Practical Data Analysis with JMP, Third Edition. Robert Carver

Читать онлайн книгу.

Practical Data Analysis with JMP, Third Edition - Robert Carver


Скачать книгу
Help Options"/>

      Along with the median, it is commonly used as a measure of central tendency; in a symmetric distribution, the mean and median are quite close in value. When a distribution is strongly left-skewed like this one, the mean will tend to be smaller than the median. In a right-skewed distribution, the opposite will be true.

      ● The standard deviation (Std Dev) is a measure of dispersion, and you might think of it as a typical distance of a value from the mean of the distribution. It is usually represented by the symbol s, and is computed as follows:

Figure 1.1 Some JMP Help Options

      We will have more to say about the standard deviation in later chapters, but for now, please note that it must be greater than or equal to zero, and that highly dispersed variables have larger standard deviations than consistent variables.

      ● n refers to the number of observations in the sample.

      Outlier Box Plots

      Now that we have discussed the five-number summary, we can interpret a box plot. The key to interpreting an outlier box plot is to recognize that it is a diagram of the five-number summary. Here is a typical example:

Figure 1.1 Some JMP Help Options

      In a box plot, there is a rectangle with an intersecting line. Two edges of the rectangle are located at the first (Q1) and third (Q3) quartile values, and the line is located at the median. In other words, the rectangular box spans the interquartile range (IQR). Extending from the ends of the box are two lines called whiskers. In a distribution that is free of outliers, the whiskers reach to the minimum and maximum values. Otherwise, the plot limits the reach of the whiskers by the upper and lower fences, which are located 1.5 IQRs from each quartile. In this illustration, we have a cluster of seven low-value outliers.

      JMP also adds two other features to the box plot. One is a diamond that represents the location of the mean. If you imagine a vertical line through the vertices of the diamond, you have located the mean. The other two vertices are positioned at the upper and lower confidence limits of the mean. We will discuss those in Chapter 11.

      The second additional feature is a red bracket above the box. This is the shortest half bracket, representing the smallest part of the number line comprising 50% of the cases. We can divide the observations in half in different ways. The median gives the upper and lower halves; the IQR box gives the middle half. This bracket gives the shortest half.

      A box plot very efficiently conveys information about the center, symmetry, dispersion, and outliers for a single distribution. When we compare box plots across several groups or samples, the results can be quite revealing. In the next chapter, we will look at such box plots and other ways of summarizing two variables at a time.

      Now that you have completed all of the activities in this chapter, use the techniques that you have learned to respond to these questions.

      1. Scenario: We will continue our analysis of the variation in life expectancy at birth in 2015. Reset the Data Filter to show and include 2015.

      a. When we first constructed the Life Exp histogram, we described it as multi-peaked and left-skewed. Use the hand tool to increase and reduce the number of bars. Adjust the number of bars so that there are two prominent peaks. Describe what you did, and where the peaks are located.

      b. Rescale the axes of the same histogram and see if you can emphasize the two peaks even more (in other words, have them separated distinctly). Describe what you did to make these peaks more distinct and noticeable.

      c. Based on what you have seen in these exercises, why is it a good idea to think critically about an analyst’s choice of scale in a reported graph?

      d. Highlight a few of the left-most bars in the histogram for LifeExp and look at the Distribution report for region. Which continent or continents are home to the countries with the shortest life expectancies in the world? What might account for this?

      2. Scenario: Now let’s look at the distribution of life expectancy 25 years before 2015. Use the Data Filter to choose the observations from 1990.

      a. Use the Distribution platform to summarize Region and LifeExp for this subset. In a few sentences, describe the distribution of LifeExp in 1990.

      b. Compare the five-number summaries for life expectancy in 1990 and in 2015. Comment on what you find.

      c. Compare the standard deviations for life expectancy in 1990 and 2015. Comment on what you find.

      d. You will recall that in 2015, the mean life expectancy was shorter than the median, consistent with the left-skewed shape. How do the mean and median compare in the 1990 data?

      3. Scenario: The data file called Sleeping Animals contains data about the size, sleep habits, lifespan, and other attributes of different mammalian species.

      a. Construct box plots for Lifespan and TotalSleep. For each plot, explain what the landmarks on each plot tell you about the distribution of each variable. Comment on noteworthy features of the plot.

      b. Which distribution is more symmetric? Explain specifically how the graphs and descriptive statistics helped you come to a conclusion.

      c. According to the data table, “Man” has a maximum life span of 100 years. Approximately what percent of mammals in the data set live less than 100 years?

      d. Sleep hours are divided into “dreaming” and “non-dreaming” sleep. How do the distributions of these types of sleep compare?

      e. Select the species that tend to get the most total sleep. Comment on how those species compare to the other species in terms of their predation, exposure, and overall danger indexes.

      f. Now use the Distribution platform to analyze the body weights of these mammals. What’s different about this distribution in comparison to the other continuous variables that you have analyzed thus far?

      g. Select those mammals that sleep in the most exposed locations. How do their body weights tend to compare to the other mammals? What might explain this comparison?

      4. Scenario: When financial analysts want a benchmark for the performance of individual equities (stocks), they often rely on a “broad market index” such as the S&P 500 in the U.S. There are many such indexes in stock markets around the world. One major index on the Tokyo Stock Exchange is the Nikkei 225, and this set of questions refers to data about the monthly values of the Nikkei 225 from December 31, 2013 through December 31, 2018. In other words, our data table called NIKKEI225 reflects monthly market activity for a five-year period.

      a. The variable called Volume is the total number of shares traded per month (in millions of shares). Describe the distribution of this variable.

      b. The variable called Change% is the monthly change, expressed as a percentage, in the closing value of the index. When Change% is positive, the index increased that month. When the variable is negative, the index decreased that month. Describe the distribution of this variable.

      c. Use the Quantiles to determine approximately how often the Nikkei declines. (Hint: What percentile is 0?)

      p. Use Graph Builder to make a Line Graph (6th icon in the icon bar) that shows adjusted closing prices over time. Then, use the Distribution platform to create a histogram of adjusted closing prices. Each graph summarizes the Adj Close variable, but each graph presents a different view of the data. Comment on the comparison of the two graphs.

      d. Now make a line graph of the monthly percentage changes over time. How would you describe the pattern in this graph?

      5. Scenario: Anyone traveling by air understands that there is always some chance of a flight delay. In the United States, the Department of Transportation monitors the arrival and departure time of every flight. The data table Airline Delays contains a sample of 51,603 flights for four airlines destined for three busy airports.

      b.


Скачать книгу