Practical Data Analysis with JMP, Third Edition. Robert Carver
Читать онлайн книгу.rel="nofollow" href="#ulink_64248619-f5be-52f1-ad3f-d738dac52209">1. In Chapter 21, we will learn how to design experiments. In this chapter, we will concentrate on the nature of experimental data.
2. Visit http://www.cdc.gov/nchs/surveys.htm to find the NHANES and other public-use survey data. Though the topic is beyond the scope of this book, readers engaged in survey research will want to learn how to conduct a database query and import the results into JMP. Interested readers should consult the section on “Importing Data” in Chapter 2 of the JMP User Guide.
3. Visit https://climatecommunication.yale.edu/publications/politics-global-warming-april-2019/ to read the full report.
Chapter 3: Describing a Single Variable
Variable Types and Their Distributions
Distribution of a Categorical Variable
Using Graph Builder to Explore Categorical Data Visually
Distribution of a Quantitative Variable
Using the Distribution Platform for Continuous Data
Exploring Further with the Graph Builder
Summary Statistics for a Single Variable
Overview
Once we have framed some research questions and gathered relevant data, the next phase of an investigation is to examine the variability in the data. The goal of descriptive analysis is to summarize where things stand with each variable. In fact, the term statistics comes from the practice of characterizing the state of political affairs through the reporting of facts and figures. This chapter presents several standard tools that we can use to examine how a variable varies, to describe the pattern of variation that it exhibits, and to look for departures from the overall pattern as well.
The Concept of a Distribution
Data analysis generally focuses on one or more variables—attributes of the individual observations. When we speak of a variable’s distribution, we are referring to a pattern of values. The distribution describes the different values the variable can assume, and how often it assumes each value.
In our first example, we will continue to consider the variability of life expectancy around the world. The data that we will use come to us from the World Bank. In Chapter 1, we used a small portion of this data set for 2017. Now we will look at more years.
Variable Types and Their Distributions
In Chapter 2, we did our work in a JMP Project. Get in the habit of using a project for each chapter.
1. Select File ► New ► Project.
2. Select File ► Open, select the Life Expectancy data table, and click Open.
Before doing any analysis, make sure that you can answer these questions:
● What population does this data table represent?
● What is the source of the data?
● How many variables are in the table?
● What data type is each variable?
● What does each variable represent?
● How many observations are there?
Take special note of the way this data table has been organized. We have 12 annual observations for each country, spaced at 5-year intervals, and they are stacked one upon the other. Not surprisingly, JMP refers to this arrangement as stacked data.
As in Chapter 1, we will raise some questions about how life expectancy at birth varies in different parts of the world. There are far too many observations for us get a general sense of the variation simply by scanning the table visually. We need some sensible ways to find the patterns among the large number of rows. We will begin our analysis by looking at the nominal variable called Region.
Statisticians generally distinguish among four types of data:
Categorical Types | Quantitative Types |
Nominal | Interval |
Ordinal | Ratio |
One reason that it is important to understand the differences among data types is that we analyze them in different ways. In JMP, we differentiate between nominal, ordinal, and continuous data. Nominal and ordinal variables are categorical, distinguishing one observation from another in some qualitative, non-measurable way. Interval and ratio data are both numeric. Interval variables are artificially constructed, like a temperature scale or stock index, with arbitrarily chosen zero points. Most measurement data are considered ratio data because ratios of values are meaningful. For example, a film that lasts 120 minutes is twice as long as one lasting 60 minutes. In contrast, 120 degrees Celsius is not twice as hot as 60 degrees Celsius.
Distribution of a Categorical Variable
In its reporting, the World Bank identifies each country of the world with a continental region. There are seven regions, each with a different number of countries. The variable Region is nominal—it literally names a country’s general location on earth. Let’s get familiar with the different regions and see how many countries are in each. In other words, let’s look at the distribution of Region.
1. Select Analyze ► Distribution. In the Distribution dialog box (Figure 3.1), select the variable region as the Y, Columns variable. Click OK.
Figure 3.1: Distribution Dialog Box
Anytime you want to assign a column to a role in a JMP dialog box, you have three options: you can highlight the column name in the Select Columns list and click the corresponding role button, you can double-click the column name, or you can click-drag the column name into the role box.
The result appears in Figure 3.2. JMP constructs a simple bar chart listing the six continental regions and showing a rectangular bar corresponding to the number of times the name of the region occurs in the data table. Though we cannot immediately tell from the graph alone exactly how many countries are in each, North America clearly has the fewest countries and Europe and Central Asia has the most.
Figure 3.2: Distribution of Region
Below the graph is a frequency distribution (titled Frequencies), which provides a more specific summary. Here we find the name of each region, and the number of times each regional name appears in our table. For example, “East Asia & Pacific” occurs 432 times. As a proportion of the whole table, 16.7% of the rows (Prob. = 0.16744)