Medical Statistics. David Machin
Читать онлайн книгу.2.4); the median is the middle observation which splits the data set into two halves with equal number of observations in each half (eight in this example). As the number if observations are even (n = 16); the median is the average of the two central ordered values (the eighth and ninth). So, the median corn size is (3 + 3)/2 = 3 mm.
If we had observed an additional 17th subject with a corn size of 10 mm the median would be the 9th ordered observation, which is 3 mm.
The median has the advantage that it is not affected by outliers, so for example the median in the data would be unaffected by replacing largest corn size of ‘10 mm’ with ‘100 mm’. However, it is not statistically efficient, as it does not make use of all the individual data values.
Mode
A third measure of location is termed the mode. This is the value that occurs most frequently, or, if the data are grouped, the grouping with the highest frequency. It is not used much in statistical analysis, since its value depends on the accuracy with which the data are measured; although it may be useful for categorical data to describe the most frequent category. However, the expression ‘bimodal’ distribution is used to describe a distribution with two peaks in it. This can be caused by mixing two or more populations together. For example, height might appear to have a bimodal distribution if one had men and women in the study population. Some illnesses may raise a biochemical measure, so in a population containing healthy individuals and those who are ill one might expect a bimodal distribution. However, some illnesses are defined by the measure of, say obesity or high blood pressure, and in these cases the distributions are usually unimodal with those above a given value regarded as ill.
Table 2.4 The 16 corn sizes ordered and ranked from smallest to largest.
Rank order | Corn size (mm) | |
---|---|---|
1 | 1 | |
2 | 2 | |
3 | 2 | |
4 | 2 | |
5 | 2 | |
6 | 2 | |
7 | 3 | |
8 | 3 |
|
9 | 3 | |
10 | 3 | |
11 | 4 | |
12 | 4 | |
13 | 5 | |
14 | 6 | |
15 | 6 | |
16 | 10 |
Example – Calculation of the Mode – Corn Size Data
In the 16 patients with corns; 5 patients have a corn size of 2 mm; thus, the modal corn size is 2 mm.
Measures of Dispersion or Variability
We also need a numerical way of summarising the amount of spread or variability in a data set. The three main approaches to quantifying variability are: the range; interquartile range and the standard deviation.
Range
The simplest way to describe the spread of a data set is to quote the minimum (lowest) and maximum (highest) values. The range is given as the smallest and largest observations. For some data it is very useful, because one would want to know these numbers, for example in a sample the age of the youngest and oldest participant. However, if outliers are present it may give a distorted impression of the variability of the data, since only two of the data points are included in making the estimate. Thus, the range is affected by extreme values at each end of the data.
Example – Calculation of the Range – Corn Size Data
The range for the corn size data is 1 to 10 mm or described by a single number 10−1 = 9 mm.
Quartiles and the Interquartile Range
The quartiles, namely the lower quartile, the median and the upper quartile, divide the data into four equal parts using three cut‐points; that is there will be approximately equal numbers of observations in the four sections (and exactly equal if the sample size is divisible by four and the measures are all distinct). The quartiles are calculated in a similar way to the median; first order the data and then count the appropriate number from the bottom. The lower quartile is found by ranking the data and then taking the value below which 25% of the data sit. The upper quartile is the value above which the top 25% of the data points sit. The interquartile range is a useful measure of variability and is the range of values that includes the middle 50% of observations and is given by the difference between the lower and upper quartiles. The interquartile range is not vulnerable to outliers, and whatever the distribution of the data, we know that 50% of them lie within the interquartile range.
Percentiles
The median and quartiles are example of percentiles – points which divide the distribution of the data set into percentages above or below a certain value. A percentile (or a centile) is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. For example, the 20th percentile is the value (or score) below which 20% of the observations may be found. The median is the 50th percentile, the lower quartile is the 25th percentile and the upper quartile is the 75th percentile. With enough data any percentile can be calculated from continuous data.
Example – Calculation of the Range, Quartiles, and Inter‐Quartile Range – Corn Size Data
Suppose, as in Table 2.5, we had the 16 corn sizes in millimetres