Statistics for HCI. Alan Dix
Читать онлайн книгу.neck-and-neck at 9 heads to 10 tails, but the other four races were all won by heads with some quite large margins: 10 to 7, 10 to 6, 10 to 5, and 10 to 4.
Often people are surprised because they are expecting a near neck-and-neck race every time. As the coins are all fair, they expect approximately equal numbers of heads and tails. However, just like the rainfall in Gheisra, it is very common to have one quite far ahead of the other.
You might think that because the probability of a head is a half, the number of heads will be near enough half. Indeed, this is the case if you average over lots and lots of tosses. However, with just 20 coins in a race, the variability is large.
The probability of getting an outright winner all heads or all tails is low, only about 1 in 500. However, the probability of getting a near wipe-out with 1 head and 10 tails or vice versa is around 1 in 50—in a large class one person is likely to have this.
Figure 2.2: Two-horse races—Were yours neck-and-neck or was there a front runner?
2.1.3 LESSONS
I hope these two activities begin to give you some idea of the wild nature of random phenomena. We can see a few general lessons.
First, apparent patterns or differences may just be pure chance. For example, if you had found heads winning by 10 to 2, you might have thought this meant that your coin was in some way biased to heads. Or, you might have thought that the nearly straight line of three drops on Day 1 had to mean something. But random things are so wild that apparently systematic effects sometimes happen by chance.
Second, this wildness may lead to what appear to be ‘bad values.’ If you had got 10 tails and just 1 head, you might have thought “but coins are fair, so I must have done something wrong.” Indeed, famous scientists have fallen for this fallacy!
Mendel’s experiment on inheritance of sweet pea characteristics laid the foundations for modern genetics. However, his results are a little too good. If you cross-pollinate two plants, one of them pure bred to have a recessive characteristic (say R) and the other purely dominant (say D), in the first generation all the progeny have the dominant characteristic, but in fact possess precisely one recessive and one dominant gene (RD). In the second generation, interbreeding two of the first-generation RD plants is expected to have observable characteristics that are dominant and recessive in the ideal ratio 3:1. In Mendel’s data the ratios are just a little too close to this figure. It seems likely that he rejected ‘bad values,’ assuming he had done something wrong, when in fact they were just the results of chance.
The same thing can happen in physics. In 1909, Robert Millikan and Harvey Fletcher ran an experiment to determine the charge of a single electron. The experiment (also known as the ‘Millikan Can Experiment’) found that charge came in discrete units and thus showed that each electron has an identical charge. To do this they created charged oil drops and suspended them using the electrostatic charge. The relationship between the electrical field needed and the size (and hence weight) of a drop enabled them to calculate the charge on each oil drop. These always came in multiples of a single value—the electron charge. There are always sources of error in any measurements and yet the reported charges are a little too close to multiples of the same number. Again, it looks like ‘bad’ results were ignored as some form of mistake during the setup of the experiment.
2.2 QUICK (AND DIRTY!) TIP
We often deal with survey or count data. This might come in public forms such as opinion poll data preceding an election, or from your own data when you email out a survey, or count kinds of errors in a user study.
So when you find that 27% of the users in your study had a problem, how confident do you feel in using this to estimate the level of prevalence amongst users in general? If you did a bigger study with more users would you be surprised if the figure you got was actually 17%, 37%, or 77%?
You can work out precise numbers for this, but I often use a simple rule of thumb method for doing a quick estimate.
for survey or other count data do square root times two (ish)
We’re going to deal with this by looking at three separate cases.
2.2.1 CASE 1 –SMALL PROPORTIONS
First, consider the case when the number you are dealing with is a comparatively small proportion of the overall sample. For example, assume you want to know about people’s favourite colours. You do a survey of 1000 people and 10% say their favourite colour is blue. How reliable is this figure? If you had done a larger survey, would the answer still be close to 10% or could it be very different?
The simple rule is that the variation is 2x the square root number of people who chose blue.
To work this out, first calculate how many people the 10% represents. Given the sample was 1000, this is 100 people. The square root of 100 is 10, so 2x this is 20 people. You can be reasonably confident that the number of people choosing blue in your sample is within +/- 20 of the proportion you’d expect from the population as a whole. Dividing that +/-20 people by the 1000 sample, the % of people for whom blue is their favourite colour is likely to be within +/- 2% of the measured 10%.
2.2.2 CASE 2 –LARGE MAJORITY
The second case is when you have a large majority who have selected a particular option. For example, let’s say in another survey, this time of 200 people, 85% said green was their favourite colour.
This time you still apply the “2x square root” rule, but instead focus on the smaller number, those who didn’t choose green. The 15% who didn’t choose green is 15% of 200, that is 30 people. The square root of 30 is about 5.5, so the expected variability is about +/-11, or in percentage terms about +/- 5%. That is, the real proportion over the population as a whole could be anywhere between 80% and 90%.
Notice how the variability of the proportion estimate from the sample increases as the sample size gets smaller.
2.2.3 CASE 3 –MIDDLING
Finally, if the numbers are near the middle, just take the square root, but this time multiply by 1.5.
For example, if you took a survey of 2000 people and 50% answered yes to a question, this represents 1000 people. The square root of 1000 is a bit over 30, and 1.5x this is around 50 people, so you expect a variation of about +/- 50 people, or about +/- 2.5%.
Opinion polls for elections often have samples of around 2000, so if the parties are within a few points of each other you really have no idea who will win.
2.2.4 WHY DOES THIS WORK?
For those who’d like to understand the detailed stats for this (skip if you don’t!) …
These three cases are simplified forms of the precise mathematical formula for the variance of a Binomial distribution np(1 –p), where n is the number in the sample and p the true population proportion for the thing you are measuring. When you are dealing with fairly small proportions the 1 –p term is close to 1, so the whole variance is close to np, that is the number with the given value. You then take the square root to give the standard deviation. The factor of 2 is because about 95% of measurements fall within 2 standard deviations. The reason this becomes 1.5 in the middle is that you can no longer treat (1 –p) as nearly 1, and for p = 0.5, this makes things smaller by square root of 0.5, which is about 0.7. Two times 0.7 is (about) one and half (I did say quick and dirty!).
2.2.5 MORE IMPORTANT THAN THE MATH …
However, for survey data, or indeed any kind of data, these calculations of variability are in the end far less critical than ensuring that the sample really does adequately measure the thing you are after.