Statistical Significance Testing for Natural Language Processing. Rotem Dror
Читать онлайн книгу.the scores of two different algorithms applied on the same sentence?
These above are examples of paired samples—samples in which a natural coupling occurs. In a dataset of paired samples (often called dependent samples), each data point in one sample is uniquely paired to a data point in the other sample. Paired samples can be before/after samples, in which a metric is computed before and after a certain action. Alternatively, they could be matched samples, in which individuals are matched on some characteristic such as age or gender. In general, paired samples appear in any circumstance in which each data point in one sample is directly matched to a data point in the other sample.
As opposed to the case of paired samples, sometimes we have independent samples, consisting of unrelated data points. Such independent samples can be obtained simply by randomly sampling from two different populations. A more realistic case in the world of medical experiments where two separate treatment groups (often a treatment group and a placebo group) are randomly created, without first matching the subjects.
Algorithm 3.1 Statistical Hypothesis Testing Process with Critical Regions
Input : H0 the null hypothesis, H1 the alternative hypothesis, α the significance level.
Output : Decision to either reject the null hypothesis in favor of the alternative or not reject it.
1:O = {ø}—list of observations.
2:O ← Perform experiment to test the hypotheses.
3:Decide which statistical test is appropriate.
4:Calculate the observed test statistic T (O).
5:Derive the distribution of the test statistic under the null hypothesis H0.
6:Calculate the critical region—the possible values of T for which the null hypothesis is rejected. The probability of the critical region under the distribution of the test statistic under the null hypothesis is α.
7:Reject the null hypothesis H0 in favor of the alternative hypothesis H1 if the observed test statistic T (O) is in the critical region.
Algorithm 3.2 Statistical Hypothesis Testing Process with p-value
Input : H0 the null hypothesis, H1 the alternative hypothesis, the significance level.
Output : Decision to either reject the null hypothesis in favor of the alternative or not reject it Notice: steps 1–5 are the same as in Algorithm 3.1.
6:Calculate the p-value—the probability, under the null hypothesis H0, of observing a test statistic at least as extreme as that which was observed.
7:Reject the null hypothesis H0 in favor of the alternative hypothesis H1 if and only if the p-value is less than (or equal to) α.
The notion of paired vs. independent samples is crucial in NLP. Oftentimes we are comparing between several algorithms on the same dataset and hence paired tests are more common. In what follows, we survey prominent parametric and nonparametric tests, emphasizing the paired setup. In addition, Algorithms 3.1 and 3.2 display a pseudo code of the general testing process that is applied when testing for statistical significance. The two processes are equivalent.
3.2 PARAMETRIC TESTS
As previously defined, parametric tests are statistical significance tests that assume prior knowledge regarding the test statistic’s distribution under the null hypothesis. When using such tests, we utilize the test statistic’s assumed distribution in order to ensure a bound on the type I error and a low probability of making a type II error. We will now elaborate on several prominent parametric tests that are suitable for the setup of paired samples.
Algorithm 3.3 The Paired Z-test
Input : Paired samples {xi},
Output : p—the p-value.
Notations : n sample size.
1:Calculate the mean of the paired differences
2:Calculate the test statistic
3:Calculate p = P(Z ≥ z) where Z ∼ N(0, 1).
We begin with tests that are highly relevant to NLP setups, accounting for cases where the metric values come from a normal distribution. Example relevant NLP metrics are sentence level accuracy, recall, unlabeled attachment score (UAS) and labeled attachment score (LAS) [Yeh, 2000].
Paired Z-test In this test, the sample is assumed to be normally distributed and the standard deviation of the population is known. This test is used to validate the hypothesis that the sample drawn belongs to the same population through checking if the sample mean is the same as the population mean. This test is not very applicable in NLP since the population standard deviation is rarely known, but we define it here for completion. In addition, the statistical test which is used to validate the same hypothesis without the assumption on the known standard deviation in one of the most commonly used tests in NLP, the t-test which is described next. The Z-test is defined in Algorithm 3.3.
Paired Student’s t-test This test aims to assess whether the population means of two sets of measurements differ from each other, and is based on the assumption that both samples come from a normal distribution [Fisher, 1937]. The calculations of the test statistic and the p-value for this test are shown in Algorithm 3.4.
Since this test assumes a normal distribution and is computed over population means, one may argue that based on the Central Limit Theorem (CLT) it can be applied to compare between any large enough measurement sets; however, in NLP setups the test examples (e.g., sentences from the same document) are often dependent, violating the independence assumption of CLT.
In practice, t-test is often applied with evaluation measures such as accuracy, UAS and LAS, that compute the mean number of correct predictions per input example. When comparing two dependency parsers, for example, we can apply the test to check if the averaged difference of their UAS scores is significantly larger than zero, which can serve as an indication that one parser is better than the other. Using t-test with such metrics can be justified based on CLT.
Algorithm 3.4 The Paired Sample t-test
Input : Paired samples.
Output : p—the p-value.
Notations : D differences between two paired samples, di the ith observation in D, n the sample size,
1:Calculate the sample mean
2:Calculate the sample standard deviation