Active Learning. Burr Settles
Читать онлайн книгу.href="#fb3_img_img_d5df6b99-4ee4-5c3d-b9b5-5900490a8465.png" alt="Image"/>
Figure 2.2: Uncertainty sampling with a toy data set. (a) 400 instances, evenly sampled from two class Gaussians. Instances are represented as points in a 2D input space. (b) A logistic regression model trained with 30 labeled instances randomly drawn from the problem domain. The line represents the decision boundary of the classifier. (c) A logistic regression model trained with 30 actively queried instances using uncertainty sampling.
Figure 2.3: Generic pool-based uncertainty sampling algorithm.
2.3 MEASURES OF UNCERTAINTY
A general active learning algorithm is presented in Figure 2.3. The key component of the algorithm with respect to designing an active learning system is line 5, and we need a way to measure the uncertainty of candidate queries in the pool. For binary classification, the “closest to the decision boundary (probability ≈ 0.5)” heuristic will suffice. But when we deal with problems and models that have posterior distributions over three or more labels—or even multiple output structures—we need a more general measure of uncertainty or information content. From this point on, let
denote the best instance that the utility measure A would select for querying.Least Confident. A basic uncertainty sampling strategy is to query the instance whose predicted output is the least confident:
where ŷ = argmaxy Pθ(y|x), the prediction with the highest posterior probability under the model θ. In other words, this strategy prefers the instance whose most likely labeling is actually the least likely among the unlabeled instances available for querying. One way to interpret this uncertainty measure is the expected 0/1-loss, i.e., the model’s belief that it has mislabeled x. A drawback of this strategy is that it only considers information about the best prediction. Thus, it effectively throws away information about the rest of the posterior distribution.
Margin. A different active learning strategy is based on the output margin:
where ŷ1 and ŷ2 are the first and second most likely predictions under the model, respectively. Margin sampling addresses a shortcoming of the least confident strategy by incorporating the second best labeling in its assessment. Intuitively, instances with large margins are easy, since the learner has little doubt in how to differentiate between the two most probable alternatives. Instances with small margins are more ambiguous, though, and knowing the true labeling should help the model discriminate more effectively between them. However, for problems with a very large number of alternatives, the margin approach still ignores much of the output distribution.
Entropy. Perhaps the most general (and most common) uncertainty sampling strategy uses entropy (Shannon, 1948), usually denoted by H, as the utility measure:
where y ranges over all possible labelings of x. Entropy is a measure of a variable’s average information content. As such, it is often thought of as an uncertainty or impurity measure in machine learning. One interpretation of this strategy is the expected log-loss, i.e., the expected number of bits it will take to “encode” the model’s posterior label distribution.
Figure 2.4: Illustrations of various uncertainty measures. (a–c) Binary classification tasks. Each plot shows the utility score as a function of Pθ(⊕|x), which is the posterior probability of the positive class. (d–f) Ternary classification tasks (three labels). Heatmap corners represent posterior distributions where one label is very likely, with the opposite edge plotting the probability range between the other two classes (when that label has low probability). The center of each heatmap is the uniform distribution.
Figure 2.4 visualizes the implicit relationship among these uncertainty measures. For binary classification (top row), all three strategies are monotonic functions of one another. They are symmetric with a peak about Pθ(⊕|x) = 0.5. In effect, they all reduce to querying the instance that is closest to the decision boundary. For a three-label classification task (bottom row), the relationship begins to change. For all three measures, the most informative instance lies at the center of the triangular simplex, because this represents where the posterior label distribution is most uniform (and therefore most uncertain under the model). Similarly, the least informative instances are at the three corners, where one of the classes has extremely high probability (and thus little model uncertainty). The main differences lie in the rest of the probability space. For example, the entropy measure does not particularly favor instances for which only one of the labels is highly unlikely (i.e., along the outer side edges of the simplex), because the model is fairly certain that it is not the true label. The least confident and margin measures, on the other hand, consider such instances to be useful if the model has trouble distinguishing between the remaining two classes (i.e., at the midpoint of an outer edge). Intuitively, entropy seems appropriate if the objective function is to minimize log-loss, while the other two (particularly margin) are more appropriate if we aim to reduce classification error, since they prefer instances that would help the model better discriminate among specific classes.
2.4 BEYOND CLASSIFICATION
Not all machine learning is classification. In particular, we might want to use active learning to reduce the cost of training a model that predicts structured output, such as label sequences or trees. For example, Figure 2.5 illustrates how information extraction—the task of automatically extracting structured information like database entries from unstructured text—is framed as a sequence-labeling task. Let x = 〈x1, …, xT〉 be an observation sequence of length T with a corresponding label sequence y = 〈y1, …, yT〉. Words in a sentence correspond to tokens in x, which are mapped to labels in y.
Figure 2.5: An information extraction example viewed as a sequence labeling task. (a) A sample input sequence x and corresponding label sequence y. (b) A probabilistic sequence model represented as a finite state machine, illustrating the path of 〈x, y〉 through the model.
Figure 2.5(a) presents an example 〈x, y〉 pair. The labels indicate whether a given word belongs to an entity class of interest (org
and loc
in this case, for “organization” and “location,” respectively) or null
otherwise. Unlike simple classification, x is not represented by a single feature vector, but rather a sequence of feature vectors: one for each token (i.e., word). One approach is to treat each token as an instance, and train a classifier that scans through the input sequence, assigning output labels to tokens independently. However, the word “Madison,” devoid of context, might refer to an location (city), organization (university), or even a person. For tasks such as this, sequence models based on probabilistic finite state machines, such as hidden Markov models or linear-chain conditional random fields, are considered the state of the art. An example sequence model is shown in Figure 2.5(b). Such models can produce a probability distribution for every possible label sequence y, the number of which can grow exponentially in the sequence length T.
Fortunately,