Active Learning. Burr Settles
Читать онлайн книгу.Pθ(ŷ|x) can be efficiently computed with dynamic programming. Selecting the best query is generally no more complicated or expensive than the standard inference procedure. The Viterbi algorithm (Corman et al., 1992) requires O(TM) time, for example, where T is the sequence length and M is the number of label states. It is often possible to perform “N-best” inference using a beam search as well (Schwartz and Chow, 1990), which finds the N most likely output structures under the model. This makes it simple to compute the necessary probabilities for ŷ1 and ŷ2 in the margin strategy, and comes at little extra computational expense: the complexity is O(TMN) for sequences, which for N = 2 merely doubles the runtime compared to the least confident strategy. Dynamic programs have also been developed to compute the entropy over all possible sequences (Mann and McCallum, 2007) or trees (Hwa, 2004), although this approach is significantly more expensive. The fastest entropy algorithm for sequence models requires O(TM2) time, which can be very slow when the number of label states is large. Furthermore, some structured models are so complex that they require approximate inference techniques, such as loopy belief propagation or Markov chain Monte Carlo (Koller and Friedman, 2009). In such cases, the least confident strategy is still straightforward since only the “best” prediction needs to be evaluated. However, the margin and entropy heuristics cease to be tractable and exact for these more complex models.
So far we have only discussed problems with discrete outputs—classification and structured prediction. Uncertainty sampling is also applicable to regression, i.e., learning tasks with continuous output variables. In this setting, the learner can simply query the unlabeled instance for which the learner has the highest output variance in its prediction. Under a Gaussian assumption, the entropy of a random variable is a monotonic function of its variance, so this approach is much in same the spirit as entropy-based uncertainty sampling. Another interpretation of variance is the expected squared-loss of the model’s prediction. Closed-form estimates of variance can be computed for a variety of model classes, although they can require complex and expensive computations.
Figure 2.6 illustrates variance-based uncertainty sampling using an artificial neural network. The target function is a Gaussian in the range [-10,10], shown by the solid red line in the top row of plots. The network in this example has one hidden layer of three logistic units, and one linear output unit. The network is initially trained with two labeled instances drawn at random (upper left plot), and its variance estimate (lower left plot) is used to select the first query. This process repeats for a few iterations, and after three queries the network can approximate the target function fairly well. In general, though, estimating the output variance is nontrivial and will depend on the type of model being used. This is in contrast to the utility measures in Section 2.3 for discrete outputs, which only require that the learner produce probability estimates. In Section 3.4, we will discuss active learning approaches using ensembles as a simpler way to estimate output variance. Active learning for regression has a long history in the statistics literature, generally referred to as optimal experimental design (Federov, 1972). However, the statistics community generally eschews uncertainty sampling in lieu of more sophisticated strategies, which we will explore further in Chapter 4.
Figure 2.6: Variance-based uncertainty sampling for a toy 1D regression task. Each column represents an iteration of active learning. In the top row, solid lines show the target function to be learned, while dashed lines show a neural network approximation based on available training data (black dots). The bottom row plots the network’s output variance across the input range, which is used to select the query for the next iteration.
2.5 DISCUSSION
Uncertainty sampling is possibly the most popular active learning strategy in practice. Perhaps this is because of its intuitive appeal combined with the ease of implementation. Most of the common uncertainty-based utility measures do not require significant engineering overhead to use. In fact, as long as the learner can provide a confidence or probability score along with its predictions, any of the measures in Section 2.3 can be employed with the learner as a “black box.” Standard classification or inference procedures can be used, leaving the choice of learning algorithm fairly modular. This is not necessarily the case for all active learning approaches. Furthermore, if inference is fast and tractable, then querying should also be fast and tractable.
Our discussion of uncertainty sampling has, thus far, been limited to the pool-based setting where the learner selects the “best” query from the set of unlabeled data U. Uncertainty sampling can also be employed in the stream-based selective sampling setting, where unlabeled instances are drawn x ∽ DX one at a time from an input distribution, and the learner decides on the spot whether to query or discard it. The simplest way to implement uncertainty-based selective sampling is to set a threshold on the uncertainty measure and use this to define a region of uncertainty. An instance is queried if it falls within the region of uncertainty, and discarded otherwise. The learner is re-trained after each new query instance (or batch of instances) is labeled and added to L.
Конец ознакомительного фрагмента.
Текст предоставлен ООО «ЛитРес».
Прочитайте эту книгу целиком, купив полную легальную версию на ЛитРес.
Безопасно оплатить книгу можно банковской картой Visa, MasterCard, Maestro, со счета мобильного телефона, с платежного терминала, в салоне МТС или Связной, через PayPal, WebMoney, Яндекс.Деньги, QIWI Кошелек, бонусными картами или другим удобным Вам способом.