Artificial Intelligence and Quantum Computing for Advanced Wireless Networks. Savo G. Glisic
Читать онлайн книгу.i element-of d 1 Subscript Baseline upper S EndAbsoluteValue EndFraction minus StartFraction StartAbsoluteValue sigma Subscript a Sub Subscript i element-of d 2 Subscript intersection y equals c Sub Subscript i Subscript Baseline upper S EndAbsoluteValue Over StartAbsoluteValue sigma Subscript a Sub Subscript i element-of d 2 Subscript Baseline upper S EndAbsoluteValue EndFraction EndAbsoluteValue right-parenthesis squared"/>
When the target attribute is binary, the Gini and twoing criteria are equivalent. For multiclass problems, the twoing criterion prefers attributes with evenly divided splits.
Orthogonality (ort) criterion: This binary criterion is defined as
where θ(Py,1, Py,2) is the angle between two distribution vectors Py,1 and Py,2 of the target attribute y on the bags
Kolmogorov–Smirnov (K–S) criterion: Assuming a binary target attribute, namely, dom(y) = {c1, c2}, the criterion is defined as
(2.21)
It was suggest extending this measure to handle target attributes with multiple classes and missing data values. The results indicate that the suggested method outperforms the gain ratio criteria.
Stopping criteria: The tree splitting (growing phase) continues until a stopping criterion is triggered. The following conditions are common stopping rules: (i) all instances in the training set belong to a single value of y, (ii) the maximum tree depth has been reached, (iii) the number of cases in the terminal node is less than the minimum number of cases for the parent nodes, and (iv) if the node were split, the number of cases in one or more child nodes would be less than the minimum number of cases for child nodes. The best splitting criteria is not greater than a certain threshold.
Pruning methods: Employing tightly stopping criteria tends to create small and underfitted decision trees. On the other hand, using loosely stopping criteria tends to generate large decision trees that are overfitted to the training set. Pruning methods were developed to solve this dilemma. There are various techniques for pruning decision trees. Most of them perform top‐down or bottom‐up traversal of the nodes. A node is pruned if this operation improves a certain criterion. Next, we describe the most popular techniques.
Cost‐complexity pruning (pr): This proceeds in two stages. In the first stage, a sequence of trees T0, T1, … Tk is built on the training data, where T0 is the original tree before pruning and Tk is the root tree. In the second stage, one of these trees is chosen as the pruned tree, based on its generalization error estimation. The tree Ti + 1 is obtained by replacing one or more of the subtrees in the predecessor tree Ti with suitable leaves. The subtrees that are pruned are those that obtain the lowest increase in apparent error rate per pruned leaf (l):
(2.22)
where ε(T, S) indicates the error rate of the tree T over the sample S, and ∣l(T)∣ denotes the number of leaves in T. The parameter pr(T, t) denote the tree obtained by replacing the node t in T with a suitable leaf. In the second phase, the generalization error of each pruned tree T0, T1, … Tk is estimated. The best pruned tree is then selected. If the given dataset is large enough, it is suggested to break it into a training set and a pruning set. The trees are constructed using the training set and evaluated on the pruning set. On the other hand, if the given dataset is not large enough, the cross‐validation methodology is suggested, despite the computational complexity implications.
Reduced error pruning: While traversing the internal nodes from the bottom to the top, the procedure checks, for each internal node, whether replacing it with the most frequent class reduces the tree’s accuracy. If not, the node is pruned. The procedure continues until any further pruning would decrease the accuracy. It can be shown that this procedure ends with the smallest accurate subtree with respect to a given pruning set.
Minimum‐error pruning (MEP): It performs bottom‐up traversal of the internal nodes. In each node, it compares the l‐probability‐error rate estimation with and without pruning. The l‐probability‐error rate estimation is a correction to the simple probability estimation using frequencies. If St denote the instances that have reached node t, then the error rate obtained if this node was pruned is
(2.23)
where papr(y = ci) is the a priori probability of y getting the value ci, and denotes the weight given to the a priori probability. A node is pruned if it does not increase the m probability‐error rate.
Pessimistic pruning: The basic idea is that the error ratio estimated using the training set is not reliable enough. Instead, a more realistic measure known as continuity correction for binomial distribution should be used: ε′(T, S) = ε(T, S) + ∣ l(T) ∣ /2 · ∣ S∣. However, this correction still produces an optimistic error rate. Consequently, it was suggested to prune an internal node t if its error rate is within one standard error from a reference tree, namely
(2.24)