Quickly Boosting Decision Trees - Pruning Underachieving Features Early
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Quickly Boosting Decision Trees - Pruning Underachieving Features Early - Ron Appel appel@caltech.edu Caltech, Pasadena, CA 91125 USA Thomas Fuchs fuchs@caltech.edu Caltech, Pasadena, CA 91125 USA Piotr Dollár pdollar@microsoft.com Microsoft Research, Redmond, WA 98052 USA Pietro Perona perona@caltech.edu Caltech, Pasadena, CA 91125 USA Abstract 1998; Ridgeway, 1999; Kotsiantis, 2007). This power- ful combination (of Boosting and decision trees) is the Boosted decision trees are among the most learning backbone behind many state-of-the-art meth- popular learning techniques in use today. ods across a variety of domains such as computer vi- While exhibiting fast speeds at test time, rel- sion, behavior analysis, and document ranking to name atively slow training renders them impracti- a few (Dollár et al., 2012; Burgos-Artizzu et al., 2012; cal for applications with real-time learning Asadi & Lin, 2013), with the added benefit of exhibit- requirements. We propose a principled ap- ing fast speed at test time. proach to overcome this drawback. We prove a bound on the error of a decision stump Learning speed is important as well. In active or real- given its preliminary error on a subset of time learning situations such as for human-in-the-loop the training data; the bound may be used processes or when dealing with data streams, classifiers to prune unpromising features early in the must learn quickly to be practical. This is our motiva- training process. We propose a fast train- tion: fast training without sacrificing accuracy. To this ing algorithm that exploits this bound, yield- end, we propose a principled approach. Our method ing speedups of an order of magnitude at no offers a speedup of an order of magnitude over prior cost in the final performance of the classifier. approaches while maintaining identical performance. Our method is not a new variant of Boosting; The contributions of our work are the following: rather, it is used in conjunction with exist- ing Boosting algorithms and other sampling 1. Given the performance on a subset of data, we methods to achieve even greater speedups. prove a bound on a stump’s classification error, information gain, Gini impurity, and variance. 2. Based on this bound, we propose an algorithm 1. Introduction guaranteed to produce identical trees as classical Boosting is one of the most popular learning tech- algorithms, and experimentally show it to be one niques in use today, combining many weak learners to order of magnitude faster for classification tasks. form a single strong one (Schapire, 1990; Freund, 1995; 3. We outline an algorithm for quickly Boosting de- Freund & Schapire, 1996). Shallow decision trees are cision trees using our quick tree-training method, commonly used as weak learners due to their simplicity applicable to any variant of Boosting. and robustness in practice (Quinlan, 1996; Breiman, In the following sections, we discuss related work, Proceedings of the 30 th International Conference on Ma- inspect the tree-boosting process, describe our algo- chine Learning, Atlanta, Georgia, USA, 2013. JMLR: rithm, prove our bound, and conclude with experi- W&CP volume 28. Copyright 2013 by the author(s). ments on several datasets, demonstrating our gains.
Quickly Boosting Decision Trees 2. Related Work (Coates et al., 2009) and for medical imaging using Probabilistic Boosting Trees (PBT) (Birkbeck et al., Many variants of Boosting (Freund & Schapire, 1996) 2011). Although these methods offer speedups in their have proven to be competitive in terms of prediction own right, we focus on the single-core paradigm. accuracy in a variety of applications (Bühlmann & Hothorn, 2007), however, the slow training speed of Regardless of the subsampling heuristic used, once a boosted trees remains a practical drawback. Accord- subset of features or data points is obtained, weak ingly, a large body of literature is devoted to speeding learners are trained on that subset in its entirety. Con- up Boosting, mostly categorizable as: methods that sequently, each of the aforementioned strategies can be subsample features or data points, and methods that viewed as a two-stage process; in the first, a smaller speed up training of the trees themselves. set of features or data points is collected, and in the second, decision trees are trained on that entire subset. In many situations, groups of features are highly cor- related. By carefully choosing exemplars, an entire set We propose a method for speeding up this second of features can be pruned based on the performance stage; thus, our approach can be used in conjunc- of its exemplar. (Dollár et al., 2007) propose cluster- tion with all the prior work mentioned above for even ing features based on their performances in previous greater speedup. Unlike the aforementioned methods, stages of boosting. (Kégl & Busa-Fekete, 2009) parti- our approach provides a performance guarantee: the tion features into many subsets, deciding which ones boosted tree offers identical performance to one with to inspect at each stage using adversarial multi-armed classical training. bandits. (Paul et al., 2009) use random projections to reduce the dimensionality of the data, in essence 3. Boosting Trees merging correlated features. A boosted P classifier (or regressor) having the form Other approaches subsample the data. In Weight- H(x) = t αt ht (x) can be trained by greedily min- trimming (Friedman, 2000), all samples with weights imizing a loss function L; i.e. by optimizing scalar smaller than a certain threshold are ignored. With αt and weak learner ht (x) at each iteration t. Be- Stochastic Boosting (Friedman, 2002), each weak fore training begins, each data sample xi is assigned a learner is trained on a random subset of the data. For non-negative weight wi (which is derived from L). Af- very large datasets or in the case of on-line learning, ter each iteration, misclassified samples are weighted elaborate sampling methods have been proposed, e.g. more heavily thereby increasing the severity of mis- Hoeffding trees (Domingos & Hulten, 2000) and Fil- classifying them in following iterations. Regardless of ter Boost (Domingo & Watanabe, 2000; Bradley & the type of Boosting used (i.e. AdaBoost, LogitBoost, Schapire, 2007). To this end, probabilistic bounds can L2 Boost, etc.), each iteration requires training a new be computed on the error rates given the number of weak learner given the sample weights. We focus on samples used (Mnih et al., 2008). More recently, Lam- the case when the weak learners are shallow trees. inating (Dubout & Fleuret, 2011) trades off number of features for number of samples considered as training 3.1. Training Decision Trees progresses, enabling constant-time Boosting. A decision tree hTREE (x) is composed of a stump hj (x) Although all these methods can be made to work in at every non-leaf node j. Trees are commonly grown practice, they provide no performance guarantees. using a greedy procedure as described in (Breiman A third line of work focuses on speeding up training et al., 1984; Quinlan, 1993), recursively setting one of decision trees. Building upon the C4.5 tree-training stump at a time, starting at the root and working algorithm of (Quinlan, 1993), using shallow (depth-D) through to the lower nodes. Each stump produces a trees and quantizing features values into B N bins binary decision; it is given an input x ∈ RK , and is leads to an efficient O(D×K×N ) implementation where parametrized with a polarity p ∈ {±1}, a threshold K is the number of features, and N is the number of τ ∈ R, and a feature index k ∈ {1, 2, ..., K}: samples (Wu et al., 2008; Sharp, 2008). hj (x) ≡ pj sign x[kj ] − τj Orthogonal to all of the above methods is the use of parallelization; multiple cores or GPUs. Recently, [where x[k] indicates the kth feature (or dimension) of x] (Svore & Burges, 2011) distributed computation over cluster nodes for the application of ranking; however, We note that even in the multi-class case, stumps are reporting lower accuracies as a result. GPU imple- trained in a similar fashion, with the binary decision mentations of GentleBoost exist for object detection discriminating between subsets of the classes. How-
Quickly Boosting Decision Trees ever, this is transparent to the training routine; thus, Clearly, Zm is greater or equal to the sum of any m we only discuss binary stumps. other sample weights. As we increase m, the m-subset includes more samples, and accordingly, its mass Zm In the context of classification, the goal in each stage increases (although at a diminishing rate). of stump training is to find the optimal parameters that minimize ε, the weighted classification error: In Figure 1, we plot Zm /Z for progressively increas- ing m-subsets, averaged over multiple iterations. An 1 X X ε≡ wi 1{h(xi )6=yi } , Z ≡ wi interesting empirical observation can be made about Z [where 1{...} is the indicator function] Boosting: a large fraction of the overall weight is ac- counted for by just a few samples. Since training time In this paper, we focus on classification error; however, is dependent on the number of samples and not on other types of split criteria (i.e. information gain, Gini their cumulative weight, we should be able to leverage impurity, or variance) can be derived in a similar form; this fact and train using only a subset of the data. The please see supplementary materials for details. For more non-uniform the weight distribution, the more we binary stumps, we can rewrite the error as: can leverage; Boosting is an ideal situation where few samples are responsible for most of the weight. 1 hX X i ε(k) = wi 1{yi =+p} + wi 1{yi =−p} Relative Subset Mass (Zm /Z) Z 100% xi [k]≤τ xi [k]>τ AdaBoost 80% LogitBoost In practice, this error is minimized by selecting the L2 Boost single best feature k ∗ from all of the features: 60% ∗ 40% {p∗ , τ ∗ , k ∗ } = argmin ε(k) , ε∗ ≡ ε(k ) p,τ,k 20% In current implementations of Boosting (Sharp, 2008), 0% 0.3% 1% 3% 10% 30% 100% feature values can first be quantized into B bins by Relative Subset Size (m/N) linearly distributing them in [1, B] (outer bins corre- sponding to the min/max, or to a fixed number of Figure 1. Progressively increasing m-subset mass using standard deviations from the mean), or by any other different variants of Boosting. Relative subset mass Zm /Z quantization method. Not surprisingly, using too few exceeds 90% after only m ≈ 7% of the total number of bins reduces threshold precision and hence overall per- samples N in the case of AdaBoost or m/N ≈ 20% in the formance. We find that B = 256 is large enough to case of L2 Boost. incur no loss in practice. The same observation was made by (Friedman et al., Determining the optimal threshold τ ∗ requires accu- 2000), giving rise to the idea of weight-trimming, which mulating each sample’s weight into discrete bins cor- proceeds as follows. At the start of each Boosting it- responding to that sample’s feature value xi [k]. This eration, samples are sorted in order of weight (from procedure turns out to still be quite costly: for each largest to smallest). For that iteration, only the sam- of the K features, the weight of each of the N samples ples belonging to the smallest m-subset such that has to be added to a bin, making the stump training Zm /Z ≥ η are used for training (where η is some prede- operation O(K ×N ) – the very bottleneck of training fined threshold); all the other samples are temporarily boosted decision trees. ignored – or trimmed. Friedman et al. claimed this to In the following sections, we examine the stump- “dramatically reduce computation for boosted models training process in greater detail and develop an in- without sacrificing accuracy.” In particular, they pre- tuition for how we can reduce computational costs. scribed η = 90% to 99% “typically”, but they left open the question of how to choose η from the statistics of 3.2. Progressively Increasing Subsets the data (Friedman et al., 2000). Their work raises an interesting question: Is there a Let us assume that at the start of each Boosting itera- principled way to determine (and train on only) the tion, the data samples are sorted in order of decreasing smallest subset after which including more samples weight, i.e: wi ≥ wj ∀ i < j . Consequently, we define does not alter the final stump? Indeed, we can prove Zm , the mass of the heaviest subset of m data points: that a relatively small m-subset contains enough in- formation to set the optimal stump, and thereby save X Zm ≡ wi [note: ZN ≡ Z] i≤m a lot of computation.
Quickly Boosting Decision Trees 3.3. Preliminary Errors 1. Most of the overall weight is accounted for by the first few samples. (This can be seen by comparing Definition: given a feature k and an m-subset, the (k) the top and bottom axes in the figure) best preliminary error εm is the lowest achievable training error if only the data points in that subset 2. Feature errors increasingly stabilize as m → N were considered. Equivalently, this is the reported er- ror if all samples not in that m-subset are trimmed. 3. The best performing features at smaller m-subsets (i.e. Zm /Z < 80%) can end up performing quite 1 hX X i poorly once all of the weights are accumulated ε(k) m ≡ wi 1{yi =+p(k) } + wi 1{yi =−p(k) } (dashed orange curve) Zm m m i≤m i≤m (k) xi [k]≤τm (k) xi [k]>τm 4. The optimal feature is among the best features for smaller m-subsets as well (solid green curve) [where p(k) (k) m and τm are optimal preliminary parameters] From the full training run shown in Figure 2, we note The emphasis on preliminary indicates that the subset (in retrospect) that using an m-subset with m ≈ 0.2N does not contain all data points, i.e: m < N , and the would suffice in determining the optimal feature. But emphasis on best indicates that no choice of polarity how can we know a priori which m is good enough? p or threshold τ can lead to a lower preliminary error using that feature. 100% As previously described, given a feature k, a stump 80% is trained by accumulating sample weights into bins. Probability We can view this as a progression; initially, only a few 60% samples are binned, and as training continues, more 40% optimum among: (k) and more samples are accounted for; hence, εm may preliminary top 1 20% preliminary top 10 be computed incrementally which is evident in the preliminary top 10% smoothness of the traces. 0% 60% 70% 80% 90% 100% Relative Subset Size m/N Relative Subset Mass Zm /Z 6.5% 9.3% 10.7% 29.8% 100% 50% Figure 3. Probability that the optimal feature is among the K top-performing features. The red curve corresponds Preliminary Error ε(k) to K = 1, the purple to K = 10, and the blue to K = 1% of m 45% all features. Note the knee-point around Zm /Z ≈ 75% at which the optimal feature is among the 10 preliminary-best 40% features over 95% of the time. 35% Averaging over many training iterations, in Figure 3, we plot the probability that the optimal feature is among the top-performing features when trained on 30% 60% 70% 80% 90% 100% only an m-subset of the data. This gives us an idea as Relative Subset Mass Zm /Z to how small m can be while still correctly predicting the optimal feature. Figure 2. Traces of the best preliminary errors over in- From Figure 3, we see that the optimal feature is creasing Zm /Z. Each curve corresponds to a different fea- ture. The dashed orange curve corresponds to a feature not amongst the top performing features until a large that turns out to be misleading (i.e. its final error dras- enough m-subset is used – in this case, Zm /Z ≈ 75%. tically worsens when all the sample weights are accumu- Although “optimality” is not quite appropriate to use lated). The thick green curve corresponds to the optimal in the context of greedy stage-wise procedures (such as feature; note that it is among the best performing features boosted trees), consistently choosing sub-optimal pa- even when training with relatively few samples. rameters at each stump empirically leads to substan- tially poorer performance, and should be avoided. In Figure 2 shows the best preliminary error for each of the following section, we outline our approach which the features in a typical experiment. By examining determines the optimal stump parameters using the Figure 2, we can make four observations: smallest possible m-subset.
Quickly Boosting Decision Trees 4. Pruning Underachieving Features Quick Stump Training Figure 3 suggests that the optimal feature can often 1. Train each feature only using data in a relatively be estimated at a fraction of the computational cost small m-subset. using the following heuristic: 2. Sort the features based on their preliminary er- rors (from best to worst) Faulty Stump Training 3. Continue training one feature at a time on pro- (k) gressively larger subsets, updating εm 1. Train each feature only using samples in the m- • if it is underachieving, prune immediately. subset where Zm /Z ≈ 75%. • if it trains to completion, save it as best-so-far. 2. Prune all but the 10 best performing features. 4. Finally, report the best performing feature (and 3. For each of un-pruned feature, complete training corresponding parameters). on the entire data set. 4. Finally, report the best performing feature (and corresponding parameters). 4.1. Subset Scheduling Deciding which schedule of m-subsets to use is a sub- This heuristic does not guarantee to return the opti- tlety that requires further explanation. Although this mal feature, since premature pruning can occur in step choice does not effect the optimality of the trained 2. However, if we were somehow able to bound the er- stump, it may effect speedup. If the first “relatively ror, we would be able to prune features that would small” m-subset (as prescribed in step 1) is too small, provably underachieve (i.e. would no longer have any we may lose out on low-error features leading to less- chance of being optimal in the end). harsh pruning. If it is too large, we may be doing unnecessary computation. Furthermore, since the cal- culation of preliminary error does incur some (albeit, Definition: a feature k is denoted underachieving low) computational cost, it is impractical to use every if it is guaranteed to perform worse than the best-so- m when training on progressively larger subsets. far feature k 0 on the entire training data. To address this issue, we implement a simple sched- Proposition 1: for a feature k, the following bound ule: The first m-subset is determined by the param- holds (proof given in Section 4.2): given two subsets eter ηkp such that Zm /Z ≈ ηkp . M following subsets (where one is larger than the other), the product of are equally spaced out between ηkp and 1. Figure 4 subset mass and preliminary error is always greater shows a parameter sweep over ηkp and M , from which for the larger subset: we fix ηkp = 90% and M = 20 and use this setting for all of our experiments. m≤n ⇒ Zm ε(k) (k) m ≤ Zn εn (1) 1 5 Let us assume that the best-so-far error ε0 has been 10 determined over a few of the features (and the param- M 20 50 eters that led to this error have been stored). Hence, 100 this is an upper-bound for the error of the stump cur- 200 rently being trained. For the next feature in the queue, 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% ηkp even after a smaller (m < N )-subset, then: Figure 4. Computational cost of training boosted trees Zm ε(k) m ≥ Zε 0 ⇒ Zε(k) ≥ Zε0 ⇒ ε(k) ≥ ε0 over a range of ηkp and M , averaged over several types of runs (with varying numbers and depths of trees). Red (k) corresponds to higher cost, blue to lower cost. The lowest Therefore, if: Zm εm ≥ Zε0 then feature k is under- computational cost is achieved at ηkp = 90% and M = 20. achieving and can safely be pruned. Note that the lower the best-so-far error ε0 , the harsher the bound; Using our quick stump training procedure, we deter- consequently, it is desirable to train a relatively low- mine the optimal parameters without having to con- error feature early on. sider every sample for each feature. By pruning un- Accordingly, we propose a new method based on com- derachieving features, a lot of computation is saved. paring feature performance on subsets of data, and We now outline the full Boosting procedure using our consequently, pruning underachieving features: quick training method:
Quickly Boosting Decision Trees Quickly Boosting Decision Trees Further, by summing over a larger subset (n ≥ m), the resulting sum can only increase: 1. Initialize weights (sorted in decreasing order). X X 2. Train decision tree ht (one node at a time) using Zm ε(k) m ≤ wi 1{yi =+p(k) } + wi 1{yi =−p(k) } n n the Quick Stump Training method. i≤n (k) i≤n (k) xi [k]≤τn xi [k]>τn 3. Perform standard Boosting steps: (a) determine optimal αt (i.e. using line-search). (k) But the right-hand side is equivalent to Zn εn ; hence: (b) update sample weights given the misclassifica- tion error of ht and the variant of Boosting used. Zm ε(k) (k) m ≤ Zn εn (c) if more Boosting iterations are needed, sort Q.E.D. sample weights in decreasing order, increment it- For similar proofs using information gain, Gini impu- eration number t, and goto step 2. rity, or variance minimization as split criteria, please see the supplementary material. We note that sorting the weights in step 3(c) above is an O(N ) operation. Given an initially sorted set, 5. Experiments Boosting updates the sample weights based on whether the samples were correctly classified or not. All cor- In the previous section, we proposed an efficient rectly classified samples are weighted down, but they stump training algorithm and showed that it has a maintain their respective ordering. Similarly, all mis- lower expected computational cost than the traditional classified samples are weighted up, also maintaining method. In this section, we describe experiments that their respective ordering. Finally, these two sorted lists are designed to assess whether the method is practical are merged in O(N ). and whether it delivers significant training speedup. We train and test on three real-world datasets and We now give a proof for the bound that our method is empirically compare the speedups. based on, and in the following section, we demonstrate its effectiveness in practice. 5.1. Datasets 4.2. Proof of Proposition 1 We trained Ada-Boosted ensembles of shallow decision (k) trees of various depths, on the following three datasets: As previously defined, εm is the preliminary weighted classification error computed using the feature k on 1. CMU-MIT Faces dataset (Rowley et al., 1996); samples in the m-subset; hence: 8.5·103 training and 4.0 ·103 test samples, 4.3·103 features used are the result of convolutions with Haar-like wavelets (Viola & Jones, 2004), using X X Zm ε(k) m = wi 1{yi =+p(k) } + wi 1{yi =−p(k) } i≤m m i≤m m 2000 stumps as in (Viola & Jones, 2004). (k) (k) xi [k]≤τm xi [k]>τm 2. INRIA Pedestrian dataset (Dalal & Triggs, 2005); 1.7 · 104 training and 1.1 · 104 test samples, 5.1 · (k) (k) Proposition 1: m≤n ⇒ Zm εm ≤ Zn εn 103 features used are Integral Channel Features (Dollár et al., 2009). The classifier has 4000 (k) depth-2 trees as in (Dollár et al., 2009). Proof: εm is the best achievable preliminary error (k) (k) on the m-subset (and correspondingly, {pm , τm } are 3. MNIST Digits (LeCun & Cortes, 1998); 6.0 · 104 the best preliminary parameters); therefore: training and 1.0 ·104 test samples, 7.8·102 features X X used are grayscale pixel values. The ten-class clas- Zm ε(k) m ≤ wi 1{yi =+p} + wi 1{yi =−p} ∀ p, τ sifier uses 1000 depth-4 trees based on (Kégl & i≤m i≤m Busa-Fekete, 2009). 0 0 xi [k]≤τ xi [k]>τ (k) (k) 5.2. Comparisons Hence, switching the optimal parameters {pm , τm } (k) (k) for potentially sub-optimal ones {pn , τn } (note the Quick Boosting can be used in conjunction with all subtle change in indices): previously mentioned heuristics to provide further gains in training speed. We report all computational Zm ε(k) X X costs in units proportional to Flops, since running time m ≤ wi 1{yi =+p(k) } + wi 1{yi =−p(k) } i≤m n i≤m n (in seconds) is dependent on compiler optimizations (k) xi [k]≤τn (k) xi [k]>τn which are beyond the scope of this work.
Quickly Boosting Decision Trees (1) (2) AdaBoost ROC] (a) −2 Weight−Trimming 99% 10 Weight−Trimming 90% LazyBoost 90% −2 (a) −4 10 10 LazyBoost 50% Over ROC] StochasticBoost 50% [AreaOver −6 Loss 10 Loss Training [Area −8 −3 10 10 Training Test Error Test Error −10 AdaBoost 10 Weight−Trimming 99% Weight−Trimming 90% −12 −4 10 LazyBoost 90% 10 LazyBoost 50% StochasticBoost 50% 9 10 8 9 10 10 10 10 10 10 Computational Cost Computational Cost AdaBoost ROC] (b) −10 10 Weight−Trimming 99% Weight−Trimming 90% LazyBoost 90% −20 (b) 10 LazyBoost 50% Over ROC] −2 StochasticBoost 50% 10 [AreaOver −30 10 Loss Training Loss [Area −40 10 Training Test Error −50 10 Test Error AdaBoost −3 10 Weight−Trimming 99% −60 10 Weight−Trimming 90% LazyBoost 90% LazyBoost 50% −70 10 StochasticBoost 50% 10 11 9 10 11 10 10 10 10 10 Computational Cost Computational Cost AdaBoost ROC] (c) Weight−Trimming 99% −10 Weight−Trimming 90% 10 LazyBoost 90% LazyBoost 50% (c) Over ROC] StochasticBoost 50% [AreaOver −20 10 LossLoss −2 Training 10 [Area −30 10 Training Test Error Test Error AdaBoost −40 10 Weight−Trimming 99% Weight−Trimming 90% LazyBoost 90% LazyBoost 50% −50 10 StochasticBoost 50% 9 10 8 9 10 10 10 10 10 10 Computational Cost Computational Cost Computational Cost Computational Cost Figure 5. Computational cost versus training loss (top plots) and versus test error (bottom plots) for the various heuristics on three datasets: (a1,2 ) CMU-MIT Faces, (b1,2 ) INRIA Pedestrians, and (c1,2 ) MNIST Digits [see text for details]. Dashed lines correspond to the “quick” versions of the heuristics (using our proposed method) and solid lines correspond to the original heuristics. Test error is defined as the area over the ROC curve. In Figure 5, we plot the computation cost versus train- each weak learner). To these six heuristics, we apply ing loss and versus test error. We compare vanilla (no our method to produce six “quick” versions. heuristics) AdaBoost, Weight-Trimming with η = 90% We further note that our goal in these experiments is and 99% [see Section 3.2], LazyBoost 90% and 50% not to tweak and enhance the performance of the clas- (only 90% or 50% randomly selected features are used sifiers, but to compare the performance of the heuris- to train each weak learner), and StochasticBoost (only tics with and without our proposed method. a 50% random subset of the samples are used to train
Quickly Boosting Decision Trees 16 Relative Computational Speedup Relative Reduction in Test Error 16 8 8 4 4 2 2 1 1 AdaBoost WT:99% WT:90% Lazy:90% Lazy:50% Stoch:50% AdaBoost WT:99% WT:90% Lazy:90% Lazy:50% Stoch:50% Figure 6. Relative (x-fold) reduction in test error (area Figure 7. Relative speedup in training time due to our over the ROC) due to our method, given a fixed computa- proposed method to achieve the desired minimal error tional cost. Triplets of bars of the same color correspond to rate. Triplets of bars of the same color correspond to the the three datasets: CMU-MIT Faces, INRIA Pedestrians, three datasets. CMU-MIT Faces, INRIA Pedestrians, and and MNIST Digits. MNIST Digits. 5.3. Results and Discussion As discussed in Section 3.2, weight-trimming acts sim- ilarly to our proposed method in that it prunes fea- From Figure 5, we make several observations. “Quick” tures, although it does so naively - without adhering versions require less computational costs (and produce to a provable bound. This results in a speed-up (at identical classifiers) as their slow counterparts. From times almost equivalent to our own), but also leads to the training loss plots (5a1 , 5b1 , 5c1 ), we gauge the classifiers that do not perform as well as those trained speed-up offered by our method; often around an or- using the other heuristics. der of magnitude. Quick-LazyBoost-50% and Quick- StochasticBoost-50% are the least computationally- intensive heuristics, and vanilla AdaBoost always 6. Conclusions achieves the smallest training loss and attains the low- We presented a principled approach for speeding up est test error in two of the three datasets. training of boosted decision trees. Our approach is The motivation behind this work was to speed up built on a novel bound on classification or regression training such that: (i) for a fixed computational bud- error, guaranteeing that gains in speed do not come at get, the best possible classifier could be trained, and a loss in classification performance. (ii) given a desired performance, a classifier could be Experiments show that our method is able to reduce trained with the least computational cost. training cost by an order of magnitude or more, or For each dataset, we find the lowest-cost heuristic and given a computational budget, is able to train classi- set that computational cost as our budget. We then fiers that reduce errors on average by two-fold or more. boost as many weak learners as our budget permits for Our ideas may be applied concurrently with other each of the heuristics (with and without our method) techniques for speeding up Boosting (e.g. subsampling and compare the test errors achieved, plotting the rela- of large datasets) and do not limit the generality of the tive gains in Figure 6. For most of the heuristics, there method, enabling the use of Boosting in applications is a two to eight-fold reduction in test error, whereas where fast training of classifiers is key. for weight-trimming, we see less of a benefit. In fact, for the second dataset, Weight-Trimming-90% runs at the same cost with and without our speedup. Conversely, in Figure 7, we compare how much less computation is required to achieve the best test er- ror rate by using our method for each heuristic. Most Acknowledgements heuristics see an eight to sixteen-fold reduction in com- putational cost, whereas for weight-trimming, there is This work was supported by NSERC 420456- still a speedup, albeit again, a much smaller one (be- 2012, MURI-ONR N00014-10-1-0933, ARO/JPL- tween one and two-fold). NASA Stennis NAS7.03001, and Moore Foundation.
Quickly Boosting Decision Trees References Friedman, J., Hastie, T., and Tibshirani, R. Additive lo- gistic regression: a statistical view of boosting. In The Asadi, N. and Lin, J. Training efficient tree-based mod- Annals of Statistics, 2000. 3 els for document ranking. In European Conference on Information Retrieval (ECIR), 2013. 1 Friedman, J. H. Greedy function approximation: A gradi- ent boosting machine. In The Annals of Statistics, 2000. Birkbeck, N., Sofka, M., and Zhou, S. K. Fast boosting 2 trees for classification, pose detection, and boundary de- tection on a GPU. In Computer Vision and Pattern Friedman, J. H. Stochastic gradient boosting. In Compu- Recognition Workshops (CVPRW), 2011. 2 tational Statistics & Data Analysis, 2002. 2 Bradley, J. K. and Schapire, R. E. Filterboost: Regression Kégl, B. and Busa-Fekete, R. Accelerating adaboost using and classification on large datasets. In Neural Informa- UCB. KDD-Cup 2009 competition, 2009. 2, 6 tion Processing Systems (NIPS), 2007. 2 Kotsiantis, S. B. Supervised machine learning: A review Breiman, L. Arcing classifiers. In The Annals of Statistics, of classification techniques. informatica 31:249–268. In- 1998. 1 formatica, 2007. 1 Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, LeCun, Y. and Cortes, C. The MNIST database of hand- C. J. Classification and Regression Trees. Chapman & written digits, 1998. 6 Hall, New York, NY, 1984. 2 Bühlmann, P. and Hothorn, T. Boosting algorithms: Reg- Mnih, V., Szepesvári, C., and Audibert, J. Empirical bern- ularization, prediction and model fitting. In Statistical stein stopping. In International Conference on Machine Science, 2007. 2 Learning (ICML), 2008. 2 Burgos-Artizzu, X.P., Dollár, P., Lin, D., Anderson, D.J., Paul, B., Athithan, G., and Murty, M.N. Speeding up and Perona, P. Social behavior recognition in continuous adaboost classifier with random projection. In Interna- videos. In Computer Vision and Pattern Recognition tional Conference on Advances in Pattern Recognition (CVPR), 2012. 1 (ICAPR), 2009. 2 Coates, A., Baumstarck, P., Le, Q., and Ng, A. Y. Scalable Quinlan, R. J. C4.5: Programs for Machine Learning. Mor- learning for object detection with GPU hardware. In gan Kaufmann, 1993. 2 Intelligent Robots and Systems (IROS), 2009. 2 Quinlan, R. J. Bagging, boosting, and C4.5. In National Dalal, N. and Triggs, B. Histograms of oriented gradients Conference on Artificial Intelligence, 1996. 1 for human detection. In Computer Vision and Pattern Recognition (CVPR), 2005. 6 Ridgeway, G. The state of boosting. In Computing Science and Statistics, 1999. 1 Dollár, P., Tu, Z., Tao, H., and Belongie, S. Feature mining for image classification. In Computer Vision and Pattern Rowley, H., Baluja, S., and Kanade, T. Neural network- Recognition (CVPR), June 2007. 2 based face detection. In Computer Vision and Pattern Recognition (CVPR), 1996. 6 Dollár, P., Tu, Z., Perona, P., and Belongie, S. Integral channel features. In British Machine Vision Conference Schapire, R. E. The strength of weak learnability. In Ma- (BMVC), 2009. 6 chine Learning, 1990. 1 Dollár, P., Appel, R., and Kienzle, W. Crosstalk cascades Sharp, T. Implementing decision trees and forests on a gpu. for frame-rate pedestrian detection. In European Con- In European Conference on Computer Vision (ECCV), ference on Computer Vision (ECCV), 2012. 1 2008. 2, 3 Domingo, C. and Watanabe, O. Scaling up a boosting- Svore, K. M. and Burges, C. J. Large-scale learning to based learner via adaptive sampling. In Pacific-Asia rank using boosted decision trees. Scaling Up Machine Conference on Knowledge Discovery and Data Mining, Learning: Parallel and Distributed Approaches, 2011. 2 2000. 2 Viola, P. A. and Jones, M. J. Robust real-time face detec- Domingos, P. and Hulten, G. Mining high-speed data tion. International Journal of Computer Vision (IJCV), streams. In International Conference on Knowledge Dis- 2004. 6 covery and Data Mining, 2000. 2 Wu, J., Brubaker, S. C., Mullin, M. D., and Rehg, Dubout, C. and Fleuret, F. Boosting with maximum adap- J. M. Fast asymmetric learning for cascade face de- tive sampling. In Neural Information Processing Systems tection. Pattern Analysis and Machine Intelligence (NIPS), 2011. 2 (PAMI), 2008. 2 Freund, Y. Boosting a weak learning algorithm by major- ity. In Information and Computation, 1995. 1 Freund, Y. and Schapire, R. E. Experiments with a new boosting algorithm. In Machine Learning International Workshop, 1996. 1, 2
You can also read