Collaborative hyperparameter tuning
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Collaborative hyperparameter tuning Rémi Bardenet1,2,3 remi.bardenet@gmail.com Mátyás Brendel2,4 matthias.brendel@gmail.com Balázs Kégl2,3 balazs.kegl@gmail.com Michèle Sebag3 sebag@lri.fr 1 Dept. of Statistics, University of Oxford, 1 South Parks Road, Oxford, OX1 3TG, UK 2 Linear Accelerator Laboratory (LAL), CNRS/University of Paris Sud, 91405 Orsay, France 3 Computer Science Laboratory (LRI), CNRS/University of Paris Sud, 91405 Orsay, France 4 Ferchau Engineering, 88046 Friedrichshaven, Germany Abstract 1. Introduction Hyperparameter tuning is a crucial step in machine Hyperparameter learning has traditionally learning practice. Recently, it was shown that the been a manual task because of the limited state of the art on image classification benchmarks can number of trials. Today’s computing in- be improved by configuring existing techniques better frastructures allow bigger evaluation bud- rather than inventing new learning paradigms (Pinto gets, thus opening the way for algorithmic et al. 2009, Coates et al. 2011, Bergstra et al. 2011, approaches. Recently, surrogate-based opti- Snoek et al. 2012, Thornton et al. 2012). Hyperpa- mization was successfully applied to hyperpa- rameter tuning is often carried out by hand, progres- rameter learning for deep belief networks and sively refining a grid over the hyperparameter space. to WEKA classifiers. The methods combined Several automatic hyperparameter tuning methods are brute force computational power with model already available, including local-search based meth- building about the behavior of the error func- ods (ParamILS of Hutter et al. 2009), estimation of tion in the hyperparameter space, and they distribution methods (REVAC of Nannen & Eiben could significantly improve on manual hyper- 2007), and surrogate-based methods (Hutter, 2009). parameter tuning. What may make experi- Recently, Bergstra et al. (2011) successfully applied enced practitioners even better at hyperpa- surrogate-based optimization methods to tuning nu- rameter optimization is their ability to gen- merous hyperparameters of deep belief networks. The eralize across similar learning problems. In method combined brute force computational power this paper, we propose a generic method to with building a model of the behavior of the cost func- incorporate knowledge from previous experi- tion (validation error) in the hyperparameter space, ments when simultaneously tuning a learning and it could significantly improve on manual hyper- algorithm on new problems at hand. To this parameter tuning. In a similar setup, Thornton et al. end, we combine surrogate-based ranking and (2012) applied surrogate-based optimization for a large optimization techniques for surrogate-based number of classifiers from the WEKA package, outper- collaborative tuning (SCoT). We demon- forming the state of the art on a large set of problems. strate SCoT in two experiments where it outperforms standard tuning techniques and What may still make experienced practitioners better single-problem surrogate-based optimization. at hyperparameter optimization is their ability to gen- eralize across similar learning problems. For example, if somebody in the past successfully applied a clas- sification algorithm A to the popular MNIST dataset Proceedings of the 30 th International Conference on Ma- with a given set of hyperparameters x, he or she would chine Learning, Atlanta, Georgia, USA, 2013. JMLR: certainly use this set as a hint (or “prior”) to choose W&CP volume 28. Copyright 2013 by the author(s). the hyperparameters of A when tuning A on a slightly
Collaborative hyperparameter tuning noisy or rotated version of MNIST. Our main contribu- ically, in single-set hyperparameter optimization, f is tion is to propose a way to mimic this human behavior a (cross-)validation error f (D, x) = R A(D, x) as when performing an automatic, surrogate-based hy- in the setup of Bergstra et al. (2011). The problem perparameter search. with this choice in our multi-set setup is that when applying the same algorithm A to different problems Brendel & Schoenauer (2011) has recently proposed a D1 , . . . , DM ∈ D, the errors can differ significantly in similar idea in artificial planning. They tuned an evo- scale, which means that f can be non-smooth in its lutionary algorithm by learning a mapping from prob- dimensions coming from D (Figure 1). The raw valida- lems to hyperparameters using a neural network. This tion error is thus a poor choice for f . At the same time, setting may work well if the problem descriptors deter- similar problems may share a model about where the mine the optimal hyperparameters with high certainty. error is minimized in the hyperparameter space. To In machine learning on diverse datasets, however, we deal with this issue, we need a quality function that en- cannot hope to come up with a set of easily measur- codes knowledge on the ranking of hyperparameters on able problem descriptors that can predict the best hy- individual problems, and can convey information such perparameters with no uncertainty. What we propose that “if f (D1 , x1 ) < f (D1 , x2 ) and D1 is similar D2 , here instead is to build a model from past experience then probably f (D2 , x1 ) < f (D2 , x2 )”, even though that can bias the search on a new problem towards re- the ranges of the errors R A(D1 , ·) and R A(D2 , ·) gions in the hyperparameter space where the optimal are very different. We propose therefore to rank hy- hyperparameters are likely to be found. Surrogate- perparameters xj on each problem Di by the valida- based methods suit well this setup. They also solve tion error R A(Di , xj ) , and take f : D × H → R as efficiently the exploration/exploitation dilemma. Fur- the surrogate model output by surrogate-based rank- thermore, combining surrogate-based ranking and op- ing algorithms. This results in a novel and rather timization techniques, we can define a novel Bayesian uncommon sequential model-based optimization algo- optimization method that can be used to collabora- rithm that redefines its training set at each time step, tively optimize quantitatively different but similarly since new rankings yield a new model. behaving objective functions. 1.1. The formal setup R ( AdaB oost( sonar, ( m , T ) ) ) R ( AdaB oost( ly m p h , ( m , T ) ) 0.35 0.5 To tackle the problem, we first build a quality func- 0.3 0.4 0.25 tion f : D × H → R that takes a dataset D ∈ D and a 0.2 0.3 set of hyperparameters x ∈ H as inputs, and outputs 0.15 0.2 0.1 the quality f (D, x) of the result A(D, x) obtained by 3 3 8 8 2 6 2 6 log 4 log 4 (m ) (m ) applying the algorithm A on the dataset D with hy- ) 1 2 lo g (T ) 1 2 lo g (T perparameters x. The hyperparameters are usually (a) Error surface of Ada- (b) Error surface of Ada- numerical so the space H is naturally RdH where dH Boost on lymph Boost on sonar is the number of hyperparameters. To apply surrogate optimization on the joint space D × H, we will also represent datasets with a real vector of size dD . These f A d aB o o st ( ., ( m , T ) ) 1 0.5 dD numerical descriptors should be such that datasets −0.5 0 with similar descriptors give rise to similar optimal hy- −1 −1.5 perparameters. Coming up with such features is itself 3 8 2 an interesting (and, in general, open) research prob- log (m ) 1 2 4 lo g (T 6 ) lem. We will describe our approach in Section 4. (c) The common latent Once the feature and dataset representation D × H is ranker fixed, we place a prior over f that incorporates knowl- edge from previous experiments. The Gaussian pro- Figure 1. (a,b) Error surfaces on two similar datasets have cess (GP) prior is now a common choice in sequen- similar shapes although the errors are quantitatively dif- tial model-based optimization (SMBO; Jones, 2001; ferent. (c) The similar shapes can be captured by a latent Lizotte, 2008). Since GPs are closed under sampling, ranker. they provide a simple and elegant way of accumulating more and more knowledge about the surrogate func- The rest of the paper is organized as follows: in Sec- tion as the observations arrive. On the other hand, tion 2 we give a brief overview of sequential model- specifying the quality function f is more subtle. Typ- based optimization. Section 3 contains our contribu- tions, describing in detail the quality function f and
Collaborative hyperparameter tuning the prior we place over it. Finally, in Section 4 we ness of the sample functions, while the mean func- present two experimental case studies on AdaBoost tion encodes a trend shared by these functions. Basi- and on multilayer perceptrons, where we compare our cally, writing f ∼ GP(µ, k) means that marginally, for T approach to several standard tuning techniques and any M , we have f (z1 ), ..., f (zM ) ∼ N (0, Q) with single-problem surrogate-based optimization. Q , k(zi , zj ) 1≤i,j≤M . It is common to assume µ ≡ 0 after centering the data. If f ∼ GP 0, k(., .) , then the 2. Sequential model-based optimization posterior of f given O is GP µ e, e k(., .) , where On today’s large databases, obtaining a validation er- ror for a given set of hyperparameters requires a train- e(z) = kTz K −1 f µ (1) ing phase that can typically take hours or days. Se- k(z1 , z2 ) = k(z1 , z2 ) − kTz1 K −1 kz2 , e quential model-based optimization (SMBO) is a surro- gate optimization method suitable for such expensive- with K , k(zi , zj ) 1≤i,j≤N , f , f (zi ) 1≤i≤N and to-evaluate target functions. SMBO replaces the tar- kv , k(v, zi ) 1≤i≤N . This property allows to analyti- get function f (z) by a cheaper-to-evaluate surrogate cally derive the posterior distribution on the value f (z) model M, and iteratively 1) tunes the model and 2) of f at any test point z, which is univariate Gaussian optimizes an auxiliary criterion function S(z; M). The with mean µ e(z) and variance e k(z, z). To conclude, criterion S(z; M) measures the interest of asking the note that it is easy to incorporate input-independent target function value at a new point z, given model evaluation noise by adding a scalar matrix to K in the M. The optimization paradigm is presented in Fig- regression equations (1). Note also that kernel func- ure 2. The ingredients that the user has to provide tions have their own hyperparameters, which are usu- are a model M and a criterion S, for which we now ally chosen by maximizing the marginal likelihood of present two common choices (which we also used in the data given the hyperparameters (see (Rasmussen our experiments). & Williams, 2006) for details). SMBO f, M, T, S 2.2. A criterion for SMBO: expected 1 O←∅ improvement 2 for t ← 1 to T Armed with this analytical posterior on f (z), we can 3 z ∗ ← arg maxz S(z; M) design criteria that can predict the quality of a new 4 Evaluate f (z ∗ ) . Expensive step point z. Expected improvement (EI; Jones, 2001) is ∗ ∗ the most common such criterion, but others also in- 5 O ← O ∪ (z , f (z )) clude the probability of improvement of (Jones, 2001), 6 Fit a new model M to O minimizing the conditional entropy of the minimizer 7 return O (Villemonteix et al., 2006), or, more recently, a cri- terion based on multi-armed bandits (Srinivas et al., Figure 2. The pseudo-code of generic sequential model- 2010). We will use here the EI criterion, which we based optimization. f is the function to be optimized, M is now describe formally. Assume we want to minimize a an initial surrogate model that is updated when new eval- function f on which we put a GP(0, k) prior. Condi- uations of f become available. T is the number of steps, tioning on the target function evaluated at some train- and S is the auxiliary criterion that measures the interest ing points O , { zi , f (zi ) , i = 1, ..., N }, we obtain a of asking the target function value at a new point z. posterior GP model M as described in Section 2.1, and the expected improvement at z over the best minimum 2.1. A model: Gaussian processes found so far is defined by Gaussian processes (GPs; Rasmussen & Williams, EI(z; M) , E max(mN − f (z), 0)|FN , 2006) are a convenient way of putting priors over functions. It is the most common model of choice where FN is the σ-algebra generated by the previous for SMBO, mainly since it is closed under sam- fitness evaluations O, and mN , min1≤i≤N f (zi ). pling. This means that if we put a GP prior on f , then the posterior distribution of f given a set 3. The quality function f and its prior O , {(zi , f (zi )), i = 1, ..., N } of observed points is still a GP. Formally, a GP is defined by its mean function We will now apply the SMBO framework of Section 2 µ(.) and its covariance function k(., .). The latter is in the joint space D × H of (dataset, hyperparameters) a positive kernel that encodes the degree of smooth- pairs. To explain our motivation for designing the
Collaborative hyperparameter tuning target “quality” function f and choosing a ranking- suppose that f (D, x) is smooth in D, which means that based surrogate method, we start by an example. Fig- similar problems produce similarly-looking error sur- ures 1(a) and 1(b) show the two-dimensional error faces. How problem similarity is defined is a deep and surfaces when running AdaBoost.MH of Schapire generally unanswered question. In Section 4 we pro- & Singer (1999) using product weak learners (Kégl pose two simple setups, but the algorithm described & Busa-Fekete, 2009) with two hyperparameters (the in the following section is generic in the sense that it number of terms m and the number of boosting it- only assumes that there exists a positive semidefinite erations T ) on two benchmark datasets (the data se- kernel representing problem similarity. lection and the experimental setup will be described We can now describe an SMBO algorithm that approx- in detail in Section 4). The datasets are similar in imately optimizes f up to a monotonic transforma- terms of some high-level features (such as the num- tion. Let us start with a collection O = (Di , xi , Ri ) i ber of instances, number of classes, or number of of (dataset, hyperparameter, validation error) triplets. attributes) and the shapes of the error surfaces are Note that in Figures 3 and 4, we also abusively denote also similar, so it intuitively makes sense to try to O = (D, x, R) to be able to underline the separate fit a common surrogate function onto the surfaces. roles of each triplet component. Consider the repeti- The errors R AdaBoost(D, (m, T ) , however, are tion of the following steps. First compute all pairwise quantitatively different in scale for these two differ- rankings P among pairs (Di , xi ) in O according to (2). ent datasets, so direct fitting would fail. To over- Then apply a surrogate-based ranking algorithm B to come this problem, we will use a ranking surrogate rankings P, outputting fb. Evaluate fb on all points fAdaBoost D, (m, T ) that only defines the relative or- (Di , xi ) in O to yield b f , and perform GP regression on dering of error values. fb using b f as input. Pick the next point to add to O by maximizing EI, as described in Section 2. 3.1. An SMBO algorithm targeting a latent ranker The procedure described above yields a first Bayesian optimization algorithm presented in Figure 3, named Let us fix a learning algorithm A. Let R A(D, x) be ST for surrogate-based tuning. What makes it a par- a validation error of algorithm A applied with hyper- ticularly unusual Bayesian optimization algorithm is parameters x ∈ H to problem D ∈ D. Define a partial that the regressed function is different at each time order ≺ on D × H by step because fb depends on the rankings P and the (D, x) ≺ (D, x0 ) ⇔ R A(D, x) ≤ R A(D, x0 ) . (2) size of P is increasing in each iteration. We say that a function g preserves rankings iff (D, x) ≺ (D, x0 ) implies g(D, x) ≤ g(D, x0 ). Assume that we are given a smooth function f : D × H → R ST D, T, O, A, B that preserves rankings and whose range does not vary across datasets. f is a better target for an 1 for t ← 0 to T − 1 SMBO-based tuning algorithm than raw validation er- 2 Compute pairwise rankings P from O as in (2) ror, since the arbitrary scale factors corresponding to 3 b f ← evaluation on (D, x) of the surrogate fb different datasets would not influence the stochastic built by B with rankings P search. Surrogate-based ranking algorithms, such as SVMrank of Joachims (2002) or the GP-based rank- 4 M ← posterior GP on fb knowing (D, x), b f ing algorithm of Chu & Ghahramani (2005), build an 5 x∗ ← argmaxx∈H EI(D, x; M) R∗ ← R A(D, x∗ ) estimate fb by trying to preserve as many as possible 6 . Run learning algorithm of the empirical rankings they are given while keeping 7 ∗ ∗ O ← O ∪ (D, x , R ) fb smooth and flat. Thus, our strategy is to feed such 8 return O a surrogate-based ranking algorithm B with all avail- able rankings defined by (2). The output fb will not Figure 3. The pseudo-code of the surrogate-based tuning estimate f in a classical (e.g., L2 ) sense, but it will algorithm. Input `B denotes a ´ surrogate-based ranking be a reasonable approximation of f up to a monotonic algorithm. O = (Di , xi , Ri ) i , (D, x, R) summarizes transformation. available data. The input O comes from results from past experiments with the same algorithm, and O is then up- Assuming that f (D, x) is smooth in x is quite natural, dated along the algorithm. See text for details. and most of the local search and surrogate optimiza- tion algorithms are designed based on this hypothesis. In addition, and this is our key assumption, we also
Collaborative hyperparameter tuning 3.2. Collaborative tuning applied to the same problem, which is the case here by definition of ≺ in (2). Since ST is now independent of the scale of valida- tion error R(·), we can apply it to several problems D1 , ..., DM at the same time. Instead of repeatedly 4. Experiments applying ST (Figure 3), we propose here to tune all To assess the performance of SCoT as a practical hy- problems simultaneously by spending one iteration on perparameter tuning algorithm, we now present two each problem in turn, thus making use of all informa- experiments of tuning classification algorithms. In the tion gained so far on all problems. This gives birth first experiment we apply multi-class AdaBoost.MH to the SCoT algorithm, for surrogate-based collabo- with two hyperparameters to a large number of het- rative tuning, presented in Figure 4. In Section 5, we erogeneous UCI problems. In the 2-dimensional hy- come back on other collaborative strategies. perparameter space we can pre-compute the valida- tion errors on a finite grid, and we can perform a SCoT (D1 , . . . , DM ), T, O, A, B meta-cross-validation over datasets to assess whether 1 for t ← 0 to T − 1 a model tuned on a set of training problems can aid 2 for i ← 1 to M the hyperparameter optimization on a hold-out set of test problems. In the second experiment we tune five 3 Compute pairwise rankings P from O as in (2) hyperparameters of single-hidden-layer neural nets on 4 f ← evaluation on (D, x) of the surrogate fb b binary classification problems. In this case precomput- built by B with rankings P ing the errors on a grid is impossible. Furthermore, 5 M ← posterior GP on fb knowing (D, x), b f to make the point that problem similarity can help collaborative tuning, we use 20 datasets obtained by 6 x∗ ← argmaxx∈H EI(Di , x; M) perturbing the same five datasets, four times each. R∗ ← R A(Di , x∗ ) . Run learning algorithm 7 For the sake of clarity, we henceforth call instance a 8 O ← O ∪ (Di , x∗ , R∗ ) point to be classified, and we call attribute one of the 9 return O elements of the input vector. The term feature is re- served to numerical descriptors of the whole classifi- Figure 4. The pseudo-code of the surrogate-based collabo- rative tuning algorithm. See the caption of Figure 3 for cation problem, that is, a point in D is described by notations, the only difference being the simultaneous tun- a vector of features. Finally note that all GP hyper- ing on several datasets D1 , . . . , DM . See text for details. parameters were tuned maximizing the marginal like- lihood (Rasmussen & Williams, 2006). 3.3. On the choice of a surrogate-based 4.1. A case study on AdaBoost ranking algorithm We describe here a setup that mimics an experienced The method of choice for Algorithm B in Figure 4 is AdaBoost user facing multi-class classification prob- the Gaussian process-based ranking algorithm of Chu lems. We used the implementation of AdaBoost.MH & Ghahramani (2005) since it provides a way of si- available at multiboost.org (Benbouzid et al., 2012). multaneously estimating the values of a hidden fb and tuning the GP hyperparameters in a double optimiza- 4 tion loop. This method, applied with a squared expo- 3.5 nential kernel with automatic relevance determination 3 y = 0.17*x + 0.94 (SE-ARD; Rasmussen & Williams, 2006, pp. 83 and 2.5 106), would avoid the need for Step 5 of SCoT and log(m) 2 provide an easier interpretation for fb. However, both 1.5 our implementation and the one available on the web 1 proved to be too slow in the regime presented in Sec- tion 4 to be realistically incorporated in the for loop of 0.5 SCoT. We thus chose to limit ourselves in Step 4 to an 0 0 1 2 3 4 5 6 7 8 isotropic squared exponential kernel, and used the effi- log (n/d) cient optimization routines available for the SVMrank of Joachims (2002), while Step 5 is carried out with Figure 5. Optimal number of product terms versus the problem feature log(n/d). an SE-ARD kernel. Note that SVMrank requires that all hyperparameter vectors x ∈ H be comparable when
Collaborative hyperparameter tuning 2 heartïstatlog feature min 1st q. med 3rd q. max ionosphere diabetes K 2 2 5 10 26 1 sonar lymph creditïg balanceïscale ticïtacïtoe log d 1.61 2.3 2.94 4.01 5.49 Second principal component kdd iris ecoli car log(n/d) 1 3.03 4.04 4.57 7.05 vehicle 0 mfeatïfourier pageïblocks nursery ρ 0.14 0.5 0.67 0.75 0.86 segment vowel mfeatïpixel optdigits ï1 Table 1. Minimum value, first quartiles, medians, third mfeatïzernike pendigits mfeatïfactors quartiles and maximum value of the features of the prob- lems. ï2 ï3 dataset, a fourth feature ρ was derived through PCA: we extracted the first d0 principal components that ex- letter plained 95% of the variance of each dataset, and di- ï4 ï4 ï3 ï2 ï1 0 First principal component 1 2 3 vided d0 by the number of attributes d to obtain ρ. Table 1 summarizes the statistics of the features, and Figure 6. Projection of the problems onto the first two Figure 6 visualizes the datasets in the feature space principal components of the feature space. For the sake projected onto the first two principal components. The of visibility some problem names are omitted. first two components explain 73.5% of the variance, which means that the four-dimensional distribution is not degenerate but not completely uniform either. Fi- 4.1.1. Setup nally, the features were also rescaled to belong to [0, 1]. We downloaded 29 multi-class classification problems from WEKA. We converted nominal attributes to 4.1.2. Different tuning strategies numerical (binary) values using a one-hot encoding. We designed five experiments, each one mimicking a We split each set 80-20% into training and validation different user behavior. In all experiments, we used sets. We ran AdaBoost.MH with decision products a 5-fold cross-validation over the 29 datasets. To as weak learners (Kégl & Busa-Fekete, 2009), so avoid confusion with training and testing in each prob- the two hyperparameters to tune were the number lem, we call meta-train and meta-test, respectively, the of iterations T and the number of product terms train and test sets made out of problems. m. Our target measure R AdaBoost(D, (m, T )) was the classical multi-class 0-1 error. In this Global default This experiment mimics the tuning setup, because H has dimension 2, we decided to strategy of a non-expert in machine learning who precompute all validation errors on a 12 × 9 grid chooses the same hyperparameter vector for every with values m ∈ {2, 3, 4, 5, 7, 10, 15, 20, 30} and T ∈ problem. Although this sounds unreasonable, a {2, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000}, typical WEKA user often runs classification algo- giving us 108 trained models for each dataset. The rithms using the default hyperparameters. Here inputs of the GP kernel were the logarithms of the we assume that the designer of the classification hyperparameters, rescaled to belong to [0, 1]. Figure 1 algorithm is an expert, so he sets the default hy- shows two example error surfaces and a smooth latent perparameters to those that perform the best on ranker learned by the GP. average. Formally, we select the hyperparameter To embed the problems into a Euclidean space D in vector that minimizes the average error over the which a GP can be used with a classical squared expo- problems in the meta-train sets. nential kernel, we first extracted three simple measures Collaborative default This experiment mimics the from each dataset, the number of training instances n, tuning strategy of a more experienced user, who the number of classes K, and the number of attributes builds a model according to his or her experience d, and defined three features K, log d, and log(n/d) with the meta-train set, but runs out of time before that we found indicative about the value of the best the conference deadline. She thus uses the surro- hyperparameters. For example, Figure 4.1 shows that gate model to predict one vector of hyperparame- log(n/d) is correlated to the optimal number of prod- ters for each problem, but does no tuning on the uct terms m. This makes sense: the more instances we problems. In practice, this strategy is similar to have compared to the number of attributes, the more a single iteration of the outer loop of SCoT (Fig- complex the optimal classifier can be. ure 4) and consists in (i) taking the surrogate out- In order to describe some statistical properties of each put by SVMrank , (ii) regressing it with a GP and
Collaborative hyperparameter tuning (iii) taking the hyperparameter with the best pos- 4.0 terior mean on each meta-test problem. global default Separate surrogate tuning This experiment mim- 3.5 average rank collaborative default ics a user who has time for a sequential algorithm, random search but does not learn either from the meta-train set 3.0 separate tuning or from the knowledge acquired in tuning on other SCoT meta-test problems simultaneously. State-of-the- 2.5 art surrogate tuning methods are of this kind. This experiment consists in running a different SMBO algorithm on each meta-test problem, thus using 0 20 40 60 80 an independent two-dimensional GP for each meta- number of trials test problem. To avoid numerical problems, we start by four iterations picking random hyperpa- Figure 7. The average rank of the different methods as a rameters. function of the number of trials. Collaborative methods start from the value 2.62 and all non-collaborative methods SCoT (surrogate collaborative tuning) This ex- start from 3.26, the first four iterations of separate tuning periment mimics an expert user, who tries to learn are the same as random search, see text for details. a relation between features, hyperparameters, and quality from the meta-train problems he encoun- tered in the past. Here we combine the static model and the surrogate technique, and we add collabora- rewarded by the average of their ordinal rankings. The tive tuning as described in Section 3 and Figure 4. ranking points of all meta-test problems were then av- eraged to get the final score. The average rank of each Random search This is the baseline experiment strategy can then be plotted against the number of that samples points from predefined independent trials. Note that the average rank of a single method priors on each hyperparameter (uniform in Sec- depends on the results of the others, which means that tion 4.1, defined in Table 2 for Section 4.2). This the score of non-tuning methods (global and collabo- is probably the most common strategy when ex- rative defaults) will increase while their quality values haustive grid search is out of computational reach. remain constant. Also, note that the lower the average Bergstra & Bengio (2012) demonstrated that ran- rank score, the better is the method. dom search is competitive in hyperparameter tun- ing for deep belief networks. Figure 7 shows the results in terms of ranking. The first observation is that the collaborative default Note that the three last strategies were used without achieves a better score than the global default, vali- replacement. For surrogate-based methods, when EI is dating therefore our hypothesis that past experience maximized at a point that has already been evaluated can help to find better hyperparameters. Secondly, in the past, we replace it by the unevaluated point in separate tuning beats random search, confirming the the grid with the highest EI. In practice, all strate- results of Bergstra et al. (2011). Finally, SCoT seems gies should then converge to the best point when the to robustly outperform all methods, which means that number of evaluations reaches the size of the grid. combining surrogate optimization and collaborative tuning gets the best of both worlds. Note that as 4.1.3. Results the number of iterations grow, all three tuning meth- ods start to saturate the search space, so their average We present results in terms of average generalization rank converges to the same value. error in the supplementary material, obtained through a 5-fold cross validation on the set of datasets. Com- 4.2. A controlled experiment with MLPs paring the methods in terms of average generalization error is however questionable for the exact same reason In this experiment we tested SCoT on tuning five hy- that made us use a ranking-based surrogate: the clas- perparameters of multi-layer perceptrons (MLPs) with sification datasets may not be commensurable in terms a single hidden layer. The hyperparameters and their of test error rate (Demšar, 2006; Lacoste et al., 2012). prior distributions are summarized in Table 2. The Hence we computed an average rank score as follows: goal of the experiment was to test SCoT on a setup in each iteration and for each problem, the results of where exhaustive grid search is not possible at all due the different strategies were ranked with the so-called to the higher number of hyperparameters. This exper- fractional ranking (“1 2.5 2.5 4” ranking), ties being iment also underlines the cotuning abilities of SCoT.
Collaborative hyperparameter tuning Hyperparameter Distribution learning rate log U (10−3 , 10) 4.0 global default `1 penalty log U (10−7 , 10−4 ) 3.5 collaborative default random search average rank `2 penalty log U (10−7 , 10−4 ) separate tuning batch size uniform on {1, . . . , 30} 3.0 SCoT number of hidden units uniform on {10, . . . , 100} 2.5 Table 2. Hyperparameters of a single-layer MLP. The dis- 2.0 tributions used in the random strategy are given in the second column. 1.5 0 20 40 60 80 100 number of trials 4.2.1. Setup MLPs are widely used in practice, however, tuning the Figure 8. Results on the MLP benchmark in terms of the larger number of hyperparameters is a more important average “fractional” rankings (see Section 4.1.3). issue than in boosting and SVM, so automatic tun- ing methods can have a greater practical impact. To demonstrate that SCoT learns a latent structure when 5. Conclusion such a structure exists, we picked five different binary classification problems out of the collection presented We presented SCoT, a surrogate-based optimization in Section 4.1, and perturbed them in three different algorithm for hyperparameter tuning. It builds on pre- ways: (i) we suppressed one attribute, (ii) we halved vious approaches but adds a memory of past experi- the number of samples, and (iii) both. In total, we ments and a collaborative tuning ability. Developing tuned MLPs simultaneously on 20 datasets. these two points led to the combination of surrogate- based optimization and ranking, resulting in a novel The dataset space D was built as follows. Since all clas- Bayesian optimization algorithm whose target is mov- sification problems were binary, the number of classes ing with the iteration number. We demonstrated K was not relevant here. Besides the three features SCoT in two experiments on AdaBoost and MLPs: log d, log(n/d), and ρ described in Section 4.1.1, we it outperformed a variety of common tuning strategies, used two more statistical descriptors: the skewness including vanilla surrogate-based optimization. and the kurtosis of each dataset projected onto its first principal component. In the end, H × D has ten di- In order to eventually yield an automatic hyperpa- mensions. In the supplementary material, we provide rameter tuner, some methodological and theoretical an illustration of a PCA in D of our 20 datasets. efforts are needed. On the methodological side, a comprehensive database of learning problems should Unlike in the case study on AdaBoost of Section 4.1, be compiled, progressively closing the gap between we did not perform meta-cross-validation, rather, we benchmark problems and real-world applications. On acted as if we used SCoT to tune neural networks the theoretical side, some efforts of feature construc- simultaneously on these 20 datasets. Methods that tion are needed to define for instance compound pa- build models (collaborative default, separate tuning, rameters, more amenable to build the surrogate qual- and collaborative tuning in Section 4.1.2) started only ity model. Such augmentations of the training set after the first 10 random points were evaluated on each of SCoT will require careful choices in the methods, dataset. The global default strategy was taken here to maybe leading to lighten the computational burden be the constant choice of the best hyperparameters on with cheaper models such as approximate GPs. Fi- average among these first 10 points. nally, we feel that there is room for improvement in designing asynchronous strategies for the collaborative 4.2.2. Results tuning of SCoT, which currently tunes problems in a As in Section 4.1.3, results based on generalization er- synchronous way, spending one point on each problem ror are deferred to the supplementary material. Fig- in turn, while it should intuitively spend more evalua- ure 8 depicts the results in terms of average ranking, tions on difficult problems. as in Section 4.1. Soon after the initial 10 training points, SCoT clearly outperforms all other methods. Acknowledgments Separate tuning comes second, but sharing informa- tion among problems obviously helps SMBO on this This work was supported by the ANR-2010-COSI-002 controlled benchmark. grant of the French National Research Agency.
Collaborative hyperparameter tuning References Lacoste, A., Laviolette, F., and Marchand, M. Bayesian comparison of machine learning algorithms Benbouzid, D., Busa-Fekete, R., Casagrande, N., on single and multiple datasets. In 15th Inter- Collin, F.-D., and Kégl, B. MultiBoost: a multi- national Conference on Artificial Intelligence and purpose boosting package. Journal of Machine Statistics (AISTATS), 2012. Learning Research, 13:549–553, 2012. Lizotte, D. Practical Bayesian Optimization. PhD Bergstra, J. and Bengio, Y. Random search for hyper- thesis, University of Alberta, 2008. parameter optimization. Journal of Machine Learn- ing Research, 2012. Nannen, V. and Eiben, A. E. Relevance estimation and value calibration of evolutionary algorithm pa- Bergstra, J., Bardenet, R., Kégl, B., and Bengio, Y. rameters. In Proceedings of the International Joint Algorithms for hyperparameter optimization. In Ad- Conference on Artificial Intelligence, pp. 975–980, vances in Neural Information Processing Systems 2007. (NIPS), volume 24. The MIT Press, 2011. Pinto, N., Doukhan, D., DiCarlo, J. J., and Cox, D. D. Brendel, M. and Schoenauer, M. Instance-based pa- A high-throughput screening approach to discover- rameter tuning for evolutionary AI planning. In ing good forms of biologically inspired visual repre- Proceedings of the 20th Genetic and Evolutionary sentation. PLoS Computational Biology, 5(11), 11 Computation Conference, 2011. 2009. Chu, W. and Ghahramani, Z. Preference learning with Rasmussen, C. E. and Williams, C. K. I. Gaussian Gaussian processes. In Proceedings of the 22nd In- Processes for Machine Learning. MIT Press, 2006. ternational Conference on Machine Learning, pp. 137–144, 2005. Schapire, R. E. and Singer, Y. Improved boosting algo- rithms using confidence-rated predictions. Machine Coates, A., Lee, H., and Ng, A. Y. An analysis of Learning, 37(3):297–336, 1999. single-layer networks in unsupervised feature learn- Snoek, J., Larochelle, H., and Adams, R. P. Practi- ing. In 14th International Conference on Artificial cal Bayesian optimization of machine learning algo- Intelligence and Statistics (AISTATS), 2011. rithms. In Advances in Neural Information Process- Demšar, J. Statistical comparisons of classifiers over ing Systems, volume 25, 2012. multiple data sets. Journal of Machine Learning Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Research, 7:1–30, 2006. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings Hutter, F. Automated Configuration of Algorithms for of the 27th International Conference on Machine Solving Hard Computational Problems. PhD thesis, Learning, 2010. University of British Columbia, 2009. Thornton, C., Hutter, F., Hoos, H. H., and Leyton- Hutter, F., Hoos, H. H., Leyton-Brown, K., and Brown, K. Auto-WEKA: Automated selection and Stützle, T. ParamILS: an automatic algorithm con- hyper-parameter optimization of classification algo- figuration framework. Journal of Artificial Intelli- rithms. Technical report, http://arxiv.org/abs/ gence Research, 36:267–306, October 2009. 1208.3719, 2012. Joachims, T. Optimizing search engines using click- Villemonteix, J., Vazquez, E., and Walter, E. An in- through data. In Proceedings of the ACM Con- formational approach to the global optimization of ference on Knowledge Discovery and Data Mining expensive-to-evaluate functions. Journal of Global (KDD), 2002. Optimization, 2006. Jones, D. R. A taxonomy of global optimization meth- ods based on response surfaces. Journal of Global Optimization, 21:345–383, 2001. Kégl, B. and Busa-Fekete, R. Boosting products of base classifiers. In International Conference on Ma- chine Learning, volume 26, pp. 497–504, Montreal, Canada, 2009.
You can also read