DEHB: Evolutionary Hyperband for Scalable, Robust and Efficient Hyperparameter Optimization
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
DEHB: Evolutionary Hyperband for Scalable, Robust and Efficient Hyperparameter Optimization∗ Noor Awad1 , Neeratyoy Mallik1 , Frank Hutter1,2 1 Department of Computer Science, University of Freiburg, Germany 2 Bosch Center for Artificial Intelligence, Renningen, Germany {awad, mallik, fh}@cs.uni-freiburg.de arXiv:2105.09821v2 [cs.LG] 21 Oct 2021 Abstract (TPE) [Bergstra et al., 2011] (in particular, strong final per- formance), and the many advantages of bandit-based HPO Modern machine learning algorithms crucially rely via Hyperband [Li et al., 2017]. While BOHB is among on several design decisions to achieve strong per- the best general-purpose HPO methods we are aware of, it formance, making the problem of Hyperparame- still has problems with optimizing discrete dimensions and ter Optimization (HPO) more important than ever. does not scale as well to high dimensions as one would wish. Here, we combine the advantages of the popular Therefore, it does not work well on high-dimensional HPO bandit-based HPO method Hyperband (HB) and problems with discrete dimensions and also has problems the evolutionary search approach of Differential with tabular neural architecture search (NAS) benchmarks Evolution (DE) to yield a new HPO method which (which can be tackled as high-dimensional discrete-valued we call DEHB. Comprehensive results on a very HPO benchmarks, an approach followed, e.g., by regularized broad range of HPO problems, as well as a wide evolution (RE) [Real et al., 2019]). range of tabular benchmarks from neural archi- tecture search, demonstrate that DEHB achieves The main contribution of this paper is to further improve strong performance far more robustly than all pre- upon BOHB to devise an effective general HPO method, vious HPO methods we are aware of, especially for which we dub DEHB. DEHB is based on a combination of high-dimensional problems with discrete input di- the evolutionary optimization method of differential evolu- mensions. For example, DEHB is up to 1000× tion (DE [Storn and Price, 1997]) and Hyperband and has faster than random search. It is also efficient in several useful properties: computational time, conceptually simple and easy to implement, positioning it well to become a new 1. DEHB fulfills all the desiderata of a good HPO opti- default HPO method. mizer stated above, and in particular achieves more ro- bust strong final performance than BOHB, especially for high-dimensional and discrete-valued problems. 1 Introduction Many algorithms in artificial intelligence rely crucially on 2. DEHB is conceptually simple and can thus be easily re- good settings of their hyperparameters to achieve strong per- implemented in different frameworks. formance. This is particularly true for deep learning [Hender- son et al., 2018; Melis et al., 2018], where dozens of hyperpa- 3. DEHB is computationally cheap, not incurring the over- rameters concerning both the neural architecture and the op- head typical of most BO methods. timization & regularization pipeline need to be instantiated. At the same time, modern neural networks continue to get 4. DEHB effectively takes advantage of parallel resources. larger and more computationally expensive, making the need for efficient hyperparameter optimization (HPO) ever more After discussing related work (Section 2) and background important. on DE and Hyperband (Section 3), Section 4 describes our We believe that a practical, general HPO method must new DEHB method in detail. Section 5 then presents com- fulfill many desiderata, including: (1) strong anytime per- prehensive experiments on artificial toy functions, surrogate formance, (2) strong final performance with a large bud- benchmarks, Bayesian neural networks, reinforcement learn- get, (3) effective use of parallel resources, (4) scalability ing, and 13 different tabular neural architecture search bench- w.r.t. the dimensionality and (5) robustness & flexibility. marks, demonstrating that DEHB is more effective and robust These desiderata drove the development of BOHB [Falkner than a wide range of other HPO methods, and in particular up et al., 2018], which satisfied them by combining the best to 1000× times faster than random search (Figure 7) and up features of Bayesian optimization via Tree Parzen estimates to 32× times faster than BOHB (Figure 11) on HPO prob- lems; on toy functions, these speedup factors even reached ∗ Proceedings of IJCAI-21 33 440× and 149×, respectively (Figure 6).
2 Related Work final offspring/child ui,g . The canonical DE uses a simple bi- nomial crossover to select values from vi,g with a probability HPO as a black-box optimization problem can be broadly p (called crossover rate) and xi,g otherwise. For the members tackled using two families of methods: model-free meth- xi,g+1 of the next generations, DE then uses the better of xi,g ods, such as evolutionary algorithms, and model-based and ui,g . More details on DE can be found in appendix A. Bayesian optimization methods. Evolutionary Algorithms (EAs) are model-free population-based methods which gen- 3.2 Successive Halving (SH) and Hyperband (HB) erally include a method of initializing a population; mutation, Successive Halving (SH) [Jamieson and Talwalkar, 2016] is crossover, selection operations; and a notion of fitness. EAs a simple yet effective multi-fidelity optimization method that are known for black-box optimization in a HPO setting since exploits the fact that, for many problems, low-cost approxi- the 1980s [Grefenstette, 1986]. They have also been popular mations of the expensive blackbox functions exist, which can for designing architectures of deep neural networks [Ange- be used to rule out poor parts of the search space at little com- line et al., 1994; Xie and Yuille, 2017; Real et al., 2017; Liu putational cost. Higher-cost approximations are only used for et al., 2017]; recently, Regularized Evolution (RE) [Real et a small fraction of the configurations to be evaluated. Specif- al., 2019] achieved state-of-the-art results on ImageNet. ically, an iteration of SH starts by sampling N configurations Bayesian optimization (BO) uses a probabilistic model uniformly at random, evaluating them at the lowest-cost ap- based on the already observed data points to model the ob- proximation (the so-called lowest budget), and forwarding a jective function and to trade off exploration and exploita- fraction of the top 1/η of them to the next budget (function tion. The most commonly used probabilistic model in BO evaluations at which are expected to be roughly η more ex- are Gaussian processes (GP) since they obtain well-calibrated pensive). This process is repeated until the highest budget, and smooth uncertainty estimates [Snoek et al., 2012]. How- used by the expensive original blackbox function, is reached. ever, GP-based models have high complexity, do not natively Once the runs on the highest budget are complete, the current scale well to high dimensions and do not apply to complex SH iteration ends, and the next iteration starts with the lowest spaces without prior knowledge; alternatives include tree- budget. We call each such fixed sequence of evaluations from based methods [Bergstra et al., 2011; Hutter et al., 2011] and lowest to highest budget a SH bracket. While SH is often Bayesian neural networks [Springenberg et al., 2016]. very effective, it is not guaranteed to converge to the optimal Recent so-called multi-fidelity methods exploit cheap ap- configuration even with infinite resources, because it can drop proximations of the objective function to speed up the opti- poorly-performing configurations at low budgets that actually mization [Liu et al., 2016; Wang et al., 2017]. Multi-fidelity might be the best with the highest budget. optimization is also popular in BO, with Fabolas [Klein et Hyperband (HB) [Li et al., 2017] solves this problem by al., 2016] and Dragonfly [Kandasamy et al., 2020] being GP- hedging its bets across different instantiations of SH with based examples. The popular method BOHB [Falkner et al., successively larger lowest budgets, thereby being provably at 2018], which combines BO and the bandit-based approach most a constant times slower than random search. In partic- Hyperband [Li et al., 2017], has been shown to be a strong ular, this procedure also allows to find configurations that are off-the-shelf HPO method and to the best of our knowledge strong for higher budgets but would have been eliminated for is the best previous off-the-shelf multi-fidelity optimizer. lower budgets. Algorithm 2 in Appendix B shows the pseu- docode for HB with the SH subroutine. One iteration of HB 3 Background (also called HB bracket) can be viewed as a sequence of SH brackets with different starting budgets and different numbers 3.1 Differential Evolution (DE) of configurations for each SH bracket. The precise budgets In each generation g, DE uses an evolutionary search based and number of configurations per budget are determined by on difference vectors to generate new candidate solutions. HB given its 3 parameters: minimum budget, maximum bud- DE is a population-based EA which uses three basic iterative get, and η. steps (mutation, crossover and selection). At the beginning of The main advantages of HB are its simplicity, theoretical the search on a D-dimensional problem, we initialize a popu- guarantees, and strong anytime performance compared to op- lation of N individuals xi,g = (x1i,g , x2i,g , ..., xD i,g ) randomly timization methods operating on the full budget. However, within the search range of the problem being solved. Each HB can perform worse than BO and DE for longer runs since individual xi,g is evaluated by computing its corresponding it only selects configurations based on random sampling and objective function value. Then the mutation operation gener- does not learn from previously sampled configurations. ates a new offspring for each individual. The canonical DE uses a mutation strategy called rand/1, which selects three 4 DEHB random parents xr1 , xr2 , xr3 to generate a new mutant vector We design DEHB to satisfy all the desiderata described in vi,g for each xi,g in the population as shown in Eq. 1 where the introduction (Section 1). DEHB inherits several advan- F is a scaling factor parameter and takes a value within the tages from HB to satisfy some of these desiderata, includ- range (0,1]. ing its strong anytime performance, scalability and flexibil- vi,g = xr1 ,g + F · (xr2 ,g − xr3 ,g ). (1) ity. From the DE component, it inherits robustness, simplic- ity, and computational efficiency. We explain DEHB in detail The crossover operation then combines each individual in the remainder of this section; full pseudocode can be found xi,g and its corresponding mutant vector vi,g to generate the in Algorithm 3 in Appendix C.
Figure 1: Internals of a DEHB iteration showing information flow across fidelities (top-down), and how each subpopulation is updated in each DEHB iteration (left-right). Figure 2: Modified SH routine under DEHB 4.1 High-Level Overview 2. Specifically, these will used as the so-called parent pool A key design principle of DEHB is to share information for that higher budget, using the modified DE evolution to be across the runs it executes at various budgets. DEHB main- discussed in Section 4.2. The end of SH Bracket 4 marks the tains a subpopulation for each of the budget levels, where end of this DEHB iteration. We dub DEHB’s first iteration the population size for each subpopulation is assigned as the its initialization iteration. At the end of this iteration, all DE maximum number of function evaluations HB allocates for subpopulations associated with the higher budgets are seeded the corresponding budget. with configurations that performed well in the lower budgets. We borrow nomenclature from HB and call the HB itera- In subsequent SH brackets, no random sampling occurs any- tions that DEHB uses DEHB iterations. Figure 1 illustrates more, and the search runs separate DE evolutions at different one such iteration, where minimum budget, maximum bud- budget levels, where information flows from the subpopula- get, and η are 1, 27, and 3, respectively. The topmost sphere tions at lower budgets to those at higher budgets through the for SH Bracket 1, is the first step, where 27 configurations modified DE mutation (Fig. 3). are sampled uniformly at random and evaluated at the lowest budget 1. These evaluated configurations now form the DE 4.2 Modified Successive Halving using DE subpopulation associated with budget 1. The dotted arrow Evolution pointing downwards indicates that the top-9 configurations We now discuss the deviations from vanilla SH by elaborat- (27/η) are promoted to be evaluated on the next higher bud- ing on the design of a SH bracket inside DEHB, highlighted get 3 to create the DE subpopulation associated with budget 3, with a box in Figure 1 (SH Bracket 1). In DEHB, the top- and so on until the highest budget. This progressive increase performing configurations from a lower budget are not sim- of the budget by η and decrease of the number of configura- ply promoted and evaluated on a higher budget (except for tions evaluated by η is simply the vanilla SH. Indeed, each the Initialization SH bracket). Rather, in DEHB, the top- SH bracket for this first DEHB iteration is basically execut- performing configurations are collected in a Parent Pool (Fig- ing vanilla SH, starting from different minimum budgets, just ure 2). This pool is responsible for transfer of information like in HB. from a lower budget to the next higher budget, but not by di- One difference from vanilla SH is that random sampling of rectly suggesting best configurations from the lower budget configurations occurs only once: in the first step of the first for re-evaluation at a higher budget. Instead, the parent pool SH bracket of the first DEHB iteration. Every subsequent represents a good performing region w.r.t. the lower budget, SH bracket begins by reusing the subpopulation updated in from which parents can be sampled for mutation. Figure 3b the previous SH bracket, and carrying out a DE evolution demonstrates how a parent pool contributes in a DE evolu- (detailed in Section 4.2). For example, for SH bracket 2 in tion in DEHB. Unlike in vanilla DE (Figure 3a), in DEHB, Figure 1, the subpopulation of 9 configurations for budget 3 the mutants involved in DE evolution are extracted from the (topmost sphere) is propagated from SH bracket 1 and un- parent pool instead of the population itself. This allows the dergoes evolution. The top 3 configurations (9/η) then affect evolution to incorporate and combine information from the the population for the next higher budget 9 of SH bracket current budget, and also from the decoupled search happen-
Figure 5: Results for the OpenML Letter surrogate benchmark where n represents number of workers that were used for each DEHB run. Each trace is averaged over 10 runs. ble cost for function evaluations, DEHB is almost 2 orders of magnitude faster than BOHB to perform 13336 function eval- uations. GP-based Bayesian optimization tools would require approximations to even fit a single model with this number of function evaluations. We also briefly describe a parallel version of DEHB (see Appendix C.3 for details of its design). Since DEHB can be Figure 3: Modified DE evolution under DEHB viewed as a sequence of predetermined SH brackets, the SH brackets can be asynchronously distributed over free workers. Figure 4: Runtime comparison A central DEHB Orchestrator keeps a single copy of all DE for DEHB and BOHB based subpopulations, allowing for asynchronous, immediate DE on a single run on the Cifar-10 evolution updates. Figure 5 illustrates that this parallel ver- benchmark from NAS-Bench- 201. The x-axis shows the ac- sion achieves linear speedups for similar final performance. tual cumulative wall-clock time spent by the algorithm (opti- 5 Experiments mization time) in between the function evaluations. We now comprehensively evaluate DEHB, illustrating that it is more robust and efficient than any other HPO method we are aware of. To keep comparisons fair and reproducible, we ing on the lower budget. The selection step as shown in Fig- use a broad collection of publicly-available HPO and NAS ure 3 is responsible for updating the current subpopulation if benchmarks: all HPO benchmarks that were used to demon- the new suggested configuration is better. If not, the existing strate the strength of BOHB [Falkner et al., 2018]1 and also a configuration is retained in the subpopulation. This guards broad collection of 13 recent tabular NAS benchmarks repre- against cases where performance across budget levels is not sented as HPO problems [Awad et al., 2020]. correlated and good configurations from lower budgets do not In this section, to avoid cluttered plots we present a focused improve higher budget scores. However, search on the higher comparison of DEHB with BOHB, the best previous off- budget can still progress, as the first step of every SH bracket the-shelf multi-fidelity HPO method we are aware of, which performs vanilla DE evolution (there is no parent pool to re- has in turn outperformed a broad range of competitors (GP- ceive information from). Thereby, search at the required bud- BO, TPE, SMAC, HB, Fabolas, MTBO, and HB-LCNet) on get level progresses even if lower budgets are not informative. these benchmarks [Falkner et al., 2018]. For reference, we Additionally, we also construct a global population pool also include the obligatory random search (RS) baseline in consisting of configurations from all the subpopulations. This these plots, showing it to be clearly dominated, with up to pool does not undergo any evolution and serves as the parent 1000-fold speedups. We also provide a comparison against a pool in the edge case where the parent pool is smaller than broader range of methods at the end of this section (see Figure the minimum number of individuals required for the mutation 13 and Table 1), with a full comparison in Appendix D. We step. For the example in Figure 2, under the rand1 mutation also compare to the recent GP-based multi-fidelity BO tool strategy (which requires three parents), we see that for the Dragonfly in Appendix D.7. Details for the hyperparameter highest budget, only one configuration (3/η) is included from values of the used algorithms can be found in Appendix D.1. the previous budget. In such a scenario, the additional two We use the same parameter settings for mutation factor required parents are sampled from the global population pool. F = 0.5 and crossover rate p = 0.5 for both DE and DEHB. The population size for DEHB is not user-defined but set by 4.3 DEHB efficiency and parallelization its internal Hyperband component while we set it to 20 for DE following [Awad et al., 2020]. Unless specified otherwise, As mentioned previously, DEHB carries out separate DE we report results from 50 runs for all algorithms, plotting the searches at each budget level. Moreover, the DE operations validation regret2 over the cumulative cost incurred by the involved in evolving a configuration are constant in opera- function evaluations, and ignoring the optimizers’ overhead tion and time. Therefore, DEHB’s runtime overhead does in order to not give DEHB what could be seen as an unfair not grow over time, even as the number of performed func- tion evaluations increases; this is in stark contrast to model- 1 We leave out the 2-dimensionsal SVM surrogate benchmarks based methods, whose time complexity is often cubic in the since all multi-fidelity algorithms performed similarly for this easy number of performed function evaluations. Indeed, Figure task, without any discernible difference. 2 4 demonstrates that, for a tabular benchmark with negligi- This is the difference of validation score from the global best.
Figure 6: Results for the Figure 7: Results for the Stochastic Counting Ones prob- OpenML Adult surrogate lem in 64 dimensional space benchmark for 6 continuous with 32 categorical and 32 con- hyperparameters for 50 runs of tinuous hyperparameters. All each algorithm. algorithms shown were run for 50 runs. advantage.3 We also show the speedups that DEHB achieves Figure 8: Results for tuning 5 hyperparameters of a Bayesian compared to RS and BOHB, where this is possible without Neural Network on the Boston adding clutter. Housing regression dataset for 50 runs each. 5.1 Artificial Toy Function: Stochastic Counting Ones This toy benchmark by Falkner et al. [2018] is useful to assess scaling behavior and ability to handle binary dimensions. The goal is to minimize the following objective function: over random search; qualitatitvely similar results for the other ! 5 datasets are in Appendix D.3. X X f (x) = − x+ Eb [(Bp=x )] , 5.3 Bayesian Neural Networks x∈Xcat x∈Xcont In this benchmark, introduced by Falkner et al. [2018], a two- where the sum of the categorical variables (xi ∈ {0, 1}) rep- layer fully-connected Bayesian Neural Network is trained us- resents the standard discrete counting ones problem. The ing stochastic gradient Hamiltonian Monte-Carlo sampling continuous variables (xj ∈ [0, 1]) represent the stochastic (SGHMC) [Chen et al., 2014] with scale adaptation [Sprin- component, with the budget b controlling the noise. The genberg et al., 2016]. The budgets were the number of budget here represents the number of samples used to es- MCMC steps (500 as minimum; 10000 as maximum). Two timate the mean of the Bernoulli distribution (B) with pa- regression datasets from UCI [Dua and Graff, 2017] were rameters xj . Following Falkner et al. [2018], we run 4 sets used for the experiments: Boston Housing and Protein Struc- of experiments with Ncont = Ncat = {4, 8, 16, 32}, where ture. Figure 8 shows the results (for Boston housing; the re- Ncont = |Xcont | and Ncat = |Xcat |, using the same bud- sults for Protein Structure are in Appendix D.4). For this get spacing and plotting the normalized regret: (f (x) + d)/d, extremely noisy benchmark, BOHB and DEHB perform sim- where d = Ncat + Ncont . Although this is a toy benchmark it ilarly, and both are about 2× faster than RS. can offer interesting insights since the search space has mixed binary/continuous dimensions which DEHB handles well (re- 5.4 Reinforcement Learning fer to C.2 in Appendix for more details). In Figure 6, we con- For this benchmark used by Falkner et al. [2018]), a proxi- sider the 64-dimensional space Ncat = Ncont = 32; results mal policy optimization (PPO) [Schulman et al., 2017] im- for the lower dimensions can be found in Appendix D.2. Both plementation is parameterized with 7 hyperparameters. PPO BOHB and DEHB begin with a set of randomly sampled indi- is used to learn the cartpole swing-up task from the OpenAI viduals evaluated on the lowest budget. It is therefore unsur- Gym [Brockman et al., 2016] environment. We plot the mean prising that in Figure 6 (and in other experiments too), these number of episodes needed until convergence for a configura- two algorithms follow a similar optimization trace at the be- tion over actual cumulative wall-clock time in Figure 9. De- ginning of the search. Given the high dimensionality, BOHB spite the strong noise in this problem, BOHB and DEHB are requires many more samples to switch to model-based search able to improve continuously, showing similar performance, which slows its convergence in comparison to the lower di- and speeding up over random search by roughly 2×. mensional cases (Ncont = Ncat = {4, 8, 16}). In contrast, DEHB’s convergence rate is almost agnostic to the increase 5.5 NAS Benchmarks in dimensionality. In this series of experiments, we evaluate DEHB on a broad 5.2 Surrogates for Feedforward Neural Networks range of NAS benchmarks. We use a total of 13 tabular benchmarks from NAS-Bench-101 [Ying et al., 2019], NAS- In this experiment, we optimize six architectural and training Bench-1shot1 [Zela et al., 2020], NAS-Bench-201 [Dong hyperparameters of a feed-forward neural network on six dif- and Yang, 2020] and NAS-HPO-Bench [Klein and Hutter, ferent datasets from OpenML [Vanschoren et al., 2014], us- ing a surrogate benchmark built by Falkner et al. [2018]. The budgets are the training epochs for the neural networks. For Figure 9: Results for tuning all six datasets, we observe a similar pattern of the search tra- PPO on OpenAI Gym cartpole jectory, with DEHB and BOHB having similar anytime per- environment with 7 hyperpa- rameters. Each algorithm was formance and DEHB achieving the best final score. An ex- run for 50 runs. ample is given in Figure 7, also showing a 1000-fold speedup 3 Shaded bands in plots represent the standard error of the mean.
Figure 10: Results for Cifar C from NAS-Bench-101 for a 27- dimensional space — 22 con- tinuous + 5 categorical hyper- parameters) Figure 11: Results for ImageNet16-120 from NAS- Bench-201 for 50 runs of each algorithm. The search space contains 6 categorical parameters. 2019]. For NAS-Bench-101, we show results on CifarC (a Figure 13: Average rank of the mean validation regret of 50 runs of mixed data type encoding of the parameter space [Awad et each algorithm, averaged over the NAS-Bench-101, NAS-Bench- al., 2020]) in Figure 10; BOHB and DEHB initially perform 1shot1, NAS-HPO-Bench, NAS-Bench-201, OpenML surrogates, similarly as RS for this dataset, since there is only little corre- and the Reinforcement Learning benchmarks. lation between runs with few epochs (low budgets) and many epochs (high budgets) in NAS-Bench-101. In the end, RS budgets. In Table 1, we show the average rank of each algo- stagnates, BOHB stagnates at a slightly better performance, rithm based on the final validation regret achieved across all and DEHB continues to improve. In Figure 11, we report re- benchmarks (now also including Stochastic Counting Ones sults for ImageNet16-120 from NAS-201. In this case, DEHB and Bayesian Neural Networks; data derived from Table 2 is clearly the best of the methods, quickly converging to a in Appendix D.8). Next to its strong anytime performance, strong solution. DEHB also yields the best final performance in this compar- Finally, Figure 12 reports results for the Protein Struc- ison, thus emerging as a strong general optimizer that works ture dataset provided in NAS-HPO-Bench. DEHB makes consistently across a diverse set of benchmarks. Result tables progress faster than BOHB to reach the optimum. The results and figures for all benchmarks can be found in Appendix D. on other NAS benchmarks are qualitatively similar to these 3 representative benchmarks, and are given in Appendix D.6. RS HB BOHB TPE SMAC RE DE DEHB Avg. rank 7.46 6.54 4.42 4.35 4.73 3.16 2.96 2.39 5.6 Results summary Table 1: Mean ranks based on final mean validation regret for all We now compare DEHB to a broader range of baseline al- algorithms tested for all benchmarks. gorithms, also including HB, TPE [Bergstra et al., 2011], SMAC [Hutter et al., 2011], regularized evolution (RE) [Real et al., 2019], and DE. Based on the mean validation regret, all algorithms can be ranked for each benchmark, for every sec- 6 Conclusion ond of the estimated wallclock time. Arranging the mean re- We introduced DEHB, a new, general HPO solver, built to gret per timepoint across all benchmarks (except the Stochas- perform efficiently and robustly across many different prob- tic Counting Ones and the Bayesian Neural Network bench- lem domains. As discussed, DEHB satisfies the many re- marks, which do not have runtimes as budgets), we compute quirements of such an HPO solver: strong performance with the average relative rank over time for each algorithm in Fig- both short and long compute budgets, robust results, scal- ure 13, where all 8 algorithms were given the mean rank of ability to high dimensions, flexibility to handle mixed data 4.5 at the beginning. The shaded region clearly indicates that types, parallelizability, and low computational overhead. Our DEHB is the most robust algorithm for this set of bench- experiments show that DEHB meets these requirements and marks (discussed further in Appendix D.8). In the end, RE in particular yields much more robust performance for dis- and DE are similarly good, but these blackbox optimization crete and high-dimensional problems than BOHB, the previ- algorithms perform worst for small compute budgets, while ous best overall HPO method we are aware of. Indeed, in DEHB’s multi-fidelity aspect makes it robust across compute our experiments, DEHB was up to 32× faster than BOHB and up to 1000× faster than random search. DEHB does Figure 12: Results for the not require advanced software packages, is simple by de- Protein Structure dataset from sign, and can easily be implemented across various platforms NAS-HPO-Bench for 50 runs and languages, allowing for practical adoption. We thus of each algorithm. The search hope that DEHB will become a new default HPO method. space contains 9 hyperparame- Our reference implementation of DEHB is available at https: ters. //github.com/automl/DEHB.
Acknowledgements. The authors acknowledge funding by K. Kandasamy, K. R. Vysyaraju, W. Neiswanger, B. Paria, the Robert Bosch GmbH, by the German Federal Ministry C. R. Collins, J. Schneider, B. Poczos, and E. P. Xing. of Education and Research (BMBF, grant Renormalized- Tuning hyperparameters without grad students: Scalable Flows 01IS19077C), and support by the state of Baden- and robust bayesian optimisation with dragonfly. Journal Württemberg through bwHPC and the German Research of Machine Learning Research, 21(81):1–27, 2020. Foundation (DFG) through grant no INST 39/963-1 FUGG. A. Klein and F. Hutter. Tabular benchmarks for joint archi- tecture and hyperparameter optimization. arXiv preprint References arXiv:1905.04970, 2019. P.J. Angeline, G.M. Saunders, and J.B. Pollack. An evo- A. Klein, S. Falkner, S. Bartels, P. Hennig, and F. Hutter. Fast lutionary algorithm that constructs recurrent neural net- bayesian optimization of machine learning hyperparame- works. IEEE transactions on Neural Networks, 5(1):54– ters on large datasets. arXiv:1605.07079 [cs.LG], 2016. 65, 1994. L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and N. Awad, N. Mallik, and F. Hutter. Differential evolution A. Talwalkar. Hyperband: Bandit-based configuration for neural architecture search. In First ICLR Workshop on evaluation for hyperparameter optimization. In Proc. of Neural Architecture Search, 2020. ICLR’17, 2017. J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl. Algorithms for hyper-parameter optimization. In Proc. of NeurIPS’11, B. Liu, S. Koziel, and Q. Zhang. A multi-fidelity surrogate- pages 2546–2554, 2011. model-assisted evolutionary algorithm for computationally expensive optimization problems. Journal of computa- G. Brockman, V. Cheung, L. Pettersson, J. Schneider, tional science, 12:28–37, 2016. J. Schulman, J. Tang, and W. Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016. H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu. Hierarchical representations for efficient U. K. Chakraborty. Advances in differential evolution, vol- architecture search. arXiv preprint arXiv:1711.00436, ume 143. Springer, 2008. 2017. T. Chen, E. Fox, and C. Guestrin. Stochastic gradient hamil- H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable ar- tonian monte carlo. In International conference on ma- chitecture search. arXiv preprint arXiv:1806.09055, 2018. chine learning, pages 1683–1691, 2014. S. Das, S. S. Mullick, and P. N. Suganthan. Recent advances G. Melis, C. Dyer, and P. Blunsom. On the state of the art of in differential evolution–an updated survey. Swarm and evaluation in neural language models. In Proc. of ICLR’18, Evolutionary Computation, 27:1–30, 2016. 2018. X. Dong and Y. Yang. Nas-bench-102: Extending the scope H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Ef- of reproducible neural architecture search. arXiv preprint ficient neural architecture search via parameter sharing. arXiv:2001.00326, 2020. arXiv preprint arXiv:1802.03268, 2018. D. Dua and C. Graff. Uci machine learning repository, 2017. E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin. Large-scale evolution of image S. Falkner, A. Klein, and F. Hutter. BOHB: Robust and ef- classifiers. In Proc. of ICML, pages 2902–2911. JMLR. ficient hyperparameter optimization at scale. In Proc. of org, 2017. ICML’18, pages 1437–1446, 2018. E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regular- J. J. Grefenstette. Optimization of control parameters for ge- ized evolution for image classifier architecture search. In netic algorithms. IEEE Transactions on Systems, Man, and Proc. of AAAI, volume 33, pages 4780–4789, 2019. Cybernetics, 16:341–359, 1986. P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and and D. Meger. Deep reinforcement learning that matters. O. Klimov. Proximal policy optimization algorithms. arXiv In Proc. of AAAI’18, 2018. preprint arXiv:1707.06347, 2017. F. Hutter, H. Hoos, and K. Leyton-Brown. Sequential model- J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian based optimization for general algorithm configuration. In optimization of machine learning algorithms. In Proc. of Proc. of LION’11, pages 507–523, 2011. NeurIPS’12, pages 2951–2959, 2012. K. Jamieson and A. Talwalkar. Non-stochastic best arm iden- J. T. Springenberg, A. Klein, S. Falkner, and F. Hutter. tification and hyperparameter optimization. In Proc. of Bayesian optimization with robust bayesian neural net- AISTATS’16, 2016. works. In Proc. of NeurIPS, pages 4134–4142, 2016. R. M. Storn K. Price and J. A. Lampinen. Differential evolu- R. Storn and K. Price. Differential evolution–a simple and tion: a practical approach to global optimization. Springer efficient heuristic for global optimization over continuous Science & Business Media, 2006. spaces. Journal of global optimization, 11(4):341–359, K. Kandasamy, G. Dasarathy, J. Schneider, and B. Póczos. 1997. Multi-fidelity bayesian optimisation with continuous ap- M. Vallati, F. Hutter, Lukás L. Chrpa, and T. L. McCluskey. proximations. arXiv:1703.06240 [stat.ML], 2017. On the effective configuration of planning domain models.
In Twenty-Fourth International Joint Conference on Artifi- cial Intelligence, 2015. J. Vanschoren, J. N. Van Rijn, B. Bischl, and L. Torgo. Openml: networked science in machine learning. ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014. H. Wang, Y. Jin, and J. Doherty. A generic test suite for evo- lutionary multifidelity optimization. IEEE Transactions on Evolutionary Computation, 22(6):836–850, 2017. L. Xie and A. Yuille. Genetic cnn. In Proc. of ICCV, pages 1379–1388, 2017. C. Ying, A. Klein, E. Real, E. Christiansen, K. Murphy, and F. Hutter. Nas-bench-101: Towards reproducible neural ar- chitecture search. arXiv preprint arXiv:1902.09635, 2019. A. Zela, J. Siems, and F. Hutter. Nas-bench-1shot1: Bench- marking and dissecting one-shot neural architecture search. arXiv preprint arXiv:2001.10422, 2020.
A More details on DE X2 Differential Evolution (DE) is a simple, well-performing evo- Xr2 - Xr3 Global Optimum lutionary algorithm to solve a variety of optimization prob- lems [K. Price and Lampinen, 2006] [Das et al., 2016]. This Xr2 algorithm was originally introduced in 1995 by Storn and V = Xr1 + F.(Xr2 - Xr3 ) Xr3 Price [Storn and Price, 1997], and later attracted the attention of many researchers to propose new improved state-of-the-art F.(Xr2 - Xr3 ) algorithms [Chakraborty, 2008]. DE is based on four steps: initialization, mutation, crossover and selection. Algorithm 1 presents the DE pseudo-code. Initialization. DE is a population-based meta-heuristic Xr1 algorithm which consists of a population of N individuals. Each individual is considered a solution and expressed as a vector of D-dimensional decision variables as follows: X1 Mutation popg = (x1i,g , x2i,g , ..., xD i,g ), i = 1, 2, ..., N, (2) Figure 14: Illustration of DE Mutation operation for a 2-dimensional case using the rand/1 mutation strategy. The scaled difference vector where g is the generation number, D is the dimension of the (F.(xr2 − xr3 )) is used to determine the neighbourhood of search problem being solved and N is the population size. The al- from xr1 . Depending on the diversity of the population, DE muta- tion’s search will be explorative or exploitative gorithm starts initially with randomly distributed individuals within the search space. The function value for the problem being solved is then computed for each individual, f (x). random number generated for all dimensions is >p. Figure Mutation. A new child/offspring is generated using the 15 shows an illustration of the crossover operations. mutation operation for each individual in the population by Selection. After the final offspring is generated, the se- a so called mutation strategy. Figure 14 illustrates this op- lection operation takes place to determine whether the target eration for a 2-dimensional case. The classical DE uses the (the parent, xi,g ) or the trial (the offspring, ui,g ) vector sur- mutation operator rand/1, in which three random individu- vives to the next generation by comparing the function values. als/parents denoted as xr1 , xr2 , xr3 are chosen to generate a The offspring replaces its parents if it has a better5 function new vector vi as follows: value as shown in Equation 5. Otherwise, the new offspring is discarded, and the target vector remains in the population vi,g = xr1 ,g + F · (xr2 ,g − xr3 ,g ), (3) for the next generation. where vi,g is the mutant vector generated for each individual ui,g if (f (ui,g ) ≤ f (xi,g )) xi,g in the population. F is the scaling factor that usually takes values within the range (0, 1] and r1 , r2 , r3 are the in- xi,g = (5) xi,g otherwise dices of different randomly-selected individuals. Eq.3 allows some parameters to be outside the search range, therefore, each parameter in vi,g is checked and reset4 if it happens to B More details on Hyberband be outside the boundaries. The Hyperband [Li et al., 2017] (HB) algorithm was de- Crossover. When the mutation phase is completed, the signed to perform random sampling with early stopping based crossover operation is applied to each target vector xi,g and on pre-determined geometrically spaced resource allocation. its corresponding mutant vector vi,g to generate a trial vec- For DEHB we replace the random sampling with DE search. tor ui,g . Classical DE uses the following uniform (binomial) However, DEHB uses HB at its core to solve the “n versus crossover: B/n” tradeoff that HB was designed to address. Algorithm ( j 2 shows how DEHB interfaces HB to query the sequence of j vi,g if (rand ≤ p) or (j = jrand ) ui,g = (4) how many configurations of each budget to run at each itera- xji,g otherwise tion. This view treats the DEHB algorithm as a sequence of predetermined (by HB), repeating Successive Halving brack- The crossover rate p is real-valued and is usually specified ets where, iteration number refers to the index of SH brackets in the range [0, 1]. This variable controls the portion of pa- run by DEHB. rameter values that are copied from the mutant vector. The jth parameter value is copied from the mutant vector vi,g to the corresponding position in the trial vector ui,g if a random C More details on DEHB number is less than or equal to p. If the condition is not satis- C.1 DEHB algorithm fied, then the jth position is copied from the target vector xi,g . jrand is a random integer in the range [1, D] to ensure that Algorithm 3 gives the pseudo code describing DEHB. DEHB at least one dimension is copied from the mutant, in case the takes as input the parameters for HB (bmin , bmax , η) and the 4 5 a random value from [0, 1] is chosen uniformly in this work DE is a minimizer
X2 Algorithm 2: A SH bracket under Hyperband Global Optimum Input: bmin , bmax - min and max budgets η - fraction of configurations promoted U = (V 1 , X 2 ) X iteration - iteration number Output: List of no. of configurations and budgets bmax 1 smax = blogη bmin c 2 s = smax − (iteration mod (smax + 1)) smax +1 3 N = d s+1 · ηs e U =V bmax −s 4 b0 = U = (X 1 , V 2 ) bmin · η 5 budgets = n conf igs = [] 6 for i ∈ {0, ..., s} do 7 Ni = bN · η −i c X1 8 b = b0 · η i Crossover 9 n conf igs.append(Ni ) 10 budgets.append(b) Figure 15: Illustration of DE Crossover operation for a 2- 11 end dimensional case using the binomial crossover. The vertex of the 12 return n conf igs, budgets rectangle shows the possible solutions of between a parent x and mutant v. Based on the choice of p, the resultant individual will either be a copy of the parent, or the mutant, or incorporate either component from parent and mutant rest. During the first DEHB bracket (bracket counter== 0) and its second SH bracket onwards (i>0), the top configura- tions from the lower fidelity are promoted6 for evaluation in Algorithm 1: DE Optimizer the next higher fidelity. The function DE trial generation on Input: L14, i.e, the sequence of mutation-crossover operations, gen- f - black-box problem erates a candidate configuration (config) to be evaluated for F - scaling factor (default F = 0.5) all other scenarios. L17 carries out the DE selection proce- p - crossover rate (default p = 0.5) dure by comparing the fitness score of config and the selected N - population size target for that DE evolution step. The target (xi,g from Equa- Output: Return best found individual in pop tion 4) is selected on L9 by a rolling pointer over the sub- population list. That is, for every iteration (every increment 1 g = 0, F E = 0; of j) a pointer moves forward by one index position in the 2 popg ← initial population(N , D); subpopulation selecting an individual to be a target. When 3 f itnessg ← evaluate population(popg ); this pointer reaches the maximal index, it resets to point back 4 FE = N; to the starting index of the subpopulation. L18 compares the 5 while (F E < F Emax ) do score of the last evaluated config with the best found score so 6 mutate(popg ); far. If the new config has a better fitness score, the best found 7 of f springg ← crossover(popg ); score is updated and the new config is marked as the incum- 8 f itnessg ← evaluate population(of f springg ); bent, conf iginc . This stores the best found configuration as 9 popg+1 ,f itnessg+1 ← select(popg ,of f springg ); an anytime best performing configuration. 10 FE = FE + N; 11 g = g+1; C.2 Handling Mixed Data Types 12 end 13 return Individual with highest fitness seen When dealing with discrete or categorical search spaces, such as the NAS problem, the best way to apply DE with such pa- rameters is to keep the population continuous and perform mutation and crossover normally (Eq. 3, 4); then, to evaluate parameters for DE (F , p). For the experiments in this pa- a configuration we evaluate a copy of it in the original discrete per, the termination condition was chosen as the total num- space as we explain below. If we instead dealt with a dis- ber of DEHB brackets to run. However, in our implementa- crete population, then the diversity of population would drop tion it can also be specified as the total absolute number of dramatically, leading to many individuals having the same function evaluations, or a cumulative wallclock time as bud- parameter values; the resulting population would then have get. L6 is the call to Algorithm 2 which gives a list of bud- many duplicates, lowering the diversity of the difference dis- gets which represent the sequence of increasing budgets to be tribution and making it hard for DE to explore effectively. We used for that SH bracket. The nomenclature DE[budgets[i]], designed DEHB to scale all parameters of a configuration in a used in L9 and L12, indicates the DE subpopulation asso- ciated with the budgets[i] fidelity level. The if...else block 6 only evaluate on higher budget and not evolve using mutation- from L11-15 differentiates the first DEHB bracket from the crossover-selection
32+32 Figure 16: Comparing DEHB Algorithm 3: DEHB normalized validation regret 4 × 10 1 encodings for the Stochastic Input: 3 × 10 1 Counting Ones problem in 64 bmin , bmax - min and max budgets 2 × 10 1 dimensional space with 32 cat- η - (default η=3) DEHB (orig) egorical and 32 continuous hy- F - scaling factor (default F = 0.5) DEHB (encoding) perparameters. Results for all 10 1 100 101 102 103 104 cummulative budget / bmax algorithms on 50 runs. p - crossover rate (default p = 0.5) Output: Best found configuration, conf iginc bmax Cifar-100 1 smax = blogη bmin c 10 1 Figure 17: Comparing DEHB 2 Initialize (smax + 1) DE subpopulations randomly validation regret 10 2 encodings for the Cifar-100 3 bracket counter = 0 dataset from NAS-Bench-201’s 4 while termination condition do 10 3 6-dimensional space. Results DEHB (orig) 5 for iteration ∈ {0, 1, ..., smax } do 10 4 DEHB (encoding) for all algorithms on 50 runs. 102 103 104 105 106 107 6 budgets, n conf igs = estimated wallclock time [s] SH bracket under HB(bmin , bmax , η, iteration) 7 for i ∈ {0, 1, ..., smax − iterations} do ables derived from 32 binary variables. We choose a repre- 8 for j ∈ {1, 2, ..., n conf igs[i]} do sentative set of benchmarks (NAS-Bench-201 and Counting 9 target = rolling pointer for Ones) to compare DEHB with the two encodings mentioned, DE[budgets[i]] in Figures 16, 17. It is enough to see one example which 10 mutation types = “vanilla” if i is 0 performs much worse than the DE-NAS [Awad et al., 2020] else “altered” encoding we chose for DEHB. The encoding from [Vallati 11 if bracket counter is 0 and i >0 then et al., 2015] did not achieve a better final performance than 12 conf ig = j-th best config from DEHB in any of our experiments. DE[budgets[i − 1]] 13 else C.3 Parallel Implementation 14 conf ig = The DEHB algorithm is a sequence of DEHB Brackets, DE trial generation(target, which in turn are a fixed sequence of SH brackets. This fea- mutation type) ture, along with the asynchronous nature of DE allows a par- 15 end allel execution of DEHB. We dub the main process as the 16 result = Evaluate conf ig on DEHB Orchestrator which maintains a single copy of all DE budgets[i] subpopulations. An HB bracket manager determines which 17 DE selection using result, conf ig vs. budget to run from which SH bracket. Based on this input target from the bracket manager, the orchestrator can fetch a config- 18 Update incumbent, conf iginc uration7 from the current subpopulations and make an asyn- 19 end chronous call for its evaluation on the assigned budget. The 20 end rest of the orchestrator continues synchronously to check for 21 end free workers, and query the HB bracket manager for the next 22 bracket counter += 1 budget and SH bracket. Once a worker finishes computation, 23 end the orchestrator collects the result, performs DE selection and 24 return conf iginc updates the relevant subpopulation accordingly. This form of an update is referred to as immediate, asynchronous DE. DEHB uses a synchronous SH routine. Though each of the population to a unit hypercube [0, 1], for the two broad types function evaluations at a particular budget can be distributed, of parameters normally encountered: a higher budget needs to wait on all the lower budget evalu- • Integer and float parameters, X i ∈ [ai , bi ] are retrieved ations to be finished. A higher budget evaluation can begin as: ai + (bi − ai ) · Ui,g , where the integer parameters are only once the lower budget evaluations are over and the top additionally rounded. 1/η can be selected. However, the asynchronous nature of DE allows a new bracket to begin if a worker is available while • Ordinal and categorical parameters, X i ∈ {x1 , ..., xn }, existing SH brackets have pending jobs or are waiting for re- are treated equivalently s.t. the range [0, 1] is divided sults. The new bracket can continue using the current state uniformly into n bins. of DE subpopulations maintained by the DEHB Orchestra- We also experimented with another encoding design where tor. Once the pending jobs from previous brackets are over, each category in each of the categorical variables are repre- the DE selection updates the DEHB Orchestrator’s subpop- sented as a continuous variables [0, 1] and the variable with ulations. Thus, the utilisation of available computational re- the max over the continuous variables is chosen as the cat- sources is maximized while the central copy of subpopula- egory [Vallati et al., 2015]. For example, in Figure 16, the tions maintained by the Orchestrator ensures that each new effective dimensionality of the search space will become 96- 7 dimensional — 32 continuous variables + 64 continuous vari- DE mutation and crossover to generate configuration
4+4 8+8 SH bracket spawned works with the latest updated subpopu- 10 1 10 1 normalized validation regret normalized validation regret 10 2 lation. 10 2 RS RS 10 3 HB 10 3 HB BOHB 10 4 BOHB 10 4 TPE TPE 10 5 SMAC 10 5 SMAC RE RE 10 6 D More details on Experiments 10 10 6 7 DE DEHB 10 7 DE DEHB 10 1 100 101 102 103 104 105 10 1 100 101 102 103 104 105 cummulative budget / bmax cummulative budget / bmax D.1 Baseline Algorithms 16+16 32+32 normalized validation regret normalized validation regret 10 1 RS RS In all our experiments we keep the configuration of all the HB BOHB 10 1 HB BOHB algorithms the same. These settings are well-performing 10 2 TPE SMAC TPE SMAC setting that have been benchmarked in previous works — RE RE DE DE DEHB DEHB [Falkner et al., 2018], [Ying et al., 2019], [Awad et al., 2020]. 10 1 100 101 102 103 104 105 10 1 100 101 102 103 104 105 cummulative budget / bmax cummulative budget / bmax Random Search (RS) We sample random architectures in the configuration space from a uniform distribution in each Figure 18: Results for the Stochastic Counting Ones problem for generation. N = {4, 8, 16, 32} respectively indicating N categorical and N BOHB We used the implementation from https://github. continuous hyperparameters for each case. All algorithms shown com/automl/HpBandSter. In [Ying et al., 2019], they identi- were run for 50 runs. fied the settings of key hyperparameters as: η is set to 3, the minimum bandwidth for the kernel density estimator is set to D.2 Artificial Toy Function: Stochastic Counting 0.3 and bandwidth factor is set to 3. In our experiments, we Ones deploy the same settings. Hyperband (HB) We used the implementation from https: The Counting Ones benchmark was designed to minimize the //github.com/automl/HpBandSter. We set η = 3 and this pa- following objective function: rameter is not free to change since there is no other different X X budgets included in the NAS benchmarks. f (x) = − xi + Eb [(Bp=xj )] , Tree-structured Parzen estimator (TPE) We used xi ∈Xcat xj ∈Xcont the open-source implementation from https://github.com/ hyperopt/hyperopt. We kept the settings of hyperparameters where the sum of the categorical variables (xi ∈ {0, 1}) rep- to their default. resents the standard discrete counting ones problem. The con- tinuous variables (xj ∈ [0, 1]) represent the stochastic com- Sequential Model-based Algorithm Configura- ponent with the budget b controlling the noise. The budget tion (SMAC) We used the implementation from here represents the number of samples used to estimate the https://github.com/automl/SMAC3 under its default pa- mean of the Bernoulli distribution (B) with parameters xj . rameter setting. Only for the Counting Ones problem with The experiments on the Stochastic Counting Ones bench- 64-dimensions, the initial design had to be changed to a mark used N = {4, 8, 16, 32}, all of which are shown in Fig- Latin Hypercube design, instead of a Sobol design. ure 18. For the low dimensional cases, BOHB and SMAC’s Regularized Evolution (RE) We used the implementation models are able to give them an early advantage. For this toy from [Real et al., 2019]. We initially sample an edge or op- benchmark the global optima is located at the corner of a unit erator uniformly at random, then we perform the mutation. hypercube. Random samples can span the lower dimensional After reaching the population size, RE kills the oldest mem- space adequately for a model to improve the search rapidly. ber at each iteration. As recommended by [Ying et al., 2019], DEHB on the other hand may require a few extra function the population size (PS) and sample size (TS) are set to 100 evaluations to reach similar convergence. However, this con- and 10 respectively. servative approach aids DEHB for the high-dimensional cases Differential Evolution (DE) We used the implementation where it is able to converge much more rapidly in comparison from [Awad et al., 2020], keeping the rand1 strategy for mu- to other algorithms. Especially where SMAC and BOHB’s tation and binomial crossover as the crossover strategy. We convergence worsens significantly. DEHB thus showcases also use the same population size of 20 as [Awad et al., 2020]. its robust performance even when the dimensionality of the All plots for all baselines were plotted for the incumbent problem increases exponentially. validation regret over the estimated wallclock time, ignoring the optimization time. The x-axis therefore accounts for only D.3 Feed-forward networks on OpenML datasets the cumulative cost incurred by function evaluations for each Figure 19, show the results on all 6 datasets from OpenML algorithm. All algorithms were run for similar actual wall- surrogates benchmark — Adult, Letter, Higgs, MNIST, clock time budget. Certain algorithms under certain bench- Optdigits, Poker. The surrogate model space is just 6- marks may not appear to have equivalent total estimated wall- dimensional, allowing BOHB and TPE to build more confi- clock time. That is an artefact of ignoring optimization time. dent models and be well-performing algorithms in this space, Model-based algorithms such as SMAC, BOHB, TPE have especially early in the optimization. However, DE and DEHB a computational cost dependent on the observation history. are able to remain competitive and consistently achieve an They might undertake lesser number of function evaluations improved final performance than TPE and BOHB respec- for the same actual wallclock time. tively. While even TPE achieves a better final performance
Adult Higgs 10 1 7 hyperparameters: # units layer 1, # units layer 2, batch size, RS RS learning rate, discount, likelihood ratio clipping and entropy validation regret validation regret 10 2 HB HB BOHB TPE BOHB TPE regularization. Figure 21 summarises the performance of all 10 2 SMAC SMAC algorithms on the RL problem for the Cartpole benchmark. RE RE DE DE 10 3 DEHB DEHB SMAC uses a SOBOL grid as its initial design and both its 101 102 103 104 105 101 102 103 104 105 estimated wallclock time [s] estimated wallclock time [s] benefit and drawback can be seen as SMAC rapidly improves, Letter Mnist stalls, and then improves again once model-based search be- 10 1 10 1 RS RS gins. However, BOHB and DEHB both remain competi- validation regret validation regret HB HB BOHB 10 2 BOHB tive and BOHB, DEHB, SMAC emerge as the top-3 for this 10 2 TPE TPE SMAC 10 3 SMAC benchmark, achieving similar final scores. We notice that RE RE 10 3 DE DEHB 10 4 DE DEHB the DE trace stands out as worse than RS and will explain 101 102 103 estimated wallclock time [s] 104 105 101 102 103 104 estimated wallclock time [s] 105 the reason behind this. Given the late improvement for DE Optdigits Poker pop = 20, we posit that this is a result of the deferred updates 10 1 10 1 of DE based on the classical DE [Awad et al., 2020] update RS RS validation regret validation regret HB HB design and also the design of the benchmark. BOHB 10 2 BOHB TPE TPE 10 2 SMAC SMAC For classical-DE, the updates are deferred, that is the re- RE RE DE 10 3 DE sults of the selection process are incorporated into the popu- 10 3 DEHB DEHB 100 101 102 103 estimated wallclock time [s] 104 102 103 104 estimated wallclock time [s] 105 lation for consideration in the next evolution step, only after all the individuals of the population have undergone evolu- Figure 19: Results for the OpenML surrogate benchmark for the 6 tion. In terms of computation, the wall-clock time for popu- datasets: Adult, Higgs, Letter, MNIST, Optdigits, Poker. The search lation size number of function evaluations are accumulated, space had 6 continuous hyperparameters. All plots shown were av- before the population is updated. In Figure 21 we illustrate eraged over 50 runs of each algorithm. why given how this benchmark is designed, this minor de- tail for DE slows down convergence. Along with a DE of population size 20 as used in the experiments, we compare than BOHB. Overall, DEHB is a competetive anytime per- a DE of population size 10 in Figure 21. For the Reinforce- former for this benchmark with the most robust final perfor- ment Learning benchmark from [Falkner et al., 2018], each mances. full budget function evaluation consists of 9 trials of a maxi- mum of 3000 episodes. With a population of 20, DE will not D.4 Bayesian Neural Networks inject a new individual into a population unless all 20 individ- The search space for the two-layer fully-connected Bayesian uals have participated as a parent in the crossover operation. Neural Network is defined by 5 hyperparameters which are: This accumulates wallclock time equivalent to 20 individu- the step length, the length of the burn-in period, the number als times 9 trials times time taken for a maximum of 3000 of units in each layer, and the decay parameter of the momen- episodes. Which can explain the flat trajectories in the op- tum variable. In Figure 20, we show the results for the tuning timization trace for DE pop = 20 in Figure 21 (right). DE of Bayesian Neural Networks on both the Boston Housing pop = 10 slashes this accumulated wallclock time in half and Protein Structure datasets for the 6-dimensional Bayesian and is able to inject newer configurations into the population Neural Networks benchmark. We observe that SMAC, TPE faster and is able to search faster. Given enough runtime, and BOHB are able to build models and reach similar re- we expect DE pop = 20 to converge to similar final scores. gions of performance with high confidence. DEHB is slower DEHB uses the immediate update design for DE, wherein it to match in such a low-dimensional noisy space. However, updates the population immediately after a DE selection, and given the same cumulative budget, DEHB achieves a com- not wait for the entire population to evolve. We posit that this petitive final score. feature, along with lower fidelity search, and performing grid Boston Housing Protein Structure search over population sizes with Hyperband, enables DEHB 70 RS HB 70 RS HB to be more practical than classical-DE. negative log-likelihood negative log-likelihood 60 60 BOHB BOHB 50 TPE 50 TPE SMAC SMAC 40 RE 40 RE cartpole 104 104 Cartpole 30 DE 30 DE DEHB DEHB epochs until convergence epochs until convergence 20 20 10 10 104 105 106 104 105 106 103 103 MCMC steps MCMC steps RS SMAC HB RE RS BOHB DE DE pop = 10 Figure 20: Results for tuning 5 hyperparameters of a Bayesian 1010 2 2 TPE DEHB 102 DE pop = 20 103 104 104 105 Neural Network on the the Boston Housing and Protein Structure time [s] time [s] datasets respectively, for 50 runs of each algorithm. Figure 21: (left) Results for tuning PPO on OpenAI Gym cartpole environment with 7 hyperparameters. Each algorithm shown was D.5 Reinforcement Learning run for 50 runs. (right) Same experiment to compare DE with a population size of 10 and 20. For this benchamrk, the proximal policy optimization (PPO) [Schulman et al., 2017] implementation is parameterized with
You can also read