High-Dimensional Bayesian Optimization with Sparse Axis-Aligned Subspaces
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
High-Dimensional Bayesian Optimization with Sparse Axis-Aligned Subspaces David Eriksson* ,1 Martin Jankowiak∗,2 1 Facebook, Menlo Park, California, USA 2 Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA Abstract These algorithms typically consist of two components. The first component employs Bayesian methods to construct a surrogate model of the (unknown) objective function. The Bayesian optimization (BO) is a powerful second component uses this model together with an acquisi- paradigm for efficient optimization of black-box tion function to select the most promising query point(s) at objective functions. High-dimensional BO presents which to evaluate the objective function. By leveraging the a particular challenge, in part because the curse of uncertainty quantification provided by the Bayesian model, dimensionality makes it difficult to define—as well a well-designed BO algorithm can provide an effective bal- as do inference over—a suitable class of surrogate ance between exploration and exploitation, leading to highly models. We argue that Gaussian process surrogate sample-efficient optimization. models defined on sparse axis-aligned subspaces offer an attractive compromise between flexibil- While BO has become a workhorse algorithm that is em- ity and parsimony. We demonstrate that our ap- ployed in a wide variety of settings, successful applications proach, which relies on Hamiltonian Monte Carlo are often limited to low-dimensional problems, e.g. fewer for inference, can rapidly identify sparse subspaces than twenty dimensions [Frazier, 2018]. Applying BO to relevant to modeling the unknown objective func- high-dimensional problems remains a significant challenge. tion, enabling sample-efficient high-dimensional The difficulty can be traced to both of the algorithm compo- BO. In an extensive suite of experiments compar- nents mentioned above, although we postulate that suitable ing to existing methods for high-dimensional BO function priors are especially important for good perfor- we demonstrate that our algorithm, Sparse Axis- mance. In particular, in order for BO to be sample-efficient Aligned Subspace BO (SAASBO), achieves excel- in high-dimensional spaces, it is crucial to define surrogate lent performance on several synthetic and real- models that are sufficiently parsimonious that they can be world problems without the need to set problem- inferred from a small number of query points. An overly specific hyperparameters. flexible class of models is likely to suffer from overfitting, which severely limits its effectiveness in decision-making. Likewise, an overly rigid class of models is unlikely to 1 INTRODUCTION capture enough features of the objective function. A com- promise between flexibility and parsimony is essential. Optimization plays an essential role in many fields of sci- In this work we focus on the setting where we aim to opti- ence, engineering and beyond. From calibrating complex mize a black-box function with hundreds of variables and experimental systems to tuning hyperparameters of machine where we are limited to a few hundred queries of the objec- learning models, the need for scalable and efficient optimiza- tive function. We argue that in this low-sample regime Gaus- tion methods is ubiquitous. Bayesian Optimization (BO) sian process surrogate models defined on sparse axis-aligned algorithms have proven particularly successful on a wide subspaces provide an attractive compromise between flexi- variety of domains including hyperparameter tuning [Snoek bility and parsimony. More specifically, our contributions et al., 2012], A/B tests [Letham et al., 2019], chemical en- are as follows: gineering [Hernández-Lobato et al., 2017], materials sci- ence [Ueno et al., 2016], control systems [Candelieri et al., • We propose the sparsity-inducing SAAS function prior 2018], and drug discovery [Negoescu et al., 2011]. • We demonstrate that when combined with the No-Turn- * Equal contribution U-Sampler (NUTS) for inference, our surrogate model Accepted for the 37th Conference on Uncertainty in Artificial Intelligence (UAI 2021).
quickly identifies the most relevant low-dimensional mon. While most methods require thousands of evaluations subspace, which in turn leads to sample-efficient BO. to find good minima [Yu and Gen, 2010], the popular co- • We show that SAASBO outperforms a number of strong variance matrix adaptation evolution strategy (CMA-ES; baselines on several problems, including three real- [Hansen et al., 2003]) is competitive with BO on some prob- world problems with as many as 388 dimensions, all lems [Letham et al., 2020]. without setting problem-specific hyperparameters. 3 BACKGROUND 2 RELATED WORK We use this section to establish our notation and review There is a large body of research on high-dimensional BO, necessary background material. Throughout this paper we and a wide variety of surrogate modelling and acquisition work in the D-dimensional domain D = [0, 1]D . We consider strategies have been proposed. In the following we draw the minimization problem xmin ∈ argminx∈D fobj (x) for a attention to a number of common themes. noise-free objective function fobj : D → R. We assume that evaluations of fobj are costly and that we are limited to A popular approach is to rely on low-dimensional structure, at most a few hundred. Additionally, fobj is a black-box with several methods utilizing random projections [Wang function and gradient information is unavailable. et al., 2016, Qian et al., 2016, Binois et al., 2020, Letham et al., 2020]. REMBO uses a random projection to project The rest of this section is organized as follows: in Sec. 3.1 low-dimensional points up to the original space [Wang et al., we review Gaussian processes; and in Sec. 3.2 we review 2016]. ALEBO introduces several refinements to REMBO the expected improvement acquisition function. and demonstrates improved performance across a large num- ber of problems [Letham et al., 2020]. Alternatively, the 3.1 GAUSSIAN PROCESSES embedding can be learned jointly with the model, includ- ing both linear [Garnett et al., 2014] and non-linear [Lu Gaussian processes (GPs) offer powerful non-parametric et al., 2018] embeddings. Finally, Hashing-enhanced Sub- function priors that are the gold standard in BO due to their space BO (HeSBO) [Nayebi et al., 2019] relies on hashing flexibility and excellent uncertainty quantification. A GP and sketching to reduce surrogate modeling and acquisition on the input space D is specified1 by a covariance function function optimization to a low-dimensional space. or kernel k : D × D → R [Rasmussen, 2003]. A common Several methods rely on additive structure, where the func- choice is the RBF or squared exponential kernel, which is tion is assumed to be a sum of low-dimensional compo- given by nents [Kandasamy et al., 2015, Gardner et al., 2017, Mutny and Krause, 2018, Wang et al., 2018]. This approach al- kψ (x, y) = σk2 exp{− 12 ∑ ρi (xi − yi )2 } (1) i lows separating the input space into independent domains, reducing the effective dimensionality of the model. where ρi for i = 1, ..., D are inverse squared length scales and where we use ψ to collectively denote all the hyperparam- A common feature of many BO algorithms in high dimen- eters, i.e. ψ = {ρ1:D , σk2 }. For scalar regression f : D → R sions is that they tend to prefer highly uncertain query points the joint density of a GP takes the form near the domain boundary. As this is usually where the model is the most uncertain, this is often a poor choice p(y, f|X) = N (y|f, σ 2 1N )N (f|00, KXX ) ψ (2) that leads to over-exploration and poor optimization perfor- mance. Oh et al. [2018] address this issue by introducing a where y are the real-valued targets, f are the latent function cylindrical kernel that promotes selection of query points in values, X = {xi }Ni=1 are the N inputs with xi ∈ D, σ 2 is the variance of the Normal likelihood N (y|·), and KXX is the interior of the domain. LineBO [Kirschner et al., 2019] ψ optimizes the acquisition function along one-dimensional the N × N kernel matrix. Throughout this paper we will be lines, which also helps to avoid highly uncertain points. The interested in modeling noise-free functions, in which case TuRBO algorithm uses several trust-regions centered around σ 2 is set to a small constant. The marginal likelihood of the the current best solution [Eriksson et al., 2019]. These trust- observed data can be computed in closed form: regions are resized based on progress, allowing TuRBO to Z zoom-in on promising regions. Li et al. [2017] use dropout p(y|X, ψ) = df p(y, f|X) = N (y, KXX + σ 2 1N ). ψ (3) to select a subset of dimensions over which to optimize the acquisition function, with excluded dimensions fixed to the The posterior distribution of the GP at a query point x∗ ∈ D value of the best point found so far. is the Normal distribution N (µf (x∗ ), σf (x∗ )2 ) where µf (·) It also important to note that there are many black-box optimization algorithms that do not rely on Bayesian meth- 1 Here and elsewhere we assume that the mean function is ods, with evolutionary algorithms being especially com- uniformly zero.
and σf (·)2 are given by 4.1 SAAS FUNCTION PRIOR ψ T −1 µf (x∗ ) = k∗X (KXX + σ 2 1N ) y ψ (4) To satisfy our desiderata we introduce a GP model with a ∗ 2 ψ T −1 ψ structured prior over the kernel hyperparameters, in particu- = k∗∗ − k∗X (KXX + σ 2 1N ) k∗X ψ ψ σf (x ) (5) lar one that induces sparse structure in the (inverse squared) length scales ρi . In detail we define the following model: Here k∗∗ = kψ (x∗ , x∗ ) and k∗X is the column vector specified ψ ψ by (k∗X )n = kψ (x∗ , xn ) for n = 1, ..., N. ψ [kernel variance] σk2 ∼ L N (0, 102 ) (8) 3.2 EXPECTED IMPROVEMENT [global shrinkage] τ ∼ H C (α) [length scales] ρi ∼ H C (τ) for i = 1, ..., D. Expected improvement (EI) is a popular acquisition func- f ∼ N (00, KXX ) with ψ = {ρ1:d , σk2 } ψ tion that is defined as follows [Mockus et al., 1978, Jones [function values] et al., 1998]. Suppose that in previous rounds of BO we [observations] y ∼ N (f, σ 2 1N ) have collected H = {x1:N , y1:N }. Then let ymin = minn yn denote the best function evaluation we have seen so far. We define the improvement u(x|ymin ) at query point x ∈ D as where L N denotes the log-Normal distribution and u(x|ymin ) = max(0, ymin − f (x)). EI is defined as the expec- H C (α) denotes the half-Cauchy distribution, i.e. p(τ|α) ∝ tation of the improvement over the posterior of f (x): (α 2 + τ 2 )−1 1(τ > 0), and p(ρi |τ) ∝ (τ 2 + ρi2 )−1 1(ρi > 0). Here α > 0 is a hyperparameter that controls the level of EI(x|ymin , ψ) = Ep(f(x)|ψ,H ) [u(x|ymin )] (6) shrinkage (our default is α = 0.1). We use an RBF kernel, although other choices like the Matérn-5/2 kernel are also where our notation makes explicit the dependence of possible. We also set σ 2 → 10−6 , since we focus on noise- Eqn. (6) on the kernel hyperparameters ψ. For a GP like in free objective functions fobj . Noisy objective functions can Sec. 3.1 this expectation can be evaluated in closed form: be accommodated by placing a weak prior on σ 2 , for exam- ple σ 2 ∼ L N (0, 102 ). EI(x|ymin , ψ) = (ymin − µf (x))Φ(Z) + σf (x)φ (Z) (7) The SAAS function prior defined in (8) has the following im- portant properties. First, the prior on the kernel variance σk2 where Z ≡ (ymin − µf (x))/σf (x) and where Φ(·) and φ (·) is weak (i.e. non-informative). Second, the level of global are the CDF and PDF of the unit Normal distribution, re- shrinkage (i.e. sparsity) is controlled by the scalar τ > 0, spectively. By maximizing Eqn. (7) over D we can find which tends to concentrate near zero due to the half-Cauchy query points x that balance exploration and exploitation. prior. Third, the (inverse squared) length scales ρi are also governed by half-Cauchy priors, and thus they too tend to concentrate near zero (more precisely for most i we expect 4 BAYESIAN OPTIMIZATION WITH ρi . τ). Consequently most of the dimensions are ‘turned SPARSE AXIS-ALIGNED SUBSPACES off’ in accord with the principle of automatic relevance determination introduced by MacKay and Neal [1994]. Fi- We now introduce the surrogate model we use for high- nally, while the half-Cauchy priors favor values near zero, dimensional BO. For a large number of dimensions, the they have heavy tails. This means that if there is sufficient space of functions mapping D to R is—to put it mildly— evidence in the observations y, the posterior over τ will be very large, even assuming a certain degree of smoothness. pushed to higher values, thus reducing the level of shrinkage To facilitate sample-efficient BO it is necessary to make and allowing more of the ρi to escape zero, effectively ‘turn- additional assumptions. Intuitively, we would like to as- ing on’ more dimensions. The parsimony inherent in our sume that the dimensions of x ∈ D exhibit a hierarchy of function prior is thus adaptive: as more data is accumulated, relevance. For example in a particular problem we might more of the ρi will escape zero, and posterior mass will give have that {x3 , x52 } are crucial features for mapping the prin- support to a richer class of functions. This is in contrast cipal variation of fobj , {x7 , x14 , x31 , x72 } are of moderate to a standard GP fit with maximum likelihood estimation importance, while the remaining features are of marginal (MLE), which will generally exhibit non-negligible ρi for importance. This motivates the following desiderata for our most dimensions—since there is no mechanism regularizing function prior: the length scales—typically resulting in drastic overfitting 1. Assumes a hierarchy of feature relevances in high-dimensional settings. 2. Encompasses a flexible class of smooth non-linear Conceptually, our function prior describes functions defined functions on sparse axis-aligned subspaces, thus the name of our prior 3. Admits tractable (approximate) inference (SAAS) and our method (SAASBO).
Algorithm 1: We outline the main steps in SAASBO when NUTS is used for inference. To instead use MAP we simply swap out line 4. For details on inference see Sec. 4.2; for details on EI maximization see Sec. 4.3. Input: Objective function fobj ; initial evaluation budget m ≥ 2; total evaluation budget T > m; hyperparameter α; number of NUTS samples L; and initial query set x1:m and evaluations y1:m (optional) Output: Approximate minimizer and minimum (xmin , ymin ) 1 If {x1:m , y1:m } is not provided, let x1:m be a Sobol sequence in D and let yt = fobj (xt ) for t = 1, ..., m. 2 for t = m + 1, ..., T do 3 Let Ht = {x1:t−1 , y1:t−1 } and ymint = min s
runs of L-BFGS-B to obtain the query point 5.1 THE SAAS PRIOR PROVIDES GOOD MODEL FIT IN HIGH DIMENSIONS xnext = argmaxx EI(x|ymin , {ψ` }) (11) In Fig. 1 we demonstrate the importance of using a sparsity- inducing prior like SAAS when fitting a GP in a high- See the supplementary materials for further details and Alg. 1 for a complete outline of the SAASBO algorithm. GP-MLE (d = 6, D = 30) GP-NUTS-Dense (d = 6, D = 30) GP-NUTS-SAAS (d = 6, D = 30) 2 2 2 Predicted value Predicted value Predicted value 0 0 0 4.4 DISCUSSION −2 −2 −2 We note that the axis-aligned structure of our model need −4 −4 −2 0 2 −4 −4 −2 0 2 −4 −4 −2 0 2 True value True value True value not be as restrictive as one might at first assume. For exam- GP-MLE GP-NUTS-Dense GP-NUTS-SAAS ple, suppose that fobj can be written as fobj (x) = g(x3 − x7 ) 2 (d = 6, D = 100) 2 (d = 6, D = 100) 2 (d = 6, D = 100) for some g : R → R. In order for our model to capture the Predicted value Predicted value Predicted value 0 0 0 structure of fobj , both x3 and x7 need to be identified as relevant. In many cases we expect this to be possible with −2 −2 −2 a relatively small number of samples. While it is true that identifying the direction z = x3 − x7 could be even easier −4 −4 −2 0 2 −4 −4 −2 0 2 −4 −4 −2 0 2 True value True value True value in a different coordinate system, inferring non-axis-aligned subspaces would come at the cost of substantially increased Figure 1: We compare model fit for three models using the computational cost. More importantly, by searching over same train/test data obtained from two independent runs of a much larger set of subspaces our surrogate model would Algorithm 1 on the d = 6 Hartmann function embedded in likely be much more susceptible to overfitting. Given that D ∈ {30, 100} dimensions. We compare: (left) a GP fit with for many problems we expect much of the function variation MLE; (middle) a GP with weak priors fit with NUTS; and to be captured by axis-aligned blocks of input features, we (right) a GP with a SAAS prior (this paper; see Eqn. (8)) fit view our axis-aligned assumption as a good compromise with NUTS. In D = 30 dimensions (top row) both models between flexibility and parsimony. Importantly, our model- fit with NUTS provide good fits to the test data, while MLE ing approach does not sacrifice any of the many benefits of provides a bad fit near the minimum. In D = 100 dimensions GPs (e.g. flexible non-linearity and non-parametric latent (bottom row) only SAAS provides a good fit. In each figure functions) nor do we need to make any strong assumptions mean predictions are depicted with dots and bars denote about fobj (e.g. additive decomposition). 95% confidence intervals. It is important to emphasize that it is by design that the model defined in Sec. 4.1 does not include any discrete dimensional domain. In D = 100 dimensions both maximum latent variables. A natural alternative to our model would likelihood estimation and full Bayesian inference for a GP introduce D binary-valued variables that control whether or with weak log-Normal priors on the (squared inverse) length not a given dimension is relevant to modeling fobj . How- scales ρi concentrate on solutions in which the vast majority ever, inference in any such model is very challenging as it of the ρi are O(1). Consequently with high probability the requires exploring a discrete space of size 2D . Our model kernel similarity between a randomly chosen test point and can be understood as a continuous relaxation of such an any of the N = 100 training data points is O(exp(−D)) ≈ 0, approach. We discuss this point in more detail in Sec. A.3 with the result that both these models revert to a trivial in the supplementary materials. mean prediction across most of the domain. By contrast, the SAAS prior only allows a few ρi to escape zero, resulting in a model that is much more useful for exploration and exploitation of the most important design variables. 5 EXPERIMENTS We present an empirical validation of our approach. In 5.2 SAASBO CAN QUICKLY IDENTIFY THE Sec. 5.1-5.2 we characterize the behavior of SAASBO MOST RELEVANT DIMENSIONS in controlled settings. In Sec. 5.4-5.7 we benchmark SAASBO against a number of state-of-the-art meth- We characterize the behavior of SAASBO in a controlled ods for high-dimensional BO. SAASBO will be imple- setting where we embed the two-dimensional Branin func- mented in BoTorch and code for reproducing the exper- tion in D = 100 dimensions. First, we explore the degree to iments will be available at https://github.com/ which SAASBO’s performance depends on the approximate martinjankowiak/saasbo. inference algorithm used, in particular comparing NUTS
Branin (d = 2, D = 100) Branin (d = 2, D = 100) Branin (d = 2, D = 100) 6 2.0 4.0 α = 0.01 α = 0.01 3.5 # of relevant dimensions found α = 0.1 α = 0.1 Effective subspace dimension 5 α = 1.0 α = 1.0 1.5 3.0 MAP MAP Best value found 4 2.5 3 1.0 2.0 1.5 2 0.5 α = 0.01 α = 0.1 1.0 1 Global α = 1.0 Minimum 0.5 MAP 0.0 0 0.0 0 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 Number of evaluations Number of evaluations Number of evaluations Figure 2: We explore how SAASBO performs on Branin (D = 100), comparing SAASBO-NUTS for three values of the sparsity controlling hyperparameter α to SAASBO-MAP. Each curve corresponds to 60 independent replications of Algorithm 1. Left: We compare performance w.r.t. the best minimum found (the mean is depicted by a thick line and shaded bands denote standard errors). Middle: We depict the mean number of relevant dimensions found, where a relevant dimension is declared ‘found’ if its corresponding PosteriorMedian(ρk ) is among the two largest {PosteriorMedian(ρi )}Di=1 . Right: We depict the mean effective subspace dimension, defined to be the number of dimensions for which PosteriorMedian(ρk ) > 0.5. to MAP (see Sec. 4.2 for details on inference). In Fig. 2 prior, and that all remaining hyperparameters control the (left) we see that NUTS outperforms MAP by a consider- computational budget (e.g. the number of NUTS samples L). able margin. In Fig. 2 (middle and right) we demonstrate This is in contrast to the many methods for high-dimensional that both inference methods are able to reliably identify the BO that rely on several (potentially sensitive) hyperparame- two relevant dimensions after ∼ 20 − 30 evaluations. ters such as the dimension de of a random embedding. Why does NUTS outperform MAP even though MAP is able to identify the relevant subspace? We hypothesize that 5.3 BASELINES the primary reason for the superior performance of NUTS is that the EI objective in Eqn. (10) is considerably more We compare SAASBO to a comprehensive selection of base- robust when averaged over multiple samples of the GP ker- lines: ALEBO, CMA-ES, EBO, HeSBO, SMAC, Sobol, nel hyperparameters. In particular, averaging over multiple and TuRBO. ALEBO [Letham et al., 2020] is chosen as a samples—potentially from distinct modes of the posterior— representative random embedding method, as it improves appears to mitigate EI’s tendency to seek out the boundary upon the original REMBO method [Wang et al., 2016]. Ad- of the domain D. For this reason we use NUTS for the ex- ditionally, we compare to HeSBO, which uses hashing and periments in this work, noting that while we obtain good sketching to project low-dimensional points up to the origi- performance with MAP in some problem settings we find nal space [Nayebi et al., 2019]. The EBO method by Wang that NUTS is significantly more robust. et al. [2018] exploits additive structure to scale to high- dimensional spaces. We also compare to CMA-ES [Hansen Next, we explore the dependence of SAASBO-NUTS on the et al., 2003], which is a popular evolutionary method that hyperparameter α. In Fig. 2 (left) we see that there is mini- is often competitive with BO methods on high-dimensional mal dependence on α, with the three values leading to simi- problems. TuRBO [Eriksson et al., 2019] uses a trust region lar optimization performance. In Fig. 2 (middle and right) centered at the best solution to avoid exploring highly uncer- we see that, as expected, smaller values of α are more con- tain parts of the search space. We also include an additional servative (i.e., prefer smaller subspaces), while larger values BO method that does not rely on GPs, namely SMAC [Hut- of α are less conservative (i.e., prefer larger subspaces). We ter et al., 2011]. Finally, we also compare to scrambled note, however, that this effect is most pronounced when Sobol sequences [Owen, 2003]. only a small number of datapoints have been collected. Af- ter ∼ 20 function evaluations the observations overwhelm We use the default settings for all baselines. For ALEBO and the prior p(τ) and the posterior quickly concentrates on the HeSBO we evaluate both de = 5 and de = 10 on the three two relevant dimensions. synthetic problems in Sec. 5.4. As de = 5 does not perform well on the three real-world applications in Sec. 5.5-5.7, we Given the good performance of all three values of α, for the instead evaluate de = 10 and de = 20 on these problems. remainder of our experiments we choose the intermediate value α = 0.1. While performance can perhaps be improved We also mention a baseline method for which we do not re- in some cases by tuning α, we find it encouraging that we port results, since it underperforms random search. Namely can get good performance with a single α. We emphasize for our surrogate model we use a quadratic polynomial over that α is the only hyperparameter that governs the function D with O(D2 ) coefficients governed by a sparsity-inducing
Branin (d = 2, D = 100) Hartmann (d = 6, D = 100) Rosenbrock (d = 3, D = 100) 8 −0.5 5 −1.0 4 6 Best value found Best value found Best value found −1.5 3 4 −2.0 2 −2.5 2 1 −3.0 0 0 −3.5 0 10 20 30 40 50 0 20 40 60 80 100 0 10 20 30 40 50 Number of evaluations Number of evaluations Number of evaluations 20 −0.5 5 −1.0 15 4 −1.5 Final value Final value Final value 3 10 −2.0 2 −2.5 5 1 −3.0 0 0 −3.5 SAASBO SMAC ALEBO (de = 5) HeSBO (de = 5) CMA-ES TuRBO EBO ALEBO (de = 10) HeSBO (de = 10) Sobol Figure 3: We compare SAASBO to seven baseline methods on three d−dimensional functions embedded in D = 100 dimensions. In each case we do 30 independent replications. Top row: For each method we depict the mean value of the best minimum found at a given iteration. Bottom row: For each method we depict the distribution over the final approximate minimum ymin encoded as a violin plot, with horizontal bars corresponding to 5%, 50%, and 95% quantiles. Horseshoe prior [Carvalho et al., 2009]. As in Baptista and lighting a serious downside of random embedding methods. Poloczek [2018], Oh et al. [2019], this finite feature ex- Crucially this important hyperparameter needs to be chosen pansion admits efficient inference with a Gibbs sampler. before the start of optimization and is not learned. Unfortunately, in our setting, where D is continuous and not discrete, this leads to pathological behavior when combined with EI, since the minima of simple parametric models are very likely to be found at the boundary of D. This is in con- trast to the mean-reverting behavior of a GP with a RBF or Matérn kernel, which is a much more appropriate modeling assumption in high dimensions. 5.5 ROVER TRAJECTORY PLANNING 5.4 SYNTHETIC PROBLEMS We consider a variation of the rover trajectory planning prob- lem from [Wang et al., 2018] where the task is to find an optimal trajectory through a 2d-environment. In the original In this section we consider the Branin (d = 2), Hartmann problem, the trajectory is determined by fitting a B-spline (d = 6), and Rosenbrock (d = 3) test functions embedded to 30 waypoints and the goal is to optimize the locations in a D = 100 space. These are problems with unambiguous of these waypoints. This is a challenging problem that re- low-dimensional structure where we expect both random quires thousands of evaluations to find good solutions, see embedding methods and SAASBO to perform well. e.g. [Eriksson et al., 2019]. To make the problem more Fig. 3 shows that SAASBO and ALEBO-5 perform the best suitable for small evaluation budgets, we require that the on Branin. SAASBO performs the best on Hartmann fol- B-spline starts and ends at the pre-determined starting posi- lowed by ALEBO-10. HeSBO performs well on Rosenbrock tion and destination. We also increase the dimensionality to and the final performance of SAASBO, HeSBO-5, HeSBO- D = 100 by using 50 waypoints. Fig. 4 shows that SAASBO 10, and ALEBO-5 are similar. However, both ALEBO and performs the best on this problem. This problem is chal- HeSBO show significant sensitivity to the embedded sub- lenging for all methods, each of which had at least one space dimension on at least two of the three problems, high- replication where the final reward was below 2.5.
Rover (D = 100) SVM (D = 388) Vehicle Design (D = 124) −2.00 0.40 340 Negative Reward −2.25 0.35 320 −2.50 Vehicle mass 0.30 300 Test RMSE −2.75 0.25 280 −3.00 0.20 260 −3.25 −3.50 0.15 240 −3.75 0.10 220 0 20 40 60 80 100 0 20 40 60 80 100 0 100 200 300 400 Number of evaluations Number of evaluations Number of evaluations −1.5 0.45 340 −2.0 0.40 320 −2.5 0.35 300 Final value Final value Final value −3.0 0.30 280 −3.5 0.25 260 −4.0 0.20 −4.5 0.15 240 −5.0 0.10 220 SAASBO SMAC ALEBO (de = 10) HeSBO (de = 10) CMA-ES TuRBO EBO ALEBO (de = 20) HeSBO (de = 20) Sobol Figure 4: We compare SAASBO to baseline methods on rover trajectory planning (D = 100), SVM hyperparameter tuning (D = 388), and MOPTA vehicle design (D = 124). We do 30 independent replications for Rover and SVM and 15 replications for MOPTA. Top row: For each method we depict the mean value of the best minimum found at a given iteration. Bottom row: For each method we depict the distribution over the final approximate minimum ymin encoded as a violin plot, with horizontal bars corresponding to 5%, 50%, and 95% quantiles. 5.6 HYPERPARAMETER TUNING OF AN SVM variables describe materials, gauges, and vehicle shape. To accommodate our baseline methods, While some meth- We define a hyperparameter tuning problem using a ker- ods such as Scalable Constrained Bayesian Optimization nel support vector machine (SVM) trained on a 385- (SCBO) [Eriksson and Poloczek, 2020] can handle this con- dimensional regression dataset. This results in a D = 388 strained problem with thousands of evaluations, we convert problem, with 3 regularization parameters and 385 kernel the hard constraints into a soft penalty, yielding a scalar length scales. We expect this problem to have some amount objective function. Fig. 4 shows that SAASBO outperforms of low-dimensional structure, as we expect the regulariza- other methods by a large margin. TuRBO and CMA-ES tion parameters to be most relevant, with a number of length perform better than the remaining methods, which fail to scales of secondary, but non-negligible importance. This identify good solutions. While this problem does not have intuition is confirmed in Fig. 7 in the supplementary mate- obvious low-dimensional structure, our flexible SAAS prior rials, which demonstrates that SAASBO quickly focuses on still results in superior optimization performance. In Fig. 8 the regularization parameters, explaining the superior per- in the supplementary materials we see that this good per- formance of SAASBO seen in Fig. 4. ALEBO makes little formance can be traced to the adaptive parsimony of the progress after iteration 30, indicating that there may not be SAAS prior, which identifies small (d ∼ 2) subspaces at the any good solutions within the random embeddings. HeSBO beginning of optimization and increasingly larger (d ∼ 10) and EBO do better than the other methods, but fail to match subspaces towards the end. the final performance of SAASBO. 6 DISCUSSION 5.7 VEHICLE DESIGN Black-box optimization in hundreds of dimensions presents We consider the vehicle design problem MOPTA08, a chal- a number of challenges, many of which can be traced to the lenging real-world high-dimensional BO problem [Jones, many degrees of freedom that characterize high-dimensional 2008]. The goal is to minimize the mass of a vehicle sub- spaces. The majority of approaches to Bayesian optimiza- ject to 68 performance constraints. The D = 124 design tion try to circumvent this potential hazard by reducing the
effective dimensionality of the problem. For example ran- optimization via random embeddings. Journal of global dom projection methods like ALEBO and HeSBO work optimization, 76(1):69–90, 2020. directly in a low-dimensional space, while methods like TuRBO or LineBO constrain the domain over which the James Bradbury, Roy Frostig, Peter Hawkins, acquisition function is optimized. We take the view that it Matthew James Johnson, Chris Leary, Dougal Maclaurin, is much more natural to work directly in the full space and and Skye Wanderman-Milne. JAX: Composable instead rely on a sparsity-inducing function prior to mitigate transformations of Python+NumPy programs, 2018. the curse of dimensionality. URL http://github.com/google/jax, 4:16, 2020. As we have shown in a comprehensive set of experiments, Antonio Candelieri, Raffaele Perego, and Francesco SAASBO outperforms state-of-the-art BO methods on sev- Archetti. Bayesian optimization of pump operations in eral synthetic and real-world problems. Our approach pro- water distribution systems. Journal of Global Optimiza- vides several distinct advantages: we highlight three. First, tion, 71(1):213–235, 2018. it preserves—and therefore can exploit—structure in the in- Carlos M Carvalho, Nicholas G Polson, and James G Scott. put domain, in contrast to methods like ALEBO or HeSBO Handling sparsity via the horseshoe. In Artificial Intelli- which risk scrambling it. Second, it is adaptive and exhibits gence and Statistics, pages 73–80. PMLR, 2009. little sensitivity to its hyperparameters. Third, it can nat- urally accommodate both input and output constraints, in Dheeru Dua and Casey Graff. Uci machine learning reposi- contrast to methods that rely on random projections, for tory, 2017. URL: http://archive.ics.uci.edu/ml, 7(1), 2019. which input constraints are particularly challenging. David Eriksson and Matthias Poloczek. Scalable While we have obtained strikingly good performance using a constrained Bayesian optimization. arXiv preprint simple acquisition strategy, it is likely that making the most arXiv:2002.08526, 2020. of our SAAS function prior will require a decision-theoretic framework that is better suited to high-dimensional settings. David Eriksson, Michael Pearce, Jacob R. Gardner, Ryan This is an interesting direction for future elaborations of Turner, and Matthias Poloczek. Scalable global optimiza- SAASBO. tion via local Bayesian optimization. In Advances in Neural Information Processing Systems 32, pages 5497– 5508, 2019. Acknowledgements Peter I Frazier. A tutorial on Bayesian optimization. arXiv We thank Neeraj Pradhan and Du Phan for help with preprint arXiv:1807.02811, 2018. NumPyro and Maximilian Balandat for providing feedback Jacob R. Gardner, Chuan Guo, Kilian Q. Weinberger, Ro- on the paper. man Garnett, and Roger B. Grosse. Discovering and ex- ploiting additive structure for Bayesian optimization. In References Proceedings of the 20th International Conference on Ar- tificial Intelligence and Statistics, volume 54 of Proceed- Maximilian Balandat, Brian Karrer, Daniel R. Jiang, Samuel ings of Machine Learning Research, pages 1311–1319. Daulton, Benjamin Letham, Andrew Gordon Wilson, and PMLR, 2017. Eytan Bakshy. Botorch: A framework for efficient Monte- Roman Garnett, Michael A. Osborne, and Philipp Hennig. Carlo Bayesian optimization. In Advances in Neural Active learning of linear embeddings for Gaussian pro- Information Processing Systems 33, 2020. cesses. In Proceedings of the Thirtieth Conference on Un- certainty in Artificial Intelligence, pages 230–239. AUAI Ricardo Baptista and Matthias Poloczek. Bayesian opti- Press, 2014. mization of combinatorial structures. volume 80 of Pro- ceedings of Machine Learning Research, pages 471–480. Nikolaus Hansen, Sibylle D Müller, and Petros Koumout- PMLR, 2018. sakos. Reducing the time complexity of the derandom- ized evolution strategy with covariance matrix adaptation Eli Bingham, Jonathan P Chen, Martin Jankowiak, Fritz (CMA-ES). Evolutionary computation, 11(1):1–18, 2003. Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D Goodman. José Miguel Hernández-Lobato, James Requeima, Ed- Pyro: Deep universal probabilistic programming. The ward O. Pyzer-Knapp, and Alán Aspuru-Guzik. Parallel Journal of Machine Learning Research, 20(1):973–978, and distributed Thompson sampling for large-scale accel- 2019. erated exploration of chemical space. In Proceedings of the 34th International Conference on Machine Learning, Mickaël Binois, David Ginsbourger, and Olivier Roustant. volume 70 of Proceedings of Machine Learning Research, On the choice of the low-dimensional domain for global pages 1470–1479. PMLR, 2017.
Matthew D Hoffman and Andrew Gelman. The No-U-Turn of Machine Learning Research, pages 3273–3281. PMLR, sampler: Adaptively setting path lengths in Hamiltonian 2018. Monte Carlo. J. Mach. Learn. Res., 15(1):1593–1623, 2014. David JC MacKay and Radford M Neal. Automatic rele- vance determination for neural networks. In Technical Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Se- Report in preparation. Cambridge University, 1994. quential model-based optimization for general algorithm Jonas Mockus, Vytautas Tiesis, and Antanas Zilinskas. To- configuration. In International conference on learning ward global optimization, volume 2, chapter Bayesian and intelligent optimization, pages 507–523. Springer, methods for seeking the extremum, 1978. 2011. Mojmir Mutny and Andreas Krause. Efficient high dimen- Donald R Jones. Large-scale multi-disciplinary mass opti- sional Bayesian optimization with additivity and quadra- mization in the auto industry. In MOPTA 2008 Confer- ture Fourier features. In Advances in Neural Information ence (20 August 2008), 2008. Processing Systems 31, pages 9019–9030, 2018. Donald R Jones, Matthias Schonlau, and William J Welch. Amin Nayebi, Alexander Munteanu, and Matthias Poloczek. Efficient global optimization of expensive black-box func- A framework for Bayesian optimization in embedded tions. Journal of Global optimization, 13(4):455–492, subspaces. In Proceedings of the 36th International Con- 1998. ference on Machine Learning, volume 97 of Proceedings Kirthevasan Kandasamy, Jeff G. Schneider, and Barnabás of Machine Learning Research, pages 4752–4761. PMLR, Póczos. High dimensional Bayesian optimisation and ban- 2019. dits via additive models. In Proceedings of the 32nd In- Diana M Negoescu, Peter I Frazier, and Warren B Pow- ternational Conference on Machine Learning, volume 37 ell. The knowledge-gradient algorithm for sequencing of JMLR Workshop and Conference Proceedings, pages experiments in drug discovery. INFORMS Journal on 295–304. JMLR.org, 2015. Computing, 23(3):346–363, 2011. Diederik P. Kingma and Jimmy Ba. Adam: A method for ChangYong Oh, Efstratios Gavves, and Max Welling. stochastic optimization. In 3rd International Conference BOCK: Bayesian optimization with cylindrical kernels. on Learning Representations, 2015. In Proceedings of the 35th International Conference on Johannes Kirschner, Mojmir Mutny, Nicole Hiller, Ras- Machine Learning, volume 80 of Proceedings of Machine mus Ischebeck, and Andreas Krause. Adaptive and Learning Research, pages 3865–3874. PMLR, 2018. safe bayesian optimization in high dimensions via one- ChangYong Oh, Jakub M. Tomczak, Efstratios Gavves, and dimensional subspaces. In Proceedings of the 36th Inter- Max Welling. Combinatorial Bayesian optimization us- national Conference on Machine Learning, volume 97 ing the graph cartesian product. In Advances in Neural of Proceedings of Machine Learning Research, pages Information Processing Systems 32, pages 2910–2920, 3429–3438. PMLR, 2019. 2019. Benjamin Letham, Brian Karrer, Guilherme Ottoni, Eytan Art B Owen. Quasi-Monte Carlo sampling. Monte Carlo Bakshy, et al. Constrained Bayesian optimization with Ray Tracing: Siggraph, 1:69–88, 2003. noisy experiments. Bayesian Analysis, 14(2):495–519, 2019. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Benjamin Letham, Roberto Calandra, Akshara Rai, and Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Eytan Bakshy. Re-examining linear embeddings for high- et al. Scikit-learn: Machine learning in Python. The dimensional Bayesian optimization. In Advances in Neu- Journal of machine Learning research, 12:2825–2830, ral Information Processing Systems 33, 2020. 2011. Cheng Li, Sunil Gupta, Santu Rana, Vu Nguyen, Svetha Du Phan, Neeraj Pradhan, and Martin Jankowiak. Compos- Venkatesh, and Alistair Shilton. High dimensional able effects for flexible and accelerated probabilistic pro- Bayesian optimization using dropout. In Proceedings gramming in NumPyro. arXiv preprint arXiv:1912.11554, of the Twenty-Sixth International Joint Conference on 2019. Artificial Intelligence, pages 2096–2102. ijcai.org, 2017. Hong Qian, Yi-Qi Hu, and Yang Yu. Derivative-free opti- Xiaoyu Lu, Javier Gonzalez, Zhenwen Dai, and Neil D. mization of high-dimensional non-convex functions by Lawrence. Structured variationally auto-encoded opti- sequential random embeddings. In Proceedings of the mization. In Proceedings of the 35th International Con- Twenty-Fifth International Joint Conference on Artificial ference on Machine Learning, volume 80 of Proceedings Intelligence, pages 1946–1952. IJCAI/AAAI Press, 2016.
Carl Edward Rasmussen. Gaussian processes in machine learning. In Summer School on Machine Learning, pages 63–71. Springer, 2003. Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practi- cal Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25, pages 2960–2968, 2012. Tsuyoshi Ueno, Trevor David Rhone, Zhufeng Hou, Teruyasu Mizoguchi, and Koji Tsuda. COMBO: An effi- cient Bayesian optimization library for materials science. Materials discovery, 4:18–21, 2016. Zi Wang, Clement Gehring, Pushmeet Kohli, and Stefanie Jegelka. Batched large-scale Bayesian optimization in high-dimensional spaces. In International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 745– 754. PMLR, 2018. Ziyu Wang, Frank Hutter, Masrour Zoghi, David Matheson, and Nando de Feitas. Bayesian optimization in a billion dimensions via random embeddings. Journal of Artificial Intelligence Research, 55:361–387, 2016. Xinjie Yu and Mitsuo Gen. Introduction to evolutionary algorithms. Springer Science & Business Media, 2010. Ciyou Zhu, Richard H Byrd, Peihuang Lu, and Jorge No- cedal. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on Mathematical Software (TOMS), 23(4): 550–560, 1997.
A INFERENCE Branin (d = 2, D = 100) 6 128-128-8 A.1 NUTS 5 512-256-16 Best value found We use the NUTS sampler implemented in NumPyro [Phan 4 et al., 2019, Bingham et al., 2019], which leverages JAX 3 for efficient hardware acceleration [Bradbury et al., 2020]. In most of our experiments (see Sec. E for exceptions) 2 we run NUTS for 768 = 512 + 256 steps where the first 1 Global Nwarmup = 512 samples are for burn-in and (diagonal) mass Minimum matrix adaptation (and thus discarded), and where we re- 0 0 10 20 30 40 50 tain every 16th sample among the final Npost = 256 samples Number of evaluations (i.e. sample thinning), yielding a total of L = 16 approximate posterior samples. It is these L samples that are then used Figure 5: We depict how SAASBO-NUTS per- to compute Eqns. (4), (5), (10). We also limit the maximum forms on Branin as we reduce the sampling budget tree depth in NUTS to 6. (Nwarmup , Npost , L) = (512, 256, 16) to (Nwarmup , Npost , L) = We note that these choices are somewhat conservative, and (128, 128, 8). We compare performance w.r.t. the best in many settings we would expect good results with fewer minimum found (the mean is depicted by a thick line samples. Indeed on the Branin test function, see Fig. 5, we and shaded bands denote standard errors). Each curve find a relatively marginal drop in performance when we corresponds to 60 independent replications of Algorithm 1. reduce the NUTS sampling budget as follows: i) reduce the number of warmup samples from 512 to 128; ii) reduce the number of post-warmup samples from 256 to 128; and iii) For each s = 1, ..., S we then compute the leave-one-out pre- reduce the total number of retained samples from 16 to 8. dictive log likelihood using the mean and variance functions We expect broadly similar results for many other problems. given in Eqns. (4)-(5). We then choose the value of s that See Sec. C for corresponding runtime results. maximizes this predictive log likelihood and use the corre- sponding kernel hyperparameter ψs to compute the expected It is worth emphasizing that while SAASBO requires speci- improvement in Eqn. (10). fying a few hyperparameters that control NUTS, these hy- perparameters are purely computational in nature, i.e. they have no effect on the SAAS function prior. Users simply A.3 NO DISCRETE LATENT VARIABLES choose a value of L that meets their computational budget. This is in contrast to e.g. the embedding dimension de that As discussed briefly in the main text, it is important that is required by ALEBO and HeSBO: the value of de often the SAAS prior defined in Sec. 4.1 does not include any has significant effects on optimization performance. discrete latent variables. Indeed a natural alternative to our model would introduce D binary-valued latent variables We also note that it is possible to make SAASBO-NUTS that control whether or not a given dimension is relevant to faster by means of the following modifications: modeling fobj . However, inference in any such model can be very challenging, as it requires exploring an extremely 1. Warm-start mass adaptation with mass matrices from large discrete space of size 2D . Our model can be under- previous iterations. stood as a continuous relaxation of such an approach. This 2. Instead of fitting a new SAAS GP at each iteration, only is a significant advantage since it means we can leverage fit every M iterations (say M = 5), and reuse hyperpa- gradient information to efficiently explore the posterior. In- rameter samples {ψ` } across M iterations of SAASBO. deed, the structure of our sparsity-inducing prior closely mirrors the justly famous Horseshoe prior [Carvalho et al., 2009], which is a popular prior for Sparse Bayesian linear A.2 MAP regression. We note that in contrast to the linear regression setting of the Horseshoe prior, our sparsity-inducing prior We run the Adam optimizer [Kingma and Ba, 2015] for governs inverse squared length scales in a non-linear kernel 1500 steps and with a learning rate of 0.02 and β1 = 0.50 to and not variances. While we expect that any prior that con- maximize the log density centrates ρi at zero can exhibit good empirical performance Us (ψs |τs ) = log p(y|X, ψs ) + log p(ψs |τs ) (12) in the setting of high-dimensional BO, this raises the impor- tant question whether distributional assumptions other than w.r.t. ψs for S = 4 pre-selected values of τs : τs ∈ those in Eqn. (8) may be better suited to governing our prior {1, 10−1 , 10−2 , 10−3 }. This optimization is trivially opti- expectations about ρi . Making a careful investigation of this mized across S. point is an interesting direction for future work.
B EXPECTED IMPROVEMENT D ADDITIONAL FIGURES AND MAXIMIZATION EXPERIMENTS We first form a scrambled Sobol sequence x1:Q (see D.1 MODEL FITTING e.g. [Owen, 2003]) of length Q = 5000 in the D-dimensional domain D. We then compute the expected improvement in In Fig. 6 we reproduce the experiment described in Sec. 5.1, Eqn. (10) in parallel for each point in the Sobol sequence. with the difference that we replace the RBF kernel with a We then choose the top K = 3 points in x1:Q , that yield the Matérn-5/2 kernel. largest EIs. For each of these K approximate maximizers we GP-MLE GP-NUTS-Dense GP-NUTS-SAAS run L-BFGS [Zhu et al., 1997] initialized with the approx- 2 (d = 6, D = 30) 2 (d = 6, D = 30) 2 (d = 6, D = 30) imate maximizer and using the implementation provided Predicted value Predicted value Predicted value by Scipy (in particular fmin_l_bfgs_b) to obtain the 0 0 0 final query point xnext , which (approximately) maximizes −2 −2 −2 Eqn. (10). We limit fmin_l_bfgs_b to use a maximum of 100 function evaluations. −4 −4 −2 0 2 −4 −4 −2 0 2 −4 −4 −2 0 2 True value True value True value GP-MLE GP-NUTS-Dense GP-NUTS-SAAS (d = 6, D = 100) (d = 6, D = 100) (d = 6, D = 100) 2 2 2 Predicted value Predicted value Predicted value C RUNTIME EXPERIMENT 0 0 0 −2 −2 −2 We measure the runtime of SAASBO as well as each base- line method on the Branin test problem. See Table 1 for the −4 −4 −2 0 2 −4 −4 −2 0 2 −4 −4 −2 0 2 True value True value True value results. We record runtimes for both the default SAASBO- NUTS settings described in Sec. A.1 as well as one with a Figure 6: This figure is an exact reproduction of Fig. 1 in the reduced NUTS sampling budget. While SAASBO requires main text apart from the use of a Matérn-5/2 kernel instead of a RBF kernel. We compare model fit for three models Table 1: Average runtime per iteration on the Branin using the same train/test data obtained from two indepen- test function embedded in a 100-dimensional space. Each dent runs of Algorithm 1 on the d = 6 Hartmann function method uses m = 10 initial points and a total of 50 function embedded in D ∈ {30, 100} dimensions. We compare: (left) evaluations. Runtimes are obtained using a 2.4 GHz 8-Core a GP fit with MLE; (middle) a GP with weak priors fit with Intel Core i9 CPU outfitted with 32 GB of RAM. NUTS; and (right) a GP with a SAAS prior (this paper; see Eqn. (8)) fit with NUTS. In D = 30 dimensions (top row) Method Time / iteration all models provide good fits to the test data. In D = 100 SAASBO (default) 26.51 seconds dimensions (bottom row) only SAAS provides a good fit. In SAASBO (128-128-8) 19.21 seconds each figure mean predictions are depicted with dots and bars TuRBO 1.52 seconds denote 95% confidence intervals. SMAC 12.12 seconds EBO 128.10 seconds ALEBO (de = 5) 4.34 seconds We note that the qualitative behavior in Fig. 6 matches the ALEBO (de = 10) 11.91 seconds behavior in Fig. 1. In particular, in D = 100 dimensions HeSBO (de = 5) 0.70 seconds only the sparsity-inducing SAAS function prior provides HeSBO (de = 10) 1.51 seconds a good fit. This emphasizes that the potential for drastic CMA-ES < 0.1 seconds overfitting that arises when fitting a non-sparse GP in high Sobol < 0.01 seconds dimensions is fundamental and is not ameliorated by using a different kernel. In particular the fact that the Matérn-5/2 kernel decays less rapidly at large distances as compared to more time per iteration than other methods such as TuRBO the RBF kernel (quadratically instead of exponentially) does and HeSBO, the overhead is relatively moderate in the set- not prevent the non-sparse models from yielding essentially ting where the black-box function fobj is very expensive to trivial predictions across most of the domain D. evaluate. We note that after reducing the NUTS sampling budget to (Nwarmup , Npost , L) = (128, 128, 8) about 75% of the runtime is devoted to EI optimization. Since our current D.2 SVM RELEVANCE PLOTS implementation executes K = 3 runs of L-BFGS serially, this runtime could be reduced further by executing L-BFGS In Fig. 7 we explore the relevant subspace identified by in parallel. SAASBO during the course of optimization of the SVM
problem discussed in Sec. 5.6. We see that the three most MOPTA (D = 124) 10 important hyperparameters, namely the regularization hyper- parameters, are consistently found more or less immediately Effective subspace dimension 8 once the initial Sobol phase of Algorithm 1 is over. This explains the rapid early progress that SAASBO makes in 6 Fig. 4 during optimization. We note that the 4th most rele- vant dimension turns out to be a length scale for a patient 4 ID feature, which makes sense given the importance of this feature to the regression problem. 2 ρ > 0.1 100 200 300 400 SVM (D = 388) SVM (D = 388) Number of evaluations 3.2 5 # of relevant dimensions found Effective subspace dimension 3.0 4 2.8 Figure 8: We depict the effective subspace dimension dur- 3 ing the course of a single run of Algorithm 1 on the 2.6 2 MOPTA vehicle design problem. Here the effective sub- 2.4 space dimension is the number of dimensions for which 1 2.2 ρ > 0.1 ρ > 0.5 PosteriorMedian(ρk ) > ξ , with ξ = 0.1 an arbitrary cutoff. 2.0 0 20 40 60 80 100 20 40 60 80 100 Number of evaluations Number of evaluations SVM (D = 388) SVM (D = 388) 0.40 0.35 Figure 7: Left: We depict the mean number of regulariza- 0.35 0.30 tion hyperparameters that have been ‘found’ in the SVM 0.30 Test RMSE Final value 0.25 problem, where a regularization hyperparameter is ‘found’ 0.25 if its corresponding PosteriorMedian(ρk ) is among the three 0.20 0.20 largest {PosteriorMedian(ρi )}Di=1 . Note that there are three 0.15 0.15 regularization hyperparameters in total. Right: We depict the mean effective subspace dimension, defined to be the 0.10 0 25 50 75 100 0.10 number of dimensions for which PosteriorMedian(ρk ) > ξ Number of evaluations where ξ ∈ {0.1, 0.5} is an arbitrary cutoff. Means are aver- SAASBO (RBF) BO-NUTS-Dense (RBF) Sobol SAASBO (Matérn-5/2) BO-NUTS-Dense (Matérn-5/2) ages across 30 independent replications. Figure 9: We compare the BO performance of the SAAS function prior to a non-sparse function prior on the SVM D.3 MOPTA08 RELEVANCE PLOTS hyperparameter tuning problem (D = 388). In addition we compare the RBF kernel to the Matérn-5/2 kernel. We do In Fig. 8 we see that during the course of a single run of 15 independent replications for each method, except for SAASBO on the MOPTA08 vehicle design problem, the SAASBO-RBF and Sobol, for which we reproduce the same effective dimension of the identified subspace steadily in- 30 replications from the main text. Left: For each method creases from about 2 to about 10 as more evaluations are we depict the mean value of the best minimimum found at a collected. Using an increasingly flexible surrogate model given iteration. Right: For each method we depict the dis- over the course of optimization is key to the excellent opti- tribution over the final approximate minimum ymin encoded mization performance of SAASBO on this problem. as a violin plot, with horizontal bars corresponding to 5%, 50%, and 95% quantiles. D.4 SVM ABLATION STUDY In Fig. 9 we depict results from an ablation study of D.5 ROTATED HARTMANN SAASBO in the context of the SVM problem. First, as a companion to Fig. 1 and Fig. 6, we compare the BO perfor- In this experiment we study how the axis-aligned assump- mance of the SAAS function prior to a non-sparse function tion in SAAS affects the performance if we rotate the co- prior that places weak priors on the length scales. As we ordinate system. In particular, we consider the Hartmann would expect from Fig. 1 and Fig. 6, the resulting BO per- function fhart for d = 6 embedded in D = 100. Given a linear formance is very poor for the non-sparse prior. Second, we projection dimensionality d p ≥ d, we generate a random pro- also compare the default RBF kernel to a Matérn-5/2 kernel. jection Pd p ∈ Rd p ×d where [Pd p ]i j ∼ N (0, 1/d p ). The goal We find that, at least on this problem, both kernels lead to is to optimize f˜(x) = fhart (Pd p x1:d p − z)) where x ∈ [0, 1]D similar BO performance. and z ∈ Rd . Given a Pd p , z is a vector in [0, 1]d such that
f˜([x∗ ; w]) = fhart (x∗ ), ∀w ∈ [0, 1]D−d where x∗ is the global does not work well on any problem so we instead report optimum of the Hartmann function. The translation z guar- results for de = 10 and de = 20. antees that the global optimum value is attainable in the For CMA-ES we use the pycma4 implementation. CMA- domain. We consider d p = 6, 18, 30 and generate a random ES is initialized using a random point in the domain and Pd p and z for each embedded dimensionality that we use uses the default initial step-size of 0.25. Recall that the for all replications. EBO is excluded from this study as it domain is normalized to [0, 1]D for all problems. We run performed worse than Sobol in Fig. 3. EBO using the reference implementation by the authors5 The results are shown in Fig. 10. We see that SAASBO with the default settings. EBO requires knowing the value outperforms the other methods even though the function has of the function at the global optimum. Similarly to [Letham been rotated, which violates the axis-aligned structure. Even et al., 2020] we provide this value to EBO for all problems, though the function is rotated, SAASBO quickly identifies but note that EBO still performs poorly on all problems the most important parameters in the rotated space. We apart from Branin and SVM. also notice that the worst-case performance of SAASBO Our comparison to SMAC uses SMAC4HPO, which is im- is better than for the other methods across all projection plemented in SMAC36 . On all problems we run SMAC in dimensionalities considered. deterministic mode, as all problems considered in this paper are noise-free. For Sobol we use the SobolEngine imple- E ADDITIONAL EXPERIMENTAL mentation in PyTorch. Finally, we compare to TuRBO with DETAILS a single trust region due to the limited evaluation budget; we use the implementation provided by the authors7 . Apart from the experiment in Sec. 5.2 that is depicted in Fig. 2 we use α = 0.1 in all experiments. Apart from Fig. 6 E.4 SYNTHETIC PROBLEMS and Fig. 9, we use an RBF kernel in all experiments. We consider three standard synthetic functions from the E.1 MODEL FIT EXPERIMENT optimization literature. Branin is a 2-dimensional function that we embed in a 100-dimensional space. We consider the In the model fit experiment in Sec. 5.1 we take data collected standard domain [−5, 10] × [0, 15] before normalizing the from two different runs of SAASBO in D = 100. We use one domain to [0, 1]100 . For Hartmann, we consider the d = 6 run as training data and the second run as test data, each version on the domain [0, 1]6 before embedding it in a 100- with N = 100 datapoints. To construct datasets in D = 30 dimensional space. For Rosenbrock, we use d = 3 and the dimensions we include the 6 relevant dimensions as well domain [−2, 2]3 , which we then embed and normalize so as 24 randomly chosen redundant dimensions and drop all that the full domain is [0, 1]100 . Rosenbrock is a function that remaining dimensions. is challenging to model, as there are large function values at the boundary of the domain. For this reason all methods minimize log(1 + fobj (x)). All methods except for CMA- E.2 INFERENCE AND HYPERPARAMETER ES are initialized with m = 10 initial points for Branin and COMPARISON EXPERIMENT Rosenbrock and m = 20 initial points for Hartmann. For the experiment in Sec. 5.2 that is depicted in Fig. 2 we initialize SAASBO with m = 10 points from a Sobol E.5 ROVER sequence. We consider the rover trajectory optimization problem that was also considered in Wang et al. [2018]. The goal is to E.3 BASELINES optimize the trajectory of a rover where this trajectory is determined by fitting a B-spline to 30 waypoints in the 2D We compare SAASBO to ALEBO, CMA-ES, EBO, HeSBO, plane. While the original problem had a pre-determined SMAC, Sobol, and TuRBO. For ALEBO and HeSBO we origin and destination, the resulting B-spline was not con- use the implementations in BoTorch [Balandat et al., 2020] strained to start and end at these positions. To make the with the same settings that were used by [Letham et al., problem easier, we force the B-spline to start and end at 2020]. We consider embeddings of dimensionality de = 5 these pre-determined positions. Additionally, we use 50 and de = 10 on the synthetic problems, which is similar to the de = d and de = 2d heuristics that were considered 4 https://github.com/CMA-ES/pycma in [Nayebi et al., 2019] as well as [Letham et al., 2020]. 5 https://github.com/zi-w/ As the true active dimensionality d of fobj is unknown, we Ensemble-Bayesian-Optimization do not allow any method to explicitly use this additional 6 https://github.com/automl/SMAC3 information. For the three real-world experiments, de = 5 7 https://github.com/uber-research/TuRBO
You can also read