Surrogate Model Based Hyperparameter Tuning for Deep Learning with SPOT
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Surrogate Model Based Hyperparameter Tuning for Deep Learning with SPOT A Preprint Thomas Bartz-Beielstein ∗ Institute for Data Science, Engineering, and Analytics arXiv:2105.14625v1 [cs.LG] 30 May 2021 Technische Hochschule Köln 5164 Gummersbach, Germany thomas.bartz-beielstein@th-koeln.de June 1, 2021 Abstract A surrogate model based hyperparameter tuning approach for deep learning is presented. This article demonstrates how the architecture-level parameters (hyperparameters) of deep learning models that were implemented in Keras/tensorflow can be optimized. The implementation of the tuning procedure is 100% based on R, the software environment for statistical computing. With a few lines of code, existing R packages (tfruns and SPOT) can be combined to perform hyperparameter tuning. An elementary hyperparameter tuning task (neural network and the MNIST data) is used to exemplify this approach. K eywords hyperparameter tuning · deep learning · hyperparameter optimization · surrogate model based optimization · sequential parameter optimization 1 Introduction Deep-learning models require the specification of several architecture-level parameters, which are called hyperparameters. Hyperparameters are to be distinguished from the parameters of a model that are optimized during the training phase, e.g., via backpropagation. We will consider hyperparameters in this article. Typical questions regarding hyperparameters in deep-learning models are as follows: 1. How many layers should be stacked? 2. Which dropout rate should be used? 3. How many filters (units) should be used in each layer? 4. Which activation function should be used? In many real-world projects, experienced deep-learning engineers build intuition over time as to what reasonable hyperparameters are, i.e., Hyperparameter Tuning (HPT) skills are developed. These skills are based on human expert and domain knowledge and not on valid formal rules. Chollet and Allaire [2018] describe the situation as follows: If you want to get to the very limit of what can be achieved on a given task, you can’t be content with arbitrary [hyperparameter] choices made by a fallible human. Your initial decisions are almost always suboptimal, even if you have good intuition. You can refine your choices by tweaking them by hand and retraining the model repeatedly—that’s what machine-learning engineers and researchers spend most of their time doing. ∗ https://www.spotseven.de
A preprint - June 1, 2021 But it shouldn’t be your job as a human to fiddle with hyperparameters all day—that is better left to a machine. HPT develops tools to to explore the space of possible hyperparameter configurations systematically, in a principled way. The HPT process typically looks like this: (HPT-1) Parameter selection. Choose a set of hyperparameters. (HPT-2) Model building. Build the corresponding Deep Neural Network (DNN) model. (HPT-3) Model training and evaluation. Fit the model to the training data and measure the final performance on the validation data. (HPT-4) Parameter update. Choose the next set of hyperparameters to try. (HPT-5) Looping. Repeat until budget is exhausted. (HPT-6) Final evaluation on test data, i.e., measuring performance on the test (hold out) data. Training the weights of a DNN model is relatively easy, e.g., a loss function is evaluated on a mini-batch of data and then the backpropagation algorithm is used to move the weights in the right direction [Chollet and Allaire 2018]. Updating hyperparameters, on the other hand, is extremely challenging, because computing the feedback signal can be extremely expensive: it requires creating and training a new model on a dataset. And, the hyperparameter space is not continuous or differentiable, because it is also made of discrete decisions. Standard gradient methods are not applicable in hyperparameter space. Instead, gradient-free optimization techniques, which sometimes are far less efficient than gradient methods, are applied. These challenges are difficult. The key to this process is the HPT algorithm that uses this history of validation performance, given various sets of hyperparameters, to choose the next set of hyperparameters to evaluate. Several different approaches are possible: • manual search, • simple random search, i.e., choosing hyperparameters to evaluate at random, repeatedly, • grid and pattern search [Meignan et al. 2015] [Lewis et al. 2000] [Tatsis and Parsopoulos 2016], • model free algorithms, i.e., algorithms that do not explicitly make use of a model, e.g., evolutionary algorithms [Hansen 2006] [Bartz-Beielstein et al. 2014], • hyperband, i.e., a multi-armed bandit strategy that dynamically allocates resources to a set of random configurations and uses successive halving to stop poorly performing configurations [Li et al. 2016], • Surrogate Model Based Optimization (SMBO) such as Sequential Parameter Optimization Toolbox (SPOT), [Bartz-Beielstein et al. 2005] [Bartz-Beielstein et al. 2021] Grid search and manual search are probably the most widely used strategies for HPT. Bergstra and Bengio [2012] demonstrate empirically and show theoretically that randomly chosen trials are more efficient for HPT than trials on a grid. Due to its simplicity, it turns out in many situations that random search is the best solution. Hyperband can result in a worse final performance than model-based approaches, because it only samples configurations randomly and does not learn from previously sampled configurations. SMBO based on Kriging (aka Gaussian processes or Bayesian Optimization (BO)) has been successfully applied to HPT in several works, e.g., Bartz-Beielstein and Markon [2004] propose a combination of classical statistical tools, BO (Design and Analysis of Computer Experiments (DACE)), and Classification and Regression Trees (CART) as a surrogate model. Snoek et al. [2012] used the CIFAR-10 dataset, which consists of 60,000 32x32 color images in ten classes, for optimizing the hyperparameters of a Convolutional Neural Networks (CNNs). Bergstra et al. [2013] optimized a three layer CNN and Mendoza et al. [2019] participated in the 2016 AutoML challenge. O’Malley et al. [2019] present Keras tuner, a hyperparameter tuner for Keras with TensorFlow 2.0. After defining a model-building function, which takes an argument from which hyperparameters such as the units (hidden nodes) of the neural network, can be sampled. Next, a tuner is instantiated. Available tuners are RandomSearch and Hyperband. HPT is a powerful technique that is an absolute requirement to get to state-of-the-art models on any task or to win machine-learning competitions. However, there are important issues to keep in mind when doing HPT: e.g., validation-set overfitting can occur, because hyperparameters are optimized based on information derived from the validation data. Falkner et al. [2018] claim, that practical Hyperparameter Optimization (HPO) solutions should fulfill the following requirements: strong anytime and final performance, effective use 2
A preprint - June 1, 2021 of parallel resources, scalability, as well as robustness and flexibility. A treatment for these issues is proposed by Bartz-Beielstein et al. [2020]. Although their recommendations were developed for benchmark studies in optimization, they are also relevant for HPT, because HPT can be seen as a special benchmarking variant. (R-1) Goals: what are the reasons for performing benchmark studies? (R-2) Problems: how to select suitable problems? (R-3) Algorithms: how to select a portfolio of algorithms to be included in the benchmark study? (R-4) Performance: how to measure performance? (R-5) Analysis: how to evaluate results? (R-6) Design: how to set up a study, e.g., how many runs shall be performed? (R-7) Presentation: how to describe results? (R-8) Reproducibility: how to guarantee scientifically sound results and how to guarantee a lasting impact, e.g., in terms of comparability? In addition to the recommendations, there are some specific issues that are caused by the deep-learning setup. These will be discussed in Sec. 4. Note, sometimes the terms HPT and HPO are used synonymously. We will use these terms with different meanings: HPO develops and applies methods to determine the best hyperparameters in an effective and efficient manner. HPT develops and applies methods that try to understand the effects and interaction of hyperparameters to enable learning and understanding. This article proposes a HPT approach based on SPOT that focuses on the following topics: Understanding. In contrast to standard HPO approaches, SPOT provides statistical tools for understanding hyperparameter importance and interactions between several hyperparameters. Transparency and Explainability. Understanding is a key tool for enabling transparency, e.g., quantifying the contribution of deep learning components (layers, activation functions, etc.). Reproducibility. The software code used in this study is available in the open source R software environment for statistical computing and graphics (R) package SPOT via Comprehensive R Archive Network (CRAN). This paper is structured as follows: Section 2 describes materials and methods that were used for the experimental setup. Section 3 presents results from a simple experiment. A discussion is presented in Sec. 4. The appendix contains information on how to set up the Python software environment for performing HPT with SPOT and Training Run Tools for TensorFlow (tfruns). Source code for performing the experiments will included in the R package SPOT. Further information are published on https://www.spotseven.de and, with some delay, on CRAN (https://cran.r-project.org/package=SPOT). 2 Materials and Methods 2.1 Benchmarks and Software Packages Before presenting the elements (benchmarks and software tools) for the experiments, we consider similar approaches that might be worth looking at. This list is not complete and will be updated in forthcoming versions of this paper. 2.1.1 Software Packages 2.1.2 Hyperparameter Optimization Software Balaprakash et al. [2018] present DeepHyper, a Python package that provides a common interface for the implementation and study of scalable hyperparameter search methods. Karmanov et al. [2018] created a Rosetta Stone of deep-learning frameworks to allow data-scientists to easily leverage their expertise from 3
A preprint - June 1, 2021 Table 1: The hyperparameters and architecture choices for the fully connected networks as defined in Falkner et al. [2018] Hyperparameter Lower Bound Upper Bound Log-transform batch size 23 28 yes dropout rate 0 0.5 no initial learning rate 1e − 6 1e − 2 yes exponential decay factor −0.185 0 no # hidden layers 1 5 no # units per layer 24 28 yes one framework to another. They provide a common setup for comparisons across GPUs (potentially CUDA versions and precision) and for comparisons across languages (Python, Julia, R). Users should be able to verify expected performance of own installation. 2.1.3 Artificial Toy Functions Because BO does not work well on high-dimensional mixed continuous and categorical configuration spaces, Falkner et al. [2018] use a simple counting ones problem to analyze this problem. Zaefferer and Bartz- Beielstein [2016] discuss these problems in greater detail. How to implement BO for discrete (and continuous) optimization problems is analyzed in the seminal paper by Bartz-Beielstein and Zaefferer [2017]. 2.1.4 Experiments on Surrogate Benchmarks Falkner et al. [2018] optimized six hyperparameters that control the training procedure of a fully connected neural network (initial learning rate, batch size, dropout, exponential decay factor for learning rate) and the architecture (number of layers, units per layer) for six different datasets gathered from OpenML [Vanschoren et al. 2014], see Table 1. Instead of training the networks directly, Falkner et al. [2018] used a surrogate network. To build a surrogate, they sampled 10,000 random configurations for each data set, trained them for 50 epochs, and recorded classification error after each epoch, and total training time. Two independent random forests models were fitted to predict these two quantities as a function of the hyperparameter configuration used. Falkner et al. [2018] note that Hypberband (HB) initially performed much better than the vanilla BO methods and achieved a roughly three-fold speedup over random search. Instead of using the surrogate network, we will use the original networks. Our approach is described in Sec.2.3. 2.2 Software: Keras, Tensorflow, tfruns, and SPOT 2.2.1 Keras and Tensorflow Keras is the high-level Application Programming Interface (API) of TensorFlow (TF), which is developed with a focus on enabling fast experimentation. TF is an open source software library for numerical computation using data flow graphs [Abadi et al. 2016]. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them [O’Malley et al. 2019]. The tensorflow R package provides access to the complete TF API from within R. 2.2.2 The R Package tfruns The R package tfruns2 provides a suite of tools for tracking, visualizing, and managing TF training runs and experiments from R. tfruns enables tracking the hyperparameters, metrics, output, and source code of every training run and comparing hyperparameters and metrics across runs to find the best performing model. It automatically generates reports to visualize individual training runs or comparisons between runs. tfruns can be used without any changes to source code, because run data is automatically captured for all Keras and TF models. 2 https://cran.r-project.org/package=tfruns, https://tensorflow.rstudio.com/tools/tfruns 4
A preprint - June 1, 2021 2.2.3 SPOT The SPOT package for R is a toolbox for tuning and understanding simulation and optimization algorithms [Bartz-Beielstein et al. 2021]. SMBO investigations are common approaches in simulation and optimization. Sequential parameter optimization has been developed, because there is a strong need for sound statistical analysis of simulation and optimization algorithms. SPOT includes methods for tuning based on classical regression and analysis of variance techniques; tree-based models such as CART and random forest; BO (Gaussian process models, aka Kriging), and combinations of different meta-modeling approaches. SPOT implements key techniques such as exploratory fitness landscape analysis and sensitivity analysis. SPOT can be used for understanding the performance of algorithms and gaining insight into algorithm’s behavior. Furthermore, SPOT can be used as an optimizer and for automatic and interactive tuning. SPOT finds improved solutions in the following way: 1. Initially, a population of (random) solutions is created. 2. A set of surrogate models is specified. 3. Then, the solutions are evaluated on the objective function. 4. Next, surrogate models are built. 5. A global search is performed to generate new candidate solutions. 6. The new solutions are evaluated on the objective function. These steps are repeated, until a satisfying solution has been found, see [Bartz-Beielstein et al. 2021]. How these software packages (Keras/TF, tfruns, and SPOT) can be combined in a very efficient and effective manner will be exemplified in the following section. 2.3 Example: Tuning Hyperparameters with SPOT The general DNN workflow will be as follows: first the training data, train_images and train_labels are fed to the DNN. The DNN will then learn to associate images and labels. Finally, the trained DNN should produce predictions for test_images. These predictions are evaluated, i.e., validated whether these predictions match the labels from test_labels. 2.3.1 The Data Set: MNIST The DNN in this example uses the Keras R package to learn to classify hand-written digits from the Modified National Institute of Standards and Technology (MNIST) data set. This is a supervised multi-class classification problem, i.e., grayscale images of handwritten digits (28 pixels by 28 pixels) should be assigned to ten categories (0 to 9). MNIST is a set of 60,000 training and 10,000 test images. The MNIST data set comes preloaded in Keras, in the form of train and test lists, each of which includes a set of images (x) and associated labels (y): train_images and train_labels form the training set, the data that the model will learn from. The model will then be tested on the test set, test_images and test_labels. The images are encoded as as 3D arrays, and the labels are a 1D array of digits, ranging from 0 to 9. There is a one-to-one correspondence between the images and the labels. Before training the DNN, the data are preprocessed by reshaping it into the shape the DNN can process. All values will be scaled that they are in the [0, 1] interval. The natural (original) training images were stored in an array of shape (60000, 28, 28) of type integer with values in the [0, 255] interval. They are transformed into a double array of shape (60000, 28 × 28) with Red, Green, and Blue color space (RGB) values between 0 and 1. Furthermore, the labels are categorically encoded. mnist
A preprint - June 1, 2021 x_train % layer_dense(units = 10, activation = 'softmax') Finally, (i) the loss function, which determines how the DNN good a prediction is based on the training data, (ii) the optimizer, i.e., the update mechanism of the network, which adjusts the weights using backpropagation, and (iii) the metrics, which monitor the progress during training and testing, are specified using the compile function from Keras. model %>% compile( loss = 'categorical_crossentropy', optimizer = optimizer_rmsprop(lr = 0.01), metrics = c('accuracy') ) The DNN training can be started as follows (using Kera’s fit function). history % keras::fit( x_train, y_train, batch_size = 128, epochs = 20, verbose = 0, validation_split = 0.2 ) Two quantities are being displayed during training: (i) the loss of the network over the training data, and (ii) the accuracy of the network over the training data. Figure 1 illustrates that an accuracy greater than 95 percent on the training data can be reached quickly. Finally, using Kera’s evaluate function, the DNN model performance can be checked on the test data set: score % evaluate( x_test, y_test, verbose = 0 ) cat('Test loss:', score[[1]], '\n') cat('Test accuracy:', score[[2]], '\n') Test loss: 0.2828377 Test accuracy: 0.9638 6
A preprint - June 1, 2021 0.5 0.4 loss 0.3 0.2 data training validation 0.950 accuracy 0.925 0.900 0.875 0.850 5 10 15 20 epoch Figure 1: Loss and accuracy plotted against epochs Running the DNN model as a standalone process before starting the tuning process is strongly recommended. As shown in this section, the default DNN model seems to work fine. The tuning process can be started next. 2.3.3 Interfacing tfruns from SPOT After testing the model as a standalone implementation, the model can be combined with the SPOT framework. A wrapper function is used to connect tfruns to the SPOT tuner. The setup requires a few lines of R code only. funTfrunsSingle
A preprint - June 1, 2021 Table 2: The hyperparameters and architecture choices for the first DNN example: fully connected networks Variable Name Hyperparameter Type Default Lower Bound Upper Bound x1 first layer dropout rate numeric 0.4 0 1 x2 second layer dropout rate numeric 0.3 0 1 x3 units per first layer integer 256 16 512 x4 units per second layer integer 128 4 256 x5 learning rate numeric 0.001 0.0001 0.1 x6 training epochs integer 20 5 25 during training [Chollet and Allaire 2018]. The dropout rate of the first and second layer will be tuned individually. (ii) the number of units, i.e., the number of single outputs from a single layer. The number of units of the first and second layer will be tuned individually. (iii) the learning rate, which controls how much to change the DNN model in response to the estimated error each time the model weights are updated. (iv) the number of training epochs, where a training epoch is one forward and backward pass of a complete data set. These hyperparameters and their ranges are listed in Table 2. Using these parameter specifications, we are ready to perform the first SPOT HPT run: res
A preprint - June 1, 2021 Figure 2: Internal ‘list‘ structure of the result object ‘res‘ from the SPOT run. Information about the surrogate model are omitted here for readability. List of 9 $ xbest : num [1, 1:6] 1.31e-02 4.23e-01 4.73e+02 1.10e+02 4.89e-04 ... $ ybest : num [1, 1] 0.0954 $ x : num [1:600, 1:6] 0.9906 0.7771 0.1383 0.0094 0.605 ... $ y : num [1:600, 1] 2.3 2.09 1.46 1.13 1.95 ... $ logInfo : logi NA $ count : int 600 $ msg : chr "budget exhausted" $ modelFit:List of 3 ..$ rfFit:List of 17 ... .. ..- attr(*, "class")= chr "randomForest" ..$ x : num [1:599, 1:6] 0.9906 0.7771 0.1383 0.0094 0.605 ... ..$ y : num [1:599, 1] 2.3 2.09 1.46 1.13 1.95 ... ..- attr(*, "class")= chr "spotRandomForest" $ ybestVec: num [1:600] 0.26 0.26 0.26 0.26 0.26 ... Table 3: Hyperparameter values from iteration 454 and the best configuration. x1 denotes the first layer dropout rate, x2 the second layer dropout rate, x3 the units per first layer, x4 the units per second layer, x5 the learning rate, and x6 the training epochs. y denotes the values of the loss function (to be minimized). hyperparameter configuration x1 x2 x3 x4 x5 x6 y outlier454 0.102 0.430 482 190 0.058251 11 34.0114 xbest 0.013 0.423 473 110 0.000489 14 0.0954 might also require a significant amount of time and resources, even when it is performed automatically) a few HPT procedures are performed. Insights gained from HPT might help to avoid pitfalls such as ill specified parameter ranges, too short run times, etc. In addition to an effective and efficient way to determine the optimal hyperparameter, SPOT provides tools for learning and understanding.3 The HPT experiment from Sec. 2.3 used n = 600 evaluations, i.e., SPOT generated a result list (res) with the information shown in Fig.2. Plots. First of all, the res list information will be used to visualize the route to the solution: in Fig. 3, loss function values are plotted against the number of iterations. This figure reveals that one hyperparameter configuration (from iteration 454, it will be referred to as outlier454 in the following) should be investigated further, because this configuration results in an unusual high loss value of 34.01. Without exception, the loss values from the other hyperparameter configurations are smaller than 5. Using the default configuration results in a loss value of 0.28. The related hyperparameters values are shown in Table 3. Box plots. Secondly, looking at the relationship to the the remaining 599 hyperparameter configurations from this experiment might be insightful: therefore, Fig. 4 visualizes (using box plots) the ranges of the six hyperparameters from the complete HPT experiment. Figure 4 reveals that the largest differences between the best solution and the outlier454 occur for variables x1 , i.e., the first layer dropout rate, and x5 , i.e., the learning rate. Regression trees. Thirdly, to analyze effects and interactions between hyperparameters, a simple regression tree can as shown in Fig. 5 can be used. The regression tree supports the observations, that hyperparameter values for x1 , i.e., the first layer dropout rate, and x5 , i.e., the learning rate are relevant. To conclude this first analysis, the interaction between x1 and x5 will be visualized. SPOT provides several tools for the analysis of interactions. Highly recommended is the use of contour plots as shown in Fig. 6 to shows interactions between x1 and x5 . 3 Or, as Bartz-Beielstein [2006] wrote: “ [SPOT] provides means for understanding algorithms’ performance (we will use datascopes similar to microscopes in biology and telescopes in astronomy).” 9
A preprint - June 1, 2021 20.0 5.0 2.0 loss 0.5 0.1 0 100 200 300 400 500 600 iterations Figure 3: Loss function values plotted against the number of SPOT iterations. The orange line represents the loss obtained with the default DNN hyperparameters. The blue line shows the best function value found during the tuning procedure. Grey squares represent the values generated during the hyperparameter tuning procedure. The red square shows one extreme value, that should be investigated further. Note: loss values plotted on a log scale In addition to the analysis performed so far, surface plots can be used: the surface plot shown in Fig. 6 supports the observations, that hyperparameters x5 and x1 have significant effects on the loss function, their values should be small. Linear models. Finally, a simple linear regression model can be fitted to the data. Based on the data from SPOT’s res list, this can be done as follows: lm.res |t|) ## (Intercept) -0.2332275 0.2630064 -0.887 0.3756 ## res$x1 1.0295949 0.2103293 4.895 1.27e-06 *** ## res$x2 -0.0059548 0.2108685 -0.028 0.9775 ## res$x3 0.0009967 0.0004168 2.392 0.0171 * ## res$x4 -0.0010134 0.0007944 -1.276 0.2026 ## res$x5 19.7357685 1.8288093 10.792 < 2e-16 *** 10
A preprint - June 1, 2021 1e+02 1e+00 1e−02 1e−04 1 2 3 4 5 6 Figure 4: Six box plots, i.e., each plot represents the values of one parameter (plotted on a log scale). The red squares represent the outlier 454, the blue one show the settings of the best value. Largest deviations can be observed for x1 and x5 . 1.3 100% yes x5 < 0.025 no 0.4 1.8 38% 62% x1 < 0.83 x3 < 482 1.7 3 58% 4% x1 < 0.57 x1 >= 0.21 1.4 33% x5 < 0.067 0.26 1.9 1.1 1.8 2.1 1.8 6.3 35% 3% 19% 14% 25% 3% 1% Figure 5: Regression tree based on the first run with 600 evaluations. Apparantly, hyperparameter values from x5 and x1 are important. This result supports the previous analysis. 11
A preprint - June 1, 2021 y 2.5 0.08 2.0 0.06 1.5 x5 0.04 1.0 0.02 0.5 0.2 0.4 0.6 0.8 x1 Figure 6: Surface plot: learning rate x5 plotted against dropout1 x1 . ## res$x6 0.0054800 0.0088735 0.618 0.5371 ## --- ## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ## ## Residual standard error: 1.404 on 593 degrees of freedom ## Multiple R-squared: 0.2283, Adjusted R-squared: 0.2205 ## F-statistic: 29.24 on 6 and 593 DF, p-value: < 2.2e-16 Although this linear model requires a detailed investigation (a misspecification analysis is necessary, e.g., 2 Radj ≈ 0.22) it also is in accordance with previous observations that hyperparameters x5 and x1 have significant effects on the loss function, and their values should be small. 4 Discussion and Conclusions This study briefly explains, how HPT can be used as a datascope for the optimization of DNN hyperparameters. The results and observations presented in Sec.3 can be stated as a hypothesis: (H-1): hyperparameters x1 the first layer dropout rate, and x5 , i.e., the learning rate, have significant effects on the loss function. Their values should be small. This hypothesis requires further investigations. The results scratch on the surface of the HPT set of tools, e.g., the role and and impact of noise was not considered. SPOT provides very powerful tools such as Optimal Computing Budget Allocation (OCBA) to handle noisy function evaluations efficiently [Chen et al. 1997] [Bartz-Beielstein et al. 2011]. Interesting structures and patters in the data, e.g., patterns in Fig. 3 that occur in the interval between iteration 150 and 200 (on the x-axis, which represents the number of iterations) should be investigated further. 12
A preprint - June 1, 2021 Furthermore, there seems to be an upper limit for the values of the loss function: no loss function values appear between 5 and 34 (34 is the loss value of the outlier454 parameter set). Considering the research goals stated in Sec. 1, the HPT approach presented in this study provides many tools and solutions. In addition to the research goals (R-1) to (R-8) from Sec. 1, important goals that are specific for HPT in DNN should be mentioned. Problem and algorithm designs will be treated separately. Problem Design. The problem design comprehends the set of parameters that related to the problem. In HPT and HPO, regression or classification tasks are often considered. In our study, the MNIST data set was chosen. • A sound test problem specification is necessary, i.e., train, validation, and test sets should be clearly specified. • Initialization (this is similar to the specification of starting points in optimization) • Usage of surrogate benchmarks should be considered (this is similar to the use of CFD simulations in optimization) • Repeats (power of the test, severity), i.e., how many runs are feasible or necessary? • What are meaningful differences (w.r.t. specification of the loss function)? • Remember: scientific relevance is not identical to statistical significance. • Floor and ceiling effects should be avoided. • Comparison to baseline (random search, random sampling, mean value...) is a must. Algorithm Design. The algorithm design in HPT and HPO refers to the model, i.e., DNNs. In our study, the neural network from Sec. 2.3.2 was chosen. • A sound algorithm (neural network) specification us required. • Initialization, pre-training (starting points in optimization). Pre-tuning should be explained. • Hyperparameter (ranges, types) should be clearly specified. • Are there any additional (untunable) parameters? • How is noise noise (randomness, stochasticity) treated? • Is reproducibility ensured? • Last but not least: open source code and open data should be provided. To conclude: differences between HPT and HPO were discussed. A HPT approach based on SMBO was introduced and exemplified. It combines two packages from the statistical programming environment R: tfruns and SPOT, hence providing a HPT environment that is fully accessible from R. Although HPT can be performed with R functions, the underlying TF (Kears) and Python environment has be be installed. This installation is explained in the Appendix. 5 Appendix 5.1 Installing Python 5.1.1 Create a Python Environment in the Project It is recommended that one Python virtual environment is used per experiment. Navigate into the project directory by using the following command: cd Create a new virtual environment in a folder called python within the project directory using the following command: virtualenv python 5.1.2 Activate the Python Environment The virtualenv can be activated using the following command in a terminal: 13
A preprint - June 1, 2021 source python/bin/activate To verify that the correct version of Python was activated the following command can be executed in a terminal: which python 5.1.3 Install Python packages in the Environment Python packages such as numpy, pandas, matplotlib, and other packages can be installed in the Python virtualenv by using pip install: pip install numpy pandas matplotlib tensorflow 5.1.4 Install and Configure reticulate to use the Correct Python Version Install the reticulate package using the following command in the R console: install.packages("reticulate") To configure reticulate to point to the Python executable in the virtualenv python from above, create a file in the project directory called .Rprofile with the following contents: Sys.setenv(RETICULATE_PYTHON = "python/bin/python") R must be restarted for the setting to take effect. To check that reticulate is configured for the correct version of Python the following command can be used in the R console: reticulate::py_config() # Get Data: library("reticulate") use_virtualenv("~/workspace/bab-bart11o/python") library("keras") reticulate::py_config() 5.2 Installing Keras To get started with Keras, the Keras R package, the core Keras library, and a backend tensor engine (such as TF) must be installed. This can be done as follows: install.packages("tensorflow") install.packages("keras") library(keras) install_keras() This will provide you with a default CPU-based installation of Keras and TF. To install a GPU-based version of the TF backend engine, the corresponding command reads as follows: install_keras(tensorflow = "gpu") 5.3 Installing SPOT The following commands can be used to install the most recent version of SPOT. install.packages("devtools") devtools::install_gitlab("bartz/spot",host="http://owos.gm.fh-koeln.de:8055") 14
A preprint - June 1, 2021 References M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv e-prints, art. arXiv:1603.04467, Mar. 2016. P. Balaprakash, M. Salim, T. D. Uram, V. Vishwanath, and S. M. Wild. DeepHyper: Asynchronous hyperpa- rameter search for deep neural networks. In 25th IEEE International Conference on High Performance Computing (HiPC18). IEEE, 2018. doi: 10.1109/hipc.2018.00014. T. Bartz-Beielstein. Experimental Research in Evolutionary Computation—The New Experimentalism. Natural Computing Series. Springer, Berlin, Heidelberg, New York, 2006. ISBN 3-540-32026-1. doi: 10.1007/3-540-32027-X. URL http://dx.doi.org/10.1007/3-540-32027-X. T. Bartz-Beielstein and S. Markon. Tuning Search Algorithms for Real-World Applications: A Regression Tree Based Approach. In G. W. Greenwood, editor, Proceedings 2004 Congress on Evolutionary Computation (CEC’04), Portland OR, pages 1111–1118, Piscataway NJ, 2004. IEEE. ISBN 0-7803-8515-2. doi: 10. 1109/CEC.2004.1330986. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber= 1330986. T. Bartz-Beielstein and M. Zaefferer. Model-based methods for continuous and discrete global optimization. Applied Soft Computing, 55:154 – 167, 2017. ISSN 1568-4946. doi: 10.1016/j.asoc.2017.01.039. URL http://www.sciencedirect.com/science/article/pii/S1568494617300546. T. Bartz-Beielstein, C. Lasarczyk, and M. Preuss. Sequential Parameter Optimization. In B. McKay et al., editors, Proceedings 2005 Congress on Evolutionary Computation (CEC’05), Edinburgh, Scotland, pages 773–780, Piscataway NJ, 2005. IEEE Press. ISBN 0-7803-9363-5. doi: 10.1109/CEC.2005.1554761. T. Bartz-Beielstein, M. Friese, M. Zaefferer, B. Naujoks, O. Flasch, W. Konen, and P. Koch. Noisy optimization with sequential parameter optimization and optimal computational budget allocation. In Proceedings of the 13th annual conference companion on Genetic and evolutionary computation, pages 119–120, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0690-4. doi: 10.1145/2001858.2001926. T. Bartz-Beielstein, J. Branke, J. Mehnen, and O. Mersmann. Evolutionary algorithms. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(3):178–195, 2014. ISSN 1942-4795. doi: 10.1002/widm. 1124. URL http://dx.doi.org/10.1002/widm.1124. T. Bartz-Beielstein, C. Doerr, J. Bossek, S. Chandrasekaran, T. Eftimov, A. Fischbach, P. Kerschke, M. Lopez-Ibanez, K. M. Malan, J. H. Moore, B. Naujoks, P. Orzechowski, V. Volz, M. Wagner, and T. Weise. Benchmarking in optimization: Best practice and open issues. arXiv, 07 2020. URL https: //arxiv.org/abs/2007.03488. T. Bartz-Beielstein, M. Zaefferer, and F. Rehbach. In a Nutshell – The Sequential Parameter Optimization Toolbox. arXiv e-prints, art. arXiv:1712.04076, Dec. 2021. J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. J. Mach. Learn. Res., 13: 281–305, 2012. ISSN 1532-4435; 1533-7928/e. J. Bergstra, D. Yamins, and D. D. Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML’13, pages I–115–I–123. JMLR.org, 2013. H. C. Chen, C. H. Chen, L. Dai, and E. Yücesan. New Development of Optimal Computing Budget Allocation for Discrete Event Simulation. In S. Andradï ttir, K. J. Healy, D. H. Withers, and B. L. Nelson, editors, Proceedings of the 1997 Winter Simulation Conference, pages 334–341, Piscataway NJ, 1997. IEEE Computer Society. F. Chollet and J. Allaire. Deep Learning with R. Manning, 2018. S. Falkner, A. Klein, and F. Hutter. BOHB: Robust and Efficient Hyperparameter Optimization at Scale, 2018. URL http://arxiv.org/abs/1807.01774. N. Hansen. The CMA evolution strategy: a comparing review. In J. A. Lozano, P. Larranaga, I. Inza, and E. Bengoetxea, editors, Towards a new evolutionary computation. Advances on estimation of distribution algorithms, pages 75–102. Springer, 2006. 15
A preprint - June 1, 2021 I. Karmanov, M. Salvaris, M. Fierro, and D. Dean. Comparing deep learning frameworks: A rosetta stone approach. Microsoft docs, Microsoft, 2018. Last accessed 29-May-2021. https://docs.microsoft.com/de- de/archive/blogs/machinelearning/comparing-deep-learning-frameworks-a-rosetta-stone-approach. R. M. Lewis, V. Torczon, and M. W. Trosset. Direct search methods: Then and now. Journal of Computational and Applied Mathematics, 124(1–2):191–207, 2000. L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. arXiv e-prints, art. arXiv:1603.06560, Mar. 2016. D. Meignan, S. Knust, J.-M. Frayet, G. Pesant, and N. Gaud. A Review and Taxonomy of Interactive Optimization Methods in Operations Research. ACM Transactions on Interactive Intelligent Systems, Sept. 2015. H. Mendoza, A. Klein, M. Feurer, J. T. Springenberg, M. Urban, M. Burkart, M. Dippel, M. Lindauer, and F. Hutter. Towards Automatically-Tuned Deep Neural Networks, pages 135–149. Springer International Publishing, Cham, 2019. ISBN 978-3-030-05318-5. doi: 10.1007/978-3-030-05318-5_7. URL https: //doi.org/10.1007/978-3-030-05318-5_7. T. O’Malley, E. Bursztein, J. Long, F. Chollet, H. Jin, L. Invernizzi, et al. Keras Tuner, 2019. https://github.com/keras-team/keras-tuner. J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian Optimization of Machine Learning Algorithms. arXiv e-prints, art. arXiv:1206.2944, June 2012. V. Tatsis and K. Parsopoulos. Grid search for operator and parameter control in differential evolution. In 9th Hellenic Conference on Artificial Intelligence (SETN 2016), 2016. J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo. OpenML: networked science in machine learning. arXiv e-prints, art. arXiv:1407.7722, July 2014. M. Zaefferer and T. Bartz-Beielstein. Efficient global optimization with indefinite kernels. In J. Handl, E. Hart, P. R. Lewis, M. López-Ibáñez, G. Ochoa, and B. Paechter, editors, Parallel Problem Solving from Nature – PPSN XIV: 14th International Conference, Edinburgh, UK, September 17-21, 2016, Proceedings, pages 69–79. Springer International Publishing, Cham, 2016. ISBN 978-3-319-45823-6. doi: 10.1007/978-3-319-45823-6_7. URL http://dx.doi.org/10.1007/978-3-319-45823-6_7. 16
You can also read