Bayesian Optimization for Hyperparameter Tuning

Bayesian Optimization helped us find a hyperparameter configuration that is better than the one found by Random Search for a neural network on the San Francisco Crimes dataset. People who are familiar with Machine Learning might want to fast forward to Section 3 for details. The code to reproduce the experiments can be found here.

Hyperparameter tuning may be one of the most tricky, yet interesting, topics in Machine Learning. For most Machine Learning practitioners, mastering the art of tuning hyperparameters requires not only a solid background in Machine Learning algorithms, but also extensive experience working with real-world datasets.

In this post, I will give an overview of hyperparameters and various approaches of tuning hyperparameters intelligently. In particular, I will explain how we applied Bayesian Optimization to tune hyperparameters of a predictive model on a real-world dataset.

1. What is a Hyperparameter? Why it is Important?

Machine Learning is the process of improving a performance score, P, on a learning experience (or data, or evidence), E, for a specific class of tasks, T. In practice, that means almost every Machine Learning model has a specific set of parameters that must be estimated from the dataset E, in order to maximize the performance score P.

For example, considering a linear regression model where, given a dataset $E=\left&space;\{x_{i},y_{i}\right&space;\}_{i=1}^{N}$, we want to accurately predict the target $x_{i}$ for every input vector $y_{i}$.  The linear regression model does so by assuming a linear relationship between $x_{i}$ and $y_{i}$.  via the following formula:

$y=w_{T}x$

where $w$ is a weight vector whose values is to be determined by the training algorithm of linear regression. Indeed, $w$ is called the parameter of the linear regression model.

In order to estimate $w$ from the dataset E, a training algorithm needs to be applied. The training algorithm itself might also have its own parameters. For instance, gradient descent has learning rate, early stopping criteria and others as parameters. To differentiate training algorithm parameters, we call them hyperparameters. Hyperparameters are important because they directly control the behavior of the training algorithm and have a significant effect on the performance of the model being trained.

In addition, there may be hyperparameters that control the capacity of the model, e.g. the maximum depth of a decision tree, the number of trees in a random forest, the regularization coefficient in ridge/lasso regression, the penalty coefficient in SVM, the number of layers in neural networks, etc. These capacity hyperparameters specify how flexible the model is, and can be seen as the knobs to control the model bias vs. variance.

The diagram below depicts the interaction of hyperparameters. Hyperparameters are specified before the training algorithm starts and can’t be optimized inside the training algorithm itself. The first choice of hyperparameters will not result in optimal model performance, so hyperparameters are tuned by adjusting parameters and taking another step of training the model – essentially an optimization procedure by itself. If the training algorithm takes a long time to run, the whole process of hyperparameter tuning will take even longer, because it needs to run the training algorithms multiple times to find a near optimal hyperparameter configuration.

The optimal set of hyperparameters largely depends on the model and the dataset. Even with the same model, the optimal hyperparameters might differ between datasets. Therefore, there has to be a meta step where we tune the hyperparameters separately for every new dataset.

2. Hyperparameters Tuning Algorithms

The straightforward way to tune hyperparameters is based on human expertise. Experienced Machine Learning practitioners know approximately how to chose good hyperparameters. For a new dataset, they will follow a trial and error process with various configurations of hyperparameters. This process of experimenting with hyperparameters is heuristic and different people with different experience might come up with different settings and the process is not easily reproducible.

There is also a real risk that a human won’t achieve a near optimum setting of hyperparameters. Humans are not good at handling high dimensional data and can easily misinterpret or miss trends and relationships when trying to tune multiple hyperparameters. For instance, while tuning just two parameters, practitioners often fall back to tuning one parameter then tuning the second parameter. This may lead to concluding improvement in performance has plateaued while adjusting the second hyperparameter, while more improvement might be available by going back to changing the first hyperparameter. This difficulty requires an automatic and reproducible approach for hyperparameter tuning. The most popular approaches are outlined in this section.

2.1. Grid Search

In grid search, we try a set of configurations of hyperparameters and train the algorithm accordingly, choosing the hyperparameter configuration that gives the best performance. In practice, practitioners specify the bounds and steps between values of the hyperparameters, so that it forms a grid of configurations. Practitioners typically start with a limited grid with relatively large steps between parameter values, then extend or make the grid finer at the best configuration and continue searching on the new grid. This process is called manual grid search.

Grid search is a costly approach. Assuming we have nhyperparameters and each hyperparameter has two values, then the total number of configurations is $2^{n}$. Therefore it is only feasible to do grid search on a small number of configurations. Fortunately, grid search is embarrassingly parallel, meaning it can be parallelized easily where each worker works on different parameter settings. This makes grid search a bit more feasible given enough computational power.

2.2. Random Search

In Random Search for Hyperparameter Optimization, the authors show a surprising result: by navigating the grid of hyperparameters randomly, one can obtain similar performance to a full grid search. The authors show that if the close-to-optimal region of hyperparameters occupies at least 5% of the search space, then a random search with a certain number of trials (typically 40-60 trials) will be able to find that region with high probability.

Random search is surprisingly simple and effective, therefore it is considered by many practitioners as the de facto method for tuning hyperparameters. Like grid search, it is embarassingly parallel, yet the number of trials is much less, while the performance is comparable.

2.3. Automatic Hyperparameter Tuning

In Grid Search and Random Search, we try the configurations randomly and blindly. The next trial is independent to all the trials done before. In contrast, automatic hyperparameter tuning forms knowledge about the relation between the hyperparameter settings and model performance in order to make a smarter choice for the next parameter settings. Typically it will first collect the performance at several configurations, then make some inference and decide what configuration to try next. The purpose is to minimize the number of trials while finding a good optimum. This process, by nature, is sequential and not easily parallelizable.

It can be seen that hyperparameter tuning is an optimization problem, where we find the hyperparameter setting that maximizes the performance of the model on a validation set. However, the mapping from the hyperparameters to the performance on the validation set cannot be written into a formula, hence we don’t know the derivatives of this function – it’s known as a blackbox function. Because it’s a blackbox function, optimization techniques like Newton method or Gradient Descent cannot be applied.

Most of the approaches for tuning hyperparameters fall into Sequential Model-based Global Optimization(SMBO). These approaches use a surrogate function to approximate the true blackbox function. Typically the inner loop of SMBO is the optimization of this surrogate, or some kind of transformation done on the surrogate. The configuration that maximizes this surrogate will be the one should be tried next. SMBO algorithms differ in the criteria by which it optimizes the surrogate, and in the way they model the surrogate given the observation history. Several SMBO approaches for hyperparameters have been recently proposed in literature:

• Bayesian Optimization: uses a Gaussian Process to model the surrogate, and typically optimizes the Expected Improvement, which is the expected probability that new trials will improve upon the current best observation. Gaussian Process is a distribution over functions. A sample from a Gaussian process is an entire function. Training a Gaussian Process involves fitting this distribution to the given data, so that it generates functions that are close to the observed data. Using Gaussian process, one can compute the Expected Improvement of any point in the search space. The one gives the highest expected improvement will be tried next. Bayesian Optimization typically gives non-trivial, off-the-grid values for continuous hyperparameters (like the learning rate, regularization coefficient,…) and was shown to beat human performance on some good benchmark datasets. A well-known implementation of Bayesian Optimization is Spearmint.
• Sequential Model-based Algorithm Configuration (SMAC): uses a random forest of regression trees to model the objective function, new points are sampled from the region considered optimal (high Expected Improvement) by the random forest. The implementation can be found here.
• Tree-structured Parzen Estimator (TPE): is an improved version of SMAC, where two separated models are used to model the posterior. A well-known implementation of TPE is hyperopt.

3. Bayesian Optimization in Action

At Arimo, the Data Science and Advanced Research team regularly develops new models on new datasets and we could save significant time and effort by automating hyperparameter tuning. Also, we want to add features to our products that make it easier for data scientists to develop, train and deploy new models. We decided to try out Bayesian optimization to see how well it performed, whether it was efficient and could be implemented straightforwardly. For our testing, we used the San Francisco Crimes dataset.

San Francisco Crimes contains 12 years of crime reports from across all of San Francisco neighborhoods. Here are some samples from the training set

 Dates Category Descript DayOfWeek PdDistrict Resolution Address X Y 2015-05-13 23:53:00 WARRANTS WARRANT ARREST Wednesday NORTHERN ARREST, BOOKED OAK ST / LAGUNA ST -122.425891675136 37.7745985956747 2015-05-13 23:53:00 OTHER OFFENSES TRAFFIC VIOLATION ARREST Wednesday NORTHERN ARREST, BOOKED OAK ST / LAGUNA ST -122.425891675136 37.7745985956747 2015-05-13 23:33:00 OTHER OFFENSES TRAFFIC VIOLATION ARREST Wednesday NORTHERN ARREST, BOOKED VANNESS AV / GREENWICH ST -122.42436302145 37.8004143219856

In the test set, only the time and location is available, and we wanted to build a model that predicts the category of crimes given location and time. There were 39 crime categories to be predicted.

Like any other Data Science problem, we start with Feature Engineering. In this post, we are going to use some good practices from the Kaggle community, including:

• Extracting time, day, month, year, day of week, season from the Dates column.
• Address is an interesting feature. In this post we use the count featurizer as done in this script
• We also do some feature engineering on the latitude and longitude, as described here.

That gave about 77 input features. We then split the training set into training and validation, and then fit a neural network to predict the crime category. The performance measurement is the multiclass log-loss, as indicated in the Kaggle competition.

Now the important step: specifying the architecture of the neural network, and various hyperparameters for the training algorithm. We don’t know the optimal values for those hyperparameters, so we need to define the search space. For this particular problem, our search space will be:

 Hyper-parameter Type Possible Values Number of Hidden Layers INT 1, 2, 3 Number of hidden units INT 64, 128, 256 Dropout at the input layer FLOAT [0, 0.5] Dropout at the hidden units FLOAT [0,0.75] Learning rate FLOAT [0.01, 0.1] L2 Weight dDecay FLOAT [0, 0.01]

We decided to use Rectifier as the nonlinearity of the hidden layers, and softmax for the output layer. In principle, we could also tune the nonlinearity function of the hidden layers (choices can be Rectifier, Tanh, Sigmoid, etc…) but Rectifier gave superior performance in our initial tests, so we don’t tune it. The maximum number of training epochs was fixed at 20.

Now we have 2 discrete and 4 continuous hyperparameters, including both the hyperparameters of the models (number of hidden layers, dropout rate…) and the parameters of the training algorithm (learning rate…). Assume that we want to do grid search, and we only take 3 samples for each continuous variable, then we already have different configurations. Using a GPU, training the neural network on this dataset takes about 25 minutes in average. So it takes 12.6 days to run the full grid search. If we have a bigger dataset, full grid search may become infeasible.

Instead, we are going to use Spearmint Bayesian optimization to tune the hyperparameters. We wrote some convenient wrappers around Spearmint making training models very easy. Given the above grid, we ran Spearmint using the GPEIOptChooser chooser, and a maximum of 40 trials. The trials can be summarized in the following chart.

Bayesian Optimization gave non-trivial values for continuous variables like Learning rRate and Dropout rRate. It also learns to enable dropout after a few trials, and it seems to favor small networks (2 hidden layers with 256 units), probably because bigger networks might over fit the data. The whole process of Bayesian Optimization took about 15.8 hours, and we get the best performance of 2.1502 in logloss on the validation set.

To make a comparison, we also run Random Search on exactly the same grid and data, using RandomizedSearchCV in scikit-learn. After spending 13.7 hours, Random Search gave a performance of 2.1591 on the validation set. The hyperparameters values of Bayesian Optimization and Random Search are displayed in the following table

 Hyperparameters Bayesian Optimization Random Search Number of hidden layers 2 1 Number of hidden units 256 256 Dropout at the input layer 0.423678 0.475357 Dropout at the hidden units 0.091693 0.280905 Learning rate >0.025994 0.0879691 L2 Weight decay >0.00238 0.005968 Logloss on the validation set >2.1502 2.1591 Running time (hours) 15.8 13.7

We then took the model trained with Bayesian Optimization and ran it on the test set, then submitted the result to Kaggle. It gave a score of 2.24152 on the public leaderboard – at the time in the top 30 scores for the dataset. We could even further improve with more feature engineering and doing ensembles of the top models, however that is out of scope of this post.

4. Conclusion

We have seen an introduction to hyperparameters, and how to use Bayesian Optimization for tuning hyperparameters of a neural networks on the San Francisco Crimes dataset. It can be seen that on the same grid and the same number of trials, Bayesian Optimization found a better configuration than random search.

Moving forward, it should be noted that Bayesian Optimization can be used to tune any expensive blackbox noisy function. In the next posts, we will see how Bayesian Optimization could be used to automate the feature engineering process and truly improve the productivity of Data Scientists.

In the mean time, we are hiring Data Scientists, among others. Check it out!