Intro to Hyperparameter Optimization for Machine Learning

Overview of a few black-box hyperparameter optimization techniques: grid search, randomized search and sequential model-based optimization.

technical
statistics
machine learning
Author
Affiliation

Duke University

Published

January 29, 2022

1 Introduction

Machine learning is easy, right? You pick a model, fit it to your data, and out come predictions.

Not quite. Sometimes we talk about the fancy math and algorithms under the hood to make it look serious, but we rarely talk about how difficult it is to transform whatever data can gather into useful, actionable predictions that have business value.

There are many challenges. First, there’s the transformation of a business problem into something that’s remotely approachable by machine learning and statistics. Second, there’s the development of a data collection plan or, more often than not, the identification of observational data which is already available. With the collection of this data comes the third step, modeling, which bridges between numbers and useful answers. Modeling may have to account for all kinds of issue with your data, such as class imbalance, missingness, and non-representativeness. You also want to obtain good answers, so throughout this step you loop between model specification, evaluation, and refinement. It is a lengthy process of research and investigation into the performance of your model, insights into the why of what you observe, and various fixes and improvements to your model. Finally, in a fourth stage, you must account for how your model will be used and the management of its lifecycle.

Moral of the story: there is a lot work involved. We need all hands on deck. And even more than that, we need robust automatization tools to support this machine learning workflow.

This blog post is about a single set of tools – hyperparameter optimization techniques – used to help with the model specification, evaluation, and refinement loop. I will focus on the standard machine learning framework of supervised learning. In this context, machine learning algorithms can be seen as black boxes which take in some data, a bunch of tuning hyperparameters specified by the user of the algorithm, and which output predictions. The quality of the predictions can be evaluated through data splitting or cross-validation. That is, we’re always able to compare predictions to ground truth for the data we have at hand.

My goal is to describe key approaches to hyperparameter optimization (see Table 1) in order to provide conceptual understanding that can be helpful practice. I describe black-box methods which treat the machine learning algorithm as, well, a black blox. This includes grid search, randomized search, and sequential model-based optimization such as Bayesian optimization. There are additional methods to be considered, such as Hyperband and Bayesian model selection, which integrate with the learning algorithms themselves. These will be for another blog post.

Table 1: Different types of hyperparameter optimization methods
Black-box methods Integrated methods
Grid Search Hyperband
Randomized Search Bayesian Model Selection
Sequential Model-Based Optimization

Before getting into the detail of these methods though, let’s go over some basic concepts and terminology which I’ll be using.

1.1 Background and Terminology

First, let’s talk about models, parameters and performance evaluation. This is going to be the occasion for me to introduce some terminology and notations.

Models, Parameters and Performance Evaluation

A model is a mathematical representation of something going on in the real world. For instance, suppose you want to predict whether or not a given stock \(X\) is going to go up tomorrow. A model for this could be: predict it’s going to go up with probability \(\alpha\) if it went up today, otherwise predict it’s not going to go up with probability \(\beta\). There’s only one variable in this model (whether or not the stock went up today), and there are two parameters, the probabilities \(\alpha\) and \(\beta\). Here the parameters could be learned if we had historical data.

You could consider more sophisticated models such as classical time series models or reccurent neural networks. In all cases, you have variables (the input to your model), parameters (what you learn from data), and you end up with predictions.

You can compare the performance of any model by comparing the predictions to what actually happened. For instance, you could look at how often your predictions were right. That’s a performance evaluation metric. Your goal is usually to build a model which will keep on performing well.

Formally, let \(R\) be the (average) future performance of your model. You don’t know this quantity, but you can estimate it as \(\hat R\) using techniques such as cross-validation and its variants. There might be a bias and a variance to \(\hat R\), but the best we can do in practice is to try to find the model with the best estimated performance (modulo certain adjustments).

This brings us to the question: how should you choose a model? The standard in machine learning is to choose a model which maximizes \(\hat R\). It’s not the only solution, and it’s not always the best solution (it can be better to do model averaging if \(\hat R\) has some variance), but it’s what we’ll focus on through this blog post.

Furthermore, we’ll approach this problem through the lens of hyperparameter selection.

Hyperparameters

Hyperparameters are things that have you have to specify before you can run a model, such as:

  • what data features to use,
  • what type of model to use (linear model? random forest? neural network?)
  • other decisions that go into the specification of a model:
    • the number of layers in your neural network,
    • the learning rate for the gradient descent algorithm,
    • the maximum depth for decision trees, etc.

There is only a practical distinction between parameters and hyperparameters. Hyperparameters are things that are usually set separately from the other model parameters, or that do not nicely fit within a model’s learning algorithm. Depending on the framework you’re using, parameters can become hyperparameters and vice versa. For example, by using ensemble methods, you could easily transform the “model type choice” hyperparameter to a simple parameter of your ensemble that is learned from data.

The key thing is that, in practice, there will typically be some distinction between parameters of your model and a set of hyperparameters that you have to specify.

Through experience, you can learn what hyperparameters work well for the kinds of problems that you work on. Other times, you might carefully tune parameters and investigate the impact of your choices on model performance.

The manual process of hyperparameter tuning can lead to important insights into the performance and behavior of your model. However, it can also be a menial task that would be better automated through hyperparameter optimization algorithms aiming to maximize \(\hat R\), such as those that I review below.

Example

Let’s look at an example to make things concrete. This is adapted from scikit-optimize’s tutorial for tuning scikit-learn estimators.

We’ll consider the California housing dataset from the scikit-learn library. Each row in this dataset represents a census block and contains aggregated information regarding houses in that block. Our goal will be to predict median house price at the block level given these other covariates.

Code
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing

dataset = fetch_california_housing(as_frame=True)

X = dataset.data # Covariates
n_features = X.shape[1] # Number of features
y = dataset.target # Median house prices

X
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25
... ... ... ... ... ... ... ... ...
20635 1.5603 25.0 5.045455 1.133333 845.0 2.560606 39.48 -121.09
20636 2.5568 18.0 6.114035 1.315789 356.0 3.122807 39.49 -121.21
20637 1.7000 17.0 5.205543 1.120092 1007.0 2.325635 39.43 -121.22
20638 1.8672 18.0 5.329513 1.171920 741.0 2.123209 39.43 -121.32
20639 2.3886 16.0 5.254717 1.162264 1387.0 2.616981 39.37 -121.24

20640 rows × 8 columns

For the regression, we’ll use scikit-learn’s gradient boosted trees estimator. This model has a number of internal parameters which don’t need to know much about, as well as hyperparameters which can be used to tune the model. This includes the max_depth hyperparameter for the maximum depth of decision trees, learning_rate for the learning rate of gradient boosting, max_features for the maximum number of features to use in each decision trees, and a few more. Ranges of reasonable values for these parameters are specified in the space variable below.

Code
from sklearn.ensemble import GradientBoostingRegressor
from skopt.space import Real, Integer

model = GradientBoostingRegressor(n_estimators=25, random_state=0)

space  = [Integer(1, 8, name='max_depth'),
          Real(0.01, 1, "log-uniform", name='learning_rate'),
          Integer(1, n_features, name='max_features'),
          Integer(1, 50, name='min_samples_leaf')
]

Now, the last thing we need is an estimator \(\hat R\) for the model’s performance. This is our Rhat() function (i.e. \(\hat R\)) which we’ll try to maximize. Here we use a cross-validated mean absolute error score.

Code
from sklearn.model_selection import cross_val_score

def Rhat(**params):
  model.set_params(**params)
  
  return -np.mean(cross_val_score(model, X, y, cv=3, n_jobs=-1,
                                  scoring="neg_mean_absolute_error"))

With this, we can fit the model to the data (using default hyperparameter values to begin with), and evaluate the model’s performance.

Code
model.fit(X, y)

Rhat()
np.float64(0.5503720160635011)

Here the unit for median house price was in hundreds of thousands of dollars and we can interpret the model performance at this scale. The value \(\hat R \approx 0.55\) means that, on average, the absolute error of the model is $55,000. We’ll see if we can do better using hyperparameter optimization.

2 Black-Box Optimization Methods

Black-box hyperparameter optimization algorithms consider the underlying machine algorithm as unknown. We only assume that, given a set of hyperparameters \(\lambda\), we can compute the estimated model performance \(\hat R(\lambda)\). There is usually variance in \(\hat R(\lambda)\), but this is not something that I will talk about in this post. We will therefore consider \(\hat R\) as a deterministic function to be optimized.

Note: in practice, you need to account for the variance in \(\hat R\), as otherwise you could get bad surprises. It’s just not something I’m covering in this post, since I want to focus on a conceptual understanding of the optimization algorithms.

We can use almost any technique to try to optimize \(\hat R\), but there are a number of challenges with hyperparameter optimization:

  1. \(\hat R\) is usually rather costly to evaluate.
  2. We usually do not have gradient information regarding \(\hat R\) (otherwise, hyperparameters for which we have gradient information could easily be incorporated as parameters of the underlying ML algorithms).
  3. The hyperparameter space is usually complex. It can contain discrete variables and can even be tree-structured, where some hyperparameters are only defined conditionally on other hyperparameters.
  4. The hyperparameter space is usually somewhat high-dimensional, with more than just 2-3 dimensions.

These particularities of the hyperparameter optimization problem has led the machine learning community to favor some of the optimization techniques which I discuss below.

2.3 Sequential Model-Based Optimization

All of the techniques considered so far made no assumption at all about the function \(\hat R\) to optimize.

This is a problem, because we do have prior information about \(\hat R\). We can expect \(\hat R\) to have some level of regularity, meaning that similar hyperparameter configurations should have similar performance. This knowledge allows us to make inference about \(\hat R(\lambda)\) given the evaluation of \(\hat R\) at other points \(\tilde \lambda \not = \lambda\).

More formally, suppose we have evaluated \(\hat R\) at a sequence of hyperparameter configurations \(\lambda_1, \lambda_2, \dots, \lambda_n\), thus observing \(\hat R(\lambda_1), \hat R(\lambda_2), \dots, \hat R(\lambda_n)\). This allows us to make inference about \(\hat R\). In particular, we can try guessing what next \(\lambda_{n+1}\) will maximize \(\hat R\) or improve our knowledge of \(\hat R\). Once we’ve observed \(\hat R(\lambda_{n+1})\), we repeat the process, trying to guess which \(\lambda_{n+2}\) to pick to improve the procedure. That is the entire idea behind sequential model-based optimization.

To make this work in practice, we need the following ingredients:

  1. An inferential model for \(\hat R\). That could be a Bayesian nonparametric model, like a Gaussian Process, or something else, like a Tree-structure Parzen Estimator.
  2. A method to guess the next best hyperparameter value to pick. Typically, \(\lambda_{n+1}\) is chosen to maximize the expected improvement criterion. This chooses \(\lambda\) to maximize the expected value of \(\max\{\hat R(\lambda) - R^*, 0\}\), where \(R^*\) is the current observed performance maximum. In other words, we want to maximize the potential for improving the current optimum, without penalizing for the possibility of observing a lower performance. This allows us to optimize \(\hat R\) while still exploring the hyperparameter space. I refer the reader to here for a review of a few other selection criterions.

When a Bayesian inferential framework is chosen, then sequential model-based optimization is called Bayesian optimization or Bayesian search. It is beyond of the scope of this blog post to go into the details of gaussian processes, but below I show howthe scikit-optimize library can be used to perform Bayesian optimization based on Gaussian Processes and the expected improvement criterion:

Code
from skopt import gp_minimize
from skopt.utils import use_named_args

@use_named_args(space)
def objective(**params):
  model.set_params(**params)
  
  return -np.mean(cross_val_score(model, X, y, cv=3, n_jobs=-1,
                                  scoring="neg_mean_absolute_error"))

res_gp = gp_minimize(objective, space, n_calls=54, random_state=1)

## 0.46

With Bayesian optimization, we see that much more time is spent sampling performant models.

Code
plt.clf()
p = plt.hist(res_gp.func_vals, bins=20)
p = plt.title("Score distribution for evaluated hyperparameters", loc="left")
p = plt.xlabel("Cross-validated average absolute error")
plt.show()

Furthermore, we can see that the algorithm quickly converges towards performant models.

Code
from skopt.plots import plot_convergence

plot_convergence(res_gp)

3 Summary

This blog post provided a basic overview of hyperparameter optimization and of what can be gained from these techniques. We reviewed grid search, the simplest brute force approach. We reviewed random search, which improves upon grid search when some hyperparameter dimensions are not influencial. Finally, we reviewed sequential model-based optimization, which much more effectively samples models with good performance.

Reuse