Helpful context:


You have a neural network with five hyperparameters: learning rate, batch size, dropout probability, number of layers, and hidden size. Every combination requires training the full model and evaluating it on a validation set. One training run takes four hours. Grid search over 5 values per hyperparameter gives $5^5 = 3125$ configurations. That is over 12,000 hours of compute - roughly 500 days on a single machine.

There has to be a smarter way. Bayesian optimization is that smarter way.

The Objective Function

Define $f(x)$ as the validation performance (say, validation accuracy) when you train with hyperparameter configuration $x$. We want to find $x^* = \arg\max_x f(x)$.

Three properties make this hard:

  1. Expensive to evaluate. Each call to $f$ requires a full training run. You can afford tens of evaluations, not thousands.
  2. No gradients. The validation metric is not differentiable with respect to the hyperparameters in general. You cannot run gradient descent on $f$ directly. (This is a different optimization problem from training the model, where gradients of the training loss are available.)
  3. Black box. You get a number out when you put a configuration in. You have no analytic expression for $f$, no formula to reason about. Only a sequence of evaluated points.

This is called black-box optimization. The goal is to find the maximum of $f$ while calling $f$ as few times as possible.

Why Grid Search and Random Search Fall Short

Grid search evaluates a fixed grid of hyperparameter combinations. It is exhaustive within the grid but wastes evaluations: if a region of the grid is bad, grid search still evaluates all grid points in that region. It also misses points between grid values - the true optimum may lie between the grid lines.

Random search (Bergstra and Bengio, 2012) is surprisingly better. The key insight: if only 2 of your 5 hyperparameters actually matter to performance, random search effectively searches those 2 important dimensions much more densely than grid search does. Grid search allocates equal evaluation budget to all dimensions including the irrelevant ones; random search does not.

But both methods share a fatal flaw: they ignore what was already evaluated. If configuration $x = (0.001, 64, 0.3, 4, 256)$ gave 91% validation accuracy and configuration $x' = (0.001, 64, 0.3, 4, 512)$ is very similar, you have good reason to believe $x'$ will also perform well - or that slightly different values near $(0.001, 64, 0.3, 4, \cdot)$ might do even better. Grid search and random search throw away that information completely. Every evaluation is independent.

The Core Idea - Use What You Already Know

Bayesian optimization builds a model of $f$ from the evaluations it has already made, then uses that model to decide intelligently where to evaluate next.

After evaluating $f$ at points $x_1, \ldots, x_t$, you have a dataset of observations ${(x_i, f(x_i))}_{i=1}^t$. Fit a surrogate model - a probabilistic model that predicts $f(x)$ at every unobserved point $x$ and also quantifies its uncertainty about that prediction. Use the surrogate to compute an acquisition function that scores each candidate point $x$ by how useful it would be to evaluate next. Evaluate $f$ at the point that maximizes the acquisition function. Update the surrogate with the new observation. Repeat.

This is sequential model-based optimization (SMBO). Each evaluation informs the next. The surrogate gradually becomes more accurate as more observations accumulate, and the acquisition function focuses the remaining budget on the most promising regions.

Gaussian Processes as the Surrogate

A Gaussian Process (GP) is the standard surrogate model for Bayesian optimization. A GP is a distribution over functions: it defines a probability distribution over all possible functions $f$, and conditioning on observations gives a posterior distribution over functions consistent with those observations.

Formally, a GP is specified by a mean function $m(x)$ and a covariance kernel $k(x, x')$. The kernel captures how similar we expect $f$ to be at two nearby points. A common choice is the squared exponential kernel:

$$k(x, x') = \sigma_f^2 \exp!\left(-\frac{|x - x'|^2}{2\ell^2}\right)$$

Here $\sigma_f^2$ is the signal variance (overall scale of $f$’s variation) and $\ell$ is the length scale (how quickly $f$ changes - small $\ell$ means $f$ can vary rapidly, large $\ell$ means $f$ is smooth). Both are hyperparameters of the GP itself, learned from the observations by maximizing marginal likelihood.

Given $t$ observations, the GP posterior at any new point $x$ is a Gaussian:

$$p(f(x) \mid \text{data}) = \mathcal{N}(\mu(x), \sigma^2(x))$$

The posterior mean $\mu(x)$ is the GP’s best guess for $f(x)$. The posterior variance $\sigma^2(x)$ quantifies uncertainty. Near observed points, the variance is small (we are confident). Far from any observation, the variance is large (we are uncertain).

This is exactly what Bayesian optimization needs: a model that not only predicts, but also knows when it does not know.

The Acquisition Function

The acquisition function $\alpha(x)$ takes the GP’s posterior mean and variance at point $x$ and returns a score for how useful it would be to evaluate $f(x)$ next. Maximizing $\alpha$ gives the next point to evaluate.

The central tension is exploration vs exploitation. Exploitation means evaluating near the current best known point - likely to give a good result but unlikely to discover something better. Exploration means evaluating in high-uncertainty regions - the GP does not know what is there, and there might be something excellent.

Expected Improvement (EI). Let $f^* = \max_{i \leq t} f(x_i)$ be the current best observed value. The expected improvement at $x$ is the expected amount by which $f(x)$ would exceed $f^*$:

$$\text{EI}(x) = \mathbb{E}!\left[\max(0,\ f(x) - f^*)\right]$$

Under the GP, $f(x) \sim \mathcal{N}(\mu(x), \sigma^2(x))$, and EI has a closed form:

$$\text{EI}(x) = (\mu(x) - f^* - \xi) \cdot \Phi(Z) + \sigma(x) \cdot \phi(Z)$$

where $Z = \frac{\mu(x) - f^* - \xi}{\sigma(x)}$, $\Phi$ is the standard normal CDF, $\phi$ is the standard normal PDF, and $\xi \geq 0$ is a small exploration parameter (typically 0.01).

The formula rewards two things simultaneously: a high mean $\mu(x)$ (the GP predicts a good value there) and a high standard deviation $\sigma(x)$ (uncertainty - something good might be lurking). EI is zero where $\mu(x) < f^*$ and $\sigma(x)$ is also small (we confidently predict this region is bad).

Upper Confidence Bound (UCB). A simpler approach:

$$\text{UCB}(x) = \mu(x) + \kappa \cdot \sigma(x)$$

The acquisition score is just the mean plus $\kappa$ times the standard deviation. The hyperparameter $\kappa$ directly controls the exploration-exploitation trade-off: large $\kappa$ favors exploration of uncertain regions, small $\kappa$ favors exploitation near the current best.

Worked example. After 3 evaluations at $x \in {0.001, 0.01, 0.1}$ (learning rates) with validation accuracies ${85%, 91%, 79%}$, the GP posterior has high mean near $x = 0.01$ (the current best) and also high uncertainty near $x = 0.003$ (never evaluated, but adjacent to a good region). EI will score $x = 0.003$ highly: maybe the true optimum is nearby and we have not looked there yet. EI will score $x = 0.2$ (far from all evaluated points) highly for exploration. The next evaluation goes wherever EI is highest.

The Full Loop

  1. Evaluate $f$ at a small number of random configurations to initialize (typically 5-10 points).
  2. Fit a GP to the observed ${(x_i, f(x_i))}$ pairs.
  3. Maximize the acquisition function $\alpha(x)$ over the hyperparameter space to find $x_{\text{next}}$. (This inner optimization is cheap because evaluating $\alpha$ is fast - it just queries the GP.)
  4. Evaluate $f(x_{\text{next}})$ - train the model with those hyperparameters and measure validation accuracy.
  5. Add $(x_{\text{next}}, f(x_{\text{next}}))$ to the observations.
  6. Go to step 2.

After 20-50 total evaluations, Bayesian optimization typically finds a configuration that would require hundreds or thousands of random search evaluations to match.

Random search allocates evaluations uniformly over the hyperparameter space regardless of what was found. Bayesian optimization concentrates evaluations in the regions the surrogate identifies as promising, while still occasionally exploring uncertain regions. The more informative each evaluation, the faster convergence to the optimum.

The gain is most dramatic when $f$ is smooth (nearby hyperparameters give similar performance) and when only a few hyperparameters matter most. The GP’s length scale parameter adapts to this: if performance is insensitive to batch size, the GP learns a long length scale in that dimension and stops sending evaluations there.

When BO Breaks Down

Scalability of the GP. Fitting a GP requires inverting the $t \times t$ covariance matrix, which costs $O(t^3)$ time and $O(t^2)$ memory. After a few hundred observations, standard GP inference becomes slow. Sparse GP approximations exist but add complexity.

High dimensionality. GPs work well in low-dimensional hyperparameter spaces (under roughly 20 dimensions). In high dimensions, the surrogate needs many observations to model $f$ accurately because there are many directions to explore. For neural architecture search with hundreds of discrete choices, GP-based BO is impractical.

Discrete and conditional spaces. Hyperparameter spaces often have structure: the number of hidden units is discrete; the dropout probability only applies if dropout is enabled. Standard GP kernels assume continuous inputs. Extensions exist (random forest surrogates, TPE) but require departing from the clean GP framework.

Tools in Practice

Optuna is the most widely used hyperparameter optimization framework in production ML today. It implements Tree-structured Parzen Estimator (TPE) - a BO variant that models $p(x \mid f(x) > \tau)$ and $p(x \mid f(x) \leq \tau)$ separately using kernel density estimators, rather than fitting a GP over the full space. TPE scales much better than GP-based BO and handles discrete and conditional hyperparameter spaces naturally.

import optuna

def objective(trial):
    lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
    dropout = trial.suggest_float("dropout", 0.0, 0.5)
    n_layers = trial.suggest_int("n_layers", 1, 5)
    # ... train model with these hyperparameters ...
    return validation_accuracy

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=50)

print(study.best_params)

Hyperopt is older and also implements TPE. scikit-optimize (skopt) implements GP-based BO and is easy to use for small experiments. BoTorch is a PyTorch-based framework for scalable BO with GPU-accelerated GP inference and modern acquisition functions, used when the full GP framework is needed at scale.


Concept Key point
Black-box optimization Maximize $f(x)$ when $f$ is expensive to evaluate, non-differentiable, and has no analytic form
Random search vs grid search Random search better when few hyperparameters matter; both ignore previous evaluations
Surrogate model A cheap probabilistic model of $f$; GP gives mean and uncertainty at every point
GP posterior Mean = best guess; variance = uncertainty; low near observed points, high far away
Acquisition function Scores candidate points by expected usefulness; balances exploration and exploitation
Expected Improvement $\mathbb{E}[\max(0, f(x) - f^*)]$; rewards high mean and high uncertainty simultaneously
UCB $\mu(x) + \kappa \sigma(x)$; $\kappa$ controls exploration-exploitation trade-off
GP scaling $O(t^3)$ in the number of observations; impractical above a few hundred evaluations
TPE BO variant used in Optuna; scales better than GP-based BO; handles discrete spaces

Read Next: