Gaussian Processes
Prerequisite:
Distributions Over Functions
A Gaussian Process (GP) is a probability distribution over functions. Formally, $f \sim \mathcal{GP}(\mu, k)$ means that for any finite collection of inputs $x_1, \ldots, x_n$, the joint distribution of function values is multivariate Gaussian:
$$(f(x_1), \ldots, f(x_n)) \sim \mathcal{N}(\mathbf{m}, K)$$
where $m_i = \mu(x_i)$ and $K_{ij} = k(x_i, x_j)$. The mean function $\mu: \mathcal{X} \to \mathbb{R}$ captures prior beliefs about the function’s average behavior (often taken to be zero). The kernel (covariance function) $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ encodes beliefs about smoothness, periodicity, and scale.
A valid kernel must produce positive semi-definite Gram matrices $K$ for all finite input sets - this guarantees that the implied covariances are internally consistent.
Kernels
The choice of kernel encodes structural assumptions about $f$:
Squared Exponential (RBF): $$k(x, x') = \sigma^2 \exp!\left(-\frac{|x - x'|^2}{2\ell^2}\right).$$ Produces infinitely differentiable (analytic) sample paths. $\ell$ is the length-scale (how quickly correlation decays) and $\sigma^2$ is the output variance.
Matérn $\nu = 5/2$: $$k(x,x') = \sigma^2!\left(1 + \frac{\sqrt{5}|x-x'|}{\ell} + \frac{5|x-x'|^2}{3\ell^2}\right)\exp!\left(-\frac{\sqrt{5}|x-x'|}{\ell}\right).$$ Produces twice-differentiable sample paths - often more realistic than the RBF for physical processes.
Periodic: $$k(x,x') = \sigma^2 \exp!\left(-\frac{2\sin^2(\pi|x-x'|/p)}{\ell^2}\right).$$ Encodes periodicity with period $p$.
Kernels can be combined: sums and products of valid kernels are valid, enabling compositional modeling.
GP Posterior: Conditioning on Data
Given observations $\mathbf{y} = f(\mathbf{x}) + \boldsymbol{\varepsilon}$ at training inputs $\mathbf{x} = (x_1,\ldots,x_n)$ with noise $\boldsymbol{\varepsilon} \sim \mathcal{N}(0, \sigma_n^2 I)$, the GP posterior at test inputs $\mathbf{x}^\ast$ has a closed-form Gaussian distribution with:
$$\mu^\ast(\mathbf{x}^\ast) = \mu(\mathbf{x}^\ast) + K_{*f}(K_{ff} + \sigma_n^2 I)^{-1}(\mathbf{y} - \mu(\mathbf{x}))$$
$$\Sigma^\ast(\mathbf{x}^\ast, \mathbf{x}^\ast) = K_{**} - K_{f}(K_{ff} + \sigma_n^2 I)^{-1}K_{f}$$
where $K_{ff} = k(\mathbf{x}, \mathbf{x})$, $K_{*f} = k(\mathbf{x}^\ast, \mathbf{x})$, and $K_{**} = k(\mathbf{x}^\ast, \mathbf{x}^\ast)$.
These formulas follow from the standard formula for conditioning a multivariate Gaussian. The posterior mean is a linear combination of kernel evaluations - a “kernel machine.” The posterior variance $\Sigma^\ast$ is the prior variance $K_{**}$ minus a non-negative correction term: observing data can only reduce uncertainty.
Predictive Uncertainty
The diagonal of $\Sigma^\ast$ gives pointwise predictive variance - a calibrated measure of uncertainty that automatically grows away from the training data. This is a key advantage over neural networks: GPs provide principled uncertainty estimates without dropout approximations or ensemble methods.
The predictive distribution for a new noisy observation $y^\ast$ is:
$$p(y^\ast \mid \mathbf{x}^\ast, \mathbf{x}, \mathbf{y}) = \mathcal{N}(y^\ast;, \mu^\ast, \Sigma^\ast + \sigma_n^2 I).$$
Hyperparameter Optimization via Marginal Likelihood
The kernel hyperparameters $\theta$ (e.g., $\ell$, $\sigma^2$, $\sigma_n^2$) are fit by maximizing the log marginal likelihood:
$$\log p(\mathbf{y} \mid \mathbf{x}, \theta) = -\frac{1}{2}\mathbf{y}^\top (K_{ff} + \sigma_n^2 I)^{-1}\mathbf{y} - \frac{1}{2}\log\det(K_{ff} + \sigma_n^2 I) - \frac{n}{2}\log 2\pi.$$
The first term rewards data fit; the second penalizes model complexity (Occam’s razor emerges automatically). This is optimized via gradient descent. The marginal likelihood automatically selects the appropriate length-scale: too small a length-scale overfits (low first term but high second), too large underfits.
Computational Cost and Sparse Approximations
Exact GP inference requires solving a linear system with the $n \times n$ matrix $K_{ff} + \sigma_n^2 I$, costing $O(n^3)$ in time and $O(n^2)$ in memory. This is prohibitive for $n \gtrsim 10^4$.
Sparse approximations introduce $m \ll n$ inducing points $\mathbf{z}$ and approximate the full GP by conditioning on $f(\mathbf{z})$. The Sparse GP (Titsias, 2009) variational approach finds the optimal inducing locations by maximizing a lower bound on the marginal likelihood, reducing cost to $O(nm^2)$.
Examples
Bayesian optimization. GPs provide the surrogate model in Bayesian optimization (BO): fit a GP to observed objective evaluations, then use the posterior to define an acquisition function (e.g., Expected Improvement, Upper Confidence Bound) that balances exploration and exploitation.
Expected Improvement at a candidate $x^\ast$ is:
$$EI(x^\ast) = E[\max(f(x^\ast) - f^+, 0)]$$
where $f^+ = \max_i y_i$ is the current best observation. This has a closed form in terms of the GP posterior mean and variance.
Hyperparameter tuning. BO with GP surrogates is the standard approach for tuning expensive models (e.g., training neural networks) where each evaluation costs hours of compute.
GP regression generalizes ridge regression: the RBF kernel GP with fixed hyperparameters corresponds to a specific RKHS ridge regression, providing a Bayesian perspective on regularization.
Read Next: