Helpful context:


You’ve placed weather sensors at ten locations around a city. You read the temperature at each one. Now someone asks: what’s the temperature at a spot between two of your sensors - a location where you have no measurement at all?

You could just average the nearest neighbors. But that throws away information: you know that nearby places tend to have similar temperatures, and that this similarity decays with distance. You’d also like to express how confident you are in the estimate - wider uncertainty far from any sensor, tighter uncertainty near one. A simple average gives you a guess but not an honest account of what you don’t know.

Gaussian processes give you both: a principled estimate and a calibrated measure of uncertainty, derived directly from the structure of the data and your assumptions about how the function behaves. They are one of the most elegant ideas in machine learning, and they rest on concepts you already know.


From One Gaussian to Infinitely Many

Start small.

A single Gaussian distribution is a probability distribution over one number. You draw a value $x \sim \mathcal{N}(\mu, \sigma^2)$ - it has mean $\mu$ and variance $\sigma^2$.

A bivariate Gaussian is a distribution over pairs of numbers $(x_1, x_2)$. It is characterized by a mean vector and a $2 \times 2$ covariance matrix:

$$\begin{pmatrix} x_1 \\ x_2 \end{pmatrix} \sim \mathcal{N}\left(\begin{pmatrix} \mu_1 \\ \mu_2 \end{pmatrix}, \begin{pmatrix} \sigma_1^2 & \sigma_{12} \\ \sigma_{12} & \sigma_2^2 \end{pmatrix}\right).$$

The covariance $\sigma_{12}$ captures how correlated $x_1$ and $x_2$ are. If $\sigma_{12}$ is large and positive, then when $x_1$ is above its mean, $x_2$ tends to be above its mean too. If $\sigma_{12} = 0$, they are independent.

Now push this further. What if you had a distribution over a function $f$, rather than over a fixed finite-dimensional vector?

A Gaussian process is exactly this. It is a probability distribution over functions. Formally:

Definition. A Gaussian process (GP) is a collection of random variables $\{f(x) : x \in \mathcal{X}\}$, any finite subset of which is jointly Gaussian.

This is the key point. You never have to deal with all possible inputs at once. Whenever you pick a finite set of inputs $\{x_1, x_2, \ldots, x_n\}$, the corresponding function values $(f(x_1), f(x_2), \ldots, f(x_n))$ are jointly Gaussian, with some mean vector and covariance matrix.

Discomfort check. “A distribution over functions” sounds impossible - there are uncountably many functions from $\mathbb{R}$ to $\mathbb{R}$. How can you put a probability distribution on that? The trick is that a GP is defined entirely by its finite-dimensional marginals. You specify: for any finite input set, what is the joint Gaussian distribution? Kolmogorov’s extension theorem guarantees that if these marginals are consistent (they agree whenever you ask about a shared subset of inputs), then there is a unique probability measure on function space that generates them. You never need to write down a density over infinite dimensions - you only ever query the process at finitely many points.

A GP is completely specified by two things:

  • A mean function $m(x) = \mathbb{E}[f(x)]$: the expected value of the function at each input.
  • A kernel (or covariance function) $k(x, x') = \text{Cov}(f(x), f(x'))$: the covariance between function values at any pair of inputs.

For any finite set of inputs $X = \{x_1, \ldots, x_n\}$, the joint distribution is:

$$\mathbf{f} = (f(x_1), \ldots, f(x_n))^\top \sim \mathcal{N}(\mathbf{m}, K),$$

where $\mathbf{m}i = m(x_i)$ and $K{ij} = k(x_i, x_j)$.


The Kernel: The Personality of the Process

The kernel $k(x, x')$ is where all the modeling assumptions live. It encodes your beliefs about the function before you see any data: how smooth it is, whether it oscillates, how quickly it varies.

The RBF / Squared Exponential Kernel

The most common choice is the radial basis function (RBF) kernel, also called the squared exponential:

$$k(x, x') = \sigma_f^2 \exp\left(-\frac{|x - x'|^2}{2\ell^2}\right).$$

Two parameters:

  • Output scale $\sigma_f^2$: controls the overall variance of the function values.
  • Length scale $\ell$: controls how quickly correlation decays with distance. Small $\ell$ means the function varies rapidly; large $\ell$ means it varies slowly.

The key behavior: when $x \approx x'$, the kernel is close to $\sigma_f^2$. When $x$ and $x'$ are far apart (relative to $\ell$), the kernel decays to zero - the function values become nearly independent. Functions drawn from a GP with the RBF kernel are infinitely differentiable, meaning they are very smooth.

Periodic Kernel

$$k(x, x') = \sigma_f^2 \exp\left(-\frac{2\sin^2(\pi |x - x'| / p)}{\ell^2}\right).$$

This generates functions that repeat with period $p$. Useful for modeling seasonal patterns, cycles in data, or anything periodic.

Matérn Kernels

The Matérn family interpolates between the extreme smoothness of the RBF kernel and rough, jagged functions. The parameter $\nu$ controls differentiability: $\nu = 1/2$ gives functions that are continuous but not differentiable (like Brownian motion paths); $\nu = 3/2$ gives once-differentiable functions; $\nu = 5/2$ gives twice-differentiable functions. Many real-world phenomena are better modeled with Matérn kernels than with the RBF - physical processes are rarely infinitely smooth.

What Makes a Valid Kernel?

Not every function $k(x, x')$ is a valid covariance function. The matrix $K$ with $K_{ij} = k(x_i, x_j)$ must be positive semi-definite for any input set - this is what ensures the distribution is a genuine multivariate Gaussian. Mercer’s theorem characterizes valid kernels: they are precisely those that can be expressed as an inner product $k(x, x') = \langle \phi(x), \phi(x') \rangle$ in some (possibly infinite-dimensional) feature space $\phi$.


The Prior: Sampling Functions Before Seeing Data

Before observing any data, a GP with mean function $m \equiv 0$ and kernel $k$ defines a prior distribution over functions: $f \sim \mathcal{GP}(0, k)$.

To sample from this prior, you pick a grid of input points $x_1, \ldots, x_n$, form the kernel matrix $K$, and draw a multivariate Gaussian sample:

$$\mathbf{f} \sim \mathcal{N}(\mathbf{0}, K).$$

With the RBF kernel, samples look like smooth, wavy curves - they vary at a rate determined by $\ell$, and their typical amplitude is set by $\sigma_f^2$. With a small length scale, the curves wiggle rapidly; with a large length scale, they vary slowly. With a Matérn-1/2 kernel, samples look like jagged, rough paths.

The prior encodes what functions you consider plausible before seeing any evidence. It is not arbitrary - choosing a kernel is a modeling decision that reflects genuine knowledge about the problem. If you know your data has a periodic component, use a periodic kernel. If you believe the function is very smooth, the RBF is appropriate. If you want to be less committal about smoothness, use Matérn.


The Posterior: Learning from Data

Here is where the GP becomes powerful. Suppose you observe noisy measurements at $n$ locations:

$$y_i = f(x_i) + \varepsilon_i, \quad \varepsilon_i \sim \mathcal{N}(0, \sigma_n^2),$$

where $\sigma_n^2$ is the observation noise variance. You want to predict $f$ at new test inputs $X_* = \{x_1^, \ldots, x_m^\}$.

The joint distribution of training outputs $\mathbf{y}$ and test function values $\mathbf{f}_*$ is Gaussian:

$$\begin{pmatrix} \mathbf{y} \\ \mathbf{f}* \end{pmatrix} \sim \mathcal{N}\left(\mathbf{0},\ \begin{pmatrix} K + \sigma_n^2 I & K* \\ K_*^\top & K_{**} \end{pmatrix}\right),$$

where $K = k(X, X)$ is the $n \times n$ training kernel matrix, $K_* = k(X, X_)$ is the $n \times m$ cross-kernel, and $K_{**} = k(X_, X_*)$ is the $m \times m$ test kernel matrix.

Conditioning on $\mathbf{y}$ (using the standard formula for Gaussian conditionals) gives the posterior:

$$\mathbf{f}* \mid X, \mathbf{y}, X* \sim \mathcal{N}(\boldsymbol{\mu}*, \Sigma*),$$

where:

$$\boldsymbol{\mu}* = K*^\top (K + \sigma_n^2 I)^{-1} \mathbf{y},$$

$$\Sigma_* = K_{**} - K_^\top (K + \sigma_n^2 I)^{-1} K_.$$

These two equations are the core of GP regression. Let’s unpack what they say.

The posterior mean $\boldsymbol{\mu}*$ is a linear combination of the training observations $\mathbf{y}$, weighted by the kernel. Inputs $x$ that are close to training points (high $k(x_, x_i)$) get large weights; inputs far from all training points get small weights. This is principled interpolation.

The posterior variance $\Sigma_$ tells you how uncertain you are. Look at $K_{**} - K_^\top (K + \sigma_n^2 I)^{-1} K_*$: the first term is the prior variance at the test points; the second term is the reduction in variance due to the data. Near training points, the second term is large - the data tells you a lot. Far from all training points, the second term is small - you haven’t learned much, and uncertainty is close to the prior.

This is one of the defining features of GPs: uncertainty is automatically higher where you have less data, and lower where you have more. Neural networks, by contrast, give you a point prediction with no built-in sense of where they’re extrapolating.

Discomfort check. Why does conditioning a Gaussian give another Gaussian? This is a property of the multivariate normal distribution. The conditional distribution of a subset of Gaussian variables given another subset is always Gaussian - and the mean and covariance of the conditional can be computed exactly using the partitioned matrix formulas. It’s the Gaussian’s self-consistency under conditioning that makes GPs analytically tractable.


Hyperparameter Learning

The kernel has parameters - the length scale $\ell$, the output scale $\sigma_f^2$, and the noise level $\sigma_n^2$. How do you choose them?

The Bayesian answer: maximize the marginal likelihood (also called the evidence). The marginal likelihood is the probability of the observed data $\mathbf{y}$ under the GP model, averaged over all possible functions:

$$\log p(\mathbf{y} \mid X, \theta) = -\frac{1}{2}\mathbf{y}^\top (K_\theta + \sigma_n^2 I)^{-1} \mathbf{y} - \frac{1}{2}\log|K_\theta + \sigma_n^2 I| - \frac{n}{2}\log 2\pi.$$

The first term rewards fitting the data well. The second term penalizes complex models (large determinant corresponds to high prior uncertainty, i.e., a flexible model). This is an automatic Occam’s razor: the marginal likelihood balances fit against complexity, and you can optimize it using gradient descent to find good hyperparameters.

This is different from cross-validation: you get the model complexity penalty for free, without needing to hold out data.


Computational Cost: The Main Limitation

The bottleneck in GP regression is inverting the $n \times n$ kernel matrix $K + \sigma_n^2 I$. Exact matrix inversion costs $O(n^3)$ time and $O(n^2)$ memory. For $n = 1{,}000$ training points, this is fast. For $n = 100{,}000$, it is intractable.

This is the Achilles' heel of GPs. Exact GPs scale poorly to large datasets - a fundamental constraint that has driven an entire subfield of approximate GP methods:

  • Sparse GPs: approximate the full kernel matrix using $m \ll n$ “inducing points”. Cost drops to $O(nm^2)$.
  • Kronecker structure: if inputs lie on a grid and the kernel is separable, exploit the Kronecker product structure for fast inference.
  • Random features: approximate the kernel using random Fourier features (Rahimi & Recht), turning the GP into a linear model in feature space.

For moderate $n$ (up to tens of thousands), exact GPs remain one of the most principled probabilistic models available.


Predictive Uncertainty in Practice

Here is a concrete example. Suppose you fit a GP to noisy observations of a smooth function. The posterior mean traces through the data, wiggling to fit the observations. The posterior variance is narrow near the training points - the model is confident there - and widens in the gaps between observations and beyond the edges of the training range.

If you add a new observation in a previously uncertain region, the posterior mean shifts toward the new point and the posterior variance there collapses. The process learns. Each new observation updates the posterior globally, because the kernel connects every pair of inputs.

This behavior - narrow uncertainty where you have data, wide uncertainty where you don’t - is called calibration. A calibrated model means that when you say you’re 95% confident the true value lies in an interval, you’re right 95% of the time. GPs are well-calibrated by construction when the model assumptions hold. This makes them especially valuable in applications like Bayesian optimization, where acting on uncertainty is as important as making accurate predictions.


Connections to Other Models

Gaussian Processes and Neural Networks

Neal (1994) showed that a single-layer neural network with infinitely many hidden units and random weights drawn from a suitable prior converges to a GP. The kernel is determined by the activation function. This connection was extended by Matthews et al. and Lee et al. (2018) to deep networks: an infinitely wide, infinitely deep neural network corresponds to a specific GP kernel - the Neural Network Gaussian Process (NNGP) kernel.

This means GPs and deep learning are not opposites - they sit at two ends of a spectrum. GPs are the infinite-width, exact Bayesian limit of neural networks. In practice, finite networks behave differently, but the NNGP connection illuminates why neural networks generalize: they are implicitly performing something like GP regression with a kernel induced by their architecture.

Gaussian Processes and Bayesian Optimization

Bayesian optimization uses a GP as a surrogate model for an expensive black-box function (like the accuracy of a neural network as a function of its hyperparameters). You query the function at a few points, build a GP posterior, then choose the next point to query based on an acquisition function that trades off exploration (going where uncertainty is high) and exploitation (going where the mean is high). The GP’s uncertainty estimates make this trade-off principled. This is how modern hyperparameter tuning tools like Optuna and GPyOpt work.


Summary

Concept Description
Gaussian process Distribution over functions; any finite collection of function values is jointly Gaussian
Mean function $m(x)$ Expected function value at each input
Kernel $k(x, x')$ Covariance between function values; encodes prior beliefs about smoothness
RBF kernel Smooth, infinitely differentiable functions; decays exponentially with distance
Matérn kernel Controlled roughness; $\nu = 3/2$ or $5/2$ are common choices
GP prior Distribution over functions before seeing data
GP posterior (regression) Updated distribution after conditioning on noisy observations
Posterior mean Predictive estimate; linear combination of training outputs
Posterior variance Uncertainty; collapses near data, widens away from it
Marginal likelihood Used to learn kernel hyperparameters; automatic complexity penalty
Computational cost $O(n^3)$ exact; sparse methods reduce to $O(nm^2)$
NNGP correspondence Infinitely wide neural networks converge to GPs

Gaussian processes are, at heart, a generalization of the multivariate Gaussian to function spaces. The key machinery - Gaussian conditioning, positive-definite kernels, marginal likelihood optimization - all flows from probability theory you already know. What GPs add is the ability to put principled probability distributions directly over functions, inheriting all the benefits of Bayesian inference: automatic uncertainty quantification, hyperparameter learning through the marginal likelihood, and elegant posterior updates in closed form.

The cost is scalability. For large datasets, you need approximations. But for moderate-scale problems, or problems where calibrated uncertainty is non-negotiable, Gaussian processes remain one of the most powerful tools in the probabilistic modeling toolkit.


Read next: