Metropolis-Hastings

Helpful context:

In 1953, scientists at Los Alamos needed to compute multidimensional integrals that appear in statistical mechanics - integrals over spaces with billions of dimensions. Standard numerical integration was hopeless: a grid over 100 dimensions with just 2 points per axis requires $2^{100}$ evaluations. Nicholas Metropolis and colleagues invented a different approach: walk randomly through the space, but bias the walk toward high-probability regions. Sample from the walk. The integral becomes an average over your samples.

It worked. And it now powers Bayesian inference, protein structure prediction, and training modern neural networks.

The Problem

You have a target distribution $\pi(x)$ that you want to sample from. You can evaluate $\pi(x)$ up to a normalizing constant - that is, you know $\tilde{\pi}(x) \propto \pi(x)$ but not the constant $Z = \int \tilde{\pi}(x) dx$.

This situation is ubiquitous in Bayesian inference. The posterior is:

$$p(\theta \mid \text{data}) = \frac{p(\text{data} \mid \theta) p(\theta)}{p(\text{data})}$$

You can evaluate the numerator - the likelihood times the prior. The denominator $p(\text{data}) = \int p(\text{data} \mid \theta) p(\theta) d\theta$ is an integral over all possible parameters, and it’s usually intractable. You know the posterior up to the normalizing constant but can’t sample from it directly.

The goal: generate samples $\theta_1, \theta_2, \ldots$ from $\pi$ without knowing $Z$, so you can compute expectations $\mathbb{E}_\pi[f(\theta)] \approx \frac{1}{N}\sum_i f(\theta_i)$.

Markov Chain Monte Carlo

The strategy: construct a Markov chain whose stationary distribution is $\pi$. Run the chain for a long time. Collect the states as (approximately) samples from $\pi$.

For a Markov chain to have $\pi$ as its stationary distribution, it’s sufficient for it to satisfy detailed balance:

$$\pi(x) T(x' \mid x) = \pi(x') T(x \mid x')$$

This says the probability flow from $x$ to $x'$ equals the probability flow from $x'$ to $x$ - the chain is in equilibrium with respect to $\pi$.

Given: a proposal distribution $q(x' \mid x)$ (your idea for where to step next). The proposal can be anything - Gaussian, uniform, whatever. You accept or reject based on whether the proposed move improves or worsens $\pi$.

Algorithm:

Start at any state $x$.
Propose a new state $x' \sim q(\cdot \mid x)$.
Compute the acceptance probability:

$$\alpha(x, x') = \min\left(1, \frac{\tilde{\pi}(x') q(x \mid x')}{\tilde{\pi}(x) q(x' \mid x)}\right)$$

Accept: move to $x'$ with probability $\alpha$. Otherwise, stay at $x$.
Record the current state as a sample. Go to step 2.

Why the normalizing constant cancels: The ratio $\tilde{\pi}(x')/\tilde{\pi}(x) = \pi(x')/\pi(x)$ regardless of $Z$. You never need to compute it.

Why this works: For $x' \neq x$, the probability of moving from $x$ to $x'$ is $q(x' \mid x) \cdot \alpha(x, x')$. You can verify:

$$\pi(x) q(x' \mid x) \alpha(x, x') = \min\left(\pi(x) q(x' \mid x),; \pi(x') q(x \mid x')\right) = \pi(x') q(x \mid x') \alpha(x', x)$$

This is exactly detailed balance. So $\pi$ is the stationary distribution.

The Random Walk Version

The most common proposal is a symmetric Gaussian:

$$q(x' \mid x) = \mathcal{N}(x' \mid x, \sigma^2 I)$$

Symmetry means $q(x \mid x') = q(x' \mid x)$, so the ratio $q(x \mid x')/q(x' \mid x) = 1$. The acceptance probability simplifies to:

$$\alpha(x, x') = \min\left(1, \frac{\tilde{\pi}(x')}{\tilde{\pi}(x)}\right)$$

If the proposal is in a higher-probability region, always accept. If it’s in a lower-probability region, accept with probability $\pi(x')/\pi(x)$. This biases the walk toward high-probability regions without trapping it there.

The step size $\sigma$ matters enormously. Too small: the chain shuffles around slowly, consecutive samples are nearly identical, you need many steps to explore. Too large: most proposals land in low-probability regions and are rejected, the chain barely moves. For a $d$-dimensional Gaussian target, the optimal acceptance rate is approximately $23.4%$, achieved by setting $\sigma \propto d^{-1/2}$.

Gibbs Sampling

If your distribution is over multiple variables $(x_1, \ldots, x_d)$, and you can sample from each conditional distribution $p(x_i \mid x_{-i})$ (all others fixed), use Gibbs sampling: cycle through variables, drawing each from its full conditional.

Gibbs is Metropolis-Hastings with a special proposal: propose $x_i' \sim p(x_i \mid x_{-i})$. The acceptance probability is always 1 - the chain always moves. No rejections.

This works beautifully for Bayesian models with conjugate priors (where conditionals have closed form) and graphical models with local structure.

Discomfort check. The Markov chain needs to mix well - meaning it needs to explore the full space efficiently. How do you know when it has mixed?

Honestly, you don’t know definitively. Mixing is hard to prove. Diagnostics help:

Trace plots: plot the sampled values over time. A well-mixed chain looks like white noise, jumping around the space. A poorly-mixed chain shows long runs in one region before moving.

Gelman-Rubin statistic $\hat{R}$: run multiple chains from different starting points. If they’ve converged to the same distribution, the between-chain variance should equal the within-chain variance. $\hat{R}$ measures this ratio; values below 1.01 indicate convergence.

Effective sample size (ESS): adjacent samples from a Markov chain are correlated, so $N$ samples aren’t as informative as $N$ independent samples. ESS estimates how many independent samples your $N$ correlated samples are worth. Low ESS means poor mixing.

Non-mixing is the central practical challenge of MCMC. There’s no substitute for diagnostic checking.

Hamiltonian Monte Carlo

Random walk MH explores space diffusively. To travel a distance $L$ in a $d$-dimensional space, random walk takes $O(d \cdot L^2)$ steps because it’s undirected. For high-dimensional posteriors, this is devastating.

Hamiltonian Monte Carlo (HMC) uses gradient information $\nabla \log \pi(x)$ to make directed proposals. Augment the state with a momentum variable $p \sim \mathcal{N}(0, M)$. Define the Hamiltonian:

$$\mathcal{H}(x, p) = -\log \pi(x) + \frac{1}{2} p^T M^{-1} p$$

Simulate Hamiltonian dynamics using leapfrog integration: the state travels along a trajectory that approximately conserves $\mathcal{H}$. At the end of the trajectory, apply a Metropolis correction for the discretization error.

The result: the chain proposes states far from the current position, with high acceptance probability. Mixing time scales as $O(d^{1/4})$ instead of $O(d)$.

Stan and PyMC use NUTS (No-U-Turn Sampler), which adapts the trajectory length automatically to avoid wasting steps going in circles.

Applications

Bayesian regression. In a hierarchical regression model, the joint posterior over coefficients and variance components has no closed form. HMC/NUTS samples it efficiently. Used in epidemiology to estimate reproduction numbers from case counts with proper uncertainty quantification.

Astrophysics. Inferring black hole parameters from gravitational wave data (LIGO): the likelihood of the observed strain given a black hole merger model is known, but the posterior over mass, spin, and distance needs MCMC. The posteriors reported in black hole detection papers are MCMC samples.

Protein folding. Sampling low-energy configurations of a protein: the energy function is known, but the integral over all configurations is intractable. MCMC (particularly replica exchange MCMC, which runs chains at different temperatures) explores the energy landscape.

Statistical physics. The Ising model of magnetic materials defines a distribution over spin configurations $\sigma \in \{-1, +1\}^n$. The partition function is intractable. Metropolis sampling with single spin-flip proposals has studied phase transitions since the original 1953 paper - the very problem the algorithm was invented to solve.

Summary

Concept	Detail
Problem	Sample from $\pi(x)$ knowing only $\tilde{\pi}(x) \propto \pi(x)$
MCMC idea	Run a Markov chain with stationary distribution $\pi$
Detailed balance	$\pi(x) T(x' \mid x) = \pi(x') T(x \mid x')$ guarantees stationarity
MH acceptance ratio	$\min\left(1, \frac{\tilde{\pi}(x') q(x \mid x')}{\tilde{\pi}(x) q(x' \mid x)}\right)$
Why $Z$ cancels	Ratio $\tilde{\pi}(x')/\tilde{\pi}(x) = \pi(x')/\pi(x)$ regardless of normalizing constant
Gibbs sampling	Cycle through variables, sample each from its conditional; always accepted
HMC	Use gradients for directed proposals; mixes in $O(d^{1/4})$ vs $O(d)$
Mixing diagnostics	Trace plots, Gelman-Rubin $\hat{R}$, effective sample size

Read next:

Integer Linear Programming - Optimization When Fractions Are Forbidden