Metropolis-Hastings & MCMC // Megha Bose

Prerequisite: Probability

Overview

Bayesian inference often reduces to computing expectations under a posterior distribution $p(\theta | \mathcal{D}) \propto p(\mathcal{D} | \theta), p(\theta)$. Except in conjugate families, this posterior has no closed form and the normalising constant $\int p(\mathcal{D}|\theta)p(\theta),d\theta$ is intractable. Markov Chain Monte Carlo (MCMC) sidesteps the normalising constant by constructing a Markov chain whose stationary distribution is the target, then collecting samples from the chain after it has converged.

The Metropolis-Hastings algorithm, introduced by Metropolis et al. (1953) and generalised by Hastings (1970), is the foundational MCMC method. It requires only the ability to evaluate $\tilde{p}(x) \propto p(x)$, not the normalised density.

Detailed Balance and Stationarity

A Markov chain with transition kernel $T(x' | x)$ has $p$ as its stationary distribution if it satisfies the detailed balance condition:

$$p(x), T(x' | x) = p(x'), T(x | x')$$

Summing over $x$ gives $\int p(x) T(x'|x),dx = p(x')$, which is exactly the stationarity condition. Detailed balance is sufficient (but not necessary) for stationarity, and it is the property that Metropolis-Hastings is designed to satisfy.

The Metropolis-Hastings Algorithm

Input: unnormalised target $\tilde{p}$, proposal distribution $q(x' | x)$, initial state $x_0$.

At state $x$, draw a proposal $x' \sim q(\cdot | x)$.
Compute the acceptance probability:

$$\alpha(x, x') = \min!\left(1,; \frac{\tilde{p}(x'), q(x | x')}{\tilde{p}(x), q(x' | x)}\right)$$

Accept: set $x \leftarrow x'$ with probability $\alpha$. Otherwise keep $x$.
Record $x$ as a sample. Repeat from step 1.

Why This Satisfies Detailed Balance

The transition kernel is $T(x'|x) = q(x'|x),\alpha(x,x') + \delta(x'-x)(1 - \int q(y|x)\alpha(x,y),dy)$. For $x' \neq x$:

$$p(x),q(x'|x),\alpha(x,x') = p(x),q(x'|x),\min!\left(1, \frac{p(x')q(x|x')}{p(x)q(x'|x)}\right)$$

$$= \min!\left(p(x),q(x'|x),; p(x'),q(x|x')\right) = p(x'),q(x|x'),\alpha(x',x)$$

which is exactly detailed balance. The acceptance ratio cancels the normalising constant since $\tilde{p}(x')/\tilde{p}(x) = p(x')/p(x)$.

Random Walk Metropolis-Hastings

The most common choice is a symmetric Gaussian proposal:

$$q(x' | x) = \mathcal{N}(x' \mid x,, \sigma^2 I)$$

Symmetry means $q(x|x') = q(x'|x)$, so the ratio collapses to:

$$\alpha(x, x') = \min!\left(1,; \frac{\tilde{p}(x')}{\tilde{p}(x)}\right)$$

The step size $\sigma$ is critical: too small and the chain mixes slowly; too large and almost every proposal lands in a low-probability region and is rejected. The optimal acceptance rate for a $d$-dimensional Gaussian target is approximately $0.234$, achieved by tuning $\sigma \propto d^{-1/2}$.

Burn-In and Mixing

The initial samples are influenced by the starting point $x_0$ rather than the target $p$. A burn-in period discards these early samples. After burn-in, the samples are not independent - consecutive samples are correlated because the chain moves incrementally. The effective sample size (ESS) quantifies this:

$$\text{ESS} = \frac{N}{1 + 2\sum_{k=1}^{\infty} \rho_k}$$

where $\rho_k$ is the autocorrelation at lag $k$. A chain that mixes poorly has high autocorrelation and low ESS per sample collected.

Convergence diagnostic $\hat{R}$: Run $M$ parallel chains. The Gelman-Rubin statistic compares between-chain variance $B$ and within-chain variance $W$:

$$\hat{R} = \sqrt{\frac{(N-1)/N \cdot W + B/N}{W}}$$

Values close to 1.0 (typically $< 1.01$) indicate the chains have converged to the same distribution.

Gibbs Sampling

Gibbs sampling is a special case of Metropolis-Hastings that exploits the ability to sample from full conditionals. Given $x = (x_1, \ldots, x_d)$, cycle through coordinates:

$$x_i^{(t+1)} \sim p!\left(x_i ;\big|; x_1^{(t+1)}, \ldots, x_{i-1}^{(t+1)}, x_{i+1}^{(t)}, \ldots, x_d^{(t)}\right)$$

This is equivalent to Metropolis-Hastings with $q(x'|x)$ equal to the conditional $p(x_i' | x_{-i})$, which gives acceptance probability $\alpha = 1$ - proposals are always accepted. Gibbs is exact when full conditionals are available in closed form (conjugate Bayesian models, graphical models with local structure).

Hamiltonian Monte Carlo

Random walk MH explores space diffusively - taking $O(d)$ steps to travel a distance $O(\sqrt{d})$. Hamiltonian Monte Carlo (HMC) uses gradient information $\nabla \log p(x)$ to make large, directed proposals.

Augment the state with a momentum variable $r \sim \mathcal{N}(0, M)$. Define the Hamiltonian:

$$\mathcal{H}(x, r) = -\log p(x) + \frac{1}{2} r^T M^{-1} r$$

Hamilton’s equations $\dot{x} = M^{-1}r$, $\dot{r} = \nabla \log p(x)$ define a trajectory that preserves $\mathcal{H}$. Numerically, a leapfrog integrator discretises this trajectory and a Metropolis correction handles discretisation error. Because HMC proposes states far from the current position while maintaining high acceptance probability, it mixes dramatically faster than random walk MH - particularly in high dimensions. NUTS (No-U-Turn Sampler), the default in Stan and PyMC, adapts the trajectory length automatically.

Complexity

Method	Cost per step	Notes
Random walk MH	$O(d)$	Mixes in $O(d)$ steps
Gibbs	$O(d)$	Requires full conditionals
HMC (leapfrog $L$ steps)	$O(Ld)$	Mixes in $O(d^{1/4})$ steps

Examples

Bayesian posterior inference. In a hierarchical regression model, the joint posterior over regression coefficients and variance components has no closed form. HMC/NUTS in Stan samples this posterior efficiently, enabling uncertainty quantification over predictions. This is used routinely in epidemiology to estimate $R_0$ from case count data.

Ising model (physics simulations). The Ising model defines a distribution over spin configurations $\sigma \in {-1, +1}^n$ via $p(\sigma) \propto \exp(\beta \sum_{\langle i,j \rangle} \sigma_i \sigma_j)$. The normalising constant (partition function) is intractable for large $n$. Metropolis sampling with single spin-flip proposals has been used since the 1950s to study phase transitions.

Probabilistic graphical models. Loopy belief propagation can fail on graphs with many cycles. MCMC provides an alternative: block Gibbs sampling over cliques, or MH proposals that flip multiple variables jointly, gives a consistent estimator of marginal probabilities in latent Dirichlet allocation, hidden Markov models, and Boltzmann machines.

Read Next: Probabilistic Sampling