Markov Chains // Megha Bose

Prerequisite:

A Markov chain is a stochastic process with a simple but powerful structure: the future depends on the present, not the past. This memorylessness makes the mathematics tractable while still capturing rich long-run behavior. The theory of stationary distributions and convergence underpins applications from web search ranking to Bayesian computation.

The Markov Property

Definition. A sequence of random variables ${X_n}{n \geq 0}$ taking values in a countable state space $\mathcal{S}$ is a Markov chain if for all $n \geq 0$ and all states $i_0, \ldots, i{n+1} \in \mathcal{S}$,

$$P(X_{n+1} = i_{n+1} \mid X_0 = i_0, \ldots, X_n = i_n) = P(X_{n+1} = i_{n+1} \mid X_n = i_n).$$

This is the Markov property: given the present state $X_n$, the future $X_{n+1}$ is independent of the history $X_0, \ldots, X_{n-1}$.

Transition Matrix

For a time-homogeneous Markov chain, the one-step transition probabilities do not depend on $n$:

$$P_{ij} = P(X_{n+1} = j \mid X_n = i).$$

The matrix $P = (P_{ij})$ is the transition matrix. It satisfies:

$P_{ij} \geq 0$ for all $i, j$, and
$\sum_{j \in \mathcal{S}} P_{ij} = 1$ for all $i$ (each row sums to 1).

Such a matrix is called stochastic (or row-stochastic).

The $n$-step transition probability $P_{ij}^{(n)} = P(X_n = j \mid X_0 = i)$ is given by the $(i,j)$ entry of the matrix power $P^n$.

Chapman-Kolmogorov Equations. For all $m, n \geq 0$:

$$P_{ij}^{(m+n)} = \sum_{k \in \mathcal{S}} P_{ik}^{(m)} P_{kj}^{(n)}.$$

Proof. Condition on the state at time $m$:

$$P(X_{m+n} = j \mid X_0 = i) = \sum_{k} P(X_{m+n} = j \mid X_m = k), P(X_m = k \mid X_0 = i),$$

using the Markov property in the first factor. In matrix notation, $P^{m+n} = P^m \cdot P^n$. $\square$

If $\mu^{(0)} = (\mu^{(0)}_i)$ is the initial distribution (a row vector), then the distribution at time $n$ is $\mu^{(n)} = \mu^{(0)} P^n$.

State Classification

Communicating Classes

State $j$ is reachable from state $i$ (written $i \to j$) if $P_{ij}^{(n)} > 0$ for some $n \geq 0$. States $i$ and $j$ communicate ($i \leftrightarrow j$) if $i \to j$ and $j \to i$. Communication is an equivalence relation; its equivalence classes are called communicating classes. A chain with a single communicating class is irreducible.

Recurrence and Transience

Definition. Let $T_i = \min{n \geq 1 : X_n = i}$ be the first return time to state $i$. State $i$ is:

recurrent if $P(T_i < \infty \mid X_0 = i) = 1$, and
transient if $P(T_i < \infty \mid X_0 = i) < 1$.

Criterion. State $i$ is recurrent if and only if $\sum_{n=0}^\infty P_{ii}^{(n)} = \infty$, and transient if and only if this sum is finite.

Proof sketch. Let $f_{ii} = P(T_i < \infty \mid X_0 = i)$. Then $\sum_n P_{ii}^{(n)} = \frac{f_{ii}}{1 - f_{ii}}$ (by the renewal argument: the expected number of returns equals $\sum_{k=1}^\infty f_{ii}^k = f_{ii}/(1-f_{ii})$ when $f_{ii} < 1$, and $+\infty$ when $f_{ii} = 1$). $\square$

Recurrence and transience are class properties: if $i \leftrightarrow j$, then $i$ is recurrent iff $j$ is recurrent.

Period

The period of state $i$ is $d(i) = \gcd{n \geq 1 : P_{ii}^{(n)} > 0}$. If $d(i) = 1$, state $i$ is aperiodic. Period is also a class property. An irreducible chain is aperiodic if any state has a self-loop ($P_{ii} > 0$), since then $d(i) = \gcd({1} \cup \ldots) = 1$.

Stationary Distributions

Definition. A probability distribution $\pi = (\pi_i)_{i \in \mathcal{S}}$ is a stationary distribution if

$$\pi P = \pi \quad \text{and} \quad \sum_{i \in \mathcal{S}} \pi_i = 1, \quad \pi_i \geq 0.$$

Equivalently, $\pi_j = \sum_i \pi_i P_{ij}$ for all $j$: if the chain starts in distribution $\pi$, it stays in $\pi$ at all future times.

Existence and Uniqueness Theorem. An irreducible Markov chain has a stationary distribution $\pi$ if and only if all states are positive recurrent (i.e., $E[T_i \mid X_0 = i] < \infty$). When it exists, $\pi$ is unique and $\pi_i = 1/E[T_i \mid X_0 = i]$.

For finite irreducible chains, positive recurrence holds automatically, so a unique stationary distribution always exists.

Convergence to Stationarity

Convergence Theorem. Let ${X_n}$ be an irreducible, positive recurrent, and aperiodic Markov chain with stationary distribution $\pi$. Then for all states $i, j$:

$$\lim_{n \to \infty} P_{ij}^{(n)} = \pi_j.$$

In matrix form: $P^n \to \mathbf{1}\pi^T$ as $n \to \infty$, where $\mathbf{1}$ is the all-ones column vector.

Proof sketch via coupling. Run two copies of the chain independently, one started at $i$ and one started in $\pi$. Because the chain is irreducible and aperiodic, the two copies will eventually meet (couple) at some finite random time $\tau$. After coupling they evolve identically. The variation distance satisfies $|P_{ij}^{(n)} - \pi_j| \leq P(\tau > n) \to 0$. $\square$

The rate of convergence is controlled by the spectral gap: for finite chains, $|P_{ij}^{(n)} - \pi_j| = O(|\lambda_2|^n)$, where $\lambda_2$ is the second-largest eigenvalue of $P$.

Detailed Balance and Reversibility

Definition. A distribution $\pi$ satisfies detailed balance (or reversibility) with respect to $P$ if

$$\pi_i P_{ij} = \pi_j P_{ji} \quad \text{for all } i, j.$$

Theorem. If $\pi$ satisfies detailed balance with $P$, then $\pi$ is a stationary distribution.

Proof. Sum over $i$: $\sum_i \pi_i P_{ij} = \sum_i \pi_j P_{ji} = \pi_j \sum_i P_{ji} = \pi_j$. $\square$

Detailed balance is easier to verify than the stationary equation and is the key condition exploited in the Metropolis-Hastings algorithm.

Examples

PageRank as a Stationary Distribution

Google’s PageRank models web surfing as a Markov chain. The state space is the set of web pages; $P_{ij} = 1/\text{out-degree}(i)$ if page $i$ links to page $j$, and 0 otherwise. To ensure irreducibility and aperiodicity (and thus a unique stationary distribution), the teleportation trick adds a small probability $\alpha \approx 0.15$ of jumping uniformly to any page:

$$P^{\text{PageRank}} = (1-\alpha)P + \frac{\alpha}{N}\mathbf{1}\mathbf{1}^T.$$

This modified chain is irreducible (any page reachable from any other) and aperiodic. Its unique stationary distribution $\pi$ is the PageRank vector: $\pi_i$ is the long-run fraction of time spent at page $i$, interpretable as the page’s importance. It is computed by power iteration $\pi^{(t+1)} = \pi^{(t)} P^{\text{PageRank}}$ until convergence.

Metropolis-Hastings MCMC

Suppose we want to sample from a target distribution $\pi$ that is known only up to a normalizing constant (common in Bayesian posterior inference). The Metropolis-Hastings algorithm constructs a Markov chain with stationary distribution $\pi$:

From current state $x$, propose $y \sim q(\cdot \mid x)$ using a proposal kernel $q$.
Accept with probability $\alpha(x, y) = \min!\left(1,, \frac{\pi(y) q(x \mid y)}{\pi(x) q(y \mid x)}\right)$.
If accepted, move to $y$; otherwise stay at $x$.

Correctness. The chain satisfies detailed balance with respect to $\pi$: the acceptance ratio $\alpha$ is designed so that $\pi(x) P(x, y) = \pi(y) P(y, x)$ wherever $\pi(x)q(y|x) \neq 0$. By the detailed balance theorem, $\pi$ is stationary. By the convergence theorem (assuming irreducibility and aperiodicity, which hold under mild conditions on $q$), the chain’s distribution converges to $\pi$.

The Metropolis-Hastings algorithm is the foundation of modern Bayesian computation: one never needs the normalizing constant $Z = \int \pi(x)dx$ because it cancels in the acceptance ratio.

Random Walk on a Graph

Let $G = (V, E)$ be an undirected graph. The simple random walk moves from vertex $i$ to a uniformly random neighbor at each step: $P_{ij} = 1/d_i$ if ${i,j} \in E$ (where $d_i$ is the degree of $i$), and $0$ otherwise.

This chain satisfies detailed balance with $\pi_i \propto d_i$: verify $d_i \cdot (1/d_i) = d_j \cdot (1/d_j) = 1$ whenever ${i,j} \in E$. Therefore, the stationary distribution is $\pi_i = d_i / (2|E|)$. For $d$-regular graphs ($d_i = d$ for all $i$), the stationary distribution is uniform. The spectral gap of the walk equals $1 - \lambda_2(A/d)$, where $\lambda_2$ is the second eigenvalue of the normalized adjacency matrix - this connection between mixing times and graph expansion is a central theme in spectral graph theory.

Read Next:

Martingales & Optional Stopping