Law of Large Numbers & Central Limit Theorem
Prerequisite:
Two theorems underpin almost all of applied statistics and machine learning: the Law of Large Numbers tells you that sample averages converge to the true mean, and the Central Limit Theorem tells you how they converge and at what rate. Together they explain why averaging reduces noise, why confidence intervals are shaped the way they are, and why Gaussian approximations are so unreasonably effective.
Setup
Let $X_1, X_2, \ldots$ be i.i.d. random variables with mean $\mu = E[X_i]$ and variance $\sigma^2 = \mathrm{Var}(X_i) < \infty$. Define the sample mean
$$\bar{X}n = \frac{1}{n} \sum{i=1}^n X_i.$$
Weak Law of Large Numbers
Theorem (WLLN). For any $\varepsilon > 0$,
$$P!\left(|\bar{X}_n - \mu| \geq \varepsilon\right) \to 0 \quad \text{as } n \to \infty.$$
This is convergence in probability, written $\bar{X}_n \xrightarrow{p} \mu$.
Proof via Chebyshev’s inequality. We have $E[\bar{X}_n] = \mu$ and
$$\mathrm{Var}(\bar{X}_n) = \frac{\sigma^2}{n}.$$
Chebyshev’s inequality states that for any random variable $Y$ with finite variance,
$$P(|Y - E[Y]| \geq \varepsilon) \leq \frac{\mathrm{Var}(Y)}{\varepsilon^2}.$$
Applying this to $\bar{X}_n$:
$$P!\left(|\bar{X}_n - \mu| \geq \varepsilon\right) \leq \frac{\sigma^2}{n\varepsilon^2} \to 0. \qquad \square$$
The proof is elegant precisely because it requires no assumption beyond finite variance - not even a specific distribution.
Strong Law of Large Numbers
Theorem (SLLN). With probability 1,
$$\bar{X}n \xrightarrow{a.s.} \mu, \quad \text{i.e.,} \quad P!\left(\lim{n\to\infty} \bar{X}_n = \mu\right) = 1.$$
Almost sure convergence is strictly stronger than convergence in probability. The sample path $\bar{X}_1(\omega), \bar{X}_2(\omega), \ldots$ converges to $\mu$ for almost every outcome $\omega$.
Proof sketch via Borel-Cantelli. Consider the subsequence $n_k = k^2$. One can show $E[(\bar{X}{n_k} - \mu)^4] = O(1/k^2)$ using the fourth moment, so by Markov’s inequality $\sum_k P(|\bar{X}{n_k} - \mu| > \varepsilon) < \infty$. The Borel-Cantelli lemma then gives $\bar{X}_{n_k} \to \mu$ a.s. along the subsequence, and a monotonicity argument extends this to all $n$. The full proof with only finite first moment (Kolmogorov’s SLLN) uses a truncation argument. $\square$
Central Limit Theorem
The LLN says $\bar{X}_n \to \mu$. The CLT characterises the fluctuations around $\mu$.
Theorem (CLT). Let $Z_n = \frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma}$. Then
$$Z_n \xrightarrow{d} \mathcal{N}(0,1),$$
meaning $P(Z_n \leq z) \to \Phi(z)$ for every $z$, where $\Phi$ is the standard normal CDF.
Proof via moment generating functions. Let $Y_i = (X_i - \mu)/\sigma$ so that $E[Y_i] = 0$, $E[Y_i^2] = 1$, and $Z_n = \frac{1}{\sqrt{n}}\sum_{i=1}^n Y_i$. The MGF of $Z_n$ is
$$M_{Z_n}(t) = \left[M_Y!\left(\frac{t}{\sqrt{n}}\right)\right]^n.$$
Taylor-expanding $M_Y$ around 0 (using $M_Y(0)=1$, $M_Y'(0)=0$, $M_Y''(0)=1$):
$$M_Y!\left(\frac{t}{\sqrt{n}}\right) = 1 + \frac{t^2}{2n} + O(n^{-3/2}).$$
Therefore
$$M_{Z_n}(t) = \left(1 + \frac{t^2}{2n} + O(n^{-3/2})\right)^n \to e^{t^2/2},$$
which is the MGF of $\mathcal{N}(0,1)$. Convergence of MGFs (in a neighbourhood of 0) implies convergence in distribution. $\square$
The characteristic function proof ($\varphi_{Z_n}(t) \to e^{-t^2/2}$) avoids the assumption that MGFs exist and is the standard route in full generality.
Berry-Esseen Bound
The CLT is an asymptotic result. How fast does the convergence happen?
Theorem (Berry-Esseen). If $E[|X_i|^3] < \infty$, then
$$\sup_z \left|P(Z_n \leq z) - \Phi(z)\right| \leq \frac{C \cdot E[|X_i - \mu|^3]}{\sigma^3 \sqrt{n}},$$
where $C \leq 0.4748$ (best known constant). The approximation error decays as $O(1/\sqrt{n})$, not as $1/n$.
Normal Approximation to the Binomial
If $X \sim \mathrm{Binomial}(n, p)$, then $X = \sum_{i=1}^n B_i$ where $B_i \sim \mathrm{Bernoulli}(p)$. Setting $\mu = p$, $\sigma^2 = p(1-p)$:
$$\frac{X - np}{\sqrt{np(1-p)}} \xrightarrow{d} \mathcal{N}(0,1).$$
The continuity correction improves finite-$n$ accuracy because $X$ is discrete: use $P(X \leq k) \approx \Phi!\left(\frac{k + 0.5 - np}{\sqrt{np(1-p)}}\right)$.
Multivariate CLT
For i.i.d. random vectors $\mathbf{X}_i \in \mathbb{R}^d$ with mean $\boldsymbol{\mu}$ and covariance matrix $\Sigma$,
$$\sqrt{n}(\bar{\mathbf{X}}_n - \boldsymbol{\mu}) \xrightarrow{d} \mathcal{N}_d(\mathbf{0}, \Sigma).$$
This follows from the univariate CLT via the Cramér-Wold theorem: $\mathbf{v}^\top \mathbf{Z}_n \xrightarrow{d} \mathcal{N}(0, \mathbf{v}^\top \Sigma \mathbf{v})$ for all $\mathbf{v} \in \mathbb{R}^d$.
Why This Matters in Practice
Confidence intervals. If $\hat{\theta}$ is an estimator with asymptotic variance $\sigma^2/n$, a 95% confidence interval is $\hat{\theta} \pm 1.96,\sigma/\sqrt{n}$, justified entirely by the CLT.
SGD and gradient noise. In stochastic gradient descent, each mini-batch gradient is an average of per-sample gradients. The CLT implies that gradient noise is approximately Gaussian, which justifies many analyses of SGD dynamics and learning rate schedules.
Averaging reduces noise at rate $1/\sqrt{n}$. This is the fundamental reason ensemble methods, repeated measurements, and averaging estimators all improve with sample size - but the improvement slows down: you need 4× more data to halve the standard error.
Read Next: