Law of Large Numbers & CLT - Why Averages Behave
Helpful context:
- Moment-Generating Functions - All a Distribution’s Moments in One Package
- Joint Distributions - What Multiple Random Variables Know About Each Other
Imagine you walk into a casino. Every game has a small edge in the house’s favor - say, 1% on each bet. On any given hand of blackjack, you might win big. The house might lose big. Neither outcome is surprising. Randomness is real.
But now imagine you’re the casino’s accountant, tracking the results of 1,000,000 bets over the year. You are not surprised. You know, with near certainty, how the year will end. The house will be up roughly 1% of the total money wagered. It might be 0.98% or 1.02%, but it won’t be 10% in either direction.
Why? The individual bets are random. The aggregate is not. Something turns the chaos of individual outcomes into the regularity of long-run averages.
That something is the Law of Large Numbers. And the Central Limit Theorem explains the shape of the fluctuations around it.
The Law of Large Numbers
Let $X_1, X_2, \ldots$ be independent and identically distributed (i.i.d.) random variables with mean $\mu = E[X_i]$ and variance $\sigma^2 = \text{Var}(X_i) < \infty$. The sample mean is:
$$\bar{X}n = \frac{1}{n} \sum{i=1}^n X_i.$$
Weak Law of Large Numbers (WLLN): For any $\varepsilon > 0$,
$$P(|\bar{X}_n - \mu| \geq \varepsilon) \to 0 \quad \text{as } n \to \infty.$$
The probability that the sample mean is far from $\mu$ goes to zero. This is called convergence in probability.
Proof via Chebyshev
The proof is elegant because it requires almost nothing. We compute the mean and variance of $\bar{X}_n$:
$$E[\bar{X}_n] = \mu, \qquad \text{Var}(\bar{X}_n) = \frac{\sigma^2}{n}.$$
The variance of the average shrinks as $1/n$ - this is the key. Now apply Chebyshev’s inequality: for any random variable $Y$ with finite variance,
$$P(|Y - E[Y]| \geq \varepsilon) \leq \frac{\text{Var}(Y)}{\varepsilon^2}.$$
Applying this to $\bar{X}_n$:
$$P(|\bar{X}_n - \mu| \geq \varepsilon) \leq \frac{\sigma^2}{n\varepsilon^2}.$$
As $n \to \infty$, the right side goes to zero. Done.
This proof assumes only finite variance - no specific distribution. Binomial, Poisson, Exponential, whatever. The LLN holds for all of them.
The Strong LLN: A Stronger Statement
The weak law says: for each fixed $\varepsilon$, the probability of a large deviation goes to 0. But you could imagine a weird world where deviations keep happening, just rarely enough to satisfy the weak law.
The Strong Law of Large Numbers (SLLN) rules this out:
$$P\left(\lim_{n \to \infty} \bar{X}_n = \mu\right) = 1.$$
With probability 1, the sample path $\bar{X}_1, \bar{X}_2, \ldots$ actually converges to $\mu$. Not just “is likely close to $\mu$” for large $n$ - the sequence converges, in the same sense that $1/n \to 0$.
The strong law requires more work to prove (the standard proof uses the fourth moment and the Borel-Cantelli lemma). But intuitively: it says that almost every sequence of observations you could draw will eventually settle down to the true mean.
What the LLN Does Not Say
The LLN is frequently misunderstood, so let’s be explicit.
The LLN does not say that past outcomes affect future ones. If you flip a fair coin and get 10 heads in a row, the next flip is still 50-50. The LLN says that as the total number of flips grows large, the proportion of heads will get close to 0.5 - not because the coin “corrects” for the past imbalance, but because 10 extra heads become negligible in a sea of thousands of flips.
This error - thinking the coin owes you tails after a run of heads - is called the gambler’s fallacy. The LLN is often cited to justify it. But the LLN has no memory. Each flip is independent of every other flip. The convergence happens because of dilution, not correction.
The Central Limit Theorem
The LLN tells you where the sample mean goes: toward $\mu$. It does not tell you how far off it typically is, or what the distribution of the deviations looks like.
The Central Limit Theorem answers that.
Theorem (CLT): Standardize the sample mean by subtracting the true mean and dividing by the standard error:
$$Z_n = \frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} = \frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma}.$$
Then $Z_n$ converges in distribution to $\mathcal{N}(0,1)$:
$$P(Z_n \leq z) \to \Phi(z) \quad \text{for every } z,$$
where $\Phi$ is the standard Normal CDF.
This holds regardless of the original distribution of the $X_i$ - as long as the mean and variance are finite.
Think about what this is saying. The $X_i$ could be Bernoulli(0.1) - a distribution that’s completely non-symmetric, concentrated at 0 and 1. The average of 1000 such variables, properly centered and scaled, is approximately normally distributed. The bell curve emerges from random variables that are not themselves bell-shaped at all.
Proof via MGFs
Let $Y_i = (X_i - \mu)/\sigma$, so that $E[Y_i] = 0$ and $E[Y_i^2] = 1$. The standardized sum is:
$$Z_n = \frac{1}{\sqrt{n}} \sum_{i=1}^n Y_i.$$
The MGF of $Z_n$ is:
$$M_{Z_n}(t) = \left[M_Y\left(\frac{t}{\sqrt{n}}\right)\right]^n.$$
(We used the fact that MGFs multiply for independent variables, and that dividing by $\sqrt{n}$ scales the argument of the MGF.)
Now Taylor-expand $M_Y$ around $t = 0$. Using $M_Y(0) = 1$, $M_Y'(0) = E[Y] = 0$, $M_Y''(0) = E[Y^2] = 1$:
$$M_Y\left(\frac{t}{\sqrt{n}}\right) = 1 + 0 + \frac{1}{2} \cdot \frac{t^2}{n} + O(n^{-3/2}) = 1 + \frac{t^2}{2n} + O(n^{-3/2}).$$
Therefore:
$$M_{Z_n}(t) = \left(1 + \frac{t^2}{2n} + O(n^{-3/2})\right)^n \xrightarrow{n \to \infty} e^{t^2/2}.$$
The limit $e^{t^2/2}$ is the MGF of $\mathcal{N}(0,1)$. By the uniqueness theorem, $Z_n \to \mathcal{N}(0,1)$ in distribution.
The key step: the first-order term in the Taylor expansion of $M_Y$ involves only $E[Y] = 0$, and the second-order term involves only $E[Y^2] = 1$. The higher-order terms, which capture the shape of the original distribution, all vanish as $n \to \infty$. The CLT works because adding independent variables smooths out everything except the mean and variance.
What “Converges in Distribution” Means
Convergence in distribution is weaker than convergence in probability. It says: the CDF of $Z_n$ converges to the standard Normal CDF $\Phi(z)$, pointwise at every point of continuity of $\Phi$. (Since $\Phi$ is continuous everywhere, this means at every $z$.)
It does not say that $Z_n$ converges to a specific Normal random variable as a number. It says that the probability distribution of $Z_n$ looks more and more like the Normal distribution as $n$ grows.
Discomfort check. The CLT says the sample average is approximately Normal. It does not say that the individual observations are Normal.
Example: $X_i \sim \text{Bernoulli}(0.1)$. Each $X_i$ takes the value 1 with probability 0.1 and 0 with probability 0.9. The individual observations are definitely not normal - they’re discrete, taking only two values.
But the average $\bar{X}_{1000}$ of 1000 such variables is approximately Normal with mean 0.1 and variance $0.1 \cdot 0.9 / 1000 = 0.000090$. The distribution of the average, properly scaled, looks like a bell curve.
This is counterintuitive because we’re used to thinking of Normal as a property of the data. The CLT says it’s a property of averages of data - a very different statement.
The CLT in Practice
How Large Does $n$ Need to Be?
The CLT is an asymptotic result - it holds “as $n \to \infty$.” In practice, you need a finite $n$, and the question is: when is the Normal approximation good enough?
The common rule of thumb is $n \geq 30$. But this is highly distribution-dependent:
- If the original distribution is symmetric and roughly bell-shaped (like a Normal or Uniform), the Normal approximation is good even for $n = 5$ or $10$.
- If the original distribution is highly skewed (like an Exponential or a Bernoulli with very small $p$), you may need $n = 100$ or more before the approximation is accurate.
The Berry-Esseen theorem gives a quantitative bound: the error in the Normal approximation decays as $O(1/\sqrt{n})$, with the constant depending on the third moment of the distribution.
Polling and Margins of Error
If you poll $n$ people, each independently expressing support (1) or opposition (0) for some policy, then $\bar{X}_n$ estimates the true support proportion $p$. By the CLT, $\bar{X}_n$ is approximately $\mathcal{N}(p,\ p(1-p)/n)$.
The standard error is $\sqrt{p(1-p)/n} \leq 1/(2\sqrt{n})$. So the margin of error at 95% confidence ($\pm 1.96$ standard deviations) is at most $1/\sqrt{n}$.
This is where the familiar “margin of error $\approx 3%$” for political polls comes from: $n = 1000$ gives $1/\sqrt{1000} \approx 3.2%$.
Monte Carlo Integration
To estimate $\int_0^1 f(x) dx$, sample $X_1, \ldots, X_n \sim \text{Uniform}[0,1]$ and compute the average $\bar{f}_n = \frac{1}{n}\sum f(X_i)$. The CLT guarantees that $\bar{f}_n$ is approximately Normal with mean $\int_0^1 f(x) dx$ and standard deviation proportional to $1/\sqrt{n}$.
The error shrinks as $1/\sqrt{n}$ regardless of the dimension of the integral. This dimension-independence is why Monte Carlo methods are so useful for high-dimensional integration - deterministic methods suffer from the curse of dimensionality, Monte Carlo does not.
Why Does It Work for Such Different Distributions?
Here is the deepest intuition. Think of each $X_i$ as carrying some information about its distribution - information about skewness, kurtosis, all the higher-order shape characteristics. When you average $n$ independent copies, the contributions to the average from any individual $X_i$ are of order $1/n$.
The mean contributes at order 1. The variance contributes at order $1/\sqrt{n}$ (this is the scale of fluctuations). The skewness contributes at order $1/n$. The kurtosis at $1/n^{3/2}$. All higher moments become negligible as $n$ grows.
In the limit, only the mean and variance survive in the distribution of the average. And the only distribution with a given mean and variance where all higher-order information has been washed away is the Normal. The CLT works because averages are smoothing operators that destroy distributional structure, leaving only the first two moments behind.
Connection to Machine Learning: Stochastic Gradient Descent
In stochastic gradient descent (SGD), at each step you compute the gradient using a mini-batch of $m$ data points. This mini-batch gradient is the average of $m$ per-sample gradients:
$$\hat{g} = \frac{1}{m} \sum_{i=1}^m \nabla_\theta \ell(\theta; x_i).$$
By the CLT, this gradient estimate is approximately Normally distributed around the true gradient, with variance proportional to $1/m$. The Gaussian approximation to gradient noise is the starting point for many theoretical analyses of SGD - it’s why practitioners talk about gradient noise as “Gaussian noise” and why the analogy to Langevin dynamics (a continuous stochastic differential equation with Gaussian noise) is so useful.
Summary
| Concept | Statement |
|---|---|
| Sample mean | $\bar{X}n = \frac{1}{n}\sum{i=1}^n X_i$ |
| Variance of sample mean | $\text{Var}(\bar{X}_n) = \sigma^2/n$ |
| Weak LLN | $P( |
| Strong LLN | $P(\bar{X}_n \to \mu) = 1$ |
| Gambler’s fallacy | LLN says nothing about individual future outcomes |
| CLT | $\frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma} \xrightarrow{d} \mathcal{N}(0,1)$ |
| CLT proof sketch | MGF of standardized sum $\to e^{t^2/2}$ via Taylor expansion |
| What CLT says | The average is approximately Normal |
| What CLT does not say | Individual observations are Normal |
| Standard error | $\sigma/\sqrt{n}$ - how much $\bar{X}_n$ fluctuates around $\mu$ |
| Margin of error (95%) | $\approx 2\sigma/\sqrt{n}$, at most $1/\sqrt{n}$ for proportions |
The LLN is why the casino always wins in the long run. The CLT is why we can say, precisely, what “always” means and how big the fluctuations are along the way.
Read next: