Concentration Inequalities - How Far Can a Random Variable Stray From Its Mean? // Megha Bose

Helpful context:

Law of Large Numbers & CLT - Why Averages Behave

The Central Limit Theorem tells you something beautiful: the sample mean converges to the true mean, and the fluctuations around it are approximately Normal. But there’s a catch hidden in the word “approximately.”

The CLT is an asymptotic result. It holds “in the limit as $n \to \infty$.” For any specific sample size - say $n = 100$ or $n = 10000$ - the CLT gives you no concrete guarantee. It tells you the distribution is close to Normal, but not how close.

What if you need to know: with probability at least 95%, the average of my 500 measurements is within 0.1 of the true value? The CLT cannot answer that directly. You need a concentration inequality - a finite-sample bound that holds with explicit constants.

This is the domain of concentration inequalities: giving you hard guarantees, not just asymptotic approximations.

Markov’s Inequality: The Most Basic Bound

Start with the most elementary question you could ask: if a non-negative random variable has a known mean, how large can it be?

Markov’s Inequality: If $X \geq 0$ and $E[X] < \infty$, then for any $a > 0$:

$$P(X \geq a) \leq \frac{E[X]}{a}.$$

The proof is a single line. Since $X \geq 0$:

$$E[X] = \int_0^\infty x dP \geq \int_a^\infty x dP \geq \int_a^\infty a dP = a \cdot P(X \geq a).$$

Divide both sides by $a$.

Example: if the average grade on an exam is 70, what fraction of students can score above 90? Markov says at most $70/90 \approx 78%$. This is a weak bound - intuitively, if the average is 70, probably fewer than 78% score above 90 - but it requires absolutely no information about the distribution beyond its mean.

Markov’s inequality is tight: the bound $E[X]/a$ is achieved exactly when $X$ takes value $a$ with probability $E[X]/a$ and value 0 otherwise. You cannot improve it without assuming more about $X$.

Chebyshev’s Inequality: Adding Variance

Now suppose you know both the mean and the variance. You can get a tighter bound by applying Markov’s inequality to the squared deviation $(X - \mu)^2$.

Chebyshev’s Inequality: If $\text{Var}(X) < \infty$, then for any $k > 0$:

$$P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2},$$

where $\mu = E[X]$ and $\sigma^2 = \text{Var}(X)$.

Proof: Apply Markov to $Y = (X-\mu)^2$ with threshold $a = (k\sigma)^2$:

$$P(|X - \mu| \geq k\sigma) = P((X-\mu)^2 \geq k^2\sigma^2) \leq \frac{E[(X-\mu)^2]}{k^2\sigma^2} = \frac{\sigma^2}{k^2\sigma^2} = \frac{1}{k^2}.$$

So at most 1/4 of the probability mass can be more than 2 standard deviations from the mean, regardless of what the distribution looks like. At most 1/9 can be more than 3 standard deviations away.

Chebyshev’s inequality is also tight: a distribution that puts mass $1/(2k^2)$ at each of $\mu + k\sigma$ and $\mu - k\sigma$, and the remaining mass at $\mu$, achieves the bound exactly. So you can’t improve Chebyshev without additional assumptions either.

Applied to the sample mean $\bar{X}_n$ (with variance $\sigma^2/n$), Chebyshev gives:

$$P(|\bar{X}_n - \mu| \geq t) \leq \frac{\sigma^2}{nt^2}.$$

This is a quantitative version of the Law of Large Numbers: to guarantee the error is at most $t$ with probability at least $1 - \delta$, you need $n \geq \sigma^2/(t^2 \delta)$ samples. The bound is correct but often pessimistic - the true tail probabilities typically decay much faster than $1/n$.

The Exponential Moment Trick: Chernoff Bounds

Markov and Chebyshev are universal - they work with minimal assumptions. But their bounds decay polynomially in the deviation. For many applications, you want exponentially small tail probabilities.

The idea: instead of applying Markov to $X$ or $X^2$, apply it to $e^{tX}$ for some $t > 0$:

$$P(X \geq a) = P(e^{tX} \geq e^{ta}) \leq \frac{E[e^{tX}]}{e^{ta}} = \frac{M_X(t)}{e^{ta}}.$$

Now optimize over $t > 0$ to get the tightest bound. The result - a Chernoff bound - gives exponentially decaying tails.

Chernoff for the Binomial: Let $X \sim \text{Binomial}(n, p)$ with $\mu = np$. For any $\delta \in (0, 1)$:

$$P(X \geq (1+\delta)\mu) \leq e^{-\mu\delta^2/3}.$$

The tail probability decays exponentially in $\mu = np$. Doubling the number of trials doubles the exponent - the tail gets squared, not halved. This is a dramatic improvement over Chebyshev’s polynomial decay.

For the lower tail: $P(X \leq (1-\delta)\mu) \leq e^{-\mu\delta^2/2}$.

Example: you flip a fair coin 1000 times. The expected number of heads is $\mu = 500$. The probability of getting at least 600 heads ($\delta = 0.2$) is at most $e^{-500 \cdot 0.04/3} \approx e^{-6.67} \approx 0.0013$. Less than 0.13% - much smaller than what Chebyshev would give ($\leq 1/(4 \cdot (100/\sqrt{250})^2) \approx 6.25%$).

Hoeffding’s Inequality: Bounded Variables

The Chernoff approach required knowing the specific distribution (Binomial, in the example above). What if you only know that your variables are bounded?

Hoeffding’s Inequality: Let $X_1, \ldots, X_n$ be independent, with $a_i \leq X_i \leq b_i$ almost surely. Let $\bar{X} = \frac{1}{n}\sum X_i$ and $\mu = E[\bar{X}]$. Then for any $t > 0$:

$$P(\bar{X} - \mu \geq t) \leq \exp\left(-\frac{2n^2 t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right).$$

In the symmetric case where all $X_i \in [a, b]$:

$$P(\bar{X} - \mu \geq t) \leq \exp\left(-\frac{2nt^2}{(b-a)^2}\right).$$

This is an exponential bound that requires nothing about the distribution of the $X_i$ beyond their range. Bounded? Hoeffding applies.

The proof goes through the Hoeffding lemma: for any $X \in [a, b]$ with $E[X] = 0$, the MGF satisfies $E[e^{sX}] \leq e^{s^2(b-a)^2/8}$. The key step is that the exponential function is convex, so $e^{sx} \leq \frac{b-x}{b-a}e^{sa} + \frac{x-a}{b-a}e^{sb}$, and bounding this expression with $1 + u \leq e^u$ gives the lemma.

Discomfort check. If we already have the CLT, why do we need concentration inequalities?

Three reasons:

Finite samples. The CLT is asymptotic - it tells you what happens as $n \to \infty$. Concentration inequalities are non-asymptotic - they give you concrete bounds for any fixed $n$. If you need “with probability at least 99%, the error is at most 0.1 after 500 samples,” you need Hoeffding, not the CLT.

Heavy tails. The CLT requires finite variance. Some distributions - Pareto, Cauchy, certain power laws - have infinite variance or don’t even have a finite mean. The CLT doesn’t apply. Markov’s inequality still does, requiring only a finite mean.

Non-i.i.d. settings. The CLT assumes i.i.d. data. Many concentration results extend to non-i.i.d. situations: martingales, Markov chains, functions of many variables. McDiarmid’s inequality, for example, handles any function of independent variables satisfying a bounded differences condition - no identically distributed requirement.

In short: the CLT is a beautiful asymptotic characterization. Concentration inequalities are the finite-sample engineering tools.

Application to Machine Learning: PAC Learning

Here is where concentration inequalities pay off in a real theorem. Consider a finite hypothesis class $\mathcal{H}$ - a set of $N$ classification rules. You draw $m$ labeled examples from some unknown distribution and pick the hypothesis that does best on your training set. How many examples do you need before you can trust that your training error is close to the true error?

For any single hypothesis $h \in \mathcal{H}$, its empirical risk $\hat{R}(h) = \frac{1}{m}\sum_{i=1}^m \mathbf{1}[h(x_i) \neq y_i]$ is the average of $m$ independent $\{0,1\}$ random variables. By Hoeffding’s inequality:

$$P(|\hat{R}(h) - R(h)| \geq \varepsilon) \leq 2e^{-2m\varepsilon^2}.$$

Now you need this to hold for all $N$ hypotheses simultaneously. Apply the union bound: the probability that any one of $N$ bad events occurs is at most the sum of their individual probabilities:

$$P\left(\max_{h \in \mathcal{H}} |\hat{R}(h) - R(h)| \geq \varepsilon\right) \leq N \cdot 2e^{-2m\varepsilon^2}.$$

Set this equal to $\delta$ and solve for $m$:

$$m \geq \frac{1}{2\varepsilon^2} \log\frac{2N}{\delta}.$$

To learn any finite hypothesis class with accuracy $\varepsilon$ and confidence $1-\delta$, you need a sample size that grows logarithmically in $N$ and as $1/\varepsilon^2$. This is the fundamental PAC learning bound (Probably Approximately Correct), and it’s a direct consequence of Hoeffding plus a union bound.

The logarithmic dependence on $N$ is what makes machine learning tractable: you can have exponentially many hypotheses and still learn with a polynomial number of samples. A model space with $2^{100}$ binary classifiers needs only about $100 \log 2 / \varepsilon^2 \approx 70/\varepsilon^2$ samples, not $2^{100}$ of them.

Sub-Gaussian Variables: The Right Generalization

Hoeffding’s inequality works for bounded variables. But many distributions of interest aren’t bounded - they’re just “well-behaved” in their tails. The right generalization is sub-Gaussian.

Definition: A zero-mean random variable $X$ is sub-Gaussian with parameter $\sigma$ if, for all $t \in \mathbb{R}$:

$$E[e^{tX}] \leq e^{\sigma^2 t^2/2}.$$

This says: the MGF of $X$ is bounded above by the MGF of $\mathcal{N}(0, \sigma^2)$. The tails of $X$ are no heavier than Gaussian tails.

Sub-Gaussian variables satisfy Hoeffding-type bounds by definition: applying the Chernoff argument with the sub-Gaussian MGF bound immediately gives:

$$P(X \geq t) \leq e^{-t^2/(2\sigma^2)}.$$

What variables are sub-Gaussian?

$X \sim \mathcal{N}(0, \sigma^2)$: sub-Gaussian with parameter $\sigma$ (tight).
Any bounded $X \in [a, b]$ with mean 0: sub-Gaussian with parameter $(b-a)/2$ (by Hoeffding’s lemma).
$X \sim \text{Bernoulli}(p) - p$: sub-Gaussian with parameter $1/2$.

Sub-Gaussian variables are closed under independent sums: if $X_i$ are independent sub-Gaussian with parameter $\sigma_i$, then $\sum X_i$ is sub-Gaussian with parameter $\sqrt{\sum \sigma_i^2}$. This extends Hoeffding’s inequality from bounded variables to the wider class of sub-Gaussian variables with minimal extra effort.

Summary

Inequality	Assumptions	Bound	Decay
Markov	$X \geq 0$, finite mean	$P(X \geq a) \leq E[X]/a$	$1/a$ (linear)
Chebyshev	Finite variance	$P(\|X-\mu\| \geq k\sigma) \leq 1/k^2$	$1/k^2$ (quadratic)
Chernoff (Binomial)	$X \sim \text{Bin}(n,p)$	$P(X \geq (1+\delta)\mu) \leq e^{-\mu\delta^2/3}$	Exponential in $n$
Hoeffding	Independent, bounded $[a_i, b_i]$	$P(\bar{X}-\mu \geq t) \leq \exp(-2n^2t^2/\sum(b_i-a_i)^2)$	Exponential in $n$
PAC learning	Hoeffding + union bound	$m = O(\log(N/\delta)/\varepsilon^2)$ samples	Log in hypothesis count
Sub-Gaussian	MGF $\leq e^{\sigma^2 t^2/2}$	$P(X \geq t) \leq e^{-t^2/(2\sigma^2)}$	Exponential in $t^2$

The hierarchy runs from weakest to strongest: Markov assumes least and proves least; sub-Gaussian bounds assume more structure and give exponential tails. The right tool depends on what you know about your data. When you only know the mean, Markov. When you know the variance, Chebyshev. When you know the range or the MGF, Hoeffding or Chernoff. When you need sub-exponential tails for a non-bounded, non-Gaussian process, sub-Gaussian techniques are the starting point for more advanced methods.

Read next: