Prerequisite:


Concentration inequalities bound the probability that a random variable deviates from its mean. Unlike asymptotic results such as the CLT, these are finite-sample guarantees with explicit constants - exactly what is needed for rigorous analysis in machine learning and statistics.

Markov’s Inequality

The most elementary bound requires only a finite mean.

Theorem (Markov). If $X \geq 0$ and $E[X] < \infty$, then for any $a > 0$,

$$P(X \geq a) \leq \frac{E[X]}{a}.$$

Proof. Since $X \geq 0$,

$$E[X] = \int_0^\infty x, dP \geq \int_a^\infty x, dP \geq a \int_a^\infty dP = a \cdot P(X \geq a).$$

Dividing both sides by $a$ gives the result. $\square$

Markov’s inequality is tight: the bound is achieved when $X$ takes value $a$ with probability $E[X]/a$ and $0$ otherwise.

Chebyshev’s Inequality

Applying Markov to the squared deviation yields a bound in terms of the variance.

Theorem (Chebyshev). If $\text{Var}(X) < \infty$, then for any $k > 0$,

$$P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2},$$

where $\mu = E[X]$ and $\sigma^2 = \text{Var}(X)$.

Proof. Apply Markov’s inequality to the non-negative random variable $(X - \mu)^2$ with $a = (k\sigma)^2$:

$$P(|X - \mu| \geq k\sigma) = P!\left((X-\mu)^2 \geq k^2\sigma^2\right) \leq \frac{E[(X-\mu)^2]}{k^2\sigma^2} = \frac{\sigma^2}{k^2\sigma^2} = \frac{1}{k^2}. \qquad \square$$

Setting $k = \sqrt{n},t/\sigma$ and applying to $\bar{X}_n$, whose variance is $\sigma^2/n$, gives $P(|\bar{X}_n - \mu| \geq t) \leq \sigma^2/(nt^2)$, recovering a quantitative Law of Large Numbers.

Chernoff Bounds

The MGF-based approach yields exponentially sharp bounds, far tighter than Chebyshev when exponential moments exist.

Theorem (Chernoff, Binomial). Let $X \sim \text{Binomial}(n, p)$ with $\mu = np$. For any $\delta > 0$,

$$P(X \geq (1+\delta)\mu) \leq e^{-\mu\delta^2/3}.$$

Proof sketch. For any $t > 0$, Markov applied to $e^{tX}$ gives

$$P(X \geq (1+\delta)\mu) = P(e^{tX} \geq e^{t(1+\delta)\mu}) \leq \frac{M_X(t)}{e^{t(1+\delta)\mu}} = \frac{(1-p+pe^t)^n}{e^{t(1+\delta)\mu}}.$$

Using $1 + x \leq e^x$ and $p = \mu/n$:

$$\frac{M_X(t)}{e^{t(1+\delta)\mu}} \leq \frac{e^{np(e^t - 1)}}{e^{t(1+\delta)\mu}} = \exp!\left(\mu(e^t - 1) - t(1+\delta)\mu\right).$$

Minimizing over $t > 0$ by setting $t = \log(1+\delta)$ yields the exponent $\mu(\delta - (1+\delta)\log(1+\delta))$. Using $\log(1+\delta) \geq \delta - \delta^2/2$ for $\delta \in (0,1)$, this exponent is at most $-\mu\delta^2/3$. $\square$

The lower-tail Chernoff bound is symmetric: $P(X \leq (1-\delta)\mu) \leq e^{-\mu\delta^2/2}$ for $\delta \in (0,1)$.

Hoeffding’s Inequality

Hoeffding’s inequality applies to sums of bounded random variables and is one of the most widely used results in learning theory.

Theorem (Hoeffding). Let $X_1, \ldots, X_n$ be independent with $a_i \leq X_i \leq b_i$ almost surely. Set $\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i$ and $\mu = E[\bar{X}]$. Then for any $t > 0$,

$$P(\bar{X} - \mu \geq t) \leq \exp!\left(-\frac{2n^2 t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right).$$

In the special case $a_i = a$ and $b_i = b$ for all $i$:

$$P(\bar{X} - \mu \geq t) \leq \exp!\left(-\frac{2nt^2}{(b-a)^2}\right).$$

The proof uses the Hoeffding lemma: for $X \in [a,b]$ with $E[X] = 0$, $E[e^{sX}] \leq e^{s^2(b-a)^2/8}$. This is established by convexity of $e^{sx}$, which bounds $e^{sx} \leq \frac{b-x}{b-a}e^{sa} + \frac{x-a}{b-a}e^{sb}$, and then bounding the resulting expression using $1 + u \leq e^u$.

Sub-Gaussian Random Variables

Definition. A zero-mean random variable $X$ is sub-Gaussian with parameter $\sigma$ (written $X \sim \text{subG}(\sigma^2)$) if for all $t \in \mathbb{R}$,

$$E[e^{tX}] \leq e^{\sigma^2 t^2/2}.$$

Equivalently, $X$ is sub-Gaussian iff $P(|X| \geq u) \leq 2e^{-u^2/(2\sigma^2)}$ for all $u \geq 0$ (these characterizations are equivalent up to constants).

Examples.

  • Any bounded $X \in [a, b]$ with mean $\mu$ is sub-Gaussian with parameter $(b-a)/2$ (Hoeffding’s lemma).
  • $X \sim \mathcal{N}(0, \sigma^2)$ is sub-Gaussian with parameter $\sigma$.
  • $X \sim \text{Bernoulli}(p) - p$ is sub-Gaussian with parameter $1/2$.

Sub-Gaussians are closed under independent sums: if $X_i \sim \text{subG}(\sigma_i^2)$ are independent, then $\sum X_i \sim \text{subG}(\sum \sigma_i^2)$.

McDiarmid’s Inequality

McDiarmid’s inequality is a powerful generalization to functions of multiple random variables, requiring only a bounded differences condition.

Theorem (McDiarmid). Let $X_1, \ldots, X_n$ be independent, and let $f: \mathcal{X}^n \to \mathbb{R}$ satisfy the bounded differences condition: for each $i$ and all values $x_1, \ldots, x_n, x_i'$,

$$|f(x_1,\ldots,x_i,\ldots,x_n) - f(x_1,\ldots,x_i',\ldots,x_n)| \leq c_i.$$

Then for any $t > 0$,

$$P(f(X_1,\ldots,X_n) - E[f] \geq t) \leq \exp!\left(-\frac{2t^2}{\sum_{i=1}^n c_i^2}\right).$$

This recovers Hoeffding when $f(x_1,\ldots,x_n) = \frac{1}{n}\sum x_i$ and $c_i = (b_i - a_i)/n$.

Union Bound

Lemma (Union Bound / Boole’s Inequality). For any (possibly uncountably many) events $A_1, A_2, \ldots$,

$$P!\left(\bigcup_i A_i\right) \leq \sum_i P(A_i).$$

Proof. Define $B_1 = A_1$ and $B_k = A_k \setminus \bigcup_{j<k} A_j$. The $B_k$ are disjoint, $\bigcup B_k = \bigcup A_k$, and $P(B_k) \leq P(A_k)$, so $P(\bigcup A_k) = \sum P(B_k) \leq \sum P(A_k)$. $\square$

The union bound is indispensable when controlling the probability that any one of many bad events occurs - a ubiquitous setting in algorithm analysis and learning theory.

Examples

Generalization Bounds in PAC Learning

In the PAC learning framework, a hypothesis class $\mathcal{H}$ with $|\mathcal{H}| < \infty$ is learned from $n$ i.i.d. samples. For any fixed $h \in \mathcal{H}$, the empirical risk $\hat{R}(h) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[h(x_i) \neq y_i]$ is an average of bounded $[0,1]$ random variables. By Hoeffding,

$$P(|R(h) - \hat{R}(h)| \geq \epsilon) \leq 2e^{-2n\epsilon^2}.$$

Applying the union bound over all $h \in \mathcal{H}$:

$$P!\left(\sup_{h \in \mathcal{H}} |R(h) - \hat{R}(h)| \geq \epsilon\right) \leq 2|\mathcal{H}|, e^{-2n\epsilon^2}.$$

Setting this equal to $\delta$ and solving for $n$:

$$n \geq \frac{1}{2\epsilon^2}\log!\frac{2|\mathcal{H}|}{\delta}.$$

This is the fundamental sample complexity bound of PAC learning: $O(\log|\mathcal{H}|/\epsilon^2)$ samples suffice to learn any finite hypothesis class with accuracy $\epsilon$ and confidence $1-\delta$.

Confidence Intervals for Empirical Means

Suppose $X_1, \ldots, X_n \in [0,1]$ are i.i.d. observations. We want to estimate $\mu = E[X]$. By Hoeffding’s inequality with $t = \epsilon$,

$$P(|\bar{X}_n - \mu| \geq \epsilon) \leq 2e^{-2n\epsilon^2}.$$

Setting the right-hand side equal to $\delta$: $\epsilon = \sqrt{\log(2/\delta)/(2n)}$. Therefore, with probability at least $1 - \delta$,

$$|\bar{X}_n - \mu| \leq \sqrt{\frac{\log(2/\delta)}{2n}}.$$

This gives a non-asymptotic confidence interval that requires no distributional assumptions beyond boundedness. For $n = 1000$ and $\delta = 0.05$, the interval width is approximately $\sqrt{\log(40)/2000} \approx 0.042$, meaning the empirical mean is within $\pm 0.042$ of the true mean with 95% confidence.


Read Next: