Concentration Inequalities
Prerequisite:
Concentration inequalities bound the probability that a random variable deviates from its mean. Unlike asymptotic results such as the CLT, these are finite-sample guarantees with explicit constants - exactly what is needed for rigorous analysis in machine learning and statistics.
Markov’s Inequality
The most elementary bound requires only a finite mean.
Theorem (Markov). If $X \geq 0$ and $E[X] < \infty$, then for any $a > 0$,
$$P(X \geq a) \leq \frac{E[X]}{a}.$$
Proof. Since $X \geq 0$,
$$E[X] = \int_0^\infty x, dP \geq \int_a^\infty x, dP \geq a \int_a^\infty dP = a \cdot P(X \geq a).$$
Dividing both sides by $a$ gives the result. $\square$
Markov’s inequality is tight: the bound is achieved when $X$ takes value $a$ with probability $E[X]/a$ and $0$ otherwise.
Chebyshev’s Inequality
Applying Markov to the squared deviation yields a bound in terms of the variance.
Theorem (Chebyshev). If $\text{Var}(X) < \infty$, then for any $k > 0$,
$$P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2},$$
where $\mu = E[X]$ and $\sigma^2 = \text{Var}(X)$.
Proof. Apply Markov’s inequality to the non-negative random variable $(X - \mu)^2$ with $a = (k\sigma)^2$:
$$P(|X - \mu| \geq k\sigma) = P!\left((X-\mu)^2 \geq k^2\sigma^2\right) \leq \frac{E[(X-\mu)^2]}{k^2\sigma^2} = \frac{\sigma^2}{k^2\sigma^2} = \frac{1}{k^2}. \qquad \square$$
Setting $k = \sqrt{n},t/\sigma$ and applying to $\bar{X}_n$, whose variance is $\sigma^2/n$, gives $P(|\bar{X}_n - \mu| \geq t) \leq \sigma^2/(nt^2)$, recovering a quantitative Law of Large Numbers.
Chernoff Bounds
The MGF-based approach yields exponentially sharp bounds, far tighter than Chebyshev when exponential moments exist.
Theorem (Chernoff, Binomial). Let $X \sim \text{Binomial}(n, p)$ with $\mu = np$. For any $\delta > 0$,
$$P(X \geq (1+\delta)\mu) \leq e^{-\mu\delta^2/3}.$$
Proof sketch. For any $t > 0$, Markov applied to $e^{tX}$ gives
$$P(X \geq (1+\delta)\mu) = P(e^{tX} \geq e^{t(1+\delta)\mu}) \leq \frac{M_X(t)}{e^{t(1+\delta)\mu}} = \frac{(1-p+pe^t)^n}{e^{t(1+\delta)\mu}}.$$
Using $1 + x \leq e^x$ and $p = \mu/n$:
$$\frac{M_X(t)}{e^{t(1+\delta)\mu}} \leq \frac{e^{np(e^t - 1)}}{e^{t(1+\delta)\mu}} = \exp!\left(\mu(e^t - 1) - t(1+\delta)\mu\right).$$
Minimizing over $t > 0$ by setting $t = \log(1+\delta)$ yields the exponent $\mu(\delta - (1+\delta)\log(1+\delta))$. Using $\log(1+\delta) \geq \delta - \delta^2/2$ for $\delta \in (0,1)$, this exponent is at most $-\mu\delta^2/3$. $\square$
The lower-tail Chernoff bound is symmetric: $P(X \leq (1-\delta)\mu) \leq e^{-\mu\delta^2/2}$ for $\delta \in (0,1)$.
Hoeffding’s Inequality
Hoeffding’s inequality applies to sums of bounded random variables and is one of the most widely used results in learning theory.
Theorem (Hoeffding). Let $X_1, \ldots, X_n$ be independent with $a_i \leq X_i \leq b_i$ almost surely. Set $\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i$ and $\mu = E[\bar{X}]$. Then for any $t > 0$,
$$P(\bar{X} - \mu \geq t) \leq \exp!\left(-\frac{2n^2 t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right).$$
In the special case $a_i = a$ and $b_i = b$ for all $i$:
$$P(\bar{X} - \mu \geq t) \leq \exp!\left(-\frac{2nt^2}{(b-a)^2}\right).$$
The proof uses the Hoeffding lemma: for $X \in [a,b]$ with $E[X] = 0$, $E[e^{sX}] \leq e^{s^2(b-a)^2/8}$. This is established by convexity of $e^{sx}$, which bounds $e^{sx} \leq \frac{b-x}{b-a}e^{sa} + \frac{x-a}{b-a}e^{sb}$, and then bounding the resulting expression using $1 + u \leq e^u$.
Sub-Gaussian Random Variables
Definition. A zero-mean random variable $X$ is sub-Gaussian with parameter $\sigma$ (written $X \sim \text{subG}(\sigma^2)$) if for all $t \in \mathbb{R}$,
$$E[e^{tX}] \leq e^{\sigma^2 t^2/2}.$$
Equivalently, $X$ is sub-Gaussian iff $P(|X| \geq u) \leq 2e^{-u^2/(2\sigma^2)}$ for all $u \geq 0$ (these characterizations are equivalent up to constants).
Examples.
- Any bounded $X \in [a, b]$ with mean $\mu$ is sub-Gaussian with parameter $(b-a)/2$ (Hoeffding’s lemma).
- $X \sim \mathcal{N}(0, \sigma^2)$ is sub-Gaussian with parameter $\sigma$.
- $X \sim \text{Bernoulli}(p) - p$ is sub-Gaussian with parameter $1/2$.
Sub-Gaussians are closed under independent sums: if $X_i \sim \text{subG}(\sigma_i^2)$ are independent, then $\sum X_i \sim \text{subG}(\sum \sigma_i^2)$.
McDiarmid’s Inequality
McDiarmid’s inequality is a powerful generalization to functions of multiple random variables, requiring only a bounded differences condition.
Theorem (McDiarmid). Let $X_1, \ldots, X_n$ be independent, and let $f: \mathcal{X}^n \to \mathbb{R}$ satisfy the bounded differences condition: for each $i$ and all values $x_1, \ldots, x_n, x_i'$,
$$|f(x_1,\ldots,x_i,\ldots,x_n) - f(x_1,\ldots,x_i',\ldots,x_n)| \leq c_i.$$
Then for any $t > 0$,
$$P(f(X_1,\ldots,X_n) - E[f] \geq t) \leq \exp!\left(-\frac{2t^2}{\sum_{i=1}^n c_i^2}\right).$$
This recovers Hoeffding when $f(x_1,\ldots,x_n) = \frac{1}{n}\sum x_i$ and $c_i = (b_i - a_i)/n$.
Union Bound
Lemma (Union Bound / Boole’s Inequality). For any (possibly uncountably many) events $A_1, A_2, \ldots$,
$$P!\left(\bigcup_i A_i\right) \leq \sum_i P(A_i).$$
Proof. Define $B_1 = A_1$ and $B_k = A_k \setminus \bigcup_{j<k} A_j$. The $B_k$ are disjoint, $\bigcup B_k = \bigcup A_k$, and $P(B_k) \leq P(A_k)$, so $P(\bigcup A_k) = \sum P(B_k) \leq \sum P(A_k)$. $\square$
The union bound is indispensable when controlling the probability that any one of many bad events occurs - a ubiquitous setting in algorithm analysis and learning theory.
Examples
Generalization Bounds in PAC Learning
In the PAC learning framework, a hypothesis class $\mathcal{H}$ with $|\mathcal{H}| < \infty$ is learned from $n$ i.i.d. samples. For any fixed $h \in \mathcal{H}$, the empirical risk $\hat{R}(h) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[h(x_i) \neq y_i]$ is an average of bounded $[0,1]$ random variables. By Hoeffding,
$$P(|R(h) - \hat{R}(h)| \geq \epsilon) \leq 2e^{-2n\epsilon^2}.$$
Applying the union bound over all $h \in \mathcal{H}$:
$$P!\left(\sup_{h \in \mathcal{H}} |R(h) - \hat{R}(h)| \geq \epsilon\right) \leq 2|\mathcal{H}|, e^{-2n\epsilon^2}.$$
Setting this equal to $\delta$ and solving for $n$:
$$n \geq \frac{1}{2\epsilon^2}\log!\frac{2|\mathcal{H}|}{\delta}.$$
This is the fundamental sample complexity bound of PAC learning: $O(\log|\mathcal{H}|/\epsilon^2)$ samples suffice to learn any finite hypothesis class with accuracy $\epsilon$ and confidence $1-\delta$.
Confidence Intervals for Empirical Means
Suppose $X_1, \ldots, X_n \in [0,1]$ are i.i.d. observations. We want to estimate $\mu = E[X]$. By Hoeffding’s inequality with $t = \epsilon$,
$$P(|\bar{X}_n - \mu| \geq \epsilon) \leq 2e^{-2n\epsilon^2}.$$
Setting the right-hand side equal to $\delta$: $\epsilon = \sqrt{\log(2/\delta)/(2n)}$. Therefore, with probability at least $1 - \delta$,
$$|\bar{X}_n - \mu| \leq \sqrt{\frac{\log(2/\delta)}{2n}}.$$
This gives a non-asymptotic confidence interval that requires no distributional assumptions beyond boundedness. For $n = 1000$ and $\delta = 0.05$, the interval width is approximately $\sqrt{\log(40)/2000} \approx 0.042$, meaning the empirical mean is within $\pm 0.042$ of the true mean with 95% confidence.
Read Next: