High-Dimensional Geometry
Prerequisite:
High-dimensional geometry defies the intuition built from two and three dimensions. As the ambient dimension $d$ grows, space becomes vast and sparse in ways that fundamentally alter the behaviour of algorithms, probability, and data. Understanding these phenomena is essential to modern machine learning, compressed sensing, and randomised algorithms.
The Curse of Dimensionality: Volume Concentration
Let $B_d(r) = {x \in \mathbb{R}^d : |x| \leq r}$ denote the $d$-dimensional ball of radius $r$. Its volume is
$$\text{Vol}(B_d(r)) = \frac{\pi^{d/2}}{\Gamma(d/2 + 1)}, r^d.$$
A striking consequence of this formula is that the fraction of volume lying within an $\epsilon$-shell near the surface is
$$\frac{\text{Vol}(B_d(r)) - \text{Vol}(B_d(r(1-\epsilon)))}{\text{Vol}(B_d(r))} = 1 - (1-\epsilon)^d \xrightarrow{d\to\infty} 1$$
for any fixed $\epsilon > 0$. Practically all of the volume of a high-dimensional ball is concentrated in a thin shell near the boundary.
d=2: d=10: d=100:
**** *** *
* ## * * * * *
* ### * * * * *
* ## * * * * *
**** *** *
legend: * = shell, # = interior
Interior fraction: 0.75, 0.35, 0.0000...6
The shell thickness needed to capture half the volume shrinks as $O(1/d)$, so data points drawn uniformly from $B_d(1)$ are effectively confined to a thin annulus. This makes nearest-neighbour search increasingly unreliable as $d$ grows.
Formal Statement
Theorem (Volume Concentration). For any $\epsilon \in (0,1)$,
$$\Pr_{x \sim \text{Unif}(B_d(1))}!\left[|x| \geq 1 - \epsilon\right] = 1 - (1-\epsilon)^d.$$
As $d \to \infty$ this probability converges to $1$ for every fixed $\epsilon > 0$.
Random Vectors in High Dimensions
Concentration of the Norm
Let $x = (x_1,\dots,x_d)$ with $x_i \overset{\text{i.i.d.}}{\sim} \mathcal{N}(0,1)$. Then $|x|^2 = \sum_{i=1}^d x_i^2 \sim \chi^2(d)$, which has mean $d$ and variance $2d$. By the law of large numbers,
$$\frac{|x|^2}{d} \xrightarrow{d\to\infty} 1 \quad \text{in probability},$$
so $|x| \approx \sqrt{d}$ with fluctuations of order $1$. More precisely, a Gaussian concentration argument gives
$$\Pr!\left[,\bigl||x| - \sqrt{d}\bigr| \geq t\right] \leq 2\exp!\left(-\frac{t^2}{2}\right).$$
A standard Gaussian vector in $\mathbb{R}^d$ has norm tightly concentrated near $\sqrt{d}$, not near $0$ or $\infty$.
Near-Orthogonality of Random Unit Vectors
For two independent standard Gaussian vectors $x, y \in \mathbb{R}^d$, the inner product of the normalised vectors satisfies
$$\left\langle \frac{x}{|x|}, \frac{y}{|y|}\right\rangle \approx \mathcal{N}!\left(0, \frac{1}{d}\right).$$
The inner product concentrates near $0$ with standard deviation $1/\sqrt{d}$. A random pair of unit vectors is nearly orthogonal. This means one can fit exponentially many ($2^{\Omega(d)}$) nearly-orthogonal vectors in $\mathbb{R}^d$, which is the geometric engine behind many high-dimensional constructions.
The Johnson-Lindenstrauss Lemma
Theorem (Johnson-Lindenstrauss, 1984). Let $\varepsilon \in (0,1)$ and let $P$ be any set of $n$ points in $\mathbb{R}^D$. There exists a map $f: \mathbb{R}^D \to \mathbb{R}^m$ with
$$m = O!\left(\frac{\log n}{\varepsilon^2}\right)$$
such that for all $u, v \in P$,
$$(1-\varepsilon)|u-v|^2 \leq |f(u)-f(v)|^2 \leq (1+\varepsilon)|u-v|^2.$$
Proof Sketch via Gaussian Projections
Let $A \in \mathbb{R}^{m \times D}$ have i.i.d. $\mathcal{N}(0,1)$ entries and define $f(x) = \frac{1}{\sqrt{m}}Ax$. For a fixed vector $v \in \mathbb{R}^D$,
$$\mathbb{E}!\left[|Av|^2\right] = m|v|^2.$$
Each coordinate of $Av$ is $\mathcal{N}(0, |v|^2)$, and $|Av|^2/|v|^2 \sim \chi^2(m)$. A Chernoff-type bound for $\chi^2$ random variables gives
$$\Pr!\left[,\left|\frac{|Av|^2}{m|v|^2} - 1\right| \geq \varepsilon\right] \leq 2e^{-(\varepsilon^2 - \varepsilon^3)m/4}.$$
Setting $m = C\log n / \varepsilon^2$ makes this at most $2/n^2$. A union bound over the $\binom{n}{2} < n^2$ pairs of points completes the argument.
Original space R^D Projected space R^m
. . . .
. . . ---[A/sqrt(m)]--> . .
. . . .
distances preserved up to (1±ε)
The JL lemma has far-reaching consequences: it justifies random projection as a preprocessing step in nearest-neighbour search, SVMs, and clustering, reducing both storage and computation while preserving geometric structure.
Concentration of Measure
Definition. A function $f: \mathbb{R}^d \to \mathbb{R}$ is $L$-Lipschitz (with respect to the Euclidean norm) if $|f(x)-f(y)| \leq L|x-y|$ for all $x,y$.
Theorem (Gaussian Concentration). If $x \sim \mathcal{N}(0, I_d)$ and $f$ is $L$-Lipschitz, then
$$\Pr!\left[f(x) \geq \mathbb{E}[f(x)] + t\right] \leq e^{-t^2/(2L^2)}.$$
This is a profound statement: in high dimensions, a Lipschitz function of a Gaussian vector is essentially constant, deviating from its mean by more than $t$ with probability exponentially small in $t^2$.
Theorem (McDiarmid’s Inequality). Let $f(x_1,\dots,x_n)$ satisfy: changing any single $x_i$ while fixing the others changes $f$ by at most $c_i$. For independent $x_1,\dots,x_n$,
$$\Pr!\left[f - \mathbb{E}[f] \geq t\right] \leq \exp!\left(-\frac{2t^2}{\sum_i c_i^2}\right).$$
McDiarmid’s inequality is the bounded-differences analogue of Gaussian concentration and is ubiquitous in learning theory.
Compressed Sensing and the Restricted Isometry Property
The Restricted Isometry Property
Definition (RIP). A matrix $A \in \mathbb{R}^{m \times n}$ satisfies the Restricted Isometry Property of order $k$ with constant $\delta_k \in (0,1)$ if
$$(1-\delta_k)|x|^2 \leq |Ax|^2 \leq (1+\delta_k)|x|^2$$
for all $k$-sparse vectors $x$ (vectors with at most $k$ non-zero entries).
Theorem. Let $A \in \mathbb{R}^{m \times n}$ have i.i.d. $\mathcal{N}(0,1/m)$ entries. If
$$m = O!\left(k\log\frac{n}{k}\right),$$
then $A$ satisfies RIP of order $k$ with constant $\delta_k < 1/4$ with probability at least $1 - 2e^{-cm}$.
The proof uses a covering-number argument: there are at most $\binom{n}{k} \leq (en/k)^k$ supports of size $k$, and on each one applies Gaussian concentration.
Compressed Sensing Recovery
Theorem (Candes-Tao, 2006). Suppose $x \in \mathbb{R}^n$ is $k$-sparse and $A \in \mathbb{R}^{m \times n}$ satisfies RIP with $\delta_{2k} < \sqrt{2}-1$. Then $x$ is the unique solution to
$$\min_{z \in \mathbb{R}^n} |z|_1 \quad \text{subject to} \quad Az = Ax.$$
n-dimensional signal x m-dimensional measurements Recovery
[0, 3, 0, 0, 7, 0, 0, 2] ---[A: m<<n]---> [y_1,...,y_m] --[L1 min]--> x
k=3 nonzeros m = O(k log n/k) exact!
This is the theoretical foundation of compressed sensing: rather than measuring $n$ quantities, only $m \ll n$ random linear measurements suffice to recover a sparse signal exactly, by solving a convex programme.
Covering Numbers and Epsilon-Nets
Definition. An $\varepsilon$-net of a set $K \subset \mathbb{R}^d$ is a finite set $\mathcal{N} \subseteq K$ such that every point of $K$ is within distance $\varepsilon$ of some point in $\mathcal{N}$. The covering number $\mathcal{N}(K, |\cdot|, \varepsilon)$ is the minimum size of an $\varepsilon$-net.
For the $d$-dimensional unit ball,
$$\left(\frac{1}{\varepsilon}\right)^d \leq \mathcal{N}(B_d(1), |\cdot|_2, \varepsilon) \leq \left(\frac{2}{\varepsilon}+1\right)^d.$$
Covering numbers quantify the complexity of a set and appear throughout: in the RIP proof, in learning theory (Rademacher complexity), and in approximation theory.
High-Dimensional Convex Bodies
Theorem (Brunn-Minkowski). For non-empty compact sets $A, B \subseteq \mathbb{R}^d$ and the Minkowski sum $A + B = {a+b : a \in A, b \in B}$,
$$\text{Vol}(A+B)^{1/d} \geq \text{Vol}(A)^{1/d} + \text{Vol}(B)^{1/d}.$$
Brunn-Minkowski is one of the deepest results in convex geometry. It implies the AM-GM inequality, the isoperimetric inequality (balls minimise surface area for fixed volume), and underlies the log-concave sampling theory used in modern MCMC methods.
The High-Dimensional Perspective in Machine Learning
The curse of dimensionality is real: in $d$ dimensions, a $k$-nearest-neighbour search requires exponentially many samples before query points have genuinely close neighbours. But high dimensionality is also a blessing:
Expressivity. Neural networks in high dimensions can represent functions of exponential complexity relative to their parameter count, because the input space is so rich.
Random projections in ML. By JL, one can reduce feature dimensions from $D$ to $O(\log n / \varepsilon^2)$ before training, speeding up kernel methods and reducing memory without sacrificing accuracy.
Dimension reduction. PCA finds the low-dimensional subspace capturing the most variance. But in high dimensions, the sample covariance has eigenvalues inflated by noise - a phenomenon explained by random matrix theory. The Marchenko-Pastur law gives the noise floor above which eigenvalues are meaningful.
Compressed sensing in imaging. MRI machines acquire fewer measurements than the Nyquist theorem requires by exploiting sparsity in a wavelet basis, with RIP guaranteeing recovery. The same principle underlies single-pixel cameras and genomic data compression.
High-dimensional geometry teaches a discipline of thinking carefully about where probability mass actually lives, how functions concentrate, and what geometric structure survives dimensionality reduction - lessons that are indispensable throughout modern computational mathematics.
Read Next: