VC Dimension // Megha Bose

Prerequisite:

Inequalities

When does a learning algorithm generalise? The Vapnik-Chervonenkis (VC) dimension gives the sharpest combinatorial answer: generalisation is possible if and only if the hypothesis class has finite VC dimension. This result - the Fundamental Theorem of PAC Learning - connects combinatorics, probability, and learning theory in one clean statement.

The PAC Learning Framework

Definition. A concept class $\mathcal{C}$ over instance space $\mathcal{X}$ is PAC learnable if there exists an algorithm $\mathcal{A}$ and a polynomial $m(\cdot, \cdot)$ such that for all $\epsilon, \delta \in (0,1)$, all target concepts $c \in \mathcal{C}$, and all distributions $\mathcal{D}$ over $\mathcal{X}$: given $m \geq m(1/\epsilon, 1/\delta)$ i.i.d. samples from $\mathcal{D}$ labelled by $c$, with probability at least $1 - \delta$ the algorithm outputs a hypothesis $h$ with generalisation error $\text{err}(h) = P_{x \sim \mathcal{D}}[h(x) \neq c(x)] \leq \epsilon$.

The key quantity $m(1/\epsilon, 1/\delta)$ is the sample complexity: the number of examples sufficient to learn any target to accuracy $\epsilon$ with confidence $1-\delta$.

Shattering and VC Dimension

Definition. A hypothesis class $\mathcal{H}$ shatters a finite set $C = {x_1, \ldots, x_k} \subseteq \mathcal{X}$ if for every binary labelling $b: C \to {0, 1}$ there exists $h \in \mathcal{H}$ consistent with $b$:

$$\forall b \in {0,1}^C ;; \exists h \in \mathcal{H} \text{ s.t. } h(x_i) = b(x_i) ;\forall i$$

Shattering means $\mathcal{H}$ can represent every possible dichotomy of $C$ - no matter how the points are labelled, $\mathcal{H}$ has a function consistent with it.

Definition. The VC dimension $\text{VCdim}(\mathcal{H})$ is the size of the largest set shattered by $\mathcal{H}$:

$$\text{VCdim}(\mathcal{H}) = \max{|C| : \mathcal{H} \text{ shatters } C}$$

If $\mathcal{H}$ shatters arbitrarily large sets, $\text{VCdim}(\mathcal{H}) = \infty$.

VC Dimension of Halfspaces in $\mathbb{R}^d$

Theorem. The class of halfspaces $\mathcal{H} = {x \mapsto \mathbf{1}[w \cdot x \geq \theta] : w \in \mathbb{R}^d, \theta \in \mathbb{R}}$ has $\text{VCdim}(\mathcal{H}) = d + 1$.

Proof (lower bound: $d+1$ points are shattered). Take the $d+1$ points $x_0 = 0$ and $x_i = e_i$ (the $i$-th standard basis vector). For any labelling $b \in {0,1}^{d+1}$, construct $w_i = 2b(x_i) - 1 \in {-1,+1}$ for $i \geq 1$ and set $\theta = \frac{1}{2}(1 - b(x_0))$. A careful check shows the halfspace with these parameters classifies each $x_i$ according to $b$. Hence $\mathcal{H}$ shatters these $d+1$ points.

Proof (upper bound: no $d+2$ points are shattered). Any $d+2$ points in $\mathbb{R}^d$ are linearly dependent: there exist coefficients $\alpha_i$, not all zero, with $\sum_{i=1}^{d+2} \alpha_i x_i = 0$. Partition into $P = {i : \alpha_i > 0}$ and $N = {i : \alpha_i < 0}$. If the labelling assigns $+1$ to $P$ and $-1$ to $N$ were achievable by a halfspace $w \cdot x \geq \theta$, then summing the constraints for $P$ and $N$ separately yields $\sum_{i \in P} \alpha_i w \cdot x_i \geq |P|\theta$ and $\sum_{i \in N} |\alpha_i| w \cdot x_i < |N|\theta$. Adding gives $w \cdot \sum_i \alpha_i x_i \geq (|P| - |N|)\theta$, but $\sum_i \alpha_i x_i = 0$, a contradiction. Hence this labelling is not realisable. $\square$

Growth Function and Sauer’s Lemma

The growth function $\Pi_\mathcal{H}(m)$ counts the maximum number of distinct labellings $\mathcal{H}$ induces on any $m$ points:

$$\Pi_\mathcal{H}(m) = \max_{x_1,\ldots,x_m \in \mathcal{X}} |{(h(x_1),\ldots,h(x_m)) : h \in \mathcal{H}}|$$

Clearly $\Pi_\mathcal{H}(m) \leq 2^m$, with equality when $\mathcal{H}$ shatters some $m$-element set.

Theorem (Sauer’s Lemma). If $\text{VCdim}(\mathcal{H}) = d$, then for all $m$:

$$\Pi_\mathcal{H}(m) \leq \sum_{i=0}^{d} \binom{m}{i}$$

For $m \geq d$, this sum is at most $(em/d)^d$, so the growth function is polynomial in $m$ once VC dimension is finite (instead of exponential).

Proof sketch. By induction on $m + d$. For a set $S$ of $m$ points, split into those where the labels of the last point are uniquely determined by the labels of the others vs. those where both 0 and 1 are possible. The induction step bounds each part using $\Pi_\mathcal{H}(m-1)$ and $\Pi_{\mathcal{H}'}(m-1)$ for a restricted class $\mathcal{H}'$ with VC dimension $d-1$. $\square$

The Fundamental Theorem of PAC Learning

Theorem. A hypothesis class $\mathcal{H}$ is PAC learnable if and only if $\text{VCdim}(\mathcal{H}) < \infty$.

The key direction (finite VC dim implies learnability) uses the VC generalisation bound.

Theorem (VC Generalisation Bound). Let $\text{VCdim}(\mathcal{H}) = d$. For any distribution $\mathcal{D}$ and any $\delta \in (0,1)$, with probability at least $1-\delta$ over a sample $S$ of $m$ points, every $h \in \mathcal{H}$ satisfies:

$$\text{err}(h) \leq \hat{\text{err}}_S(h) + \sqrt{\frac{2d\log(em/d) + 2\log(2/\delta)}{m}}$$

where $\hat{\text{err}}_S(h)$ is the empirical error on $S$. In particular, for an empirical risk minimiser with $\hat{\text{err}}_S(h) = 0$, the sample complexity is:

$$m \geq O!\left(\frac{d + \log(1/\delta)}{\epsilon^2}\right)$$

Proof sketch. Use the symmetrisation trick: bound the probability that the empirical error deviates from the true error by introducing a ghost sample $S'$. Then use a union bound over the at most $\Pi_\mathcal{H}(2m) \leq (2em/d)^d$ distinct labellings on $S \cup S'$, applying Hoeffding’s inequality to each. Sauer’s lemma converts $2^m$ (infeasible) to a polynomial (feasible). $\square$

Rademacher Complexity

VC dimension measures combinatorial complexity with worst-case labels. Rademacher complexity provides a tighter, distribution-dependent alternative:

$$\mathcal{R}m(\mathcal{H}) = \mathbb{E}{S, \sigma}!\left[\sup_{h \in \mathcal{H}} \frac{1}{m}\sum_{i=1}^m \sigma_i h(x_i)\right]$$

where $\sigma_i \in {-1, +1}$ are i.i.d. uniform (Rademacher) random variables. Intuitively, this measures how well $\mathcal{H}$ can correlate with pure noise. Rademacher bounds are tighter than VC bounds because they adapt to the actual data distribution and do not pay for complexity on parts of the space the distribution ignores.

Examples

VC dimension of neural networks. A feedforward network with $W$ weights and piecewise-linear activations has VC dimension $O(W \log W)$ (Bartlett-Maass bound). Modern networks have billions of parameters, yielding astronomical VC dimensions - yet they generalise well. This apparent contradiction (the “double descent” phenomenon) reveals that VC dimension is a pessimistic worst-case bound; practical generalisation depends on implicit regularisation, the geometry of the data distribution, and algorithmic factors not captured by combinatorial complexity.

Expressiveness vs. generalisation. The VC generalisation bound makes the tradeoff explicit. A richer class (larger $d$) can represent more functions but requires more samples to generalise. A class with $d = 0$ (constant functions) needs no data but is useless. The optimal model complexity is the one that balances approximation error (does $\mathcal{H}$ contain a good approximation to the target?) against estimation error (given $m$ samples, can we find it?). Structural risk minimisation (SRM) formalises this by choosing $\mathcal{H}$ to minimise the sum of empirical error and the VC generalisation bound.

The VC framework gives the “when” of generalisation. The companion question - “how” - is addressed in the PAC Learning post, where algorithms and computational complexity enter the picture.

Read Next:

PAC Learning