Consider the problem of recognizing handwritten digits. A pixel grid of 28x28 gives 784 numbers. Each image is a point in $\mathbb{R}^{784}$. Nearby points in that space - images that look similar to each other - should belong to the same digit class. This is the geometric intuition behind classification: structure in the input space reflects structure in the labels, and a good classifier exploits that geometry.

Every classification algorithm is an answer to one question: how do you use that geometry? K-nearest neighbors answers it directly - find the training examples closest to the query point and vote. Support vector machines answer it by finding the hyperplane that best separates classes with the widest possible margin. The two approaches are philosophically opposite: KNN stores everything and decides at query time, SVM compresses everything into a boundary and decides at training time. Both approaches turn out to be remarkably principled, and understanding them carefully reveals the major themes of machine learning - overfitting, the curse of dimensionality, the bias-variance tradeoff, and the kernel trick.

This post develops both classifiers from first principles, starting with the feature space geometry, then building to the full SVM theory including soft margins, the dual problem, kernels, and multiclass extensions. The goal is not just to know how the algorithms work, but to understand why they work and what they are actually doing to the training data.


Feature Space and Classifier Design

A feature vector $\mathbf{x} \in \mathbb{R}^d$ represents a data point. Each of the $d$ coordinates is a measured property - pixel intensity, word frequency, sensor reading. The feature space $\mathbb{R}^d$ contains all possible inputs.

A classifier is a function $f: \mathbb{R}^d \to \{1, \ldots, K\}$ mapping each input to one of $K$ class labels. The goal is to learn $f$ from training data $\{(\mathbf{x}i, y_i)\}{i=1}^n$ where $\mathbf{x}_i \in \mathbb{R}^d$ and $y_i \in \{1, \ldots, K\}$.

Every classifier partitions the feature space into $K$ decision regions - the set of points mapped to each class. The boundaries between regions are the decision boundaries. The classifier’s inductive bias determines the shape of these boundaries: KNN produces boundaries that are piecewise-linear (Voronoi cells), linear classifiers produce boundaries that are hyperplanes (half-spaces).

The fundamental tension in learning a classifier is the bias-variance tradeoff. A classifier with too much flexibility perfectly separates training points but fails on new data - it has memorized noise rather than pattern. A classifier with too little flexibility imposes structure that the data does not support - it misses real patterns. Choosing the right class of functions $f$ is most of the work.


K-Nearest Neighbor Classifier

The k-nearest neighbor (KNN) classifier makes no parametric assumptions about the data distribution. It simply stores all training examples and, at test time, uses the local structure of the training set to make predictions.

The algorithm. Given a test point $\mathbf{x}$:

  1. Compute the distance $d(\mathbf{x}, \mathbf{x}_i)$ to every training point $\mathbf{x}_i$.
  2. Find the $k$ training points $\mathcal{N}_k(\mathbf{x})$ with smallest distance.
  3. Return the majority class among those $k$ neighbors:

$$f(\mathbf{x}) = \underset{c}{\arg\max} \sum_{i \in \mathcal{N}_k(\mathbf{x})} \mathbf{1}[y_i = c]$$

The Euclidean distance $d(\mathbf{x}, \mathbf{z}) = |\mathbf{x} - \mathbf{z}|_2$ is the standard choice, but other metrics (Manhattan, cosine, Mahalanobis) can be substituted depending on the problem.

The Voronoi Diagram and k=1

For $k = 1$, the decision boundary is exactly the Voronoi diagram of the training points. Each training point $\mathbf{x}_i$ defines a Voronoi cell - the set of all points closer to $\mathbf{x}_i$ than to any other training point. The 1-NN classifier assigns every test point the label of whichever training point owns the Voronoi cell containing it.

The resulting decision boundary can be arbitrarily complex: it has one cell per training point, so $n$ cells total, and the boundary wiggles around every example in the training set. This is the hallmark of overfitting. A single mislabeled training point creates a small “island” of the wrong class in the middle of another class’s territory.

The Effect of k

Increasing $k$ smooths the decision boundary by averaging over more neighbors. A region must have majority support from $k$ training points to be assigned a given class, which suppresses isolated outliers. As $k \to n$, the classifier predicts the majority class everywhere - maximum smoothness but also maximum bias, ignoring all local structure.

The optimal $k$ trades off these extremes. Odd values of $k$ avoid ties in binary classification. In practice, $k$ is chosen by cross-validation.

The Cover-Hart Theorem

Why does KNN work at all? The Cover-Hart theorem provides the theoretical answer: in the limit of infinite training data, the error rate of the 1-NN classifier is at most twice the Bayes error rate - the irreducible error of the optimal classifier that knows the true class probabilities.

$$R_{1\text{-NN}} \leq 2 R^*$$

where $R^* = 1 - \mathbb{E}[\max_c P(Y=c \mid \mathbf{X})]$ is the Bayes error. The reason: with enough data, every test point $\mathbf{x}$ will have a nearest neighbor $\mathbf{x}'$ so close that they are essentially at the same location. The 1-NN classifier then uses a single sample from $P(Y \mid \mathbf{x})$ to estimate the class, and two such samples independently drawn give error at most $2R^*$. This is a strong guarantee that requires no modeling assumptions.

Computational Cost and Data Structures

The naive implementation costs $O(nd)$ per query - compute all $n$ distances, each taking $O(d)$ time. For large $n$ or $d$, this is prohibitively slow.

k-d trees partition the feature space recursively along coordinate axes, organizing training points into a binary tree. For low-dimensional data ($d \lesssim 20$), nearest neighbor search runs in $O(d \log n)$ average time. For high $d$, the tree degenerates and offers no speedup.

Ball trees partition space into nested hyperspheres and handle moderate dimensions better than k-d trees. Locality-sensitive hashing provides approximate nearest neighbors in sublinear time for high-dimensional data.

The Curse of Dimensionality

KNN fundamentally relies on the assumption that nearby points in feature space have similar labels. This fails in high dimensions.

In $d$ dimensions, a hypercube of side 1 has volume 1. A ball of radius $r$ has volume proportional to $r^d$. To capture a fixed fraction $\epsilon$ of the training data, the ball radius must grow as $\epsilon^{1/d}$. For $d = 100$ and $\epsilon = 0.01$, the radius is $0.01^{1/100} \approx 0.955$ - nearly the entire space. “Nearby” neighbors in high dimensions are almost as far as random points.

More precisely, in high dimensions all pairwise distances concentrate: $\frac{\max_i d(\mathbf{x}, \mathbf{x}_i) - \min_i d(\mathbf{x}, \mathbf{x}_i)}{\min_i d(\mathbf{x}, \mathbf{x}_i)} \to 0$ as $d \to \infty$ under mild conditions. The “nearest” neighbor is barely closer than the farthest. Distance loses its discriminative power.

This is why KNN works beautifully on MNIST (which despite 784 pixels lives on a low-dimensional manifold) but poorly on raw high-dimensional genomic or text data without dimensionality reduction.


Linear Classifiers and Discriminant Functions

A linear classifier partitions the feature space using a hyperplane. In binary classification (classes $+1$ and $-1$), the classifier is:

$$f(\mathbf{x}) = \text{sign}(\mathbf{w} \cdot \mathbf{x} + b)$$

where $\mathbf{w} \in \mathbb{R}^d$ is the weight vector (normal to the hyperplane) and $b \in \mathbb{R}$ is the bias. The decision boundary is the hyperplane $\{\mathbf{x} : \mathbf{w} \cdot \mathbf{x} + b = 0\}$.

The signed distance from a point $\mathbf{x}$ to the hyperplane is:

$$\text{dist}(\mathbf{x}) = \frac{\mathbf{w} \cdot \mathbf{x} + b}{|\mathbf{w}|}$$

Points with $\mathbf{w} \cdot \mathbf{x} + b > 0$ are classified as $+1$; those with $\mathbf{w} \cdot \mathbf{x} + b < 0$ as $-1$. The magnitude of $\mathbf{w} \cdot \mathbf{x} + b$ measures how far the point is from the boundary - a natural confidence score.

Learning a linear classifier means learning $\mathbf{w}$ and $b$. The geometry is fixed (a hyperplane); the question is which hyperplane.

Gradient Descent and the Hinge Loss

A natural loss for linear classifiers is the hinge loss: a training point $(\mathbf{x}_i, y_i)$ contributes loss $\max(0, 1 - y_i(\mathbf{w} \cdot \mathbf{x}_i + b))$. Points correctly classified with confidence at least 1 contribute zero loss. Points close to the boundary or misclassified contribute positive loss proportional to how wrong they are.

The total training loss is:

$$L(\mathbf{w}, b) = \sum_{i=1}^n \max(0, 1 - y_i(\mathbf{w} \cdot \mathbf{x}_i + b))$$

Gradient descent updates $\mathbf{w}$ and $b$ by subtracting a multiple of the gradient. For the hinge loss, when $y_i(\mathbf{w} \cdot \mathbf{x}_i + b) < 1$, the gradient with respect to $\mathbf{w}$ is $-y_i \mathbf{x}_i$. Points that satisfy the constraint contribute no gradient - only the violating points drive the update.

The Perceptron

The Perceptron is the simplest online learning algorithm for linear classifiers. It initializes $\mathbf{w} = 0$, $b = 0$ and processes one training example at a time. When it makes an error - when $y_i(\mathbf{w} \cdot \mathbf{x}_i + b) \leq 0$ - it updates:

$$\mathbf{w} \leftarrow \mathbf{w} + y_i \mathbf{x}_i, \quad b \leftarrow b + y_i$$

This update moves $\mathbf{w}$ in the direction of $y_i \mathbf{x}_i$, which increases $y_i(\mathbf{w} \cdot \mathbf{x}_i + b)$ by $y_i^2 |\mathbf{x}_i|^2 = |\mathbf{x}_i|^2 > 0$. The misclassified point is less wrong after the update.

The Perceptron convergence theorem guarantees termination on linearly separable data. Let $R = \max_i |\mathbf{x}_i|$ (radius of the data) and $\gamma$ (the margin) be the maximum over all weight vectors $\mathbf{w}$ with $|\mathbf{w}| = 1$ of $\min_i y_i(\mathbf{w} \cdot \mathbf{x}_i + b)$. Then the Perceptron makes at most:

$$\left(\frac{R}{\gamma}\right)^2 \text{ updates}$$

before converging to a correct classifier. A large margin means few updates; a small margin can mean many. The proof shows that correct updates increase $\mathbf{w} \cdot \mathbf{w}^$ (alignment with the true separating direction) while bounding the growth of $|\mathbf{w}|$, so the angle between $\mathbf{w}$ and $\mathbf{w}^$ strictly decreases with each update.

The Perceptron has a critical limitation: it finds some separating hyperplane but not necessarily the best one. On non-separable data it never converges. The SVM resolves both issues.


Support Vector Machines

Among all hyperplanes that correctly separate the training data, which one should we choose? The answer is the one with the maximum margin - the largest minimum distance from any training point to the hyperplane.

The margin matters because it controls generalization. A hyperplane that passes close to training points is sensitive to small perturbations - moving a training point slightly could flip the classification. A wide-margin hyperplane is robust to such perturbations. The maximum-margin hyperplane is the most “confident” separator.

Hard-Margin SVM

For a separating hyperplane $\mathbf{w} \cdot \mathbf{x} + b = 0$, the distance from training point $\mathbf{x}_i$ to the hyperplane is $\frac{y_i(\mathbf{w} \cdot \mathbf{x}_i + b)}{|\mathbf{w}|}$ (using $y_i$ to get the signed distance with the correct sign). Normalizing so that the closest training point has $y_i(\mathbf{w} \cdot \mathbf{x}_i + b) = 1$, the margin is $\frac{2}{|\mathbf{w}|}$. The hyperplanes $\mathbf{w} \cdot \mathbf{x} + b = \pm 1$ are called the gutters and pass through the closest points of each class, called support vectors.

Maximizing the margin $\frac{2}{|\mathbf{w}|}$ is equivalent to minimizing $|\mathbf{w}|^2$. The hard-margin SVM is the quadratic program:

$$\min_{\mathbf{w}, b} \frac{1}{2}|\mathbf{w}|^2 \quad \text{subject to} \quad y_i(\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1 \quad \forall i$$

This requires the data to be linearly separable. When it is not, the constraints are infeasible and the hard-margin SVM has no solution.

Soft-Margin SVM

The soft-margin SVM allows constraints to be violated by introducing slack variables $\xi_i \geq 0$. Point $i$ violates the margin constraint when $\xi_i > 0$; it is misclassified when $\xi_i > 1$. The optimization becomes:

$$\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}|\mathbf{w}|^2 + C \sum_{i=1}^n \xi_i \quad \text{subject to} \quad y_i(\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1 - \xi_i,; \xi_i \geq 0 \quad \forall i$$

The hyperparameter $C > 0$ controls the tradeoff: large $C$ penalizes violations heavily (narrow margin, fits training data closely), small $C$ tolerates many violations (wide margin, potentially more robust). In the limit $C \to \infty$, the soft-margin SVM reduces to the hard-margin SVM.

Substituting the optimal $\xi_i = \max(0, 1 - y_i(\mathbf{w} \cdot \mathbf{x}_i + b))$, the objective becomes $\frac{1}{2}|\mathbf{w}|^2 + C \sum_i \max(0, 1 - y_i(\mathbf{w} \cdot \mathbf{x}_i + b))$ - exactly the hinge loss with $\ell_2$ regularization. The SVM primal is regularized empirical risk minimization with the hinge loss.

Dual Formulation

The primal SVM is a convex quadratic program with $d + n + 1$ variables. For large $d$ (or infinite-dimensional feature maps), the dual formulation is more useful.

By Lagrangian duality, introducing multipliers $\alpha_i \geq 0$ for the margin constraints and $\mu_i \geq 0$ for $\xi_i \geq 0$, the dual problem is:

$$\max_{\boldsymbol{\alpha}} \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i \cdot \mathbf{x}_j$$

$$\text{subject to} \quad 0 \leq \alpha_i \leq C \quad \forall i, \quad \sum_{i=1}^n \alpha_i y_i = 0$$

The dual has $n$ variables (one per training point) and depends on the training data only through inner products $\mathbf{x}_i \cdot \mathbf{x}_j$. This is the key observation that enables the kernel trick.

KKT conditions give the primal-dual relationship. At the optimum: $\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i$. Training points with $\alpha_i = 0$ do not contribute to $\mathbf{w}$ at all - they are irrelevant. Training points with $\alpha_i > 0$ are the support vectors - they lie on or inside the margin and determine the decision boundary entirely. The solution is sparse: typically only a small fraction of training points are support vectors.

The decision function at test time is:

$$f(\mathbf{x}) = \text{sign}\left(\sum_{i:\alpha_i > 0} \alpha_i y_i \mathbf{x}_i \cdot \mathbf{x} + b\right)$$

The test point interacts with the training data only through inner products with support vectors. Everything else is discarded.


The Kernel Trick

The dual SVM depends on training data only through inner products $\mathbf{x}_i \cdot \mathbf{x}_j$. This creates an opening: replace every inner product with a kernel function $K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i) \cdot \phi(\mathbf{x}_j)$ for some feature map $\phi: \mathbb{R}^d \to \mathcal{H}$ into a (possibly infinite-dimensional) Hilbert space $\mathcal{H}$.

The remarkable fact is that we never need to compute $\phi(\mathbf{x})$ explicitly. We only need $K(\mathbf{x}_i, \mathbf{x}_j)$, which can be evaluated directly in the original space. The SVM then finds the maximum-margin hyperplane in $\mathcal{H}$, a space that can be far more expressive than $\mathbb{R}^d$, without ever leaving the input space computationally.

Common Kernels

Linear kernel: $K(\mathbf{x}, \mathbf{z}) = \mathbf{x} \cdot \mathbf{z}$. This corresponds to $\phi = \text{id}$, the identity map. Recovers the standard linear SVM.

Polynomial kernel: $K(\mathbf{x}, \mathbf{z}) = (\mathbf{x} \cdot \mathbf{z} + 1)^p$. Corresponds to all monomials of degree up to $p$ in the input features. For $d = 2$ and $p = 2$, the feature map is $\phi(x_1, x_2) = (1, \sqrt{2}x_1, \sqrt{2}x_2, x_1^2, \sqrt{2}x_1 x_2, x_2^2)$ - six dimensions from two, computed via a single inner product.

Radial basis function (RBF) / Gaussian kernel: $K(\mathbf{x}, \mathbf{z}) = \exp\left(-\frac{|\mathbf{x} - \mathbf{z}|^2}{2\sigma^2}\right)$. This corresponds to an infinite-dimensional feature map. The bandwidth $\sigma$ controls locality: small $\sigma$ makes the kernel sensitive only to very nearby points (risk of overfitting), large $\sigma$ makes it broad (risk of underfitting). The RBF kernel is by far the most widely used in practice.

Mercer’s Condition

Not every function $K$ is a valid kernel. For $K$ to correspond to an inner product in some Hilbert space, the kernel matrix $K_{ij} = K(\mathbf{x}_i, \mathbf{x}_j)$ must be positive semidefinite for any finite set of points $\{\mathbf{x}_1, \ldots, \mathbf{x}_n\}$. This is Mercer’s condition.

A positive semidefinite kernel guarantees that the dual optimization problem is convex and has a unique solution. Constructing new kernels from existing ones is safe using these facts: sums of valid kernels are valid, products of valid kernels are valid, and a positive constant multiple of a valid kernel is valid.

Why Kernels Enable Non-Linear Separation

A dataset that is not linearly separable in $\mathbb{R}^d$ may become linearly separable in $\mathcal{H}$. The kernel SVM finds a linear boundary in $\mathcal{H}$, which corresponds to a non-linear boundary in the original input space. The decision boundary in $\mathbb{R}^d$ is:

$$\left\{\mathbf{x} : \sum_{i:\alpha_i > 0} \alpha_i y_i K(\mathbf{x}_i, \mathbf{x}) + b = 0\right\}$$

For the RBF kernel, this is a sum of Gaussians centered at support vectors. The boundary can be any shape that can be expressed as a weighted combination of such bumps - enormously flexible.


Multi-Class Strategies

Binary SVMs classify into two classes. Extending to $K > 2$ classes requires a strategy for combining binary classifiers.

One-vs-Rest

One-vs-Rest (OvR) trains $K$ binary classifiers. Classifier $k$ is trained to separate class $k$ from all other classes combined (labeled $+1$ and $-1$ respectively). At test time, each classifier produces a score $f_k(\mathbf{x}) = \mathbf{w}_k \cdot \mathbf{x} + b_k$, and the point is assigned the class with the highest score:

$$\hat{y} = \underset{k}{\arg\max}; f_k(\mathbf{x})$$

OvR is fast: $K$ classifiers, each trained on $n$ points. The downside is class imbalance - classifier $k$ has 1 positive class vs. $K-1$ negative classes, so even random prediction gives $(K-1)/K$ “accuracy.” The decision regions of different classifiers may also conflict: there may be regions where multiple classifiers predict $+1$ (ambiguous) or all predict $-1$ (no class claimed).

One-vs-One

One-vs-One (OvO) trains $\binom{K}{2} = K(K-1)/2$ binary classifiers, one for every pair of classes $(j, k)$. Classifier $(j,k)$ is trained only on data from classes $j$ and $k$. At test time, each classifier votes for one class, and the class with the most votes wins.

OvO produces well-balanced training sets - each classifier sees equal numbers of each class. But training $O(K^2)$ classifiers is expensive for large $K$, and at test time $O(K^2)$ classifier evaluations are required. For small $K$ (say, $K \leq 10$) OvO is often preferred.

Decision Directed Acyclic Graph

DDAG (Decision Directed Acyclic Graph) arranges the $K(K-1)/2$ binary classifiers as a directed acyclic graph with $K(K-1)/2$ internal nodes and $K$ leaves. Each internal node is a binary classifier for one pair of classes $(j, k)$, and eliminates the losing class. Starting at the root, the algorithm traverses the graph, eliminating one class per node, and outputs whichever class survives to the leaf.

The test-time cost of DDAG is $O(K)$ classifier evaluations (traversing from root to leaf through $K-1$ nodes), compared to $O(K^2)$ for full OvO voting. The accuracy is comparable to OvO while being much faster to evaluate.

Hierarchical Classifiers

When the $K$ classes have a natural hierarchy (as in taxonomy or ontology), hierarchical classifiers organize the classes as a tree. Each internal node has a binary (or multi-way) classifier that routes the input down the tree. Classification terminates at a leaf. The tree structure encodes prior knowledge about class relationships and reduces the number of classifiers needed.


Summary

KNN Linear SVM Kernel SVM
Model type Non-parametric Parametric (linear) Non-parametric (kernel)
Training cost $O(1)$ (store data) $O(n \cdot d \cdot T)$ gradient steps $O(n^2 d)$ to $O(n^3)$
Test cost $O(nd)$ naive; $O(d \log n)$ k-d tree $O(d)$ $O(s \cdot d)$ where $s$ = support vectors
Key parameters $k$, distance metric Learning rate, iterations $C$, kernel, kernel params
Kernel support No No Yes
Linear separability required No Soft-margin: No No

KNN makes no assumptions but pays quadratic test-time cost and suffers in high dimensions. Linear SVMs are fast at test time but can only represent linear boundaries. Kernel SVMs combine the expressiveness of non-parametric methods with the principled maximum-margin optimization of SVMs, at the cost of $O(n^2)$ training time and the need to choose a kernel.

The SVM’s maximum-margin principle is not just computational convenience - it is a strong inductive bias toward solutions that generalize well. The Perceptron convergence theorem and the Cover-Hart theorem tell us that both linear and nearest-neighbor classifiers have rigorous theoretical guarantees, not just empirical success. Understanding those guarantees is what separates machine learning from curve fitting.


Read next: