Bias, Variance, & Overfitting
Prerequisite:
The Bias-Variance Decomposition
Consider a regression problem where the true relationship is $y = f(x) + \epsilon$ with $\epsilon \sim \mathcal{N}(0, \sigma^2)$. Given a model $\hat{f}$ trained on a dataset $D$, the expected mean squared error at a fixed test point $x$ decomposes as:
$$\mathbb{E}_D\left[(y - \hat{f}(x))^2\right] = \text{Bias}^2[\hat{f}(x)] + \text{Var}[\hat{f}(x)] + \sigma^2$$
Derivation. Let $\bar{f}(x) = \mathbb{E}_D[\hat{f}(x)]$ be the mean prediction over all possible training sets. Then:
$$\mathbb{E}_D\left[(y - \hat{f})^2\right] = \mathbb{E}\left[(y - \bar{f} + \bar{f} - \hat{f})^2\right]$$
Expanding and using the fact that the cross term $\mathbb{E}_D[\hat{f} - \bar{f}] = 0$ by definition:
$$= \underbrace{(f(x) - \bar{f}(x))^2}{\text{Bias}^2} + \underbrace{\mathbb{E}D\left[(\hat{f}(x) - \bar{f}(x))^2\right]}{\text{Variance}} + \underbrace{\sigma^2}{\text{Irreducible noise}}$$
Bias measures how far the average prediction is from the truth - it reflects systematic error due to model assumptions. Variance measures how much predictions fluctuate across different training sets - it reflects sensitivity to data. The noise term $\sigma^2$ is irreducible; no model can eliminate it.
Underfitting and Overfitting
High bias (underfitting) occurs when the model is too simple to capture the true structure. A degree-1 polynomial fit to a nonlinear signal will have high training error and high test error. The model fails regardless of how much data is provided.
High variance (overfitting) occurs when the model is too complex relative to the available data. A degree-15 polynomial fit to 20 data points will memorize training noise, achieving near-zero training error but high test error. Collecting more data reduces variance.
The fundamental tradeoff is that reducing bias (by increasing model complexity) generally increases variance, and vice versa. Model selection is the problem of finding the complexity that minimizes total test error.
Regularization
$L_2$ Regularization (Ridge)
Augmenting the cost with $\frac{\lambda}{2}|\theta|_2^2$ penalizes large weights. The MAP interpretation (under a Gaussian prior) gives a closed-form solution for linear regression:
$$\hat{\theta}_{\text{ridge}} = (X^T X + \lambda I)^{-1} X^T y$$
The $\lambda I$ term improves the conditioning of the matrix and shrinks all eigenvalues of $X^T X$ toward $\lambda$. Geometrically, $L_2$ regularization constrains the solution to lie within an $\ell_2$ ball - it shrinks all weights proportionally toward zero, reducing variance without inducing exact sparsity.
$L_1$ Regularization (Lasso)
The penalty $\lambda |\theta|_1$ corresponds to a Laplace prior and induces sparsity - many weights become exactly zero. This happens because the $\ell_1$ ball has corners at the axes; gradient descent trajectories tend to hit these corners. There is no closed form for the Lasso, but coordinate descent and proximal gradient methods solve it efficiently.
A practical mnemonic: $L_2$ is differentiable everywhere and yields dense solutions; $L_1$ is not differentiable at zero and yields sparse solutions.
Early Stopping as Implicit Regularization
In gradient descent, training longer allows the model to fit progressively finer structure (and noise) in the training data. Early stopping halts training when validation loss begins to increase, preventing the model from reaching regions of parameter space that overfit. For linear models trained with gradient descent, it can be shown that early stopping is approximately equivalent to $L_2$ regularization - the number of gradient steps plays the role of $1/\lambda$.
Dropout
Dropout is a stochastic regularization technique. During training, each neuron activation $h_j$ is multiplied by an independent Bernoulli mask $m_j \sim \text{Bernoulli}(p)$ where $p$ is the keep probability. The effective forward pass computes:
$$\tilde{h}_j = m_j \cdot h_j$$
At test time, dropout is disabled but activations are scaled by $p$ (or equivalently, weights are scaled by $p$ at training - inverted dropout) to ensure expected outputs match:
$$\mathbb{E}[\tilde{h}_j] = p \cdot h_j$$
The ensemble interpretation: a network with $n$ dropout units implicitly trains $2^n$ weight-sharing sub-networks. At test time, the full network with scaled weights approximates the geometric mean of this ensemble. This ensemble view explains why dropout is a variance-reducing regularizer - averaging over many models reduces the sensitivity to any particular training sample.
Data Augmentation
Data augmentation synthetically expands the training set by applying label-preserving transformations: horizontal flips, random crops, color jitter for images; synonym replacement or back-translation for text. Formally, augmentation adds samples drawn from $p_{\text{aug}}(x|y)$ rather than the true data distribution. This is effective when $p_{\text{aug}}$ approximates invariances present in the test distribution. Like dropout, it reduces variance by preventing the model from memorizing specific artifact patterns.
The Double Descent Phenomenon
Classical statistics predicts a U-shaped test error curve: error decreases as model size grows up to an optimal capacity, then increases. Modern observations reveal a second descent: when models are trained past the interpolation threshold (where training error reaches zero), test error begins to decrease again.
The interpolation threshold corresponds to the minimum complexity model that can exactly fit the training data. Beyond this point, among all models that perfectly interpolate, gradient descent and its variants tend to find solutions with minimum norm - which often generalize well despite zero training loss. This reconciles the apparent paradox of overparameterized neural networks: they can both memorize training data and generalize.
Structural Risk Minimization
Vapnik’s Structural Risk Minimization (SRM) framework formalizes the tradeoff between empirical risk and model complexity. Given a hypothesis class $\mathcal{H}$ with VC dimension $d$, the generalization bound states that with probability $\ge 1 - \delta$:
$$R(f) \le \hat{R}(f) + \sqrt{\frac{d \log(2n/d) + \log(2/\delta)}{2n}}$$
where $R(f)$ is the true risk, $\hat{R}(f)$ is the empirical risk, and $n$ is the training set size. SRM selects the hypothesis class that minimizes the right-hand side - a rigorous justification for choosing simpler models when data is limited.
Examples
Polynomial regression and the bias-variance tradeoff. Fit polynomials of degree $k = 1, 5, 15$ to 30 noisy samples from $f(x) = \sin(2\pi x)$. Degree 1 has high bias: the linear fit cannot capture the sinusoidal shape. Degree 15 has high variance: it passes through all training points but oscillates wildly (Runge’s phenomenon) at test points. Degree 5 approximately minimizes the sum of squared bias and variance on held-out data.
Random forests as variance reduction. A single decision tree grown to full depth has near-zero bias but high variance - small changes in training data produce very different trees. Random forests train $B$ trees on bootstrap samples with random feature subsets, then average predictions:
$$\hat{f}{\text{RF}}(x) = \frac{1}{B}\sum{b=1}^B \hat{f}_b(x)$$
If trees have variance $\sigma^2$ and pairwise correlation $\rho$, the ensemble variance is $\rho\sigma^2 + \frac{1-\rho}{B}\sigma^2$. As $B \to \infty$, variance approaches $\rho\sigma^2$ - the irreducible component due to tree correlations. Random feature selection reduces $\rho$, so the ensemble generalizes better than any single tree.
Read Next: