Helpful context:


You train a decision tree on 1000 examples. It achieves 100% training accuracy. You test it on 100 new examples. It achieves 55% accuracy - barely better than a coin flip. You didn’t build a classifier. You built a lookup table that memorized noise.

This is overfitting. And understanding why it happens - and what to do about it - is the central intellectual challenge of machine learning. The bias-variance tradeoff gives you the language to think about it precisely.


Two Kinds of Error

Before decomposing anything, name the two ways a model can fail.

A model has high bias if it makes systematic errors - it’s consistently wrong in the same direction, regardless of which training data you use. A linear model fit to a quadratic pattern will always underpredict in the middle and overpredict at the extremes, no matter how much data you give it. The model is too simple to capture the truth.

A model has high variance if it changes dramatically depending on the training data. Retrain on a different sample of 1000 examples, and you get a different model. The model is memorizing the specific dataset rather than learning the underlying pattern. This is the decision tree above.

Both are failure modes. The question is which one you’re suffering from, and how to fix it.


The Bias-Variance Decomposition

Here is the precise statement. Suppose we’re doing regression: the true relationship is $y = f(x) + \epsilon$ where $\epsilon \sim \mathcal{N}(0, \sigma^2)$ is irreducible noise. We train a model $\hat{f}$ using some training set $\mathcal{D}$, drawn from the data distribution. The expected squared error at a test point $x$ is:

$$\mathbb E_{\mathcal{D}}\left[(\hat{f}(x) - y)^2\right] = \text{Bias}(\hat{f}(x))^2 + \text{Var}(\hat{f}(x)) + \sigma^2.$$

The three terms are:

$$\text{Bias}(\hat{f}(x)) = \mathbb E_{\mathcal{D}}[\hat{f}(x)] - f(x),$$

$$\text{Var}(\hat{f}(x)) = \mathbb E_{\mathcal{D}}\left[\left(\hat{f}(x) - \mathbb E_{\mathcal{D}}[\hat{f}(x)]\right)^2\right],$$

$$\sigma^2 = \text{Var}(\epsilon).$$

Let’s work through the derivation to see where this comes from. Let $\mu = \mathbb E_{\mathcal{D}}[\hat{f}(x)]$ be the average prediction over all possible training sets. Then:

$$\mathbb E_{\mathcal{D}}[(\hat{f}(x) - y)^2] = \mathbb E_{\mathcal{D}}[(\hat{f}(x) - f(x) - \epsilon)^2].$$

Expanding and using $\mathbb{E}[\epsilon] = 0$ and independence of $\epsilon$ from $\mathcal{D}$:

$$= \mathbb E_{\mathcal{D}}[(\hat{f}(x) - f(x))^2] + \sigma^2.$$

Now add and subtract $\mu$ inside the first term:

$$\mathbb E_{\mathcal{D}}[(\hat{f}(x) - \mu + \mu - f(x))^2] = \mathbb E_{\mathcal{D}}[(\hat{f}(x) - \mu)^2] + (\mu - f(x))^2,$$

where the cross-term vanishes because $\mathbb E_{\mathcal{D}}[\hat{f}(x) - \mu] = 0$. The result is:

$$= \overbrace{(\mu - f(x))^2}^{\text{Bias}^2} + \overbrace{\mathbb E_{\mathcal{D}}\left[(\hat{f}(x)-\mu)^2\right]}^{\text{Variance}} + \overbrace{\sigma^2}^{\text{Noise}}.$$

The three terms are genuinely independent. Bias measures systematic error. Variance measures sensitivity to the training set. Noise is irreducible - it is the minimum possible error even if your model is perfect.


The Classic Tradeoff

As model complexity increases, what happens to bias and variance?

A very simple model - say, predicting the constant mean of $y$ - has high bias (it’s almost certainly wrong about the relationship between $x$ and $y$) but zero variance (it gives the same prediction regardless of training data). A very complex model - a depth-50 decision tree with millions of leaves - has low bias (it can approximate almost any function) but high variance (small changes in the training set produce completely different trees).

The classical picture: as complexity increases, bias falls monotonically and variance rises monotonically. Total error is a U-shaped curve. There is a sweet spot in the middle where the sum of all three terms is minimized. This is the optimal model complexity for your data.

The practical implication is that you cannot simultaneously minimize both bias and variance. Adding model capacity reduces bias but raises variance. Regularization reduces variance but raises bias. Every modeling choice involves this tradeoff.


Double Descent

Here is where the classical story breaks down.

In modern neural networks - particularly large ones with far more parameters than training examples - something surprising happens. The test error follows the classical U-shaped curve up to a certain complexity, peaks at the “interpolation threshold” where the model just barely fits the training data exactly, and then decreases again as the model grows larger.

This phenomenon is called double descent. Over-parameterized models (more parameters than data points) can generalize better than models with optimal complexity by classical theory. Why? Several reasons are hypothesized:

  • Very large models have many ways to interpolate the training data. Gradient descent finds the minimum-norm solution, which tends to be smooth and generalizes well.
  • Over-parameterization makes the loss landscape more benign - fewer sharp minima, better-conditioned optimization.
  • Implicit regularization from stochastic gradient descent biases learning toward simpler solutions.

Double descent is empirically real and reproducible. It challenges the classical bias-variance narrative, which predicted that over-parameterization would always hurt. The formal theory for why it works is still an active research area.


Training Error vs. Generalization Error

Two quantities matter in practice:

Training error is the loss on the data you trained on. The optimizer minimizes this by design. It is always optimistic - the model has seen these examples, so it has an unfair advantage. For a sufficiently expressive model with enough training time, training error can reach zero even on random noise.

Generalization error (also called test error) is the expected loss on new, unseen data drawn from the same distribution. This is what you actually care about. It measures whether the model learned the true pattern or just memorized the training set.

The generalization gap is generalization error minus training error. A small gap means the model generalizes. A large gap means overfitting - the model is much better on training data than on new data. Underfitting produces a small gap too, but both training and test error are high.

Discomfort check. “The model is overfitting, so I need more data.” True. But also: “The model is overfitting, so I need a simpler model.” Also true. And: “The model is overfitting, so I need regularization.” Also true. All three interventions reduce variance, and all three are valid. Which to use depends on what is feasible. More data is best but often expensive. Regularization is cheap. Simpler models may sacrifice the bias you need to solve the problem. There is no universal answer.


Diagnosing the Problem

Before choosing a fix, diagnose which failure mode you have.

Underfitting (high bias): High training error AND high test error, with a small generalization gap. The model is not powerful enough to capture the signal.

  • Fix: use a more expressive model, train longer, use more features, reduce regularization.

Overfitting (high variance): Low training error AND high test error, with a large generalization gap. The model has memorized the training data.

  • Fix: get more data, use regularization, use a simpler model, use dropout, use early stopping.

Both? Separately? In practice, you diagnose by looking at learning curves: plot training and validation error as a function of training set size or training steps. Converging curves that are both high indicate underfitting. A large and growing gap indicates overfitting.


Train, Validation, and Test Splits

You need separate data for three distinct purposes:

Training set: Fit the model parameters. The optimizer sees this data.

Validation set: Select hyperparameters (learning rate, regularization strength, model depth, number of epochs). This data is not used by the optimizer, but your decisions are influenced by it. This introduces a subtle form of leakage - if you tune 100 hyperparameters on the validation set, you will overfit to the validation set by luck.

Test set: Final, unbiased evaluation. Use this exactly once, after all decisions are made. Never use the test set to make any modeling decisions. If you evaluate on the test set and then change something, you have contaminated it - your final number is now optimistic.

The standard split is roughly 70/15/15 or 80/10/10 depending on dataset size. For very large datasets (millions of examples), the validation and test sets can be much smaller fractions.

K-fold cross-validation is used when data is scarce. Split the data into $K$ equal folds; for each fold $k$, train on the other $K-1$ folds and evaluate on fold $k$. Average the $K$ validation errors. This makes full use of all data for both training and evaluation, at the cost of training $K$ models. Common choices are $K = 5$ or $K = 10$.


Regularization Techniques

These all reduce variance, at the cost of increased bias.

L2 regularization (weight decay): Add $\lambda|\theta|^2$ to the loss. Shrinks all weights toward zero, preventing any single parameter from dominating. Equivalent to a Gaussian prior on weights. This is the most common form of regularization in neural networks and is typically implemented as weight_decay in optimizers.

L1 regularization (Lasso): Add $\lambda|\theta|_1$ to the loss. Produces sparse solutions: many weights become exactly zero. Useful when you believe only a few features are relevant.

Dropout: During training, randomly set each neuron’s activation to zero with probability $p$ (typically $p = 0.1$ to $0.5$). At inference, use all neurons with activations scaled by $(1-p)$. The intuition: each training step uses a different sub-network. The model cannot rely on any single neuron; it must learn redundant representations. This is approximately equivalent to training an ensemble of $2^n$ different networks.

Early stopping: Monitor validation error during training. Stop when validation error starts rising, even if training error is still falling. This is one of the simplest and most effective regularization techniques. It requires no modification to the loss function.

Data augmentation: Artificially expand the training set by applying transformations that preserve the label: rotations, flips, crops (for images); synonym replacement, back-translation (for text). More data always reduces variance. Augmentation creates more data cheaply.


Generalization Bounds: The Formal Theory

The bias-variance decomposition describes expected error, but it doesn’t tell you how well you can expect to generalize. Statistical learning theory provides formal bounds.

For a hypothesis class of VC dimension $d$ (a measure of how many distinct binary labelings the class can produce), with $n$ training samples, the generalization gap satisfies:

$$\text{Gen. gap} \leq \mathcal{O}\left(\sqrt{\frac{d\log(n/d)}{n}}\right)$$

with high probability. Larger hypothesis class → larger $d$ → worse bound. More data → smaller bound. This formalizes the intuition that more complex models require more data to generalize.

The VC bound is often loose in practice - neural networks have astronomically large VC dimension but generalize far better than the bound predicts. This is one of the deep mysteries motivating the double descent discussion above, and an active area of research.

A tighter framework for neural networks uses PAC-Bayesian bounds, which measure not the capacity of the hypothesis class but the flatness of the loss landscape near the solution found - flat minima tend to generalize better.


Summary

Concept Definition Symptom Fix
Bias Systematic error from wrong assumptions High train AND test error More expressive model
Variance Sensitivity to training data Low train, high test error Regularization, more data
Irreducible noise True randomness in data Irreducible Better data collection
Underfitting High bias Both errors high, small gap More capacity
Overfitting High variance Small train error, large gap Regularization, data
Generalization gap Test error $-$ train error Large gap = overfitting Reduce variance
Double descent Error drops again at very high complexity N/A Use large models with care

The bias-variance tradeoff is not a problem to solve - it is a constraint to navigate. Every model is somewhere on the spectrum. Your job is to diagnose where your model sits, understand why, and choose the right intervention.


Read next: