Helpful context:


A child touches a hot stove once and never does it again. An ML algorithm “learns” the same lesson by processing the same examples thousands of times, adjusting its internal parameters each pass.

Is that really learning? The child generalized instantly from a single experience. The algorithm requires massive repetition, enormous compute, and still fails in ways no child would. And yet the algorithm can diagnose cancer from images, translate between languages it has never seen paired together, and predict protein structures that took scientists decades to figure out.

What is actually happening? Strip away the marketing language - “intelligence”, “thinking”, “understanding” - and what you find underneath is a specific mathematical framework: a function class, a loss function, an optimizer, and a theorem that says when the learned function will work on new data it has never seen. Everything else is a special case.


The Learning Problem

Here is the formal setup.

There is an unknown target function $f: \mathcal{X} \to \mathcal{Y}$. In supervised learning, $\mathcal{X}$ is the input space (images, text, sensor readings) and $\mathcal{Y}$ is the output space (labels, numbers, categories). The function $f$ describes the true relationship between inputs and outputs - it is what we are trying to approximate.

We do not have access to $f$ directly. Instead, we observe a training set:

$$S = \{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\},$$

where each $(x_i, y_i)$ is drawn independently from some unknown distribution $\mathcal{D}$ over $\mathcal{X} \times \mathcal{Y}$.

We want to find a hypothesis $h: \mathcal{X} \to \mathcal{Y}$ that does well on new inputs drawn from $\mathcal{D}$ - not just on the training examples we happened to see. This is the fundamental tension of machine learning: we optimize on $S$, but we care about performance on $\mathcal{D}$.

A few immediate observations:

  • We never get to see $f$ or $\mathcal{D}$ directly. We see only a finite sample.
  • $S$ might be noisy: $y_i = f(x_i) + \varepsilon$ for some noise $\varepsilon$.
  • We want $h \approx f$ everywhere, not just at the specific points in $S$.

This is a harder problem than it looks. Any function that perfectly memorizes the training set - assigns $h(x_i) = y_i$ for all $i$ and behaves arbitrarily elsewhere - has zero training error. But it is useless: it has not learned anything about the underlying relationship, only the specific data points.


The Three Components of Every ML System

Every machine learning algorithm, from the simplest linear regression to GPT-4, is built from three components.

1. Hypothesis class $\mathcal{H}$: the set of functions we are willing to consider. Examples:

  • Linear functions: $h(x) = w \cdot x + b$.
  • Degree-$d$ polynomials.
  • Decision trees with depth at most $D$.
  • Neural networks with a specific architecture.

The hypothesis class defines the inductive bias of the algorithm: what kinds of solutions it is capable of finding. If the true function $f$ is not (approximately) in $\mathcal{H}$, no amount of data will help - you cannot learn what your model cannot represent. Choosing $\mathcal{H}$ is as important as choosing the optimizer.

2. Loss function $L$: how we measure the cost of a prediction. Examples:

  • Squared error: $L(h(x), y) = (h(x) - y)^2$. Standard for regression.
  • Cross-entropy: $L(h(x), y) = -y \log h(x) - (1-y) \log(1-h(x))$. Standard for binary classification.
  • Hinge loss: $L(h(x), y) = \max(0, 1 - y \cdot h(x))$. Used in SVMs.
  • 0-1 loss: $L(h(x), y) = \mathbf{1}[h(x) \neq y]$. The “natural” classification error, but not differentiable.

The choice of loss function is not arbitrary. It encodes what you actually care about - and the loss you optimize (because it is differentiable) and the loss you care about (accuracy on real data) are often different things.

3. Optimizer: the procedure that searches $\mathcal{H}$ for the best hypothesis. For small, convex problems, the optimizer can find the global minimum exactly. For neural networks, we use stochastic gradient descent (SGD) and its variants (Adam, RMSProp): iteratively move in the direction of decreasing loss, using gradient information.


Empirical Risk Minimization

Define two quantities:

Empirical risk (training error):

$$\hat{R}(h) = \frac{1}{n} \sum_{i=1}^{n} L(h(x_i), y_i).$$

This is what you minimize during training. You can compute it.

True risk (generalization error):

$$R(h) = \mathbb{E}_{(x,y) \sim \mathcal{D}}[L(h(x), y)].$$

This is what you actually care about. You cannot compute it - $\mathcal{D}$ is unknown. You can only estimate it using held-out data.

Empirical Risk Minimization (ERM): find the hypothesis $\hat{h}$ that minimizes empirical risk:

$$\hat{h} = \arg\min_{h \in \mathcal{H}} \hat{R}(h).$$

The central question of statistical learning theory: under what conditions does small empirical risk guarantee small true risk? When does doing well on the training set mean doing well in the world?

The gap $R(h) - \hat{R}(h)$ is the generalization gap. This is what validation and test sets measure: how much worse the model does on data it has not seen.


The Fundamental Tension: Underfitting and Overfitting

More complex hypothesis classes (larger $\mathcal{H}$) can represent more functions, including the true function $f$. This is good.

But a more complex class also has more capacity to fit noise. A polynomial of degree 100 fit to 10 data points will pass through all 10 points perfectly - and oscillate wildly in between. It has not learned the function; it has memorized the data. This is overfitting: small empirical risk, large true risk.

Too simple a hypothesis class misses real structure in the data. A linear model fit to data generated by a sine wave will miss the curvature regardless of how much data you have. This is underfitting: large empirical risk, large true risk.

The right hypothesis class is complex enough to represent the true signal, but not so complex that it memorizes noise. This is the bias-variance tradeoff:

  • Bias: systematic error from using too simple a model (underfitting). High bias means the model cannot represent the true function.
  • Variance: sensitivity to the specific training set. High variance means the model changes a lot when you replace the training data with a new sample from $\mathcal{D}$.

More formally, for squared error loss, the expected error decomposes as:

$$\mathbb{E}[(h(x) - f(x))^2] = \text{Bias}^2 + \text{Variance} + \text{Noise}.$$

The noise term is irreducible (it is the inherent randomness in the data). Bias and variance trade off against each other as you change the complexity of $\mathcal{H}$.


A Bound on Generalization

Statistical learning theory makes the bias-variance intuition precise. Here is a fundamental result.

Theorem (Finite hypothesis class, Hoeffding bound). Suppose $L \in [0, 1]$ and $|\mathcal{H}|$ is finite. Then with probability at least $1 - \delta$ over a random draw of training set $S$ of size $n$:

$$R(\hat{h}) \leq \hat{R}(\hat{h}) + \sqrt{\frac{\log(|\mathcal{H}|/\delta)}{2n}}.$$

Two key insights from this bound:

  1. More data ($n$ larger) tightens the bound. The generalization gap shrinks as $1/\sqrt{n}$. Doubling your data reduces the gap by a factor of $\sqrt{2}$.

  2. More complex model ($|\mathcal{H}|$ larger) loosens the bound. The $\log |\mathcal{H}|$ term penalizes model complexity. If $|\mathcal{H}|$ doubles, the bound increases by $\sqrt{\log 2 / (2n)}$ - a small but nonzero cost.

The bound makes precise the intuitive principle: with a fixed dataset, you pay a price for searching a larger hypothesis class. You might find a better fit to the training data, but you have less confidence that this fit generalizes.

For infinite hypothesis classes - like neural networks - the right complexity measure is not $|\mathcal{H}|$ but something like Rademacher complexity or VC dimension, which measure the effective capacity of the class rather than its literal size.

Discomfort check. Modern deep learning seems to violate this picture entirely. Very large neural networks with billions of parameters are trained to near-zero training error on millions of examples - and they generalize well. The classical bias-variance tradeoff says they should overfit catastrophically. They don’t. This is called the “double descent” phenomenon: as model size increases past a threshold, test error initially increases (classical overfitting) and then decreases again, eventually below the optimal classical model. The explanation involves implicit regularization from SGD, the geometry of the loss landscape, and the structure of real data. Classical bounds are not tight enough to explain modern deep learning - this is an active research area. But the ERM framework and the bias-variance intuition remain the right mental model for understanding what is happening.


The No Free Lunch Theorem

Here is an uncomfortable result.

No Free Lunch Theorem (Wolpert, 1996). For any learning algorithm $A$ and any performance measure, there exists a distribution $\mathcal{D}$ on which $A$ performs no better than random guessing - and for which any other algorithm $A'$ outperforms $A$.

In other words: without assumptions about the problem, no algorithm is universally better than any other. Every algorithm that works well on some problems works badly on others. There is no universally optimal learning algorithm.

This is not pessimistic - it is clarifying. It means that inductive bias is not a weakness but a necessity. When a neural network assumes that useful features can be built by composing local patterns (as in convolutional networks), or that sequences have useful structure (as in transformers), it is restricting the hypothesis class in ways that make learning feasible. These restrictions are not arbitrary - they are engineering choices based on domain knowledge about the structure of images, text, and other data.

Good ML is not about finding the most general algorithm. It is about encoding the right assumptions for your problem.


What Makes Features Matter

Classical machine learning requires hand-designed features: you transform raw inputs (pixel values, word counts) into a representation that makes the problem easier (edges, topic vectors). The model then learns a simple function (often linear) over these features. Feature engineering is the hard part.

Deep learning replaces hand-designed features with learned ones. A convolutional neural network learns edge detectors in its first layer, texture detectors in its second, part detectors in its third, and object detectors in its fourth - all from raw pixels. The features are learned jointly with the classifier.

Both are instances of the same ERM framework. The difference is where in the hypothesis class the representation is built:

  • Classical ML: $h(x) = g(\phi(x))$ where $\phi$ is hand-designed and $g$ is learned.
  • Deep learning: $h(x) = g_L \circ g_{L-1} \circ \cdots \circ g_1(x)$ where all $g_i$ are learned jointly.

Deep learning works when the learning problem has enough data and compute that learning $\phi$ from scratch is tractable, and when the right features are hard to specify by hand.


Supervised, Unsupervised, Reinforcement Learning

The ERM framework, as stated, covers supervised learning: you have labeled pairs $(x_i, y_i)$ and optimize a loss that compares predictions to labels.

Unsupervised learning has no labels. The training set is $\{x_1, \ldots, x_n\}$ alone. The goal is to learn the structure of the distribution $\mathcal{D}$ over inputs: clusters, low-dimensional representations, generative models. Examples: $k$-means clustering, principal component analysis (PCA), variational autoencoders (VAEs), generative adversarial networks (GANs).

Reinforcement learning (RL) has neither labels nor a fixed dataset. An agent interacts with an environment: it takes an action $a$ in state $s$, receives a reward $r$, and transitions to a new state $s'$. The goal is to learn a policy $\pi(a \mid s)$ - a distribution over actions given the current state - that maximizes long-run cumulative reward. The data is generated by the agent’s own behavior, which creates a feedback loop absent in supervised learning.

All three paradigms ask the same underlying question: what can you infer about a distribution from samples? They differ in what the samples look like and what “inference” means.


What Makes Deep Learning Different

Deep learning is not a different mathematical framework. It is ERM with:

  • Very expressive hypothesis classes: neural networks with many layers and millions to billions of parameters. These can approximate any continuous function (universal approximation theorem), at least in principle.
  • Backpropagation: an efficient algorithm for computing gradients of the loss with respect to all parameters simultaneously. Without backprop, training deep networks would be computationally infeasible. Backpropagation is just the chain rule of calculus, applied systematically.
  • Large datasets: ImageNet has 14 million images. Common Crawl, used to train language models, has petabytes of text. With enough data, very flexible models can be trained without catastrophic overfitting.
  • GPU compute: matrix multiplications (the core operation in neural network forward and backward passes) are massively parallelizable on GPUs. Training that took months in 2010 takes hours today.

None of this changes the fundamental framework. The model still minimizes empirical risk over a hypothesis class. The generalization question is still how closely training performance predicts test performance. The optimizer still uses gradient information.

What changed is scale. Scale changed what hypothesis classes are feasible, what datasets are available, and what can be computed in a reasonable time. The math is the same. The scale changed everything.

Discomfort check. “Learning” sounds like it implies consciousness, understanding, or intelligence. It does not. Every ML algorithm is optimization of a loss function on data, nothing more. A neural network that classifies images is minimizing a number. When people talk about a language model “understanding” a sentence, they mean it produces outputs that are useful - outputs that correlate with what a human would say. Whether there is anything it is like to be the model is a different question entirely, and not one the mathematics has an opinion on. The outputs can be impressive and the underlying process entirely mechanical. Do not confuse performance with understanding.


Summary

Concept Definition
Hypothesis class $\mathcal{H}$ Set of functions the algorithm can represent; encodes inductive bias
Loss function $L$ Measure of prediction error; determines what the optimizer minimizes
Empirical risk $\hat{R}(h)$ Average loss on training set; what we can compute
True risk $R(h)$ Expected loss on the data distribution; what we care about
ERM Find $\hat{h} = \arg\min_{h \in \mathcal{H}} \hat{R}(h)$
Generalization gap $R(h) - \hat{R}(h)$; measured by validation/test performance
Bias Error from too-simple hypothesis class; underfitting
Variance Sensitivity to training set; overfitting
Generalization bound $R(\hat{h}) \leq \hat{R}(\hat{h}) + O\left(\sqrt{\log
No Free Lunch No algorithm is universally best; inductive bias is necessary
Backpropagation Chain rule applied to compute gradients in neural networks
Deep learning vs. classical ML Same ERM framework; different where in the pipeline features are learned

Machine learning, at its core, is an answer to a single question: given data drawn from an unknown distribution, how do you find a function that works well on new data from the same distribution? The answer requires three choices - what functions you consider (hypothesis class), how you measure quality (loss), and how you search (optimizer) - and one theoretical guarantee: that under certain conditions, what you find on the training set will transfer to the real world.

Everything else - convolutional networks, transformers, reinforcement learning, generative models - is a specific instantiation of this framework, engineered to work on a particular kind of data with a particular kind of structure. The framework is simple. Making it work at scale is where the hard problems live.


Read next: