Machine Learning First Principles
Prerequisite:
What is Learning from Data?
Machine learning is the study of algorithms that improve their performance on a task by exposure to data. To make this precise, we need a formal vocabulary.
Definition (Learning problem). A supervised learning problem is defined by three spaces:
- An input space $\mathcal{X}$ (e.g., $\mathbb{R}^d$ for $d$-dimensional features),
- An output space $\mathcal{Y}$ (e.g., $\mathbb{R}$ for regression, ${0,1}$ for binary classification),
- An unknown data-generating distribution $\mathcal{D}$ over $\mathcal{X} \times \mathcal{Y}$.
We observe a training set $\mathcal{S} = {(x_1, y_1), \ldots, (x_n, y_n)}$ drawn i.i.d. from $\mathcal{D}$. Our goal is to find a function $h: \mathcal{X} \to \mathcal{Y}$ that predicts well on new draws from $\mathcal{D}$.
Definition (Hypothesis class). A hypothesis class $\mathcal{H}$ is a set of candidate functions $h: \mathcal{X} \to \mathcal{Y}$. For example, all linear functions $h(x) = w^T x + b$ form a hypothesis class parameterized by $(w, b)$.
Risk and Empirical Risk
To measure how well a hypothesis performs, we need a loss function $\ell: \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}_{\geq 0}$ that quantifies the cost of predicting $\hat{y}$ when the true label is $y$.
Definition (Risk). The true risk (population risk) of a hypothesis $h$ is
$$R(h) = \mathbb{E}_{(X,Y) \sim \mathcal{D}}[\ell(h(X), Y)].$$
This is the quantity we actually care about - average loss on the underlying distribution. But $\mathcal{D}$ is unknown, so $R(h)$ is inaccessible directly.
Definition (Empirical risk). The empirical risk is the average loss on the observed training set:
$$\hat{R}(h) = \frac{1}{n} \sum_{i=1}^{n} \ell(h(x_i), y_i).$$
By the law of large numbers, $\hat{R}(h) \to R(h)$ as $n \to \infty$ for any fixed $h$. However, when we optimize $h$ over the training set, this convergence argument breaks down - the selected $h$ depends on the data.
Empirical Risk Minimization
Definition (ERM). Empirical Risk Minimization selects the hypothesis that minimizes empirical risk:
$$\hat{h} = \underset{h \in \mathcal{H}}{\arg\min} ; \hat{R}(h).$$
ERM is the foundational algorithmic principle underlying most of supervised learning. Nearly every method - linear regression, logistic regression, neural networks - can be viewed as solving an ERM problem with a particular choice of $\mathcal{H}$ and $\ell$.
Generalization: The Central Problem
The gap between empirical and true risk is called the generalization gap:
$$\text{Generalization gap} = R(\hat{h}) - \hat{R}(\hat{h}).$$
If this gap is large, the model has overfit: it has memorized the training data without capturing the underlying pattern. If both $R(\hat{h})$ and $\hat{R}(\hat{h})$ are large, the model has underfit: $\mathcal{H}$ is not expressive enough to capture the pattern.
Theorem (Generalization bound, finite $\mathcal{H}$, informal). Suppose $\mathcal{H}$ is finite and $\ell \in [0,1]$. Then with probability at least $1 - \delta$,
$$R(\hat{h}) \leq \hat{R}(\hat{h}) + \sqrt{\frac{\log|\mathcal{H}| + \log(1/\delta)}{2n}}.$$
This bound reveals the fundamental tension in learning:
- A larger $\mathcal{H}$ (more expressive model) can achieve lower $\hat{R}(\hat{h})$, but the generalization term grows.
- A smaller $\mathcal{H}$ generalizes better but may underfit.
For infinite hypothesis classes (e.g., neural networks), the complexity is measured by VC dimension or Rademacher complexity instead of $|\mathcal{H}|$, but the qualitative tension remains.
Overfitting and Underfitting
Overfitting occurs when the model fits training data well but fails on test data: low $\hat{R}$, high $R$. A model that simply memorizes every training pair ${(x_i, y_i)}$ achieves $\hat{R} = 0$ but generalizes no better than random on unseen inputs - it has learned nothing about $\mathcal{D}$.
Underfitting occurs when the model lacks the capacity to fit even the training data: both $\hat{R}$ and $R$ are high. Restricting a neural network to a single linear layer when the true relationship is nonlinear is an example.
The bias-variance trade-off formalizes this: under squared loss, the expected test error decomposes as
$$\mathbb{E}[(h(x) - y)^2] = \underbrace{\text{Bias}^2}{\text{underfitting}} + \underbrace{\text{Variance}}{\text{overfitting}} + \sigma^2_{\text{noise}}.$$
The No Free Lunch Theorem
No single learning algorithm works best for every problem. More precisely:
Theorem (No Free Lunch, informal). For any learning algorithm $A$ and any performance measure, there exists a distribution $\mathcal{D}$ on which $A$ performs no better than random guessing.
The implication is not that ML is hopeless but that assumptions matter. Every practical learning algorithm embeds inductive biases - preferences for certain types of hypotheses. Choosing the right inductive bias for a problem domain is what algorithm design is really about.
Three Components of a Machine Learning System
Every supervised learning system has three interacting components:
-
Model ($\mathcal{H}$): The family of functions the algorithm searches over. Choice of model encodes assumptions about the problem structure (linearity, smoothness, compositionality).
-
Loss function ($\ell$): The objective being minimized. Common choices include squared error $\ell(\hat{y}, y) = (\hat{y} - y)^2$ for regression and cross-entropy $\ell(\hat{p}, y) = -y \log \hat{p} - (1-y)\log(1-\hat{p})$ for classification.
-
Optimizer: The algorithm that searches $\mathcal{H}$ to minimize $\hat{R}$. For differentiable losses, gradient-based methods (gradient descent and its variants) are standard.
These three choices are largely independent. The same model can be trained with different losses; the same loss can be minimized by different optimizers.
Supervised, Unsupervised, and Self-Supervised Learning
The framework above describes supervised learning, where every training example has a label $y_i$ provided by an external source. Not all learning settings match this description.
Unsupervised learning works with unlabeled data ${x_1, \ldots, x_n}$ drawn from a marginal distribution $p(x)$. The goal might be to learn the structure of $p(x)$ (density estimation), to compress $x$ into a low-dimensional representation (dimensionality reduction), or to partition the data into coherent groups (clustering).
Self-supervised learning creates supervision from the data itself. A model is trained to predict one part of an input from another - for example, predicting the next word in a sentence given all preceding words. No human-provided labels are needed. Modern large language models are trained this way.
Examples
Regression. Predict a patient’s blood pressure from age, weight, and lifestyle variables. Here $\mathcal{X} = \mathbb{R}^d$, $\mathcal{Y} = \mathbb{R}$, and a natural loss is squared error. ERM over the class of linear functions yields ordinary least squares.
Binary classification. Classify an email as spam or not-spam. Here $\mathcal{Y} = {0, 1}$. A common loss is binary cross-entropy. ERM over linear functions with a sigmoid link yields logistic regression.
Why memorizing training data fails. Consider a dataset of $n$ points. The hypothesis “output $y_i$ if $x = x_i$, else output 0” achieves $\hat{R} = 0$ on the training set. But on any test point $x \notin {x_1, \ldots, x_n}$ it predicts 0 regardless of the true label - the true risk can be arbitrarily high. This is the archetype of overfitting: zero training error with no generalization.
Read Next: