Machine Learning First Principles // Megha Bose

Prerequisite:

Probability

What is Learning from Data?

Machine learning is the study of algorithms that improve their performance on a task by exposure to data. To make this precise, we need a formal vocabulary.

Definition (Learning problem). A supervised learning problem is defined by three spaces:

An input space $\mathcal{X}$ (e.g., $\mathbb{R}^d$ for $d$-dimensional features),
An output space $\mathcal{Y}$ (e.g., $\mathbb{R}$ for regression, ${0,1}$ for binary classification),
An unknown data-generating distribution $\mathcal{D}$ over $\mathcal{X} \times \mathcal{Y}$.

We observe a training set $\mathcal{S} = {(x_1, y_1), \ldots, (x_n, y_n)}$ drawn i.i.d. from $\mathcal{D}$. Our goal is to find a function $h: \mathcal{X} \to \mathcal{Y}$ that predicts well on new draws from $\mathcal{D}$.

Definition (Hypothesis class). A hypothesis class $\mathcal{H}$ is a set of candidate functions $h: \mathcal{X} \to \mathcal{Y}$. For example, all linear functions $h(x) = w^T x + b$ form a hypothesis class parameterized by $(w, b)$.

Risk and Empirical Risk

To measure how well a hypothesis performs, we need a loss function $\ell: \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}_{\geq 0}$ that quantifies the cost of predicting $\hat{y}$ when the true label is $y$.

Definition (Risk). The true risk (population risk) of a hypothesis $h$ is

$$R(h) = \mathbb{E}_{(X,Y) \sim \mathcal{D}}[\ell(h(X), Y)].$$

This is the quantity we actually care about - average loss on the underlying distribution. But $\mathcal{D}$ is unknown, so $R(h)$ is inaccessible directly.

Definition (Empirical risk). The empirical risk is the average loss on the observed training set:

$$\hat{R}(h) = \frac{1}{n} \sum_{i=1}^{n} \ell(h(x_i), y_i).$$

By the law of large numbers, $\hat{R}(h) \to R(h)$ as $n \to \infty$ for any fixed $h$. However, when we optimize $h$ over the training set, this convergence argument breaks down - the selected $h$ depends on the data.

Empirical Risk Minimization

Definition (ERM). Empirical Risk Minimization selects the hypothesis that minimizes empirical risk:

$$\hat{h} = \underset{h \in \mathcal{H}}{\arg\min} ; \hat{R}(h).$$

ERM is the foundational algorithmic principle underlying most of supervised learning. Nearly every method - linear regression, logistic regression, neural networks - can be viewed as solving an ERM problem with a particular choice of $\mathcal{H}$ and $\ell$.

Generalization: The Central Problem

The gap between empirical and true risk is called the generalization gap:

$$\text{Generalization gap} = R(\hat{h}) - \hat{R}(\hat{h}).$$

If this gap is large, the model has overfit: it has memorized the training data without capturing the underlying pattern. If both $R(\hat{h})$ and $\hat{R}(\hat{h})$ are large, the model has underfit: $\mathcal{H}$ is not expressive enough to capture the pattern.

Theorem (Generalization bound, finite $\mathcal{H}$, informal). Suppose $\mathcal{H}$ is finite and $\ell \in [0,1]$. Then with probability at least $1 - \delta$,

$$R(\hat{h}) \leq \hat{R}(\hat{h}) + \sqrt{\frac{\log|\mathcal{H}| + \log(1/\delta)}{2n}}.$$

This bound reveals the fundamental tension in learning:

A larger $\mathcal{H}$ (more expressive model) can achieve lower $\hat{R}(\hat{h})$, but the generalization term grows.
A smaller $\mathcal{H}$ generalizes better but may underfit.

For infinite hypothesis classes (e.g., neural networks), the complexity is measured by VC dimension or Rademacher complexity instead of $|\mathcal{H}|$, but the qualitative tension remains.

Overfitting and Underfitting

Overfitting occurs when the model fits training data well but fails on test data: low $\hat{R}$, high $R$. A model that simply memorizes every training pair ${(x_i, y_i)}$ achieves $\hat{R} = 0$ but generalizes no better than random on unseen inputs - it has learned nothing about $\mathcal{D}$.

Underfitting occurs when the model lacks the capacity to fit even the training data: both $\hat{R}$ and $R$ are high. Restricting a neural network to a single linear layer when the true relationship is nonlinear is an example.

The bias-variance trade-off formalizes this: under squared loss, the expected test error decomposes as

$$\mathbb{E}[(h(x) - y)^2] = \underbrace{\text{Bias}^2}{\text{underfitting}} + \underbrace{\text{Variance}}{\text{overfitting}} + \sigma^2_{\text{noise}}.$$

The No Free Lunch Theorem

No single learning algorithm works best for every problem. More precisely:

Theorem (No Free Lunch, informal). For any learning algorithm $A$ and any performance measure, there exists a distribution $\mathcal{D}$ on which $A$ performs no better than random guessing.

The implication is not that ML is hopeless but that assumptions matter. Every practical learning algorithm embeds inductive biases - preferences for certain types of hypotheses. Choosing the right inductive bias for a problem domain is what algorithm design is really about.

Three Components of a Machine Learning System

Every supervised learning system has three interacting components:

Model ($\mathcal{H}$): The family of functions the algorithm searches over. Choice of model encodes assumptions about the problem structure (linearity, smoothness, compositionality).
Loss function ($\ell$): The objective being minimized. Common choices include squared error $\ell(\hat{y}, y) = (\hat{y} - y)^2$ for regression and cross-entropy $\ell(\hat{p}, y) = -y \log \hat{p} - (1-y)\log(1-\hat{p})$ for classification.
Optimizer: The algorithm that searches $\mathcal{H}$ to minimize $\hat{R}$. For differentiable losses, gradient-based methods (gradient descent and its variants) are standard.

These three choices are largely independent. The same model can be trained with different losses; the same loss can be minimized by different optimizers.

Supervised, Unsupervised, and Self-Supervised Learning

The framework above describes supervised learning, where every training example has a label $y_i$ provided by an external source. Not all learning settings match this description.

Unsupervised learning works with unlabeled data ${x_1, \ldots, x_n}$ drawn from a marginal distribution $p(x)$. The goal might be to learn the structure of $p(x)$ (density estimation), to compress $x$ into a low-dimensional representation (dimensionality reduction), or to partition the data into coherent groups (clustering).

Self-supervised learning creates supervision from the data itself. A model is trained to predict one part of an input from another - for example, predicting the next word in a sentence given all preceding words. No human-provided labels are needed. Modern large language models are trained this way.

Examples

Regression. Predict a patient’s blood pressure from age, weight, and lifestyle variables. Here $\mathcal{X} = \mathbb{R}^d$, $\mathcal{Y} = \mathbb{R}$, and a natural loss is squared error. ERM over the class of linear functions yields ordinary least squares.

Binary classification. Classify an email as spam or not-spam. Here $\mathcal{Y} = {0, 1}$. A common loss is binary cross-entropy. ERM over linear functions with a sigmoid link yields logistic regression.

Why memorizing training data fails. Consider a dataset of $n$ points. The hypothesis “output $y_i$ if $x = x_i$, else output 0” achieves $\hat{R} = 0$ on the training set. But on any test point $x \notin {x_1, \ldots, x_n}$ it predicts 0 regardless of the true label - the true risk can be arbitrarily high. This is the archetype of overfitting: zero training error with no generalization.

Read Next:

Linear Regression