Objective Functions & Loss Design - What You Optimize Is What You Get
Helpful context:
- Autodiff - Derivatives Without Doing the Algebra by Hand
- Entropy & Information Theory - The Mathematics of Surprise
- KL Divergence - How Wrong Is Your Distribution?
A self-driving car trained to minimize the number of accidents might learn to stay parked forever. A recommendation system trained to maximize clicks might learn to recommend outrage. Neither model is broken - both are doing exactly what they were told. The loss function you choose is the most consequential design decision in any ML system. It defines what “good” means.
This is not a peripheral concern. Training is just optimization: the optimizer finds the parameters that minimize the loss. If the loss is misspecified, the optimizer will find a model that does exactly what you measured, not what you wanted. Get the loss wrong and nothing else matters.
What Is a Loss Function?
A loss function $L(\theta)$ maps model parameters $\theta$ to a real number measuring how badly the model performs on training data. Given predictions $\hat{y}_1, \ldots, \hat{y}_n$ and true targets $y_1, \ldots, y_n$, the loss aggregates individual errors into a single scalar that the optimizer tries to minimize.
The choice of aggregation is a design choice. The choice of how to measure individual errors is a design choice. Both encode assumptions about what kinds of mistakes are acceptable and what the underlying data-generating process looks like.
Regression Losses
Mean Squared Error
The most common loss for regression:
$$L_{\text{MSE}} = \frac{1}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i)^2.$$
The squaring does two things: it makes all errors positive, and it penalizes large errors quadratically. An error of 2 is penalized four times as much as an error of 1.
MSE has a clean probabilistic interpretation. If you assume the true targets are generated by your model plus independent Gaussian noise,
$$y_i = f(x_i;\theta) + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0,\sigma^2),$$
then maximizing the likelihood of the data under this model is equivalent to minimizing MSE. This is the formal justification: MSE is the right loss when your noise is Gaussian.
Practical properties: MSE is differentiable everywhere (smooth gradient landscape), but sensitive to outliers. A single data point with error 10 contributes 100 to the loss, while ten points with error 1 each contribute only 10 total. If your data has outliers, MSE will distort the model toward them.
Mean Absolute Error
$$L_{\text{MAE}} = \frac{1}{n}\sum_{i=1}^{n}|\hat{y}_i - y_i|.$$
Errors are penalized linearly. A point with error 10 contributes 10, not 100. This makes MAE much more robust to outliers.
The probabilistic interpretation: MAE is the right loss when noise is Laplace-distributed rather than Gaussian. The Laplace distribution has heavier tails than the Gaussian, which matches the robustness property.
There is a cost. MAE is not differentiable at zero - the gradient is $\pm 1$ everywhere except the kink. This can cause gradient-based optimizers to oscillate near the minimum. And where MSE produces the conditional mean of $y$ given $x$, MAE produces the conditional median. For skewed distributions, the median can be very different from the mean.
Huber Loss
Huber loss is a pragmatic compromise: quadratic for small errors, linear for large ones.
$$L_\delta(\hat{y}, y) = \begin{cases} \frac{1}{2}(\hat{y} - y)^2 & \text{if } |\hat{y} - y| \leq \delta \\ \delta\left(|\hat{y} - y| - \frac{\delta}{2}\right) & \text{otherwise.} \end{cases}$$
For errors smaller than the threshold $\delta$, you get the smooth, well-conditioned gradient of MSE. For large errors (potential outliers), you cap the penalty linearly like MAE. The result is differentiable everywhere, robust to outliers, and well-behaved near the minimum.
The downside: you have to choose $\delta$. This is a hyperparameter that affects the tradeoff between robustness and smoothness. In practice, $\delta = 1$ is a common default, but it may need tuning for your data’s scale.
Classification Losses
Binary Cross-Entropy
For binary classification, the model outputs a probability $\hat{y}_i = \sigma(z_i) \in (0,1)$ via sigmoid, and the true label is $y_i \in \{0, 1\}$. The loss is:
$$L_{\text{BCE}} = -\frac{1}{n}\sum_{i=1}^{n}\left[y_i \log \hat{y}_i + (1 - y_i)\log(1 - \hat{y}_i)\right].$$
When $y_i = 1$, only the first term matters: $-\log \hat{y}_i$. As $\hat{y}_i \to 1$, the loss goes to 0. As $\hat{y}_i \to 0$, the loss goes to $+\infty$. When $y_i = 0$, only the second term matters, with the same pattern.
The probabilistic derivation is clean. Assume each label $y_i$ is a Bernoulli random variable with success probability $\hat{y}_i$. The log-likelihood is:
$$\log P(\text{data}) = \sum_{i=1}^{n}\left[y_i \log \hat{y}_i + (1-y_i)\log(1-\hat{y}_i)\right].$$
Minimizing BCE is exactly maximizing this log-likelihood. The loss is not arbitrary - it is the principled choice under the Bernoulli model.
One crucial property: BCE penalizes confident wrong predictions catastrophically. If the model outputs $\hat{y}_i = 0.001$ for a true positive, the loss is $-\log(0.001) \approx 6.9$. This is what makes the loss effective - and what makes it important that the output is a real probability, not a score.
Categorical Cross-Entropy
For multi-class classification with $K$ classes, the model outputs a probability vector $\hat{y}_i \in \mathbb{R}^K$ via softmax, and the true label is a one-hot vector. The loss is:
$$L_{\text{CE}} = -\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{K} y_{ik} \log \hat{y}_{ik}.$$
Since $y_{ik} = 1$ for exactly one $k$ per sample, this simplifies to $-\log \hat{y}_{i, \text{true}}$ for each sample - the negative log probability assigned to the correct class. The same MLE derivation applies, now under the categorical distribution.
KL Divergence as a Loss
The connection between cross-entropy and KL divergence matters for understanding what you’re optimizing.
Recall the KL divergence from a true distribution $p$ to a model distribution $q$:
$$D_{\text{KL}}(p | q) = \sum_k p_k \log \frac{p_k}{q_k}.$$
The cross-entropy between $p$ and $q$ decomposes as:
$$H(p, q) = H(p) + D_{\text{KL}}(p | q),$$
where $H(p) = -\sum_k p_k \log p_k$ is the entropy of the true distribution.
When you minimize cross-entropy over the model parameters $\theta$ (which affect $q$ but not $p$), $H(p)$ is a constant and drops out. So minimizing cross-entropy is exactly minimizing KL divergence. The loss is zero when $q = p$ - when your model perfectly captures the true distribution.
Discomfort check. The loss function should match the probabilistic model. If you assume Gaussian noise, use MSE. If you assume Bernoulli outcomes, use binary cross-entropy. If you assume categorical outcomes, use categorical cross-entropy. This is not just a recommendation - it is a theorem. MLE under each noise model gives the corresponding loss. Using MSE for a classification problem is not “wrong” in a catastrophic sense, but it is misspecified: the model will still train, but it is optimizing a quantity that doesn’t correspond to any meaningful probability model, and the resulting predictions are harder to interpret and often worse calibrated. The loss defines your model’s worldview.
Contrastive Losses
Standard supervised losses compare each prediction to its label independently. Contrastive losses instead compare pairs or groups of examples, encouraging the model to learn a geometry where similar inputs are close and dissimilar inputs are far.
Triplet loss takes an anchor $a$, a positive $p$ (same class as $a$), and a negative $n$ (different class):
$$L_{\text{triplet}} = \max(0, |f(a) - f(p)|^2 - |f(a) - f(n)|^2 + \alpha),$$
where $\alpha$ is a margin. The loss is zero when the positive is already closer than the negative by at least $\alpha$; positive otherwise.
InfoNCE (used in SimCLR, CLIP) is a more modern formulation. Given a positive pair $(x, x^+)$ and $K-1$ negative samples $x^-1, \ldots, x^-{K-1}$:
$$L_{\text{InfoNCE}} = -\log \frac{\exp(f(x)^\top f(x^+)/\tau)}{\exp(f(x)^\top f(x^+)/\tau) + \sum_{j}\exp(f(x)^\top f(x^-_j)/\tau)},$$
where $\tau$ is a temperature hyperparameter. This is a softmax over similarities - the loss encourages the model to identify the positive pair among the negatives.
Contrastive losses are the core of self-supervised learning. They allow representation learning without labels: you can define positive pairs as augmented views of the same image, or aligned image-text pairs, without any human annotation.
Regularization as a Loss Component
A loss function often has two components: the data fit term and a regularization term:
$$L(\theta) = L_{\text{data}}(\theta) + \lambda R(\theta).$$
The regularization term $R(\theta)$ penalizes model complexity and reduces overfitting. The hyperparameter $\lambda$ controls the tradeoff.
L2 regularization (weight decay):
$$R(\theta) = |\theta|^2 = \sum_j \theta_j^2.$$
This penalizes large weights. The probabilistic interpretation: L2 corresponds to placing a Gaussian prior $\theta_j \sim \mathcal{N}(0, 1/\lambda)$ on each weight. MAP estimation under this prior gives L2 regularization. The effect is to shrink all weights toward zero, reducing the model’s ability to rely on any single feature.
L1 regularization (Lasso):
$$R(\theta) = |\theta|_1 = \sum_j |\theta_j|.$$
The probabilistic interpretation: L1 corresponds to a Laplace prior on weights. The key property is sparsity: L1 regularization tends to set many weights exactly to zero, effectively performing feature selection. L2 shrinks everything; L1 zeroes things out.
Elastic net: combine both.
$$R(\theta) = \alpha |\theta|_1 + (1-\alpha)|\theta|^2.$$
You get both sparsity from L1 and the stability of L2.
Dropout is a different mechanism: during training, randomly zero out each neuron with probability $p$. This is not added to the loss function - it is applied during the forward pass. The effect is similar to ensembling many sub-networks, which reduces overfitting. At inference, all neurons are active and weights are scaled by $(1-p)$.
Choosing a Loss in Practice
There is no universal answer. The right loss depends on your problem structure:
What is the output type?
- Real-valued output (price, temperature): MSE, MAE, or Huber.
- Probability (is this spam?): binary cross-entropy.
- Category (which digit?): categorical cross-entropy.
- Distribution: KL divergence or cross-entropy.
What is the noise model?
- Gaussian noise → MSE.
- Heavy-tailed noise or outliers likely → Huber or MAE.
- Bernoulli outcomes → binary cross-entropy.
- Categorical outcomes → categorical cross-entropy.
Are there outliers?
- Yes: Huber or MAE. No: MSE is fine.
Do you need sparse predictions?
- Yes: add L1 regularization.
Is calibration important?
- Use proper scoring rules (cross-entropy for classification). MSE applied to probabilities is not a proper scoring rule.
Is interpretability important?
- Simpler losses are easier to debug. A model trained with a complex custom loss is harder to reason about when something goes wrong.
The Loss Landscape
The loss function you choose doesn’t just determine the optimal solution - it shapes the entire optimization landscape through which gradient descent must navigate.
An MSE loss with well-normalized inputs tends to produce a smooth, well-conditioned landscape: saddle points are rare and the gradient always points roughly toward the minimum. A cross-entropy loss can have plateaus (where the model is confident and correct, so the gradient is nearly zero) and cliffs (where the model is confident and wrong, producing large gradient spikes).
Ill-conditioned loss landscapes - where the loss changes rapidly in some directions and slowly in others - slow down training dramatically. This is one reason batch normalization was such a breakthrough: it smooths the loss landscape by ensuring activations remain well-scaled, allowing much higher learning rates.
Non-convex losses (which all deep networks have) mean there are multiple local minima. The practical finding is that most local minima in large networks have similar loss values, and the global minimum is not necessary to find good solutions. But saddle points - where gradient descent can stall - are more common and more problematic than local minima.
The connection to regularization: regularization doesn’t just prevent overfitting, it shapes the loss landscape. L2 regularization makes the landscape more convex (it adds a bowl-shaped term to every parameter), which improves conditioning and speeds convergence.
RLHF and Preference Objectives
In reinforcement learning from human feedback, the loss is more complex. You are not minimizing a pointwise error - you are training a reward model to predict human preferences.
The Bradley-Terry model says that if a human prefers response $y_1$ over $y_2$ given prompt $x$, the probability of this preference is:
$$P(y_1 \succ y_2 \mid x) = \sigma(r(x, y_1) - r(x, y_2)),$$
where $r$ is the learned reward function and $\sigma$ is sigmoid. The reward model is trained with cross-entropy to predict which response was preferred. This is still a cross-entropy loss, but now over pairwise comparisons rather than class labels.
The reward model is then used to fine-tune a language model with PPO or similar policy gradient methods. The full RLHF objective is a combination of maximizing expected reward plus a KL penalty to prevent the model from drifting too far from the base model:
$$L_{\text{RLHF}} = -\mathbb{E}[r(x,y)] + \beta D_{\text{KL}}(\pi_\theta | \pi_{\text{ref}}).$$
The loss has become a design artifact encoding what “better” means according to human raters. Every choice in this pipeline - what questions to rate, who rates them, how to aggregate ratings - becomes encoded in the loss.
Summary
| Loss | Formula | When to Use | Properties |
|---|---|---|---|
| MSE | $\frac{1}{n}\sum(\hat{y}-y)^2$ | Regression, Gaussian noise | Differentiable, outlier-sensitive |
| MAE | $\frac{1}{n}\sum|\hat{y}-y|$ | Robust regression | Outlier-robust, non-differentiable at 0 |
| Huber | Quadratic then linear | Regression with outliers | Smooth, robust, needs $\delta$ |
| Binary CE | $-\frac{1}{n}\sum[y\log\hat{y}+(1-y)\log(1-\hat{y})]$ | Binary classification | Bernoulli MLE, penalizes confident errors |
| Categorical CE | $-\frac{1}{n}\sum\sum y_k\log\hat{y}_k$ | Multi-class classification | Categorical MLE |
| L2 reg | $\lambda|\theta|^2$ | General regularization | Gaussian prior, shrinks weights |
| L1 reg | $\lambda|\theta|_1$ | Sparse models | Laplace prior, zeroes weights |
The loss function is the ML system’s constitution. It encodes what you value, what you’re willing to trade off, and what you’re willing to get wrong. No amount of architecture sophistication or optimization tuning will fix a misspecified loss. Choose deliberately.
Read next: