Objective Functions
Prerequisite:
Loss, Cost, and Objective: Clarifying the Terminology
These three terms are often used interchangeably, but they carry distinct meanings. The loss function $L(\hat{y}, y)$ measures the discrepancy for a single training example. The cost function $J(\theta)$ is typically the average loss over the entire dataset:
$$J(\theta) = \frac{1}{n} \sum_{i=1}^{n} L(f_\theta(x_i), y_i)$$
The objective function is the broader quantity being optimized - it may include regularization terms, constraints, or auxiliary penalties on top of the cost. Understanding this hierarchy matters because the loss you report and the objective you actually optimize can differ substantially.
Regression Losses
Mean Squared Error (MSE) is the canonical regression loss:
$$L_{\text{MSE}} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2$$
It penalizes large errors quadratically, making it sensitive to outliers. Its gradient with respect to $\hat{y}$ is simply $-2(y_i - \hat{y}_i)$, which is analytically convenient.
Mean Absolute Error (MAE) uses the $\ell_1$ norm:
$$L_{\text{MAE}} = \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i|$$
MAE is robust to outliers because errors are penalized linearly, but its gradient is constant in magnitude and undefined at zero - making optimization noisier near convergence.
Huber Loss $L_\delta$ combines both, being quadratic for small errors and linear for large ones:
$$L_\delta(r) = \begin{cases} \frac{1}{2}r^2 & |r| \le \delta \ \delta\left(|r| - \frac{\delta}{2}\right) & |r| > \delta \end{cases}$$
where $r = y - \hat{y}$. For $|r| \le \delta$, the gradient is $r$ (same as MSE); for $|r| > \delta$, it is $\pm\delta$ (same sign as MAE but capped in magnitude). The hyperparameter $\delta$ controls the transition point - smaller $\delta$ makes the loss more robust to outliers at the cost of slower convergence near the optimum.
Classification Losses
Binary Cross-Entropy: Derivation from MLE
Suppose labels $y \in {0,1}$ and the model outputs $\hat{p} = \sigma(\theta^T x)$. The likelihood of a single example under a Bernoulli model is:
$$p(y|\hat{p}) = \hat{p}^y (1-\hat{p})^{1-y}$$
Taking the negative log-likelihood over $n$ examples:
$$-\log \mathcal{L} = -\sum_{i=1}^n \left[ y_i \log \hat{p}_i + (1 - y_i)\log(1 - \hat{p}_i) \right]$$
This is exactly binary cross-entropy. Minimizing it is equivalent to MLE under a Bernoulli generative assumption - which is why it is the right loss for binary classification, not something chosen arbitrarily.
Categorical Cross-Entropy
For $K$-class problems with softmax outputs $\hat{p}_k = e^{z_k}/\sum_j e^{z_j}$ and one-hot labels $y$:
$$L_{\text{CE}} = -\sum_{k=1}^K y_k \log \hat{p}k = -\log \hat{p}{y^\ast}$$
where $y^\ast$ is the true class. This again follows from MLE under a categorical distribution. The gradient of $L_{\text{CE}}$ with respect to the logits $z$ simplifies beautifully to $\hat{p} - y$, which is why cross-entropy and softmax are almost always paired.
Focal Loss
In severely imbalanced datasets, easy negatives dominate the gradient signal. Focal loss down-weights easy examples by multiplying by a modulating factor:
$$L_{\text{focal}} = -(1 - p_t)^\gamma \log(p_t)$$
where $p_t = \hat{p}$ if $y=1$ and $p_t = 1-\hat{p}$ otherwise, and $\gamma \ge 0$ is the focusing parameter. When $\gamma = 0$, this reduces to standard cross-entropy. When $\gamma = 2$ (a common choice), examples with $p_t = 0.9$ contribute a factor of $(0.1)^2 = 0.01$ relative to hard examples with $p_t = 0.1$ - effectively forcing the model to focus on the difficult minority cases.
Ranking Losses
Hinge loss is the foundation of support vector machines:
$$L_{\text{hinge}} = \max(0, 1 - y\hat{y}), \quad y \in {-1, +1}$$
It penalizes predictions that are on the wrong side of the margin or within the margin, and is zero whenever the prediction is correct with sufficient confidence ($y\hat{y} \ge 1$). This “max-margin” property gives SVMs their geometric interpretation.
Contrastive loss operates on pairs $(x_i, x_j)$ with a binary label $s_{ij}$ indicating whether they are similar:
$$L_{\text{contrastive}} = s_{ij} \cdot d_{ij}^2 + (1 - s_{ij}) \cdot \max(0, m - d_{ij})^2$$
where $d_{ij}$ is the distance between embeddings and $m$ is a margin hyperparameter. Similar pairs are pulled together; dissimilar pairs are pushed apart only if they fall within the margin.
Triplet loss uses anchor $a$, positive $p$, and negative $n$ examples to enforce $d(a,p) + \alpha < d(a,n)$ for some margin $\alpha$:
$$L_{\text{triplet}} = \max(0, d(a,p)^2 - d(a,n)^2 + \alpha)$$
Generative Objectives: The ELBO
Variational autoencoders maximize the Evidence Lower BOund (ELBO) because $\log p(x)$ is intractable when the latent variable $z$ must be integrated out. Starting from:
$$\log p(x) = \log \int p(x|z)p(z),dz$$
Introduce an approximate posterior $q_\phi(z|x)$ and apply Jensen’s inequality:
$$\log p(x) \ge \mathbb{E}{q\phi(z|x)}\left[\log p_\theta(x|z)\right] - D_{\text{KL}}\left(q_\phi(z|x) ,|, p(z)\right)$$
The right side is the ELBO. The first term is the reconstruction loss - how well the decoder recovers $x$ from sampled $z$. The second term is a regularizer that keeps the approximate posterior close to the prior $p(z)$. Maximizing the ELBO simultaneously trains the encoder (variational parameters $\phi$) and decoder (generative parameters $\theta$).
Regularization as Prior: MAP Derivation
Regularization is not arbitrary - it corresponds to placing a prior over parameters in a MAP estimation framework. Given $\log p(\theta|D) \propto \log p(D|\theta) + \log p(\theta)$:
$L_2$ regularization corresponds to a Gaussian prior. Let $p(\theta) = \mathcal{N}(0, \sigma^2 I)$. Then:
$$\log p(\theta) = -\frac{1}{2\sigma^2}|\theta|_2^2 + \text{const}$$
Adding this to the negative log-likelihood gives the objective $J(\theta) + \frac{\lambda}{2}|\theta|^2$ with $\lambda = 1/\sigma^2$. The MAP estimate shrinks all weights toward zero proportionally.
$L_1$ regularization corresponds to a Laplace prior. Let $p(\theta_j) \propto \exp(-|\theta_j|/b)$. Then:
$$\log p(\theta) = -\frac{1}{b}|\theta|_1 + \text{const}$$
The Laplace prior has a sharper peak at zero and heavier tails than the Gaussian - this is precisely why $L_1$ regularization drives many weights to exactly zero while allowing others to be large, producing sparse solutions.
Multi-Task Objectives
When training on multiple tasks simultaneously, the simplest approach is a weighted sum:
$$J(\theta) = \sum_i w_i L_i(\theta)$$
The choice of weights $w_i$ is a hyperparameter. A principled alternative is uncertainty weighting (Kendall et al., 2018), where each task weight is learned automatically:
$$J(\theta, \sigma) = \sum_i \frac{1}{2\sigma_i^2} L_i(\theta) + \log \sigma_i$$
The intuition: tasks with high aleatoric uncertainty $\sigma_i$ are down-weighted automatically. The $\log \sigma_i$ term prevents the trivial solution of sending all $\sigma_i \to \infty$. The weights $\sigma_i$ are learned jointly with $\theta$.
Examples
Why cross-entropy and not accuracy? Accuracy is piecewise constant - its gradient with respect to model parameters is zero almost everywhere. Cross-entropy is smooth and differentiable, and its gradient at a well-classified example is small while it provides strong signal for misclassified or uncertain examples. Minimizing cross-entropy indirectly maximizes accuracy while remaining tractable for gradient-based optimization.
Label smoothing is a practical technique that modifies the target distribution. Instead of one-hot targets $y_k \in {0,1}$, use:
$$y_k^{\text{smooth}} = (1 - \epsilon),y_k + \frac{\epsilon}{K}$$
for smoothing parameter $\epsilon$ (typically 0.1). This prevents the model from becoming overconfident - a logit of $+\infty$ on the correct class is no longer the true optimum. It acts as a regularizer, improves calibration, and often improves generalization, particularly in translation and image classification tasks.
Read Next: