Prerequisite:

The Likelihood Function

Suppose we observe data $x = (x_1, \ldots, x_n)$ assumed to be i.i.d. draws from a distribution parameterized by $\theta \in \Theta$. The likelihood function is

$$L(\theta; x) = \prod_{i=1}^n p(x_i \mid \theta),$$

where $p(\cdot \mid \theta)$ denotes the probability mass function (discrete case) or probability density function (continuous case). The likelihood is a function of $\theta$ with the data $x$ held fixed - it measures how probable the observed data is under each parameter value.

Working with products is numerically inconvenient, so we use the log-likelihood

$$\ell(\theta) = \log L(\theta; x) = \sum_{i=1}^n \log p(x_i \mid \theta).$$

Since $\log$ is strictly increasing, maximizing $\ell(\theta)$ and $L(\theta; x)$ are equivalent.

Definition. The maximum likelihood estimator (MLE) is

$$\hat{\theta}{\text{MLE}} = \arg\max{\theta \in \Theta} \ell(\theta).$$

MLE for the Bernoulli Distribution

Let $x_1, \ldots, x_n \in {0, 1}$ be i.i.d. Bernoulli$(p)$. Then $p(x_i \mid p) = p^{x_i}(1-p)^{1-x_i}$ and

$$\ell(p) = \sum_{i=1}^n \bigl[x_i \log p + (1 - x_i) \log(1-p)\bigr] = n\bar{x} \log p + n(1 - \bar{x})\log(1-p),$$

where $\bar{x} = \frac{1}{n}\sum x_i$. Setting $\frac{d\ell}{dp} = 0$:

$$\frac{d\ell}{dp} = \frac{n\bar{x}}{p} - \frac{n(1-\bar{x})}{1-p} = 0 \implies n\bar{x}(1-p) = n(1-\bar{x})p \implies \bar{x} = p.$$

Thus $\hat{p} = \bar{x}$, the sample proportion. The second derivative $\frac{d^2\ell}{dp^2} = -\frac{n\bar{x}}{p^2} - \frac{n(1-\bar{x})}{(1-p)^2} < 0$, confirming a maximum.

MLE for the Gaussian Distribution

Let $x_1, \ldots, x_n$ be i.i.d. $\mathcal{N}(\mu, \sigma^2)$. The log-likelihood is

$$\ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log\sigma^2 - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2.$$

Estimating $\mu$: Setting $\partial \ell / \partial \mu = 0$:

$$\frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^n (x_i - \mu) = 0 \implies \hat{\mu} = \bar{x}.$$

Estimating $\sigma^2$: Setting $\partial \ell / \partial \sigma^2 = 0$ (substituting $\hat{\mu} = \bar{x}$):

$$\frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2}\sum_{i=1}^n (x_i - \bar{x})^2 = 0 \implies \hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2.$$

Note that $\hat{\sigma}^2$ divides by $n$, not $n-1$, so it is biased: $E[\hat{\sigma}^2] = \frac{n-1}{n}\sigma^2$. The MLE trades bias for minimum variance within the class of all estimators.

Properties of MLEs

Consistency

Under regularity conditions (differentiable log-likelihood, identifiable model), $\hat{\theta}n \xrightarrow{P} \theta_0$ as $n \to \infty$, where $\theta_0$ is the true parameter. Intuitively, the log-likelihood $\ell(\theta)/n \to E{\theta_0}[\log p(X \mid \theta)]$ by the law of large numbers, and this expectation is uniquely maximized at $\theta_0$ (a consequence of Gibbs' inequality: $E[\log p(X|\theta_0)] \geq E[\log p(X|\theta)]$).

Asymptotic Normality

The Fisher information at $\theta$ is

$$I(\theta) = -E!\left[\frac{\partial^2 \log p(X \mid \theta)}{\partial \theta^2}\right] = E!\left[\left(\frac{\partial \log p(X \mid \theta)}{\partial \theta}\right)^{!2}\right].$$

The second equality holds under regularity. For $n$ i.i.d. observations, the total Fisher information is $nI(\theta)$. The MLE satisfies

$$\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} \mathcal{N}!\left(0, \frac{1}{I(\theta_0)}\right).$$

This follows from a Taylor expansion of the score equation $\ell'(\hat{\theta}) = 0$ around $\theta_0$, combined with the central limit theorem applied to $\ell'(\theta_0)/n$.

Cramér-Rao Bound and Efficiency

Theorem (Cramér-Rao). For any unbiased estimator $\hat{\theta}$ of $\theta$,

$$\text{Var}(\hat{\theta}) \geq \frac{1}{nI(\theta)}.$$

Proof sketch. Let $S = \partial \log L/\partial \theta$ be the score. By the Cauchy-Schwarz inequality applied to $\text{Cov}(\hat{\theta}, S)$:

$$[\text{Cov}(\hat{\theta}, S)]^2 \leq \text{Var}(\hat{\theta}) \cdot \text{Var}(S).$$

Differentiating $E[\hat{\theta}] = \theta$ under the integral sign gives $\text{Cov}(\hat{\theta}, S) = 1$, and $\text{Var}(S) = nI(\theta)$. The bound follows. $\square$

An estimator achieving this bound is efficient. The MLE is asymptotically efficient: its asymptotic variance $1/(nI(\theta))$ equals the Cramér-Rao bound.

Connection to Cross-Entropy

For a categorical model with $K$ classes, $p(x = k \mid \theta) = \theta_k$, the log-likelihood given empirical frequencies $\hat{p}_k = n_k/n$ is

$$\ell(\theta) = n \sum_{k=1}^K \hat{p}_k \log \theta_k.$$

Maximizing this subject to $\sum_k \theta_k = 1$ gives $\hat{\theta}_k = \hat{p}_k$. The negative log-likelihood per sample is

$$-\frac{\ell(\theta)}{n} = -\sum_{k=1}^K \hat{p}_k \log \theta_k = H(\hat{p}, \theta),$$

the cross-entropy of the empirical distribution $\hat{p}$ with model distribution $\theta$. Thus MLE for categorical models is equivalent to minimizing cross-entropy - the standard loss in classification.

Examples

Logistic regression as MLE. In binary logistic regression, $p(y_i = 1 \mid x_i, \beta) = \sigma(\beta^\top x_i)$ where $\sigma(z) = 1/(1+e^{-z})$. The log-likelihood is

$$\ell(\beta) = \sum_{i=1}^n \bigl[y_i \log \sigma(\beta^\top x_i) + (1-y_i)\log(1 - \sigma(\beta^\top x_i))\bigr],$$

which is the negative binary cross-entropy. Gradient ascent on $\ell$ is the standard training procedure. The MLE is consistent and asymptotically normal under mild design conditions.

Gaussian MLE in linear regression. In the model $y_i = x_i^\top \beta + \varepsilon_i$ with $\varepsilon_i \sim \mathcal{N}(0, \sigma^2)$ i.i.d., the log-likelihood is

$$\ell(\beta, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}|y - X\beta|^2.$$

Maximizing over $\beta$ is equivalent to minimizing $|y - X\beta|^2$, which is ordinary least squares. The MLE $\hat{\beta} = (X^\top X)^{-1} X^\top y$ is the OLS estimator, establishing that least squares is maximum likelihood under Gaussian noise - a fundamental connection in statistics.


Read Next: