Logistic Regression
Prerequisite:
From Regression to Classification
Binary classification asks: given $\mathbf{x} \in \mathbb{R}^d$, predict $y \in {0, 1}$. We want to model the conditional probability $P(Y=1 \mid \mathbf{x})$. A linear model $\mathbf{w}^T \mathbf{x}$ is unbounded and cannot directly represent a probability. We need a function that maps $\mathbb{R}$ to $(0, 1)$.
The Logistic Function
Definition (Logistic / sigmoid function). The logistic function is
$$\sigma(z) = \frac{1}{1 + e^{-z}}.$$
Key properties:
- $\sigma(z) \in (0, 1)$ for all $z \in \mathbb{R}$,
- $\sigma(0) = 1/2$,
- $\sigma(-z) = 1 - \sigma(z)$ (symmetry),
- $\sigma'(z) = \sigma(z)(1 - \sigma(z))$ (useful derivative identity).
The logit (log-odds) interpretation. If $p = \sigma(z)$, then
$$z = \log\frac{p}{1-p} = \text{logit}(p).$$
So $z = \mathbf{w}^T \mathbf{x}$ models the log-odds of the positive class as a linear function of the features.
The Logistic Regression Model
The model is:
$$P(Y = 1 \mid \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x}) = \frac{1}{1 + e^{-\mathbf{w}^T\mathbf{x}}},$$
$$P(Y = 0 \mid \mathbf{x}) = 1 - \sigma(\mathbf{w}^T \mathbf{x}).$$
Both expressions collapse into the compact form
$$P(Y = y \mid \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x})^y \cdot (1 - \sigma(\mathbf{w}^T \mathbf{x}))^{1-y}, \quad y \in {0, 1}.$$
Loss: Negative Log-Likelihood
Given $n$ i.i.d. examples, the log-likelihood is
$$\log \mathcal{L}(\mathbf{w}) = \sum_{i=1}^{n} \left[y_i \log \hat{p}_i + (1 - y_i) \log(1 - \hat{p}_i)\right],$$
where $\hat{p}_i = \sigma(\mathbf{w}^T \mathbf{x}_i)$. Minimizing the negative log-likelihood (equivalently, binary cross-entropy) gives
$$L(\mathbf{w}) = -\frac{1}{n} \sum_{i=1}^{n} \left[y_i \log \hat{p}_i + (1 - y_i) \log(1 - \hat{p}_i)\right].$$
Unlike MSE for linear regression, this objective has no closed-form minimizer - we must use iterative methods.
Gradient Derivation
Let $\hat{\mathbf{p}} = (\hat{p}_1, \ldots, \hat{p}_n)^T$ with $\hat{p}_i = \sigma(\mathbf{w}^T \mathbf{x}_i)$. We derive the gradient:
$$\frac{\partial L}{\partial w_j} = -\frac{1}{n}\sum_{i=1}^n \left[\frac{y_i}{\hat{p}_i} \cdot \frac{\partial \hat{p}_i}{\partial w_j} - \frac{1-y_i}{1-\hat{p}_i} \cdot \frac{\partial \hat{p}_i}{\partial w_j}\right].$$
Using $\frac{\partial \hat{p}_i}{\partial w_j} = \hat{p}_i(1-\hat{p}i) x{ij}$:
$$\frac{\partial L}{\partial w_j} = -\frac{1}{n}\sum_{i=1}^n \left[y_i(1-\hat{p}i) - (1-y_i)\hat{p}i\right] x{ij} = \frac{1}{n}\sum{i=1}^n (\hat{p}i - y_i) x{ij}.$$
In matrix form:
$$\nabla_{\mathbf{w}} L = \frac{1}{n} X^T (\hat{\mathbf{p}} - \mathbf{y}).$$
This elegant expression has the same structure as the gradient of linear regression MSE: the residuals $(\hat{\mathbf{p}} - \mathbf{y})$ are backpropagated through $X$.
The loss $L$ is convex in $\mathbf{w}$ (the Hessian $H = \frac{1}{n} X^T W X$ where $W = \text{diag}(\hat{p}_i(1-\hat{p}_i))$ is positive semidefinite), so gradient descent converges to the global minimum.
Decision Boundary
The decision rule is: predict $\hat{y} = 1$ iff $\hat{p} \geq 0.5$, i.e., iff $\mathbf{w}^T \mathbf{x} \geq 0$. The decision boundary is the hyperplane
$${\mathbf{x} : \mathbf{w}^T \mathbf{x} = 0},$$
which is a $(d-1)$-dimensional affine subspace of $\mathbb{R}^d$. Logistic regression thus produces a linear classifier despite modeling probabilities through the sigmoid nonlinearity.
Multinomial Logistic Regression (Softmax)
For $K$-class classification with label $y \in {1, \ldots, K}$, we assign a weight vector $\mathbf{w}_k$ to each class and define
$$P(Y = k \mid \mathbf{x}) = \frac{e^{\mathbf{w}k^T \mathbf{x}}}{\sum{j=1}^{K} e^{\mathbf{w}_j^T \mathbf{x}}}.$$
This is the softmax transformation. It generalizes the sigmoid: for $K = 2$, choosing $\mathbf{w}_1 = -\mathbf{w}_0 = \mathbf{w}/2$ recovers binary logistic regression.
The maximum entropy interpretation is illuminating. Among all distributions over $K$ classes that match the observed feature expectations $\mathbb{E}[\mathbf{x}]$ under each class, the softmax distribution is the one with maximum entropy. This makes it the “least committal” choice consistent with the linear structure - a principled justification rooted in information theory.
The loss for multinomial logistic regression is cross-entropy:
$$L({w_k}) = -\frac{1}{n}\sum_{i=1}^n \sum_{k=1}^K \mathbf{1}[y_i = k] \log P(Y = k \mid \mathbf{x}_i).$$
Regularization
As with linear regression, logistic regression benefits from regularization to prevent overfitting, especially when features are high-dimensional or nearly collinear.
- $L_2$ (ridge): Add $\lambda |\mathbf{w}|^2$ to the loss. Gradient becomes $\nabla L + 2\lambda \mathbf{w}$. The objective remains convex and strictly so, giving a unique minimizer.
- $L_1$ (lasso): Add $\lambda |\mathbf{w}|_1$. Induces sparsity, useful when many features are irrelevant.
Examples
Spam classification. Represent an email as a bag-of-words vector $\mathbf{x} \in \mathbb{R}^d$ where $x_j$ counts occurrences of word $j$. Train logistic regression on labeled examples (spam/not-spam). The learned weights $w_j$ indicate which words are predictive of spam. Words like “free” and “winner” get large positive weights; words like “attached” and “regards” get small or negative weights. The decision boundary separates spam from legitimate email in word-count space.
Multiclass digit recognition. On the MNIST dataset with $K = 10$ digit classes, multinomial logistic regression (softmax) learns 10 weight vectors, each acting as a template for one digit. Despite being a linear model, it achieves around 92% accuracy - a useful baseline before turning to convolutional networks.
Read Next: