Logistic Regression - Classification Dressed Up as Regression
Helpful context:
- Linear Regression - The Best Straight Line Through the Cloud of Points
- Probability Distributions - The Shapes That Randomness Takes
Is this email spam? Is this tumor malignant? Is this transaction fraudulent?
These are classification problems. The output you want isn’t a continuous number - it’s a category. Spam or not spam. Malignant or benign. Fraud or legitimate. The output should be a probability: how confident is the model that this email is spam?
Linear regression gives you a number in $(-\infty, \infty)$. Probabilities must live in $[0, 1]$. Can we fix this with a single clever modification?
Yes. That modification is logistic regression.
Why Not Just Use Linear Regression?
Suppose you try to directly predict a binary label $y \in \{0, 1\}$ with a linear model: $\hat{y} = w^T x + b$.
Three problems. First, $w^T x + b$ can be any real number - there’s nothing stopping it from being $-17$ or $1053$, neither of which is a valid probability. Second, if you train with MSE loss, large inputs will produce predictions far from $[0, 1]$, and the model will waste effort trying to push these predictions toward 0 or 1 - directions where the linear function has no natural stopping point. Third, the 0/1 labels don’t actually require that the relationship is linear in the raw features; what’s natural is that the log-odds is linear.
What we want is a function that maps any real number to a probability in $(0, 1)$. Several such functions exist, but one is especially natural.
The Sigmoid Function
Define:
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$
This is the sigmoid (or logistic) function. Let’s understand its shape before doing anything else.
At $z = 0$: $\sigma(0) = 1/(1 + 1) = 0.5$. The midpoint.
As $z \to +\infty$: $e^{-z} \to 0$, so $\sigma(z) \to 1$.
As $z \to -\infty$: $e^{-z} \to +\infty$, so $\sigma(z) \to 0$.
The function is S-shaped, smooth, monotone increasing, and always in $(0, 1)$. It’s the natural “squashing” function - it takes any real number and squeezes it into a probability.
Three properties that make it mathematically pleasant:
Symmetry: $\sigma(-z) = 1 - \sigma(z)$. The probability of class 0 is one minus the probability of class 1.
Derivative: $\sigma'(z) = \sigma(z)(1 - \sigma(z))$. The derivative is expressible in terms of the function itself. This will make gradient computations very clean.
Inverse: The inverse of $\sigma$ is $\sigma^{-1}(p) = \log\frac{p}{1-p}$, the log-odds or logit function.
The Logistic Model
The model is:
$$P(Y = 1 \mid x) = \sigma(w^T x + b) = \frac{1}{1 + e^{-(w^T x + b)}}$$
We model the probability of class 1 as the sigmoid of a linear function of the features. The probability of class 0 is:
$$P(Y = 0 \mid x) = 1 - \sigma(w^T x + b) = \sigma(-(w^T x + b))$$
To predict a class, threshold at 0.5: predict class 1 if $P(Y=1|x) > 0.5$, i.e., if $w^T x + b > 0$.
Log-odds interpretation. What is the model really saying? Compute the log-odds of class 1:
$$\log \frac{P(Y=1|x)}{P(Y=0|x)} = \log \frac{\sigma(w^Tx+b)}{1-\sigma(w^Tx+b)} = w^T x + b$$
The model says: the log-odds of class 1 is a linear function of the features. This is the right level of abstraction. Not the probability itself (which lives in $[0,1]$ and doesn’t have to be linear in $x$), but the log-odds (which lives in $(-\infty, \infty)$ and can freely be linear).
This is why the algorithm is called logistic regression - it’s regression in log-odds space. It’s a linear model, but for the log-odds, not the probability.
Training: Maximum Likelihood
For linear regression, we minimized mean squared error, which turned out to be MLE under Gaussian noise. For logistic regression, we do MLE directly.
Given training data $\{(x^{(i)}, y^{(i)})\}_{i=1}^m$ with $y^{(i)} \in \{0, 1\}$, the likelihood of a single example is:
$$P(y^{(i)} | x^{(i)}, w) = \sigma(w^T x^{(i)})^{y^{(i)}} \cdot (1 - \sigma(w^T x^{(i)}))^{1-y^{(i)}}$$
This is a compact way of writing: if $y^{(i)} = 1$, the probability is $\sigma(w^T x^{(i)})$; if $y^{(i)} = 0$, it’s $1 - \sigma(w^T x^{(i)})$. The full log-likelihood is:
$$\ell(w) = \sum_{i=1}^m \left[ y^{(i)} \log \sigma(w^T x^{(i)}) + (1 - y^{(i)}) \log(1 - \sigma(w^T x^{(i)})) \right]$$
We maximize $\ell(w)$, or equivalently minimize the cross-entropy loss:
$$L(w) = -\frac{1}{m} \sum_{i=1}^m \left[ y^{(i)} \log \sigma(w^T x^{(i)}) + (1-y^{(i)}) \log(1 - \sigma(w^T x^{(i)})) \right]$$
Discomfort check. Why is this called cross-entropy? In information theory, the cross-entropy between a true distribution $p$ and a predicted distribution $q$ is $H(p, q) = -\sum p \log q$. Here, the “true distribution” over labels for example $i$ is concentrated at $y^{(i)}$ (a degenerate distribution), and the “predicted distribution” assigns probability $\sigma(w^T x^{(i)})$ to class 1 and $1 - \sigma(w^T x^{(i)})$ to class 0. Minimizing the cross-entropy loss is exactly minimizing the KL divergence between the true label distribution and the model’s predicted distribution, averaged over training examples. The model tries to make its probability distribution as close as possible to reality.
The Gradient: Surprisingly Clean
Take the gradient of $L(w)$ with respect to $w$. Using the chain rule and the sigmoid derivative $\sigma'(z) = \sigma(z)(1-\sigma(z))$:
Let $\hat{p}^{(i)} = \sigma(w^T x^{(i)})$. Then:
$$\frac{\partial L}{\partial w} = \frac{1}{m} \sum_{i=1}^m (\hat{p}^{(i)} - y^{(i)}) x^{(i)}$$
In matrix form, with $X$ the design matrix and $\hat{p}$ the vector of predictions:
$$\nabla_w L = \frac{1}{m} X^T (\hat{p} - y)$$
The gradient has exactly the same form as linear regression: $\frac{1}{m} X^T (\text{prediction} - \text{label})$. The error signal at each example is $\hat{p}^{(i)} - y^{(i)}$ - the predicted probability minus the true label. This error gets weighted by the feature vector $x^{(i)}$ and summed.
The sigmoid’s special derivative causes an elegant cancellation: the $\sigma(1-\sigma)$ term from the chain rule cancels with the $1/(\sigma(1-\sigma))$ from the log-derivative of the likelihood. What remains is just the raw prediction error.
The gradient descent update is:
$$w \leftarrow w - \eta \cdot \nabla_w L = w - \frac{\eta}{m} X^T (\hat{p} - y)$$
No Closed Form
Unlike linear regression, there is no formula like $(X^TX)^{-1}X^Ty$ that directly gives the optimal weights. The cross-entropy loss is convex - it has a unique global minimum - but the minimum cannot be found by simply setting the gradient to zero and solving. The equation $X^T(\hat{p} - y) = 0$ involves $\hat{p} = \sigma(Xw)$, which is nonlinear in $w$. There is no algebraic solution.
We must use iterative optimization. Gradient descent works and is simple. Newton’s method works better (faster convergence) by also using second-order information (the Hessian), but each step is more expensive. The Hessian of the cross-entropy loss is:
$$H = \frac{1}{m} X^T \text{diag}(\hat{p} \odot (1-\hat{p})) X$$
where $\odot$ is elementwise multiplication. This is positive semidefinite (confirming convexity), and Newton’s method with this Hessian is called IRLS (Iteratively Reweighted Least Squares). It typically converges in 10 - 20 iterations instead of thousands for gradient descent.
The Decision Boundary
When we use logistic regression as a classifier, we predict class 1 when $P(Y=1|x) > 0.5$, which is exactly when $w^T x + b > 0$.
The decision boundary is the set of points where $w^T x + b = 0$. This is a hyperplane - a line in 2D, a plane in 3D, a hyperplane in $n$ dimensions.
Logistic regression is a linear classifier. The boundary between classes is always a hyperplane, no matter what the data looks like. If class 0 examples and class 1 examples are arranged in two interlocking crescents, or in nested circles, logistic regression cannot separate them with a straight line.
Discomfort check. Think about the XOR problem. Four points: $(0,0) \to 0$, $(1,0) \to 1$, $(0,1) \to 1$, $(1,1) \to 0$. No straight line separates the 0s from the 1s. Any linear classifier will fail. Logistic regression, despite being a sophisticated probabilistic model, is fundamentally a straight-line separator. This limitation motivates every non-linear model that comes after - kernel methods, decision trees, neural networks. They all exist, in some sense, because logistic regression can’t do XOR.
Extension to Multiple Classes: Softmax
Binary logistic regression generalizes naturally to $K > 2$ classes using the softmax function.
For each class $k \in \{1, \ldots, K\}$, we have a separate weight vector $w_k$. The probability of class $k$ is:
$$P(Y = k \mid x) = \frac{\exp(w_k^T x)}{\sum_{j=1}^K \exp(w_j^T x)}$$
This is the softmax function. It generalizes the sigmoid: with $K = 2$, softmax reduces to logistic regression. The outputs are non-negative and sum to 1, so they form a valid probability distribution over classes.
Training uses cross-entropy loss generalized to $K$ classes:
$$L = -\frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \mathbb{1}[y^{(i)} = k] \log P(Y = k \mid x^{(i)})$$
The gradient has the same clean form: for weight vector $w_k$, the gradient is $\frac{1}{m} X^T (\hat{p}_k - \mathbb{1}[y = k])$, where $\hat{p}_k$ is the predicted probability of class $k$.
Softmax logistic regression is also called multinomial logistic regression or maximum entropy (MaxEnt) classification. It is the standard last layer of any neural network for multi-class classification.
Discriminative vs. Generative Models
Logistic regression is a discriminative model: it directly models $P(Y|X)$, the probability of the label given the features. It doesn’t say anything about the distribution of $X$ itself.
A contrasting approach is Naive Bayes, a generative model: it models $P(X|Y)$ (the distribution of features given the label) and $P(Y)$ (the prior on labels), then applies Bayes' theorem:
$$P(Y|X) \propto P(X|Y) P(Y)$$
Both approaches can be used for classification. Their trade-offs:
With limited data, generative models can be more efficient. If you correctly specify the data-generating process (e.g., “features are Gaussian within each class”), a generative model uses all the data more efficiently. Naive Bayes can work well with hundreds of examples where logistic regression struggles.
With abundant data, discriminative models often win. They make fewer assumptions - logistic regression doesn’t assume anything about the distribution of $X$, only about $P(Y|X)$. When the generative model’s assumptions are wrong (and they usually are), the discriminative model suffers less.
There’s a theoretical result: with infinite data, discriminative models like logistic regression are asymptotically optimal. But we never have infinite data, and the transition point depends on how wrong the generative model’s assumptions are.
Connection to Information Theory
The cross-entropy loss has a deep connection to information theory.
The KL divergence from distribution $q$ to distribution $p$ is:
$$D_{\text{KL}}(p | q) = \sum_k p_k \log \frac{p_k}{q_k}$$
It measures how different $q$ is from $p$. It’s always $\geq 0$, and equals 0 only when $p = q$.
For a single training example with true label $y$ and predicted probability $\hat{p}$, the cross-entropy is:
$$H(y, \hat{p}) = -y \log \hat{p} - (1-y) \log(1-\hat{p})$$
And the KL divergence between the true label distribution and the predicted distribution is $D_{\text{KL}} = H(y, \hat{p}) - H(y, y)$. The term $H(y,y)$ is the entropy of the true distribution - it’s constant and doesn’t depend on $w$. So minimizing cross-entropy is identical to minimizing KL divergence: we’re making the model’s predicted distribution as close as possible (in KL sense) to the true label distribution.
Training logistic regression = minimizing KL divergence between model predictions and true labels. This isn’t just a nice interpretation - it connects to the broader framework of maximum entropy models and the foundations of statistical inference.
Summary
| Concept | Details |
|---|---|
| Model | $P(Y=1 |
| Sigmoid | $\sigma(z) = \frac{1}{1+e^{-z}}$; maps $\mathbb{R} \to (0,1)$ |
| Log-odds | $\log\frac{P(Y=1 |
| Loss | Cross-entropy: $-\frac{1}{m}\sum[y\log\hat{p} + (1-y)\log(1-\hat{p})]$ |
| Gradient | $\nabla_w L = \frac{1}{m}X^T(\hat{p} - y)$ |
| Closed form? | No - iterative optimization required |
| Decision boundary | Hyperplane $\{x : w^Tx + b = 0\}$ |
| Multiclass | Softmax: $P(Y=k |
| Discriminative | Models $P(Y |
| Information theory | Minimizing cross-entropy = minimizing KL divergence |
Logistic regression takes a linear model and applies the sigmoid to produce a probability. The log-odds is linear in the features. Training maximizes likelihood - equivalently, minimizes cross-entropy. The gradient has the same structure as linear regression. The decision boundary is a hyperplane, which is the fundamental limitation: logistic regression cannot learn non-linear class boundaries. That limitation is what motivates neural networks.
Read next: