Linear Regression - The Best Straight Line Through the Cloud of Points
Helpful context:
- Machine Learning - What It Means to Generalize From Data
- Expectation, Variance & Covariance - The Center, the Spread, and the Relationship
- Gradients & Partial Derivatives - Slopes in Every Direction at Once
You have a spreadsheet with 1000 rows. Each row is a house that recently sold: its size in square feet and its sale price. You want to predict the price of a new house given only its size.
The simplest possible strategy: draw a line. Price = w · size + b. If you pick the right w and b, this line will be a useful predictor. But how do you find the right w and b? You can’t just eyeball 1000 data points.
This question - how to fit the best line through data - is linear regression. It is the simplest machine learning algorithm that actually works. It is also the foundation for understanding logistic regression, neural networks, and virtually everything else in supervised learning. And despite its simplicity, it has a beautiful mathematical structure that rewards careful study.
The Model
For a single input feature, the model is:
$$h_w(x) = wx + b$$
This is a line. $w$ is the slope (how much price increases per square foot) and $b$ is the intercept (baseline price at zero square footage, which doesn’t need to be physically meaningful).
For $n$ input features - size, number of bedrooms, age of the house, distance to the nearest school - the model is:
$$h_w(x) = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b = w^T x + b$$
This is a hyperplane in $n+1$ dimensions. With one feature it’s a line. With two features it’s a plane through 3D space. With $n$ features it’s something you can’t visualize but can still compute with.
A notational convenience: absorb the bias $b$ into the weight vector by appending a constant feature $x_0 = 1$ to every input. Then $w^T x + b = w^T x$ with the extended vectors. We’ll keep $b$ explicit in the intuition and use the matrix form where it makes the algebra cleaner.
Given $m$ training examples $\{(x^{(i)}, y^{(i)})\}_{i=1}^m$, we can write all predictions at once:
$$\hat{y} = Xw$$
where $X \in \mathbb{R}^{m \times (n+1)}$ is the design matrix (each row is one training example with a prepended 1), $w \in \mathbb{R}^{n+1}$ includes the bias, and $\hat{y} \in \mathbb{R}^m$ is the vector of predictions.
The Loss Function
We need a way to measure how wrong our predictions are. The standard choice is mean squared error (MSE):
$$L(w) = \frac{1}{m} \sum_{i=1}^{m} \left(h_w(x^{(i)}) - y^{(i)}\right)^2 = \frac{1}{m} |Xw - y|^2$$
Why squared error rather than absolute error $|h_w(x^{(i)}) - y^{(i)}|$ or something else?
Mathematical convenience. The squared error is differentiable everywhere. Absolute error has a kink at zero and is not differentiable there, making optimization harder. MSE is a smooth bowl - it has a unique minimum, and calculus can find it.
Probabilistic justification. Suppose the true relationship is $y^{(i)} = w^T x^{(i)} + \varepsilon^{(i)}$ where each noise term $\varepsilon^{(i)} \sim \mathcal{N}(0, \sigma^2)$ is an independent Gaussian error. Then the likelihood of the data is:
$$P(y | X, w) = \prod_{i=1}^m \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(y^{(i)} - w^T x^{(i)})^2}{2\sigma^2}\right)$$
Maximizing this likelihood is equivalent to minimizing $\sum_i (y^{(i)} - w^T x^{(i)})^2$. MSE is exactly maximum likelihood estimation under Gaussian noise. This connection isn’t a coincidence - it’s why squared error is the natural loss for regression when errors are expected to be symmetric and small ones are much more common than large ones.
Large errors get penalized proportionally more (quadratically) than small ones. A prediction that’s off by 10 incurs 100 units of loss; a prediction off by 20 incurs 400. This is a feature, not a bug - it strongly penalizes outlier predictions.
The Normal Equations: Exact Solution
The loss $L(w) = \frac{1}{m}|Xw - y|^2$ is a quadratic function of $w$. Quadratic functions have a unique minimum (if the leading coefficient is positive), and calculus can find it exactly. Set the gradient to zero.
Computing the gradient. Expand the squared norm:
$$|Xw - y|^2 = (Xw - y)^T(Xw - y) = w^T X^T X w - 2 y^T X w + y^T y$$
Differentiating with respect to $w$:
$$\nabla_w L = \frac{1}{m}(2X^T X w - 2 X^T y) = \frac{2}{m} X^T(Xw - y)$$
Set this to zero:
$$X^T X w = X^T y$$
Solving for $w$:
$$w = (X^T X)^{-1} X^T y$$
This is the normal equation. It gives the exact optimal weights in closed form - no iteration, no learning rate, no hyperparameters. Plug in your data, invert a matrix, multiply, done.
Discomfort check. Why can we invert $X^T X$? It’s an $(n+1) \times (n+1)$ matrix. It’s invertible exactly when the columns of $X$ are linearly independent - that is, when no feature is a perfect linear combination of the others. If you have two identical features (say, square footage in both square feet and square feet again), the columns are linearly dependent and $X^T X$ is singular. In practice, this means removing duplicate or perfectly collinear features before solving. Near-perfect collinearity (two very highly correlated features) doesn’t make the matrix singular but makes it ill-conditioned - small perturbations in the data cause large swings in $w$. Ridge regression (below) directly addresses this.
Geometric Interpretation: Projection
Here is the most elegant way to see what linear regression is doing.
The predictions $\hat{y} = Xw$ are always in the column space of $X$ - the set of all vectors that can be written as linear combinations of the columns of $X$. As $w$ ranges over all possible weight vectors, $Xw$ traces out a subspace of $\mathbb{R}^m$.
The true labels $y$ almost certainly don’t lie in this subspace (unless the data fits a hyperplane perfectly, which real data never does). We want to find the point in col$(X)$ that is closest to $y$. Closest in the Euclidean sense means minimizing $|Xw - y|^2$.
The closest point in a subspace to a vector outside it is the orthogonal projection. The residual vector $y - \hat{y}$ must be perpendicular to every column of $X$:
$$X^T (y - \hat{y}) = 0 \implies X^T y = X^T X w \implies w = (X^T X)^{-1} X^T y$$
The normal equations literally say “the residuals are normal (perpendicular) to the column space.” This is where the name comes from. Ordinary least squares is the orthogonal projection of the label vector onto the column space of the feature matrix.
This picture is useful beyond aesthetics. It immediately tells you: if you add a feature that is already in the column space of existing features, you gain nothing - the projection doesn’t change. Feature redundancy is geometrically visible.
Maximum Likelihood: Same Solution, Different Story
Let’s make the probabilistic interpretation precise. Assume each $y^{(i)} = w^T x^{(i)} + \varepsilon^{(i)}$ with $\varepsilon^{(i)} \overset{\text{iid}}{\sim} \mathcal{N}(0, \sigma^2)$.
The conditional distribution of $y^{(i)}$ given $x^{(i)}$ and $w$ is:
$$y^{(i)} | x^{(i)}, w \sim \mathcal{N}(w^T x^{(i)}, \sigma^2)$$
The log-likelihood of the entire dataset is:
$$\ell(w) = \sum_{i=1}^m \log P(y^{(i)} | x^{(i)}, w) = -\frac{m}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^m (y^{(i)} - w^T x^{(i)})^2$$
Maximizing $\ell(w)$ with respect to $w$ is equivalent to minimizing $\sum_i (y^{(i)} - w^T x^{(i)})^2$. The solution is exactly the normal equation. Linear regression = MLE under Gaussian noise. The least-squares criterion isn’t arbitrary - it’s the right thing to do when you believe errors are normally distributed.
When the Exact Solution Is Too Slow
The normal equation is exact but expensive. Computing $X^T X$ costs $O(mn^2)$ operations (matrix multiply of $m \times (n+1)$ by $(n+1) \times m$). Inverting the $(n+1) \times (n+1)$ result costs $O(n^3)$.
For $m = 10{,}000$ examples and $n = 1{,}000$ features, this is manageable. But for $n = 1{,}000{,}000$ features - common in text processing, genomics, or online advertising - the inversion alone requires $10^{18}$ operations. It is simply infeasible.
The alternative is gradient descent: iteratively nudge $w$ in the direction that decreases the loss. The gradient of $L$ is:
$$\nabla_w L = \frac{2}{m} X^T (Xw - y)$$
The update rule is:
$$w \leftarrow w - \eta \cdot \frac{2}{m} X^T (Xw - y)$$
Each step costs $O(mn)$ - one matrix-vector multiply to compute predictions, one to compute the gradient. For large $n$, this is far cheaper than forming and inverting $X^T X$.
For very large datasets, even computing the gradient over all $m$ examples is expensive. Stochastic gradient descent (SGD) estimates the gradient using a single randomly-chosen example or a small mini-batch. Each step is cheap, and convergence still happens in expectation.
Discomfort check. Gradient descent converges to the same solution as the normal equation (since the MSE loss is convex with a unique global minimum), but it takes many iterations. How many? For MSE, the convergence rate depends on the learning rate $\eta$ and the condition number of $X^T X$ (the ratio of its largest to smallest eigenvalue). A poorly conditioned matrix - one with eigenvalues spanning many orders of magnitude - causes slow convergence and requires careful tuning of $\eta$. This is one reason preprocessing (feature normalization) matters: it keeps eigenvalues in a similar range and speeds up gradient descent.
Measuring Fit: R²
How do you know if the model is good? The loss value in absolute terms is hard to interpret - it depends on the scale of $y$. A normalized measure is the coefficient of determination, $R^2$:
$$R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}} = 1 - \frac{\sum_i (y^{(i)} - \hat{y}^{(i)})^2}{\sum_i (y^{(i)} - \bar{y})^2}$$
where $\bar{y}$ is the mean of the labels. $SS_{\text{res}}$ is the variance unexplained by the model; $SS_{\text{tot}}$ is the total variance in the labels.
$R^2 = 1$: the model explains all the variance. Every prediction is perfect.
$R^2 = 0$: the model explains none of the variance. Predicting the mean $\bar{y}$ for every input would do just as well.
$R^2 < 0$: the model is worse than predicting the mean. This can happen for misspecified models evaluated on test data.
$R^2$ is the fraction of variance explained by the model. It’s a useful summary but not a complete picture - a model can have high $R^2$ on training data but poor $R^2$ on test data (overfitting).
Ridge Regression: Regularization
Linear regression minimizes $|Xw - y|^2$. If $n$ is large relative to $m$, or if features are highly correlated, the solution can have very large weights - it fits the training data well by exploiting collinearities, then generalizes poorly.
Ridge regression adds a penalty on the size of the weights:
$$L_{\text{ridge}}(w) = |Xw - y|^2 + \lambda |w|^2$$
The $\lambda |w|^2$ term is the regularizer. Taking the gradient and setting to zero:
$$2X^T(Xw - y) + 2\lambda w = 0$$
$$w_{\text{ridge}} = (X^T X + \lambda I)^{-1} X^T y$$
Two things to notice. First, $(X^T X + \lambda I)$ is always invertible for $\lambda > 0$, even when $X^T X$ is singular. Adding $\lambda$ to the diagonal shifts all eigenvalues up by $\lambda$, so none are zero. Ridge regression solves the collinearity problem by construction.
Second, as $\lambda$ increases, the weights are pulled toward zero. With very large $\lambda$, all weights are near zero and the model predicts close to the mean regardless of input. The $\lambda$ parameter controls the bias-variance trade-off: larger $\lambda$ introduces more bias (predictions closer to the mean) but reduces variance (less sensitivity to the particular training set).
Bayesian interpretation. Ridge regression is maximum a posteriori (MAP) estimation with a Gaussian prior $w \sim \mathcal{N}(0, \sigma^2/\lambda \cdot I)$. The prior says: before seeing data, we believe the weights are small. The posterior mode (MAP estimate) balances the likelihood (data fit) against the prior (weight size). Ridge is not just a regularization trick - it’s exact Bayesian inference under a specific prior.
Lasso: Sparse Solutions
Lasso (Least Absolute Shrinkage and Selection Operator) replaces the squared penalty with an absolute value:
$$L_{\text{lasso}}(w) = |Xw - y|^2 + \lambda |w|_1 = |Xw - y|^2 + \lambda \sum_j |w_j|$$
The $\ell_1$ penalty has a remarkable property: the optimal solution often has many weights equal to exactly zero. Lasso simultaneously fits the model and performs feature selection, automatically setting irrelevant weights to zero.
Why does $\ell_1$ produce sparsity while $\ell_2$ doesn’t? The geometry is the reason. The $\ell_2$ ball $\{w : |w|^2 \leq c\}$ is a smooth sphere. The $\ell_1$ ball $\{w : |w|_1 \leq c\}$ has corners - one at each coordinate axis. The constrained minimum of a function on a smooth sphere can be anywhere on the boundary, but on the $\ell_1$ ball, the minimum tends to hit corners where many coordinates are zero.
Lasso doesn’t have a closed-form solution (because $|w|_1$ is not differentiable at $w_j = 0$). It requires iterative methods such as coordinate descent. But it’s an indispensable tool when you have many features and expect only a few to be relevant.
Assumptions and When They Break
Linear regression makes several assumptions. Knowing them tells you when the model is trustworthy and when you should be suspicious.
Linearity. The true relationship between features and label is linear (or linear after your feature engineering). If the true relationship is cubic, a linear model will systematically mispredict. Diagnostic: plot residuals vs. fitted values - a curved pattern suggests non-linearity.
Homoscedasticity. The variance of the noise $\varepsilon^{(i)}$ is constant across all inputs. If variance grows with the prediction (e.g., expensive houses have more variable prices), OLS estimates are still unbiased but are no longer the minimum-variance estimator. Weighted least squares or log-transforming the label can help.
Independence. Errors are independent of each other. This fails for time series (today’s error correlates with tomorrow’s) or spatial data (nearby houses have correlated errors). Ignoring this gives overly optimistic confidence intervals.
No perfect multicollinearity. No feature is an exact linear combination of others. Near-multicollinearity doesn’t break the algebra but causes numerical instability and inflated variance in the weights. Ridge regression is the remedy.
Summary
| Concept | Formula | Notes |
|---|---|---|
| Model | $\hat{y} = Xw$ | Hyperplane in feature space |
| MSE loss | $L = \frac{1}{m}|Xw - y|^2$ | Differentiable, unique minimum |
| Normal equation | $w = (X^TX)^{-1}X^Ty$ | Exact; costs $O(mn^2 + n^3)$ |
| Geometric view | $\hat{y}$ = orthogonal projection of $y$ onto col$(X)$ | Residuals $\perp$ col$(X)$ |
| MLE connection | Minimizing MSE = MLE under $\varepsilon \sim \mathcal{N}(0, \sigma^2)$ | Same solution |
| Gradient descent | $w \leftarrow w - \frac{2\eta}{m}X^T(Xw-y)$ | Scales to large $n$, $m$ |
| $R^2$ | $1 - SS_{\text{res}}/SS_{\text{tot}}$ | Fraction of variance explained |
| Ridge | $w = (X^TX + \lambda I)^{-1}X^Ty$ | Shrinks weights; always invertible |
| Lasso | $|Xw - y|^2 + \lambda|w|_1$ | Sparse weights; feature selection |
Linear regression fits a hyperplane. The normal equation finds the best hyperplane exactly. Geometrically, it projects the label vector onto the column space of the feature matrix. Probabilistically, it is MLE under Gaussian noise. Gradient descent finds the same answer iteratively when the matrix inversion is too expensive. Ridge and Lasso extend the model to handle collinearity and sparsity.
This is the simplest story in supervised learning. But the same structure - a model, a loss, and optimization - recurs in every algorithm that follows.
Read next: