Linear Regression // Megha Bose

Prerequisite:

The Linear Model

Linear regression models the relationship between inputs $\mathbf{x} \in \mathbb{R}^d$ and a real-valued output $y \in \mathbb{R}$ as

$$\hat{y} = \mathbf{w}^T \mathbf{x} + b,$$

where $\mathbf{w} \in \mathbb{R}^d$ is the weight vector and $b \in \mathbb{R}$ is the bias. Absorbing the bias by appending a constant feature 1 to every input, we write $\hat{y} = \mathbf{w}^T \mathbf{x}$ with $\mathbf{x} \in \mathbb{R}^{d+1}$. In matrix form, predictions over $n$ examples are $\hat{\mathbf{y}} = X\mathbf{w}$, where $X \in \mathbb{R}^{n \times (d+1)}$ is the design matrix.

The Loss: Mean Squared Error

The standard loss for regression is the mean squared error (MSE):

$$L(\mathbf{w}) = \frac{1}{n} |X\mathbf{w} - \mathbf{y}|^2 = \frac{1}{n} \sum_{i=1}^{n} (\mathbf{w}^T \mathbf{x}_i - y_i)^2.$$

MSE penalizes large residuals quadratically, making it sensitive to outliers but analytically tractable.

The Normal Equations

To find the minimizer, compute the gradient and set it to zero.

$$\nabla_{\mathbf{w}} L(\mathbf{w}) = \frac{2}{n} X^T(X\mathbf{w} - \mathbf{y}) = \mathbf{0}.$$

This gives the normal equations:

$$X^T X \mathbf{w} = X^T \mathbf{y}.$$

If $X^T X$ is invertible (equivalently, $X$ has full column rank), the unique solution is

$$\mathbf{w}^\ast = (X^T X)^{-1} X^T \mathbf{y}.$$

Verification that this is a minimum. The Hessian is $\nabla^2 L = \frac{2}{n} X^T X$, which is positive semidefinite (and positive definite when $X$ has full rank). Therefore $L$ is convex, and any critical point is a global minimum.

Geometric Interpretation

The prediction $\hat{\mathbf{y}} = X\mathbf{w}^\ast$ is the orthogonal projection of $\mathbf{y}$ onto the column space $\text{col}(X)$.

To see why, note that $\hat{\mathbf{y}} = X(X^T X)^{-1} X^T \mathbf{y} = P\mathbf{y}$, where $P = X(X^T X)^{-1}X^T$ is the projection matrix. The residual $\mathbf{y} - \hat{\mathbf{y}} = (I - P)\mathbf{y}$ is orthogonal to every column of $X$:

$$X^T(\mathbf{y} - X\mathbf{w}^\ast) = X^T \mathbf{y} - X^T X \mathbf{w}^\ast = \mathbf{0},$$

which is precisely the normal equations. Geometrically, linear regression finds the point in the column space of $X$ closest to $\mathbf{y}$ in Euclidean distance.

Ridge Regression ($L_2$ Regularization)

When features are nearly collinear, $X^T X$ is nearly singular and $\mathbf{w}^\ast$ can have very large magnitude. Ridge regression adds an $L_2$ penalty to control this:

$$L_\lambda(\mathbf{w}) = \frac{1}{n}|X\mathbf{w} - \mathbf{y}|^2 + \lambda |\mathbf{w}|^2, \quad \lambda > 0.$$

Setting the gradient to zero:

$$\nabla_\mathbf{w} L_\lambda = \frac{2}{n} X^T(X\mathbf{w} - \mathbf{y}) + 2\lambda \mathbf{w} = \mathbf{0},$$

gives the ridge solution:

$$\mathbf{w}^\ast_\text{ridge} = (X^T X + n\lambda I)^{-1} X^T \mathbf{y}.$$

The matrix $X^T X + n\lambda I$ has eigenvalues $\sigma_i^2 + n\lambda > 0$ for all $i$, so it is always invertible regardless of $X$. Ridge shrinks each weight toward zero, reducing variance at the cost of introducing some bias.

Lasso ($L_1$ Regularization)

Lasso replaces the $L_2$ penalty with an $L_1$ penalty:

$$L_\lambda(\mathbf{w}) = \frac{1}{n}|X\mathbf{w} - \mathbf{y}|^2 + \lambda |\mathbf{w}|_1.$$

Unlike ridge, lasso has no closed form because $|\mathbf{w}|_1$ is not differentiable at zero. However, its solutions are typically sparse: many weights are driven to exactly zero, performing automatic feature selection.

Geometric intuition. The $L_1$ ball ${\mathbf{w} : |\mathbf{w}|_1 \leq c}$ is a polytope with corners along the coordinate axes. When projecting the least-squares solution onto this set, the corners are where the feasible set first intersects the ellipsoidal level sets of the MSE loss - and corners lie on coordinate axes, forcing some weights to exactly zero.

Probabilistic View: MLE under Gaussian Noise

Assume the data-generating process is

$$y_i = \mathbf{w}^T \mathbf{x}_i + \varepsilon_i, \quad \varepsilon_i \overset{\text{i.i.d.}}{\sim} \mathcal{N}(0, \sigma^2).$$

The likelihood of the training data under this model is

$$p(\mathbf{y} \mid X, \mathbf{w}) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi}\sigma} \exp!\left(-\frac{(y_i - \mathbf{w}^T \mathbf{x}_i)^2}{2\sigma^2}\right).$$

Maximizing the log-likelihood:

$$\log p = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \mathbf{w}^T \mathbf{x}_i)^2.$$

Since $\sigma^2$ does not depend on $\mathbf{w}$, maximizing over $\mathbf{w}$ is equivalent to minimizing $\sum_i (y_i - \mathbf{w}^T \mathbf{x}_i)^2$ - exactly the MSE loss. MLE under Gaussian noise is identical to ordinary least squares.

Similarly, adding a Gaussian prior $\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \tau^2 I)$ and computing the MAP estimate yields ridge regression with $\lambda = \sigma^2 / (n\tau^2)$.

Key Assumptions

Linear regression achieves its nice properties only when several assumptions hold:

Linearity: $\mathbb{E}[y \mid x] = \mathbf{w}^T \mathbf{x}$.
Homoscedasticity: $\text{Var}(\varepsilon_i) = \sigma^2$ is constant across observations.
No multicollinearity: The columns of $X$ are linearly independent, ensuring $X^TX$ is invertible.
Exogeneity: $\mathbb{E}[\varepsilon_i \mid \mathbf{x}_i] = 0$ - errors are uncorrelated with inputs.

Violations lead to biased or inefficient estimates and require modified estimators.

Examples

Feature scaling. The normal equations are invariant to the solution $\mathbf{w}^\ast$, but gradient-based solvers converge much faster when features are on similar scales. Standardizing features to zero mean and unit variance (z-scoring) is standard practice: $\tilde{x}_j = (x_j - \mu_j)/\sigma_j$.

Polynomial regression as a special case. To fit a degree-$p$ polynomial in a scalar input $x$, define the feature map $\phi(x) = (1, x, x^2, \ldots, x^p)^T$ and apply linear regression with design matrix $\Phi$ whose rows are $\phi(x_i)^T$. The model $\hat{y} = \mathbf{w}^T \phi(x)$ is nonlinear in $x$ but linear in $\mathbf{w}$, so all the linear regression theory applies directly. This illustrates that “linear” in linear regression refers to linearity in the parameters, not in the raw features.

Read Next:

Logistic Regression