Orthogonality & Projections
Prerequisite:
Orthogonality
Two vectors $u, v$ in an inner product space are orthogonal, written $u \perp v$, if $\langle u, v \rangle = 0$. A set of vectors is orthogonal if every pair is orthogonal, and orthonormal if additionally each vector has unit norm: $|e_i| = 1$ for all $i$.
Proposition. An orthogonal set of nonzero vectors is linearly independent.
Proof. Suppose $\sum_{i=1}^k \alpha_i v_i = 0$ with all $v_i \neq 0$ and $\langle v_i, v_j \rangle = 0$ for $i \neq j$. Take the inner product of both sides with $v_j$: $$0 = \left\langle \sum_i \alpha_i v_i,, v_j \right\rangle = \alpha_j |v_j|^2$$ Since $|v_j|^2 > 0$, we get $\alpha_j = 0$. This holds for each $j$. $\square$
Orthogonal Complement
Definition. For a subset $S \subseteq V$, the orthogonal complement is $$S^\perp = {v \in V : \langle v, s \rangle = 0 \text{ for all } s \in S}$$
$S^\perp$ is always a subspace (even if $S$ is not). Key properties for a subspace $U \subseteq V$ (with $V$ finite-dimensional):
- $(U^\perp)^\perp = U$
- $\dim U + \dim U^\perp = \dim V$
- $U \cap U^\perp = {0}$
- $V = U \oplus U^\perp$ (direct sum decomposition)
The last point is the orthogonal decomposition theorem, and it is the foundation of projection theory.
Orthonormal Bases
An orthonormal basis $(e_1, \ldots, e_n)$ satisfies $\langle e_i, e_j \rangle = \delta_{ij}$ (the Kronecker delta). The advantages over a general basis are significant:
Coordinates are inner products. If $v = \sum_i \alpha_i e_i$, then $\langle v, e_j \rangle = \alpha_j$. No linear system needs to be solved to find coordinates.
Norm is Euclidean in coordinates. $|v|^2 = \sum_i |\alpha_i|^2$ (Parseval’s identity). There is no Gram matrix to worry about.
Change of basis is orthogonal. The matrix $Q$ whose columns are an orthonormal basis satisfies $Q^T Q = I$ (and $QQ^T = I$ if $Q$ is square). Orthogonal matrices preserve inner products: $\langle Qx, Qy \rangle = x^T Q^T Q y = x^T y$.
Orthogonal Projection
Projection onto a 1-Dimensional Subspace
Given a nonzero vector $u$ and a vector $v$, the orthogonal projection of $v$ onto $\text{span}(u)$ is: $$P_u v = \frac{\langle v, u \rangle}{\langle u, u \rangle} u = \frac{\langle v, u \rangle}{|u|^2} u$$
The scalar $\frac{\langle v, u \rangle}{|u|}$ is the component of $v$ along $u$ (its signed length in the direction of $u$). The residual $v - P_u v$ is orthogonal to $u$ by direct computation: $$\langle v - P_u v, u \rangle = \langle v, u \rangle - \frac{\langle v,u \rangle}{|u|^2}\langle u, u \rangle = 0$$
Projection onto a line spanned by u:
v
/|
/ |
/ | (v - P_u v) <-- orthogonal to u
/ |
/ |
*-----*-----------> u
0 P_u v
v = P_u v + (v - P_u v)
<P_u v, v - P_u v> = 0
Projection onto a General Subspace
Let $U = \text{span}(u_1, \ldots, u_k)$ be a subspace of $V$. The orthogonal projection of $v$ onto $U$ is the unique vector $P_U v \in U$ such that $v - P_U v \in U^\perp$.
If $(e_1, \ldots, e_k)$ is an orthonormal basis for $U$, then: $$P_U v = \sum_{i=1}^k \langle v, e_i \rangle e_i$$
This is the cleanest formula. When the basis is not orthonormal, write the basis vectors as columns of a matrix $A$. Then: $$P_U v = A(A^T A)^{-1} A^T v$$
The matrix $P = A(A^T A)^{-1} A^T$ is the orthogonal projection matrix onto $\text{col}(A)$.
Properties of Projection Matrices
A matrix $P$ is an orthogonal projection if and only if:
- $P^2 = P$ (idempotent: projecting twice is the same as projecting once)
- $P^T = P$ (symmetric)
Condition 1 alone defines a projection (not necessarily orthogonal). Adding condition 2 enforces that the projection is orthogonal to its complement. The eigenvalues of a projection matrix are only 0 and 1 - eigenvectors with eigenvalue 1 are in $\text{col}(P)$, and eigenvectors with eigenvalue 0 are in $\text{null}(P) = \text{col}(P)^\perp$.
Projection onto a plane in R^3:
z
|
| v
| /|
| / |
|/ | (v - Pv)
*---+---- y
/ Pv
/
x
Pv lies in the xy-plane
v - Pv is vertical (perpendicular to the plane)
Best Approximation Theorem
Theorem (Best Approximation). Let $U$ be a finite-dimensional subspace of $V$, and $v \in V$. Then for any $u \in U$: $$|v - P_U v| \leq |v - u|$$ with equality iff $u = P_U v$. That is, $P_U v$ is the closest point in $U$ to $v$.
Proof. Write $v - u = (v - P_U v) + (P_U v - u)$. The two summands are orthogonal: $P_U v - u \in U$ and $v - P_U v \in U^\perp$, so by the Pythagorean theorem: $$|v - u|^2 = |v - P_U v|^2 + |P_U v - u|^2 \geq |v - P_U v|^2$$ Equality holds iff $|P_U v - u|^2 = 0$, i.e., $u = P_U v$. $\square$
This theorem is the geometric heart of least squares.
Least Squares and Normal Equations
Given an overdetermined linear system $Ax = b$ (more equations than unknowns, typically no exact solution), the least squares solution minimises $|Ax - b|_2^2$ over all $x$.
Derivation. We want the $x$ such that $Ax$ is as close as possible to $b$ in $\ell^2$. By the best approximation theorem, the closest point in $\text{col}(A)$ to $b$ is $P_{\text{col}(A)} b$. We need $Ax = P_{\text{col}(A)} b$, which requires $b - Ax \in \text{col}(A)^\perp = \text{null}(A^T)$. Thus: $$A^T(b - Ax) = 0 \implies A^T Ax = A^T b$$
These are the normal equations. If $A$ has full column rank, $A^T A$ is invertible and the unique solution is: $$\hat{x} = (A^T A)^{-1} A^T b$$
The matrix $(A^T A)^{-1} A^T$ is the Moore-Penrose pseudoinverse $A^\dagger$ when $A$ has full column rank. The projection matrix onto $\text{col}(A)$ is then $P = A(A^T A)^{-1} A^T = AA^\dagger$.
Why normal equations can be numerically bad. The condition number of $A^T A$ is the square of the condition number of $A$. If $A$ is ill-conditioned, $A^T A$ is dramatically more so, causing catastrophic cancellation in floating point. The numerically preferred method is QR decomposition of $A$ - see the Gram-Schmidt post.
Examples
Linear regression. Fitting $y = X\beta + \epsilon$ by OLS gives $\hat{\beta} = (X^T X)^{-1} X^T y$, and the fitted values are $\hat{y} = X\hat{\beta} = X(X^T X)^{-1} X^T y = Hy$, where $H = X(X^T X)^{-1} X^T$ is the hat matrix (it puts the hat on $y$). The residuals $\hat{\epsilon} = y - \hat{y} = (I - H)y$ lie in the orthogonal complement of the column space of $X$. The hat matrix is an orthogonal projection, so $H^2 = H$ and $H^T = H$.
PCA as projection. Principal Component Analysis finds the $k$-dimensional subspace that captures maximum variance. The projection of data onto the top $k$ eigenvectors of the covariance matrix $\Sigma = \frac{1}{n}X^T X$ is the orthogonal projection $P = V_k V_k^T$ where $V_k$ contains the top $k$ eigenvectors. This is the best rank-$k$ approximation to the data by the Eckart-Young theorem (via SVD).
Pseudoinverse and underdetermined systems. When $A$ has more columns than rows (underdetermined, infinitely many solutions), the pseudoinverse $A^\dagger = A^T(AA^T)^{-1}$ gives the minimum-norm solution: among all $x$ with $Ax = b$, the pseudoinverse selects the one with smallest $|x|_2$. This is again a projection - onto the row space of $A$, which is $\text{null}(A)^\perp$.
Read Next: