Orthogonality & Projections // Megha Bose

Prerequisite:

Inner Products & Norms

Orthogonality

Two vectors $u, v$ in an inner product space are orthogonal, written $u \perp v$, if $\langle u, v \rangle = 0$. A set of vectors is orthogonal if every pair is orthogonal, and orthonormal if additionally each vector has unit norm: $|e_i| = 1$ for all $i$.

Proposition. An orthogonal set of nonzero vectors is linearly independent.

Proof. Suppose $\sum_{i=1}^k \alpha_i v_i = 0$ with all $v_i \neq 0$ and $\langle v_i, v_j \rangle = 0$ for $i \neq j$. Take the inner product of both sides with $v_j$: $$0 = \left\langle \sum_i \alpha_i v_i,, v_j \right\rangle = \alpha_j |v_j|^2$$ Since $|v_j|^2 > 0$, we get $\alpha_j = 0$. This holds for each $j$. $\square$

Orthogonal Complement

Definition. For a subset $S \subseteq V$, the orthogonal complement is $$S^\perp = {v \in V : \langle v, s \rangle = 0 \text{ for all } s \in S}$$

$S^\perp$ is always a subspace (even if $S$ is not). Key properties for a subspace $U \subseteq V$ (with $V$ finite-dimensional):

$(U^\perp)^\perp = U$
$\dim U + \dim U^\perp = \dim V$
$U \cap U^\perp = {0}$
$V = U \oplus U^\perp$ (direct sum decomposition)

The last point is the orthogonal decomposition theorem, and it is the foundation of projection theory.

Orthonormal Bases

An orthonormal basis $(e_1, \ldots, e_n)$ satisfies $\langle e_i, e_j \rangle = \delta_{ij}$ (the Kronecker delta). The advantages over a general basis are significant:

Coordinates are inner products. If $v = \sum_i \alpha_i e_i$, then $\langle v, e_j \rangle = \alpha_j$. No linear system needs to be solved to find coordinates.

Norm is Euclidean in coordinates. $|v|^2 = \sum_i |\alpha_i|^2$ (Parseval’s identity). There is no Gram matrix to worry about.

Change of basis is orthogonal. The matrix $Q$ whose columns are an orthonormal basis satisfies $Q^T Q = I$ (and $QQ^T = I$ if $Q$ is square). Orthogonal matrices preserve inner products: $\langle Qx, Qy \rangle = x^T Q^T Q y = x^T y$.

Orthogonal Projection

Projection onto a 1-Dimensional Subspace

Given a nonzero vector $u$ and a vector $v$, the orthogonal projection of $v$ onto $\text{span}(u)$ is: $$P_u v = \frac{\langle v, u \rangle}{\langle u, u \rangle} u = \frac{\langle v, u \rangle}{|u|^2} u$$

The scalar $\frac{\langle v, u \rangle}{|u|}$ is the component of $v$ along $u$ (its signed length in the direction of $u$). The residual $v - P_u v$ is orthogonal to $u$ by direct computation: $$\langle v - P_u v, u \rangle = \langle v, u \rangle - \frac{\langle v,u \rangle}{|u|^2}\langle u, u \rangle = 0$$

Projection onto a line spanned by u:

        v
       /|
      / |
     /  | (v - P_u v)  <-- orthogonal to u
    /   |
   /    |
  *-----*-----------> u
  0   P_u v

  v = P_u v + (v - P_u v)
  <P_u v, v - P_u v> = 0

Projection onto a General Subspace

Let $U = \text{span}(u_1, \ldots, u_k)$ be a subspace of $V$. The orthogonal projection of $v$ onto $U$ is the unique vector $P_U v \in U$ such that $v - P_U v \in U^\perp$.

If $(e_1, \ldots, e_k)$ is an orthonormal basis for $U$, then: $$P_U v = \sum_{i=1}^k \langle v, e_i \rangle e_i$$

This is the cleanest formula. When the basis is not orthonormal, write the basis vectors as columns of a matrix $A$. Then: $$P_U v = A(A^T A)^{-1} A^T v$$

The matrix $P = A(A^T A)^{-1} A^T$ is the orthogonal projection matrix onto $\text{col}(A)$.

Properties of Projection Matrices

A matrix $P$ is an orthogonal projection if and only if:

$P^2 = P$ (idempotent: projecting twice is the same as projecting once)
$P^T = P$ (symmetric)

Condition 1 alone defines a projection (not necessarily orthogonal). Adding condition 2 enforces that the projection is orthogonal to its complement. The eigenvalues of a projection matrix are only 0 and 1 - eigenvectors with eigenvalue 1 are in $\text{col}(P)$, and eigenvectors with eigenvalue 0 are in $\text{null}(P) = \text{col}(P)^\perp$.

Projection onto a plane in R^3:

         z
         |
         |   v
         |  /|
         | / |
         |/  | (v - Pv)
         *---+---- y
        /  Pv
       /
      x

  Pv lies in the xy-plane
  v - Pv is vertical (perpendicular to the plane)

Best Approximation Theorem

Theorem (Best Approximation). Let $U$ be a finite-dimensional subspace of $V$, and $v \in V$. Then for any $u \in U$: $$|v - P_U v| \leq |v - u|$$ with equality iff $u = P_U v$. That is, $P_U v$ is the closest point in $U$ to $v$.

This theorem is the geometric heart of least squares.

Least Squares and Normal Equations

Given an overdetermined linear system $Ax = b$ (more equations than unknowns, typically no exact solution), the least squares solution minimises $|Ax - b|_2^2$ over all $x$.

Derivation. We want the $x$ such that $Ax$ is as close as possible to $b$ in $\ell^2$. By the best approximation theorem, the closest point in $\text{col}(A)$ to $b$ is $P_{\text{col}(A)} b$. We need $Ax = P_{\text{col}(A)} b$, which requires $b - Ax \in \text{col}(A)^\perp = \text{null}(A^T)$. Thus: $$A^T(b - Ax) = 0 \implies A^T Ax = A^T b$$

These are the normal equations. If $A$ has full column rank, $A^T A$ is invertible and the unique solution is: $$\hat{x} = (A^T A)^{-1} A^T b$$

The matrix $(A^T A)^{-1} A^T$ is the Moore-Penrose pseudoinverse $A^\dagger$ when $A$ has full column rank. The projection matrix onto $\text{col}(A)$ is then $P = A(A^T A)^{-1} A^T = AA^\dagger$.

Why normal equations can be numerically bad. The condition number of $A^T A$ is the square of the condition number of $A$. If $A$ is ill-conditioned, $A^T A$ is dramatically more so, causing catastrophic cancellation in floating point. The numerically preferred method is QR decomposition of $A$ - see the Gram-Schmidt post.

Examples

Linear regression. Fitting $y = X\beta + \epsilon$ by OLS gives $\hat{\beta} = (X^T X)^{-1} X^T y$, and the fitted values are $\hat{y} = X\hat{\beta} = X(X^T X)^{-1} X^T y = Hy$, where $H = X(X^T X)^{-1} X^T$ is the hat matrix (it puts the hat on $y$). The residuals $\hat{\epsilon} = y - \hat{y} = (I - H)y$ lie in the orthogonal complement of the column space of $X$. The hat matrix is an orthogonal projection, so $H^2 = H$ and $H^T = H$.

PCA as projection. Principal Component Analysis finds the $k$-dimensional subspace that captures maximum variance. The projection of data onto the top $k$ eigenvectors of the covariance matrix $\Sigma = \frac{1}{n}X^T X$ is the orthogonal projection $P = V_k V_k^T$ where $V_k$ contains the top $k$ eigenvectors. This is the best rank-$k$ approximation to the data by the Eckart-Young theorem (via SVD).

Pseudoinverse and underdetermined systems. When $A$ has more columns than rows (underdetermined, infinitely many solutions), the pseudoinverse $A^\dagger = A^T(AA^T)^{-1}$ gives the minimum-norm solution: among all $x$ with $Ax = b$, the pseudoinverse selects the one with smallest $|x|_2$. This is again a projection - onto the row space of $A$, which is $\text{null}(A)^\perp$.

Read Next: