Orthogonality & Projections - Right Angles Have Deep Consequences // Megha Bose

Helpful context:

Inner Products & Norms - Angles and Lengths for Any Vector Space

Orthogonality

Two vectors $u, v$ in an inner product space are orthogonal, written $u \perp v$, if $\langle u, v \rangle = 0$. A set of vectors is orthogonal if every pair is orthogonal, and orthonormal if additionally each vector has unit norm: $|e_i| = 1$ for all $i$.

Proposition. An orthogonal set of nonzero vectors is linearly independent.

Proof. Suppose $\sum_{i=1}^k \alpha_i v_i = 0$ with all $v_i \neq 0$ and $\langle v_i, v_j \rangle = 0$ for $i \neq j$. Take the inner product of both sides with $v_j$: $$0 = \left\langle \sum_i \alpha_i v_i, v_j \right\rangle = \alpha_j |v_j|^2$$ Since $|v_j|^2 > 0$, we get $\alpha_j = 0$. This holds for each $j$. $\square$

Orthogonal Complement

Definition. For a subset $S \subseteq V$, the orthogonal complement is $$S^\perp = \{v \in V : \langle v, s \rangle = 0 \text{ for all } s \in S\}$$

$S^\perp$ is always a subspace (even if $S$ is not). Key properties for a subspace $U \subseteq V$ (with $V$ finite-dimensional):

$(U^\perp)^\perp = U$
$\dim U + \dim U^\perp = \dim V$
$U \cap U^\perp = \{0\}$
$V = U \oplus U^\perp$ (direct sum decomposition)

The last point is the orthogonal decomposition theorem, and it is the foundation of projection theory.

Orthonormal Bases

The Kronecker Delta

Before stating what an orthonormal basis is, we need a compact piece of notation. The Kronecker delta $\delta_{ij}$ (named after Leopold Kronecker, a 19th-century German mathematician) is simply:

$$\delta_{ij} = 1 \text{ if } i = j, \qquad \delta_{ij} = 0 \text{ if } i \neq j$$

That is all it is - a selector that returns 1 when the two indices match and 0 when they do not. It is not a deep concept; it is a notational convenience that avoids writing “equals 1 if $i = j$ and 0 otherwise” every time. The identity matrix $I$ has entries $(I){ij} = \delta{ij}$. The condition $\langle e_i, e_j \rangle = \delta_{ij}$ is just a clean way of saying: basis vectors are unit length (self inner product is 1) and mutually orthogonal (cross inner product is 0).

Properties

An orthonormal basis $(e_1, \ldots, e_n)$ satisfies $\langle e_i, e_j \rangle = \delta_{ij}$. The advantages over a general basis are significant:

Coordinates are inner products. If $v = \sum_i \alpha_i e_i$, then $\langle v, e_j \rangle = \alpha_j$. No linear system needs to be solved to find coordinates.

Norm is Euclidean in coordinates. $|v|^2 = \sum_i |\alpha_i|^2$ (Parseval’s identity). There is no Gram matrix to worry about.

The Q Matrix and Its Inverse

Form the matrix $Q$ whose columns are the orthonormal basis vectors $e_1, \ldots, e_n$. The $(i,j)$ entry of $Q^T Q$ is the dot product of column $i$ with column $j$, which is exactly $\langle e_i, e_j \rangle = \delta_{ij}$. So $Q^T Q = I$, which means:

$$Q^{-1} = Q^T$$

This is a remarkable and practically important fact: the inverse of an orthogonal matrix is just its transpose. Computing a general matrix inverse costs $O(n^3)$ operations and is numerically fragile. For orthogonal matrices, the inverse is free - just flip rows and columns - and numerically perfectly stable.

For a square $Q$, both $Q^T Q = I$ and $Q Q^T = I$ hold, meaning the rows of $Q$ also form an orthonormal set, not just the columns.

What orthogonal matrices do geometrically. Because $Q^{-1} = Q^T$, applying $Q$ is a length-preserving, angle-preserving transformation - a rotation or reflection. Concretely:

$Q$ preserves lengths: since $Q^T Q = I$,

$$|Qx|^2 = (Qx)^T(Qx) = x^T Q^T Q x = x^T x = |x|^2$$

$Q$ preserves angles: for any two vectors $x, y$,

$$\langle Qx, Qy \rangle = x^T Q^T Q y = x^T y = \langle x, y \rangle$$

For the determinant: $\det(Q^T Q) = \det(I) = 1$, and $\det(Q^T Q) = \det(Q)^2$, so $\det(Q) = \pm 1$. Pure rotations have $\det(Q) = +1$; reflections have $\det(Q) = -1$.

Orthogonal Projection

Projection onto a 1-Dimensional Subspace

Given a nonzero vector $u$ and a vector $v$, the orthogonal projection of $v$ onto $\text{span}(u)$ is: $$P_u v = \frac{\langle v, u \rangle}{\langle u, u \rangle} u = \frac{\langle v, u \rangle}{|u|^2} u$$

The scalar $\frac{\langle v, u \rangle}{|u|}$ is the component of $v$ along $u$ (its signed length in the direction of $u$). The residual $v - P_u v$ is orthogonal to $u$ by direct computation: $$\langle v - P_u v, u \rangle = \langle v, u \rangle - \frac{\langle v,u \rangle}{|u|^2}\langle u, u \rangle = 0$$

Projection onto a General Subspace

Let $U = \text{span}(u_1, \ldots, u_k)$ be a subspace of $V$. The orthogonal projection of $v$ onto $U$ is the unique vector $P_U v \in U$ such that $v - P_U v \in U^\perp$.

If $(e_1, \ldots, e_k)$ is an orthonormal basis for $U$, then: $$P_U v = \sum_{i=1}^k \langle v, e_i \rangle e_i$$

This is the cleanest formula. When the basis is not orthonormal, write the basis vectors as columns of a matrix $A$. Then: $$P_U v = A(A^T A)^{-1} A^T v$$

The matrix $P = A(A^T A)^{-1} A^T$ is the orthogonal projection matrix onto $\text{col}(A)$.

Properties of Projection Matrices

A matrix $P$ is an orthogonal projection if and only if:

$P^2 = P$ (idempotent: projecting twice is the same as projecting once)
$P^T = P$ (symmetric)

Condition 1 alone defines a projection (not necessarily orthogonal). Adding condition 2 enforces that the projection is orthogonal to its complement. The eigenvalues of a projection matrix are only 0 and 1 - eigenvectors with eigenvalue 1 are in $\text{col}(P)$, and eigenvectors with eigenvalue 0 are in $\text{null}(P) = \text{col}(P)^\perp$.

Best Approximation Theorem

Theorem (Best Approximation). Let $U$ be a finite-dimensional subspace of $V$, and $v \in V$. Then for any $u \in U$: $$|v - P_U v| \leq |v - u|$$ with equality iff $u = P_U v$. That is, $P_U v$ is the closest point in $U$ to $v$.

This theorem is the geometric heart of least squares.

Least Squares and Normal Equations

Given an overdetermined linear system $Ax = b$ (more equations than unknowns, typically no exact solution), the least squares solution minimises $|Ax - b|_2^2$ over all $x$.

Derivation. We want the $x$ such that $Ax$ is as close as possible to $b$ in $\ell^2$. By the best approximation theorem, the closest point in $\text{col}(A)$ to $b$ is $P_{\text{col}(A)} b$. We need $Ax = P_{\text{col}(A)} b$, which requires $b - Ax \in \text{col}(A)^\perp = \text{null}(A^T)$. Thus: $$A^T(b - Ax) = 0 \implies A^T Ax = A^T b$$

These are the normal equations. If $A$ has full column rank, $A^T A$ is invertible and the unique solution is: $$\hat{x} = (A^T A)^{-1} A^T b$$

The matrix $(A^T A)^{-1} A^T$ is the Moore-Penrose pseudoinverse $A^\dagger$ when $A$ has full column rank. The projection matrix onto $\text{col}(A)$ is then $P = A(A^T A)^{-1} A^T = AA^\dagger$.

Why normal equations can be numerically bad. The condition number of $A^T A$ is the square of the condition number of $A$. If $A$ is ill-conditioned, $A^T A$ is dramatically more so, causing catastrophic cancellation in floating point. The numerically preferred method is QR decomposition of $A$ - see the Gram-Schmidt post.

Examples

Linear regression. Fitting $y = X\beta + \epsilon$ by OLS gives $\hat{\beta} = (X^T X)^{-1} X^T y$, and the fitted values are $\hat{y} = X\hat{\beta} = X(X^T X)^{-1} X^T y = Hy$, where $H = X(X^T X)^{-1} X^T$ is the hat matrix (it puts the hat on $y$). The residuals $\hat{\epsilon} = y - \hat{y} = (I - H)y$ lie in the orthogonal complement of the column space of $X$. The hat matrix is an orthogonal projection, so $H^2 = H$ and $H^T = H$.

PCA as projection. Principal Component Analysis finds the $k$-dimensional subspace that captures maximum variance. The projection of data onto the top $k$ eigenvectors of the covariance matrix $\Sigma = \frac{1}{n}X^T X$ is the orthogonal projection $P = V_k V_k^T$ where $V_k$ contains the top $k$ eigenvectors. This is the best rank-$k$ approximation to the data by the Eckart-Young theorem (via SVD).

Pseudoinverse and underdetermined systems. When $A$ has more columns than rows (underdetermined, infinitely many solutions), the pseudoinverse $A^\dagger = A^T(AA^T)^{-1}$ gives the minimum-norm solution: among all $x$ with $Ax = b$, the pseudoinverse selects the one with smallest $|x|_2$. This is again a projection - onto the row space of $A$, which is $\text{null}(A)^\perp$.

Key Properties

Orthogonality

$u \perp v$ iff $\langle u, v \rangle = 0$
Any orthogonal set of nonzero vectors is linearly independent
An orthonormal set satisfies $\langle e_i, e_j \rangle = \delta_{ij}$

Orthogonal complement

$S^\perp$ is always a subspace, even when $S$ is not
$(U^\perp)^\perp = U$
$\dim U + \dim U^\perp = \dim V$
$U \cap U^\perp = {0}$
$V = U \oplus U^\perp$ (every vector decomposes uniquely into a part in $U$ and a part in $U^\perp$)

Kronecker delta

$\delta_{ij} = 1$ if $i = j$, else $0$ - just a notational selector; $(I){ij} = \delta{ij}$
$\langle e_i, e_j \rangle = \delta_{ij}$ is the compact statement that basis vectors are unit length and mutually orthogonal

Orthonormal bases

Coordinates are inner products: if $v = \sum_i \alpha_i e_i$ then $\alpha_j = \langle v, e_j \rangle$
$|v|^2 = \sum_i |\alpha_i|^2$ (Parseval’s identity)
The change-of-basis matrix $Q$ (columns = orthonormal basis vectors) satisfies $Q^T Q = I$
Therefore $Q^{-1} = Q^T$ - the inverse is just the transpose, free to compute and numerically stable
For square $Q$: $QQ^T = I$ as well, so rows are also orthonormal
$Q$ preserves lengths: $|Qx| = |x|$
$Q$ preserves angles: $\langle Qx, Qy \rangle = \langle x, y \rangle$
$\det(Q) = \pm 1$: rotations give $+1$, reflections give $-1$

Projections

Projection onto $\text{span}(u)$: $P_u v = \dfrac{\langle v, u \rangle}{|u|^2}, u$
The residual $v - P_u v$ is always orthogonal to $u$
Projection onto a subspace $U$ with orthonormal basis $(e_1, \ldots, e_k)$: $P_U v = \sum_{i=1}^k \langle v, e_i \rangle e_i$
With a general (non-orthonormal) basis as columns of $A$: $P = A(A^T A)^{-1} A^T$

Projection matrices

$P$ is an orthogonal projection iff $P^2 = P$ (idempotent) and $P^T = P$ (symmetric)
Eigenvalues of a projection matrix are only 0 and 1
Eigenvectors with eigenvalue 1 span $\text{col}(P)$; eigenvectors with eigenvalue 0 span $\text{null}(P) = \text{col}(P)^\perp$

Best approximation

$P_U v$ is the unique closest point in $U$ to $v$: $|v - P_U v| \leq |v - u|$ for all $u \in U$
This is the geometric foundation of least squares

Least squares and normal equations

The least squares solution to $Ax = b$ satisfies $A^T A x = A^T b$
When $A$ has full column rank: $\hat{x} = (A^T A)^{-1} A^T b$
The residual $b - A\hat{x}$ lies in $\text{col}(A)^\perp = \text{null}(A^T)$
The condition number of $A^T A$ is the square of that of $A$ - numerically, prefer QR over the normal equations

Read Next: