Gram-Schmidt - Building Perpendicular Bases From Anything You're Given
Helpful context:
The Problem
Every finite-dimensional inner product space has an orthonormal basis - but given an arbitrary basis, how do we construct one? The Gram-Schmidt process is the answer: a constructive algorithm that takes any linearly independent list and produces an orthonormal list spanning the same space.
Theorem. Every finite-dimensional inner product space (over $\mathbb{R}$ or $\mathbb{C}$) has an orthonormal basis.
Proof. Take any basis. Apply Gram-Schmidt. $\square$ (The content is in the algorithm.)
The Gram-Schmidt Algorithm
Input: Linearly independent vectors $v_1, v_2, \ldots, v_n \in V$.
Output: Orthonormal vectors $e_1, e_2, \ldots, e_n$ with $\text{span}(e_1, \ldots, e_k) = \text{span}(v_1, \ldots, v_k)$ for each $k$.
Step 1. Normalise $v_1$:
$$e_1 = \frac{v_1}{|v_1|}$$
Step $k$ (for $k = 2, \ldots, n$). Subtract from $v_k$ its projections onto all previously found orthonormal vectors to get an intermediate vector $u_k$, then normalise:
$$u_k = v_k - \sum_{j=1}^{k-1} \langle v_k, e_j \rangle e_j$$
$$e_k = \frac{u_k}{|u_k|}$$
The vector $u_k$ is the component of $v_k$ orthogonal to $\text{span}(e_1, \ldots, e_{k-1})$. Linear independence of the $v_i$ guarantees $u_k \neq 0$ - otherwise $v_k$ would lie in $\text{span}(v_1, \ldots, v_{k-1})$, contradicting independence.
Why It Works
Claim: After step $k$, the vectors $e_1, \ldots, e_k$ are orthonormal and span the same space as $v_1, \ldots, v_k$.
Base case: $k = 1$. $e_1 = v_1 / |v_1|$ is a unit vector spanning $\text{span}(v_1)$. ✓
Inductive step: Assume the claim holds for $k-1$. Then $u_k = v_k - P_{\text{span}(e_1,\ldots,e_{k-1})} v_k$ is the component of $v_k$ orthogonal to $\text{span}(e_1, \ldots, e_{k-1})$. By construction, $\langle u_k, e_j \rangle = 0$ for all $j = 1, \ldots, k-1$, since the projection subtraction removes all components in those directions. Normalising gives $e_k$ with $|e_k| = 1$. The span condition holds because
$$v_k = u_k + \sum_{j=1}^{k-1} \langle v_k, e_j \rangle e_j$$
is a linear combination of $u_k$ (proportional to $e_k$) and $e_1, \ldots, e_{k-1}$. $\square$
Worked Example in $\mathbb{R}^3$
Take $v_1 = (1, 1, 0)$, $v_2 = (1, 0, 1)$, $v_3 = (0, 1, 1)$.
Step 1. Normalise $v_1$:
$$|v_1| = \sqrt{2}, \qquad e_1 = \frac{1}{\sqrt{2}}(1, 1, 0)$$
Step 2. Project $v_2$ onto $e_1$ and subtract:
$$\langle v_2, e_1 \rangle = \frac{1}{\sqrt{2}}$$
$$u_2 = (1,0,1) - \frac{1}{\sqrt{2}} \cdot \frac{1}{\sqrt{2}}(1,1,0) = (1,0,1) - \frac{1}{2}(1,1,0) = \left(\frac{1}{2}, -\frac{1}{2}, 1\right)$$
$$|u_2| = \sqrt{\frac{1}{4} + \frac{1}{4} + 1} = \sqrt{\frac{3}{2}}, \qquad e_2 = \frac{1}{\sqrt{6}}(1, -1, 2)$$
Step 3. Project $v_3$ onto $e_1$ and $e_2$ and subtract both:
$$\langle v_3, e_1 \rangle = \frac{1}{\sqrt{2}}, \qquad \langle v_3, e_2 \rangle = \frac{0 - 1 + 2}{\sqrt{6}} = \frac{1}{\sqrt{6}}$$
$$u_3 = (0,1,1) - \frac{1}{\sqrt{2}} \cdot \frac{1}{\sqrt{2}}(1,1,0) - \frac{1}{\sqrt{6}} \cdot \frac{1}{\sqrt{6}}(1,-1,2)$$
$$= (0,1,1) - \frac{1}{2}(1,1,0) - \frac{1}{6}(1,-1,2) = \left(-\frac{2}{3}, \frac{2}{3}, \frac{2}{3}\right)$$
$$|u_3| = \frac{2}{\sqrt{3}}, \qquad e_3 = \frac{1}{\sqrt{3}}(-1, 1, 1)$$
Verification: $\langle e_1, e_2 \rangle = 0$, $\langle e_1, e_3 \rangle = 0$, $\langle e_2, e_3 \rangle = 0$. Each vector has unit norm. ✓
Modified Gram-Schmidt
Classical Gram-Schmidt as stated above is mathematically correct but numerically unstable. In finite-precision arithmetic, rounding errors accumulate: after many steps, the computed $e_k$ may not be exactly orthogonal to the earlier $e_j$.
Modified Gram-Schmidt (MGS) reorders the operations to reduce error propagation. Instead of computing all projections of $v_k$ onto the previous $e_j$ simultaneously, we update the remaining vectors sequentially: after finding $e_k$, immediately remove the $e_k$ component from every subsequent $v_j$ before moving on.
Mathematically, classical and modified GS produce the same result. Numerically, MGS projects the already-partially-orthogonalised $v_j$ against $e_k$, so errors from earlier steps do not compound the way they do in the classical formulation. The loss of orthogonality in classical GS grows with the square of the condition number of the input matrix; in modified GS it grows only linearly with the condition number.
For highly ill-conditioned problems, even MGS is insufficient. The remedy is reorthogonalisation: after computing $e_k$, apply one additional round of projection removal to clean up residual errors.
QR Decomposition
The Gram-Schmidt process directly produces a QR decomposition of any matrix $A$ with linearly independent columns.
Theorem (QR Decomposition). Let $A \in \mathbb{R}^{m \times n}$ have linearly independent columns ($m \geq n$). Then $A = QR$ where:
- $Q \in \mathbb{R}^{m \times n}$ has orthonormal columns ($Q^T Q = I_n$)
- $R \in \mathbb{R}^{n \times n}$ is upper triangular with positive diagonal entries
Construction. Running Gram-Schmidt on the columns $a_1, \ldots, a_n$ of $A$ gives orthonormal columns $e_1, \ldots, e_n$ of $Q$. The matrix $R$ captures the change of basis. For each column $a_k$:
$$a_k = \sum_{j=1}^{k} r_{jk} e_j$$
where $r_{jk} = \langle a_k, e_j \rangle$ for $j = 1, \ldots, k-1$, and $r_{kk} = |u_k|$ which is positive by construction.
$R$ is upper triangular because $a_k$ only involves $e_1, \ldots, e_k$ (the span-preservation property of GS). The diagonal entries are positive because we normalise with $|u_k|$.
Uniqueness of QR
Theorem. If $A$ has full column rank, the QR decomposition with positive diagonal of $R$ is unique.
Proof sketch. Suppose $A = Q_1 R_1 = Q_2 R_2$ with both having positive diagonals. Then $Q_2^T Q_1 = R_2 R_1^{-1}$. The left side is orthogonal; the right side is upper triangular. An orthogonal upper triangular matrix must be diagonal with diagonal entries satisfying $d_i^2 = 1$, so $d_i = \pm 1$. Positivity of the diagonals of both $R_1$ and $R_2$ forces $d_i = 1$, giving $Q_1 = Q_2$ and $R_1 = R_2$. $\square$
Solving Least Squares via QR
The QR decomposition provides the numerically preferred method for least squares. Given $Ax \approx b$ with $A = QR$, since $Q$ preserves norms we have:
$$|Ax - b|^2 = |QRx - b|^2 = |Rx - Q^T b|^2 + |b - QQ^T b|^2$$
The second term does not depend on $x$. Minimising the first term requires solving the triangular system:
$$Rx = Q^T b$$
This is done by back-substitution in $O(n^2)$ operations. No matrix-matrix product $A^T A$ is ever formed, so the condition number is not squared - this is the key numerical advantage over the normal equations.
Further Applications
The QR algorithm for eigenvalues. The iteration $A_0 = A$, then $A_{k+1} = R_k Q_k$ (swap the order after factoring $A_k = Q_k R_k$) converges to a triangular matrix whose diagonal entries are the eigenvalues. Each step is a similarity transformation $A_{k+1} = Q_k^T A_k Q_k$, which preserves eigenvalues. With shifts and deflation, the QR algorithm runs in $O(n^3)$ and is the standard method for dense eigenvalue problems.
Krylov subspace methods. Iterative solvers for large sparse systems (GMRES, Arnoldi, Lanczos) work in the Krylov subspace $\mathcal{K}_k(A, b) = \text{span}(b, Ab, A^2 b, \ldots, A^{k-1} b)$. Building an orthonormal basis for $\mathcal{K}_k$ as $k$ grows is precisely the Arnoldi iteration - Gram-Schmidt applied to successive Krylov vectors. The Lanczos iteration is the symmetric specialisation where the orthonormal basis can be maintained with only a 3-term recurrence, giving $O(kn)$ cost instead of $O(k^2 n)$. These methods underlie MATLAB’s eigs and SciPy’s eigsh.
Read Next: