Gram-Schmidt Orthogonalization
Prerequisite:
The Problem
Every finite-dimensional inner product space has an orthonormal basis - but given an arbitrary basis, how do we construct one? The Gram-Schmidt process is the answer: a constructive algorithm that takes any linearly independent list and produces an orthonormal list spanning the same space.
Theorem. Every finite-dimensional inner product space (over $\mathbb{R}$ or $\mathbb{C}$) has an orthonormal basis.
Proof. Take any basis. Apply Gram-Schmidt. $\square$ (The content is in the algorithm.)
The Gram-Schmidt Algorithm
Input: Linearly independent vectors $v_1, v_2, \ldots, v_n \in V$.
Output: Orthonormal vectors $e_1, e_2, \ldots, e_n$ with $\text{span}(e_1, \ldots, e_k) = \text{span}(v_1, \ldots, v_k)$ for each $k$.
Step 1. Set $e_1 = v_1 / |v_1|$.
Step $k$ (for $k = 2, \ldots, n$). Subtract from $v_k$ its projections onto all previously found orthonormal vectors, then normalise: $$\tilde{e}k = v_k - \sum{j=1}^{k-1} \langle v_k, e_j \rangle e_j, \qquad e_k = \frac{\tilde{e}_k}{|\tilde{e}_k|}$$
The vector $\tilde{e}k$ is the component of $v_k$ orthogonal to $\text{span}(e_1, \ldots, e{k-1})$. Linear independence of the $v_i$ guarantees $\tilde{e}k \neq 0$ (otherwise $v_k \in \text{span}(v_1, \ldots, v{k-1})$, contradicting independence).
Why It Works: Induction
Claim: After step $k$, the set ${e_1, \ldots, e_k}$ is orthonormal and spans the same space as ${v_1, \ldots, v_k}$.
Base case: $k=1$. $e_1 = v_1/|v_1|$ is a unit vector spanning $\text{span}(v_1)$. $\checkmark$
Inductive step: Assume the claim holds for $k-1$. Then $\tilde{e}k = v_k - P{\text{span}(e_1,\ldots,e_{k-1})} v_k$ is the orthogonal complement of $v_k$ with respect to $\text{span}(e_1, \ldots, e_{k-1})$. By construction, $\langle \tilde{e}_k, e_j \rangle = 0$ for all $j < k$ (the projection subtraction removes all components in those directions). Normalising gives $e_k$ with $|e_k|=1$. The span condition holds because $v_k = \tilde{e}k + \sum{j<k} \langle v_k, e_j \rangle e_j$ is a linear combination of $\tilde{e}k$ (proportional to $e_k$) and $e_1, \ldots, e{k-1}$. $\square$
ASCII Example in $\mathbb{R}^3$
Given: v1 = (1,1,0), v2 = (1,0,1), v3 = (0,1,1)
Step 1:
||v1|| = sqrt(2)
e1 = (1/sqrt(2), 1/sqrt(2), 0)
Step 2:
<v2, e1> = 1/sqrt(2)
proj = (1/2, 1/2, 0)
tilde_e2 = v2 - proj = (1/2, -1/2, 1)
||tilde_e2|| = sqrt(1/4 + 1/4 + 1) = sqrt(3/2)
e2 = (1/sqrt(6), -1/sqrt(6), 2/sqrt(6))
Step 3:
<v3, e1> = 1/sqrt(2), <v3, e2> = -1/sqrt(6) + 2/sqrt(6) = 1/sqrt(6)
proj = (1/2, 1/2, 0) + (1/6)*(1,-1,2) = (2/3, 1/3, 1/3)
tilde_e3 = (0,1,1) - (2/3, 1/3, 1/3) = (-2/3, 2/3, 2/3)
||tilde_e3|| = (2/3)*sqrt(3)
e3 = (-1/sqrt(3), 1/sqrt(3), 1/sqrt(3))
Check: <e1,e2> = 0, <e1,e3> = 0, <e2,e3> = 0 ✓
Modified Gram-Schmidt
Classical Gram-Schmidt as stated above is mathematically correct but numerically unstable. In finite-precision arithmetic, rounding errors accumulate: after many steps, the computed $e_k$ may not be exactly orthogonal to the earlier $e_j$, and the error grows.
Modified Gram-Schmidt (MGS) reorders the operations to reduce error propagation. Instead of computing all projections of $v_k$ onto the previous $e_j$ simultaneously, we update the residual sequentially:
for k = 1 to n:
e_k = v_k / ||v_k||
for j = k+1 to n:
v_j = v_j - <v_j, e_k> * e_k # remove component along e_k immediately
Mathematically, classical and modified GS are identical. Numerically, MGS projects the already-partially-orthogonalised $v_j$ against $e_k$, so errors from earlier steps don’t compound the way they do in the classical formulation. The loss of orthogonality in classical GS is $O(\epsilon_{\text{mach}} \cdot \kappa(A)^2)$; in modified GS it is $O(\epsilon_{\text{mach}} \cdot \kappa(A))$, where $\kappa(A)$ is the condition number.
For highly ill-conditioned problems, even MGS is insufficient. The remedy is reorthogonalisation: after computing $e_k$, immediately subtract its projections from $e_k$ again (one pass of classical GS applied to the output of classical GS).
QR Decomposition
The Gram-Schmidt process directly produces a QR decomposition of any matrix $A$ with linearly independent columns.
Theorem (QR Decomposition). Let $A \in \mathbb{R}^{m \times n}$ have linearly independent columns ($m \geq n$). Then $A = QR$ where:
- $Q \in \mathbb{R}^{m \times n}$ has orthonormal columns ($Q^T Q = I_n$)
- $R \in \mathbb{R}^{n \times n}$ is upper triangular with positive diagonal entries
Construction. Running Gram-Schmidt on the columns $a_1, \ldots, a_n$ of $A$ gives orthonormal columns $e_1, \ldots, e_n$ of $Q$. The matrix $R$ captures the change of basis: $$a_k = \sum_{j=1}^k r_{jk} e_j, \qquad \text{where } r_{jk} = \langle a_k, e_j \rangle \text{ for } j < k, \quad r_{kk} = |\tilde{e}_k| > 0$$
$R$ is upper triangular because $a_k$ only involves $e_1, \ldots, e_k$ (the span preservation property of GS), and the diagonal entries are positive because we normalise with $|\tilde{e}_k|$.
Uniqueness of QR
Theorem. If $A$ has full column rank, the QR decomposition with positive diagonal of $R$ is unique.
Proof sketch. Suppose $A = Q_1 R_1 = Q_2 R_2$ with both having positive diagonals. Then $Q_2^T Q_1 = R_2 R_1^{-1}$. The left side has orthonormal columns (so it’s orthogonal); the right side is upper triangular (product of upper triangular matrices). An orthogonal upper triangular matrix must be diagonal, and its diagonal entries satisfy $d_i^2 = 1$, so $d_i = \pm 1$. Positivity of the diagonals of both $R_1$ and $R_2$ forces $d_i = 1$, giving $Q_1 = Q_2$ and $R_1 = R_2$. $\square$
Solving Least Squares via QR
The QR decomposition provides the numerically preferred method for least squares. Given $Ax \approx b$ with $A = QR$: $$|Ax - b|^2 = |QRx - b|^2 = |Rx - Q^T b|^2 + |b - QQ^T b|^2$$
(using the fact that $Q$ preserves norms and splitting $b = QQ^T b + (I - QQ^T)b$). The first term depends on $x$; minimising it requires solving the triangular system $Rx = Q^T b$, which is done by back-substitution in $O(n^2)$ operations. No matrix-matrix product $A^T A$ is formed, so the condition number is not squared.
Algorithm for least squares via QR:
1. Compute A = QR (via modified Gram-Schmidt or Householder)
2. Compute c = Q^T b
3. Solve Rx = c by back-substitution
Householder reflections (not Gram-Schmidt) are typically preferred in production codes for QR because they are numerically superior for very ill-conditioned problems, but the result is the same decomposition.
Examples
The QR algorithm for eigenvalues. The iteration $A_0 = A$, then $A_{k+1} = R_k Q_k$ (swap the order after factoring $A_k = Q_k R_k$) converges to a triangular (or quasi-triangular) matrix whose diagonal entries are the eigenvalues. This is not the same as applying QR decomposition once - it is a sequence of similarity transformations $A_{k+1} = Q_k^T A_k Q_k$, each of which preserves eigenvalues. With shifts and deflation, the QR algorithm runs in $O(n^3)$ per eigenvalue and is the standard method for dense eigenvalue problems.
Krylov subspace methods. Iterative solvers for large sparse systems (GMRES, Arnoldi, Lanczos) work in the Krylov subspace $\mathcal{K}_k(A, b) = \text{span}(b, Ab, A^2 b, \ldots, A^{k-1} b)$. Building an orthonormal basis for $\mathcal{K}_k$ as $k$ grows is precisely the Arnoldi iteration - Gram-Schmidt applied to successive Krylov vectors. The Lanczos iteration is the symmetric specialisation where the orthonormal basis and the projection of $A$ onto it can both be maintained with only 3-term recurrences, giving $O(kn)$ cost instead of $O(k^2 n)$. These methods underlie MATLAB’s eigs and Python’s scipy.sparse.linalg.eigsh.
Read Next: