Matrix Factorizations // Megha Bose

Prerequisite:

A matrix factorization decomposes a matrix into a product of structured factors - triangular, orthogonal, diagonal - that are easier to work with than the original. Different factorizations are optimal for different tasks: solving linear systems, least squares, handling positive-definite structure, or revealing spectral information. This post covers the five main factorizations used in scientific computing.

LU Decomposition

The Factorization

Theorem (LU without pivoting). If $A \in \mathbb{R}^{n \times n}$ has the property that all leading principal submatrices $A_{1:k, 1:k}$ are nonsingular (for $k = 1, \ldots, n-1$), then there exists a unique factorization

$$A = LU,$$

where $L$ is unit lower triangular (diagonal entries all 1) and $U$ is upper triangular.

Theorem (LU with partial pivoting). For any $A \in \mathbb{R}^{n \times n}$, there exists a permutation matrix $P$, unit lower triangular $L$, and upper triangular $U$ such that

$$PA = LU.$$

The permutation $P$ reorders rows to place the largest-magnitude entry in each column on the diagonal before elimination - this is partial pivoting and is essential for numerical stability.

Connection to Gaussian Elimination

LU decomposition is Gaussian elimination in disguise.

Start:  A = [a11 a12 a13]     After elim col 1:  [u11 u12 u13]
            [a21 a22 a23]   ==================>  [0   u22 u23]
            [a31 a32 a33]                         [0   u32 u33]

Multipliers m21 = a21/a11, m31 = a31/a11 are stored in L:
L = [1    0   0 ]
    [m21  1   0 ]
    [m31  m32  1]

Each elimination step applies an elementary lower triangular matrix $E_k$ from the left: $E_{n-1} \cdots E_1 A = U$, so $L = E_1^{-1} \cdots E_{n-1}^{-1}$ (each $E_k^{-1}$ just negates the subdiagonal entries of $E_k$, so $L$ accumulates the multipliers directly).

Cost: Factorization is $O(n^3)$; once $PA = LU$ is known, solving $Ax = b$ costs $O(n^2)$:

Solve $Ly = Pb$ (forward substitution, $O(n^2)$).
Solve $Ux = y$ (back substitution, $O(n^2)$).

This makes LU ideal when you must solve $Ax = b$ for many different right-hand sides $b$ (factor once, solve multiple times cheaply).

Existence and Uniqueness

Theorem. If $A$ is nonsingular, then $PA = LU$ exists and $U$ is nonsingular. The factorization $A = LU$ (without pivoting) is unique when it exists.

Uniqueness proof. If $A = L_1 U_1 = L_2 U_2$, then $L_2^{-1} L_1 = U_2 U_1^{-1}$. The left side is unit lower triangular; the right side is upper triangular. A matrix that is both unit lower and upper triangular must be $I$. So $L_1 = L_2$ and $U_1 = U_2$. $\square$

Numerical Notes

Partial pivoting (LAPACK’s dgetrf) ensures $|l_{ij}| \leq 1$, bounding the growth factor $g = |U|\infty / |A|\infty$. In the worst case $g$ can be exponentially large ($g = 2^{n-1}$ for a specific adversarial matrix), but in practice partial pivoting is extremely reliable. Complete pivoting (pivoting rows and columns) has provably bounded growth but is rarely needed.

QR Decomposition

The Factorization

Theorem (QR Decomposition). For any $A \in \mathbb{R}^{m \times n}$ with $m \geq n$, there exists a factorization

$$A = QR,$$

where $Q \in \mathbb{R}^{m \times m}$ is orthogonal ($Q^T Q = I$) and $R \in \mathbb{R}^{m \times n}$ is upper triangular (zero below the diagonal).

The thin QR keeps only the first $n$ columns of $Q$:

$$A = \hat{Q}\hat{R}, \quad \hat{Q} \in \mathbb{R}^{m \times n},; \hat{R} \in \mathbb{R}^{n \times n},$$

where $\hat{Q}$ has orthonormal columns ($\hat{Q}^T \hat{Q} = I_n$) and $\hat{R}$ is square upper triangular.

Uniqueness. If $A$ has full column rank and we require diagonal entries of $R$ to be positive, the thin QR is unique.

Gram–Schmidt Process

The classical Gram–Schmidt algorithm computes QR by orthogonalizing the columns of $A$:

Input: columns a1, ..., an of A
q1 = a1 / ||a1||
r11 = ||a1||

for j = 2, ..., n:
    v = aj
    for i = 1, ..., j-1:
        r_ij = qi^T * aj        (projection coefficient)
        v = v - r_ij * qi       (subtract projection)
    r_jj = ||v||
    qj = v / r_jj

The columns $q_1, \ldots, q_n$ are the orthonormal basis (columns of $\hat{Q}$), and $r_{ij}$ fill $\hat{R}$.

Modified Gram–Schmidt (MGS) reorders the operations to subtract projections one at a time and is more numerically stable (avoids catastrophic cancellation when $a_j$ is nearly in the span of earlier columns). Even MGS loses orthogonality in floating point when columns are nearly dependent; in that case, use Householder QR.

Householder Reflections

A Householder reflector is $H = I - 2vv^T / (v^T v)$, an orthogonal matrix that reflects across the hyperplane orthogonal to $v$. By choosing $v$ appropriately, $H$ maps any vector $x$ to $|x|e_1$.

Householder QR applies $n$ reflectors from the left:

$$H_n \cdots H_2 H_1 A = R \implies Q^T = H_n \cdots H_1 \implies Q = H_1 \cdots H_n.$$

This is backward stable: the computed $Q$ and $R$ satisfy $A + \delta A = QR$ where $|\delta A|_F / |A|F = O(\varepsilon{\text{machine}})$.

Application: Least Squares

The overdetermined system $Ax \approx b$ ($A \in \mathbb{R}^{m \times n}$, $m > n$, full column rank) has least squares solution minimizing $|Ax - b|_2^2$.

Using thin QR $A = \hat{Q}\hat{R}$:

$$|Ax - b|^2 = |\hat{Q}\hat{R}x - b|^2 = |\hat{R}x - \hat{Q}^T b|^2 + |(I - \hat{Q}\hat{Q}^T)b|^2.$$

The second term is independent of $x$; minimizing requires $\hat{R}x = \hat{Q}^T b$. Since $\hat{R}$ is square upper triangular and nonsingular (full rank assumed):

$$\hat{R}x^\ast = \hat{Q}^T b \quad \text{(solve by back-substitution)}.$$

This is numerically superior to the normal equations $A^T A x = A^T b$ (which squares the condition number).

Cholesky Decomposition

The Factorization

Definition. $A \in \mathbb{R}^{n \times n}$ is positive definite (SPD) if $A = A^T$ and $x^T A x > 0$ for all $x \neq 0$.

Theorem (Cholesky). $A$ is symmetric positive definite if and only if there exists a unique lower triangular matrix $L$ with positive diagonal entries such that

$$A = LL^T.$$

This is the Cholesky decomposition (or Cholesky factorization).

Proof (existence by induction). Write $A = \begin{pmatrix} \alpha & v^T \ v & B \end{pmatrix}$. Since $A$ is SPD, $\alpha = e_1^T A e_1 > 0$, so $\ell_{11} = \sqrt{\alpha}$ is real. The Schur complement $S = B - vv^T/\alpha$ is also SPD (verify: $x^T S x = x^T B x - (v^T x)^2 / \alpha > 0$ for $x \neq 0$, using SPD of $A$). Apply induction to $S = \tilde{L}\tilde{L}^T$, and assemble:

$$L = \begin{pmatrix} \ell_{11} & 0 \ v/\ell_{11} & \tilde{L}\end{pmatrix}. \qquad \square$$

Cost: Cholesky is $\approx n^3/3$ flops - half the cost of LU - because it exploits symmetry (only the lower triangle is computed). No pivoting is needed: all Schur complements of an SPD matrix are SPD, so the algorithm never encounters a zero pivot.

Algorithm

for j = 1, ..., n:
    l_jj = sqrt(a_jj - sum_{k=1}^{j-1} l_jk^2)
    for i = j+1, ..., n:
        l_ij = (a_ij - sum_{k=1}^{j-1} l_ik * l_jk) / l_jj

The algorithm fails (encounters $\sqrt{\text{negative}}$) if and only if $A$ is not positive definite - Cholesky is a practical test for positive definiteness.

Numerical Stability and Applications

Cholesky is numerically stable: the growth factor is 1 (diagonal entries of $L$ never exceed $\sqrt{a_{ii}}$). This makes it the preferred solver for SPD systems.

Connection to multivariate Gaussians. The density of $\mathcal{N}(\mu, \Sigma)$ involves $\Sigma^{-1}$ and $\det(\Sigma)$. With $\Sigma = LL^T$:

$\Sigma^{-1} = L^{-T} L^{-1}$: solve two triangular systems.
$\det(\Sigma) = (\det L)^2 = \left(\prod_{i=1}^n l_{ii}\right)^2$: trivial from diagonal.
Sampling: to generate $x \sim \mathcal{N}(\mu, \Sigma)$, draw $z \sim \mathcal{N}(0, I)$ and return $x = \mu + Lz$. This uses $L$ as a “square root” of $\Sigma$.

Optimization. In Newton’s method, the Hessian $H$ of a strictly convex function is SPD. Cholesky factorization of $H$ is the standard way to solve $Hp = -\nabla f$ (the Newton step). The modified Cholesky algorithm additionally detects non-positive-definiteness and shifts the Hessian, useful when $H$ may be indefinite.

Eigendecomposition

Theorem. $A \in \mathbb{F}^{n \times n}$ is diagonalizable if and only if

$$A = Q \Lambda Q^{-1},$$

where $\Lambda = \mathrm{diag}(\lambda_1, \ldots, \lambda_n)$ and $Q$ has eigenvectors as columns.

For symmetric $A$: $Q$ can be taken orthogonal ($Q^T Q = I$), giving

$$A = Q \Lambda Q^T \quad (A \text{ symmetric}),$$

the spectral decomposition. Eigenvalues are real; $Q$ is an orthogonal change-of-basis to the eigenbasis.

Existence: Only for square matrices over algebraically closed fields (or symmetric matrices over $\mathbb{R}$). General matrices may not be diagonalizable.

Cost: Computing eigendecomposition is $O(n^3)$ via the QR algorithm. However, eigenvectors are more expensive to compute than eigenvalues alone.

When to use eigendecomposition:

Computing matrix powers: $A^k = Q\Lambda^k Q^{-1}$.
Solving ODEs: $\dot{x} = Ax$, solution $x(t) = e^{At}x_0 = Qe^{\Lambda t}Q^{-1}x_0$.
PCA (for square covariance matrices).
Spectral graph theory (Laplacian eigenvectors).

Schur Decomposition

Theorem (Schur). For any $A \in \mathbb{C}^{n \times n}$, there exists a unitary $Q$ and upper triangular $T$ such that

$$A = QTQ^\ast.$$

The diagonal entries of $T$ are the eigenvalues of $A$ (in any order). If $A$ is real and all eigenvalues are real, $Q$ can be taken real orthogonal and $T$ real upper triangular.

Proof by induction. $A$ has at least one eigenvalue $\lambda_1$ (over $\mathbb{C}$) with unit eigenvector $q_1$. Extend to an orthonormal basis; in this basis $A$ has the form $\begin{pmatrix}\lambda_1 & *^T \ 0 & A_1\end{pmatrix}$. Apply induction to $A_1$. $\square$

Schur vs Jordan. The Schur form is always upper triangular (not block diagonal with Jordan blocks). It is numerically computable and backward stable (the QR algorithm produces the Schur form). Jordan form is theoretically cleaner but numerically fragile.

Real Schur form. For real $A$ with complex eigenvalues (appearing in conjugate pairs $\alpha \pm i\beta$), the real Schur form is $A = QTQ^T$ where $T$ is quasi-upper triangular: block upper triangular with $1 \times 1$ blocks for real eigenvalues and $2 \times 2$ blocks for conjugate pairs. No complex arithmetic needed.

Comparing Factorizations

Factorization	Form	When it Exists	Primary Use
LU	$PA = LU$	Always (with pivoting)	Solving $Ax = b$, computing $\det A$
QR	$A = QR$	Always ($m \geq n$)	Least squares, orthogonalization
Cholesky	$A = LL^T$	$A$ symmetric positive definite	SPD systems, sampling Gaussians
Eigendecomp	$A = Q\Lambda Q^{-1}$	$A$ diagonalizable	Powers, ODEs, PCA
Schur	$A = QTQ^\ast$	Always (square)	Canonical form, computing eigenvalues
SVD	$A = U\Sigma V^T$	Always (any shape)	Rank, least squares, dimensionality reduction

Computational cost ($A \in \mathbb{R}^{n \times n}$):

Factorization	Flops
LU	$\frac{2}{3}n^3$
QR (Householder)	$\frac{4}{3}n^3$
Cholesky	$\frac{1}{3}n^3$
Eigendecomposition	$\sim 10 n^3$ (QR alg)
Schur	$\sim 10 n^3$ (QR alg)
SVD	$\sim 11 n^3$ (for $m = n$)

After factorization, solving $Ax = b$ costs:

LU: $O(n^2)$ (triangular substitutions)
Cholesky: $O(n^2)$ (same, but half as many since symmetric)
QR: $O(n^2)$ (back-substitute, or thin QR for least squares)

Examples

LAPACK (Linear Algebra PACKage) is the standard library for dense numerical linear algebra, underlying NumPy, SciPy, MATLAB, Julia, and most ML frameworks.

Key LAPACK routines:

dgesv: LU solve (wraps dgetrf + dgetrs)
dpotrf + dpotrs: Cholesky factorization + solve (for SPD systems)
dgels: QR-based least squares
dgesvd / dgesdd: SVD (divide-and-conquer version dgesdd is faster)
dsyev: symmetric eigendecomposition (Schur = eigendecomposition for symmetric)

Positive-definite Hessians in optimization. Newton’s method requires solving $H \Delta x = -\nabla f$ where $H = \nabla^2 f$ is the Hessian. For strictly convex $f$, $H$ is SPD: use Cholesky. The L-BFGS algorithm approximates the inverse Hessian with a low-rank update, avoiding factorization entirely.

Covariance matrices. Covariance matrices $\Sigma = \mathbb{E}[(X - \mu)(X-\mu)^T]$ are symmetric positive semidefinite. In practice, sample covariance matrices may be PSD but not PD (when $n < p$: fewer samples than features). Cholesky fails; use SVD or regularize: $\Sigma_\epsilon = \Sigma + \epsilon I$ (ridge regularization), which is always SPD.

Sparse matrices. For large sparse $A$ (e.g., finite element discretizations, graph Laplacians), dense factorizations are prohibitive. Sparse LU (e.g., UMFPACK), sparse Cholesky (CHOLMOD), and iterative solvers (conjugate gradient for SPD, GMRES for general) are used instead. Fill-in (new nonzeros in $L, U$) is managed by ordering heuristics (AMD, METIS).

Read Next: