Singular Value Decomposition
Prerequisite:
The singular value decomposition (SVD) is arguably the most important matrix factorization in applied mathematics and data science. Unlike eigendecomposition, it applies to every matrix - not just square or diagonalizable ones - and it reveals the intrinsic geometry of a linear map stripped of all arbitrary basis choices.
The SVD Theorem
Theorem (Existence of SVD). Let $A \in \mathbb{R}^{m \times n}$ (or $\mathbb{C}^{m \times n}$). Then there exist:
- an orthogonal (unitary) matrix $U \in \mathbb{R}^{m \times m}$,
- an orthogonal (unitary) matrix $V \in \mathbb{R}^{n \times n}$,
- a matrix $\Sigma \in \mathbb{R}^{m \times n}$ with nonneg entries on the main diagonal and zeros elsewhere,
such that
$$\boxed{A = U \Sigma V^T.}$$
The diagonal entries $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_{\min(m,n)} \geq 0$ are the singular values of $A$. The columns of $U$ are the left singular vectors; the columns of $V$ are the right singular vectors.
The SVD always exists, over any field, for any shape of matrix. This is the key advantage over eigendecomposition.
Geometric Interpretation
Every linear map $A: \mathbb{R}^n \to \mathbb{R}^m$ decomposes into three steps:
R^n ---V^T--> R^n ---Sigma--> R^m ---U--> R^m
x ------> V^T x -------> Sigma V^T x ----> U Sigma V^T x = Ax
(rotate/ (axis-aligned (rotate/reflect
reflect) scaling) in R^m)
- $V^T$: orthogonal transformation (rotation/reflection) in the domain $\mathbb{R}^n$. Rotates the input into a special coordinate system aligned with the right singular vectors.
- $\Sigma$: axis-aligned scaling. The $i$-th coordinate is scaled by $\sigma_i$. Coordinates beyond $\min(m,n)$ are dropped (if $m < n$) or padded with zero (if $m > n$).
- $U$: orthogonal transformation in the codomain $\mathbb{R}^m$. Rotates the output into the final coordinate system.
The unit sphere becomes an ellipsoid. The image of the unit sphere $S^{n-1} = {x : |x| = 1}$ under $A$ is an ellipsoid with semi-axes $\sigma_1 u_1, \sigma_2 u_2, \ldots, \sigma_r u_r$ (where $r = \mathrm{rank}(A)$ and $u_i$ are the left singular vectors).
Unit sphere in R^n After A = U Sigma V^T Ellipsoid in R^m
o ---
/|\ stretch sigma_1 | |
/ | \ ===> ===> sigma_1
o | o stretch sigma_2 | |
\ | / ---
\|/
semi-axes = sigma_1, sigma_2, ...
Relationship to Eigenvalues
The singular values of $A$ are the square roots of the eigenvalues of the symmetric positive semidefinite matrices $A^T A$ and $A A^T$.
Theorem. $\sigma_i^2$ are the nonzero eigenvalues of $A^T A$ (and of $A A^T$, which has the same nonzero eigenvalues). The right singular vectors $v_i$ are eigenvectors of $A^T A$; the left singular vectors $u_i$ are eigenvectors of $A A^T$.
Proof. From $A = U\Sigma V^T$:
$$A^T A = V \Sigma^T U^T U \Sigma V^T = V (\Sigma^T \Sigma) V^T.$$
Since $V$ is orthogonal, this is the eigendecomposition of $A^T A$: columns of $V$ are eigenvectors, diagonal entries of $\Sigma^T \Sigma = \mathrm{diag}(\sigma_1^2, \ldots, \sigma_n^2)$ are eigenvalues. Similarly $AA^T = U \Sigma \Sigma^T U^T$, so columns of $U$ are eigenvectors of $AA^T$. $\square$
Consistency check: $A^T A$ is $n \times n$ and symmetric PSD; $AA^T$ is $m \times m$ and symmetric PSD. Their nonzero eigenvalues coincide (this follows from the general fact that $AB$ and $BA$ have the same nonzero eigenvalues). The number of positive eigenvalues equals $\mathrm{rank}(A) = r$.
Computing the SVD
The standard computational path:
- Form $A^T A$ ($n \times n$ symmetric PSD matrix).
- Eigendecompose $A^T A = V D V^T$ (by spectral theorem, since $A^T A$ is symmetric). The eigenvalues $d_i = \sigma_i^2 \geq 0$, so $\sigma_i = \sqrt{d_i}$.
- Compute left singular vectors: for each $\sigma_i > 0$, $u_i = \frac{1}{\sigma_i} A v_i$. Orthonormalize if needed (they are automatically orthonormal if done correctly).
- Extend ${u_i}$ and ${v_i}$ to full orthonormal bases of $\mathbb{R}^m$ and $\mathbb{R}^n$ respectively (for the “full” SVD).
Warning. Forming $A^T A$ explicitly squares the condition number ($\kappa(A^T A) = \kappa(A)^2$), losing numerical precision. Production algorithms (e.g., LAPACK’s dgesdd) use bidiagonalization (Golub–Reinsch) followed by the QR algorithm on the bidiagonal - achieving SVD without forming $A^T A$.
Economy (Thin) SVD
When $m \geq n$, most of $U$ is “wasted”: $\Sigma$ has zero rows at the bottom. The thin SVD keeps only the first $n$ columns of $U$:
$$A = \underbrace{U_n}{m \times n} \underbrace{\hat{\Sigma}}{n \times n} \underbrace{V^T}_{n \times n},$$
where $U_n$ has orthonormal columns (but is not square) and $\hat{\Sigma} = \mathrm{diag}(\sigma_1, \ldots, \sigma_n)$. The thin SVD is usually sufficient and saves storage/computation.
For rank-$r$ $A$ (with $r \leq \min(m,n)$), the compact SVD keeps only $r$ terms:
$$A = \sum_{i=1}^r \sigma_i u_i v_i^T.$$
This is the most economical representation.
The Eckart–Young Theorem: Best Low-Rank Approximation
Theorem (Eckart–Young, 1936). Let $A \in \mathbb{R}^{m \times n}$ with SVD $A = U\Sigma V^T$ and singular values $\sigma_1 \geq \cdots \geq \sigma_r > 0$. For any $k \leq r$, the best rank-$k$ approximation of $A$ in the Frobenius norm (and in the spectral norm) is
$$A_k = \sum_{i=1}^k \sigma_i u_i v_i^T = U_k \Sigma_k V_k^T,$$
where $U_k, V_k$ are the first $k$ columns of $U, V$ and $\Sigma_k = \mathrm{diag}(\sigma_1, \ldots, \sigma_k)$. The approximation error is:
$$|A - A_k|F = \sqrt{\sigma{k+1}^2 + \cdots + \sigma_r^2}, \quad |A - A_k|2 = \sigma{k+1}.$$
Moreover, $A_k$ achieves the minimum among all rank-$k$ matrices $B$:
$$A_k = \arg\min_{\mathrm{rank}(B) \leq k} |A - B|F = \arg\min{\mathrm{rank}(B) \leq k} |A - B|_2.$$
Proof sketch (spectral norm version). For any rank-$k$ matrix $B$, $\ker(B)$ has dimension $\geq n-k$. The subspace $\mathrm{span}(v_1, \ldots, v_{k+1})$ has dimension $k+1$. By dimension counting, there exists a unit vector $w$ in $\ker(B) \cap \mathrm{span}(v_1, \ldots, v_{k+1})$. Then:
$$|A - B|_2 \geq |(A-B)w|_2 = |Aw|2 \geq \sigma{k+1}|w|2 = \sigma{k+1}.$$
(The last inequality uses the variational characterization $\sigma_{k+1} = \min_{|x|=1,, x \perp v_1,\ldots,v_k} |Ax|$.) Since $|A - A_k|2 = \sigma{k+1}$ exactly, $A_k$ achieves the bound. $\square$
Singular value “energy” diagram:
sigma_i
|
|*
| *
| *
| *
| **
| ***
| *****
| **************
+------------------------------------> i
1 2 3 4 5 6 7 8 9
Rapid decay = low-rank structure (data compressible)
Slow decay = no low-rank structure (e.g., noise)
Keeping top k singular values captures most "energy":
||A_k||_F^2 / ||A||_F^2 = (sigma_1^2 + ... + sigma_k^2) / (sigma_1^2 + ... + sigma_r^2)
The Pseudoinverse
For a non-square or rank-deficient matrix, the ordinary inverse does not exist. The Moore–Penrose pseudoinverse provides a canonical substitute.
Definition. Given $A = U\Sigma V^T$, define
$$A^+ = V \Sigma^+ U^T,$$
where $\Sigma^+$ is obtained by replacing each nonzero diagonal entry $\sigma_i$ of $\Sigma$ with $1/\sigma_i$, and transposing.
Theorem (Pseudoinverse Characterization). $A^+$ is the unique matrix satisfying the four Moore–Penrose conditions:
- $A A^+ A = A$
- $A^+ A A^+ = A^+$
- $(A A^+)^T = A A^+$ (symmetric)
- $(A^+ A)^T = A^+ A$ (symmetric)
Application to least squares. The system $Ax = b$ (with $A \in \mathbb{R}^{m \times n}$, $m > n$) is generally overdetermined. The minimum-norm least squares solution is:
$$x^\ast = A^+ b = V \Sigma^+ U^T b.$$
This is the solution that minimizes $|Ax - b|^2$ and (among all minimizers) has minimum $|x|^2$.
Derivation. Write $b = U U^T b = \sum_i (u_i^T b) u_i$. Then
$$A^+ b = V\Sigma^+ U^T b = \sum_{i:,\sigma_i > 0} \frac{u_i^T b}{\sigma_i} v_i.$$
Each term $\frac{u_i^T b}{\sigma_i} v_i$ is the projection of $b$ onto the $i$-th left singular direction, scaled by $1/\sigma_i$. Components of $b$ in $\ker(A^T) = \mathrm{span}(u_{r+1}, \ldots, u_m)$ are discarded (they contribute nothing to $Ax$).
Condition Number and Numerical Rank
Definition. The condition number of $A$ is
$$\kappa(A) = |A|2 |A^+|2 = \frac{\sigma{\max}}{\sigma{\min}} = \frac{\sigma_1}{\sigma_r}.$$
(For singular $A$, $\kappa(A) = \infty$.)
The condition number measures how much the relative error in $b$ is amplified in the solution $x$ of $Ax = b$:
$$\frac{|\delta x|}{|x|} \leq \kappa(A) \frac{|\delta b|}{|b|}.$$
A well-conditioned matrix ($\kappa$ close to 1) gives reliable solutions; an ill-conditioned matrix ($\kappa \gg 1$) magnifies errors catastrophically.
Numerical rank. In finite precision arithmetic, singular values below $\varepsilon_{\text{machine}} \cdot \sigma_1$ are numerically zero. The numerical rank is
$$r_{\varepsilon} = #{i : \sigma_i > \varepsilon \cdot \sigma_1}.$$
A matrix is numerically rank-deficient if $\sigma_r / \sigma_1 < \varepsilon_{\text{machine}} \approx 10^{-16}$. In this regime, forming $A^{-1}$ or solving $Ax = b$ via LU is dangerous; use the pseudoinverse or truncated SVD instead.
Truncated SVD regularization. To solve an ill-conditioned system $Ax \approx b$, compute the truncated SVD solution:
$$x_k = \sum_{i=1}^k \frac{u_i^T b}{\sigma_i} v_i,$$
discarding terms where $\sigma_i$ is small (would amplify noise). This is Tikhonov regularization (ridge regression) in disguise.
Applications
PCA as SVD of the Data Matrix
Given a centred data matrix $X \in \mathbb{R}^{n \times p}$ ($n$ observations, $p$ features, each column has zero mean), compute $X = U\Sigma V^T$. Then:
- The principal components (PCs) are the columns of $V$ (right singular vectors of $X$) = eigenvectors of the covariance matrix $X^T X / (n-1)$.
- The scores (projections of data onto PCs) are the columns of $U\Sigma = XV$ (or just the left singular vectors scaled by singular values).
- The variance explained by the $k$-th PC is $\sigma_k^2 / \sum_i \sigma_i^2$.
PCA is the SVD of $X$; the eigendecomposition of the covariance matrix $X^T X$ is equivalent but numerically less stable.
Image Compression
An $m \times n$ grayscale image is an $m \times n$ matrix $A$. The rank-$k$ approximation $A_k = U_k \Sigma_k V_k^T$ requires storing $k(m + n + 1)$ numbers instead of $mn$. For $k \ll \min(m,n)$, this is substantial compression. Quality degrades gracefully as $k$ decreases.
k = rank(A) Full image (no compression)
k = 50 Hard to distinguish from original
k = 20 Visible blur but recognizable
k = 5 Abstract impression
k = 1 Best rank-1 approximation: u1 v1^T (one outer product)
Latent Semantic Analysis (LSA)
In NLP, build a term-document matrix $A$ (rows = words, columns = documents, entry = TF-IDF weight). The rank-$k$ SVD reveals latent topics: $u_i$ represents a “concept” direction in word space, $v_i$ in document space, and $\sigma_i$ gives the importance of topic $i$.
Queries are matched against documents in the reduced rank-$k$ space, allowing for synonymy (words with similar meaning land near each other in concept space) and polysemy handling.
Collaborative Filtering
User-item rating matrices (e.g., Netflix ratings) are large, sparse, and low-rank in practice (users have latent preference dimensions). SVD (or its variants like SVD++) finds a low-rank factorization $A \approx U\Sigma V^T$ where $U$ encodes user preferences and $V^T$ encodes item features in a shared latent space. Missing ratings are predicted by the corresponding entries of $U\Sigma V^T$.
Formal Summary
| Concept | Formula | Notes |
|---|---|---|
| Full SVD | $A = U\Sigma V^T$ | $U \in \mathbb{R}^{m\times m}$, $V \in \mathbb{R}^{n \times n}$ orthogonal |
| Thin SVD | $A = U_r \Sigma_r V_r^T$ | Only $r = \mathrm{rank}(A)$ terms |
| Singular values | $\sigma_i = \sqrt{\lambda_i(A^T A)}$ | $\sigma_1 \geq \cdots \geq \sigma_r > 0$ |
| Best rank-$k$ approx | $A_k = U_k \Sigma_k V_k^T$ | Error $= \sigma_{k+1}$ (spectral), $\sqrt{\sum_{i>k}\sigma_i^2}$ (Frobenius) |
| Pseudoinverse | $A^+ = V\Sigma^+ U^T$ | Min-norm least squares solution |
| Condition number | $\kappa(A) = \sigma_1/\sigma_r$ | Ill-conditioned if $\kappa \gg 1$ |
| Rank | $r = #{\sigma_i > 0}$ | Numerical rank uses threshold |
The SVD is computationally $O(mn^2)$ for $m \geq n$ (thin SVD), more for the full version. Randomized SVD algorithms (Halko, Martinsson, Tropp) can approximate the top-$k$ SVD in $O(mnk)$ time, enabling SVD of matrices too large to process otherwise.
Read Next: