Singular Value Decomposition // Megha Bose

Prerequisite:

The singular value decomposition (SVD) is arguably the most important matrix factorization in applied mathematics and data science. Unlike eigendecomposition, it applies to every matrix - not just square or diagonalizable ones - and it reveals the intrinsic geometry of a linear map stripped of all arbitrary basis choices.

The SVD Theorem

Theorem (Existence of SVD). Let $A \in \mathbb{R}^{m \times n}$ (or $\mathbb{C}^{m \times n}$). Then there exist:

an orthogonal (unitary) matrix $U \in \mathbb{R}^{m \times m}$,
an orthogonal (unitary) matrix $V \in \mathbb{R}^{n \times n}$,
a matrix $\Sigma \in \mathbb{R}^{m \times n}$ with nonneg entries on the main diagonal and zeros elsewhere,

such that

$$\boxed{A = U \Sigma V^T.}$$

The diagonal entries $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_{\min(m,n)} \geq 0$ are the singular values of $A$. The columns of $U$ are the left singular vectors; the columns of $V$ are the right singular vectors.

The SVD always exists, over any field, for any shape of matrix. This is the key advantage over eigendecomposition.

Geometric Interpretation

Every linear map $A: \mathbb{R}^n \to \mathbb{R}^m$ decomposes into three steps:

R^n ---V^T--> R^n ---Sigma--> R^m ---U--> R^m

x  ------>  V^T x  ------->  Sigma V^T x  ---->  U Sigma V^T x = Ax
(rotate/  (axis-aligned    (rotate/reflect
 reflect)   scaling)         in R^m)

$V^T$: orthogonal transformation (rotation/reflection) in the domain $\mathbb{R}^n$. Rotates the input into a special coordinate system aligned with the right singular vectors.
$\Sigma$: axis-aligned scaling. The $i$-th coordinate is scaled by $\sigma_i$. Coordinates beyond $\min(m,n)$ are dropped (if $m < n$) or padded with zero (if $m > n$).
$U$: orthogonal transformation in the codomain $\mathbb{R}^m$. Rotates the output into the final coordinate system.

The unit sphere becomes an ellipsoid. The image of the unit sphere $S^{n-1} = {x : |x| = 1}$ under $A$ is an ellipsoid with semi-axes $\sigma_1 u_1, \sigma_2 u_2, \ldots, \sigma_r u_r$ (where $r = \mathrm{rank}(A)$ and $u_i$ are the left singular vectors).

Unit sphere in R^n     After A = U Sigma V^T     Ellipsoid in R^m

      o                                              ---
     /|\                 stretch sigma_1             |   |
    / | \     ===>                          ===>    sigma_1
   o  |  o               stretch sigma_2            |   |
    \ | /                                            ---
     \|/
     
                   semi-axes = sigma_1, sigma_2, ...

Relationship to Eigenvalues

The singular values of $A$ are the square roots of the eigenvalues of the symmetric positive semidefinite matrices $A^T A$ and $A A^T$.

Theorem. $\sigma_i^2$ are the nonzero eigenvalues of $A^T A$ (and of $A A^T$, which has the same nonzero eigenvalues). The right singular vectors $v_i$ are eigenvectors of $A^T A$; the left singular vectors $u_i$ are eigenvectors of $A A^T$.

Proof. From $A = U\Sigma V^T$:

$$A^T A = V \Sigma^T U^T U \Sigma V^T = V (\Sigma^T \Sigma) V^T.$$

Since $V$ is orthogonal, this is the eigendecomposition of $A^T A$: columns of $V$ are eigenvectors, diagonal entries of $\Sigma^T \Sigma = \mathrm{diag}(\sigma_1^2, \ldots, \sigma_n^2)$ are eigenvalues. Similarly $AA^T = U \Sigma \Sigma^T U^T$, so columns of $U$ are eigenvectors of $AA^T$. $\square$

Consistency check: $A^T A$ is $n \times n$ and symmetric PSD; $AA^T$ is $m \times m$ and symmetric PSD. Their nonzero eigenvalues coincide (this follows from the general fact that $AB$ and $BA$ have the same nonzero eigenvalues). The number of positive eigenvalues equals $\mathrm{rank}(A) = r$.

Computing the SVD

The standard computational path:

Form $A^T A$ ($n \times n$ symmetric PSD matrix).
Eigendecompose $A^T A = V D V^T$ (by spectral theorem, since $A^T A$ is symmetric). The eigenvalues $d_i = \sigma_i^2 \geq 0$, so $\sigma_i = \sqrt{d_i}$.
Compute left singular vectors: for each $\sigma_i > 0$, $u_i = \frac{1}{\sigma_i} A v_i$. Orthonormalize if needed (they are automatically orthonormal if done correctly).
Extend ${u_i}$ and ${v_i}$ to full orthonormal bases of $\mathbb{R}^m$ and $\mathbb{R}^n$ respectively (for the “full” SVD).

Warning. Forming $A^T A$ explicitly squares the condition number ($\kappa(A^T A) = \kappa(A)^2$), losing numerical precision. Production algorithms (e.g., LAPACK’s dgesdd) use bidiagonalization (Golub–Reinsch) followed by the QR algorithm on the bidiagonal - achieving SVD without forming $A^T A$.

Economy (Thin) SVD

When $m \geq n$, most of $U$ is “wasted”: $\Sigma$ has zero rows at the bottom. The thin SVD keeps only the first $n$ columns of $U$:

$$A = \underbrace{U_n}{m \times n} \underbrace{\hat{\Sigma}}{n \times n} \underbrace{V^T}_{n \times n},$$

where $U_n$ has orthonormal columns (but is not square) and $\hat{\Sigma} = \mathrm{diag}(\sigma_1, \ldots, \sigma_n)$. The thin SVD is usually sufficient and saves storage/computation.

For rank-$r$ $A$ (with $r \leq \min(m,n)$), the compact SVD keeps only $r$ terms:

$$A = \sum_{i=1}^r \sigma_i u_i v_i^T.$$

This is the most economical representation.

The Eckart–Young Theorem: Best Low-Rank Approximation

Theorem (Eckart–Young, 1936). Let $A \in \mathbb{R}^{m \times n}$ with SVD $A = U\Sigma V^T$ and singular values $\sigma_1 \geq \cdots \geq \sigma_r > 0$. For any $k \leq r$, the best rank-$k$ approximation of $A$ in the Frobenius norm (and in the spectral norm) is

$$A_k = \sum_{i=1}^k \sigma_i u_i v_i^T = U_k \Sigma_k V_k^T,$$

where $U_k, V_k$ are the first $k$ columns of $U, V$ and $\Sigma_k = \mathrm{diag}(\sigma_1, \ldots, \sigma_k)$. The approximation error is:

$$|A - A_k|F = \sqrt{\sigma{k+1}^2 + \cdots + \sigma_r^2}, \quad |A - A_k|2 = \sigma{k+1}.$$

Moreover, $A_k$ achieves the minimum among all rank-$k$ matrices $B$:

$$A_k = \arg\min_{\mathrm{rank}(B) \leq k} |A - B|F = \arg\min{\mathrm{rank}(B) \leq k} |A - B|_2.$$

Proof sketch (spectral norm version). For any rank-$k$ matrix $B$, $\ker(B)$ has dimension $\geq n-k$. The subspace $\mathrm{span}(v_1, \ldots, v_{k+1})$ has dimension $k+1$. By dimension counting, there exists a unit vector $w$ in $\ker(B) \cap \mathrm{span}(v_1, \ldots, v_{k+1})$. Then:

$$|A - B|_2 \geq |(A-B)w|_2 = |Aw|2 \geq \sigma{k+1}|w|2 = \sigma{k+1}.$$

(The last inequality uses the variational characterization $\sigma_{k+1} = \min_{|x|=1,, x \perp v_1,\ldots,v_k} |Ax|$.) Since $|A - A_k|2 = \sigma{k+1}$ exactly, $A_k$ achieves the bound. $\square$

Singular value “energy” diagram:

sigma_i
  |
  |*
  | *
  |  *
  |   *
  |    **
  |      ***
  |         *****
  |               **************
  +------------------------------------> i
  1   2   3   4   5   6   7   8   9

Rapid decay = low-rank structure (data compressible)
Slow decay  = no low-rank structure (e.g., noise)

Keeping top k singular values captures most "energy":
||A_k||_F^2 / ||A||_F^2 = (sigma_1^2 + ... + sigma_k^2) / (sigma_1^2 + ... + sigma_r^2)

The Pseudoinverse

For a non-square or rank-deficient matrix, the ordinary inverse does not exist. The Moore–Penrose pseudoinverse provides a canonical substitute.

Definition. Given $A = U\Sigma V^T$, define

$$A^+ = V \Sigma^+ U^T,$$

where $\Sigma^+$ is obtained by replacing each nonzero diagonal entry $\sigma_i$ of $\Sigma$ with $1/\sigma_i$, and transposing.

Theorem (Pseudoinverse Characterization). $A^+$ is the unique matrix satisfying the four Moore–Penrose conditions:

$A A^+ A = A$
$A^+ A A^+ = A^+$
$(A A^+)^T = A A^+$ (symmetric)
$(A^+ A)^T = A^+ A$ (symmetric)

Application to least squares. The system $Ax = b$ (with $A \in \mathbb{R}^{m \times n}$, $m > n$) is generally overdetermined. The minimum-norm least squares solution is:

$$x^\ast = A^+ b = V \Sigma^+ U^T b.$$

This is the solution that minimizes $|Ax - b|^2$ and (among all minimizers) has minimum $|x|^2$.

Derivation. Write $b = U U^T b = \sum_i (u_i^T b) u_i$. Then

$$A^+ b = V\Sigma^+ U^T b = \sum_{i:,\sigma_i > 0} \frac{u_i^T b}{\sigma_i} v_i.$$

Each term $\frac{u_i^T b}{\sigma_i} v_i$ is the projection of $b$ onto the $i$-th left singular direction, scaled by $1/\sigma_i$. Components of $b$ in $\ker(A^T) = \mathrm{span}(u_{r+1}, \ldots, u_m)$ are discarded (they contribute nothing to $Ax$).

Condition Number and Numerical Rank

Definition. The condition number of $A$ is

$$\kappa(A) = |A|2 |A^+|2 = \frac{\sigma{\max}}{\sigma{\min}} = \frac{\sigma_1}{\sigma_r}.$$

(For singular $A$, $\kappa(A) = \infty$.)

The condition number measures how much the relative error in $b$ is amplified in the solution $x$ of $Ax = b$:

$$\frac{|\delta x|}{|x|} \leq \kappa(A) \frac{|\delta b|}{|b|}.$$

A well-conditioned matrix ($\kappa$ close to 1) gives reliable solutions; an ill-conditioned matrix ($\kappa \gg 1$) magnifies errors catastrophically.

Numerical rank. In finite precision arithmetic, singular values below $\varepsilon_{\text{machine}} \cdot \sigma_1$ are numerically zero. The numerical rank is

$$r_{\varepsilon} = #{i : \sigma_i > \varepsilon \cdot \sigma_1}.$$

A matrix is numerically rank-deficient if $\sigma_r / \sigma_1 < \varepsilon_{\text{machine}} \approx 10^{-16}$. In this regime, forming $A^{-1}$ or solving $Ax = b$ via LU is dangerous; use the pseudoinverse or truncated SVD instead.

Truncated SVD regularization. To solve an ill-conditioned system $Ax \approx b$, compute the truncated SVD solution:

$$x_k = \sum_{i=1}^k \frac{u_i^T b}{\sigma_i} v_i,$$

discarding terms where $\sigma_i$ is small (would amplify noise). This is Tikhonov regularization (ridge regression) in disguise.

Applications

PCA as SVD of the Data Matrix

Given a centred data matrix $X \in \mathbb{R}^{n \times p}$ ($n$ observations, $p$ features, each column has zero mean), compute $X = U\Sigma V^T$. Then:

The principal components (PCs) are the columns of $V$ (right singular vectors of $X$) = eigenvectors of the covariance matrix $X^T X / (n-1)$.
The scores (projections of data onto PCs) are the columns of $U\Sigma = XV$ (or just the left singular vectors scaled by singular values).
The variance explained by the $k$-th PC is $\sigma_k^2 / \sum_i \sigma_i^2$.

PCA is the SVD of $X$; the eigendecomposition of the covariance matrix $X^T X$ is equivalent but numerically less stable.

Image Compression

An $m \times n$ grayscale image is an $m \times n$ matrix $A$. The rank-$k$ approximation $A_k = U_k \Sigma_k V_k^T$ requires storing $k(m + n + 1)$ numbers instead of $mn$. For $k \ll \min(m,n)$, this is substantial compression. Quality degrades gracefully as $k$ decreases.

k = rank(A)    Full image (no compression)
k = 50         Hard to distinguish from original
k = 20         Visible blur but recognizable
k = 5          Abstract impression
k = 1          Best rank-1 approximation: u1 v1^T (one outer product)

Latent Semantic Analysis (LSA)

In NLP, build a term-document matrix $A$ (rows = words, columns = documents, entry = TF-IDF weight). The rank-$k$ SVD reveals latent topics: $u_i$ represents a “concept” direction in word space, $v_i$ in document space, and $\sigma_i$ gives the importance of topic $i$.

Queries are matched against documents in the reduced rank-$k$ space, allowing for synonymy (words with similar meaning land near each other in concept space) and polysemy handling.

Collaborative Filtering

User-item rating matrices (e.g., Netflix ratings) are large, sparse, and low-rank in practice (users have latent preference dimensions). SVD (or its variants like SVD++) finds a low-rank factorization $A \approx U\Sigma V^T$ where $U$ encodes user preferences and $V^T$ encodes item features in a shared latent space. Missing ratings are predicted by the corresponding entries of $U\Sigma V^T$.

Formal Summary

Concept	Formula	Notes
Full SVD	$A = U\Sigma V^T$	$U \in \mathbb{R}^{m\times m}$, $V \in \mathbb{R}^{n \times n}$ orthogonal
Thin SVD	$A = U_r \Sigma_r V_r^T$	Only $r = \mathrm{rank}(A)$ terms
Singular values	$\sigma_i = \sqrt{\lambda_i(A^T A)}$	$\sigma_1 \geq \cdots \geq \sigma_r > 0$
Best rank-$k$ approx	$A_k = U_k \Sigma_k V_k^T$	Error $= \sigma_{k+1}$ (spectral), $\sqrt{\sum_{i>k}\sigma_i^2}$ (Frobenius)
Pseudoinverse	$A^+ = V\Sigma^+ U^T$	Min-norm least squares solution
Condition number	$\kappa(A) = \sigma_1/\sigma_r$	Ill-conditioned if $\kappa \gg 1$
Rank	$r = #{\sigma_i > 0}$	Numerical rank uses threshold

The SVD is computationally $O(mn^2)$ for $m \geq n$ (thin SVD), more for the full version. Randomized SVD algorithms (Halko, Martinsson, Tropp) can approximate the top-$k$ SVD in $O(mnk)$ time, enabling SVD of matrices too large to process otherwise.

Read Next: