Singular Value Decomposition - The Skeleton of Every Matrix // Megha Bose

Helpful context:

The singular value decomposition (SVD) is arguably the most important matrix factorization in applied mathematics and data science. Unlike eigendecomposition, it applies to every matrix - not just square or diagonalizable ones - and it reveals the intrinsic geometry of a linear map stripped of all arbitrary basis choices.

The SVD Theorem

Theorem (Existence of SVD). Let $A \in \mathbb{R}^{m \times n}$ (or $\mathbb{C}^{m \times n}$). Then there exist:

an orthogonal (unitary) matrix $U \in \mathbb{R}^{m \times m}$,
an orthogonal (unitary) matrix $V \in \mathbb{R}^{n \times n}$,
a matrix $\Sigma \in \mathbb{R}^{m \times n}$ with nonneg entries on the main diagonal and zeros elsewhere,

such that

$$A = U \Sigma V^T.$$

The diagonal entries $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_{\min(m,n)} \geq 0$ are the singular values of $A$. The columns of $U$ are the left singular vectors; the columns of $V$ are the right singular vectors.

The SVD always exists, over any field, for any shape of matrix. This is the key advantage over eigendecomposition.

Why SVD, not something else? Any matrix $A$ has infinitely many factorizations: $A = BC$ for any invertible $C$ with $B = AC^{-1}$. SVD is special because it simultaneously diagonalizes both $A^T A$ and $A A^T$ using the same singular values. Eigendecomposition requires a square matrix and may not exist over the reals. LU factorization reflects elimination order, not intrinsic structure. QR reflects a specific basis choice. SVD alone extracts the intrinsic geometry: the input directions (columns of $V$) that the matrix stretches, the output directions (columns of $U$) it stretches into, and how much stretching happens (singular values). These are not choices made by the analyst - they are properties of the matrix itself.

Another way to see it: $A^T A$ is symmetric positive semidefinite, so it has a real eigendecomposition $A^T A = V \Lambda V^T$. The columns of $V$ are the “natural” input directions for $A$ - the orthonormal basis in which $A$’s action is simplest. SVD is what you get when you ask: what is the most natural coordinate system for understanding what $A$ does?

Geometric Interpretation

Every linear map $A: \mathbb{R}^n \to \mathbb{R}^m$ decomposes into three steps:

$V^T$: orthogonal transformation (rotation/reflection) in the domain $\mathbb{R}^n$. Rotates the input into a special coordinate system aligned with the right singular vectors.
$\Sigma$: axis-aligned scaling. The $i$-th coordinate is scaled by $\sigma_i$. Coordinates beyond $\min(m,n)$ are dropped (if $m < n$) or padded with zero (if $m > n$).
$U$: orthogonal transformation in the codomain $\mathbb{R}^m$. Rotates the output into the final coordinate system.

The unit sphere becomes an ellipsoid. The image of the unit sphere $S^{n-1} = \{x : |x| = 1\}$ under $A$ is an ellipsoid with semi-axes $\sigma_1 u_1, \sigma_2 u_2, \ldots, \sigma_r u_r$ (where $r = \mathrm{rank}(A)$ and $u_i$ are the left singular vectors).

Relationship to Eigenvalues

The singular values of $A$ are the square roots of the eigenvalues of the symmetric positive semidefinite matrices $A^T A$ and $A A^T$.

Theorem. $\sigma_i^2$ are the nonzero eigenvalues of $A^T A$ (and of $A A^T$, which has the same nonzero eigenvalues). The right singular vectors $v_i$ are eigenvectors of $A^T A$; the left singular vectors $u_i$ are eigenvectors of $A A^T$.

Proof. From $A = U\Sigma V^T$:

$$A^T A = V \Sigma^T U^T U \Sigma V^T = V (\Sigma^T \Sigma) V^T.$$

Since $V$ is orthogonal, this is the eigendecomposition of $A^T A$: columns of $V$ are eigenvectors, diagonal entries of $\Sigma^T \Sigma = \mathrm{diag}(\sigma_1^2, \ldots, \sigma_n^2)$ are eigenvalues. Similarly $AA^T = U \Sigma \Sigma^T U^T$, so columns of $U$ are eigenvectors of $AA^T$. $\square$

Consistency check: $A^T A$ is $n \times n$ and symmetric PSD; $AA^T$ is $m \times m$ and symmetric PSD. Their nonzero eigenvalues coincide (this follows from the general fact that $AB$ and $BA$ have the same nonzero eigenvalues). The number of positive eigenvalues equals $\mathrm{rank}(A) = r$.

Computing the SVD

The standard computational path:

Form $A^T A$ ($n \times n$ symmetric PSD matrix).
Eigendecompose $A^T A = V D V^T$ (by spectral theorem, since $A^T A$ is symmetric). The eigenvalues $d_i = \sigma_i^2 \geq 0$, so $\sigma_i = \sqrt{d_i}$.
Compute left singular vectors: for each $\sigma_i > 0$, $u_i = \frac{1}{\sigma_i} A v_i$. Orthonormalize if needed (they are automatically orthonormal if done correctly).
Extend $\{u_i\}$ and $\{v_i\}$ to full orthonormal bases of $\mathbb{R}^m$ and $\mathbb{R}^n$ respectively (for the “full” SVD).

Warning. Forming $A^T A$ explicitly squares the condition number ($\kappa(A^T A) = \kappa(A)^2$), losing numerical precision. Production algorithms (e.g., LAPACK’s dgesdd) use bidiagonalization (Golub - Reinsch) followed by the QR algorithm on the bidiagonal - achieving SVD without forming $A^T A$.

Economy (Thin) SVD

When $m \geq n$, most of $U$ is “wasted”: $\Sigma$ has zero rows at the bottom. The thin SVD keeps only the first $n$ columns of $U$:

$$A = \underbrace{U_n}_{m \times n} \underbrace{\hat{\Sigma}}_{n \times n} \underbrace{V^T}_{n \times n},$$

where $U_n$ has orthonormal columns (but is not square) and $\hat{\Sigma} = \mathrm{diag}(\sigma_1, \ldots, \sigma_n)$. The thin SVD is usually sufficient and saves storage/computation.

For rank-$r$ $A$ (with $r \leq \min(m,n)$), the compact SVD keeps only $r$ terms:

$$A = \sum_{i=1}^r \sigma_i u_i v_i^T.$$

This is the most economical representation.

The Eckart - Young Theorem: Best Low-Rank Approximation

Theorem (Eckart - Young, 1936). Let $A \in \mathbb{R}^{m \times n}$ with SVD $A = U\Sigma V^T$ and singular values $\sigma_1 \geq \cdots \geq \sigma_r > 0$. For any $k \leq r$, the best rank-$k$ approximation of $A$ in the Frobenius norm (and in the spectral norm) is

$$A_k = \sum_{i=1}^k \sigma_i u_i v_i^T = U_k \Sigma_k V_k^T,$$

where $U_k, V_k$ are the first $k$ columns of $U, V$ and $\Sigma_k = \mathrm{diag}(\sigma_1, \ldots, \sigma_k)$. The approximation error is:

$$|A - A_k|_F = \sqrt{\sigma_{k+1}^2 + \cdots + \sigma_r^2}, \quad |A - A_k|_2 = \sigma_{k+1}.$$

Moreover, $A_k$ achieves the minimum among all rank-$k$ matrices $B$:

$$A_k = \arg\min_{\mathrm{rank}(B) \leq k} |A - B|_F = \arg\min_{\mathrm{rank}(B) \leq k} |A - B|_2.$$

Proof sketch (spectral norm version). For any rank-$k$ matrix $B$, $\ker(B)$ has dimension $\geq n-k$. The subspace $\mathrm{span}(v_1, \ldots, v_{k+1})$ has dimension $k+1$. By dimension counting, there exists a unit vector $w$ in $\ker(B) \cap \mathrm{span}(v_1, \ldots, v_{k+1})$. Then:

$$|A - B|_2 \geq |(A-B)w|_2 = |Aw|_2 \geq \sigma_{k+1}|w|_2 = \sigma_{k+1}.$$

(The last inequality uses the variational characterization $\sigma_{k+1} = \min_{|x|=1, x \perp v_1,\ldots,v_k} |Ax|$.) Since $|A - A_k|_2 = \sigma_{k+1}$ exactly, $A_k$ achieves the bound. $\square$

Singular value “energy” diagram:

Typical singular value spectra. Rapid decay (blue) signals low-rank structure: a few values dominate, so a rank-k approximation captures most of the energy. Slow decay (orange) appears in noise-like matrices with no compressible structure.

Keeping top $k$ singular values captures most of the “energy”:

$$\frac{|A_k|_F^2}{|A|_F^2} = \frac{\sigma_1^2 + \cdots + \sigma_k^2}{\sigma_1^2 + \cdots + \sigma_r^2}$$

The Pseudoinverse

For a non-square or rank-deficient matrix, the ordinary inverse does not exist. The Moore - Penrose pseudoinverse provides a canonical substitute.

Definition. Given $A = U\Sigma V^T$, define

$$A^+ = V \Sigma^+ U^T,$$

where $\Sigma^+$ is obtained by replacing each nonzero diagonal entry $\sigma_i$ of $\Sigma$ with $1/\sigma_i$, and transposing.

Theorem (Pseudoinverse Characterization). $A^+$ is the unique matrix satisfying the four Moore - Penrose conditions:

$A A^+ A = A$
$A^+ A A^+ = A^+$
$(A A^+)^T = A A^+$ (symmetric)
$(A^+ A)^T = A^+ A$ (symmetric)

Application to least squares. The system $Ax = b$ (with $A \in \mathbb{R}^{m \times n}$, $m > n$) is generally overdetermined. The minimum-norm least squares solution is:

$$x^\ast = A^+ b = V \Sigma^+ U^T b.$$

This is the solution that minimizes $|Ax - b|^2$ and (among all minimizers) has minimum $|x|^2$.

Derivation. Write $b = U U^T b = \sum_i (u_i^T b) u_i$. Then

$$A^+ b = V\Sigma^+ U^T b = \sum_{i:\sigma_i > 0} \frac{u_i^T b}{\sigma_i} v_i.$$

Each term $\frac{u_i^T b}{\sigma_i} v_i$ is the projection of $b$ onto the $i$-th left singular direction, scaled by $1/\sigma_i$. Components of $b$ in $\ker(A^T) = \mathrm{span}(u_{r+1}, \ldots, u_m)$ are discarded (they contribute nothing to $Ax$).

Condition Number and Numerical Rank

Definition. The condition number of $A$ is

$$\kappa(A) = |A|_2 |A^+|_2 = \frac{\sigma_{\max}}{\sigma_{\min}} = \frac{\sigma_1}{\sigma_r}.$$

(For singular $A$, $\kappa(A) = \infty$.)

The condition number measures how much the relative error in $b$ is amplified in the solution $x$ of $Ax = b$:

$$\frac{|\delta x|}{|x|} \leq \kappa(A) \frac{|\delta b|}{|b|}.$$

A well-conditioned matrix ($\kappa$ close to 1) gives reliable solutions; an ill-conditioned matrix ($\kappa \gg 1$) magnifies errors catastrophically.

Numerical rank. In finite precision arithmetic, singular values below $\varepsilon_{\text{machine}} \cdot \sigma_1$ are numerically zero. The numerical rank is

$$r_{\varepsilon} = #\{i : \sigma_i > \varepsilon \cdot \sigma_1\}.$$

A matrix is numerically rank-deficient if $\sigma_r / \sigma_1 < \varepsilon_{\text{machine}} \approx 10^{-16}$. In this regime, forming $A^{-1}$ or solving $Ax = b$ via LU is dangerous; use the pseudoinverse or truncated SVD instead.

Truncated SVD regularization. To solve an ill-conditioned system $Ax \approx b$, compute the truncated SVD solution:

$$x_k = \sum_{i=1}^k \frac{u_i^T b}{\sigma_i} v_i,$$

discarding terms where $\sigma_i$ is small (would amplify noise). This is Tikhonov regularization (ridge regression) in disguise.

Applications

PCA as SVD of the Data Matrix

Given a centred data matrix $X \in \mathbb{R}^{n \times p}$ ($n$ observations, $p$ features, each column has zero mean), compute $X = U\Sigma V^T$. Then:

The principal components (PCs) are the columns of $V$ (right singular vectors of $X$) = eigenvectors of the covariance matrix $X^T X / (n-1)$.
The scores (projections of data onto PCs) are the columns of $U\Sigma = XV$ (or just the left singular vectors scaled by singular values).
The variance explained by the $k$-th PC is $\sigma_k^2 / \sum_i \sigma_i^2$.

PCA is the SVD of $X$; the eigendecomposition of the covariance matrix $X^T X$ is equivalent but numerically less stable.

Image Compression

An $m \times n$ grayscale image is an $m \times n$ matrix $A$. The rank-$k$ approximation $A_k = U_k \Sigma_k V_k^T$ requires storing $k(m + n + 1)$ numbers instead of $mn$. For $k \ll \min(m,n)$, this is substantial compression. Quality degrades gracefully as $k$ decreases.

k = rank(A)    Full image (no compression)
k = 50         Hard to distinguish from original
k = 20         Visible blur but recognizable
k = 5          Abstract impression
k = 1          Best rank-1 approximation: u1 v1^T (one outer product)

Latent Semantic Analysis (LSA)

In NLP, build a term-document matrix $A$ (rows = words, columns = documents, entry = TF-IDF weight). The rank-$k$ SVD reveals latent topics: $u_i$ represents a “concept” direction in word space, $v_i$ in document space, and $\sigma_i$ gives the importance of topic $i$.

Queries are matched against documents in the reduced rank-$k$ space, allowing for synonymy (words with similar meaning land near each other in concept space) and polysemy handling.

Collaborative Filtering

User-item rating matrices (e.g., Netflix ratings) are large, sparse, and low-rank in practice (users have latent preference dimensions). SVD (or its variants like SVD$++$) finds a low-rank factorization $A \approx U\Sigma V^T$ where $U$ encodes user preferences and $V^T$ encodes item features in a shared latent space. Missing ratings are predicted by the corresponding entries of $U\Sigma V^T$.

When Low Rank Is Not Enough

Low-rank approximation via SVD is optimal in Frobenius norm - but Frobenius optimality does not guarantee interpretability. A matrix can be numerically low-rank (singular values decay quickly after the first few) without the leading singular vectors being semantically meaningful.

The dark side of PCA. PCA finds the directions of maximum variance. If the data has a low-dimensional structure aligned with variance, PCA recovers it. But if the structure is in a subspace that does not explain maximum variance - for example, a rare but important cluster of points - PCA discards it. Variance is not the same as importance.

Rotation ambiguity. If $A \approx U_k \Sigma_k V_k^T$, any rotation $R$ gives an equally valid factorization: $A \approx (U_k R)(R^T \Sigma_k V_k^T)$. The columns of $U_k$ and $V_k$ are not uniquely identified - they can be rotated freely within the rank-$k$ subspace. Methods like Independent Component Analysis (ICA) and Non-negative Matrix Factorization (NMF) add additional constraints (independence, non-negativity) to select a specific rotation that is more interpretable for specific domains.

When to use it and when not to. Use SVD when you need the best low-rank approximation in Frobenius norm, when you need the pseudoinverse, or when you want to understand the condition number and numerical rank. Do not expect SVD to reveal semantically meaningful structure unless the data’s meaningful structure happens to align with directions of maximum variance.

Formal Summary

Concept	Formula	Notes
Full SVD	$A = U\Sigma V^T$	$U \in \mathbb{R}^{m\times m}$, $V \in \mathbb{R}^{n \times n}$ orthogonal
Thin SVD	$A = U_r \Sigma_r V_r^T$	Only $r = \mathrm{rank}(A)$ terms
Singular values	$\sigma_i = \sqrt{\lambda_i(A^T A)}$	$\sigma_1 \geq \cdots \geq \sigma_r > 0$
Best rank-$k$ approx	$A_k = U_k \Sigma_k V_k^T$	Error $= \sigma_{k+1}$ (spectral), $\sqrt{\sum_{i>k}\sigma_i^2}$ (Frobenius)
Pseudoinverse	$A^+ = V\Sigma^+ U^T$	Min-norm least squares solution
Condition number	$\kappa(A) = \sigma_1/\sigma_r$	Ill-conditioned if $\kappa \gg 1$
Rank	$r = #\{\sigma_i > 0\}$	Numerical rank uses threshold

The SVD is computationally $O(mn^2)$ for $m \geq n$ (thin SVD), more for the full version. Randomized SVD algorithms (Halko, Martinsson, Tropp) can approximate the top-$k$ SVD in $O(mnk)$ time, enabling SVD of matrices too large to process otherwise.

Read Next: