Random Matrix Theory // Megha Bose

Prerequisite:

Random matrix theory (RMT) studies the statistical properties of matrices whose entries are drawn from probability distributions. Originating in nuclear physics - where Wigner modelled the energy levels of heavy nuclei as eigenvalues of random matrices - RMT has become a central tool in statistics, signal processing, and the theory of deep learning. The key insight is that individual randomness averages out in large systems, producing universal limiting laws of remarkable precision.

Why Random Matrices?

In high-dimensional statistics, we observe a data matrix $X \in \mathbb{R}^{p \times n}$ with $p$ features and $n$ samples. When both $p$ and $n$ are large, the sample covariance $\hat{\Sigma} = \frac{1}{n}XX^T$ has eigenvalues that differ systematically from the true population eigenvalues. Classical results (the law of large numbers) say little about how bad this distortion is. RMT gives sharp, quantitative answers.

In deep learning, the weight matrices $W^{(l)}$ of a neural network at initialisation are random. Their eigenvalue distribution affects gradient flow, trainability, and generalisation. RMT provides the right framework to analyse these phenomena.

Gaussian Ensembles

The Gaussian Orthogonal Ensemble

Definition. The Gaussian Orthogonal Ensemble (GOE) is the probability measure on $n \times n$ real symmetric matrices $W$ with density proportional to

$$\exp!\left(-\frac{n}{4}\text{tr}(W^2)\right) = \exp!\left(-\frac{n}{4}\sum_{i,j}W_{ij}^2\right).$$

Equivalently, $W_{ij} \overset{\text{i.i.d.}}{\sim} \mathcal{N}(0, 1/n)$ for $i < j$ and $W_{ii} \sim \mathcal{N}(0, 2/n)$.

The GOE is the natural ensemble of symmetric random matrices invariant under orthogonal conjugation: if $O$ is any orthogonal matrix, $OWO^T$ has the same distribution as $W$.

Eigenvalue Repulsion

A key feature of GOE (and all Gaussian ensembles) is eigenvalue repulsion: the joint density of eigenvalues $\lambda_1 \leq \cdots \leq \lambda_n$ has the form

$$p(\lambda_1,\dots,\lambda_n) \propto \prod_{i < j}|\lambda_i - \lambda_j|^\beta \cdot \exp!\left(-\frac{n\beta}{4}\sum_i \lambda_i^2\right),$$

with $\beta = 1$ for GOE and $\beta = 2$ for the Gaussian Unitary Ensemble (GUE, complex Hermitian matrices). The factor $\prod_{i<j}|\lambda_i-\lambda_j|^\beta$ is the Vandermonde determinant squared and penalises eigenvalue coincidences. Eigenvalues of random matrices are mutually repelling - a dramatic contrast to the Poisson process (independent eigenvalues) which describes integrable quantum systems.

Wigner’s Semicircle Law

Theorem (Wigner, 1955). Let $W$ be an $n \times n$ symmetric matrix with i.i.d. entries on and above the diagonal satisfying $\mathbb{E}[W_{ij}] = 0$ and $\mathbb{E}[W_{ij}^2] = \sigma^2/n$. The empirical spectral distribution

$$\mu_n = \frac{1}{n}\sum_{i=1}^n \delta_{\lambda_i(W)}$$

converges weakly, almost surely, to the semicircle distribution with density

$$\rho_{\text{sc}}(x) = \frac{2}{\pi \sigma^2}\sqrt{\sigma^2 - x^2} \cdot \mathbf{1}_{|x| \leq \sigma}.$$

For $\sigma = 1$ this is the semicircle on $[-1,1]$.

Empirical eigenvalue histogram vs. semicircle (sigma=1):

  density
   0.6 |      ***
       |    **   **
   0.4 |  **       **
       | *           *
   0.2 |*             *
       |                *
     0 +--+--+--+--+--+---> x
      -1            0   1

Proof Sketch via the Moment Method

The $k$-th moment of $\mu_n$ is $m_k = \frac{1}{n}\mathbb{E}[\text{tr}(W^k)] = \frac{1}{n}\sum_{i_1,\dots,i_k} \mathbb{E}[W_{i_1 i_2} W_{i_2 i_3} \cdots W_{i_k i_1}]$. Since $\mathbb{E}[W_{ij}] = 0$, only index sequences where each edge $(i_l, i_{l+1})$ appears at least twice contribute. Counting such sequences by Catalan-number combinatorics shows that $m_{2k} \to C_k \sigma^{2k}$ (the $k$-th Catalan number) and $m_{2k+1} \to 0$. These are precisely the moments of the semicircle distribution, proving convergence in distribution.

The semicircle law is universal: the limiting distribution depends on the entries only through their variance $\sigma^2$, not their higher moments.

The Marchenko-Pastur Distribution

Setting. Let $X \in \mathbb{R}^{p \times n}$ have i.i.d. entries with mean $0$ and variance $1$. The sample covariance is $S = \frac{1}{n}XX^T \in \mathbb{R}^{p \times p}$. Assume $p, n \to \infty$ with $p/n \to \gamma \in (0,\infty)$.

Theorem (Marchenko-Pastur, 1967). The empirical spectral distribution of $S$ converges almost surely to the Marchenko-Pastur distribution with density

$$\rho_{\text{MP}}(x) = \frac{1}{2\pi\gamma x}\sqrt{(\lambda_+ - x)(x - \lambda_-)}\cdot \mathbf{1}{[\lambda-, \lambda_+]},$$

where $\lambda_{\pm} = (1 \pm \sqrt{\gamma})^2$ are the upper and lower edges of the spectrum. If $\gamma > 1$, there is an additional point mass of weight $1 - 1/\gamma$ at $0$.

Marchenko-Pastur density (gamma = 0.5):

   density          lambda_-        lambda_+
     |               |               |
     |          -----*****-----      |
     |        **           **        |
     |       *               *       |
   0 +-------+---------------+-------> x
           0.17           1.54

Significance for PCA

In classical PCA, one retains eigenvectors whose eigenvalues are “large”. But when $p$ is comparable to $n$, even a null covariance ($\Sigma = I$) produces sample eigenvalues ranging from $\lambda_-$ to $\lambda_+$. The Marchenko-Pastur law provides the noise floor: eigenvalues of the sample covariance that exceed $\lambda_+ = (1+\sqrt{\gamma})^2$ are statistically significant; those below $\lambda_+$ cannot be distinguished from pure noise.

This gives a principled, model-free threshold for PCA, superseding heuristics like “eigenvalue $> 1$” (Kaiser’s rule) or scree plots.

Eigenvalue Spacing and the Tracy-Widom Distribution

GUE Universality

For GUE matrices, the joint eigenvalue density (with $\beta=2$) can be analysed exactly via orthogonal polynomial techniques. The normalized spacing between consecutive eigenvalues follows the Wigner surmise distribution

$$p(s) \approx \frac{32s^2}{\pi^2} e^{-4s^2/\pi},$$

showing strong repulsion at small $s$ (unlike the exponential distribution $e^{-s}$ of a Poisson process). This distinction between GOE/GUE statistics and Poisson statistics is a universal signature of quantum chaos and complex correlated systems.

Tracy-Widom Distribution

Theorem (Tracy-Widom, 1994). Let $\lambda_{\max}$ be the largest eigenvalue of a GUE matrix of size $n$. Then

$$\Pr!\left[n^{2/3}!\left(\lambda_{\max} - 2\right) \leq s\right] \to F_2(s),$$

where $F_2$ is the Tracy-Widom distribution of order $2$, expressed via a solution to the Painlevé II equation. The analogous result for GOE gives $F_1$.

The $n^{2/3}$ fluctuation scale (slower than the bulk $n^{-1/2}$ scale) reflects the edge of the spectrum, and $F_2$ appears universally in diverse combinatorial problems including the longest increasing subsequence of a random permutation.

Free Probability

Free probability, developed by Voiculescu, is a non-commutative probability theory that handles the addition and multiplication of large random matrices. Two large random matrices $A$ and $B$ are free (the matrix analogue of classical independence) if their mixed moments can be computed from their individual moment sequences via a specific rule - the free moment formula.

Free convolution $\mu_A \boxplus \mu_B$ gives the limiting spectral distribution of $A+B$ when $A$ and $B$ are free. Analogously, free multiplicative convolution $\mu_A \boxtimes \mu_B$ handles $A^{1/2}BA^{1/2}$.

A central result: if $A$ and $B$ are independent random matrices from rotationally invariant ensembles (e.g. GOE and a fixed diagonal matrix), then $A$ and $B$ are asymptotically free, and the spectrum of $A+B$ is the free convolution of their spectra. This gives a powerful toolbox for computing the spectrum of matrix polynomials without explicit diagonalisation.

Sample Covariance Matrices and Shrinkage

Eigenvalue Inflation

Let the true covariance be $\Sigma$ with eigenvalues $\sigma_1^2 \geq \cdots \geq \sigma_p^2$. When $p/n = \gamma > 0$, the sample eigenvalues $\hat{\lambda}_i$ are systematically inflated relative to $\sigma_i^2$. Large eigenvalues are over-estimated; small ones are under-estimated.

Ledoit-Wolf Shrinkage

Theorem (Ledoit-Wolf, 2004). The optimal linear shrinkage estimator - minimising the Frobenius-norm risk $\mathbb{E}[|\hat{\Sigma} - \Sigma|_F^2]$ over all estimators of the form $\alpha S + \beta I$ - has coefficients given by closed-form expressions in $p$, $n$, and the moments of $S$.

More sophisticated nonlinear shrinkage applies a function $\eta(\lambda)$ to each sample eigenvalue, where $\eta$ is derived from the Marchenko-Pastur equation, yielding the oracle estimator that achieves minimum loss even as $p/n \to \gamma$. In practice this dramatically improves portfolio optimisation, covariance-based classifiers, and anomaly detection.

Random Matrix Theory in Deep Learning

Weight Initialisation and Singular Value Spectra

At random initialisation, the weight matrices $W^{(l)} \in \mathbb{R}^{n_{l+1} \times n_l}$ have singular values distributed as $\sqrt{n_l} \cdot \text{Marchenko-Pastur}$ (when entries are i.i.d. Gaussian). The product of many such matrices - the Jacobian of a deep network - has singular values that either explode or vanish exponentially unless initialisation is carefully controlled (cf. Xavier/He initialisation).

Neural Network Hessian Spectra

The Hessian $H = \nabla^2 \mathcal{L}$ of the training loss has an empirical spectral distribution that, near a minimum, decomposes into a bulk governed by the Marchenko-Pastur law (from the noise in stochastic gradients) and a few outlier eigenvalues corresponding to directions of large curvature (the signal). This separation - predicted by RMT - is exploited by second-order optimisers that treat bulk and outlier directions differently.

Hessian eigenvalue spectrum near a minimum:

  density
     |  Bulk (MP law)     Outliers
     | ****               | | |
     |*    **             | | |
     |       **           | | |
   0 +--------***---------+-+-+-----> lambda
             lambda_+
             (noise floor)

Overparameterisation and the Double Descent Phenomenon

In overparameterised models (parameters $\gg$ samples), the interpolating solution (zero training loss) often has good test performance. RMT explains this: the pseudoinverse solution $\hat{w} = X^+(y)$ has risk governed by the Marchenko-Pastur distribution when $p/n > 1$, giving a bias-variance tradeoff curve that descends again after interpolation - the “double descent” curve observed empirically.

Formal Summary of Key Theorems

Theorem	Setting	Result
Wigner semicircle	Symmetric, i.i.d. entries	ESD $\to$ semicircle on $[-\sigma, \sigma]$
Marchenko-Pastur	$\frac{1}{n}XX^T$, $p/n\to\gamma$	ESD $\to$ MP law on $[\lambda_-, \lambda_+]$
Tracy-Widom	GUE largest eigenvalue	$n^{2/3}$ fluctuations $\to F_2$
Free convolution	Independent rotationally invariant matrices	Spectrum of sum via $\boxplus$

Random matrix theory transforms vague intuitions about “noise in high dimensions” into precise, universal laws. Its theorems tell us exactly where eigenvalues sit in the absence of signal, how they fluctuate, and how to correct for finite-sample bias - capabilities that are increasingly indispensable in the era of high-dimensional data.

Read Next:

High-Dimensional Geometry