Dimensionality Reduction - Keeping the Signal, Dropping the Noise // Megha Bose

Helpful context:

Imagine you have a dataset of images - each image is 64x64 pixels, giving a 4096-dimensional vector. You want to find images similar to a query image, so you compute Euclidean distances between the 4096-dimensional vectors. The problem: in 4096 dimensions, all pairwise distances become nearly equal. The nearest neighbor and the farthest point are at almost the same distance from your query. Your distance metric has become useless. This is not a pathology of your particular dataset; it is a mathematical consequence of high dimensions, a manifestation of the curse of dimensionality. The volume of a high-dimensional ball concentrates near its surface, making the notion of “near” and “far” nearly indistinguishable.

The antidote is to find a low-dimensional representation of the data that captures its essential structure. This is dimensionality reduction: mapping from a high-dimensional space $\mathbb{R}^d$ to a low-dimensional space $\mathbb{R}^k$ with $k \ll d$, while preserving the structure that matters for your task. What structure to preserve - global geometry, local neighborhoods, generative factors - is precisely where the different methods diverge.

There is no free lunch here. Every dimensionality reduction method makes a tradeoff: what it preserves, what it discards, how well it scales, and how interpretable the resulting representation is. PCA preserves global linear structure but is blind to nonlinear manifolds. t-SNE preserves local neighborhoods beautifully but distorts global distances. Autoencoders can capture arbitrary nonlinear structure but require training and lose interpretability. Understanding these tradeoffs is what lets you choose the right tool.

PCA: Directions of Maximum Variance

Principal Component Analysis is the canonical linear dimensionality reduction method. The intuition: if your data lives in a 2D plane embedded in 3D space, you only need two coordinates to describe it. PCA finds those coordinates automatically.

More precisely, PCA finds the directions in $\mathbb{R}^d$ along which the data has the most variance. The first principal component $u_1$ is the unit vector that maximizes the variance of the projection $u_1^\top x$:

$$u_1 = \arg\max_{|u|=1} \text{Var}(u^\top X) = \arg\max_{|u|=1} u^\top \Sigma u$$

where $\Sigma$ is the covariance matrix of the data. By the Lagrangian solution, this is the eigenvector of $\Sigma$ corresponding to the largest eigenvalue. The second principal component $u_2$ is the direction of next-greatest variance, constrained to be orthogonal to $u_1$ - this is the second eigenvector of $\Sigma$. Continuing this way gives all $d$ principal components, ordered by decreasing variance (eigenvalue).

To reduce to $k$ dimensions, project each data point onto the top $k$ principal components:

$$z = U_k^\top (x - \mu)$$

where $U_k \in \mathbb{R}^{d \times k}$ has the top $k$ eigenvectors as columns, and $\mu$ is the data mean. This is the low-dimensional representation. The reconstruction is $\hat{x} = U_k z + \mu$.

Connection to SVD. Let $X \in \mathbb{R}^{n \times d}$ be the mean-centered data matrix. Its SVD is $X = U \Sigma V^\top$. The principal components are the columns of $V$ (right singular vectors), and the principal component scores are the columns of $U\Sigma$. This is computationally important: computing the SVD of $X$ directly is more numerically stable than computing the eigendecomposition of the covariance $X^\top X / n$.

Why high-variance directions capture information. PCA minimizes the mean squared reconstruction error over all possible $k$-dimensional linear projections. The Eckart-Young theorem states that the best rank-$k$ approximation to a matrix in Frobenius norm is given by keeping the top $k$ singular vectors. High-variance directions capture the most “spread” in the data; discarding them would lose the most information in a squared-error sense.

Choosing the Number of Components

The number of components $k$ is the key hyperparameter in PCA. Two standard approaches:

Explained variance ratio. Each principal component $i$ explains a fraction $\lambda_i / \sum_j \lambda_j$ of the total variance, where $\lambda_i$ is its eigenvalue. Plot the cumulative explained variance ratio as a function of $k$. Choose $k$ such that the cumulative ratio exceeds a threshold - 95% is common. This says: keep enough components to explain 95% of the variance in the data.

Scree plot (elbow method). Plot the eigenvalues in decreasing order. Look for an “elbow” - a point after which the eigenvalues drop off steeply and then flatten. The elbow suggests the natural dimensionality of the data. This is more subjective than the threshold method.

For very high-dimensional data (e.g., genomics with $d = 20000$ features), the first two or three components often explain a surprisingly large fraction of the variance, because most of the variance is driven by a few dominant factors (e.g., population stratification, batch effects).

What PCA Preserves and Destroys

PCA preserves global linear structure. Pairwise Euclidean distances in the original space are approximately preserved in the projection, by the Eckart-Young theorem. If two points are far apart in the original space, they tend to be far apart in the PCA projection. If two points are close, the projection keeps them close - as long as the distance is in a direction captured by the kept components.

PCA destroys local nonlinear structure. If your data lies on a curved manifold - say, a Swiss roll (a 2D sheet rolled up in 3D) - PCA will unfurl the sheet partially but cannot fully separate the inner and outer layers because the roll has intrinsic nonlinear structure. Linear projections cannot “unfold” nonlinear manifolds.

PCA is invariant to the choice of basis but sensitive to feature scaling. If features have very different scales (e.g., one feature ranges from 0 to 1 and another from 0 to 10000), the high-variance feature dominates all principal components. Standard practice: normalize features to zero mean and unit variance before applying PCA, unless the raw scale is meaningful.

t-SNE: Preserving Neighborhoods

t-Distributed Stochastic Neighbor Embedding (van der Maaten and Hinton, 2008) is designed for visualization: mapping high-dimensional data to 2D or 3D in a way that reveals cluster structure.

The key idea: represent pairwise relationships as probabilities, and find a 2D configuration that matches those probabilities.

Step 1. For each pair of points $i, j$ in the original high-dimensional space, compute a probability $p_{ij}$ that reflects their similarity. Points that are close neighbors get high probability; distant points get near-zero probability. The similarity is computed using a Gaussian kernel centered at each point:

$$p_{j|i} = \frac{\exp(-|x_i - x_j|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-|x_i - x_k|^2 / 2\sigma_i^2)}$$

The bandwidth $\sigma_i$ is chosen so that the effective number of neighbors (the perplexity) equals a user-specified value (typically 5-50).

Step 2. Initialize 2D points $y_1, \ldots, y_n$ randomly. Compute pairwise similarities $q_{ij}$ in the 2D space using a Student’s t-distribution with one degree of freedom (heavier tails than Gaussian - this helps separate clusters):

$$q_{ij} = \frac{(1 + |y_i - y_j|^2)^{-1}}{\sum_{k \neq l} (1 + |y_k - y_l|^2)^{-1}}$$

Step 3. Minimize the KL divergence between $P$ and $Q$ using gradient descent:

$$\text{KL}(P | Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}$$

The asymmetry of KL divergence matters: it penalizes heavily when $p_{ij}$ is large but $q_{ij}$ is small (nearby points mapped far apart) more than when $p_{ij}$ is small but $q_{ij}$ is large (distant points mapped nearby). The result is that local neighborhood structure is well-preserved; global distances are not.

Limitations of t-SNE. The 2D embedding is not a projection - you cannot apply it to new points without rerunning the algorithm. The result depends on hyperparameters (perplexity, learning rate, number of iterations) and the random initialization. Cluster sizes and inter-cluster distances in a t-SNE plot are not meaningful - two clusters that appear close may be globally distant. t-SNE is useful for visualization and exploration, not for learning representations used in downstream tasks.

UMAP: Faster with Better Global Structure

Uniform Manifold Approximation and Projection (McInnes et al., 2018) shares t-SNE’s goal of visualizing high-dimensional data in 2D or 3D, but is faster and better at preserving global structure.

UMAP’s theoretical foundation is Riemannian geometry and fuzzy topology, but the intuition is similar to t-SNE: build a graph representing the local structure of the data, then find a low-dimensional embedding that preserves that graph.

Step 1. For each point, find its $k$ nearest neighbors. Construct a weighted graph where the edge weight between $i$ and $j$ reflects their similarity, normalized by the distance to the nearest neighbor of $i$ (so that each point has at least one near neighbor, even in sparse regions). This gives a fuzzy topological representation of the data manifold.

Step 2. Optimize a low-dimensional embedding to have a similar fuzzy topological structure, using cross-entropy between the two fuzzy graph representations as the loss.

UMAP vs t-SNE. UMAP is 5-10x faster for large datasets. It preserves more global structure - clusters that are far apart in the original space tend to remain relatively far apart in the UMAP plot. It can also be applied to new points (unlike t-SNE), because the optimization can be extended to out-of-sample embedding. The tradeoff: UMAP’s local neighborhood preservation is slightly less faithful than t-SNE’s, though this varies by dataset.

For most practical visualization tasks, UMAP is now the default choice over t-SNE due to speed and scalability.

Autoencoders: Nonlinear Dimensionality Reduction

An autoencoder is a neural network trained to reconstruct its input through a bottleneck:

$$x \xrightarrow{\text{encoder}} z \xrightarrow{\text{decoder}} \hat{x}$$

The encoder $f: \mathbb{R}^d \to \mathbb{R}^k$ compresses the input to a low-dimensional latent representation $z$. The decoder $g: \mathbb{R}^k \to \mathbb{R}^d$ reconstructs the original from $z$. Training minimizes reconstruction loss $|x - \hat{x}|^2$.

The bottleneck dimension $k$ is the dimensionality of the representation. By forcing information through this bottleneck, the network must learn to compress - keeping only the most important structure and discarding noise.

Why autoencoders can do what PCA cannot. PCA can only learn linear projections. If the low-dimensional structure is nonlinear (a curved manifold, a set of discrete clusters, a generative factor like “rotation angle” or “facial expression”), PCA will miss it. An autoencoder with nonlinear activations can represent and exploit this structure. A nonlinear encoder can “unfold” the Swiss roll that PCA cannot.

Variants. A variational autoencoder (VAE) enforces a probabilistic structure on the latent space, making it useful for generation, not just compression. A denoising autoencoder is trained to reconstruct clean inputs from noisy ones, learning more robust representations. A sparse autoencoder adds a sparsity penalty to the latent code, encouraging each feature to activate for a small number of inputs.

The cost of nonlinearity. Autoencoders require training (choosing architecture, learning rate, number of epochs). The latent representation is not interpretable in the way PCA components are (the components have no natural ordering or meaning). Different random initializations may learn different representations. PCA, by contrast, gives a unique and globally optimal linear projection.

When to Use Which

Method	Preserves	Fails on	Speed	Invertible	Best for
PCA	Global linear structure	Nonlinear manifolds	Very fast	Yes	Preprocessing, compression, linear visualization
t-SNE	Local neighborhoods	Global distances	Slow	No	2D/3D visualization, cluster exploration
UMAP	Local + some global	Very fine structure	Fast	Approximately	Visualization, semi-supervised learning
Autoencoder	Whatever the task needs	Interpretability	Moderate (training)	Yes (via decoder)	Representation learning, generation

A few practical rules:

Always try PCA first. It is fast, interpretable, and often sufficient.
Use t-SNE or UMAP when you want to visually inspect cluster structure. Do not draw quantitative conclusions from inter-cluster distances.
Use an autoencoder when downstream performance matters more than interpretability, or when the structure is clearly nonlinear.
For preprocessing before a supervised model, PCA (or a linear projection) is usually best because it is stable and does not overfit.

Read next: