Helpful context:


Every high-dimensional dataset lives, in practice, on something much smaller. A collection of face images - each one $256 \times 256$ pixels, so a point in $\mathbb{R}^{65536}$ - does not actually fill that space uniformly. The images cluster on a low-dimensional surface: the manifold of human faces. It is parameterized by things like the angle of the jaw, the distance between the eyes, the direction of gaze. An autoencoder is a neural network that learns this manifold directly, without being told anything about it. It does so by trying to compress its input into a small code and then reconstruct the original from that code alone. If the reconstruction is good, the bottleneck has learned to represent what actually varies in the data.

The idea of compression-as-learning runs deep. A model that must reproduce its input from a short description cannot cheat by memorizing - it must find structure. The bottleneck dimension $d$ controls how much structure the model is forced to discover: too large and the network learns identity; too small and it cannot represent the data faithfully. In the space between these extremes lives everything interesting. Different architectural choices and regularization strategies push the learned representation toward different properties: sparsity, robustness to noise, sensitivity to the data manifold geometry, probabilistic interpretability. Each variant tells you something different about the structure of the data.

This post works through the main variants in order of increasing sophistication. We start with the basic autoencoder and show that its linear special case is exactly principal component analysis. Then nonlinear activations, denoising, sparsity, and contraction each add a new inductive bias. We close with Restricted Boltzmann Machines - a different tradition entirely, energy-based rather than feed-forward - which give a probabilistic account of representation learning and a genuinely generative model of data.


Standard Autoencoders

An autoencoder is a composition of two functions. The encoder $f: \mathbb{R}^D \to \mathbb{R}^d$ maps an input $x$ to a code (latent representation) $z = f(x)$. The decoder $g: \mathbb{R}^d \to \mathbb{R}^D$ maps the code back to input space. The model is trained to minimize the reconstruction loss over a dataset $\{x^{(i)}\}$:

$$\mathcal{L} = \frac{1}{n} \sum_{i=1}^n | x^{(i)} - g(f(x^{(i)})) |^2$$

The critical constraint is $d \ll D$: the code dimension is much smaller than the input dimension. This bottleneck is what forces the encoder to compress. Without it, the network could learn the identity map with zero loss and zero effort. With it, the network must decide what to preserve and what to discard, and the right answer is to preserve whatever accounts for the most variation across the dataset.

The encoder and decoder are typically parameterized as multilayer perceptrons. The encoder might be $f(x) = \sigma(W_2 \sigma(W_1 x + b_1) + b_2)$ for some weight matrices $W_1, W_2$ and activation $\sigma$; the decoder mirrors this structure. Training is done by gradient descent on $\mathcal{L}$ via backpropagation - there are no special labels, only the input itself serving as its own target. This is why autoencoders are called self-supervised.

The latent code $z$ can be used for downstream tasks. If the encoder has learned a good representation, a linear classifier on $z$ should perform well even when the original data is too high-dimensional for direct classification. This is the representation learning payoff: the compression forces the network to discover the factors of variation that generalize.


Linear Autoencoders and PCA

Suppose both the encoder and decoder are linear functions: $f(x) = Wx$ with $W \in \mathbb{R}^{d \times D}$, and $g(z) = Vz$ with $V \in \mathbb{R}^{D \times d}$. The reconstruction loss becomes:

$$\mathcal{L} = \frac{1}{n} \sum_{i=1}^n | x^{(i)} - VWx^{(i)} |^2 = | X - XW^\top V^\top |_F^2$$

where $X \in \mathbb{R}^{n \times D}$ is the data matrix (rows are examples, assumed mean-centered).

The claim is that the optimal linear autoencoder performs exactly PCA. More precisely, the composition $VW$ at the minimum of $\mathcal{L}$ is the orthogonal projection onto the subspace spanned by the top $d$ eigenvectors of the sample covariance matrix $\Sigma = \frac{1}{n} X^\top X$.

Proof via SVD. Let $X = U S V^\top$ be the singular value decomposition, where $U \in \mathbb{R}^{n \times D}$ has orthonormal columns, $S = \mathrm{diag}(\sigma_1, \ldots, \sigma_r)$ with $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0$, and $V \in \mathbb{R}^{D \times D}$ is orthogonal. The columns of $V$ are the eigenvectors of $X^\top X$, and $\sigma_k^2 / n$ are the eigenvalues of the covariance matrix.

The reconstruction error is $|X - XV^\top W^\top W V|_F^2$ after substituting $W = \tilde{W} V^\top$ for some $\tilde{W}$. Working in the eigenvector basis, we want to find the rank-$d$ matrix $P = V W^\top W V^\top$ that minimizes $|X - XP|_F^2$. By the Eckart-Young theorem, the best rank-$d$ approximation to $X$ in Frobenius norm is $X_d = U_d S_d V_d^\top$, where $U_d, S_d, V_d$ are the leading $d$ components of the SVD. The projection is:

$$P^\star = V_d V_d^\top$$

where $V_d \in \mathbb{R}^{D \times d}$ contains the top $d$ right singular vectors of $X$ (equivalently, top $d$ eigenvectors of the covariance matrix). The minimum reconstruction error is:

$$\mathcal{L}^\star = \sum_{k=d+1}^{r} \sigma_k^2$$

the sum of squared singular values not captured by the bottleneck. This is exactly the variance discarded by PCA when retaining $d$ components.

The encoder weight matrix at the optimum satisfies $W = V_d^\top$ (project onto the top eigenvectors), and the decoder satisfies $V = V_d$ (map back to input space along those same directions). Note that many other weight matrices achieve the same optimum - any $W = A V_d^\top$ where $A$ is an invertible $d \times d$ matrix, paired with $V = V_d A^{-1}$, gives the same product $VW = V_d V_d^\top$. The encoder-decoder composition is unique (it is the projection), but the individual weight matrices are not.

Why does this matter? It means that adding nonlinearities to an autoencoder is the only way to do something genuinely beyond PCA. A linear autoencoder, no matter how many layers deep, can only learn the same linear subspace. Depth without nonlinearity is redundant. This is the formal justification for using activation functions in autoencoders - they are not optional decoration, they are what separates the model from a glorified PCA.


Nonlinear Autoencoders and the Manifold Hypothesis

Add nonlinear activations - sigmoid, tanh, ReLU - and the autoencoder can learn curved structures. This connects to the manifold hypothesis: the assumption that high-dimensional data concentrates near a low-dimensional smooth manifold embedded in the ambient space.

A linear autoencoder can only learn a $d$-dimensional linear subspace (a flat plane through the origin). Nonlinear encoder-decoder pairs can learn a $d$-dimensional manifold of arbitrary curvature. Concretely: if the data lies on a $d$-dimensional sphere embedded in $\mathbb{R}^D$, a linear autoencoder with bottleneck dimension $d$ will fail (a sphere is not a linear subspace), but a nonlinear autoencoder can learn to map the sphere to a flat $d$-dimensional code and back.

The encoder $f$ acts as a coordinate chart for the manifold: it maps a neighborhood of the manifold to a flat region of $\mathbb{R}^d$. The decoder $g$ maps back to the manifold. Training on the reconstruction loss shapes both to be consistent: $g \circ f$ must be approximately the identity on the data manifold, but can do anything off it. As training progresses, the Jacobian of $f$ at any data point $x$ will tend to have only $d$ large singular values - the directions along the manifold - and small values in the normal directions, which the encoder compresses away.

This is not just a geometric metaphor. It is why learned representations generalize: if the encoder captures the true degrees of freedom of the data, then points that are nearby on the manifold (similar images, similar sentences) map to nearby codes, and the downstream classifier can draw simple boundaries in code space even when the raw data requires complicated decision surfaces.


Denoising Autoencoders

A standard autoencoder trained to minimize $|x - g(f(x))|^2$ has a pathological solution: if $d \geq D$, the network can learn the identity map exactly, achieving zero loss without learning anything. Even with $d < D$, a sufficiently expressive network can overfit and fail to generalize. The denoising autoencoder (DAE), introduced by Vincent et al., fixes this by changing the problem.

Instead of reconstructing $x$ from $x$, the DAE is trained to reconstruct a clean input $x$ from a corrupted version $\tilde{x}$. Two corruption processes are common:

Gaussian noise: $\tilde{x} = x + \varepsilon$ where $\varepsilon \sim \mathcal{N}(0, \sigma^2 I)$.

Masking noise: each coordinate of $x$ is independently set to zero with probability $p$ (also called dropout noise).

The loss is:

$$\mathcal{L}{\mathrm{DAE}} = \mathbb{E}{x \sim p_{\mathrm{data}}} \mathbb{E}_{\tilde{x} \sim q(\tilde{x}|x)} \left[ | x - g(f(\tilde{x})) |^2 \right]$$

The network sees corrupted $\tilde{x}$ but must predict clean $x$. This forces it to learn which directions in input space are informative (and thus worth recovering) and which are noise. The identity map is no longer a valid solution - applying the identity to $\tilde{x}$ reproduces the noise, not the signal.

Why DAEs learn the manifold. The optimal reconstruction $g(f(\tilde{x}))$ under squared loss is the conditional mean $\mathbb{E}[x \mid \tilde{x}]$. For small Gaussian noise $\varepsilon$ with covariance $\sigma^2 I$, this conditional mean points from $\tilde{x}$ back toward the data manifold: roughly, $\mathbb{E}[x \mid \tilde{x}] \approx \tilde{x} + \sigma^2 \nabla_x \log p(\tilde{x})$. The reconstruction vector $g(f(\tilde{x})) - \tilde{x}$ therefore approximates the score function $\nabla_x \log p(x)$ of the data distribution.

This connection - denoising is equivalent to score matching - is the foundation of modern diffusion models. A DAE trained at noise level $\sigma$ is learning to estimate the score of the data distribution convolved with Gaussian noise of that level. The data manifold is the zero-set of the score-based vector field: it is where all the denoising arrows point toward. By forcing the network to denoise, we force it to learn the geometry of the data distribution, not just a compact representation of the training set.

Robustness. DAE representations are empirically more robust than standard AE representations. Because the encoder must handle corrupted inputs during training, it learns features that are stable under small perturbations - which is exactly what you want for downstream classification, where test inputs may differ slightly from training inputs.


Sparse Autoencoders

The bottleneck in a standard autoencoder achieves compression through dimensionality: $d \ll D$. A sparse autoencoder (SAE) achieves a different kind of compression: the code $h = f(x) \in \mathbb{R}^d$ is forced to be sparse, meaning that most of its entries are zero for any given input, even if $d \geq D$.

The training objective adds an $\ell_1$ penalty on the hidden activations:

$$\mathcal{L}_{\mathrm{SAE}} = | x - g(f(x)) |^2 + \lambda | h |_1$$

where $h = f(x)$ and $\lambda > 0$ controls the sparsity-reconstruction tradeoff. The $\ell_1$ norm $|h|_1 = \sum_k |h_k|$ is chosen over $\ell_2$ because it induces true zeros in the solution (by the subdifferential analysis of the $\ell_1$ penalty, a coordinate $h_k$ is set to exactly zero whenever the reconstruction gradient at that coordinate is small relative to $\lambda$).

Overcomplete dictionaries. The striking regime is $d > D$: the code dimension exceeds the input dimension. A standard autoencoder with $d > D$ has no compression pressure at all and trivially learns the identity. An SAE with $d > D$ is different: the code is high-dimensional but must be sparse - only $k \ll d$ entries active per input. This forces the model to learn an overcomplete dictionary of $d$ features such that any input can be reconstructed from a small subset of them. The analogy is to sparse coding: just as $k$-sparse signals can be recovered from few measurements via compressed sensing when the dictionary is incoherent, an SAE learns a dictionary in which each training example activates a small, distinct subset of features.

Why sparsity helps interpretability. A dense code $h \in \mathbb{R}^d$ is hard to interpret: every coordinate mixes many underlying factors. A sparse code is far more interpretable: if $h_k$ is active only when a particular feature is present in $x$, then $h_k$ can be associated with that feature by inspection. This is the basis for the application of sparse autoencoders in mechanistic interpretability of large language models.

The superposition hypothesis, developed in circuits-style research, posits that neural networks store more concepts than they have neurons by representing each concept as a sparse linear combination of neurons (the neurons “superpose” many features in their activations). SAEs trained on the residual stream or MLP activations of a transformer can decompose these superposed representations: each SAE feature corresponds to a specific, human-interpretable concept (a name, a syntactic role, a factual category). Because the dictionary is overcomplete and the codes are sparse, each concept activates on the appropriate inputs and is silent otherwise - recovering the individual “virtual neurons” that the transformer has mixed together.


Contractive Autoencoders

The contractive autoencoder (CAE) regularizes the encoder by penalizing the sensitivity of the code to small input perturbations. Specifically, it adds a penalty on the Frobenius norm of the Jacobian of the encoder:

$$\mathcal{L}_{\mathrm{CAE}} = | x - g(f(x)) |^2 + \lambda | J_f(x) |_F^2$$

where $J_f(x) \in \mathbb{R}^{d \times D}$ is the Jacobian of $f$ at $x$, with $(J_f)_{ij} = \partial f_i / \partial x_j$.

The Frobenius norm $|J_f(x)|F^2 = \sum{i,j} (\partial f_i / \partial x_j)^2$ measures the total squared sensitivity: how much the code changes when the input is perturbed in any direction. Minimizing this forces the encoder to be locally contractive - to map nearby inputs to nearby codes, with small derivatives.

Why contraction is the right regularizer. The encoder’s Jacobian has two kinds of directions: those tangent to the data manifold, and those normal to it. The reconstruction loss requires that the encoder be invertible (via the decoder) along the manifold - otherwise reconstruction fails. So the encoder’s Jacobian cannot be small in the tangent directions. But in the normal directions - perturbations that move $x$ off the manifold - the encoder is free to contract sharply. The Jacobian penalty therefore pushes the encoder to have small singular values in the normal directions and larger ones in the tangent directions. This means the encoder has learned to distinguish tangent from normal: it knows the manifold.

Connection to DAEs. Alain and Bengio (2014) showed that for small Gaussian noise $\varepsilon \sim \mathcal{N}(0, \sigma^2 I)$, a DAE and a CAE converge to the same solution in the limit $\sigma \to 0$ when $\lambda = \sigma^2$. Both regularizers enforce the same geometric constraint - that the representation is insensitive to off-manifold perturbations - but via different mechanisms. The DAE does it by averaging over corrupted inputs; the CAE does it by directly penalizing the Jacobian. The equivalence confirms that the Jacobian penalty is the “right” thing to regularize.


Restricted Boltzmann Machines

Restricted Boltzmann Machines take a completely different approach. Rather than parameterizing encoder and decoder directly, an RBM is an energy-based model that defines a joint probability distribution over visible units $v \in \{0,1\}^D$ and hidden units $h \in \{0,1\}^d$.

Energy function. The energy of a configuration $(v, h)$ is:

$$E(v, h) = -v^\top W h - b^\top v - c^\top h$$

where $W \in \mathbb{R}^{D \times d}$ is the weight matrix connecting visible to hidden units, $b \in \mathbb{R}^D$ are visible biases, and $c \in \mathbb{R}^d$ are hidden biases. The joint distribution is the Boltzmann distribution:

$$P(v, h) = \frac{1}{Z} \exp(-E(v, h)) = \frac{1}{Z} \exp(v^\top W h + b^\top v + c^\top h)$$

where the partition function $Z = \sum_{v,h} \exp(-E(v,h))$ normalizes the distribution. The marginal over visible units is $P(v) = \sum_h P(v,h)$, which the model is trained to maximize on the data.

The restriction. In a Boltzmann machine, every unit can be connected to every other. The “restricted” version has no connections within the visible layer and no connections within the hidden layer - only between them. This bipartite structure gives a crucial property: the visible and hidden units are conditionally independent given the other layer.

Conditional distributions. Because there are no hidden-hidden or visible-visible connections, the conditional distributions factor completely:

$$P(h_j = 1 \mid v) = \sigma\left(c_j + v^\top W_{:j}\right)$$

$$P(v_i = 1 \mid h) = \sigma\left(b_i + W_{i:} h\right)$$

where $\sigma(x) = 1/(1 + e^{-x})$ is the sigmoid function and $W_{:j}$ is the $j$-th column of $W$. Each hidden unit $h_j$ turns on with probability that depends on how much the input $v$ aligns with the $j$-th column of $W$ - the $j$-th feature detector. Each visible unit $v_i$ turns on with probability determined by how much the hidden configuration $h$ supports pixel $i$.

This factored structure means that given $v$, all $h_j$ can be sampled in parallel (no need to sample one at a time). Likewise given $h$, all $v_i$ can be sampled in parallel. This makes block Gibbs sampling efficient: alternate between sampling the full hidden layer given the visible, and sampling the full visible layer given the hidden.

Training: Contrastive Divergence. The log-likelihood gradient for an RBM is:

$$\frac{\partial \log P(v)}{\partial W} = \mathbb{E}{h \sim P(h|v)}[v h^\top] - \mathbb{E}{(v',h') \sim P(v,h)}[v' h'^\top]$$

The first term - the positive phase - is a sample from the model distribution with the visible units clamped to data. Since $P(h|v)$ is factorial, computing this expectation is tractable: $\mathbb{E}[h_j \mid v] = \sigma(c_j + v^\top W_{:j})$.

The second term - the negative phase - requires sampling from the joint $P(v,h)$, which is intractable in general (it requires running Gibbs sampling to convergence). Hinton’s contrastive divergence (CD-$k$) approximates this by running only $k$ steps of block Gibbs starting from the data:

  1. Start with a data point $v^{(0)} = x$.
  2. Sample $h^{(0)} \sim P(h \mid v^{(0)})$.
  3. Sample $v^{(1)} \sim P(v \mid h^{(0)})$ (this is the “reconstruction”).
  4. Repeat $k$ times to get $(v^{(k)}, h^{(k)})$.
  5. Use $(v^{(k)}, h^{(k)})$ in place of the true model sample.

The gradient update is:

$$\Delta W \propto \mathbb{E}[v^{(0)} (h^{(0)})^\top] - \mathbb{E}[v^{(k)} (h^{(k)})^\top]$$

with analogous updates for $b$ and $c$.

Why CD-1 works. With $k = 1$ step, the negative phase sample $v^{(1)}$ is just one step away from the data. This is a biased estimator of the true gradient, but it works well in practice for two reasons. First, the bias decreases as training progresses - as the model improves, the one-step reconstruction $v^{(1)}$ comes to resemble the true model distribution. Second, CD-1 can be seen as minimizing a different objective (the contrastive divergence $\mathrm{KL}(p_0 | p_\infty) - \mathrm{KL}(p_1 | p_\infty)$, where $p_k$ is the distribution after $k$ Gibbs steps from data), which is well-defined and has the right fixed points.

The positive phase increases the probability of data configurations; the negative phase decreases the probability of reconstructed (“fantasy”) configurations. This push-pull dynamic is the RBM’s analog of maximizing likelihood, and it is what gives the model its generative power.

Deep Boltzmann Machines. Multiple RBMs can be stacked: the hidden units of one become the visible units of the next. Training is typically done greedily, one layer at a time. This yields a deep generative model with multiple levels of abstraction in its hidden representations. Deep Boltzmann Machines (DBMs) have joint connections between all adjacent layers and require a more sophisticated training procedure (variational mean-field inference for the positive phase), but they capture richer dependencies than stacked RBMs.


Applications

The variants above each have a natural domain of application, shaped by what their regularizer enforces.

Anomaly detection. A standard autoencoder trained on normal data learns to reconstruct normal inputs well. Anomalous inputs - fraud, equipment faults, novel disease presentations - are off the learned manifold and reconstruct poorly. The reconstruction error $|x - g(f(x))|^2$ serves directly as an anomaly score, with no labeled anomaly examples needed during training. Denoising autoencoders work especially well here because the learned manifold is smoother and more stable.

Representation learning. The encoder $f$ of any trained autoencoder can be used as a feature extractor. The resulting code $z = f(x)$ is typically much lower-dimensional than $x$ and captures semantically meaningful variation. Linear classifiers trained on $z$ often match or exceed nonlinear classifiers trained on $x$ directly - evidence that the autoencoder has extracted the relevant structure.

Generative sampling. RBMs are genuine generative models: block Gibbs sampling from $P(v)$ produces new samples from the learned distribution. Run Gibbs for sufficiently many steps starting from random noise, and the chain converges to samples from the model. This gives RBMs the ability to hallucinate new examples - new faces, new digits - that were not in the training set but look like they could have been.

Mechanistic interpretability. Sparse autoencoders trained on transformer activations have become a primary tool in the circuits research agenda for understanding large language models. The SAE decomposes the dense, superposed activations of a model into a sparse combination of human-interpretable features. By identifying which SAE features activate on which inputs, researchers can reverse-engineer what a model has learned: which features encode factual knowledge, which encode syntax, which mediate specific reasoning steps. The overcomplete, sparse code is essential here - a dense code would not give clean feature-to-concept correspondences.


Summary

Model Loss Key property
Standard AE $|x - g(f(x))|^2$ Compression via bottleneck $d \ll D$
Linear AE $|x - g(f(x))|^2$, linear $f,g$ Equivalent to PCA; optimal $VW = V_d V_d^\top$
Nonlinear AE $|x - g(f(x))|^2$, nonlinear $f,g$ Learns curved manifolds; beyond PCA
Denoising AE $\mathbb{E}|x - g(f(\tilde{x}))|^2$ Learns score function; manifold-aware
Sparse AE $|x - g(f(x))|^2 + \lambda|h|_1$ Sparse codes; overcomplete dictionaries; interpretability
Contractive AE $|x - g(f(x))|^2 + \lambda|J_f|_F^2$ Small Jacobian off manifold; equiv. to DAE in limit
RBM Negative log-likelihood (CD-$k$) Energy-based; factorial conditionals; generative sampling

The through-line across all these models is the same insight: to learn structure in data, force the model to solve a constrained reconstruction problem. The constraint - dimensionality, sparsity, noise robustness, Jacobian size, or energy - determines what structure gets learned. Choose the constraint to match what you care about in the representation.


Read Next: