Mechanistic Interpretability
Prerequisite:
Mechanistic interpretability is the programme of reverse-engineering trained neural networks into human-understandable algorithms. Rather than asking “what does this model do on a benchmark?” it asks “what computation does this model actually implement, and how?” This post covers the core concepts and methods, from the circuits framework to sparse autoencoders.
The Circuits Hypothesis
The circuits hypothesis, developed by Olah et al. at Anthropic, proposes that neural networks learn interpretable algorithms implemented as circuits - subgraphs of the computation graph where specific features computed by earlier layers are routed through specific weights to produce meaningful intermediate computations.
A circuit is formally a tuple $(N, E)$ where $N$ is a set of nodes (neurons, attention heads, or entire layers in a coarser analysis) and $E$ is a set of edges weighted by the strength of connection. The claim is that many capabilities are localised to relatively small circuits, not distributed uniformly across the network.
Superposition and Polysemanticity
The superposition hypothesis explains a fundamental puzzle: networks appear to represent far more features than they have dimensions. A layer with $d$ neurons can represent at most $d$ orthogonal features, yet in practice probing studies find thousands of interpretable directions in spaces of hundreds of dimensions.
The resolution is that features are stored in superposition: they are not orthogonal but are instead nearly orthogonal, tolerating small interference. If features are sparse - most are inactive for any given input - the cross-feature interference remains small on average.
Formally, if $f \in \mathbb{R}^n$ is a feature vector (with $n \gg d$) and $W \in \mathbb{R}^{d \times n}$ is a projection into the hidden layer, then the reconstruction $W^T W f$ has interference terms $W_i^T W_j f_j$ for $i \neq j$. When the $f_j$ are sparse, these terms are small and the feature can be recovered approximately.
This explains polysemantic neurons: a single neuron activates for multiple seemingly unrelated concepts because it participates in representing several superposed features.
Sparse Autoencoders
Sparse autoencoders (SAEs) are the primary tool for extracting monosemantic features from polysemantic activations. Given a residual stream vector $x \in \mathbb{R}^d$, an SAE computes:
$$\hat{x} = W_d , \text{ReLU}(W_e x + b_e) + b_d$$
where $W_e \in \mathbb{R}^{m \times d}$ is an encoder with $m \gg d$ hidden units, and $W_d \in \mathbb{R}^{d \times m}$ is a decoder constrained to have unit-norm columns. The training loss is:
$$L = |x - \hat{x}|^2 + \lambda |\text{ReLU}(W_e x + b_e)|_1$$
The $L_1$ sparsity penalty drives most hidden units to zero for any given input, encouraging each active unit to correspond to a single interpretable feature. SAEs trained on transformer residual streams at scale (Anthropic’s work on Claude activations) recover tens of thousands of features including specific named entities, syntactic roles, and emotional valences - far more than the number of dimensions in the residual stream.
Attention Head Decomposition
Each attention head can be decomposed into two matrices that capture its computational role:
$$W_{OV} = W_O W_V \in \mathbb{R}^{d \times d}$$ $$W_{QK} = W_Q^T W_K \in \mathbb{R}^{d \times d}$$
$W_{QK}$ determines what each position attends to: token $i$ attends strongly to token $j$ when $e_i^T W_{QK} e_j$ is large, where $e_i$, $e_j$ are the residual stream vectors at those positions.
$W_{OV}$ determines what information is moved: when position $j$ is attended to, the vector $W_{OV} e_j$ is added to position $i$’s residual stream.
This decomposition allows rigorous analysis of what a head “does” without needing to run the full model.
Induction Heads
Induction heads are a two-head circuit that implements in-context sequence copying. The pattern: given tokens $[A][B] \ldots [A]$, the circuit predicts $[B]$ to follow the second $[A]$.
The mechanism requires two heads working in sequence:
- A previous-token head (layer 0) shifts information: it copies the embedding of token $t$ into the residual stream at position $t+1$.
- An induction head (layer 1) attends back to the previous occurrence of the current token by using the previous-token head’s output in its key - effectively searching for positions whose following token matches the current token, then copying that following token’s value.
Induction heads are present in virtually every transformer with two or more layers and are thought to underlie much of in-context learning capability.
Logit Lens and Direct Logit Attribution
The logit lens reads off intermediate predictions by projecting the residual stream at layer $l$ directly through the unembedding matrix:
$$\hat{p}_l^{(t)} = \text{softmax}(W_U h_l^{(t)})$$
where $h_l^{(t)} \in \mathbb{R}^d$ is the residual stream at layer $l$, position $t$, and $W_U \in \mathbb{R}^{V \times d}$ is the unembedding matrix. This reveals how the model’s “best guess” at each token evolves through the layers.
Direct logit attribution decomposes the final logit for a specific token $v$ into contributions from each component:
$$W_U^{(v)} h_L^{(t)} = \sum_{\ell} W_U^{(v)} \left(\text{MLP}\ell^{(t)} + \sum_h \text{Attn}{\ell,h}^{(t)}\right)$$
since the residual stream is a sum of all component outputs. This gives a precise account of which layers and heads are responsible for promoting or suppressing specific tokens.
Activation Patching
Activation patching (or causal tracing) provides causal evidence for circuit claims. The procedure:
- Run the model on a clean input and record all intermediate activations.
- Run the model on a corrupted input (e.g., with a key fact changed) and record the output.
- For each component of interest, re-run on the corrupted input but patch in the clean activation at that component.
- Measure the change in output toward the clean prediction.
A component whose patching fully restores the clean output is causally necessary for that computation. This allows researchers to identify the minimal circuit responsible for a capability.
Examples
The IOI Circuit. The indirect object identification (IOI) task asks the model to complete “Mary and John went to the store. John gave a book to ___” with “Mary”. Wang et al. identified the circuit responsible: name-mover heads in late layers copy the indirect object name into the output; duplicate-token heads in mid layers suppress the repeated name; S-inhibition heads mediate between them. The full circuit involves fewer than 30 of GPT-2’s 144 total attention heads.
Factual Recall Mechanism. For prompts of the form “The Eiffel Tower is located in ___”, mechanistic analysis shows that mid-layer MLP blocks act as a “knowledge store”: specific neurons fire for the subject entity and causally produce the correct attribute in the residual stream. Patching these MLP outputs across different subject entities transplants factual associations, demonstrating that factual knowledge is localised to identifiable components rather than distributed uniformly.
Read Next: