Mechanistic Interpretability - What Is the Model Actually Computing?
Helpful context:
- Transformers From First Principles - Why Attention Changed Everything
- Linear Transformations - Geometry Encoded as Arithmetic
You trained a neural network to play chess. It wins against grandmasters. But you have no idea how. Is it doing pattern matching? Calculating variations many moves ahead? Forming abstract concepts like “king safety” or “pawn structure”? You can’t look inside. The weights are 50 million floating-point numbers with no labels. You know the input (board position), the output (move), and the fact that training worked. You don’t know what it computed to get from one to the other.
Now multiply this uncertainty by a thousand and you have GPT-4. Mechanistic interpretability is the research program that asks: can we reverse-engineer what neural networks actually compute, at the level of specific algorithms implemented by specific components?
This is not interpretability in the usual sense - where you ask “what features matter for this prediction?” Mechanistic interpretability asks a harder question: what is the precise computational procedure the network executes?
The Problem: Weights Are Not Algorithms
When you learn arithmetic, you can describe your algorithm. “To multiply 47 by 8: multiply 40 by 8 (320), multiply 7 by 8 (56), add (376).” The algorithm is inspectable. You can explain it, verify it, modify it.
A neural network weights matrix encodes an algorithm, but not in any human-readable form. You have a 2048×2048 matrix of floating-point numbers. Somewhere in those 4 million entries is the information that allows the model to add two numbers. But which entries? How are they organized? What computation do they implement?
The naive approach - “look at the weights” - fails because individual weights have no meaning in isolation. A weight of 0.73 in one context and -0.31 in another are both meaningless without understanding the entire matrix they participate in. The meaning is distributed.
Mechanistic interpretability tries to find the distributed structures - the circuits, representations, and algorithms - that emerge from training.
The Linear Representation Hypothesis
The starting hypothesis: neural networks represent concepts as directions in high-dimensional activation space.
What does this mean concretely? Consider the residual stream of a transformer - the $d$-dimensional vector that accumulates information as it passes through layers. The linear representation hypothesis says: meaningful properties of the current token (or context) correspond to linear directions in this space. “This token is a verb” is a direction. “The subject of the current sentence is male” is a direction. “We’re currently inside a Python function definition” is a direction.
Operations on concepts correspond to vector operations on directions. Adding vectors adds features. Subtracting vectors removes features. Linear classifiers can detect features.
The canonical early evidence: Word2Vec (Mikolov et al., 2013). Train word embeddings on a text corpus. Then:
$$\text{king} - \text{man} + \text{woman} \approx \text{queen}$$
The gender direction (man → woman) can be added to the king direction to produce the queen direction. Analogy relationships - capital cities, verb tenses, comparatives - all appear as linear offsets in embedding space. This was widely replicated and extended.
Probing classifiers provide a more systematic test. Take the activations of a model at some layer, and train a linear classifier to predict a linguistic property (part of speech, syntactic role, named entity type). If the linear classifier succeeds - which it does, consistently, for many properties - then those properties are linearly encoded in the activations. Nonlinear probes don’t typically do better, suggesting the representations are genuinely linear.
Logit lens (nostalgebraist, 2020): at each intermediate layer of a language model, project the residual stream directly onto the vocabulary using the final unembedding matrix. This gives you a “prediction” at each layer - the model’s best guess at the next token if it had to stop there. You can watch the prediction evolve: early layers often predict words in the same syntactic category, middle layers narrow to semantically similar words, late layers converge to the specific correct word. This visualization makes the incremental refinement process legible.
Circuits: Algorithms Inside Networks
The circuits program (Olah et al., 2020, and subsequent work from Anthropic and others) takes a more surgical approach: identify specific subgraphs of neurons and attention heads that implement specific algorithms.
Induction heads are the cleanest and most important example (Olsson et al., 2022).
An induction head implements the following operation: given a sequence that contains a repeated pattern [A][B]...[A], predict [B] at the second occurrence of [A]. In English: if the text has said “John went to the store. John went to the…” then the induction head detects that went followed John before, and it boosts the probability of went now.
This is in-context learning in its most basic form: the model is using information from earlier in the context to make predictions. The remarkable finding from Olsson et al.: induction heads are present in every transformer examined (across scales, across architectures, across training setups). They appear early in training - often within the first 10% of training steps - in a sudden transition associated with a phase change in training loss. And they are causally responsible for much of the in-context learning that language models exhibit.
The circuit implementing induction behavior requires two attention heads working together across two layers:
- Previous token head (layer $\ell$): for token at position $i$, attends to the token at position $i-1$. This writes information about “what came before this token” into the residual stream.
- Induction head (layer $\ell+1$): uses the information from the previous token head to attend back to wherever that “preceding token” appeared before. Then it copies the token that followed that earlier occurrence into the current prediction.
This is a two-hop lookup: “who came before me” → “find where that token appeared before” → “what came after it.” It’s implemented by two attention heads with a specific communication pattern between them.
Indirect object identification (Wang et al., 2022): In a sentence like “Mary and John went to the store; Mary gave ___”, the model must identify “John” (the indirect object) as the correct completion. The researchers found a circuit spanning multiple layers and attention heads, with identifiable roles: subject-finding heads that locate “Mary”, backup name heads, S-inhibition heads that suppress the subject from being predicted as the indirect object, and name mover heads that copy the remaining name.
The circuit was identified by ablating attention heads (setting their output to zero or to mean activations) and measuring the effect on model accuracy. Heads whose ablation caused the largest accuracy drop were assigned to the circuit.
The Superposition Hypothesis
If every concept were encoded in its own dedicated neuron, interpretability would be easy: find the “Paris neuron,” the “plural noun neuron,” the “code neuron.” But this is not what happens.
The superposition hypothesis (Elhage et al., 2022) proposes something more nuanced: a neural network with $d$ dimensions can represent more than $d$ features, by encoding multiple features in an overlapping (superimposed) way, at the cost of interference.
Here’s the key idea. Suppose features are sparse - most features are “off” (zero) at any given time. For example, in a typical sentence, the “referring to Paris” feature is off for 99% of tokens. If features are sparse, then packing multiple features into the same subspace causes only occasional interference (when multiple features happen to be active simultaneously).
The geometry: a set of $n > d$ vectors in $\mathbb{R}^d$ can have low average pairwise inner product if they’re approximately orthogonal, but you can pack more near-orthogonal vectors into $\mathbb{R}^d$ than you might expect. Specifically, in high dimensions you can fit exponentially many near-orthogonal vectors. This is the Johnson-Lindenstrauss phenomenon. A network exploits this to pack many features into a small number of dimensions.
Implication: individual neurons are not the right unit of analysis. A neuron that “activates for Paris” might also activate (slightly) for “museum,” “tower,” and “French city” - because all of these concepts are stored as nearby directions. When you probe individual neurons, you find polysemantic behavior: the same neuron responding to apparently unrelated inputs.
This makes interpretability much harder. You can’t just look at the neuron with the highest activation and say what it represents. You need to find the actual directions (features) in the high-dimensional space.
Sparse Autoencoders: Extracting Features
If features are superimposed as directions in activation space, how do you find them?
A sparse autoencoder (SAE) is a technique for doing this. Train an autoencoder with a large hidden layer to reconstruct activations, subject to a sparsity constraint on the hidden layer. The constraint forces the network to represent each activation as a sparse combination of “dictionary features.”
$$\text{hidden} = \text{ReLU}(W_{\text{enc}}(x - b_{\text{dec}}) + b_{\text{enc}})$$
$$\hat{x} = W_{\text{dec}} \cdot \text{hidden} + b_{\text{dec}}$$
with a loss:
$$\mathcal{L} = |x - \hat{x}|^2 + \lambda |\text{hidden}|_1$$
The $L_1$ penalty encourages sparsity: most hidden units are zero for any given activation $x$. The dictionary vectors (columns of $W_{\text{dec}}$) become the “features.”
If the superposition hypothesis is correct, the SAE should decompose each activation into a small number of active, interpretable features. Anthropic’s recent work (Templeton et al., 2024) applied SAEs to a large language model and found millions of interpretable features: the feature for “the Eiffel Tower,” the feature for “DNA base pairs,” the feature for “code inside a Python function,” the feature for “a specific Supreme Court justice.” These features had clear semantic interpretations when probed by finding the text examples that maximally activated them.
This is striking: the network has rich, inspectable internal representations, but they were hidden in superposition. The SAE is the lens that makes them visible.
Discomfort check. The mechanistic interpretability results we have are mostly from small, well-controlled models: GPT-2 (1.5B parameters), simple two-layer transformers, or specific components of larger models. Whether the same circuits exist in GPT-4 or Claude 3 - and whether the circuits are the same type of computation or something more complex - is not established. The circuits found so far are existence proofs: they show that transformers can implement clean algorithms. They don’t show that every capability in a large model is implemented this way. We don’t have a complete mechanistic explanation of any non-trivial computation in any capable model. The field is young, the tools are primitive, and the models being studied are tiny relative to frontier models.
Activation Patching and Causal Tracing
To move from correlation (“this component activates for this input”) to causation (“this component is responsible for this output”), mechanistic interpretability uses activation patching (also called causal tracing, Meng et al., 2022).
The protocol:
- Run the model on a clean input (e.g., “The Eiffel Tower is in Paris”) and record all intermediate activations.
- Run the model on a corrupted input (e.g., “The Colosseum is in Paris”) that changes the answer.
- For each component (attention head, MLP layer, residual stream position), “patch” the corrupted run by replacing its activation with the clean-run value.
- Measure how much this single patch restores the correct output (“Paris”).
The component whose patching most restores the output is causally responsible for storing or computing that information.
Meng et al. applied this to GPT models to find where factual associations are stored. For a prompt like “The Eiffel Tower is located in the city of ___”, they found that the answer “Paris” is primarily stored in the MLP layers of the middle layers (roughly 60% through the depth), in the residual stream at the position of the subject token (“Eiffel Tower”). This is consistent across many different factual associations.
They used this to build ROME (Rank-One Model Editing): a method to surgically edit a model’s factual knowledge by modifying a small number of weight entries in those MLP layers, without retraining. You can change “the Eiffel Tower is in Paris” to “the Eiffel Tower is in Rome” in the model’s outputs, while leaving other knowledge intact.
Connection to Safety
Mechanistic interpretability is not purely an academic exercise. It’s a central part of AI safety research, particularly at Anthropic.
If you understand what a model computes, you can in principle:
Detect deception. If a model is behaving differently than its outputs suggest - giving misleading answers while “knowing” they’re wrong - interpretability tools might detect the discrepancy between the model’s internal representations and its outputs. A model that claims “I don’t know” but whose residual stream contains the answer could in principle be caught this way.
Identify harmful capabilities. Rather than probing model outputs for dangerous knowledge, you could inspect the internal representations directly. If “how to synthesize a pathogen” is a direction in activation space, you might be able to detect this without generating the output.
Targeted editing. Rather than retraining to remove a capability, surgical weight editing (like ROME) might remove specific knowledge or behaviors without affecting others. This is far from reliable today, but the research direction is active.
Understand generalization. The circuits framework provides a mechanistic account of why transformers generalize (they learn general algorithms like induction, not specific lookup tables). Understanding generalization mechanisms helps predict failure modes.
Summary
| Concept | What it claims | Evidence |
|---|---|---|
| Linear representations | Concepts are directions in activation space | Word2Vec analogies, probing classifiers, logit lens |
| Induction heads | Attention heads implement in-context copy | Found in all transformers; appear in phase transition during training |
| Superposition | $n > d$ features packed in $d$ dimensions via sparsity | Polysemantic neurons; SAE decompositions |
| Sparse autoencoders | SAE decoder columns are interpretable features | Millions of interpretable features found in Claude |
| Activation patching | Identifies causal components for specific outputs | ROME: facts stored in middle-layer MLP at subject position |
Mechanistic interpretability is the attempt to do for neural networks what circuit diagrams do for hardware: describe what computation is happening and how. The work so far is encouraging - clean algorithms (induction, indirect object identification) have been found in small models. The work ahead is harder: scaling these tools to models with hundreds of billions of parameters, where the circuits may be more complex, more distributed, and less legible. But the central claim - that neural networks implement algorithms that can be reverse-engineered - has enough evidence that the research program seems viable.
Read next: