Lesson 219 of 2116
Mechanistic Interpretability: Reading the Model's Mind
Sparse autoencoders, features, circuits. How researchers try to see what a model actually thinks, and why it may be the most strategically important safety work.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1Why Opening the Box Matters
- 2interpretability
- 3sparse autoencoder
- 4features
Concept cluster
Terms to connect while reading
Section 1
Why Opening the Box Matters
Behavior-based evaluations tell you what a model does. Interpretability tries to tell you why. If deceptive alignment is a real concern, you cannot catch it from outputs alone. You need to look inside. Mechanistic interpretability is the research program of understanding neural networks well enough to trust or distrust them for specific reasons.
The starting problem: polysemantic neurons
Early hopes that individual neurons encoded individual concepts were wrong. Most neurons are polysemantic: they activate on many unrelated concepts (curly hair, bird beaks, and political slogans in the same neuron). The superposition hypothesis (Elhage et al., 2022) explains this: models pack many more features than they have dimensions, using interference.
Sparse autoencoders: the breakthrough
A sparse autoencoder (SAE) is trained to reconstruct model activations through a much wider hidden layer with an L1 sparsity penalty. The wide layer can represent features one-per-unit. Running an SAE on model activations produces monosemantic features. Anthropic's Scaling Monosemanticity (May 2024) trained SAEs on Claude 3 Sonnet activations and extracted millions of interpretable features.
What features look like
- A feature for the Golden Gate Bridge, active on mentions, images, and abstract references
- Safety-relevant features for deception, sycophancy, bias, dangerous content
- Features for security vulnerabilities in code
- Features for specific people, places, emotions, and abstractions like betrayal or symmetry
- Multilingual features: same concept across languages fires one feature
- Multimodal features: same feature fires on text and image
Steering: feature manipulation
If a feature represents a concept, you can amplify it and observe the effect on output. Anthropic's Golden Gate Claude demo (May 2024) clamped the Golden Gate feature high and produced a model that obsessively steered every conversation toward the bridge. The proof of concept: features are causal, not just correlational.
Circuits: features connected
Features are the nouns; circuits are the verbs. A circuit is a chain of features and attention heads that implements a specific computation. Early circuits work identified induction heads (copying patterns), name movers (resolving he/she), and factual recall circuits. Transformer-circuits.pub publishes ongoing circuit analyses.
The SAE objective: reconstruct activations through a wide, sparse hidden layer.
Simplified SAE training:
Given activations x (dimension d) from a layer,
train encoder E (d -> k where k >> d) and decoder D
with loss:
L = ||x - D(E(x))||^2 + lambda * ||E(x)||_1
reconstruction error + sparsity penalty
Each of the k dimensions in E(x) becomes a 'feature.'
With k in the millions and lambda tuned, features
often become monosemantic.Compare the options
| Era | Representative work | Main insight |
|---|---|---|
| 2019-2020 | Transformer Circuits (Olah) | Attention heads have specific roles |
| 2021-2022 | Induction Heads | In-context learning has a mechanism |
| 2022 | Toy Models of Superposition | Features live in superposition |
| 2023 | Towards Monosemanticity | SAEs extract single features |
| 2024 | Scaling Monosemanticity + Golden Gate Claude | SAEs scale to Claude 3 Sonnet, features are causal |
| 2025 | Circuits Updates + Crosscoders | Features connected into circuits, ongoing |
Why this may be strategically central
- Distinguishes aligned behavior from deceptively-aligned behavior (in principle)
- Gives auditors something concrete to audit, not just outputs
- Enables targeted edits: remove dangerous capabilities without retraining
- Provides early warning for capability shifts (new features emerging)
- Creates a path to verifiable safety claims rather than confidence-based ones
Known limits
- SAEs reconstruct activations imperfectly, missing features with heavy L1
- Features are model-specific; not yet transferable across model generations
- Feature labels rely on human interpretation, which can be wrong
- Millions of features mean automation of labeling; automation has errors
- We do not yet know if all mesa-objective representations will be interpretable
“We have the first plausible path to actually reading the mind of a neural network. That was not true five years ago.”
Key terms in this lesson
The big idea: for the first time, there is a plausible technical path to reading what a neural network is actually computing. It is incomplete, expensive, and young. It is also the most direct response to the harder alignment questions, and it is advancing quickly.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Mechanistic Interpretability: Reading the Model's Mind”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 37 min
Feature Discovery in LLMs
A feature is a direction in activation space that corresponds to a concept. Finding them — naming them, ranking them, connecting them — is one of the central activities of interpretability research.
Builders · 28 min
Circuits in Neural Networks
A circuit is a small sub-network inside a big model that implements one specific behavior. Finding circuits is how researchers prove how a model does what it does.
Creators · 50 min
AI Alignment: The Actual Technical Problem
Alignment is not a vibes debate. It is a concrete technical problem about getting systems to pursue goals we actually want. Here is what researchers work on when they say they work on alignment.
