Loading lesson…
Sparse autoencoders, features, circuits. How researchers try to see what a model actually thinks, and why it may be the most strategically important safety work.
Behavior-based evaluations tell you what a model does. Interpretability tries to tell you why. If deceptive alignment is a real concern, you cannot catch it from outputs alone. You need to look inside. Mechanistic interpretability is the research program of understanding neural networks well enough to trust or distrust them for specific reasons.
Early hopes that individual neurons encoded individual concepts were wrong. Most neurons are polysemantic: they activate on many unrelated concepts (curly hair, bird beaks, and political slogans in the same neuron). The superposition hypothesis (Elhage et al., 2022) explains this: models pack many more features than they have dimensions, using interference.
A sparse autoencoder (SAE) is trained to reconstruct model activations through a much wider hidden layer with an L1 sparsity penalty. The wide layer can represent features one-per-unit. Running an SAE on model activations produces monosemantic features. Anthropic's Scaling Monosemanticity (May 2024) trained SAEs on Claude 3 Sonnet activations and extracted millions of interpretable features.
If a feature represents a concept, you can amplify it and observe the effect on output. Anthropic's Golden Gate Claude demo (May 2024) clamped the Golden Gate feature high and produced a model that obsessively steered every conversation toward the bridge. The proof of concept: features are causal, not just correlational.
Features are the nouns; circuits are the verbs. A circuit is a chain of features and attention heads that implements a specific computation. Early circuits work identified induction heads (copying patterns), name movers (resolving he/she), and factual recall circuits. Transformer-circuits.pub publishes ongoing circuit analyses.
Simplified SAE training:
Given activations x (dimension d) from a layer,
train encoder E (d -> k where k >> d) and decoder D
with loss:
L = ||x - D(E(x))||^2 + lambda * ||E(x)||_1
reconstruction error + sparsity penalty
Each of the k dimensions in E(x) becomes a 'feature.'
With k in the millions and lambda tuned, features
often become monosemantic.The SAE objective: reconstruct activations through a wide, sparse hidden layer.| Era | Representative work | Main insight |
|---|---|---|
| 2019-2020 | Transformer Circuits (Olah) | Attention heads have specific roles |
| 2021-2022 | Induction Heads | In-context learning has a mechanism |
| 2022 | Toy Models of Superposition | Features live in superposition |
| 2023 | Towards Monosemanticity | SAEs extract single features |
| 2024 | Scaling Monosemanticity + Golden Gate Claude | SAEs scale to Claude 3 Sonnet, features are causal |
| 2025 | Circuits Updates + Crosscoders | Features connected into circuits, ongoing |
We have the first plausible path to actually reading the mind of a neural network. That was not true five years ago.
— Chris Olah, Anthropic interpretability team
The big idea: for the first time, there is a plausible technical path to reading what a neural network is actually computing. It is incomplete, expensive, and young. It is also the most direct response to the harder alignment questions, and it is advancing quickly.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety-interpretability-creators
What is the core problem that motivated the development of sparse autoencoders?
According to the superposition hypothesis, why do most neurons in large language models appear polysemantic?
What architectural choice allows a sparse autoencoder to produce monosemantic features?
What did Anthropic's 'Golden Gate Claude' experiment demonstrate about features?
In the terminology of mechanistic interpretability, what is the relationship between features and circuits?
Why might mechanistic interpretability be strategically important for AI safety compared to behavior-based evaluations alone?
What computational role do induction heads play in transformer models?
What capability does 'steering' via feature manipulation enable researchers to do?
What is one key limitation of sparse autoencoders as mentioned in the material?
Why is finding interpretable features insufficient on its own for complete model understanding?
What economic consideration was raised about scaling interpretability to frontier models?
What did Anthropic's 'Scaling Monosemanticity' paper successfully demonstrate?
What does it mean that a feature is 'multilingual'?
What risk exists even with automated feature labeling?
The quote 'We have the first plausible path to actually reading the mind of a neural network' refers most directly to what development?