Loading lesson…
Sparse autoencoders, features, circuits. How researchers try to see what a model actually thinks, and why it may be the most strategically important safety work.
Behavior-based evaluations tell you what a model does. Interpretability tries to tell you why. If deceptive alignment is a real concern, you cannot catch it from outputs alone. You need to look inside. Mechanistic interpretability is the research program of understanding neural networks well enough to trust or distrust them for specific reasons.
Early hopes that individual neurons encoded individual concepts were wrong. Most neurons are polysemantic: they activate on many unrelated concepts (curly hair, bird beaks, and political slogans in the same neuron). The superposition hypothesis (Elhage et al., 2022) explains this: models pack many more features than they have dimensions, using interference.
A sparse autoencoder (SAE) is trained to reconstruct model activations through a much wider hidden layer with an L1 sparsity penalty. The wide layer can represent features one-per-unit. Running an SAE on model activations produces monosemantic features. Anthropic's Scaling Monosemanticity (May 2024) trained SAEs on Claude 3 Sonnet activations and extracted millions of interpretable features.
If a feature represents a concept, you can amplify it and observe the effect on output. Anthropic's Golden Gate Claude demo (May 2024) clamped the Golden Gate feature high and produced a model that obsessively steered every conversation toward the bridge. The proof of concept: features are causal, not just correlational.
Features are the nouns; circuits are the verbs. A circuit is a chain of features and attention heads that implements a specific computation. Early circuits work identified induction heads (copying patterns), name movers (resolving he/she), and factual recall circuits. Transformer-circuits.pub publishes ongoing circuit analyses.
Simplified SAE training: Given activations x (dimension d) from a layer, train encoder E (d -> k where k >> d) and decoder D with loss: L = ||x - D(E(x))||^2 + lambda * ||E(x)||_1 reconstruction error + sparsity penalty Each of the k dimensions in E(x) becomes a 'feature.' With k in the millions and lambda tuned, features often become monosemantic.The SAE objective: reconstruct activations through a wide, sparse hidden layer.| Era | Representative work | Main insight |
|---|---|---|
| 2019-2020 | Transformer Circuits (Olah) | Attention heads have specific roles |
| 2021-2022 | Induction Heads | In-context learning has a mechanism |
| 2022 | Toy Models of Superposition | Features live in superposition |
| 2023 | Towards Monosemanticity | SAEs extract single features |
| 2024 | Scaling Monosemanticity + Golden Gate Claude | SAEs scale to Claude 3 Sonnet, features are causal |
| 2025 | Circuits Updates + Crosscoders | Features connected into circuits, ongoing |
We have the first plausible path to actually reading the mind of a neural network. That was not true five years ago.
— Chris Olah, Anthropic interpretability team
The big idea: for the first time, there is a plausible technical path to reading what a neural network is actually computing. It is incomplete, expensive, and young. It is also the most direct response to the harder alignment questions, and it is advancing quickly.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety-interpretability-creators
What is the main idea of "Mechanistic Interpretability: Reading the Model's Mind"?
Which concept is most central to "Mechanistic Interpretability: Reading the Model's Mind"?
Which use of AI fits this topic best?
What should a careful learner remember about "Who is doing this work"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about interpretability be treated?
Name one way to verify an AI answer about interpretability.
Which action would help you apply "Mechanistic Interpretability: Reading the Model's Mind" responsibly?