Mechanistic Interpretability: Reading the Model's Mind

Sparse autoencoders, features, circuits. How researchers try to see what a model actually thinks, and why it may be the most strategically important safety work.

55 min · Reviewed 2026

Why Opening the Box Matters

Behavior-based evaluations tell you what a model does. Interpretability tries to tell you why. If deceptive alignment is a real concern, you cannot catch it from outputs alone. You need to look inside. Mechanistic interpretability is the research program of understanding neural networks well enough to trust or distrust them for specific reasons.

The starting problem: polysemantic neurons

Early hopes that individual neurons encoded individual concepts were wrong. Most neurons are polysemantic: they activate on many unrelated concepts (curly hair, bird beaks, and political slogans in the same neuron). The superposition hypothesis (Elhage et al., 2022) explains this: models pack many more features than they have dimensions, using interference.

Sparse autoencoders: the breakthrough

A sparse autoencoder (SAE) is trained to reconstruct model activations through a much wider hidden layer with an L1 sparsity penalty. The wide layer can represent features one-per-unit. Running an SAE on model activations produces monosemantic features. Anthropic's Scaling Monosemanticity (May 2024) trained SAEs on Claude 3 Sonnet activations and extracted millions of interpretable features.

What features look like

A feature for the Golden Gate Bridge, active on mentions, images, and abstract references
Safety-relevant features for deception, sycophancy, bias, dangerous content
Features for security vulnerabilities in code
Features for specific people, places, emotions, and abstractions like betrayal or symmetry
Multilingual features: same concept across languages fires one feature
Multimodal features: same feature fires on text and image

Steering: feature manipulation

If a feature represents a concept, you can amplify it and observe the effect on output. Anthropic's Golden Gate Claude demo (May 2024) clamped the Golden Gate feature high and produced a model that obsessively steered every conversation toward the bridge. The proof of concept: features are causal, not just correlational.

Circuits: features connected

Features are the nouns; circuits are the verbs. A circuit is a chain of features and attention heads that implements a specific computation. Early circuits work identified induction heads (copying patterns), name movers (resolving he/she), and factual recall circuits. Transformer-circuits.pub publishes ongoing circuit analyses.

Simplified SAE training:

Given activations x (dimension d) from a layer,
train encoder E (d -> k where k >> d) and decoder D
with loss:

  L = ||x - D(E(x))||^2  +  lambda * ||E(x)||_1

       reconstruction error   +   sparsity penalty

Each of the k dimensions in E(x) becomes a 'feature.'
With k in the millions and lambda tuned, features
often become monosemantic.The SAE objective: reconstruct activations through a wide, sparse hidden layer.

Era	Representative work	Main insight
2019-2020	Transformer Circuits (Olah)	Attention heads have specific roles
2021-2022	Induction Heads	In-context learning has a mechanism
2022	Toy Models of Superposition	Features live in superposition
2023	Towards Monosemanticity	SAEs extract single features
2024	Scaling Monosemanticity + Golden Gate Claude	SAEs scale to Claude 3 Sonnet, features are causal
2025	Circuits Updates + Crosscoders	Features connected into circuits, ongoing

Why this may be strategically central

Distinguishes aligned behavior from deceptively-aligned behavior (in principle)
Gives auditors something concrete to audit, not just outputs
Enables targeted edits: remove dangerous capabilities without retraining
Provides early warning for capability shifts (new features emerging)
Creates a path to verifiable safety claims rather than confidence-based ones

Known limits

SAEs reconstruct activations imperfectly, missing features with heavy L1
Features are model-specific; not yet transferable across model generations
Feature labels rely on human interpretation, which can be wrong
Millions of features mean automation of labeling; automation has errors
We do not yet know if all mesa-objective representations will be interpretable

We have the first plausible path to actually reading the mind of a neural network. That was not true five years ago.
— Chris Olah, Anthropic interpretability team

The big idea: for the first time, there is a plausible technical path to reading what a neural network is actually computing. It is incomplete, expensive, and young. It is also the most direct response to the harder alignment questions, and it is advancing quickly.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety-interpretability-creators

What is the core problem that motivated the development of sparse autoencoders?
1. Neurons in neural networks are too small to store meaningful information
2. Attention heads cannot be visualized effectively
3. Models lack sufficient training data to learn consistent representations
4. Individual neurons encode multiple unrelated concepts, making model behavior hard to interpret
According to the superposition hypothesis, why do most neurons in large language models appear polysemantic?
1. Researchers mislabel neurons due to insufficient training data
2. Polysemanticity is an inevitable result of the attention mechanism
3. The training process intentionally mixes unrelated concepts together
4. Models pack more features than they have dimensions, causing features to interfere with each other
What architectural choice allows a sparse autoencoder to produce monosemantic features?
1. Reducing the number of attention heads to focus on key patterns
2. Training on smaller datasets to prevent feature mixing
3. Adding a second attention layer to refine interpretations
4. Using a much wider hidden layer with an L1 sparsity penalty during training
What did Anthropic's 'Golden Gate Claude' experiment demonstrate about features?
1. Features can only be identified in smaller models, not larger ones
2. Features disappear when models are fine-tuned
3. Features have a causal effect on model behavior when amplified or suppressed
4. Features are purely correlational and cannot be manipulated
In the terminology of mechanistic interpretability, what is the relationship between features and circuits?
1. Features are the nouns (concepts) while circuits are the verbs (computations that connect features)
2. Features and circuits are the same thing described differently
3. Circuits represent individual neurons and features represent attention patterns
4. Features and circuits cannot be studied together in current research
Why might mechanistic interpretability be strategically important for AI safety compared to behavior-based evaluations alone?
1. Interpretability research is no longer needed since models pass safety tests
2. Behavior-based evaluations can detect all possible harmful behaviors
3. Behavior-based evaluations are faster and cheaper than interpretability research
4. Interpretability can distinguish genuinely aligned models from those that are deceptively aligned
What computational role do induction heads play in transformer models?
1. They prevent models from repeating the same tokens
2. They copy patterns from earlier in a sequence to predict what comes next
3. They generate random tokens to help models maintain creativity
4. They calculate attention scores between different layers
What capability does 'steering' via feature manipulation enable researchers to do?
1. Modify the model's underlying architecture
2. Completely replace the model's knowledge base
3. Prevent the model from generating any text
4. Amplify or suppress specific concepts in model outputs without retraining
What is one key limitation of sparse autoencoders as mentioned in the material?
1. They may miss features with heavy L1 penalty due to the sparsity constraint
2. They work equally well on all model architectures
3. They perfectly reconstruct all activations without any loss
4. They cannot be trained on activations from large models
Why is finding interpretable features insufficient on its own for complete model understanding?
1. Features are model-specific and change with every update
2. Features only tell you what concepts exist, not how they connect to produce behavior
3. Features cannot be detected in models with more than 10 billion parameters
4. Features are not yet discoverable in most models
What economic consideration was raised about scaling interpretability to frontier models?
1. Training SAEs costs a meaningful fraction of the model's pretraining compute
2. Interpretability research has been abandoned due to high costs
3. Interpretability tools are free and require no computational resources
4. Only academic institutions can afford interpretability work
What did Anthropic's 'Scaling Monosemanticity' paper successfully demonstrate?
1. Features cannot be extracted from Claude models
2. Smaller models are more interpretable than larger ones
3. SAEs can perfectly reconstruct all model activations without loss
4. Sparse autoencoders can extract millions of interpretable features from a frontier model
What does it mean that a feature is 'multilingual'?
1. The feature only activates on text in one language
2. The same feature fires for a concept regardless of which language expresses it
3. The feature only activates on machine-translated content
4. The feature translates text between languages automatically
What risk exists even with automated feature labeling?
1. Automated labeling is perfectly accurate
2. Automation always produces better labels than humans
3. Human interpretation is still needed and can be wrong
4. Labels cannot be assigned to features at all
The quote 'We have the first plausible path to actually reading the mind of a neural network' refers most directly to what development?
1. The discovery of polysemantic neurons
2. The invention of the transformer architecture
3. The creation of behavior-based evaluation benchmarks
4. Sparse autoencoders that extract interpretable features from model activations

← Back to interactive lesson

Tendril · Creators · Ethics & Society