Tendril

Lesson 219 of 2116

Mechanistic Interpretability: Reading the Model's Mind

Sparse autoencoders, features, circuits. How researchers try to see what a model actually thinks, and why it may be the most strategically important safety work.

CreatorsEthics & Society~33 min readAdvancedResearcherBI5 · Societal ImpactBI3 · LearningPrint / PDF

Lesson map

What this lesson covers

55 min25 blocks5 concepts

Learning path

The main moves in order

1Why Opening the Box Matters
2interpretability
3sparse autoencoder
4features

Concept cluster

Terms to connect while reading

interpretabilitysparse autoencoderfeaturescircuitsmonosemanticity

Sections8

Lists3

Notes4

Code1

Compare1

Section 1

Why Opening the Box Matters

Behavior-based evaluations tell you what a model does. Interpretability tries to tell you why. If deceptive alignment is a real concern, you cannot catch it from outputs alone. You need to look inside. Mechanistic interpretability is the research program of understanding neural networks well enough to trust or distrust them for specific reasons.

The starting problem: polysemantic neurons

Early hopes that individual neurons encoded individual concepts were wrong. Most neurons are polysemantic: they activate on many unrelated concepts (curly hair, bird beaks, and political slogans in the same neuron). The superposition hypothesis (Elhage et al., 2022) explains this: models pack many more features than they have dimensions, using interference.

Sparse autoencoders: the breakthrough

A sparse autoencoder (SAE) is trained to reconstruct model activations through a much wider hidden layer with an L1 sparsity penalty. The wide layer can represent features one-per-unit. Running an SAE on model activations produces monosemantic features. Anthropic's Scaling Monosemanticity (May 2024) trained SAEs on Claude 3 Sonnet activations and extracted millions of interpretable features.

Check-in 1. Got it so far?

What features look like

A feature for the Golden Gate Bridge, active on mentions, images, and abstract references
Safety-relevant features for deception, sycophancy, bias, dangerous content
Features for security vulnerabilities in code
Features for specific people, places, emotions, and abstractions like betrayal or symmetry
Multilingual features: same concept across languages fires one feature
Multimodal features: same feature fires on text and image

Steering: feature manipulation

If a feature represents a concept, you can amplify it and observe the effect on output. Anthropic's Golden Gate Claude demo (May 2024) clamped the Golden Gate feature high and produced a model that obsessively steered every conversation toward the bridge. The proof of concept: features are causal, not just correlational.

Circuits: features connected

Features are the nouns; circuits are the verbs. A circuit is a chain of features and attention heads that implements a specific computation. Early circuits work identified induction heads (copying patterns), name movers (resolving he/she), and factual recall circuits. Transformer-circuits.pub publishes ongoing circuit analyses.

Check-in 2. Got it so far?

The SAE objective: reconstruct activations through a wide, sparse hidden layer.

text

Simplified SAE training:

Given activations x (dimension d) from a layer,
train encoder E (d -> k where k >> d) and decoder D
with loss:

  L = ||x - D(E(x))||^2  +  lambda * ||E(x)||_1

       reconstruction error   +   sparsity penalty

Each of the k dimensions in E(x) becomes a 'feature.'
With k in the millions and lambda tuned, features
often become monosemantic.

Compare the options

Era	Representative work	Main insight
2019-2020	Transformer Circuits (Olah)	Attention heads have specific roles
2021-2022	Induction Heads	In-context learning has a mechanism
2022	Toy Models of Superposition	Features live in superposition
2023	Towards Monosemanticity	SAEs extract single features
2024	Scaling Monosemanticity + Golden Gate Claude	SAEs scale to Claude 3 Sonnet, features are causal
2025	Circuits Updates + Crosscoders	Features connected into circuits, ongoing

Why this may be strategically central

Distinguishes aligned behavior from deceptively-aligned behavior (in principle)
Gives auditors something concrete to audit, not just outputs
Enables targeted edits: remove dangerous capabilities without retraining
Provides early warning for capability shifts (new features emerging)
Creates a path to verifiable safety claims rather than confidence-based ones

Check-in 3. Got it so far?

Known limits

SAEs reconstruct activations imperfectly, missing features with heavy L1
Features are model-specific; not yet transferable across model generations
Feature labels rely on human interpretation, which can be wrong
Millions of features mean automation of labeling; automation has errors
We do not yet know if all mesa-objective representations will be interpretable

Check-in 4. Got it so far?

“We have the first plausible path to actually reading the mind of a neural network. That was not true five years ago.”
Chris Olah, Anthropic interpretability team

Key terms in this lesson

The big idea: for the first time, there is a plausible technical path to reading what a neural network is actually computing. It is incomplete, expensive, and young. It is also the most direct response to the harder alignment questions, and it is advancing quickly.

Check-in 5. Got it so far?

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Mechanistic Interpretability: Reading the Model's Mind”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Mechanistic Interpretability: Reading the Model's Mind

Why Opening the Box Matters

The starting problem: polysemantic neurons

Sparse autoencoders: the breakthrough

What features look like

Steering: feature manipulation

Circuits: features connected

Why this may be strategically central

Known limits

Curious about “Mechanistic Interpretability: Reading the Model's Mind”?

Keep going

Mechanistic Interpretability: Reading the Model's Mind

Why Opening the Box Matters

The starting problem: polysemantic neurons

Sparse autoencoders: the breakthrough

What features look like

Steering: feature manipulation

Circuits: features connected

Why this may be strategically central

Known limits

Curious about “Mechanistic Interpretability: Reading the Model's Mind”?

Keep going