Lesson 858 of 2116
Sparse Autoencoders Explained
Neural networks mix many concepts into each neuron. Sparse autoencoders pull them apart into human-readable features. This is the workhorse of modern interpretability.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The Superposition Problem
- 2sparse autoencoder
- 3superposition
- 4dictionary learning
Concept cluster
Terms to connect while reading
Section 1
The Superposition Problem
A neural network has way more concepts than neurons. A single neuron in GPT-2 might fire for 'this is a legal document,' 'the Golden Gate Bridge,' and 'word ending in -ing' all at once. This is superposition: the network packs many features into shared directions in activation space. If you want to see the features, you need to unmix them.
How a sparse autoencoder works
- 1Take activations from a specific layer of a trained model
- 2Pass them through an encoder into a much larger hidden space (often 8-64x wider)
- 3Force the hidden activations to be sparse — only a few fire per input
- 4Decode back to the original activations and minimize reconstruction loss
- 5Train over billions of tokens, then interpret the features the SAE learned
What you can do with features
- Identify safety-relevant concepts like deception, manipulation, bias
- Steer model behavior by amplifying or suppressing specific features
- Debug failures by checking which features fired
- Detect distribution shift by watching feature activation patterns
- Audit for concepts a model 'should not' have learned
“We have gone from not knowing what was inside these models to a crude map. The map is wrong in places, but it is a map.”
Key terms in this lesson
The big idea: SAEs turned interpretability from stamp-collecting into an engineering discipline. They are the main reason interpretability had a breakout year in 2024.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Sparse Autoencoders Explained”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 55 min
Mechanistic Interpretability: Reading the Model's Mind
Sparse autoencoders, features, circuits. How researchers try to see what a model actually thinks, and why it may be the most strategically important safety work.
Creators · 50 min
AI Alignment: The Actual Technical Problem
Alignment is not a vibes debate. It is a concrete technical problem about getting systems to pursue goals we actually want. Here is what researchers work on when they say they work on alignment.
Creators · 40 min
Jailbreak Case Studies: What Actually Broke
Abstract jailbreak theory is less useful than real cases. Here are the techniques that worked on production models, what they taught us, and what is still unsolved.
