Loading lesson…
Neural networks mix many concepts into each neuron. Sparse autoencoders pull them apart into human-readable features. This is the workhorse of modern interpretability.
A neural network has way more concepts than neurons. A single neuron in GPT-2 might fire for 'this is a legal document,' 'the Golden Gate Bridge,' and 'word ending in -ing' all at once. This is superposition: the network packs many features into shared directions in activation space. If you want to see the features, you need to unmix them.
We have gone from not knowing what was inside these models to a crude map. The map is wrong in places, but it is a map.
— Chris Olah, Anthropic interpretability lead
The big idea: SAEs turned interpretability from stamp-collecting into an engineering discipline. They are the main reason interpretability had a breakout year in 2024.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-sparse-autoencoders-creators
In a sparse autoencoder, how does the hidden layer size compare to the input activation size?
What does it mean for SAE hidden activations to be 'sparse'?
When Anthropic trained sparse autoencoders on Claude 3 Sonnet, approximately how many features did they discover?
What happened when researchers amplified the 'Golden Gate Bridge' feature in Claude?
Which of the following is NOT listed as an application of SAE features in the lesson?
The lesson compares SAE features to a map that is 'wrong in places.' What does this analogy imply?
What is 'dictionary learning' in the context of sparse autoencoders?
What does 'monosemanticity' mean in the context of interpretability?
Why is training SAEs on frontier models computationally expensive?
How can SAE features be used to detect distribution shift?
What is the primary goal of the decoder in a sparse autoencoder?
The lesson states SAEs turned interpretability from 'stamp-collecting' into an engineering discipline. What does 'stamp-collecting' refer to?
Why is it important to audit which features a model has learned?
What type of layer activations are fed into a sparse autoencoder?
Which company was mentioned in the lesson as training SAEs on one of their language models?