Sparse Autoencoders Explained

Neural networks mix many concepts into each neuron. Sparse autoencoders pull them apart into human-readable features. This is the workhorse of modern interpretability.

40 min · Reviewed 2026

The Superposition Problem

A neural network has way more concepts than neurons. A single neuron in GPT-2 might fire for 'this is a legal document,' 'the Golden Gate Bridge,' and 'word ending in -ing' all at once. This is superposition: the network packs many features into shared directions in activation space. If you want to see the features, you need to unmix them.

How a sparse autoencoder works

Take activations from a specific layer of a trained model
Pass them through an encoder into a much larger hidden space (often 8-64x wider)
Force the hidden activations to be sparse — only a few fire per input
Decode back to the original activations and minimize reconstruction loss
Train over billions of tokens, then interpret the features the SAE learned

What you can do with features

Identify safety-relevant concepts like deception, manipulation, bias
Steer model behavior by amplifying or suppressing specific features
Debug failures by checking which features fired
Detect distribution shift by watching feature activation patterns
Audit for concepts a model 'should not' have learned

We have gone from not knowing what was inside these models to a crude map. The map is wrong in places, but it is a map.
— Chris Olah, Anthropic interpretability lead

The big idea: SAEs turned interpretability from stamp-collecting into an engineering discipline. They are the main reason interpretability had a breakout year in 2024.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-sparse-autoencoders-creators

In a sparse autoencoder, how does the hidden layer size compare to the input activation size?
1. It is typically the same size as the input
2. It varies randomly each training step
3. It is typically smaller than the input to encourage compression
4. It is usually 8-64 times wider than the input
What does it mean for SAE hidden activations to be 'sparse'?
1. Most hidden neurons activate for every input
2. The activations are stored in a compressed format
3. The model uses fewer training examples
4. Only a few neurons fire for any given input
When Anthropic trained sparse autoencoders on Claude 3 Sonnet, approximately how many features did they discover?
1. About 3.4 billion features
2. About 34,000 features
3. About 340,000 features
4. About 34 million features
What happened when researchers amplified the 'Golden Gate Bridge' feature in Claude?
1. Claude began claiming it was the Golden Gate Bridge
2. The model became confused and crashed
3. The feature had no measurable effect on behavior
4. The model refused to discuss bridges
Which of the following is NOT listed as an application of SAE features in the lesson?
1. Auditing for concepts a model should not have learned
2. Improving model training speed
3. Debugging failures by checking which features fired
4. Identifying safety-relevant concepts like deception
The lesson compares SAE features to a map that is 'wrong in places.' What does this analogy imply?
1. The features are useless for practical purposes
2. Features should not be trusted for any purpose
3. SAEs completely misrepresent how models work
4. SAEs provide useful structure despite some inaccuracies
What is 'dictionary learning' in the context of sparse autoencoders?
1. A technique for compressing dictionaries in model weights
2. A way to organize documentation about AI systems
3. The process of learning a large set of interpretable features from data
4. A method for teaching models new vocabulary words
What does 'monosemanticity' mean in the context of interpretability?
1. When each neuron or feature corresponds to a single, understandable concept
2. A technique for reducing model complexity
3. A method for making models more verbose
4. A language model that speaks only one language
Why is training SAEs on frontier models computationally expensive?
1. SAE training requires accessing proprietary training data
2. The models must be retrained from zero each time
3. The features require constant human supervision
4. It is often comparable to training a small model from scratch
How can SAE features be used to detect distribution shift?
1. By counting how many neurons are active
2. By comparing feature activation patterns between different inputs
3. By tracking how fast the model generates tokens
4. By measuring the size of the model's weights
What is the primary goal of the decoder in a sparse autoencoder?
1. To reconstruct the original activations from the sparse features
2. To increase the sparsity of activations
3. To compress the input into a smaller representation
4. To add new information to the representation
The lesson states SAEs turned interpretability from 'stamp-collecting' into an engineering discipline. What does 'stamp-collecting' refer to?
1. A technique for organizing feature visualizations
2. Using interpretability to stamp out model errors
3. Collecting postage stamps as a hobby
4. Manually documenting individual neuron behaviors without systematic methods
Why is it important to audit which features a model has learned?
1. To make the model run faster
2. To reduce the model's memory usage
3. To ensure the model has learned more vocabulary
4. To identify concepts the model should not have learned
What type of layer activations are fed into a sparse autoencoder?
1. The output layer predictions
2. The input layer of the original model
3. Activations from a specific hidden layer of a trained model
4. The gradient information during backpropagation
Which company was mentioned in the lesson as training SAEs on one of their language models?
1. Meta AI
2. OpenAI
3. Google DeepMind
4. Anthropic

← Back to interactive lesson

Tendril · Creators · Ethics & Society