Lesson 859 of 2116
Feature Discovery in LLMs
A feature is a direction in activation space that corresponds to a concept. Finding them — naming them, ranking them, connecting them — is one of the central activities of interpretability research.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1What Counts as a Feature
- 2feature
- 3activation
- 4interpretability
Concept cluster
Terms to connect while reading
Section 1
What Counts as a Feature
Informally, a feature is anything the network appears to 'represent' in its activations. More formally, it's a direction in activation space whose magnitude correlates with the presence of some input property — a word, a concept, a category, a sentiment.
Ways researchers find them
- 1Activation maximization: find inputs that make a neuron or direction fire hardest
- 2Sparse autoencoders: learn a dictionary of features from activations
- 3Probing: train a small classifier to detect a concept from activations
- 4Difference in means: compare activations across two labeled sets
- 5Steering vectors: find a direction that, when added to activations, changes output in a specific way
Named features in the wild (as of ~2024-2025)
- Truthfulness and deception directions
- Refusal features that activate before a safety refusal
- Sycophancy features tied to agreeing with users
- Entity-tracking features for names and pronouns
- Code-vulnerability features
- Sandbagging-related features
“Every feature we name is a hypothesis. The test is whether intervening on it does what we expect.”
Key terms in this lesson
The big idea: features are the unit of analysis in modern interpretability. Learning to read them is the closest thing the field has to learning to read code.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Feature Discovery in LLMs”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 55 min
Mechanistic Interpretability: Reading the Model's Mind
Sparse autoencoders, features, circuits. How researchers try to see what a model actually thinks, and why it may be the most strategically important safety work.
Builders · 28 min
Circuits in Neural Networks
A circuit is a small sub-network inside a big model that implements one specific behavior. Finding circuits is how researchers prove how a model does what it does.
Creators · 50 min
AI Alignment: The Actual Technical Problem
Alignment is not a vibes debate. It is a concrete technical problem about getting systems to pursue goals we actually want. Here is what researchers work on when they say they work on alignment.
