Lesson 271 of 2116
Why Models Are Hard to Reason About
LLMs are black boxes with billions of parameters. Why is interpretability so hard — and what progress has been made?
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1A Trillion Parameters of Fog
- 2interpretability
- 3mechanistic
- 4black box
Concept cluster
Terms to connect while reading
Section 1
A Trillion Parameters of Fog
You trained a model. It works. You cannot say why. Modern LLMs have hundreds of billions of parameters woven together; they produce answers through processes no human designed. Understanding them has become its own research field: interpretability.
Why it is so hard
- Superposition: each neuron encodes many features, and each feature uses many neurons
- Distributed representations: no single unit corresponds to 'the idea of a cat'
- Nonlinearity: changing one weight affects outputs in nonlinear ways across the entire network
- Emergent behaviors: the whole is more than its labeled parts
Four interpretability traditions
Compare the options
| Approach | Question asked | Example |
|---|---|---|
| Behavioral | What does the model do? | Eval suites, red-teaming |
| Probing | What information does it encode? | Linear probes on activations |
| Mechanistic | What algorithm runs inside? | Circuit analysis, induction heads |
| Feature-level | What concepts are there? | Sparse autoencoders, SAEs |
Mechanistic interpretability
The ambitious project: reverse-engineer a neural network into human-readable algorithms. The Anthropic mechanistic interpretability team found 'induction heads' that copy patterns, 'indirect object identification' circuits in GPT-2, and — with sparse autoencoders — human-interpretable features in larger models.
Why it matters
- 1Debug weird behaviors (jailbreaks, hallucinations) at the circuit level
- 2Spot deception or scheming before deployment
- 3Give the safety community tools for audit
- 4Catch dangerous capabilities earlier in training
“We can now identify millions of features inside a frontier language model.”
Key terms in this lesson
The big idea: we are building minds faster than we can understand them. Interpretability is the project to close that gap.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Why Models Are Hard to Reason About”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 45 min
What Is Intelligence, Really? A Working Framework
Before we can judge whether an AI is intelligent, we need a framework for what intelligence even means. Draw on Chollet, Dennett, and modern evals.
Creators · 45 min
The Economics and Ethics of Training Data
Data is the strategic asset of AI. Understand the supply chain, the legal fight, and the philosophical stakes before you build anything on top.
Creators · 45 min
Emergence, Capability Forecasting, and Safety
Emergent abilities make AI both more exciting and more dangerous. How do labs forecast what the next model will do — and what happens when they are wrong?
