Loading lesson…
LLMs are black boxes with billions of parameters. Why is interpretability so hard — and what progress has been made?
You trained a model. It works. You cannot say why. Modern LLMs have hundreds of billions of parameters woven together; they produce answers through processes no human designed. Understanding them has become its own research field: interpretability.
| Approach | Question asked | Example |
|---|---|---|
| Behavioral | What does the model do? | Eval suites, red-teaming |
| Probing | What information does it encode? | Linear probes on activations |
| Mechanistic | What algorithm runs inside? | Circuit analysis, induction heads |
| Feature-level | What concepts are there? | Sparse autoencoders, SAEs |
The ambitious project: reverse-engineer a neural network into human-readable algorithms. The Anthropic mechanistic interpretability team found 'induction heads' that copy patterns, 'indirect object identification' circuits in GPT-2, and — with sparse autoencoders — human-interpretable features in larger models.
We can now identify millions of features inside a frontier language model.
— Anthropic, Scaling Monosemanticity (2024)
The big idea: we are building minds faster than we can understand them. Interpretability is the project to close that gap.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-hard-to-reason-about
What is the main idea of "Why Models Are Hard to Reason About"?
Which concept is most central to "Why Models Are Hard to Reason About"?
Which use of AI fits this topic best?
What should a careful learner remember about "Sparse autoencoders (SAEs)"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about interpretability be treated?
Name one way to verify an AI answer about interpretability.
Which action would help you apply "Why Models Are Hard to Reason About" responsibly?