Tendril

Lesson 271 of 2116

Why Models Are Hard to Reason About

LLMs are black boxes with billions of parameters. Why is interpretability so hard — and what progress has been made?

CreatorsAI Foundations~24 min readAdvancedBI3 · LearningBI5 · Societal ImpactPrint / PDF

Lesson map

What this lesson covers

40 min17 blocks4 concepts

Learning path

The main moves in order

1A Trillion Parameters of Fog
2interpretability
3mechanistic
4black box

Concept cluster

Terms to connect while reading

interpretabilitymechanisticblack boxcircuits

Sections5

Lists2

Notes4

Compare1

Quotes1

Section 1

A Trillion Parameters of Fog

You trained a model. It works. You cannot say why. Modern LLMs have hundreds of billions of parameters woven together; they produce answers through processes no human designed. Understanding them has become its own research field: interpretability.

Why it is so hard

Superposition: each neuron encodes many features, and each feature uses many neurons
Distributed representations: no single unit corresponds to 'the idea of a cat'
Nonlinearity: changing one weight affects outputs in nonlinear ways across the entire network
Emergent behaviors: the whole is more than its labeled parts

Four interpretability traditions

Compare the options

Approach	Question asked	Example
Behavioral	What does the model do?	Eval suites, red-teaming
Probing	What information does it encode?	Linear probes on activations
Mechanistic	What algorithm runs inside?	Circuit analysis, induction heads
Feature-level	What concepts are there?	Sparse autoencoders, SAEs

Check-in 1. Got it so far?

Mechanistic interpretability

The ambitious project: reverse-engineer a neural network into human-readable algorithms. The Anthropic mechanistic interpretability team found 'induction heads' that copy patterns, 'indirect object identification' circuits in GPT-2, and — with sparse autoencoders — human-interpretable features in larger models.

Why it matters

1Debug weird behaviors (jailbreaks, hallucinations) at the circuit level
2Spot deception or scheming before deployment
3Give the safety community tools for audit
4Catch dangerous capabilities earlier in training

Check-in 2. Got it so far?

“We can now identify millions of features inside a frontier language model.”
Anthropic, Scaling Monosemanticity (2024)

Key terms in this lesson

Check-in 3. Got it so far?

The big idea: we are building minds faster than we can understand them. Interpretability is the project to close that gap.

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Why Models Are Hard to Reason About”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Why Models Are Hard to Reason About

A Trillion Parameters of Fog

Why it is so hard

Four interpretability traditions

Mechanistic interpretability

Why it matters

Curious about “Why Models Are Hard to Reason About”?

Keep going

Why Models Are Hard to Reason About

A Trillion Parameters of Fog

Why it is so hard

Four interpretability traditions

Mechanistic interpretability

Why it matters

Curious about “Why Models Are Hard to Reason About”?

Keep going