Lesson 860 of 2116
Probing: Linear, Nonlinear, and Contrast
Probing asks a simple question: given a model's hidden state, can a small classifier predict some property? The answer tells you what the model represents, whether or not it uses that information.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The Probe Recipe
- 2probing
- 3linear probe
- 4representation
Concept cluster
Terms to connect while reading
Section 1
The Probe Recipe
Take a labeled dataset — sentences marked 'positive sentiment' vs 'negative,' or 'factual' vs 'hallucinated.' Extract the model's hidden activations on each input. Train a small classifier (a probe) to predict the label from the activations. If it succeeds, the model represents that information somewhere.
Types of probes
- Linear probe: a single linear layer — simplest and most trusted
- Nonlinear probe: MLP or deeper — detects information the model doesn't use linearly
- Contrast probe (CCS): trains on the constraint that contradictory statements should have opposite representations
- Direction finding: not quite a probe, but similar — find a direction in activation space that maps to a concept
Classic findings
- 1BERT encodes syntactic trees in its hidden states (Hewitt and Manning, 2019)
- 2LLMs represent truth-vs-falsehood directions that persist across layers
- 3Chess-playing transformers encode full board state in their activations
- 4Many 'emergent' capabilities show up first in probes before appearing in output
“Probing tells you what is in the water. It does not tell you what the fish is doing.”
Key terms in this lesson
The big idea: probing is the cheapest, oldest tool in the interpretability kit. It is also still one of the most useful, especially when paired with intervention experiments.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Probing: Linear, Nonlinear, and Contrast”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 50 min
AI Alignment: The Actual Technical Problem
Alignment is not a vibes debate. It is a concrete technical problem about getting systems to pursue goals we actually want. Here is what researchers work on when they say they work on alignment.
Creators · 40 min
Jailbreak Case Studies: What Actually Broke
Abstract jailbreak theory is less useful than real cases. Here are the techniques that worked on production models, what they taught us, and what is still unsolved.
Creators · 55 min
Alignment: The Full Technical Picture
What alignment actually is as a research program, how it is done in practice, what the open problems are, and where the actual papers live. A model that is always helpful will help you do harmful things.
