Lesson 861 of 2116
Activation Patching: Intervention Experiments
Correlation is not causation, even inside a neural network. Activation patching is the interpretability equivalent of a controlled experiment — swap one component and see what changes.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The Causal Question
- 2activation patching
- 3causal intervention
- 4path patching
Concept cluster
Terms to connect while reading
Section 1
The Causal Question
A probe tells you a model represents something. A circuit diagram is a hypothesis about how. Activation patching is how you test that hypothesis causally: you take an activation from one run and paste it into another run, then see if the output changes the way your theory predicts.
The standard procedure
- 1Run the model on a clean input — get the correct answer
- 2Run the model on a corrupted input (e.g., a different name, a shuffled prompt) — get the wrong answer
- 3Save activations from the clean run at every position and layer
- 4Re-run the corrupted input, but patch in a clean activation at one specific site
- 5Measure: did the output move back toward the correct answer?
- 6Repeat across sites to find which ones matter
Why it is gold-standard
- Produces causal, not correlational, evidence
- Works without retraining
- Scales to any architecture with a residual stream
- Can identify redundant circuits that do the same work
“Without causal intervention, every interpretability claim is a guess dressed in plots.”
Key terms in this lesson
The big idea: patching is how you separate 'the model represents this' from 'the model uses this.' Without it, interpretability is observational. With it, interpretability becomes experimental.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Activation Patching: Intervention Experiments”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 50 min
AI Alignment: The Actual Technical Problem
Alignment is not a vibes debate. It is a concrete technical problem about getting systems to pursue goals we actually want. Here is what researchers work on when they say they work on alignment.
Creators · 40 min
Jailbreak Case Studies: What Actually Broke
Abstract jailbreak theory is less useful than real cases. Here are the techniques that worked on production models, what they taught us, and what is still unsolved.
Creators · 55 min
Alignment: The Full Technical Picture
What alignment actually is as a research program, how it is done in practice, what the open problems are, and where the actual papers live. A model that is always helpful will help you do harmful things.
