Loading lesson…
Correlation is not causation, even inside a neural network. Activation patching is the interpretability equivalent of a controlled experiment — swap one component and see what changes.
A probe tells you a model represents something. A circuit diagram is a hypothesis about how. Activation patching is how you test that hypothesis causally: you take an activation from one run and paste it into another run, then see if the output changes the way your theory predicts.
Without causal intervention, every interpretability claim is a guess dressed in plots.
— Common paraphrase from interpretability workshops
The big idea: patching is how you separate 'the model represents this' from 'the model uses this.' Without it, interpretability is observational. With it, interpretability becomes experimental.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-activation-patching-creators
What is the main idea of "Activation Patching: Intervention Experiments"?
Which concept is most central to "Activation Patching: Intervention Experiments"?
Which use of AI fits this topic best?
What should a careful learner remember about "Path patching"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about activation patching be treated?
Name one way to verify an AI answer about activation patching.
Which action would help you apply "Activation Patching: Intervention Experiments" responsibly?