Loading lesson…
Correlation is not causation, even inside a neural network. Activation patching is the interpretability equivalent of a controlled experiment — swap one component and see what changes.
A probe tells you a model represents something. A circuit diagram is a hypothesis about how. Activation patching is how you test that hypothesis causally: you take an activation from one run and paste it into another run, then see if the output changes the way your theory predicts.
Without causal intervention, every interpretability claim is a guess dressed in plots.
— Common paraphrase from interpretability workshops
The big idea: patching is how you separate 'the model represents this' from 'the model uses this.' Without it, interpretability is observational. With it, interpretability becomes experimental.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-activation-patching-creators
What is the core idea behind "Activation Patching: Intervention Experiments"?
Which term best describes a foundational idea in "Activation Patching: Intervention Experiments"?
A learner studying Activation Patching: Intervention Experiments would need to understand which concept?
Which of these is directly relevant to Activation Patching: Intervention Experiments?
Which of the following is a key point about Activation Patching: Intervention Experiments?
Which of these does NOT belong in a discussion of Activation Patching: Intervention Experiments?
Which statement is accurate regarding Activation Patching: Intervention Experiments?
Which of these does NOT belong in a discussion of Activation Patching: Intervention Experiments?
What is the key insight about "Path patching" in the context of Activation Patching: Intervention Experiments?
What is the key insight about "Off-distribution activations" in the context of Activation Patching: Intervention Experiments?
Which statement accurately describes an aspect of Activation Patching: Intervention Experiments?
What does working with Activation Patching: Intervention Experiments typically involve?
Which best describes the scope of "Activation Patching: Intervention Experiments"?
Which section heading best belongs in a lesson about Activation Patching: Intervention Experiments?
Which section heading best belongs in a lesson about Activation Patching: Intervention Experiments?