Activation Patching: Intervention Experiments

Correlation is not causation, even inside a neural network. Activation patching is the interpretability equivalent of a controlled experiment — swap one component and see what changes.

34 min · Reviewed 2026

The Causal Question

A probe tells you a model represents something. A circuit diagram is a hypothesis about how. Activation patching is how you test that hypothesis causally: you take an activation from one run and paste it into another run, then see if the output changes the way your theory predicts.

The standard procedure

Run the model on a clean input — get the correct answer
Run the model on a corrupted input (e.g., a different name, a shuffled prompt) — get the wrong answer
Save activations from the clean run at every position and layer
Re-run the corrupted input, but patch in a clean activation at one specific site
Measure: did the output move back toward the correct answer?
Repeat across sites to find which ones matter

Why it is gold-standard

Produces causal, not correlational, evidence
Works without retraining
Scales to any architecture with a residual stream
Can identify redundant circuits that do the same work

Without causal intervention, every interpretability claim is a guess dressed in plots.
— Common paraphrase from interpretability workshops

The big idea: patching is how you separate 'the model represents this' from 'the model uses this.' Without it, interpretability is observational. With it, interpretability becomes experimental.

End-of-lesson check

8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-activation-patching-creators

What is the main idea of "Activation Patching: Intervention Experiments"?
1. Correlation is not causation, even inside a neural network.
2. Use AI as the final authority for the whole decision
3. Avoid checking the answer once it sounds polished
4. Focus only on speed instead of judgment
Which concept is most central to "Activation Patching: Intervention Experiments"?
1. causal intervention
2. activation patching
3. path patching
4. causal tracing
Which use of AI fits this topic best?
1. Let the AI decide what matters without your review
2. Use the answer before checking whether it fits the situation
3. Run the model on a clean input — get the correct answer
4. Treat the AI output as automatically correct
What should a careful learner remember about "Path patching"?
1. Use AI to draft or organize ideas about activation patching, then verify before acting.
2. Skip the context so the tool can guess faster
3. Treat the output as private even after sharing it online
4. Use the answer without checking the source
You want to use AI after this lesson. What is the safest next step?
1. Act immediately because the AI answer is written clearly
2. AI cannot make the human values decision for you.
3. Hide uncertainty so the final answer looks cleaner
4. Use private or sensitive details before checking permission
How should AI output about activation patching be treated?
1. As proof that no other source is needed
2. As a replacement for context, consent, or expert review
3. As a draft or helper output that still needs human judgment and verification
4. As something that becomes correct when it sounds confident
Name one way to verify an AI answer about activation patching.
Which action would help you apply "Activation Patching: Intervention Experiments" responsibly?
1. Use the tool to avoid thinking through the tradeoff
2. Keep going even if the output conflicts with a trusted source
3. Treat the AI output as automatically correct
4. Run the model on a corrupted input (e.g., a different name, a shuffled prompt) — get the wrong answer

← Back to interactive lesson

Tendril · Creators · Ethics & Society

Activation Patching: Intervention Experiments

Correlation is not causation, even inside a neural network. Activation patching is the interpretability equivalent of a controlled experiment — swap one component and see what changes.

34 min · Reviewed 2026

The Causal Question

The standard procedure

Run the model on a clean input — get the correct answer
Run the model on a corrupted input (e.g., a different name, a shuffled prompt) — get the wrong answer
Save activations from the clean run at every position and layer
Re-run the corrupted input, but patch in a clean activation at one specific site
Measure: did the output move back toward the correct answer?
Repeat across sites to find which ones matter

Why it is gold-standard

Produces causal, not correlational, evidence
Works without retraining
Scales to any architecture with a residual stream
Can identify redundant circuits that do the same work

Without causal intervention, every interpretability claim is a guess dressed in plots.
— Common paraphrase from interpretability workshops

The big idea: patching is how you separate 'the model represents this' from 'the model uses this.' Without it, interpretability is observational. With it, interpretability becomes experimental.

End-of-lesson check

8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-activation-patching-creators

What is the main idea of "Activation Patching: Intervention Experiments"?
1. Correlation is not causation, even inside a neural network.
2. Use AI as the final authority for the whole decision
3. Avoid checking the answer once it sounds polished
4. Focus only on speed instead of judgment
Which concept is most central to "Activation Patching: Intervention Experiments"?
1. causal intervention
2. activation patching
3. path patching
4. causal tracing
Which use of AI fits this topic best?
1. Let the AI decide what matters without your review
2. Use the answer before checking whether it fits the situation
3. Run the model on a clean input — get the correct answer
4. Treat the AI output as automatically correct
What should a careful learner remember about "Path patching"?
1. Use AI to draft or organize ideas about activation patching, then verify before acting.
2. Skip the context so the tool can guess faster
3. Treat the output as private even after sharing it online
4. Use the answer without checking the source
You want to use AI after this lesson. What is the safest next step?
1. Act immediately because the AI answer is written clearly
2. AI cannot make the human values decision for you.
3. Hide uncertainty so the final answer looks cleaner
4. Use private or sensitive details before checking permission
How should AI output about activation patching be treated?
1. As proof that no other source is needed
2. As a replacement for context, consent, or expert review
3. As a draft or helper output that still needs human judgment and verification
4. As something that becomes correct when it sounds confident
Name one way to verify an AI answer about activation patching.
Which action would help you apply "Activation Patching: Intervention Experiments" responsibly?
1. Use the tool to avoid thinking through the tradeoff
2. Keep going even if the output conflicts with a trusted source
3. Treat the AI output as automatically correct
4. Run the model on a corrupted input (e.g., a different name, a shuffled prompt) — get the wrong answer

← Back to interactive lesson