A circuit is a small sub-network inside a big model that implements one specific behavior. Finding circuits is how researchers prove how a model does what it does.
28 min · Reviewed 2026
From Features to Circuits
Finding features tells you what a model represents. Circuits tell you how it computes. A circuit is a specific subset of attention heads and MLP components that, together, implement a particular capability.
Famous examples
Induction heads: detect 'A B ... A' and predict 'B' next, enabling in-context learning
IOI circuit: identifies the indirect object in sentences like 'John and Mary went to the store; John gave a drink to ___'
Modular addition circuit: a small transformer that computes (a+b) mod p using rotations in a Fourier basis
Greater-than circuit: determines which of two numbers is larger
Why circuits matter for safety
A circuit-level understanding could reveal deceptive reasoning as it happens
Circuits for sycophancy or refusal can be audited directly
Removing a circuit can ablate a capability without full retraining
Circuits that generalize across models are candidates for universal interpretability claims
The big idea: circuits are the wiring diagrams of neural networks. We can draw a few of them. We cannot yet draw most. That asymmetry is the state of the art.
End-of-lesson check
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-circuits-builders
What is a circuit in the context of neural network interpretability?
A training dataset used to teach models specific behaviors
A complete architecture diagram showing all layers in a transformer model
A type of output layer that generates predictions
A specific subset of attention heads and MLP components that together implement a particular capability
What does it mean to 'ablate' a component when researchers are searching for circuits?
To increase the size of the component to see how it affects performance
To zero out or remove the component and observe what changes in the model's behavior
To visualize the internal activations of the component
To add a new component to the network and test its effects
Why are circuits considered valuable for AI safety?
Circuits make models run faster on consumer hardware
Circuits automatically prevent models from generating harmful content
Circuits reduce the amount of training data needed
Circuit-level understanding could reveal deceptive reasoning as it happens
Which famous circuit detects patterns like 'A B ... A' and then predicts 'B' to appear next?
Greater-than circuit
IOI circuit
Modular addition circuit
Induction head
What specific task does the IOI (Indirect Object Identification) circuit perform?
Identifies the indirect object in sentences like 'John gave a drink to ___'
Computes the sum of two numbers modulo a prime
Compares two numbers to determine which is larger
Detects whether a statement is true or false
Why is it much harder to prove comparable mechanisms exist in frontier models like GPT-4 compared to GPT-2 small?
Frontier models were not trained on enough data
Frontier models have fewer parameters to examine
Frontier models are orders of magnitude more complex and harder to analyze
Frontier models use completely different programming languages
What is an induction head?
A type of loss function used in training
A circuit that detects 'A B ... A' patterns and predicts what comes next
A mechanism that prevents models from repeating themselves
A component that combines multiple attention outputs
If researchers wanted to remove a model's ability to refuse certain requests without full retraining, what approach might they use?
Replace the model's activation function
Delete all training data related to safety
Add more parameters to override the behavior
Ablate the specific circuit responsible for refusal behavior
What does it mean when a circuit 'generalizes across models'?
The same circuit mechanism is found in multiple different models
The circuit automatically adjusts its size based on input
The circuit can be transferred between different programming languages
The circuit allows models to learn from fewer examples
What does the greater-than circuit do?
Calculates the difference between two numbers
Rounds numbers to the nearest integer
Predicts the next number in a sequence
Determines which of two numbers is larger
How does the modular addition circuit perform computation?
Using rotations in a Fourier basis
By looking up pre-computed answers in a table
Using conditional if-else statements
Through simple repeated addition
What is the main goal of circuit discovery in interpretability research?
To reduce the cost of training models
To increase the speed of model inference
To make models generate more text
To understand exactly how a model computes specific behaviors
Why might understanding specific circuits be useful for auditing AI systems?
Auditors can use circuits to generate training data
Auditors can replace circuits with human reviewers
Auditors can use circuits to make models run on phones
Auditors can directly examine circuits responsible for behaviors like sycophancy or refusal
What is an attention head?
A component in transformers that focuses on specific parts of the input
A type of training objective
A method for visualizing which data points are important
A neural network trained to pay attention to user queries
The lesson says the 'big idea' is that circuits are the 'wiring diagrams' of neural networks. What does this analogy suggest?
Circuits are unnecessary for modern neural networks
Circuits are physical components like wires and resistors
Circuits must be drawn by hand by engineers
Circuits show how information flows and is transformed through the network