Circuits in Neural Networks

A circuit is a small sub-network inside a big model that implements one specific behavior. Finding circuits is how researchers prove how a model does what it does.

28 min · Reviewed 2026

From Features to Circuits

Finding features tells you what a model represents. Circuits tell you how it computes. A circuit is a specific subset of attention heads and MLP components that, together, implement a particular capability.

Famous examples

Induction heads: detect 'A B ... A' and predict 'B' next, enabling in-context learning
IOI circuit: identifies the indirect object in sentences like 'John and Mary went to the store; John gave a drink to ___'
Modular addition circuit: a small transformer that computes (a+b) mod p using rotations in a Fourier basis
Greater-than circuit: determines which of two numbers is larger

Why circuits matter for safety

A circuit-level understanding could reveal deceptive reasoning as it happens
Circuits for sycophancy or refusal can be audited directly
Removing a circuit can ablate a capability without full retraining
Circuits that generalize across models are candidates for universal interpretability claims

The big idea: circuits are the wiring diagrams of neural networks. We can draw a few of them. We cannot yet draw most. That asymmetry is the state of the art.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-circuits-builders

What is a circuit in the context of neural network interpretability?
1. A training dataset used to teach models specific behaviors
2. A complete architecture diagram showing all layers in a transformer model
3. A type of output layer that generates predictions
4. A specific subset of attention heads and MLP components that together implement a particular capability
What does it mean to 'ablate' a component when researchers are searching for circuits?
1. To increase the size of the component to see how it affects performance
2. To zero out or remove the component and observe what changes in the model's behavior
3. To visualize the internal activations of the component
4. To add a new component to the network and test its effects
Why are circuits considered valuable for AI safety?
1. Circuits make models run faster on consumer hardware
2. Circuits automatically prevent models from generating harmful content
3. Circuits reduce the amount of training data needed
4. Circuit-level understanding could reveal deceptive reasoning as it happens
Which famous circuit detects patterns like 'A B ... A' and then predicts 'B' to appear next?
1. Greater-than circuit
2. IOI circuit
3. Modular addition circuit
4. Induction head
What specific task does the IOI (Indirect Object Identification) circuit perform?
1. Identifies the indirect object in sentences like 'John gave a drink to ___'
2. Computes the sum of two numbers modulo a prime
3. Compares two numbers to determine which is larger
4. Detects whether a statement is true or false
Why is it much harder to prove comparable mechanisms exist in frontier models like GPT-4 compared to GPT-2 small?
1. Frontier models were not trained on enough data
2. Frontier models have fewer parameters to examine
3. Frontier models are orders of magnitude more complex and harder to analyze
4. Frontier models use completely different programming languages
What is an induction head?
1. A type of loss function used in training
2. A circuit that detects 'A B ... A' patterns and predicts what comes next
3. A mechanism that prevents models from repeating themselves
4. A component that combines multiple attention outputs
If researchers wanted to remove a model's ability to refuse certain requests without full retraining, what approach might they use?
1. Replace the model's activation function
2. Delete all training data related to safety
3. Add more parameters to override the behavior
4. Ablate the specific circuit responsible for refusal behavior
What does it mean when a circuit 'generalizes across models'?
1. The same circuit mechanism is found in multiple different models
2. The circuit automatically adjusts its size based on input
3. The circuit can be transferred between different programming languages
4. The circuit allows models to learn from fewer examples
What does the greater-than circuit do?
1. Calculates the difference between two numbers
2. Rounds numbers to the nearest integer
3. Predicts the next number in a sequence
4. Determines which of two numbers is larger
How does the modular addition circuit perform computation?
1. Using rotations in a Fourier basis
2. By looking up pre-computed answers in a table
3. Using conditional if-else statements
4. Through simple repeated addition
What is the main goal of circuit discovery in interpretability research?
1. To reduce the cost of training models
2. To increase the speed of model inference
3. To make models generate more text
4. To understand exactly how a model computes specific behaviors
Why might understanding specific circuits be useful for auditing AI systems?
1. Auditors can use circuits to generate training data
2. Auditors can replace circuits with human reviewers
3. Auditors can use circuits to make models run on phones
4. Auditors can directly examine circuits responsible for behaviors like sycophancy or refusal
What is an attention head?
1. A component in transformers that focuses on specific parts of the input
2. A type of training objective
3. A method for visualizing which data points are important
4. A neural network trained to pay attention to user queries
The lesson says the 'big idea' is that circuits are the 'wiring diagrams' of neural networks. What does this analogy suggest?
1. Circuits are unnecessary for modern neural networks
2. Circuits are physical components like wires and resistors
3. Circuits must be drawn by hand by engineers
4. Circuits show how information flows and is transformed through the network

← Back to interactive lesson

Tendril · Builders · Ethics & Society