Lesson 178 of 1570
Circuits in Neural Networks
A circuit is a small sub-network inside a big model that implements one specific behavior. Finding circuits is how researchers prove how a model does what it does.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1From Features to Circuits
- 2circuit
- 3attention head
- 4interpretability
Concept cluster
Terms to connect while reading
Section 1
From Features to Circuits
Finding features tells you what a model represents. Circuits tell you how it computes. A circuit is a specific subset of attention heads and MLP components that, together, implement a particular capability.
Famous examples
- Induction heads: detect 'A B ... A' and predict 'B' next, enabling in-context learning
- IOI circuit: identifies the indirect object in sentences like 'John and Mary went to the store; John gave a drink to ___'
- Modular addition circuit: a small transformer that computes (a+b) mod p using rotations in a Fourier basis
- Greater-than circuit: determines which of two numbers is larger
Why circuits matter for safety
- 1A circuit-level understanding could reveal deceptive reasoning as it happens
- 2Circuits for sycophancy or refusal can be audited directly
- 3Removing a circuit can ablate a capability without full retraining
- 4Circuits that generalize across models are candidates for universal interpretability claims
Key terms in this lesson
The big idea: circuits are the wiring diagrams of neural networks. We can draw a few of them. We cannot yet draw most. That asymmetry is the state of the art.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Circuits in Neural Networks”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 55 min
Mechanistic Interpretability: Reading the Model's Mind
Sparse autoencoders, features, circuits. How researchers try to see what a model actually thinks, and why it may be the most strategically important safety work.
Creators · 37 min
Feature Discovery in LLMs
A feature is a direction in activation space that corresponds to a concept. Finding them — naming them, ranking them, connecting them — is one of the central activities of interpretability research.
Builders · 28 min
Where Bias in AI Actually Comes From
AI bias is not magic and not moral failure. It is math operating on imperfect data. Here is exactly where the bias enters the system.
