Lesson 176 of 1570
Scalable Oversight: Watching Models Smarter Than You
When AI outputs get too long, too technical, or too fast for humans to check, how do you know it is doing the right thing? Scalable oversight is the research program trying to answer that.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The Bandwidth Problem
- 2scalable oversight
- 3RLHF
- 4human feedback
Concept cluster
Terms to connect while reading
Section 1
The Bandwidth Problem
Human feedback is the backbone of modern alignment. Raters read a model answer and upvote or downvote. That works great when the answer is short and the rater is qualified. It breaks when the answer is a 30-page research paper, a 10,000-line codebase, or a claim in a field no rater actually knows.
Why it matters
- Models are outpacing the speed at which humans can read their work
- Domains like biology and math exceed most raters' expertise
- Rater fatigue causes quality to drop across long sessions
- If the model is wrong in ways the rater can't see, training reinforces the error
Main approaches
- 1Debate: two models argue, a human judges
- 2Iterated amplification: break hard tasks into smaller pieces a human can check
- 3Recursive reward modeling: train a reward model on easier subtasks, use it to evaluate harder ones
- 4Process supervision: score the reasoning steps, not just the final answer
- 5Critique models: one model finds flaws in another's output
“The hope is that we can use AI to help us align smarter AI, bootstrapping supervision up the capability curve.”
Key terms in this lesson
The big idea: alignment at scale is not just better labels. It is a research bet that supervision itself can be amplified without losing the human anchor.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Scalable Oversight: Watching Models Smarter Than You”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Builders · 28 min
Where Bias in AI Actually Comes From
AI bias is not magic and not moral failure. It is math operating on imperfect data. Here is exactly where the bias enters the system.
Builders · 28 min
Circuits in Neural Networks
A circuit is a small sub-network inside a big model that implements one specific behavior. Finding circuits is how researchers prove how a model does what it does.
Creators · 55 min
Alignment: The Full Technical Picture
What alignment actually is as a research program, how it is done in practice, what the open problems are, and where the actual papers live. A model that is always helpful will help you do harmful things.
