Lesson 218 of 2116
Scalable Oversight: How Do You Supervise What You Cannot Evaluate
Debate, amplification, weak-to-strong, process supervision. Research on how humans supervise models smarter than them.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The Supervision Ceiling
- 2scalable oversight
- 3debate
- 4amplification
Concept cluster
Terms to connect while reading
Section 1
The Supervision Ceiling
RLHF and related methods work because humans can judge whether a response is good. What happens when the model's output is a 10,000-line proof, a molecular design, or a multi-step plan that no human can evaluate end-to-end? The human can no longer provide a trustworthy reward signal. This is the scalable oversight problem.
Why it matters now, not later
- Frontier models already produce outputs humans cannot fully check (e.g., long code changes)
- Training signal quality caps the model's trained behavior
- Once models clearly exceed human expertise in a domain, naive RLHF becomes misleading
- Even current models need scalable oversight in narrow domains
Proposal 1: Debate
Irving, Christiano, Amodei (OpenAI, 2018) proposed training two AI agents to debate an answer to a question, with a human judging the debate. The intuition is that finding a true argument should be easier than finding a compelling false argument that withstands adversarial questioning. The human is supervising arguments, not solutions, which is a much easier task.
Proposal 2: Iterated amplification
Paul Christiano's iterated distillation and amplification (IDA): a human working with several copies of a model can answer harder questions than the human alone. Distill that amplified capability into a single model. Now the distilled model + humans can answer still-harder questions. Iterate. The training signal at each step stays within human-judgable complexity.
Proposal 3: Weak-to-strong generalization
OpenAI's December 2023 paper Weak-to-Strong Generalization tested whether a weaker model supervising a stronger one could still produce aligned behavior. GPT-2 supervised GPT-4 on NLP tasks. The strong model outperformed the weak supervisor, suggesting some generalization of correct behavior despite noisy supervision. The paper was framed as a testbed for the superhuman-supervision problem, not a solution.
Proposal 4: Process supervision
Reward correct intermediate steps, not just correct outputs. OpenAI's Let's Verify Step by Step (Lightman et al., 2023) showed this outperformed outcome-only supervision on MATH. For tasks where the chain of reasoning is checkable even when the final answer is not, this transfers supervision load to the step level.
Proposal 5: AI Control
Redwood Research's AI Control agenda (Shlegeris et al., 2024) takes a different angle: assume the model may be unaligned, and design deployment so that a capable but potentially unaligned model cannot cause catastrophic harm. Examples: untrusted models proposing actions, trusted models auditing, monitoring at every step. This is complementary to alignment, not a substitute.
Compare the options
| Approach | Human task | Scales with capability? | Status |
|---|---|---|---|
| Debate | Judge an argument | In theory yes | Research, partial empirical |
| Iterated amplification | Coordinate with AI helpers | Yes | Research, mostly theoretical |
| Weak-to-strong | Provide imperfect supervision | Tested empirically | Testbed, not full solution |
| Process supervision | Judge individual steps | For decomposable tasks | Deployed (o1, etc.) |
| AI control | Monitor untrusted model | Different axis, complementary | Active research |
“The central question of alignment is not whether we can supervise AI today. It is whether our supervision can grow as fast as AI's capability.”
Key terms in this lesson
The big idea: scalable oversight is the research problem of how humans keep judgment useful as AI capability climbs past our ability to directly evaluate outputs. No proposal solves it. Several partial approaches stack together, and that stack is the field's working answer today.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Scalable Oversight: How Do You Supervise What You Cannot Evaluate”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 55 min
Alignment: The Full Technical Picture
What alignment actually is as a research program, how it is done in practice, what the open problems are, and where the actual papers live. A model that is always helpful will help you do harmful things.
Creators · 50 min
AI Alignment: The Actual Technical Problem
Alignment is not a vibes debate. It is a concrete technical problem about getting systems to pursue goals we actually want. Here is what researchers work on when they say they work on alignment.
Creators · 45 min
Constitutional AI: A Deep Dive on Anthropic's Approach
What a constitution actually contains, how the training loop works, where the research is now, and the honest trade-offs.
