Loading lesson…
Debate, amplification, weak-to-strong, process supervision. Research on how humans supervise models smarter than them.
RLHF and related methods work because humans can judge whether a response is good. What happens when the model's output is a 10,000-line proof, a molecular design, or a multi-step plan that no human can evaluate end-to-end? The human can no longer provide a trustworthy reward signal. This is the scalable oversight problem.
Irving, Christiano, Amodei (OpenAI, 2018) proposed training two AI agents to debate an answer to a question, with a human judging the debate. The intuition is that finding a true argument should be easier than finding a compelling false argument that withstands adversarial questioning. The human is supervising arguments, not solutions, which is a much easier task.
Paul Christiano's iterated distillation and amplification (IDA): a human working with several copies of a model can answer harder questions than the human alone. Distill that amplified capability into a single model. Now the distilled model + humans can answer still-harder questions. Iterate. The training signal at each step stays within human-judgable complexity.
OpenAI's December 2023 paper Weak-to-Strong Generalization tested whether a weaker model supervising a stronger one could still produce aligned behavior. GPT-2 supervised GPT-4 on NLP tasks. The strong model outperformed the weak supervisor, suggesting some generalization of correct behavior despite noisy supervision. The paper was framed as a testbed for the superhuman-supervision problem, not a solution.
Reward correct intermediate steps, not just correct outputs. OpenAI's Let's Verify Step by Step (Lightman et al., 2023) showed this outperformed outcome-only supervision on MATH. For tasks where the chain of reasoning is checkable even when the final answer is not, this transfers supervision load to the step level.
Redwood Research's AI Control agenda (Shlegeris et al., 2024) takes a different angle: assume the model may be unaligned, and design deployment so that a capable but potentially unaligned model cannot cause catastrophic harm. Examples: untrusted models proposing actions, trusted models auditing, monitoring at every step. This is complementary to alignment, not a substitute.
| Approach | Human task | Scales with capability? | Status |
|---|---|---|---|
| Debate | Judge an argument | In theory yes | Research, partial empirical |
| Iterated amplification | Coordinate with AI helpers | Yes | Research, mostly theoretical |
| Weak-to-strong | Provide imperfect supervision | Tested empirically | Testbed, not full solution |
| Process supervision | Judge individual steps | For decomposable tasks | Deployed (o1, etc.) |
| AI control | Monitor untrusted model | Different axis, complementary | Active research |
The central question of alignment is not whether we can supervise AI today. It is whether our supervision can grow as fast as AI's capability.
— Jan Leike, former OpenAI superalignment co-lead
The big idea: scalable oversight is the research problem of how humans keep judgment useful as AI capability climbs past our ability to directly evaluate outputs. No proposal solves it. Several partial approaches stack together, and that stack is the field's working answer today.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety-scalable-oversight-creators
What is the main idea of "Scalable Oversight: How Do You Supervise What You Cannot Evaluate"?
Which concept is most central to "Scalable Oversight: How Do You Supervise What You Cannot Evaluate"?
Which use of AI fits this topic best?
What should a careful learner remember about "Why there is no silver bullet"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about scalable oversight be treated?
Name one way to verify an AI answer about scalable oversight.
Which action would help you apply "Scalable Oversight: How Do You Supervise What You Cannot Evaluate" responsibly?