Loading lesson…
Debate, amplification, weak-to-strong, process supervision. Research on how humans supervise models smarter than them.
RLHF and related methods work because humans can judge whether a response is good. What happens when the model's output is a 10,000-line proof, a molecular design, or a multi-step plan that no human can evaluate end-to-end? The human can no longer provide a trustworthy reward signal. This is the scalable oversight problem.
Irving, Christiano, Amodei (OpenAI, 2018) proposed training two AI agents to debate an answer to a question, with a human judging the debate. The intuition is that finding a true argument should be easier than finding a compelling false argument that withstands adversarial questioning. The human is supervising arguments, not solutions, which is a much easier task.
Paul Christiano's iterated distillation and amplification (IDA): a human working with several copies of a model can answer harder questions than the human alone. Distill that amplified capability into a single model. Now the distilled model + humans can answer still-harder questions. Iterate. The training signal at each step stays within human-judgable complexity.
OpenAI's December 2023 paper Weak-to-Strong Generalization tested whether a weaker model supervising a stronger one could still produce aligned behavior. GPT-2 supervised GPT-4 on NLP tasks. The strong model outperformed the weak supervisor, suggesting some generalization of correct behavior despite noisy supervision. The paper was framed as a testbed for the superhuman-supervision problem, not a solution.
Reward correct intermediate steps, not just correct outputs. OpenAI's Let's Verify Step by Step (Lightman et al., 2023) showed this outperformed outcome-only supervision on MATH. For tasks where the chain of reasoning is checkable even when the final answer is not, this transfers supervision load to the step level.
Redwood Research's AI Control agenda (Shlegeris et al., 2024) takes a different angle: assume the model may be unaligned, and design deployment so that a capable but potentially unaligned model cannot cause catastrophic harm. Examples: untrusted models proposing actions, trusted models auditing, monitoring at every step. This is complementary to alignment, not a substitute.
| Approach | Human task | Scales with capability? | Status |
|---|---|---|---|
| Debate | Judge an argument | In theory yes | Research, partial empirical |
| Iterated amplification | Coordinate with AI helpers | Yes | Research, mostly theoretical |
| Weak-to-strong | Provide imperfect supervision | Tested empirically | Testbed, not full solution |
| Process supervision | Judge individual steps | For decomposable tasks | Deployed (o1, etc.) |
| AI control | Monitor untrusted model | Different axis, complementary | Active research |
The central question of alignment is not whether we can supervise AI today. It is whether our supervision can grow as fast as AI's capability.
— Jan Leike, former OpenAI superalignment co-lead
The big idea: scalable oversight is the research problem of how humans keep judgment useful as AI capability climbs past our ability to directly evaluate outputs. No proposal solves it. Several partial approaches stack together, and that stack is the field's working answer today.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety-scalable-oversight-creators
What is the core challenge that scalable oversight aims to address?
In the debate method for scalable oversight, what specifically does the human judge evaluate?
What is iterated amplification (IDA) primarily concerned with?
What did OpenAI's weak-to-strong generalization experiment demonstrate?
Process supervision differs from outcome-only supervision by rewarding what?
AI control as discussed in the lesson assumes what about AI models?
Why does naive RLHF become misleading once models exceed human expertise in a domain?
Which scalable oversight approach has been deployed in current systems like o1?
The lesson describes the scalable oversight problem as recursive. What does this mean?
What was the intuition behind the debate proposal by Irving, Christiano, and Amodei?
According to the comparison table, what is the current status of weak-to-strong generalization?
What does the lesson identify as the 'worst case' scenario for scalable oversight?
Which approach scales with capability (according to the comparison table)?
What role do RSPs (Responsible Scaling Policies) play in the lesson's framework?
In iterated amplification, what happens in the 'distillation' step?