neural-forge.io

Sign inStartStart learning

Tendril

Ethics & Society0%

Lesson 218 of 2116

Scalable Oversight: How Do You Supervise What You Cannot Evaluate

Debate, amplification, weak-to-strong, process supervision. Research on how humans supervise models smarter than them.

CreatorsEthics & Society~27 min readAdvancedResearcherBI5 · Societal ImpactBI3 · LearningPrint / PDF

Lesson map

What this lesson covers

45 min22 blocks4 concepts

Learning path

The main moves in order

1The Supervision Ceiling
2scalable oversight
3debate
4amplification

Concept cluster

Terms to connect while reading

scalable oversightdebateamplificationweak-to-strong

Read7

Sections7

Lists1

Notes4

Compare1

Quotes1

Section 1

The Supervision Ceiling

RLHF and related methods work because humans can judge whether a response is good. What happens when the model's output is a 10,000-line proof, a molecular design, or a multi-step plan that no human can evaluate end-to-end? The human can no longer provide a trustworthy reward signal. This is the scalable oversight problem.

Why it matters now, not later

Frontier models already produce outputs humans cannot fully check (e.g., long code changes)
Training signal quality caps the model's trained behavior
Once models clearly exceed human expertise in a domain, naive RLHF becomes misleading
Even current models need scalable oversight in narrow domains

Proposal 1: Debate

Irving, Christiano, Amodei (OpenAI, 2018) proposed training two AI agents to debate an answer to a question, with a human judging the debate. The intuition is that finding a true argument should be easier than finding a compelling false argument that withstands adversarial questioning. The human is supervising arguments, not solutions, which is a much easier task.

Check-in 1. Got it so far?

Proposal 2: Iterated amplification

Paul Christiano's iterated distillation and amplification (IDA): a human working with several copies of a model can answer harder questions than the human alone. Distill that amplified capability into a single model. Now the distilled model + humans can answer still-harder questions. Iterate. The training signal at each step stays within human-judgable complexity.

Proposal 3: Weak-to-strong generalization

OpenAI's December 2023 paper Weak-to-Strong Generalization tested whether a weaker model supervising a stronger one could still produce aligned behavior. GPT-2 supervised GPT-4 on NLP tasks. The strong model outperformed the weak supervisor, suggesting some generalization of correct behavior despite noisy supervision. The paper was framed as a testbed for the superhuman-supervision problem, not a solution.

Proposal 4: Process supervision

Reward correct intermediate steps, not just correct outputs. OpenAI's Let's Verify Step by Step (Lightman et al., 2023) showed this outperformed outcome-only supervision on MATH. For tasks where the chain of reasoning is checkable even when the final answer is not, this transfers supervision load to the step level.

Check-in 2. Got it so far?

Proposal 5: AI Control

Redwood Research's AI Control agenda (Shlegeris et al., 2024) takes a different angle: assume the model may be unaligned, and design deployment so that a capable but potentially unaligned model cannot cause catastrophic harm. Examples: untrusted models proposing actions, trusted models auditing, monitoring at every step. This is complementary to alignment, not a substitute.

Compare the options

Approach	Human task	Scales with capability?	Status
Debate	Judge an argument	In theory yes	Research, partial empirical
Iterated amplification	Coordinate with AI helpers	Yes	Research, mostly theoretical
Weak-to-strong	Provide imperfect supervision	Tested empirically	Testbed, not full solution
Process supervision	Judge individual steps	For decomposable tasks	Deployed (o1, etc.)
AI control	Monitor untrusted model	Different axis, complementary	Active research

Check-in 3. Got it so far?

“The central question of alignment is not whether we can supervise AI today. It is whether our supervision can grow as fast as AI's capability.”
Jan Leike, former OpenAI superalignment co-lead

Key terms in this lesson

Check-in 4. Got it so far?

The big idea: scalable oversight is the research problem of how humans keep judgment useful as AI capability climbs past our ability to directly evaluate outputs. No proposal solves it. Several partial approaches stack together, and that stack is the field's working answer today.

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Scalable Oversight: How Do You Supervise What You Cannot Evaluate”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Keep going