Lesson 853 of 2116
Debate as an Alignment Method
Two AIs argue opposite sides. A human judges the transcript. The bet: truth is easier to defend than lies, so debate surfaces signal a human alone would miss. Two Lawyers, One Judge Proposed by Irving, Christiano, and Amodei at OpenAI in 2018, AI Safety via Debate structures oversight as an adversarial game.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1Two Lawyers, One Judge
- 2debate
- 3scalable oversight
- 4adversarial training
Concept cluster
Terms to connect while reading
Section 1
Two Lawyers, One Judge
Proposed by Irving, Christiano, and Amodei at OpenAI in 2018, AI Safety via Debate structures oversight as an adversarial game. Two copies of a model take opposite positions on a question. They argue. A human reads the exchange and picks a winner. The hypothesis is that lying requires a more fragile story than telling the truth, so the liar loses over many rounds.
Why adversarial structure helps
- A single model can confidently lie to a rater with no counter
- In debate, the other model is motivated to expose the lie
- The human does not need to know the truth directly — just which argument is stronger
- Works for questions where humans can evaluate locally even if they can't reason end-to-end
Where it wobbles
- 1Obfuscated arguments: a well-crafted lie with plausible local steps can beat a clumsy truth
- 2Human judges get fooled by rhetoric, confidence, and length
- 3Debate assumes both sides have equal capability — unequal models break the symmetry
- 4Some questions don't decompose into checkable sub-claims
“If the only tool we had were RLHF, we would be stuck. Debate is one attempt at a different tool.”
Key terms in this lesson
The big idea: adversarial oversight is a structural idea, not a product. It may or may not scale, but the reasoning behind it — use conflict to surface signal — is worth carrying into any supervision scheme.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Debate as an Alignment Method”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 45 min
Scalable Oversight: How Do You Supervise What You Cannot Evaluate
Debate, amplification, weak-to-strong, process supervision. Research on how humans supervise models smarter than them.
Creators · 55 min
Alignment: The Full Technical Picture
What alignment actually is as a research program, how it is done in practice, what the open problems are, and where the actual papers live. A model that is always helpful will help you do harmful things.
Creators · 50 min
AI Alignment: The Actual Technical Problem
Alignment is not a vibes debate. It is a concrete technical problem about getting systems to pursue goals we actually want. Here is what researchers work on when they say they work on alignment.
