Lesson 537 of 1596
Debate as an Alignment Method
Two AIs argue opposite sides. A human judges the transcript. The bet: truth is easier to defend than lies, so debate surfaces signal a human alone would miss. Two Lawyers, One Judge Proposed by Irving, Christiano, and Amodei at OpenAI in 2018, AI Safety via Debate structures oversight as an adversarial game.
Creators · Ethics & Society · ~21 min read
Two Lawyers, One Judge
Proposed by Irving, Christiano, and Amodei at OpenAI in 2018, AI Safety via Debate structures oversight as an adversarial game. Two copies of a model take opposite positions on a question. They argue. A human reads the exchange and picks a winner. The hypothesis is that lying requires a more fragile story than telling the truth, so the liar loses over many rounds.
Why adversarial structure helps
- A single model can confidently lie to a rater with no counter
- In debate, the other model is motivated to expose the lie
- The human does not need to know the truth directly — just which argument is stronger
- Works for questions where humans can evaluate locally even if they can't reason end-to-end
Where it wobbles
- 1Obfuscated arguments: a well-crafted lie with plausible local steps can beat a clumsy truth
- 2Human judges get fooled by rhetoric, confidence, and length
- 3Debate assumes both sides have equal capability — unequal models break the symmetry
- 4Some questions don't decompose into checkable sub-claims
“If the only tool we had were RLHF, we would be stuck. Debate is one attempt at a different tool.”
Key terms in this lesson
The big idea: adversarial oversight is a structural idea, not a product. It may or may not scale, but the reasoning behind it — use conflict to surface signal — is worth carrying into any supervision scheme.
End-of-lesson quiz
Check what stuck
8 questions · Score saves to your progress.
Tutor
Curious about “Debate as an Alignment Method”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 45 min
Scalable Oversight: How Do You Supervise What You Cannot Evaluate
Debate, amplification, weak-to-strong, process supervision. Research on how humans supervise models smarter than them.
Creators · 55 min
Alignment: The Full Technical Picture
What alignment actually is as a research program, how it is done in practice, what the open problems are, and where the actual papers live. A model that is always helpful will help you do harmful things.
Creators · 50 min
AI Alignment: The Actual Technical Problem
Alignment is not a vibes debate. It is a concrete technical problem about getting systems to pursue goals we actually want. Here is what researchers work on when they say they work on alignment.
