Debate as an Alignment Method

Two AIs argue opposite sides. A human judges the transcript. The bet: truth is easier to defend than lies, so debate surfaces signal a human alone would miss. Two Lawyers, One Judge Proposed by Irving, Christiano, and Amodei at OpenAI in 2018, AI Safety via Debate structures oversight as an adversarial game.

35 min · Reviewed 2026

Two Lawyers, One Judge

Proposed by Irving, Christiano, and Amodei at OpenAI in 2018, AI Safety via Debate structures oversight as an adversarial game. Two copies of a model take opposite positions on a question. They argue. A human reads the exchange and picks a winner. The hypothesis is that lying requires a more fragile story than telling the truth, so the liar loses over many rounds.

Why adversarial structure helps

A single model can confidently lie to a rater with no counter
In debate, the other model is motivated to expose the lie
The human does not need to know the truth directly — just which argument is stronger
Works for questions where humans can evaluate locally even if they can't reason end-to-end

Where it wobbles

Obfuscated arguments: a well-crafted lie with plausible local steps can beat a clumsy truth
Human judges get fooled by rhetoric, confidence, and length
Debate assumes both sides have equal capability — unequal models break the symmetry
Some questions don't decompose into checkable sub-claims

If the only tool we had were RLHF, we would be stuck. Debate is one attempt at a different tool.
— Geoffrey Irving, UK AI Safety Institute

The big idea: adversarial oversight is a structural idea, not a product. It may or may not scale, but the reasoning behind it — use conflict to surface signal — is worth carrying into any supervision scheme.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-debate-alignment-creators

What is the core hypothesis that makes AI debate potentially useful for alignment?
1. Debate works best when both AI systems are trained to agree with each other
2. AI systems are inherently honest and only need a platform to express their true beliefs
3. The judge model can directly verify factual claims by consulting external databases
4. Lying requires maintaining a more complex and fragile narrative than telling the truth, making it easier to expose through sustained argument
Why is an adversarial structure more effective than a single AI answering questions unsupervised?
1. Single AIs are too slow to generate responses
2. A single AI can confidently present false information with no counter-argument available to challenge it
3. Adversarial structures guarantee that the AI will tell the truth
4. The adversarial structure eliminates the need for any human involvement
What did the image-classification debate experiment demonstrate about the value of local evidence?
1. The whole image is always better than partial information for classification tasks
2. AI systems cannot debate visual concepts like 'dog' versus 'cat'
3. A judge who saw only a small mask of pixels and heard debate outperformed a judge who saw the whole image without debate
4. Debate only works for text-based questions, not visual ones
What is an 'obfuscated argument' as described in the lesson?
1. An argument that has been redacted by a moderator
2. An argument that uses too many technical terms
3. A debate where neither side makes a clear point
4. A well-constructed false claim with plausible-sounding local reasoning that can beat a clumsy truth
Which of the following is identified as a limitation of AI debate in the lesson?
1. The opposing AI always loses because it cannot defend lies effectively
2. Debate cannot be used for any questions involving ethics
3. Debate requires both AI systems to be identical copies of each other
4. Human judges can be persuaded by rhetorical skill, perceived confidence, and argument length rather than actual correctness
What assumption about the debaters is required for debate to work as intended?
1. Both debaters must have roughly equal capability and motivation to win
2. One debater must be significantly more intelligent than the other
3. Neither debater needs to be capable of reasoning
4. The debaters should be programmed to reach the same conclusion
What did recent 2024 research from Anthropic and others find about debate's effectiveness?
1. Debate completely solves the alignment problem for all AI systems
2. Debate only works when the human judge is replaced by another AI
3. Debate modestly outperforms non-adversarial baselines on some question-answering tasks, especially with stronger judges
4. Debate was found to be no better than random guessing
The lesson describes debate as a 'structural idea, not a product.' What does this mean?
1. Debate can only be purchased as a commercial product from AI companies
2. Debate has been fully developed and is ready for immediate deployment
3. Debate is a general framework for creating adversarial oversight, not a specific finished tool for alignment
4. The structure of debate is less important than the specific questions asked
In the debate framework, what motivates the AI arguing against a false claim to expose the lie?
1. The adversarial game structure gives the opposing AI direct incentive to find and highlight weaknesses in the other argument
2. The AI is programmed to always tell the truth regardless of the game structure
3. The human judge directly instructs the AI on what to look for
4. The opposing AI has access to a database of all true facts
What happens when debate involves questions that cannot be decomposed into checkable sub-claims?
1. The human judge can easily determine the truth without help
2. The AI automatically knows the correct answer
3. The debate may fail because there are no local points for the judge to evaluate
4. The debate becomes more effective because the whole question is considered
Why might a longer argument win a debate even if it contains errors?
1. The lesson states that shorter arguments are automatically disqualified
2. Human judges can be swayed by length and rhetorical polish, mistaking verbosity for correctness
3. Longer arguments are always factually correct
4. AI debate rules require that the longest argument wins
What is a 'judge model' in the context of AI debate?
1. A human judge who has been trained to be an expert in all debate subjects
2. The system that creates the two debating AI agents
3. The original AI that generates the debate topic
4. An AI system that evaluates the arguments in a debate and determines which side presented a stronger case
What distinguishes 'adversarial training' from regular training in AI development?
1. Adversarial training is another name for standard supervised learning
2. Adversarial training involves AI systems competing against each other to expose weaknesses, rather than just optimizing toward a single goal
3. Adversarial training only applies to image classification tasks
4. Adversarial training means AIs are trained to avoid all conflict
What did the lesson mean by saying 'truth is easier to defend than lies' in the context of debate?
1. A true claim can withstand sustained questioning and counter-arguments because it is internally consistent with reality
2. True statements are easier for AI to generate
3. Human judges prefer true statements over false ones
4. Lies are always shorter than truth statements
If two AI models with very different capability levels participate in debate, what problem can arise?
1. The debate will be faster and more efficient
2. The less capable model will automatically win because it has less to defend
3. The models will refuse to participate
4. The more capable model can dominate regardless of whether its position is true, breaking the symmetry that makes debate work

← Back to interactive lesson

Tendril · Creators · Ethics & Society