Loading lesson…
Two AIs argue opposite sides. A human judges the transcript. The bet: truth is easier to defend than lies, so debate surfaces signal a human alone would miss. Two Lawyers, One Judge Proposed by Irving, Christiano, and Amodei at OpenAI in 2018, AI Safety via Debate structures oversight as an adversarial game.
Proposed by Irving, Christiano, and Amodei at OpenAI in 2018, AI Safety via Debate structures oversight as an adversarial game. Two copies of a model take opposite positions on a question. They argue. A human reads the exchange and picks a winner. The hypothesis is that lying requires a more fragile story than telling the truth, so the liar loses over many rounds.
If the only tool we had were RLHF, we would be stuck. Debate is one attempt at a different tool.
— Geoffrey Irving, UK AI Safety Institute
The big idea: adversarial oversight is a structural idea, not a product. It may or may not scale, but the reasoning behind it — use conflict to surface signal — is worth carrying into any supervision scheme.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-debate-alignment-creators
What is the core hypothesis that makes AI debate potentially useful for alignment?
Why is an adversarial structure more effective than a single AI answering questions unsupervised?
What did the image-classification debate experiment demonstrate about the value of local evidence?
What is an 'obfuscated argument' as described in the lesson?
Which of the following is identified as a limitation of AI debate in the lesson?
What assumption about the debaters is required for debate to work as intended?
What did recent 2024 research from Anthropic and others find about debate's effectiveness?
The lesson describes debate as a 'structural idea, not a product.' What does this mean?
In the debate framework, what motivates the AI arguing against a false claim to expose the lie?
What happens when debate involves questions that cannot be decomposed into checkable sub-claims?
Why might a longer argument win a debate even if it contains errors?
What is a 'judge model' in the context of AI debate?
What distinguishes 'adversarial training' from regular training in AI development?
What did the lesson mean by saying 'truth is easier to defend than lies' in the context of debate?
If two AI models with very different capability levels participate in debate, what problem can arise?