Loading lesson…
Two AIs argue opposite sides. A human judges the transcript. The bet: truth is easier to defend than lies, so debate surfaces signal a human alone would miss. Two Lawyers, One Judge Proposed by Irving, Christiano, and Amodei at OpenAI in 2018, AI Safety via Debate structures oversight as an adversarial game.
Proposed by Irving, Christiano, and Amodei at OpenAI in 2018, AI Safety via Debate structures oversight as an adversarial game. Two copies of a model take opposite positions on a question. They argue. A human reads the exchange and picks a winner. The hypothesis is that lying requires a more fragile story than telling the truth, so the liar loses over many rounds.
If the only tool we had were RLHF, we would be stuck. Debate is one attempt at a different tool.
— Geoffrey Irving, UK AI Safety Institute
The big idea: adversarial oversight is a structural idea, not a product. It may or may not scale, but the reasoning behind it — use conflict to surface signal — is worth carrying into any supervision scheme.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-debate-alignment-creators
What is the main idea of "Debate as an Alignment Method"?
Which concept is most central to "Debate as an Alignment Method"?
Which use of AI fits this topic best?
What should a careful learner remember about "Toy success"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about debate be treated?
Name one way to verify an AI answer about debate.
Which action would help you apply "Debate as an Alignment Method" responsibly?