Skip to main content

neural-forge.io

Learn Tracks Models AI Explorer Compare

Sign inStartStart learning

Tendril

Tendril neural-forge.io

Free AI literacy for everyone, supported by trust-safe partners.

Learn

Curriculum
Tracks
For you
Preferences

Resources

Glossary
In the Wild
Newsroom
Community
Partners
Send Feedback
Changelog
About
New to AI?

Schools & Orgs

Schools
Libraries
Tech Teams
Free Access
Sponsor
Sign Up
Support the Mission

Trust

Privacy
Terms
COPPA
Accessibility

Legal

Privacy
Terms
COPPA
Accessibility

© 2026 Tendril·Privacy·Terms·Contact

Built with Claude

Loading lesson…

Tendril

Ethics & Society0%

Time on lesson

0s

← Ethics & Society

0 of 196 complete

○Lesson 331The EU AI Act: The Global Floor, Whether You Like It or Not
○Lesson 332AI Alignment: The Actual Technical Problem
○Lesson 333Red-Teaming: The Ethics of Breaking AI on Purpose
○Lesson 334Jailbreak Case Studies: What Actually Broke
○Lesson 335Labor and AI: What the Data Actually Says
○Lesson 336Creative Rights: Artists, Writers, Musicians vs. Generative AI
○Lesson 337AI Safety Orgs and How They Actually Operate
○Lesson 338Responsible Scaling Policies Explained
○Lesson 339Your Own Ethical Checklist as an AI Builder
○Lesson 811Alignment: The Full Technical Picture
○Lesson 812Specification Gaming, Reward Hacking, and the Goodhart Tax
○Lesson 813Mesa-Optimization: An Optimizer Inside Your Optimizer
○Lesson 814RLHF to RLAIF: How Preference Learning Scaled
○Lesson 815Reward Hacking in the Wild: Cases From Real Labs
○Lesson 816Deceptive Alignment: The Failure Mode Everyone Talks About
○Lesson 817Goal Misgeneralization: The Right Reward, The Wrong Learned Goal
○Lesson 818Constitutional AI: A Deep Dive on Anthropic's Approach
○Lesson 819Scalable Oversight: How Do You Supervise What You Cannot Evaluate
○Lesson 820Mechanistic Interpretability: Reading the Model's Mind
○Lesson 821Data Poisoning: Attacking AI Through Its Training Set
○Lesson 822Model Extraction and Distillation Attacks
○Lesson 824Weak-to-Strong Generalization
○Lesson 840Know-Your-Customer Rules for AI Compute
○Lesson 841Model Disclosure Requirements
○Lesson 842Safety Evaluations: What Gets Disclosed
○Lesson 844The AI Insurance Industry
○Lesson 845UK AI Safety Institute
○Lesson 846Singapore's AI Verify
○Lesson 847China's Generative AI Regulations
○Lesson 849Bio Risk and AI: A Measured Look
○Lesson 850Cyber Risk and Autonomous AI Attackers
○Lesson 1603Writing Your Own HS AI Honor Code
○Lesson 1608AI For Relationship Advice — When To Trust It
○Lesson 1609AI For Mental Health Support — What's Safe
Lesson 2494Debate as an Alignment Method
○Lesson 2495Iterative Amplification
○Lesson 2496Training-Time vs. Inference-Time Alignment
○Lesson 2497Alignment Faking: When Models Pretend
○Lesson 2498Deceptive Alignment: From Theory to Data
○Lesson 2499Sparse Autoencoders Explained
○Lesson 2500Feature Discovery in LLMs
○Lesson 2501Probing: Linear, Nonlinear, and Contrast
○Lesson 2502Activation Patching: Intervention Experiments
○Lesson 2503SB 1047: California's AI Safety Bill
○Lesson 2504The US Executive Order on AI and What Happened Next
○Lesson 22500AI Attribution Norms: When and How to Disclose AI Involvement in Your Work
○Lesson 22501AI's Environmental Impact: Honest Numbers for Personal and Organizational Decisions
○Lesson 22502AI's Labor Impact: Honest Conversations About What's Actually Changing
○Lesson 22503AI in Content Moderation: The Ethics of Scale, Speed, and Inevitable Mistakes
○Lesson 22504Ethics of AI in Academic Research: Beyond Plagiarism Detection
○Lesson 24000AI's Effect on Democratic Discourse: Where to Pay Attention
○Lesson 24001AI Monoculture: Why Everyone Sounding the Same Matters
○Lesson 24002AI Resurrection of the Dead: Grieftech's Hard Questions
○Lesson 24003AI in Religious and Spiritual Life: Where Communities Are Drawing Lines
○Lesson 24004AI and Disability Rights: Both Tool and Threat
○Lesson 25300AI and the Future of Truth-Finding
○Lesson 25301AI and the Loneliness Epidemic: Help or Harm?
○Lesson 25302AI's Effect on Creative Economies: How Artists Are Adapting
○Lesson 25303AI in Criminal Justice: Where Bias Has Real Consequences
○Lesson 25304Who Has the Power Over AI: A Concentration Problem
○Lesson 26900Trust Erosion in the AI Era: Personal Commitments That Help
○Lesson 26901Recommending AI Tools Ethically
○Lesson 26902AI and Language Preservation: Who Decides
○Lesson 26903AI in Children's Media: Higher Bar Than Adult Content
○Lesson 26904Collective Action on AI Ethics: Beyond Personal Choices
○Lesson 27800AI and the Attention Economy: Personal Resistance
○Lesson 27801AI and the Dignity of Labor
○Lesson 27802AI and Environmental Justice: Where Data Centers Land
○Lesson 27803AI and Elder Autonomy: Care vs Control
○Lesson 27804Personal Data Stewardship in the AI Era
○Lesson 28700AI and Power Asymmetry Between Companies and Users
○Lesson 28701Professional Norms for AI Use Across Fields
○Lesson 28702Data Cooperatives: An Alternative to Big-Tech Data Concentration
○Lesson 28703Academic Integrity in the AI Era: Evolution Underway
○Lesson 28704Good Disagreement About AI in Communities
○Lesson 29900Developing a Personal AI Use Policy
○Lesson 29901Developing Team Norms for AI Use
○Lesson 29902Ethics in AI Vendor Relationships
○Lesson 29903Public Comment Engagement on AI Regulation
○Lesson 29904Engaging With Algorithmic Accountability Reports
○Lesson 31400Personal Data Export Practices
○Lesson 31401Pushing Back Against AI Recommendation Systems
○Lesson 31402Correcting Misinformation Without Amplifying It
○Lesson 31403Strategic Boycotts of AI Products
○Lesson 31404Strategic Praise of AI Products That Get It Right
○Lesson 33000Personal AI Disclosure: When and How
○Lesson 33001Organizational AI Statements: Beyond Vague Principles
○Lesson 33002Corporate AI Environmental Impact Reporting
○Lesson 33003Employee Voice on AI Decisions
○Lesson 33004Pressuring AI Vendors on Ethics
○Lesson 34500Developing Personal AI Philosophy
○Lesson 34501Productive Conversations With AI Skeptics
○Lesson 34502Productive Conversations With AI Enthusiasts
○Lesson 34503Personal Resistance to AI's Worst Tendencies
○Lesson 34504Engaging Policymakers on AI
○Lesson 37000Using AI Vendor Due Diligence in Procurement
○Lesson 37001Designing AI Consent Flows That Respect Users
○Lesson 37002Writing Postmortems for AI System Incidents
○Lesson 37003Designing AI Bug Bounty and Disclosure Programs
○Lesson 37004Staging AI Deployments Ethically
○Lesson 37005Reporting AI Risk to Boards of Directors
○Lesson 37006Ethics of AI Procurement in the Public Sector
○Lesson 37007Norms for Publishing AI Research Responsibly
○Lesson 37008Ethics of AI Products Designed for Children
○Lesson 37009Planning Ethical Workforce Transitions Around AI
○Lesson 38500AI for Employee AI-Use Feedback Loops: Listening Before Mandating
○Lesson 38501AI for Vendor Model Card Reviews: Reading Between the Lines
○Lesson 38502AI for AI Grievance Process Design: A Way for People to Push Back
○Lesson 38503AI for Augmentation-vs-Replacement Framing: Honest Org Communication
○Lesson 38504AI for Shadow AI Policy Design: Channels, Not Just Bans
○Lesson 38505AI for Consent Language Readability: Plain Words That Still Hold Up Legally
○Lesson 38506AI for AI Ethics Training Curriculum: Designing What Sticks
○Lesson 38507AI for Deepfake Incident Response Plans: Ready Before You Need It
○Lesson 38508AI for Junior-Role Impact Assessments: The Pipeline Problem
○Lesson 40000AI vendor renewal fairness review checklist
○Lesson 40001AI internal prompt-use policy rollout plan
○Lesson 40003AI disability access review of internal AI prompts
○Lesson 40004AI policy for political content generation
○Lesson 40005AI customer redress process for AI-driven decisions
○Lesson 40006AI training data removal request handling process
○Lesson 40007AI vendor incident disclosure letter to customers
○Lesson 40009AI research participant debrief letter for AI studies
○Lesson 41500AI supplier code of conduct update for AI use
○Lesson 41501AI employee AI tool request review rubric
○Lesson 41502AI customer-facing AI use disclosure pattern library
○Lesson 41503AI customer data training opt-out process documentation
○Lesson 41505AI board AI risk quarterly update memo
○Lesson 41506AI customer AI fairness complaint investigation summary
○Lesson 41507AI acquired team AI norms onboarding document
○Lesson 41508AI vendor pricing change customer notification letter
○Lesson 41509AI internal AI prompt library governance policy
○Lesson 43401AI third-party model evaluation rubric for procurement teams
○Lesson 43402AI employee AI tool incident reporting flow design
○Lesson 43403AI vendor AI feature rollout customer notification letter
○Lesson 43404AI internal AI policy exception request process design
○Lesson 43405AI procurement fairness testing plan for vendor models
○Lesson 43406AI deceptive pattern audit checklist for AI features
○Lesson 43407AI employee handbook AI use section update draft
○Lesson 43408AI explainability statement for customers receiving AI decisions
○Lesson 43409AI board AI ethics policy annual revision memo
○Lesson 45300AI Museum Deaccession Narrative: Drafting Provenance-Aware Disclosure
○Lesson 45301AI Research Debriefing After Deception: Drafting Trauma-Aware Scripts
○Lesson 45302AI Academic Authorship Dispute Mediation: Drafting Resolution Frameworks
○Lesson 45303AI Human-Subjects Honoraria Equity Review: Aligning Compensation to Risk
○Lesson 45304AI Content-Moderation Appeals Drafting: Building User-Facing Explanations
○Lesson 45305AI Clinical-Trial Placebo Justification: Drafting Equipoise Narratives
○Lesson 45306AI Corporate Political-Spending Disclosure Drafting: Investor-Facing Transparency
○Lesson 45307AI Undergraduate-Research Credit Allocation: Drafting Mentor Frameworks
○Lesson 45308AI Personal-Data Deletion-Rights Workflow Drafting: GDPR and CCPA Alignment
○Lesson 45309AI Platform Creator-Payout Transparency: Drafting Statement Explainers
○Lesson 47300AI Employee-Monitoring Disclosure Narrative: Drafting Workplace-Surveillance Notices
○Lesson 47301AI Algorithmic-Pricing Fairness Narrative: Drafting Disparate-Impact Memos
○Lesson 47303AI Vendor AI-Risk-Assessment Narrative: Drafting Procurement-Stage Memos
○Lesson 47304AI Incident Disclosure-to-Users Narrative: Drafting Notification Letters
○Lesson 47305AI Political-Microtargeting Policy Narrative: Drafting Platform-Policy Memos
○Lesson 47306AI Deepfake-Image Takedown Narrative: Drafting Non-Consensual-Intimate-Image Responses
○Lesson 47307AI Research-Data Secondary-Use Narrative: Drafting Reuse-Justification Memos
○Lesson 47308AI Children's-Data COPPA-Treatment Narrative: Drafting Verifiable-Parental-Consent Memos
○Lesson 47309AI Sanctions-Screening False-Match Narrative: Drafting Customer-Communication Memos
○Lesson 49300AI Model Deprecation User-Impact Narrative: Drafting Sunset-Communication Summaries
○Lesson 49301AI Synthetic Data Consent Narrative: Drafting Consent-Inheritance Summaries
○Lesson 49302AI Content Attribution Policy Narrative: Drafting Newsroom Disclosure Summaries
○Lesson 49304AI Child Safety Evaluation Coverage Narrative: Drafting Threat-Model Coverage Summaries
○Lesson 49306AI Open-Weights Release Risk Narrative: Drafting Pre-Release Risk-Acceptance Summaries
○Lesson 49307AI Red-Team Finding Coordinated Disclosure Narrative: Drafting Vendor-Notification Summaries
○Lesson 49309AI Researcher Access Program Governance Narrative: Drafting Access-Tier Justification Summaries
○Lesson 51300AI Model Card Draft: Drafting With Human Oversight
○Lesson 53300AI and an AI-use disclosure template
○Lesson 53301AI and a bias pre-mortem checklist
○Lesson 53302AI and a data-minimization review
○Lesson 53303AI and a consent-form readability rewrite
○Lesson 53305AI and a stakeholder impact map
○Lesson 53307AI and a vendor AI due-diligence questionnaire
○Lesson 53308AI and a red-team prompt set
○Lesson 53309AI and a decision-rights doc for AI features
○Lesson 55301AI and Fairness Metric Selection Memo: Tradeoff Walkthrough
○Lesson 55304AI and Data Minimization Audit: Trimming the Training Set
○Lesson 55305AI and Evaluation Set Coverage Gaps: What's Missing From the Test
○Lesson 55307AI and Redress Mechanism Design Prompt: User Appeal Pathways
○Lesson 55309AI and Impact Assessment Stakeholder List: Who Should Be Heard
○Lesson 57302AI and Data Deletion Policies: User-Right Workflows
○Lesson 57305AI and Bias Audit Checklists: Pre-Deployment Reviews
○Lesson 57306AI and AI Incident Response Plans: When Models Misbehave
○Lesson 57307AI and Vendor AI Risk Questionnaires: Procurement Drafts
○Lesson 57308AI and AI Governance Charters: Cross-Functional Oversight
○Lesson 59300AI and Attribution Trails for Remix: Crediting the Whole Chain
○Lesson 59303AI and Revenue Share with Collaborators: Splits That Survive Success
○Lesson 59304AI and Audience Data Minimum-Viable Collection: Less Is Less Risk
○Lesson 59305AI and Likeness Licensing Language: Renting Your Face Without Losing It
○Lesson 59306AI and Audience Vulnerability Flags: Knowing Who's Watching
○Lesson 59307AI and Deepfake-of-Self Policies: Setting House Rules for Your Face
○Lesson 59308AI and Sponsorship Disclosure Checks: FTC-Proofing Every Post
○Lesson 59309AI and Anonymity Protection for Sources: De-Identifying Quotes
○Lesson 59310AI and Platform TOS Friction Mapping: Knowing the Rules That Bite
○Lesson 59311AI and the Criticism vs Harassment Line: Pre-Publication Pulse Check
○Lesson 59312AI and Correction and Retraction Flow: Owning Mistakes in Public

Curriculum
·
Creators
·
Ethics & Society
·
Debate as an Alignment Method

Lesson 853 of 2116

Debate as an Alignment Method

Two AIs argue opposite sides. A human judges the transcript. The bet: truth is easier to defend than lies, so debate surfaces signal a human alone would miss. Two Lawyers, One Judge Proposed by Irving, Christiano, and Amodei at OpenAI in 2018, AI Safety via Debate structures oversight as an adversarial game.

CreatorsEthics & Society~21 min readAdvancedBI5 · Societal ImpactBI3 · LearningPrint / PDF

Big idea

Two AIs argue opposite sides. A human judges the transcript. The bet: truth is easier to defend than lies, so debate surfaces signal a human alone would miss. Two Lawyers, One Judge Proposed by Irving, Christiano, and Amodei at OpenAI in 2018, AI Safety via Debate structures oversight as an adversarial game.

Lesson map

What this lesson covers

35 min13 blocks3 concepts

Learning path

The main moves in order

1Two Lawyers, One Judge
2debate
3scalable oversight
4adversarial training

Concept cluster

Terms to connect while reading

debatescalable oversightadversarial training

Read2

Sections3

Lists2

Notes4

Quotes1

Terms1

Section 1

Two Lawyers, One Judge

Proposed by Irving, Christiano, and Amodei at OpenAI in 2018, AI Safety via Debate structures oversight as an adversarial game. Two copies of a model take opposite positions on a question. They argue. A human reads the exchange and picks a winner. The hypothesis is that lying requires a more fragile story than telling the truth, so the liar loses over many rounds.

Why adversarial structure helps

A single model can confidently lie to a rater with no counter
In debate, the other model is motivated to expose the lie
The human does not need to know the truth directly — just which argument is stronger
Works for questions where humans can evaluate locally even if they can't reason end-to-end

Toy success

In image-classification debates, two agents pointed at pixels to argue 'this is a dog' vs 'this is a cat'. A judge seeing only a small mask of pixels beat a judge seeing the whole image with no debate. Local evidence plus adversarial selection beat holistic unaided judgment.

Check-in 1. Got it so far?

Where it wobbles

1Obfuscated arguments: a well-crafted lie with plausible local steps can beat a clumsy truth
2Human judges get fooled by rhetoric, confidence, and length
3Debate assumes both sides have equal capability — unequal models break the symmetry
4Some questions don't decompose into checkable sub-claims

Recent results

2024 work from Anthropic and others showed that debate does outperform non-adversarial baselines on some QA tasks, especially with stronger judges. But improvements are modest and task-specific.

“If the only tool we had were RLHF, we would be stuck. Debate is one attempt at a different tool.”
Geoffrey Irving, UK AI Safety Institute

Check-in 2. Got it so far?

Key terms in this lesson

debate
adversarial training
obfuscated argument
judge model

The big idea: adversarial oversight is a structural idea, not a product. It may or may not scale, but the reasoning behind it — use conflict to surface signal — is worth carrying into any supervision scheme.

Key insight

Two AIs argue opposite sides. A human judges the transcript. The bet: truth is easier to defend than lies, so debate surfaces signal a human alone would miss. Two Lawyers, One Judge Proposed by Irving, Christiano, and Amodei at OpenAI in 2018, AI Safety via Debate structures oversight as an adversarial game. The best way to learn is to practice.

Check-in 3. Got it so far?

Lesson complete

You've completed "Debate as an Alignment Method". Mark this lesson done and keep going — every lesson builds on the last.

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Debate as an Alignment Method”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Your question

Try one:

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Keep going

Creators · 45 min
Scalable Oversight: How Do You Supervise What You Cannot Evaluate
Debate, amplification, weak-to-strong, process supervision. Research on how humans supervise models smarter than them.
Creators · 55 min
Alignment: The Full Technical Picture
What alignment actually is as a research program, how it is done in practice, what the open problems are, and where the actual papers live. A model that is always helpful will help you do harmful things.
Creators · 50 min
AI Alignment: The Actual Technical Problem
Alignment is not a vibes debate. It is a concrete technical problem about getting systems to pursue goals we actually want. Here is what researchers work on when they say they work on alignment.

Previous: Previous

Iterative Amplification: Next

Report an error

Reading mode