Constitutional AI: A Deep Dive on Anthropic's Approach

What a constitution actually contains, how the training loop works, where the research is now, and the honest trade-offs.

45 min · Reviewed 2026

The Paper and the Approach

Constitutional AI: Harmlessness from AI Feedback (Bai et al., Anthropic, December 2022) proposed replacing most human preference labels with labels from a model guided by a written constitution. The constitution is a set of principles the model uses to critique and revise its own outputs, and another model uses to rank revised outputs. The goal was to scale feedback, make principles auditable, and reduce the labor cost of alignment.

The two-stage loop

Stage 1: Supervised learning (critique and revise) for each (prompt, harmful response): critique = model(prompt, response, principle) revision = model(prompt, response, critique) SFT on (prompt, revision) Stage 2: Reinforcement learning (RLAIF) for each prompt: response_A = model(prompt) response_B = model(prompt) judgment = model(response_A, response_B, principle) use judgment as preference label train reward model on these AI-labeled preferences PPO against reward modelConstitutional AI's two-stage pipeline. Human labels are concentrated in writing the constitution, not labeling every output.

What is in a constitution

Anthropic's published constitution draws from multiple sources: the UN Universal Declaration of Human Rights, Apple's terms of service, DeepMind's Sparrow rules, Anthropic's own research on harms. Principles include things like please choose the response that is most supportive and encouraging of life, liberty, and personal security, choose the response that has the least objectionable, offensive, unlawful, deceptive, inaccurate, or harmful content, and choose the response that sounds most like something a wise, ethical, polite, and friendly person would say.

Why principles, not examples

Auditable: anyone can read and critique the constitution text
Debatable: principles are contestable in a way that individual labeler calls are not
Updatable: you can edit a principle and retrain, faster than recollecting labels
Cross-cultural: multiple voices can contribute principles at different granularities
Scalable: one sentence can guide labeling of a million completions

The Collective Constitutional AI experiment

In 2023, Anthropic partnered with the Collective Intelligence Project to source constitutional principles from about 1,000 Americans via a deliberative polling process. Some principles diverged from Anthropic's own. The publicly sourced constitution was then used to train a comparable model. The experiment was a proof of concept for democratic input into alignment targets.

Known limits

The judge model has its own biases; RLAIF inherits them
Constitutional language is ambiguous; different models interpret the same principle differently
Principles can conflict (helpful vs. harmless); the resolution is implicit in training
Constitution authorship is political: whose values, from where, selected by whom
Works best for adjacent-to-human-judgment tasks; harder for superhuman domains

Dimension	RLHF	Constitutional AI / RLAIF
Primary labelers	Human contractors	AI model guided by principles
Transparency	Unpublished labeler guidelines	Published constitution
Scale cost	Linear in labels needed	Near-constant after constitution written
Debate surface	Individual labels, not public	Written principles, public
Bias source	Labeler demographics	Principle selection + judge model
Updatability	Slow (recollect labels)	Fast (edit principle, retrain)

We cannot scale alignment by putting a human in every decision. We can scale it by putting humans in the writing of the rules.
— Amanda Askell, Anthropic

The big idea: constitutional AI replaces the cost of per-example human feedback with the cost of writing and debating principles. That is a real bet about where alignment judgment should happen. Its strengths and limits are both consequences of that bet.

End-of-lesson check

8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety-constitutional-ai-deep-creators

What is the main idea of "Constitutional AI: A Deep Dive on Anthropic's Approach"?
1. What a constitution actually contains, how the training loop works, where the research is now, and the honest trade-offs.
2. Use AI as the final authority for the whole decision
3. Avoid checking the answer once it sounds polished
4. Focus only on speed instead of judgment
Which concept is most central to "Constitutional AI: A Deep Dive on Anthropic's Approach"?
1. RLAIF
2. constitutional AI
3. critique and revision
4. principles
Which use of AI fits this topic best?
1. Let the AI decide what matters without your review
2. Use the answer before checking whether it fits the situation
3. Auditable: anyone can read and critique the constitution text
4. Treat the AI output as automatically correct
What should a careful learner remember about "Everyone does some version of this now"?
1. Use "Everyone does some version of this now" as a reminder to verify the AI output before anyone relies on it.
2. Skip the context so the tool can guess faster
3. Treat the output as private even after sharing it online
4. Use the answer without checking the source
You want to use AI after this lesson. What is the safest next step?
1. Act immediately because the AI answer is written clearly
2. AI cannot make the human values decision for you.
3. Hide uncertainty so the final answer looks cleaner
4. Use private or sensitive details before checking permission
How should AI output about constitutional AI be treated?
1. As proof that no other source is needed
2. As a replacement for context, consent, or expert review
3. As a draft or helper output that still needs human judgment and verification
4. As something that becomes correct when it sounds confident
Name one way to verify an AI answer about constitutional AI.
Which action would help you apply "Constitutional AI: A Deep Dive on Anthropic's Approach" responsibly?
1. Use the tool to avoid thinking through the tradeoff
2. Keep going even if the output conflicts with a trusted source
3. Treat the AI output as automatically correct
4. Debatable: principles are contestable in a way that individual labeler calls are not

← Back to interactive lesson

Tendril · Creators · Ethics & Society