Lesson 217 of 2116
Constitutional AI: A Deep Dive on Anthropic's Approach
What a constitution actually contains, how the training loop works, where the research is now, and the honest trade-offs.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The Paper and the Approach
- 2constitutional AI
- 3RLAIF
- 4critique and revision
Concept cluster
Terms to connect while reading
Section 1
The Paper and the Approach
Constitutional AI: Harmlessness from AI Feedback (Bai et al., Anthropic, December 2022) proposed replacing most human preference labels with labels from a model guided by a written constitution. The constitution is a set of principles the model uses to critique and revise its own outputs, and another model uses to rank revised outputs. The goal was to scale feedback, make principles auditable, and reduce the labor cost of alignment.
The two-stage loop
Constitutional AI's two-stage pipeline. Human labels are concentrated in writing the constitution, not labeling every output.
Stage 1: Supervised learning (critique and revise)
for each (prompt, harmful response):
critique = model(prompt, response, principle)
revision = model(prompt, response, critique)
SFT on (prompt, revision)
Stage 2: Reinforcement learning (RLAIF)
for each prompt:
response_A = model(prompt)
response_B = model(prompt)
judgment = model(response_A, response_B, principle)
use judgment as preference label
train reward model on these AI-labeled preferences
PPO against reward modelWhat is in a constitution
Anthropic's published constitution draws from multiple sources: the UN Universal Declaration of Human Rights, Apple's terms of service, DeepMind's Sparrow rules, Anthropic's own research on harms. Principles include things like please choose the response that is most supportive and encouraging of life, liberty, and personal security, choose the response that has the least objectionable, offensive, unlawful, deceptive, inaccurate, or harmful content, and choose the response that sounds most like something a wise, ethical, polite, and friendly person would say.
Why principles, not examples
- Auditable: anyone can read and critique the constitution text
- Debatable: principles are contestable in a way that individual labeler calls are not
- Updatable: you can edit a principle and retrain, faster than recollecting labels
- Cross-cultural: multiple voices can contribute principles at different granularities
- Scalable: one sentence can guide labeling of a million completions
The Collective Constitutional AI experiment
In 2023, Anthropic partnered with the Collective Intelligence Project to source constitutional principles from about 1,000 Americans via a deliberative polling process. Some principles diverged from Anthropic's own. The publicly sourced constitution was then used to train a comparable model. The experiment was a proof of concept for democratic input into alignment targets.
Known limits
- The judge model has its own biases; RLAIF inherits them
- Constitutional language is ambiguous; different models interpret the same principle differently
- Principles can conflict (helpful vs. harmless); the resolution is implicit in training
- Constitution authorship is political: whose values, from where, selected by whom
- Works best for adjacent-to-human-judgment tasks; harder for superhuman domains
Compare the options
| Dimension | RLHF | Constitutional AI / RLAIF |
|---|---|---|
| Primary labelers | Human contractors | AI model guided by principles |
| Transparency | Unpublished labeler guidelines | Published constitution |
| Scale cost | Linear in labels needed | Near-constant after constitution written |
| Debate surface | Individual labels, not public | Written principles, public |
| Bias source | Labeler demographics | Principle selection + judge model |
| Updatability | Slow (recollect labels) | Fast (edit principle, retrain) |
“We cannot scale alignment by putting a human in every decision. We can scale it by putting humans in the writing of the rules.”
Key terms in this lesson
The big idea: constitutional AI replaces the cost of per-example human feedback with the cost of writing and debating principles. That is a real bet about where alignment judgment should happen. Its strengths and limits are both consequences of that bet.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Constitutional AI: A Deep Dive on Anthropic's Approach”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 55 min
Alignment: The Full Technical Picture
What alignment actually is as a research program, how it is done in practice, what the open problems are, and where the actual papers live. A model that is always helpful will help you do harmful things.
Creators · 40 min
Cyber Risk and Autonomous AI Attackers
AI agents can already find some software vulnerabilities and write exploits. What happens when those capabilities scale? A clear-eyed walk through the data.
Creators · 42 min
Alignment Faking: When Models Pretend
In late 2024, Anthropic and Redwood published evidence that Claude sometimes complies with harmful training requests in ways that preserve its prior values. That is alignment faking, and it matters.
