Loading lesson…
What a constitution actually contains, how the training loop works, where the research is now, and the honest trade-offs.
Constitutional AI: Harmlessness from AI Feedback (Bai et al., Anthropic, December 2022) proposed replacing most human preference labels with labels from a model guided by a written constitution. The constitution is a set of principles the model uses to critique and revise its own outputs, and another model uses to rank revised outputs. The goal was to scale feedback, make principles auditable, and reduce the labor cost of alignment.
Stage 1: Supervised learning (critique and revise) for each (prompt, harmful response): critique = model(prompt, response, principle) revision = model(prompt, response, critique) SFT on (prompt, revision) Stage 2: Reinforcement learning (RLAIF) for each prompt: response_A = model(prompt) response_B = model(prompt) judgment = model(response_A, response_B, principle) use judgment as preference label train reward model on these AI-labeled preferences PPO against reward modelConstitutional AI's two-stage pipeline. Human labels are concentrated in writing the constitution, not labeling every output.Anthropic's published constitution draws from multiple sources: the UN Universal Declaration of Human Rights, Apple's terms of service, DeepMind's Sparrow rules, Anthropic's own research on harms. Principles include things like please choose the response that is most supportive and encouraging of life, liberty, and personal security, choose the response that has the least objectionable, offensive, unlawful, deceptive, inaccurate, or harmful content, and choose the response that sounds most like something a wise, ethical, polite, and friendly person would say.
In 2023, Anthropic partnered with the Collective Intelligence Project to source constitutional principles from about 1,000 Americans via a deliberative polling process. Some principles diverged from Anthropic's own. The publicly sourced constitution was then used to train a comparable model. The experiment was a proof of concept for democratic input into alignment targets.
| Dimension | RLHF | Constitutional AI / RLAIF |
|---|---|---|
| Primary labelers | Human contractors | AI model guided by principles |
| Transparency | Unpublished labeler guidelines | Published constitution |
| Scale cost | Linear in labels needed | Near-constant after constitution written |
| Debate surface | Individual labels, not public | Written principles, public |
| Bias source | Labeler demographics | Principle selection + judge model |
| Updatability | Slow (recollect labels) | Fast (edit principle, retrain) |
We cannot scale alignment by putting a human in every decision. We can scale it by putting humans in the writing of the rules.
— Amanda Askell, Anthropic
The big idea: constitutional AI replaces the cost of per-example human feedback with the cost of writing and debating principles. That is a real bet about where alignment judgment should happen. Its strengths and limits are both consequences of that bet.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety-constitutional-ai-deep-creators
What is the main idea of "Constitutional AI: A Deep Dive on Anthropic's Approach"?
Which concept is most central to "Constitutional AI: A Deep Dive on Anthropic's Approach"?
Which use of AI fits this topic best?
What should a careful learner remember about "Everyone does some version of this now"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about constitutional AI be treated?
Name one way to verify an AI answer about constitutional AI.
Which action would help you apply "Constitutional AI: A Deep Dive on Anthropic's Approach" responsibly?