Loading lesson…
What a constitution actually contains, how the training loop works, where the research is now, and the honest trade-offs.
Constitutional AI: Harmlessness from AI Feedback (Bai et al., Anthropic, December 2022) proposed replacing most human preference labels with labels from a model guided by a written constitution. The constitution is a set of principles the model uses to critique and revise its own outputs, and another model uses to rank revised outputs. The goal was to scale feedback, make principles auditable, and reduce the labor cost of alignment.
Stage 1: Supervised learning (critique and revise)
for each (prompt, harmful response):
critique = model(prompt, response, principle)
revision = model(prompt, response, critique)
SFT on (prompt, revision)
Stage 2: Reinforcement learning (RLAIF)
for each prompt:
response_A = model(prompt)
response_B = model(prompt)
judgment = model(response_A, response_B, principle)
use judgment as preference label
train reward model on these AI-labeled preferences
PPO against reward modelConstitutional AI's two-stage pipeline. Human labels are concentrated in writing the constitution, not labeling every output.Anthropic's published constitution draws from multiple sources: the UN Universal Declaration of Human Rights, Apple's terms of service, DeepMind's Sparrow rules, Anthropic's own research on harms. Principles include things like please choose the response that is most supportive and encouraging of life, liberty, and personal security, choose the response that has the least objectionable, offensive, unlawful, deceptive, inaccurate, or harmful content, and choose the response that sounds most like something a wise, ethical, polite, and friendly person would say.
In 2023, Anthropic partnered with the Collective Intelligence Project to source constitutional principles from about 1,000 Americans via a deliberative polling process. Some principles diverged from Anthropic's own. The publicly sourced constitution was then used to train a comparable model. The experiment was a proof of concept for democratic input into alignment targets.
| Dimension | RLHF | Constitutional AI / RLAIF |
|---|---|---|
| Primary labelers | Human contractors | AI model guided by principles |
| Transparency | Unpublished labeler guidelines | Published constitution |
| Scale cost | Linear in labels needed | Near-constant after constitution written |
| Debate surface | Individual labels, not public | Written principles, public |
| Bias source | Labeler demographics | Principle selection + judge model |
| Updatability | Slow (recollect labels) | Fast (edit principle, retrain) |
We cannot scale alignment by putting a human in every decision. We can scale it by putting humans in the writing of the rules.
— Amanda Askell, Anthropic
The big idea: constitutional AI replaces the cost of per-example human feedback with the cost of writing and debating principles. That is a real bet about where alignment judgment should happen. Its strengths and limits are both consequences of that bet.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety-constitutional-ai-deep-creators
What is the core idea behind "Constitutional AI: A Deep Dive on Anthropic's Approach"?
Which term best describes a foundational idea in "Constitutional AI: A Deep Dive on Anthropic's Approach"?
A learner studying Constitutional AI: A Deep Dive on Anthropic's Approach would need to understand which concept?
Which of these is directly relevant to Constitutional AI: A Deep Dive on Anthropic's Approach?
Which of the following is a key point about Constitutional AI: A Deep Dive on Anthropic's Approach?
Which of these does NOT belong in a discussion of Constitutional AI: A Deep Dive on Anthropic's Approach?
Which statement is accurate regarding Constitutional AI: A Deep Dive on Anthropic's Approach?
Which of these does NOT belong in a discussion of Constitutional AI: A Deep Dive on Anthropic's Approach?
What is the key insight about "Everyone does some version of this now" in the context of Constitutional AI: A Deep Dive on Anthropic's Approach?
What is the key insight about "The meta-question" in the context of Constitutional AI: A Deep Dive on Anthropic's Approach?
What is the recommended tip about "Key insight" in the context of Constitutional AI: A Deep Dive on Anthropic's Approach?
Which statement accurately describes an aspect of Constitutional AI: A Deep Dive on Anthropic's Approach?
What does working with Constitutional AI: A Deep Dive on Anthropic's Approach typically involve?
Which of the following is true about Constitutional AI: A Deep Dive on Anthropic's Approach?
Which best describes the scope of "Constitutional AI: A Deep Dive on Anthropic's Approach"?