Constitutional AI: A Deep Dive on Anthropic's Approach

Section 1

The Paper and the Approach

Constitutional AI's two-stage pipeline. Human labels are concentrated in writing the constitution, not labeling every output.

text

Stage 1: Supervised learning (critique and revise)
  for each (prompt, harmful response):
    critique = model(prompt, response, principle)
    revision = model(prompt, response, critique)
    SFT on (prompt, revision)

Stage 2: Reinforcement learning (RLAIF)
  for each prompt:
    response_A = model(prompt)
    response_B = model(prompt)
    judgment = model(response_A, response_B, principle)
    use judgment as preference label
  train reward model on these AI-labeled preferences
  PPO against reward model

Compare the options

Dimension	RLHF	Constitutional AI / RLAIF
Primary labelers	Human contractors	AI model guided by principles
Transparency	Unpublished labeler guidelines	Published constitution
Scale cost	Linear in labels needed	Near-constant after constitution written
Debate surface	Individual labels, not public	Written principles, public
Bias source	Labeler demographics	Principle selection + judge model
Updatability	Slow (recollect labels)	Fast (edit principle, retrain)

Key terms in this lesson

Constitutional AI: A Deep Dive on Anthropic's Approach

The Paper and the Approach

The two-stage loop

What is in a constitution

Why principles, not examples

The Collective Constitutional AI experiment

Known limits

Curious about “Constitutional AI: A Deep Dive on Anthropic's Approach”?

Keep going

Constitutional AI: A Deep Dive on Anthropic's Approach

The Paper and the Approach

The two-stage loop

What is in a constitution

Why principles, not examples

The Collective Constitutional AI experiment

Known limits

Curious about “Constitutional AI: A Deep Dive on Anthropic's Approach”?

Keep going