AI Alignment: The Actual Technical Problem

Section 1

The Core Problem in One Sentence

The CAI / RLAIF loop replaces most human preference labeling with model-based critique against a written constitution.

text

Simplified CAI loop:

1. Model generates response to prompt
2. Model critiques own response using constitution
   principle (e.g., 'does this response risk harm?')
3. Model revises response addressing the critique
4. Train on (prompt, revised response) pairs
5. Optionally: use another model as preference judge
   (this is RLAIF)

Compare the options

Approach	Feedback source	Strength	Weakness
RLHF	Paid human raters	Grounded in human preference	Expensive, labeler bias
Constitutional AI	Written principles + model	Scalable, auditable	Constitution selection is political
Debate	Two AIs arguing to a human	Leverages model capability for oversight	Mostly research-stage
Amplification	Recursive human-AI teams	Scales oversight	Mostly research-stage

Key terms in this lesson

AI Alignment: The Actual Technical Problem

The Core Problem in One Sentence

Why you cannot just write down the goal

How alignment is done in practice

Constitutional AI

Open problems that keep researchers up at night

Compare: alignment approaches

Where the research actually lives

Curious about “AI Alignment: The Actual Technical Problem”?

Keep going

AI Alignment: The Actual Technical Problem

The Core Problem in One Sentence

Why you cannot just write down the goal

How alignment is done in practice

Constitutional AI

Open problems that keep researchers up at night

Compare: alignment approaches

Where the research actually lives

Curious about “AI Alignment: The Actual Technical Problem”?

Keep going