Alignment: The Full Technical Picture

Section 1

The Field in One Paragraph

The CAI / RLAIF loop replaces most human labeling with model-based critique against written principles.

text

Simplified CAI loop:

1. Model generates response to prompt
2. Model critiques own response against
   constitutional principle:
   ('does this risk harm? is it honest?')
3. Model revises response addressing the critique
4. Train on (prompt, revised response) pairs
5. Optional: second model preference-ranks,
   feeding back into RL loop (this is RLAIF)

Compare the options

Approach	Feedback source	Strength	Weakness
RLHF	Paid human raters	Grounded in human preference	Expensive, labeler bias
Constitutional AI	Written principles + model	Scalable, auditable	Constitution selection is political
Debate	Two AIs arguing to a human	Leverages capability for oversight	Mostly research-stage
Iterated amplification	Recursive human-AI teams	Scales oversight	Mostly research-stage
Weak-to-strong	Weaker model supervises stronger	Empirical testbed for future	Uncertain generalization

Key terms in this lesson

Alignment: The Full Technical Picture

The Field in One Paragraph

Why you cannot just write down the goal

The pipeline in 2026

Constitutional AI in detail

Scalable oversight: the real puzzle

Open problems that keep researchers up

Curious about “Alignment: The Full Technical Picture”?

Keep going

Alignment: The Full Technical Picture

The Field in One Paragraph

Why you cannot just write down the goal

The pipeline in 2026

Constitutional AI in detail

Scalable oversight: the real puzzle

Open problems that keep researchers up

Curious about “Alignment: The Full Technical Picture”?

Keep going