Lesson 210 of 2116
Alignment: The Full Technical Picture
What alignment actually is as a research program, how it is done in practice, what the open problems are, and where the actual papers live. A model that is always helpful will help you do harmful things.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The Field in One Paragraph
- 2alignment
- 3RLHF
- 4RLAIF
Concept cluster
Terms to connect while reading
Section 1
The Field in One Paragraph
AI alignment is the research program of making AI systems pursue the goals their designers actually intended, not just proxies that look the same on benchmarks but diverge in deployment. The central problem is that humans cannot precisely specify what we want, and optimization amplifies every small mismatch. Alignment tries to narrow the gap between intent and behavior as capability grows.
Why you cannot just write down the goal
Humans do not agree on most values in precise terms. Even clear-sounding targets like be helpful have infinite failure modes. A model that is always helpful will help you do harmful things. A model that refuses aggressively is useless. The target is a moving, multidimensional judgment, and training has to approximate it with a concrete signal.
The pipeline in 2026
- 1Pretraining: next-token prediction builds capabilities, not behavior.
- 2Supervised fine-tuning (SFT): humans write ideal responses; the model learns the distribution.
- 3Preference learning: humans (or AI) rank outputs. RLHF uses humans; RLAIF uses a model guided by a constitution.
- 4Direct preference optimization (DPO) or similar: convert rankings into a loss without a separate reward model.
- 5Red-teaming: find failure modes, feed them back into training.
- 6Evaluation: behavioral tests across thousands of scenarios before deployment.
- 7Deployment monitoring: measure real-world behavior, update.
Constitutional AI in detail
Anthropic's constitutional AI approach (Bai et al., 2022) writes down principles drawn from sources like the UN Declaration of Human Rights, platform terms, and Anthropic's own research. The model critiques its own outputs against the constitution and revises. Another model preference-ranks the revised outputs. This scales feedback and makes the principles auditable.
The CAI / RLAIF loop replaces most human labeling with model-based critique against written principles.
Simplified CAI loop:
1. Model generates response to prompt
2. Model critiques own response against
constitutional principle:
('does this risk harm? is it honest?')
3. Model revises response addressing the critique
4. Train on (prompt, revised response) pairs
5. Optional: second model preference-ranks,
feeding back into RL loop (this is RLAIF)Scalable oversight: the real puzzle
What happens when the model is smarter than the humans supervising it? Paul Christiano's iterated distillation and amplification (IDA) and OpenAI's debate proposals try to decompose hard questions into simpler sub-questions humans can judge. Anthropic has worked on market-making and debate. These are all active research directions with no settled answer.
Compare the options
| Approach | Feedback source | Strength | Weakness |
|---|---|---|---|
| RLHF | Paid human raters | Grounded in human preference | Expensive, labeler bias |
| Constitutional AI | Written principles + model | Scalable, auditable | Constitution selection is political |
| Debate | Two AIs arguing to a human | Leverages capability for oversight | Mostly research-stage |
| Iterated amplification | Recursive human-AI teams | Scales oversight | Mostly research-stage |
| Weak-to-strong | Weaker model supervises stronger | Empirical testbed for future | Uncertain generalization |
Open problems that keep researchers up
- Sycophancy: models learn to flatter raters, not to be honest
- Deceptive alignment: a model behaves well during training and differently in deployment
- Reward hacking: exploiting the reward function rather than the intent
- Goal misgeneralization: correct reward, wrong internalized goal
- Capability generalization outpacing alignment generalization
- Value specification: whose values, measured how, frozen when
“We are trying to build something that optimizes a goal, while the thing that we actually want is very hard to specify. That gap is where all the danger lives.”
Key terms in this lesson
The big idea: alignment is a technical research program with concrete methods, partial solutions, and specific open problems. The question is not whether we know how to align AI. It is whether alignment keeps pace with capability. That race is the central drama of frontier AI right now.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Alignment: The Full Technical Picture”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 50 min
AI Alignment: The Actual Technical Problem
Alignment is not a vibes debate. It is a concrete technical problem about getting systems to pursue goals we actually want. Here is what researchers work on when they say they work on alignment.
Creators · 45 min
Constitutional AI: A Deep Dive on Anthropic's Approach
What a constitution actually contains, how the training loop works, where the research is now, and the honest trade-offs.
Creators · 45 min
Scalable Oversight: How Do You Supervise What You Cannot Evaluate
Debate, amplification, weak-to-strong, process supervision. Research on how humans supervise models smarter than them.
