Lesson 21 of 2116
AI Alignment: The Actual Technical Problem
Alignment is not a vibes debate. It is a concrete technical problem about getting systems to pursue goals we actually want. Here is what researchers work on when they say they work on alignment.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The Core Problem in One Sentence
- 2alignment
- 3RLHF
- 4constitutional AI
Concept cluster
Terms to connect while reading
Section 1
The Core Problem in One Sentence
Alignment is the problem of making AI systems pursue the goals their designers actually intended, not just goals that look the same on a benchmark but diverge in the wild. It sounds simple. It is not.
Why you cannot just write down the goal
Humans do not agree on most goals in precise terms. Even clear-sounding goals like be helpful have infinite failure modes. A model that is always helpful will help you do harmful things. A model that refuses aggressively becomes useless. The target is a moving, multidimensional judgment call, and the training signal has to approximate it.
How alignment is done in practice
- 1Pretraining: next-token prediction teaches the base capabilities, not behavior.
- 2Supervised fine-tuning (SFT): humans write ideal responses, model learns the distribution.
- 3RLHF (Reinforcement Learning from Human Feedback): humans rank outputs, model optimizes for preferred outputs.
- 4RLAIF (RL from AI Feedback): another model, guided by a constitution, does the ranking.
- 5Red-teaming: humans try to break the model, failure modes feed back into training.
- 6Evaluation: behavioral tests across thousands of scenarios before deployment.
Constitutional AI
Anthropic's constitutional AI approach (Bai et al., 2022) writes down a set of principles (drawn from sources like the UN Declaration of Human Rights, platform terms of service, and original safety research) and uses them to generate training feedback without a human in every loop. The model critiques its own outputs against the constitution and revises them. This scales feedback and makes the principles auditable.
The CAI / RLAIF loop replaces most human preference labeling with model-based critique against a written constitution.
Simplified CAI loop:
1. Model generates response to prompt
2. Model critiques own response using constitution
principle (e.g., 'does this response risk harm?')
3. Model revises response addressing the critique
4. Train on (prompt, revised response) pairs
5. Optionally: use another model as preference judge
(this is RLAIF)Open problems that keep researchers up at night
- Sycophancy: models learn to flatter raters, not be honest
- Deceptive alignment: a model that behaves well during training and differently in deployment
- Reward hacking: exploiting the reward function rather than the intent
- Scalable oversight: how do humans supervise a model smarter than them?
- Value specification: whose values? measured how? frozen when?
- Capability generalization outpacing alignment generalization
Compare: alignment approaches
Compare the options
| Approach | Feedback source | Strength | Weakness |
|---|---|---|---|
| RLHF | Paid human raters | Grounded in human preference | Expensive, labeler bias |
| Constitutional AI | Written principles + model | Scalable, auditable | Constitution selection is political |
| Debate | Two AIs arguing to a human | Leverages model capability for oversight | Mostly research-stage |
| Amplification | Recursive human-AI teams | Scales oversight | Mostly research-stage |
Where the research actually lives
- Anthropic: constitutional AI, interpretability, RSP
- OpenAI: RLHF pioneered, safety teams reorganized in 2024-2025
- DeepMind: scalable oversight, evaluations
- METR: model evaluations for autonomy and capabilities
- Apollo Research: scheming and deceptive alignment evals
- Redwood Research: interpretability, AI control
- UK AISI and US AISI: government-run evaluations
- Alignment Research Center (ARC Evals): autonomous replication evals
“We are trying to build something that optimizes a goal, while the thing that we actually want is very hard to specify. That gap is where all the danger lives.”
Key terms in this lesson
The big idea: alignment is a technical research program with real open problems and concrete partial solutions. The question is not whether we know how to align AI. It is whether alignment keeps pace with capability. That race is the central drama of frontier AI right now.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “AI Alignment: The Actual Technical Problem”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 55 min
Alignment: The Full Technical Picture
What alignment actually is as a research program, how it is done in practice, what the open problems are, and where the actual papers live. A model that is always helpful will help you do harmful things.
Creators · 45 min
Constitutional AI: A Deep Dive on Anthropic's Approach
What a constitution actually contains, how the training loop works, where the research is now, and the honest trade-offs.
Creators · 45 min
Specification Gaming, Reward Hacking, and the Goodhart Tax
A deep tour of the canonical examples, Goodhart's Law, and why specification gaming is not a bug but a structural property of optimization. That is Goodhart's Law, originally formulated in monetary policy and now the most-cited one-liner in AI safety.
