Lesson 213 of 2116
RLHF to RLAIF: How Preference Learning Scaled
RLHF made ChatGPT possible. RLAIF is trying to take humans out of the loop. Here is the history, the trade-offs, and where the field is going.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The Technique That Made ChatGPT Usable
- 2RLHF
- 3RLAIF
- 4DPO
Concept cluster
Terms to connect while reading
Section 1
The Technique That Made ChatGPT Usable
A GPT-3 base model in 2020 was impressive but nearly unusable in a consumer chat interface. You had to prompt it carefully. It would helpfully suggest crimes. The jump from that to ChatGPT in late 2022 was not bigger models. It was reinforcement learning from human feedback.
The RLHF recipe
- 1Start with a pretrained base model.
- 2Supervised fine-tune on high-quality example completions (SFT).
- 3Collect preference data: show humans two model outputs, ask which is better.
- 4Train a reward model to predict human preferences from (prompt, completion) pairs.
- 5Use PPO (or another RL algorithm) to fine-tune the policy against the reward model.
- 6Add a KL penalty to keep the policy close to the SFT base (prevents reward hacking).
RLAIF: replace the human
RL from AI Feedback, introduced by Anthropic in Constitutional AI (Bai et al., 2022), replaces most human preference labels with labels from another (or the same) model guided by a written constitution. The model critiques outputs against principles, revises them, and does the preference ranking. You still use humans, but orders of magnitude fewer.
DPO: skip the reward model
Direct Preference Optimization (Rafailov et al., 2023) showed that you can turn preference pairs directly into a loss function without training a separate reward model and running PPO. DPO is simpler, cheaper, and often works as well or better. Many 2024-2026 open-weight alignment pipelines use DPO or variants like IPO, KTO, SimPO.
Compare the options
| Method | Feedback source | Loss setup | Strength | Weakness |
|---|---|---|---|---|
| RLHF (PPO) | Human | Reward model + PPO RL | Proven at scale | Complex, expensive, tuning-heavy |
| RLAIF | AI + constitution | Reward model + RL or DPO | Scales feedback cheaply | Quality depends on judge model and constitution |
| DPO | Human or AI | Direct preference loss | Simple, stable, no RM needed | Less mature at frontier scale |
| KTO | Single good/bad label | Kahneman-Tversky prospect loss | Works with one-sided data | Newer, less studied |
Known pathologies
- Sycophancy: Anthropic found models preference-trained on human feedback become agreeable in ways humans rate positively but that damage honesty (Sharma et al., 2023)
- Mode collapse: models produce less diverse outputs after heavy RLHF
- Over-refusal: safety training from refusal preferences generalizes to refusing benign requests
- Reward model exploitation: the policy finds inputs where the reward model gives high score but a human would not
- Distributional narrowing: RLHF shrinks the model's generative range, sometimes too much
Where the field is moving
- Process supervision (reward reasoning, not just answers) for math and code
- Rubric-based AI judges trained to apply fine-grained criteria
- Constitutional AI variants baked into every major lab's pipeline
- Online preference learning: continuously update from live user feedback
- Scalable oversight research for when the judge is weaker than the policy
“RLHF turns a pretrained model from a savant into a colleague. It also teaches the savant to please you. Separating those two is the next decade of work.”
Key terms in this lesson
The big idea: every modern chat model has gone through preference learning, and every one of its quirks (hedging, flattery, format bloat) traces back to how those preferences were collected. Follow the feedback and you understand the model.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “RLHF to RLAIF: How Preference Learning Scaled”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 55 min
Alignment: The Full Technical Picture
What alignment actually is as a research program, how it is done in practice, what the open problems are, and where the actual papers live. A model that is always helpful will help you do harmful things.
Creators · 50 min
AI Alignment: The Actual Technical Problem
Alignment is not a vibes debate. It is a concrete technical problem about getting systems to pursue goals we actually want. Here is what researchers work on when they say they work on alignment.
Creators · 45 min
Constitutional AI: A Deep Dive on Anthropic's Approach
What a constitution actually contains, how the training loop works, where the research is now, and the honest trade-offs.
