Lesson 2074 of 2116
How AI Models Get Safety Training: RLHF in Plain Words
Why models refuse what they refuse, and how that shapes their behavior.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The premise
- 2RLHF
- 3preference data
- 4alignment
Concept cluster
Terms to connect while reading
Section 1
The premise
Models are pre-trained on vast text and then aligned via reinforcement learning from human feedback (RLHF) — humans rank outputs, the model learns to produce ranked-higher ones. This shapes everything about how they behave.
What AI does well here
- Refusing the most obviously harmful requests
- Defaulting to helpful, harmless, honest behavior
- Producing outputs that match a particular team's preference profile
- Adjusting behavior over time as preference data accumulates
What AI cannot do
- Capture all human preferences faithfully — RLHF flattens diversity
- Avoid the sycophancy bias — agreeing with users gets ranked higher
- Generalize safety perfectly to never-seen requests
Key terms in this lesson
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “How AI Models Get Safety Training: RLHF in Plain Words”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 40 min
DPO vs PPO: Why Direct Preference Optimization Won
DPO vs PPO reshapes serving and quality tradeoffs. This lesson covers why it matters and how to evaluate adoption.
Creators · 11 min
RLHF vs DPO: aligning models without breaking them
Compare reinforcement learning from human feedback and direct preference optimization at the level of intuition, not equations.
Creators · 40 min
Constitutional AI: Self-Critique as a Training Signal
Constitutional AI reshapes serving and quality tradeoffs. This lesson covers why it matters and how to evaluate adoption.
