Lesson 1555 of 1596
How AI Models Get Safety Training: RLHF in Plain Words
Why models refuse what they refuse, and how that shapes their behavior.
Creators · AI Foundations · ~7 min read
The premise
Models are pre-trained on vast text and then aligned via reinforcement learning from human feedback (RLHF) — humans rank outputs, the model learns to produce ranked-higher ones. This shapes everything about how they behave.
What AI does well here
- Refusing the most obviously harmful requests
- Defaulting to helpful, harmless, honest behavior
- Producing outputs that match a particular team's preference profile
- Adjusting behavior over time as preference data accumulates
What AI cannot do
- Capture all human preferences faithfully — RLHF flattens diversity
- Avoid the sycophancy bias — agreeing with users gets ranked higher
- Generalize safety perfectly to never-seen requests
Key terms in this lesson
End-of-lesson quiz
Check what stuck
10 questions · Score saves to your progress.
Tutor
Curious about “How AI Models Get Safety Training: RLHF in Plain Words”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 11 min
RLHF vs DPO: aligning models without breaking them
Compare reinforcement learning from human feedback and direct preference optimization at the level of intuition, not equations.
Creators · 11 min
The AI Data Flywheel: Why Some Products Get Better Faster
How usage creates training data that improves the product that creates more usage.
Creators · 11 min
Attention deep dive: queries, keys, values, and why it works
Understand attention as a content-addressable lookup over a sequence — and where the analogy breaks.
