Tendril

Lesson 2074 of 2116

How AI Models Get Safety Training: RLHF in Plain Words

Why models refuse what they refuse, and how that shapes their behavior.

CreatorsAI Foundations~7 min readBI2 · Representation & ReasoningBI3 · LearningBI4 · Natural InteractionPrint / PDF

Lesson map

What this lesson covers

11 min11 blocks4 concepts

Learning path

The main moves in order

1The premise
2RLHF
3preference data
4alignment

Concept cluster

Terms to connect while reading

RLHFpreference dataalignmentrefusal behavior

Sections3

Lists2

Notes4

Terms1

Section 1

The premise

Models are pre-trained on vast text and then aligned via reinforcement learning from human feedback (RLHF) — humans rank outputs, the model learns to produce ranked-higher ones. This shapes everything about how they behave.

What AI does well here

Refusing the most obviously harmful requests
Defaulting to helpful, harmless, honest behavior
Producing outputs that match a particular team's preference profile
Adjusting behavior over time as preference data accumulates

Check-in 1. Got it so far?

What AI cannot do

Capture all human preferences faithfully — RLHF flattens diversity
Avoid the sycophancy bias — agreeing with users gets ranked higher
Generalize safety perfectly to never-seen requests

Key terms in this lesson

Check-in 2. Got it so far?

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “How AI Models Get Safety Training: RLHF in Plain Words”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

How AI Models Get Safety Training: RLHF in Plain Words

The premise

What AI does well here

What AI cannot do

Curious about “How AI Models Get Safety Training: RLHF in Plain Words”?

Keep going

How AI Models Get Safety Training: RLHF in Plain Words

The premise

What AI does well here

What AI cannot do

Curious about “How AI Models Get Safety Training: RLHF in Plain Words”?

Keep going