neural-forge.io

Sign inStartStart learning

Tendril

Ethics & Society0%

Lesson 213 of 2116

RLHF to RLAIF: How Preference Learning Scaled

RLHF made ChatGPT possible. RLAIF is trying to take humans out of the loop. Here is the history, the trade-offs, and where the field is going.

CreatorsEthics & Society~27 min readAdvancedBI5 · Societal ImpactBI3 · LearningPrint / PDF

Lesson map

What this lesson covers

45 min20 blocks4 concepts

Learning path

The main moves in order

1The Technique That Made ChatGPT Usable
2RLHF
3RLAIF
4DPO

Concept cluster

Terms to connect while reading

RLHFRLAIFDPOpreference learning

Read4

Sections6

Lists3

Notes4

Compare1

Quotes1

Section 1

The Technique That Made ChatGPT Usable

A GPT-3 base model in 2020 was impressive but nearly unusable in a consumer chat interface. You had to prompt it carefully. It would helpfully suggest crimes. The jump from that to ChatGPT in late 2022 was not bigger models. It was reinforcement learning from human feedback.

The RLHF recipe

1Start with a pretrained base model.
2Supervised fine-tune on high-quality example completions (SFT).
3Collect preference data: show humans two model outputs, ask which is better.
4Train a reward model to predict human preferences from (prompt, completion) pairs.
5Use PPO (or another RL algorithm) to fine-tune the policy against the reward model.
6Add a KL penalty to keep the policy close to the SFT base (prevents reward hacking).

Check-in 1. Got it so far?

RLAIF: replace the human

RL from AI Feedback, introduced by Anthropic in Constitutional AI (Bai et al., 2022), replaces most human preference labels with labels from another (or the same) model guided by a written constitution. The model critiques outputs against principles, revises them, and does the preference ranking. You still use humans, but orders of magnitude fewer.

DPO: skip the reward model

Direct Preference Optimization (Rafailov et al., 2023) showed that you can turn preference pairs directly into a loss function without training a separate reward model and running PPO. DPO is simpler, cheaper, and often works as well or better. Many 2024-2026 open-weight alignment pipelines use DPO or variants like IPO, KTO, SimPO.

Compare the options

Method	Feedback source	Loss setup	Strength	Weakness
RLHF (PPO)	Human	Reward model + PPO RL	Proven at scale	Complex, expensive, tuning-heavy
RLAIF	AI + constitution	Reward model + RL or DPO	Scales feedback cheaply	Quality depends on judge model and constitution
DPO	Human or AI	Direct preference loss	Simple, stable, no RM needed	Less mature at frontier scale
KTO	Single good/bad label	Kahneman-Tversky prospect loss	Works with one-sided data	Newer, less studied

Check-in 2. Got it so far?

Known pathologies

Sycophancy: Anthropic found models preference-trained on human feedback become agreeable in ways humans rate positively but that damage honesty (Sharma et al., 2023)
Mode collapse: models produce less diverse outputs after heavy RLHF
Over-refusal: safety training from refusal preferences generalizes to refusing benign requests
Reward model exploitation: the policy finds inputs where the reward model gives high score but a human would not
Distributional narrowing: RLHF shrinks the model's generative range, sometimes too much

Where the field is moving

Process supervision (reward reasoning, not just answers) for math and code
Rubric-based AI judges trained to apply fine-grained criteria
Constitutional AI variants baked into every major lab's pipeline
Online preference learning: continuously update from live user feedback
Scalable oversight research for when the judge is weaker than the policy

Check-in 3. Got it so far?

“RLHF turns a pretrained model from a savant into a colleague. It also teaches the savant to please you. Separating those two is the next decade of work.”
An alignment researcher at OpenAI

Key terms in this lesson

The big idea: every modern chat model has gone through preference learning, and every one of its quirks (hedging, flattery, format bloat) traces back to how those preferences were collected. Follow the feedback and you understand the model.

Check-in 4. Got it so far?

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “RLHF to RLAIF: How Preference Learning Scaled”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Keep going