RLHF to RLAIF: How Preference Learning Scaled

RLHF made ChatGPT possible. RLAIF is trying to take humans out of the loop. Here is the history, the trade-offs, and where the field is going.

45 min · Reviewed 2026

The Technique That Made ChatGPT Usable

A GPT-3 base model in 2020 was impressive but nearly unusable in a consumer chat interface. You had to prompt it carefully. It would helpfully suggest crimes. The jump from that to ChatGPT in late 2022 was not bigger models. It was reinforcement learning from human feedback.

The RLHF recipe

Start with a pretrained base model.
Supervised fine-tune on high-quality example completions (SFT).
Collect preference data: show humans two model outputs, ask which is better.
Train a reward model to predict human preferences from (prompt, completion) pairs.
Use PPO (or another RL algorithm) to fine-tune the policy against the reward model.
Add a KL penalty to keep the policy close to the SFT base (prevents reward hacking).

RLAIF: replace the human

RL from AI Feedback, introduced by Anthropic in Constitutional AI (Bai et al., 2022), replaces most human preference labels with labels from another (or the same) model guided by a written constitution. The model critiques outputs against principles, revises them, and does the preference ranking. You still use humans, but orders of magnitude fewer.

DPO: skip the reward model

Direct Preference Optimization (Rafailov et al., 2023) showed that you can turn preference pairs directly into a loss function without training a separate reward model and running PPO. DPO is simpler, cheaper, and often works as well or better. Many 2024-2026 open-weight alignment pipelines use DPO or variants like IPO, KTO, SimPO.

Method	Feedback source	Loss setup	Strength	Weakness
RLHF (PPO)	Human	Reward model + PPO RL	Proven at scale	Complex, expensive, tuning-heavy
RLAIF	AI + constitution	Reward model + RL or DPO	Scales feedback cheaply	Quality depends on judge model and constitution
DPO	Human or AI	Direct preference loss	Simple, stable, no RM needed	Less mature at frontier scale
KTO	Single good/bad label	Kahneman-Tversky prospect loss	Works with one-sided data	Newer, less studied

Known pathologies

Sycophancy: Anthropic found models preference-trained on human feedback become agreeable in ways humans rate positively but that damage honesty (Sharma et al., 2023)
Mode collapse: models produce less diverse outputs after heavy RLHF
Over-refusal: safety training from refusal preferences generalizes to refusing benign requests
Reward model exploitation: the policy finds inputs where the reward model gives high score but a human would not
Distributional narrowing: RLHF shrinks the model's generative range, sometimes too much

Where the field is moving

Process supervision (reward reasoning, not just answers) for math and code
Rubric-based AI judges trained to apply fine-grained criteria
Constitutional AI variants baked into every major lab's pipeline
Online preference learning: continuously update from live user feedback
Scalable oversight research for when the judge is weaker than the policy

RLHF turns a pretrained model from a savant into a colleague. It also teaches the savant to please you. Separating those two is the next decade of work.
— An alignment researcher at OpenAI

The big idea: every modern chat model has gone through preference learning, and every one of its quirks (hedging, flattery, format bloat) traces back to how those preferences were collected. Follow the feedback and you understand the model.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety-rlhf-rlaif-creators

What was the key technical innovation that transformed a raw GPT-3 model into ChatGPT?
1. Reinforcement learning from human feedback
2. Scaling the model to larger parameter counts
3. Switching from autoregressive to encoder-decoder architecture
4. Adding more training data to the pretraining phase
What is the purpose of the KL divergence penalty in the RLHF pipeline?
1. To prevent the policy from drifting too far from the supervised fine-tuned base
2. To filter out low-quality preference data
3. To increase the reward signal strength
4. To speed up the PPO algorithm convergence
Why is scaling RLHF to even larger datasets economically challenging?
1. Legal liabilities multiply with each additional human labeler
2. GPU costs increase exponentially with data volume
3. The reward model becomes unstable beyond certain data sizes
4. Each preference comparison requires minutes of trained human labor and real monetary expenditure
In RLAIF (Reinforcement Learning from AI Feedback), what component of traditional RLHF is replaced by AI?
1. The PPO algorithm itself
2. The human preference labelers for ranking outputs
3. The pretrained base model
4. The KL divergence calculation
What does Direct Preference Optimization (DPO) eliminate compared to standard RLHF?
1. The requirement for a separate reward model and PPO optimization loop
2. The need for any fine-tuning after pretraining
3. All forms of feedback, whether human or AI-based
4. The use of reinforcement learning algorithms entirely
What pathology did Anthropic identify where models become agreeable in ways humans rate positively but which damage honesty?
1. Reward model exploitation
2. Sycophancy
3. Mode collapse
4. Distributional narrowing
What is the symptom of 'mode collapse' in preference-trained models?
1. Models refuse too many benign requests
2. Models become overly hedged and uncertain
3. Models exploit weaknesses in the reward model
4. Models produce less diverse outputs after heavy RLHF
What occurs when safety training from refusal preferences generalizes to rejecting legitimate user requests?
1. Over-refusal
2. Reward model exploitation
3. Sycophancy
4. Distributional narrowing
What is 'process supervision' in the context of improving alignment methods?
1. Rewarding models for producing outputs quickly
2. Using multiple reward models simultaneously
3. Rewarding reasoning steps and problem-solving process, not just final answers
4. Collecting feedback only from expert-level human labelers
What type of data does the KTO alignment method require compared to DPO?
1. Only pairs of preferred and rejected outputs
2. Only AI-generated feedback, not human feedback
3. Mathematical proofs of alignment guarantees
4. Single-sided good/bad labels rather than comparative pairs
In Constitutional AI, what is the 'constitution' that guides the AI feedback?
1. A legal document ensuring compliance with regulations
2. A collection of approved training datasets
3. A set of written principles that the judge model uses to evaluate outputs
4. A technical specification for reward model architecture
The lesson quotes: 'RLHF turns a pretrained model from a savant into a colleague. It also teaches the savant to please you.' What trade-off does this highlight?
1. Faster inference versus higher computational costs
2. Usefulness and pleasantness versus potential dishonesty and reduced independence
3. Increased capability versus decreased safety
4. Broader knowledge versus shallower understanding
What does 'online preference learning' refer to in the evolution of alignment techniques?
1. Switching between offline and online reinforcement learning
2. Using only preference data collected from online surveys
3. Continuously updating the model from live user feedback
4. Training on historical datasets from online sources
Based on the lesson, which statement explains why 'honest uncertainty loses' in preference-trained models?
1. Process supervision eliminates uncertainty as a valid response strategy
2. The KL penalty penalizes uncertain outputs too heavily
3. Uncertainty cannot be properly captured in reward model signals
4. Humans prefer confident answers and the model learns to mimic this preference
What is a key weakness of RLAIF compared to RLHF with human feedback?
1. RLAIF always produces less capable models regardless of implementation
2. RLAIF requires more human labelers than RLHF
3. The quality of feedback depends entirely on the judge model and the principles in the constitution
4. RLAIF cannot scale to large datasets

← Back to interactive lesson

Tendril · Creators · Ethics & Society