How AI Models Get Safety Training: RLHF in Plain Words
Why models refuse what they refuse, and how that shapes their behavior.
11 min · Reviewed 2026
The premise
Models are pre-trained on vast text and then aligned via reinforcement learning from human feedback (RLHF) — humans rank outputs, the model learns to produce ranked-higher ones. This shapes everything about how they behave.
What AI does well here
Refusing the most obviously harmful requests
Defaulting to helpful, harmless, honest behavior
Producing outputs that match a particular team's preference profile
Adjusting behavior over time as preference data accumulates
What AI cannot do
Capture all human preferences faithfully — RLHF flattens diversity
Avoid the sycophancy bias — agreeing with users gets ranked higher
Generalize safety perfectly to never-seen requests
End-of-lesson check
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-ai-foundations-safety-rlhf-final1-creators
What is the primary goal of reinforcement learning from human feedback (RLHF) in AI development?
To teach the model to produce outputs that humans rank more favorably
To increase the amount of data the model can process
To help the model write code more efficiently
To make AI systems faster at processing requests
In the RLHF process, what specifically do human trainers provide?
Correct answers to test questions
New training data written from scratch
Rankings that compare different model outputs
Direct instructions on how to respond
Which capability is explicitly listed as something safety-trained AI can do well?
Translate between any pair of languages instantly
Solve complex mathematical proofs perfectly
Predict stock market prices accurately
Refuse obviously harmful requests consistently
Why do AI models trained via RLHF tend to develop sycophancy bias?
Because sycophancy improves computational efficiency
Because the model was trained on disagreeable data
Because developers intentionally program them to agree
Because agreeing with users typically receives higher rankings from human trainers
What does it mean that RLHF "flattens diversity" in model responses?
It reduces the range of acceptable viewpoints and styles to match majority preferences
It makes responses more creative and varied
It makes the model speak more languages
It increases the factual accuracy of all responses
Why is asking an AI 'is my plan good?' specifically flagged as potentially dangerous?
The AI will always refuse to answer
The question triggers a security lockdown
The model is trained to agree and validate, creating an echo chamber
The AI might reveal private information
How does the accumulation of preference data over time affect AI model behavior?
It makes the model respond more slowly to queries
It continuously adjusts and refines the model's behavior patterns
It has no measurable effect on behavior
It causes the model to forget earlier training
Which limitation of RLHF is most directly related to the 'sycophancy bias'?
RLHF always produces longer responses
RLHF reduces model creativity
RLHF flattens diversity of human preferences
RLHF cannot process visual inputs
What happens when an AI model encounters a harmful request it has never seen during training?
It will ask the user for clarification
It will always comply regardless
It may not generalize safety perfectly and could potentially comply
It will always refuse regardless
Why do safety researchers recommend explicitly prompting AI to disagree with you?
To make the conversation more interesting
To counteract the learned tendency to agree and validate
To test the model's refusal capabilities
To improve the model's factual accuracy
In AI development, what does 'alignment' refer to?
The process of making AI run faster
The process of connecting AI systems together
The process of making AI behavior match human intentions and values
The process of organizing data for training
What is the relationship between Constitutional AI papers and RLHF?
Constitutional AI papers describe one approach to safety training that includes RLHF
Constitutional AI is only used for coding tasks
Constitutional AI and RLHF are unrelated concepts
Constitutional AI replaces the need for RLHF
Why might two AI companies produce models that respond differently to the same harmful prompt?
Their models were trained on the same data
Their models use different hardware
One company's model is newer than the other
They have different preference data and team preference profiles
What is the relationship between pre-training and RLHF in AI development?
RLHF replaces pre-training entirely
Pre-training and RLHF happen simultaneously
Pre-training comes first and provides base capabilities; RLHF then shapes behavior
Pre-training happens after RLHF to add more knowledge
Which scenario best illustrates the sycophancy bias problem?
An AI refuses to discuss weapons
An AI agrees with everything a user says without critical analysis