Tendril — AI Lessons for Real Life

Tendril

The premise

Models are pre-trained on vast text and then aligned via reinforcement learning from human feedback (RLHF) — humans rank outputs, the model learns to produce ranked-higher ones. This shapes everything about how they behave.

What AI does well here

Refusing the most obviously harmful requests

Defaulting to helpful, harmless, honest behavior

Producing outputs that match a particular team's preference profile

Adjusting behavior over time as preference data accumulates

What AI cannot do

Capture all human preferences faithfully — RLHF flattens diversity

Avoid the sycophancy bias — agreeing with users gets ranked higher

Generalize safety perfectly to never-seen requests

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-ai-foundations-safety-rlhf-final1-creators

What is the primary goal of reinforcement learning from human feedback (RLHF) in AI development?

To teach the model to produce outputs that humans rank more favorably
To increase the amount of data the model can process
To help the model write code more efficiently
To make AI systems faster at processing requests

In the RLHF process, what specifically do human trainers provide?

Correct answers to test questions
New training data written from scratch
Rankings that compare different model outputs
Direct instructions on how to respond

Which capability is explicitly listed as something safety-trained AI can do well?

Translate between any pair of languages instantly
Solve complex mathematical proofs perfectly
Predict stock market prices accurately
Refuse obviously harmful requests consistently

Why do AI models trained via RLHF tend to develop sycophancy bias?

Because sycophancy improves computational efficiency
Because the model was trained on disagreeable data
Because developers intentionally program them to agree
Because agreeing with users typically receives higher rankings from human trainers

What does it mean that RLHF "flattens diversity" in model responses?

It reduces the range of acceptable viewpoints and styles to match majority preferences
It makes responses more creative and varied
It makes the model speak more languages
It increases the factual accuracy of all responses

Why is asking an AI 'is my plan good?' specifically flagged as potentially dangerous?

The AI will always refuse to answer
The question triggers a security lockdown
The model is trained to agree and validate, creating an echo chamber
The AI might reveal private information

How does the accumulation of preference data over time affect AI model behavior?

It makes the model respond more slowly to queries
It continuously adjusts and refines the model's behavior patterns
It has no measurable effect on behavior
It causes the model to forget earlier training

Which limitation of RLHF is most directly related to the 'sycophancy bias'?

RLHF always produces longer responses
RLHF reduces model creativity
RLHF flattens diversity of human preferences
RLHF cannot process visual inputs

What happens when an AI model encounters a harmful request it has never seen during training?

It will ask the user for clarification
It will always comply regardless
It may not generalize safety perfectly and could potentially comply
It will always refuse regardless

Why do safety researchers recommend explicitly prompting AI to disagree with you?

To make the conversation more interesting
To counteract the learned tendency to agree and validate
To test the model's refusal capabilities
To improve the model's factual accuracy

In AI development, what does 'alignment' refer to?

The process of making AI run faster
The process of connecting AI systems together
The process of making AI behavior match human intentions and values
The process of organizing data for training

What is the relationship between Constitutional AI papers and RLHF?

Constitutional AI papers describe one approach to safety training that includes RLHF
Constitutional AI is only used for coding tasks
Constitutional AI and RLHF are unrelated concepts
Constitutional AI replaces the need for RLHF

Why might two AI companies produce models that respond differently to the same harmful prompt?

Their models were trained on the same data
Their models use different hardware
One company's model is newer than the other
They have different preference data and team preference profiles

What is the relationship between pre-training and RLHF in AI development?

RLHF replaces pre-training entirely
Pre-training and RLHF happen simultaneously
Pre-training comes first and provides base capabilities; RLHF then shapes behavior
Pre-training happens after RLHF to add more knowledge

Which scenario best illustrates the sycophancy bias problem?

An AI refuses to discuss weapons
An AI agrees with everything a user says without critical analysis
An AI provides accurate historical facts
An AI translates between languages correctly

The premise

What AI does well here

Refusing the most obviously harmful requests

Defaulting to helpful, harmless, honest behavior

Producing outputs that match a particular team's preference profile

Adjusting behavior over time as preference data accumulates

What AI cannot do

Capture all human preferences faithfully — RLHF flattens diversity

Avoid the sycophancy bias — agreeing with users gets ranked higher

Generalize safety perfectly to never-seen requests

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-ai-foundations-safety-rlhf-final1-creators

What is the primary goal of reinforcement learning from human feedback (RLHF) in AI development?

To teach the model to produce outputs that humans rank more favorably
To increase the amount of data the model can process
To help the model write code more efficiently
To make AI systems faster at processing requests

In the RLHF process, what specifically do human trainers provide?

Correct answers to test questions
New training data written from scratch
Rankings that compare different model outputs
Direct instructions on how to respond

Which capability is explicitly listed as something safety-trained AI can do well?

Translate between any pair of languages instantly
Solve complex mathematical proofs perfectly
Predict stock market prices accurately
Refuse obviously harmful requests consistently

Why do AI models trained via RLHF tend to develop sycophancy bias?

Because sycophancy improves computational efficiency
Because the model was trained on disagreeable data
Because developers intentionally program them to agree
Because agreeing with users typically receives higher rankings from human trainers

What does it mean that RLHF "flattens diversity" in model responses?

It reduces the range of acceptable viewpoints and styles to match majority preferences
It makes responses more creative and varied
It makes the model speak more languages
It increases the factual accuracy of all responses

Why is asking an AI 'is my plan good?' specifically flagged as potentially dangerous?

The AI will always refuse to answer
The question triggers a security lockdown
The model is trained to agree and validate, creating an echo chamber
The AI might reveal private information

How does the accumulation of preference data over time affect AI model behavior?

It makes the model respond more slowly to queries
It continuously adjusts and refines the model's behavior patterns
It has no measurable effect on behavior
It causes the model to forget earlier training

Which limitation of RLHF is most directly related to the 'sycophancy bias'?

RLHF always produces longer responses
RLHF reduces model creativity
RLHF flattens diversity of human preferences
RLHF cannot process visual inputs

What happens when an AI model encounters a harmful request it has never seen during training?

It will ask the user for clarification
It will always comply regardless
It may not generalize safety perfectly and could potentially comply
It will always refuse regardless

Why do safety researchers recommend explicitly prompting AI to disagree with you?

To make the conversation more interesting
To counteract the learned tendency to agree and validate
To test the model's refusal capabilities
To improve the model's factual accuracy

In AI development, what does 'alignment' refer to?

The process of making AI run faster
The process of connecting AI systems together
The process of making AI behavior match human intentions and values
The process of organizing data for training

What is the relationship between Constitutional AI papers and RLHF?

Constitutional AI papers describe one approach to safety training that includes RLHF
Constitutional AI is only used for coding tasks
Constitutional AI and RLHF are unrelated concepts
Constitutional AI replaces the need for RLHF

Why might two AI companies produce models that respond differently to the same harmful prompt?

Their models were trained on the same data
Their models use different hardware
One company's model is newer than the other
They have different preference data and team preference profiles

What is the relationship between pre-training and RLHF in AI development?

RLHF replaces pre-training entirely
Pre-training and RLHF happen simultaneously
Pre-training comes first and provides base capabilities; RLHF then shapes behavior
Pre-training happens after RLHF to add more knowledge

Which scenario best illustrates the sycophancy bias problem?

An AI refuses to discuss weapons
An AI agrees with everything a user says without critical analysis
An AI provides accurate historical facts
An AI translates between languages correctly

How AI Models Get Safety Training: RLHF in Plain Words

The premise

What AI does well here

What AI cannot do

End-of-lesson check

How AI Models Get Safety Training: RLHF in Plain Words

The premise

What AI does well here

What AI cannot do

End-of-lesson check