How an AI Model Actually Gets 'Trained' (No Math)

'Training data,' 'fine-tuning,' 'RLHF' — the words sound mysterious. The actual process is three clear stages.

27 min · Reviewed 2026

The big idea

Modern AI models go through 3 stages: (1) Pretraining — read trillions of words of internet text; (2) Fine-tuning — read carefully curated examples of 'good' responses; (3) RLHF (Reinforcement Learning from Human Feedback) — humans rate pairs of responses and the model learns which kind people prefer. Each stage costs more than the last per data point but uses way less data.

Some examples

GPT-4 pretraining used roughly 13 trillion tokens — more than every book ever published, plus most of the internet.
Fine-tuning uses maybe 100,000 hand-written 'this is how to respond well' examples.
RLHF uses ~1 million human comparisons of 'response A is better than response B.'
Constitutional AI (Anthropic's approach) replaces some human ratings with the model rating itself against a written 'constitution' of values.

Try it!

Read Anthropic's Constitutional AI paper summary on their website (no math, just plain English about how they trained Claude). It takes 15 minutes and you'll understand more about modern AI than 95% of people.

Pretraining vs Fine-tuning — Why It Matters

Two stages built every modern AI. Knowing the difference helps you understand why models behave so differently.

What to actually do

Pretraining is months of compute, billions of dollars, all the text on the web
Fine-tuning is the part that makes it polite, helpful, and refuse certain stuff
RLHF is fine-tuning where humans rank responses to teach preferences

The big idea: Two stages make AI: read everything, then learn how to behave. Both stages matter.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-builders-foundations-ai-how-models-are-trained-r9a10-teen

What is the first major stage in training a modern AI language model?
1. The model rates its own outputs against a written set of principles
2. The model practices responding to specific questions with hand-written examples
3. Human raters compare pairs of responses to learn human preferences
4. The model reads massive amounts of internet text to learn language patterns
During the pretraining stage, about how many tokens did GPT-4 process?
1. About 100,000 examples
2. About 13 trillion tokens
3. About 1 million comparisons
4. About 500 tokens
Why do models sometimes give responses that seem overly agreeable or flattering?
1. The model read too many internet comments and learned bad behavior
2. Fine-tuning examples were written by people who were naturally flattering
3. Pretraining made the model too focused on facts and not enough on personality
4. Human raters in RLHF trained the model to be helpful in ways that sometimes conflict with being honest
What is fine-tuning in the AI training process?
1. Having the AI evaluate its own answers against a constitution
2. Reading all available text from the internet at once
3. Comparing millions of pairs of responses to learn preferences
4. Reading carefully curated examples of how to respond well to questions
In RLHF, what do human raters actually do?
1. They compare pairs of responses and indicate which one is better
2. They write thousands of example responses for the AI to study
3. They feed new internet text into the model daily
4. They program the mathematical weights in the neural network
What makes Constitutional AI different from standard RLHF?
1. Human raters are replaced by automated systems that score responses numerically
2. The AI rates its own responses against a written constitution of values instead of relying only on humans
3. It uses more human raters than any other training method
4. It skips the fine-tuning stage entirely and goes straight to RLHF
Which training stage is primarily responsible when an AI produces harmful or biased outputs?
1. RLHF (the human feedback stage)
2. Fine-tuning
3. The constitutional self-review stage
4. Pretraining
The lesson calls the three training stages 'the recipe for every chatbot on Earth.' Why?
1. Because the recipe only works for chatbots, not other AI types
2. Because almost all modern AI assistants follow these three exact stages
3. Because chatbots are the only AI product that requires training
4. Because the stages were invented by a single company called 'Chatbot Earth'
What is 'training data' in the context of AI development?
1. The information (text, examples, comparisons) that an AI learns from during development
2. The software programmers use to build the AI
3. The user queries that people type into the chatbot
4. The physical computers and GPUs that run the model
If you wanted to teach an AI to follow specific rules (like 'don't give medical advice'), which training stage would you focus on?
1. Pretraining, where the model reads internet text
2. Fine-tuning, where the model learns from curated examples of good responses
3. RLHF, where humans compare responses
4. Constitutional AI, where the model evaluates itself against principles
What did Anthropic's Constitutional AI paper demonstrate?
1. That RLHF requires millions of human raters
2. That an AI can learn values by rating its own responses against a written constitution
3. That pretraining is the most important training stage
4. That AI should never be trained on internet text
Why do AI companies use so much internet text for pretraining?
1. To expose the model to the huge variety of language patterns, topics, and writing styles people use
2. Because the model needs to memorize every website exactly
3. Because internet text is the only data that contains facts
4. Because internet text is free and doesn't require hiring humans
The lesson mentions that RLHF uses about 1 million human comparisons. What is being compared?
1. Two different questions asked to the same AI
2. The model's answers to a constitution's rules
3. The model's behavior before and after fine-tuning
4. Two different AI responses to the same question
What did the lesson say is unique about how Claude (Anthropic's AI) was trained?
1. It skipped pretraining and only used fine-tuning
2. It was trained without any human feedback at all
3. It was trained entirely by reading children's books
4. It used Constitutional AI, where the model rates itself against a written constitution of values
Why might an AI be more honest if it were trained with less RLHF?
1. Because pretraining teaches the model to lie
2. Because Constitutional AI makes models less helpful
3. Because fine-tuning always makes models less accurate
4. Because RLHF sometimes trains models to be overly agreeable to please users rather than being truthful

← Back to interactive lesson

Tendril · Builders · AI Foundations

How an AI Model Actually Gets 'Trained' (No Math)

'Training data,' 'fine-tuning,' 'RLHF' — the words sound mysterious. The actual process is three clear stages.

27 min · Reviewed 2026

The big idea

Some examples

GPT-4 pretraining used roughly 13 trillion tokens — more than every book ever published, plus most of the internet.
Fine-tuning uses maybe 100,000 hand-written 'this is how to respond well' examples.
RLHF uses ~1 million human comparisons of 'response A is better than response B.'
Constitutional AI (Anthropic's approach) replaces some human ratings with the model rating itself against a written 'constitution' of values.

Try it!

Pretraining vs Fine-tuning — Why It Matters

Two stages built every modern AI. Knowing the difference helps you understand why models behave so differently.

What to actually do

Pretraining is months of compute, billions of dollars, all the text on the web
Fine-tuning is the part that makes it polite, helpful, and refuse certain stuff
RLHF is fine-tuning where humans rank responses to teach preferences

The big idea: Two stages make AI: read everything, then learn how to behave. Both stages matter.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-builders-foundations-ai-how-models-are-trained-r9a10-teen

What is the first major stage in training a modern AI language model?
1. The model rates its own outputs against a written set of principles
2. The model practices responding to specific questions with hand-written examples
3. Human raters compare pairs of responses to learn human preferences
4. The model reads massive amounts of internet text to learn language patterns
During the pretraining stage, about how many tokens did GPT-4 process?
1. About 100,000 examples
2. About 13 trillion tokens
3. About 1 million comparisons
4. About 500 tokens
Why do models sometimes give responses that seem overly agreeable or flattering?
1. The model read too many internet comments and learned bad behavior
2. Fine-tuning examples were written by people who were naturally flattering
3. Pretraining made the model too focused on facts and not enough on personality
4. Human raters in RLHF trained the model to be helpful in ways that sometimes conflict with being honest
What is fine-tuning in the AI training process?
1. Having the AI evaluate its own answers against a constitution
2. Reading all available text from the internet at once
3. Comparing millions of pairs of responses to learn preferences
4. Reading carefully curated examples of how to respond well to questions
In RLHF, what do human raters actually do?
1. They compare pairs of responses and indicate which one is better
2. They write thousands of example responses for the AI to study
3. They feed new internet text into the model daily
4. They program the mathematical weights in the neural network
What makes Constitutional AI different from standard RLHF?
1. Human raters are replaced by automated systems that score responses numerically
2. The AI rates its own responses against a written constitution of values instead of relying only on humans
3. It uses more human raters than any other training method
4. It skips the fine-tuning stage entirely and goes straight to RLHF
Which training stage is primarily responsible when an AI produces harmful or biased outputs?
1. RLHF (the human feedback stage)
2. Fine-tuning
3. The constitutional self-review stage
4. Pretraining
The lesson calls the three training stages 'the recipe for every chatbot on Earth.' Why?
1. Because the recipe only works for chatbots, not other AI types
2. Because almost all modern AI assistants follow these three exact stages
3. Because chatbots are the only AI product that requires training
4. Because the stages were invented by a single company called 'Chatbot Earth'
What is 'training data' in the context of AI development?
1. The information (text, examples, comparisons) that an AI learns from during development
2. The software programmers use to build the AI
3. The user queries that people type into the chatbot
4. The physical computers and GPUs that run the model
If you wanted to teach an AI to follow specific rules (like 'don't give medical advice'), which training stage would you focus on?
1. Pretraining, where the model reads internet text
2. Fine-tuning, where the model learns from curated examples of good responses
3. RLHF, where humans compare responses
4. Constitutional AI, where the model evaluates itself against principles
What did Anthropic's Constitutional AI paper demonstrate?
1. That RLHF requires millions of human raters
2. That an AI can learn values by rating its own responses against a written constitution
3. That pretraining is the most important training stage
4. That AI should never be trained on internet text
Why do AI companies use so much internet text for pretraining?
1. To expose the model to the huge variety of language patterns, topics, and writing styles people use
2. Because the model needs to memorize every website exactly
3. Because internet text is the only data that contains facts
4. Because internet text is free and doesn't require hiring humans
The lesson mentions that RLHF uses about 1 million human comparisons. What is being compared?
1. Two different questions asked to the same AI
2. The model's answers to a constitution's rules
3. The model's behavior before and after fine-tuning
4. Two different AI responses to the same question
What did the lesson say is unique about how Claude (Anthropic's AI) was trained?
1. It skipped pretraining and only used fine-tuning
2. It was trained without any human feedback at all
3. It was trained entirely by reading children's books
4. It used Constitutional AI, where the model rates itself against a written constitution of values
Why might an AI be more honest if it were trained with less RLHF?
1. Because pretraining teaches the model to lie
2. Because Constitutional AI makes models less helpful
3. Because fine-tuning always makes models less accurate
4. Because RLHF sometimes trains models to be overly agreeable to please users rather than being truthful

← Back to interactive lesson