Standalone lesson.
Lesson 2105 of 2116
The Full LLM Pipeline
Pre-training → SFT → RLHF → Constitutional AI.
Every frontier LLM is the product of four stages: pre-training, supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and — more recently — Constitutional AI / RLAIF. Understanding each stage tells you why different models behave differently.
1. Pre-training
The model is trained on a huge corpus — typically trillions of tokens scraped from the web, books, code, and academic papers. It learns next-token prediction on that raw data. At the end of pre-training, the model is a powerful autocomplete — smart, but not helpful. Ask it a question and it might continue with more questions instead of answering.
2. Supervised fine-tuning (SFT)
Humans write thousands of high-quality example conversations: “here’s a good answer to this question.” The model is fine-tuned on these examples and learns the shape of “being an assistant.” Output is now conversational but not yet well-aligned with human preferences.
3. RLHF — Reinforcement Learning from Human Feedback
Humans compare pairs of model outputs: “which is better?” Those preferences train a separate reward modelthat approximates human taste. The main model is then fine-tuned with reinforcement learning to maximize that reward. This is what makes ChatGPT feel “polished.”
4. Constitutional AI / RLAIF
Anthropic’s twist on RLHF: instead of all preferences coming from humans, an AI critiques its own outputs using a written constitution— a document of principles. This scales better and produces models that can explain their own refusals. “RLAIF” = Reinforcement Learning from AI Feedback. It’s why Claude sounds more measured than other frontier models.
Why this matters for you
The same base model, post-trained differently, becomes a very different product. When you notice Claude refusing politely, ChatGPT pivoting smoothly, or Gemini being chattier, you’re seeing the fingerprints of different SFT data, different reward models, and different safety philosophies — not fundamentally different intelligence.
Tutor
Curious about “The Full LLM Pipeline”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 55 min
Alignment: The Full Technical Picture
What alignment actually is as a research program, how it is done in practice, what the open problems are, and where the actual papers live. A model that is always helpful will help you do harmful things.
Creators · 45 min
RLHF to RLAIF: How Preference Learning Scaled
RLHF made ChatGPT possible. RLAIF is trying to take humans out of the loop. Here is the history, the trade-offs, and where the field is going.
Builders · 8 min
How Teens Make $30-100/hr Training AI on Scale and Mercor
RLHF needs experts on tap. A 16-year-old with chess or coding skills can earn real money — here's the truth about the gigs.
