The Full LLM Pipeline

Pre-training → SFT → RLHF → Constitutional AI.

CreatorsCreators~30 min readInteractiveBI3 · LearningPrint / PDF

Every frontier LLM is the product of four stages: pre-training, supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and — more recently — Constitutional AI / RLAIF. Understanding each stage tells you why different models behave differently.

1. Pre-training

The model is trained on a huge corpus — typically trillions of tokens scraped from the web, books, code, and academic papers. It learns next-token prediction on that raw data. At the end of pre-training, the model is a powerful autocomplete — smart, but not helpful. Ask it a question and it might continue with more questions instead of answering.

2. Supervised fine-tuning (SFT)

Humans write thousands of high-quality example conversations: “here’s a good answer to this question.” The model is fine-tuned on these examples and learns the shape of “being an assistant.” Output is now conversational but not yet well-aligned with human preferences.

3. RLHF — Reinforcement Learning from Human Feedback

Humans compare pairs of model outputs: “which is better?” Those preferences train a separate reward modelthat approximates human taste. The main model is then fine-tuned with reinforcement learning to maximize that reward. This is what makes ChatGPT feel “polished.”

4. Constitutional AI / RLAIF

Anthropic’s twist on RLHF: instead of all preferences coming from humans, an AI critiques its own outputs using a written constitution— a document of principles. This scales better and produces models that can explain their own refusals. “RLAIF” = Reinforcement Learning from AI Feedback. It’s why Claude sounds more measured than other frontier models.

Why this matters for you

The same base model, post-trained differently, becomes a very different product. When you notice Claude refusing politely, ChatGPT pivoting smoothly, or Gemini being chattier, you’re seeing the fingerprints of different SFT data, different reward models, and different safety philosophies — not fundamentally different intelligence.

Tutor

Curious about “The Full LLM Pipeline”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Keep going

Creators0%

Standalone lesson.

Lesson 2105 of 2116

The Full LLM Pipeline

Pre-training → SFT → RLHF → Constitutional AI.

CreatorsCreators~30 min readInteractiveBI3 · LearningPrint / PDF

1. Pre-training

2. Supervised fine-tuning (SFT)

3. RLHF — Reinforcement Learning from Human Feedback

4. Constitutional AI / RLAIF

Why this matters for you

Tutor

Curious about “The Full LLM Pipeline”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons