Lesson 196 of 1570
A/B Testing LLM Outputs
When you change a prompt, how do you know the new version is actually better? A/B testing is the honest answer.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1Two Versions Go In. One Wins.
- 2A/B testing
- 3control
- 4treatment
Concept cluster
Terms to connect while reading
Section 1
Two Versions Go In. One Wins.
An A/B test compares two variants — A is the current version, B is the proposed change. You route half your real traffic to each, measure a metric, and see which wins. Same logic for LLM prompts, system messages, or models.
A minimal LLM A/B test
- 1Pick one metric to optimize (satisfaction rating, task completion, response length, whatever matters)
- 2Define the change (new prompt, new model, new temperature)
- 3Randomly assign each incoming request to A or B
- 4Log the metric for each
- 5Wait for enough samples (see sample size below)
- 6Compare means with a statistical test
Common mistakes
- Peeking: checking results early and stopping when favorable (inflates false positives)
- Confounded changes: you changed two things — you cannot tell which worked
- Selection bias: assigning 'easy' requests to B
- Ignoring variance across users: one power user dominates the metric
Compare the options
| Do | Do not |
|---|---|
| Pre-register your hypothesis | Dig until you find a significant effect |
| Lock your sample size ahead of time | Stop the test as soon as A wins |
| Control for time-of-day and cohort effects | Run A on Monday, B on Tuesday |
| Report effect size with CI | Report only the p-value |
“Absence of evidence is not evidence of absence — especially in small A/B tests.”
Key terms in this lesson
The big idea: your prompt is not better because you think it is. A/B tests are how you find out.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “A/B Testing LLM Outputs”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Builders · 28 min
Statistics Class: Letting AI Handle the Arithmetic
Stats is 10 percent concepts and 90 percent careful arithmetic. AI is shockingly good at the arithmetic, which frees you to actually think about the concepts.
Builders · 30 min
Where Training Data Actually Comes From
You cannot understand modern AI without understanding its diet. Let's map where the data comes from, how it gets cleaned, and what that means.
Builders · 25 min
Claude Artifacts — when AI builds alongside you
Artifacts is Claude's canvas. Charts, code, docs, and interactive React components render live next to the chat.
