Tendril

Lesson 196 of 1455

A/B Testing LLM Outputs

When you change a prompt, how do you know the new version is actually better? A/B testing is the honest answer.

Builders · AI Foundations · ~17 min read

Two Versions Go In. One Wins.

An A/B test compares two variants — A is the current version, B is the proposed change. You route half your real traffic to each, measure a metric, and see which wins. Same logic for LLM prompts, system messages, or models.

A minimal LLM A/B test

1Pick one metric to optimize (satisfaction rating, task completion, response length, whatever matters)
2Define the change (new prompt, new model, new temperature)
3Randomly assign each incoming request to A or B
4Log the metric for each
5Wait for enough samples (see sample size below)
6Compare means with a statistical test

Common mistakes

Peeking: checking results early and stopping when favorable (inflates false positives)
Confounded changes: you changed two things — you cannot tell which worked
Selection bias: assigning 'easy' requests to B
Ignoring variance across users: one power user dominates the metric

Compare the options

Do	Do not
Pre-register your hypothesis	Dig until you find a significant effect
Lock your sample size ahead of time	Stop the test as soon as A wins
Control for time-of-day and cohort effects	Run A on Monday, B on Tuesday
Report effect size with CI	Report only the p-value

“Absence of evidence is not evidence of absence — especially in small A/B tests.”
Common statistics wisdom

Key terms in this lesson

The big idea: your prompt is not better because you think it is. A/B tests are how you find out.

End-of-lesson quiz

Check what stuck

8 questions · Score saves to your progress.

Lesson help

Questions are best handled with a grown-up here.

For this age range, Tendril keeps freeform AI chat paused until parent/guardian consent and child-safe moderation are fully verified. Use the quiz, notes, and related lessons below, or ask a parent, guardian, teacher, or librarian to work through the question with you.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

A/B Testing LLM Outputs

Two Versions Go In. One Wins.

A minimal LLM A/B test

Common mistakes

Questions are best handled with a grown-up here.

Keep going

A/B Testing LLM Outputs

Two Versions Go In. One Wins.

A minimal LLM A/B test

Common mistakes

Questions are best handled with a grown-up here.

Keep going