Lesson 258 of 2116
Designing Your Own Eval
The eval that matters most is the one tied to your real task. Here is a step-by-step way to build one. The rubric is the product Most 'AI product' failures are actually rubric failures.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The Only Eval That Really Matters
- 2custom eval
- 3rubric
- 4golden set
Concept cluster
Terms to connect while reading
Section 1
The Only Eval That Really Matters
Public benchmarks are useful signals, but the eval that matters for your project is the one built on your users' actual work. Designing a good custom eval is a distinct skill.
Eight-step recipe
- 1Write down the user task in one sentence
- 2Sample 50-200 real instances of the task from logs or interviews
- 3For each, decide what 'good' means (right answer? right tone? right format?)
- 4Write an explicit rubric — not just vibes
- 5Have at least one human grade the sample to validate the rubric
- 6Automate the grader (LLM-as-judge or string match)
- 7Check the automated grader against the human on a subset
- 8Version the eval — same input, comparable output over time
The rubric is the product
Most 'AI product' failures are actually rubric failures. The team never wrote down what good looks like, so they shipped something that kind-of-works until a customer complained. A crisp rubric forces the fuzzy bits into the open.
Compare the options
| Bad rubric | Good rubric |
|---|---|
| Response is helpful | Response directly answers the user's first question within the first two sentences |
| Tone is good | Tone is friendly, avoids hedging phrases like 'I think', matches second-person |
| Factually accurate | Any specific claim can be verified against a cited source; no invented statistics |
Keep it honest
- Never let the model see the rubric (unless that is the point of your system)
- Refresh the sample quarterly — user behavior drifts
- Track false positives and false negatives separately
- Store every eval run in version control for trend analysis
A minimal folder layout for a versioned, repeatable eval
Eval file structure (example):
evals/
README.md # what this eval measures
rubric.md # the explicit definition of good
cases/
001.json # one input + expected output behavior
002.json
...
runner.py # runs model(s) and grader
grader.py # LLM-as-judge or rules
history/
2026-04-23.csv # one row per case per model
2026-04-30.csv“You cannot improve what you do not measure, and you cannot measure what you have not defined.”
Key terms in this lesson
The big idea: a good eval is a living spec for what your product is supposed to do. It is one of the most valuable artifacts you will ever build.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Designing Your Own Eval”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 30 min
Label Noise: When Your Ground Truth Is Wrong
Every labeled dataset has mistakes. Studies have found error rates of 3 to 6 percent in famous benchmarks like ImageNet. Noisy labels confuse models and mislead evaluations.
Creators · 32 min
AlexNet and the Deep Learning Revolution
In September 2012, a neural network crushed ImageNet and everything about AI changed.
Creators · 35 min
Calculus with AI: Limits, Derivatives, and Not Getting Lost
Calculus is where a lot of smart students hit a wall. Wolfram|Alpha and Claude can walk you through every step, but only if you already did the setup work.
