Lesson 19 of 2116
Evaluating Prompt Performance: From Vibes to Metrics
You can't improve what you don't measure. Build an eval set, pick metrics, and turn prompt engineering from gut-feel into a rigorous discipline.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1Vibes don't scale
- 2prompt evaluation
- 3eval sets
- 4LLM-as-judge
Concept cluster
Terms to connect while reading
Section 1
Vibes don't scale
Early prompt tuning feels like cooking — taste, adjust, taste again. That works until you have three prompts in production serving 10,000 users. Then you need evals: a test set you run on every prompt change, with metrics that fail loudly when quality regresses.
Build an eval set
- 1Collect 20-100 real inputs that represent your task (not synthetic fluff).
- 2For each input, define what a correct or acceptable output looks like.
- 3Include hard cases: ambiguous inputs, edge cases, adversarial inputs.
- 4Version the set. Freeze it. Label updates v1, v2, v3.
- 5Split into 'validation' (tune on) and 'held-out test' (final check).
Metric families
Compare the options
| Metric type | When to use | Example |
|---|---|---|
| Exact match | Classification, extraction with known answers. | Sentiment labels match ground truth. |
| Regex / schema | Structured output (JSON, tags). | Response parses as valid JSON matching schema. |
| Semantic similarity | Open-ended answers with multiple valid phrasings. | Cosine similarity of embeddings >= 0.85 to reference. |
| Rubric scoring | Quality dimensions (clarity, accuracy, tone). | 1-5 score on 4 axes, averaged. |
| LLM-as-judge | Subjective quality at scale. | Ask a strong model to score outputs against criteria. |
| Human review | Gold standard for nuanced judgment. | Expert rates a sample of 50 outputs per release. |
LLM-as-judge template
A judge prompt you can run over every row of your eval set.
You are a rigorous evaluator. Score the following AI response.
<task_description>
{TASK}
</task_description>
<input>
{INPUT}
</input>
<response>
{MODEL_OUTPUT}
</response>
<reference_answer>
{IDEAL_ANSWER}
</reference_answer>
Score the response on:
1. Correctness (1-5): Does it answer the task accurately?
2. Completeness (1-5): Does it cover all parts of the task?
3. Format adherence (1-5): Does it match the required format?
4. Conciseness (1-5): Is it appropriately concise?
Respond in this XML format:
<scores>
<correctness>N</correctness>
<completeness>N</completeness>
<format>N</format>
<conciseness>N</conciseness>
</scores>
<justification>One paragraph explaining the scores.</justification>
<failure_mode>If any score is below 4, name the primary failure mode.</failure_mode>The regression gate
Every proposed prompt change runs against the eval set automatically. If any metric drops by more than a threshold (e.g., -5% correctness), the change is blocked or flagged for human review. This is the same discipline as unit tests — just for prompts.
An eval harness in pseudocode.
# Pseudocode for an eval run
for case in eval_set:
output = run_prompt(candidate_prompt, case.input)
scores = judge(task, case.input, output, case.ideal)
results.append(scores)
mean = average(results)
if mean.correctness < baseline.correctness - 0.05:
fail("Regression detected on correctness")
elif any_case.correctness < 3:
warn("Individual case below threshold — review")
else:
pass_and_record_new_baseline()Beyond single-prompt eval
- Compare across models: same prompt on Claude Sonnet 4.5 vs Opus 4.7 vs Haiku.
- Compare across temperatures: find the sweet spot for your task.
- Compare prompt variants side-by-side (A/B testing at the prompt level).
- Track drift: run the same eval monthly; model updates can shift performance silently.
- Track cost/quality tradeoffs: the cheapest prompt that meets your quality bar often wins.
Key terms in this lesson
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Evaluating Prompt Performance: From Vibes to Metrics”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 40 min
Multi-Turn Reasoning: Agents That Think Across Steps
Some problems need more than one prompt. Learn how to design multi-turn reasoning flows — reflection, critique, retry — that give you AI which actually solves hard problems.
Creators · 36 min
Meta-Prompting: AI That Writes AI Prompts
Use an AI to write, optimize, and debug your prompts. Meta-prompting is how top teams ship production prompts faster than humans alone could write them.
Creators · 38 min
Red-Teaming Your Own Prompts
Before shipping, attack your own prompts. Inject, confuse, overload, and role-swap. If you don't find the holes, your users will.
