Tendril

Lesson 19 of 2116

Evaluating Prompt Performance: From Vibes to Metrics

You can't improve what you don't measure. Build an eval set, pick metrics, and turn prompt engineering from gut-feel into a rigorous discipline.

CreatorsPrompting~25 min readAdvancedCoderBI3 · LearningBI5 · Societal ImpactPrint / PDF

Lesson map

What this lesson covers

42 min18 blocks5 concepts

Learning path

The main moves in order

1Vibes don't scale
2prompt evaluation
3eval sets
4LLM-as-judge

Concept cluster

Terms to connect while reading

prompt evaluationeval setsLLM-as-judgemetricsregression testing

Sections6

Lists2

Notes4

Code2

Compare1

Section 1

Vibes don't scale

Early prompt tuning feels like cooking — taste, adjust, taste again. That works until you have three prompts in production serving 10,000 users. Then you need evals: a test set you run on every prompt change, with metrics that fail loudly when quality regresses.

Build an eval set

1Collect 20-100 real inputs that represent your task (not synthetic fluff).
2For each input, define what a correct or acceptable output looks like.
3Include hard cases: ambiguous inputs, edge cases, adversarial inputs.
4Version the set. Freeze it. Label updates v1, v2, v3.
5Split into 'validation' (tune on) and 'held-out test' (final check).

Metric families

Compare the options

Metric type	When to use	Example
Exact match	Classification, extraction with known answers.	Sentiment labels match ground truth.
Regex / schema	Structured output (JSON, tags).	Response parses as valid JSON matching schema.
Semantic similarity	Open-ended answers with multiple valid phrasings.	Cosine similarity of embeddings >= 0.85 to reference.
Rubric scoring	Quality dimensions (clarity, accuracy, tone).	1-5 score on 4 axes, averaged.
LLM-as-judge	Subjective quality at scale.	Ask a strong model to score outputs against criteria.
Human review	Gold standard for nuanced judgment.	Expert rates a sample of 50 outputs per release.

Check-in 1. Got it so far?

LLM-as-judge template

A judge prompt you can run over every row of your eval set.

markdown

You are a rigorous evaluator. Score the following AI response.

<task_description>
{TASK}
</task_description>

<input>
{INPUT}
</input>

<response>
{MODEL_OUTPUT}
</response>

<reference_answer>
{IDEAL_ANSWER}
</reference_answer>

Score the response on:
1. Correctness (1-5): Does it answer the task accurately?
2. Completeness (1-5): Does it cover all parts of the task?
3. Format adherence (1-5): Does it match the required format?
4. Conciseness (1-5): Is it appropriately concise?

Respond in this XML format:
<scores>
  <correctness>N</correctness>
  <completeness>N</completeness>
  <format>N</format>
  <conciseness>N</conciseness>
</scores>
<justification>One paragraph explaining the scores.</justification>
<failure_mode>If any score is below 4, name the primary failure mode.</failure_mode>

The regression gate

Every proposed prompt change runs against the eval set automatically. If any metric drops by more than a threshold (e.g., -5% correctness), the change is blocked or flagged for human review. This is the same discipline as unit tests — just for prompts.

An eval harness in pseudocode.

markdown

# Pseudocode for an eval run
for case in eval_set:
    output = run_prompt(candidate_prompt, case.input)
    scores = judge(task, case.input, output, case.ideal)
    results.append(scores)

mean = average(results)
if mean.correctness < baseline.correctness - 0.05:
    fail("Regression detected on correctness")
elif any_case.correctness < 3:
    warn("Individual case below threshold — review")
else:
    pass_and_record_new_baseline()

Check-in 2. Got it so far?

Beyond single-prompt eval

Compare across models: same prompt on Claude Sonnet 4.5 vs Opus 4.7 vs Haiku.
Compare across temperatures: find the sweet spot for your task.
Compare prompt variants side-by-side (A/B testing at the prompt level).
Track drift: run the same eval monthly; model updates can shift performance silently.
Track cost/quality tradeoffs: the cheapest prompt that meets your quality bar often wins.

Check-in 3. Got it so far?

Key terms in this lesson

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Evaluating Prompt Performance: From Vibes to Metrics”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Evaluating Prompt Performance: From Vibes to Metrics

Vibes don't scale

Build an eval set

Metric families

LLM-as-judge template

The regression gate

Beyond single-prompt eval

Curious about “Evaluating Prompt Performance: From Vibes to Metrics”?

Keep going

Evaluating Prompt Performance: From Vibes to Metrics

Vibes don't scale

Build an eval set

Metric families

LLM-as-judge template

The regression gate

Beyond single-prompt eval

Curious about “Evaluating Prompt Performance: From Vibes to Metrics”?

Keep going