Designing Your Own Eval

Section 1

The Only Eval That Really Matters

Compare the options

Bad rubric	Good rubric
Response is helpful	Response directly answers the user's first question within the first two sentences
Tone is good	Tone is friendly, avoids hedging phrases like 'I think', matches second-person
Factually accurate	Any specific claim can be verified against a cited source; no invented statistics

A minimal folder layout for a versioned, repeatable eval

text

Eval file structure (example):

evals/
  README.md            # what this eval measures
  rubric.md            # the explicit definition of good
  cases/
    001.json           # one input + expected output behavior
    002.json
    ...
  runner.py            # runs model(s) and grader
  grader.py            # LLM-as-judge or rules
  history/
    2026-04-23.csv     # one row per case per model
    2026-04-30.csv

Key terms in this lesson

Designing Your Own Eval

The Only Eval That Really Matters

Eight-step recipe

The rubric is the product

Keep it honest

Curious about “Designing Your Own Eval”?

Keep going

Designing Your Own Eval

The Only Eval That Really Matters

Eight-step recipe

The rubric is the product

Keep it honest

Curious about “Designing Your Own Eval”?

Keep going