Tendril

Tendril · Creators · Prompting

Prompt Evaluation and Testing: From Vibes to Rigorous Evals, Part 2

Get a self-estimated confidence number you can route on, without pretending it is perfectly calibrated.

40 min · Reviewed 2026

The premise

A rough confidence number, even imperfect, beats no signal at all when routing humans into the loop.

What AI does well here

Ask for a 0-100 score with a one-line rationale
Route low-confidence answers to humans

What AI cannot do

Trust the absolute number
Replace measured calibration on real data

Understanding "Asking Claude and GPT for calibrated confidence scores" in practice: Prompts are the primary interface to language model capability. Precision in prompt structure directly maps to output quality. Get a self-estimated confidence number you can route on, without pretending it is perfectly calibrated — and knowing how to apply this gives you a concrete advantage.

Apply calibration in your prompting workflow to get better results
Apply uncertainty in your prompting workflow to get better results
Apply self-evaluation in your prompting workflow to get better results

Rewrite one of your best prompts using role + context + task + format
Ask an AI to critique your prompt and suggest improvements
Compare outputs from two models using the same prompt

Building non-English eval sets for an AI assistant

The premise

Quality in English does not predict quality in Spanish, Hindi, or Japanese.

What AI does well here

Translate or source 50 prompts per shipped language
Score with native speakers, not auto-translate

What AI cannot do

Use English-only auto-graders for non-English correctness
Replace local cultural review

AI prompting and eval harness design

The premise

Prompts regress silently; an eval harness with golden cases is the only safety net.

What AI does well here

Define 30-100 golden cases with expected outputs
Run on every prompt change in CI

What AI cannot do

Cover every edge case
Replace human spot-check on style and tone

Understanding "AI prompting and eval harness design" in practice: Prompts are the primary interface to language model capability. Precision in prompt structure directly maps to output quality. Build an eval harness that catches prompt regressions before deploy — and knowing how to apply this gives you a concrete advantage.

Apply evals in your prompting workflow to get better results
Apply regression in your prompting workflow to get better results
Apply harness in your prompting workflow to get better results

Rewrite one of your best prompts using role + context + task + format
Ask an AI to critique your prompt and suggest improvements
Compare outputs from two models using the same prompt

Prompting AI: context stuffing vs retrieval — choosing the right tool

The premise

Long context windows tempt teams to paste everything. Past a certain size, models miss content in the middle, and cost grows linearly. Retrieval is more work upfront but scales further.

What AI does well here

Use information from anywhere in a moderate context window
Cite passages when asked to ground answers
Combine retrieved snippets into a coherent answer

What AI cannot do

Reliably attend to every fact in a very long context
Decide on its own which document to retrieve from
Recover from bad retrieval results — garbage in, garbage out

Prompting AI: versioning prompts like code

The premise

When prompts live in a Notion doc, you can't tell when a regression happened or who changed what. Putting prompts in git, reviewing changes, and gating with evals turns them into manageable artifacts.

What AI does well here

Behave consistently for a frozen prompt + frozen model + temp 0
Show measurable differences between versions on the same eval set
Roll back instantly when a prompt is reverted in code

What AI cannot do

Tell you which prompt version produced a past output without you logging it
Maintain consistency across model upgrades without re-evaluation
Self-version or self-tag its own prompts

AI Prompting: Use an LLM Judge With a Rubric — Carefully

The premise

LLM-as-judge is fast and cheap evaluation, but uncalibrated judges drift, agree with themselves, and hide systematic bias.

What AI does well here

Write a rubric with concrete criteria and examples
Calibrate against 50+ human-scored examples
Detect position bias by swapping output order
Re-calibrate when the judge or generator model changes

What AI cannot do

Replace human judgment for novel quality dimensions
Detect issues your rubric does not name
Stay calibrated forever without re-checks

AI Prompting: Build RAG Prompts That Actually Use the Retrieved Context

The premise

RAG fails when prompts let the model fall back to its own knowledge; explicit grounding instructions and citation requirements force the model to use what you fetched.

What AI does well here

Wrap retrieved chunks in delimited tags with IDs
Require citations to chunk IDs in the answer
Tell the model what to do if retrieval is empty
Surface 'I do not know' as a valid output

What AI cannot do

Replace good retrieval — bad chunks beat good prompting every time
Guarantee no hallucinated citations
Eliminate the need to verify cited chunks exist

AI and grounded prompts with retrieval

The premise

Models hallucinate confidently on facts. Retrieving the source and quoting it in-prompt cuts hallucinations dramatically.

What AI does well here

Propose a retrieval slot in the prompt.
Suggest a 'cite the passage' instruction.
Help format snippets for compactness.

What AI cannot do

Guarantee the model uses retrieved text.
Replace a real RAG eval suite.
Catch when retrieval returned the wrong doc.

AI and prompt test suite basics

The premise

Prompts regress silently when models or wording change. A 10-case test suite catches the worst regressions cheaply.

What AI does well here

Help write a starter case set.
Suggest pass criteria (exact match, schema, judge).
Identify a tricky case for each prompt.

What AI cannot do

Replace exploratory testing.
Score subjective outputs without a rubric.
Test what is not in the case list.

Rubric Grading: Make AI Score Outputs Objectively

The premise

AI scoring is noisy without a rubric. Specify dimensions and a 1-5 scale and you get usable, repeatable grades.

What AI does well here

Score consistently across multiple drafts when given a rubric.
Justify each score with concrete evidence.
Identify which rubric dimension is weakest.
Apply the same rubric across many candidates.

What AI cannot do

Assign meaningful scores without a rubric.
Match a senior human grader's nuance on subjective dimensions.

AI Self-Consistency Prompting: Sampling Multiple Paths for Reliable Answers

The premise

Self-consistency prompting samples multiple reasoning paths at non-zero temperature and aggregates the most common answer — improving reliability on tasks with verifiable answers.

What AI does well here

Producing varied reasoning paths at elevated temperature
Converging on stable answers when the underlying logic is sound
Improving math and logic accuracy materially
Surfacing answer distribution when prompted

What AI cannot do

Help on tasks where there's no verifiable correct answer
Eliminate systematic biases shared across all samples

AI Prompt Eval: Detecting Regressions Before Production

The premise

AI prompt changes require evaluation against golden sets and held-out adversarial cases — vibe-checks miss regressions that hit users in production.

What AI does well here

Producing structured outputs for automated grading
Following test scenarios deterministically when seeded
Reporting per-test pass/fail with explanations
Replicating runs against frozen prompt versions

What AI cannot do

Generate genuinely adversarial test cases against itself
Self-assess whether a change improved user outcomes

← Back to interactive lesson

Tendril · Creators · Prompting

Prompt Evaluation and Testing: From Vibes to Rigorous Evals, Part 2

Get a self-estimated confidence number you can route on, without pretending it is perfectly calibrated.

40 min · Reviewed 2026

The premise

A rough confidence number, even imperfect, beats no signal at all when routing humans into the loop.

What AI does well here

Ask for a 0-100 score with a one-line rationale
Route low-confidence answers to humans

What AI cannot do

Trust the absolute number
Replace measured calibration on real data

Apply calibration in your prompting workflow to get better results
Apply uncertainty in your prompting workflow to get better results
Apply self-evaluation in your prompting workflow to get better results

Rewrite one of your best prompts using role + context + task + format
Ask an AI to critique your prompt and suggest improvements
Compare outputs from two models using the same prompt

Building non-English eval sets for an AI assistant

The premise

Quality in English does not predict quality in Spanish, Hindi, or Japanese.

What AI does well here

Translate or source 50 prompts per shipped language
Score with native speakers, not auto-translate

What AI cannot do

Use English-only auto-graders for non-English correctness
Replace local cultural review

AI prompting and eval harness design

The premise

Prompts regress silently; an eval harness with golden cases is the only safety net.

What AI does well here

Define 30-100 golden cases with expected outputs
Run on every prompt change in CI

What AI cannot do

Cover every edge case
Replace human spot-check on style and tone

Apply evals in your prompting workflow to get better results
Apply regression in your prompting workflow to get better results
Apply harness in your prompting workflow to get better results

Rewrite one of your best prompts using role + context + task + format
Ask an AI to critique your prompt and suggest improvements
Compare outputs from two models using the same prompt

Prompting AI: context stuffing vs retrieval — choosing the right tool

The premise

Long context windows tempt teams to paste everything. Past a certain size, models miss content in the middle, and cost grows linearly. Retrieval is more work upfront but scales further.

What AI does well here

Use information from anywhere in a moderate context window
Cite passages when asked to ground answers
Combine retrieved snippets into a coherent answer

What AI cannot do

Reliably attend to every fact in a very long context
Decide on its own which document to retrieve from
Recover from bad retrieval results — garbage in, garbage out

Prompting AI: versioning prompts like code

The premise

What AI does well here

Behave consistently for a frozen prompt + frozen model + temp 0
Show measurable differences between versions on the same eval set
Roll back instantly when a prompt is reverted in code

What AI cannot do

Tell you which prompt version produced a past output without you logging it
Maintain consistency across model upgrades without re-evaluation
Self-version or self-tag its own prompts

AI Prompting: Use an LLM Judge With a Rubric — Carefully

The premise

LLM-as-judge is fast and cheap evaluation, but uncalibrated judges drift, agree with themselves, and hide systematic bias.

What AI does well here

Write a rubric with concrete criteria and examples
Calibrate against 50+ human-scored examples
Detect position bias by swapping output order
Re-calibrate when the judge or generator model changes

What AI cannot do

Replace human judgment for novel quality dimensions
Detect issues your rubric does not name
Stay calibrated forever without re-checks

AI Prompting: Build RAG Prompts That Actually Use the Retrieved Context

The premise

RAG fails when prompts let the model fall back to its own knowledge; explicit grounding instructions and citation requirements force the model to use what you fetched.

What AI does well here

Wrap retrieved chunks in delimited tags with IDs
Require citations to chunk IDs in the answer
Tell the model what to do if retrieval is empty
Surface 'I do not know' as a valid output

What AI cannot do

Replace good retrieval — bad chunks beat good prompting every time
Guarantee no hallucinated citations
Eliminate the need to verify cited chunks exist

AI and grounded prompts with retrieval

The premise

Models hallucinate confidently on facts. Retrieving the source and quoting it in-prompt cuts hallucinations dramatically.

What AI does well here

Propose a retrieval slot in the prompt.
Suggest a 'cite the passage' instruction.
Help format snippets for compactness.

What AI cannot do

Guarantee the model uses retrieved text.
Replace a real RAG eval suite.
Catch when retrieval returned the wrong doc.

AI and prompt test suite basics

The premise

Prompts regress silently when models or wording change. A 10-case test suite catches the worst regressions cheaply.

What AI does well here

Help write a starter case set.
Suggest pass criteria (exact match, schema, judge).
Identify a tricky case for each prompt.

What AI cannot do

Replace exploratory testing.
Score subjective outputs without a rubric.
Test what is not in the case list.

Rubric Grading: Make AI Score Outputs Objectively

The premise

AI scoring is noisy without a rubric. Specify dimensions and a 1-5 scale and you get usable, repeatable grades.

What AI does well here

Score consistently across multiple drafts when given a rubric.
Justify each score with concrete evidence.
Identify which rubric dimension is weakest.
Apply the same rubric across many candidates.

What AI cannot do

Assign meaningful scores without a rubric.
Match a senior human grader's nuance on subjective dimensions.

AI Self-Consistency Prompting: Sampling Multiple Paths for Reliable Answers

The premise

Self-consistency prompting samples multiple reasoning paths at non-zero temperature and aggregates the most common answer — improving reliability on tasks with verifiable answers.

What AI does well here

Producing varied reasoning paths at elevated temperature
Converging on stable answers when the underlying logic is sound
Improving math and logic accuracy materially
Surfacing answer distribution when prompted

What AI cannot do

Help on tasks where there's no verifiable correct answer
Eliminate systematic biases shared across all samples

AI Prompt Eval: Detecting Regressions Before Production

The premise

AI prompt changes require evaluation against golden sets and held-out adversarial cases — vibe-checks miss regressions that hit users in production.

What AI does well here

Producing structured outputs for automated grading
Following test scenarios deterministically when seeded
Reporting per-test pass/fail with explanations
Replicating runs against frozen prompt versions

What AI cannot do

Generate genuinely adversarial test cases against itself
Self-assess whether a change improved user outcomes

← Back to interactive lesson