Tendril — AI Lessons for Real Life

Tendril

The premise

You can't improve what you don't measure. Eval suites turn 'feels better' into 'scored 87 vs 82.'

What AI does well here

Run a fixed test set against new prompts/models.

Compare outputs on rubric scores.

Surface regressions when you change a prompt.

Generate test cases when seeded with examples.

What AI cannot do

Replace human judgment for subjective dimensions.

Catch edge cases you didn't include in the eval.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-tools-ai-evals-and-testing-r13a2-creators

What is the primary benefit of running an eval suite against a fixed test set?

It allows you to compare new prompts or models against a consistent baseline
It lets the AI generate entirely new test cases on its own
It automatically fixes bugs in the AI model
It eliminates the need for any human involvement in testing

According to the recommended structure, what three elements should be included in each row of an eval CSV?

input, actual_output, and error_message
input, expected_output, and rubric_score
prompt, model_name, and timestamp
question, answer_key, and difficulty_rating

A developer runs an eval and finds their new prompt scores 82, down from 87 with the previous prompt. What does this represent in eval terminology?

A judge scoring error
A successful optimization
An edge case failure
A performance regression

What bias does the lesson identify in LLM-as-judge scoring?

It cannot evaluate technical content
It always prefers shorter responses
It only works with multiple-choice questions
It prefers verbose, hedge-heavy answers

What does the lesson recommend to validate that LLM-as-judge scores are reliable?

Only use judges trained on your specific domain
Require two different judges to agree on every score
Sample-validate judge scores against your own ratings
Run the judge on the same data multiple times

Which statement best describes what AI can do well within an eval framework?

Identify every possible edge case automatically
Replace human judgment on subjective quality dimensions
Generate test cases when seeded with examples
Design the scoring rubric without any human input

A developer changes their prompt and wants to know if it improved the AI's responses. How does an eval help answer this question?

By running the new prompt against a fixed test set and comparing rubric scores to baseline
By showing the old and new prompts side-by-side to users for feedback
By asking the AI to rate its own responses
By measuring how long the new prompt takes to generate responses

The lesson states that AI cannot replace human judgment for what type of dimensions?

Subjective dimensions
Numerical dimensions
Binary dimensions
Technical dimensions

What is the relationship between the baseline and current scores in an eval?

The baseline must always be higher than current scores
The baseline is the score you hope to achieve eventually
The baseline serves as a reference point for comparing new scores
The baseline is calculated after running the current prompt

What does it mean to 'surface a regression' in the context of AI evals?

To detect when a prompt or model change causes performance to drop
To fix a bug that was introduced in a previous version
To recover a lost version of a prompt
To identify that the AI has become too verbose

Why might an LLM-as-judge give a high score to a response that isn't actually the best answer?

Because it cannot read the response being scored
Because it has access to the internet during scoring
Because it prefers verbose, hedge-heavy answers that sound cautious
Because it randomly assigns scores to save time

What is the fundamental problem that eval frameworks aim to solve?

Making AI models run faster
Replacing all human QA testers
Eliminating the need for any test cases
Turning subjective impressions like 'feels better' into measurable scores

After running an LLM-as-judge on your eval, you notice it consistently gives scores that seem too high. What should you do?

Switch to a different scoring method entirely
Sample-validate the judge scores against your own ratings
Reduce the number of test cases
Stop using evals and rely on user feedback

What must be true about a test set for it to effectively surface regressions?

It must change with every test iteration
It must be validated by an external auditor
It must be fixed and consistent across runs
It must contain only edge cases

Why is human judgment still necessary even when using LLM-as-judge in evals?

Because subjective dimensions cannot be reliably scored by AI
Because human judgment is faster than AI scoring
Because evals require a human to run the test cases
Because AI judges always make random errors

The premise

You can't improve what you don't measure. Eval suites turn 'feels better' into 'scored 87 vs 82.'

What AI does well here

Run a fixed test set against new prompts/models.

Compare outputs on rubric scores.

Surface regressions when you change a prompt.

Generate test cases when seeded with examples.

What AI cannot do

Replace human judgment for subjective dimensions.

Catch edge cases you didn't include in the eval.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-tools-ai-evals-and-testing-r13a2-creators

What is the primary benefit of running an eval suite against a fixed test set?

It allows you to compare new prompts or models against a consistent baseline
It lets the AI generate entirely new test cases on its own
It automatically fixes bugs in the AI model
It eliminates the need for any human involvement in testing

According to the recommended structure, what three elements should be included in each row of an eval CSV?

input, actual_output, and error_message
input, expected_output, and rubric_score
prompt, model_name, and timestamp
question, answer_key, and difficulty_rating

A developer runs an eval and finds their new prompt scores 82, down from 87 with the previous prompt. What does this represent in eval terminology?

A judge scoring error
A successful optimization
An edge case failure
A performance regression

What bias does the lesson identify in LLM-as-judge scoring?

It cannot evaluate technical content
It always prefers shorter responses
It only works with multiple-choice questions
It prefers verbose, hedge-heavy answers

What does the lesson recommend to validate that LLM-as-judge scores are reliable?

Only use judges trained on your specific domain
Require two different judges to agree on every score
Sample-validate judge scores against your own ratings
Run the judge on the same data multiple times

Which statement best describes what AI can do well within an eval framework?

Identify every possible edge case automatically
Replace human judgment on subjective quality dimensions
Generate test cases when seeded with examples
Design the scoring rubric without any human input

A developer changes their prompt and wants to know if it improved the AI's responses. How does an eval help answer this question?

By running the new prompt against a fixed test set and comparing rubric scores to baseline
By showing the old and new prompts side-by-side to users for feedback
By asking the AI to rate its own responses
By measuring how long the new prompt takes to generate responses

The lesson states that AI cannot replace human judgment for what type of dimensions?

Subjective dimensions
Numerical dimensions
Binary dimensions
Technical dimensions

What is the relationship between the baseline and current scores in an eval?

The baseline must always be higher than current scores
The baseline is the score you hope to achieve eventually
The baseline serves as a reference point for comparing new scores
The baseline is calculated after running the current prompt

What does it mean to 'surface a regression' in the context of AI evals?

To detect when a prompt or model change causes performance to drop
To fix a bug that was introduced in a previous version
To recover a lost version of a prompt
To identify that the AI has become too verbose

Why might an LLM-as-judge give a high score to a response that isn't actually the best answer?

Because it cannot read the response being scored
Because it has access to the internet during scoring
Because it prefers verbose, hedge-heavy answers that sound cautious
Because it randomly assigns scores to save time

What is the fundamental problem that eval frameworks aim to solve?

Making AI models run faster
Replacing all human QA testers
Eliminating the need for any test cases
Turning subjective impressions like 'feels better' into measurable scores

After running an LLM-as-judge on your eval, you notice it consistently gives scores that seem too high. What should you do?

Switch to a different scoring method entirely
Sample-validate the judge scores against your own ratings
Reduce the number of test cases
Stop using evals and rely on user feedback

What must be true about a test set for it to effectively surface regressions?

It must change with every test iteration
It must be validated by an external auditor
It must be fixed and consistent across runs
It must contain only edge cases

Why is human judgment still necessary even when using LLM-as-judge in evals?

Because subjective dimensions cannot be reliably scored by AI
Because human judgment is faster than AI scoring
Because evals require a human to run the test cases
Because AI judges always make random errors

AI Evals: Testing AI Outputs Like You'd Test Code

The premise

What AI does well here

What AI cannot do

End-of-lesson check

AI Evals: Testing AI Outputs Like You'd Test Code

The premise

What AI does well here

What AI cannot do

End-of-lesson check