Tendril — AI Lessons for Real Life

Tendril

The premise

AI can scaffold an AI Promptfoo configuration with prompts, providers, test cases, and assertions for side-by-side comparison.

What AI does well here

Generate test cases per provider with shared assertions

Draft assertions for contains, format, and grading-by-judge

What AI cannot do

Decide acceptance thresholds that justify shipping

Replace human inspection of judge-graded outputs

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-tools-promptfoo-config-suite-r9a4-creators

What is Promptfoo primarily designed to do?

Deploy AI models to production servers
Generate new prompts automatically based on a topic
Convert natural language queries into database SQL statements
Compare the same prompt against multiple AI providers to evaluate quality differences

When AI assists with creating a Promptfoo configuration, which task is within AI's capability?

Drafting test cases and initial assertions for comparison
Replacing human review of judge-graded outputs
Determining what score makes a prompt acceptable to ship
Deciding which provider your team should adopt permanently

Which three assertion types can AI help draft for Promptfoo tests?

encrypt, hash, and compress
contains, format, and grading-by-judge
login, logout, and session
deploy, scale, and latency

What is an 'acceptance threshold' in the context of prompt testing?

A time limit for how long a test can run
A configuration setting for how many providers to test simultaneously
A cutoff score or criteria that determines whether a prompt passes or fails testing
A file that lists which prompts to include in a test suite

Why must humans, not AI, decide acceptance thresholds for prompt testing?

AI cannot read configuration files
Thresholds are illegal to automate
Thresholds require understanding of real-world requirements and risk tolerance that AI lacks
AI always produces perfect scores

What warning does the lesson provide about judge model graders?

They cannot be used with multiple providers
They eliminate the need for any human review
Their grading behavior can change when the underlying model is updated
They always produce identical scores regardless of context

In Promptfoo terminology, what does 'side-by-side' testing mean?

Testing prompts one after another in sequence
Running identical prompts against multiple providers simultaneously to compare outputs
Placing two test files next to each other in a folder
Writing prompts and tests in adjacent code blocks

What is the recommended practice when using a judge model in Promptfoo?

Pin the judge model version and periodically review its grading behavior
Run judge models only in production
Avoid using judge models for any reason
Use the latest available version at all times

What does it mean to 'scaffold' a Promptfoo configuration?

AI generates an initial draft with prompts, providers, tests, and assertions that humans then refine
Delete existing test cases to start fresh
Deploy the configuration to a live server
Manually type every line of configuration from scratch

What can AI do regarding assertions in a Promptfoo configuration?

Decide which assertions should cause permanent production blocks
Write assertions that pass every test without evaluation
Eliminate the need for any assertions in testing
Draft initial assertions for contains, format, and judge-based grading

Which statement about human inspection of judge-graded outputs is correct?

Humans should defer completely to the judge model's decisions
Human inspection is required for every single test case, not just samples
Humans must still inspect judge-graded outputs because AI cannot replace this review
Human inspection is only needed for contains assertions

What is the prompt owner responsible for in a Promptfoo configuration created with AI assistance?

Writing all test case text manually
Designing the user interface of the testing tool
Choosing which AI model to purchase for the company
The assertions and acceptance criteria that determine if prompts are shippable

What information should be included when AI scaffolds a Promptfoo configuration?

Prompts, providers, test cases, assertions, judge model configuration, and output format
A list of竞争对手 (competitors)
Just the provider API keys
Only the prompts themselves

What is 'regression' in the context of prompt testing?

A type of assertion that checks for regression testing
A new feature added to the testing tool
When a prompt that previously passed now fails or produces worse outputs
When the testing tool crashes

What happens if you rely solely on AI to set acceptance thresholds without human oversight?

The tests will automatically improve the prompts
You might ship prompts that don't meet actual quality requirements for your users
The CI pipeline will become faster
There is no downside—AI is reliable for this

The premise

AI can scaffold an AI Promptfoo configuration with prompts, providers, test cases, and assertions for side-by-side comparison.

What AI does well here

Generate test cases per provider with shared assertions

Draft assertions for contains, format, and grading-by-judge

What AI cannot do

Decide acceptance thresholds that justify shipping

Replace human inspection of judge-graded outputs

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-tools-promptfoo-config-suite-r9a4-creators

What is Promptfoo primarily designed to do?

Deploy AI models to production servers
Generate new prompts automatically based on a topic
Convert natural language queries into database SQL statements
Compare the same prompt against multiple AI providers to evaluate quality differences

When AI assists with creating a Promptfoo configuration, which task is within AI's capability?

Drafting test cases and initial assertions for comparison
Replacing human review of judge-graded outputs
Determining what score makes a prompt acceptable to ship
Deciding which provider your team should adopt permanently

Which three assertion types can AI help draft for Promptfoo tests?

encrypt, hash, and compress
contains, format, and grading-by-judge
login, logout, and session
deploy, scale, and latency

What is an 'acceptance threshold' in the context of prompt testing?

A time limit for how long a test can run
A configuration setting for how many providers to test simultaneously
A cutoff score or criteria that determines whether a prompt passes or fails testing
A file that lists which prompts to include in a test suite

Why must humans, not AI, decide acceptance thresholds for prompt testing?

AI cannot read configuration files
Thresholds are illegal to automate
Thresholds require understanding of real-world requirements and risk tolerance that AI lacks
AI always produces perfect scores

What warning does the lesson provide about judge model graders?

They cannot be used with multiple providers
They eliminate the need for any human review
Their grading behavior can change when the underlying model is updated
They always produce identical scores regardless of context

In Promptfoo terminology, what does 'side-by-side' testing mean?

Testing prompts one after another in sequence
Running identical prompts against multiple providers simultaneously to compare outputs
Placing two test files next to each other in a folder
Writing prompts and tests in adjacent code blocks

What is the recommended practice when using a judge model in Promptfoo?

Pin the judge model version and periodically review its grading behavior
Run judge models only in production
Avoid using judge models for any reason
Use the latest available version at all times

What does it mean to 'scaffold' a Promptfoo configuration?

AI generates an initial draft with prompts, providers, tests, and assertions that humans then refine
Delete existing test cases to start fresh
Deploy the configuration to a live server
Manually type every line of configuration from scratch

What can AI do regarding assertions in a Promptfoo configuration?

Decide which assertions should cause permanent production blocks
Write assertions that pass every test without evaluation
Eliminate the need for any assertions in testing
Draft initial assertions for contains, format, and judge-based grading

Which statement about human inspection of judge-graded outputs is correct?

Humans should defer completely to the judge model's decisions
Human inspection is required for every single test case, not just samples
Humans must still inspect judge-graded outputs because AI cannot replace this review
Human inspection is only needed for contains assertions

What is the prompt owner responsible for in a Promptfoo configuration created with AI assistance?

Writing all test case text manually
Designing the user interface of the testing tool
Choosing which AI model to purchase for the company
The assertions and acceptance criteria that determine if prompts are shippable

What information should be included when AI scaffolds a Promptfoo configuration?

Prompts, providers, test cases, assertions, judge model configuration, and output format
A list of竞争对手 (competitors)
Just the provider API keys
Only the prompts themselves

What is 'regression' in the context of prompt testing?

A type of assertion that checks for regression testing
A new feature added to the testing tool
When a prompt that previously passed now fails or produces worse outputs
When the testing tool crashes

What happens if you rely solely on AI to set acceptance thresholds without human oversight?

The tests will automatically improve the prompts
You might ship prompts that don't meet actual quality requirements for your users
The CI pipeline will become faster
There is no downside—AI is reliable for this

AI Tool Promptfoo Config Suite: Running Side-by-Side Prompt Tests

The premise

What AI does well here

What AI cannot do

End-of-lesson check

AI Tool Promptfoo Config Suite: Running Side-by-Side Prompt Tests

The premise

What AI does well here

What AI cannot do

End-of-lesson check