The premise
AI can scaffold an AI Promptfoo configuration with prompts, providers, test cases, and assertions for side-by-side comparison.
What AI does well here
- Generate test cases per provider with shared assertions
- Draft assertions for contains, format, and grading-by-judge
What AI cannot do
- Decide acceptance thresholds that justify shipping
- Replace human inspection of judge-graded outputs
End-of-lesson check
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-tools-promptfoo-config-suite-r9a4-creators
What is Promptfoo primarily designed to do?
- Deploy AI models to production servers
- Generate new prompts automatically based on a topic
- Convert natural language queries into database SQL statements
- Compare the same prompt against multiple AI providers to evaluate quality differences
When AI assists with creating a Promptfoo configuration, which task is within AI's capability?
- Drafting test cases and initial assertions for comparison
- Replacing human review of judge-graded outputs
- Determining what score makes a prompt acceptable to ship
- Deciding which provider your team should adopt permanently
Which three assertion types can AI help draft for Promptfoo tests?
- encrypt, hash, and compress
- contains, format, and grading-by-judge
- login, logout, and session
- deploy, scale, and latency
What is an 'acceptance threshold' in the context of prompt testing?
- A time limit for how long a test can run
- A configuration setting for how many providers to test simultaneously
- A cutoff score or criteria that determines whether a prompt passes or fails testing
- A file that lists which prompts to include in a test suite
Why must humans, not AI, decide acceptance thresholds for prompt testing?
- AI cannot read configuration files
- Thresholds are illegal to automate
- Thresholds require understanding of real-world requirements and risk tolerance that AI lacks
- AI always produces perfect scores
What warning does the lesson provide about judge model graders?
- They cannot be used with multiple providers
- They eliminate the need for any human review
- Their grading behavior can change when the underlying model is updated
- They always produce identical scores regardless of context
In Promptfoo terminology, what does 'side-by-side' testing mean?
- Testing prompts one after another in sequence
- Running identical prompts against multiple providers simultaneously to compare outputs
- Placing two test files next to each other in a folder
- Writing prompts and tests in adjacent code blocks
What is the recommended practice when using a judge model in Promptfoo?
- Pin the judge model version and periodically review its grading behavior
- Run judge models only in production
- Avoid using judge models for any reason
- Use the latest available version at all times
What does it mean to 'scaffold' a Promptfoo configuration?
- AI generates an initial draft with prompts, providers, tests, and assertions that humans then refine
- Delete existing test cases to start fresh
- Deploy the configuration to a live server
- Manually type every line of configuration from scratch
What can AI do regarding assertions in a Promptfoo configuration?
- Decide which assertions should cause permanent production blocks
- Write assertions that pass every test without evaluation
- Eliminate the need for any assertions in testing
- Draft initial assertions for contains, format, and judge-based grading
Which statement about human inspection of judge-graded outputs is correct?
- Humans should defer completely to the judge model's decisions
- Human inspection is required for every single test case, not just samples
- Humans must still inspect judge-graded outputs because AI cannot replace this review
- Human inspection is only needed for contains assertions
What is the prompt owner responsible for in a Promptfoo configuration created with AI assistance?
- Writing all test case text manually
- Designing the user interface of the testing tool
- Choosing which AI model to purchase for the company
- The assertions and acceptance criteria that determine if prompts are shippable
What information should be included when AI scaffolds a Promptfoo configuration?
- Prompts, providers, test cases, assertions, judge model configuration, and output format
- A list of竞争对手 (competitors)
- Just the provider API keys
- Only the prompts themselves
What is 'regression' in the context of prompt testing?
- A type of assertion that checks for regression testing
- A new feature added to the testing tool
- When a prompt that previously passed now fails or produces worse outputs
- When the testing tool crashes
What happens if you rely solely on AI to set acceptance thresholds without human oversight?
- The tests will automatically improve the prompts
- You might ship prompts that don't meet actual quality requirements for your users
- The CI pipeline will become faster
- There is no downside—AI is reliable for this