Agent quality requires trajectory-level evaluation; step-by-step accuracy misses the actual outcome.
What AI does well here
Evaluate task-completion rate (did the agent finish what was asked)
Evaluate trajectory quality (was the path reasonable)
Compare to human-judgment ground truth on representative tasks
Track quality over time as system updates
What AI cannot do
Substitute step accuracy for trajectory quality
Eliminate the human-judgment component of evaluation
Predict trajectory quality from training data alone
End-of-lesson check
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-agentic-agent-quality-evaluation-creators
A developer reports their agent achieves 99% step accuracy but only completes 40% of assigned tasks. What does this most likely indicate about evaluating the agent?
Step accuracy is an unreliable metric when the agent works in unfamiliar domains
Single-step accuracy fails to capture whether the agent achieves its ultimate goal
The agent requires more training on individual action sequences
Step accuracy measurements should be replaced with outcome-based measurements only
Which of the following best describes trajectory quality in agent evaluation?
The percentage of individual actions the agent executes without errors
The sequence and logic of steps an agent takes to complete a task
The final output produced by the agent regardless of how it was generated
The speed at which the agent completes each step in a process
A product team implements a system that compares their agent's decisions to expert human decisions on 500 representative tasks. What component of agent evaluation design is this addressing?
Task completion measurement
Trajectory quality assessment
Human-judgment ground truth
Action triggers for quality drops
Why is it impossible for AI systems to fully eliminate human judgment from agent evaluation?
Humans are needed to execute the agent's actions
AI cannot measure its own performance without external reference
Human judgment defines what 'quality' means for most real-world tasks
Training data always contains errors that require human correction
A company deploys an agent system and monitors its performance metrics monthly. They notice task-completion rates dropped from 87% to 64% over three months. What evaluation component is this demonstrating?
Trajectory quality assessment
Human-judgment ground truth comparison
Quality tracking over time
Action trigger implementation
What is the primary limitation of predicting trajectory quality from training data alone?
Training data is always too small to be useful
Trajectory quality depends on how the agent generalizes to novel situations, which training data cannot fully capture
AI models cannot analyze their own training data
Training data only measures step-level accuracy, not trajectory outcomes
When designing an agent quality evaluation system, what is the purpose of defining action triggers?
To automatically correct agent errors in real-time
To define what happens when quality metrics fall below acceptable thresholds
To determine which tasks to include in evaluation sampling
To establish baseline accuracy levels for individual steps
An agent takes 47 steps to complete a task that a human expert completes in 12 steps, yet both achieve the same correct outcome. Which evaluation metric would highlight this difference?
Task completion rate
Step accuracy
Trajectory quality
Human-judgment ground truth
What does sampling methodology address in agent evaluation design?
How to measure individual step performance
Which tasks to evaluate and how to select them for assessment
When to trigger human review of agent outputs
How to compare agent trajectories to human decision paths
A developer claims their agent evaluation system requires no human involvement because the system automatically calculates accuracy scores. What fundamental flaw exists in this approach?
Accuracy scores cannot be calculated automatically for agent tasks
The system lacks a definition of what constitutes accurate or high-quality behavior
Automated scoring always produces biased results
Human involvement is only needed for step-level evaluation, not trajectory evaluation
Why is task-completion rate insufficient as a standalone metric for agent quality?
Task completion cannot be measured reliably across different agent types
An agent can complete tasks through inefficient or undesirable methods
Task completion is always 100% for well-designed agents
Completion rate ignores whether the agent used appropriate resources
What distinguishes trajectory-level evaluation from step-by-step evaluation?
Trajectory evaluation requires human reviewers, while step evaluation is fully automated
Trajectory evaluation assesses the entire task path, while step evaluation focuses on individual actions
Trajectory evaluation is only for simple tasks, while step evaluation handles complexity
Trajectory evaluation measures output quality, while step evaluation measures process quality
A team evaluates their agent on 1,000 tasks and finds it completes 92% successfully. However, when humans review the successful trajectories, they flag 34% as having serious flaws in reasoning. What does this reveal about relying only on task-completion rate?
Task completion rate overestimates quality because it ignores trajectory reasoning flaws
Task completion rate is useless for agent evaluation
The agent requires more training on the 8% of failed tasks
The human reviewers are incorrect in their assessments
When designing an agent evaluation system, what is the relationship between task completion measurement and trajectory quality assessment?
They are interchangeable; measuring one eliminates the need for the other
Task completion measures outcomes while trajectory quality measures process—both are necessary for comprehensive evaluation
Trajectory quality assessment makes task completion measurement redundant
Task completion should always be measured before trajectory quality assessment
A team selects evaluation tasks that are all very similar to examples in the agent's training data. Why might this sampling approach produce misleading quality results?
Similar tasks always produce accurate agent behavior
The agent may perform well on familiar tasks but poorly on novel situations it hasn't encountered
Training data examples are always poor quality benchmarks
Similar tasks cannot be used for evaluation because they introduce bias