Tendril
Knowledge check · 15 questions
Tests understanding of tool-use evaluation design, AI's capabilities and limitations in creating benchmarks, and the importance of inspecting failure cases.
Tool-Use Evaluation: Building Reliable Agent Benchmarks — Quick Check
15 questions