Loading lesson…
Numbers on leaderboards are seductive and often wrong. Learn the big benchmarks, their leaderboard positions, their recently-exposed cheats, and how to run your own evals.
| Benchmark | Measures | April 2026 leader |
|---|---|---|
| SWE-bench Verified | Fixing real GitHub issues end-to-end. | Claude Opus 4.7 (~87.6%). |
| WebArena | Multi-step web navigation tasks. | Competitive among OpenAI/Anthropic/Google. |
| GAIA | General assistant tasks (multi-modal). | Claude Sonnet 4.5 at 74.6% (Princeton HAL). |
| OSWorld | Full desktop GUI usage. | Claude Sonnet 4.6 (~72.5%). |
| TAU-bench | Tool-using customer service. | Varies by domain. |
| AgentBench | Multi-domain agent tasks. | Varies by task. |
# Minimal agent eval harness import json from dataclasses import dataclass from statistics import mean @dataclass class Case: id: str goal: str check: callable # returns (pass: bool, reason: str) eval_set: list[Case] = [ Case("email-001", "Find the most recent invoice from Acme in my inbox and extract the total.", lambda out: (out["total"] == 4231.50, "wrong total")), # 49 more real cases from your actual work ] def run_agent(case: Case) -> dict: # invoke your agent with case.goal; return structured output results = [] for case in eval_set: try: out = run_agent(case) ok, reason = case.check(out) except Exception as e: ok, reason = False, f"error: {e}" results.append({"id": case.id, "pass": ok, "reason": reason, "cost": out.get("cost", 0), "steps": out.get("steps", 0)}) print(f"Pass rate: {mean(r['pass'] for r in results):.0%}") print(f"Avg cost: ${mean(r['cost'] for r in results):.3f}") print(f"Avg steps: {mean(r['steps'] for r in results):.1f}")A bare-bones eval harness you can write in an afternoon. 50 cases from your own work beat 500 from a public benchmark.| Metric | Why it matters |
|---|---|
| Pass rate | Obvious. But without others, misleading. |
| Avg cost per task | Separates viable prod agents from demos. |
| P95 latency | Distinguishes responsive from 'grab a coffee' agents. |
| Step count | Short solutions > long meandering ones. |
| Escalation rate | How often did the agent punt to a human? |
| Recoverable-failure rate | On fail, could a retry fix it? |
| Unrecoverable-failure rate | Catastrophic: data loss, wrong actions. |
| Reproducibility | Run the same case 10x — same result? |
When 'correct' is subjective (written summaries, code quality, tone), use a strong model as the judge with a rubric. Anthropic's Claude Opus 4.7 and OpenAI's GPT-5 are current standards. Mix judges across providers to reduce bias.
The production agent teams that consistently ship aren't the ones with the highest SWE-bench score — they're the ones with the most honest internal eval set.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-agentic-evaluation-creators
What is the main idea of "Evaluating Agent Performance: SWE-bench, WebArena, GAIA"?
Which concept is most central to "Evaluating Agent Performance: SWE-bench, WebArena, GAIA"?
Which use of AI fits this topic best?
What should a careful learner remember about "April 2026 — every major benchmark was jailbroken"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about SWE-bench be treated?
Name one way to verify an AI answer about SWE-bench.
Which action would help you apply "Evaluating Agent Performance: SWE-bench, WebArena, GAIA" responsibly?