Lesson 53 of 2116
Evaluating Agent Performance: SWE-bench, WebArena, GAIA
Numbers on leaderboards are seductive and often wrong. Learn the big benchmarks, their leaderboard positions, their recently-exposed cheats, and how to run your own evals.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The big benchmarks at a glance
- 2SWE-bench
- 3WebArena
- 4GAIA
Concept cluster
Terms to connect while reading
Section 1
The big benchmarks at a glance
Compare the options
| Benchmark | Measures | April 2026 leader |
|---|---|---|
| SWE-bench Verified | Fixing real GitHub issues end-to-end. | Claude Opus 4.7 (~87.6%). |
| WebArena | Multi-step web navigation tasks. | Competitive among OpenAI/Anthropic/Google. |
| GAIA | General assistant tasks (multi-modal). | Claude Sonnet 4.5 at 74.6% (Princeton HAL). |
| OSWorld | Full desktop GUI usage. | Claude Sonnet 4.6 (~72.5%). |
| TAU-bench | Tool-using customer service. | Varies by domain. |
| AgentBench | Multi-domain agent tasks. | Varies by task. |
What each benchmark actually tests
- SWE-bench Verified: 500 curated Python GitHub issues. Agent must produce a passing patch. Tests coding + repo navigation + test execution.
- WebArena: 812 tasks in self-hosted clones of Reddit, GitLab, e-commerce, CMS. Tests multi-page web flows.
- GAIA: 466 real-world assistant tasks across three difficulty tiers. Heavy on multi-modal and tool use.
- OSWorld: Desktop control tasks across Ubuntu apps — the hardest GUI benchmark until its 2026 exploit.
- TAU-bench: Dialogue + tool calls for customer service domains (airline, retail).
- Terminal-Bench: Completing real CLI tasks from natural-language descriptions.
Why public scores mislead
- 1Harness variance — two implementations score 15 points apart on the same model.
- 2Model version sloppiness — 'Claude 3.5 Sonnet' covers three model releases.
- 3Benchmark leakage — training data contamination (SWE-bench includes public repos).
- 4Cost-per-task hidden — a 90% scorer that burns $50 per task isn't production-viable.
- 5Exploit-ability — as the Berkeley paper showed, some benchmarks can be gamed.
- 6Leaderboard gaming — fine-tunes that overfit specific tasks.
Build your own eval — the only honest metric
A bare-bones eval harness you can write in an afternoon. 50 cases from your own work beat 500 from a public benchmark.
# Minimal agent eval harness
import json
from dataclasses import dataclass
from statistics import mean
@dataclass
class Case:
id: str
goal: str
check: callable # returns (pass: bool, reason: str)
eval_set: list[Case] = [
Case("email-001",
"Find the most recent invoice from Acme in my inbox and extract the total.",
lambda out: (out["total"] == 4231.50, "wrong total")),
# ... 49 more real cases from your actual work
]
def run_agent(case: Case) -> dict:
# invoke your agent with case.goal; return structured output
...
results = []
for case in eval_set:
try:
out = run_agent(case)
ok, reason = case.check(out)
except Exception as e:
ok, reason = False, f"error: {e}"
results.append({"id": case.id, "pass": ok, "reason": reason,
"cost": out.get("cost", 0), "steps": out.get("steps", 0)})
print(f"Pass rate: {mean(r['pass'] for r in results):.0%}")
print(f"Avg cost: ${mean(r['cost'] for r in results):.3f}")
print(f"Avg steps: {mean(r['steps'] for r in results):.1f}")What to measure beyond pass rate
Compare the options
| Metric | Why it matters |
|---|---|
| Pass rate | Obvious. But without others, misleading. |
| Avg cost per task | Separates viable prod agents from demos. |
| P95 latency | Distinguishes responsive from 'grab a coffee' agents. |
| Step count | Short solutions > long meandering ones. |
| Escalation rate | How often did the agent punt to a human? |
| Recoverable-failure rate | On fail, could a retry fix it? |
| Unrecoverable-failure rate | Catastrophic: data loss, wrong actions. |
| Reproducibility | Run the same case 10x — same result? |
LLM-as-judge for open-ended output
When 'correct' is subjective (written summaries, code quality, tone), use a strong model as the judge with a rubric. Anthropic's Claude Opus 4.7 and OpenAI's GPT-5 are current standards. Mix judges across providers to reduce bias.
The production agent teams that consistently ship aren't the ones with the highest SWE-bench score — they're the ones with the most honest internal eval set.
Key terms in this lesson
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Evaluating Agent Performance: SWE-bench, WebArena, GAIA”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 52 min
Red-Teaming Agents: Injection, Escalation, Exfil
An agent is a new attack surface. Prompt injection, privilege escalation, data exfiltration — these are no longer theoretical. Learn the attacks and the defenses.
Creators · 75 min
Capstone: Build and Ship a Real Agent
Everything comes together. Design, code, test, secure, and ship a production-quality agent with open-source code you can fork today.
Creators · 50 min
Tool Use at the API Level: The Primitive
Underneath every agent framework is the same primitive — the model returns a structured tool call, you execute it, you feed the result back. Master this loop and every framework looks familiar.
