Evaluating Agent Performance: SWE-bench, WebArena, GAIA

Numbers on leaderboards are seductive and often wrong. Learn the big benchmarks, their leaderboard positions, their recently-exposed cheats, and how to run your own evals.

50 min · Reviewed 2026

The big benchmarks at a glance

Benchmark	Measures	April 2026 leader
SWE-bench Verified	Fixing real GitHub issues end-to-end.	Claude Opus 4.7 (~87.6%).
WebArena	Multi-step web navigation tasks.	Competitive among OpenAI/Anthropic/Google.
GAIA	General assistant tasks (multi-modal).	Claude Sonnet 4.5 at 74.6% (Princeton HAL).
OSWorld	Full desktop GUI usage.	Claude Sonnet 4.6 (~72.5%).
TAU-bench	Tool-using customer service.	Varies by domain.
AgentBench	Multi-domain agent tasks.	Varies by task.

What each benchmark actually tests

SWE-bench Verified: 500 curated Python GitHub issues. Agent must produce a passing patch. Tests coding + repo navigation + test execution.
WebArena: 812 tasks in self-hosted clones of Reddit, GitLab, e-commerce, CMS. Tests multi-page web flows.
GAIA: 466 real-world assistant tasks across three difficulty tiers. Heavy on multi-modal and tool use.
OSWorld: Desktop control tasks across Ubuntu apps — the hardest GUI benchmark until its 2026 exploit.
TAU-bench: Dialogue + tool calls for customer service domains (airline, retail).
Terminal-Bench: Completing real CLI tasks from natural-language descriptions.

Why public scores mislead

Harness variance — two implementations score 15 points apart on the same model.
Model version sloppiness — 'Claude 3.5 Sonnet' covers three model releases.
Benchmark leakage — training data contamination (SWE-bench includes public repos).
Cost-per-task hidden — a 90% scorer that burns $50 per task isn't production-viable.
Exploit-ability — as the Berkeley paper showed, some benchmarks can be gamed.
Leaderboard gaming — fine-tunes that overfit specific tasks.

Build your own eval — the only honest metric

# Minimal agent eval harness import json from dataclasses import dataclass from statistics import mean @dataclass class Case: id: str goal: str check: callable # returns (pass: bool, reason: str) eval_set: list[Case] = [ Case("email-001", "Find the most recent invoice from Acme in my inbox and extract the total.", lambda out: (out["total"] == 4231.50, "wrong total")), # 49 more real cases from your actual work ] def run_agent(case: Case) -> dict: # invoke your agent with case.goal; return structured output results = [] for case in eval_set: try: out = run_agent(case) ok, reason = case.check(out) except Exception as e: ok, reason = False, f"error: {e}" results.append({"id": case.id, "pass": ok, "reason": reason, "cost": out.get("cost", 0), "steps": out.get("steps", 0)}) print(f"Pass rate: {mean(r['pass'] for r in results):.0%}") print(f"Avg cost: ${mean(r['cost'] for r in results):.3f}") print(f"Avg steps: {mean(r['steps'] for r in results):.1f}")A bare-bones eval harness you can write in an afternoon. 50 cases from your own work beat 500 from a public benchmark.

What to measure beyond pass rate

Metric	Why it matters
Pass rate	Obvious. But without others, misleading.
Avg cost per task	Separates viable prod agents from demos.
P95 latency	Distinguishes responsive from 'grab a coffee' agents.
Step count	Short solutions > long meandering ones.
Escalation rate	How often did the agent punt to a human?
Recoverable-failure rate	On fail, could a retry fix it?
Unrecoverable-failure rate	Catastrophic: data loss, wrong actions.
Reproducibility	Run the same case 10x — same result?

LLM-as-judge for open-ended output

When 'correct' is subjective (written summaries, code quality, tone), use a strong model as the judge with a rubric. Anthropic's Claude Opus 4.7 and OpenAI's GPT-5 are current standards. Mix judges across providers to reduce bias.

The production agent teams that consistently ship aren't the ones with the highest SWE-bench score — they're the ones with the most honest internal eval set.

End-of-lesson check

8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-agentic-evaluation-creators

What is the main idea of "Evaluating Agent Performance: SWE-bench, WebArena, GAIA"?
1. Numbers on leaderboards are seductive and often wrong.
2. Use AI as the final authority for the whole decision
3. Avoid checking the answer once it sounds polished
4. Focus only on speed instead of judgment
Which concept is most central to "Evaluating Agent Performance: SWE-bench, WebArena, GAIA"?
1. WebArena
2. SWE-bench
3. GAIA
4. benchmark hacking
Which use of AI fits this topic best?
1. Let the AI decide what matters without your review
2. Use the answer before checking whether it fits the situation
3. SWE-bench Verified: 500 curated Python GitHub issues. Agent must produce a passing patch. Tests coding + repo navigation + test execution.
4. Treat the AI output as automatically correct
What should a careful learner remember about "April 2026 — every major benchmark was jailbroken"?
1. Use AI to draft or organize ideas about SWE-bench, then verify before acting.
2. Skip the context so the tool can guess faster
3. Treat the output as private even after sharing it online
4. Use the answer without checking the source
You want to use AI after this lesson. What is the safest next step?
1. Act immediately because the AI answer is written clearly
2. Use AI for drafting and comparison, but verify before publishing or relying on it.
3. Hide uncertainty so the final answer looks cleaner
4. Use private or sensitive details before checking permission
How should AI output about SWE-bench be treated?
1. As proof that no other source is needed
2. As a replacement for context, consent, or expert review
3. As a draft or helper output that still needs human judgment and verification
4. As something that becomes correct when it sounds confident
Name one way to verify an AI answer about SWE-bench.
Which action would help you apply "Evaluating Agent Performance: SWE-bench, WebArena, GAIA" responsibly?
1. Use the tool to avoid thinking through the tradeoff
2. Keep going even if the output conflicts with a trusted source
3. Treat the AI output as automatically correct
4. WebArena: 812 tasks in self-hosted clones of Reddit, GitLab, e-commerce, CMS. Tests multi-page web flows.

← Back to interactive lesson

Tendril · Creators · Agentic AI

Evaluating Agent Performance: SWE-bench, WebArena, GAIA

Numbers on leaderboards are seductive and often wrong. Learn the big benchmarks, their leaderboard positions, their recently-exposed cheats, and how to run your own evals.

50 min · Reviewed 2026

The big benchmarks at a glance

Benchmark	Measures	April 2026 leader
SWE-bench Verified	Fixing real GitHub issues end-to-end.	Claude Opus 4.7 (~87.6%).
WebArena	Multi-step web navigation tasks.	Competitive among OpenAI/Anthropic/Google.
GAIA	General assistant tasks (multi-modal).	Claude Sonnet 4.5 at 74.6% (Princeton HAL).
OSWorld	Full desktop GUI usage.	Claude Sonnet 4.6 (~72.5%).
TAU-bench	Tool-using customer service.	Varies by domain.
AgentBench	Multi-domain agent tasks.	Varies by task.

What each benchmark actually tests

SWE-bench Verified: 500 curated Python GitHub issues. Agent must produce a passing patch. Tests coding + repo navigation + test execution.
WebArena: 812 tasks in self-hosted clones of Reddit, GitLab, e-commerce, CMS. Tests multi-page web flows.
GAIA: 466 real-world assistant tasks across three difficulty tiers. Heavy on multi-modal and tool use.
OSWorld: Desktop control tasks across Ubuntu apps — the hardest GUI benchmark until its 2026 exploit.
TAU-bench: Dialogue + tool calls for customer service domains (airline, retail).
Terminal-Bench: Completing real CLI tasks from natural-language descriptions.

Why public scores mislead

Harness variance — two implementations score 15 points apart on the same model.
Model version sloppiness — 'Claude 3.5 Sonnet' covers three model releases.
Benchmark leakage — training data contamination (SWE-bench includes public repos).
Cost-per-task hidden — a 90% scorer that burns $50 per task isn't production-viable.
Exploit-ability — as the Berkeley paper showed, some benchmarks can be gamed.
Leaderboard gaming — fine-tunes that overfit specific tasks.

Build your own eval — the only honest metric

# Minimal agent eval harness import json from dataclasses import dataclass from statistics import mean @dataclass class Case: id: str goal: str check: callable # returns (pass: bool, reason: str) eval_set: list[Case] = [ Case("email-001", "Find the most recent invoice from Acme in my inbox and extract the total.", lambda out: (out["total"] == 4231.50, "wrong total")), # 49 more real cases from your actual work ] def run_agent(case: Case) -> dict: # invoke your agent with case.goal; return structured output results = [] for case in eval_set: try: out = run_agent(case) ok, reason = case.check(out) except Exception as e: ok, reason = False, f"error: {e}" results.append({"id": case.id, "pass": ok, "reason": reason, "cost": out.get("cost", 0), "steps": out.get("steps", 0)}) print(f"Pass rate: {mean(r['pass'] for r in results):.0%}") print(f"Avg cost: ${mean(r['cost'] for r in results):.3f}") print(f"Avg steps: {mean(r['steps'] for r in results):.1f}")A bare-bones eval harness you can write in an afternoon. 50 cases from your own work beat 500 from a public benchmark.

What to measure beyond pass rate

Metric	Why it matters
Pass rate	Obvious. But without others, misleading.
Avg cost per task	Separates viable prod agents from demos.
P95 latency	Distinguishes responsive from 'grab a coffee' agents.
Step count	Short solutions > long meandering ones.
Escalation rate	How often did the agent punt to a human?
Recoverable-failure rate	On fail, could a retry fix it?
Unrecoverable-failure rate	Catastrophic: data loss, wrong actions.
Reproducibility	Run the same case 10x — same result?

LLM-as-judge for open-ended output

The production agent teams that consistently ship aren't the ones with the highest SWE-bench score — they're the ones with the most honest internal eval set.

End-of-lesson check

8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-agentic-evaluation-creators

What is the main idea of "Evaluating Agent Performance: SWE-bench, WebArena, GAIA"?
1. Numbers on leaderboards are seductive and often wrong.
2. Use AI as the final authority for the whole decision
3. Avoid checking the answer once it sounds polished
4. Focus only on speed instead of judgment
Which concept is most central to "Evaluating Agent Performance: SWE-bench, WebArena, GAIA"?
1. WebArena
2. SWE-bench
3. GAIA
4. benchmark hacking
Which use of AI fits this topic best?
1. Let the AI decide what matters without your review
2. Use the answer before checking whether it fits the situation
3. SWE-bench Verified: 500 curated Python GitHub issues. Agent must produce a passing patch. Tests coding + repo navigation + test execution.
4. Treat the AI output as automatically correct
What should a careful learner remember about "April 2026 — every major benchmark was jailbroken"?
1. Use AI to draft or organize ideas about SWE-bench, then verify before acting.
2. Skip the context so the tool can guess faster
3. Treat the output as private even after sharing it online
4. Use the answer without checking the source
You want to use AI after this lesson. What is the safest next step?
1. Act immediately because the AI answer is written clearly
2. Use AI for drafting and comparison, but verify before publishing or relying on it.
3. Hide uncertainty so the final answer looks cleaner
4. Use private or sensitive details before checking permission
How should AI output about SWE-bench be treated?
1. As proof that no other source is needed
2. As a replacement for context, consent, or expert review
3. As a draft or helper output that still needs human judgment and verification
4. As something that becomes correct when it sounds confident
Name one way to verify an AI answer about SWE-bench.
Which action would help you apply "Evaluating Agent Performance: SWE-bench, WebArena, GAIA" responsibly?
1. Use the tool to avoid thinking through the tradeoff
2. Keep going even if the output conflicts with a trusted source
3. Treat the AI output as automatically correct
4. WebArena: 812 tasks in self-hosted clones of Reddit, GitLab, e-commerce, CMS. Tests multi-page web flows.

← Back to interactive lesson