Tendril

Lesson 53 of 2116

Evaluating Agent Performance: SWE-bench, WebArena, GAIA

Numbers on leaderboards are seductive and often wrong. Learn the big benchmarks, their leaderboard positions, their recently-exposed cheats, and how to run your own evals.

CreatorsAgentic AI~30 min readAdvancedBI3 · LearningBI4 · Natural InteractionBI5 · Societal ImpactPrint / PDF

Lesson map

What this lesson covers

50 min19 blocks5 concepts

Learning path

The main moves in order

1The big benchmarks at a glance
2SWE-bench
3WebArena
4GAIA

Concept cluster

Terms to connect while reading

SWE-benchWebArenaGAIAbenchmark hackingcustom evals

Sections6

Lists2

Notes5

Code1

Compare2

Section 1

The big benchmarks at a glance

Compare the options

Benchmark	Measures	April 2026 leader
SWE-bench Verified	Fixing real GitHub issues end-to-end.	Claude Opus 4.7 (~87.6%).
WebArena	Multi-step web navigation tasks.	Competitive among OpenAI/Anthropic/Google.
GAIA	General assistant tasks (multi-modal).	Claude Sonnet 4.5 at 74.6% (Princeton HAL).
OSWorld	Full desktop GUI usage.	Claude Sonnet 4.6 (~72.5%).
TAU-bench	Tool-using customer service.	Varies by domain.
AgentBench	Multi-domain agent tasks.	Varies by task.

What each benchmark actually tests

SWE-bench Verified: 500 curated Python GitHub issues. Agent must produce a passing patch. Tests coding + repo navigation + test execution.
WebArena: 812 tasks in self-hosted clones of Reddit, GitLab, e-commerce, CMS. Tests multi-page web flows.
GAIA: 466 real-world assistant tasks across three difficulty tiers. Heavy on multi-modal and tool use.
OSWorld: Desktop control tasks across Ubuntu apps — the hardest GUI benchmark until its 2026 exploit.
TAU-bench: Dialogue + tool calls for customer service domains (airline, retail).
Terminal-Bench: Completing real CLI tasks from natural-language descriptions.

Check-in 1. Got it so far?

Why public scores mislead

1Harness variance — two implementations score 15 points apart on the same model.
2Model version sloppiness — 'Claude 3.5 Sonnet' covers three model releases.
3Benchmark leakage — training data contamination (SWE-bench includes public repos).
4Cost-per-task hidden — a 90% scorer that burns $50 per task isn't production-viable.
5Exploit-ability — as the Berkeley paper showed, some benchmarks can be gamed.
6Leaderboard gaming — fine-tunes that overfit specific tasks.

Build your own eval — the only honest metric

A bare-bones eval harness you can write in an afternoon. 50 cases from your own work beat 500 from a public benchmark.

python

# Minimal agent eval harness
import json
from dataclasses import dataclass
from statistics import mean

@dataclass
class Case:
    id: str
    goal: str
    check: callable  # returns (pass: bool, reason: str)

eval_set: list[Case] = [
    Case("email-001",
         "Find the most recent invoice from Acme in my inbox and extract the total.",
         lambda out: (out["total"] == 4231.50, "wrong total")),
    # ... 49 more real cases from your actual work
]

def run_agent(case: Case) -> dict:
    # invoke your agent with case.goal; return structured output
    ...

results = []
for case in eval_set:
    try:
        out = run_agent(case)
        ok, reason = case.check(out)
    except Exception as e:
        ok, reason = False, f"error: {e}"
    results.append({"id": case.id, "pass": ok, "reason": reason,
                    "cost": out.get("cost", 0), "steps": out.get("steps", 0)})

print(f"Pass rate: {mean(r['pass'] for r in results):.0%}")
print(f"Avg cost:  ${mean(r['cost'] for r in results):.3f}")
print(f"Avg steps: {mean(r['steps'] for r in results):.1f}")

What to measure beyond pass rate

Compare the options

Metric	Why it matters
Pass rate	Obvious. But without others, misleading.
Avg cost per task	Separates viable prod agents from demos.
P95 latency	Distinguishes responsive from 'grab a coffee' agents.
Step count	Short solutions > long meandering ones.
Escalation rate	How often did the agent punt to a human?
Recoverable-failure rate	On fail, could a retry fix it?
Unrecoverable-failure rate	Catastrophic: data loss, wrong actions.
Reproducibility	Run the same case 10x — same result?

Check-in 2. Got it so far?

LLM-as-judge for open-ended output

When 'correct' is subjective (written summaries, code quality, tone), use a strong model as the judge with a rubric. Anthropic's Claude Opus 4.7 and OpenAI's GPT-5 are current standards. Mix judges across providers to reduce bias.

Check-in 3. Got it so far?

The production agent teams that consistently ship aren't the ones with the highest SWE-bench score — they're the ones with the most honest internal eval set.

Key terms in this lesson

Check-in 4. Got it so far?

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Evaluating Agent Performance: SWE-bench, WebArena, GAIA”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Evaluating Agent Performance: SWE-bench, WebArena, GAIA

The big benchmarks at a glance

What each benchmark actually tests

Why public scores mislead

Build your own eval — the only honest metric

What to measure beyond pass rate

LLM-as-judge for open-ended output

Curious about “Evaluating Agent Performance: SWE-bench, WebArena, GAIA”?

Keep going

Evaluating Agent Performance: SWE-bench, WebArena, GAIA

The big benchmarks at a glance

What each benchmark actually tests

Why public scores mislead

Build your own eval — the only honest metric

What to measure beyond pass rate

LLM-as-judge for open-ended output

Curious about “Evaluating Agent Performance: SWE-bench, WebArena, GAIA”?

Keep going