Lesson 254 of 2116
Agent Benchmarks: WebArena, GAIA, OSWorld
LLM benchmarks are about single answers. Agent benchmarks measure multi-step real-world task completion. Very different beast.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1When the Test Is a Task, Not a Question
- 2agent benchmark
- 3WebArena
- 4GAIA
Concept cluster
Terms to connect while reading
Section 1
When the Test Is a Task, Not a Question
Traditional LLM benchmarks ask a question and score an answer. Agent benchmarks give the model tools — a browser, a terminal, a file system — and score whether it can actually complete a real task. It is a much harder evaluation.
The big three
Compare the options
| Benchmark | Year | Environment | Task type |
|---|---|---|---|
| WebArena | 2023 | Realistic web apps (shop, Reddit clone, maps) | Multi-step web tasks |
| GAIA | 2023 | Web + tools | Knowledge-intensive assistant tasks |
| OSWorld | 2024 | Real Ubuntu + macOS desktops | Cross-app computer tasks |
| SWE-bench | 2023 | GitHub repos | Code bug fixes |
| τ-bench (Tau-bench) | 2024 | Customer service domain | Tool-calling chat agents |
WebArena
Self-hostable copies of Shopify, Reddit, GitLab, Wikipedia-style sites. Tasks range from 'add this product to cart and checkout' to 'find the most-upvoted post in this subreddit and summarize the top three comments.' Models must click, type, scroll, and read.
GAIA
From Meta AI. 466 real-world questions that require multiple tools to solve, like reading PDFs, watching videos, and running calculations. The tasks are easy for humans (92 percent) and hard for AI (top systems around 50-65 percent in 2025).
OSWorld
A true operating system. 369 tasks across common apps — LibreOffice, Chrome, VS Code, Thunderbird. Models must complete tasks like 'configure Thunderbird with this IMAP server' or 'format this spreadsheet.' Human baseline is around 72 percent. Frontier models hover in the 20-40 percent range as of 2025-2026.
A typical real-world computer-use task
Example OSWorld task:
Goal: 'In LibreOffice Calc, make the values in
column B that are less than 100 appear in red bold.'
The agent must:
1. Open Calc with the provided file
2. Select column B
3. Navigate Format -> Conditional
4. Configure the rule
5. Save the file
Each step can fail. Final state is compared to ground truth.“GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modal handling, web browsing, and generally tool-use proficiency.”
Key terms in this lesson
The big idea: agents are judged by what they do, not what they say. These benchmarks are where AI capability is genuinely being tested now.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Agent Benchmarks: WebArena, GAIA, OSWorld”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 35 min
Calculus with AI: Limits, Derivatives, and Not Getting Lost
Calculus is where a lot of smart students hit a wall. Wolfram|Alpha and Claude can walk you through every step, but only if you already did the setup work.
Creators · 36 min
AP Physics: Free-Body Diagrams and Walkthroughs
Physics problems are 40 percent drawing the right picture. AI models that can see your free-body diagram and critique it are close to having a TA on call.
Creators · 45 min
Designing Your Own Eval
The eval that matters most is the one tied to your real task. Here is a step-by-step way to build one. The rubric is the product Most 'AI product' failures are actually rubric failures.
