Loading lesson…
LLM benchmarks are about single answers. Agent benchmarks measure multi-step real-world task completion. Very different beast.
Traditional LLM benchmarks ask a question and score an answer. Agent benchmarks give the model tools — a browser, a terminal, a file system — and score whether it can actually complete a real task. It is a much harder evaluation.
| Benchmark | Year | Environment | Task type |
|---|---|---|---|
| WebArena | 2023 | Realistic web apps (shop, Reddit clone, maps) | Multi-step web tasks |
| GAIA | 2023 | Web + tools | Knowledge-intensive assistant tasks |
| OSWorld | 2024 | Real Ubuntu + macOS desktops | Cross-app computer tasks |
| SWE-bench | 2023 | GitHub repos | Code bug fixes |
| τ-bench (Tau-bench) | 2024 | Customer service domain | Tool-calling chat agents |
Self-hostable copies of Shopify, Reddit, GitLab, Wikipedia-style sites. Tasks range from 'add this product to cart and checkout' to 'find the most-upvoted post in this subreddit and summarize the top three comments.' Models must click, type, scroll, and read.
From Meta AI. 466 real-world questions that require multiple tools to solve, like reading PDFs, watching videos, and running calculations. The tasks are easy for humans (92 percent) and hard for AI (top systems around 50-65 percent in 2025).
A true operating system. 369 tasks across common apps — LibreOffice, Chrome, VS Code, Thunderbird. Models must complete tasks like 'configure Thunderbird with this IMAP server' or 'format this spreadsheet.' Human baseline is around 72 percent. Frontier models hover in the 20-40 percent range as of 2025-2026.
Example OSWorld task: Goal: 'In LibreOffice Calc, make the values in column B that are less than 100 appear in red bold.' The agent must: 1. Open Calc with the provided file 2. Select column B 3. Navigate Format -> Conditional 4. Configure the rule 5. Save the file Each step can fail. Final state is compared to ground truth.A typical real-world computer-use taskGAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modal handling, web browsing, and generally tool-use proficiency.
— Mialon et al., GAIA paper (2023)
The big idea: agents are judged by what they do, not what they say. These benchmarks are where AI capability is genuinely being tested now.
6 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-agent-benchmarks
What is the main idea of "Agent Benchmarks: WebArena, GAIA, OSWorld"?
Which concept is most central to "Agent Benchmarks: WebArena, GAIA, OSWorld"?
What should a careful learner remember about "Agent benchmarks are brutal"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about agent benchmark be treated?
Name one way to verify an AI answer about agent benchmark.