Loading lesson…
LLM benchmarks are about single answers. Agent benchmarks measure multi-step real-world task completion. Very different beast.
Traditional LLM benchmarks ask a question and score an answer. Agent benchmarks give the model tools — a browser, a terminal, a file system — and score whether it can actually complete a real task. It is a much harder evaluation.
| Benchmark | Year | Environment | Task type |
|---|---|---|---|
| WebArena | 2023 | Realistic web apps (shop, Reddit clone, maps) | Multi-step web tasks |
| GAIA | 2023 | Web + tools | Knowledge-intensive assistant tasks |
| OSWorld | 2024 | Real Ubuntu + macOS desktops | Cross-app computer tasks |
| SWE-bench | 2023 | GitHub repos | Code bug fixes |
| τ-bench (Tau-bench) | 2024 | Customer service domain | Tool-calling chat agents |
Self-hostable copies of Shopify, Reddit, GitLab, Wikipedia-style sites. Tasks range from 'add this product to cart and checkout' to 'find the most-upvoted post in this subreddit and summarize the top three comments.' Models must click, type, scroll, and read.
From Meta AI. 466 real-world questions that require multiple tools to solve, like reading PDFs, watching videos, and running calculations. The tasks are easy for humans (92 percent) and hard for AI (top systems around 50-65 percent in 2025).
A true operating system. 369 tasks across common apps — LibreOffice, Chrome, VS Code, Thunderbird. Models must complete tasks like 'configure Thunderbird with this IMAP server' or 'format this spreadsheet.' Human baseline is around 72 percent. Frontier models hover in the 20-40 percent range as of 2025-2026.
Example OSWorld task:
Goal: 'In LibreOffice Calc, make the values in
column B that are less than 100 appear in red bold.'
The agent must:
1. Open Calc with the provided file
2. Select column B
3. Navigate Format -> Conditional
4. Configure the rule
5. Save the file
Each step can fail. Final state is compared to ground truth.A typical real-world computer-use taskGAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modal handling, web browsing, and generally tool-use proficiency.
— Mialon et al., GAIA paper (2023)
The big idea: agents are judged by what they do, not what they say. These benchmarks are where AI capability is genuinely being tested now.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-agent-benchmarks
What is the core idea behind "Agent Benchmarks: WebArena, GAIA, OSWorld"?
Which term best describes a foundational idea in "Agent Benchmarks: WebArena, GAIA, OSWorld"?
A learner studying Agent Benchmarks: WebArena, GAIA, OSWorld would need to understand which concept?
Which of these is directly relevant to Agent Benchmarks: WebArena, GAIA, OSWorld?
What is the key insight about "Agent benchmarks are brutal" in the context of Agent Benchmarks: WebArena, GAIA, OSWorld?
What is the recommended tip about "Ground your practice in fundamentals" in the context of Agent Benchmarks: WebArena, GAIA, OSWorld?
Which statement accurately describes an aspect of Agent Benchmarks: WebArena, GAIA, OSWorld?
What does working with Agent Benchmarks: WebArena, GAIA, OSWorld typically involve?
Which of the following is true about Agent Benchmarks: WebArena, GAIA, OSWorld?
Which best describes the scope of "Agent Benchmarks: WebArena, GAIA, OSWorld"?
Which section heading best belongs in a lesson about Agent Benchmarks: WebArena, GAIA, OSWorld?
Which section heading best belongs in a lesson about Agent Benchmarks: WebArena, GAIA, OSWorld?
Which section heading best belongs in a lesson about Agent Benchmarks: WebArena, GAIA, OSWorld?
Which of the following is a concept covered in Agent Benchmarks: WebArena, GAIA, OSWorld?
Which of the following is a concept covered in Agent Benchmarks: WebArena, GAIA, OSWorld?