Agent Benchmarks: WebArena, GAIA, OSWorld

LLM benchmarks are about single answers. Agent benchmarks measure multi-step real-world task completion. Very different beast.

40 min · Reviewed 2026

When the Test Is a Task, Not a Question

Traditional LLM benchmarks ask a question and score an answer. Agent benchmarks give the model tools — a browser, a terminal, a file system — and score whether it can actually complete a real task. It is a much harder evaluation.

The big three

Benchmark	Year	Environment	Task type
WebArena	2023	Realistic web apps (shop, Reddit clone, maps)	Multi-step web tasks
GAIA	2023	Web + tools	Knowledge-intensive assistant tasks
OSWorld	2024	Real Ubuntu + macOS desktops	Cross-app computer tasks
SWE-bench	2023	GitHub repos	Code bug fixes
τ-bench (Tau-bench)	2024	Customer service domain	Tool-calling chat agents

WebArena

Self-hostable copies of Shopify, Reddit, GitLab, Wikipedia-style sites. Tasks range from 'add this product to cart and checkout' to 'find the most-upvoted post in this subreddit and summarize the top three comments.' Models must click, type, scroll, and read.

GAIA

From Meta AI. 466 real-world questions that require multiple tools to solve, like reading PDFs, watching videos, and running calculations. The tasks are easy for humans (92 percent) and hard for AI (top systems around 50-65 percent in 2025).

OSWorld

A true operating system. 369 tasks across common apps — LibreOffice, Chrome, VS Code, Thunderbird. Models must complete tasks like 'configure Thunderbird with this IMAP server' or 'format this spreadsheet.' Human baseline is around 72 percent. Frontier models hover in the 20-40 percent range as of 2025-2026.

Example OSWorld task:

Goal: 'In LibreOffice Calc, make the values in
column B that are less than 100 appear in red bold.'

The agent must:
  1. Open Calc with the provided file
  2. Select column B
  3. Navigate Format -> Conditional
  4. Configure the rule
  5. Save the file

Each step can fail. Final state is compared to ground truth.A typical real-world computer-use task

GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modal handling, web browsing, and generally tool-use proficiency.
— Mialon et al., GAIA paper (2023)

The big idea: agents are judged by what they do, not what they say. These benchmarks are where AI capability is genuinely being tested now.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-agent-benchmarks

What is the core idea behind "Agent Benchmarks: WebArena, GAIA, OSWorld"?
1. LLM benchmarks are about single answers. Agent benchmarks measure multi-step real-world task completion. Very different beast.
2. adjudication
3. Lock the version — never silently edit labels
4. Large-scale pretraining builds general-purpose features
Which term best describes a foundational idea in "Agent Benchmarks: WebArena, GAIA, OSWorld"?
1. WebArena
2. agent benchmark
3. GAIA
4. OSWorld
A learner studying Agent Benchmarks: WebArena, GAIA, OSWorld would need to understand which concept?
1. agent benchmark
2. GAIA
3. WebArena
4. OSWorld
Which of these is directly relevant to Agent Benchmarks: WebArena, GAIA, OSWorld?
1. agent benchmark
2. WebArena
3. OSWorld
4. GAIA
What is the key insight about "Agent benchmarks are brutal" in the context of Agent Benchmarks: WebArena, GAIA, OSWorld?
1. Scoring high on MMLU and low on OSWorld is normal. Single-shot language skill does not translate directly into real-worl…
2. adjudication
3. Lock the version — never silently edit labels
4. Large-scale pretraining builds general-purpose features
What is the recommended tip about "Ground your practice in fundamentals" in the context of Agent Benchmarks: WebArena, GAIA, OSWorld?
1. adjudication
2. Every AI capability has an underlying mechanism. Understanding that mechanism tells you where it'll fail — which is more…
3. Lock the version — never silently edit labels
4. Large-scale pretraining builds general-purpose features
Which statement accurately describes an aspect of Agent Benchmarks: WebArena, GAIA, OSWorld?
1. adjudication
2. Lock the version — never silently edit labels
3. Traditional LLM benchmarks ask a question and score an answer. Agent benchmarks give the model tools — a browser, a terminal, a file system …
4. Large-scale pretraining builds general-purpose features
What does working with Agent Benchmarks: WebArena, GAIA, OSWorld typically involve?
1. adjudication
2. Lock the version — never silently edit labels
3. Large-scale pretraining builds general-purpose features
4. Self-hostable copies of Shopify, Reddit, GitLab, Wikipedia-style sites. Tasks range from 'add this product to cart and checkout' to 'find th…
Which of the following is true about Agent Benchmarks: WebArena, GAIA, OSWorld?
1. From Meta AI. 466 real-world questions that require multiple tools to solve, like reading PDFs, watching videos, and running calculations.
2. adjudication
3. Lock the version — never silently edit labels
4. Large-scale pretraining builds general-purpose features
Which best describes the scope of "Agent Benchmarks: WebArena, GAIA, OSWorld"?
1. It is unrelated to foundations workflows
2. It focuses on LLM benchmarks are about single answers. Agent benchmarks measure multi-step real-world task complet
3. It applies only to the opposite beginner tier
4. It was deprecated in 2024 and no longer relevant
Which section heading best belongs in a lesson about Agent Benchmarks: WebArena, GAIA, OSWorld?
1. adjudication
2. Lock the version — never silently edit labels
3. The big three
4. Large-scale pretraining builds general-purpose features
Which section heading best belongs in a lesson about Agent Benchmarks: WebArena, GAIA, OSWorld?
1. adjudication
2. Lock the version — never silently edit labels
3. Large-scale pretraining builds general-purpose features
4. WebArena
Which section heading best belongs in a lesson about Agent Benchmarks: WebArena, GAIA, OSWorld?
1. OSWorld
2. adjudication
3. Lock the version — never silently edit labels
4. Large-scale pretraining builds general-purpose features
Which of the following is a concept covered in Agent Benchmarks: WebArena, GAIA, OSWorld?
1. WebArena
2. agent benchmark
3. GAIA
4. OSWorld
Which of the following is a concept covered in Agent Benchmarks: WebArena, GAIA, OSWorld?
1. agent benchmark
2. GAIA
3. WebArena
4. OSWorld

← Back to interactive lesson

Tendril · Creators · AI Foundations

Agent Benchmarks: WebArena, GAIA, OSWorld

LLM benchmarks are about single answers. Agent benchmarks measure multi-step real-world task completion. Very different beast.

40 min · Reviewed 2026

When the Test Is a Task, Not a Question

The big three

Benchmark	Year	Environment	Task type
WebArena	2023	Realistic web apps (shop, Reddit clone, maps)	Multi-step web tasks
GAIA	2023	Web + tools	Knowledge-intensive assistant tasks
OSWorld	2024	Real Ubuntu + macOS desktops	Cross-app computer tasks
SWE-bench	2023	GitHub repos	Code bug fixes
τ-bench (Tau-bench)	2024	Customer service domain	Tool-calling chat agents

WebArena

GAIA

OSWorld

Example OSWorld task:

Goal: 'In LibreOffice Calc, make the values in
column B that are less than 100 appear in red bold.'

The agent must:
  1. Open Calc with the provided file
  2. Select column B
  3. Navigate Format -> Conditional
  4. Configure the rule
  5. Save the file

Each step can fail. Final state is compared to ground truth.A typical real-world computer-use task

GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modal handling, web browsing, and generally tool-use proficiency.
— Mialon et al., GAIA paper (2023)

The big idea: agents are judged by what they do, not what they say. These benchmarks are where AI capability is genuinely being tested now.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-agent-benchmarks

What is the core idea behind "Agent Benchmarks: WebArena, GAIA, OSWorld"?
1. LLM benchmarks are about single answers. Agent benchmarks measure multi-step real-world task completion. Very different beast.
2. adjudication
3. Lock the version — never silently edit labels
4. Large-scale pretraining builds general-purpose features
Which term best describes a foundational idea in "Agent Benchmarks: WebArena, GAIA, OSWorld"?
1. WebArena
2. agent benchmark
3. GAIA
4. OSWorld
A learner studying Agent Benchmarks: WebArena, GAIA, OSWorld would need to understand which concept?
1. agent benchmark
2. GAIA
3. WebArena
4. OSWorld
Which of these is directly relevant to Agent Benchmarks: WebArena, GAIA, OSWorld?
1. agent benchmark
2. WebArena
3. OSWorld
4. GAIA
What is the key insight about "Agent benchmarks are brutal" in the context of Agent Benchmarks: WebArena, GAIA, OSWorld?
1. Scoring high on MMLU and low on OSWorld is normal. Single-shot language skill does not translate directly into real-worl…
2. adjudication
3. Lock the version — never silently edit labels
4. Large-scale pretraining builds general-purpose features
What is the recommended tip about "Ground your practice in fundamentals" in the context of Agent Benchmarks: WebArena, GAIA, OSWorld?
1. adjudication
2. Every AI capability has an underlying mechanism. Understanding that mechanism tells you where it'll fail — which is more…
3. Lock the version — never silently edit labels
4. Large-scale pretraining builds general-purpose features
Which statement accurately describes an aspect of Agent Benchmarks: WebArena, GAIA, OSWorld?
1. adjudication
2. Lock the version — never silently edit labels
3. Traditional LLM benchmarks ask a question and score an answer. Agent benchmarks give the model tools — a browser, a terminal, a file system …
4. Large-scale pretraining builds general-purpose features
What does working with Agent Benchmarks: WebArena, GAIA, OSWorld typically involve?
1. adjudication
2. Lock the version — never silently edit labels
3. Large-scale pretraining builds general-purpose features
4. Self-hostable copies of Shopify, Reddit, GitLab, Wikipedia-style sites. Tasks range from 'add this product to cart and checkout' to 'find th…
Which of the following is true about Agent Benchmarks: WebArena, GAIA, OSWorld?
1. From Meta AI. 466 real-world questions that require multiple tools to solve, like reading PDFs, watching videos, and running calculations.
2. adjudication
3. Lock the version — never silently edit labels
4. Large-scale pretraining builds general-purpose features
Which best describes the scope of "Agent Benchmarks: WebArena, GAIA, OSWorld"?
1. It is unrelated to foundations workflows
2. It focuses on LLM benchmarks are about single answers. Agent benchmarks measure multi-step real-world task complet
3. It applies only to the opposite beginner tier
4. It was deprecated in 2024 and no longer relevant
Which section heading best belongs in a lesson about Agent Benchmarks: WebArena, GAIA, OSWorld?
1. adjudication
2. Lock the version — never silently edit labels
3. The big three
4. Large-scale pretraining builds general-purpose features
Which section heading best belongs in a lesson about Agent Benchmarks: WebArena, GAIA, OSWorld?
1. adjudication
2. Lock the version — never silently edit labels
3. Large-scale pretraining builds general-purpose features
4. WebArena
Which section heading best belongs in a lesson about Agent Benchmarks: WebArena, GAIA, OSWorld?
1. OSWorld
2. adjudication
3. Lock the version — never silently edit labels
4. Large-scale pretraining builds general-purpose features
Which of the following is a concept covered in Agent Benchmarks: WebArena, GAIA, OSWorld?
1. WebArena
2. agent benchmark
3. GAIA
4. OSWorld
Which of the following is a concept covered in Agent Benchmarks: WebArena, GAIA, OSWorld?
1. agent benchmark
2. GAIA
3. WebArena
4. OSWorld

← Back to interactive lesson