Agent Benchmarks: WebArena, GAIA, OSWorld

Section 1

When the Test Is a Task, Not a Question

Compare the options

Benchmark	Year	Environment	Task type
WebArena	2023	Realistic web apps (shop, Reddit clone, maps)	Multi-step web tasks
GAIA	2023	Web + tools	Knowledge-intensive assistant tasks
OSWorld	2024	Real Ubuntu + macOS desktops	Cross-app computer tasks
SWE-bench	2023	GitHub repos	Code bug fixes
τ-bench (Tau-bench)	2024	Customer service domain	Tool-calling chat agents

A typical real-world computer-use task

text

Example OSWorld task:

Goal: 'In LibreOffice Calc, make the values in
column B that are less than 100 appear in red bold.'

The agent must:
  1. Open Calc with the provided file
  2. Select column B
  3. Navigate Format -> Conditional
  4. Configure the rule
  5. Save the file

Each step can fail. Final state is compared to ground truth.

Key terms in this lesson

Agent Benchmarks: WebArena, GAIA, OSWorld

When the Test Is a Task, Not a Question

The big three

WebArena

GAIA

OSWorld

Curious about “Agent Benchmarks: WebArena, GAIA, OSWorld”?

Keep going

Agent Benchmarks: WebArena, GAIA, OSWorld

When the Test Is a Task, Not a Question

The big three

WebArena

GAIA

OSWorld

Curious about “Agent Benchmarks: WebArena, GAIA, OSWorld”?

Keep going