Loading lesson…
Four benchmarks dominate modern AI announcements. Know what each measures, how, and where it breaks.
When a model launches, four benchmarks almost always appear in the announcement. You will be a more sophisticated reader if you know them cold.
| Benchmark | What it tests | Format | Typical frontier score |
|---|---|---|---|
| MMLU | Broad academic knowledge, 57 subjects | Multiple choice | 85-90%+ |
| GPQA Diamond | Graduate-level science questions | Multiple choice, expert-written | 60-80% |
| HumanEval | Python coding correctness | Function-completion, unit tests | 90%+ |
| SWE-bench | Real-world software engineering bug fixes | GitHub issues + repos | 40-70% depending on variant |
Released in 2020 by Dan Hendrycks and colleagues. 57 subjects from high-school to professional level, all multiple-choice. It was groundbreaking in 2021, but by 2024-2026 frontier models score above 85 percent and the benchmark has largely saturated.
Released in 2023. Questions are written by domain experts with PhDs and are meant to be Google-proof — you cannot find the answer by searching. The 'Diamond' subset is the hardest. This is where the real 2025-2026 frontier competition happens.
Released with OpenAI's Codex in 2021. 164 Python problems with hidden unit tests. Simple but effective. Now largely saturated (GPT-4 and Claude 3.5+ score above 90 percent). MBPP and newer BigCodeBench are the serious successors.
Released 2023. Pulls real bug-fix tasks from 12 popular Python repositories on GitHub, with real unit tests. SWE-bench Verified (a cleaner subset curated by humans) is the version most frequently cited. This benchmark actually tests agentic coding ability, not just completion.
Example MMLU question (high school biology): Q: What is the basic unit of life? A) Tissue B) Organ C) Cell D) Organism Example HumanEval task: """Return the sum of even numbers in the list.""" def sum_evens(nums): # model writes thisWhat these benchmarks actually look likeWe introduce a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
— Hendrycks et al., MMLU paper (2020)
The big idea: these four names will show up in every frontier model launch for years. Knowing what each measures and how it breaks makes you a fluent reader of AI claims.
6 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-mmlu-gpqa-humaneval-swebench
What is the main idea of "MMLU, GPQA, HumanEval, SWE-bench: The Core Four"?
Which concept is most central to "MMLU, GPQA, HumanEval, SWE-bench: The Core Four"?
What should a careful learner remember about "Contamination risk"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about MMLU be treated?
Name one way to verify an AI answer about MMLU.