Loading lesson…
Four benchmarks dominate modern AI announcements. Know what each measures, how, and where it breaks.
When a model launches, four benchmarks almost always appear in the announcement. You will be a more sophisticated reader if you know them cold.
| Benchmark | What it tests | Format | Typical frontier score |
|---|---|---|---|
| MMLU | Broad academic knowledge, 57 subjects | Multiple choice | 85-90%+ |
| GPQA Diamond | Graduate-level science questions | Multiple choice, expert-written | 60-80% |
| HumanEval | Python coding correctness | Function-completion, unit tests | 90%+ |
| SWE-bench | Real-world software engineering bug fixes | GitHub issues + repos | 40-70% depending on variant |
Released in 2020 by Dan Hendrycks and colleagues. 57 subjects from high-school to professional level, all multiple-choice. It was groundbreaking in 2021, but by 2024-2026 frontier models score above 85 percent and the benchmark has largely saturated.
Released in 2023. Questions are written by domain experts with PhDs and are meant to be Google-proof — you cannot find the answer by searching. The 'Diamond' subset is the hardest. This is where the real 2025-2026 frontier competition happens.
Released with OpenAI's Codex in 2021. 164 Python problems with hidden unit tests. Simple but effective. Now largely saturated (GPT-4 and Claude 3.5+ score above 90 percent). MBPP and newer BigCodeBench are the serious successors.
Released 2023. Pulls real bug-fix tasks from 12 popular Python repositories on GitHub, with real unit tests. SWE-bench Verified (a cleaner subset curated by humans) is the version most frequently cited. This benchmark actually tests agentic coding ability, not just completion.
Example MMLU question (high school biology):
Q: What is the basic unit of life?
A) Tissue B) Organ C) Cell D) Organism
Example HumanEval task:
"""Return the sum of even numbers in the list."""
def sum_evens(nums):
# model writes this
...What these benchmarks actually look likeWe introduce a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
— Hendrycks et al., MMLU paper (2020)
The big idea: these four names will show up in every frontier model launch for years. Knowing what each measures and how it breaks makes you a fluent reader of AI claims.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-mmlu-gpqa-humaneval-swebench
What is the core idea behind "MMLU, GPQA, HumanEval, SWE-bench: The Core Four"?
Which term best describes a foundational idea in "MMLU, GPQA, HumanEval, SWE-bench: The Core Four"?
A learner studying MMLU, GPQA, HumanEval, SWE-bench: The Core Four would need to understand which concept?
Which of these is directly relevant to MMLU, GPQA, HumanEval, SWE-bench: The Core Four?
What is the key insight about "Contamination risk" in the context of MMLU, GPQA, HumanEval, SWE-bench: The Core Four?
What is the recommended tip about "Ground your practice in fundamentals" in the context of MMLU, GPQA, HumanEval, SWE-bench: The Core Four?
Which statement accurately describes an aspect of MMLU, GPQA, HumanEval, SWE-bench: The Core Four?
What does working with MMLU, GPQA, HumanEval, SWE-bench: The Core Four typically involve?
Which of the following is true about MMLU, GPQA, HumanEval, SWE-bench: The Core Four?
Which best describes the scope of "MMLU, GPQA, HumanEval, SWE-bench: The Core Four"?
Which section heading best belongs in a lesson about MMLU, GPQA, HumanEval, SWE-bench: The Core Four?
Which section heading best belongs in a lesson about MMLU, GPQA, HumanEval, SWE-bench: The Core Four?
Which section heading best belongs in a lesson about MMLU, GPQA, HumanEval, SWE-bench: The Core Four?
Which section heading best belongs in a lesson about MMLU, GPQA, HumanEval, SWE-bench: The Core Four?
Which of the following is a concept covered in MMLU, GPQA, HumanEval, SWE-bench: The Core Four?