Lesson 248 of 2116
MMLU, GPQA, HumanEval, SWE-bench: The Core Four
Four benchmarks dominate modern AI announcements. Know what each measures, how, and where it breaks.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1Know These By Heart
- 2MMLU
- 3GPQA
- 4HumanEval
Concept cluster
Terms to connect while reading
Section 1
Know These By Heart
When a model launches, four benchmarks almost always appear in the announcement. You will be a more sophisticated reader if you know them cold.
Compare the options
| Benchmark | What it tests | Format | Typical frontier score |
|---|---|---|---|
| MMLU | Broad academic knowledge, 57 subjects | Multiple choice | 85-90%+ |
| GPQA Diamond | Graduate-level science questions | Multiple choice, expert-written | 60-80% |
| HumanEval | Python coding correctness | Function-completion, unit tests | 90%+ |
| SWE-bench | Real-world software engineering bug fixes | GitHub issues + repos | 40-70% depending on variant |
MMLU — Massive Multitask Language Understanding
Released in 2020 by Dan Hendrycks and colleagues. 57 subjects from high-school to professional level, all multiple-choice. It was groundbreaking in 2021, but by 2024-2026 frontier models score above 85 percent and the benchmark has largely saturated.
GPQA — Graduate-level Proof Q&A
Released in 2023. Questions are written by domain experts with PhDs and are meant to be Google-proof — you cannot find the answer by searching. The 'Diamond' subset is the hardest. This is where the real 2025-2026 frontier competition happens.
HumanEval
Released with OpenAI's Codex in 2021. 164 Python problems with hidden unit tests. Simple but effective. Now largely saturated (GPT-4 and Claude 3.5+ score above 90 percent). MBPP and newer BigCodeBench are the serious successors.
SWE-bench
Released 2023. Pulls real bug-fix tasks from 12 popular Python repositories on GitHub, with real unit tests. SWE-bench Verified (a cleaner subset curated by humans) is the version most frequently cited. This benchmark actually tests agentic coding ability, not just completion.
What these benchmarks actually look like
Example MMLU question (high school biology):
Q: What is the basic unit of life?
A) Tissue B) Organ C) Cell D) Organism
Example HumanEval task:
"""Return the sum of even numbers in the list."""
def sum_evens(nums):
# model writes this
...“We introduce a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.”
Key terms in this lesson
The big idea: these four names will show up in every frontier model launch for years. Knowing what each measures and how it breaks makes you a fluent reader of AI claims.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “MMLU, GPQA, HumanEval, SWE-bench: The Core Four”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 45 min
What Is Intelligence, Really? A Working Framework
Before we can judge whether an AI is intelligent, we need a framework for what intelligence even means. Draw on Chollet, Dennett, and modern evals.
Creators · 45 min
The Economics and Ethics of Training Data
Data is the strategic asset of AI. Understand the supply chain, the legal fight, and the philosophical stakes before you build anything on top.
Creators · 45 min
Emergence, Capability Forecasting, and Safety
Emergent abilities make AI both more exciting and more dangerous. How do labs forecast what the next model will do — and what happens when they are wrong?
