MMLU, GPQA, HumanEval, SWE-bench: The Core Four

Four benchmarks dominate modern AI announcements. Know what each measures, how, and where it breaks.

40 min · Reviewed 2026

Know These By Heart

When a model launches, four benchmarks almost always appear in the announcement. You will be a more sophisticated reader if you know them cold.

Benchmark	What it tests	Format	Typical frontier score
MMLU	Broad academic knowledge, 57 subjects	Multiple choice	85-90%+
GPQA Diamond	Graduate-level science questions	Multiple choice, expert-written	60-80%
HumanEval	Python coding correctness	Function-completion, unit tests	90%+
SWE-bench	Real-world software engineering bug fixes	GitHub issues + repos	40-70% depending on variant

MMLU — Massive Multitask Language Understanding

Released in 2020 by Dan Hendrycks and colleagues. 57 subjects from high-school to professional level, all multiple-choice. It was groundbreaking in 2021, but by 2024-2026 frontier models score above 85 percent and the benchmark has largely saturated.

GPQA — Graduate-level Proof Q&A

Released in 2023. Questions are written by domain experts with PhDs and are meant to be Google-proof — you cannot find the answer by searching. The 'Diamond' subset is the hardest. This is where the real 2025-2026 frontier competition happens.

HumanEval

Released with OpenAI's Codex in 2021. 164 Python problems with hidden unit tests. Simple but effective. Now largely saturated (GPT-4 and Claude 3.5+ score above 90 percent). MBPP and newer BigCodeBench are the serious successors.

SWE-bench

Released 2023. Pulls real bug-fix tasks from 12 popular Python repositories on GitHub, with real unit tests. SWE-bench Verified (a cleaner subset curated by humans) is the version most frequently cited. This benchmark actually tests agentic coding ability, not just completion.

Example MMLU question (high school biology):

Q: What is the basic unit of life?
  A) Tissue   B) Organ   C) Cell   D) Organism

Example HumanEval task:

"""Return the sum of even numbers in the list."""
def sum_evens(nums):
    # model writes this
    ...What these benchmarks actually look like

We introduce a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
— Hendrycks et al., MMLU paper (2020)

The big idea: these four names will show up in every frontier model launch for years. Knowing what each measures and how it breaks makes you a fluent reader of AI claims.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-mmlu-gpqa-humaneval-swebench

What is the core idea behind "MMLU, GPQA, HumanEval, SWE-bench: The Core Four"?
1. Four benchmarks dominate modern AI announcements. Know what each measures, how, and where it breaks.
2. Jailbreaks: prompts that bypass safety guidelines
3. publishing
4. Skill is not one-dimensional — a model great at coding and bad at poetry cannot …
Which term best describes a foundational idea in "MMLU, GPQA, HumanEval, SWE-bench: The Core Four"?
1. GPQA
2. MMLU
3. HumanEval
4. SWE-bench
A learner studying MMLU, GPQA, HumanEval, SWE-bench: The Core Four would need to understand which concept?
1. MMLU
2. HumanEval
3. GPQA
4. SWE-bench
Which of these is directly relevant to MMLU, GPQA, HumanEval, SWE-bench: The Core Four?
1. MMLU
2. GPQA
3. SWE-bench
4. HumanEval
What is the key insight about "Contamination risk" in the context of MMLU, GPQA, HumanEval, SWE-bench: The Core Four?
1. MMLU and HumanEval have been public since 2020-2021. They are almost certainly in most training sets now.
2. Jailbreaks: prompts that bypass safety guidelines
3. publishing
4. Skill is not one-dimensional — a model great at coding and bad at poetry cannot …
What is the recommended tip about "Ground your practice in fundamentals" in the context of MMLU, GPQA, HumanEval, SWE-bench: The Core Four?
1. Jailbreaks: prompts that bypass safety guidelines
2. Every AI capability has an underlying mechanism. Understanding that mechanism tells you where it'll fail — which is more…
3. publishing
4. Skill is not one-dimensional — a model great at coding and bad at poetry cannot …
Which statement accurately describes an aspect of MMLU, GPQA, HumanEval, SWE-bench: The Core Four?
1. Jailbreaks: prompts that bypass safety guidelines
2. publishing
3. When a model launches, four benchmarks almost always appear in the announcement.
4. Skill is not one-dimensional — a model great at coding and bad at poetry cannot …
What does working with MMLU, GPQA, HumanEval, SWE-bench: The Core Four typically involve?
1. Jailbreaks: prompts that bypass safety guidelines
2. publishing
3. Skill is not one-dimensional — a model great at coding and bad at poetry cannot …
4. Released in 2020 by Dan Hendrycks and colleagues. 57 subjects from high-school to professional level, all multiple-choice.
Which of the following is true about MMLU, GPQA, HumanEval, SWE-bench: The Core Four?
1. Released in 2023. Questions are written by domain experts with PhDs and are meant to be Google-proof — you cannot find the answer by searchi…
2. Jailbreaks: prompts that bypass safety guidelines
3. publishing
4. Skill is not one-dimensional — a model great at coding and bad at poetry cannot …
Which best describes the scope of "MMLU, GPQA, HumanEval, SWE-bench: The Core Four"?
1. It is unrelated to foundations workflows
2. It focuses on Four benchmarks dominate modern AI announcements. Know what each measures, how, and where it breaks.
3. It applies only to the opposite beginner tier
4. It was deprecated in 2024 and no longer relevant
Which section heading best belongs in a lesson about MMLU, GPQA, HumanEval, SWE-bench: The Core Four?
1. Jailbreaks: prompts that bypass safety guidelines
2. publishing
3. MMLU — Massive Multitask Language Understanding
4. Skill is not one-dimensional — a model great at coding and bad at poetry cannot …
Which section heading best belongs in a lesson about MMLU, GPQA, HumanEval, SWE-bench: The Core Four?
1. Jailbreaks: prompts that bypass safety guidelines
2. publishing
3. Skill is not one-dimensional — a model great at coding and bad at poetry cannot …
4. GPQA — Graduate-level Proof Q&A
Which section heading best belongs in a lesson about MMLU, GPQA, HumanEval, SWE-bench: The Core Four?
1. HumanEval
2. Jailbreaks: prompts that bypass safety guidelines
3. publishing
4. Skill is not one-dimensional — a model great at coding and bad at poetry cannot …
Which section heading best belongs in a lesson about MMLU, GPQA, HumanEval, SWE-bench: The Core Four?
1. Jailbreaks: prompts that bypass safety guidelines
2. SWE-bench
3. publishing
4. Skill is not one-dimensional — a model great at coding and bad at poetry cannot …
Which of the following is a concept covered in MMLU, GPQA, HumanEval, SWE-bench: The Core Four?
1. GPQA
2. HumanEval
3. MMLU
4. SWE-bench

← Back to interactive lesson

Tendril · Creators · AI Foundations

MMLU, GPQA, HumanEval, SWE-bench: The Core Four

Four benchmarks dominate modern AI announcements. Know what each measures, how, and where it breaks.

40 min · Reviewed 2026

Know These By Heart

When a model launches, four benchmarks almost always appear in the announcement. You will be a more sophisticated reader if you know them cold.

Benchmark	What it tests	Format	Typical frontier score
MMLU	Broad academic knowledge, 57 subjects	Multiple choice	85-90%+
GPQA Diamond	Graduate-level science questions	Multiple choice, expert-written	60-80%
HumanEval	Python coding correctness	Function-completion, unit tests	90%+
SWE-bench	Real-world software engineering bug fixes	GitHub issues + repos	40-70% depending on variant

MMLU — Massive Multitask Language Understanding

GPQA — Graduate-level Proof Q&A

HumanEval

SWE-bench

Example MMLU question (high school biology):

Q: What is the basic unit of life?
  A) Tissue   B) Organ   C) Cell   D) Organism

Example HumanEval task:

"""Return the sum of even numbers in the list."""
def sum_evens(nums):
    # model writes this
    ...What these benchmarks actually look like

We introduce a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
— Hendrycks et al., MMLU paper (2020)

The big idea: these four names will show up in every frontier model launch for years. Knowing what each measures and how it breaks makes you a fluent reader of AI claims.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-mmlu-gpqa-humaneval-swebench

What is the core idea behind "MMLU, GPQA, HumanEval, SWE-bench: The Core Four"?
1. Four benchmarks dominate modern AI announcements. Know what each measures, how, and where it breaks.
2. Jailbreaks: prompts that bypass safety guidelines
3. publishing
4. Skill is not one-dimensional — a model great at coding and bad at poetry cannot …
Which term best describes a foundational idea in "MMLU, GPQA, HumanEval, SWE-bench: The Core Four"?
1. GPQA
2. MMLU
3. HumanEval
4. SWE-bench
A learner studying MMLU, GPQA, HumanEval, SWE-bench: The Core Four would need to understand which concept?
1. MMLU
2. HumanEval
3. GPQA
4. SWE-bench
Which of these is directly relevant to MMLU, GPQA, HumanEval, SWE-bench: The Core Four?
1. MMLU
2. GPQA
3. SWE-bench
4. HumanEval
What is the key insight about "Contamination risk" in the context of MMLU, GPQA, HumanEval, SWE-bench: The Core Four?
1. MMLU and HumanEval have been public since 2020-2021. They are almost certainly in most training sets now.
2. Jailbreaks: prompts that bypass safety guidelines
3. publishing
4. Skill is not one-dimensional — a model great at coding and bad at poetry cannot …
What is the recommended tip about "Ground your practice in fundamentals" in the context of MMLU, GPQA, HumanEval, SWE-bench: The Core Four?
1. Jailbreaks: prompts that bypass safety guidelines
2. Every AI capability has an underlying mechanism. Understanding that mechanism tells you where it'll fail — which is more…
3. publishing
4. Skill is not one-dimensional — a model great at coding and bad at poetry cannot …
Which statement accurately describes an aspect of MMLU, GPQA, HumanEval, SWE-bench: The Core Four?
1. Jailbreaks: prompts that bypass safety guidelines
2. publishing
3. When a model launches, four benchmarks almost always appear in the announcement.
4. Skill is not one-dimensional — a model great at coding and bad at poetry cannot …
What does working with MMLU, GPQA, HumanEval, SWE-bench: The Core Four typically involve?
1. Jailbreaks: prompts that bypass safety guidelines
2. publishing
3. Skill is not one-dimensional — a model great at coding and bad at poetry cannot …
4. Released in 2020 by Dan Hendrycks and colleagues. 57 subjects from high-school to professional level, all multiple-choice.
Which of the following is true about MMLU, GPQA, HumanEval, SWE-bench: The Core Four?
1. Released in 2023. Questions are written by domain experts with PhDs and are meant to be Google-proof — you cannot find the answer by searchi…
2. Jailbreaks: prompts that bypass safety guidelines
3. publishing
4. Skill is not one-dimensional — a model great at coding and bad at poetry cannot …
Which best describes the scope of "MMLU, GPQA, HumanEval, SWE-bench: The Core Four"?
1. It is unrelated to foundations workflows
2. It focuses on Four benchmarks dominate modern AI announcements. Know what each measures, how, and where it breaks.
3. It applies only to the opposite beginner tier
4. It was deprecated in 2024 and no longer relevant
Which section heading best belongs in a lesson about MMLU, GPQA, HumanEval, SWE-bench: The Core Four?
1. Jailbreaks: prompts that bypass safety guidelines
2. publishing
3. MMLU — Massive Multitask Language Understanding
4. Skill is not one-dimensional — a model great at coding and bad at poetry cannot …
Which section heading best belongs in a lesson about MMLU, GPQA, HumanEval, SWE-bench: The Core Four?
1. Jailbreaks: prompts that bypass safety guidelines
2. publishing
3. Skill is not one-dimensional — a model great at coding and bad at poetry cannot …
4. GPQA — Graduate-level Proof Q&A
Which section heading best belongs in a lesson about MMLU, GPQA, HumanEval, SWE-bench: The Core Four?
1. HumanEval
2. Jailbreaks: prompts that bypass safety guidelines
3. publishing
4. Skill is not one-dimensional — a model great at coding and bad at poetry cannot …
Which section heading best belongs in a lesson about MMLU, GPQA, HumanEval, SWE-bench: The Core Four?
1. Jailbreaks: prompts that bypass safety guidelines
2. SWE-bench
3. publishing
4. Skill is not one-dimensional — a model great at coding and bad at poetry cannot …
Which of the following is a concept covered in MMLU, GPQA, HumanEval, SWE-bench: The Core Four?
1. GPQA
2. HumanEval
3. MMLU
4. SWE-bench

← Back to interactive lesson