MMLU, GPQA, HumanEval, SWE-bench: The Core Four

Section 1

Know These By Heart

Compare the options

Benchmark	What it tests	Format	Typical frontier score
MMLU	Broad academic knowledge, 57 subjects	Multiple choice	85-90%+
GPQA Diamond	Graduate-level science questions	Multiple choice, expert-written	60-80%
HumanEval	Python coding correctness	Function-completion, unit tests	90%+
SWE-bench	Real-world software engineering bug fixes	GitHub issues + repos	40-70% depending on variant

What these benchmarks actually look like

text

Example MMLU question (high school biology):

Q: What is the basic unit of life?
  A) Tissue   B) Organ   C) Cell   D) Organism

Example HumanEval task:

"""Return the sum of even numbers in the list."""
def sum_evens(nums):
    # model writes this
    ...

Key terms in this lesson

MMLU, GPQA, HumanEval, SWE-bench: The Core Four

Know These By Heart

MMLU — Massive Multitask Language Understanding

GPQA — Graduate-level Proof Q&A

HumanEval

SWE-bench

Curious about “MMLU, GPQA, HumanEval, SWE-bench: The Core Four”?

Keep going

MMLU, GPQA, HumanEval, SWE-bench: The Core Four

Know These By Heart

MMLU — Massive Multitask Language Understanding

GPQA — Graduate-level Proof Q&A

HumanEval

SWE-bench

Curious about “MMLU, GPQA, HumanEval, SWE-bench: The Core Four”?

Keep going