MMLU-Pro, SWE-Bench, GPQA, ARC-AGI — vendor benchmark cards look authoritative. Most are gameable, contaminated, or measure the wrong thing. The vendor card is not the whole truth Every frontier model launches with a benchmark card — a wall of percentages on standard tests.
10 min · Reviewed 2026
The vendor card is not the whole truth
Every frontier model launches with a benchmark card — a wall of percentages on standard tests. Treat the card as marketing first, evidence second. The same number often hides different methodologies, contamination risks, and prompt engineering tricks.
Common benchmarks and what they measure
Benchmark
What it measures
Watch out for
MMLU-Pro
Multi-domain reasoning
Saturated; small differences are noise
SWE-Bench Verified
Real-world code repair
Setup-script differences move scores
GPQA Diamond
Hard graduate science
Memorization risk
ARC-AGI
Abstract reasoning puzzles
Cost varies wildly by lab
AIME / MATH
Math competition problems
Training contamination is endemic
LiveBench
Continuously refreshed
Less gameable but newer methodology
Three questions to ask any benchmark card
Was the test set in the training data — explicitly or by leak?
What inference scaffolding was used — chain-of-thought, multi-sample voting, agent loops?
How does the model perform on a fresh, never-published test of the same skill?
Applied exercise
Build a 25-question internal eval from real tasks at your company
Run the top three models against it
Compare your ranking to the public benchmark ranking
If they disagree, your eval is more relevant — trust it
The big idea: benchmark cards rank models. Your eval ranks them for your work. The second matters.
End-of-lesson check
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-frontier-benchmark-cards-creators
What is the main idea of "Reading Benchmark Cards Critically"?