Lesson 255 of 2116
Multimodal Benchmarks
Evaluating models that see, hear, and read at once requires new kinds of tests. Here are the ones that matter.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1Beyond Text
- 2multimodal
- 3vision
- 4MMMU
Concept cluster
Terms to connect while reading
Section 1
Beyond Text
Once models handle images, audio, and video, text-only benchmarks are insufficient. A new family of multimodal benchmarks has emerged. Knowing them helps you cut through vision-language model marketing.
Compare the options
| Benchmark | Modality | What it tests |
|---|---|---|
| MMMU | Images + text | College-level multidisciplinary reasoning over figures |
| MathVista | Images + text | Visual math and geometry |
| ChartQA | Charts + text | Reading and reasoning over charts |
| DocVQA | Document images | Extracting info from real documents |
| MVBench / Video-MME | Video | Video understanding |
| AudioSet / MMAU | Audio | Sound understanding |
MMMU: the multimodal MMLU
Released 2023. 11,500 college-level questions from 30 subjects, each with one or more images — diagrams, charts, medical scans, chemical structures. Real-world mix of visual and textual reasoning. Human expert score is around 88 percent; frontier models are approaching it but not there yet as of early 2026.
Known pitfalls
- Text-only solvability: many 'multimodal' questions can be answered without the image
- OCR leakage: models that good at OCR can game vision tests
- Cultural and language bias in image sources
- Small test sets inflate noise
Video: the new frontier
Video benchmarks like Video-MME and LongVideoBench measure temporal reasoning over long clips. These are genuinely hard — even top 2025 models score 60-70 percent on short videos and much less on long ones.
“MMMU includes 11.5K questions collected from college exams, quizzes, and textbooks, covering six core disciplines.”
Key terms in this lesson
The big idea: multimodal benchmarks test whether a model actually sees, or whether it is reading labels and faking the rest. Always ask about the text-only baseline.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Multimodal Benchmarks”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 50 min
Scaling Laws and Compute-Optimal Training
Dive into the equations that governed the last five years of AI progress, and the fresh questions they raise now that pure scaling is hitting walls.
Creators · 35 min
How Chatbot Arena Works
The world's most influential 'leaderboard' for AI is not a test — it is humans voting blindly. Here is how that works.
Creators · 38 min
Benchmark Contamination
When the test questions quietly end up in the training data, scores lie. Here is how it happens and how to catch it.
