Tendril

Lesson 255 of 2116

Multimodal Benchmarks

Evaluating models that see, hear, and read at once requires new kinds of tests. Here are the ones that matter.

CreatorsAI Foundations~21 min readAdvancedCoderBI2 · Representation & ReasoningBI3 · LearningPrint / PDF

Lesson map

What this lesson covers

35 min16 blocks4 concepts

Learning path

The main moves in order

1Beyond Text
2multimodal
3vision
4MMMU

Concept cluster

Terms to connect while reading

multimodalvisionMMMUvideo understanding

Sections4

Lists1

Notes4

Compare1

Quotes1

Section 1

Beyond Text

Once models handle images, audio, and video, text-only benchmarks are insufficient. A new family of multimodal benchmarks has emerged. Knowing them helps you cut through vision-language model marketing.

Compare the options

Benchmark	Modality	What it tests
MMMU	Images + text	College-level multidisciplinary reasoning over figures
MathVista	Images + text	Visual math and geometry
ChartQA	Charts + text	Reading and reasoning over charts
DocVQA	Document images	Extracting info from real documents
MVBench / Video-MME	Video	Video understanding
AudioSet / MMAU	Audio	Sound understanding

MMMU: the multimodal MMLU

Released 2023. 11,500 college-level questions from 30 subjects, each with one or more images — diagrams, charts, medical scans, chemical structures. Real-world mix of visual and textual reasoning. Human expert score is around 88 percent; frontier models are approaching it but not there yet as of early 2026.

Check-in 1. Got it so far?

Known pitfalls

Text-only solvability: many 'multimodal' questions can be answered without the image
OCR leakage: models that good at OCR can game vision tests
Cultural and language bias in image sources
Small test sets inflate noise

Video: the new frontier

Video benchmarks like Video-MME and LongVideoBench measure temporal reasoning over long clips. These are genuinely hard — even top 2025 models score 60-70 percent on short videos and much less on long ones.

Check-in 2. Got it so far?

“MMMU includes 11.5K questions collected from college exams, quizzes, and textbooks, covering six core disciplines.”
Yue et al., MMMU paper (2023)

Key terms in this lesson

Check-in 3. Got it so far?

The big idea: multimodal benchmarks test whether a model actually sees, or whether it is reading labels and faking the rest. Always ask about the text-only baseline.

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Multimodal Benchmarks”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Multimodal Benchmarks

Beyond Text

MMMU: the multimodal MMLU

Known pitfalls

Video: the new frontier

Curious about “Multimodal Benchmarks”?

Keep going

Multimodal Benchmarks

Beyond Text

MMMU: the multimodal MMLU

Known pitfalls

Video: the new frontier

Curious about “Multimodal Benchmarks”?

Keep going