Multimodal Benchmarks

Evaluating models that see, hear, and read at once requires new kinds of tests. Here are the ones that matter.

35 min · Reviewed 2026

Beyond Text

Once models handle images, audio, and video, text-only benchmarks are insufficient. A new family of multimodal benchmarks has emerged. Knowing them helps you cut through vision-language model marketing.

Benchmark	Modality	What it tests
MMMU	Images + text	College-level multidisciplinary reasoning over figures
MathVista	Images + text	Visual math and geometry
ChartQA	Charts + text	Reading and reasoning over charts
DocVQA	Document images	Extracting info from real documents
MVBench / Video-MME	Video	Video understanding
AudioSet / MMAU	Audio	Sound understanding

MMMU: the multimodal MMLU

Released 2023. 11,500 college-level questions from 30 subjects, each with one or more images — diagrams, charts, medical scans, chemical structures. Real-world mix of visual and textual reasoning. Human expert score is around 88 percent; frontier models are approaching it but not there yet as of early 2026.

Known pitfalls

Text-only solvability: many 'multimodal' questions can be answered without the image
OCR leakage: models that good at OCR can game vision tests
Cultural and language bias in image sources
Small test sets inflate noise

Video: the new frontier

Video benchmarks like Video-MME and LongVideoBench measure temporal reasoning over long clips. These are genuinely hard — even top 2025 models score 60-70 percent on short videos and much less on long ones.

MMMU includes 11.5K questions collected from college exams, quizzes, and textbooks, covering six core disciplines.
— Yue et al., MMMU paper (2023)

The big idea: multimodal benchmarks test whether a model actually sees, or whether it is reading labels and faking the rest. Always ask about the text-only baseline.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-multimodal-benchmarks

What is the core idea behind "Multimodal Benchmarks"?
1. Evaluating models that see, hear, and read at once requires new kinds of tests. Here are the ones that matter.
2. test-time compute
3. demo-reality gap
4. Frame: write one sentence defining what you want to learn
Which term best describes a foundational idea in "Multimodal Benchmarks"?
1. multimodal
2. MMMU
3. text-only baseline
4. video benchmark
A learner studying Multimodal Benchmarks would need to understand which concept?
1. MMMU
2. text-only baseline
3. multimodal
4. video benchmark
Which of these is directly relevant to Multimodal Benchmarks?
1. MMMU
2. multimodal
3. video benchmark
4. text-only baseline
Which of the following is a key point about Multimodal Benchmarks?
1. Text-only solvability: many 'multimodal' questions can be answered without the image
2. OCR leakage: models that good at OCR can game vision tests
3. Cultural and language bias in image sources
4. Small test sets inflate noise
Which of these does NOT belong in a discussion of Multimodal Benchmarks?
1. OCR leakage: models that good at OCR can game vision tests
2. test-time compute
3. Text-only solvability: many 'multimodal' questions can be answered without the image
4. Cultural and language bias in image sources
What is the key insight about "Check 'text-only accuracy'" in the context of Multimodal Benchmarks?
1. test-time compute
2. demo-reality gap
3. A proper multimodal benchmark reports the score with the image removed as a baseline.
4. Frame: write one sentence defining what you want to learn
What is the key insight about "Video = tokens × frames" in the context of Multimodal Benchmarks?
1. test-time compute
2. demo-reality gap
3. Frame: write one sentence defining what you want to learn
4. Long-video understanding is constrained by context windows. A 30-minute video at 1 fps is 1,800 frames — at even 256 tok…
What is the recommended tip about "Ground your practice in fundamentals" in the context of Multimodal Benchmarks?
1. Every AI capability has an underlying mechanism. Understanding that mechanism tells you where it'll fail — which is more…
2. test-time compute
3. demo-reality gap
4. Frame: write one sentence defining what you want to learn
Which statement accurately describes an aspect of Multimodal Benchmarks?
1. test-time compute
2. Once models handle images, audio, and video, text-only benchmarks are insufficient. A new family of multimodal benchmarks has emerged.
3. demo-reality gap
4. Frame: write one sentence defining what you want to learn
What does working with Multimodal Benchmarks typically involve?
1. test-time compute
2. demo-reality gap
3. Released 2023. 11,500 college-level questions from 30 subjects, each with one or more images — diagrams, charts, medical scans, chemical str…
4. Frame: write one sentence defining what you want to learn
Which of the following is true about Multimodal Benchmarks?
1. test-time compute
2. demo-reality gap
3. Frame: write one sentence defining what you want to learn
4. Video benchmarks like Video-MME and LongVideoBench measure temporal reasoning over long clips.
Which best describes the scope of "Multimodal Benchmarks"?
1. It focuses on Evaluating models that see, hear, and read at once requires new kinds of tests. Here are the ones th
2. It is unrelated to foundations workflows
3. It applies only to the opposite beginner tier
4. It was deprecated in 2024 and no longer relevant
Which section heading best belongs in a lesson about Multimodal Benchmarks?
1. test-time compute
2. MMMU: the multimodal MMLU
3. demo-reality gap
4. Frame: write one sentence defining what you want to learn
Which section heading best belongs in a lesson about Multimodal Benchmarks?
1. test-time compute
2. demo-reality gap
3. Known pitfalls
4. Frame: write one sentence defining what you want to learn

← Back to interactive lesson

Tendril · Creators · AI Foundations

Multimodal Benchmarks

Evaluating models that see, hear, and read at once requires new kinds of tests. Here are the ones that matter.

35 min · Reviewed 2026

Beyond Text

Benchmark	Modality	What it tests
MMMU	Images + text	College-level multidisciplinary reasoning over figures
MathVista	Images + text	Visual math and geometry
ChartQA	Charts + text	Reading and reasoning over charts
DocVQA	Document images	Extracting info from real documents
MVBench / Video-MME	Video	Video understanding
AudioSet / MMAU	Audio	Sound understanding

MMMU: the multimodal MMLU

Known pitfalls

Text-only solvability: many 'multimodal' questions can be answered without the image
OCR leakage: models that good at OCR can game vision tests
Cultural and language bias in image sources
Small test sets inflate noise

Video: the new frontier

MMMU includes 11.5K questions collected from college exams, quizzes, and textbooks, covering six core disciplines.
— Yue et al., MMMU paper (2023)

The big idea: multimodal benchmarks test whether a model actually sees, or whether it is reading labels and faking the rest. Always ask about the text-only baseline.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-multimodal-benchmarks

What is the core idea behind "Multimodal Benchmarks"?
1. Evaluating models that see, hear, and read at once requires new kinds of tests. Here are the ones that matter.
2. test-time compute
3. demo-reality gap
4. Frame: write one sentence defining what you want to learn
Which term best describes a foundational idea in "Multimodal Benchmarks"?
1. multimodal
2. MMMU
3. text-only baseline
4. video benchmark
A learner studying Multimodal Benchmarks would need to understand which concept?
1. MMMU
2. text-only baseline
3. multimodal
4. video benchmark
Which of these is directly relevant to Multimodal Benchmarks?
1. MMMU
2. multimodal
3. video benchmark
4. text-only baseline
Which of the following is a key point about Multimodal Benchmarks?
1. Text-only solvability: many 'multimodal' questions can be answered without the image
2. OCR leakage: models that good at OCR can game vision tests
3. Cultural and language bias in image sources
4. Small test sets inflate noise
Which of these does NOT belong in a discussion of Multimodal Benchmarks?
1. OCR leakage: models that good at OCR can game vision tests
2. test-time compute
3. Text-only solvability: many 'multimodal' questions can be answered without the image
4. Cultural and language bias in image sources
What is the key insight about "Check 'text-only accuracy'" in the context of Multimodal Benchmarks?
1. test-time compute
2. demo-reality gap
3. A proper multimodal benchmark reports the score with the image removed as a baseline.
4. Frame: write one sentence defining what you want to learn
What is the key insight about "Video = tokens × frames" in the context of Multimodal Benchmarks?
1. test-time compute
2. demo-reality gap
3. Frame: write one sentence defining what you want to learn
4. Long-video understanding is constrained by context windows. A 30-minute video at 1 fps is 1,800 frames — at even 256 tok…
What is the recommended tip about "Ground your practice in fundamentals" in the context of Multimodal Benchmarks?
1. Every AI capability has an underlying mechanism. Understanding that mechanism tells you where it'll fail — which is more…
2. test-time compute
3. demo-reality gap
4. Frame: write one sentence defining what you want to learn
Which statement accurately describes an aspect of Multimodal Benchmarks?
1. test-time compute
2. Once models handle images, audio, and video, text-only benchmarks are insufficient. A new family of multimodal benchmarks has emerged.
3. demo-reality gap
4. Frame: write one sentence defining what you want to learn
What does working with Multimodal Benchmarks typically involve?
1. test-time compute
2. demo-reality gap
3. Released 2023. 11,500 college-level questions from 30 subjects, each with one or more images — diagrams, charts, medical scans, chemical str…
4. Frame: write one sentence defining what you want to learn
Which of the following is true about Multimodal Benchmarks?
1. test-time compute
2. demo-reality gap
3. Frame: write one sentence defining what you want to learn
4. Video benchmarks like Video-MME and LongVideoBench measure temporal reasoning over long clips.
Which best describes the scope of "Multimodal Benchmarks"?
1. It focuses on Evaluating models that see, hear, and read at once requires new kinds of tests. Here are the ones th
2. It is unrelated to foundations workflows
3. It applies only to the opposite beginner tier
4. It was deprecated in 2024 and no longer relevant
Which section heading best belongs in a lesson about Multimodal Benchmarks?
1. test-time compute
2. MMMU: the multimodal MMLU
3. demo-reality gap
4. Frame: write one sentence defining what you want to learn
Which section heading best belongs in a lesson about Multimodal Benchmarks?
1. test-time compute
2. demo-reality gap
3. Known pitfalls
4. Frame: write one sentence defining what you want to learn

← Back to interactive lesson