Lesson 1548 of 1596
Multimodal Models: Vision, Audio, and What They Cannot See
What it actually means when a model can see images and hear audio.
Creators · AI Foundations · ~7 min read
The premise
Multimodal models translate images and audio into the same representation space as text, letting them describe, transcribe, and reason across modalities. The capabilities are remarkable; the limits are predictable.
What AI does well here
- Describing images, including charts, screenshots, and diagrams
- Transcribing audio with reasonable accuracy for clear speech
- Answering questions about an image given context
- Comparing two images for differences
What AI cannot do
- Reliably read fine print, low-resolution text, or messy handwriting
- Identify specific real people in photos
- Tell you exactly where in an image a feature is at pixel precision
Key terms in this lesson
End-of-lesson quiz
Check what stuck
10 questions · Score saves to your progress.
Tutor
Curious about “Multimodal Models: Vision, Audio, and What They Cannot See”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 35 min
Multimodal Benchmarks
Evaluating models that see, hear, and read at once requires new kinds of tests. Here are the ones that matter.
Builders · 40 min
AI and What 'Multimodal' Actually Means
Modern AI handles text, images, audio, and video at once — that's multimodal.
Creators · 11 min
Attention deep dive: queries, keys, values, and why it works
Understand attention as a content-addressable lookup over a sequence — and where the analogy breaks.
