Lesson 2067 of 2116
Multimodal Models: Vision, Audio, and What They Cannot See
What it actually means when a model can see images and hear audio.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The premise
- 2multimodal
- 3vision
- 4audio
Concept cluster
Terms to connect while reading
Section 1
The premise
Multimodal models translate images and audio into the same representation space as text, letting them describe, transcribe, and reason across modalities. The capabilities are remarkable; the limits are predictable.
What AI does well here
- Describing images, including charts, screenshots, and diagrams
- Transcribing audio with reasonable accuracy for clear speech
- Answering questions about an image given context
- Comparing two images for differences
What AI cannot do
- Reliably read fine print, low-resolution text, or messy handwriting
- Identify specific real people in photos
- Tell you exactly where in an image a feature is at pixel precision
Key terms in this lesson
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Multimodal Models: Vision, Audio, and What They Cannot See”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Builders · 40 min
AI and What 'Multimodal' Actually Means
Modern AI handles text, images, audio, and video at once — that's multimodal.
Creators · 35 min
Multimodal Benchmarks
Evaluating models that see, hear, and read at once requires new kinds of tests. Here are the ones that matter.
Creators · 9 min
AI for Resume English (Immigrant Career Edition)
American resumes look different from many other countries. AI can format your work history in the U.S. style and translate foreign job titles.
