Standalone lesson.
Lesson 1565 of 1570
Seeing & Speaking AI
Vision, voice, and multimodal models.
“Multimodal” is a fancy word for “can handle more than just text.” Modern AIs can see pictures, hear you talk, watch videos, and respond out loud. Here’s how.
Vision — AI that sees
Upload a picture and ask about it. The AI turns your picture into a bunch of numbers (we covered this in the Explorers tier) and answers like it’s part of the conversation. Try: take a photo of a math problem and ask the AI to walk through it.
Voice — AI that listens and speaks
Two things are happening: first, your voice gets converted into text (called speech-to-text), then the AI replies, and its reply gets converted back into spoken audio (called text-to-speech). Some AIs now do this in one step, which is why voice conversations feel more natural than they used to.
Video — still weird, improving fast
Video AI can watch a short clip and answer questions about it (“what just happened?”) and — in models like Sora and Veo — generate new video from a text prompt. This is the fastest-moving area in AI right now.
Why multimodal matters
A tutor AI that can look at your handwritten work is way more useful than one that can only read what you type. A research assistant that can watch a lecture and take notes is closer to how you actually study. These aren’t different products — they’re the same AI, looking through more senses.
The privacy question
Whenever you upload a picture or voice, it leaves your computer and goes to the AI company. Think twice about uploading photos of other people, your school ID, or anything you don’t want stored somewhere.
Tutor
Curious about “Seeing & Speaking AI”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 40 min
Vision Model Selection by Use Case
Vision capabilities vary across models. Use case fit matters more than overall benchmarks.
Builders · 40 min
AI model families: multimodal AI (text + image + audio)
Understand multimodal models that handle text, images, audio, and video together.
Creators · 18 min
Local Qwen-VL: Seeing Images Without a Cloud API
Qwen vision-language variants are useful when an app needs local image understanding, screenshots, diagrams, receipts, or UI inspection.
