Seeing & Speaking AI

Vision, voice, and multimodal models.

BuildersBuilders~18 min readInteractiveBI1 · PerceptionBI4 · Natural InteractionPrint / PDF

“Multimodal” is a fancy word for “can handle more than just text.” Modern AIs can see pictures, hear you talk, watch videos, and respond out loud. Here’s how.

Vision — AI that sees

Upload a picture and ask about it. The AI turns your picture into a bunch of numbers (we covered this in the Explorers tier) and answers like it’s part of the conversation. Try: take a photo of a math problem and ask the AI to walk through it.

Voice — AI that listens and speaks

Two things are happening: first, your voice gets converted into text (called speech-to-text), then the AI replies, and its reply gets converted back into spoken audio (called text-to-speech). Some AIs now do this in one step, which is why voice conversations feel more natural than they used to.

Video — still weird, improving fast

Video AI can watch a short clip and answer questions about it (“what just happened?”) and — in models like Sora and Veo — generate new video from a text prompt. This is the fastest-moving area in AI right now.

Why multimodal matters

A tutor AI that can look at your handwritten work is way more useful than one that can only read what you type. A research assistant that can watch a lecture and take notes is closer to how you actually study. These aren’t different products — they’re the same AI, looking through more senses.

The privacy question

Whenever you upload a picture or voice, it leaves your computer and goes to the AI company. Think twice about uploading photos of other people, your school ID, or anything you don’t want stored somewhere.

Tutor

Curious about “Seeing & Speaking AI”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Keep going

Builders0%

Standalone lesson.

Lesson 1565 of 1570

Seeing & Speaking AI

Vision, voice, and multimodal models.

BuildersBuilders~18 min readInteractiveBI1 · PerceptionBI4 · Natural InteractionPrint / PDF

“Multimodal” is a fancy word for “can handle more than just text.” Modern AIs can see pictures, hear you talk, watch videos, and respond out loud. Here’s how.

Vision — AI that sees

Voice — AI that listens and speaks

Video — still weird, improving fast

Why multimodal matters

The privacy question

Tutor

Curious about “Seeing & Speaking AI”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons