AI model families: multimodal AI (text + image + audio)

Understand multimodal models that handle text, images, audio, and video together.

40 min · Reviewed 2026

The big idea

Multimodal AI handles more than text. GPT-5, Claude, Gemini all 'see' images and 'hear' audio. You can show AI a photo of homework, a math problem on a whiteboard, or a song clip.

Some examples

Snap a photo of homework and ask for help
Show AI a screenshot to debug a UI
Have AI describe a meme to your blind grandma
Send a voice note instead of typing

Try it!

Take a photo of something confusing — a sign, a chart, a recipe in another language. Send it to a multimodal AI. See if it 'gets' what you needed.

What 'Multimodal' Means — Text, Image, Audio, Video All in One Model

The big idea

Old models only read text. Modern models — GPT-4o, Claude Sonnet 4.5+, Gemini 2.5 — are 'multimodal.' One model handles text, images, audio, and (in Gemini's case) video. You can paste a screenshot of an error, ask 'what's wrong?' and the model 'sees' the image. That's the whole point of multimodal.

Some examples

Paste a photo of your math homework — Gemini reads it and walks you through the problem.
Send GPT-4o a voice message and it replies in voice (real-time conversation mode).
Show Claude a screenshot of a confusing UI and ask 'where do I click?'
Upload a YouTube video to Gemini and ask 'summarize the part about photosynthesis.'

Try it!

Take a screenshot of something confusing today (an error message, a chart). Drop it into ChatGPT or Claude with one question. Skip the typing.

Multimodal AI: Models That See, Hear, and Speak

The big idea

Multimodal models accept multiple input types. You can paste a screenshot of a bug, talk to ChatGPT in voice mode, or share a photo of a fridge and ask 'what can I cook?' Output is mostly still text, but image and voice output are growing.

Some examples

You photograph a math problem on paper; Claude solves it.
You talk to ChatGPT in advanced voice mode while walking.
You share a screenshot of a website bug; the AI spots the misalignment.
You upload a video clip to Gemini; it summarizes what happens.

Try it!

Take a screenshot of something on your screen and ask an AI a question about it. Notice that the picture was the easiest input.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-builders-model-families-AI-and-multimodal-teen

What makes a model 'multimodal'?
1. It learns faster than other AI models
2. It can handle and combine information from different formats like text, images, and sound
3. It can only process text written by humans
4. It works without internet connections
A student takes a photo of a math problem on a whiteboard and asks an AI to solve it. This is an example of:
1. Multimodal AI using vision capabilities
2. A video call with AI
3. A voice interaction
4. Text-only AI processing
Why might showing an AI a picture of a confusing sign be better than typing out a description?
1. Pictures load faster than text
2. The AI might misunderstand a description but can read the sign directly
3. Text descriptions cost more money
4. AI cannot read text at all
Which of these is NOT mentioned as a use case for multimodal AI in the lesson?
1. Describing a meme to someone who cannot see
2. Writing code from a sketch
3. Debugging a website by showing a screenshot
4. Sending a voice note instead of typing
What does it mean when the lesson says AI can 'see' images?
1. The AI has eyes like a human
2. The AI uses computer vision to analyze pictures
3. The AI stores every image it sees forever
4. The AI turns images into text files
A user sends a voice note to an AI assistant instead of typing. What capability is being used?
1. Image generation
2. Text-to-speech
3. Video processing
4. Audio input or speech recognition
The lesson says: 'If you can show it, you don't have to describe it.' What is the main advantage of this approach?
1. Descriptions are always wrong
2. It saves the AI memory
3. It makes the AI work faster
4. Showing avoids losing information in translation—you get exactly what's in the image
Which of these inputs would a multimodal AI definitely be able to process?
1. A screenshot of an error message and a question about it
2. Only typed text
3. Only photos without any accompanying text
4. Only voice recordings
Your grandma cannot see a funny meme online. How could multimodal AI help?
1. By describing what the meme shows using text or speech
2. By sending the meme to her phone
3. By making the meme larger
4. By fixing her computer screen
What would happen if you tried to show a photo of a recipe to a text-only AI?
1. It could not 'see' the image and would ask you to type the recipe
2. It would work exactly the same as a multimodal AI
3. It would automatically convert the image to text
4. It would refuse to respond
The lesson mentions GPT-5, Claude, and Gemini as examples of multimodal models. What do these represent?
1. Different programming languages
2. Different companies that make AI assistants
3. Different types of computer hardware
4. Different file formats
You see a chart with confusing data on it. What could you do with a multimodal AI that you couldn't do with a text-only AI?
1. Ask it to make the chart bigger
2. Print the chart
3. Delete the chart
4. Show it the chart and ask what the data means
Why is it useful that multimodal AI can 'hear' audio?
1. It can record your voice
2. It can play music for you
3. It can sing songs
4. It can understand spoken questions without you typing
A developer shows a multimodal AI a screenshot of a broken website button. What is the likely goal?
1. To save the screenshot
2. To have AI help find and fix the problem
3. To test if the AI can see colors
4. To have AI describe the screenshot to a blind person
What is computer vision?
1. A type of video game
2. The ability of AI to analyze and understand images
3. A camera that computers wear
4. A way to draw pictures with computers

← Back to interactive lesson

Tendril · Builders · Model Families

AI model families: multimodal AI (text + image + audio)

Understand multimodal models that handle text, images, audio, and video together.

40 min · Reviewed 2026

The big idea

Multimodal AI handles more than text. GPT-5, Claude, Gemini all 'see' images and 'hear' audio. You can show AI a photo of homework, a math problem on a whiteboard, or a song clip.

Some examples

Snap a photo of homework and ask for help
Show AI a screenshot to debug a UI
Have AI describe a meme to your blind grandma
Send a voice note instead of typing

Try it!

Take a photo of something confusing — a sign, a chart, a recipe in another language. Send it to a multimodal AI. See if it 'gets' what you needed.

What 'Multimodal' Means — Text, Image, Audio, Video All in One Model

The big idea

Some examples

Paste a photo of your math homework — Gemini reads it and walks you through the problem.
Send GPT-4o a voice message and it replies in voice (real-time conversation mode).
Show Claude a screenshot of a confusing UI and ask 'where do I click?'
Upload a YouTube video to Gemini and ask 'summarize the part about photosynthesis.'

Try it!

Take a screenshot of something confusing today (an error message, a chart). Drop it into ChatGPT or Claude with one question. Skip the typing.

Multimodal AI: Models That See, Hear, and Speak

The big idea

Some examples

You photograph a math problem on paper; Claude solves it.
You talk to ChatGPT in advanced voice mode while walking.
You share a screenshot of a website bug; the AI spots the misalignment.
You upload a video clip to Gemini; it summarizes what happens.

Try it!

Take a screenshot of something on your screen and ask an AI a question about it. Notice that the picture was the easiest input.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-builders-model-families-AI-and-multimodal-teen

What makes a model 'multimodal'?
1. It learns faster than other AI models
2. It can handle and combine information from different formats like text, images, and sound
3. It can only process text written by humans
4. It works without internet connections
A student takes a photo of a math problem on a whiteboard and asks an AI to solve it. This is an example of:
1. Multimodal AI using vision capabilities
2. A video call with AI
3. A voice interaction
4. Text-only AI processing
Why might showing an AI a picture of a confusing sign be better than typing out a description?
1. Pictures load faster than text
2. The AI might misunderstand a description but can read the sign directly
3. Text descriptions cost more money
4. AI cannot read text at all
Which of these is NOT mentioned as a use case for multimodal AI in the lesson?
1. Describing a meme to someone who cannot see
2. Writing code from a sketch
3. Debugging a website by showing a screenshot
4. Sending a voice note instead of typing
What does it mean when the lesson says AI can 'see' images?
1. The AI has eyes like a human
2. The AI uses computer vision to analyze pictures
3. The AI stores every image it sees forever
4. The AI turns images into text files
A user sends a voice note to an AI assistant instead of typing. What capability is being used?
1. Image generation
2. Text-to-speech
3. Video processing
4. Audio input or speech recognition
The lesson says: 'If you can show it, you don't have to describe it.' What is the main advantage of this approach?
1. Descriptions are always wrong
2. It saves the AI memory
3. It makes the AI work faster
4. Showing avoids losing information in translation—you get exactly what's in the image
Which of these inputs would a multimodal AI definitely be able to process?
1. A screenshot of an error message and a question about it
2. Only typed text
3. Only photos without any accompanying text
4. Only voice recordings
Your grandma cannot see a funny meme online. How could multimodal AI help?
1. By describing what the meme shows using text or speech
2. By sending the meme to her phone
3. By making the meme larger
4. By fixing her computer screen
What would happen if you tried to show a photo of a recipe to a text-only AI?
1. It could not 'see' the image and would ask you to type the recipe
2. It would work exactly the same as a multimodal AI
3. It would automatically convert the image to text
4. It would refuse to respond
The lesson mentions GPT-5, Claude, and Gemini as examples of multimodal models. What do these represent?
1. Different programming languages
2. Different companies that make AI assistants
3. Different types of computer hardware
4. Different file formats
You see a chart with confusing data on it. What could you do with a multimodal AI that you couldn't do with a text-only AI?
1. Ask it to make the chart bigger
2. Print the chart
3. Delete the chart
4. Show it the chart and ask what the data means
Why is it useful that multimodal AI can 'hear' audio?
1. It can record your voice
2. It can play music for you
3. It can sing songs
4. It can understand spoken questions without you typing
A developer shows a multimodal AI a screenshot of a broken website button. What is the likely goal?
1. To save the screenshot
2. To have AI help find and fix the problem
3. To test if the AI can see colors
4. To have AI describe the screenshot to a blind person
What is computer vision?
1. A type of video game
2. The ability of AI to analyze and understand images
3. A camera that computers wear
4. A way to draw pictures with computers

← Back to interactive lesson