Some AI Can See Pictures and Hear Sound

Multi-modal AI takes more than just text — pictures, sound, and video too.

7 min · Reviewed 2026

The big idea

Old AI only read text. New 'multi-modal' AI can also look at pictures, listen to voice, or watch video.

Some examples

You can show AI a photo and ask 'What is this?'
You can talk to AI by voice instead of typing.
Some AI can describe a picture for blind users.
Some AI can read homework you photograph.

Try it!

If your AI app has a camera button, snap a photo of an object and ask 'What is this?'

Here's why "Some AI Can See Pictures and Hear Sound" matters: Learning about AI is one of the most important skills you can build for the future! Multi-modal AI takes more than just text — pictures, sound, and video too — and knowing how to apply this gives you a concrete advantage.

Learn what "multi-modal" means and why it's important
Learn what "vision AI" means and why it's important
Learn what "audio AI" means and why it's important

Find out more about Some AI Can See Pictures and Hear Sound by asking an AI a question about it
Talk to a grown-up about what you learned
Write down one new thing you learned today

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-explorers-foundations-AI-and-the-multi-modal-input-r10a5

Which of these is an example of vision AI being used?
1. An AI that types out what you say
2. An AI that solves math problems
3. An AI that describes what's in a photo for blind users
4. An AI that writes stories about any topic
If you show an AI a photo of your pet and ask 'What is this?', what is the AI doing?
1. Reading your typed words
2. Using vision AI to recognize the image
3. Creating a new picture
4. Listening to your voice
What is audio AI mainly used for?
1. Looking at photos
2. Understanding and responding to spoken words
3. Drawing pictures
4. Reading text on a page
A blind person uses an AI app that looks at photos and tells them what's in the picture. What kind of AI is this?
1. Music AI
2. Vision AI
3. Audio AI
4. Text AI
Which of these inputs can a multi-modal AI accept?
1. Speaking into a microphone
2. Only touching the screen
3. Only typing on a keyboard
4. Only using a mouse click
Why is it helpful that AI can now 'see' pictures?
1. It can think like a human artist
2. It can replace human eyes completely
3. It can describe images to people who cannot see them
4. It can make photos look better than real life
What does the term 'multi-modal' mean when describing AI?
1. The AI can only handle one type of input
2. The AI is made by many different companies
3. The AI learns extremely quickly
4. The AI can handle different types of input like text, images, and sound
If an AI app has a camera button, what can you do with it?
1. Take a photo and ask the AI to identify what's in it
2. Send photos to all your contacts
3. Only record video of your friends
4. Automatically delete old photos
Which of these is a real example of audio AI?
1. An AI that feels textures by touching
2. Talking to a voice assistant that understands your questions
3. An AI that can smell flowers
4. An AI that tastes food for you
What is the main difference between old AI and new multi-modal AI?
1. Old AI was much faster than today's AI
2. New AI is always correct and never makes mistakes
3. Old AI could only read text, while new AI can also see pictures and hear sound
4. New AI has real feelings and emotions
A student photographs their homework with their phone and an AI reads the handwritten answers. What is happening?
1. Vision AI is reading the image of the homework
2. The AI is listening to the student read aloud
3. The AI is creating new homework problems
4. The AI is grading the homework automatically without reading
Which of the following can a multi-modal AI accept as input?
1. Only spoken words
2. Text, pictures, sound, and video
3. Only photographs
4. Only typed text
If you wanted an AI to look at a painting and tell you what it shows, what type of AI would you need?
1. Audio AI
2. Vision AI
3. Text AI
4. Music AI
What happens when you talk to an AI instead of typing?
1. The AI cannot respond to voice at all
2. The AI becomes much slower at answering
3. The AI forgets your previous messages
4. The AI uses audio AI to understand your speech
Why might someone use the camera feature on an AI app?
1. To send text messages to everyone
2. To show the AI something and get information about it
3. To play video games on the app
4. To make phone calls to friends

← Back to interactive lesson

Tendril · Explorers · AI Foundations

Some AI Can See Pictures and Hear Sound

Multi-modal AI takes more than just text — pictures, sound, and video too.

7 min · Reviewed 2026

The big idea

Old AI only read text. New 'multi-modal' AI can also look at pictures, listen to voice, or watch video.

Some examples

You can show AI a photo and ask 'What is this?'
You can talk to AI by voice instead of typing.
Some AI can describe a picture for blind users.
Some AI can read homework you photograph.

Try it!

If your AI app has a camera button, snap a photo of an object and ask 'What is this?'

Learn what "multi-modal" means and why it's important
Learn what "vision AI" means and why it's important
Learn what "audio AI" means and why it's important

Find out more about Some AI Can See Pictures and Hear Sound by asking an AI a question about it
Talk to a grown-up about what you learned
Write down one new thing you learned today

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-explorers-foundations-AI-and-the-multi-modal-input-r10a5

Which of these is an example of vision AI being used?
1. An AI that types out what you say
2. An AI that solves math problems
3. An AI that describes what's in a photo for blind users
4. An AI that writes stories about any topic
If you show an AI a photo of your pet and ask 'What is this?', what is the AI doing?
1. Reading your typed words
2. Using vision AI to recognize the image
3. Creating a new picture
4. Listening to your voice
What is audio AI mainly used for?
1. Looking at photos
2. Understanding and responding to spoken words
3. Drawing pictures
4. Reading text on a page
A blind person uses an AI app that looks at photos and tells them what's in the picture. What kind of AI is this?
1. Music AI
2. Vision AI
3. Audio AI
4. Text AI
Which of these inputs can a multi-modal AI accept?
1. Speaking into a microphone
2. Only touching the screen
3. Only typing on a keyboard
4. Only using a mouse click
Why is it helpful that AI can now 'see' pictures?
1. It can think like a human artist
2. It can replace human eyes completely
3. It can describe images to people who cannot see them
4. It can make photos look better than real life
What does the term 'multi-modal' mean when describing AI?
1. The AI can only handle one type of input
2. The AI is made by many different companies
3. The AI learns extremely quickly
4. The AI can handle different types of input like text, images, and sound
If an AI app has a camera button, what can you do with it?
1. Take a photo and ask the AI to identify what's in it
2. Send photos to all your contacts
3. Only record video of your friends
4. Automatically delete old photos
Which of these is a real example of audio AI?
1. An AI that feels textures by touching
2. Talking to a voice assistant that understands your questions
3. An AI that can smell flowers
4. An AI that tastes food for you
What is the main difference between old AI and new multi-modal AI?
1. Old AI was much faster than today's AI
2. New AI is always correct and never makes mistakes
3. Old AI could only read text, while new AI can also see pictures and hear sound
4. New AI has real feelings and emotions
A student photographs their homework with their phone and an AI reads the handwritten answers. What is happening?
1. Vision AI is reading the image of the homework
2. The AI is listening to the student read aloud
3. The AI is creating new homework problems
4. The AI is grading the homework automatically without reading
Which of the following can a multi-modal AI accept as input?
1. Only spoken words
2. Text, pictures, sound, and video
3. Only photographs
4. Only typed text
If you wanted an AI to look at a painting and tell you what it shows, what type of AI would you need?
1. Audio AI
2. Vision AI
3. Text AI
4. Music AI
What happens when you talk to an AI instead of typing?
1. The AI cannot respond to voice at all
2. The AI becomes much slower at answering
3. The AI forgets your previous messages
4. The AI uses audio AI to understand your speech
Why might someone use the camera feature on an AI app?
1. To send text messages to everyone
2. To show the AI something and get information about it
3. To play video games on the app
4. To make phone calls to friends

← Back to interactive lesson