Multimodal Models: Vision, Audio, and What They Cannot See
What it actually means when a model can see images and hear audio.
11 min · Reviewed 2026
The premise
Multimodal models translate images and audio into the same representation space as text, letting them describe, transcribe, and reason across modalities. The capabilities are remarkable; the limits are predictable.
What AI does well here
Describing images, including charts, screenshots, and diagrams
Transcribing audio with reasonable accuracy for clear speech
Answering questions about an image given context
Comparing two images for differences
What AI cannot do
Reliably read fine print, low-resolution text, or messy handwriting
Identify specific real people in photos
Tell you exactly where in an image a feature is at pixel precision
End-of-lesson check
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-ai-foundations-multimodal-final1-creators
What enables a multimodal AI system to process images, text, and audio within a single model?
Processing only one modality at a time to avoid confusion
Converting different input types into a shared representation space the model can understand
Using completely separate models for each modality that never communicate
Transforming all inputs into audio signals before processing
A model is shown a low-resolution photograph of a handwritten recipe and correctly identifies the main ingredients but misreads several spice names. This demonstrates which documented limitation?
Inability to recognize food items in photographs
Lack of general knowledge about cooking
Difficulty reliably reading low-resolution text and messy handwriting
Complete refusal to process unfamiliar inputs
A news organization uses a multimodal AI to extract quotes from a photograph of a printed document. What critical caution should guide their use of this output?
Verify every extracted quote against the original source because the model may confidently describe text that isn't there
The system cannot read printed documents, only handwritten text
AI-extracted quotes from images are always accurate and need no verification
They should only trust quotes extracted from audio, never from images
A user uploads two nearly identical product photos and asks an AI to identify all differences between them. What capability is the model demonstrating?
Creating 3D models based on two-dimensional photos
Generating completely new images from scratch
Comparing two images to detect differences between them
Translating visual content into multiple languages
A user asks a multimodal model to identify a specific celebrity in a photograph. The model refuses to provide a name. Why might this happen?
The model can only process text, not images
The model cannot reliably identify specific real people in photographs
Photographs of celebrities are always too low-resolution to process
Celebrities are never included in training data
An educator uploads a complex organizational chart and asks an AI to describe what it shows. Which capability is the model expected to demonstrate?
Generating a completely new organizational chart
Identifying every employee by name from the chart
Reading text smaller than 5 pixels with perfect accuracy
Describing charts, diagrams, and screenshots with reasonable accuracy
A multimodal model provides a detailed description of an image but when asked for the exact location of a specific object, it gives only a general area. What limitation does this illustrate?
Complete failure to understand image content
Refusal to provide location information on principle
Lack of any visual perception capabilities
Inability to locate features at pixel-level precision
A researcher uploads a clear audio recording of someone speaking in a quiet room. What reasonable expectation should they have for transcription accuracy?
Perfect accuracy regardless of audio quality or background noise
Generation of a written summary without being asked
Automatic translation into ten different languages
Reasonable accuracy for clear speech transcription
What fundamental capability distinguishes multimodal models from text-only or image-only models?
Lower computational resource requirements
Ability to reason across different input types like text, images, and audio
Faster processing speed than single-modality models
Higher accuracy on tasks involving only one modality
A user asks a model to provide exact pixel coordinates for a red car in a photograph. The model describes the car's location in general terms but cannot provide pixel-level precision. Which limitation best explains this?
Refusal to process photographs of vehicles
Inability to locate features at pixel precision in images
Lack of color recognition capabilities
Complete failure to detect vehicles in photographs
A student uploads a data visualization from a financial report and wants to test the AI's abilities. Which question would best showcase the model's documented strengths?
Asking the model to generate an entirely new chart
Requesting predictions about future stock prices based on the chart
Requesting the model to identify each person visible in the chart
Asking what the chart shows, what the axes represent, and what conclusion it supports
A company wants to automate extraction of information from identity documents using multimodal AI. Which documented limitation should they account for in their workflow?
Refusal to process official government documents
Potential errors with fine print, low-resolution text, or messy handwriting on documents
Complete inability to read any text appearing in images
Inability to distinguish between different types of documents
A user provides context about a historical event and then asks detailed questions about a photograph from that event. What capability is the model demonstrating?
Generating new photographs of historical events
Accessing real-time information about current events
Reading the intentions of people in the photograph
Answering questions about an image when given relevant context
A model confidently reads a sign in an image as saying 'SALE 50% OFF' when the sign actually says 'SALE 30% OFF.' What is the most important takeaway about multimodal model limitations?
Models can confidently describe text that isn't there; verification against the source is essential
Models never make mistakes with text recognition in images
The model was deliberately providing incorrect information
Image-based text recognition is completely unreliable and useless
Which of the following is explicitly listed in the lesson as a capability that multimodal models can perform?
Describing charts, screenshots, and diagrams in detail
Identifying specific real people in photographs with names
Reading handwritten notes with perfect accuracy
Providing exact pixel coordinates for objects in images