Standalone lesson.
Lesson 2108 of 2116
Vision, Voice, Tools
Multimodal I/O and tool use at the API level.
Modern LLM APIs aren’t just text in / text out. They accept images, audio, video; and they can call external functions — which is how agents get anything done in the real world.
Vision at the API level
Pass an image (URL or base64) alongside your text prompt in a multi-part messagesarray. The vision encoder turns the image into tokens that share the same embedding space as text, so the model can “read” the image.
Audio — two patterns
- ASR → LLM → TTS. Traditional pipeline: Whisper transcribes, the LLM responds, a TTS model speaks. High latency, high quality.
- End-to-end speech. Newer models (GPT-4o Realtime, Gemini Live) ingest and emit audio directly, with sub-second latency and the ability to handle interruptions naturally.
Tool use (function calling)
You define a list of tools with JSON schemas. The model can decide to call one of them instead of responding directly. Your server runs the tool, passes the result back, and the model continues. This is the primitive beneath every agent.
MCP — Model Context Protocol
Anthropic’s open standard for exposing tools (and data sources) to AI systems. Any MCP-compliant server — a Postgres DB, a Figma file, a GitHub repo — becomes instantly usable by any MCP-compliant client. It’s standardizing the “USB-C” of AI tool use.
A concrete tool-use loop
user: "What's the weather in Cleveland?"
model: calls get_weather(city="Cleveland")
your server: { temp: 58, conditions: "cloudy" }
model: "It's 58°F and cloudy in Cleveland right now."Tutor
Curious about “Vision, Voice, Tools”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Explorers · 7 min
AI Agents Carry a Tool Belt
Agents pick the right tool for each job, like a builder.
Builders · 40 min
AI model families: multimodal AI (text + image + audio)
Understand multimodal models that handle text, images, audio, and video together.
Creators · 18 min
Local Qwen-VL: Seeing Images Without a Cloud API
Qwen vision-language variants are useful when an app needs local image understanding, screenshots, diagrams, receipts, or UI inspection.
