Vision, Voice, Tools

Multimodal I/O and tool use at the API level.

CreatorsCreators~27 min readInteractiveBI1 · PerceptionBI4 · Natural InteractionPrint / PDF

Modern LLM APIs aren’t just text in / text out. They accept images, audio, video; and they can call external functions — which is how agents get anything done in the real world.

Vision at the API level

Pass an image (URL or base64) alongside your text prompt in a multi-part messagesarray. The vision encoder turns the image into tokens that share the same embedding space as text, so the model can “read” the image.

Audio — two patterns

ASR → LLM → TTS. Traditional pipeline: Whisper transcribes, the LLM responds, a TTS model speaks. High latency, high quality.
End-to-end speech. Newer models (GPT-4o Realtime, Gemini Live) ingest and emit audio directly, with sub-second latency and the ability to handle interruptions naturally.

Tool use (function calling)

You define a list of tools with JSON schemas. The model can decide to call one of them instead of responding directly. Your server runs the tool, passes the result back, and the model continues. This is the primitive beneath every agent.

MCP — Model Context Protocol

Anthropic’s open standard for exposing tools (and data sources) to AI systems. Any MCP-compliant server — a Postgres DB, a Figma file, a GitHub repo — becomes instantly usable by any MCP-compliant client. It’s standardizing the “USB-C” of AI tool use.

A concrete tool-use loop

user: "What's the weather in Cleveland?"
model: calls get_weather(city="Cleveland")
your server: { temp: 58, conditions: "cloudy" }
model: "It's 58°F and cloudy in Cleveland right now."

Tutor

Curious about “Vision, Voice, Tools”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Keep going

Creators0%

Standalone lesson.

Lesson 2108 of 2116

Vision, Voice, Tools

Multimodal I/O and tool use at the API level.

CreatorsCreators~27 min readInteractiveBI1 · PerceptionBI4 · Natural InteractionPrint / PDF

Modern LLM APIs aren’t just text in / text out. They accept images, audio, video; and they can call external functions — which is how agents get anything done in the real world.

Vision at the API level

Audio — two patterns

ASR → LLM → TTS. Traditional pipeline: Whisper transcribes, the LLM responds, a TTS model speaks. High latency, high quality.
End-to-end speech. Newer models (GPT-4o Realtime, Gemini Live) ingest and emit audio directly, with sub-second latency and the ability to handle interruptions naturally.

Tool use (function calling)

MCP — Model Context Protocol

A concrete tool-use loop

user: "What's the weather in Cleveland?"
model: calls get_weather(city="Cleveland")
your server: { temp: 58, conditions: "cloudy" }
model: "It's 58°F and cloudy in Cleveland right now."

Tutor

Curious about “Vision, Voice, Tools”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons