Lesson 531 of 2116
Local Function Calling and Structured Output: Making Small Models Reliable
Tool use and JSON output are not just frontier-cloud features. Modern Ollama and llama.cpp support both — with sharper constraints that pay off in reliability.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The reliability problem
- 2function calling
- 3structured output
- 4JSON schema
Concept cluster
Terms to connect while reading
Section 1
The reliability problem
When a frontier cloud model returns JSON, it almost always parses. When a 7B local model returns JSON, it sometimes adds a stray comma, drops a closing brace, or wraps the whole thing in apologetic prose. The model is not 'wrong'; it is undertrained for that exact format. The fix is constrained decoding — telling the inference engine to only allow tokens that keep the output valid.
Three levels of structured output
- 1Hope: prompt the model nicely and parse the result. Fast to implement, brittle in production
- 2JSON mode: ask Ollama to enforce JSON-shaped output. Better — but no schema constraint
- 3Grammar / schema: feed a JSON schema or a GBNF grammar to the engine. Output is mathematically valid by construction
Schema-constrained output via the OpenAI-compatible API. The local engine enforces validity.
import json
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
schema = {
"type": "object",
"properties": {
"intent": {"type": "string", "enum": ["book", "cancel", "reschedule"]},
"confidence": {"type": "number", "minimum": 0, "maximum": 1},
},
"required": ["intent", "confidence"],
}
resp = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": user_text}],
response_format={"type": "json_schema", "json_schema": {"name": "intent", "schema": schema}},
)
result = json.loads(resp.choices[0].message.content)Compare the options
| Approach | Setup cost | Failure mode | Throughput cost |
|---|---|---|---|
| Hope + parse | Trivial | Invalid JSON sometimes | None |
| JSON mode (no schema) | Low | Wrong shape, valid syntax | Negligible |
| JSON schema constrained | Low | Mostly correctness errors | Small |
| GBNF grammar | Medium | Can be too restrictive | Small |
Function calling on small models
Modern Ollama supports OpenAI-style tool calls — same shape as the cloud APIs. Many open-weight models (Llama 3.1+, Qwen 2.5+, Mistral) are tuned for tool use and work well. Older or off-brand models will hallucinate the wire format. Always test the specific model you plan to ship; do not assume that 'tool calling works' generalizes.
Apply this
- 1Pick a real classification task in your work and write a JSON schema for the output
- 2Implement it on Ollama with response_format JSON schema
- 3Compare reliability against the same task with no schema constraint, on 50 inputs
Key terms in this lesson
The big idea: small local models become reliable when you constrain their output. The cloud's apparent reliability is partly that the engine already does this for you.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Local Function Calling and Structured Output: Making Small Models Reliable”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 9 min
Hermes For Structured JSON Output: Schemas That Work
When you need data, not prose, an open-weight model has to play by a schema. Hermes is one of the more reliable choices — but only if you prompt it carefully.
Creators · 40 min
Tool Calling Quality Across Frontier Models
Tool calling quality varies across frontier models. Selection by use case improves reliability.
Creators · 30 min
Structured Output Modes: JSON Mode, Schema, Tool Forcing
How vendors implement structured output and which mode to pick per use case.
