Local Function Calling and Structured Output: Making Small Models Reliable

Tool use and JSON output are not just frontier-cloud features. Modern Ollama and llama.cpp support both — with sharper constraints that pay off in reliability.

10 min · Reviewed 2026

The reliability problem

When a frontier cloud model returns JSON, it almost always parses. When a 7B local model returns JSON, it sometimes adds a stray comma, drops a closing brace, or wraps the whole thing in apologetic prose. The model is not 'wrong'; it is undertrained for that exact format. The fix is constrained decoding — telling the inference engine to only allow tokens that keep the output valid.

Three levels of structured output

Hope: prompt the model nicely and parse the result. Fast to implement, brittle in production
JSON mode: ask Ollama to enforce JSON-shaped output. Better — but no schema constraint
Grammar / schema: feed a JSON schema or a GBNF grammar to the engine. Output is mathematically valid by construction

import json
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

schema = {
    "type": "object",
    "properties": {
        "intent": {"type": "string", "enum": ["book", "cancel", "reschedule"]},
        "confidence": {"type": "number", "minimum": 0, "maximum": 1},
    },
    "required": ["intent", "confidence"],
}

resp = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": user_text}],
    response_format={"type": "json_schema", "json_schema": {"name": "intent", "schema": schema}},
)
result = json.loads(resp.choices[0].message.content)Schema-constrained output via the OpenAI-compatible API. The local engine enforces validity.

Approach	Setup cost	Failure mode	Throughput cost
Hope + parse	Trivial	Invalid JSON sometimes	None
JSON mode (no schema)	Low	Wrong shape, valid syntax	Negligible
JSON schema constrained	Low	Mostly correctness errors	Small
GBNF grammar	Medium	Can be too restrictive	Small

Function calling on small models

Modern Ollama supports OpenAI-style tool calls — same shape as the cloud APIs. Many open-weight models (Llama 3.1+, Qwen 2.5+, Mistral) are tuned for tool use and work well. Older or off-brand models will hallucinate the wire format. Always test the specific model you plan to ship; do not assume that 'tool calling works' generalizes.

Apply this

Pick a real classification task in your work and write a JSON schema for the output
Implement it on Ollama with response_format JSON schema
Compare reliability against the same task with no schema constraint, on 50 inputs

The big idea: small local models become reliable when you constrain their output. The cloud's apparent reliability is partly that the engine already does this for you.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-local-function-calling-structured-output-creators

What is the core idea behind "Local Function Calling and Structured Output: Making Small Models Reliable"?
1. Tool use and JSON output are not just frontier-cloud features. Modern Ollama and llama.cpp support both — with sharper constraints that pay off in reliability.
2. reproducibility
3. Token-budget visualization while you type — handy for sizing prompts to small mo…
4. invalidation
Which term best describes a foundational idea in "Local Function Calling and Structured Output: Making Small Models Reliable"?
1. JSON schema
2. constrained decoding
3. GBNF grammar
4. function calling
A learner studying Local Function Calling and Structured Output: Making Small Models Reliable would need to understand which concept?
1. constrained decoding
2. GBNF grammar
3. JSON schema
4. function calling
Which of these is directly relevant to Local Function Calling and Structured Output: Making Small Models Reliable?
1. constrained decoding
2. JSON schema
3. function calling
4. GBNF grammar
Which of the following is a key point about Local Function Calling and Structured Output: Making Small Models Reliable?
1. Hope: prompt the model nicely and parse the result. Fast to implement, brittle in production
2. JSON mode: ask Ollama to enforce JSON-shaped output. Better — but no schema constraint
3. Grammar / schema: feed a JSON schema or a GBNF grammar to the engine.
4. reproducibility
What is one important takeaway from studying Local Function Calling and Structured Output: Making Small Models Reliable?
1. Implement it on Ollama with response_format JSON schema
2. Pick a real classification task in your work and write a JSON schema for the output
3. Compare reliability against the same task with no schema constraint, on 50 inputs
4. reproducibility
What is the key insight about "Pair tools with retries" in the context of Local Function Calling and Structured Output: Making Small Models Reliable?
1. reproducibility
2. Token-budget visualization while you type — handy for sizing prompts to small mo…
3. Even with constrained decoding, treat tool output as parseable-but-possibly-wrong.
4. invalidation
What is the key insight about "Schema does not check semantics" in the context of Local Function Calling and Structured Output: Making Small Models Reliable?
1. reproducibility
2. Token-budget visualization while you type — handy for sizing prompts to small mo…
3. invalidation
4. Constrained decoding guarantees the JSON shape — not that the field values are correct. A 'confidence: 0.
What is the key insight about "From the community" in the context of Local Function Calling and Structured Output: Making Small Models Reliable?
1. On r/LocalLLaMA, grammar-constrained generation is the recurring fix for 'my 7B keeps wrapping JSON in apologetic prose'…
2. reproducibility
3. Token-budget visualization while you type — handy for sizing prompts to small mo…
4. invalidation
Which statement accurately describes an aspect of Local Function Calling and Structured Output: Making Small Models Reliable?
1. reproducibility
2. When a frontier cloud model returns JSON, it almost always parses. When a 7B local model returns JSON, it sometimes adds a stray comma, drop…
3. Token-budget visualization while you type — handy for sizing prompts to small mo…
4. invalidation
What does working with Local Function Calling and Structured Output: Making Small Models Reliable typically involve?
1. reproducibility
2. Token-budget visualization while you type — handy for sizing prompts to small mo…
3. Modern Ollama supports OpenAI-style tool calls — same shape as the cloud APIs. Many open-weight models (Llama 3.1+, Qwen 2.
4. invalidation
Which of the following is true about Local Function Calling and Structured Output: Making Small Models Reliable?
1. reproducibility
2. Token-budget visualization while you type — handy for sizing prompts to small mo…
3. invalidation
4. The big idea: small local models become reliable when you constrain their output.
Which best describes the scope of "Local Function Calling and Structured Output: Making Small Models Reliable"?
1. It focuses on Tool use and JSON output are not just frontier-cloud features. Modern Ollama and llama.cpp support b
2. It is unrelated to model-families workflows
3. It applies only to the opposite beginner tier
4. It was deprecated in 2024 and no longer relevant
Which section heading best belongs in a lesson about Local Function Calling and Structured Output: Making Small Models Reliable?
1. reproducibility
2. Three levels of structured output
3. Token-budget visualization while you type — handy for sizing prompts to small mo…
4. invalidation
Which section heading best belongs in a lesson about Local Function Calling and Structured Output: Making Small Models Reliable?
1. reproducibility
2. Token-budget visualization while you type — handy for sizing prompts to small mo…
3. Function calling on small models
4. invalidation

← Back to interactive lesson

Tendril · Creators · Model Families