Prompt Caching and Cost Optimization

Long system prompts are expensive. Prompt caching lets you reuse the prefix at up to 90% cost reduction and much lower latency. Here's how to architect prompts for caching.

34 min · Reviewed 2026

Why caching exists

In production AI apps, the same long preamble — system prompt, few-shot examples, long retrieved documents — is often sent on every request while only a short user message changes. Processing that preamble every time is wasteful. Prompt caching lets the provider store a processed version of the prefix and reuse it.

Prompt structure for caching

// Anthropic Messages API { "model": "claude-sonnet-4-5", "system": [ { "type": "text", "text": "<long system prompt, role, rules, policies>", "cache_control": {"type": "ephemeral"} }, { "type": "text", "text": "<knowledge base, 50KB of docs>", "cache_control": {"type": "ephemeral"} } ], "messages": [ {"role": "user", "content": "What's our refund policy for digital goods?"} ] }Marking two prefix chunks as cacheable. Only the final user message varies per request.

What to cache

System prompts longer than ~1K tokens.
Few-shot example blocks that don't change between requests.
Retrieved documents that are reused across a session.
Tool definitions and schemas.
Large style guides, glossaries, or brand-voice references.

What NOT to cache

The user's per-request message (defeats the purpose).
Content under ~1K tokens (overhead > benefit).
Personalized data that changes every request.
Anything behind the cache's expiration window if requests are sparse.

Cache hierarchy

Caching works best when prefixes are layered in order from most-stable to least-stable. Put static rules first, then slowly-changing context (user profile, session info), then the per-request message last. That way the maximum possible prefix hits the cache.

[STABLE SYSTEM] <- cache forever (or until product change) [KNOWLEDGE BASE] <- cache for the day [USER PROFILE] <- cache for the session [CONVERSATION HISTORY]<- cache up to last turn [CURRENT USER MSG] <- never cacheLayered cache strategy — deeper layers reused more often.

Measuring impact

Baseline: log tokens-in / tokens-out / latency / cost per request for a representative day.
Add cache_control markers.
Re-measure. Compare per-request cost and P50/P95 latency.
Track cache hit rate (providers return it in responses). Aim for >80%.
Revisit when the preamble changes — cache misses spike after a prompt update.

End-of-lesson check

8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-prompting-caching-creators

What is the main idea of "Prompt Caching and Cost Optimization"?
1. Long system prompts are expensive.
2. Use AI as the final authority for the whole decision
3. Avoid checking the answer once it sounds polished
4. Focus only on speed instead of judgment
Which concept is most central to "Prompt Caching and Cost Optimization"?
1. cache_control
2. prompt caching
3. cost optimization
4. latency
Which use of AI fits this topic best?
1. Let the AI decide what matters without your review
2. Use the answer before checking whether it fits the situation
3. System prompts longer than ~1K tokens.
4. Treat the AI output as automatically correct
What should a careful learner remember about "The numbers (Anthropic, 2026)"?
1. Use AI to draft or organize ideas about prompt caching, then verify before acting.
2. Skip the context so the tool can guess faster
3. Treat the output as private even after sharing it online
4. Use the answer without checking the source
You want to use AI after this lesson. What is the safest next step?
1. Act immediately because the AI answer is written clearly
2. Use AI for drafting and comparison, but verify before publishing or relying on it.
3. Hide uncertainty so the final answer looks cleaner
4. Use private or sensitive details before checking permission
How should AI output about prompt caching be treated?
1. As proof that no other source is needed
2. As a replacement for context, consent, or expert review
3. As a draft or helper output that still needs human judgment and verification
4. As something that becomes correct when it sounds confident
Name one way to verify an AI answer about prompt caching.
Which action would help you apply "Prompt Caching and Cost Optimization" responsibly?
1. Use the tool to avoid thinking through the tradeoff
2. Keep going even if the output conflicts with a trusted source
3. Treat the AI output as automatically correct
4. Few-shot example blocks that don't change between requests.

← Back to interactive lesson

Tendril · Creators · Prompting

Prompt Caching and Cost Optimization

Long system prompts are expensive. Prompt caching lets you reuse the prefix at up to 90% cost reduction and much lower latency. Here's how to architect prompts for caching.

34 min · Reviewed 2026

Why caching exists

Prompt structure for caching

// Anthropic Messages API { "model": "claude-sonnet-4-5", "system": [ { "type": "text", "text": "<long system prompt, role, rules, policies>", "cache_control": {"type": "ephemeral"} }, { "type": "text", "text": "<knowledge base, 50KB of docs>", "cache_control": {"type": "ephemeral"} } ], "messages": [ {"role": "user", "content": "What's our refund policy for digital goods?"} ] }Marking two prefix chunks as cacheable. Only the final user message varies per request.

What to cache

System prompts longer than ~1K tokens.
Few-shot example blocks that don't change between requests.
Retrieved documents that are reused across a session.
Tool definitions and schemas.
Large style guides, glossaries, or brand-voice references.

What NOT to cache

The user's per-request message (defeats the purpose).
Content under ~1K tokens (overhead > benefit).
Personalized data that changes every request.
Anything behind the cache's expiration window if requests are sparse.

Cache hierarchy

[STABLE SYSTEM] <- cache forever (or until product change) [KNOWLEDGE BASE] <- cache for the day [USER PROFILE] <- cache for the session [CONVERSATION HISTORY]<- cache up to last turn [CURRENT USER MSG] <- never cacheLayered cache strategy — deeper layers reused more often.

Measuring impact

Baseline: log tokens-in / tokens-out / latency / cost per request for a representative day.
Add cache_control markers.
Re-measure. Compare per-request cost and P50/P95 latency.
Track cache hit rate (providers return it in responses). Aim for >80%.
Revisit when the preamble changes — cache misses spike after a prompt update.

End-of-lesson check

8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-prompting-caching-creators

What is the main idea of "Prompt Caching and Cost Optimization"?
1. Long system prompts are expensive.
2. Use AI as the final authority for the whole decision
3. Avoid checking the answer once it sounds polished
4. Focus only on speed instead of judgment
Which concept is most central to "Prompt Caching and Cost Optimization"?
1. cache_control
2. prompt caching
3. cost optimization
4. latency
Which use of AI fits this topic best?
1. Let the AI decide what matters without your review
2. Use the answer before checking whether it fits the situation
3. System prompts longer than ~1K tokens.
4. Treat the AI output as automatically correct
What should a careful learner remember about "The numbers (Anthropic, 2026)"?
1. Use AI to draft or organize ideas about prompt caching, then verify before acting.
2. Skip the context so the tool can guess faster
3. Treat the output as private even after sharing it online
4. Use the answer without checking the source
You want to use AI after this lesson. What is the safest next step?
1. Act immediately because the AI answer is written clearly
2. Use AI for drafting and comparison, but verify before publishing or relying on it.
3. Hide uncertainty so the final answer looks cleaner
4. Use private or sensitive details before checking permission
How should AI output about prompt caching be treated?
1. As proof that no other source is needed
2. As a replacement for context, consent, or expert review
3. As a draft or helper output that still needs human judgment and verification
4. As something that becomes correct when it sounds confident
Name one way to verify an AI answer about prompt caching.
Which action would help you apply "Prompt Caching and Cost Optimization" responsibly?
1. Use the tool to avoid thinking through the tradeoff
2. Keep going even if the output conflicts with a trusted source
3. Treat the AI output as automatically correct
4. Few-shot example blocks that don't change between requests.

← Back to interactive lesson