Lesson 18 of 2116
Prompt Caching and Cost Optimization
Long system prompts are expensive. Prompt caching lets you reuse the prefix at up to 90% cost reduction and much lower latency. Here's how to architect prompts for caching.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1Why caching exists
- 2prompt caching
- 3cache_control
- 4cost optimization
Concept cluster
Terms to connect while reading
Section 1
Why caching exists
In production AI apps, the same long preamble — system prompt, few-shot examples, long retrieved documents — is often sent on every request while only a short user message changes. Processing that preamble every time is wasteful. Prompt caching lets the provider store a processed version of the prefix and reuse it.
Prompt structure for caching
Marking two prefix chunks as cacheable. Only the final user message varies per request.
// Anthropic Messages API
{
"model": "claude-sonnet-4-5",
"system": [
{
"type": "text",
"text": "<long system prompt, role, rules, policies...>",
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": "<knowledge base, 50KB of docs>",
"cache_control": {"type": "ephemeral"}
}
],
"messages": [
{"role": "user", "content": "What's our refund policy for digital goods?"}
]
}What to cache
- System prompts longer than ~1K tokens.
- Few-shot example blocks that don't change between requests.
- Retrieved documents that are reused across a session.
- Tool definitions and schemas.
- Large style guides, glossaries, or brand-voice references.
What NOT to cache
- The user's per-request message (defeats the purpose).
- Content under ~1K tokens (overhead > benefit).
- Personalized data that changes every request.
- Anything behind the cache's expiration window if requests are sparse.
Cache hierarchy
Caching works best when prefixes are layered in order from most-stable to least-stable. Put static rules first, then slowly-changing context (user profile, session info), then the per-request message last. That way the maximum possible prefix hits the cache.
Layered cache strategy — deeper layers reused more often.
[STABLE SYSTEM] <- cache forever (or until product change)
[KNOWLEDGE BASE] <- cache for the day
[USER PROFILE] <- cache for the session
[CONVERSATION HISTORY]<- cache up to last turn
[CURRENT USER MSG] <- never cacheMeasuring impact
- 1Baseline: log tokens-in / tokens-out / latency / cost per request for a representative day.
- 2Add cache_control markers.
- 3Re-measure. Compare per-request cost and P50/P95 latency.
- 4Track cache hit rate (providers return it in responses). Aim for >80%.
- 5Revisit when the preamble changes — cache misses spike after a prompt update.
Key terms in this lesson
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Prompt Caching and Cost Optimization”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 38 min
Anthropic's Prompt Engineering Patterns
Anthropic publishes detailed prompt engineering guidance. Master the core patterns — Be Direct, Let Claude Think, and Chain Complex Prompts — to write production-grade prompts.
Creators · 36 min
Meta-Prompting: AI That Writes AI Prompts
Use an AI to write, optimize, and debug your prompts. Meta-prompting is how top teams ship production prompts faster than humans alone could write them.
Creators · 38 min
Red-Teaming Your Own Prompts
Before shipping, attack your own prompts. Inject, confuse, overload, and role-swap. If you don't find the holes, your users will.
