Tendril

Lesson 18 of 2116

Prompt Caching and Cost Optimization

Long system prompts are expensive. Prompt caching lets you reuse the prefix at up to 90% cost reduction and much lower latency. Here's how to architect prompts for caching.

CreatorsPrompting~20 min readAdvancedProfessionalBI3 · LearningBI5 · Societal ImpactPrint / PDF

Lesson map

What this lesson covers

34 min18 blocks5 concepts

Learning path

The main moves in order

1Why caching exists
2prompt caching
3cache_control
4cost optimization

Concept cluster

Terms to connect while reading

prompt cachingcache_controlcost optimizationlatencyprefix reuse

Sections6

Lists3

Notes4

Code2

Terms1

Section 1

Why caching exists

In production AI apps, the same long preamble — system prompt, few-shot examples, long retrieved documents — is often sent on every request while only a short user message changes. Processing that preamble every time is wasteful. Prompt caching lets the provider store a processed version of the prefix and reuse it.

Prompt structure for caching

Marking two prefix chunks as cacheable. Only the final user message varies per request.

markdown

// Anthropic Messages API
{
  "model": "claude-sonnet-4-5",
  "system": [
    {
      "type": "text",
      "text": "<long system prompt, role, rules, policies...>",
      "cache_control": {"type": "ephemeral"}
    },
    {
      "type": "text",
      "text": "<knowledge base, 50KB of docs>",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {"role": "user", "content": "What's our refund policy for digital goods?"}
  ]
}

Check-in 1. Got it so far?

What to cache

System prompts longer than ~1K tokens.
Few-shot example blocks that don't change between requests.
Retrieved documents that are reused across a session.
Tool definitions and schemas.
Large style guides, glossaries, or brand-voice references.

What NOT to cache

The user's per-request message (defeats the purpose).
Content under ~1K tokens (overhead > benefit).
Personalized data that changes every request.
Anything behind the cache's expiration window if requests are sparse.

Cache hierarchy

Caching works best when prefixes are layered in order from most-stable to least-stable. Put static rules first, then slowly-changing context (user profile, session info), then the per-request message last. That way the maximum possible prefix hits the cache.

Check-in 2. Got it so far?

Layered cache strategy — deeper layers reused more often.

markdown

[STABLE SYSTEM]       <- cache forever (or until product change)
[KNOWLEDGE BASE]      <- cache for the day
[USER PROFILE]        <- cache for the session
[CONVERSATION HISTORY]<- cache up to last turn
[CURRENT USER MSG]    <- never cache

Measuring impact

1Baseline: log tokens-in / tokens-out / latency / cost per request for a representative day.
2Add cache_control markers.
3Re-measure. Compare per-request cost and P50/P95 latency.
4Track cache hit rate (providers return it in responses). Aim for >80%.
5Revisit when the preamble changes — cache misses spike after a prompt update.

Check-in 3. Got it so far?

Key terms in this lesson

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Prompt Caching and Cost Optimization”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Prompt Caching and Cost Optimization

Why caching exists

Prompt structure for caching

What to cache

What NOT to cache

Cache hierarchy

Measuring impact

Curious about “Prompt Caching and Cost Optimization”?

Keep going

Prompt Caching and Cost Optimization

Why caching exists

Prompt structure for caching

What to cache

What NOT to cache

Cache hierarchy

Measuring impact

Curious about “Prompt Caching and Cost Optimization”?

Keep going