Tendril

Lesson 340 of 2116

Debugging Cost and Rate Limits in AI Coding

Your agent is running but nothing happens. Or your bill quadrupled overnight. Cost and rate-limit issues feel like bugs — and you fix them with debugging instincts, not new code.

CreatorsAI-Assisted Coding~7 min readBI2 · Representation & ReasoningBI4 · Natural InteractionBI5 · Societal ImpactPrint / PDF

Lesson map

What this lesson covers

11 min24 blocks5 concepts

Learning path

The main moves in order

1When the Bug Is the Bill
2rate limit
3context cost
4prompt caching

Concept cluster

Terms to connect while reading

rate limitcontext costprompt cachingmodel selectionbudget

Sections8

Lists3

Notes5

Code2

Compare2

Section 1

When the Bug Is the Bill

You ask Claude Code to do something simple. Forty minutes later it's still going, you're hitting rate limits, and your monthly budget evaporated. This is not a code bug. It's a cost bug. They share a debugging discipline.

The 2026 cost landscape (rough orders of magnitude)

Compare the options

Tool / model	Input ($/M tok)	Output ($/M tok)	Notes
Claude Sonnet 4.7	$3	$15	Default for Claude Code
Claude Opus 4.5	$15	$75	Premium reasoning
GPT-5.5	$5-ish	$15-ish	Codex CLI default
Gemini 2.5 Pro	$3.5	$10.5	Often cheapest at large context
Cursor Pro plan	$20/mo	Quota-based	Soft limits vary
Windsurf Pro	$15/mo	Daily/weekly quota	Switched March 2026
Copilot Pro	$10/mo	Generous	Includes Claude Opus access

The five expensive habits to spot in your own workflow

1Loading entire codebases into context every session — dramatically more expensive than indexed search
2Running the heaviest model for trivial tasks (Opus to fix a typo)
3Long sessions without compaction — every turn re-bills the whole history
410 subagents in parallel for a 30-minute task — 10x cost for no real win
5Letting agents loop on failing tests for an hour without intervention

Check-in 1. Got it so far?

Prompt caching is the single biggest cost lever

Caching the stable parts of your prompt is the single most cost-effective change in any LLM application.

text

# Without caching:
# - Every turn re-bills the full system prompt + project context
# - 200k tokens * 30 turns = 6M input tokens billed

# With caching (Anthropic, OpenAI, AI Gateway):
# - System prompt + project context cached after first turn
# - Subsequent turns pay 10% of cached portion + full new tokens
# - Same workload: ~1M billed tokens vs 6M

# Claude Code uses caching automatically.
# Custom apps via the API: set cache_control: ephemeral on stable system blocks.

from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
    model="claude-sonnet-4-6",
    system=[
        {"type": "text", "text": LONG_STABLE_INSTRUCTIONS,
         "cache_control": {"type": "ephemeral"}},
    ],
    messages=[{"role": "user", "content": "..."}],
)

Model routing: not every task needs the flagship

Boilerplate, formatting, comments — Haiku, GPT-5 Mini, Gemini Flash
Code generation in known patterns — Sonnet, GPT-5, Gemini 2.5 Pro
Architectural decisions, novel algorithms — Opus, GPT-5.5
Debugging hard bugs — Opus or Sonnet with Extended Thinking

Check-in 2. Got it so far?

Diagnosing rate-limit failures

Compare the options

Symptom	Likely cause	Fix
Sudden 429 errors	Bursty parallel calls (subagents)	Stagger calls; respect retry-after
Slow responses, no errors	Soft rate limit / queueing	Reduce concurrency, switch model
Account-wide hard cap hit	Monthly quota exhausted	Buy more, optimize, or wait
Per-minute limits hit but daily fine	Bursty patterns	Add backoff with jitter
Quota silently consumed	A loop you forgot you started	Check `codex cloud` / Cursor cloud agents — they run while you sleep

The cost-aware coding loop

Cost discipline is a habit, not a tool. Practice these checks until they're automatic.

text

# Daily checklist (30 seconds):

1. Check yesterday's spend — anomalies?
2. Confirm no zombie cloud agents are running
3. Review your model defaults — should this project use Sonnet or Haiku?

# Per-session checklist (10 seconds):

1. /compact when the session crosses ~50k tokens
2. /clear when the session is done — don't keep stale context for tomorrow
3. Spawn subagents only when the work is truly parallel

# Per-prompt checklist (instant):

1. Did I include unnecessary context (paste of files I don't need)?
2. Could a smaller model handle this?

Check-in 3. Got it so far?

When you're hitting limits hard

1Switch to a flat-rate plan if your usage is steady (Cursor, Windsurf, Copilot)
2Use API access through AI Gateway with caching enabled — most cost-efficient at scale
3Move heavy explore-and-summarize work to a cheaper model with long context (Gemini)
4Reserve flagship models for the actual hard reasoning steps
5If you genuinely need more capacity, the cost of upgrade is usually less than the cost of throttled engineering time

Check-in 4. Got it so far?

“If you can't measure it, you can't optimize it. Watch the meter.”
A finops engineer turned AI specialist

Key terms in this lesson

Check-in 5. Got it so far?

The big idea: cost and rate limits are part of the AI coding craft. Cache stable prompts, route by task complexity, compact long sessions, and audit background agents. The engineers who stay productive at scale are the ones who treat tokens as a real resource, not an unlimited well.

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Debugging Cost and Rate Limits in AI Coding”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Debugging Cost and Rate Limits in AI Coding

When the Bug Is the Bill

The 2026 cost landscape (rough orders of magnitude)

The five expensive habits to spot in your own workflow

Prompt caching is the single biggest cost lever

Model routing: not every task needs the flagship

Diagnosing rate-limit failures

The cost-aware coding loop

When you're hitting limits hard

Curious about “Debugging Cost and Rate Limits in AI Coding”?

Keep going

Debugging Cost and Rate Limits in AI Coding

When the Bug Is the Bill

The 2026 cost landscape (rough orders of magnitude)

The five expensive habits to spot in your own workflow

Prompt caching is the single biggest cost lever

Model routing: not every task needs the flagship

Diagnosing rate-limit failures

The cost-aware coding loop

When you're hitting limits hard

Curious about “Debugging Cost and Rate Limits in AI Coding”?

Keep going