Lesson 340 of 2116
Debugging Cost and Rate Limits in AI Coding
Your agent is running but nothing happens. Or your bill quadrupled overnight. Cost and rate-limit issues feel like bugs — and you fix them with debugging instincts, not new code.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1When the Bug Is the Bill
- 2rate limit
- 3context cost
- 4prompt caching
Concept cluster
Terms to connect while reading
Section 1
When the Bug Is the Bill
You ask Claude Code to do something simple. Forty minutes later it's still going, you're hitting rate limits, and your monthly budget evaporated. This is not a code bug. It's a cost bug. They share a debugging discipline.
The 2026 cost landscape (rough orders of magnitude)
Compare the options
| Tool / model | Input ($/M tok) | Output ($/M tok) | Notes |
|---|---|---|---|
| Claude Sonnet 4.7 | $3 | $15 | Default for Claude Code |
| Claude Opus 4.5 | $15 | $75 | Premium reasoning |
| GPT-5.5 | $5-ish | $15-ish | Codex CLI default |
| Gemini 2.5 Pro | $3.5 | $10.5 | Often cheapest at large context |
| Cursor Pro plan | $20/mo | Quota-based | Soft limits vary |
| Windsurf Pro | $15/mo | Daily/weekly quota | Switched March 2026 |
| Copilot Pro | $10/mo | Generous | Includes Claude Opus access |
The five expensive habits to spot in your own workflow
- 1Loading entire codebases into context every session — dramatically more expensive than indexed search
- 2Running the heaviest model for trivial tasks (Opus to fix a typo)
- 3Long sessions without compaction — every turn re-bills the whole history
- 410 subagents in parallel for a 30-minute task — 10x cost for no real win
- 5Letting agents loop on failing tests for an hour without intervention
Prompt caching is the single biggest cost lever
Caching the stable parts of your prompt is the single most cost-effective change in any LLM application.
# Without caching:
# - Every turn re-bills the full system prompt + project context
# - 200k tokens * 30 turns = 6M input tokens billed
# With caching (Anthropic, OpenAI, AI Gateway):
# - System prompt + project context cached after first turn
# - Subsequent turns pay 10% of cached portion + full new tokens
# - Same workload: ~1M billed tokens vs 6M
# Claude Code uses caching automatically.
# Custom apps via the API: set cache_control: ephemeral on stable system blocks.
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
system=[
{"type": "text", "text": LONG_STABLE_INSTRUCTIONS,
"cache_control": {"type": "ephemeral"}},
],
messages=[{"role": "user", "content": "..."}],
)Model routing: not every task needs the flagship
- Boilerplate, formatting, comments — Haiku, GPT-5 Mini, Gemini Flash
- Code generation in known patterns — Sonnet, GPT-5, Gemini 2.5 Pro
- Architectural decisions, novel algorithms — Opus, GPT-5.5
- Debugging hard bugs — Opus or Sonnet with Extended Thinking
Diagnosing rate-limit failures
Compare the options
| Symptom | Likely cause | Fix |
|---|---|---|
| Sudden 429 errors | Bursty parallel calls (subagents) | Stagger calls; respect retry-after |
| Slow responses, no errors | Soft rate limit / queueing | Reduce concurrency, switch model |
| Account-wide hard cap hit | Monthly quota exhausted | Buy more, optimize, or wait |
| Per-minute limits hit but daily fine | Bursty patterns | Add backoff with jitter |
| Quota silently consumed | A loop you forgot you started | Check `codex cloud` / Cursor cloud agents — they run while you sleep |
The cost-aware coding loop
Cost discipline is a habit, not a tool. Practice these checks until they're automatic.
# Daily checklist (30 seconds):
1. Check yesterday's spend — anomalies?
2. Confirm no zombie cloud agents are running
3. Review your model defaults — should this project use Sonnet or Haiku?
# Per-session checklist (10 seconds):
1. /compact when the session crosses ~50k tokens
2. /clear when the session is done — don't keep stale context for tomorrow
3. Spawn subagents only when the work is truly parallel
# Per-prompt checklist (instant):
1. Did I include unnecessary context (paste of files I don't need)?
2. Could a smaller model handle this?When you're hitting limits hard
- 1Switch to a flat-rate plan if your usage is steady (Cursor, Windsurf, Copilot)
- 2Use API access through AI Gateway with caching enabled — most cost-efficient at scale
- 3Move heavy explore-and-summarize work to a cheaper model with long context (Gemini)
- 4Reserve flagship models for the actual hard reasoning steps
- 5If you genuinely need more capacity, the cost of upgrade is usually less than the cost of throttled engineering time
“If you can't measure it, you can't optimize it. Watch the meter.”
Key terms in this lesson
The big idea: cost and rate limits are part of the AI coding craft. Cache stable prompts, route by task complexity, compact long sessions, and audit background agents. The engineers who stay productive at scale are the ones who treat tokens as a real resource, not an unlimited well.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Debugging Cost and Rate Limits in AI Coding”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 45 min
Rate-Limiting, Costs, and Optimization
AI coding bills surprise teams that don't watch them. Let's break down the real cost drivers, the levers that actually reduce them, and how to set guardrails before your CFO does.
Creators · 50 min
The Landscape: Copilot vs. Cursor vs. Windsurf vs. Claude Code
The AI coding tool market fragmented fast. Let's map the 2026 landscape honestly: who is for autocomplete, who is for agents, who wins on cost, and what the tradeoffs actually feel like.
Creators · 45 min
Installing and Using the OpenAI Codex CLI
Codex CLI is OpenAI's terminal coding agent. It runs locally, supports MCP, and ships a codex cloud mode for background tasks. Let's install it and compare it honestly to Claude Code.
