Loading lesson…
Your agent is running but nothing happens. Or your bill quadrupled overnight. Cost and rate-limit issues feel like bugs — and you fix them with debugging instincts, not new code.
You ask Claude Code to do something simple. Forty minutes later it's still going, you're hitting rate limits, and your monthly budget evaporated. This is not a code bug. It's a cost bug. They share a debugging discipline.
| Tool / model | Input ($/M tok) | Output ($/M tok) | Notes |
|---|---|---|---|
| Claude Sonnet 4.7 | $3 | $15 | Default for Claude Code |
| Claude Opus 4.5 | $15 | $75 | Premium reasoning |
| GPT-5.5 | $5-ish | $15-ish | Codex CLI default |
| Gemini 2.5 Pro | $3.5 | $10.5 | Often cheapest at large context |
| Cursor Pro plan | $20/mo | Quota-based | Soft limits vary |
| Windsurf Pro | $15/mo | Daily/weekly quota | Switched March 2026 |
| Copilot Pro | $10/mo | Generous | Includes Claude Opus access |
# Without caching: # - Every turn re-bills the full system prompt + project context # - 200k tokens * 30 turns = 6M input tokens billed # With caching (Anthropic, OpenAI, AI Gateway): # - System prompt + project context cached after first turn # - Subsequent turns pay 10% of cached portion + full new tokens # - Same workload: ~1M billed tokens vs 6M # Claude Code uses caching automatically. # Custom apps via the API: set cache_control: ephemeral on stable system blocks. from anthropic import Anthropic client = Anthropic() response = client.messages.create( model="claude-sonnet-4-6", system=[ {"type": "text", "text": LONG_STABLE_INSTRUCTIONS, "cache_control": {"type": "ephemeral"}}, ], messages=[{"role": "user", "content": ""}], )Caching the stable parts of your prompt is the single most cost-effective change in any LLM application.| Symptom | Likely cause | Fix |
|---|---|---|
| Sudden 429 errors | Bursty parallel calls (subagents) | Stagger calls; respect retry-after |
| Slow responses, no errors | Soft rate limit / queueing | Reduce concurrency, switch model |
| Account-wide hard cap hit | Monthly quota exhausted | Buy more, optimize, or wait |
| Per-minute limits hit but daily fine | Bursty patterns | Add backoff with jitter |
| Quota silently consumed | A loop you forgot you started | Check `codex cloud` / Cursor cloud agents — they run while you sleep |
# Daily checklist (30 seconds): 1. Check yesterday's spend — anomalies? 2. Confirm no zombie cloud agents are running 3. Review your model defaults — should this project use Sonnet or Haiku? # Per-session checklist (10 seconds): 1. /compact when the session crosses ~50k tokens 2. /clear when the session is done — don't keep stale context for tomorrow 3. Spawn subagents only when the work is truly parallel # Per-prompt checklist (instant): 1. Did I include unnecessary context (paste of files I don't need)? 2. Could a smaller model handle this?Cost discipline is a habit, not a tool. Practice these checks until they're automatic.If you can't measure it, you can't optimize it. Watch the meter.
— A finops engineer turned AI specialist
The big idea: cost and rate limits are part of the AI coding craft. Cache stable prompts, route by task complexity, compact long sessions, and audit background agents. The engineers who stay productive at scale are the ones who treat tokens as a real resource, not an unlimited well.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-coding-debug-cost-and-rate-limits-creators
What is the main idea of "Debugging Cost and Rate Limits in AI Coding"?
Which concept is most central to "Debugging Cost and Rate Limits in AI Coding"?
Which use of AI fits this topic best?
What should a careful learner remember about "The compounding session"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about rate limit be treated?
Name one way to verify an AI answer about rate limit.
Which action would help you apply "Debugging Cost and Rate Limits in AI Coding" responsibly?