Loading lesson…
Your agent is running but nothing happens. Or your bill quadrupled overnight. Cost and rate-limit issues feel like bugs — and you fix them with debugging instincts, not new code.
You ask Claude Code to do something simple. Forty minutes later it's still going, you're hitting rate limits, and your monthly budget evaporated. This is not a code bug. It's a cost bug. They share a debugging discipline.
| Tool / model | Input ($/M tok) | Output ($/M tok) | Notes |
|---|---|---|---|
| Claude Sonnet 4.7 | $3 | $15 | Default for Claude Code |
| Claude Opus 4.5 | $15 | $75 | Premium reasoning |
| GPT-5.5 | $5-ish | $15-ish | Codex CLI default |
| Gemini 2.5 Pro | $3.5 | $10.5 | Often cheapest at large context |
| Cursor Pro plan | $20/mo | Quota-based | Soft limits vary |
| Windsurf Pro | $15/mo | Daily/weekly quota | Switched March 2026 |
| Copilot Pro | $10/mo | Generous | Includes Claude Opus access |
# Without caching:
# - Every turn re-bills the full system prompt + project context
# - 200k tokens * 30 turns = 6M input tokens billed
# With caching (Anthropic, OpenAI, AI Gateway):
# - System prompt + project context cached after first turn
# - Subsequent turns pay 10% of cached portion + full new tokens
# - Same workload: ~1M billed tokens vs 6M
# Claude Code uses caching automatically.
# Custom apps via the API: set cache_control: ephemeral on stable system blocks.
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
system=[
{"type": "text", "text": LONG_STABLE_INSTRUCTIONS,
"cache_control": {"type": "ephemeral"}},
],
messages=[{"role": "user", "content": "..."}],
)Caching the stable parts of your prompt is the single most cost-effective change in any LLM application.| Symptom | Likely cause | Fix |
|---|---|---|
| Sudden 429 errors | Bursty parallel calls (subagents) | Stagger calls; respect retry-after |
| Slow responses, no errors | Soft rate limit / queueing | Reduce concurrency, switch model |
| Account-wide hard cap hit | Monthly quota exhausted | Buy more, optimize, or wait |
| Per-minute limits hit but daily fine | Bursty patterns | Add backoff with jitter |
| Quota silently consumed | A loop you forgot you started | Check `codex cloud` / Cursor cloud agents — they run while you sleep |
# Daily checklist (30 seconds):
1. Check yesterday's spend — anomalies?
2. Confirm no zombie cloud agents are running
3. Review your model defaults — should this project use Sonnet or Haiku?
# Per-session checklist (10 seconds):
1. /compact when the session crosses ~50k tokens
2. /clear when the session is done — don't keep stale context for tomorrow
3. Spawn subagents only when the work is truly parallel
# Per-prompt checklist (instant):
1. Did I include unnecessary context (paste of files I don't need)?
2. Could a smaller model handle this?Cost discipline is a habit, not a tool. Practice these checks until they're automatic.If you can't measure it, you can't optimize it. Watch the meter.
— A finops engineer turned AI specialist
The big idea: cost and rate limits are part of the AI coding craft. Cache stable prompts, route by task complexity, compact long sessions, and audit background agents. The engineers who stay productive at scale are the ones who treat tokens as a real resource, not an unlimited well.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-coding-debug-cost-and-rate-limits-creators
According to the pricing table in the material, which model combination would be the most expensive for a task requiring significant output tokens?
A developer leaves Claude Code running overnight with a 200k token context and returns to find their bill tripled. Which expensive habit from the material best explains this?
A student asks Claude Opus to fix a single typo in a variable name. Why does the material consider this wasteful?
What does the material identify as the single biggest cost lever for reducing AI coding expenses?
A developer receives sudden 429 errors when running their AI coding agent. According to the diagnostic table, what is the most likely cause?
Which model routing strategy does the material recommend for generating boilerplate code and comments?
A developer notices their account shows quota consumed but no visible errors. They haven't run any requests themselves. What should they check for?
What service allows you to write one application and route requests to different models based on capability and cost?
The material draws an analogy: in the 1990s engineers obsessed over RAM, and in 2026 they should treat what similarly?
A developer has steady, predictable AI coding usage every month. Which cost strategy does the material recommend?
Which combination of tasks represents proper model routing according to the material?
A developer's responses are slow but there are no error codes. According to the diagnostic table, what is likely happening?
The material mentions that the cost difference between disciplined and undisciplined sessions can be as much as:
What is the recommended action after finishing a Claude Code session to avoid ongoing charges?
For non-critical code paths like tests and formatting, the material suggests routing through a gateway can reduce costs by: