Lesson 43 of 2116
Rate-Limiting, Costs, and Optimization
AI coding bills surprise teams that don't watch them. Let's break down the real cost drivers, the levers that actually reduce them, and how to set guardrails before your CFO does.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The Bill That Compounds
- 2token economics
- 3rate limits
- 4prompt caching
Concept cluster
Terms to connect while reading
Section 1
The Bill That Compounds
A single engineer using Claude Code heavily can generate hundreds of millions of tokens a month. Across a team, the numbers become genuinely expensive. Most overspend comes from habits, not essential usage — which means most of it is recoverable.
The four cost drivers, ranked
- 1Context size per call — long-context calls dominate cost
- 2Model tier — flagship models are 5-10x the price of mid-tier siblings
- 3Call frequency — agentic loops can burn through tokens in minutes
- 4Output length — long outputs cost 3-5x more per token than inputs
Cost per million tokens, roughly (April 2026)
Compare the options
| Model tier | Input $/MTok | Output $/MTok |
|---|---|---|
| Flagship (Claude Opus, GPT-5.5, Gemini Ultra) | $15 | $75 |
| Mid (Claude Sonnet, GPT-5, Gemini Pro) | $3 | $15 |
| Small (Claude Haiku, GPT-5 mini, Gemini Flash) | $0.25 | $1.25 |
| Open-weights self-hosted (Llama 4, Qwen 3) | ~$0 (hardware only) | ~$0 (hardware only) |
The levers that actually work
- Prompt caching: 90% input discount after first call — always on for long context
- Right-size the model: use mid/small tier for 80% of work, flagship only when needed
- Limit context: paste only what's relevant, not the whole repo
- Cap max_tokens on output: 2048 is enough for most coding tasks
- Batch requests: one call with 5 questions beats 5 calls
- Model routing: a gateway picks the cheapest model that meets quality
A simple routing pattern
A router in front of your agent is the single highest-leverage optimization. Gateways like Vercel AI Gateway and LiteLLM do this with config.
// Route simple tasks to cheap models, hard tasks to flagships.
async function routeTask(task: Task) {
const complexity = await classifyComplexity(task);
// classifyComplexity uses a small/cheap model
if (complexity === 'trivial') {
return callModel('haiku', task); // $0.25/M
}
if (complexity === 'moderate') {
return callModel('sonnet', task); // $3/M
}
return callModel('opus', task); // $15/M
}
// Rule of thumb: at scale, 70-80% of tasks route to cheap tiers.
// Average cost drops 3-5x with no measurable quality loss.Rate limits to know
- Provider-level: requests/minute and tokens/minute caps by tier
- Tool-level: Claude Code, Cursor, and Windsurf all enforce daily/weekly quotas
- Org-level: your company may set per-user caps; check before a big day
- Burst-level: retry with exponential backoff on 429s, never just hammer
Team FinOps for AI
- 1Set per-user monthly budgets with alerts at 50% / 80% / 100%
- 2Tag every API call with user, project, and feature for accurate attribution
- 3Review spend weekly in the first month of new tool adoption, monthly after
- 4Publish a leaderboard of efficient users — peer pressure beats policy
- 5Negotiate enterprise volume discounts once spend is predictable
When self-hosting makes sense
At sustained high volume, self-hosting an open-weights model like Llama 4 or Qwen 3 can undercut API pricing. The crossover happens around $10-20k/month of API spend, higher if you need flagship quality. Below that, running GPUs is a distraction.
“The first rule of AI cost optimization: the bill you can't see is the bill you can't control.”
Key terms in this lesson
The big idea: AI coding bills scale with habits, not just headcount. Caching, routing, context hygiene, and budget visibility get you 80% of the savings. Skip those and you're burning money the market is teaching other teams to keep.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Rate-Limiting, Costs, and Optimization”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 50 min
The Landscape: Copilot vs. Cursor vs. Windsurf vs. Claude Code
The AI coding tool market fragmented fast. Let's map the 2026 landscape honestly: who is for autocomplete, who is for agents, who wins on cost, and what the tradeoffs actually feel like.
Creators · 55 min
Red-Teaming Your AI-Generated Code
Agents ship working code that's also quietly insecure. Red-teaming means actively attacking your own code. Let's build the habits that catch real-world exploits before attackers do.
Creators · 50 min
AI-Assisted Code Review Workflows (for Teams)
Code review is the highest-leverage touchpoint in a team. Automating the noise with AI frees humans to focus on the irreducibly human parts. Let's design the workflow.
