Loading lesson…
AI coding bills surprise teams that don't watch them. Let's break down the real cost drivers, the levers that actually reduce them, and how to set guardrails before your CFO does.
A single engineer using Claude Code heavily can generate hundreds of millions of tokens a month. Across a team, the numbers become genuinely expensive. Most overspend comes from habits, not essential usage — which means most of it is recoverable.
| Model tier | Input $/MTok | Output $/MTok |
|---|---|---|
| Flagship (Claude Opus, GPT-5.5, Gemini Ultra) | $15 | $75 |
| Mid (Claude Sonnet, GPT-5, Gemini Pro) | $3 | $15 |
| Small (Claude Haiku, GPT-5 mini, Gemini Flash) | $0.25 | $1.25 |
| Open-weights self-hosted (Llama 4, Qwen 3) | ~$0 (hardware only) | ~$0 (hardware only) |
// Route simple tasks to cheap models, hard tasks to flagships. async function routeTask(task: Task) { const complexity = await classifyComplexity(task); // classifyComplexity uses a small/cheap model if (complexity === 'trivial') { return callModel('haiku', task); // $0.25/M } if (complexity === 'moderate') { return callModel('sonnet', task); // $3/M } return callModel('opus', task); // $15/M } // Rule of thumb: at scale, 70-80% of tasks route to cheap tiers. // Average cost drops 3-5x with no measurable quality loss.A router in front of your agent is the single highest-leverage optimization. Gateways like Vercel AI Gateway and LiteLLM do this with config.At sustained high volume, self-hosting an open-weights model like Llama 4 or Qwen 3 can undercut API pricing. The crossover happens around $10-20k/month of API spend, higher if you need flagship quality. Below that, running GPUs is a distraction.
The first rule of AI cost optimization: the bill you can't see is the bill you can't control.
— A FinOps lead
The big idea: AI coding bills scale with habits, not just headcount. Caching, routing, context hygiene, and budget visibility get you 80% of the savings. Skip those and you're burning money the market is teaching other teams to keep.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-coding-rate-limiting-costs-optimization-creators
What is the main idea of "Rate-Limiting, Costs, and Optimization"?
Which concept is most central to "Rate-Limiting, Costs, and Optimization"?
Which use of AI fits this topic best?
What should a careful learner remember about "These numbers move monthly"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about token economics be treated?
Name one way to verify an AI answer about token economics.
Which action would help you apply "Rate-Limiting, Costs, and Optimization" responsibly?