Rate-Limiting, Costs, and Optimization

Section 1

The Bill That Compounds

Compare the options

Model tier	Input $/MTok	Output $/MTok
Flagship (Claude Opus, GPT-5.5, Gemini Ultra)	$15	$75
Mid (Claude Sonnet, GPT-5, Gemini Pro)	$3	$15
Small (Claude Haiku, GPT-5 mini, Gemini Flash)	$0.25	$1.25
Open-weights self-hosted (Llama 4, Qwen 3)	~$0 (hardware only)	~$0 (hardware only)

A router in front of your agent is the single highest-leverage optimization. Gateways like Vercel AI Gateway and LiteLLM do this with config.

typescript

// Route simple tasks to cheap models, hard tasks to flagships.
async function routeTask(task: Task) {
  const complexity = await classifyComplexity(task);
  // classifyComplexity uses a small/cheap model

  if (complexity === 'trivial') {
    return callModel('haiku', task);       // $0.25/M
  }
  if (complexity === 'moderate') {
    return callModel('sonnet', task);      // $3/M
  }
  return callModel('opus', task);          // $15/M
}

// Rule of thumb: at scale, 70-80% of tasks route to cheap tiers.
// Average cost drops 3-5x with no measurable quality loss.

Key terms in this lesson

Rate-Limiting, Costs, and Optimization

The Bill That Compounds

The four cost drivers, ranked

Cost per million tokens, roughly (April 2026)

The levers that actually work

A simple routing pattern

Rate limits to know

Team FinOps for AI

When self-hosting makes sense

Curious about “Rate-Limiting, Costs, and Optimization”?

Keep going

Rate-Limiting, Costs, and Optimization

The Bill That Compounds

The four cost drivers, ranked

Cost per million tokens, roughly (April 2026)

The levers that actually work

A simple routing pattern

Rate limits to know

Team FinOps for AI

When self-hosting makes sense

Curious about “Rate-Limiting, Costs, and Optimization”?

Keep going