AI Pricing Models: Per-Token, Cached, Batch, and Reserved Capacity
Understand the AI pricing landscape across input, output, cached, batch, and reserved tiers.
11 min · Reviewed 2026
The premise
AI provider pricing now spans per-token, cached-token, batch, and reserved-capacity tiers — each with distinct fit for different workload patterns.
What AI does well here
Per-token: low-volume, sporadic workloads
Cached tokens: repeated long contexts at much lower cost
Batch APIs: high-volume async work at deep discounts
Reserved: predictable steady-state high volume
What AI cannot do
Optimize pricing tier choice without workload data
Predict its own input and output token usage precisely
End-of-lesson check
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-model-families-pricing-models-final5-creators
Which workload pattern is most economically suited for per-token pricing?
A fixed daily workload of 10 million tokens
Sporadic, low-volume requests that arrive unpredictably
A nightly job processing millions of database records
Continuous background data processing running 24/7
A developer implements prompt caching and sees costs drop by 60%. What scenario most likely explains this result?
They switched from a premium model to a budget model
Their system prompt and user queries share a long consistent prefix
They reduced their maximum output token limit
They began using a different API provider
A team runs the same complex analysis on 50,000 records every night. Which pricing approach would likely yield the greatest cost savings?
Pay-per-request with priority routing
Batch API with asynchronous processing
Reserved capacity with dedicated infrastructure
Per-token pricing with standard API calls
A company's AI usage is highly predictable: they process 2 million tokens daily, 365 days a year, with consistent request patterns. Which pricing tier best matches this situation?
Reserved capacity or committed use discounts
Pay-as-you-go with no commitments
Per-token pricing with prompt caching enabled
Batch API with background processing
A developer changes just one word in their system prompt and notices their cached tokens dropped to nearly zero. What does this demonstrate about prompt caching?
The cache was deliberately cleared by the provider
Cache hits require exact prefix matches
Cached prompts have a 24-hour expiration policy
Cache automatically rebuilds within a few minutes
What is a fundamental limitation that prevents AI systems from automatically selecting the optimal pricing tier?
AI systems cannot connect to billing APIs
AI cannot measure its own input and output token usage precisely
A startup wants to minimize costs for an application where users submit questions and expect answers within 2 seconds. They have 10,000 daily users. Which approach would likely save them the most money?
Switch to reserved capacity with per-second billing
Move all processing to a serverless function
Implement aggressive prompt caching on the system prompt
Use batch API processing for all user queries
Which statement best describes the primary use case for cached tokens?
Storing conversation history between sessions
Lowering costs when the same long context appears repeatedly
Enabling faster real-time speech translation
Reducing costs for one-time, unique queries
An AI-powered customer service chatbot receives 100,000 messages per day with a 2,000-token system prompt that never changes. What cost optimization strategy should they prioritize?
Switching to reserved capacity immediately
Negotiating volume discounts with the API provider
Maximizing prompt cache hits on the system prompt
Implementing batch processing for all messages
A data science team needs to summarize 500,000 PDF documents over the weekend. They don't need results until Monday. Which pricing strategy would be most appropriate?
Per-token pricing with standard synchronous API calls
Batch API with asynchronous processing
Reserved capacity with dedicated GPU instances
Pay-per-request with premium support
What does the lesson identify as two specific moves that can cut AI bills by over 50%?
Using smaller models and reducing output limits
Enabling compression and caching all outputs
Maximizing prompt cache hits and pushing async work to batch APIs
Switching providers and negotiating contracts
Why might a company choose NOT to use reserved capacity pricing even with high, predictable volume?
They need the flexibility to scale down during low periods without penalties
The technology is not yet available for most AI providers
Reserved capacity actually costs more than per-token for any volume
Reserved capacity is only available to enterprise customers
Which scenario best represents an ideal use case for batch API pricing?
A user waiting in a chat interface for an immediate response
A content moderation system scanning millions of uploaded images overnight
A stock trading algorithm making split-second decisions
A healthcare system flagging urgent lab results in real-time
A team wants to reduce costs but finds their prompt changes frequently due to adding new features. What should they prioritize before investing in caching infrastructure?
Stabilizing their prompts to ensure cache effectiveness
Removing all system prompts to maximize cache hits
Switching to a cheaper AI model
Implementing a prompt versioning system
What is the primary trade-off when using batch API pricing?
Lower cost in exchange for asynchronous (delayed) results
Lower cost but must use more expensive premium models