Loading lesson…
Frontier models now read a million tokens of your codebase in one shot. That changes how we architect prompts, retrieval, and the cost curve of agentic work.
Claude, Gemini, and GPT now offer 1M+ token context windows for coding workloads. That's roughly 750,000 words, or most mid-sized repos. When the whole codebase fits in one prompt, architectural questions, cross-file refactors, and full-repo audits become viable in a single shot.
Just because you can paste a million tokens doesn't mean you should. Cost scales linearly with input tokens, and attention quality degrades unevenly across the window. The middle of a long context tends to be the worst-attended region — a phenomenon called the lost-in-the-middle effect.
| Strategy | Cost per call | Attention quality |
|---|---|---|
| Paste full repo every call | Very high | Variable — worst in middle |
| Paste only relevant files | Moderate | Good — surgical |
| RAG (retrieve then prompt) | Low | Excellent if retrieval is good |
| Prompt caching with long context | High first call, low after | Full context with amortized cost |
Both Anthropic and OpenAI support prompt caching. Once a long context is cached, subsequent calls reuse it at a fraction of the input cost, typically around 10 percent. This makes long-context workflows economically sane. Structure prompts so the stable parts (codebase, docs, instructions) come first — they cache. The variable part (the question) comes last.
# Anthropic prompt caching example
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=2048,
system=[
{
"type": "text",
"text": "You are a senior engineer reviewing a codebase."
},
{
"type": "text",
"text": open("whole_repo.txt").read(), # 800k tokens
"cache_control": {"type": "ephemeral"} # CACHE THIS
}
],
messages=[{
"role": "user",
"content": "What's the purpose of utils/parser.ts?"
}]
)
# First call: full price for the 800k tokens.
# Every subsequent call in the next 5 min: ~10% of that cost.Cache the stable parts of a long-context prompt. Pay full price once, then reuse cheaply for the duration of your session.The needle-in-haystack eval plants a specific fact in a long context and asks the model to retrieve it. Frontier models score near-perfectly on simple versions, but real-world performance on complex questions across long context is noticeably worse. Test your actual workflow before trusting reported benchmarks.
Long context is a better memory, not a better brain.
— An AI systems researcher
The big idea: 1M-token context opens whole-repo reasoning, but only if you pair it with caching, structure, and citations. Long context is a superpower priced by the token — spend it deliberately.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-coding-long-context-1m-token-creators
What capability becomes newly viable when a model's context window can hold roughly 750,000 words (approximately one million tokens)?
What is the 'lost-in-the-middle' effect in long-context models?
Which strategy for using long-context models has the LOWEST cost per API call?
Why is prompt caching described as a 'game-changer' for long-context workflows?
In a prompt caching workflow, where should the variable part of your query (the specific question) be placed?
What is the needle-in-haystack evaluation?
Which design pattern helps models better attend to structure in long prompts?
What is the main drawback of pasting the full repository into every API call?
For very large monorepos exceeding one million tokens, what approach typically beats pure long context on both cost and quality?
Why should you ask an AI coding assistant to cite which file and line number its answer came from?
What is the primary limitation of long-context models besides the direct monetary cost of tokens?
Why might answers from long-context prompts be less reliable than answers from surgical (narrow) prompts?
Before trusting reported benchmarks about how well a model handles your specific long-context workflow, what should you do?
In what situation is RAG (retrieve then prompt) preferred over pure long context, even though both could technically work?
When splitting a very large repository into chunks for processing, what is the recommended chunking strategy?