Loading lesson…
Frontier models now read a million tokens of your codebase in one shot. That changes how we architect prompts, retrieval, and the cost curve of agentic work.
Claude, Gemini, and GPT now offer 1M+ token context windows for coding workloads. That's roughly 750,000 words, or most mid-sized repos. When the whole codebase fits in one prompt, architectural questions, cross-file refactors, and full-repo audits become viable in a single shot.
Just because you can paste a million tokens doesn't mean you should. Cost scales linearly with input tokens, and attention quality degrades unevenly across the window. The middle of a long context tends to be the worst-attended region — a phenomenon called the lost-in-the-middle effect.
| Strategy | Cost per call | Attention quality |
|---|---|---|
| Paste full repo every call | Very high | Variable — worst in middle |
| Paste only relevant files | Moderate | Good — surgical |
| RAG (retrieve then prompt) | Low | Excellent if retrieval is good |
| Prompt caching with long context | High first call, low after | Full context with amortized cost |
Both Anthropic and OpenAI support prompt caching. Once a long context is cached, subsequent calls reuse it at a fraction of the input cost, typically around 10 percent. This makes long-context workflows economically sane. Structure prompts so the stable parts (codebase, docs, instructions) come first — they cache. The variable part (the question) comes last.
# Anthropic prompt caching example response = client.messages.create( model="claude-opus-4-7", max_tokens=2048, system=[ { "type": "text", "text": "You are a senior engineer reviewing a codebase." }, { "type": "text", "text": open("whole_repo.txt").read(), # 800k tokens "cache_control": {"type": "ephemeral"} # CACHE THIS } ], messages=[{ "role": "user", "content": "What's the purpose of utils/parser.ts?" }] ) # First call: full price for the 800k tokens. # Every subsequent call in the next 5 min: ~10% of that cost.Cache the stable parts of a long-context prompt. Pay full price once, then reuse cheaply for the duration of your session.The needle-in-haystack eval plants a specific fact in a long context and asks the model to retrieve it. Frontier models score near-perfectly on simple versions, but real-world performance on complex questions across long context is noticeably worse. Test your actual workflow before trusting reported benchmarks.
Long context is a better memory, not a better brain.
— An AI systems researcher
The big idea: 1M-token context opens whole-repo reasoning, but only if you pair it with caching, structure, and citations. Long context is a superpower priced by the token — spend it deliberately.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-coding-long-context-1m-token-creators
What is the main idea of "Long-Context Code Understanding — The 1M-Token Era"?
Which concept is most central to "Long-Context Code Understanding — The 1M-Token Era"?
Which use of AI fits this topic best?
What should a careful learner remember about "When retrieval still wins"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about context window be treated?
Name one way to verify an AI answer about context window.
Which action would help you apply "Long-Context Code Understanding — The 1M-Token Era" responsibly?