Lesson 30 of 1596
Long-Context Code Understanding — The 1M-Token Era
Frontier models now read a million tokens of your codebase in one shot. That changes how we architect prompts, retrieval, and the cost curve of agentic work.
Creators · AI-Assisted Coding · ~30 min read
A Million Tokens Changes the Job
Claude, Gemini, and GPT now offer 1M+ token context windows for coding workloads. That's roughly 750,000 words, or most mid-sized repos. When the whole codebase fits in one prompt, architectural questions, cross-file refactors, and full-repo audits become viable in a single shot.
What you can now do in one pass
- Read an entire small-to-medium repo into context
- Ask architectural questions that cross many files
- Refactor a shared type across 50 call sites at once
- Summarize every change between two git tags
- Audit for duplicated logic repo-wide
Long context is not free context
Just because you can paste a million tokens doesn't mean you should. Cost scales linearly with input tokens, and attention quality degrades unevenly across the window. The middle of a long context tends to be the worst-attended region — a phenomenon called the lost-in-the-middle effect.
Compare the options
| Strategy | Cost per call | Attention quality |
|---|---|---|
| Paste full repo every call | Very high | Variable — worst in middle |
| Paste only relevant files | Moderate | Good — surgical |
| RAG (retrieve then prompt) | Low | Excellent if retrieval is good |
| Prompt caching with long context | High first call, low after | Full context with amortized cost |
Prompt caching is the game-changer
Both Anthropic and OpenAI support prompt caching. Once a long context is cached, subsequent calls reuse it at a fraction of the input cost, typically around 10 percent. This makes long-context workflows economically sane. Structure prompts so the stable parts (codebase, docs, instructions) come first — they cache. The variable part (the question) comes last.
Cache the stable parts of a long-context prompt. Pay full price once, then reuse cheaply for the duration of your session.
# Anthropic prompt caching example response = client.messages.create( model="claude-opus-4-7", max_tokens=2048, system=[ { "type": "text", "text": "You are a senior engineer reviewing a codebase." }, { "type": "text", "text": open("whole_repo.txt").read(), # 800k tokens "cache_control": {"type": "ephemeral"} # CACHE THIS } ], messages=[{ "role": "user", "content": "What's the purpose of utils/parser.ts?" }] ) # First call: full price for the 800k tokens. # Every subsequent call in the next 5 min: ~10% of that cost.Design patterns that matter
- 1Put stable content first (repo, docs, instructions), variable content last
- 2Use explicit section markers (file path tags) — models attend to structure
- 3Ask the model to cite which file/line its answer came from — keeps it grounded
- 4Chunk very large repos by concern, not by file order
- 5Always measure: run the same query with and without long context and compare
Needle-in-haystack: the eval to know
The needle-in-haystack eval plants a specific fact in a long context and asks the model to retrieve it. Frontier models score near-perfectly on simple versions, but real-world performance on complex questions across long context is noticeably worse. Test your actual workflow before trusting reported benchmarks.
“Long context is a better memory, not a better brain.”
Key terms in this lesson
The big idea: 1M-token context opens whole-repo reasoning, but only if you pair it with caching, structure, and citations. Long context is a superpower priced by the token — spend it deliberately.
End-of-lesson quiz
Check what stuck
8 questions · Score saves to your progress.
Tutor
Curious about “Long-Context Code Understanding — The 1M-Token Era”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 40 min
Agents vs. Autocomplete — the Mental Model Shift
Autocomplete is a suggestion. An agent is an actor. The mental model you bring to each is different, and conflating them is the number-one reason teams trip over AI coding.
Creators · 60 min
Capstone — Python CLI That Summarizes With Claude
Tie it all together. A command-line tool that reads a file, calls Claude, and prints a summary. Real code, real errors, real polish.
Creators · 55 min
Red-Teaming Your AI-Generated Code
Agents ship working code that's also quietly insecure. Red-teaming means actively attacking your own code. Let's build the habits that catch real-world exploits before attackers do.
