Lesson 34 of 2116
Long-Context Code Understanding — The 1M-Token Era
Frontier models now read a million tokens of your codebase in one shot. That changes how we architect prompts, retrieval, and the cost curve of agentic work.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1A Million Tokens Changes the Job
- 2context window
- 3long context
- 4needle in haystack
Concept cluster
Terms to connect while reading
Section 1
A Million Tokens Changes the Job
Claude, Gemini, and GPT now offer 1M+ token context windows for coding workloads. That's roughly 750,000 words, or most mid-sized repos. When the whole codebase fits in one prompt, architectural questions, cross-file refactors, and full-repo audits become viable in a single shot.
What you can now do in one pass
- Read an entire small-to-medium repo into context
- Ask architectural questions that cross many files
- Refactor a shared type across 50 call sites at once
- Summarize every change between two git tags
- Audit for duplicated logic repo-wide
Long context is not free context
Just because you can paste a million tokens doesn't mean you should. Cost scales linearly with input tokens, and attention quality degrades unevenly across the window. The middle of a long context tends to be the worst-attended region — a phenomenon called the lost-in-the-middle effect.
Compare the options
| Strategy | Cost per call | Attention quality |
|---|---|---|
| Paste full repo every call | Very high | Variable — worst in middle |
| Paste only relevant files | Moderate | Good — surgical |
| RAG (retrieve then prompt) | Low | Excellent if retrieval is good |
| Prompt caching with long context | High first call, low after | Full context with amortized cost |
Prompt caching is the game-changer
Both Anthropic and OpenAI support prompt caching. Once a long context is cached, subsequent calls reuse it at a fraction of the input cost, typically around 10 percent. This makes long-context workflows economically sane. Structure prompts so the stable parts (codebase, docs, instructions) come first — they cache. The variable part (the question) comes last.
Cache the stable parts of a long-context prompt. Pay full price once, then reuse cheaply for the duration of your session.
# Anthropic prompt caching example
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=2048,
system=[
{
"type": "text",
"text": "You are a senior engineer reviewing a codebase."
},
{
"type": "text",
"text": open("whole_repo.txt").read(), # 800k tokens
"cache_control": {"type": "ephemeral"} # CACHE THIS
}
],
messages=[{
"role": "user",
"content": "What's the purpose of utils/parser.ts?"
}]
)
# First call: full price for the 800k tokens.
# Every subsequent call in the next 5 min: ~10% of that cost.Design patterns that matter
- 1Put stable content first (repo, docs, instructions), variable content last
- 2Use explicit section markers (file path tags) — models attend to structure
- 3Ask the model to cite which file/line its answer came from — keeps it grounded
- 4Chunk very large repos by concern, not by file order
- 5Always measure: run the same query with and without long context and compare
Needle-in-haystack: the eval to know
The needle-in-haystack eval plants a specific fact in a long context and asks the model to retrieve it. Frontier models score near-perfectly on simple versions, but real-world performance on complex questions across long context is noticeably worse. Test your actual workflow before trusting reported benchmarks.
“Long context is a better memory, not a better brain.”
Key terms in this lesson
The big idea: 1M-token context opens whole-repo reasoning, but only if you pair it with caching, structure, and citations. Long context is a superpower priced by the token — spend it deliberately.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Long-Context Code Understanding — The 1M-Token Era”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 40 min
Agents vs. Autocomplete — the Mental Model Shift
Autocomplete is a suggestion. An agent is an actor. The mental model you bring to each is different, and conflating them is the number-one reason teams trip over AI coding.
Creators · 60 min
Capstone — Python CLI That Summarizes With Claude
Tie it all together. A command-line tool that reads a file, calls Claude, and prints a summary. Real code, real errors, real polish.
Creators · 50 min
The Landscape: Copilot vs. Cursor vs. Windsurf vs. Claude Code
The AI coding tool market fragmented fast. Let's map the 2026 landscape honestly: who is for autocomplete, who is for agents, who wins on cost, and what the tradeoffs actually feel like.
