Long-Context Code Understanding — The 1M-Token Era

Frontier models now read a million tokens of your codebase in one shot. That changes how we architect prompts, retrieval, and the cost curve of agentic work.

50 min · Reviewed 2026

A Million Tokens Changes the Job

Claude, Gemini, and GPT now offer 1M+ token context windows for coding workloads. That's roughly 750,000 words, or most mid-sized repos. When the whole codebase fits in one prompt, architectural questions, cross-file refactors, and full-repo audits become viable in a single shot.

What you can now do in one pass

Read an entire small-to-medium repo into context
Ask architectural questions that cross many files
Refactor a shared type across 50 call sites at once
Summarize every change between two git tags
Audit for duplicated logic repo-wide

Long context is not free context

Just because you can paste a million tokens doesn't mean you should. Cost scales linearly with input tokens, and attention quality degrades unevenly across the window. The middle of a long context tends to be the worst-attended region — a phenomenon called the lost-in-the-middle effect.

Strategy	Cost per call	Attention quality
Paste full repo every call	Very high	Variable — worst in middle
Paste only relevant files	Moderate	Good — surgical
RAG (retrieve then prompt)	Low	Excellent if retrieval is good
Prompt caching with long context	High first call, low after	Full context with amortized cost

Prompt caching is the game-changer

Both Anthropic and OpenAI support prompt caching. Once a long context is cached, subsequent calls reuse it at a fraction of the input cost, typically around 10 percent. This makes long-context workflows economically sane. Structure prompts so the stable parts (codebase, docs, instructions) come first — they cache. The variable part (the question) comes last.

# Anthropic prompt caching example
response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=2048,
    system=[
        {
            "type": "text",
            "text": "You are a senior engineer reviewing a codebase."
        },
        {
            "type": "text",
            "text": open("whole_repo.txt").read(),  # 800k tokens
            "cache_control": {"type": "ephemeral"}  # CACHE THIS
        }
    ],
    messages=[{
        "role": "user",
        "content": "What's the purpose of utils/parser.ts?"
    }]
)
# First call: full price for the 800k tokens.
# Every subsequent call in the next 5 min: ~10% of that cost.Cache the stable parts of a long-context prompt. Pay full price once, then reuse cheaply for the duration of your session.

Design patterns that matter

Put stable content first (repo, docs, instructions), variable content last
Use explicit section markers (file path tags) — models attend to structure
Ask the model to cite which file/line its answer came from — keeps it grounded
Chunk very large repos by concern, not by file order
Always measure: run the same query with and without long context and compare

Needle-in-haystack: the eval to know

The needle-in-haystack eval plants a specific fact in a long context and asks the model to retrieve it. Frontier models score near-perfectly on simple versions, but real-world performance on complex questions across long context is noticeably worse. Test your actual workflow before trusting reported benchmarks.

Long context is a better memory, not a better brain.
— An AI systems researcher

The big idea: 1M-token context opens whole-repo reasoning, but only if you pair it with caching, structure, and citations. Long context is a superpower priced by the token — spend it deliberately.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-coding-long-context-1m-token-creators

What capability becomes newly viable when a model's context window can hold roughly 750,000 words (approximately one million tokens)?
1. Reading an entire small-to-medium codebase in a single prompt
2. Training a custom model on your code
3. Automatically deploying code to production servers
4. Generating new code files from scratch
What is the 'lost-in-the-middle' effect in long-context models?
1. Context windows shrink during extended use
2. Attention quality degrades unevenly, with the worst attention on the middle portion of the context
3. Middle sections of code are automatically deleted from context
4. The model loses access to context after approximately 500,000 tokens
Which strategy for using long-context models has the LOWEST cost per API call?
1. Paste only relevant files
2. RAG (retrieve relevant chunks first, then prompt with only those)
3. Prompt caching on every call
4. Paste full repo every call
Why is prompt caching described as a 'game-changer' for long-context workflows?
1. It eliminates the need for API keys
2. It permanently stores all past conversations
3. Subsequent calls reuse cached context at approximately 10% of the original input cost
4. It increases the context window size by 10x
In a prompt caching workflow, where should the variable part of your query (the specific question) be placed?
1. Randomly distributed throughout
2. At the beginning, before any code
3. In a separate API call
4. At the end of the prompt, after stable content
What is the needle-in-haystack evaluation?
1. A test of how quickly models can process large codebases
2. A benchmark comparing different programming languages
3. An automated code review system
4. An evaluation that plants a specific fact in a long context and tests if the model can retrieve it
Which design pattern helps models better attend to structure in long prompts?
1. Including multiple exclamation marks
2. Using emojis instead of text
3. Using explicit section markers or file path tags
4. Writing prompts in all capital letters
What is the main drawback of pasting the full repository into every API call?
1. The model cannot read entire repositories
2. It causes the model to forget everything else
3. It violates API terms of service
4. Very high cost per call and variable attention quality (worst in the middle)
For very large monorepos exceeding one million tokens, what approach typically beats pure long context on both cost and quality?
1. RAG (chunk the repo, retrieve relevant pieces, prompt with those)
2. Ignoring the large repo and starting fresh
3. Using multiple API calls simultaneously
4. Increasing the context window further
Why should you ask an AI coding assistant to cite which file and line number its answer came from?
1. To make the response appear more professional
2. To keep the answer grounded in actual code and verifiable with grep
3. To reduce the number of tokens used
4. To increase the length of the response
What is the primary limitation of long-context models besides the direct monetary cost of tokens?
1. Attention quality degrades unevenly across the context window
2. Models become slower with longer context
3. Context windows randomly drop important tokens
4. Long context causes models to hallucinate more
Why might answers from long-context prompts be less reliable than answers from surgical (narrow) prompts?
1. Long context causes models to become confused
2. Long context reduces the model's knowledge
3. Surgical prompts are always more accurate
4. More context creates more opportunities for the model to conflate unrelated pieces of information
Before trusting reported benchmarks about how well a model handles your specific long-context workflow, what should you do?
1. Only trust benchmarks from the model provider
2. Run your own tests comparing results with and without long context
3. Skip testing entirely and rely on theory
4. Assume the benchmarks apply perfectly to your use case
In what situation is RAG (retrieve then prompt) preferred over pure long context, even though both could technically work?
1. When you want the slowest possible response
2. When cost is not a concern at all
3. When accuracy is irrelevant
4. When working with very large repos (>1M tokens) or when you only need a narrow slice of information
When splitting a very large repository into chunks for processing, what is the recommended chunking strategy?
1. Chunk by file size, largest first
2. Chunk by concern (functional area), not by file order
3. Chunk randomly to ensure variety
4. Chunk alphabetically by filename

← Back to interactive lesson

Tendril · Creators · AI-Assisted Coding

Long-Context Code Understanding — The 1M-Token Era

Frontier models now read a million tokens of your codebase in one shot. That changes how we architect prompts, retrieval, and the cost curve of agentic work.

50 min · Reviewed 2026

A Million Tokens Changes the Job

What you can now do in one pass

Read an entire small-to-medium repo into context
Ask architectural questions that cross many files
Refactor a shared type across 50 call sites at once
Summarize every change between two git tags
Audit for duplicated logic repo-wide

Long context is not free context

Strategy	Cost per call	Attention quality
Paste full repo every call	Very high	Variable — worst in middle
Paste only relevant files	Moderate	Good — surgical
RAG (retrieve then prompt)	Low	Excellent if retrieval is good
Prompt caching with long context	High first call, low after	Full context with amortized cost

Prompt caching is the game-changer

# Anthropic prompt caching example
response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=2048,
    system=[
        {
            "type": "text",
            "text": "You are a senior engineer reviewing a codebase."
        },
        {
            "type": "text",
            "text": open("whole_repo.txt").read(),  # 800k tokens
            "cache_control": {"type": "ephemeral"}  # CACHE THIS
        }
    ],
    messages=[{
        "role": "user",
        "content": "What's the purpose of utils/parser.ts?"
    }]
)
# First call: full price for the 800k tokens.
# Every subsequent call in the next 5 min: ~10% of that cost.Cache the stable parts of a long-context prompt. Pay full price once, then reuse cheaply for the duration of your session.

Design patterns that matter

Put stable content first (repo, docs, instructions), variable content last
Use explicit section markers (file path tags) — models attend to structure
Ask the model to cite which file/line its answer came from — keeps it grounded
Chunk very large repos by concern, not by file order
Always measure: run the same query with and without long context and compare

Needle-in-haystack: the eval to know

Long context is a better memory, not a better brain.
— An AI systems researcher

The big idea: 1M-token context opens whole-repo reasoning, but only if you pair it with caching, structure, and citations. Long context is a superpower priced by the token — spend it deliberately.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-coding-long-context-1m-token-creators

What capability becomes newly viable when a model's context window can hold roughly 750,000 words (approximately one million tokens)?
1. Reading an entire small-to-medium codebase in a single prompt
2. Training a custom model on your code
3. Automatically deploying code to production servers
4. Generating new code files from scratch
What is the 'lost-in-the-middle' effect in long-context models?
1. Context windows shrink during extended use
2. Attention quality degrades unevenly, with the worst attention on the middle portion of the context
3. Middle sections of code are automatically deleted from context
4. The model loses access to context after approximately 500,000 tokens
Which strategy for using long-context models has the LOWEST cost per API call?
1. Paste only relevant files
2. RAG (retrieve relevant chunks first, then prompt with only those)
3. Prompt caching on every call
4. Paste full repo every call
Why is prompt caching described as a 'game-changer' for long-context workflows?
1. It eliminates the need for API keys
2. It permanently stores all past conversations
3. Subsequent calls reuse cached context at approximately 10% of the original input cost
4. It increases the context window size by 10x
In a prompt caching workflow, where should the variable part of your query (the specific question) be placed?
1. Randomly distributed throughout
2. At the beginning, before any code
3. In a separate API call
4. At the end of the prompt, after stable content
What is the needle-in-haystack evaluation?
1. A test of how quickly models can process large codebases
2. A benchmark comparing different programming languages
3. An automated code review system
4. An evaluation that plants a specific fact in a long context and tests if the model can retrieve it
Which design pattern helps models better attend to structure in long prompts?
1. Including multiple exclamation marks
2. Using emojis instead of text
3. Using explicit section markers or file path tags
4. Writing prompts in all capital letters
What is the main drawback of pasting the full repository into every API call?
1. The model cannot read entire repositories
2. It causes the model to forget everything else
3. It violates API terms of service
4. Very high cost per call and variable attention quality (worst in the middle)
For very large monorepos exceeding one million tokens, what approach typically beats pure long context on both cost and quality?
1. RAG (chunk the repo, retrieve relevant pieces, prompt with those)
2. Ignoring the large repo and starting fresh
3. Using multiple API calls simultaneously
4. Increasing the context window further
Why should you ask an AI coding assistant to cite which file and line number its answer came from?
1. To make the response appear more professional
2. To keep the answer grounded in actual code and verifiable with grep
3. To reduce the number of tokens used
4. To increase the length of the response
What is the primary limitation of long-context models besides the direct monetary cost of tokens?
1. Attention quality degrades unevenly across the context window
2. Models become slower with longer context
3. Context windows randomly drop important tokens
4. Long context causes models to hallucinate more
Why might answers from long-context prompts be less reliable than answers from surgical (narrow) prompts?
1. Long context causes models to become confused
2. Long context reduces the model's knowledge
3. Surgical prompts are always more accurate
4. More context creates more opportunities for the model to conflate unrelated pieces of information
Before trusting reported benchmarks about how well a model handles your specific long-context workflow, what should you do?
1. Only trust benchmarks from the model provider
2. Run your own tests comparing results with and without long context
3. Skip testing entirely and rely on theory
4. Assume the benchmarks apply perfectly to your use case
In what situation is RAG (retrieve then prompt) preferred over pure long context, even though both could technically work?
1. When you want the slowest possible response
2. When cost is not a concern at all
3. When accuracy is irrelevant
4. When working with very large repos (>1M tokens) or when you only need a narrow slice of information
When splitting a very large repository into chunks for processing, what is the recommended chunking strategy?
1. Chunk by file size, largest first
2. Chunk by concern (functional area), not by file order
3. Chunk randomly to ensure variety
4. Chunk alphabetically by filename

← Back to interactive lesson