Context Windows & RAG

Why 1M tokens matters and what retrieval does.

CreatorsCreators~27 min readInteractiveBI2 · Representation & ReasoningPrint / PDF

The context windowis how much an LLM can “hold in mind” at once — measured in tokens. In 2022 it was ~4K. In 2026 frontier models handle 1M–2M tokens. That completely changes what’s possible.

Context vs. memory — they’re not the same

An LLM has no persistent memory. Every request is stateless. What feels like memory in a chat is the entire conversation being re-sent with each message. When you hit the context limit, the oldest messages start getting dropped.

What 1M tokens actually buys you

~750 pages of text, or ~2500 pages of code.
An entire textbook as a prompt.
A company’s full documentation site.
Hours of transcribed audio.

RAG — retrieval-augmented generation

Even 1M tokens isn’t enough for, say, a whole code repo or all of Wikipedia. RAG fixes this: instead of stuffing everything into the context, you:

Chunk your documents into small pieces.
Compute an embedding (a vector) for each chunk.
Store all embeddings in a vector database.
When the user asks something, compute the embedding of their question, find the top-k most similar chunks, and include only those in the prompt.

Context caching

Sending the same long context with every request is expensive. Anthropic, OpenAI, and Google all offer prompt caching — after you send a long prompt once, subsequent requests that reuse it pay ~10% of the usual cost. For any app that reuses a system prompt or loaded document, this is the single biggest cost optimization.

Limits of long context

“The model has 1M token context” doesn’t mean it uses all of it equally well. Information in the middle of a long context is often missed — a phenomenon called “lost in the middle.” Put critical instructions at the start or end.

Tutor

Curious about “Context Windows & RAG”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Keep going

Creators0%

Standalone lesson.

Lesson 2107 of 2116

Context Windows & RAG

Why 1M tokens matters and what retrieval does.

CreatorsCreators~27 min readInteractiveBI2 · Representation & ReasoningPrint / PDF

Context vs. memory — they’re not the same

What 1M tokens actually buys you

~750 pages of text, or ~2500 pages of code.
An entire textbook as a prompt.
A company’s full documentation site.
Hours of transcribed audio.

RAG — retrieval-augmented generation

Even 1M tokens isn’t enough for, say, a whole code repo or all of Wikipedia. RAG fixes this: instead of stuffing everything into the context, you:

Chunk your documents into small pieces.
Compute an embedding (a vector) for each chunk.
Store all embeddings in a vector database.
When the user asks something, compute the embedding of their question, find the top-k most similar chunks, and include only those in the prompt.

Context caching

Limits of long context

Tutor

Curious about “Context Windows & RAG”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons