Standalone lesson.
Lesson 2107 of 2116
Context Windows & RAG
Why 1M tokens matters and what retrieval does.
The context windowis how much an LLM can “hold in mind” at once — measured in tokens. In 2022 it was ~4K. In 2026 frontier models handle 1M–2M tokens. That completely changes what’s possible.
Context vs. memory — they’re not the same
An LLM has no persistent memory. Every request is stateless. What feels like memory in a chat is the entire conversation being re-sent with each message. When you hit the context limit, the oldest messages start getting dropped.
What 1M tokens actually buys you
- ~750 pages of text, or ~2500 pages of code.
- An entire textbook as a prompt.
- A company’s full documentation site.
- Hours of transcribed audio.
RAG — retrieval-augmented generation
Even 1M tokens isn’t enough for, say, a whole code repo or all of Wikipedia. RAG fixes this: instead of stuffing everything into the context, you:
- Chunk your documents into small pieces.
- Compute an embedding (a vector) for each chunk.
- Store all embeddings in a vector database.
- When the user asks something, compute the embedding of their question, find the top-k most similar chunks, and include only those in the prompt.
Context caching
Sending the same long context with every request is expensive. Anthropic, OpenAI, and Google all offer prompt caching — after you send a long prompt once, subsequent requests that reuse it pay ~10% of the usual cost. For any app that reuses a system prompt or loaded document, this is the single biggest cost optimization.
Limits of long context
“The model has 1M token context” doesn’t mean it uses all of it equally well. Information in the middle of a long context is often missed — a phenomenon called “lost in the middle.” Put critical instructions at the start or end.
Tutor
Curious about “Context Windows & RAG”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Adults & Professionals · 11 min
RAG For Ops Manuals: Retrieval That Actually Retrieves
Retrieval-Augmented Generation lets you ground answers in your own ops manuals. Most RAG systems fail not at generation but at retrieval — here's how to fix that.
Builders · 8 min
Why AI Search Beats Keyword Search (Embeddings Explained)
Old search needed your exact words. AI search understands meaning. The trick is called 'embeddings' and you can use it in your own projects.
Creators · 9 min
AI and RAG Chunk Strategy: Picking the Right Slice Size
AI helps creators tune RAG chunking so retrieval lands the right context, not too much or too little.
