Loading lesson…
Chunk, embed, store, retrieve, generate. Build retrieval-augmented generation in a single file.
RAG is: chunk documents, embed chunks, store vectors, retrieve top-k for a query, generate an answer grounded in retrieved chunks. Everything else is variation.
from openai import OpenAI
import numpy as np
client = OpenAI()
def chunk(text: str, size: int = 400, overlap: int = 50) -> list[str]:
words = text.split()
chunks: list[str] = []
i = 0
while i < len(words):
chunks.append(" ".join(words[i:i+size]))
i += size - overlap
return chunks
def embed(texts: list[str]) -> np.ndarray:
r = client.embeddings.create(model="text-embedding-3-small", input=texts)
return np.array([d.embedding for d in r.data], dtype=np.float32)
def cosine(a: np.ndarray, b: np.ndarray) -> np.ndarray:
a_n = a / np.linalg.norm(a, axis=1, keepdims=True)
b_n = b / np.linalg.norm(b)
return a_n @ b_nChunking with overlap, batched embeddings, cosine similarity. The full math fits in 20 lines.DOC = open("handbook.txt", encoding="utf-8").read()
CHUNKS = chunk(DOC)
MATRIX = embed(CHUNKS)
def answer(question: str, k: int = 4) -> str:
q_vec = embed([question])[0]
scores = cosine(MATRIX, q_vec)
top = np.argsort(-scores)[:k]
context = "\n\n---\n\n".join(CHUNKS[i] for i in top)
r = client.responses.create(
model="gpt-5",
input=[
{"role": "system", "content": "Answer only from the provided context. If unsure, say you don't know."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
],
)
return r.output_text
print(answer("What is the PTO policy?"))Embed once at startup, search in-memory, ground the prompt in retrieved chunks. Good enough for 10k-chunk corpora.Understanding "RAG From Scratch" in practice: AI-assisted coding shifts work from syntax recall to design thinking — models handle boilerplate so you focus on architecture. Chunk, embed, store, retrieve, generate. Build retrieval-augmented generation in a single file — and knowing how to apply this gives you a concrete advantage.
The big idea: RAG is a five-step pipeline, not magic. Own every step once, then upgrade with a real vector DB when you outgrow numpy.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-progx-rag-from-scratch-creators
What are the five core steps that make up a retrieval-augmented generation system?
A developer building a RAG system wants to maximize recall—finding all potentially relevant information. Which chunking approach would best support this goal?
Why is the phrase 'answer only from the provided context' critical to include in RAG system prompts?
A developer has been using a simple Python list to store document embeddings for a RAG project. What typically signals that it's time to upgrade to a dedicated vector database?
In the retrieval step of RAG, what does the 'k' in 'top-k retrieval' represent?
What does it mean for a RAG system's output to be 'grounded' in the retrieved context?
A student notices their RAG system sometimes fails to find relevant information even though it exists in the documents. What is the most likely cause if the embedding model and retrieval logic are working correctly?
When embedding a 2,000-word article for RAG, what happens if you chunk it into only two very large pieces?
What is the primary advantage of using dense vector embeddings over simple keyword matching for retrieval?
A developer implements a RAG system but finds the language model frequently makes up facts not present in any retrieved document. What should they verify in their implementation?
Why do the lesson notes emphasize splitting documents first by headings or paragraphs before breaking them into smaller word-based chunks?
A RAG system retrieves five document chunks for a user query, but the language model ignores four of them and answers based only on one. What might explain this behavior?
In the context of RAG, what is the main benefit of using a dedicated vector database over storing embeddings in a simple NumPy array?
If a user asks a RAG system about a concept not covered in any indexed documents, what is the most likely system behavior?
The lesson describes RAG as a five-step pipeline rather than 'magic.' What is the practical implication of this framing for someone building RAG systems?