RAG From Scratch

Chunk, embed, store, retrieve, generate. Build retrieval-augmented generation in a single file.

60 min · Reviewed 2026

The Five Steps

RAG is: chunk documents, embed chunks, store vectors, retrieve top-k for a query, generate an answer grounded in retrieved chunks. Everything else is variation.

from openai import OpenAI
import numpy as np

client = OpenAI()

def chunk(text: str, size: int = 400, overlap: int = 50) -> list[str]:
    words = text.split()
    chunks: list[str] = []
    i = 0
    while i < len(words):
        chunks.append(" ".join(words[i:i+size]))
        i += size - overlap
    return chunks

def embed(texts: list[str]) -> np.ndarray:
    r = client.embeddings.create(model="text-embedding-3-small", input=texts)
    return np.array([d.embedding for d in r.data], dtype=np.float32)

def cosine(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    a_n = a / np.linalg.norm(a, axis=1, keepdims=True)
    b_n = b / np.linalg.norm(b)
    return a_n @ b_nChunking with overlap, batched embeddings, cosine similarity. The full math fits in 20 lines.

DOC = open("handbook.txt", encoding="utf-8").read()
CHUNKS = chunk(DOC)
MATRIX = embed(CHUNKS)

def answer(question: str, k: int = 4) -> str:
    q_vec = embed([question])[0]
    scores = cosine(MATRIX, q_vec)
    top = np.argsort(-scores)[:k]
    context = "\n\n---\n\n".join(CHUNKS[i] for i in top)

    r = client.responses.create(
        model="gpt-5",
        input=[
            {"role": "system", "content": "Answer only from the provided context. If unsure, say you don't know."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
        ],
    )
    return r.output_text

print(answer("What is the PTO policy?"))Embed once at startup, search in-memory, ground the prompt in retrieved chunks. Good enough for 10k-chunk corpora.

Understanding "RAG From Scratch" in practice: AI-assisted coding shifts work from syntax recall to design thinking — models handle boilerplate so you focus on architecture. Chunk, embed, store, retrieve, generate. Build retrieval-augmented generation in a single file — and knowing how to apply this gives you a concrete advantage.

Apply chunking in your ai-coding workflow to get better results
Apply embedding in your ai-coding workflow to get better results
Apply retrieval in your ai-coding workflow to get better results
Apply grounding in your ai-coding workflow to get better results

Use AI to generate unit tests for an existing function
Ask AI to refactor a messy function and explain the changes
Have AI suggest a code review for a recent pull request

The big idea: RAG is a five-step pipeline, not magic. Own every step once, then upgrade with a real vector DB when you outgrow numpy.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-progx-rag-from-scratch-creators

What are the five core steps that make up a retrieval-augmented generation system?
1. Load dataset, clean data, split into train/test, train neural network, evaluate accuracy
2. Write prompt, send to API, receive response, display to user, collect feedback
3. Collect user data, train model, fine-tune, deploy, monitor performance
4. Chunk documents, embed chunks, store vectors, retrieve top-k for query, generate answer from retrieved chunks
A developer building a RAG system wants to maximize recall—finding all potentially relevant information. Which chunking approach would best support this goal?
1. Dividing all documents into the same number of chunks based on total word count
2. Combining multiple documents into single large chunks to preserve complete context
3. Using a fixed character count like exactly 500 characters for every chunk regardless of content
4. Splitting on semantic boundaries like headings and paragraphs before breaking into smaller pieces
Why is the phrase 'answer only from the provided context' critical to include in RAG system prompts?
1. The phrase activates a special mode in the model that enables vector search capabilities
2. Without this instruction, the language model will generate plausible-sounding answers that include information beyond what was retrieved
3. This phrase is required by legal regulations for AI systems that use external data
4. The model cannot process retrieved chunks unless explicitly told to use them as input
A developer has been using a simple Python list to store document embeddings for a RAG project. What typically signals that it's time to upgrade to a dedicated vector database?
1. When the language model being used reaches its token limit
2. When the API costs for the embedding model become too expensive
3. When the developer wants to add user authentication to their application
4. When the number of embeddings grows large enough that search speed becomes noticeably slow
In the retrieval step of RAG, what does the 'k' in 'top-k retrieval' represent?
1. The maximum number of words allowed in any single document chunk
2. The dimensionality of the embedding vectors used to represent text
3. The total number of documents in the original dataset being indexed
4. The number of most similar document chunks returned from the vector store for the generation step
What does it mean for a RAG system's output to be 'grounded' in the retrieved context?
1. The generated answer is constrained to only use information explicitly present in the retrieved document chunks
2. The output is stored in a database alongside the source documents
3. The system has been trained specifically on the documents being retrieved
4. The language model weights have been adjusted to match the vector embeddings
A student notices their RAG system sometimes fails to find relevant information even though it exists in the documents. What is the most likely cause if the embedding model and retrieval logic are working correctly?
1. The vector store requires additional user permissions to access the documents
2. The language model being used is too old to understand modern terminology
3. The chunking strategy may have split relevant content across multiple chunks in a way that breaks semantic coherence
4. The embedding model needs to be retrained on the specific domain vocabulary
When embedding a 2,000-word article for RAG, what happens if you chunk it into only two very large pieces?
1. The embeddings become less specific and may lose fine-grained semantic distinctions, reducing retrieval precision
2. The vector storage costs will be prohibitively expensive for most applications
3. The language model will refuse to process chunks that large
4. The system will automatically delete chunks exceeding a certain size threshold
What is the primary advantage of using dense vector embeddings over simple keyword matching for retrieval?
1. Dense embeddings capture semantic meaning and can find relevant results even when different words are used
2. Dense embeddings can be stored in plain text format without special tools
3. Dense embeddings require less computational power to search through large datasets
4. Dense embeddings guarantee 100% recall by indexing every word in the original documents
A developer implements a RAG system but finds the language model frequently makes up facts not present in any retrieved document. What should they verify in their implementation?
1. That the vector database is using the most recent version of the embedding model
2. That each document chunk contains at least 1,000 words of content
3. That the prompt explicitly instructs the model to answer only from the provided context
4. That the system is using a GPU-accelerated instance for faster processing
Why do the lesson notes emphasize splitting documents first by headings or paragraphs before breaking them into smaller word-based chunks?
1. Breaking at headings first makes the chunks compress more efficiently in storage
2. This approach is required by most vector database providers for proper indexing
3. Higher-level semantic boundaries preserve context and improve the chances that each chunk contains coherent, complete thoughts
4. The language model can only process text that originally appeared under heading tags
A RAG system retrieves five document chunks for a user query, but the language model ignores four of them and answers based only on one. What might explain this behavior?
1. Four of the five chunks contained contradictory information that the model refused to use
2. The vector database failed to properly index four of the five chunks
3. The prompt exceeded the model's maximum token limit, causing it to truncate context
4. The model may be attending more to the first or most salient chunk due to attention mechanism limitations
In the context of RAG, what is the main benefit of using a dedicated vector database over storing embeddings in a simple NumPy array?
1. Vector databases automatically generate new embeddings when queried with new documents
2. Vector databases provide efficient approximate nearest neighbor search algorithms that scale to millions of vectors
3. Vector databases can understand the meaning of queries without requiring an embedding model
4. Vector databases are free to use while NumPy requires a commercial license
If a user asks a RAG system about a concept not covered in any indexed documents, what is the most likely system behavior?
1. The vector store will return an error message to the user
2. The system will retrieve random documents as a fallback
3. The system will automatically search the open internet for the information
4. The system may either say it doesn't know or might hallucinate an answer if not properly constrained
The lesson describes RAG as a five-step pipeline rather than 'magic.' What is the practical implication of this framing for someone building RAG systems?
1. Each step can be independently tested, optimized, or swapped with different implementations
2. RAG requires expensive GPU hardware to function properly
3. RAG only works with documents written in English
4. RAG systems can only be built using specialized AI frameworks

← Back to interactive lesson

Tendril · Creators · AI-Assisted Coding

RAG From Scratch

Chunk, embed, store, retrieve, generate. Build retrieval-augmented generation in a single file.

60 min · Reviewed 2026

The Five Steps

RAG is: chunk documents, embed chunks, store vectors, retrieve top-k for a query, generate an answer grounded in retrieved chunks. Everything else is variation.

from openai import OpenAI
import numpy as np

client = OpenAI()

def chunk(text: str, size: int = 400, overlap: int = 50) -> list[str]:
    words = text.split()
    chunks: list[str] = []
    i = 0
    while i < len(words):
        chunks.append(" ".join(words[i:i+size]))
        i += size - overlap
    return chunks

def embed(texts: list[str]) -> np.ndarray:
    r = client.embeddings.create(model="text-embedding-3-small", input=texts)
    return np.array([d.embedding for d in r.data], dtype=np.float32)

def cosine(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    a_n = a / np.linalg.norm(a, axis=1, keepdims=True)
    b_n = b / np.linalg.norm(b)
    return a_n @ b_nChunking with overlap, batched embeddings, cosine similarity. The full math fits in 20 lines.

DOC = open("handbook.txt", encoding="utf-8").read()
CHUNKS = chunk(DOC)
MATRIX = embed(CHUNKS)

def answer(question: str, k: int = 4) -> str:
    q_vec = embed([question])[0]
    scores = cosine(MATRIX, q_vec)
    top = np.argsort(-scores)[:k]
    context = "\n\n---\n\n".join(CHUNKS[i] for i in top)

    r = client.responses.create(
        model="gpt-5",
        input=[
            {"role": "system", "content": "Answer only from the provided context. If unsure, say you don't know."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
        ],
    )
    return r.output_text

print(answer("What is the PTO policy?"))Embed once at startup, search in-memory, ground the prompt in retrieved chunks. Good enough for 10k-chunk corpora.

Apply chunking in your ai-coding workflow to get better results
Apply embedding in your ai-coding workflow to get better results
Apply retrieval in your ai-coding workflow to get better results
Apply grounding in your ai-coding workflow to get better results

Use AI to generate unit tests for an existing function
Ask AI to refactor a messy function and explain the changes
Have AI suggest a code review for a recent pull request

The big idea: RAG is a five-step pipeline, not magic. Own every step once, then upgrade with a real vector DB when you outgrow numpy.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-progx-rag-from-scratch-creators

What are the five core steps that make up a retrieval-augmented generation system?
1. Load dataset, clean data, split into train/test, train neural network, evaluate accuracy
2. Write prompt, send to API, receive response, display to user, collect feedback
3. Collect user data, train model, fine-tune, deploy, monitor performance
4. Chunk documents, embed chunks, store vectors, retrieve top-k for query, generate answer from retrieved chunks
A developer building a RAG system wants to maximize recall—finding all potentially relevant information. Which chunking approach would best support this goal?
1. Dividing all documents into the same number of chunks based on total word count
2. Combining multiple documents into single large chunks to preserve complete context
3. Using a fixed character count like exactly 500 characters for every chunk regardless of content
4. Splitting on semantic boundaries like headings and paragraphs before breaking into smaller pieces
Why is the phrase 'answer only from the provided context' critical to include in RAG system prompts?
1. The phrase activates a special mode in the model that enables vector search capabilities
2. Without this instruction, the language model will generate plausible-sounding answers that include information beyond what was retrieved
3. This phrase is required by legal regulations for AI systems that use external data
4. The model cannot process retrieved chunks unless explicitly told to use them as input
A developer has been using a simple Python list to store document embeddings for a RAG project. What typically signals that it's time to upgrade to a dedicated vector database?
1. When the language model being used reaches its token limit
2. When the API costs for the embedding model become too expensive
3. When the developer wants to add user authentication to their application
4. When the number of embeddings grows large enough that search speed becomes noticeably slow
In the retrieval step of RAG, what does the 'k' in 'top-k retrieval' represent?
1. The maximum number of words allowed in any single document chunk
2. The dimensionality of the embedding vectors used to represent text
3. The total number of documents in the original dataset being indexed
4. The number of most similar document chunks returned from the vector store for the generation step
What does it mean for a RAG system's output to be 'grounded' in the retrieved context?
1. The generated answer is constrained to only use information explicitly present in the retrieved document chunks
2. The output is stored in a database alongside the source documents
3. The system has been trained specifically on the documents being retrieved
4. The language model weights have been adjusted to match the vector embeddings
A student notices their RAG system sometimes fails to find relevant information even though it exists in the documents. What is the most likely cause if the embedding model and retrieval logic are working correctly?
1. The vector store requires additional user permissions to access the documents
2. The language model being used is too old to understand modern terminology
3. The chunking strategy may have split relevant content across multiple chunks in a way that breaks semantic coherence
4. The embedding model needs to be retrained on the specific domain vocabulary
When embedding a 2,000-word article for RAG, what happens if you chunk it into only two very large pieces?
1. The embeddings become less specific and may lose fine-grained semantic distinctions, reducing retrieval precision
2. The vector storage costs will be prohibitively expensive for most applications
3. The language model will refuse to process chunks that large
4. The system will automatically delete chunks exceeding a certain size threshold
What is the primary advantage of using dense vector embeddings over simple keyword matching for retrieval?
1. Dense embeddings capture semantic meaning and can find relevant results even when different words are used
2. Dense embeddings can be stored in plain text format without special tools
3. Dense embeddings require less computational power to search through large datasets
4. Dense embeddings guarantee 100% recall by indexing every word in the original documents
A developer implements a RAG system but finds the language model frequently makes up facts not present in any retrieved document. What should they verify in their implementation?
1. That the vector database is using the most recent version of the embedding model
2. That each document chunk contains at least 1,000 words of content
3. That the prompt explicitly instructs the model to answer only from the provided context
4. That the system is using a GPU-accelerated instance for faster processing
Why do the lesson notes emphasize splitting documents first by headings or paragraphs before breaking them into smaller word-based chunks?
1. Breaking at headings first makes the chunks compress more efficiently in storage
2. This approach is required by most vector database providers for proper indexing
3. Higher-level semantic boundaries preserve context and improve the chances that each chunk contains coherent, complete thoughts
4. The language model can only process text that originally appeared under heading tags
A RAG system retrieves five document chunks for a user query, but the language model ignores four of them and answers based only on one. What might explain this behavior?
1. Four of the five chunks contained contradictory information that the model refused to use
2. The vector database failed to properly index four of the five chunks
3. The prompt exceeded the model's maximum token limit, causing it to truncate context
4. The model may be attending more to the first or most salient chunk due to attention mechanism limitations
In the context of RAG, what is the main benefit of using a dedicated vector database over storing embeddings in a simple NumPy array?
1. Vector databases automatically generate new embeddings when queried with new documents
2. Vector databases provide efficient approximate nearest neighbor search algorithms that scale to millions of vectors
3. Vector databases can understand the meaning of queries without requiring an embedding model
4. Vector databases are free to use while NumPy requires a commercial license
If a user asks a RAG system about a concept not covered in any indexed documents, what is the most likely system behavior?
1. The vector store will return an error message to the user
2. The system will retrieve random documents as a fallback
3. The system will automatically search the open internet for the information
4. The system may either say it doesn't know or might hallucinate an answer if not properly constrained
The lesson describes RAG as a five-step pipeline rather than 'magic.' What is the practical implication of this framing for someone building RAG systems?
1. Each step can be independently tested, optimized, or swapped with different implementations
2. RAG requires expensive GPU hardware to function properly
3. RAG only works with documents written in English
4. RAG systems can only be built using specialized AI frameworks

← Back to interactive lesson