Interactive Explainer
RAG, From Scratch
Ask a chatbot who won yesterday's election and it confidently makes something up. The fix? Give it the right documents before it answers. That's RAG—retrieval-augmented generation—and you can build it from two ideas you already know.
Drag points, switch domains, and watch token generation unfold. Every step of the pipeline is interactive.
The problem: frozen knowledge
A language model's knowledge is baked into its weights during training. Once training ends, those weights are frozen. This creates three problems that show up constantly in practice:
Two building blocks
RAG combines exactly two ideas. If you've taken an intro ML course, you already know both. Let's make sure we agree on them before wiring them together.
Building Block 1
Next-token prediction (LM)
A language model takes a sequence of tokens and predicts what comes next. GPT, LLaMA, Claude—they all work this way, one token at a time. Everything the model "knows" lives in its weights. This is called parametric memory—fixed once training ends.
Building Block 2
k-Nearest Neighbours (kNN)
Represent items as vectors (points in space). Given a query vector, find the k closest points. No learning at query time—pure geometry. If two documents are about similar topics, their vectors will be close together.
The RAG pipeline
Five steps. We'll walk through each one interactively.
There are two stages. Offline (done once): embed all your documents and store them in a searchable index. Online (at query time): push the user's question through the full pipeline. Let's build both, step by step.
First, pick a domain. This determines the corpus (the documents the model can search over):
Embed the documents
Here are the five documents in our mini-corpus. In a real system, you'd have thousands or millions—but five is enough to see the mechanism.
An encoder model (like Sentence-BERT) reads each document and outputs a vector—a list of numbers that captures the document's meaning. Documents about similar topics produce similar vectors. In real systems these are 768-dimensional or more, but to keep things visual we'll use 2D embeddings.
Document embeddings in 2D
Each blue dot below is a document, plotted at its 2D embedding coordinates. Notice how topically related documents cluster together.
docs = split(corpus) # chunk text into passages
emb = encoder(docs) # embed each chunk → vector
index = FAISS(emb) # build a searchable indexEmbed the query
A user asks a question. We pass it through the same encoder to get a query vector. Because the encoder maps semantically similar text to nearby points, the query will land close to relevant documents.
Now both the documents and the query live in the same vector space. Below is the 2D embedding canvas. The orange dot is the query; the blue dots are documents. Drag the orange query point to explore—the two nearest documents will be highlighted with distance lines in real time.
Retrieve with kNN
Now we compute the Euclidean distance between the query vector and every document vector:
Sort by distance. The top-k (here k=2) closest documents "win" and become our retrieved context. Here's the live distance table—drag the query point above and watch it update:
| Document | Embedding | Distance | Retrieved? |
|---|
Build the prompt
This is where retrieval meets generation. We take the retrieved documents and the user's question and concatenate them into a single text block—the prompt that the LM will read:
q_emb = encoder(query) # step 2: embed the query
docs = index.search(q_emb, k=2) # step 3: retrieve top-2
prompt = concat(docs, query) # step 4: assemble prompt
output = LLM(prompt) # step 5: generate answerAfter prompt assembly, RAG's retrieval job is done. From here on it's standard language modelling—the LM reads the prompt and generates an answer, one token at a time.
Generate the answer
The LM generates tokens one at a time, conditioned on the full prompt (context + question). Mathematically, it's still plain next-token prediction:
But the context tokens profoundly change which tokens are likely. Without context, the model might hallucinate. With context, it can ground its answer in real text. Click to watch generation unfold:
Tokens highlighted in green come directly from the retrieved context. The model isn't "copying" in a literal sense—it's just that the context makes those specific words overwhelmingly likely under next-token prediction.
How does the LM use the context?
Through self-attention. When generating each output token, the model attends to all tokens in the prompt—including the retrieved documents. It naturally focuses on the most relevant context tokens:
Darker cells = higher attention weight. When the model generates "Delhi," it attends heavily to "Delhi" in the context. When it generates "PM2.5," it attends to "PM2.5." The attention mechanism is what makes RAG work at the neural network level.
Side by side: with vs. without RAG
Without RAG
The model relies solely on its weights. May hallucinate or hedge.
With RAG
The context contains the fact. Attention surfaces it into the answer.
Try different domains
The beauty of RAG is that the pipeline is domain-agnostic. Swap the corpus and the same five steps work on air quality, cooking, history, or medicine. Switch domains and watch every step adapt:
Corpus
Query
Retrieved (k = 2)
Assembled prompt
Generated answer
Why RAG works
The intuition is simple: instead of asking the model to recall a fact from memory, you hand it a sheet of notes. But we can make this precise.
Without RAG, the model generates purely from parametric memory:
With RAG, we marginalise over retrieved documents:
In practice, we approximate with the top-k documents:
Three concrete benefits
- Reduces hallucination. The LM has actual text to reference, making it far less likely to invent facts.
- Always up-to-date. Need fresher knowledge? Update the document store. No retraining needed—just re-index.
- Cheaper than fine-tuning. The model's weights stay frozen. Only the index changes. You don't need GPUs to update knowledge.
RAG vs kNN-LM: a subtle distinction
| Method | Retrieval unit | How it's used |
|---|---|---|
| kNN-LM | Individual tokens | Interpolated into output logits at each step |
| RAG | Whole documents | Prepended to the input prompt |
When RAG fails
RAG is powerful, but it's not magic. Understanding the failure modes will save you hours of debugging:
1. The retriever fetches wrong documents
If the embedding model doesn't understand your domain well, it may retrieve documents that look similar on the surface but are semantically wrong. Garbage context in → garbage answer out. The embedding model's quality is the ceiling for RAG.
2. Context dilution (k too large)
Stuffing too many documents into the prompt spreads attention thin. The one relevant document gets buried among nine irrelevant ones. Smaller, more precise k often beats larger k.
3. Redundant documents waste context
Five documents that all say the same thing use up precious context window without adding new information. Diversity-aware retrieval (like MMR—maximal marginal relevance) helps.
4. No multi-hop reasoning
If the answer requires combining facts from document A and document B, the LM often fails. It tends to copy from one document rather than synthesise across several. This is an active area of research.
What we built
- RAG = Retrieval + Generation. Find the nearest documents in embedding space, paste them into the prompt, and let the LM answer from the text.
- Five steps. Embed → Index → Retrieve → Build prompt → Generate.
- Better input, not better model. RAG doesn't change the model's weights—it changes what the model sees. The underlying math is still next-token prediction.
- Attention is the mechanism. The LM attends to context tokens via self-attention, naturally copying relevant facts into its output.
- Know the limits. Bad embeddings, diluted context, redundant docs, and multi-hop reasoning are where RAG breaks down. No tool is a silver bullet.