← Explainer Library

Interactive Explainer

RAG, From Scratch

Ask a chatbot who won yesterday's election and it confidently makes something up. The fix? Give it the right documents before it answers. That's RAG—retrieval-augmented generation—and you can build it from two ideas you already know.

Drag points, switch domains, and watch token generation unfold. Every step of the pipeline is interactive.

~12 min Deep Learning Retrieval Language Models
Part I

The problem: frozen knowledge

A language model's knowledge is baked into its weights during training. Once training ends, those weights are frozen. This creates three problems that show up constantly in practice:

1
Stale knowledge. Ask about something that happened after the training cutoff and the model has no idea. It will either refuse or, worse, guess confidently.
2
Missing domain facts. Your company's internal docs, a niche research paper, a new city's bus schedule—none of it was in the training data.
3
Hallucination. When the model doesn't know something, it often invents a plausible-sounding but wrong answer instead of saying "I don't know."
The core idea. Instead of retraining the model every time knowledge changes, give it an open-book exam. Fetch the relevant documents at query time and paste them into the prompt. The model reads them and answers from the text—not from memory. That is RAG.
Part II

Two building blocks

RAG combines exactly two ideas. If you've taken an intro ML course, you already know both. Let's make sure we agree on them before wiring them together.

Building Block 1

Next-token prediction (LM)

A language model takes a sequence of tokens and predicts what comes next. GPT, LLaMA, Claude—they all work this way, one token at a time. Everything the model "knows" lives in its weights. This is called parametric memory—fixed once training ends.

Building Block 2

k-Nearest Neighbours (kNN)

Represent items as vectors (points in space). Given a query vector, find the k closest points. No learning at query time—pure geometry. If two documents are about similar topics, their vectors will be close together.

Pause and think. The LM generates fluent text but can hallucinate facts. kNN finds relevant documents but can't write sentences. RAG chains them: kNN retrieves the facts, and the LM turns them into an answer. Neither can do the job alone.
Part III

The RAG pipeline

Five steps. We'll walk through each one interactively.

qQuery
EEmbed
kNNRetrieve
[ ]Prompt
LMGenerate

There are two stages. Offline (done once): embed all your documents and store them in a searchable index. Online (at query time): push the user's question through the full pipeline. Let's build both, step by step.

First, pick a domain. This determines the corpus (the documents the model can search over):

Part IV: Step 1

Embed the documents

Here are the five documents in our mini-corpus. In a real system, you'd have thousands or millions—but five is enough to see the mechanism.

An encoder model (like Sentence-BERT) reads each document and outputs a vector—a list of numbers that captures the document's meaning. Documents about similar topics produce similar vectors. In real systems these are 768-dimensional or more, but to keep things visual we'll use 2D embeddings.

Document embeddings in 2D

Each blue dot below is a document, plotted at its 2D embedding coordinates. Notice how topically related documents cluster together.

docs  = split(corpus)       # chunk text into passages
emb   = encoder(docs)       # embed each chunk → vector
index = FAISS(emb)          # build a searchable index
What makes a good embedding? "Delhi PM2.5 in winter" should land near "Kolkata winter smog" and far from "Gulab jamun is deep-fried." The encoder learns these distances during its own training on large text corpora.
Part V: Step 2

Embed the query

A user asks a question. We pass it through the same encoder to get a query vector. Because the encoder maps semantically similar text to nearby points, the query will land close to relevant documents.

Query
Query embedding:

Now both the documents and the query live in the same vector space. Below is the 2D embedding canvas. The orange dot is the query; the blue dots are documents. Drag the orange query point to explore—the two nearest documents will be highlighted with distance lines in real time.

Drag the query point to explore. Notice how the retrieval table, the prompt builder, and the attention heatmap below all update in real time!
Part VI: Step 3

Retrieve with kNN

Now we compute the Euclidean distance between the query vector and every document vector:

Sort by distance. The top-k (here k=2) closest documents "win" and become our retrieved context. Here's the live distance table—drag the query point above and watch it update:

DocumentEmbeddingDistanceRetrieved?
Why top-k, not all? LMs have a finite context window (typically 4K–128K tokens). We can't paste every document, so we pick the k most relevant ones. Choosing k is a trade-off: too small and you miss context; too large and irrelevant documents drown the signal.
Part VII: Step 4

Build the prompt

This is where retrieval meets generation. We take the retrieved documents and the user's question and concatenate them into a single text block—the prompt that the LM will read:

Context (retrieved docs)
Question (user query)
q_emb  = encoder(query)            # step 2: embed the query
docs   = index.search(q_emb, k=2)  # step 3: retrieve top-2
prompt = concat(docs, query)        # step 4: assemble prompt
output = LLM(prompt)                # step 5: generate answer

After prompt assembly, RAG's retrieval job is done. From here on it's standard language modelling—the LM reads the prompt and generates an answer, one token at a time.

Part VIII: Step 5

Generate the answer

The LM generates tokens one at a time, conditioned on the full prompt (context + question). Mathematically, it's still plain next-token prediction:

But the context tokens profoundly change which tokens are likely. Without context, the model might hallucinate. With context, it can ground its answer in real text. Click to watch generation unfold:

Tokens highlighted in green come directly from the retrieved context. The model isn't "copying" in a literal sense—it's just that the context makes those specific words overwhelmingly likely under next-token prediction.

How does the LM use the context?

Through self-attention. When generating each output token, the model attends to all tokens in the prompt—including the retrieved documents. It naturally focuses on the most relevant context tokens:

Darker cells = higher attention weight. When the model generates "Delhi," it attends heavily to "Delhi" in the context. When it generates "PM2.5," it attends to "PM2.5." The attention mechanism is what makes RAG work at the neural network level.

Side by side: with vs. without RAG

Without RAG

The model relies solely on its weights. May hallucinate or hedge.

"Air pollution in Delhi is a significant concern, particularly during certain seasons..." Vague, hedging

With RAG

The context contains the fact. Attention surfaces it into the answer.

"Air pollution in Delhi during winter is very high, often reaching hazardous PM2.5 levels." Specific, grounded
Instruction tuning matters. A base LM might continue the prompt as though it were part of a document, rather than answering the question. An instruction-tuned model has learned "Context + Question → Answer." Same next-token prediction math, but the weights are aligned to produce helpful answers.
Part IX

Try different domains

The beauty of RAG is that the pipeline is domain-agnostic. Swap the corpus and the same five steps work on air quality, cooking, history, or medicine. Switch domains and watch every step adapt:

1

Corpus

2

Query

3

Retrieved (k = 2)

4

Assembled prompt

5

Generated answer

Part X

Why RAG works

The intuition is simple: instead of asking the model to recall a fact from memory, you hand it a sheet of notes. But we can make this precise.

Without RAG, the model generates purely from parametric memory:

With RAG, we marginalise over retrieved documents:

In practice, we approximate with the top-k documents:

Three concrete benefits

RAG vs kNN-LM: a subtle distinction

MethodRetrieval unitHow it's used
kNN-LMIndividual tokensInterpolated into output logits at each step
RAGWhole documentsPrepended to the input prompt
The one-line summary. RAG = kNN over knowledge + LM over text. Parametric memory (what the model learned) + non-parametric memory (what you retrieved just now).
Part XI

When RAG fails

RAG is powerful, but it's not magic. Understanding the failure modes will save you hours of debugging:

1. The retriever fetches wrong documents

If the embedding model doesn't understand your domain well, it may retrieve documents that look similar on the surface but are semantically wrong. Garbage context in → garbage answer out. The embedding model's quality is the ceiling for RAG.

2. Context dilution (k too large)

Stuffing too many documents into the prompt spreads attention thin. The one relevant document gets buried among nine irrelevant ones. Smaller, more precise k often beats larger k.

3. Redundant documents waste context

Five documents that all say the same thing use up precious context window without adding new information. Diversity-aware retrieval (like MMR—maximal marginal relevance) helps.

4. No multi-hop reasoning

If the answer requires combining facts from document A and document B, the LM often fails. It tends to copy from one document rather than synthesise across several. This is an active area of research.

Citations are not proof. Even if the model writes "[Source 1]," it may cite the wrong document or misrepresent what it says. Citations in LM output are a learned output pattern, not a grounding mechanism. Always verify.
Part XII

What we built