RAG, From Scratch

Part I

The problem: frozen knowledge

A language model's knowledge is baked into its weights during training. Once training ends, those weights are frozen. This creates three problems that show up constantly in practice:

1

Stale knowledge. Ask about something that happened after the training cutoff and the model has no idea. It will either refuse or, worse, guess confidently.

2

Missing domain facts. Your company's internal docs, a niche research paper, a new city's bus schedule—none of it was in the training data.

3

Hallucination. When the model doesn't know something, it often invents a plausible-sounding but wrong answer instead of saying "I don't know."

The core idea. Instead of retraining the model every time knowledge changes, give it an open-book exam. Fetch the relevant documents at query time and paste them into the prompt. The model reads them and answers from the text—not from memory. That is RAG.

Part II

Two building blocks

RAG combines exactly two ideas. If you've taken an intro ML course, you already know both. Let's make sure we agree on them before wiring them together.

Building Block 1

Next-token prediction (LM)

A language model takes a sequence of tokens and predicts what comes next. GPT, LLaMA, Claude—they all work this way, one token at a time. Everything the model "knows" lives in its weights. This is called parametric memory—fixed once training ends.

Building Block 2

k-Nearest Neighbours (kNN)

Represent items as vectors (points in space). Given a query vector, find the k closest points. No learning at query time—pure geometry. If two documents are about similar topics, their vectors will be close together.

Pause and think. The LM generates fluent text but can hallucinate facts. kNN finds relevant documents but can't write sentences. RAG chains them: kNN retrieves the facts, and the LM turns them into an answer. Neither can do the job alone.

Part III

The RAG pipeline

Five steps. We'll walk through each one interactively.

qQuery

→

EEmbed

→

kNNRetrieve

→

[ ]Prompt

→

LMGenerate

There are two stages. Offline (done once): embed all your documents and store them in a searchable index. Online (at query time): push the user's question through the full pipeline. Let's build both, step by step.

First, pick a domain. This determines the corpus (the documents the model can search over):

Part IV: Step 1

Embed the documents

Here are the five documents in our mini-corpus. In a real system, you'd have thousands or millions—but five is enough to see the mechanism.

An encoder model (like Sentence-BERT) reads each document and outputs a vector—a list of numbers that captures the document's meaning. Documents about similar topics produce similar vectors. In real systems these are 768-dimensional or more, but to keep things visual we'll use 2D embeddings.

Document embeddings in 2D

Each blue dot below is a document, plotted at its 2D embedding coordinates. Notice how topically related documents cluster together.

docs  = split(corpus)       # chunk text into passages
emb   = encoder(docs)       # embed each chunk → vector
index = FAISS(emb)          # build a searchable index

What makes a good embedding? "Delhi PM2.5 in winter" should land near "Kolkata winter smog" and far from "Gulab jamun is deep-fried." The encoder learns these distances during its own training on large text corpora.

Part V: Step 2

Embed the query

A user asks a question. We pass it through the same encoder to get a query vector. Because the encoder maps semantically similar text to nearby points, the query will land close to relevant documents.

Query

Query embedding:

Now both the documents and the query live in the same vector space. Below is the 2D embedding canvas. The orange dot is the query; the blue dots are documents. Drag the orange query point to explore—the two nearest documents will be highlighted with distance lines in real time.

Drag the query point to explore. Notice how the retrieval table, the prompt builder, and the attention heatmap below all update in real time!

Part VI: Step 3

Retrieve with kNN

Now we compute the Euclidean distance between the query vector and every document vector:

Sort by distance. The top-k (here k=2) closest documents "win" and become our retrieved context. Here's the live distance table—drag the query point above and watch it update:

Document	Embedding	Distance	Retrieved?

Why top-k, not all? LMs have a finite context window (typically 4K–128K tokens). We can't paste every document, so we pick the k most relevant ones. Choosing k is a trade-off: too small and you miss context; too large and irrelevant documents drown the signal.

Part VII: Step 4

Build the prompt

This is where retrieval meets generation. We take the retrieved documents and the user's question and concatenate them into a single text block—the prompt that the LM will read:

Context (retrieved docs)

Question (user query)

q_emb  = encoder(query)            # step 2: embed the query
docs   = index.search(q_emb, k=2)  # step 3: retrieve top-2
prompt = concat(docs, query)        # step 4: assemble prompt
output = LLM(prompt)                # step 5: generate answer

After prompt assembly, RAG's retrieval job is done. From here on it's standard language modelling—the LM reads the prompt and generates an answer, one token at a time.

Part VIII: Step 5

Generate the answer

The LM generates tokens one at a time, conditioned on the full prompt (context + question). Mathematically, it's still plain next-token prediction:

But the context tokens profoundly change which tokens are likely. Without context, the model might hallucinate. With context, it can ground its answer in real text. Click to watch generation unfold:

Tokens highlighted in green come directly from the retrieved context. The model isn't "copying" in a literal sense—it's just that the context makes those specific words overwhelmingly likely under next-token prediction.

How does the LM use the context?

Through self-attention. When generating each output token, the model attends to all tokens in the prompt—including the retrieved documents. It naturally focuses on the most relevant context tokens:

Darker cells = higher attention weight. When the model generates "Delhi," it attends heavily to "Delhi" in the context. When it generates "PM2.5," it attends to "PM2.5." The attention mechanism is what makes RAG work at the neural network level.

Side by side: with vs. without RAG

Without RAG

The model relies solely on its weights. May hallucinate or hedge.

"Air pollution in Delhi is a significant concern, particularly during certain seasons..." Vague, hedging

With RAG

The context contains the fact. Attention surfaces it into the answer.

"Air pollution in Delhi during winter is very high, often reaching hazardous PM2.5 levels." Specific, grounded

Instruction tuning matters. A base LM might continue the prompt as though it were part of a document, rather than answering the question. An instruction-tuned model has learned "Context + Question → Answer." Same next-token prediction math, but the weights are aligned to produce helpful answers.

Part IX

Try different domains

The beauty of RAG is that the pipeline is domain-agnostic. Swap the corpus and the same five steps work on air quality, cooking, history, or medicine. Switch domains and watch every step adapt:

1

Corpus

2

Query

3

Retrieved (k = 2)

4

Assembled prompt

5

Generated answer

Part X

Why RAG works

The intuition is simple: instead of asking the model to recall a fact from memory, you hand it a sheet of notes. But we can make this precise.

Without RAG, the model generates purely from parametric memory:

With RAG, we marginalise over retrieved documents:

In practice, we approximate with the top-k documents:

Three concrete benefits

Reduces hallucination. The LM has actual text to reference, making it far less likely to invent facts.
Always up-to-date. Need fresher knowledge? Update the document store. No retraining needed—just re-index.
Cheaper than fine-tuning. The model's weights stay frozen. Only the index changes. You don't need GPUs to update knowledge.

RAG vs kNN-LM: a subtle distinction

Method	Retrieval unit	How it's used
kNN-LM	Individual tokens	Interpolated into output logits at each step
RAG	Whole documents	Prepended to the input prompt

The one-line summary. RAG = kNN over knowledge + LM over text. Parametric memory (what the model learned) + non-parametric memory (what you retrieved just now).

Part XI

When RAG fails

RAG is powerful, but it's not magic. Understanding the failure modes will save you hours of debugging:

1. The retriever fetches wrong documents

If the embedding model doesn't understand your domain well, it may retrieve documents that look similar on the surface but are semantically wrong. Garbage context in → garbage answer out. The embedding model's quality is the ceiling for RAG.

2. Context dilution (k too large)

Stuffing too many documents into the prompt spreads attention thin. The one relevant document gets buried among nine irrelevant ones. Smaller, more precise k often beats larger k.

3. Redundant documents waste context

Five documents that all say the same thing use up precious context window without adding new information. Diversity-aware retrieval (like MMR—maximal marginal relevance) helps.

4. No multi-hop reasoning

If the answer requires combining facts from document A and document B, the LM often fails. It tends to copy from one document rather than synthesise across several. This is an active area of research.

Citations are not proof. Even if the model writes "[Source 1]," it may cite the wrong document or misrepresent what it says. Citations in LM output are a learned output pattern, not a grounding mechanism. Always verify.

Part XII

What we built

RAG = Retrieval + Generation. Find the nearest documents in embedding space, paste them into the prompt, and let the LM answer from the text.
Five steps. Embed → Index → Retrieve → Build prompt → Generate.
Better input, not better model. RAG doesn't change the model's weights—it changes what the model sees. The underlying math is still next-token prediction.
Attention is the mechanism. The LM attends to context tokens via self-attention, naturally copying relevant facts into its output.
Know the limits. Bad embeddings, diluted context, redundant docs, and multi-hop reasoning are where RAG breaks down. No tool is a silver bullet.