The Attention Mechanism

Lecture 12 · ES 667: Deep Learning

Prof. Nipun Batra
IIT Gandhinagar · Aug 2026

Learning outcomes

By the end of this lecture you will be able to:

  1. Explain attention as differentiable dictionary lookup.
  2. Derive scaled dot-product and justify the √d_k.
  3. Distinguish Bahdanau (additive) from Luong (multiplicative).
  4. Implement QKV self-attention in PyTorch.
  5. Apply causal masking to get a GPT-style decoder.
  6. State the O(n²) complexity wall and its consequences.

Recap · where we are

Module 6 opens. The previous lecture ended on a cliff-hanger:

  • Seq2Seq works for short sentences.
  • The fixed-size context vector can't hold everything for long ones.
  • BLEU collapses past ~30 tokens.

Today we fix it.

Today maps to Prince Ch 12 (early sections). This is the single most influential idea in deep learning between backprop and diffusion.

The one-line fix

Don't force the decoder to read from one fixed vector. Let it peek at every encoder state and decide which ones matter for the current step.

Bahdanau et al. 2014 · "Neural Machine Translation by Jointly Learning to Align and Translate."

Why one vector can't hold everything

Think about translating a 40-word English sentence into French. A Seq2Seq encoder has to compress:

  • every word's identity
  • the full syntactic structure
  • the tense, number, gender, mood of every verb and noun
  • the coreference chains ("it" refers to "the box")
  • the sentiment

...into a single 512-dim vector. One number for every 0.3 bits of meaning.

No matter how big you make that vector, there's always a sentence that exceeds it. The information-theoretic problem is fundamental, not an engineering issue.

The shift in viewpoint

Old · push mode

Encoder pushes a summary forward. Decoder takes whatever fits.

  • One-shot summarization.
  • Lossy — compression is mandatory.

New · pull mode

Decoder pulls information on demand. Encoder keeps everything around.

  • No compression required.
  • Decoder chooses what's relevant per step.

Attention is the mechanism for the pull — a differentiable version of "look up the word I need right now."

Four questions

  1. What does attention look like — literally, as a heatmap?
  2. What are Q, K, V and why "retrieval"?
  3. Why do we divide by ?
  4. What is self-attention and how does it differ from cross-attention?

PART 1

Attention as soft alignment

The heatmap view

What attention looks like

Interpretation

Every target word computes a distribution over source words. Darker cell = more attention mass.

Three things to notice:

  1. Rough diagonal — most alignments are monotonic.
  2. "traversé" attends to "cross" — semantic alignment across a language barrier.
  3. "qu'il" attends to BOTH "it" AND "animal" — coreference resolution, emergent from training.

No one told the model what "it" refers to. It learned this by minimizing translation loss. Attention made linguistic structure visible for the first time.

▶ Interactive: hover over source tokens, see attention weights in real time — attention.

PART 2

From Bahdanau to QKV

Two parameterizations · one abstraction

A 3×3 worked example · before the math

Source · "the cat slept" Target step · decoding French for "cat" → "chat"

Encoder states (toy 2-d): (the), (cat), (slept)

Decoder state: (about to emit chat)

Raw scores (dot products): , ,

Softmax: "cat" gets the largest weight.

Context:

The decoder reads a weighted mixture, dominated by the relevant word. That's one step of attention.

Bahdanau (additive) attention · 2014

Analogy · the compatibility test. Plain dot product compares profiles directly. A learned score function projects both into a shared compatibility space (), combines (tanh), and an "expert" reads out a single 8/10 score. More expressive than raw dot product when vectors aren't aligned.

Score between decoder state and encoder state :

Bahdanau · worked numeric, term-by-term

Toy 2-d. , .

Step 1 · project.

Step 2 · add + non-linearity.

Step 3 · final dot product with .

This single number is the alignment score for one pair. Compute for all , then softmax.

Luong (multiplicative) attention · 2015

A simpler score, no learned MLP:

  • Faster — one matrix multiply instead of a two-layer MLP.
  • Equivalent in expressiveness when you have enough data (the projections in QKV absorb Bahdanau's ).

This is the version Vaswani et al. kept for the Transformer in 2017, with one small but crucial addition: the scaling.

Why multiplicative won

Two reasons the ML community moved from Bahdanau to Luong:

  1. Hardware — a dot product is one matrix multiply; GPUs love it. Bahdanau's MLP has element-wise tanh, which is slower per op and harder to batch.
  2. Enough capacity elsewhere — once we added learned projections (next section), we no longer needed the to learn the similarity. The dot product of projections already gives it.

Pattern in DL · keep the core operation small + fast, push learning into the linear layers around it. This is also why attention beat CNN for sequences — CNN's inductive bias was baked in; attention's bias is learned.

PART 3

Q, K, V · the clean abstraction

Attention as database retrieval

Soft retrieval · picture

Why Q, K, V are three different things · the library analogy

You walk into a library with a question.

  • Query · "I need info on the 2008 financial crisis."
  • Key · the title on each book's spine ("The 2008 Meltdown", "The Great Depression"...).
  • Value · the actual content inside each book.

You use the query to match against keys. The match score decides how much of each book's value (content) you blend into your final answer.

That's it · attention is a soft library lookup. The next slides formalize the math; the analogy is the whole intuition.

The retrieval metaphor

Imagine a Python dictionary lookup:

db = {"cat": "meow", "dog": "bark"}
query = "cat"
result = db[query]      # returns "meow"

Attention is the soft version of this:

  • Query — what you're looking for (from the decoder).
  • Key — what each encoder state announces itself as.
  • Value — what each encoder state actually contains.

Score keys against the query → softmax → use weights to blend values.

Soft retrieval · why three roles not one

You could imagine an attention mechanism where and are the same thing. Early models did exactly this (Luong 2015). So why separate them?

Keys say "what I am"; values say "what I contribute."

A word like "bank" in a sentence should be found by the query "financial", but contribute the full contextual embedding. Keys for retrieval, values for content.

  • optimized for similarity with plausible queries.
  • optimized to carry whatever downstream layers need.

Separating them doubles the parameter count of one head but roughly doubles the expressiveness too.

The Python-dict mental model · extended

# Hard retrieval
db = {k1: v1, k2: v2, k3: v3}
out = db[query]                      # exact match → one value

# Soft retrieval (attention)
scores  = [sim(query, k) for k in [k1, k2, k3]]   # similarities
weights = softmax(scores)                          # probabilities
out     = weights[0]*v1 + weights[1]*v2 + weights[2]*v3

Attention is differentiable dictionary lookup. The network's parameters shape what "sim" means and what each key/value represents. Everything else is the soft version of db[query].

QKV · the computation

Scaled dot-product · step by step

Scaled dot-product · worked numeric (4 steps)

Tiny example · 2 tokens, .
, , , .

Step 1 · scores.
— token 1's query matches token 2 better (score 2 > 1).

Step 2 · scale.

Step 3 · row-wise softmax.
Row 1:
Row 2:

Step 4 · weighted sum.

Output for token 1 is 33% of + 67% of .

One actor, three roles

Analogy. Eddie Murphy in The Nutty Professor — same person (), different costumes and makeup (), three characters.

  • Query role · "I'm the hero — what do I need?"
  • Key role · "Here's what I am."
  • Value role · "Here's the info I have to offer."

The network learns the costumes ( matrices) that make attention work.

QKV · projection worked example

Input vector for "cat" · (). Project to via three matrices.

The single input now plays three roles: query , key , value .

The full matrices are this calculation repeated for every token. PyTorch:

class AttentionHead(nn.Module):
    def __init__(self, d_in, d_k):
        super().__init__()
        self.Wq = nn.Linear(d_in, d_k, bias=False)
        self.Wk = nn.Linear(d_in, d_k, bias=False)
        self.Wv = nn.Linear(d_in, d_k, bias=False)

    def forward(self, x):
        Q, K, V = self.Wq(x), self.Wk(x), self.Wv(x)
        scores = Q @ K.transpose(-2, -1) / math.sqrt(Q.size(-1))
        weights = scores.softmax(dim=-1)
        return weights @ V
class AttentionHead(nn.Module):
    def __init__(self, d_in, d_k):
        super().__init__()
        self.Wq = nn.Linear(d_in, d_k, bias=False)
        self.Wk = nn.Linear(d_in, d_k, bias=False)
        self.Wv = nn.Linear(d_in, d_k, bias=False)

    def forward(self, x):
        Q, K, V = self.Wq(x), self.Wk(x), self.Wv(x)
        scores = Q @ K.transpose(-2, -1) / math.sqrt(Q.size(-1))
        weights = scores.softmax(dim=-1)
        return weights @ V

PART 4

Why ?

The scaling that makes attention work

Why softmax can get "spiky"

Analogy · grading on a curve.

  • Scores 1–10 · 7 vs 8 are close. Curve is smooth.
  • Scores 1–1000 · 700 vs 800 are worlds apart. The 801 wins; everyone else gets ~0%. Curve is spiky.

Unscaled dot products behave like the second case as grows. Scaling factor brings us back to the first.

Variance of unscaled dot products

For with i.i.d. entries:

Variance of a sum of independent terms = sum of variances:

(For two independent zero-mean unit-variance variables, .)

Standard deviation → typical magnitudes scale like .

With raw scores are . Softmax of → nearly one-hot. Scale by → variance back to 1.

Worked numeric · the scaling factor in action

. A typical raw-score vector for a query: .

Without scaling.
. Softmax — essentially one-hot → gradient ≈ 0 → learning stalls.

With scaling. Divide by 16: .
, sum . Softmax — soft, gradients flow.

Numeric demo · softmax at different scales

Raw logits . Softmax behaves nicely:

temperature softmax
/1 (0.58, 0.21, 0.13, ...) — soft
/4 (0.40, 0.30, 0.26, ...) — very soft
×10 (0.9999, 4e-5, 2e-7) — one-hot

Dot products without scaling behave like the bottom row — effectively one-hot. Divide by and you land back on the top row. The scaling is doing exactly the role of a temperature denominator, derived from variance analysis rather than tuned by hand.

In pictures

Why one-hot is bad

If softmax outputs are (nearly) one-hot, then attention picks one encoder state and ignores the rest.

Two consequences:

  1. Information from other positions is discarded.
  2. Gradient through the softmax is near zero (saturation) → training stalls.

The fix — divide scores by :

Scores stay in a healthy range → softmax stays soft → gradients keep flowing.

⚠️ optional · The derivation in three lines

Assume entries are i.i.d. with zero mean and unit variance.

So . Dividing by makes the variance 1 — independent of dimension.

Why this matters · the same attention block can be used at or without retuning temperatures. The scaling is dimension-invariant by construction.

PART 5

Self-attention vs cross-attention

Same machinery · different sources for QKV

Cross-attention · the decoder reads the encoder

In a Seq2Seq + attention model:

  • Queries come from the decoder (current target position).
  • Keys and values come from the encoder (all source positions).

This is what the first attention heatmap showed — one distribution per target step over source positions.

Self-attention · the "bank" disambiguation

How do you know "bank" means a financial institution in "the bank approved the loan"?

You look at the other words. "Loan" tells you which bank.

In "he sat on the river bank" · the word "river" tells you the other meaning.

Self-attention is exactly this · every word looks at every other word in the sentence to figure out its context-specific meaning. The famous "the animal didn't cross the street because it was too tired" example needs this · attention disambiguates "it" by attending to "animal."

Self-attention · a sequence attends to itself

Same operation. Now:

All three come from the same input sequence.

Self-attention lets every position in a sequence aggregate information from every other position — in parallel, with no recurrence.

This is the idea that killed RNNs and gave us Transformers (L13).

Self-attention · 3 tokens, by hand

Sentence · "the cat slept" token embeddings .

After projecting: are each a matrix.

Row says: how much does token want to look at every other token?

Softmax per row → 3×3 attention matrix . Output is again .

Every output row is a weighted blend of value rows — a contextualized embedding for that token.

Self-attention vs convolution · same goal, different bias

Convolution

  • Each output depends on a fixed local window.
  • Inductive bias: locality.
  • Parameters: shared kernel.

Self-attention

  • Each output depends on all positions.
  • Inductive bias: learned from data.
  • Parameters: Q, K, V projections.

Convolution bakes in "nearby tokens matter"; self-attention lets the network decide from data whether nearby or far-away tokens matter. When you have lots of data, learned bias wins over hand-designed bias. That's the whole arc of the 2017–2025 vision revolution in one sentence.

Causal mask · 5×5 grid

Causal self-attention · don't peek at the future

When writing the next word of "The quick brown fox jumps…", you can only use what you've already written. A causal mask is like covering the future with cardboard — for "fox", you see "The quick brown fox" but everything after is hidden.

scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k)     # (n, n)
mask   = torch.triu(torch.ones_like(scores), diagonal=1).bool()
scores.masked_fill_(mask, float('-inf'))              # future → -inf
weights = scores.softmax(dim=-1)                      # rows still sum to 1

That's the only difference between BERT-style (bidirectional) and GPT-style (causal) attention. Same module, different mask.

Worked numeric · how the −∞ mask works

Raw scores (3 tokens, → no scaling):

Step 1 · mask the upper triangle.

Step 2 · row-wise softmax. Since :

  • Row 1:
  • Row 2:
  • Row 3:

The guarantees future tokens get exact zero weight after softmax. Token 1 sees only itself; token 2 sees 1–2; token 3 sees all.

Complexity · the O(n²) wall

Self-attention on a sequence of length :

  • builds an matrix → memory and compute.
  • Double the context → 4× the cost.

At and , one head's attention matrix is already 64 MB per layer. Scaling context to 1M tokens naively would need 500 GB per layer. This is the wall that motivates:

  • FlashAttention (L23) · recompute attention in tiles, avoiding the full matrix.
  • Sparse / local / linear attention (reading) · trade off quality for or .
  • KV caching (L23) · don't redo the whole computation at every generation step.

The four kinds of attention you will meet

Type Q comes from K, V come from Used in
Encoder self-attn encoder encoder Transformer encoder
Decoder self-attn (causal) decoder (past only) decoder (past only) GPT, decoder of Transformer
Cross-attention decoder encoder Seq2Seq decoder, translation
Masked/local attention input input (masked) Longformer, Reformer, etc.

PART 6

What attention unlocked

One slide of consequences

Why attention was such a big deal

  1. Bottleneck solved · no more "fit everything into 512 dims."
  2. Long-range dependencies · every target step can see any source step.
  3. Parallelizable · unlike RNNs, all attention scores can be computed at once.
  4. Interpretable · attention heatmaps are the first DL visualization tool that's actually informative.
  5. Transfer · attention blocks compose cleanly — stack them, mix them with cross-attention, make them multi-headed. The Transformer (L13) is exactly this.

Without attention · no Transformer · no BERT · no GPT · no Claude · no diffusion text conditioning. A single 2014 paper seeded the next decade.

"Attention Is All You Need" · the 2017 pivot

Vaswani et al.'s one-page insight · drop the RNN entirely, use only attention plus FFNs plus positional encodings.

Before: encoders and decoders were RNNs with attention as a helper. After: attention was the load-bearing operation; RNN was gone.

Consequence table:

Axis RNN+attention Transformer
Sequential compute yes (unrolled) no (parallel)
Long-range path hops hops
Training throughput slow 10–20× faster
Scaling plateaus at ~1B trained to 1T+

Every major model since 2018 (BERT, GPT-*, T5, Claude, Llama) is this architecture, plus or minus details. L13 builds it from parts.

Lecture 12 — summary

  • Attention = soft retrieval · each query selects a weighted combination of values based on similarity to keys.
  • Bahdanau (additive, 2014) and Luong (multiplicative, 2015) are two parameterizations — we use Luong's dot-product form in Transformers.
  • QKV abstraction · are learned projections of the same input; the network decides what each role should be.
  • scaling — without it, softmax collapses to one-hot at large and gradients die.
  • Self-attention · Q, K, V from the same sequence; parallel, long-range, interpretable.
  • This unlocked the Transformer — next lecture.

Read before Lecture 13

Prince Ch 12 mid-sections (Transformer block).

Next lecture

The Transformer — built live. Multi-head attention · positional encoding · residual + LayerNorm · the full encoder-decoder stack.

Notebook 12 · 12-attention-nmt.ipynb — add attention to Lecture 11's Seq2Seq; visualize attention heatmaps; watch BLEU improve on long sentences.