The heatmap view
Every target word computes a distribution over source words. Darker cell = more attention mass.
Three things to notice:
No one told the model what "it" refers to. It learned this by minimizing translation loss. Attention made linguistic structure visible for the first time.
Interactive: hover over source tokens, see attention weights in real time — attention.
Two parameterizations · one abstraction
Source · "the cat slept" Target step · decoding French for "cat" → "chat"
Encoder states (toy 2-d):
Decoder state:
Raw scores (dot products):
Softmax:
Context:
The decoder reads a weighted mixture, dominated by the relevant word. That's one step of attention.
Analogy · the compatibility test. Plain dot product compares profiles directly. A learned score function projects both into a shared compatibility space (tanh), and an "expert"
Score between decoder state
Toy 2-d.
Step 1 · project.
Step 2 · add + non-linearity.
Step 3 · final dot product with
This single number is the alignment score for one
A simpler score, no learned MLP:
This is the version Vaswani et al. kept for the Transformer in 2017, with one small but crucial addition: the
Two reasons the ML community moved from Bahdanau to Luong:
Pattern in DL · keep the core operation small + fast, push learning into the linear layers around it. This is also why attention beat CNN for sequences — CNN's inductive bias was baked in; attention's bias is learned.
Attention as database retrieval
You walk into a library with a question.
You use the query to match against keys. The match score decides how much of each book's value (content) you blend into your final answer.
That's it · attention is a soft library lookup. The next slides formalize the math; the analogy is the whole intuition.
Imagine a Python dictionary lookup:
db = {"cat": "meow", "dog": "bark"}
query = "cat"
result = db[query] # returns "meow"
Attention is the soft version of this:
Score keys against the query → softmax → use weights to blend values.
You could imagine an attention mechanism where
Keys say "what I am"; values say "what I contribute."
A word like "bank" in a sentence should be found by the query "financial", but contribute the full contextual embedding. Keys for retrieval, values for content.
Separating them doubles the parameter count of one head but roughly doubles the expressiveness too.
# Hard retrieval
db = {k1: v1, k2: v2, k3: v3}
out = db[query] # exact match → one value
# Soft retrieval (attention)
scores = [sim(query, k) for k in [k1, k2, k3]] # similarities
weights = softmax(scores) # probabilities
out = weights[0]*v1 + weights[1]*v2 + weights[2]*v3
Attention is differentiable dictionary lookup. The network's parameters shape what "sim" means and what each key/value represents. Everything else is the soft version of db[query].
Tiny example · 2 tokens,
Step 1 · scores.
Step 2 · scale.
Step 3 · row-wise softmax.
Row 1:
Row 2:
Step 4 · weighted sum.
Output for token 1 is 33% of
Analogy. Eddie Murphy in The Nutty Professor — same person (
The network learns the costumes (
Input vector for "cat" ·
The single input
The full
class AttentionHead(nn.Module):
def __init__(self, d_in, d_k):
super().__init__()
self.Wq = nn.Linear(d_in, d_k, bias=False)
self.Wk = nn.Linear(d_in, d_k, bias=False)
self.Wv = nn.Linear(d_in, d_k, bias=False)
def forward(self, x):
Q, K, V = self.Wq(x), self.Wk(x), self.Wv(x)
scores = Q @ K.transpose(-2, -1) / math.sqrt(Q.size(-1))
weights = scores.softmax(dim=-1)
return weights @ V
class AttentionHead(nn.Module):
def __init__(self, d_in, d_k):
super().__init__()
self.Wq = nn.Linear(d_in, d_k, bias=False)
self.Wk = nn.Linear(d_in, d_k, bias=False)
self.Wv = nn.Linear(d_in, d_k, bias=False)
def forward(self, x):
Q, K, V = self.Wq(x), self.Wk(x), self.Wv(x)
scores = Q @ K.transpose(-2, -1) / math.sqrt(Q.size(-1))
weights = scores.softmax(dim=-1)
return weights @ V
The scaling that makes attention work
Analogy · grading on a curve.
Unscaled dot products behave like the second case as
For
Variance of a sum of independent terms = sum of variances:
(For two independent zero-mean unit-variance variables,
Standard deviation
With
Without scaling.
With scaling. Divide by 16:
Raw logits
| temperature | softmax |
|---|---|
| /1 | (0.58, 0.21, 0.13, ...) — soft |
| /4 | (0.40, 0.30, 0.26, ...) — very soft |
| ×10 | (0.9999, 4e-5, 2e-7) — one-hot |
Dot products without scaling behave like the bottom row — effectively one-hot. Divide by
If softmax outputs are (nearly) one-hot, then attention picks one encoder state and ignores the rest.
Two consequences:
The fix — divide scores by
Scores stay in a healthy range → softmax stays soft → gradients keep flowing.
Assume
So
Why this matters · the same attention block can be used at
Same machinery · different sources for QKV
In a Seq2Seq + attention model:
This is what the first attention heatmap showed — one distribution per target step over source positions.
How do you know "bank" means a financial institution in "the bank approved the loan"?
You look at the other words. "Loan" tells you which bank.
In "he sat on the river bank" · the word "river" tells you the other meaning.
Self-attention is exactly this · every word looks at every other word in the sentence to figure out its context-specific meaning. The famous "the animal didn't cross the street because it was too tired" example needs this · attention disambiguates "it" by attending to "animal."
Same operation. Now:
All three come from the same input sequence.
Self-attention lets every position in a sequence aggregate information from every other position — in parallel, with no recurrence.
This is the idea that killed RNNs and gave us Transformers (L13).
Sentence · "the cat slept" token embeddings
After projecting:
Row
Softmax per row → 3×3 attention matrix
Every output row is a weighted blend of value rows — a contextualized embedding for that token.
Convolution bakes in "nearby tokens matter"; self-attention lets the network decide from data whether nearby or far-away tokens matter. When you have lots of data, learned bias wins over hand-designed bias. That's the whole arc of the 2017–2025 vision revolution in one sentence.
When writing the next word of "The quick brown fox jumps…", you can only use what you've already written. A causal mask is like covering the future with cardboard — for "fox", you see "The quick brown fox" but everything after is hidden.
scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k) # (n, n)
mask = torch.triu(torch.ones_like(scores), diagonal=1).bool()
scores.masked_fill_(mask, float('-inf')) # future → -inf
weights = scores.softmax(dim=-1) # rows still sum to 1
That's the only difference between BERT-style (bidirectional) and GPT-style (causal) attention. Same module, different mask.
Raw scores (3 tokens,
Step 1 · mask the upper triangle.
Step 2 · row-wise softmax. Since
The
Self-attention on a sequence of length
At
| Type | Q comes from | K, V come from | Used in |
|---|---|---|---|
| Encoder self-attn | encoder | encoder | Transformer encoder |
| Decoder self-attn (causal) | decoder (past only) | decoder (past only) | GPT, decoder of Transformer |
| Cross-attention | decoder | encoder | Seq2Seq decoder, translation |
| Masked/local attention | input | input (masked) | Longformer, Reformer, etc. |
One slide of consequences
Without attention · no Transformer · no BERT · no GPT · no Claude · no diffusion text conditioning. A single 2014 paper seeded the next decade.
Vaswani et al.'s one-page insight · drop the RNN entirely, use only attention plus FFNs plus positional encodings.
Before: encoders and decoders were RNNs with attention as a helper. After: attention was the load-bearing operation; RNN was gone.
Consequence table:
| Axis | RNN+attention | Transformer |
|---|---|---|
| Sequential compute | yes (unrolled) | no (parallel) |
| Long-range path | ||
| Training throughput | slow | 10–20× faster |
| Scaling | plateaus at ~1B | trained to 1T+ |
Every major model since 2018 (BERT, GPT-*, T5, Claude, Llama) is this architecture, plus or minus details. L13 builds it from parts.