QKV · projection worked example

Input vector for "cat" · (). Project to via three matrices.

The single input now plays three roles: query , key , value .

The full matrices are this calculation repeated for every token. PyTorch:

class AttentionHead(nn.Module):
    def __init__(self, d_in, d_k):
        super().__init__()
        self.Wq = nn.Linear(d_in, d_k, bias=False)
        self.Wk = nn.Linear(d_in, d_k, bias=False)
        self.Wv = nn.Linear(d_in, d_k, bias=False)

    def forward(self, x):
        Q, K, V = self.Wq(x), self.Wk(x), self.Wv(x)
        scores = Q @ K.transpose(-2, -1) / math.sqrt(Q.size(-1))
        weights = scores.softmax(dim=-1)
        return weights @ V

class AttentionHead(nn.Module):
    def __init__(self, d_in, d_k):
        super().__init__()
        self.Wq = nn.Linear(d_in, d_k, bias=False)
        self.Wk = nn.Linear(d_in, d_k, bias=False)
        self.Wv = nn.Linear(d_in, d_k, bias=False)

    def forward(self, x):
        Q, K, V = self.Wq(x), self.Wk(x), self.Wv(x)
        scores = Q @ K.transpose(-2, -1) / math.sqrt(Q.size(-1))
        weights = scores.softmax(dim=-1)
        return weights @ V

temperature	softmax
/1	(0.58, 0.21, 0.13, ...) — soft
/4	(0.40, 0.30, 0.26, ...) — very soft
×10	(0.9999, 4e-5, 2e-7) — one-hot

Type	Q comes from	K, V come from	Used in
Encoder self-attn	encoder	encoder	Transformer encoder
Decoder self-attn (causal)	decoder (past only)	decoder (past only)	GPT, decoder of Transformer
Cross-attention	decoder	encoder	Seq2Seq decoder, translation
Masked/local attention	input	input (masked)	Longformer, Reformer, etc.

Axis	RNN+attention	Transformer
Sequential compute	yes (unrolled)	no (parallel)
Long-range path	hops	hops
Training throughput	slow	10–20× faster
Scaling	plateaus at ~1B	trained to 1T+

The Attention Mechanism

Lecture 12 · ES 667: Deep Learning

Learning outcomes

Recap · where we are

The one-line fix

Why one vector can't hold everything

The shift in viewpoint

Old · push mode

New · pull mode

Four questions

PART 1

Attention as soft alignment

What attention looks like

Interpretation

PART 2

From Bahdanau to QKV

A 3×3 worked example · before the math

Bahdanau (additive) attention · 2014

Bahdanau · worked numeric, term-by-term

Luong (multiplicative) attention · 2015

Why multiplicative won

PART 3

Q, K, V · the clean abstraction

Soft retrieval · picture

Why Q, K, V are three different things · the library analogy

The retrieval metaphor

Soft retrieval · why three roles not one

The Python-dict mental model · extended

QKV · the computation

Scaled dot-product · step by step

Scaled dot-product · worked numeric (4 steps)

One actor, three roles

QKV · projection worked example

PART 4

Why ?

Why softmax can get "spiky"

Variance of unscaled dot products

Worked numeric · the scaling factor in action

Numeric demo · softmax at different scales

In pictures

Why one-hot is bad

optional · The derivation in three lines

PART 5

Self-attention vs cross-attention

Cross-attention · the decoder reads the encoder

Self-attention · the "bank" disambiguation

Self-attention · a sequence attends to itself

Self-attention · 3 tokens, by hand

Self-attention vs convolution · same goal, different bias

Convolution

Self-attention

Causal mask · 5×5 grid

Causal self-attention · don't peek at the future

Worked numeric · how the −∞ mask works

Complexity · the O(n²) wall

The four kinds of attention you will meet

PART 6

What attention unlocked

Why attention was such a big deal

"Attention Is All You Need" · the 2017 pivot

Lecture 12 — summary

Read before Lecture 13

Next lecture