Multi-head · trace the tensor shapes

Setup · 1 sentence, 3 tokens, , heads, so .

Output shape is identical to input. The block is composable — stack as many as you want.

The Transformer — Built Live

Lecture 13 · ES 667: Deep Learning

Learning outcomes

Recap · where we are

Four questions

The block

The block · "communication then thinking"

The Transformer block (pre-norm)

Why this exact structure?

Pre-norm vs post-norm · the gradient highway

Pre-norm vs post-norm · derive the gradient

The FFN · not an afterthought

The block in PyTorch · 20 lines

Multi-head attention

The multi-head idea

Multi-head · the pipeline in detail

Multi-head · trace the tensor shapes

Multi-head · numeric example

Multi-head · the team-of-specialists analogy

Why split into heads?

Parameter accounting · derive the formulas

Worked numeric · where parameters live

Multi-head attention in PyTorch

Positional encoding

The problem · attention is permutation-invariant

Sinusoidal positional encoding

Sinusoidal PE · the multi-handed clock analogy

Why sinusoidal · derive the rotation property

Two modern alternatives

The full architecture

Encoder-decoder (original Vaswani 2017)

Three architectural flavours

Causal masking · no peeking at the answer

Masking in PyTorch

Full stack · one figure

Put it all together · build GPT-tiny

nanoGPT · 80 lines that changed the world

Cross-attention · for the encoder-decoder case

Common variations you will meet

Debug · "my Transformer doesn't train"

Why the Transformer won

Lecture 13 — summary