Between each attention sublayer sits a two-layer MLP with a massive hidden size (4× d_model):
~⅔ of Transformer parameters live in the FFN, not in attention. Attention mixes tokens; the FFN transforms each token independently with huge capacity. Recent interpretability work (Anthropic) shows FFN layers store facts and concepts; attention layers route information between them.
GELU activation (smoother than ReLU) is the standard choice. Llama 2+ uses SwiGLU, a slightly better variant.
class TransformerBlock(nn.Module):
def __init__(self, d_model, n_heads, d_ff):
super().__init__()
self.norm1 = nn.LayerNorm(d_model)
self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
self.norm2 = nn.LayerNorm(d_model)
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Linear(d_ff, d_model),
)
def forward(self, x, mask=None):
# Pre-norm, then attention, then residual
h = self.norm1(x)
a, _ = self.attn(h, h, h, attn_mask=mask)
x = x + a
# Pre-norm, then FFN, then residual
x = x + self.ffn(self.norm2(x))
return x
That's the Transformer. Everything else is plumbing around this.
Why one head is never enough
Setup · 1 sentence, 3 tokens,
x.shape = (1, 3, 4).(1, 3, 4). Same for (1, 3, 4) → (1, 3, 2, 2) → (1, 2, 3, 2). Same for (3, 2) → (3, 2). Stack: (1, 2, 3, 2).(1, 3, 4). Final projection (1, 3, 4).Output shape is identical to input. The block is composable — stack as many as you want.
Split into 2 heads.
Two other tokens' keys:
Head 1 raw scores.
Head 2 raw scores.
A single attention head must average over all kinds of relationships at once · subject-verb, pronoun-antecedent, adjective-noun, syntax, semantics.
Multi-head attention is a team of specialists running in parallel · one head specializes in syntax, another in coreference, another in long-range dependencies.
After each head computes its own answer, the outputs are concatenated and projected · the network learns the right division of labor among heads.
A single attention head has to choose one distribution over positions per query. But real language has multiple relations to track at once:
Multiple heads = multiple "attention circuits" running in parallel. Each head specializes in a different kind of relationship.
Empirically, 8 or 16 heads is standard. Increasing beyond has diminishing returns — each head's dim
Block params depend on
Attention. Four matrices, each
FFN. Two layers ·
LayerNorm. Each LN has scale + shift =
(Number of heads
| Component | Calculation | Params |
|---|---|---|
| Attention | 1,048,576 (~33%) | |
| FFN | 2,097,152 (~66%) | |
| LayerNorm × 2 | 2,048 (<0.1%) | |
| Total | ~3.15M |
Conclusion · the FFN ("thinking") uses 2× the parameters of attention ("communication").
Anthropic interpretability work · FFN layers store facts and concepts; attention layers route information between them. Different roles, different param budgets.
Attention params are independent of sequence length — the same weights process 10 or 10,000 tokens. Big scaling advantage over RNNs.
# PyTorch gives you this in one line:
self.attn = nn.MultiheadAttention(d_model=512, num_heads=8, batch_first=True)
# By hand, the core operation is:
def multi_head_attention(x, Wq, Wk, Wv, Wo, n_heads):
B, N, d = x.shape
d_k = d // n_heads
# Project and reshape: (B, N, d) → (B, n_heads, N, d_k)
q = (x @ Wq).view(B, N, n_heads, d_k).transpose(1, 2)
k = (x @ Wk).view(B, N, n_heads, d_k).transpose(1, 2)
v = (x @ Wv).view(B, N, n_heads, d_k).transpose(1, 2)
# Scaled dot-product attention per head, then concatenate
scores = q @ k.transpose(-2, -1) / math.sqrt(d_k)
weights = scores.softmax(dim=-1)
out = (weights @ v).transpose(1, 2).contiguous().view(B, N, d)
return out @ Wo
Telling the model "this is position 7"
Attention with no positional info:
"Dog bites man" → same attention weights as
"Man bites dog"
Both have the same tokens, just reordered. Attention computes
We need to inject position information into the token embeddings. The Transformer's choice was sinusoidal — a specific multi-scale "clock" vector added to each token's embedding.
Imagine encoding position with a clock with many hands.
Reading all hands gives a unique signature for each position. To get the signature for position + 1, you just rotate each hand a fixed amount — easy for the model to learn the relative offset.
Sinusoidal encoding is a high-dimensional version of this clock:
For one pair of dimensions
What about
Letting
A 2D rotation matrix that depends only on
Worked numeric.
Learned embeddings work too but don't extrapolate past training length. Sinusoidal does.
| Method | How | Used in |
|---|---|---|
| Sinusoidal (Vaswani 2017) | fixed sin/cos | original Transformer, BERT |
| Learned | nn.Embedding(max_len, d_model) |
GPT-2, GPT-3 |
| RoPE (Su 2021) | rotate Q and K by position-dependent angle | Llama, Mistral, PaLM, modern LLMs |
| ALiBi (Press 2021) | bias attention scores by relative distance | OPT-175B, some variants |
2026 · RoPE dominates new LLMs. We'll cover it in L15 (LLMs). For now, any of the four works — pick the one that matches your base model.
Encoder · decoder · causal mask
| Model | What it is | Use case |
|---|---|---|
| Encoder-only (BERT) | stack of encoder blocks | classification, embedding, retrieval |
| Decoder-only (GPT, Llama, Claude) | stack of decoder blocks, no cross-attn | autoregressive generation |
| Encoder-decoder (T5, BART) | both, with cross-attention | translation, summarization |
In 2026, decoder-only dominates LLMs. Encoder-only ships in retrieval pipelines. Encoder-decoder survives for translation-style tasks.
Decoder predicting "The quick brown fox ___" (answer: "jumps"). During training the whole sentence is fed; when predicting position 5, the model must not see token 5.
Causal mask · add
Worked numeric. Token 3's pre-softmax scores:
Apply mask →
Weights
Token 4 weight is exactly 0 — the cheat is closed off.
# Causal mask for sequence length N
N = 128
mask = torch.triu(torch.ones(N, N), diagonal=1).bool() # upper-triangular, excluding diagonal
# [[F, T, T, T, ...],
# [F, F, T, T, ...],
# [F, F, F, T, ...],
# ...]
# True = mask (set score to -inf)
# In MultiheadAttention, mask=True means "block this position"
out, _ = self.attn(x, x, x, attn_mask=mask)
Karpathy nanoGPT in 80 lines
class GPT(nn.Module):
def __init__(self, vocab, d_model=192, n_heads=6, n_layers=6, max_len=256):
super().__init__()
self.tok_emb = nn.Embedding(vocab, d_model)
self.pos_emb = nn.Embedding(max_len, d_model)
self.blocks = nn.ModuleList([TransformerBlock(d_model, n_heads, 4*d_model)
for _ in range(n_layers)])
self.norm_f = nn.LayerNorm(d_model)
self.head = nn.Linear(d_model, vocab, bias=False)
def forward(self, idx):
B, N = idx.shape
pos = torch.arange(N, device=idx.device)
x = self.tok_emb(idx) + self.pos_emb(pos) # add PE
mask = torch.triu(torch.ones(N, N), 1).bool().to(idx.device)
for block in self.blocks:
x = block(x, mask=mask)
x = self.norm_f(x)
return self.head(x) # logits over vocab
Train this on Tiny Shakespeare → working Shakespeare-in-the-style-of generator. Seriously.
In the original Vaswani Transformer the decoder has three sublayers, not two:
Cross-attention is the Bahdanau-attention mechanism from L12, with learned Q/K/V projections. The encoder produces a rich representation of the source; the decoder queries it at every step.
GPT and Llama drop the encoder and cross-attention entirely — decoder-only. T5 keeps them for translation. Stable Diffusion uses cross-attention to inject text conditioning into images (L22).
| Variation | Change | Seen in |
|---|---|---|
| Pre-norm (vs post-norm) | normalize before sublayer | GPT-2+, Llama, Claude |
| SwiGLU FFN (vs ReLU) | SiLU + gating | Llama 2+ |
| RoPE (vs sinusoidal PE) | rotate Q, K per position | Llama, Mistral, PaLM |
| GQA (vs MHA) | fewer KV heads than Q heads | Llama 2 70B+ |
| RMSNorm (vs LayerNorm) | drop mean centering | Llama, Mistral |
| Parallel attention + FFN | attn and FFN run in parallel, not sequentially | GPT-J, PaLM |
Each tweak is small (0.1-1% win). Stacked, they define a "2026 default Transformer" that looks quite different from Vaswani 2017 in details, identical in structure.
Top 5 issues to check:
self.head.weight = self.tok_emb.weight reduces params by ~25%, usually helps.Karpathy's "most common deep-learning bug" list puts attention-mask bugs at the top. Every implementation has one that costs a week of debugging.