Compress source into a vector · decompress into target.
Two unrolled RNNs, back to back, trained end-to-end. No grammar rules, no alignment dictionaries, no phrase tables — the representations are learned from parallel corpus data alone. This was radically new in 2014; by 2016 it was state-of-the-art in production MT.
The same encoder-decoder pattern returns in T5 (L14), Stable Diffusion (L22), and every modular ML system that maps between domains.
Two design choices:
Modern multilingual models (mT5, NLLB, Whisper) share vocab via SentencePiece — a single token stream covers 100+ languages. Today's LLMs do the same.
A human translator faced with a long German sentence does not translate word-by-word. They read the whole sentence, pause to grasp the meaning, form a mental summary, then start composing the English translation by unpacking that summary.
Seq2Seq does exactly this · the encoder reads, builds a context vector (the "mental summary"), and the decoder unpacks it into the target language.
The problem · for long sentences, even a great human's mental summary fails. So does Seq2Seq · this is why we'll add attention in L12.
Forget RNNs for a second. Two black boxes:
Like reading a sentence, thinking "got it", and explaining it in another language.
Interactive: see BLEU curves fall as source length grows — seq2seq-bottleneck.
nn.LSTMAn nn.LSTM layer is a function with specific I/O:
(10, 256)).outputs — hidden state at every time step (e.g. (10, 512)).(h_n, c_n) — final hidden + cell state (each (1, 512)).For the encoder we want the final state — that's our context vector outputs (using _ in Python) and keep the tuple (h, c).
class Seq2Seq(nn.Module):
def __init__(self, src_vocab, tgt_vocab, d_emb=256, d_h=512):
super().__init__()
self.src_emb = nn.Embedding(src_vocab, d_emb)
self.tgt_emb = nn.Embedding(tgt_vocab, d_emb)
self.encoder = nn.LSTM(d_emb, d_h, batch_first=True)
self.decoder = nn.LSTM(d_emb, d_h, batch_first=True)
self.output = nn.Linear(d_h, tgt_vocab)
def forward(self, src, tgt):
_, (h, c) = self.encoder(self.src_emb(src)) # context = (h, c)
dec_out, _ = self.decoder(self.tgt_emb(tgt), (h, c))
return self.output(dec_out)
How we train sequence generators
At inference the decoder feeds its own predictions back in:
decoder sees "<start>" → emits "The" → sees "The" → emits "animal" → …
But at training, if the decoder's first prediction is wrong, the error compounds — all subsequent steps condition on bad inputs. Training becomes painfully slow and unstable.
Like learning to ride a bike with a parent holding the seat. You still pedal and steer (predict the next word). But when you wobble (make a mistake), the parent keeps you on the right path (feeds you the ground-truth word).
You learn the core motion much faster and more safely. At inference, the training wheels come off.
The decoder feeds its own prediction back in. One error at step 1 → wrong context for every later step.
The decoder sees the ground-truth previous token. Every step is a clean, independent prediction problem.
The biggest reason for teacher forcing isn't safety — it's parallelism.
<s>, "The", "cat", … are known up-front → feed them in all at once → one big matrix multiply.A slow sequential loop becomes a fast parallel computation — 10–100× speedup on training. Empirically tolerated despite the "you train on a distribution you won't see at inference" critique.
Even though training ≠ inference, teacher forcing works because:
The "you're training on a distribution you won't see at inference" critique is real (exposure bias, next slide). But empirically it's tolerated because the alternative — fully autoregressive training — is 10-100× slower.
Source: "Le chien a chassé le chat." Target: "The dog chased the cat."
Training (teacher forcing).
<s> → predicts "A" (logP −1.2). Truth is "The". Loss is computed."The". Back on track.Inference (autoregressive).
<s> → predicts "A".The model trained in a perfect world, tested in a messy one.
Mitigations · scheduled sampling (Bengio 2015), noisy data augmentation, or just sidestep with massive-scale Transformers (2020+).
How to generate at inference
Simplest: at each step, pick
Fast. Deterministic. Usually suboptimal — a slightly-less-likely next token can lead to a much-more-likely full sequence.
Example (simplified):
greedy: "The dog is running" (but doesn't quite fit context)
better: "The puppies are running" (total prob higher)
Imagine the true best translation is "The cat sits on the mat" (probability 0.7).
At step 1, probabilities are:
Greedy picks "A" because it's locally higher. But the full-sequence score is lower than starting with "The".
Local optima are not global optima. Greedy decoding is a greedy search on the product of conditional probabilities — it commits at every step. Beam search (next) mitigates by keeping multiple candidates alive until the sequence ends.
Greedy decoding is one hiker taking the steepest step at every fork. They miss bigger peaks that require a less steep early path.
Beam search sends out
The team's collective best end-of-path score is far closer to the global maximum than greedy's. Cost ·
We want the sentence with highest
For a 20-word sentence we'd multiply 20 small probabilities → numerical underflow (rounded to 0).
Fix · take logs.
Each
Every
| Sentence | logP |
|---|---|
| "I am" | |
| "I am a student" |
Plain log-prob picks "I am". Wrong.
Length normalization · divide by sentence length raised to
It makes finished, longer hypotheses comparable to short ones — the goal is fair comparison, not always preferring long.
At each step, maintain
Typical
Vocab · {The, A, cat, dog, sat, ran}. Decoding "The cat sat".
Step 1. Beams · [<s>]. Top-2 next-token log-probs:
| token | logP |
|---|---|
| The | -0.5 |
| A | -0.7 |
Keep both. Beams · [<s>, The] (-0.5), [<s>, A] (-0.7).
Step 2. Expand each. For brevity, top extensions:
| sequence | logP |
|---|---|
<s> The cat |
-0.5 + -0.6 = -1.1 |
<s> The dog |
-0.5 + -1.3 = -1.8 |
<s> A feline |
-0.7 + -0.9 = -1.6 |
<s> A dog |
-0.7 + -1.0 = -1.7 |
Keep the top 2 · <s> The cat (-1.1) and <s> A feline (-1.6).
Step 3. Continue. Final score divides by
A beam of
Top-
For open-ended generation (story writing, chat) beam is too deterministic — everything sounds the same.
When the next-token distribution has one tall bar, nucleus narrows automatically; when it's flat, nucleus widens. Top-
2026 LLM default — nucleus with
Why attention became inevitable
The entire source — 5 words or 500 — must compress into one fixed-size context vector. The decoder then generates from that single vector.
For short sentences, fine. For long sentences, the encoder forgets the beginning by the time it reaches the end. The decoder has no way to recover what was lost.
The 2014 paper itself found that reversing the source improved BLEU by several points:
"I am happy"→ encoded in order
"happy am I"→ encoded reversed
Why? The last source words (now first) are closest to where the decoder begins generating — less path length for that information to travel through the hidden-state chain.
A hack that works is a sign of a problem waiting to be solved properly. Reversing the input "fixed" the bottleneck by shifting where the leakage happens, not by removing it.
If one context vector can't hold all source info, don't use one context vector.
Instead, let the decoder look at all the encoder hidden states — and decide which ones to focus on for each target step.
That is attention. Bahdanau et al. 2014 — the paper that launched a decade of NLP. Next lecture.
Even in 2026, some parts survive
The Seq2Seq pattern (encoder → context → decoder) is everywhere. Only the implementation of "context" changed: fixed vector (2014) → attention (2015) → self-attention (2017) → ... → your favorite 2026 LLM.