Interactive Explainer
Feeling the Seq2Seq bottleneck
Long sentences break short context vectors. Build a 1-vector encoder yourself, test it on sentences of growing length, and see precisely when information starts falling out — the failure that made attention inevitable.
The suitcase problem
Imagine you're going on a one-week holiday and the airline restricts you to a single 7 kg carry-on. You can just manage. Now imagine the same suitcase for a six-month expedition. The suitcase hasn't changed — your needs have. Everything that doesn't fit has to be left behind.
The 2014 Seq2Seq model (Sutskever, Vinyals, Le) built a neural translator by packing the entire source sentence into a single fixed-size context vector — call it 512 floats. Short sentences packed fine. Long sentences overflowed. This page lets you run that compression live and watch translation quality crater as sentence length grows.
Pick a sentence pair
Three example translation tasks. Each has a short, medium, and long variant. Pick a scenario to begin.
Compress into a context vector
Now simulate the encoder. Each token is read sequentially; the hidden state is a 2D sketch of the full d-dim context. Slide the context dim to see how much room the encoder has.
Translation quality vs source length
The 2014 paper (Sutskever et al.) reported BLEU by source-sentence length. Here is their result, reproduced interactively: BLEU falls off a cliff past ~30 tokens.
What information gets dropped first?
Here's a concrete example. Feed a sentence through the compressor, then ask it to recall specific parts. Watch which parts the decoder can still recover.
Attention · remove the bottleneck
Rather than cramming everything into one vector, attention gives the decoder access to every encoder hidden state. At each decoding step, it weights them by relevance and averages.
Three common confusions
"Just make the vector bigger."
A 2048-dim vector delays the problem but doesn't solve it. The bottleneck is structural: one vector carries information from every position. You always run out eventually.
"Reversing source is a hack."
It's a hack, but a pragmatic one that revealed which end of the sentence the bottleneck hurts most. The fix (attention) came directly from recognising this leak.
"Modern LLMs still use encoder-decoder."
Most 2026 LLMs are decoder-only — they don't encode the source into a vector at all; the whole conversation is one long sequence. Encoder-decoder survives in translation-specific models like T5.
Three lenses on the same fix
| Model (year) | What it added | BLEU on 50-token sentences |
|---|---|---|
| Seq2Seq (2014) | encoder+decoder LSTM, one context vector | 22 |
| Reversed Seq2Seq (2014) | flip source · same architecture | 26 |
| Bahdanau attention (2015) | decoder attends to all encoder states | 30 |
| Transformer (2017) | attention everywhere · no recurrence | 35+ |
Part of the ES 667 Deep Learning course · IIT Gandhinagar · Aug 2026.