Interactive Explainer

Feeling the Seq2Seq bottleneck

Long sentences break short context vectors. Build a 1-vector encoder yourself, test it on sentences of growing length, and see precisely when information starts falling out — the failure that made attention inevitable.

~10 min Deep Learning · Seq2Seq · Attention motivation

Prelude

The suitcase problem

Imagine you're going on a one-week holiday and the airline restricts you to a single 7 kg carry-on. You can just manage. Now imagine the same suitcase for a six-month expedition. The suitcase hasn't changed — your needs have. Everything that doesn't fit has to be left behind.

The 2014 Seq2Seq model (Sutskever, Vinyals, Le) built a neural translator by packing the entire source sentence into a single fixed-size context vector — call it 512 floats. Short sentences packed fine. Long sentences overflowed. This page lets you run that compression live and watch translation quality crater as sentence length grows.

By the end of this page, you will have watched the context vector struggle, measured the BLEU drop, and understood exactly why Bahdanau et al. added attention a year later.

Step 1

Pick a sentence pair

Three example translation tasks. Each has a short, medium, and long variant. Pick a scenario to begin.

Source length: 12

Source tokens –

Bits of info (approx.) –

Pause. A 12-word sentence has roughly 120 bits of entropy (≈10 bits per word). A 512-dim float vector can in principle encode ~16000 bits. So for short sentences, there's *plenty* of room. The trouble starts at ~50 words.

Step 2

Compress into a context vector

Now simulate the encoder. Each token is read sequentially; the hidden state is a 2D sketch of the full d-dim context. Slide the context dim to see how much room the encoder has.

Context dim d:

Context capacity –

Sentence entropy –

              Compression loss
              –
            

Read the picture. Each encoder step compresses more source info into the same-size vector. Early tokens fade as later ones overwrite the representation. Sutskever's team observed exactly this and fixed it partly by reversing the source (so early tokens are written last, closer to where the decoder reads).

Step 3

Translation quality vs source length

The 2014 paper (Sutskever et al.) reported BLEU by source-sentence length. Here is their result, reproduced interactively: BLEU falls off a cliff past ~30 tokens.

Seq2Seq (no attention) Seq2Seq (reversed source) Bahdanau attention (2015)

The story. Reversing the source bought them 4-5 BLEU points — not because it fixed the bottleneck, but because it shifted *where* the leakage happens (last-read = closest-to-decoder). Attention replaced the fixed vector with a *flexible lookup* and leaped past both curves.

Step 4

What information gets dropped first?

Here's a concrete example. Feed a sentence through the compressor, then ask it to recall specific parts. Watch which parts the decoder can still recover.

Which token to recall: 0

Position –

              Recall accuracy
              –
            

Pattern. Middle tokens are lost first — they sit between the start (faded out) and end (most recent). This U-shape is why beam-search translations famously get "the middle" wrong for long sentences in pre-attention models.

Step 5

Attention · remove the bottleneck

Rather than cramming everything into one vector, attention gives the decoder access to every encoder hidden state. At each decoding step, it weights them by relevance and averages.

attention weight α_t,i

Mechanism. Each target step t builds its context by sampling from all source states h_1…h_T with learned weights α_t,i. Long sentences don't hurt — every source token remains accessible. That's it: attention dissolves the bottleneck.

c_t = Σ_i α_{t,i} · h_i     where    α_{t,i} = softmax(score(s_t, h_i))

Misconceptions

Three common confusions

False

"Just make the vector bigger."
A 2048-dim vector delays the problem but doesn't solve it. The bottleneck is structural: one vector carries information from every position. You always run out eventually.

False

"Reversing source is a hack."
It's a hack, but a pragmatic one that revealed which end of the sentence the bottleneck hurts most. The fix (attention) came directly from recognising this leak.

False

"Modern LLMs still use encoder-decoder."
Most 2026 LLMs are decoder-only — they don't encode the source into a vector at all; the whole conversation is one long sequence. Encoder-decoder survives in translation-specific models like T5.

Bonus

Three lenses on the same fix

Model (year)	What it added	BLEU on 50-token sentences
Seq2Seq (2014)	encoder+decoder LSTM, one context vector	22
Reversed Seq2Seq (2014)	flip source · same architecture	26
Bahdanau attention (2015)	decoder attends to all encoder states	30
Transformer (2017)	attention everywhere · no recurrence	35+

Final takeaway. The Seq2Seq bottleneck is the pedagogical motivation for attention. Every subsequent architecture is "attention, but with different routing" — cross-attention in Transformers, cross-attention in Stable Diffusion, cross-attention in LLaVA. Once you've felt where the bottleneck hurts, the fix is obvious.

Part of the ES 667 Deep Learning course · IIT Gandhinagar · Aug 2026.