Sentence	logP
"I am"
"I am a student"

Beam search · worked example with

Vocab · {The, A, cat, dog, sat, ran}. Decoding "The cat sat".

Step 1. Beams · [<s>]. Top-2 next-token log-probs:

token	logP
The	-0.5
A	-0.7

Keep both. Beams · [<s>, The] (-0.5), [<s>, A] (-0.7).

Step 2. Expand each. For brevity, top extensions:

sequence	logP
`<s> The cat`	-0.5 + -0.6 = -1.1
`<s> The dog`	-0.5 + -1.3 = -1.8
`<s> A feline`	-0.7 + -0.9 = -1.6
`<s> A dog`	-0.7 + -1.0 = -1.7

Keep the top 2 · <s> The cat (-1.1) and <s> A feline (-1.6).

Step 3. Continue. Final score divides by to compare different lengths.

A beam of already finds "The cat sat" (logP -1.7) where greedy might have committed early to "A" and ended at "A dog ran" (logP -2.5).

Seq2Seq & the Motivation for Attention

Lecture 11 · ES 667: Deep Learning

Learning outcomes

Recap · where we are

Four questions

PART 1

The encoder-decoder architecture

Seq2Seq · the 2014 breakthrough

The whole idea in one sentence

Shared vs separate vocabularies

The translator analogy

The big picture · encoder–decoder

The architecture

A quick look inside nn.LSTM

Seq2Seq in PyTorch · skeleton

PART 2

Teacher forcing

The problem with training auto-regressively

Teacher forcing · the fix

Teacher forcing · detailed flow

Teacher forcing · the training-wheels analogy

Two regimes side-by-side

Autoregressive (inference)

Teacher forcing (training)

Why teacher forcing? · speed

Teacher forcing · why it's OK pragmatically

Exposure bias · the price you pay

Exposure bias · concrete cascade

PART 3

Decoding strategies

Four common strategies

Decoding · the search tree

Greedy · pick top-1 every step

Greedy fails · the canonical example

Beam search · the tree

Beam search · the team-of-hikers analogy

From probabilities to log-probabilities

The length-bias problem

Beam search · keep top- paths

Beam search · worked example with

Top- and nucleus · the improv-comedian analogy

Top- vs nucleus · how the pool is chosen

PART 4

The bottleneck that killed Seq2Seq

The failure mode · source length

The bottleneck in one sentence

Sutskever's own fix · reverse the input

The obvious next step

PART 5

Applications of classic Seq2Seq

Where Seq2Seq-like ideas still ship

Lecture 11 — summary

Read before Lecture 12

Next lecture

A quick look inside `nn.LSTM`