step	input	update	result
1			encodes "I"
2			encodes "I love"
3			encodes "I love deep"
4			— full summary

	After 100 steps
LSTM along cell state	— small but nonzero
Vanilla RNN (optimistic factor 0.5)	— completely vanished

	LSTM	GRU
Gates	3 + candidate	2 + candidate
State	cell + hidden	hidden only
Params (d_h = 128)	4 · 128 · 256 = 131k	3 · 128 · 256 = 98k
Accuracy	baseline	often tied
Training speed	slower	~15% faster

Lecture 10 — summary

MLPs fail on sequences because they can't share parameters across time.
RNN — same cell, shared weights, hidden state carries memory forward.
BPTT — backprop through unrolled graph; same vanishing/exploding problem as depth.
LSTM — gated cell state is an additive "conveyor belt"; gradients flow through gates, not through tanh products.
GRU — simpler (2 gates instead of 3); often equivalent accuracy, ~15% faster.
2026 — Transformers own most sequence tasks, but RNNs still win for streaming and tiny devices.

Read before Lecture 11

Bishop Ch 12 · Seq2Seq.

Next lecture

Seq2Seq + the motivation for attention — encoder-decoder architecture, teacher forcing, beam search, and the bottleneck that made attention inevitable.

Notebook 10 · 10-lstm-from-scratch.ipynb — implement an LSTMCell with only nn.Linear layers; verify output matches nn.LSTMCell; train a char-level LSTM on Tiny Shakespeare.

RNNs, LSTMs & GRUs

Lecture 10 · ES 667: Deep Learning

Learning outcomes

Where we are

Four questions

PART 1

Why MLPs fail on sequences

The problem

Weight sharing across time · the RNN trick

RNN · step-by-step on "I love deep learning"

RNN I/O patterns · three shapes

The three RNN use-patterns

Many-to-one

One-to-many

RNN in PyTorch · by hand

PART 2

BPTT and vanishing gradients in time

BPTT · the telephone game

BPTT · derive the gradient product

Vanishing gradients in time

Worked example · BPTT on 3 timesteps

Truncated BPTT · the practical fix

Gradient clipping · the second fix

PART 3

LSTM · the gating fix

LSTM cell · annotated

LSTM cell architecture

LSTM · gatekeeper, janitor, press secretary

LSTM · the conveyor-belt analogy

LSTM · build the equations step-by-step

Worked numeric · single-neuron LSTM

Each gate · in plain English

Why gating fixes vanishing gradients

Worked numeric · gradient flow over 100 steps

PART 4

GRU · the lighter sibling

GRU · a simpler, smarter gate

GRU · build the equations step-by-step

Worked numeric · scalar GRU

LSTM vs GRU · a history sentence

LSTM vs GRU · when to pick which

Bidirectional + stacked RNNs

PART 5

When to still use RNNs in 2026

The 2026 reality

RWKV and Mamba · the RNN comeback

A preview · the problem RNNs can't solve

Lecture 10 — summary

Read before Lecture 11

Next lecture