β	effective memory	behavior
0.5	2 steps	barely smooths
0.9	10 steps	sensible default
0.95	20 steps	slower to respond to curvature changes
0.99	100 steps	heavy; needs gradient clipping

Optimizer	Train loss	Val accuracy
SGD	0.42	92.1%
SGD + momentum	0.13	97.6%

Use-case	Optimizer	Reason
CNN from scratch	SGD + momentum + Nesterov	best test accuracy on vision
Fine-tuning anything	AdamW (L5)	robust across LR
Transformer training	AdamW + warmup + cosine	field default
Debugging a new model	Adam first	faster to iterate

		Effect
0.0		no momentum
0.5		light
0.9		standard default
0.95		heavy
0.99		very heavy

Symptom	Likely cause	Fix
Loss → NaN after step 1	LR too high, fp16 overflow	halve , enable gradient clipping
Loss oscillates (±)	ravine + momentum too low	raise to 0.9
Loss plateaus for hundreds of steps	stuck near saddle	raise or switch to Adam
Loss drops then climbs	overfitting (not optimizer)	add weight decay, lower
Training slower than Keras example	no momentum	add `momentum=0.9`

Update	Equation	Cures
SGD		nothing — pure 1st-order
Momentum	,	ravine zig-zag, saddles
Nesterov	evaluated at	overshoot near a minimum

Lecture 4 — summary

Loss landscapes — ravines, saddles, ill-conditioning. High-dim is mostly saddles.
Vanilla SGD serves one direction at a time → oscillates across narrow valleys.
Momentum = EMA of gradients; damps zig-zag, reinforces consistent directions.
Nesterov evaluates the gradient at the lookahead point — a free small speed-up.
In practice · SGD+momentum(+Nesterov) for vision, AdamW for everything else.

Read before Lecture 5

Prince — Ch 6 §6.4–6.6. Free at udlbook.github.io.

Next lecture

Adam, AdamW, and learning-rate schedules — per-parameter adaptive LR, bias correction, decoupled weight decay, warmup + cosine.

Notebook 4 · 04-optimizer-race.ipynb — implement SGD, momentum, Nesterov from scratch; animate trajectories on a 2D quadratic and Rosenbrock.

step			next
0 → 1
1 → 2
2 → 3

SGD, Momentum, Nesterov

Lecture 4 · ES 667: Deep Learning

Learning outcomes

Recap · where we are

Four questions for today

Pop quiz · what's killing this run?

PART 1

The loss landscape

Gradient descent — picture

Three kinds of critical points

The ravine problem — κ elongates the basin

Why vanilla SGD oscillates on ravines

Ravine · zig-zag vs glide

What makes a valley hard to navigate?

The condition number · let's compute it

Worked numeric · SGD in a ravine

What kinds of "flat spots" exist?

Why saddles dominate in high dimensions

Mini-batch noise is not always bad

PART 2

Momentum

Momentum · the heavy-ball analogy

Momentum = EMA of gradients

The physical intuition

Momentum · numerical trace

Momentum = one more hyperparameter?

From hiker to heavy ball

Momentum · build the update step-by-step

Worked numeric · momentum smooths the ravine

What momentum fixes

A concrete example · MNIST

Momentum in PyTorch · one line

PART 3

Nesterov accelerated gradient

Nesterov · driving with a longer-range view

Classical vs Nesterov

A smarter heavy ball

Nesterov · the update in three steps

Worked numeric · Nesterov's correction

Standard momentum

Nesterov

Why it helps

A geometric way to see Nesterov

Classical momentum

Nesterov

Nesterov in PyTorch · one flag

PART 4

Practical recommendations

Current 2026 practice

A common mistake

Momentum's compounding effect

Deriving the effective learning rate

The "effective LR" in numbers

Debugging optimizer failures

Putting it all together · the L04 master sentence

Pop quiz · revisit

Practice problems

Lecture 4 — summary

Read before Lecture 5

Next lecture