SGD, Momentum, Nesterov

Lecture 4 · ES 667: Deep Learning

Prof. Nipun Batra
IIT Gandhinagar · Aug 2026

Learning outcomes

By the end of this lecture you will be able to:

  1. Identify ravines and saddles in high-dimensional loss.
  2. Show why vanilla SGD oscillates across ravines.
  3. Derive momentum as EMA of gradients.
  4. Explain Nesterov's lookahead and its convergence rate payoff.
  5. Pick β appropriately (0.9 default; when to adjust).
  6. Diagnose an optimizer failure from training curves.

Recap · where we are

  • Deep networks are trainable with ResNets + He init + ReLU.
  • PyTorch recipe — forward, loss, zero_grad, backward, step.
  • Debugging ladder — overfit one batch, LR finder, error analysis.

Today maps to UDL Ch 6 · fitting models (SGD, momentum, acceleration).

One piece we glossed over: the optimizer. Today we open that box.

Four questions for today

  1. What does the loss landscape look like, really?
  2. Why does vanilla SGD oscillate on it?
  3. How does momentum fix the oscillation?
  4. What does Nesterov's lookahead add on top?

Pop quiz · what's killing this run?

A 50-layer ResNet trains fine for 200 steps, then loss starts oscillating between 1.4 and 1.9 forever — never diverging, never improving.

(a) Vanishing gradients.
(b) Step size too large for a narrow ravine.
(c) Bad data shuffling.
(d) Saddle point.

Stop and decide. We'll come back to this once you've seen the ravine geometry — your gut answer will probably change.

This is the most common failure mode of vanilla SGD on real loss surfaces. By the end of today you'll diagnose it from the curves alone.

PART 1

The loss landscape

What makes neural-net optimization hard

Gradient descent — picture

Three kinds of critical points

The ravine problem — κ elongates the basin

Why vanilla SGD oscillates on ravines

Ravine · zig-zag vs glide

What makes a valley hard to navigate?

Analogy · hiking. A round bowl is easy — every direction goes downhill. A steep, narrow canyon is tricky. The walls are very steep, but the path forward (along the floor) is almost flat.

  • Big step → slam into the canyon wall.
  • Tiny step (avoiding the wall) → crawl along the floor.

The "steepness ratio" between the walls and the floor is the condition number . Large → narrow ravine → SGD struggles.

The condition number · let's compute it

Loss: . Hessian eigenvalues condition number .

  1. Gradient. .
  2. GD update. .
  3. Per-coordinate.
  4. Stability constraint. .

Pick the largest stable LR: .

  • shrinks by — actually flips sign (overshoots).
  • shrinks by only crawls.

The high-curvature direction forces a tiny LR; the low-curvature direction then converges painfully slowly.

Worked numeric · SGD in a ravine

Start , .

step next
0 → 1
1 → 2
2 → 3

zig-zags wildly: .
crawls: .

In deep nets, is often . This is why momentum, adaptive LR, and normalization all help — they rescale so matters less.

What kinds of "flat spots" exist?

A critical point is anywhere . In 1D: valley bottom (min) or hilltop (max).
In 2D and higher: a third option — the saddle.

Analogy · Pringles chip / horse saddle. At the centre, the gradient is zero. But it's not a minimum. Along the horse's spine, the surface curves up. Across its back, the surface curves down. Mixed curvature → saddle point.

To classify a critical point, look at curvature in every direction. The Hessian stores all second derivatives; its eigenvalues give curvature along the principal directions.

Why saddles dominate in high dimensions

In dimensions there are curvature directions:

  • Local min — all eigenvalues (curvature UP everywhere).
  • Local max — all (curvature DOWN everywhere).
  • Saddle — mix of and .

Toy probability: assume each eigenvalue is randomly or with prob .

For parameters: probability of a true local minimum is — essentially zero.

Almost every critical point in a deep net is a saddle, not a minimum. The challenge isn't escaping valleys — it's navigating vast, flat saddle regions. Momentum's memory saves you here: it keeps you moving in a consistent direction through the flat plateau.

Mini-batch noise is not always bad

Gradient from a batch is a noisy estimate of the full gradient.

  • Bad: adds variance to each step.
  • Good: helps escape saddle points and shallow local minima.
  • Good (more): implicit regularization from noise is part of why SGD generalizes.

Larger batch → less noise → worse generalization in practice. "Linear-scaling" rule: if you 2× the batch, 2× the learning rate.

PART 2

Momentum

The single most important change to SGD

Momentum · the heavy-ball analogy

Vanilla SGD is a short-sighted hiker · only looks at the slope under their feet. In a narrow canyon they zig-zag wildly.

Momentum turns the hiker into a heavy ball rolling down the hill. The ball's inertia smooths out the zig-zags and carries it through small bumps and flat spots.

Algorithmically · keep an exponentially-weighted average of past gradients · use that as the update direction. The next slide turns this analogy into one line of math.

Momentum = EMA of gradients

The physical intuition

Replace position updates with velocity updates. A ball rolling down the valley:

  • accumulates speed in consistent directions
  • averages out back-and-forth from noise

Formally — keep an exponential moving average of past gradients.

Momentum · numerical trace

Let gradients in a ravine look like: (flipping sign on every step, small consistent push on ).

With , EMA settles to:

  • · average of oscillations → near zero
  • · average of 0.1 (preserved)

Vanilla SGD zig-zags on . Momentum cancels that out — direction 2 gets all the step budget. Zig-zag in, drift out.

The same EMA mechanism shows up in Adam (L5), batch-norm running stats, and target networks in RL. One primitive, many uses.

Momentum = one more hyperparameter?

Yes — but a forgiving one.

β effective memory behavior
0.5 2 steps barely smooths
0.9 10 steps sensible default
0.95 20 steps slower to respond to curvature changes
0.99 100 steps heavy; needs gradient clipping

Most practitioners set once and never touch it again. The knob you tune is .

From hiker to heavy ball

Vanilla SGD only cares about the slope right now. A heavy ball has inertia — its motion today is a mix of where it was already going and the new push from the slope.

Analogy · pushing a bowling ball. Push it once → it rolls. Push it again in the same direction → it speeds up. Push it sideways → it changes direction, but it doesn't stop and turn on a dime. This memory of past motion is what we add to SGD.

Momentum · build the update step-by-step

  1. Define velocity — a vector that remembers past gradients.
  2. Velocity update. Mix old velocity with the new gradient using :

  1. Position update. Step using the smoothed velocity, not the raw gradient:

This is an Exponential Moving Average (EMA). With : keep 90% of the old velocity, mix in 10% of the new gradient. Effective memory steps.

PyTorch's SGD(..., momentum=0.9) uses an equivalent form.

Worked numeric · momentum smooths the ravine

Ravine gradients: — first component flips sign every step, second is constant. .

1
2
3

Observation.

  • oscillates near zero (): zig-zags cancel out.
  • steadily grows (): consistent push accumulates.

Momentum damps oscillation, amplifies persistence.

What momentum fixes

▶ Interactive: race SGD, momentum, Adam on a 2D quadratic — optimizer-race.

A concrete example · MNIST

Same model, same LR, after 10 epochs:

Optimizer Train loss Val accuracy
SGD 0.42 92.1%
SGD + momentum 0.13 97.6%

Momentum is a free 5 points on MNIST. The single highest-value change to SGD.

Momentum in PyTorch · one line

# vanilla SGD
opt = torch.optim.SGD(model.parameters(), lr=0.01)

# SGD + momentum  — the sensible default
opt = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

Never use vanilla SGD without momentum for a deep network. It is a Lego brick, not a finished optimizer.

PART 3

Nesterov accelerated gradient

Evaluate the gradient one step ahead

Nesterov · driving with a longer-range view

Standard momentum is like driving by looking at the road right in front of your bumper · you steer based on what's directly under you.

Nesterov is like looking down the road. You first imagine where momentum is taking you, look at the slope at that future point, and steer based on that.

The result · less overshoot near valley walls · cleaner approach to the minimum · provably better convergence rate (in the convex case).

Classical vs Nesterov

A smarter heavy ball

Standard momentum is a bit reckless. It computes the gradient at the current spot, then commits to a big velocity-driven step. It's like a driver looking only at the road right under the car.

Analogy · Nesterov is a smarter driver who looks ahead.

  1. First, make a "guess" move based only on old velocity → land at a lookahead point.
  2. From the lookahead, compute the gradient.
  3. Use that gradient (the slope where you're about to be) to make the actual step.

If your velocity was about to drive you into a wall, the lookahead gradient already points back — correcting the course before you fully commit.

Nesterov · the update in three steps

  1. Project a lookahead point — where would old velocity take us?

  1. Compute gradient at the lookahead (not at ):

  1. Standard momentum update with the smarter gradient:

One change: where the gradient is measured.

Worked numeric · Nesterov's correction

1D toy: . Start .

Standard momentum

Nesterov

  • (less steep!)

The Nesterov gradient was smaller — it "saw" the valley flattening ahead → slightly more conservative step → less overshoot. Tiny difference per step, but compounds over training.

Why it helps

If the lookahead overshoots, the gradient at that point points back — correction kicks in before you commit to the full step.

Less overshoot near curvature changes, slightly faster convergence.

Theoretical payoff (convex, smooth case): optimal convergence rate vs GD's .

In practice on deep nets, the gain is modest but free.

A geometric way to see Nesterov

Classical momentum

  1. Take step (momentum part)
  2. Add a correction based on gradient at start

If the gradient at the start was misleading, you've committed before knowing.

Nesterov

  1. Tentatively move to a lookahead point
  2. Measure the gradient there
  3. Use that gradient for the real step

Information from the landscape you're about to visit, not the one you're leaving.

That's it · the same update, just measured at a smarter location.

Nesterov in PyTorch · one flag

opt = torch.optim.SGD(model.parameters(),
                      lr=0.01,
                      momentum=0.9,
                      nesterov=True)    # ← the flag

That is the entire cost of using it.

PART 4

Practical recommendations

What to actually use

Current 2026 practice

Use-case Optimizer Reason
CNN from scratch SGD + momentum + Nesterov best test accuracy on vision
Fine-tuning anything AdamW (L5) robust across LR
Transformer training AdamW + warmup + cosine field default
Debugging a new model Adam first faster to iterate

Image researchers often prefer SGD-Momentum for final runs — it tends to find flatter minima that generalize slightly better. Everyone else uses AdamW.

A common mistake

Q. Student sets momentum=0.99 because "more is better." Loss diverges. Why?

Higher = longer memory. If curvature changes abruptly (early training), a heavy memory keeps you pushing in stale directions after the gradient has reversed → overshoot.

Rule of thumb: keep . Go higher only with strong gradient clipping + warmup.

Momentum's compounding effect

What happens when the gradient points the same way, step after step?

Analogy · pushing a child on a swing. First push — they get some velocity. Next time they swing by — push again. The new push adds to the existing velocity. They go higher and higher. Momentum does the same: each consistent gradient builds the velocity, leading to far bigger steps than the gradient alone.

Let's derive how big that compounding gets.

Deriving the effective learning rate

Assume gradient is constant for many steps. Use the simpler form (same steady-state behaviour). Unroll, with :

  • :
  • :
  • :
  • :

A geometric series with :

So the parameter update at terminal velocity is:

The "effective LR" in numbers

How much does amplify ?

Effect
0.0 no momentum
0.5 light
0.9 standard default
0.95 heavy
0.99 very heavy

Key takeaway. A small bump from to multiplies your effective LR by . Your previously-stable run will diverge. When you raise momentum, lower to compensate.

Practical recipe: fix ; use the LR finder to pick . Revisit only if training is unstable.

Debugging optimizer failures

Common symptoms and fixes:

Symptom Likely cause Fix
Loss → NaN after step 1 LR too high, fp16 overflow halve , enable gradient clipping
Loss oscillates (±) ravine + momentum too low raise to 0.9
Loss plateaus for hundreds of steps stuck near saddle raise or switch to Adam
Loss drops then climbs overfitting (not optimizer) add weight decay, lower
Training slower than Keras example no momentum add momentum=0.9

Most "my network doesn't train" bugs are optimizer-level. The debug ladder from L3 + this table catches ~90% of them in practice.

Putting it all together · the L04 master sentence

Vanilla SGD is steepest descent · momentum is a low-pass filter on gradients · Nesterov is momentum that peeks ahead. All three live on the same loss landscape · they differ only in how they smooth the gradient signal across steps.

Update Equation Cures
SGD nothing — pure 1st-order
Momentum , ravine zig-zag, saddles
Nesterov evaluated at overshoot near a minimum

Effective LR under momentum · . Raising from 0.9 to 0.99 multiplies it by 10× — the most common cause of "I bumped momentum and everything diverged."

Pop quiz · revisit

The 50-layer ResNet that oscillated forever? The answer is (b) ravine + step too large.

Vanishing gradients (a) would flatten loss, not oscillate. Bad shuffling (c) would show step-to-step jitter, not a stable cycle. Saddles (d) plateau, not oscillate.
Fix · raise to 0.9 (momentum smooths the cycle), or halve .

Practice problems

P1. A quadratic has . Compute the largest stable LR for SGD. Why does this LR make the direction crawl?

P2. Show that the momentum recurrence with constant converges to . Use this to derive the effective LR.

P3. Explain in one sentence why momentum helps escape saddles but does not help escape strict local minima.

P4. Run two trajectories on a Rosenbrock-like ravine · SGD with vs SGD-momentum with . Predict which oscillates more and why.

P5. Show that Nesterov's update can be rewritten as classical momentum plus a gradient correction term. State the correction.

P6. A practitioner raises momentum from 0.9 to 0.99 and the run NaNs. Without changing , what one change rescues the run?

Lecture 4 — summary

  • Loss landscapes — ravines, saddles, ill-conditioning. High-dim is mostly saddles.
  • Vanilla SGD serves one direction at a time → oscillates across narrow valleys.
  • Momentum = EMA of gradients; damps zig-zag, reinforces consistent directions.
  • Nesterov evaluates the gradient at the lookahead point — a free small speed-up.
  • In practice · SGD+momentum(+Nesterov) for vision, AdamW for everything else.

Read before Lecture 5

Prince — Ch 6 §6.4–6.6. Free at udlbook.github.io.

Next lecture

Adam, AdamW, and learning-rate schedules — per-parameter adaptive LR, bias correction, decoupled weight decay, warmup + cosine.

Notebook 4 · 04-optimizer-race.ipynb — implement SGD, momentum, Nesterov from scratch; animate trajectories on a 2D quadratic and Rosenbrock.