Loss:
Pick the largest stable LR:
The high-curvature direction forces a tiny LR; the low-curvature direction then converges painfully slowly.
Start
| step | next |
||
|---|---|---|---|
| 0 → 1 | |||
| 1 → 2 | |||
| 2 → 3 |
In deep nets,
A critical point is anywhere
In 2D and higher: a third option — the saddle.
Analogy · Pringles chip / horse saddle. At the centre, the gradient is zero. But it's not a minimum. Along the horse's spine, the surface curves up. Across its back, the surface curves down. Mixed curvature → saddle point.
To classify a critical point, look at curvature in every direction. The Hessian
In
Toy probability: assume each eigenvalue is randomly
For
Almost every critical point in a deep net is a saddle, not a minimum. The challenge isn't escaping valleys — it's navigating vast, flat saddle regions. Momentum's memory saves you here: it keeps you moving in a consistent direction through the flat plateau.
Gradient from a batch is a noisy estimate of the full gradient.
Larger batch → less noise → worse generalization in practice. "Linear-scaling" rule: if you 2× the batch, 2× the learning rate.
The single most important change to SGD
Vanilla SGD is a short-sighted hiker · only looks at the slope under their feet. In a narrow canyon they zig-zag wildly.
Momentum turns the hiker into a heavy ball rolling down the hill. The ball's inertia smooths out the zig-zags and carries it through small bumps and flat spots.
Algorithmically · keep an exponentially-weighted average of past gradients · use that as the update direction. The next slide turns this analogy into one line of math.
Replace position updates with velocity updates. A ball rolling down the valley:
Formally — keep an exponential moving average of past gradients.
Let gradients in a ravine look like:
With
Vanilla SGD zig-zags
The same EMA mechanism shows up in Adam (L5), batch-norm running stats, and target networks in RL. One primitive, many uses.
Yes — but a forgiving one.
| β | effective memory | behavior |
|---|---|---|
| 0.5 | 2 steps | barely smooths |
| 0.9 | 10 steps | sensible default |
| 0.95 | 20 steps | slower to respond to curvature changes |
| 0.99 | 100 steps | heavy; needs gradient clipping |
Most practitioners set
Vanilla SGD only cares about the slope right now. A heavy ball has inertia — its motion today is a mix of where it was already going and the new push from the slope.
Analogy · pushing a bowling ball. Push it once → it rolls. Push it again in the same direction → it speeds up. Push it sideways → it changes direction, but it doesn't stop and turn on a dime. This memory of past motion is what we add to SGD.
This is an Exponential Moving Average (EMA). With
PyTorch's SGD(..., momentum=0.9) uses an equivalent form.
Ravine gradients:
| 1 | ||
| 2 | ||
| 3 |
Observation.
Momentum damps oscillation, amplifies persistence.
Interactive: race SGD, momentum, Adam on a 2D quadratic — optimizer-race.
Same model, same LR, after 10 epochs:
| Optimizer | Train loss | Val accuracy |
|---|---|---|
| SGD | 0.42 | 92.1% |
| SGD + momentum | 0.13 | 97.6% |
Momentum is a free 5 points on MNIST. The single highest-value change to SGD.
# vanilla SGD
opt = torch.optim.SGD(model.parameters(), lr=0.01)
# SGD + momentum — the sensible default
opt = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
Never use vanilla SGD without momentum for a deep network. It is a Lego brick, not a finished optimizer.
Evaluate the gradient one step ahead
Standard momentum is like driving by looking at the road right in front of your bumper · you steer based on what's directly under you.
Nesterov is like looking down the road. You first imagine where momentum is taking you, look at the slope at that future point, and steer based on that.
The result · less overshoot near valley walls · cleaner approach to the minimum · provably better convergence rate (in the convex case).
Standard momentum is a bit reckless. It computes the gradient at the current spot, then commits to a big velocity-driven step. It's like a driver looking only at the road right under the car.
Analogy · Nesterov is a smarter driver who looks ahead.
If your velocity was about to drive you into a wall, the lookahead gradient already points back — correcting the course before you fully commit.
One change: where the gradient is measured.
1D toy:
The Nesterov gradient was smaller — it "saw" the valley flattening ahead → slightly more conservative step → less overshoot. Tiny difference per step, but compounds over training.
If the lookahead overshoots, the gradient at that point points back — correction kicks in before you commit to the full step.
Less overshoot near curvature changes, slightly faster convergence.
Theoretical payoff (convex, smooth case): optimal convergence rate
In practice on deep nets, the gain is modest but free.
If the gradient at the start was misleading, you've committed before knowing.
Information from the landscape you're about to visit, not the one you're leaving.
That's it · the same update, just measured at a smarter location.
opt = torch.optim.SGD(model.parameters(),
lr=0.01,
momentum=0.9,
nesterov=True) # ← the flag
That is the entire cost of using it.
What to actually use
| Use-case | Optimizer | Reason |
|---|---|---|
| CNN from scratch | SGD + momentum + Nesterov | best test accuracy on vision |
| Fine-tuning anything | AdamW (L5) | robust across LR |
| Transformer training | AdamW + warmup + cosine | field default |
| Debugging a new model | Adam first | faster to iterate |
Image researchers often prefer SGD-Momentum for final runs — it tends to find flatter minima that generalize slightly better. Everyone else uses AdamW.
Q. Student sets momentum=0.99 because "more is better." Loss diverges. Why?
Higher
Rule of thumb: keep
What happens when the gradient points the same way, step after step?
Analogy · pushing a child on a swing. First push — they get some velocity. Next time they swing by — push again. The new push adds to the existing velocity. They go higher and higher. Momentum does the same: each consistent gradient builds the velocity, leading to far bigger steps than the gradient alone.
Let's derive how big that compounding gets.
Assume gradient
A geometric series with
So the parameter update at terminal velocity is:
How much does
| Effect | ||
|---|---|---|
| 0.0 | no momentum | |
| 0.5 | light | |
| 0.9 | standard default | |
| 0.95 | heavy | |
| 0.99 | very heavy |
Key takeaway. A small bump from
Practical recipe: fix
Common symptoms and fixes:
| Symptom | Likely cause | Fix |
|---|---|---|
| Loss → NaN after step 1 | LR too high, fp16 overflow | halve |
| Loss oscillates (±) | ravine + momentum too low | raise |
| Loss plateaus for hundreds of steps | stuck near saddle | raise |
| Loss drops then climbs | overfitting (not optimizer) | add weight decay, lower |
| Training slower than Keras example | no momentum | add momentum=0.9 |
Most "my network doesn't train" bugs are optimizer-level. The debug ladder from L3 + this table catches ~90% of them in practice.
Vanilla SGD is steepest descent · momentum is a low-pass filter on gradients · Nesterov is momentum that peeks ahead. All three live on the same loss landscape · they differ only in how they smooth the gradient signal across steps.
| Update | Equation | Cures |
|---|---|---|
| SGD | nothing — pure 1st-order | |
| Momentum | ravine zig-zag, saddles | |
| Nesterov | overshoot near a minimum |
Effective LR under momentum ·
The 50-layer ResNet that oscillated forever? The answer is (b) ravine + step too large.
Vanishing gradients (a) would flatten loss, not oscillate. Bad shuffling (c) would show step-to-step jitter, not a stable cycle. Saddles (d) plateau, not oscillate.
Fix · raise
P1. A quadratic
P2. Show that the momentum recurrence
P3. Explain in one sentence why momentum helps escape saddles but does not help escape strict local minima.
P4. Run two trajectories on a Rosenbrock-like ravine · SGD with
P5. Show that Nesterov's update can be rewritten as classical momentum plus a gradient correction term. State the correction.
P6. A practitioner raises momentum from 0.9 to 0.99 and the run NaNs. Without changing