Interactive Explainer
The Optimizer Race
Drop a starting point on four different loss landscapes and watch SGD, Momentum, and Adam actually solve the optimization problem side-by-side—with their step counts, current loss, and trajectories all computed live.
Why not just use plain gradient descent?
Training a neural network is, at heart, minimizing a loss function. You have a dial for each weight, you measure how wrong the model is, and you turn the dials in whatever direction reduces the error. Simple.
The problem is that real loss surfaces are nothing like a smooth bowl. They have long, narrow ravines; saddle points that look flat in one direction but tilt in another; plateaus where gradients die; and cliffs where they explode. Plain gradient descent handles none of this gracefully—which is why, in 2014, Adam was introduced and essentially never left.
This page lets you race three optimizers across four diagnostic landscapes. Every dot, every path, and every loss number is computed in real time from the same closed-form landscapes you'll see in the math blocks.
Plain SGD: follow the slope
Stochastic Gradient Descent (SGD) does the obvious thing. At every step, it computes the gradient of the loss and moves the weights a small amount against that direction:
- $W_t$: your current weights.
- $\alpha$: the learning rate (step size).
- $\nabla L(W_t)$: the gradient at the current point.
SGD is the right starting point, but it gets confused whenever the gradient stops pointing toward the minimum. In a ravine the gradient points across the ravine walls, not along them, so SGD bounces from wall to wall and makes painfully slow progress toward the bottom.
Momentum: add a little inertia
Momentum treats the optimizer like a heavy ball. Instead of the current gradient being the step, it accumulates a velocity that blends past gradients with the current one:
- $V_t$: the velocity (a running average of recent gradients).
- $\beta$: friction, usually 0.9. Smaller values mean shorter memory; larger values mean a heavier ball.
In a ravine, the oscillating component of the gradient (across the walls) cancels in the velocity because successive steps push in opposite directions. Meanwhile the tiny consistent component (along the floor) accumulates, building up speed exactly where you want it.
Adam: adapt per-parameter
Adam adds a second running average: the squared gradient. It uses the first moment for direction (like momentum) and the second moment to rescale each parameter individually:
The magic is in the division by $\sqrt{\hat v_{t+1}}$. Parameters whose gradient wobbles a lot (large second moment) get a smaller effective step. Parameters whose gradient has been small but consistent (small second moment) get a larger step. It's like running momentum on a warped version of the landscape where steep walls have been flattened and gentle slopes have been steepened.
The grand prix: four landscapes
Pick a landscape, adjust the learning rate, and click anywhere on the terrain to drop a starting point. Every optimizer runs for the same number of steps from the same start so the comparison is honest.
Choose a landscape
Live scoreboard
Loss vs. step
What each landscape reveals
| Landscape | What it tests | Who wins, and why |
|---|---|---|
| Narrow ravine | Direction mismatch: gradient points across, not along. | Momentum and Adam. SGD bounces off the walls. |
| Saddle point | Escaping a "flat" region with curvature in both signs. | Adam (aggressive per-parameter scaling). SGD gets stuck; Momentum is slow. |
| Plateau + bowl | Dead flat region before any real signal appears. | Adam (bootstraps from tiny gradients). SGD plods. Momentum eventually picks up. |
| Rosenbrock | A curved valley that snakes toward the minimum. | Everyone struggles; Adam usually first to the basin because it re-scales per-axis. |
Try starting each landscape from the same extreme corner and watch the loss curves diverge. The total compute for one step of Adam is roughly twice that of SGD, but the ratio of steps-to-convergence is usually far more than two—which is why nobody trains modern models with plain SGD anymore.
Three things people get wrong about Adam
"Adam always beats SGD."
On large-scale image classification tasks with well-tuned
learning-rate schedules, plain SGD + momentum still outperforms
Adam at the final test accuracy. Adam converges faster
but can generalize worse. The right optimizer depends on the
problem.
"A higher learning rate always means faster training."
Crank $\alpha$ high enough on any of the landscapes above and
the optimizers overshoot and diverge (loss balloons to infinity).
Every landscape has a range of "safe" learning rates, and it
differs per algorithm.
"Momentum is just about speed."
Its more important job is averaging noise. On
mini-batch gradients that jitter from step to step, momentum's
running average filters out the noise and keeps the net
direction stable. Speed is a side-effect.