Interactive Explainer

The Optimizer Race

Drop a starting point on four different loss landscapes and watch SGD, Momentum, and Adam actually solve the optimization problem side-by-side—with their step counts, current loss, and trajectories all computed live.

~15 min Deep Learning · Optimization · SGD

Prelude

Why not just use plain gradient descent?

Training a neural network is, at heart, minimizing a loss function. You have a dial for each weight, you measure how wrong the model is, and you turn the dials in whatever direction reduces the error. Simple.

The problem is that real loss surfaces are nothing like a smooth bowl. They have long, narrow ravines; saddle points that look flat in one direction but tilt in another; plateaus where gradients die; and cliffs where they explode. Plain gradient descent handles none of this gracefully—which is why, in 2014, Adam was introduced and essentially never left.

This page lets you race three optimizers across four diagnostic landscapes. Every dot, every path, and every loss number is computed in real time from the same closed-form landscapes you'll see in the math blocks.

Step 1

Plain SGD: follow the slope

Stochastic Gradient Descent (SGD) does the obvious thing. At every step, it computes the gradient of the loss and moves the weights a small amount against that direction:

$W_t$: your current weights.
$\alpha$: the learning rate (step size).
$\nabla L(W_t)$: the gradient at the current point.

SGD is the right starting point, but it gets confused whenever the gradient stops pointing toward the minimum. In a ravine the gradient points across the ravine walls, not along them, so SGD bounces from wall to wall and makes painfully slow progress toward the bottom.

Step 2

Momentum: add a little inertia

Momentum treats the optimizer like a heavy ball. Instead of the current gradient being the step, it accumulates a velocity that blends past gradients with the current one:

$V_t$: the velocity (a running average of recent gradients).
$\beta$: friction, usually 0.9. Smaller values mean shorter memory; larger values mean a heavier ball.

In a ravine, the oscillating component of the gradient (across the walls) cancels in the velocity because successive steps push in opposite directions. Meanwhile the tiny consistent component (along the floor) accumulates, building up speed exactly where you want it.

Step 3

Adam: adapt per-parameter

Adam adds a second running average: the squared gradient. It uses the first moment for direction (like momentum) and the second moment to rescale each parameter individually:

The magic is in the division by $\sqrt{\hat v_{t+1}}$. Parameters whose gradient wobbles a lot (large second moment) get a smaller effective step. Parameters whose gradient has been small but consistent (small second moment) get a larger step. It's like running momentum on a warped version of the landscape where steep walls have been flattened and gentle slopes have been steepened.

Step 4

The grand prix: four landscapes

Pick a landscape, adjust the learning rate, and click anywhere on the terrain to drop a starting point. Every optimizer runs for the same number of steps from the same start so the comparison is honest.

Choose a landscape

Learning rate α = 0.05

Maximum steps 300

Click anywhere to drop a starting point. The global minimum is marked with a star.

SGD

Momentum

Adam

Live scoreboard

SGD — steps: 0

Momentum — steps: 0

Adam — steps: 0

Loss vs. step

Log-scale loss curves. Lower is better.

Step 5

What each landscape reveals

Landscape	What it tests	Who wins, and why
Narrow ravine	Direction mismatch: gradient points across, not along.	Momentum and Adam. SGD bounces off the walls.
Saddle point	Escaping a "flat" region with curvature in both signs.	Adam (aggressive per-parameter scaling). SGD gets stuck; Momentum is slow.
Plateau + bowl	Dead flat region before any real signal appears.	Adam (bootstraps from tiny gradients). SGD plods. Momentum eventually picks up.
Rosenbrock	A curved valley that snakes toward the minimum.	Everyone struggles; Adam usually first to the basin because it re-scales per-axis.

Try starting each landscape from the same extreme corner and watch the loss curves diverge. The total compute for one step of Adam is roughly twice that of SGD, but the ratio of steps-to-convergence is usually far more than two—which is why nobody trains modern models with plain SGD anymore.

Step 6

Three things people get wrong about Adam

Myth

"Adam always beats SGD."
On large-scale image classification tasks with well-tuned learning-rate schedules, plain SGD + momentum still outperforms Adam at the final test accuracy. Adam converges faster but can generalize worse. The right optimizer depends on the problem.

Myth

"A higher learning rate always means faster training."
Crank $\alpha$ high enough on any of the landscapes above and the optimizers overshoot and diverge (loss balloons to infinity). Every landscape has a range of "safe" learning rates, and it differs per algorithm.

Myth

"Momentum is just about speed."
Its more important job is averaging noise. On mini-batch gradients that jitter from step to step, momentum's running average filters out the noise and keeps the net direction stable. Speed is a side-effect.

Final takeaway. Gradient descent is elegant but naive. Momentum adds memory to handle bouncy ravines. Adam adds per-parameter scaling to handle landscapes where every axis has different curvature. Every improvement you see in the race above is the optimizer building a better mental model of the surface—from a blind stepper to a shape-aware scout.