Lecture 4 · momentum = EMA of past gradients. Damps ravine oscillation and speeds training.
But momentum still uses a single learning rate for all parameters.
Today maps to UDL Ch 6 (Adam) and Ch 7 (gradients + initialization revisited).
Q. Is that always the right thing?
Imagine training a model where:
A single LR that is right for one is wrong for the other.
Today · per-parameter adaptive learning rates — AdaGrad → RMSProp → Adam → AdamW — plus the schedule we wrap around them.
You are training a 1B-parameter Transformer on text · gradients are sparse for embeddings, dense for attention, and large for a few outlier weights.
(a) SGD with momentum 0.9.
(b) AdaGrad.
(c) AdamW with warmup + cosine.
(d) Plain Adam, constant LR.
Stop and decide. By the end of today the answer should be obvious — and you'll know why each of the others fails.
The exact same loss-curve shape would be diagnosed differently for each optimizer. That diagnostic is what this lecture trains.
SGD → AdaGrad → RMSProp → Adam
Imagine tuning two knobs · one is sensitive (you've already moved it a lot) · the other you've barely touched.
Which deserves a bigger turn? Obviously the untouched one.
AdaGrad does this for every parameter · if a parameter has accumulated lots of gradient (big knob movement so far), shrink its effective LR. If it's been quiet, leave its LR large. This is the core idea behind every adaptive optimizer (RMSProp, Adam, AdamW).
Analogy · audio mixing board. Imagine you're an engineer with a giant board of thousands of knobs (parameters).
AdaGrad's idea · keep a total movement history for each knob. The more a knob has moved, the smaller its future turns.
For a single parameter
Each parameter gets its own effective LR.
LR shrinks gently as updates accumulate.
One big gradient → LR for
Analogy · perfect vs. fading memory. AdaGrad has infinite memory — every gradient since step 1 is in
What if we gave it a fading memory, like momentum? An EMA of squared gradients weights recent gradients more than old ones. That's RMSProp.
Replace AdaGrad's accumulator with an EMA
| AdaGrad | RMSProp | |
|---|---|---|
| Update | ||
| Behaviour | Grows without bound | Stabilizes |
Update rule:
Typical
Constant gradient
RMSProp's effective LR converges to a positive value rather than decaying to zero.
Combine the best of both worlds:
Adam does both at once.
Defaults ·
Suppose · gradients
| step | update | ||||
|---|---|---|---|---|---|
| 1 | |||||
| 2 | |||||
| 3 |
Note · the per-parameter update size is roughly constant (
Why we divide by
Our moving averages
Analogy · making hot chocolate. You start with a cup of hot water (
After many spoonfuls the mix is right. Bias correction is the math trick to "undilute" the early steps and get the right strength immediately.
Initialize
The EMA is 10× smaller than the true gradient — purely because it started from zero.
Unroll the EMA from
In general:
The bracketed sum is a geometric series with
So
Same logic gives
Suppose the true gradient is
| step |
||
|---|---|---|
| 1 | 0.100 | 1.000 |
| 2 | 0.190 | 1.000 |
| 3 | 0.271 | 1.000 |
| 10 | 0.651 | 1.000 |
| 100 | 1.000 | 1.000 |
Without correction, the first step would use
For
Bias correction is a first-few-steps phenomenon. It keeps early steps the right magnitude; after that it fades to the identity.
The decoupled-weight-decay fix
L2 regularization adds
In plain SGD, this equals weight decay — each step shrinks weights by a factor of
For SGD these are the same. Not so for Adam.
Setup ·
Adam (L2 in gradient)
AdamW (decoupled)
In AdamW, weight decay is uniform for every parameter. In Adam, parameters with large past gradients are decayed less. AdamW's behavior matches the regularization theory · use it.
# the right default for almost everything in 2026
opt = torch.optim.AdamW(model.parameters(),
lr=3e-4,
betas=(0.9, 0.999),
weight_decay=0.1) # typical for LLMs
LLMs: weight_decay=0.1. Fine-tuning: 0.01–0.05. Vision fine-tune: 0.001–0.01.
Why one learning rate isn't enough over a full run
Interactive: sliders for peak LR and warmup — lr-schedule-visualizer.
# Step decay
sched = torch.optim.lr_scheduler.MultiStepLR(opt, milestones=[60, 120], gamma=0.1)
# Cosine annealing
sched = torch.optim.lr_scheduler.CosineAnnealingLR(opt, T_max=n_epochs)
# Warmup + cosine — the 2026 default for Transformers
from torch.optim.lr_scheduler import LambdaLR
def lr_lambda(step):
if step < warmup: return step / warmup
progress = (step - warmup) / (total_steps - warmup)
return 0.5 * (1 + math.cos(math.pi * progress))
sched = LambdaLR(opt, lr_lambda)
A randomly-initialized network knows nothing.
Two things go wrong simultaneously at step 1:
Big gradient ÷ tiny, unreliable
A perfect storm at the start:
Combine:
Warmup is a safety valve. Linearly ramp
After warmup, use cosine decay.
Typical Transformer schedule:
def lr_lambda(step, total, warmup):
if step < warmup:
return step / warmup # linear warmup 0 → 1
progress = (step - warmup) / (total - warmup)
return 0.5 * (1 + math.cos(math.pi * progress)) # cosine
5 lines. Ubiquitous.
| Model / regime | Optimizer | LR | Schedule |
|---|---|---|---|
| CNN from scratch | SGD + momentum + Nesterov | 0.1 | step decay |
| CNN fine-tune | AdamW | 1e-4 | cosine |
| Transformer pre-train | AdamW (β₂ = 0.95) | 3e-4 | warmup + cosine |
| LoRA fine-tune of LLM | AdamW | 1e-4 to 3e-4 | cosine |
| Debugging a new idea | AdamW | 3e-4 | constant |
lr = 3e-4 is not magic — it's the number to use when you don't want to think. For a real run, do the LR finder (Lecture 3).
Leaving weight_decay at 0 for AdamW. You get plain Adam with no regularization. Surprisingly common.
No LR schedule for an LLM. Training plateaus early and you blame the architecture.
Warmup of 10 steps for a 100k-step run. Far too short. Warmup = 1–10% of total.
lr = 3e-4 for SGD. That's the Adam default. SGD usually wants lr = 0.01 to 0.1.
Adam = momentum on
| Symbol | Role | Update at step |
|---|---|---|
| momentum | ||
| per-param scale | ||
| bias-corrected | ||
| step |
The final
The 1B-parameter Transformer? The answer is (c) AdamW with warmup + cosine.
(a) SGD · single LR can't cope with sparse vs dense gradients · slow.
(b) AdaGrad · LR collapses to zero on dense layers within 10k steps.
(c) ✓ AdamW + warmup absorbs the chaotic early gradients · cosine anneals.
(d) Plain Adam · L2 silently broken; no warmup → first-step explosions.
P1. Show that AdaGrad's effective LR for parameter
P2. A parameter has constant gradient
P3. State the exact difference between Adam-with-L2 and AdamW. Show that for a fixed gradient and fixed
P4. A Transformer is trained for
P5. Why does Adam with
P6. You ran AdamW with weight_decay=0 for 50 epochs. Train loss is 0.001, val loss is 0.7. Name two changes from this lecture (not L06's regularization) that would help and why.