Adam, AdamW & LR Schedules

Lecture 5 · ES 667: Deep Learning

Prof. Nipun Batra
IIT Gandhinagar · Aug 2026

Learning outcomes

By the end of this lecture you will be able to:

  1. Explain why per-parameter LR is helpful (sparse vs dense gradients).
  2. Derive Adam as momentum + RMSProp + bias correction.
  3. Compute bias-correction at small and show why it matters.
  4. Distinguish Adam vs AdamW and pick AdamW for regularized training.
  5. Pick a schedule (constant / step / cosine / warmup+cosine) per use-case.
  6. Explain why warmup is essential for Transformers.

Recap · where we left off

Lecture 4 · momentum = EMA of past gradients. Damps ravine oscillation and speeds training.

But momentum still uses a single learning rate for all parameters.

Today maps to UDL Ch 6 (Adam) and Ch 7 (gradients + initialization revisited).

Q. Is that always the right thing?

Not always — here's why

Imagine training a model where:

  • word-embedding parameters are updated by rare tokens → gradients are large and sparse
  • hidden-layer weights are updated every step → gradients are small but constant

A single LR that is right for one is wrong for the other.

Today · per-parameter adaptive learning rates — AdaGrad → RMSProp → Adam → AdamW — plus the schedule we wrap around them.

Four questions

  1. How do we get a per-parameter learning rate?
  2. What is Adam actually doing, piece by piece?
  3. Why is L2 "broken" inside Adam, and how does AdamW fix it?
  4. What schedule should you use, and why do Transformers need warmup?

Pop quiz · which optimizer would you bet on?

You are training a 1B-parameter Transformer on text · gradients are sparse for embeddings, dense for attention, and large for a few outlier weights.

(a) SGD with momentum 0.9.
(b) AdaGrad.
(c) AdamW with warmup + cosine.
(d) Plain Adam, constant LR.

Stop and decide. By the end of today the answer should be obvious — and you'll know why each of the others fails.

The exact same loss-curve shape would be diagnosed differently for each optimizer. That diagnostic is what this lecture trains.

PART 1

The family tree

SGD → AdaGrad → RMSProp → Adam

The lineage

Per-parameter LR · the two-knobs analogy

Imagine tuning two knobs · one is sensitive (you've already moved it a lot) · the other you've barely touched.

Which deserves a bigger turn? Obviously the untouched one.

AdaGrad does this for every parameter · if a parameter has accumulated lots of gradient (big knob movement so far), shrink its effective LR. If it's been quiet, leave its LR large. This is the core idea behind every adaptive optimizer (RMSProp, Adam, AdamW).

How do we give each parameter its own LR?

Analogy · audio mixing board. Imagine you're an engineer with a giant board of thousands of knobs (parameters).

  • Some knobs are very sensitive — you've already moved them a lot. Make tiny adjustments now.
  • Other knobs you've barely touched. You can afford to turn them aggressively.

AdaGrad's idea · keep a total movement history for each knob. The more a knob has moved, the smaller its future turns.

AdaGrad · build the update step-by-step

For a single parameter :

  1. Gradient at step for parameter : .
  2. Running sum of squared gradients (the "history"):

  1. Standard SGD would do .
  2. AdaGrad divides by :

avoids division-by-zero. Vector form (element-wise):

Each parameter gets its own effective LR.

Worked numeric · AdaGrad on two parameters

(dense, small gradients) and (sparse, occasionally huge). .

· steady

  • :
  • :
  • :

LR shrinks gently as updates accumulate.

· sparse: ,

  • : tiny update (nothing to update toward)
  • :

One big gradient → LR for collapses immediately. Past sparsity protected; future moves are careful.

AdaGrad's problem · LR decays to zero

How do we fix AdaGrad's dying LR?

Analogy · perfect vs. fading memory. AdaGrad has infinite memory — every gradient since step 1 is in . After a million steps, is huge and the effective LR is essentially zero.

What if we gave it a fading memory, like momentum? An EMA of squared gradients weights recent gradients more than old ones. That's RMSProp.

RMSProp · AdaGrad with a fading memory (2012)

Replace AdaGrad's accumulator with an EMA :

AdaGrad RMSProp
Update (forever)
Behaviour Grows without bound Stabilizes

Update rule:

Typical — keep 99.9% of old, mix in 0.1% of new.

Worked numeric · AdaGrad vs. RMSProp

Constant gradient , , for clarity.

AdaGrad · keeps shrinking

  • :
  • :
  • :
  • :

RMSProp · stabilizes

  • :
  • :
  • :

RMSProp's effective LR converges to a positive value rather than decaying to zero.

The big idea · Adam = Momentum + RMSProp

Combine the best of both worlds:

  1. From momentum — EMA of gradients gives a smoother, less noisy direction. Call it (first moment).
  2. From RMSProp — EMA of squared gradients gives a per-parameter scale. Call it (second moment).

Adam does both at once.

Adam · Momentum + RMSProp

Trajectories · SGD vs momentum vs Adam

Adam · the full update

Defaults · .

Adam · worked example · 3 steps on a single parameter

Suppose · gradients . Defaults; .

step update
1
2
3

Note · the per-parameter update size is roughly constant () even though the gradients change. That's Adam's per-parameter adaptive scaling at work.

PART 2

The bias-correction detail

Why we divide by

The cold-start problem

Our moving averages and are initialized at zero.

Analogy · making hot chocolate. You start with a cup of hot water () and add one spoonful of cocoa (gradient ). The first sip is mostly water with a hint of cocoa — biased toward the starting point.

After many spoonfuls the mix is right. Bias correction is the math trick to "undilute" the early steps and get the right strength immediately.

What goes wrong at ?

Initialize . At step 1:

The EMA is 10× smaller than the true gradient — purely because it started from zero.

The correction in one picture

Deriving the bias-correction factor

Unroll the EMA from , assuming the gradient is approximately constant :

In general:

The bracketed sum is a geometric series with terms:

So . To recover , divide:

Same logic gives .

Worked example · bias correction in action

Suppose the true gradient is at every step. With , starting :

step
1 0.100 1.000
2 0.190 1.000
3 0.271 1.000
10 0.651 1.000
100 1.000 1.000

Without correction, the first step would use — ten times too small. Adam's step would therefore be 10× smaller than the optimal at t=1, wasting the early training phase. The denominator fixes it exactly.

When does bias correction matter?

For :

  • : factor ← huge
  • :
  • : ← negligible

Bias correction is a first-few-steps phenomenon. It keeps early steps the right magnitude; after that it fades to the identity.

PART 3

Adam → AdamW

The decoupled-weight-decay fix

L2 · quick recap from ES 654

L2 regularization adds to the loss. Gradient contribution: .

In plain SGD, this equals weight decay — each step shrinks weights by a factor of :

For SGD these are the same. Not so for Adam.

AdamW · the fix

Adam vs AdamW · one-step worked numeric

Setup · , , (weight decay), , (from RMSProp).

Adam (L2 in gradient)

  • Effective gradient ·
  • Update ·
  • Note · the "regularization" got divided by · weakened on high-gradient params!

AdamW (decoupled)

  • Adaptive update ·
  • Plus uniform decay ·
  • Total ·

In AdamW, weight decay is uniform for every parameter. In Adam, parameters with large past gradients are decayed less. AdamW's behavior matches the regularization theory · use it.

AdamW in PyTorch · one line

# the right default for almost everything in 2026
opt = torch.optim.AdamW(model.parameters(),
                        lr=3e-4,
                        betas=(0.9, 0.999),
                        weight_decay=0.1)   # typical for LLMs

LLMs: weight_decay=0.1. Fine-tuning: 0.01–0.05. Vision fine-tune: 0.001–0.01.

PART 4

Learning-rate schedules

Why one learning rate isn't enough over a full run

LR schedules · four common shapes

Four common schedules

▶ Interactive: sliders for peak LR and warmup — lr-schedule-visualizer.

Schedules in PyTorch

# Step decay
sched = torch.optim.lr_scheduler.MultiStepLR(opt, milestones=[60, 120], gamma=0.1)

# Cosine annealing
sched = torch.optim.lr_scheduler.CosineAnnealingLR(opt, T_max=n_epochs)

# Warmup + cosine — the 2026 default for Transformers
from torch.optim.lr_scheduler import LambdaLR
def lr_lambda(step):
    if step < warmup: return step / warmup
    progress = (step - warmup) / (total_steps - warmup)
    return 0.5 * (1 + math.cos(math.pi * progress))
sched = LambdaLR(opt, lr_lambda)

Why Transformers need warmup

Why are early gradients so chaotic?

A randomly-initialized network knows nothing.

  • Loss is high → gradients are large.
  • The model makes wildly overconfident-but-wrong predictions (e.g. softmax assigns 99% to the wrong class) → massive corrective gradients.

Two things go wrong simultaneously at step 1:

  1. Chaotic, large gradients from a random network.
  2. Adam's is itself noisy — only one batch has been seen; the EMA estimate is unreliable.

Big gradient ÷ tiny, unreliable explosive first step that throws weights into unrecoverable territory.

Why Transformers need warmup

A perfect storm at the start:

  1. Adam's denominator is unstable. is based on a few batches; if those happen to be small, is tiny → updates are huge.
  2. Transformer gradients are spiky. Random init → attention accidentally focuses everything on one irrelevant token → enormous gradient on that head's weights.

Combine:

Warmup is a safety valve. Linearly ramp from 0 over the first 1–10% of training:

  • At step 1, → tiny update no matter how crazy the gradient.
  • This gives time to stabilize over many batches.
  • By the time reaches its target, is a reliable estimate.

After warmup, use cosine decay.

PART 5

What to actually use

Warmup · typical schedule

Typical Transformer schedule:

  • Warmup · linearly ramp lr 0 → target over first 2000 steps (or 1-10% of total).
  • Plateau · hold at target lr briefly.
  • Cosine decay · smooth drop from target to ~1/10 of target over the rest.
  • Final · ~10% of training at minimum lr to squeeze last gains.
def lr_lambda(step, total, warmup):
    if step < warmup:
        return step / warmup       # linear warmup 0 → 1
    progress = (step - warmup) / (total - warmup)
    return 0.5 * (1 + math.cos(math.pi * progress))  # cosine

5 lines. Ubiquitous.

Defaults that work

Model / regime Optimizer LR Schedule
CNN from scratch SGD + momentum + Nesterov 0.1 step decay
CNN fine-tune AdamW 1e-4 cosine
Transformer pre-train AdamW (β₂ = 0.95) 3e-4 warmup + cosine
LoRA fine-tune of LLM AdamW 1e-4 to 3e-4 cosine
Debugging a new idea AdamW 3e-4 constant

lr = 3e-4 is not magic — it's the number to use when you don't want to think. For a real run, do the LR finder (Lecture 3).

Gradient clipping · cheap insurance

Common mistakes

Leaving weight_decay at 0 for AdamW. You get plain Adam with no regularization. Surprisingly common.

No LR schedule for an LLM. Training plateaus early and you blame the architecture.

Warmup of 10 steps for a 100k-step run. Far too short. Warmup = 1–10% of total.

lr = 3e-4 for SGD. That's the Adam default. SGD usually wants lr = 0.01 to 0.1.

Putting it all together · the L05 master sentence

Adam = momentum on , RMSProp on , with a bias-correction at the start. AdamW peels weight-decay out of the adaptive scaling so it actually regularizes. A schedule wraps everything · warmup at the start (gradients are chaotic), cosine at the end (anneal toward a minimum).

Symbol Role Update at step
momentum
per-param scale
bias-corrected ,
step (AdamW)

The final term — applied directly to weights, not through — is the only difference between Adam+L2 and AdamW. That single change is why AdamW is the 2026 default for every Transformer in the world.

Pop quiz · revisit

The 1B-parameter Transformer? The answer is (c) AdamW with warmup + cosine.

(a) SGD · single LR can't cope with sparse vs dense gradients · slow.
(b) AdaGrad · LR collapses to zero on dense layers within 10k steps.
(c) ✓ AdamW + warmup absorbs the chaotic early gradients · cosine anneals.
(d) Plain Adam · L2 silently broken; no warmup → first-step explosions.

Practice problems

P1. Show that AdaGrad's effective LR for parameter at step is . Argue why this monotonically decreases.

P2. A parameter has constant gradient . Compute Adam's and at . Verify the bias-correction recovers and .

P3. State the exact difference between Adam-with-L2 and AdamW. Show that for a fixed gradient and fixed they give different updates.

P4. A Transformer is trained for steps. Sketch the cosine schedule with warmup steps and peak LR . Give the LR at .

P5. Why does Adam with need bias correction at but not at ? Compute for both.

P6. You ran AdamW with weight_decay=0 for 50 epochs. Train loss is 0.001, val loss is 0.7. Name two changes from this lecture (not L06's regularization) that would help and why.

Lecture 5 — summary

  • AdaGrad gave per-parameter LR — but never forgets, so LRs decay to zero.
  • RMSProp fixed that with EMA of .
  • Adam = momentum + RMSProp + bias correction. Robust first-try optimizer.
  • Bias correction matters for the first ~10–100 steps; .
  • AdamW decouples weight decay from adaptive scaling. Use it, not Adam+L2.
  • Schedules — cosine is the clean default; warmup + cosine is the Transformer default.
  • Gradient clipping at 1.0 — cheap insurance.

Read before Lecture 6

Prince — Ch 6 §6.7 (Adam), Ch 7. Free at udlbook.github.io.

Next lecture

Regularization I — bias-variance in the overparameterized regime, double descent, weight decay as prior, early stopping, data augmentation, Mixup, label smoothing.

Notebook 5 · 05-adam-schedules.ipynb — implement Adam and AdamW from scratch; sweep step-decay vs cosine on CIFAR-10.