AdaGrad	RMSProp
Update	(forever)
Behaviour	Grows without bound	Stabilizes

step
1	0.100	1.000
2	0.190	1.000
3	0.271	1.000
10	0.651	1.000
100	1.000	1.000

Model / regime	Optimizer	LR	Schedule
CNN from scratch	SGD + momentum + Nesterov	0.1	step decay
CNN fine-tune	AdamW	1e-4	cosine
Transformer pre-train	AdamW (β₂ = 0.95)	3e-4	warmup + cosine
LoRA fine-tune of LLM	AdamW	1e-4 to 3e-4	cosine
Debugging a new idea	AdamW	3e-4	constant

Symbol	Role	Update at step
	momentum
	per-param scale
	bias-corrected	,
	step	(AdamW)

Practice problems

P1. Show that AdaGrad's effective LR for parameter at step is . Argue why this monotonically decreases.

P2. A parameter has constant gradient . Compute Adam's and at . Verify the bias-correction recovers and .

P3. State the exact difference between Adam-with-L2 and AdamW. Show that for a fixed gradient and fixed they give different updates.

P4. A Transformer is trained for steps. Sketch the cosine schedule with warmup steps and peak LR . Give the LR at .

P5. Why does Adam with need bias correction at but not at ? Compute for both.

P6. You ran AdamW with weight_decay=0 for 50 epochs. Train loss is 0.001, val loss is 0.7. Name two changes from this lecture (not L06's regularization) that would help and why.

step					update
1
2
3

Adam, AdamW & LR Schedules

Lecture 5 · ES 667: Deep Learning

Learning outcomes

Recap · where we left off

Not always — here's why

Four questions

Pop quiz · which optimizer would you bet on?

PART 1

The family tree

The lineage

Per-parameter LR · the two-knobs analogy

How do we give each parameter its own LR?

AdaGrad · build the update step-by-step

Worked numeric · AdaGrad on two parameters

· steady

· sparse: ,

AdaGrad's problem · LR decays to zero

How do we fix AdaGrad's dying LR?

RMSProp · AdaGrad with a fading memory (2012)

Worked numeric · AdaGrad vs. RMSProp

AdaGrad · keeps shrinking

RMSProp · stabilizes

The big idea · Adam = Momentum + RMSProp

Adam · Momentum + RMSProp

Trajectories · SGD vs momentum vs Adam

Adam · the full update

Adam · worked example · 3 steps on a single parameter

PART 2

The bias-correction detail

The cold-start problem

What goes wrong at ?

The correction in one picture

Deriving the bias-correction factor

Worked example · bias correction in action

When does bias correction matter?

PART 3

Adam → AdamW

L2 · quick recap from ES 654

AdamW · the fix

Adam vs AdamW · one-step worked numeric

AdamW in PyTorch · one line

PART 4

Learning-rate schedules

LR schedules · four common shapes

Four common schedules

Schedules in PyTorch

Why Transformers need warmup

Why are early gradients so chaotic?

Why Transformers need warmup

PART 5

What to actually use

Warmup · typical schedule

Defaults that work

Gradient clipping · cheap insurance

Common mistakes

Putting it all together · the L05 master sentence

Pop quiz · revisit

Practice problems

Lecture 5 — summary

Read before Lecture 6

Next lecture