← Explainer Library

Interactive Explainer

LR Schedule Visualizer

Four schedules, one canvas. Drag peak LR, warmup fraction, total steps — see exactly how the learning rate moves through training, and why Transformers cold-started at full LR diverge in their first 100 steps.

~6 minDeep Learning · Optimization · Schedules

The same optimizer can succeed or fail depending on the learning rate schedule you wrap around it. Cold-starting Adam on a Transformer with lr = 3e-4 diverges within a few hundred steps. Warm it up linearly from 0 and the same architecture trains cleanly.

The playground

PyTorch for the current schedule



    

Why Transformers need warmup

Adam's second-moment estimate v̂_t is tiny and noisy at step 1. Dividing by √v̂_t wildly amplifies the first few steps. Meanwhile, random-init attention produces peaky distributions — large early gradients in a few heads. Combined, cold-starting at full LR produces huge unstable first steps, and the loss diverges.

Warmup linearly ramps LR from 0 to peak over 1–10% of training. Once v̂_t has a few steps to stabilize and attention has diffused, you can safely ride the peak LR.

Part of the ES 667 Deep Learning course · IIT Gandhinagar · Aug 2026.