Interactive Explainer
Slide the dropout probability and watch a small network flicker. Every frame is a different sub-network. Over many frames you're training an exponential ensemble with shared weights — the core intuition behind the most influential regularizer of 2012.
Dropout was Hinton's 2012 insight: on every forward pass, randomly
silence a fraction p of the hidden units. At inference,
turn dropout off and use the whole network. In effect, you train an
ensemble of 2N thinned networks that all share
weights — for the price of one forward pass per batch.
Try this: set p = 0.5 and press
Auto-resample. Each frame you see a different sub-network.
Switch to Eval mode — dropout turns off, every unit fires,
and predictions become deterministic.
Naive dropout would make activations at eval time (all units alive)
systematically larger than at train time (only a fraction alive). To
keep the expected activation constant, we scale the surviving
units up by 1/p during training:
hdrop = (h ⊙ mask) / p mask ~ Bernoulli(p)
This is called inverted dropout. It's the form PyTorch uses. Because the rescaling happens at training time, the eval path stays the simplest thing: identity.
model.eval() at
inference means dropout stays on — your predictions will flicker
randomly, and calibration will be broken.
Ensemble view. Each minibatch trains a different thinned network. Over training you're implicitly training 2N networks that all share weights. At test time, no mask ⇒ you get the geometric-mean ensemble for free.
Co-adaptation view. Without dropout, unit j relies on unit k being alive to do its job. Dropout forces every unit to be useful on its own, distributing the representation rather than localizing it.
p = 0.1 after attention and FFN. Still the default.p = 0.5 between hidden layers. The original 2012 regime.Part of the ES 667 Deep Learning course · IIT Gandhinagar · Aug 2026.