Interactive Explainer

Dropout Playground

Slide the dropout probability and watch a small network flicker. Every frame is a different sub-network. Over many frames you're training an exponential ensemble with shared weights — the core intuition behind the most influential regularizer of 2012.

~8 min Deep Learning · Regularization

Dropout was Hinton's 2012 insight: on every forward pass, randomly silence a fraction p of the hidden units. At inference, turn dropout off and use the whole network. In effect, you train an ensemble of 2^N thinned networks that all share weights — for the price of one forward pass per batch.

The playground

Drop rate p: 0.50

Layer 1 alive: – Layer 2 alive: – Active edges: – Scale factor (inverted): 1.11

Try this: set p = 0.5 and press Auto-resample. Each frame you see a different sub-network. Switch to Eval mode — dropout turns off, every unit fires, and predictions become deterministic.

Why the 1/p rescale?

Naive dropout would make activations at eval time (all units alive) systematically larger than at train time (only a fraction alive). To keep the expected activation constant, we scale the surviving units up by 1/p during training:

h_drop = (h ⊙ mask) / p mask ~ Bernoulli(p)

This is called inverted dropout. It's the form PyTorch uses. Because the rescaling happens at training time, the eval path stays the simplest thing: identity.

Common bug. Forgetting model.eval() at inference means dropout stays on — your predictions will flicker randomly, and calibration will be broken.

Two intuitions for why it works

Ensemble view. Each minibatch trains a different thinned network. Over training you're implicitly training 2^N networks that all share weights. At test time, no mask ⇒ you get the geometric-mean ensemble for free.

Co-adaptation view. Without dropout, unit j relies on unit k being alive to do its job. Dropout forces every unit to be useful on its own, distributing the representation rather than localizing it.

Where dropout lives in 2026

Transformers — p = 0.1 after attention and FFN. Still the default.
Modern CNNs — usually 0. BatchNorm + augmentation carry the regularization budget.
LoRA fine-tuning — typically 0 or 0.05. Training data is scarce; over-regularizing hurts.
Classical MLPs — p = 0.5 between hidden layers. The original 2012 regime.

Part of the ES 667 Deep Learning course · IIT Gandhinagar · Aug 2026.