L1	L2
Prior	Laplace	Gaussian
Solution	sparse (many zeros)	small (everything shrinks)
Use in DL	rare	ubiquitous

Domain	Useful	Avoid
Natural images	flip · crop · color jitter · rotate	vertical flip (changes sky/ground)
Medical imaging	small rotations · mild intensity	flips (mirrors anatomy)
MNIST / digits	small rotations · elastic	flip (6 9)
Satellite imagery	all rotations (no "up") · flip	color jitter (semantic)
Text	synonym · back-translation · mask	random char shuffle

Dropout in PyTorch

self.drop = nn.Dropout(p=0.1)     # typical: 0.1 for Transformers
                                   #          0.5 for MLP hidden layers
                                   #          0.0 for CNNs (usually)

def forward(self, x):
    h = F.relu(self.fc1(x))
    h = self.drop(h)               # apply after the activation
    h = self.fc2(h)
    return h

model.train() and model.eval() toggle it automatically. Forgetting the mode switch is a classic bug.

Convention mismatch warning · in lecture math, often denotes keep probability (Bernoulli). PyTorch's nn.Dropout(p) uses as the drop probability. Keep-prob 0.8 nn.Dropout(p=0.2). Always double-check.

Architecture	Typical use
Large MLPs	`p = 0.5` between hidden layers
CNNs	usually absent (BN + aug is enough)
RNNs / LSTMs	variational dropout (same mask per timestep)
Transformers	`p = 0.1` after attention and FFN
Fine-tuning an LLM	`p = 0.0` or very small — data is scarce

Architecture	Normalization	Why
CNN for images	BatchNorm	large batches, fixed shapes
Small-batch CNN (`< 32`)	GroupNorm	batch-independent
Transformer	LayerNorm (pre-norm)	batch/seq-length independent
Modern LLM (Llama, Mistral)	RMSNorm (pre-norm)	cheaper, no loss of quality
RNN / LSTM	LayerNorm across features	same reason as Transformer

Regularization in Deep Learning

Lecture 6 · ES 667: Deep Learning

Learning outcomes

Recap · where we are

Plan for the two sessions

Two students · the regularization story

What's new in DL regularization vs classical ML

You already know (ES 654)

New for DL

SESSION 1 · PART 1

Double descent

The classical textbook picture

Double descent · the 2019 surprise

What's actually happening

Practical implication

SESSION 1 · PART 2

L2, L1, early stopping

L2 · the weight-leash analogy

L2 · derive the gradient

Worked numeric · L2 single-weight update

Without L2

With L2

L1 vs L2 · the geometry

L2 = weight decay · regrouping the update

L1 · the sparsity-inducing sibling

Early stopping as implicit regularization

SESSION 1 · PART 3

Data augmentation

Why augmentation is powerful

Standard vision augmentations

Which augmentations · which problem

Advanced · RandAugment, AutoAugment

SESSION 1 · PART 4

Mixup & CutMix

The idea

Mixup and CutMix in one picture

Why Mixup and CutMix work

Mixup · the smoothie analogy

Mixup · the math, step by step

Worked numeric · Mixup label

Mixup in PyTorch · 10 lines

SESSION 1 · PART 5

Label smoothing

Hard vs soft targets

Hard target vs soft target · in bars

Why soften the labels?

Label smoothing · derive the formula

Why label smoothing helps

SESSION 2

Architectural regularization

Session 2 · what we will cover

SESSION 2 · PART 6

Dropout

The idea (Hinton 2012)

Dropout in one picture

Two intuitions for why it helps

Ensemble view

Co-adaptation view

Dropout · different masks per pass

Inverted dropout · the part-time-team analogy

Inverted dropout · why divide by

Dropout · worked numeric example

Dropout · the basketball-team analogy

Dropout in PyTorch

Where dropout lives in 2026

SESSION 2 · PART 7

Normalization

Hiker in a canyon · why normalization matters

Why normalize at all?

BN · LN · RMSNorm · the axes

BatchNorm · train vs eval modes

BatchNorm · standardizing exam scores

BatchNorm · worked numeric example

The ICS debate

BatchNorm in PyTorch

When BatchNorm fails

LayerNorm · fix for sequences

RMSNorm · the cheap modern cousin

SESSION 2 · PART 8

Where to put the norm