Three regimes:
More parameters past the threshold does not hurt. Implicit regularization from SGD + overparameterization finds flat, generalizing minima.
This is one of the big open questions in DL theory. Prince Ch 20 — "Why does deep learning work?"
You don't need to shrink a model to generalize. You can just go bigger and rely on:
ResNet-50 has 25M params; modern LLMs have 10¹¹. Both generalize fine because they are past the interpolation threshold.
Brisk — you know these from ES 654
Analogy. The data loss is your dog, trying to run toward an interesting smell (the optimum on the training data).
The L2 penalty is a leash pulling the dog back toward you (toward zero). The actual update is a compromise — the dog moves toward the smell, but the leash keeps it from running too far.
Total loss = data loss + L2 penalty:
Differentiate term by term:
Plug into the SGD update rule
The extra term
The weight ends up slightly smaller — decayed toward zero. Repeat over thousands of steps · big weights shrink unless the data loss really wants them.
Take the SGD-with-L2 step:
Group the two
The factor
That's why "L2 regularization" is the same thing as "weight decay".
In PyTorch · AdamW(..., weight_decay=0.1) is the one line you need. (For why decoupling matters in adaptive optimizers, see L5.)
Add
| L1 | L2 | |
|---|---|---|
| Prior | Laplace | Gaussian |
| Solution | sparse (many zeros) | small (everything shrinks) |
| Use in DL | rare | ubiquitous |
DL rarely uses L1 — features are distributed across many weights, not localized in a few. L1's sparsity breaks distributed representations.
Q. If val loss starts rising at epoch 30, why stop training?
Because continuing means:
Early stopping = an implicit form of capacity control. It's free and almost always helps. Every serious training script checkpoints on val loss.
Covered in Lecture 1 — the curve with the "best val" marker.
The single highest-value regularizer for vision
Classical regularization constrains the model. Data augmentation constrains the data — by telling the model what invariances it should respect.
A flipped cat is still a cat. A color-jittered cat is still a cat. By training on the augmented versions, the model learns that these transformations should not change the prediction.
| Domain | Useful | Avoid |
|---|---|---|
| Natural images | flip · crop · color jitter · rotate | vertical flip (changes sky/ground) |
| Medical imaging | small rotations · mild intensity | flips (mirrors anatomy) |
| MNIST / digits | small rotations · elastic | flip (6 |
| Satellite imagery | all rotations (no "up") · flip | color jitter (semantic) |
| Text | synonym · back-translation · mask | random char shuffle |
Rule · augmentation must preserve the label. If not, it's noise, not signal.
Instead of hand-picking augmentations:
In torchvision:
transforms.RandAugment(num_ops=2, magnitude=9)
RandAugment is the 2026 default for vision pre-training. Two-line change, consistent +1–3% accuracy.
Augment the label, not just the input
Standard augmentation: one image → one transformed image, same label.
Mixup and CutMix go further: combine two images and interpolate their labels correspondingly.
Three observations:
Empirically: Mixup/CutMix adds ~1–2% CIFAR-10 accuracy. Essentially a free win for vision.
Analogy. Standard augmentation = slightly reshape a "cat" fruit. Mixup = put 70% cat + 30% dog into a blender → a smoothie that is neither fully cat nor fully dog.
Crucially, the label is also a smoothie: "70% cat, 30% dog". This forces the model to learn that predictions can live between classes → smoother decision boundary.
Mix two examples
Cats = class 0, dogs = class 1. One-hot:
The model now sees a faded cat overlaid with 30%-opacity dog, and must produce
def mixup_batch(x, y, alpha=0.2):
lam = np.random.beta(alpha, alpha)
idx = torch.randperm(x.size(0))
x_mix = lam * x + (1 - lam) * x[idx]
y_a, y_b = y, y[idx]
return x_mix, y_a, y_b, lam
# in training step
x_mix, y_a, y_b, lam = mixup_batch(x, y)
logits = model(x_mix)
loss = lam * criterion(logits, y_a) + (1 - lam) * criterion(logits, y_b)
Because "1.0 for the right class" is a lie
Analogy · the humble professor. A bad professor: "The answer is A. Memorize it." → encourages overconfidence.
A good professor: "It's very likely A — but reserve some confidence for being wrong." → calibrated thinking.
Label smoothing is the good professor for your neural network.
Goal · take a tiny fraction
Sum is still 1. The compact form:
One flag in PyTorch: CrossEntropyLoss(label_smoothing=0.1).
Dropout + Normalization
Two architectural regularizers that ship inside the network:
All 2026-relevant. All in every modern architecture.
An implicit ensemble, one line of code
Every forward pass during training:
At eval time, turn dropout off — use all units.
Interactive: slide
p and watch a small network flicker; toggle train/eval mode — dropout-playground.
Each mini-batch trains a different sub-network (a subset of units).
Over many batches you are implicitly training an ensemble of
At test time, no mask → like averaging the ensemble.
Without dropout, units co-adapt — neuron
Dropout forces every unit to be useful on its own → more distributed representation.
Analogy. A 4-person construction crew. Some days, randomly, only 2 show up.
On the final project day (test time), all 4 show up — no scaling needed. Their normal pace = correct expectation.
For a single neuron with activation
Test time: dropout off → output is just
Training time: output
Expected output during training:
✓ same as test time. The rest of the network sees the same expected activations in both modes — training and eval behave consistently. This is inverted dropout.
Suppose hidden activations · h = [2.0, 1.5, 0.5, 3.0] and we use p = 0.5 keep-prob.
Train pass. Sample mask m = [1, 0, 1, 0] (Bernoulli p=0.5).
(the kept units are amplified to compensate)
Expected value ·
↑ same as the no-dropout output.
Eval pass.
The training-time scaling-up by
Imagine training a basketball team where, in any given practice drill, some players randomly sit out. No one can rely too much on the star player · she might not be there.
Result · everyone becomes more versatile. The team performs more reliably with any subset on the court.
That's what dropout does to neurons · it prevents them from co-adapting (relying too heavily on a few specific neighbors). Each neuron has to become individually useful.
self.drop = nn.Dropout(p=0.1) # typical: 0.1 for Transformers
# 0.5 for MLP hidden layers
# 0.0 for CNNs (usually)
def forward(self, x):
h = F.relu(self.fc1(x))
h = self.drop(h) # apply after the activation
h = self.fc2(h)
return h
model.train() and model.eval() toggle it automatically. Forgetting the mode switch is a classic bug.
Convention mismatch warning · in lecture math, nn.Dropout(p) uses
nn.Dropout(p=0.2). Always double-check.
| Architecture | Typical use |
|---|---|
| Large MLPs | p = 0.5 between hidden layers |
| CNNs | usually absent (BN + aug is enough) |
| RNNs / LSTMs | variational dropout (same mask per timestep) |
| Transformers | p = 0.1 after attention and FFN |
| Fine-tuning an LLM | p = 0.0 or very small — data is scarce |
Dropout was the biggest regularization breakthrough of 2012. Today it is overshadowed by BN + LN + augmentation for many tasks, but still in every Transformer.
Same family, three flavours, one knob at a time
Imagine a hiker descending a long, narrow, steep-sided canyon. They bounce side-to-side, making slow progress along the canyon's length.
A round bowl is much easier · the hiker walks straight to the bottom.
Normalization reshapes the loss landscape from a canyon into a bowl · same minimum, much easier optimizer trajectory.
Concretely · BN/LN keep activations centered and unit-scale, which means the loss's curvature in different directions is roughly equal. The optimizer takes confident, direct steps.
Two problems that normalization fixes:
Analogy. Student A's homework:
BatchNorm is a fair grading TA:
Mini-batch of 4 activations from one neuron:
Step 1 · mean.
Step 2 · variance.
Step 3 · normalize. (
Step 4 · scale + shift with learned
This vector is what the next layer sees. At eval time · use the running mean/var collected during training, not batch stats. (model.eval() flips this switch.)
Ioffe & Szegedy 2015 · BN helps by reducing internal covariate shift (ICS) — the changing distribution of layer inputs during training.
Santurkar et al. 2018 — showed ICS was largely a red herring:
BN's real benefit is that it smooths the loss landscape — makes gradients more predictable, enabling larger learning rates.
You don't need to remember this. You do need to remember: BN works, but the reason it works is more subtle than the original paper claimed.
# For 2D / FC: nn.BatchNorm1d
# For 4D / conv: nn.BatchNorm2d
layer = nn.Sequential(
nn.Conv2d(64, 128, 3, padding=1),
nn.BatchNorm2d(128), # ← after the conv, before ReLU
nn.ReLU(),
)
running_mean, running_varSmall batch sizes — stats are noisy, BN hurts more than it helps. batch_size < 32 → prefer GroupNorm or LayerNorm.
Sequence models — variable-length sequences have inconsistent statistics along the batch axis.
Online / streaming — can't collect meaningful batch stats.
Distributed training — each replica computes its own batch stats unless you use SyncBN.
These are exactly the situations that birthed LayerNorm.
Normalize across the feature dimension instead of the batch dimension.
No dependence on batch size or other samples.
norm = nn.LayerNorm(d_model) # applied at every Transformer block
Every modern Transformer (BERT, GPT, Llama, Claude) uses LayerNorm or its cheaper cousin RMSNorm — not BatchNorm.
Drop the mean subtraction — keep only the scale:
# PyTorch 2.4+
norm = nn.RMSNorm(d_model)
Pre-norm vs post-norm
Analogy · highway and side road. The residual connection is a multi-lane highway that lets the gradient flow easily from the end of the network to the beginning. Sub-layers (attention, MLP) are winding side roads.
Residual update:
Post-norm. Forward:
Pre-norm. Forward:
Pre-norm is the default for every modern Transformer (GPT-2 onwards). If you're building a new Transformer in 2026, use pre-norm.
| Architecture | Normalization | Why |
|---|---|---|
| CNN for images | BatchNorm | large batches, fixed shapes |
Small-batch CNN (< 32) |
GroupNorm | batch-independent |
| Transformer | LayerNorm (pre-norm) | batch/seq-length independent |
| Modern LLM (Llama, Mistral) | RMSNorm (pre-norm) | cheaper, no loss of quality |
| RNN / LSTM | LayerNorm across features | same reason as Transformer |
Fixed-size dense data with big batches → BatchNorm.
Anything else → LayerNorm (or RMSNorm if you want cheap).
The full regularization stack for 2026
# 1. Architecture regularization
model = ResNet50(dropout=0.0) # BN already in ResNet
# 2. Optimizer regularization (from L5)
opt = AdamW(model.parameters(), weight_decay=0.05)
# 3. Data augmentation
train_tfm = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.RandAugment(num_ops=2, magnitude=9),
transforms.ToTensor(),
transforms.RandomErasing(p=0.25),
transforms.Normalize(MEAN, STD),
])
# 4. Mixup inside the training loop (conditional)
# 5. Label smoothing in the loss
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
# 6. Early stopping — checkpoint on val loss