Compute
Let
torch.no_grad() for evaluationmodel.eval()
with torch.no_grad():
for x, y in val_loader:
pred = model(x)
...
Skips tape construction · faster · less memory.
.detach() to stop gradient flowtarget = model_old(x).detach()
loss = mse(model_new(x), target)
Treats the tensor as a constant in the graph.
CPU loads while the GPU computes
Before the model sees a batch, the batch should satisfy a contract.
| Item | Classification expectation | Common bug |
|---|---|---|
x.shape |
[B, C, H, W] for images or [B, d] for vectors |
missing batch dimension |
x.dtype |
float32 / bfloat16 after transforms |
raw uint8 pixels |
x.range |
normalized, often roughly centered | values still in [0, 255] |
y.shape |
[B] for class indices |
one-hot when loss expects indices |
y.dtype |
torch.long for CrossEntropyLoss |
float labels |
assert x.ndim == 4
assert x.dtype in (torch.float32, torch.bfloat16)
assert y.ndim == 1 and y.dtype == torch.long
Many "model bugs" are actually batch-contract bugs. Verify the batch before touching the architecture.
Datasetfrom torch.utils.data import Dataset
class TinyImageDataset(Dataset):
def __init__(self, paths, labels, tfm=None):
self.paths, self.labels, self.tfm = paths, labels, tfm
def __len__(self):
return len(self.paths)
def __getitem__(self, i):
img = PIL.Image.open(self.paths[i]).convert('RGB')
if self.tfm: img = self.tfm(img)
return img, self.labels[i]
Two methods. That is the entire contract.
loader = DataLoader(dataset,
batch_size=64,
shuffle=True,
num_workers=4,
pin_memory=True,
persistent_workers=True,
drop_last=True)
Rule of thumb · num_workers ≈ 4 × num_GPUs. pin_memory=True when using CUDA. persistent_workers=True when epochs are short.
Data loading is successful when the GPU almost never waits for the CPU.
| Symptom | Likely cause | First fix |
|---|---|---|
| GPU utilization sawtooths | CPU/disk cannot feed batches | increase num_workers |
| first batch slow every epoch | worker restart overhead | persistent_workers=True |
| transfer to CUDA slow | pageable host memory | pin_memory=True |
| random crop dominates time | heavy CPU transforms | cache, simplify, or move to GPU |
import time
t0 = time.perf_counter()
for i, batch in enumerate(loader):
if i == 100: break
print("batches/sec", 100 / (time.perf_counter() - t0))
Benchmark the loader alone. If data throughput is low, a bigger model or better optimizer will not fix the bottleneck.
Modern GPUs are highways for numbers. To go faster: build a bigger highway (new GPU) or make the cars smaller.
We want the speed of 16 bits without losing FP32's range.
Analogy · measuring with rulers
For deep learning, range matters far more than ultra-fine precision. BF16 almost never overflows.
BF16 · trades precision for range. Same memory as FP16. Available on NVIDIA A100+, AMD MI200+, TPU v3+. Default on any modern LLM training.
with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
output = model(input) # most ops in BF16
loss = criterion(output, target)
loss.backward() # grads in BF16 too
Your GPU handles batches of 64, but the model trains better with batch 256.
Analogy · polling a small town. You want the average opinion of 256 people, but only 64 fit in the room.
Gradient accumulation does exactly this with batches: it processes several micro-batches, sums their gradients, and updates the weights once at the end.
The key identity: gradient of a sum = sum of gradients.
Effective batch 256, GPU fits 64 →
for i, (x, y) in enumerate(loader):
loss = criterion(model(x), y) / K # divide so we average over K
loss.backward() # ADDS to existing grads in .grad
if (i + 1) % K == 0:
opt.step() # update once per K micro-batches
opt.zero_grad() # then clear for next big batch
Two crucial details: divide loss by step() / zero_grad() after
Update weight
loss.backward() → w.grad = -2step(), no zero_grad().loss.backward() → w.grad = -2 + (-6) = -8Update step. zero_grad() resets w.grad = 0.
This is the exact update we'd get from one big batch of
Some batches — especially in RNNs and Transformers — produce ridiculously large gradients. This is an exploding gradient.
Analogy · learning to drive. Your instructor gives small corrections: "turn 5°," "forward 10 cm." Then suddenly screams "TURN 10,000° LEFT!". Following that literally → smashed wall. Weights destroyed, restart training.
Gradient clipping is a safety rule:
"No matter what gradient is computed, never let the step be larger than
max_norm."
It prevents one bad batch from wrecking the entire run.
A gradient is a vector — direction × magnitude.
loss.backward(), gather full gradient vector max_norm (e.g. 1.0).Direction is preserved — only step size is capped.
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
opt.step()
Two weights .backward(): max_norm = 1.0.
The optimizer now uses
From zero to a trained model
Training optimizes a loss. Reporting uses a metric. They answer different questions.
| Setting | Optimized loss | Reported metric | Failure if confused |
|---|---|---|---|
| balanced classification | cross-entropy | accuracy | hides calibration |
| imbalanced classification | weighted CE / focal | F1, AUROC, AUPRC | high accuracy by predicting majority |
| regression | MSE / MAE / NLL | RMSE, MAE, |
loss punishes errors differently |
| ranking / retrieval | contrastive / pairwise | recall@k, MRR | good loss, poor top-k behavior |
Choose the metric before training. Otherwise you will optimize the convenient loss and later discover it does not match the real objective.
def train(model, loader_tr, loader_val, opt, loss_fn, n_epochs, device):
best_val = float('inf')
for ep in range(n_epochs):
model.train()
for x, y in loader_tr:
x, y = x.to(device), y.to(device)
loss = loss_fn(model(x), y)
opt.zero_grad(); loss.backward(); opt.step()
val = evaluate(model, loader_val, loss_fn, device)
print(f'epoch {ep:3d} val_loss={val:.4f}')
if val < best_val:
best_val = val
torch.save(model.state_dict(), 'best.pt')
Every real script is a variation on this. Keep it boring.
torch.save({
'model': model.state_dict(),
'optim': opt.state_dict(),
'epoch': ep,
'config': cfg,
}, 'ckpt.pt')
ckpt = torch.load('ckpt.pt',
map_location=device,
weights_only=True)
model.load_state_dict(ckpt['model'])
opt.load_state_dict(ckpt['optim'])
Don't torch.save(model) — it pickles the class, which breaks across refactors. Always save state_dict.
A validation score is useful only if the split matches the deployment question.
| Problem | What goes wrong | Better split |
|---|---|---|
| same patient in train and val | memorizes patient-specific artifacts | split by patient |
| adjacent video frames | train and val nearly duplicates | split by video / scene |
| time-series random split | future leaks into training | chronological split |
| user logs | same user behavior in both sets | split by user or time |
| repeated documents | near-duplicate text leakage | split by document/source |
Leakage makes the training recipe look correct while the model has learned the wrong thing. Check the split before celebrating a curve.
Even without leakage, validation may differ from real deployment.
| Shift | Example | Practical response |
|---|---|---|
| covariate shift | new camera, hospital, device | collect representative val/test |
| label shift | class priors change | report per-class metrics |
| concept shift | definition of target changes | refresh labels and monitor drift |
| temporal shift | user behavior changes over time | time-based test set |
The question is not "did validation improve?" The question is "does validation measure the future cases we care about?"
(Karpathy's recipe — don't skip a rung)
If you can't even overfit a single batch, something is fundamentally broken.
NaN. Fix: divide LR by 10.nn.Softmax AND you use nn.CrossEntropyLoss. The loss already includes softmax → applied twice → gradients muffled. Fix: remove nn.Softmax from the model; output raw logits.opt.zero_grad() → gradients from previous batches pile up; updates point in nonsense directions. Fix: add opt.zero_grad() at the start of each step.requires_grad=False. It will never update. Fix: assert param.requires_grad for each layer you intend to train.CrossEntropyLoss wants class indices (e.g. [3, 0, 1], dtype torch.long). One-hot floats break it. Fix: check .shape and .dtype of the label tensor.ToTensor + Normalize.Karpathy: "Become one with the data." Before touching the model, print shapes, dtypes, ranges, label balance, and a few random examples.
The learning rate is the most important hyperparameter.
We want the highest LR that is still stable — without dozens of trial-and-error runs.
Analogy · pushing a cart up a hill. Find the hardest you can push without tipping. Start with a tiny push and gradually increase. More force → more speed. At some point the cart wobbles. The best push is just before the wobble.
The LR finder does this with your learning rate.
A short fake training session (≈100 steps) sweeps over LRs.
How to read the plot:
Suppose the LR finder gives:
| LR | |||||
|---|---|---|---|---|---|
| Loss | 3.2 | 2.5 | 1.1 (steepest drop) | 0.9 (minimum) | 2.8 (diverging) |
Analysis. Minimum loss is at LR
Good starting LR · max_lr for one-cycle training.
Never skip a rung. If rung 3 fails, debug at rung 3 — don't go tune hyperparams at rung 6.
When improving a model, change one thing at a time.
| Run | Change | Val metric | Interpretation |
|---|---|---|---|
| A | baseline | 82.0 | reference |
| B | stronger augmentation | 84.1 | likely useful |
| C | bigger model | 82.4 | small gain, more cost |
| D | augmentation + bigger model | 84.0 | bigger model added little |
Rules:
Without ablations, you do not know which change helped. You only know that the final run was different.
After training — what next?
Q. Which of these is most useful?
(a) Try a bigger model.
(b) Train for more epochs.
(c) Sample 100 val mistakes and categorize them.
(d) Tune the learning rate.
Ng's rule. Before adding complexity, look at the errors. Nearly always you will find a dominant failure category — fixing it moves val accuracy far more than architectural churn.
Do not stop at naming the error. Convert the bucket into an action.
| Error bucket | Example diagnosis | Intervention |
|---|---|---|
| blurry images | model fails on motion blur | add blur augmentation / collect blur examples |
| rare class | few training examples | reweight loss / collect targeted data |
| label ambiguity | humans disagree | clean labels / merge classes / report uncertainty |
| background shortcut | model uses spurious context | crop, mask, augment backgrounds |
| threshold error | probabilities okay, decision bad | tune threshold on validation |
Error analysis is not a post-mortem. It is the fastest way to decide what experiment to run next.
The small things that save you weeks later
import random, numpy as np, torch
def set_seed(seed=42):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
# Only if bit-exact reproduction matters (slower):
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
Seed set at the start; every experiment records its seed in the config.
Every serious run should leave enough evidence to reproduce and compare it.
| Record | Why it matters |
|---|---|
| git commit | exact code |
| config file | hyperparameters and paths |
| data version / split id | same examples in train/val/test |
| random seed | reproducibility and variance estimate |
| hardware + precision | speed and numeric behavior |
| final checkpoint + best checkpoint | resume and deploy |
| metrics over time | diagnose underfit/overfit |
run = {
"commit": git_sha,
"seed": seed,
"config": cfg,
"best_val": best_val,
}
If you cannot compare two runs later, the experiment did not really happen.
Training a deep net is 10% picking an architecture and 90% running a disciplined loop. PyTorch is just an engine; the wins come from the data pipeline, the debugging ladder, and error analysis — none of which require new theory.
| Stage | The discipline | If you skip it |
|---|---|---|
| Build | nn.Module + nn.Parameter registration |
silent missing parameters |
| Feed | DataLoader w/ workers + pin_memory |
GPU sits idle |
| Train | forward · loss · zero_grad · backward · step | rotting gradients |
| Debug | overfit-1-batch → LR finder → ablation | weeks of red herrings |
| Analyze | bucket errors before scaling | scale the wrong thing |
The single insight · the bug almost never lives where students look first. Hence the ladder.
The flat-loss puzzle from the start? The right move is (c) overfit one batch.
If a model can't drive loss → 0 on 4 examples, the bug is not the LR, the optimizer, or the depth — it's the wiring. Until you climb that rung, every other tweak is guessing.
This is the single most important habit in this lecture.
P1. You change batch_size from 32 to 256 but keep the LR fixed. Training diverges. Why? Name the standard rule for scaling LR with batch size.
P2. A DataLoader with num_workers=0 and pin_memory=False is feeding a GPU running at 30% utilization. Name two changes that should help and the order in which you'd test them.
P3. Show that gradient accumulation over
P4. Your run has training loss 0.02, validation loss 0.6. Diagnose. Name three interventions in order of cost.
P5. Why does model.eval() differ from torch.no_grad()? Give a concrete example where you need both.
P6. A grad-clip of max_norm=1.0 is applied to a 100M-parameter model. Show that this is not the same as clipping per-parameter. Which version do you want when sigmoid layers are blowing up only in early layers?