The batch contract

Before the model sees a batch, the batch should satisfy a contract.

Item	Classification expectation	Common bug
`x.shape`	`[B, C, H, W]` for images or `[B, d]` for vectors	missing batch dimension
`x.dtype`	`float32` / `bfloat16` after transforms	raw `uint8` pixels
`x.range`	normalized, often roughly centered	values still in `[0, 255]`
`y.shape`	`[B]` for class indices	one-hot when loss expects indices
`y.dtype`	`torch.long` for `CrossEntropyLoss`	float labels

assert x.ndim == 4
assert x.dtype in (torch.float32, torch.bfloat16)
assert y.ndim == 1 and y.dtype == torch.long

Many "model bugs" are actually batch-contract bugs. Verify the batch before touching the architecture.

Symptom	Likely cause	First fix
GPU utilization sawtooths	CPU/disk cannot feed batches	increase `num_workers`
first batch slow every epoch	worker restart overhead	`persistent_workers=True`
transfer to CUDA slow	pageable host memory	`pin_memory=True`
random crop dominates time	heavy CPU transforms	cache, simplify, or move to GPU

Setting	Optimized loss	Reported metric	Failure if confused
balanced classification	cross-entropy	accuracy	hides calibration
imbalanced classification	weighted CE / focal	F1, AUROC, AUPRC	high accuracy by predicting majority
regression	MSE / MAE / NLL	RMSE, MAE,	loss punishes errors differently
ranking / retrieval	contrastive / pairwise	recall@k, MRR	good loss, poor top-k behavior

Problem	What goes wrong	Better split
same patient in train and val	memorizes patient-specific artifacts	split by patient
adjacent video frames	train and val nearly duplicates	split by video / scene
time-series random split	future leaks into training	chronological split
user logs	same user behavior in both sets	split by user or time
repeated documents	near-duplicate text leakage	split by document/source

Shift	Example	Practical response
covariate shift	new camera, hospital, device	collect representative val/test
label shift	class priors change	report per-class metrics
concept shift	definition of target changes	refresh labels and monitor drift
temporal shift	user behavior changes over time	time-based test set

Run	Change	Val metric	Interpretation
A	baseline	82.0	reference
B	stronger augmentation	84.1	likely useful
C	bigger model	82.4	small gain, more cost
D	augmentation + bigger model	84.0	bigger model added little

Error bucket	Example diagnosis	Intervention
blurry images	model fails on motion blur	add blur augmentation / collect blur examples
rare class	few training examples	reweight loss / collect targeted data
label ambiguity	humans disagree	clean labels / merge classes / report uncertainty
background shortcut	model uses spurious context	crop, mask, augment backgrounds
threshold error	probabilities okay, decision bad	tune threshold on validation

Record	Why it matters
git commit	exact code
config file	hyperparameters and paths
data version / split id	same examples in train/val/test
random seed	reproducibility and variance estimate
hardware + precision	speed and numeric behavior
final checkpoint + best checkpoint	resume and deploy
metrics over time	diagnose underfit/overfit

Stage	The discipline	If you skip it
Build	`nn.Module` + `nn.Parameter` registration	silent missing parameters
Feed	`DataLoader` w/ workers + pin_memory	GPU sits idle
Train	forward · loss · zero_grad · backward · step	rotting gradients
Debug	overfit-1-batch → LR finder → ablation	weeks of red herrings
Analyze	bucket errors before scaling	scale the wrong thing

Training Deep Networks in Practice

Lecture 3 · ES 667: Deep Learning

Learning outcomes

Recap · what we have so far

Pop quiz · "my model isn't learning"

PART 1

The PyTorch stack

nn.Module · the container

Parameter registration · a common footgun

How does PyTorch know the derivatives?

Autograd · let's build a graph by hand

Forward pass

Backward pass (chain rule)

Autograd · the dynamic tape

Worked numeric · a 2-layer backward step

Forward

Backward (chain rule)

Two safety habits

torch.no_grad() for evaluation

.detach() to stop gradient flow

PART 2

The data pipeline

Dataset → DataLoader → device

The batch contract

Writing a custom Dataset

DataLoader flags that matter

Is the GPU waiting?

Why not just use full precision (FP32)?

Precision vs. range · the trade-off

Mixed precision · BF16 is the 2026 default

Mixed precision · why BF16 > FP16

FP16

BF16

What if the batch doesn't fit in your GPU?

Simulating a big batch · gradient accumulation

Worked numeric · gradient accumulation

Micro-batch 1 ·

Micro-batch 2 ·

Why we need gradient clipping

How gradient clipping works

Worked numeric · gradient clipping

PART 3

The full training recipe

Training curves · the diagnostic language

Loss and metric are not the same

The recipe · one function

Save / load correctly

Save

Load

Before trusting validation

Distribution shift is the next test

PART 4

The debugging ladder

The ladder

Rung 3 · overfit one batch

Debug checklist · why won't loss go down? · part 1

Debug checklist · why won't loss go down? · part 2

How do we find a good starting LR?

The LR finder algorithm

Rung 5 · the learning-rate finder

Worked numeric · reading the LR plot

Full recipe · the 7 rungs

Ablation discipline

PART 5

Error analysis (Ng style)

You have a model. Val accuracy is 82%.

Answer · (c)

Error analysis · categorize, then prioritize

From error bucket to intervention

Ceiling analysis — for pipelines

PART 6

Reproducibility

Five layers of reproducibility

Seeds and determinism · code

Minimal experiment record

Putting it all together · the L03 master sentence

Pop quiz · revisit

Practice problems

Lecture 3 — summary

Read before Lecture 4

`nn.Module` · the container

`torch.no_grad()` for evaluation

`.detach()` to stop gradient flow

Writing a custom `Dataset`