Year	Winner	Top-5 error	Method
2010	NEC-UIUC	28.2%	SIFT + Fisher
2011	XRCE	25.8%	Hand-crafted
2012	AlexNet	16.4%	8-layer CNN
2015	ResNet	3.6%	152-layer CNN

	1990	2024
Compute ($/FLOP best-case)	$\sim$10⁸ FLOPs/$	$\sim$10¹⁵ FLOPs/$
Datasets (frontier)	60k MNIST digits	5T LLM tokens
SOTA model params	LeNet (60k)	Llama-3 405B

Task	Conditional	Loss
Regression		MSE
Binary classification		BCE
-class		CE


0	0	0
0	1	1
1	0	1
1	1	0

Name	Formula	Where you see it
Sigmoid		gates
Tanh		RNNs
ReLU		most CNNs
GELU / SiLU	/	Transformers, LLMs

Activation	Failure mode	Practical consequence
Sigmoid	Saturates; derivative	vanishing gradients
Tanh	Saturates for large $	z
ReLU	zero gradient for	dead units if updates are harsh
GELU / SiLU	smoother but costlier	common in Transformers, less common in tiny CNNs

Tensor	Shape
flattened input
first weight
hidden activations
final logits

Brain region	Roughly analogous layer	Detects
V1	early layer	oriented edges
V2	next layer	textures, junctions
V4	mid layer	shapes, parts
IT	late layer	objects, faces

Layer	Local factor	Gradient signal
Output	—	1.0
L4		0.1
L3		0.01
L2		0.001
L1		0.0001

Depth	upper bound on gradient magnitude
5
10
20

Problem	Tool
activation variance shrinks or explodes	Xavier / He initialization
gradients weaken through many layers	residual connections
optimization is scale-sensitive	normalization layers

Mechanism	`model.train()`	`model.eval()`
Dropout	randomly masks activations	uses all activations
BatchNorm	updates running statistics	uses stored statistics
Autograd	tracks gradients if enabled	still tracks unless disabled

Data type	Bad split	Better split
medical images	random image split	split by patient
video frames	random frame split	split by video / scene
recommender logs	random row split	split by user or time
documents	random paragraph split	split by document / source
sensors	random window split	split by device / location

Step	What happens	Where it came from
1 · Forward		composition of layers
2 · Loss		NLL of the chosen distribution (L00)
3 · Backward		chain rule on the comp. graph
4 · Update		SGD

Lecture 1 — summary

Deep learning = representation learning. Layers learn transformations that preserve task signal and suppress nuisance variation.
Why now: data + compute + algorithms compounded 2009–2017.
Neuron = sum + squash. Stack them and non-linearity keeps depth meaningful.
Softmax + CE from MLE: .
Backprop = local gradient rules plus matrix multiplication, repeated layer by layer.
Depth needs engineering: activation choice, initialization, residual connections, normalization.
Training loop: forward → loss → zero_grad → backward → step.
Evaluation: train / validation / test splits must match the deployment question.

Read before Lecture 2

Prince · Understanding Deep Learning — Ch 4 (deep networks), Ch 7 (gradients and initialization), Ch 11 (residual networks). Free PDF at udlbook.github.io.

Next lecture

Why depth, ResNets, Xavier / He initialization derived from first principles.

1a · 01a-micrograd.ipynb — scalar autograd engine from scratch (Karpathy-style).
1b · 01b-mlp-mnist.ipynb — train this MLP on MNIST end-to-end.

Prediction	CE loss

Prediction	CE loss

Why Deep Learning?

Lecture 1 · ES 667: Deep Learning

READING · Prince UDL · Ch 1 · Ch 3

Lecture 1

PART 1

The big picture

A question to open the semester

Linear · works for separable data, fails for curved data

Why raw pixels break a linear classifier

Pop quiz · which model would you pick?

Classical ML vs deep learning

Deep learning, in one sentence

What representation learning buys you

Three eras of deep learning

ImageNet · the turning point (2012)

Why now? Three ingredients compounded

Why now · concrete numbers

Learning outcomes · for this lecture

Course roadmap

PART 2

MLP recap

Our running example

Tying back to Lecture 0 · what changed

From linear models to neurons

Worked example · one neuron forward pass

The single neuron — anatomy

In vector form

Why we need a non-linearity · the magnifying-glass analogy

Let's prove it · stacking linear layers collapses

Worked numeric example · the collapse

Without σ · depth gives nothing

XOR · the canonical "linear can't, MLP can" example

XOR · linear fails, two-layer MLP succeeds

XOR · build a 2-layer MLP by hand

Feature-space transformation · the deeper view

Activation functions at a glance

Activation functions · what can go wrong

Stacking neurons → MLP

Parameter count — do this in your head

Batched matrix form · the shapes that matter

MLP in PyTorch

The last layer is bare · the #1 beginner bug

PART 3

Losses and backprop

From scores to probabilities · the goal

Softmax · worked numeric example

Softmax · three acts

Why exponentiate?

Cross-entropy from MLE

Why cross-entropy is the right score

Push-pull intuition · the gradient

Deriving · softmax + CE gradient · step by step

Worked numeric · the gradient

The elegant softmax + CE gradient

What that gradient actually looks like

Backprop · the blame game

Backpropagation · the computational view

Backprop · the blame distributor

Deriving · the linear-layer backward pass

Worked numeric · linear-layer backward

The local-gradient rule · three lines

End-to-end worked numeric · 2-layer MLP, one example

Same rule with batches

PART 4

Why go deep?

One layer is enough, in principle

Depth ⇒ hierarchical features

Biology inspired the hierarchy

Backprop as broken telephone

Why σ′ shrinks · let's compute it

Worked numeric · the gradient vanishes

Depth has a cost · vanishing gradients

The fix · ReLU

PART 5

The training loop

The training cycle

The PyTorch training loop · code

One common bug · zero_grad

Train mode vs eval mode

Train / val / test

One common bug · `zero_grad`