Axis	Question	L02 answer
Expressivity	Can this architecture represent the function?	UAT and depth separation
Optimization	Can SGD find useful weights?	ReLU, residuals, initialization
Generalization	Will it work on unseen data?	not guaranteed by UAT

ReLU			turns on at
1	0.0	0.25	0.0
2	0.25	0.50	0.25
3	0.50	0.75	0.50
4	0.75	1.00	0.75

	relu()	relu()	relu()	sum
0	0	0	0	0
1.5	0.5	0	0	0.5
2	1	0	0	1.0 (peak)
2.5	1.5	-1	0	0.5
4	3	-4	1	0

Architecture	Parameters
	~1.63 M
	~0.22 M

Domain	Reusable structure	What deeper layers reuse
images	edges → textures → parts → objects	local visual motifs
language	characters → words → phrases → discourse	compositional meaning
audio	samples → phonemes → syllables → words	local temporal patterns
tabular data	often weaker hierarchy	depth may help less

Layer (counting back)	Cumulative factor
L	0.25
L−1
L−2
L−5
L−10

Failure mode	Training error	Validation / test error	Main diagnosis
Underfitting	high	high	model or training too weak
Overfitting	low	high	generalization failure
Degradation	higher for deeper model	higher for deeper model	optimization failure

Component	What it helps with
Skip connection	direct signal and gradient path
BatchNorm	stable activation scale across layers
ReLU	nonlinearity with active-side gradient 1
He initialization	variance scale matched to ReLU

Layer	Plain (multiplicative)	ResNet (additive)
L (output)	1.0	1.0
L−1	0.1	1.1
L−2	0.01	1.21
L−3	0.001	1.33
L−4	0.0001	1.46

Block style	Formula sketch	Why it matters
Post-activation		original ResNet
Pre-activation		cleaner identity path

Year	Model	Depth	ImageNet top-5
2012	AlexNet	8	16.4%
2014	VGG-19	19	7.3%
2014	GoogLeNet	22	6.7%
2015	ResNet-152	152	3.6%

Quantity	Healthy early signal	Red flag
activation mean	near 0 for normalized layers	large drift
activation std	roughly stable across depth	shrinks to 0 or explodes
gradient norm	nonzero in early layers	first layers near 0
fraction ReLU active	neither 0% nor 100%	many dead units

Activation	Init
ReLU, Leaky	He
Sigmoid, Tanh	Xavier
GELU, SiLU	He (convention)

Symptom	Root cause	Fix introduced
Gradient → 0 in early layers	ceiling 0.25 + product of small Jacobians	ReLU family · skip connections
Activation variance blows up / shrinks	Wrong init scale per layer	Xavier (tanh) · He (ReLU)
Deeper net is worse at training loss	Optimization, not capacity	Residual blocks

Lecture 2 — summary

UAT is an existence theorem. Width can be exponential; depth is the practical knob.
Compositionality (and parity) shows depth can replace exponential width.
Vanishing gradients come from products of sub-1 Jacobians; ReLU unlocks depth by giving gradient 1 on the active side.
Degradation problem — plain deep nets train worse.
ResNets — . Identity-in-the-Jacobian gradient highway + smoother landscape.
ResNet practice — BatchNorm, projection shortcuts, and pre-activation blocks keep the identity path usable.
Xavier / He — both derived from variance preservation.
Always separate axes: expressivity, optimization, and generalization are different claims.

Read before Lecture 3

Prince — Ch 4, Ch 11. Free at udlbook.github.io.

Next lecture

Tensors, autograd, nn.Module, DataLoader, the full training recipe, debugging ladder, error analysis.

Notebook 2 · 02-depth-and-resnets.ipynb — shallow-wide vs deep-narrow on spirals; build a residual block; visualize gradient norms across depth.

		neurons
1		~10
10
100

Universal Approximation & Going Deep

Lecture 2 · ES 667: Deep Learning

Learning outcomes

Recap · where we left off

Three axes we must not mix up

Pop quiz · two architectures, same parameter budget

PART 1

Universal approximation

Build a bump from two ReLUs

The "LEGO brick" idea · UAT in plain English

UAT · unpacking the statement

UAT · the formal statement

UAT · the proof in three moves

Worked example · approximate f(x) = x² with 4 ReLUs

Worked example · approximate f(x) = x² with 4 ReLUs · numbers

Building a triangle bump · step-by-step

Two-ReLU bumps · the real building block

The price of width — curse of dimensionality

Three things UAT does not promise

Pop quiz · answer

PART 2

Depth vs width

Shallow enumerates, deep reuses

A parameter-budget exercise

Answer — and the twist

When depth helps — and when it doesn't

Parity — the canonical case

Why depth helps for parity

Depth-vs-width · the formal separation

Enough theory — can we just stack layers?

Two things break

PART 3

Vanishing gradients · the full picture

Backprop · term-by-term, no Jacobians

Numeric · vanishing with sigmoids

The chain rule is a product

Sigmoid's fatal ceiling · 0.25

Fix #1 · the ReLU family

PART 4

The degradation problem & ResNets

The He et al. (2015) experiment

The surprise

Why this should bother you

Degradation is not overfitting

The thought experiment

Why "learning the change" is easier · steering analogy

ResNet · the key insight

Why residuals are easier

The residual block

BatchNorm in the ResNet story

How the skip-connection makes a gradient highway

Numeric · gradient flow with vs without skip

Skip connections fix gradient flow

Gradient highway · plain vs ResNet

ResNet in PyTorch · 12 lines

Projection shortcuts when shapes change

Pre-activation ResNets

Empirical impact

PART 5

Initialization

The goal

What to check in a real model

Why initialization matters · the failure mode

Activations

Gradients

Vanishing gradient · numeric example

Variance · three regimes across layers

Forward-pass variance

Xavier · for sigmoid/tanh

He · for ReLU

Worked numeric · variance flow over 10 layers

Initialization in PyTorch

Pop quiz · answer

Putting it all together · the L02 master sentence

Practice problems

Lecture 2 — summary

Read before Lecture 3

Next lecture

Worked example · approximate `f(x) = x²` with 4 ReLUs

Worked example · approximate `f(x) = x²` with 4 ReLUs · numbers