Universal Approximation & Going Deep

Lecture 2 · ES 667: Deep Learning

Prof. Nipun Batra
IIT Gandhinagar · Aug 2026

Learning outcomes

By the end of this lecture you will be able to:

  1. State the Universal Approximation Theorem and cite its caveats.
  2. Explain why depth beats width in practice despite theoretical equivalence.
  3. Diagnose vanishing / exploding gradients in deep nets.
  4. Apply residual connections to train 100+ layer networks.
  5. Pick weight init (Xavier / He) based on activation.
  6. Articulate three practical limits UAT does not address.
  7. Separate expressivity, optimization, and generalization claims.
  8. Explain projection shortcuts and pre-activation residual blocks.

Recap · where we left off

  • MLPs are stacks of affine + non-linearity (L1).
  • Backprop is chain rule run right-to-left (L1).
  • Logistic / softmax classification = 1-layer MLP under categorical NLL (L0 + L1).
  • Sigmoids vanish; ReLU un-blocks depth (L1, end).
  • One hidden layer can approximate anything (UAT) — we need to make that precise today.

Today maps to UDL Ch 4 (deep networks), Ch 7 (gradients & init), Ch 11 (residual networks). Read these three chapters before or after — whichever works for you.

Today's question · if one hidden layer is universal, why do we ever go deeper? Three answers · width is exponential, depth is hierarchical, and depth is trainable (with the right tools).

Three axes we must not mix up

Deep learning progress depends on three different questions.

Axis Question L02 answer
Expressivity Can this architecture represent the function? UAT and depth separation
Optimization Can SGD find useful weights? ReLU, residuals, initialization
Generalization Will it work on unseen data? not guaranteed by UAT

A network can be expressive but untrainable. It can be trainable but overfit. It can fit train/val and still fail under distribution shift. Keep these axes separate in every DL paper you read.

Pop quiz · two architectures, same parameter budget

You have ~ parameters to spend on a regression task with 1-D input.

(a) Wide-and-shallow · 1 hidden layer with 5,000 ReLU units.
(b) Tall-and-thin · 50 hidden layers with 14 ReLU units each.

Which one would you bet on for fitting a complex function like on ?

Stop and decide. We'll come back to this exact question when we hit Telgarsky's separation — your gut answer should change once you've seen the proof.

This is L02's central tension · width is universal but expensive, depth is exponentially more efficient when it can be trained. Today we earn both halves of that sentence.

PART 1

Universal approximation

What a single hidden layer can — and can't — do

Build a bump from two ReLUs

▶ Interactive: grow a 1-hidden-layer net and watch it fit a target curve — universal-approximation.

The "LEGO brick" idea · UAT in plain English

Imagine an unlimited supply of LEGO bricks. Can you build a sculpture of anything? A car, a house, the Eiffel Tower? Yes · if your bricks are small enough, you can approximate any shape.

UAT says · a neural network with one hidden layer can do the same for mathematical functions. Its "LEGO bricks" are simple functions built from neurons.

UAT · unpacking the statement

"For any continuous function ..."
The "true" relationship in our data · e.g., price = f(house_features).

"...and any small error ..."
How close we want our approximation. Set or whatever you need.

"...there exists a network..."
With one hidden layer of neurons. The theorem guarantees · for any , some exists.

"...whose output is within of ."

A weighted sum of neuron outputs · each neuron is a "LEGO brick."

UAT · the formal statement

Theorem (Cybenko 1989 · Hornik 1991 · Leshno 1993)

For any continuous and any , there exist , weights such that

for any non-polynomial activation — including ReLU.

One hidden layer suffices. The catch hides in one word: exist.

UAT · the proof in three moves

We won't write a full proof — but the structure is short and worth knowing.

Move 1 · Approximate continuous functions by step functions. Any continuous on is uniformly continuous (Heine–Cantor). So for any we can partition into small enough cells that varies by less than inside each cell. Replace by its average value on each cell · we get a step function within of .

Move 2 · Approximate step functions by sums of sigmoids. A sigmoid with large is essentially a step at . A bump on is a difference of two such "near-steps." Each cell of the step function ⇒ one bump in the network. cells → hidden units.

Move 3 · Density via Stone-Weierstrass / Hahn-Banach. The space of finite sums is dense in — a non-trivial functional-analysis theorem. Combined with Moves 1 and 2, this gives the bound .

Bottom line · UAT is not a constructive recipe — it's a density theorem. It says good weights exist; it says nothing about , generalization, or whether SGD finds them.

Worked example · approximate f(x) = x² with 4 ReLUs

Worked example · approximate f(x) = x² with 4 ReLUs · numbers

Pick 4 ReLU bumps at on . Each is relu(w·(x − b)) for slope .

ReLU turns on at
1 0.0 0.25 0.0
2 0.25 0.50 0.25
3 0.50 0.75 0.50
4 0.75 1.00 0.75

The output is a piecewise-linear staircase that hugs ². With more ReLUs, the staircase gets finer · the error .

That's UAT in numbers. A weighted sum of ReLU bumps approximates any 1D continuous function.

Building a triangle bump · step-by-step

A single ReLU is a ramp going up forever. To make it come back down to zero we need three ReLUs.

relu() relu() relu() sum
0 0 0 0 0
1.5 0.5 0 0 0.5
2 1 0 0 1.0 (peak)
2.5 1.5 -1 0 0.5
4 3 -4 1 0

A perfect triangular bump centered at . Place enough such bumps and you can approximate any continuous function. This is what UAT proves.

Two-ReLU bumps · the real building block

A single ReLU is a half-plane. Subtract two ReLUs · you get a bump of any width and height.

This is 0 outside and rises linearly in between. Place enough bumps and you can build any continuous function · just place a bump where each fine slice is.

UAT's existence proof essentially tiles the function space with bumps. A network finds these bumps automatically through gradient descent. Existence is given by the construction; training is the open problem.

The price of width — curse of dimensionality

Piecewise-linear approximation of to error :

  • 1D:
  • D:
neurons
1 ~10
10
100

UAT says good weights exist. Not that SGD finds them. Not that the network generalizes. Not that is reasonable.

Three things UAT does not promise

Learnability.
Existence of good weights ≠ SGD finding them.

Width.
The bound on can be astronomical.

Generalization.
A network that memorizes training points also satisfies UAT. Works on train, fails on test.

Pop quiz. True or false: UAT guarantees a 1-hidden-layer net will perform well on unseen data, given enough neurons and data.

Pop quiz · answer

False — for three independent reasons.

  1. Existence ≠ findability.
  2. "Enough neurons" can mean exponentially many.
  3. UAT says nothing about generalization.

In practice, depth is far more parameter-efficient than width.

PART 2

Depth vs width

Why deeper is (usually) better

Shallow enumerates, deep reuses

A parameter-budget exercise

Q. On a classifier:

  • Wide shallow: 1 hidden layer of 2048 units
  • Deep narrow: 8 hidden layers of 128 units

Which has more parameters?

Answer — and the twist

Architecture Parameters
~1.63 M
~0.22 M

Shallow-wide has 7× more parameters, but deep-narrow typically wins on natural data.

Parameter count is a crude proxy for capacity. What depth gives you, width alone cannot: reusable hierarchy.

When depth helps — and when it doesn't

Depth helps when the target function has reusable substructure.

Domain Reusable structure What deeper layers reuse
images edges → textures → parts → objects local visual motifs
language characters → words → phrases → discourse compositional meaning
audio samples → phonemes → syllables → words local temporal patterns
tabular data often weaker hierarchy depth may help less

Depth is not magic. It is a strong inductive bias for compositional problems. If the data do not have reusable structure, a deeper model can simply be harder to optimize.

Parity — the canonical case

Why depth helps for parity

Parity is recursive — XOR of pairs, then XOR of pair-pairs, then XOR of those.

Depth matches the structure of the problem:

  • layers, gates.
  • Shallow has no hierarchy to exploit — it must enumerate every pattern.

Formal proof: Telgarsky, "Benefits of Depth in Neural Networks," COLT 2016. We take the statement on faith today.

Depth-vs-width · the formal separation

Theorem (Telgarsky 2016, simplified) · There exists a function representable by a ReLU network of depth and width , such that any ReLU network of depth approximating within constant error needs at least units.

The witness function is the -fold composition of the sawtooth ·

Each composition doubles the number of "teeth" — depth gives teeth using ReLUs. A shallow net must enumerate every tooth · exponential width.

Depth is exponentially more parameter-efficient than width for problems with compositional / recursive structure. Real-world data (images, language) is richly compositional · so depth pays off in practice.

Enough theory — can we just stack layers?

Q. If depth is so great, should we train a 500-layer plain MLP?

Give an honest answer; then read on.

Two things break

Problem 1 — vanishing gradients (from L1).
Product of many sub-1 Jacobians → zero.

Problem 2 — the degradation problem.
Something strictly more surprising. Next section.

PART 3

Vanishing gradients · the full picture

Before we fix depth, let's see it break

Backprop · term-by-term, no Jacobians

A 4-layer scalar network · y = w_4 · w_3 · w_2 · w_1 · x (ignoring activations).

We want gradient of loss with respect to the first weight . Chain rule, one link at a time:

Expand :

So ·

The key is the product . If any of these is small (< 1), the product shrinks fast.

Numeric · vanishing with sigmoids

Sigmoid derivative · · max value 0.25.

Each backward step multiplies by . Assume weights ≈ 1.0:

Layer (counting back) Cumulative factor
L 0.25
L−1
L−2
L−5
L−10

After 10 sigmoid layers, the gradient signal at the earliest weight is shrunk by a million. The first layer barely updates · learning stalls.

This is the core problem ResNets and ReLU were invented to solve.

The chain rule is a product

Sigmoid's fatal ceiling · 0.25

▶ Interactive: stack sigmoids and watch the gradient evaporate with depth — vanishing-gradients.

Fix #1 · the ReLU family

PART 4

The degradation problem & ResNets

The most important architectural idea since backprop

The He et al. (2015) experiment

Train plain CNNs of increasing depth on CIFAR-10 — same optimizer, same init, just more layers.

Q. What do you expect? More layers → more capacity → better, right?

The surprise

Why this should bother you

If this were overfitting, training error would drop with depth (more capacity to memorize) and test error would rise (poor generalization).

Instead, training error went up. The deeper net cannot even fit the training data.

This is an optimization problem, not a capacity problem.

Degradation is not overfitting

Compare the signatures:

Failure mode Training error Validation / test error Main diagnosis
Underfitting high high model or training too weak
Overfitting low high generalization failure
Degradation higher for deeper model higher for deeper model optimization failure

The ResNet paper mattered because it exposed a surprising fact: simply adding layers can make the training objective itself harder, even though the deeper model has more representational capacity.

The thought experiment

Why "learning the change" is easier · steering analogy

Imagine steering a car by giving the wheel an absolute angle (e.g., "set wheel to 27 degrees from zero"). Hard to do.

Instead, you say "turn a little right" or "stay straight." Much easier.

ResNets do the same · they reframe each layer's job from "output the right thing" to "output a small change to your input." The default action — change nothing, pass input through — is now trivial. The network only learns deviations from identity, which is much easier for SGD.

ResNet · the key insight

He et al. 2015 — don't ask the block to learn the full mapping . Ask it to learn the residual:

If the optimum is close to identity, we only need — which SGD finds trivially.

Why residuals are easier

  • Weight decay pushes toward zero.
  • Init near zero starts .
  • Small SGD updates keep it there unless signal says otherwise.

Coordinating non-linear layers to produce exact identity is the opposite: a delicate balance with no prior.

The skip connection turns a hard default into a free default.

The residual block

BatchNorm in the ResNet story

The original ResNet block was not just "add a skip connection." It used a stack like:

Component What it helps with
Skip connection direct signal and gradient path
BatchNorm stable activation scale across layers
ReLU nonlinearity with active-side gradient 1
He initialization variance scale matched to ReLU

Do not learn the wrong lesson: ResNets train because several engineering choices work together. The skip connection is the central idea, but scale control is part of the system.

How the skip-connection makes a gradient highway

Forward · .

We want the gradient flowing back through the block · .

By the chain rule ·

Compute the local gradient:

So ·

The "+I" gives an uninterrupted express lane for the gradient · the second term is the original signal passing straight through even if the first term vanishes. The early layers always get a clean signal.

Numeric · gradient flow with vs without skip

Same 5-layer net. Each layer's has tiny norm 0.1 (saturated regime).

Layer Plain (multiplicative) ResNet (additive)
L (output) 1.0 1.0
L−1 0.1 1.1
L−2 0.01 1.21
L−3 0.001 1.33
L−4 0.0001 1.46

The plain net's signal vanishes · ResNet keeps the full unit signal plus a small contribution from each block. Hundred-layer ResNets train; 100-layer plain nets don't.

Skip connections fix gradient flow

Even if collapses to zero, the identity survives — a direct path back to every early layer.

Gradient highway · plain vs ResNet

ResNet in PyTorch · 12 lines

class ResidualBlock(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.block = nn.Sequential(
            nn.Linear(dim, dim), nn.BatchNorm1d(dim), nn.ReLU(),
            nn.Linear(dim, dim), nn.BatchNorm1d(dim),
        )
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.relu(self.block(x) + x)    # ← the skip

Q. What constraint does self.block(x) + x impose on dimensions?

Projection shortcuts when shapes change

The addition requires matching shapes. If the block changes width, channels, or spatial resolution, the skip path must also transform .

class ResidualBlock(nn.Module):
    def __init__(self, d_in, d_out):
        super().__init__()
        self.block = nn.Sequential(
            nn.Linear(d_in, d_out), nn.ReLU(),
            nn.Linear(d_out, d_out),
        )
        self.skip = nn.Identity() if d_in == d_out else nn.Linear(d_in, d_out)

    def forward(self, x):
        return torch.relu(self.block(x) + self.skip(x))

In CNNs this is often a convolution, sometimes with stride 2, so both branches land on the same shape before addition.

Pre-activation ResNets

Later ResNets moved normalization and activation before the weight layers:

Block style Formula sketch Why it matters
Post-activation original ResNet
Pre-activation cleaner identity path

Pre-activation keeps the skip path as close to a pure identity as possible. This makes very deep residual networks easier to optimize because the gradient highway is less obstructed.

Transformers use the same idea in another form: residual stream plus normalization around each block.

Empirical impact

Year Model Depth ImageNet top-5
2012 AlexNet 8 16.4%
2014 VGG-19 19 7.3%
2014 GoogLeNet 22 6.7%
2015 ResNet-152 152 3.6%

Skip connections are now in virtually every modern architecture — CNNs, Transformers, diffusion U-Nets.

PART 5

Initialization

From first principles

The goal

Keep activations — and gradients — at roughly constant variance across layers.

  • Variance grows → exploding activations.
  • Variance shrinks → vanishing activations.

What to check in a real model

Initialization theory is useful because it gives a concrete diagnostic: activation statistics by layer.

For one batch, log:

Quantity Healthy early signal Red flag
activation mean near 0 for normalized layers large drift
activation std roughly stable across depth shrinks to 0 or explodes
gradient norm nonzero in early layers first layers near 0
fraction ReLU active neither 0% nor 100% many dead units
for name, p in model.named_parameters():
    if p.grad is not None:
        print(name, p.grad.norm().item())

Before changing architecture, check whether the signal survives the forward pass and the gradient survives the backward pass.

Why initialization matters · the failure mode

If we initialize weights from in a deep ReLU net:

Activations

Each layer multiplies by a matrix of N(0,1) values. Activation magnitudes either grow exponentially with depth (explode) or shrink exponentially (vanish) depending on shape.

Gradients

Same product, going backwards. A single bad scale → all early layers train at effective rate · they never move.

Symptom · loss is NaN at step 1, or loss flat with all weights stuck. Always-and-only an init problem if the loop is otherwise correct.

The variance argument on the next slide gives a one-line fix · scale init by .

Vanishing gradient · numeric example

Suppose · 10-layer sigmoid network. Sigmoid derivative max is (at ).

Even at the best point, gradient through one layer multiplies by .

After 10 layers ·

After 20 layers ·

The first layer's effective learning rate is a million times smaller than the last layer's. It barely updates · network never learns features in early layers.

This is the practical reason ReLU (derivative = 0 or 1) replaced sigmoid in deep nets · it doesn't shrink the gradient by a factor every layer.

Variance · three regimes across layers

Forward-pass variance

Layer: . Assume independent, zero-mean.

For independent zero-mean :

Therefore:

To preserve variance we need .

Xavier · for sigmoid/tanh

Forward: .
Backward: .

Xavier / Glorot (2010) — compromise:

He · for ReLU

ReLU zeros half the pre-activations:

To keep this constant across layers:

He / Kaiming (2015):

Factor of 2 compensates the ReLU halving.

Worked numeric · variance flow over 10 layers

10-layer fully-connected ReLU net with everywhere. Initial activation variance .

Naive init (no scaling) ·

  • Each layer · .
  • After ReLU · .
  • Pass through layer 2 · . Explodes.
  • After 10 layers · . NaN at step 1.

He init ·

  • Each layer · .
  • After ReLU · .
  • Stable for 10, 100, or 1000 layers. ✓

This is why initialization is not optional. Bad init → loss is NaN at step 1, or weights are stuck at scale and never move. Good init keeps signal magnitude constant across depth.

Initialization in PyTorch

# Default for nn.Linear — Kaiming uniform (sensible for ReLU)
model = nn.Linear(784, 256)

# Explicit He init
nn.init.kaiming_normal_(model.weight, mode='fan_in', nonlinearity='relu')
nn.init.zeros_(model.bias)

# Explicit Xavier (for sigmoid/tanh)
nn.init.xavier_normal_(model.weight)

Rule of thumb · ReLU family → He, sigmoid / tanh → Xavier.

Pop quiz. You build a 10-layer MLP with Tanh activations and He initialization. Loss oscillates, activations saturate. Why?

Pop quiz · answer

He doubles variance to compensate for ReLU's halving. Tanh doesn't halve — He gives too large a variance → activations saturate at → gradients die.

Fix: Xavier.

Activation Init
ReLU, Leaky He
Sigmoid, Tanh Xavier
GELU, SiLU He (convention)

Putting it all together · the L02 master sentence

Depth is mathematically expressive (UAT, Telgarsky), but practically fragile.
Three forces conspire against naive deep nets · vanishing/exploding gradients, vanishing/exploding activations, and the degradation problem even when both are tame.

Symptom Root cause Fix introduced
Gradient → 0 in early layers ceiling 0.25 + product of small Jacobians ReLU family · skip connections
Activation variance blows up / shrinks Wrong init scale per layer Xavier (tanh) · He (ReLU)
Deeper net is worse at training loss Optimization, not capacity Residual blocks

The single insight underneath all three fixes · make every layer easy to leave alone. Identity-friendly initialization, identity skip-connections, and activations that pass gradients through unchanged in their linear regime. That's what made depth practical in 2015 and onward.

Practice problems

P1. UAT says on can be approximated to error by a sum of ReLUs. Estimate as a function of . (Hint · piecewise-linear with breakpoints has error for smooth .)

P2. A 5-layer plain MLP with sigmoid activations is failing to train. Without changing the architecture, name two changes that would help and explain why each works.

P3. Show that for a ResNet block , . Use this to argue that even if has tiny Jacobian, gradient through the residual block does not vanish.

P4. A 100-layer ReLU MLP with everywhere uses Xavier init . Will the activation variance grow, shrink, or stay constant? Why is this wrong for ReLU? What's the fix?

P5. Telgarsky's separation says depth- ReLU nets need units to be matched by depth- nets. Plug in · how many shallow units? Why does this argue for going deep?

P6. You replace ReLU with leaky ReLU () in a 50-layer net. (a) What changes for forward-pass variance? (b) What changes for vanishing gradients on initially-negative pre-activations?

Lecture 2 — summary

  • UAT is an existence theorem. Width can be exponential; depth is the practical knob.
  • Compositionality (and parity) shows depth can replace exponential width.
  • Vanishing gradients come from products of sub-1 Jacobians; ReLU unlocks depth by giving gradient 1 on the active side.
  • Degradation problem — plain deep nets train worse.
  • ResNets. Identity-in-the-Jacobian gradient highway + smoother landscape.
  • ResNet practice — BatchNorm, projection shortcuts, and pre-activation blocks keep the identity path usable.
  • Xavier / He — both derived from variance preservation.
  • Always separate axes: expressivity, optimization, and generalization are different claims.

Read before Lecture 3

Prince — Ch 4, Ch 11. Free at udlbook.github.io.

Next lecture

Tensors, autograd, nn.Module, DataLoader, the full training recipe, debugging ladder, error analysis.

Notebook 2 · 02-depth-and-resnets.ipynb — shallow-wide vs deep-narrow on spirals; build a residual block; visualize gradient norms across depth.