What a single hidden layer can — and can't — do
Interactive: grow a 1-hidden-layer net and watch it fit a target curve — universal-approximation.
Imagine an unlimited supply of LEGO bricks. Can you build a sculpture of anything? A car, a house, the Eiffel Tower? Yes · if your bricks are small enough, you can approximate any shape.
UAT says · a neural network with one hidden layer can do the same for mathematical functions. Its "LEGO bricks" are simple functions built from neurons.
"For any continuous function
The "true" relationship in our data · e.g., price = f(house_features).
"...and any small error
How close we want our approximation. Set
"...there exists a network..."
With one hidden layer of
"...whose output is within
A weighted sum of neuron outputs · each neuron is a "LEGO brick."
Theorem (Cybenko 1989 · Hornik 1991 · Leshno 1993)
For any continuous
for any non-polynomial activation
One hidden layer suffices. The catch hides in one word: exist.
We won't write a full proof — but the structure is short and worth knowing.
Move 1 · Approximate continuous functions by step functions. Any continuous
Move 2 · Approximate step functions by sums of sigmoids. A sigmoid
Move 3 · Density via Stone-Weierstrass / Hahn-Banach. The space of finite sums
Bottom line · UAT is not a constructive recipe — it's a density theorem. It says good weights exist; it says nothing about
f(x) = x² with 4 ReLUsf(x) = x² with 4 ReLUs · numbersPick 4 ReLU bumps at relu(w·(x − b)) for slope
| ReLU |
turns on at |
||
|---|---|---|---|
| 1 | 0.0 | 0.25 | 0.0 |
| 2 | 0.25 | 0.50 | 0.25 |
| 3 | 0.50 | 0.75 | 0.50 |
| 4 | 0.75 | 1.00 | 0.75 |
The output is a piecewise-linear staircase that hugs
That's UAT in numbers. A weighted sum of ReLU bumps approximates any 1D continuous function.
A single ReLU is a ramp going up forever. To make it come back down to zero we need three ReLUs.
| relu( |
relu( |
sum | ||
|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 |
| 1.5 | 0.5 | 0 | 0 | 0.5 |
| 2 | 1 | 0 | 0 | 1.0 (peak) |
| 2.5 | 1.5 | -1 | 0 | 0.5 |
| 4 | 3 | -4 | 1 | 0 |
A perfect triangular bump centered at
A single ReLU is a half-plane. Subtract two ReLUs · you get a bump of any width and height.
This is 0 outside
UAT's existence proof essentially tiles the function space with bumps. A network finds these bumps automatically through gradient descent. Existence is given by the construction; training is the open problem.
Piecewise-linear approximation of
| neurons | ||
|---|---|---|
| 1 | ~10 | |
| 10 | ||
| 100 |
UAT says good weights exist. Not that SGD finds them. Not that the network generalizes. Not that
Learnability.
Existence of good weights ≠ SGD finding them.
Width.
The bound on
Generalization.
A network that memorizes
Pop quiz. True or false: UAT guarantees a 1-hidden-layer net will perform well on unseen data, given enough neurons and data.
False — for three independent reasons.
In practice, depth is far more parameter-efficient than width.
Why deeper is (usually) better
Q. On a
Which has more parameters?
| Architecture | Parameters |
|---|---|
| ~1.63 M | |
| ~0.22 M |
Shallow-wide has 7× more parameters, but deep-narrow typically wins on natural data.
Parameter count is a crude proxy for capacity. What depth gives you, width alone cannot: reusable hierarchy.
Depth helps when the target function has reusable substructure.
| Domain | Reusable structure | What deeper layers reuse |
|---|---|---|
| images | edges → textures → parts → objects | local visual motifs |
| language | characters → words → phrases → discourse | compositional meaning |
| audio | samples → phonemes → syllables → words | local temporal patterns |
| tabular data | often weaker hierarchy | depth may help less |
Depth is not magic. It is a strong inductive bias for compositional problems. If the data do not have reusable structure, a deeper model can simply be harder to optimize.
Parity is recursive — XOR of pairs, then XOR of pair-pairs, then XOR of those.
Depth matches the structure of the problem:
Formal proof: Telgarsky, "Benefits of Depth in Neural Networks," COLT 2016. We take the statement on faith today.
Theorem (Telgarsky 2016, simplified) · There exists a function
The witness function is the
Each composition doubles the number of "teeth" — depth
Depth is exponentially more parameter-efficient than width for problems with compositional / recursive structure. Real-world data (images, language) is richly compositional · so depth pays off in practice.
Q. If depth is so great, should we train a 500-layer plain MLP?
Give an honest answer; then read on.
Problem 1 — vanishing gradients (from L1).
Product of many sub-1 Jacobians → zero.
Problem 2 — the degradation problem.
Something strictly more surprising. Next section.
Before we fix depth, let's see it break
A 4-layer scalar network · y = w_4 · w_3 · w_2 · w_1 · x (ignoring activations).
We want gradient of loss
Expand
So ·
The key is the product
Sigmoid derivative ·
Each backward step multiplies by
| Layer (counting back) | Cumulative factor |
|---|---|
| L | 0.25 |
| L−1 | |
| L−2 | |
| L−5 | |
| L−10 |
After 10 sigmoid layers, the gradient signal at the earliest weight is shrunk by a million. The first layer barely updates · learning stalls.
This is the core problem ResNets and ReLU were invented to solve.
Interactive: stack sigmoids and watch the gradient evaporate with depth — vanishing-gradients.
The most important architectural idea since backprop
Train plain CNNs of increasing depth on CIFAR-10 — same optimizer, same init, just more layers.
Q. What do you expect? More layers → more capacity → better, right?
If this were overfitting, training error would drop with depth (more capacity to memorize) and test error would rise (poor generalization).
Instead, training error went up. The deeper net cannot even fit the training data.
This is an optimization problem, not a capacity problem.
Compare the signatures:
| Failure mode | Training error | Validation / test error | Main diagnosis |
|---|---|---|---|
| Underfitting | high | high | model or training too weak |
| Overfitting | low | high | generalization failure |
| Degradation | higher for deeper model | higher for deeper model | optimization failure |
The ResNet paper mattered because it exposed a surprising fact: simply adding layers can make the training objective itself harder, even though the deeper model has more representational capacity.
Imagine steering a car by giving the wheel an absolute angle (e.g., "set wheel to 27 degrees from zero"). Hard to do.
Instead, you say "turn a little right" or "stay straight." Much easier.
ResNets do the same · they reframe each layer's job from "output the right thing" to "output a small change to your input." The default action — change nothing, pass input through — is now trivial. The network only learns deviations from identity, which is much easier for SGD.
He et al. 2015 — don't ask the block to learn the full mapping
If the optimum is close to identity, we only need
Coordinating non-linear layers to produce exact identity is the opposite: a delicate balance with no prior.
The skip connection turns a hard default into a free default.
The original ResNet block was not just "add a skip connection." It used a stack like:
| Component | What it helps with |
|---|---|
| Skip connection | direct signal and gradient path |
| BatchNorm | stable activation scale across layers |
| ReLU | nonlinearity with active-side gradient 1 |
| He initialization | variance scale matched to ReLU |
Do not learn the wrong lesson: ResNets train because several engineering choices work together. The skip connection is the central idea, but scale control is part of the system.
Forward ·
We want the gradient flowing back through the block ·
By the chain rule ·
Compute the local gradient:
So ·
The "+I" gives an uninterrupted express lane for the gradient · the second term is the original signal passing straight through even if the first term vanishes. The early layers always get a clean signal.
Same 5-layer net. Each layer's
| Layer | Plain (multiplicative) | ResNet (additive) |
|---|---|---|
| L (output) | 1.0 | 1.0 |
| L−1 | 0.1 | 1.1 |
| L−2 | 0.01 | 1.21 |
| L−3 | 0.001 | 1.33 |
| L−4 | 0.0001 | 1.46 |
The plain net's signal vanishes · ResNet keeps the full unit signal plus a small contribution from each block. Hundred-layer ResNets train; 100-layer plain nets don't.
Even if
class ResidualBlock(nn.Module):
def __init__(self, dim):
super().__init__()
self.block = nn.Sequential(
nn.Linear(dim, dim), nn.BatchNorm1d(dim), nn.ReLU(),
nn.Linear(dim, dim), nn.BatchNorm1d(dim),
)
self.relu = nn.ReLU()
def forward(self, x):
return self.relu(self.block(x) + x) # ← the skip
Q. What constraint does self.block(x) + x impose on dimensions?
The addition requires matching shapes. If the block changes width, channels, or spatial resolution, the skip path must also transform
class ResidualBlock(nn.Module):
def __init__(self, d_in, d_out):
super().__init__()
self.block = nn.Sequential(
nn.Linear(d_in, d_out), nn.ReLU(),
nn.Linear(d_out, d_out),
)
self.skip = nn.Identity() if d_in == d_out else nn.Linear(d_in, d_out)
def forward(self, x):
return torch.relu(self.block(x) + self.skip(x))
In CNNs this is often a
Later ResNets moved normalization and activation before the weight layers:
| Block style | Formula sketch | Why it matters |
|---|---|---|
| Post-activation | original ResNet | |
| Pre-activation | cleaner identity path |
Pre-activation keeps the skip path as close to a pure identity as possible. This makes very deep residual networks easier to optimize because the gradient highway is less obstructed.
Transformers use the same idea in another form: residual stream plus normalization around each block.
| Year | Model | Depth | ImageNet top-5 |
|---|---|---|---|
| 2012 | AlexNet | 8 | 16.4% |
| 2014 | VGG-19 | 19 | 7.3% |
| 2014 | GoogLeNet | 22 | 6.7% |
| 2015 | ResNet-152 | 152 | 3.6% |
Skip connections are now in virtually every modern architecture — CNNs, Transformers, diffusion U-Nets.
From first principles
Keep activations — and gradients — at roughly constant variance across layers.
Initialization theory is useful because it gives a concrete diagnostic: activation statistics by layer.
For one batch, log:
| Quantity | Healthy early signal | Red flag |
|---|---|---|
| activation mean | near 0 for normalized layers | large drift |
| activation std | roughly stable across depth | shrinks to 0 or explodes |
| gradient norm | nonzero in early layers | first layers near 0 |
| fraction ReLU active | neither 0% nor 100% | many dead units |
for name, p in model.named_parameters():
if p.grad is not None:
print(name, p.grad.norm().item())
Before changing architecture, check whether the signal survives the forward pass and the gradient survives the backward pass.
If we initialize weights from
Each layer multiplies by a matrix of N(0,1) values. Activation magnitudes either grow exponentially with depth (explode) or shrink exponentially (vanish) depending on shape.
Same product, going backwards. A single bad scale → all early layers train at
Symptom · loss is NaN at step 1, or loss flat with all weights stuck. Always-and-only an init problem if the loop is otherwise correct.
The variance argument on the next slide gives a one-line fix · scale init by
Suppose · 10-layer sigmoid network. Sigmoid derivative max is
Even at the best point, gradient through one layer multiplies by
After 10 layers ·
After 20 layers ·
The first layer's effective learning rate is a million times smaller than the last layer's. It barely updates · network never learns features in early layers.
This is the practical reason ReLU (derivative = 0 or 1) replaced sigmoid in deep nets · it doesn't shrink the gradient by a factor every layer.
Layer:
For independent zero-mean
Therefore:
To preserve variance we need
Forward:
Backward:
Xavier / Glorot (2010) — compromise:
ReLU zeros half the pre-activations:
To keep this constant across layers:
He / Kaiming (2015):
Factor of 2 compensates the ReLU halving.
10-layer fully-connected ReLU net with
Naive init
He init
This is why initialization is not optional. Bad init → loss is NaN at step 1, or weights are stuck at
# Default for nn.Linear — Kaiming uniform (sensible for ReLU)
model = nn.Linear(784, 256)
# Explicit He init
nn.init.kaiming_normal_(model.weight, mode='fan_in', nonlinearity='relu')
nn.init.zeros_(model.bias)
# Explicit Xavier (for sigmoid/tanh)
nn.init.xavier_normal_(model.weight)
Rule of thumb · ReLU family → He, sigmoid / tanh → Xavier.
Pop quiz. You build a 10-layer MLP with Tanh activations and He initialization. Loss oscillates, activations saturate. Why?
He doubles variance to compensate for ReLU's halving. Tanh doesn't halve — He gives too large a variance → activations saturate at
Fix: Xavier.
| Activation | Init |
|---|---|
| ReLU, Leaky | He |
| Sigmoid, Tanh | Xavier |
| GELU, SiLU | He (convention) |
Depth is mathematically expressive (UAT, Telgarsky), but practically fragile.
Three forces conspire against naive deep nets · vanishing/exploding gradients, vanishing/exploding activations, and the degradation problem even when both are tame.
| Symptom | Root cause | Fix introduced |
|---|---|---|
| Gradient → 0 in early layers | ReLU family · skip connections | |
| Activation variance blows up / shrinks | Wrong init scale per layer | Xavier (tanh) · He (ReLU) |
| Deeper net is worse at training loss | Optimization, not capacity | Residual blocks |
The single insight underneath all three fixes · make every layer easy to leave alone. Identity-friendly initialization, identity skip-connections, and activations that pass gradients through unchanged in their linear regime. That's what made depth practical in 2015 and onward.
P1. UAT says
P2. A 5-layer plain MLP with sigmoid activations is failing to train. Without changing the architecture, name two changes that would help and explain why each works.
P3. Show that for a ResNet block
P4. A 100-layer ReLU MLP with
P5. Telgarsky's separation says depth-
P6. You replace ReLU with leaky ReLU (