Watching a Neural Network Become Universal

Prelude

The claim that sounds too good

In 1989, Cybenko proved something startling: a feed-forward neural network with a single hidden layer and enough neurons can approximate any continuous function on a bounded interval to any accuracy you want. Sigmoids, ReLUs, tanh—the exact activation doesn't matter, as long as it isn't linear.

Read that literally and it sounds absurd. One layer? Any function? How can bending a few straight lines possibly match a wild curve?

The trick is to stop thinking of the hidden layer as "a network" and start thinking of it as a Lego kit of bumps. Each neuron is one Lego brick. Wide enough, any shape can be built. Let's build.

Step 1

One neuron is a bent line

A single hidden neuron computes three simple operations:

It takes the input $x$, multiplies it by a weight $w$, shifts it by a bias $b$, then passes the whole thing through a non-linear activation $\phi$. With ReLU ($\phi(z) = \max(0, z)$), that produces a hinge—flat on one side, linear on the other, with the bend at $x = -b / w$.

Drag the sliders below. Notice two things: the weight $w$ controls which direction the hinge rises, and the bias $b$ slides it left and right. A single ReLU neuron can only make one kink in an otherwise straight line—it can't do much on its own.

Weight w = 1.0

Bias b = 0.0

One ReLU neuron:

Pause and think. You just have a hinge. It can tilt, it can shift, but it can never peak and come back down. So how could a pile of these ever match a smooth sine wave?

Step 2

Two neurons make a bump

The secret is subtraction. Take two rising hinges shifted slightly apart, multiply one by a positive output weight and the other by a negative output weight, and add them together. The result is a localized bump: zero far away, rising up at one kink, falling back at the next.

Below you control two neurons feeding a single output. The blue dashed lines show each neuron's individual hinge; the solid orange curve is their weighted sum. Slide the second bias to pull the second hinge to the right—watch a bump appear.

Bias 1 b₁ = -1.0

Bias 2 b₂ = 1.0

Output 1 c₁ = 1.0

Output 2 c₂ = -1.0

Two ReLU neurons combining into a bump. Dashed lines: individual neurons. Solid line: weighted sum.

Try this. Set $c_2$ equal to $-c_1$ and move $b_1$ and $b_2$ close together. You get a narrow, tall spike. Move them far apart: a wide, short plateau. The spacing of the biases sets the width, and the output weights set the height and sign. That's your Lego brick.

Step 3

Stack hinges into the shape

Here's the whole theorem in plain English: a ReLU network with enough hinges can match any smooth shape arbitrarily well. Every neuron adds one kink to a piecewise-linear fit — so $N$ neurons give $N-1$ straight segments joined at the kinks. Wide enough, and the segments become dense enough that the fit is indistinguishable from the target.

Below, pick a target function — or draw your own — and drag the width slider. The network places $N$ kinks at equally-spaced positions across $[0, 1]$ and picks output weights so the fit passes exactly through the target at every kink; the bits in between are straight lines. This is the classic piecewise-linear interpolant, built entirely from ReLU hinges.

Hidden neurons N = 8

Orange target. Blue piecewise-linear fit. Blue dots mark the $N$ kinks where the fit passes exactly through the target.

Hidden neurons 8

Parameters 25

Mean squared error —

Watch the error drop. For smooth targets (sine, spike, bumpy), doubling $N$ shrinks the MSE by roughly $16\times$ — the classic $1/N^2$ rate for piecewise-linear interpolation of a smooth curve, squared. For sharp features (step, zigzag corners), convergence is slower because straight lines can't match corners. That gap between smooth and sharp is one reason modern networks add depth: deep layers can compose hinges into much sharper shapes than a single layer of hinges ever can.

Step 4

Why can't a few neurons just do it?

The difficulty is geometric. Here's a table of live MSE values for the current target above — switch targets to see the numbers update:

Hidden width $N$	What the network can express	MSE vs sine

Look at the scaling: for smooth targets (sine, spike, bumpy), doubling $N$ cuts the MSE by roughly $16\times$ — error shrinks as $1/N^2$, so MSE shrinks as $1/N^4$. For discontinuous targets (step) the rate is much slower, around $1/N$. Cybenko's theorem says convergence continues forever in either case: there's no wall you hit, only a trade-off between width and accuracy.

Step 5

Now let gradient descent do the placing

Our hand-placed bumps are evenly spaced. A trained network finds much better placements on its own—clustering bumps where the target is wiggly, spreading them out where it's flat. Click Train and watch it happen. The browser runs actual stochastic gradient descent on a one-hidden-layer MLP.

New here: you can draw any curve as a target, tune the learning rate, toggle Show neurons to see each learned bump, and overlay the hand-placed fit from Step 3 for comparison.

Hidden neurons N = 16

Learning rate η = 0.050

Activation ReLU

Live fit while gradient descent runs. Toggle Show neurons to see individual bumps and kink positions.

Training loss (log scale). Full history from epoch 0 onward.

Epoch 0

Parameters —

Train MSE —

Things to try. (1) Train width-2 on sine — it finds the best two-bump fit. Bump to 32, retrain. (2) Switch to tanh on the spike: tanh is bad at sharp corners and you'll see residual wiggle. (3) Click Show neurons to see each bump the network learned — notice how it clusters bumps where the target is wiggly. (4) Click vs. Hand-placed to overlay the uniform-spacing fit — gradient descent finds smarter placements. (5) Draw your own wild shape and watch the network attempt to fit it. The theorem guarantees a good fit exists, but can gradient descent find it?

Learning rate experiments. Drag $\eta$ up to 0.2 and watch the loss explode. Drag it to 0.005 and watch it crawl. The sweet spot depends on both the target and the width — one more thing the theorem says nothing about. Press Space to start/stop training.

Step 6

What about classification?

Everything so far has been regression: fit a curve to a target value. But the same theorem covers classification: change the output activation to a sigmoid and the loss to binary cross-entropy, and a one-hidden-layer MLP can carve out any decision boundary in the plane.

The picture below shows a 2D version of universal approximation in action. Pick a dataset, set a width, and click Train. The shaded region is the network's predicted probability; the dark line is the $p = 0.5$ decision boundary. Each ReLU neuron contributes one straight cut to the plane, and gradient descent stitches them together into the boundary.

Hidden neurons N = 12

Learning rate η = 0.10

Blue dots = class 0, orange dots = class 1. Shading is the network's predicted probability; the dark line is the $p=0.5$ decision boundary.

Binary cross-entropy loss across epochs.

Epoch 0

BCE loss —

Accuracy —

Things to try. (1) Train width-2 on two moons — one straight cut almost works. (2) Try width-2 on XOR — impossible, since you need at least three lines to separate the four quadrants. Bump it to 4 and watch it click. (3) The two spirals dataset is the classic stress test. Try width-8: stuck. Width-32: it gets close. Width-64: almost perfect. Universal approximation works, but the price is width.

Same theorem, two faces. Regression and classification are the same problem for a ReLU network: both are function approximation. The only difference is what the function represents (a value vs. a probability) and how we measure error (MSE vs. cross-entropy). Universal approximation covers both with one statement.

Step 7

The big number

At 16 neurons, a one-hidden-layer MLP already has only a few dozen parameters—yet it can represent curves that polynomials of the same complexity cannot. Here's the count for the network you're currently training:

Parameters in a single hidden layer of width 16

49

= N inputs weights + N biases + N outputs + 1 output bias

Compare that to what the theorem promises: any continuous function on $[0,1]$, to arbitrary precision, using nothing but these few numbers. The “cost” is not in complexity but in width—and in a training procedure that can find the right parameters.

Step 8

Three things the theorem does not say

Universal Approximation is often repeated as a slogan ("neural nets can learn anything!"). Three important caveats usually get dropped:

Myth

"It means one hidden layer is enough in practice."
The theorem says enough width suffices. In practice, the required width can be astronomical for complicated functions. Adding depth lets you build bumps from bumps, cutting the parameter budget dramatically. That's why modern networks are deep, not one layer wide.

Myth

"Training is guaranteed to find the approximation."
The proof shows a good approximation exists; it says nothing about whether gradient descent can find it. In the demo above you'll see tanh on a spike get stuck—not a failure of the theorem, but of the optimizer on that loss surface.

Myth

"It means networks generalize to unseen inputs."
Cybenko's theorem is about fitting training points—i.e. memorizing. Whether the fit is sensible between training points (generalization) is a completely separate question, handled by regularization, inductive bias, and more data.

Final takeaway. A one-hidden-layer network is a bump builder. The theorem says: with enough bumps, you can match any shape. The practice of deep learning is about finding the bumps efficiently—through depth, better optimizers, and good inductive biases. You now know exactly what “universal” in Universal Approximation means, because you just watched the bumps stack up one at a time.