Interactive Explainer
Watching a Neural Network Become Universal
The Universal Approximation Theorem says a neural net with one hidden layer can match almost any shape you draw. Most people read that and shrug. This page lets you feel it—by building the bumps yourself, then training a real network in your browser.
The claim that sounds too good
In 1989, Cybenko proved something startling: a feed-forward neural network with a single hidden layer and enough neurons can approximate any continuous function on a bounded interval to any accuracy you want. Sigmoids, ReLUs, tanh—the exact activation doesn't matter, as long as it isn't linear.
Read that literally and it sounds absurd. One layer? Any function? How can bending a few straight lines possibly match a wild curve?
The trick is to stop thinking of the hidden layer as "a network" and start thinking of it as a Lego kit of bumps. Each neuron is one Lego brick. Wide enough, any shape can be built. Let's build.
One neuron is a bent line
A single hidden neuron computes three simple operations:
It takes the input $x$, multiplies it by a weight $w$, shifts it by a bias $b$, then passes the whole thing through a non-linear activation $\phi$. With ReLU ($\phi(z) = \max(0, z)$), that produces a hinge—flat on one side, linear on the other, with the bend at $x = -b / w$.
Drag the sliders below. Notice two things: the weight $w$ controls which direction the hinge rises, and the bias $b$ slides it left and right. A single ReLU neuron can only make one kink in an otherwise straight line—it can't do much on its own.
Two neurons make a bump
The secret is subtraction. Take two rising hinges shifted slightly apart, multiply one by a positive output weight and the other by a negative output weight, and add them together. The result is a localized bump: zero far away, rising up at one kink, falling back at the next.
Below you control two neurons feeding a single output. The blue dashed lines show each neuron's individual hinge; the solid orange curve is their weighted sum. Slide the second bias to pull the second hinge to the right—watch a bump appear.
Stack hinges into the shape
Here's the whole theorem in plain English: a ReLU network with enough hinges can match any smooth shape arbitrarily well. Every neuron adds one kink to a piecewise-linear fit — so $N$ neurons give $N-1$ straight segments joined at the kinks. Wide enough, and the segments become dense enough that the fit is indistinguishable from the target.
Below, pick a target function — or draw your own — and drag the width slider. The network places $N$ kinks at equally-spaced positions across $[0, 1]$ and picks output weights so the fit passes exactly through the target at every kink; the bits in between are straight lines. This is the classic piecewise-linear interpolant, built entirely from ReLU hinges.
Why can't a few neurons just do it?
The difficulty is geometric. Here's a table of live MSE values for the current target above — switch targets to see the numbers update:
| Hidden width $N$ | What the network can express | MSE vs sine |
|---|
Look at the scaling: for smooth targets (sine, spike, bumpy), doubling $N$ cuts the MSE by roughly $16\times$ — error shrinks as $1/N^2$, so MSE shrinks as $1/N^4$. For discontinuous targets (step) the rate is much slower, around $1/N$. Cybenko's theorem says convergence continues forever in either case: there's no wall you hit, only a trade-off between width and accuracy.
Now let gradient descent do the placing
Our hand-placed bumps are evenly spaced. A trained network finds much better placements on its own—clustering bumps where the target is wiggly, spreading them out where it's flat. Click Train and watch it happen. The browser runs actual stochastic gradient descent on a one-hidden-layer MLP.
New here: you can draw any curve as a target, tune the learning rate, toggle Show neurons to see each learned bump, and overlay the hand-placed fit from Step 3 for comparison.
What about classification?
Everything so far has been regression: fit a curve to a target value. But the same theorem covers classification: change the output activation to a sigmoid and the loss to binary cross-entropy, and a one-hidden-layer MLP can carve out any decision boundary in the plane.
The picture below shows a 2D version of universal approximation in action. Pick a dataset, set a width, and click Train. The shaded region is the network's predicted probability; the dark line is the $p = 0.5$ decision boundary. Each ReLU neuron contributes one straight cut to the plane, and gradient descent stitches them together into the boundary.
The big number
At 16 neurons, a one-hidden-layer MLP already has only a few dozen parameters—yet it can represent curves that polynomials of the same complexity cannot. Here's the count for the network you're currently training:
Parameters in a single hidden layer of width 16
= N inputs weights + N biases + N outputs + 1 output bias
Compare that to what the theorem promises: any continuous function on $[0,1]$, to arbitrary precision, using nothing but these few numbers. The “cost” is not in complexity but in width—and in a training procedure that can find the right parameters.
Three things the theorem does not say
Universal Approximation is often repeated as a slogan ("neural nets can learn anything!"). Three important caveats usually get dropped:
"It means one hidden layer is enough in practice."
The theorem says enough width suffices. In practice, the
required width can be astronomical for complicated functions.
Adding depth lets you build bumps from bumps, cutting
the parameter budget dramatically. That's why modern networks
are deep, not one layer wide.
"Training is guaranteed to find the approximation."
The proof shows a good approximation exists; it says
nothing about whether gradient descent can find it. In the demo
above you'll see tanh on a spike get stuck—not a failure of
the theorem, but of the optimizer on that loss surface.
"It means networks generalize to unseen inputs."
Cybenko's theorem is about fitting training points—i.e.
memorizing. Whether the fit is sensible between
training points (generalization) is a completely separate
question, handled by regularization, inductive bias, and more
data.