Interactive Explainer
Watching a Neural Network Become Universal
The Universal Approximation Theorem says a neural net with one hidden layer can match almost any shape you draw. Most people read that and shrug. This page lets you feel it—by building the bumps yourself, then training a real network in your browser.
The claim that sounds too good
In 1989, Cybenko proved something startling: a feed-forward neural network with a single hidden layer and enough neurons can approximate any continuous function on a bounded interval to any accuracy you want. Sigmoids, ReLUs, tanh—the exact activation doesn't matter, as long as it isn't linear.
Read that literally and it sounds absurd. One layer? Any function? How can bending a few straight lines possibly match a wild curve?
The trick is to stop thinking of the hidden layer as "a network" and start thinking of it as a Lego kit of bumps. Each neuron is one Lego brick. Wide enough, any shape can be built. Let's build.
One neuron is a bent line
A single hidden neuron computes three simple operations:
It takes the input $x$, multiplies it by a weight $w$, shifts it by a bias $b$, then passes the whole thing through a non-linear activation $\phi$. With ReLU ($\phi(z) = \max(0, z)$), that produces a hinge—flat on one side, linear on the other, with the bend at $x = -b / w$.
Drag the sliders below. Notice two things: the weight $w$ controls which direction the hinge rises, and the bias $b$ slides it left and right. A single ReLU neuron can only make one kink in an otherwise straight line—it can't do much on its own.
Two neurons make a bump
The secret is subtraction. Take two rising hinges shifted slightly apart, multiply one by a positive output weight and the other by a negative output weight, and add them together. The result is a localized bump: zero far away, rising up at one kink, falling back at the next.
Below you control two neurons feeding a single output. The blue dashed lines show each neuron's individual hinge; the solid orange curve is their weighted sum. Slide the second bias to pull the second hinge to the right—watch a bump appear.
Stack bumps until the shape appears
Here is the whole theorem in plain English: if you can place a bump anywhere, of any height and width, you can approximate any continuous function as a sum of many little bumps. A wider hidden layer means more bumps, means finer resolution.
Below, pick a target function and drag the width slider. The network uses equally-spaced bias positions and sets each output weight to the value of the target at that position—a crude, hand-made approximation, but it already shows the shape snapping into place as you add neurons.
Why can't one neuron just do it?
The difficulty is geometric. Here's a table that makes the bottleneck concrete for the sine target:
| Hidden width $N$ | What the network can express | Best achievable MSE on sine |
|---|
Every row of this table is computed live from the current scenario. Notice how the error falls roughly as $1/N^2$—doubling the width quarters the error. Cybenko's theorem says this continues forever: there's no wall you hit, only a trade-off between width and accuracy.
Now let gradient descent do the placing
Our hand-placed bumps are evenly spaced. A trained network finds much better placements on its own—clustering bumps where the target is wiggly, spreading them out where it's flat. Click Train and watch it happen. The browser runs actual stochastic gradient descent on a one-hidden-layer MLP with the width, activation, and target you pick below.
The big number
At 16 neurons, a one-hidden-layer MLP already has only a few dozen parameters—yet it can represent curves that polynomials of the same complexity cannot. Here's the count for the network you're currently training:
Parameters in a single hidden layer of width 16
= N inputs weights + N biases + N outputs + 1 output bias
Compare that to what the theorem promises: any continuous function on $[0,1]$, to arbitrary precision, using nothing but these few numbers. The “cost” is not in complexity but in width—and in a training procedure that can find the right parameters.
Three things the theorem does not say
Universal Approximation is often repeated as a slogan ("neural nets can learn anything!"). Three important caveats usually get dropped:
"It means one hidden layer is enough in practice."
The theorem says enough width suffices. In practice, the
required width can be astronomical for complicated functions.
Adding depth lets you build bumps from bumps, cutting
the parameter budget dramatically. That's why modern networks
are deep, not one layer wide.
"Training is guaranteed to find the approximation."
The proof shows a good approximation exists; it says
nothing about whether gradient descent can find it. In the demo
above you'll see tanh on a spike get stuck—not a failure of
the theorem, but of the optimizer on that loss surface.
"It means networks generalize to unseen inputs."
Cybenko's theorem is about fitting training points—i.e.
memorizing. Whether the fit is sensible between
training points (generalization) is a completely separate
question, handled by regularization, inductive bias, and more
data.