Watching a Neural Network Become Universal

Prelude

The claim that sounds too good

In 1989, Cybenko proved something startling: a feed-forward neural network with a single hidden layer and enough neurons can approximate any continuous function on a bounded interval to any accuracy you want. Sigmoids, ReLUs, tanh—the exact activation doesn't matter, as long as it isn't linear.

Read that literally and it sounds absurd. One layer? Any function? How can bending a few straight lines possibly match a wild curve?

The trick is to stop thinking of the hidden layer as "a network" and start thinking of it as a Lego kit of bumps. Each neuron is one Lego brick. Wide enough, any shape can be built. Let's build.

Step 1

One neuron is a bent line

A single hidden neuron computes three simple operations:

It takes the input $x$, multiplies it by a weight $w$, shifts it by a bias $b$, then passes the whole thing through a non-linear activation $\phi$. With ReLU ($\phi(z) = \max(0, z)$), that produces a hinge—flat on one side, linear on the other, with the bend at $x = -b / w$.

Drag the sliders below. Notice two things: the weight $w$ controls which direction the hinge rises, and the bias $b$ slides it left and right. A single ReLU neuron can only make one kink in an otherwise straight line—it can't do much on its own.

Weight w = 1.0

Bias b = 0.0

One ReLU neuron:

Pause and think. You just have a hinge. It can tilt, it can shift, but it can never peak and come back down. So how could a pile of these ever match a smooth sine wave?

Step 2

Two neurons make a bump

The secret is subtraction. Take two rising hinges shifted slightly apart, multiply one by a positive output weight and the other by a negative output weight, and add them together. The result is a localized bump: zero far away, rising up at one kink, falling back at the next.

Below you control two neurons feeding a single output. The blue dashed lines show each neuron's individual hinge; the solid orange curve is their weighted sum. Slide the second bias to pull the second hinge to the right—watch a bump appear.

Bias 1 b₁ = -1.0

Bias 2 b₂ = 1.0

Output 1 c₁ = 1.0

Output 2 c₂ = -1.0

Two ReLU neurons combining into a bump. Dashed lines: individual neurons. Solid line: weighted sum.

Try this. Set $c_2$ equal to $-c_1$ and move $b_1$ and $b_2$ close together. You get a narrow, tall spike. Move them far apart: a wide, short plateau. The spacing of the biases sets the width, and the output weights set the height and sign. That's your Lego brick.

Step 3

Stack bumps until the shape appears

Here is the whole theorem in plain English: if you can place a bump anywhere, of any height and width, you can approximate any continuous function as a sum of many little bumps. A wider hidden layer means more bumps, means finer resolution.

Below, pick a target function and drag the width slider. The network uses equally-spaced bias positions and sets each output weight to the value of the target at that position—a crude, hand-made approximation, but it already shows the shape snapping into place as you add neurons.

Hidden neurons N = 8

Orange target. Blue network output. Grey bumps: individual neurons (one per hidden unit).

Hidden neurons 8

Parameters 25

Mean squared error —

Watch the error drop. Go from $N=1$ (one hinge, hopeless) to $N=60$ (sixty bumps, nearly perfect). The curve literally discretizes into a smoother and smoother staircase of little bumps. That staircase intuition is exactly how Cybenko's original proof works: chop the x-axis into tiny intervals and put a bump on each.

Step 4

Why can't one neuron just do it?

The difficulty is geometric. Here's a table that makes the bottleneck concrete for the sine target:

Hidden width $N$	What the network can express	Best achievable MSE on sine

Every row of this table is computed live from the current scenario. Notice how the error falls roughly as $1/N^2$—doubling the width quarters the error. Cybenko's theorem says this continues forever: there's no wall you hit, only a trade-off between width and accuracy.

Step 5

Now let gradient descent do the placing

Our hand-placed bumps are evenly spaced. A trained network finds much better placements on its own—clustering bumps where the target is wiggly, spreading them out where it's flat. Click Train and watch it happen. The browser runs actual stochastic gradient descent on a one-hidden-layer MLP with the width, activation, and target you pick below.

Hidden neurons N = 16

Activation ReLU

Live fit while gradient descent runs.

Training loss (log scale). Each dot is one mini-batch.

Epoch 0

Parameters —

Train MSE —

Things to try. Train a width-2 network on the sine wave—it will find the best two-bump fit it can. Then bump the width to 32 and retrain. Now switch the activation to tanh on the spike target: tanh is bad at sharp corners and you'll see residual wiggle. The theorem guarantees existence of a fit, not that every activation reaches it equally fast.

Step 6

The big number

At 16 neurons, a one-hidden-layer MLP already has only a few dozen parameters—yet it can represent curves that polynomials of the same complexity cannot. Here's the count for the network you're currently training:

Parameters in a single hidden layer of width 16

49

= N inputs weights + N biases + N outputs + 1 output bias

Compare that to what the theorem promises: any continuous function on $[0,1]$, to arbitrary precision, using nothing but these few numbers. The “cost” is not in complexity but in width—and in a training procedure that can find the right parameters.

Step 7

Three things the theorem does not say

Universal Approximation is often repeated as a slogan ("neural nets can learn anything!"). Three important caveats usually get dropped:

Myth

"It means one hidden layer is enough in practice."
The theorem says enough width suffices. In practice, the required width can be astronomical for complicated functions. Adding depth lets you build bumps from bumps, cutting the parameter budget dramatically. That's why modern networks are deep, not one layer wide.

Myth

"Training is guaranteed to find the approximation."
The proof shows a good approximation exists; it says nothing about whether gradient descent can find it. In the demo above you'll see tanh on a spike get stuck—not a failure of the theorem, but of the optimizer on that loss surface.

Myth

"It means networks generalize to unseen inputs."
Cybenko's theorem is about fitting training points—i.e. memorizing. Whether the fit is sensible between training points (generalization) is a completely separate question, handled by regularization, inductive bias, and more data.

Final takeaway. A one-hidden-layer network is a bump builder. The theorem says: with enough bumps, you can match any shape. The practice of deep learning is about finding the bumps efficiently—through depth, better optimizers, and good inductive biases. You now know exactly what “universal” in Universal Approximation means, because you just watched the bumps stack up one at a time.