Feeling GAN training through the forger-and-detective game

Prelude

The counterfeiter and the detective

A counterfeiter prints fake banknotes. A detective inspects every bill that passes through a bank. At first, the counterfeiter is bad — the fakes look like kindergarten art, and the detective catches them instantly. The detective looks smart. The counterfeiter looks ridiculous.

But the counterfeiter learns. Every time a fake is caught, she asks the detective: "What tipped you off?" She fixes that flaw. A few hundred iterations later the fakes pass muster on the old tests — the detective must find new ones. A few thousand iterations later the detective's "tests" are splitting hairs of the paper fiber.

At the limit, the counterfeiter's bills are indistinguishable from real ones. The detective is left flipping coins. A GAN is this game, with two neural networks as the players and probability distributions as the bills.

By the end of this page, you will have driven both sides of this game yourself — you'll see what the detective's decision curve looks like, watch the counterfeiter's distribution glide onto the target, and understand precisely what "Nash equilibrium" means for two neural networks.

Step 1

Pick a target distribution

Before any training, fix what the real data looks like. In a real GAN this is a dataset of images; here it's a 1D probability density we can plot. Three shapes come up in GAN debugging discussions:

Two symmetric Gaussians — think "heights of adult men and women mixed together."

Real data distribution p_data(x). We'll keep this fixed; the generator tries to learn it.

Pause and think. If you'd never seen this density, how would you try to reproduce it? You could fit a parametric family (Gaussian mixture) — but what if the true shape isn't a mixture? A GAN skips that choice: the generator is a neural network, free to take any form. The training loss steers it toward p_data without ever assuming what shape it is.

Step 2

What does a "bad" discriminator look like?

The detective — the discriminator D(x) — takes a point x and outputs a probability: "how confident am I this came from real data, not the generator?". A perfect D outputs 1 for real, 0 for fake.

Drag the slider below to build a manual linear discriminator: a simple threshold and slope. Watch how its accuracy (fraction correctly classified) changes when you move it around. The generator's distribution here is fixed at N(0, 1.5²) — the dumb initialization before training.

Decision centre: 0.0 Steepness: 1.0

Real detected correctly –

Fake detected correctly –

              Discriminator loss
              –
            

What the loss measures. A discriminator's loss is a classification loss — binary cross-entropy:

L D = - E x\simp data [log D(x)] - E x\simp G [log(1 - D(x))]

Low loss means D splits real from fake cleanly. By moving the slider you can see the loss drop when you align the decision boundary with where real and fake differ the most.

Step 3

What does the optimal discriminator look like?

Turns out there's a closed-form answer. If the generator's distribution is fixed, the D that minimises the loss is exactly:

D * (x) = p data (x) / ( p data (x) + p G (x) )

In plain English: for every point x, the optimal detective answers "probability this is real" by computing the ratio of real-density to total-density. You can build this by hand without any training — if you know p_data and p_G.

Densities and the optimal D^∗(x). Move the slider to change the generator's standard deviation and watch D^∗ re-shape itself — a better generator makes D^∗ flatter.

Generator σ: 1.5

real p_data(x) generator p_G(x) optimal D^∗(x)

Read this carefully. When p_G exactly equals p_data, the ratio is p/(p+p) = 0.5 everywhere. The optimal discriminator is a horizontal line at 0.5 — it gives up. That flat line is the target of GAN training.

Common confusion: "But we never actually compute D^∗ because we don't know p_G." True. The training process approximates this optimum by gradient descent on the BCE loss. Each step of D training nudges D closer to D^∗ for the current G.

Step 4

What does the generator do?

Given a discriminator, the generator's job is to fool it. If D thinks real points have high score and fakes have low score, G wants to push its output toward regions where D is high.

That's the gradient signal. At a sample x = G(z), the generator receives:

\nabla θ log D(G(z; θ))

which points in the direction that raises D at the generated point. Interpretation: "take this sample and move it toward a place where the detective would have been more fooled."

G's gradient field for a single sample. Click anywhere on the x-axis to place a generated sample. The red arrow shows the direction G's gradient wants to push it.

Click anywhere on the x-axis to place a sample

Look at the arrows. Samples sitting on the left mode of the real distribution get small arrows — they're already plausible. Samples sitting between modes (where p_data is low and D is low) get big arrows pointing toward the nearest real mode. That's the physical meaning of "move toward more plausible space."

Saturation warning. If D is very confident (outputs 0.001 on this sample), log D ≈ -6.9 — a huge negative number but a tiny gradient because the log derivative 1/D becomes unstable. That's the vanishing-gradient problem Goodfellow's non-saturating loss (maximize log D instead of minimize log(1-D)) was designed to fix.

Step 5

The full dance

Now put both sides together. At every training step, we alternate:

Update D: nudge it toward D^∗ for the current G (via gradient descent on BCE loss).
Update G: nudge it in the direction that raises D at its samples.

Scrub the training step slider below and watch the three curves evolve together. At step 0, G is noise; D quickly becomes peaky. As G improves, D's decision curve flattens until it's 0.5 everywhere.

Training step: 0

p_data p_G D(x) D loss G loss

What the loss dynamics show. The two losses oscillate, not monotonically decrease. This is expected — when G improves, D's job gets harder; when D adjusts, G's gradient shifts. The oscillation is the dance. At convergence both losses settle around a fixed point (log 4 ≈ 1.38 for the DPPM-like minimax objective).

Step 6

Mode collapse · when the dance breaks down

In the ideal trajectory above, G learns the full distribution. In practice, G often takes a shortcut: it discovers that one particular output fools D well, and produces only that output for every input. Diversity is lost — mode collapse.

Mode-collapse demonstration. Toggle the scenario and watch the generator fixate on just one mode of a bimodal target, even as training proceeds.

Currently: healthy trajectory

Why collapse happens. If G's output at some point fools D with confidence 0.49, the gradient signal is weak. If another point fools D with confidence 0.49 too, G has no reason to produce both — it converges on the easier one. Preventing collapse needs extra tricks: minibatch discrimination, WGAN's earth-mover distance, spectral normalization, or just careful hyperparameter tuning.

Step 7

Three things a GAN is not

False

"GANs minimize a meaningful loss."
The minimax objective doesn't admit a single "loss goes down" interpretation. D's loss going up might mean G got better; G's loss going up might mean D caught on. Monitoring losses alone is a terrible way to judge GAN progress — use sample quality + diversity metrics (FID, precision/recall).

False

"The generator learns p_data."
Only approximately, and only at the idealised Nash equilibrium. In practice, G often learns a good-looking subset of p_data. This is why WGAN, Wasserstein distance, and regularisers matter — they push the model closer to the full distribution.

False

"You should train D to convergence before updating G."
The original paper suggested this; in practice it kills G's gradient (the stronger D gets, the more it saturates, the smaller G's gradient). Most GAN recipes alternate one step of each, or use the non-saturating G loss to stay robust.

Bonus

From minimax to Wasserstein · why training got easier

The original GAN objective measures a Jensen-Shannon divergence between p_data and p_G. The problem: JS is saturating — when the distributions barely overlap (early in training), JS ≈ log 2 no matter how close or far they are. So the generator gets no useful gradient signal.

Arjovsky et al. 2017 (WGAN) replaced JS with the Wasserstein-1 (earth-mover) distance:

W(p data, p G) = inf γ\inΠ E (x,y)\simγ [‖x - y‖]

Think of it as "the minimum cost to physically move the mass of p_G onto p_data." This distance shrinks continuously as the distributions get closer — no saturation. Training is dramatically more stable.

Comparing the three objectives

Objective	Gradient when distributions far apart	Training stability
Minimax (original, saturating)	Vanishing	Poor
Non-saturating (log D)	Healthy when G is bad	Standard
Wasserstein (WGAN / WGAN-GP)	Healthy always (linear in distance)	Best

Connection to diffusion (L21-L22). Diffusion models avoid the GAN problem entirely by never asking two distributions to match at once. They teach a single model to undo noise step by step, with a regression loss (MSE on predicted noise) — no adversarial game, no mode collapse, no minimax. That's the main reason diffusion displaced GANs for most practical image generation in 2022-23.

Final takeaway. When someone says "GAN training is unstable," they're describing the exact dynamics you just drove. The generator and discriminator are chasing each other around a saddle point in a 10⁹-dimensional space — the miracle is that it works at all. WGAN, spectral norm, and large-scale engineering (StyleGAN, BigGAN) made it work reliably, not because they found a new idea but because they tamed the dance above.