Imagine you're given 10,000 photos of cats. You must produce new cat photos that look real — not in the dataset, but plausibly drawable from the same distribution.
Mathematically: estimate
This looks innocent. But
Tempting: fit
Doesn't work. A Gaussian's support is all of
Same problem for any simple parametric family. Images live where simple doesn't reach.
Don't try to write down
i.e. · sample a simple distribution (Gaussian noise
The neural network is the distribution. We never write it down; we just sample from it by sampling noise and running forward. That's the big shift in the 2014 era.
Training question · how do we make
Goodfellow's idea · even if we can't compute
GAN = generator G + adversary D, trained together. The adversary's loss is well-defined (binary cross-entropy). Its gradients carry implicit information about
One paper → a decade of progress in generative models.
G is a counterfeiter. It makes fake paintings and tries to pass them off as real.
D is an art detective. It sees a mix of real and fake paintings and labels each.
That's the Nash equilibrium — and it's what the math below formalizes.
Interactive: scrub through training steps; watch G's distribution slide onto the real one and D(x) flatten to 0.5 — gan-minimax-dance.
Target data · two Gaussians at
Step 0 ·
Step 100 · G has shifted its mass outward; two bumps emerge, near but not on the modes.
Step 500 · G matches the two modes closely. D's output is ~0.5 everywhere.
Step 1000 · p_G = p_data exactly. D is random guessing. Nash equilibrium.
This trajectory is animated in the interactive. The math that follows formalizes each bullet above.
The math behind the dance
D is just a binary classifier. We know how to train those.
D's goal. For real
Average over the data distribution and noise:
G's goal. Opposite — make
A two-player minimax game. Nash equilibrium ·
One real
D's objective value.
Maximize → push
G's objective value. G only sees the second term:
Minimize → push
Each step both networks adjust → an equilibrium dance.
Fix G. Maximize over
Take derivative ·
Sanity check on
Plug
The GAN objective is equivalent to minimizing Jensen–Shannon divergence. JSD = 0 iff
for batch in loader:
# 1. Update discriminator
opt_D.zero_grad()
real = batch
fake = G(torch.randn(BATCH, NOISE_DIM)).detach() # stop grad through G here
loss_D = -(D(real).log() + (1 - D(fake)).log()).mean()
loss_D.backward(); opt_D.step()
# 2. Update generator
opt_G.zero_grad()
fake = G(torch.randn(BATCH, NOISE_DIM))
loss_G = -(D(fake).log()).mean() # non-saturating (see next slide)
loss_G.backward(); opt_G.step()
Key pattern — alternate between updating D and G. Balance is critical: if D gets too strong, G can't learn.
.detach() mattersIn the D-update, we compute fake = G(z).detach(). Why?
Without detach(), PyTorch builds a graph through G. Calling backward() on D's loss would compute G's gradients too — but we don't want to update G yet. detach() snips the graph at G's output. G's params get no gradient from the D update.
It's a subtle but critical detail. Forgetting it is one of the top-3 bugs in from-scratch GAN code.
Intuition · early in training,
The original G objective — minimize
Non-saturating G objective (Goodfellow 2014 footnote, became the standard):
Maximize
Everyone uses the non-saturating version.
Let D's pre-sigmoid activation be
Saturating loss
Chain rule ·
Non-saturating loss
Chain rule ·
The non-saturating gradient is ~100× stronger in this common early-training regime. That gap kept vanilla GANs untrainable for 2 years until Goodfellow's footnote fix.
Every optimizer you've seen (SGD, Adam) is for single-objective minimization. You're walking a fixed landscape.
GANs are different — neither loss is fixed. When G moves, D's optimal landscape shifts. When D moves, G's optimal landscape shifts.
This is a saddle-point search in a 10⁹-dimensional game. Standard optimization guarantees (convergence, stability, unique optimum) do not apply. Everything in GAN lore — DCGAN tricks, spectral normalization, WGAN — is fighting this.
Radford et al. 2015
2014-2015 papers reported:
Dozens of competing GAN variants; none consistently trained. Needed were architectural norms, not loss tweaks.
Radford, Metz, Chintala 2015 · "Unsupervised Representation Learning with Deep Convolutional GANs" — a cookbook that made the whole field tractable.
The DCGAN rules aren't arbitrary · each is a stabilizer for the tricky GAN game.
Each rule is a small wedge that prevents a known failure mode. Combined, they made GANs train.
These aren't deep insights — they are a cookbook that made GANs actually train.
G needs to go from (batch, noise_dim) to (batch, 3, 64, 64) — upsampling. Use ConvTranspose2d:
A normal conv shrinks (or preserves) spatial size. A transposed conv inflates · each input pixel is multiplied by the kernel and spread into a larger output.
Dimension formula (inverse of conv) ·
For
Alternative · nearest-neighbor upsample + regular conv. Often cleaner; fewer "checkerboard" artifacts.
Batch norm stabilizes GANs because:
Exception: the generator's output layer and the discriminator's input layer should not have BN — they'd destroy the fine-grained info there. The DCGAN paper is explicit about this.
class Generator(nn.Module):
def __init__(self, nz=100, ngf=64):
super().__init__()
self.net = nn.Sequential(
nn.ConvTranspose2d(nz, ngf*8, 4, 1, 0, bias=False),
nn.BatchNorm2d(ngf*8), nn.ReLU(True),
nn.ConvTranspose2d(ngf*8, ngf*4, 4, 2, 1, bias=False),
nn.BatchNorm2d(ngf*4), nn.ReLU(True),
nn.ConvTranspose2d(ngf*4, ngf*2, 4, 2, 1, bias=False),
nn.BatchNorm2d(ngf*2), nn.ReLU(True),
nn.ConvTranspose2d(ngf*2, 3, 4, 2, 1, bias=False),
nn.Tanh() # output in [-1, 1]
)
def forward(self, z):
return self.net(z.view(z.size(0), -1, 1, 1))
Noise shape: [batch, 100] → reshaped to [batch, 100, 1, 1] → upsampled to [batch, 3, 32, 32].
class Discriminator(nn.Module):
def __init__(self, ndf=64):
super().__init__()
self.net = nn.Sequential(
nn.Conv2d(3, ndf, 4, 2, 1, bias=False),
nn.LeakyReLU(0.2, True),
nn.Conv2d(ndf, ndf*2, 4, 2, 1, bias=False),
nn.BatchNorm2d(ndf*2), nn.LeakyReLU(0.2, True),
nn.Conv2d(ndf*2, ndf*4, 4, 2, 1, bias=False),
nn.BatchNorm2d(ndf*4), nn.LeakyReLU(0.2, True),
nn.Conv2d(ndf*4, 1, 4, 1, 0, bias=False),
nn.Sigmoid()
)
def forward(self, x): return self.net(x).view(-1)
Mirror of G, basically. No BN on the first layer; LeakyReLU helps gradients flow for negative activations (dead-neuron fix).
N(0, 0.02) for conv layers; N(1, 0.02) for BN gamma.This exact recipe trains on most small-to-medium image datasets without hand-holding. Useful starting point for any GAN project.
The pathologies
GAN training is a non-cooperative game. Three common failure modes:
Most of 2015–2019 GAN research is fighting these failure modes.
G doesn't need to produce the full distribution to get low loss. It just needs to fool D on whatever samples it produces.
Suppose G finds one output (say, one "mode" of faces) that reliably fools D. D can only fix this by re-learning that particular mode; while it does, G moves to another mode. The two chase through a few modes and settle on whichever is easiest — never seeing the full distribution.
Picture · instead of p_G covering p_data, p_G is a point mass (or thin ridge) sitting inside p_data.
In 2026, if you need a GAN you almost always use WGAN-GP or StyleGAN architecture; both mostly eliminate mode collapse in practice.
Standard diagnostics:
FID is the main quantitative metric for image generation in 2026.
Fréchet Inception Distance takes two sets of images (real vs fake), runs both through a pretrained Inception-V3, gets 2048-dim feature vectors, and computes:
i.e. · Wasserstein-2 distance between two Gaussians fitted to the feature distributions.
Lower is better. FID of 10-20 · "looks good". FID of 3-5 · "basically indistinguishable". SOTA diffusion on ImageNet is ~2.
Arjovsky et al. 2017
Recall · the original GAN minimizes JS divergence. When
Early in training, fake samples are far from real ones — they barely overlap. JS is saturated. G's gradient is zero. This is the fundamental reason vanilla GANs struggle at the start.
Imagine two piles of sand (two distributions). How much work to reshape one pile into the other — moving each grain the minimum distance?
where
Unlike JS,
Vanilla GAN's classifier draws an infinitely steep cliff between real/fake. When supports don't overlap, the cliff has zero gradient at its base. G gets no signal.
WGAN fix · 1-Lipschitz critic. A "critic"
Using Kantorovich–Rubinstein duality:
D's job · maximize the score difference (high on real, low on fake). G's job · minimize
1-Lipschitz check.
We want D's slope to be 1 everywhere — but checking everywhere is impossible. Highway patrol can't put a camera on every metre, so it places random ones. WGAN-GP picks random points
Critic's full loss ·
Stabilizes training and effectively eliminates mode collapse.
Between 2017 and 2020 WGAN-GP became the default "safe" GAN variant. Beyond it, spectral normalization (SN-GAN) added more robustness and scales better to 1024×1024.
Disentangled latents, hyper-realistic faces
Karras et al. (NVIDIA) 2018-2021 · three generations of StyleGAN.
Result · 1024×1024 face generation indistinguishable from photographs.
Each resolution block receives a separate injection from
You can mix · coarse-
This era gave us "AI-generated photo" as a concept.
Still alive, but niche
| Year | Model | What it did |
|---|---|---|
| 2014 | GAN | original paper — 28×28 MNIST |
| 2015 | DCGAN | convolutional, stable training |
| 2017 | WGAN-GP | Wasserstein + gradient penalty |
| 2017 | ProGAN | progressive growing → 1024×1024 faces |
| 2019 | StyleGAN | disentangled latent, hyper-realistic faces |
| 2021 | StyleGAN3 | temporal consistency, aliasing fixes |
| GANs | Diffusion | |
|---|---|---|
| Training stability | ✗ brittle | ✓ stable |
| Sample quality | ✓ (SOTA 2018-2020) | ✓✓ (SOTA 2021+) |
| Likelihood | ✗ | ≈ |
| Inference speed | ✓✓ one pass | ✗ many passes |
| Diversity | mode collapse risk | ✓ natural |
| Latent space | rich, explorable | uniform-ish |
In 2026 · diffusion dominates text-to-image / video. GANs survive where inference speed matters (real-time face generation, StyleGAN-based editing).
GANs didn't disappear — they were outperformed at their own strength (sample quality) while being simultaneously worse at training, coverage, and conditioning.
Analogy · RNNs vs Transformers. RNNs didn't vanish, just found niches (tiny devices, streaming). GANs are the same story.
The idea of "use a critic to train a generator" shows up everywhere:
The single most durable idea from GANs is "a neural network can act as a learned loss function." That framework outlived the original GAN itself.