Today · diffusion. Sharp samples + stable training + tractable likelihood. SOTA since 2021 for image, video, audio, 3D generation.
Today maps to Prince Ch 18 (early) + Ho et al. 2020 (DDPM) + Song & Ermon 2020 (score-based).
Four questions:
Corrupt then learn to uncorrupt
Gradually turn an image into pure noise, then train a network to reverse that process one tiny step at a time.
At the end of training, you can start from random noise and reverse-diffuse it into a brand new image. Each small step is easy to learn; chained together they generate.
Drop a drop of ink into a glass of water. It stays concentrated, then slowly spreads, then uniformly tints the water.
Ink diffuses into water · we can describe this with a simple diffusion equation. Watching a drop blur is what "noise corrupts the signal" looks like in pictures.
"Un-diffuse" the ink back into a drop. Physics says impossible (entropy only grows). But with data · we have many examples of initial states. A neural network can learn the reverse direction from those examples.
Diffusion models learn the miracle "reverse" that physics doesn't give you — but they learn it from data, not first principles.
A GAN asks a network to hit a moving target (the discriminator's decision boundary). A diffusion model asks a network to match a static target (the noise that was added). Static targets are fundamentally easier to optimize.
Interactive: slide t, see a 2D spiral dissolve into noise; press "Reverse animate" to watch it reassemble — diffusion-denoise.
Fixed · Markov · Gaussian
Analogy. Take a sharp photo and make it one step blurrier:
That's it.
Formally:
i.e.
Worked numeric.
Over
You might ask · why not just add noise? Why shrink the signal too?
Shrinking keeps the total variance bounded. If you only add noise, the variance grows without limit;
Variance check · if
Variance of
This is why the forward process preserves unit variance — it's a variance-preserving SDE.
Start with
| noise added | ||
|---|---|---|
| 0 | 2.00 | — |
| 1 | 1.99 · √0.99 + 0.1·ε = 1.97 + 0.08 = 2.05 | ε = 0.8 |
| 2 | 2.02 | ε = -0.4 |
| 3 | 2.00 | ε = -0.2 |
| 4 | 2.03 | ε = 0.3 |
| 5 | 2.06 | ε = 0.2 |
After 5 steps the signal is barely disturbed. After 1000 steps with growing β, it becomes standard normal. The accumulated effect, not each step, turns signal into noise.
After 1 step, signal is faded by
All the per-step noise additions also "pool together" into one big Gaussian:
Variance check · we designed the process so total variance stays 1. If
Worked numeric · jump from
After 500 steps the original signal of 2.0 has nearly washed away. No iteration needed during training.
To get
Batch of 128, 100k examples · ~10⁹ operations just to make noise targets. Days on a single GPU.
One sample of
Batch of 128 in one step · microseconds. Hours instead of days.
This closed-form is the single biggest practical advantage over continuous-time score-SDE approaches. Without it, DDPM training would cost 500× more.
Start from one step:
Unroll:
Merge Gaussians (sum of independent Gaussians = Gaussian with summed variances):
where
| Linear α̅_t | Cosine α̅_t | What's left | |
|---|---|---|---|
| 0 | 1.00 | 1.00 | clean signal |
| 250 | 0.62 | 0.82 | linear: 38% destroyed · cosine: 18% |
| 500 | 0.17 | 0.50 | linear: already mostly gone |
| 750 | 0.02 | 0.18 | linear: essentially pure noise |
| 1000 | 0.00 | 0.00 | both: N(0, I) |
Linear schedule wastes computation on steps near
Two common schedules:
Linear (original DDPM) ·
Cosine (Nichol & Dhariwal 2021) ·
Cosine schedule adds noise more gradually at the start and faster at the end. Better quality at fewer diffusion steps. Used in most modern diffusion models.
| Behavior | |
|---|---|
| 50 | too coarse; each step must learn a big jump; sample quality hurts |
| 200 | works but poor quality at the extremes |
| 1000 | default; great quality with cosine schedule |
| 4000 | slight quality gain; 4× inference cost; rarely worth it |
DDPM (Ho 2020) used T=1000 with linear schedule. Nichol & Dhariwal 2021 showed cosine + T=4000 gave marginal gains; T=1000+cosine is today's sweet spot.
Predict the noise
We want
Ho et al. 2020 showed that if
Further: parameterize to predict the noise
Given
Ho et al. 2020 showed
Modern models (SDXL, Imagen) often use "v-prediction" — a weighted combination that's more numerically stable at small
How do we teach the network to "un-blur"?
Take a clean image. Add a known amount of random noise
The better it gets at guessing the noise, the better it is at denoising · because subtracting the predicted noise gets us back to a cleaner image.
That's the entire training objective · MSE between predicted noise and the true noise added.
In plain words:
That's it. Much simpler than GAN minimax or VAE ELBO.
Suppose
Single example, single
def ddpm_loss(model, x0, T=1000):
B = x0.size(0)
t = torch.randint(0, T, (B,), device=x0.device)
noise = torch.randn_like(x0)
alpha_bar = alpha_bar_schedule[t].view(B, 1, 1, 1)
x_t = alpha_bar.sqrt() * x0 + (1 - alpha_bar).sqrt() * noise
pred_noise = model(x_t, t) # network input: noisy img + t
return F.mse_loss(pred_noise, noise)
def sample(model, shape, T=1000):
x = torch.randn(shape) # start from N(0, I)
for t in reversed(range(T)):
alpha_t = alpha_schedule[t]
alpha_bar_t = alpha_bar_schedule[t]
predicted = model(x, torch.tensor([t]))
mean = (x - (1 - alpha_t) / (1 - alpha_bar_t).sqrt() * predicted) / alpha_t.sqrt()
if t > 0:
x = mean + alpha_t.sqrt() * torch.randn_like(x)
else:
x = mean
return x
The network architecture is a U-Net (L9) with time-step conditioning injected into each block.
To go from
Worked numeric (1D).
Add noise ·
We took one small step from noisier (1.5) to slightly cleaner (1.464). At
A diffusion model's
For 512×512 images · ~1B param U-Net; ~50 steps of sampling; ~5 seconds on a single GPU. Stable Diffusion's architecture is a direct descendant.
The time
def timestep_embedding(t, d):
half = d // 2
freqs = torch.exp(
-math.log(10000) * torch.arange(half).float() / half
)
args = t.float()[:, None] * freqs[None, :]
return torch.cat([args.cos(), args.sin()], dim=-1)
The same reason as in Transformers (L13) · sinusoidal basis gives multi-scale time representation that the network can read at any scale. Learned embeddings work too; sinusoidal is more robust across training-time changes in
class TimestepBlock(nn.Module):
def forward(self, x, t_emb):
# x: image features. t_emb: timestep embedding
h = self.norm1(x)
h = self.conv1(F.silu(h))
# project time and add as bias (broadcast over spatial dims)
h = h + self.time_mlp(t_emb)[:, :, None, None]
h = self.conv2(F.silu(self.norm2(h)))
return x + h
Each U-Net residual block receives the time embedding and adds it to the channel dimension. The same network weights handle all timesteps — time is just another input, not a different model per step.
Same thing, different derivation
Pause and look at this from a different angle. Imagine our data points sit at the bottom of valleys in a landscape.
The score is an arrow at every point in space that points in the steepest uphill direction.
If we can learn this field of "uphill" arrows, we can just follow them backwards to always go downhill toward valleys (i.e., toward real data).
That's score-based generation. Mathematically equivalent to DDPM · just a different lens. Picking either lens is fine; many find score-based more intuitive (gradients pointing toward data).
Imagine probability is a landscape. Real data points sit in deep valleys; noise sits on high flat plains.
The score is an arrow at every point pointing uphill — toward higher density.
If we learn this field of "uphill" arrows, we can generate data · start on a high plain (noise) and walk toward the arrows until we land in a valley (real data).
Define
If we have
A small step toward high-density regions, plus a bit of noise to keep exploring. Looks just like the reverse diffusion step · follow a learned signal + add a little noise.
Hard to model with a neural network.
Parametrize the derivative, not the function itself. Samples are what we want anyway.
Modeling the score sidesteps the normalizer problem — and the score is exactly what you need to run Langevin sampling.
The noisy distribution is Gaussian:
Log density (up to const):
Differentiate w.r.t.
But the forward equation rearranges to
Punchline. The true score is just (negative, scaled) noise. Predicting
DDPM (Ho 2020) and score-SDE (Song 2020) are two lenses on the same model.
Use whichever is easier for your problem. DDPM's discrete-time recipe is simpler to code; Score-SDE gives more flexibility for continuous-time / arbitrary-schedule models (e.g., flow matching in 2023+).
| VAE | GAN | Diffusion | |
|---|---|---|---|
| Sample quality | blurry | sharp | SOTA |
| Training | stable, fast | brittle | stable, slow |
| Likelihood | ELBO | ✗ | ELBO (loose) |
| Sampling | 1 forward | 1 forward | T forwards |
| Mode coverage | strong | mode collapse risk | strong |
| Interpretability | structured latent | messy latent | uniform latent |
Diffusion's big cost · slow sampling. This is what L22 will focus on — classifier-free guidance, latent diffusion, DDIM.
Think about drawing a face. A GAN must commit · "these pixels are skin, these are eyes, this is hair" — all in one forward pass. Wrong commitments cascade.
Diffusion starts with pure noise; the first reverse step sketches the rough layout; the second adds features; the hundredth adds skin texture. The network revises its answer 1000 times, getting it right in the limit.
This is why diffusion samples look sharper and more coherent than any single-forward-pass generator.
Diffusion has become the default generative model across modalities.
Consistency models and flow matching are closing the "slow sampling" gap. In 2026 · expect 1-step diffusion samplers to become competitive with GANs on speed.
Q. Is diffusion a likelihood-based model?
A. Yes, approximately. The DDPM loss corresponds to a variational lower bound on
Q. Why is the schedule Gaussian, not uniform?
A. Because Gaussians are closed under convolution — lets us write
Q. What if the data isn't image-like?
A. Use a different architecture (Transformer for sequences, GNN for graphs). The diffusion recipe is independent of architecture — only the noise-prediction network changes.