Diffusion Models — Theory

Lecture 21 · ES 667: Deep Learning

Prof. Nipun Batra
IIT Gandhinagar · Aug 2026

Learning outcomes

By the end of this lecture you will be able to:

  1. Explain the forward process in one sentence and write it down.
  2. Derive the closed-form and use it in code.
  3. Write the DDPM training loss and describe what each term is doing.
  4. Describe the reverse process step-by-step.
  5. Understand noise schedules (linear vs cosine) and pick one for a task.
  6. Connect DDPM to score matching via Langevin dynamics.
  7. State why diffusion won over GANs and VAEs for image/video/audio.

Where we are

  • VAE (L19) · probabilistic encoder-decoder; good structure, blurry samples.
  • GAN (L20) · sharp samples, unstable training, mode collapse.

Today · diffusion. Sharp samples + stable training + tractable likelihood. SOTA since 2021 for image, video, audio, 3D generation.

Today maps to Prince Ch 18 (early) + Ho et al. 2020 (DDPM) + Song & Ermon 2020 (score-based).

Four questions:

  1. What's the forward process?
  2. What's the closed-form for ?
  3. How do we train a diffusion model?
  4. What's the connection to score matching?

PART 1

Forward & reverse · the big picture

Corrupt then learn to uncorrupt

The intuition in one sentence

Gradually turn an image into pure noise, then train a network to reverse that process one tiny step at a time.

At the end of training, you can start from random noise and reverse-diffuse it into a brand new image. Each small step is easy to learn; chained together they generate.

A physical analogy · ink in water

Drop a drop of ink into a glass of water. It stays concentrated, then slowly spreads, then uniformly tints the water.

Forward (easy)

Ink diffuses into water · we can describe this with a simple diffusion equation. Watching a drop blur is what "noise corrupts the signal" looks like in pictures.

Reverse (hard)

"Un-diffuse" the ink back into a drop. Physics says impossible (entropy only grows). But with data · we have many examples of initial states. A neural network can learn the reverse direction from those examples.

Diffusion models learn the miracle "reverse" that physics doesn't give you — but they learn it from data, not first principles.

Why this is better than GANs

GAN problems

  • minimax: two networks playing a game
  • mode collapse
  • unstable training
  • hyper-sensitive to hyperparameters

Diffusion advantages

  • regression loss: MSE on predicted noise
  • one network
  • stable training
  • default settings usually work

A GAN asks a network to hit a moving target (the discriminator's decision boundary). A diffusion model asks a network to match a static target (the noise that was added). Static targets are fundamentally easier to optimize.

Forward corrupts · reverse reassembles

Forward + reverse · sequence view

▶ Interactive: slide t, see a 2D spiral dissolve into noise; press "Reverse animate" to watch it reassemble — diffusion-denoise.

PART 2

The forward process

Fixed · Markov · Gaussian

One forward step · slightly blurring a photo

Analogy. Take a sharp photo and make it one step blurrier:

  1. Fade the original a tiny bit (e.g. to 99.5% opacity).
  2. Add a faint layer of random static (Gaussian noise).

That's it. controls both the fade amount and the static intensity.

Formally:

i.e. with .

Worked numeric. , .

  • Fade · , so mean = .
  • Noise · std .
  • Update · .

Over steps, the signal washes out into pure . Forward is not learned — fixed dynamical system.

Why shrink and add noise

You might ask · why not just add noise? Why shrink the signal too?

Shrinking keeps the total variance bounded. If you only add noise, the variance grows without limit; would be impossibly noisy and nothing-like-.

Variance check · if , then .

Variance of = = 1. Always.

This is why the forward process preserves unit variance — it's a variance-preserving SDE.

A step-by-step · small β = 0.01

Start with . Apply 5 forward steps with :

noise added
0 2.00
1 1.99 · √0.99 + 0.1·ε = 1.97 + 0.08 = 2.05 ε = 0.8
2 2.02 ε = -0.4
3 2.00 ε = -0.2
4 2.03 ε = 0.3
5 2.06 ε = 0.2

After 5 steps the signal is barely disturbed. After 1000 steps with growing β, it becomes standard normal. The accumulated effect, not each step, turns signal into noise.

The closed form · compounding fades

After 1 step, signal is faded by . After 2 steps · . After steps · .

All the per-step noise additions also "pool together" into one big Gaussian:

Variance check · we designed the process so total variance stays 1. If is the fraction left from the signal, is the fraction from noise. Sums to 1.

Worked numeric · jump from to .
, .

  • Signal scale · .
  • Noise scale · .
  • Sample .
  • .

After 500 steps the original signal of 2.0 has nearly washed away. No iteration needed during training.

Why closed-form matters · training speed

Naive (iterative) forward

To get , you'd apply 500 Gaussian steps sequentially. 500× forward passes per training example.

Batch of 128, 100k examples · ~10⁹ operations just to make noise targets. Days on a single GPU.

Closed-form

One sample of , one scaled add. 500× faster per example.

Batch of 128 in one step · microseconds. Hours instead of days.

This closed-form is the single biggest practical advantage over continuous-time score-SDE approaches. Without it, DDPM training would cost 500× more.

⚠️ optional · Closed-form · the derivation in 3 lines

Start from one step: .

Unroll:

Merge Gaussians (sum of independent Gaussians = Gaussian with summed variances):

where replaces the -step chain of independent 's. Gaussian closed under convolution — this is the magic.

Noise schedules · in one chart

Noise schedules · the numbers

Linear α̅_t Cosine α̅_t What's left
0 1.00 1.00 clean signal
250 0.62 0.82 linear: 38% destroyed · cosine: 18%
500 0.17 0.50 linear: already mostly gone
750 0.02 0.18 linear: essentially pure noise
1000 0.00 0.00 both: N(0, I)

Linear schedule wastes computation on steps near where everything is noise anyway. Cosine keeps the middle range useful — the middle is where the model actually learns.

Noise schedules · linear vs cosine

Two common schedules:

Linear (original DDPM) · grows linearly from to over steps.

Cosine (Nichol & Dhariwal 2021) · — smoother, better for smaller .

Cosine schedule adds noise more gradually at the start and faster at the end. Better quality at fewer diffusion steps. Used in most modern diffusion models.

Picking T · the hyperparameter most people ignore

Behavior
50 too coarse; each step must learn a big jump; sample quality hurts
200 works but poor quality at the extremes
1000 default; great quality with cosine schedule
4000 slight quality gain; 4× inference cost; rarely worth it

DDPM (Ho 2020) used T=1000 with linear schedule. Nichol & Dhariwal 2021 showed cosine + T=4000 gave marginal gains; T=1000+cosine is today's sweet spot.

PART 3

Training objective

Predict the noise

The reverse process · parameterized

We want — learn to denoise.

Ho et al. 2020 showed that if is Gaussian (it is), the optimal reverse process is also Gaussian. So parameterize:

Further: parameterize to predict the noise rather than the mean directly. Simpler, better signal.

Why predict instead of the mean?

Given , there are three equivalent prediction targets:

  • Predict · known as "x0-prediction" or "v-prediction variant"
  • Predict · the mean of the reverse distribution
  • Predict · the noise that was added

Ho et al. 2020 showed -prediction gives the best sample quality. Intuition · noise is unit-variance and dimension-independent; the network doesn't need to learn the scale of the signal.

Modern models (SDXL, Imagen) often use "v-prediction" — a weighted combination that's more numerically stable at small .

Training · the noise-guessing game

How do we teach the network to "un-blur"?

Take a clean image. Add a known amount of random noise . Show the noisy result to the network. Ask it: "What noise did I just add?"

The better it gets at guessing the noise, the better it is at denoising · because subtracting the predicted noise gets us back to a cleaner image.

That's the entire training objective · MSE between predicted noise and the true noise added.

DDPM loss · surprisingly simple

In plain words:

  1. Sample a clean image from the dataset.
  2. Sample a timestep .
  3. Sample Gaussian noise .
  4. Compute (closed form).
  5. Ask the network to predict from .
  6. MSE loss on the prediction.

That's it. Much simpler than GAN minimax or VAE ELBO.

Worked example · one training step

Suppose (a 2D data point), , . Sample .


  1. =
  2. Feed to the network. Prediction ·
  3. Loss ·
  4. Backprop through to update the network.

Single example, single . Sum the loss over a batch and all is ready. No adversarial game, no multiple networks, no cross-entropy.

DDPM in PyTorch · 30 lines

def ddpm_loss(model, x0, T=1000):
    B = x0.size(0)
    t = torch.randint(0, T, (B,), device=x0.device)
    noise = torch.randn_like(x0)

    alpha_bar = alpha_bar_schedule[t].view(B, 1, 1, 1)
    x_t = alpha_bar.sqrt() * x0 + (1 - alpha_bar).sqrt() * noise

    pred_noise = model(x_t, t)                         # network input: noisy img + t
    return F.mse_loss(pred_noise, noise)

def sample(model, shape, T=1000):
    x = torch.randn(shape)                              # start from N(0, I)
    for t in reversed(range(T)):
        alpha_t     = alpha_schedule[t]
        alpha_bar_t = alpha_bar_schedule[t]
        predicted   = model(x, torch.tensor([t]))
        mean = (x - (1 - alpha_t) / (1 - alpha_bar_t).sqrt() * predicted) / alpha_t.sqrt()
        if t > 0:
            x = mean + alpha_t.sqrt() * torch.randn_like(x)
        else:
            x = mean
    return x

The network architecture is a U-Net (L9) with time-step conditioning injected into each block.

Reverse step · denoise then re-noise a little

To go from to :

  1. Denoise. Predict , subtract a scaled version from → estimate of clean signal.
  2. Re-noise a little. Add a small fresh noise so the chain stays stochastic.

Worked numeric (1D). . Schedule · , . Network predicts .

  • Mean ·

Add noise · , → noise term = .
.

We took one small step from noisier (1.5) to slightly cleaner (1.464). At , drop the noise term — final step is deterministic.

Network architecture · in one picture

Network architecture · U-Net with time

A diffusion model's is typically a U-Net:

  • Encoder downsamples, decoder upsamples.
  • Skip connections between matching resolutions (from L9).
  • Time embedding · becomes a sinusoidal vector, projected, and added into every block.
  • Attention at low spatial resolutions (globally mix features).

For 512×512 images · ~1B param U-Net; ~50 steps of sampling; ~5 seconds on a single GPU. Stable Diffusion's architecture is a direct descendant.

Sinusoidal time embedding

The time is an integer . Represent it as a dense vector using the same positional encoding from L13:

def timestep_embedding(t, d):
    half = d // 2
    freqs = torch.exp(
        -math.log(10000) * torch.arange(half).float() / half
    )
    args = t.float()[:, None] * freqs[None, :]
    return torch.cat([args.cos(), args.sin()], dim=-1)

The same reason as in Transformers (L13) · sinusoidal basis gives multi-scale time representation that the network can read at any scale. Learned embeddings work too; sinusoidal is more robust across training-time changes in .

Time conditioning · inject at every block

class TimestepBlock(nn.Module):
    def forward(self, x, t_emb):
        # x: image features. t_emb: timestep embedding
        h = self.norm1(x)
        h = self.conv1(F.silu(h))
        # project time and add as bias (broadcast over spatial dims)
        h = h + self.time_mlp(t_emb)[:, :, None, None]
        h = self.conv2(F.silu(self.norm2(h)))
        return x + h

Each U-Net residual block receives the time embedding and adds it to the channel dimension. The same network weights handle all timesteps — time is just another input, not a different model per step.

PART 4

Connection to score matching

Same thing, different derivation

The score field in one picture

Score · the "uphill arrows" view

Pause and look at this from a different angle. Imagine our data points sit at the bottom of valleys in a landscape.

The score is an arrow at every point in space that points in the steepest uphill direction.

If we can learn this field of "uphill" arrows, we can just follow them backwards to always go downhill toward valleys (i.e., toward real data).

That's score-based generation. Mathematically equivalent to DDPM · just a different lens. Picking either lens is fine; many find score-based more intuitive (gradients pointing toward data).

Score · the mountain-range analogy

Imagine probability is a landscape. Real data points sit in deep valleys; noise sits on high flat plains.

The score is an arrow at every point pointing uphill — toward higher density.

If we learn this field of "uphill" arrows, we can generate data · start on a high plain (noise) and walk toward the arrows until we land in a valley (real data).

The score function · math

Define — gradient of log density.

  • · density (high for real data, low elsewhere).
  • · just makes math nice (peaks of = peaks of ).
  • · vector pointing in the direction of steepest ascent.

If we have , we can sample with Langevin dynamics:

A small step toward high-density regions, plus a bit of noise to keep exploring. Looks just like the reverse diffusion step · follow a learned signal + add a little noise.

Score vs density · why use the score?

Density

  • Must be non-negative.
  • Must integrate to 1.
  • Intractable normalizing constant for complex distributions.

Hard to model with a neural network.

Score

  • Any vector field.
  • Normalizer disappears: .
  • Easy to model with a neural network.

Parametrize the derivative, not the function itself. Samples are what we want anyway.

Modeling the score sidesteps the normalizer problem — and the score is exactly what you need to run Langevin sampling.

Diffusion ≈ score matching · derivation

The noisy distribution is Gaussian:

Log density (up to const):

Differentiate w.r.t. :

But the forward equation rearranges to . Substitute:

Punchline. The true score is just (negative, scaled) noise. Predicting with MSE = predicting the score:

DDPM (Ho 2020) and score-SDE (Song 2020) are two lenses on the same model.

Two views side-by-side

DDPM view (Ho 2020)

  • Forward · fixed Markov chain.
  • Reverse · learned Gaussian chain.
  • Loss · MSE between predicted and true noise.
  • Intuition · denoising at multiple scales.

Score-SDE view (Song 2020)

  • Forward · SDE driving data to noise.
  • Reverse · another SDE driving noise to data.
  • Loss · score matching.
  • Intuition · gradient field pointing to data.

Use whichever is easier for your problem. DDPM's discrete-time recipe is simpler to code; Score-SDE gives more flexibility for continuous-time / arbitrary-schedule models (e.g., flow matching in 2023+).

PART 5

Why diffusion won

Diffusion vs VAE vs GAN

VAE GAN Diffusion
Sample quality blurry sharp SOTA
Training stable, fast brittle stable, slow
Likelihood ELBO ELBO (loose)
Sampling 1 forward 1 forward T forwards
Mode coverage strong mode collapse risk strong
Interpretability structured latent messy latent uniform latent

Diffusion's big cost · slow sampling. This is what L22 will focus on — classifier-free guidance, latent diffusion, DDIM.

Why diffusion beat GANs on image quality

  1. Training signal is always strong · MSE on noise has a meaningful gradient at every step and every . GAN's adversarial loss often gives near-zero gradient early.
  2. No mode collapse · every training example teaches the model to denoise independently. The model can't "cheat" by producing one output.
  3. Iterative refinement · generation is 50-1000 tiny corrections. Errors at each step are small; the chain self-corrects. GANs must produce the final output in one forward pass.
  4. Infinite data augmentation · every (x₀, t, ε) triple is a new training example. A dataset of 10k images gives you a virtually infinite training stream.

A picture of why iteration helps

Think about drawing a face. A GAN must commit · "these pixels are skin, these are eyes, this is hair" — all in one forward pass. Wrong commitments cascade.

Diffusion starts with pure noise; the first reverse step sketches the rough layout; the second adds features; the hundredth adds skin texture. The network revises its answer 1000 times, getting it right in the limit.

This is why diffusion samples look sharper and more coherent than any single-forward-pass generator.

Applications · 2026 state

  • Text-to-image · Stable Diffusion, Midjourney, DALL-E 3, Imagen.
  • Video · Sora, Runway Gen-3, VEO.
  • Audio · AudioGen, Riffusion.
  • Molecule design · RFdiffusion for proteins.
  • Robotics policies · diffusion policy (Chi et al. 2023).

Diffusion has become the default generative model across modalities.

Frontier · where diffusion is heading

Faster sampling

  • DDIM (L22) · deterministic, 20 steps.
  • Flow matching · 5-10 steps.
  • Consistency models · 1-4 steps.

Richer conditioning

  • CFG (L22) · text steering.
  • ControlNet · per-pixel conditioning (pose, depth).
  • Inpainting · mask what to regenerate.

Consistency models and flow matching are closing the "slow sampling" gap. In 2026 · expect 1-step diffusion samplers to become competitive with GANs on speed.

Common questions · FAQ

Q. Is diffusion a likelihood-based model?
A. Yes, approximately. The DDPM loss corresponds to a variational lower bound on , but with a specific weighting. Tight bounds need "improved DDPM" tricks.

Q. Why is the schedule Gaussian, not uniform?
A. Because Gaussians are closed under convolution — lets us write in closed form. Other noise distributions (uniform, Laplacian) don't give this gift.

Q. What if the data isn't image-like?
A. Use a different architecture (Transformer for sequences, GNN for graphs). The diffusion recipe is independent of architecture — only the noise-prediction network changes.

Lecture 21 — summary

  • Forward process · add small Gaussian noise over T steps; closed form .
  • Reverse process · neural net predicts the noise; subtract step by step.
  • DDPM loss · MSE between true noise and predicted noise. Stable, simple.
  • Schedule · linear or cosine; cosine is modern default.
  • Architecture · U-Net with sinusoidal time embedding + attention at low res.
  • Score matching · same model through a different lens; reverse diffusion ≈ Langevin dynamics along the score.

Read before Lecture 22

Prince Ch 18 (later sections) + HF diffusers docs + Rombach 2022 (Stable Diffusion).

Next lecture

Diffusion Models — Practice — classifier-free guidance, latent diffusion, DDIM, DiT.

Notebook 21 · 21-ddpm-2d.ipynb — implement DDPM on a 2D toy dataset (Swiss roll); visualize forward noising + reverse denoising animations.