Last lecture · DDPM — forward noise, learn reverse, predict ε. Works on MNIST, toy 2D. But how do we get from there to Stable Diffusion and Sora?
Today maps to Prince Ch 18 (later sections) + HF diffusers docs + Rombach 2022 (Stable Diffusion) + Ho & Salimans 2022 (CFG).
Four questions:
We want
Three common ways to inject
How does an image model listen to a text prompt?
Imagine a painter following instructions. For every brushstroke they ask, "Which part of the instruction applies right here?"
Cross-attention is exactly this · each spatial location in the image listens to the most relevant words in the prompt.
The result · a dynamic, spatial link between text and pixels. The "cat" word strongly influences cat-shaped regions; the "background" word influences corners.
Analogy · team of painters. Each painter handles a small patch. Before painting they ask: "which word in the prompt is most relevant to my patch?" The painter near the top pays attention to "hat"; the furry-patch painter pays attention to "cat".
The mechanism — for each image region:
text_emb = clip_text_encoder("a cat astronaut") # [1, 77, 768]
class CrossAttention(nn.Module):
def forward(self, spatial_features, text_emb):
Q = self.q_proj(spatial_features)
K = self.k_proj(text_emb)
V = self.v_proj(text_emb)
return softmax(Q @ K.T / sqrt(d)) @ V
Prompt: "a red square". Two word vectors:
Top-left pixel's current state:
Step 1 · scores.
"Square" is much more relevant.
Step 2 · softmax → weights. Suppose
Step 3 · final instruction.
Used to update this pixel — slightly more red, much more square.
Self-attention inside the U-Net mixes spatial positions. Cross-attention between spatial features and text features is where the prompt enters.
At a high-resolution block · each pixel asks "which prompt tokens matter for me?". A cat pixel pays attention to "cat" in the prompt; a sky pixel to "sky". The result is spatially localized conditioning — a single prompt can steer different regions differently.
Visualizing the cross-attention maps reveals exactly this — each word has a blob of pixels that listened most to it. This is the mechanism behind DreamBooth, Prompt-to-Prompt, and every prompt-editing technique.
The trick that makes generation feel "on-prompt"
You're lost in a forest with two compasses.
The difference vector
Default
Interactive: slide
To enable CFG at inference, the training needs both conditional and unconditional examples:
def training_step(x0, prompt):
t = sample_t()
noise = torch.randn_like(x0)
x_t = add_noise(x0, noise, t)
# Dropout 10% of prompts to null (enables unconditional path later)
if random.random() < 0.10:
c = null_embedding
else:
c = clip_text_encoder(prompt)
pred_noise = model(x_t, t, c)
return F.mse_loss(pred_noise, noise)
Same network learns both modes. At inference, you run it twice and extrapolate. Cost · 2× compute per step. Benefit · any w at generation time.
At denoising step
Difference vector ·
| meaning | ||
|---|---|---|
| 0 | unconditional | |
| 1 | pure conditional | |
| 3 | overshoot 2× | |
| 7 | strong overshoot |
Larger
| What you get | |
|---|---|
| 1 | pure conditional · ignores extrapolation |
| 3 | subtle prompt adherence · natural-looking |
| 7 | Stable Diffusion default · balanced |
| 12+ | strong adherence · saturated colors, artifacts |
| 25+ | cartoonish oversaturation; often broken |
CFG trades diversity for prompt adherence. Low
The one trick that made Stable Diffusion ship
Most pixels in an image are correlated. A patch of blue sky has hundreds of near-identical pixel values. Diffusing each one independently is wasteful.
A pretrained VAE has already absorbed the perceptually redundant structure · the latent is a near-minimal representation. Diffusion then only has to model the interesting (high-entropy) dimensions.
Result · 48× fewer dimensions, same perceptual quality, ~10× faster sampling. This is the single change that made Stable Diffusion runnable on consumer GPUs.
A 512×512 photo of blue sky has 262,144 pixels — most are nearly identical shades of blue. Asking a network to predict "this blue pixel is followed by a blue pixel" 1000× is wasteful.
Like emailing a 100MB doc. Instead of attaching it raw, you zip to 1MB. The zip captures the important info in less space. Latent diffusion does the same · zip the image with a VAE, run diffusion in the small zipped space, then unzip at the end.
512×512 RGB =
Stable Diffusion's VAE compresses to 64×64×4 =
Saving ·
The diffusion loop runs in latent space; the VAE encoder runs once at the start, decoder once at the end. Single biggest reason Stable Diffusion is shippable on consumer GPUs.
Encoder: image → 4-channel latent.
Decoder: latent → image.
Pretrained, not updated during diffusion training.
Tokenize → embed → 77 × 768 tensor.
Frozen — the diffusion model just consumes these.
Operates in the 64×64×4 latent space.
Cross-attention injects text embeddings at every scale.
~1B parameters for SD v1.5.
This is the only thing you train.
DDIM and DiT
DDPM's forward process adds tiny Gaussian noise at each of
Small per-step noise → easier inverse problem for the network. But then the reverse loop has 1000 forward passes — painful at inference.
DDIM (next slide) breaks this by reinterpreting the reverse process as non-Markovian, so you can skip steps without retraining.
Vanilla DDPM sampling requires running the U-Net 1000 times per generation. Each step is a forward pass of a ~1B param network.
We want to make it faster without retraining. Two approaches:
Original DDPM is like walking down a rocky hill in fog. You take 1000 tiny careful steps because you can only see one step ahead.
DDIM realizes · with the right model, you can see the whole path. Instead of 1000 wobbly steps, take 50 confident strides directly toward the final image.
The same trained model gets used · DDIM just chooses which subset of timesteps to evaluate. No retraining. 20× speedup at no cost in sample quality.
Song, Meng, Ermon 2020 · "Denoising Diffusion Implicit Models"
DDIM reformulates the reverse process to be deterministic given the initial noise. Crucially, the trained DDPM model can be sampled with DDIM — no retraining.
Effect · 50 DDIM steps ≈ 1000 DDPM steps in quality. 20× speedup essentially for free.
In 2026, DDIM (and its successor DPM-Solver++) is the default sampler in every diffusion library.
Convolution (U-Net): you only talk to your neighbours. Messages travel via "telephone" through many layers to reach a far-away pixel.
Attention (Transformer): every pixel is in a global video call. The cat-ear pixel can directly ask the cat-tail pixel "are you also part of the cat?" Better long-range understanding.
Peebles & Xie 2023 · replace the U-Net with a Transformer.
Why · Transformers scale incredibly well. More data + more compute = better, indefinitely. Inheriting LLM-ecosystem optimizations (FlashAttention, tensor parallelism). Backbone of Sora, Stable Diffusion 3, and most 2024+ frontier image/video models.
| System | Architecture | Notes |
|---|---|---|
| Stable Diffusion 3 | DiT + rectified flow | open-weight, commercial |
| DALL-E 3 | diffusion (details undisclosed) | integrated with ChatGPT |
| Midjourney v6+ | custom diffusion | paid, high fidelity |
| Sora | DiT on spacetime patches | OpenAI video model |
| Veo 2 | latent diffusion, video | |
| Flux | open-weight SDXL successor | commercial, fast |
All diffusion-based. All descendants of the 2020 DDPM paper.
Diffusion is one of the two big generative paradigms now (the other: autoregressive LLMs).
| Sampler | Steps | Quality | Notes |
|---|---|---|---|
| DDPM (original) | 1000 | baseline | slowest, stochastic |
| DDIM (deterministic) | 50-100 | = | same model; drop-in |
| DPM-Solver++ | 20-30 | = | ODE solver · default in HF diffusers |
| Flow matching ODE | 8-20 | = | straighter trajectories |
| Consistency models | 1-4 | slight drop | distilled; near-real-time |
5-year trajectory · 1000 steps (2020) → 50 steps (DDIM) → 4 steps (consistency) → 1 step (Rectified Flow v3 distilled). Each generation reduced inference cost by ~10×.
Peebles & Xie 2022 · replace U-Net entirely with a Transformer.
Stable Diffusion 3 and Sora both use DiT-based architectures. DiT inherits Transformer's scaling laws · more data + compute = better samples, indefinitely.
| Modality | Model | Pixels / sec to generate |
|---|---|---|
| Images 1024² | SD3 · Flux | ~0.5 images/s (20 steps) |
| Video 720p · 16s | Sora · VEO | minutes per clip |
| Audio music | AudioGen · MusicLM | real-time (8 DDIM steps) |
| 3D meshes | Diffusion-SDF · ShapE | ~30s per mesh |
| Molecules | RFdiffusion | 10s per protein structure |
| Robot policies | Diffusion Policy | 50 Hz control loop |
The diffusion paradigm scaled to every signal that can be noised. 2026 is the golden age · expect more modalities (brain signals, weather, materials) by 2028.
Normal DDIM/DDPM · walk down a winding mountain path from foggy peak (noise) to clear valley (image). 50+ steps along the path.
Consistency model = a teleporter. Stand at any point on the path, press a button, instantly arrive at the valley. The teleporter is consistent — no matter where on the path you stand, it always takes you to the same final
Train
for any two points on the same trajectory.
Inference:
SDXL-Turbo (2023) · 1-step generation at near-50-step DDIM quality. LCM adapters · drop-in for Stable Diffusion. GAN-speed with diffusion-quality, finally.
Rectified flow (Liu 2022) and flow matching (Lipman 2022) generalize diffusion:
Stable Diffusion 3, Flux, and many 2024+ models use flow-matching instead of pure diffusion. The boundary is blurring — both are "continuous-time generative models."