Cross-attention · how an image listens to text

Analogy · team of painters. Each painter handles a small patch. Before painting they ask: "which word in the prompt is most relevant to my patch?" The painter near the top pays attention to "hat"; the furry-patch painter pays attention to "cat".

The mechanism — for each image region:

Get word vectors from a frozen text encoder (CLIP) → keys and values .
Get region "questions" from the U-Net's intermediate features → queries .
Score relevance · , then softmax.
Final instruction · weighted sum of → used to denoise that region.

text_emb = clip_text_encoder("a cat astronaut")  # [1, 77, 768]

class CrossAttention(nn.Module):
    def forward(self, spatial_features, text_emb):
        Q = self.q_proj(spatial_features)
        K = self.k_proj(text_emb)
        V = self.v_proj(text_emb)
        return softmax(Q @ K.T / sqrt(d)) @ V

		meaning
0		unconditional
1		pure conditional
3		overshoot 2×
7		strong overshoot

	What you get
1	pure conditional · ignores extrapolation
3	subtle prompt adherence · natural-looking
7	Stable Diffusion default · balanced
12+	strong adherence · saturated colors, artifacts
25+	cartoonish oversaturation; often broken

System	Architecture	Notes
Stable Diffusion 3	DiT + rectified flow	open-weight, commercial
DALL-E 3	diffusion (details undisclosed)	integrated with ChatGPT
Midjourney v6+	custom diffusion	paid, high fidelity
Sora	DiT on spacetime patches	OpenAI video model
Veo 2	latent diffusion, video	Google
Flux	open-weight SDXL successor	commercial, fast

Sampler	Steps	Quality	Notes
DDPM (original)	1000	baseline	slowest, stochastic
DDIM (deterministic)	50-100	=	same model; drop-in
DPM-Solver++	20-30	=	ODE solver · default in HF diffusers
Flow matching ODE	8-20	=	straighter trajectories
Consistency models	1-4	slight drop	distilled; near-real-time

Modality	Model	Pixels / sec to generate
Images 1024²	SD3 · Flux	~0.5 images/s (20 steps)
Video 720p · 16s	Sora · VEO	minutes per clip
Audio music	AudioGen · MusicLM	real-time (8 DDIM steps)
3D meshes	Diffusion-SDF · ShapE	~30s per mesh
Molecules	RFdiffusion	10s per protein structure
Robot policies	Diffusion Policy	50 Hz control loop

Diffusion Models — Practice

Lecture 22 · ES 667: Deep Learning

Learning outcomes

Where we are

PART 1

Conditioning · from unconditional to text-to-image

How to condition a diffusion model

Cross-attention · the painter-following-instructions analogy

Cross-attention · how an image listens to text

Cross-attention · worked numeric

Cross-attention · why it works for text conditioning

PART 2

Classifier-Free Guidance

CFG · the geometry

CFG · the two-compass analogy

CFG · derive the formula step by step

CFG in pictures

Training with CFG-ready dropout

CFG · vectors in 2D

Worked example · CFG arithmetic at one step

Picking a CFG scale · practical guide

PART 3

Latent diffusion

Why diffuse in latent space · the intuition

Latent diffusion · the zip-the-file analogy

The dimension math

Stable Diffusion architecture

Stable Diffusion · annotated full stack

Latent diffusion · three components

1. VAE (frozen)

2. CLIP text encoder (frozen)

3. U-Net diffuser (trainable)

PART 4

Faster sampling

Why 1000 steps · the quick math

DDPM is slow · 1000 steps

DDIM · the foggy-hill analogy

DDIM · skip 95% of steps

DDIM · deterministic sampling in 20–50 steps

DiT · the global town-hall analogy

DiT · how it works

PART 5

The generative landscape · 2026

What ships in 2026

Diffusion for non-image modalities

2026 · the sampler menu

DiT · Transformer backbone

Diffusion · by modality in 2026

Consistency models · winding road vs teleporter

Consistency models · the property

Flow matching · the next paradigm?

Lecture 22 — summary

Read before Lecture 23

Next lecture