Interactive Explainer
CFG linearly extrapolates between what the model would draw without the prompt and what it would draw with the prompt. Slide the weight w from 0 to 30 to see the generation stretch toward — and past — the target shape.
During training, the diffusion model sees conditional and unconditional examples (10% of prompts are dropped to ""). At inference, you can linearly extrapolate past the conditional direction: the further you go, the more aggressively the model commits to the prompt — until it overshoots and produces saturated, ringing artifacts.
The sweet spot for Stable Diffusion is roughly w ∈ [5, 10]. Below 3 the model drifts off-prompt; above 15 you start seeing saturation (everything becomes a vivid stereotype) and "ringing" (high-frequency artifacts around edges).
ε_CFG = ε_θ(x_t, ∅) + w · (ε_θ(x_t, c) − ε_θ(x_t, ∅))
At w=0, you get pure unconditional noise prediction (prompt ignored). At w=1, you get the conditional prediction. Past w=1 you're extrapolating — pushing further in the prompt direction than the model ever saw during training. The model's generalization past its training extrapolation limits is why artifacts appear.
CFG requires a small trick during training: randomly drop the prompt 10% of the time (replace with null/empty conditioning). This teaches the same model to do both conditional and unconditional generation. At inference you run two forward passes per step — one with the prompt, one with the null — and extrapolate.
Cost · 2× inference. Benefit · you can dial any w you want at generation time without retraining. Every text-to-image system uses this (Stable Diffusion, Imagen, DALL-E 3, Midjourney).
Part of the ES 667 Deep Learning course · IIT Gandhinagar · Aug 2026.