Write
Hidden variable
Diffusion (L21) is a layered latent model with
A brief taxonomy
| Family | How it samples | Training |
|---|---|---|
| VAE (L19) | sample z ~ p(z), decode | ELBO |
| GAN (L20) | sample z ~ p(z), generator | minimax |
| Normalizing flows | invertible transforms of p(z) | exact likelihood |
| Diffusion (L21-22) | iterative denoising from noise | score matching / denoising |
Today is VAE. Each family has tradeoffs between sample quality, training stability, and tractability.
The building block
A perfect forger writes the most compact possible description of a painting on a postcard (the latent code · maybe 16 numbers).
They mail it to their partner. The partner must recreate the original painting using only the postcard.
If the postcard is too small, they're forced to learn what's truly essential · the "essence" of the painting · not every brushstroke. That's compression. That's what an autoencoder learns.
Train encoder + decoder together. The bottleneck
Tiny grayscale image
Backprop adjusts encoder + decoder weights to push this lower.
Uses:
PCA is the linear autoencoder with orthogonal weights. What does nonlinearity buy you?
Concretely · PCA on MNIST reaches ~85% explained variance with 32 dims; a deep AE matches the full-data reconstruction at ~16 dims. Curved manifold vs linear subspace · nonlinearity buys 2× compression.
If the latent
The bottleneck
Modern variants add noise (denoising AE) or masking (MAE, L17) instead of a small bottleneck — same idea, different forcing.
Input · 28 × 28 = 784 pixels. Encode to latent z of size 16. Decode back to 784.
| Layer | Shape | Params |
|---|---|---|
| Input | 784 | — |
| Linear → ReLU | 256 | 200,960 |
| Linear → ReLU | 64 | 16,448 |
| Linear (μ only) | 16 | bottleneck · 1,040 |
| Linear → ReLU | 64 | 1,088 |
| Linear → ReLU | 256 | 16,640 |
| Linear → sigmoid | 784 | 201,488 |
Total · ~440k params. Reconstruction MSE on MNIST test · ~0.003 after 10 epochs. Compare PCA with 16 components · ~0.015. 5× better with nonlinearities.
Suppose you train an AE on MNIST. To generate a new digit, you'd:
What happens? Usually garbage. Why?
The latent space is irregular. The encoder only learned to map actual training images to latent points. Random z values likely fall into "nothing-mapped-here" regions where the decoder is undefined.
You'd need the latent space to be dense and structured — that's what VAE adds.
A prior and a KL penalty
Interactive: slide the KL weight β, watch the latent space go from clumpy to Gaussian — vae-latent-explorer.
The prior
At generation time we draw
Without a prior, you wouldn't know how to initialize z for generation.
The KL term pulls
Without this, training points occupy disjoint clusters.
A VAE is a plain AE with a regularizer that makes the latent space match a known distribution. Everything else follows from making that regularizer principled (the ELBO).
The encoder no longer outputs a point
Both
The VAE loss is just reconstruction + KL-to-prior. Train to minimize it. That's all you need to use a VAE.
The next two slides derive this from first principles (Jensen's inequality). If you trust me, you can skip them · come back to the math later.
Hard problem. We want
Easy trick (variational inference).
Step 1. Multiply and divide by
Step 2 · Jensen's inequality.
Step 3 · expand. Using
The second bracket is exactly
Evidence Lower Bound — a tractable lower bound on
Maximize the ELBO = minimize the negative. This is the VAE loss. Every term is tractable and backpropagatable.
The KL term is the price for
For Gaussian
Quick check on the variance term:
The VAE pushes
Suppose
Plugging in:
This is the trade-off the VAE balances at every sample.
If the decoder is too powerful, the KL term will drive
Posterior collapse · the VAE becomes an autoencoder where z is just noise. Reconstructions are fine (the decoder ignores z), but samples are junk (there's no latent structure to exploit).
Fixes:
Imagine training a robot arm that randomly picks a part from a bin. You can't train the picking motion · "your random pick was wrong" gives no gradient.
Now change the system · the robot picks a specific part from a conveyor belt. The randomness is in how the belt is loaded, not in the robot's motion.
The belt-loading randomness ≡ the noise
How to backprop through a sample
Problem. We need sample_from_gaussian(μ, σ) is a black box — no gradient through it.
Trick (Kingma & Welling 2013). Any sample from
The randomness now sits outside
Worked numeric. Encoder outputs
Suppose
Then:
Optimizer can now update
class VAE(nn.Module):
def __init__(self, d_in, d_z):
super().__init__()
self.enc = nn.Sequential(nn.Linear(d_in, 256), nn.ReLU(),
nn.Linear(256, 2 * d_z))
self.dec = nn.Sequential(nn.Linear(d_z, 256), nn.ReLU(),
nn.Linear(256, d_in))
def forward(self, x):
h = self.enc(x)
mu, log_var = h.chunk(2, dim=-1)
# Reparam: z = mu + sigma · eps
std = torch.exp(0.5 * log_var)
eps = torch.randn_like(std)
z = mu + std * eps
recon = self.dec(z)
# KL against N(0, I)
kl = -0.5 * (1 + log_var - mu.pow(2) - log_var.exp()).sum(-1)
return recon, kl
Entire VAE in 15 lines. The trick is making sure the KL goes in the loss alongside reconstruction.
for x in loader:
recon, kl = model(x)
recon_loss = F.mse_loss(recon, x, reduction='none').sum(-1)
loss = (recon_loss + BETA * kl).mean()
opt.zero_grad(); loss.backward(); opt.step()
Tuning BETA:
BETA = 1 · standard VAE (follows the ELBO derivation).BETA > 1 · β-VAE (Higgins 2017). Stronger regularization → more disentangled, often blurrier.BETA < 1 · more weight to reconstruction, less to latent structure.With
No supervision — the structure emerges from the KL regularization plus the reconstruction pressure. Disentanglement lets you do editable generation · "same face with different smile" by perturbing one z coordinate.
The trade-off · stronger KL forces shared structure, but loses reconstruction detail. β = 1 is the theoretical sweet spot; higher β sacrifices quality for interpretability.
If you have class labels
At inference · sample
CVAE was used for controllable generation before diffusion + CFG took over. Still shipped in some specialized systems (molecule generation, time-series imputation).
Suppose ·
Step 1 · encoder. Suppose it outputs
Step 2 · sample
Step 3 · decode. Suppose decoder outputs
Step 4 · loss.
Backprop through this. Update params. Repeat.
# Generate new examples
with torch.no_grad():
z = torch.randn(16, d_z) # sample from N(0, I)
samples = model.dec(z) # decode into image space
# Interpolate between two inputs in latent space
z_a = encoder(x_a)[0] # mu for image A
z_b = encoder(x_b)[0] # mu for image B
for alpha in torch.linspace(0, 1, 10):
z = (1 - alpha) * z_a + alpha * z_b
morph = model.dec(z) # smooth transition
The interpolation is the magic · it produces valid intermediate images because the latent space is smooth.
Truncated sampling. Sampling
Decoder stochasticity. If decoder outputs a Gaussian
VAE blur. The KL pulls posteriors toward a simple prior · posteriors overlap significantly. The decoder averages over possible
| Sample quality | Training stability | Likelihood | Sampling speed | |
|---|---|---|---|---|
| VAE | ✗ often blurry | ✓ stable | ✓ ELBO | ✓✓ one pass |
| GAN | ✓✓ sharp | ✗ brittle | ✗ no | ✓✓ one pass |
| Diffusion | ✓✓✓ SOTA | ✓ stable | ≈ | ✗ many passes |
VAEs remain useful for latent-space exploration and pre-compression — Stable Diffusion uses a VAE to compress images into a 4× smaller latent space before running diffusion there.
Q. Why is VAE blurrier than GAN?
A. VAE's loss is MSE on pixels, which is the mean of possible reconstructions. When multiple outputs are possible (e.g., any detailed face), the mean is a smoothed average of those — blurry. GANs don't average; they commit.
Q. Can I use a perceptual loss (feature-space MSE) instead of pixel MSE?
A. Yes — produces sharper reconstructions. VQ-VAE (Van den Oord 2017) combines VAE-like structure with discrete latents and perceptual losses. Stable Diffusion's VAE uses this trick.
Q. Is the posterior truly Gaussian?
A. No — the true posterior is arbitrary. The Gaussian parameterization is an approximation (the "amortized variational" part). Normalizing flow encoders and hierarchical VAEs address this; vanilla VAE trades approximation quality for simplicity.