Autoencoders & VAEs

Lecture 19 · ES 667: Deep Learning

Prof. Nipun Batra
IIT Gandhinagar · Aug 2026

Learning outcomes

By the end of this lecture you will be able to:

  1. State what a plain autoencoder is and why it isn't generative.
  2. Motivate the need for a prior distribution on latent space.
  3. Write the ELBO from scratch using Jensen's inequality.
  4. Derive the KL term for Gaussian posterior vs standard normal.
  5. Explain and implement the reparameterization trick.
  6. Train a β-VAE and discuss disentanglement.
  7. Place VAEs in context · pre-compressor for Stable Diffusion in 2026.

Where we are

Module 9 opens · generative models. Until now every model classified or predicted — labels, tokens, pixels-given-labels. Today we switch to: given a dataset, can I sample new examples that look like it?

Today maps to Prince Ch 17 (Variational Autoencoders) + Kingma & Welling 2013.

Four questions:

  1. What's a plain autoencoder and why isn't it generative?
  2. What does VAE add, and why does it work?
  3. What is the reparameterization trick?
  4. Where do VAEs fit in the 2026 generative landscape?

Generative modeling · the task

Given i.i.d. samples from an unknown distribution , learn a model from which we can sample new such that .

Two sub-tasks, often pursued together:

  1. Density estimation · assign a probability to any candidate sample.
  2. Generation · draw novel samples from .

Images live in on a low-dimensional manifold. Writing down analytically is hopeless; learning it from samples is the whole game.

Three generative strategies

Density-based

Write explicitly; maximize .

  • Pixel-RNN, PixelCNN, Normalizing flows.
  • Clean likelihoods.
  • Autoregressive or invertible only.

Latent-based

Hidden variable generates : .

  • VAE, GAN (implicit).
  • Compact latent, rich samples.
  • Likelihood tractable only via ELBO.

Diffusion (L21) is a layered latent model with levels. The recipe of "add structure in latent space" starts here with the VAE.

PART 1

The generative model family tree

A brief taxonomy

Four families of generative models

Family How it samples Training
VAE (L19) sample z ~ p(z), decode ELBO
GAN (L20) sample z ~ p(z), generator minimax
Normalizing flows invertible transforms of p(z) exact likelihood
Diffusion (L21-22) iterative denoising from noise score matching / denoising

Today is VAE. Each family has tradeoffs between sample quality, training stability, and tractability.

PART 2

Autoencoder first

The building block

Autoencoder · the postcard analogy

A perfect forger writes the most compact possible description of a painting on a postcard (the latent code · maybe 16 numbers).

They mail it to their partner. The partner must recreate the original painting using only the postcard.

If the postcard is too small, they're forced to learn what's truly essential · the "essence" of the painting · not every brushstroke. That's compression. That's what an autoencoder learns.

Postcard analogy · in math terms

  • Forger writes the postcard → encoder .
  • Postcard = compact description → latent code .
  • Partner reconstructs the painting → decoder .
  • "Goodness" of reconstruction → MSE loss .

Train encoder + decoder together. The bottleneck forces the network to keep only the most informative features.

Worked numeric · 4-pixel autoencoder

Tiny grayscale image . Latent dim .

  1. Encode. Network outputs .
  2. Decode. Network maps .
  3. Loss (MSE).

Backprop adjusts encoder + decoder weights to push this lower.

Uses:

  • Denoising · dimensionality reduction · pretraining / feature learning.
  • Beats PCA for non-linear data.

Autoencoder vs PCA · what's added

PCA is the linear autoencoder with orthogonal weights. What does nonlinearity buy you?

  • PCA forces the latent space to be linear subspace. Fine for Gaussian-like data; poor for curved manifolds.
  • An autoencoder (MLP or CNN) can fold arbitrary manifolds — digits on a swiss-roll latent, faces on a curved surface, etc.

Concretely · PCA on MNIST reaches ~85% explained variance with 32 dims; a deep AE matches the full-data reconstruction at ~16 dims. Curved manifold vs linear subspace · nonlinearity buys 2× compression.

Bottleneck intuition · why it's crucial

If the latent , the network can just copy · , . Loss is zero but nothing learned.

The bottleneck forces compression · the network must keep only the most informative features. Anything redundant gets dropped. This is why autoencoders produce useful representations even without labels.

Modern variants add noise (denoising AE) or masking (MAE, L17) instead of a small bottleneck — same idea, different forcing.

A concrete AE · MNIST dimensionality

Input · 28 × 28 = 784 pixels. Encode to latent z of size 16. Decode back to 784.

Layer Shape Params
Input 784
Linear → ReLU 256 200,960
Linear → ReLU 64 16,448
Linear (μ only) 16 bottleneck · 1,040
Linear → ReLU 64 1,088
Linear → ReLU 256 16,640
Linear → sigmoid 784 201,488

Total · ~440k params. Reconstruction MSE on MNIST test · ~0.003 after 10 epochs. Compare PCA with 16 components · ~0.015. 5× better with nonlinearities.

But autoencoders aren't generative

Suppose you train an AE on MNIST. To generate a new digit, you'd:

  1. Pick a random z in latent space.
  2. Decode it.

What happens? Usually garbage. Why?

The latent space is irregular. The encoder only learned to map actual training images to latent points. Random z values likely fall into "nothing-mapped-here" regions where the decoder is undefined.

You'd need the latent space to be dense and structured — that's what VAE adds.

PART 3

The VAE fix

A prior and a KL penalty

AE vs VAE

▶ Interactive: slide the KL weight β, watch the latent space go from clumpy to Gaussian — vae-latent-explorer.

Why a prior? · two jobs it does

The prior does two things for us:

1. Defines the sampling distribution

At generation time we draw and decode. The prior is the rule book for producing valid z's.

Without a prior, you wouldn't know how to initialize z for generation.

2. Regularizes the posterior

The KL term pulls toward for every training example. Every encoded posterior overlaps in the same region → smooth latent.

Without this, training points occupy disjoint clusters.

A VAE is a plain AE with a regularizer that makes the latent space match a known distribution. Everything else follows from making that regularizer principled (the ELBO).

VAE · the encoder outputs a distribution

The encoder no longer outputs a point . It outputs parameters of a Gaussian:

Both and are network outputs. During training:

  1. Given , get .
  2. Sample .
  3. Decode · .
  4. Loss · reconstruction + KL divergence to prior.

ELBO geometry

The punchline · you can skip the derivation

The VAE loss is just reconstruction + KL-to-prior. Train to minimize it. That's all you need to use a VAE.

The next two slides derive this from first principles (Jensen's inequality). If you trust me, you can skip them · come back to the math later.

ELBO · the hard problem and the easy trick

Hard problem. We want . Intractable for high-dim .

Easy trick (variational inference).

  1. Define a tractable distribution (our encoder).
  2. Derive a lower bound on that we can compute: the ELBO.
  3. Maximizing the ELBO pushes up — like getting good practice-exam grades.

Deriving the ELBO · step by step

Step 1. Multiply and divide by :

Step 2 · Jensen's inequality. is concave → . (For two values: .) Move log inside expectation:

Step 3 · expand. Using :

The second bracket is exactly :

  • First term · reconstruction (maximize).
  • Second term · regularizer (minimize).

The ELBO · reconstruction + KL

Evidence Lower Bound — a tractable lower bound on :

  • First term · data likelihood under the decoder, averaged over encoder samples.
  • Second term · how far the posterior is from the prior .

Maximize the ELBO = minimize the negative. This is the VAE loss. Every term is tractable and backpropagatable.

KL · the "cost of being different"

The KL term is the price for deviating from the standard-normal prior. Two parts:

  1. Mean cost · zero when , grows quadratically.
  2. Variance cost · zero at , positive everywhere else.

For Gaussian vs :

Quick check on the variance term:

  • (penalty)
  • (penalty)

The VAE pushes and unless the reconstruction term needs different values.

KL · worked numeric example

Suppose for a particular image . Standard normal prior .

Plugging in: means narrow posterior (confident encoding); means off-centre from the prior. The KL penalty of ~1.0 will pull back toward 0 during training, unless the reconstruction term needs a wide-apart to distinguish this image from others.

This is the trade-off the VAE balances at every sample.

Posterior collapse · the picture

Posterior collapse · what to watch for

If the decoder is too powerful, the KL term will drive · every image encodes to the same latent, carries no information about .

Posterior collapse · the VAE becomes an autoencoder where z is just noise. Reconstructions are fine (the decoder ignores z), but samples are junk (there's no latent structure to exploit).

Fixes:

  • Reduce decoder capacity (smaller FFN).
  • Use β-VAE with (less KL weight).
  • KL annealing · start with , ramp up over training.
  • Free bits · allow some KL "for free" before penalizing.

PART 4

Reparameterization in one picture

Reparameterization · gradient flow

Reparameterization · the conveyor-belt analogy

Imagine training a robot arm that randomly picks a part from a bin. You can't train the picking motion · "your random pick was wrong" gives no gradient.

Now change the system · the robot picks a specific part from a conveyor belt. The randomness is in how the belt is loaded, not in the robot's motion.

The belt-loading randomness ≡ the noise . The robot's picking motion ≡ the deterministic map . Now we can compute gradients through the picking motion · which is exactly what we needed to train and .

The reparameterization trick

How to backprop through a sample

Reparameterization trick · making randomness differentiable

Problem. We need . But sample_from_gaussian(μ, σ) is a black box — no gradient through it.

Trick (Kingma & Welling 2013). Any sample from can be written:

The randomness now sits outside . The path from back to is just deterministic add + multiply → gradients flow.

Worked numeric. Encoder outputs , . Sample .
.
Suppose .

Then:

Optimizer can now update as usual.

VAE in PyTorch · the whole thing

class VAE(nn.Module):
    def __init__(self, d_in, d_z):
        super().__init__()
        self.enc = nn.Sequential(nn.Linear(d_in, 256), nn.ReLU(),
                                 nn.Linear(256, 2 * d_z))
        self.dec = nn.Sequential(nn.Linear(d_z, 256), nn.ReLU(),
                                 nn.Linear(256, d_in))

    def forward(self, x):
        h = self.enc(x)
        mu, log_var = h.chunk(2, dim=-1)

        # Reparam: z = mu + sigma · eps
        std = torch.exp(0.5 * log_var)
        eps = torch.randn_like(std)
        z = mu + std * eps

        recon = self.dec(z)

        # KL against N(0, I)
        kl = -0.5 * (1 + log_var - mu.pow(2) - log_var.exp()).sum(-1)

        return recon, kl

Entire VAE in 15 lines. The trick is making sure the KL goes in the loss alongside reconstruction.

Training loop

for x in loader:
    recon, kl = model(x)
    recon_loss = F.mse_loss(recon, x, reduction='none').sum(-1)
    loss = (recon_loss + BETA * kl).mean()
    opt.zero_grad(); loss.backward(); opt.step()

Tuning BETA:

  • BETA = 1 · standard VAE (follows the ELBO derivation).
  • BETA > 1 · β-VAE (Higgins 2017). Stronger regularization → more disentangled, often blurrier.
  • BETA < 1 · more weight to reconstruction, less to latent structure.

β · the seesaw between recon and KL

Disentanglement · what β-VAE buys you

With on a faces dataset (Higgins 2017), each latent dimension starts to control ONE semantic factor:

  • · azimuth (face angle)
  • · lighting direction
  • · smile / frown
  • · hairstyle length
  • ...

No supervision — the structure emerges from the KL regularization plus the reconstruction pressure. Disentanglement lets you do editable generation · "same face with different smile" by perturbing one z coordinate.

The trade-off · stronger KL forces shared structure, but loses reconstruction detail. β = 1 is the theoretical sweet spot; higher β sacrifices quality for interpretability.

Conditional VAE · putting labels into the game

If you have class labels , a Conditional VAE extends the game:

  • Encoder:
  • Decoder:
  • Prior: unchanged.

At inference · sample , fix to the desired class, decode. Generate class-specific samples without retraining.

CVAE was used for controllable generation before diffusion + CFG took over. Still shipped in some specialized systems (molecule generation, time-series imputation).

PART 5

Generating with a VAE

Worked example · one VAE forward pass

Suppose · (4-pixel input), (2D latent). Input · .

Step 1 · encoder. Suppose it outputs , .

Step 2 · sample (one draw from ).

Step 3 · decode. Suppose decoder outputs .

Step 4 · loss.

  • Reconstruction MSE ·
  • KL ·
    • :
    • :
    • sum / 2 = 1.02
  • Total loss = 0.005 + 1.02 = 1.025

Backprop through this. Update params. Repeat.

Latent-space interpolation · in pictures

Sampling + interpolation

# Generate new examples
with torch.no_grad():
    z = torch.randn(16, d_z)              # sample from N(0, I)
    samples = model.dec(z)                # decode into image space

# Interpolate between two inputs in latent space
z_a = encoder(x_a)[0]  # mu for image A
z_b = encoder(x_b)[0]  # mu for image B
for alpha in torch.linspace(0, 1, 10):
    z = (1 - alpha) * z_a + alpha * z_b
    morph = model.dec(z)                  # smooth transition

The interpolation is the magic · it produces valid intermediate images because the latent space is smooth.

Sampling gotchas

Truncated sampling. Sampling occasionally gives large- points where training coverage was sparse. Truncate · sample from and reject if (e.g., ). Samples look cleaner.

Decoder stochasticity. If decoder outputs a Gaussian , add to the mean for a single sample. If you only decode means you get the "mode"; adding variance makes samples diverse.

VAE blur. The KL pulls posteriors toward a simple prior · posteriors overlap significantly. The decoder averages over possible given → samples are blurry means. This is the fundamental VAE limitation diffusion (L21) fixes.

VAE vs GAN vs Diffusion · quality ranking

Sample quality Training stability Likelihood Sampling speed
VAE ✗ often blurry ✓ stable ✓ ELBO ✓✓ one pass
GAN ✓✓ sharp ✗ brittle ✗ no ✓✓ one pass
Diffusion ✓✓✓ SOTA ✓ stable ✗ many passes

VAEs remain useful for latent-space exploration and pre-compression — Stable Diffusion uses a VAE to compress images into a 4× smaller latent space before running diffusion there.

Common questions · FAQ

Q. Why is VAE blurrier than GAN?
A. VAE's loss is MSE on pixels, which is the mean of possible reconstructions. When multiple outputs are possible (e.g., any detailed face), the mean is a smoothed average of those — blurry. GANs don't average; they commit.

Q. Can I use a perceptual loss (feature-space MSE) instead of pixel MSE?
A. Yes — produces sharper reconstructions. VQ-VAE (Van den Oord 2017) combines VAE-like structure with discrete latents and perceptual losses. Stable Diffusion's VAE uses this trick.

Q. Is the posterior truly Gaussian?
A. No — the true posterior is arbitrary. The Gaussian parameterization is an approximation (the "amortized variational" part). Normalizing flow encoders and hierarchical VAEs address this; vanilla VAE trades approximation quality for simplicity.

Lecture 19 — summary

  • Autoencoder · encode → bottleneck → decode. Great for compression; not generative.
  • VAE · encoder outputs a distribution (μ, σ); sample; decode; ELBO loss.
  • ELBO · reconstruction term + KL divergence to prior.
  • KL term · closed form for Gaussian; pulls q(z|x) toward N(0, I).
  • Reparameterization trick · z = μ + σ·ε; differentiable.
  • β-VAE · tune the KL weight for disentanglement.
  • 2026 role · pre-compressor in latent diffusion models (next week).

Read before Lecture 20

Prince Ch 15 · GANs.

Next lecture

GANs — minimax training, DCGAN, mode collapse, non-saturating loss.

Notebook 19 · 19-vae-mnist.ipynb — build and train a VAE on MNIST; visualize 2D latent; interpolate digits; sample from N(0, I).