Interactive Explainer
The β knob in a VAE trades reconstruction for latent structure. At β=0 you have a vanilla autoencoder — great reconstruction, chaotic latent space. At β=10 you have a strong VAE — clean unit-Gaussian latent, but samples get blurry. Drag the slider and see it.
A VAE's loss has two terms: reconstruction (how well can I decode this back?) and a KL divergence pulling the latent distribution q(z|x) toward the standard normal prior. The β-VAE (Higgins 2017) simply weights the KL term: L = recon + β · KL. Slide β to see what it does.
What's happening: at β=0 the latent points cluster in a few tight clumps — one per class. At β=1 the clumps become blobs of width 1, roughly tiling a unit circle. At β=10 everything collapses to the origin (posterior collapse) — the model stops using the latent.
z = μ(x) + σ(x) ⊙ ε ε ~ N(0, I)
The encoder outputs μ and σ (per input). We sample ε from a fixed noise distribution and compute z deterministically. The noise is in ε, which has no parameters, so gradients flow through μ and σ fine. This is the trick that makes VAE training work.
log p(x) ≥ E_{q(z|x)}[log p(x|z)] − KL(q(z|x) ∥ p(z))
The ELBO is a lower bound on the log-marginal likelihood. We can't compute log p(x) directly (intractable marginal), but we CAN compute and maximize its lower bound. That's what VAE training does.
Part of the ES 667 Deep Learning course · IIT Gandhinagar · Aug 2026.