Loss · mean squared error
Train · two equivalent options ·
Open question · you probably justified MSE as "penalize big errors more than small ones." True — but why squared and not absolute or cubed? We'll see today.
Same setup as linear regression, but now
The sigmoid maps any real number to
The model output
Loss · binary cross-entropy (BCE)
Train · gradient descent (no closed form because of the sigmoid).
Open question · you justified BCE as "big penalty when confidently wrong." True again — but why this exact form, not
When the model overfits, you added a penalty term:
L2 (ridge) ·
L1 (lasso) ·
You probably learned ·
But why does L1 hit zero and L2 doesn't? Why is the penalty squared in L2 and absolute in L1? We'll derive both from first principles.
| Object | Recipe | Where does it come from? |
|---|---|---|
| MSE | ? | |
| BCE | ? | |
| L2 | ? | |
| L1 | $\lambda \sum | \theta_j |
You've used all four — but none was ever derived from anything. They were handed to you with the magic words "this is the loss for regression".
Today we replace the magic with a single principle.
All four mysteries are derived consequences of two ideas ·
That's the whole lecture in two bullets. Everything else is unpacking.
The same machinery gives us KL divergence — the natural distance between distributions.
KL becomes the central object in ·
One framework today, ten lectures of dividends.
The model doesn't predict a number — it predicts a distribution
A random variable
Read · "
Two flavours, depending on the type of value
We'll use both. The notation is mostly the same.
Most distributions have parameters — knobs that shape the distribution. We collect them into a single symbol
Read · "
The vertical bar "
The parameter symbol
Coin ·
Normal ·
Categorical ·
In ML,
A dataset
These two assumptions together give us the product factorization ·
This product is what becomes a sum after taking logs — and what becomes the summed loss over a dataset in every training loop. IID is the formal license to add up per-example losses.
When IID fails (time series, video frames, sensor logs from one device) we need different math · autoregressive models, state-space models, etc. For this course, treat batches as IID.
Outcome
Probability mass function ·
Two outcomes, two probabilities, summing to 1. This is the simplest non-trivial distribution.
Examples · email is spam (Y=1) or not (Y=0) · patient has disease or not · pixel is foreground or background.
We can fold the two cases of the PMF into a single expression ·
Sanity check ·
This compact form is what makes the per-example log-likelihood
Mean ·
Variance ·
Variance is largest at
This will be reused when we derive logistic regression's gradient — it has a
Setup · coin with
Under the IID assumption ·
The product over independent observations is the heart of likelihood — coming up in Part 2 when we ask "which
We now have one concrete distribution (Bernoulli). A clean way to draw a probabilistic model is plate notation — the standard for the rest of this course.
| Symbol | Meaning |
|---|---|
| ○ | a random variable (uncertain) |
| ● (filled) | an observed random variable (we see its value) |
| arrow |
|
| rectangle (plate) labelled |
the contents are repeated |
These four symbols compose every probabilistic model in this course. Bayesian networks, HMMs, VAEs, diffusion models — all drawn with these conventions.
Apply the conventions to the simplest model ·
A single Bernoulli observation ·
For
The plate says "draw a fresh
Outcome
Probability mass function ·
One-hot compact form · let
The product collapses · only one
Mean ·
MNIST classifier outputs
The model says
If the true label is
A perfect model would put all mass on class 2 (i.e.
The softmax output of any classifier IS a Categorical distribution. Treat it that way and the loss falls out automatically.
Continuous
Probability density function (PDF) ·
Mean & variance ·
The most important continuous distribution in all of statistics — and the seed of the MSE loss.
Property 2 is what we'll lean on most. Let's unpack it.
Property 2 says density falls off exponentially in
| Distance from mean | Exponent | Density factor |
|---|---|---|
This squared, exponential decay is what makes Gaussians "tightly concentrated" — almost all the mass sits within a few
Empirical rule · 68% within
House prices modelled as
| Sample value |
Density |
Distance from mean |
|---|---|---|
A house priced at the mean (
This squared-distance penalty
For
If
Three deep reasons — each comes back later in the course.
These three properties together explain why the Gaussian dominates classical statistics, signal processing, diffusion models, Kalman filters, and Bayesian neural nets.
Statement (informal) · let
Then as
Implication · any quantity arising as the aggregate of many small effects looks Gaussian. Sensor noise, human height, daily temperature deviation — all approximately Normal because the underlying causes are sums of many small contributions.
A clean way to see CLT in action ·
Sum of 12 IID
Historically used to generate Gaussian samples before better algorithms (Box-Muller) existed · subtract 6, you get a draw from
Sum of
Entropy quantifies "how spread-out / uncertain" a distribution is. Among all distributions with a given mean
Reading · "if all you know about a quantity is its first two moments, the least committal probability model is Gaussian." This is Occam's razor for distributions — don't bake in assumptions you can't justify.
The same max-entropy principle picks out other familiar distributions when you change the constraints ·
| Support | Constraint | Max-entropy distribution |
|---|---|---|
| given mean | Bernoulli | |
| Bounded interval | none beyond support | Uniform |
| given mean | Exponential | |
| given mean and variance | Normal |
Each "default" distribution in classical statistics is the least committal choice given some basic constraint. This is why these distributions show up so much — they are what you get when you assume nothing extra.
Two structural facts make Gaussians uniquely well-behaved under linear maps ·
Sum of independent Gaussians is Gaussian. If
Affine transform of a Gaussian is Gaussian.
No other common distribution behaves this nicely. Sum of two Bernoullis isn't Bernoulli; sum of two uniforms isn't uniform. The Gaussian is the fixed point of summing.
This is why Gaussians compound cleanly under repeated additive operations.
A second, equally powerful property ·
Conjugacy. A Gaussian prior
No integral, no MCMC. Just two formulas.
This is why classical Bayesian regression with known noise variance is trivial — and why we'll have to work much harder for non-Gaussian posteriors (variational inference in L19).
Two consequences you'll see in the next few weeks ·
Diffusion (L21) · the forward process adds Gaussian noise at every step. The closed-form jump
exists exactly because sums of independent Gaussians are Gaussian — no integral needed.
Kalman filter · linear-Gaussian state-space models have a closed-form posterior at every time step. Used in robotics, control, and signal processing — and a stepping-stone to L19's variational inference.
Reparameterization trick (next!) · sampling from a non-standard Normal is just an affine transform of a standard one ·
This is the single most important sampling trick in deep learning ·
All three rely on Gaussians being closed under affine maps. Each lecture above is a direct consequence of the two structural properties we just stated.
| Name | Notation | Type | Used for |
|---|---|---|---|
| Bernoulli | discrete, binary | binary classification | |
| Categorical | discrete, |
multiclass classification | |
| Normal | continuous | regression, Gaussian noise | |
| Laplace | continuous, heavy-tail | L1 regularizer prior | |
| Beta | continuous on |
prior over a probability | |
| Multinomial | discrete | counts in |
You'll need the first three today. Laplace comes back when we derive L1; Beta when we add a prior to the coin.
In supervised learning the model does not output a number. It outputs the parameters of a distribution over
| Task | Model output | Conditional distribution |
|---|---|---|
| Linear regression | ||
| Logistic regression | ||
Training asks · under these conditional distributions, how likely are the labels we actually saw? Maximize that — the rest follows.
Every supervised learning setup we'll see in this course shares the same plate diagram ·
Whether you are doing logistic regression, an MLP, a Transformer, or a diffusion model — the outermost graphical model is always this. Only the conditional distribution
How we draw from distributions — and why every generative model needs it
Sampling appears all over deep learning, in two distinct roles ·
| Role | What it means | Examples |
|---|---|---|
| Generation | Produce new instances from a learned distribution | LLM next-token, VAE images, diffusion, GAN |
| Monte Carlo estimation | Approximate an expectation we can't compute analytically | mini-batch SGD, REINFORCE, dropout averaging, evaluating ELBOs |
These two uses are technically the same operation — draw
For any 1-D distribution with CDF
Why it works ·
To sample
The CDF jumps from
Inverting ·
return 0 if u < 1 - p else 1
One uniform draw + one comparison · this is what torch.bernoulli does internally. Same algorithm extends to any 1-D distribution as long as you can compute the CDF.
Categorical with probabilities
Build
Worked ·
Draw
This is what torch.multinomial does · one uniform + one cumulative scan.
Every LLM samples its next token with exactly this algorithm.
The model produces a vector of
You will see this primitive in ·
Knowing it once means knowing it everywhere.
To sample
torch.randn(...)).Why this works · the affine transform of a Gaussian is a Gaussian (the "closed under linear ops" property from earlier). If
So we only ever need a routine to draw from
The same affine trick is one of the most important ideas in modern DL.
Let the model output
The randomness
This is what makes VAEs trainable (L19) and powers the entire diffusion stack (L21–L22). You'll see this exact form repeatedly — and you now know where it comes from.
For an integral / sum we can't compute analytically ·
The Monte Carlo trick · trade an integral we can't compute for a sample mean we can. Always works as long as we can sample from
The
Almost every "loss" you'll write is secretly an expectation being approximated by a single sample ·
| Where | The expectation | Estimated by |
|---|---|---|
| Mini-batch SGD | one batch of size |
|
| VAE ELBO (L19) | usually one |
|
| Diffusion loss (L21) | one random |
|
| REINFORCE / RLHF (L16) | a few sampled trajectories |
Each loss above looks deterministic in code — loss = ... returns a number. But probabilistically, it's a sample-mean estimate of a deeper expectation. Variance reduction (control variates, importance sampling) is a research area for exactly this reason.
You now know the primitive (sampling from a Categorical). LLM generation is just Categorical sampling at every step — but with a few tweaks to control diversity ·
| Strategy | What it does | When to use |
|---|---|---|
| Greedy ( |
Pick the most likely token | Deterministic; safe but boring |
| Temperature |
Sample from |
|
| Top- |
Keep top |
Caps diversity at the top |
| Top- |
Keep smallest set with cumulative prob |
Adapts to the model's confidence |
L14 covers the full story. The point today · the underlying operation is sampling from a Categorical — exactly the inverse-CDF primitive from two slides ago.
The probability of the data, viewed as a function of
You're handed a coin with unknown bias
Six heads, four tails.
Question · what value of
Intuition says
The likelihood of a parameter
It is not a probability over
For the coin, IID assumption gives ·
For our data ·
Maximum at
The raw likelihood
Problem 1 · numerical underflow. With
This is below the smallest representable double-precision float (
Problem 2 · hard to differentiate. The product rule on
We need the same answer in a form that doesn't underflow and is easy to differentiate.
The logarithm turns products into sums and is monotonic — so maxima are preserved.
Now the dataset's "score" is a sum of
We minimize the negative log-likelihood (NLL) so "loss" is something we drive down with gradient descent ·
We will always work with log-likelihood from this point onward.
Every loss in this course is an NLL. MSE, BCE, cross-entropy, ELBO, diffusion loss — all are just NLLs of carefully chosen distributions.
Inverting conditional probabilities — the foundation of MAP
For two events
Read · "given that
Cross-multiply and you get the product rule ·
From the two ways to factor
Divide by
This flips the conditional · if you know
A disease has prevalence 1%. A test has sensitivity 95% and specificity 95%.
You test positive. How likely are you to have the disease?
Let
Stop and guess before the next slide.
Evidence (total probability of testing positive) ·
Posterior ·
Despite a "95% accurate" test, a positive result gives only 16% chance of disease.
This is the base-rate fallacy — and it's the same maths we'll apply to ML when the prior on
In ML,
This is the central equation of probabilistic ML. It tells us how to update our belief about
| Term | What it is | Where it comes from |
|---|---|---|
| Likelihood | how plausible is the data under |
the model |
| Prior | belief about |
choice / domain knowledge |
| Posterior | updated belief about |
what we compute |
| Evidence | usually intractable, often ignored |
We will keep these colors consistent for the rest of the course.
The evidence
For finding the single best
We only need · posterior
The evidence becomes important only when we want a full posterior — Bayesian neural nets, model comparison, ELBO in VAEs (L19). For today, MLE + MAP, we drop it.
Bayes' rule is not a one-shot operation. As more data arrives, today's posterior becomes tomorrow's prior ·
After dataset
Now observe more data
The previous posterior
Practical issue · for general distributions, the posterior may not be in the same family as the prior, so each update changes the shape of the formula. Conjugate priors (next slide) avoid this · prior and posterior stay in the same family, and updates are just parameter arithmetic.
A prior
| Likelihood | Conjugate prior | Posterior |
|---|---|---|
| Bernoulli( |
Beta( |
Beta( |
| Categorical( |
Dirichlet( |
Dirichlet( |
| Normal( |
Normal( |
Normal (closed-form |
| Poisson( |
Gamma( |
Gamma( |
Conjugacy is why classical Bayesian statistics looks easy · all the integrals collapse. Once we go to deep neural nets, conjugacy breaks and we need approximations (variational inference, MCMC) — but for this lecture, conjugacy makes the coin example airtight.
The Beta distribution is supported on
Mean ·
The shape depends on
So
There's a much more intuitive way to read
Think of
Examples ·
This pseudo-count framing is what makes the conjugate update so clean — the posterior just adds the real counts to the pseudo-counts. We'll see that on the next slide.
The two ingredients ·
Prior ·
Likelihood ·
(we drop the binomial coefficient — it's constant in
Both are products of
By Bayes' rule, posterior
Combine exponents ·
This is the kernel of
The posterior is in the same family as the prior — that's what "conjugate" means. We started with a Beta, multiplied by a Bernoulli/Binomial likelihood, and got another Beta out.
Interpretation · just add the observed counts to the pseudo-counts. That's the whole update — no integral, no normalization, no MCMC.
This is why we framed
Sequential updating · because conjugacy preserves the family, you can update one observation at a time and the formulas stay the same. Each new flip just adds
This is the cleanest possible Bayesian inference, and it's what made Bayesian statistics tractable in the pre-MCMC era. Modern DL breaks conjugacy (neural-net likelihoods aren't conjugate to anything) → we need approximations like variational inference (L19) or sampling.
Start with a weakly-fair prior
The posterior gets narrower as more data arrives. With infinite data, it converges to a point mass at the true
Prior ·
Observe ·
Step 1 · update parameters
Step 2 · posterior summaries
The posterior mean lives between the prior mean (
The simplest possible estimator · ignore the prior, just maximize the likelihood.
Read · "what value of
This is what we'll spend Part 4 of this lecture on. Coin → linear regression → logistic regression → multiclass — all derive their loss as the negative log-likelihood under MLE.
The Bayesian-flavoured estimator · use the prior, maximize the full posterior.
Read · "given my prior belief and the data, what's the most probable
This is Part 5 (Session 2). MAP = MLE plus a regularizer that comes from the log-prior. With a Gaussian prior we'll recover L2; with a Laplace prior we'll recover L1. Same machinery, different prior.
MAP = MLE + a prior on
That single sentence is what makes regularization fall out of the same machinery as the loss. There is no separate "loss" and "regularizer" theory — it's all one Bayesian story, and L2/L1 are just choices of prior.
When the data is plentiful, the likelihood dominates and MAP
Concrete derivations · coin · linear regression · logistic regression
Three sections of the same exam · grades modelled as Normal ·
| Course | Mean |
Std |
|---|---|---|
| C1 | 80 | 10 |
| C2 | 70 | 10 |
| C3 | 90 | 5 |
A student's marks
Stop and guess before the next slide.
Evaluate the density
Course C1 wins · its bell is centred close to 82 with reasonable spread.
MLE intuition · among candidate distributions, choose the one under which the observed data has the highest probability.
This is the entire idea. Everything below is just doing it carefully.
Data ·
Take the log ·
Step 1 · differentiate.
Step 2 · set derivative to zero.
Expand ·
For our data
This is your first MLE derivation. Same recipe applies to everything below.
To find the MLE for any model ·
We will now apply this recipe to linear regression (where the answer pops out as MSE) and logistic regression (where it pops out as BCE).
Modelling choice · the target is a linear function plus Gaussian noise ·
Equivalently · the conditional distribution of
The model's prediction
Each data point is one draw from a Gaussian whose mean lies on the regression line. MLE asks · which line
Equivalently · which line minimizes the squared distance from each point to the line — exactly the OLS objective. That equivalence is the whole derivation.
This is the only modelling choice. Everything else is algebra.
For a single example,
Sum over the dataset · log of a product of IID terms is a sum.
Only the second term depends on
Maximizing
The factor
This is exactly MSE. MSE is not a heuristic — it is the MLE under Gaussian noise.
If the noise had been Laplace (
The MSE objective is quadratic in
Stack data ·
The familiar normal equation is the closed-form MLE under Gaussian noise. SGD / gradient descent gives the same answer iteratively.
Let
Data ·
We want the
Let's compute it step by step.
Step 1 · denominator
Step 2 · numerator
Step 3 · solve ·
That's the OLS estimate by hand. No matrix inversion needed for
Predictions under
Residuals
Sum-of-squared residuals ·
Sanity check · the MSE gradient at
This is exactly the answer torch.linalg.lstsq returns. We just did it by hand to feel the closed form.
Now
Per-example probability ·
This is the same compact Bernoulli form as the coin — except
Take the log ·
where
Sum over the dataset and negate to get NLL ·
This is exactly binary cross-entropy. Same story · BCE is not invented — it's the MLE under a Bernoulli output.
Differentiate
The gradient is "prediction minus truth, weighted by input." This is the exact same form as the linear regression gradient
That is not a coincidence — it's a feature of generalized linear models, all derived from MLE.
Now
For each class
Softmax turns logits into probabilities ·
Two properties ·
Modelling assumption ·
If example
Equivalently with one-hot encoding
Take logs ·
Same compact-Bernoulli trick as before — only one term in the sum survives because the one-hot vector has a single 1.
Sum over the dataset and negate ·
The right-hand form is the textbook categorical cross-entropy — true distribution
This loss powers every multiclass classifier in this course · CIFAR-10 (
Differentiate
Prediction minus truth, per logit. Same elegant form as binary logistic (
This is why softmax + cross-entropy are always implemented as one fused op (F.cross_entropy(logits, target)) — both for numerical stability (log-sum-exp trick) and because the gradient simplifies dramatically when you do them together.
3 classes (cat, dog, car). Image with true class cat (index 0). Model produces logits
Softmax (subtract max for stability) ·
Per-example loss · NLL of the true class.
Gradient on logits ·
The gradient says · push the cat-logit up (negative gradient → the optimizer subtracts it, so cat-logit increases), push the dog and car logits down. Exactly what you want.
| Output | Distribution chosen | NLL turns out to be |
|---|---|---|
| Continuous |
Normal | MSE |
| Binary |
Bernoulli | BCE |
| Categorical | CE |
Every loss = NLL under an assumed conditional distribution
Pick the distribution to match your data. The loss falls out automatically. No more "memorize MSE for regression and CE for classification" — both come from the same place.
Cat-vs-dog classifier · model output
| true class |
log-likelihood | NLL = loss | model "happy"? |
|---|---|---|---|
| 1 (cat) | yes — small loss | ||
| 0 (dog) | no — big loss |
The loss is small iff the model assigned high probability to the true class. That's all cross-entropy is doing — and that's all "maximizing log-likelihood" means once you write it out.
A model defines a conditional distribution
Likelihood of a dataset
MLE
Plug in the right distribution and the right loss falls out automatically ·
Bayes' rule turns this into a belief-update story · posterior
Session 2 will turn the prior into regularization and KL divergence into the single language used by every model in the rest of the course.
Try these on paper; answers worked through in the notebook (lec00-mle-map.ipynb).
P1. A coin gives 12 heads in 20 flips. Compute the MLE for
P2. Show that for
P3. Write down the conditional distribution and the per-example log-likelihood for Poisson regression (
P4. For a 3-class softmax with logits
P5. A coin's true bias is
So far · probabilistic framework + MLE. Bernoulli, Categorical, Normal · likelihood and log-likelihood · sampling primitives · Bayes' rule with its four named terms · Beta-Binomial conjugate updates · MLE for the coin, linear regression, logistic regression, and multiclass — all derived as NLL of a chosen distribution.
Session 2 picks up from here · MAP and regularization (L2 from Gaussian prior, L1 from Laplace), KL divergence and information theory as the unifying lens, and how it all fans out into VAEs, diffusion, RLHF, and the rest of the course.
Recap of session 1 · the model defines a distribution, and MLE = NLL minimization. Today · what changes when we add a prior — and why KL divergence is the right language for everything that comes next.
L2 from a Gaussian prior · L1 from a Laplace prior
You flip a coin 3 times and see 3 heads. MLE says
According to MLE, this coin is certainly biased to always land heads. Future tails impossible.
This is absurd. Three flips is not enough evidence to make such an extreme claim. You know most coins are roughly fair.
The fix · encode that prior knowledge into the inference. That gives MAP.
In ML, the analogous trap is overfitting · with finite data, MLE drives weights to whatever value most exactly fits the training set, even if those values are wildly extreme. A prior pulls them back.
By Bayes,
In log space ·
Equivalently, minimize the negative ·
The first term is your usual loss. The second term is whatever the prior gives. That second term is what we will recognize as L1 or L2.
The MAP estimate is the point in
We choose a prior
For a single weight ·
For the whole vector
The constant doesn't depend on
Plug into the MAP objective ·
This is exactly L2 regularization (a.k.a. ridge, weight decay).
Linear regression. Suppose at the MLE estimate, the NLL is
Pick
The optimizer now pays a price for large weights. Since the gradient of
Now choose a prior
The Laplace density ·
For one weight ·
For the whole vector with independent components ·
Note the absolute value in the log — that's the structural difference from Gaussian's squared term.
Plug into MAP ·
This is exactly L1 regularization (a.k.a. lasso).
Think of MAP as minimizing the loss subject to the prior. In 2D ·
L1's sparsity is a direct consequence of the diamond geometry of the Laplace prior — corners that lie exactly on the coordinate axes are the most likely touchpoints. Move to high dimensions and there are exponentially many corners → many features get zeroed simultaneously.
| L2 (Ridge) | L1 (Lasso) | |
|---|---|---|
| Prior | ||
| Log-prior penalty | $\lambda \sum | |
| Geometry | Circle | Diamond |
| Solution | small everywhere | many exactly zero |
| Gradient of penalty | ||
| When to use | dense problems, "shrink everything" | feature selection, "kill weak features" |
L1 and L2 are the same MAP machinery — they differ only in the prior over
Back to the coin · 3 heads, 0 tails. MLE says
Suppose your prior is
Posterior ∝ Likelihood × Prior ·
Maximize ·
MAP says
This is the same regularization story as L1/L2 on weights — just with a different distribution.
NLL is the loss everywhere · KL connects · VAE/diffusion/GANs are MAP++
| Output | Distribution | Loss = NLL | Lecture |
|---|---|---|---|
| Real-valued | Normal | MSE | L1 (recap), L19 (VAE recon) |
| Binary | Bernoulli | BCE | L1 (recap) |
| K classes | Categorical | Cross-entropy | L7+ (vision), L13–15 (LLMs) |
| Pixels | per-pixel Normal | per-pixel MSE | L19 (VAE), L21 (diffusion) |
| Tokens | Categorical | next-token CE | L13–L15 (LLMs) |
| Image patch given noise | Normal in pixel/score space | MSE on noise | L21 (diffusion) |
| Latent variable model | Normal + KL prior | ELBO = recon + KL | L19 (VAE) |
| Two distributions match | KL minimization | DPO, distillation | L16, L23 |
The whole course will keep instantiating the same NLL recipe. Each new model just changes which distribution is being assumed.
Before defining entropy, define the information content (or "surprise") of one outcome ·
Why this exact formula? Three axioms we want
The only function satisfying all three is
Same headline · "It snowed today." Two locations ·
The same event carries different surprise depending on the distribution generating it. That is exactly what
This is also why log-likelihood works as a model-quality signal · a model that places high probability on the data has low per-sample surprise.
| Event | Interpretation | ||
|---|---|---|---|
| Fair coin lands heads | one yes/no question | ||
| Roll a 6 on a fair die | "less than 3 yes/no questions worth" | ||
| Win a 1-in-1024 lottery | extremely surprising | ||
| The sun rises tomorrow | no information (already certain) | ||
| Sample a specific token from 50k vocab | one token = one short word in English |
Reading · large
This is the per-sample log-loss in disguise. Cross-entropy is just the average of these surprises.
The entropy of a distribution
In base 2,
So entropy is what you get when you average the per-sample surprise from the previous slides. Big entropy = on average, draws from
Imagine you must send samples from
Concrete · alphabet
| Symbol | Optimal code | Code length | |
|---|---|---|---|
| A | 0 |
1 | |
| B | 10 |
2 | |
| C | 110 |
3 | |
| D | 111 |
3 |
Average length
Entropy = the cost of describing samples from
| Distribution | Computation | |
|---|---|---|
| Fair coin | ||
| Biased coin |
||
| Uniform over 8 | ||
| One-hot (deterministic) | nothing to encode |
Entropy peaks for uniform distributions and is zero for deterministic ones. A fair coin needs 1 bit per flip; a near-deterministic coin needs ~0 bits per flip; a uniform-over-K distribution needs
This explains why low-entropy classifier outputs (confident) need fewer bits to encode than high-entropy outputs (uncertain) — and why temperature in LLMs (which controls output entropy) controls "creativity vs determinism."
A fair coin needs one bit to encode each flip. A coin with
Let
| Distribution |
Interpretation | |
|---|---|---|
| already certain | ||
| confident model | ||
| mixed evidence | ||
| maximum |
The maximum entropy of a
For a continuous random variable, replace the sum with an integral ·
(Lower-case
For a Normal
Reading · the differential entropy of a Gaussian only depends on
Worked ·
For two random variables
Read · "how much
Why we'll care ·
We won't compute
KL divergence measures how different one distribution
(Replace
Read · "the average log-ratio when the data really comes from
It's the average extra surprise you experience by encoding samples-from-
Property 1 is what makes KL a sensible loss-like quantity — minimize it and you get to zero only when distributions match.
Property 2 says KL = 0 is the unique minimum. So minimizing KL is the right objective for matching distributions.
Property 3 is the surprising one. KL is not a metric — which direction you take matters. We'll see this matters a lot in "forward vs reverse KL" later.
Let
In nats (natural log) ·
In bits · divide by
Asymmetry check ·
True
Model A ·
Model B ·
A roughly-aligned model has KL
Same alphabet
Suppose we mistakenly built our code for wrong
When we encode samples from
Reading · KL is the literal number of extra bits per sample you pay for using the wrong code. Modelling = compression · this is why "better model = better compression" is not a metaphor.
Jensen's inequality · for any concave function
Since
So
This is the full proof of Gibbs' inequality. Jensen's inequality is the same tool used to derive the ELBO in VAEs (L19). Same trick, same place — once you see it once, you see it everywhere.
Algebra ·
For classification with a one-hot true label
In classification, the truth
For any one-hot,
Cross-entropy of a one-hot truth against a softmax model is exactly the NLL of the model on the true class. This is the standard classification loss.
It's also exactly what L1 derived from a Bernoulli/Categorical assumption — same answer through two different lenses (NLL or KL). On the next slide we tabulate it across confidence levels.
3 classes, true class is 1 (so
| Model |
"Sentiment" | |
|---|---|---|
| confidently right · tiny loss | ||
| mostly right | ||
| uncertain (≈ uniform) | ||
| confidently wrong · big loss |
The loss is small iff the model assigned high probability to the true class and grows steeply when the model is confidently wrong. This asymmetric penalty is what makes cross-entropy a proper scoring rule — it rewards calibration, not just accuracy.
The classifier's training loss is just the average of this column over the dataset.
Define the empirical data distribution as the histogram of the training set, treated as a distribution ·
This is just a discrete probability mass with weight
Now the average log-likelihood becomes an expectation under
So the score we already maximize is the negative cross-entropy between the empirical distribution and the model.
Use
MLE = make the model distribution as KL-close as possible to the empirical distribution.
Same machinery, two-distribution view. Every classifier you've trained with cross-entropy has been silently minimizing this KL — to the one-hot empirical distribution of class labels.
This is also why MLE is mode-covering (forward KL) — covered in the "forward vs reverse KL" slide.
Adding the log-prior to the previous slide ·
L2 regularization through this lens · "minimize KL-to-empirical, plus pay
This is the fit + don't drift in KL template that recurs throughout the course.
The same "fit + don't drift in KL" pattern appears in many places ·
| Method | Loss structure | Reading |
|---|---|---|
| L2 regularization | NLL |
don't drift from prior |
| VAE (L19) | reconstruction |
encoder posterior close to standard normal |
| DPO / RLHF (L16) | reward |
policy close to base model |
| Distillation (L23) | student matches teacher |
Every regularizer in modern DL is some form of "don't drift too far in KL" from a reference distribution. Once you see the pattern, you stop memorizing losses and start deriving them.
We've already seen that
Both are valid. Both are used in DL. Knowing which one your loss optimizes tells you what failure mode to expect. Next two slides unpack each.
Read · "average the log-ratio over the truth
Failure mode · the expectation is taken over
Consequence · mode-covering.
This is what MLE optimizes — recall MLE =
Read · "average the log-ratio over the model
Failure mode · the expectation is taken over
Consequence · mode-seeking.
This is approximately what VAE encoders, GAN training, and policy distillation optimize. Sharper samples, but mode collapse is a real risk.
Same bimodal target
The two errors are opposite failure modes. Knowing which KL direction your loss optimizes tells you which failure mode to expect.
For
For the special case
This is the exact term that shows up in the VAE loss (L19) — pulling the encoder's posterior toward a standard-normal prior. Memorize this form; it'll save you 5 minutes of staring when you re-derive it later.
Worked ·
Once you see KL as the underlying object, every advanced model becomes a specific KL minimization ·
| Model | KL it minimizes | Lecture |
|---|---|---|
| Classifier (CE loss) | L1, every classifier | |
| VAE | recon-NLL |
L19 |
| Diffusion (variational view) | L21 | |
| GAN | (approximately) Jensen-Shannon — symmetrized KL | L20 |
| DPO / RLHF KL term | L16 | |
| Knowledge distillation | L23 |
You will see KL terms in every generative or alignment loss for the next 24 lectures. Each is an instance of what we just derived.
Try these on paper; the notebook has plotting and verification.
P6. For linear regression with NLL
P7. Compute
P8. Show that for one-hot true label
P9. A bimodal target has modes at
P10. Coin flipped 10 times, observed 8 heads. Starting from
Every advanced model in this course uses MLE (or MAP) under a specific distribution ·
| Model | Output distribution | Loss = NLL of … | Lecture |
|---|---|---|---|
| LLM (next-token) | Categorical over vocab | L13–L15 | |
| VAE | Normal pixel decoder + Gaussian latent | reconstruction NLL + KL to prior | L19 |
| GAN | implicit generator | min–max over |
L20 |
| Diffusion | Gaussian forward process | MSE on predicted noise | L21–L22 |
| RLHF / DPO | Bradley–Terry preference pair | log-sigmoid of reward gap | L16 |
You now know what all of these losses are. They're NLLs.
We will pair this lecture with a notebook (lec00-mle-map.ipynb) that walks through ·
torch.distributions.Bernoulli.torch.distributions.Normal — recover OLS.BCEWithLogitsLoss — show it equals NLL of Bernoulli.Same code skeleton, three lines change between MLE and MAP. That's the punchline.
Q. Do we always have to pick a distribution before training?
A. Yes — implicitly or explicitly. When you write MSE, you have implicitly assumed Gaussian noise. When you write BCE, Bernoulli. The conscious choice is what we're advocating today.
Q. Why minimize NLL instead of maximize LL?
A. Pure convention. ML libraries minimize. Negate, minimize, same answer.
Q. What if my output isn't Bernoulli/Categorical/Gaussian?
A. Pick whatever distribution fits — Poisson for counts, Beta for probabilities, mixture-of-Gaussians for multimodal data. The recipe (NLL = loss) is universal.
Q. How does this connect to Bayesian deep learning?
A. Bayesian DL keeps the full posterior over