A Probabilistic View of ML

Distributions · sampling · MLE · MAP · KL

Lecture 0 · ES 667: Deep Learning · spans 2 sessions

Prof. Nipun Batra
IIT Gandhinagar · Aug 2026

Why this lecture

You already know how to solve regression and classification. You wrote down a loss, took a gradient, ran SGD. It works.

But two questions you may never have asked ·

  1. Where does the loss come from? Why MSE for regression and cross-entropy for classification — and not something else?
  2. What is L1 / L2 regularization, really? They're not just "add this term, weights stay small." They have a meaning.

Today's promise · one principle — maximum likelihood under a probabilistic model — gives every loss in this course, and a tiny twist on it gives every regularizer. From this lecture onward, "loss design" stops being a bag of tricks and becomes a modeling choice.

Learning outcomes · Session 1

Framework + likelihood. By end of Session 1 you can ·

  1. State the Bernoulli, Categorical, and Normal distributions and the ~ notation.
  2. Read a plate-notation graphical model and recognize the supervised setup.
  3. Explain three reasons why the Normal shows up everywhere (CLT, max entropy, closed-under-linear).
  4. Sample from Bernoulli, Categorical, and Normal — and recognize the reparameterization trick as an affine map of a base sample.
  5. Apply Bayes' rule and identify prior, likelihood, evidence, and posterior.
  6. Derive MSE / BCE / categorical CE as NLL under Gaussian / Bernoulli / Categorical outputs.

Learning outcomes · Session 2

MAP + KL + course spine. By end of Session 2 you can ·

  1. Derive L2 as MAP with a Gaussian prior, L1 as MAP with a Laplace prior, and explain L1's sparsity geometrically.
  2. State and use KL divergence, recognize cross-entropy as KL up to a constant, and re-derive MLE/MAP through the KL lens.
  3. Distinguish forward vs reverse KL and predict their respective failure modes (mode-covering vs mode-seeking).
  4. Connect these foundations to VAEs, diffusion, RLHF, distillation, and the rest of the course.

PART 0

Revision · linear & logistic regression

The loss functions we used without asking why

Linear regression · the model

Given a dataset with and continuous targets , fit a linear model ·

(absorb the bias into by appending a to each ).

The prediction is a single real number — a point estimate of .

So far, no probability anywhere in sight. We pick weights, we predict a number, we measure how wrong we are.

Linear regression · loss & training

Loss · mean squared error

Train · two equivalent options ·

  • Solve in closed form · .
  • Or run gradient descent · .

Open question · you probably justified MSE as "penalize big errors more than small ones." True — but why squared and not absolute or cubed? We'll see today.

Logistic regression · the model

Same setup as linear regression, but now (binary classification). The output should be a probability between 0 and 1.

The sigmoid maps any real number to — large positive logits give probabilities near 1; large negative logits give probabilities near 0; zero logit gives .

The model output is interpreted as . (We'll make that interpretation precise in Part 1.)

Logistic regression · loss & training

Loss · binary cross-entropy (BCE)

Train · gradient descent (no closed form because of the sigmoid).

Open question · you justified BCE as "big penalty when confidently wrong." True again — but why this exact form, not or ? Same question as MSE: where do these specific losses come from?

Regularization · the other half-mystery

When the model overfits, you added a penalty term:

L2 (ridge) ·

L1 (lasso) ·

You probably learned ·

  • L2 keeps weights small (smooth shrinkage).
  • L1 drives many weights to exactly zero (sparsity).

But why does L1 hit zero and L2 doesn't? Why is the penalty squared in L2 and absolute in L1? We'll derive both from first principles.

Today's question · the four mysteries

Object Recipe Where does it come from?
MSE ?
BCE ?
L2 ?
L1 $\lambda \sum \theta_j

You've used all four — but none was ever derived from anything. They were handed to you with the magic words "this is the loss for regression".

Today we replace the magic with a single principle.

Today's promise · one principle, four answers

All four mysteries are derived consequences of two ideas ·

  1. The model defines a probability distribution over given — not a single number.
  2. Pick parameters that make the data most likely (MLE), optionally tempered by a prior on (MAP).

That's the whole lecture in two bullets. Everything else is unpacking.

Bonus · the dividend in later lectures

The same machinery gives us KL divergence — the natural distance between distributions.

KL becomes the central object in ·

  • VAEs (L19) · the ELBO is a KL between approximate and true posterior.
  • Diffusion (L21) · score matching ≡ KL minimization between noisy data and model.
  • RLHF / DPO (L16) · the reward objective is regularized by KL to a reference policy.
  • Distillation (L23) · student matches teacher distribution by minimizing KL.

One framework today, ten lectures of dividends.

PART 1

A probabilistic view of ML

The model doesn't predict a number — it predicts a distribution

Random variable · the basic object

A random variable is a quantity whose value is uncertain. It follows a distribution .

Read · " is distributed as ". The "" is the central notation of this lecture.

Two flavours, depending on the type of value takes ·

  • Discrete — described by a probability mass function summing to .
  • Continuous — described by a probability density integrating to .

We'll use both. The notation is mostly the same.

Distributions usually have parameters

Most distributions have parameters — knobs that shape the distribution. We collect them into a single symbol .

Read · " is distributed as , given parameters ."

The vertical bar "" means "given" — the same conditional notation as in from your probability course.

Parameters · three concrete examples

The parameter symbol is just a placeholder. For the three distributions we'll meet today ·

  • Coin · (one parameter, the bias).
    .

  • Normal · (two parameters).
    .

  • Categorical · (a vector of probabilities summing to 1).

In ML, ends up being the model's weights — the things we estimate from data via MLE / MAP later. For now, treat as known.

IID · the assumption that makes everything work

A dataset is independent and identically distributed if ·

  • Identically distributed · every comes from the same distribution.
  • Independent · knowing tells you nothing about for .

These two assumptions together give us the product factorization ·

This product is what becomes a sum after taking logs — and what becomes the summed loss over a dataset in every training loop. IID is the formal license to add up per-example losses.

When IID fails (time series, video frames, sensor logs from one device) we need different math · autoregressive models, state-space models, etc. For this course, treat batches as IID.

Bernoulli · the coin

Outcome , parameter = probability of "heads."

Probability mass function ·

Two outcomes, two probabilities, summing to 1. This is the simplest non-trivial distribution.

Examples · email is spam (Y=1) or not (Y=0) · patient has disease or not · pixel is foreground or background.

Bernoulli · the compact form we'll reuse

We can fold the two cases of the PMF into a single expression ·

Sanity check ·

This compact form is what makes the per-example log-likelihood work for both classes simultaneously — the seed of binary cross-entropy. We'll use it every time we write a BCE loss.

Bernoulli · moments

Mean ·

Variance ·

Variance is largest at (most uncertain) and zero at (deterministic).

This will be reused when we derive logistic regression's gradient — it has a term that is exactly the Bernoulli variance at the predicted probability.

Bernoulli · IID worked example

Setup · coin with . Three flips give (i.e. ).

Under the IID assumption ·

The product over independent observations is the heart of likelihood — coming up in Part 2 when we ask "which makes the observed data most likely?"

Plate notation · graphical-model conventions

We now have one concrete distribution (Bernoulli). A clean way to draw a probabilistic model is plate notation — the standard for the rest of this course.

Symbol Meaning
a random variable (uncertain)
● (filled) an observed random variable (we see its value)
arrow generates (i.e. depends on )
rectangle (plate) labelled the contents are repeated times — independence across

These four symbols compose every probabilistic model in this course. Bayesian networks, HMMs, VAEs, diffusion models — all drawn with these conventions.

Plate notation · IID Bernoulli example

Apply the conventions to the simplest model · IID Bernoulli observations.

A single Bernoulli observation · .

For observations, draw a plate around the repeated part ·

The plate says "draw a fresh for each , all from the same Bernoulli()." The single outside the plate is shared across all observations — that is what makes the dataset identically distributed.

Categorical · the K-sided die (formal)

Outcome , parameter vector with and .

Probability mass function ·

One-hot compact form · let with if , else 0. Then ·

The product collapses · only one , so only one factor survives.

Mean · . Bernoulli is the special case .

Categorical · a worked example

MNIST classifier outputs for one image (10 components, summing to 1).

The model says for each digit.

If the true label is (digit "2"), then the probability the model assigned to the truth is

A perfect model would put all mass on class 2 (i.e. ). The further the prediction is from a one-hot truth, the less likely the data is under it — and the larger the cross-entropy loss.

The softmax output of any classifier IS a Categorical distribution. Treat it that way and the loss falls out automatically.

Normal (Gaussian) · the bell curve

Continuous with mean and variance .

Probability density function (PDF) ·

Mean & variance ·

The most important continuous distribution in all of statistics — and the seed of the MSE loss.

Normal · three properties to memorize

  1. Centred at , spread controlled by .
  2. Density falls off exponentially in the squared distance .
  3. The squared exponent will be the seed of MSE in Part 4.

Property 2 is what we'll lean on most. Let's unpack it.

Normal · how fast the bell decays

Property 2 says density falls off exponentially in . How fast?

Distance from mean Exponent Density factor

This squared, exponential decay is what makes Gaussians "tightly concentrated" — almost all the mass sits within a few of the mean.

Empirical rule · 68% within , 95% within , 99.7% within .

Normal · a worked numeric example

House prices modelled as lakh.

Sample value Density Distance from mean
(the mean)

A house priced at the mean () is the most likely. A house priced at — four standard deviations away — is vanishingly unlikely under this model.

This squared-distance penalty is the exact form that becomes MSE when we maximize the likelihood over data — covered in Part 4.

Multivariate Normal · briefly

For ·

  • — mean vector.
  • — covariance matrix (symmetric, positive definite).

If (isotropic), the components are independent. This is the case in diffusion (L21) — every noise step samples isotropic Gaussian noise. We'll come back to this when we get there.

Why does the Normal show up everywhere?

Three deep reasons — each comes back later in the course.

  1. Central Limit Theorem (CLT) · the sum of many small independent things is Normal. So measurement noise, sensor jitter, biological variation all empirically look Gaussian.
  2. Maximum entropy · among all distributions with given mean and variance, Normal has the largest entropy → "the most agnostic choice when all you know is mean & variance."
  3. Closed under linear operations · sum, scaling, and conditioning on linear maps of Gaussians stay Gaussian. Bayesian updates with Gaussian prior + Gaussian likelihood give a Gaussian posterior — conjugacy.

These three properties together explain why the Gaussian dominates classical statistics, signal processing, diffusion models, Kalman filters, and Bayesian neural nets.

Why Normal · the Central Limit Theorem

Statement (informal) · let be IID with mean and variance . Define the standardized sum ·

Then as , regardless of the original distribution of .

Implication · any quantity arising as the aggregate of many small effects looks Gaussian. Sensor noise, human height, daily temperature deviation — all approximately Normal because the underlying causes are sums of many small contributions.

CLT · a worked numeric

A clean way to see CLT in action ·

Sum of 12 IID samples ·

  • Mean of · .
  • Variance of · .
  • Distribution of · approximately .

Historically used to generate Gaussian samples before better algorithms (Box-Muller) existed · subtract 6, you get a draw from .

CLT in pictures · sum of N uniforms

Sum of IID samples, repeated 5000 times. One uniform is uniform; two uniforms summed form a triangular density; by N = 30 the sum is essentially indistinguishable from a Gaussian. The CLT is not a special property of any one distribution — it's an attractor that almost any IID sum flows to.

Why Normal · maximum entropy

Entropy quantifies "how spread-out / uncertain" a distribution is. Among all distributions with a given mean and variance ·

Reading · "if all you know about a quantity is its first two moments, the least committal probability model is Gaussian." This is Occam's razor for distributions — don't bake in assumptions you can't justify.

Other max-entropy distributions

The same max-entropy principle picks out other familiar distributions when you change the constraints ·

Support Constraint Max-entropy distribution
given mean Bernoulli
Bounded interval none beyond support Uniform
given mean Exponential
given mean and variance Normal

Each "default" distribution in classical statistics is the least committal choice given some basic constraint. This is why these distributions show up so much — they are what you get when you assume nothing extra.

Why Normal · closed under linear operations

Two structural facts make Gaussians uniquely well-behaved under linear maps ·

Sum of independent Gaussians is Gaussian. If and are independent ·

Affine transform of a Gaussian is Gaussian.

No other common distribution behaves this nicely. Sum of two Bernoullis isn't Bernoulli; sum of two uniforms isn't uniform. The Gaussian is the fixed point of summing.

This is why Gaussians compound cleanly under repeated additive operations.

Why Normal · conjugacy

A second, equally powerful property ·

Conjugacy. A Gaussian prior on the mean of a Gaussian likelihood gives a Gaussian posterior with closed-form mean and variance.

No integral, no MCMC. Just two formulas.

This is why classical Bayesian regression with known noise variance is trivial — and why we'll have to work much harder for non-Gaussian posteriors (variational inference in L19).

Why Normal · where this pays off (1/2)

Two consequences you'll see in the next few weeks ·

Diffusion (L21) · the forward process adds Gaussian noise at every step. The closed-form jump

exists exactly because sums of independent Gaussians are Gaussian — no integral needed.

Kalman filter · linear-Gaussian state-space models have a closed-form posterior at every time step. Used in robotics, control, and signal processing — and a stepping-stone to L19's variational inference.

Why Normal · where this pays off (2/2)

Reparameterization trick (next!) · sampling from a non-standard Normal is just an affine transform of a standard one ·

This is the single most important sampling trick in deep learning ·

  • VAEs (L19) · sample latent codes through it.
  • Diffusion (L21) · every denoising step uses it.
  • Bayesian neural nets · weights are sampled this way.

All three rely on Gaussians being closed under affine maps. Each lecture above is a direct consequence of the two structural properties we just stated.

A small zoo of distributions you'll meet

Name Notation Type Used for
Bernoulli discrete, binary binary classification
Categorical discrete, -way multiclass classification
Normal continuous regression, Gaussian noise
Laplace continuous, heavy-tail L1 regularizer prior
Beta continuous on prior over a probability
Multinomial discrete counts in trials

You'll need the first three today. Laplace comes back when we derive L1; Beta when we add a prior to the coin.

The conditional view · model outputs a distribution

In supervised learning the model does not output a number. It outputs the parameters of a distribution over given the input .

Task Model output Conditional distribution
Linear regression
Logistic regression
-class softmax

Training asks · under these conditional distributions, how likely are the labels we actually saw? Maximize that — the rest follows.

The supervised graphical model · one picture

Every supervised learning setup we'll see in this course shares the same plate diagram ·

  • — model parameters (we will estimate by MLE / MAP).
  • — input, observed (filled).
  • — output, observed during training (filled), unknown at test time.
  • The plate says · " IID samples, all sharing the same ."

Whether you are doing logistic regression, an MLP, a Transformer, or a diffusion model — the outermost graphical model is always this. Only the conditional distribution inside the plate changes.

PART 1.5

Sampling

How we draw from distributions — and why every generative model needs it

Why we sample · two roles

Sampling appears all over deep learning, in two distinct roles ·

Role What it means Examples
Generation Produce new instances from a learned distribution LLM next-token, VAE images, diffusion, GAN
Monte Carlo estimation Approximate an expectation we can't compute analytically mini-batch SGD, REINFORCE, dropout averaging, evaluating ELBOs

These two uses are technically the same operation — draw — but conceptually different. Today we set up the primitives. The advanced uses (reparameterization in VAEs, ancestral sampling in diffusion, nucleus sampling in LLMs) all reduce to combinations of what's on the next four slides.

The master sampling primitive · inverse CDF

For any 1-D distribution with CDF ·

  1. Draw .
  2. Return .

Why it works · . ✓

Inverse CDF · worked on a Bernoulli

To sample ·

The CDF jumps from to at , then from to at .

Inverting ·

return 0 if u < 1 - p else 1

One uniform draw + one comparison · this is what torch.bernoulli does internally. Same algorithm extends to any 1-D distribution as long as you can compute the CDF.

Sampling from a Categorical · the algorithm

Categorical with probabilities . The CDF is the stick-breaking cumulative ·

Build (so ). Draw . Return the smallest such that .

Worked · → cumulative .

Draw · smallest with is . Return class 3.

Categorical sampling · why it matters

This is what torch.multinomial does · one uniform + one cumulative scan.

Every LLM samples its next token with exactly this algorithm.

The model produces a vector of probabilities (where vocab size, often 50,000+) and we run a single Categorical draw.

You will see this primitive in ·

  • L13–L15 · LLM token generation
  • L19 · VAE discrete latents
  • L21 · diffusion class-conditional sampling

Knowing it once means knowing it everywhere.

Sampling from a Normal · the affine trick

To sample ·

  1. Sample (e.g. via Box-Muller, or just torch.randn(...)).
  2. Return .

Why this works · the affine transform of a Gaussian is a Gaussian (the "closed under linear ops" property from earlier). If , then — exactly what we wanted.

So we only ever need a routine to draw from . Everything else is multiplication and addition.

The reparameterization trick · preview of L19

The same affine trick is one of the most important ideas in modern DL.

Let the model output and as deterministic functions of an input. To sample a stochastic output ·

The randomness is outside the gradient path. are deterministic ⇒ we can backprop through the sample.

This is what makes VAEs trainable (L19) and powers the entire diffusion stack (L21–L22). You'll see this exact form repeatedly — and you now know where it comes from.

Monte Carlo expectation · the definition

For an integral / sum we can't compute analytically ·

The Monte Carlo trick · trade an integral we can't compute for a sample mean we can. Always works as long as we can sample from .

Monte Carlo · three properties

  • Unbiased · exactly.
  • Variance scales as (Law of large numbers).
  • Standard error scales as · quadrupling samples halves the error.

The scaling is why Monte Carlo is slow in high dimensions but still beats numerical integration whenever . Variance reduction (control variates, importance sampling, antithetic variates) is an entire research area built on accelerating this rate.

Monte Carlo · hidden everywhere in DL

Almost every "loss" you'll write is secretly an expectation being approximated by a single sample ·

Where The expectation Estimated by
Mini-batch SGD one batch of size
VAE ELBO (L19) usually one sample
Diffusion loss (L21) one random + one
REINFORCE / RLHF (L16) a few sampled trajectories

Each loss above looks deterministic in code — loss = ... returns a number. But probabilistically, it's a sample-mean estimate of a deeper expectation. Variance reduction (control variates, importance sampling) is a research area for exactly this reason.

LLM token sampling · preview of L14

You now know the primitive (sampling from a Categorical). LLM generation is just Categorical sampling at every step — but with a few tweaks to control diversity ·

Strategy What it does When to use
Greedy () Pick the most likely token Deterministic; safe but boring
Temperature Sample from = greedy; = unchanged; = more random
Top- Keep top logits, renormalize, sample Caps diversity at the top
Top- (nucleus) Keep smallest set with cumulative prob Adapts to the model's confidence

L14 covers the full story. The point today · the underlying operation is sampling from a Categorical — exactly the inverse-CDF primitive from two slides ago.

PART 2

Likelihood

The probability of the data, viewed as a function of

Coin toss · setup

You're handed a coin with unknown bias . You flip it times and observe ·

Six heads, four tails.

Question · what value of is most consistent with this data?

Intuition says . We'll make that intuition rigorous and, in doing so, write down the entire framework for everything that follows.

Likelihood · definition

The likelihood of a parameter given data is the probability of observing under that parameter ·

It is not a probability over (we'll get that from Bayes' rule next). It is the data's probability, viewed as a function of .

For the coin, IID assumption gives ·

For our data · .

Likelihood · plotted

, plotted on .

Maximum at — which matches our intuition (6 heads out of 10).

Two problems with the raw likelihood

The raw likelihood is a product. Two practical issues follow ·

Problem 1 · numerical underflow. With and each factor ·

This is below the smallest representable double-precision float (). On a computer, becomes literally zero.

Problem 2 · hard to differentiate. The product rule on factors produces terms — algebraically and computationally messy.

We need the same answer in a form that doesn't underflow and is easy to differentiate.

The fix · take the log

The logarithm turns products into sums and is monotonic — so maxima are preserved.

Now the dataset's "score" is a sum of moderate negative numbers — numerically stable and easy to differentiate term-by-term.

NLL · the convention we'll always use

We minimize the negative log-likelihood (NLL) so "loss" is something we drive down with gradient descent ·

We will always work with log-likelihood from this point onward.

Every loss in this course is an NLL. MSE, BCE, cross-entropy, ELBO, diffusion loss — all are just NLLs of carefully chosen distributions.

PART 3

Bayes' rule

Inverting conditional probabilities — the foundation of MAP

Conditional probability · refresher

For two events with ·

Read · "given that happened, the probability also happened."

Cross-multiply and you get the product rule ·

Bayes' rule · the formula

From the two ways to factor ·

Divide by ·

This flips the conditional · if you know but want , Bayes is the bridge.

Bayes worked example · disease test (setup)

A disease has prevalence 1%. A test has sensitivity 95% and specificity 95%.

You test positive. How likely are you to have the disease?

Let = "have disease", = "test positive".

  • Prior · ,
  • Likelihood · ,

Stop and guess before the next slide.

Bayes worked example · the answer

Evidence (total probability of testing positive) ·

Posterior ·

Despite a "95% accurate" test, a positive result gives only 16% chance of disease.

This is the base-rate fallacy — and it's the same maths we'll apply to ML when the prior on is strong but the likelihood of any single data point is weak.

Bayes for ML · flip onto

In ML, becomes the parameter and becomes the data .

This is the central equation of probabilistic ML. It tells us how to update our belief about after seeing data . Each term has a name and a role.

The four terms · names, roles, colors

Term What it is Where it comes from
Likelihood how plausible is the data under the model
Prior belief about before seeing data choice / domain knowledge
Posterior updated belief about after data what we compute
Evidence — a normalizer usually intractable, often ignored

We will keep these colors consistent for the rest of the course.

Why we usually ignore the evidence

The evidence is a constant with respect to . It does not change as we vary .

For finding the single best (MLE or MAP), the evidence is irrelevant ·

We only need · posterior likelihood prior.

The evidence becomes important only when we want a full posterior — Bayesian neural nets, model comparison, ELBO in VAEs (L19). For today, MLE + MAP, we drop it.

Bayesian updating · the dynamic view

Bayes' rule is not a one-shot operation. As more data arrives, today's posterior becomes tomorrow's prior ·

After dataset ·

Now observe more data , independent of given ·

The previous posterior now plays the role of prior. Bayesian inference is iterative belief updating.

Practical issue · for general distributions, the posterior may not be in the same family as the prior, so each update changes the shape of the formula. Conjugate priors (next slide) avoid this · prior and posterior stay in the same family, and updates are just parameter arithmetic.

Conjugacy · when the math stays clean

A prior is conjugate to a likelihood if the posterior belongs to the same family as the prior — only the parameters change.

Likelihood Conjugate prior Posterior
Bernoulli() Beta() Beta()
Categorical() Dirichlet() Dirichlet()
Normal(), known Normal() Normal (closed-form )
Poisson() Gamma() Gamma()

Conjugacy is why classical Bayesian statistics looks easy · all the integrals collapse. Once we go to deep neural nets, conjugacy breaks and we need approximations (variational inference, MCMC) — but for this lecture, conjugacy makes the coin example airtight.

Beta · prior over a probability

The Beta distribution is supported on — perfect for putting a prior on a Bernoulli's .

Mean ·

The shape depends on ·

  • · uniform on — completely flat prior.
  • · "weakly fair" — peaks at , allows variation.
  • large · sharply concentrated near .

So jointly control where the prior peaks and how concentrated it is.

Beta · the pseudo-count interpretation

There's a much more intuitive way to read ·

Think of and as "pseudo-counts" — fake prior heads and tails you've already seen before the experiment.

Examples ·

  • ≈ "I've seen 1 fake head and 1 fake tail." Mild belief in fairness.
  • ≈ "I've seen 49 fake heads and 49 fake tails." Strong belief that — would take a lot of real data to shift.
  • ≈ "I've seen 0 fake heads and 8 fake tails." Strong belief the coin is biased toward tails.

This pseudo-count framing is what makes the conjugate update so clean — the posterior just adds the real counts to the pseudo-counts. We'll see that on the next slide.

Beta-Binomial · prior and likelihood

The two ingredients ·

Prior ·

Likelihood · heads and tails in flips

(we drop the binomial coefficient — it's constant in ).

Both are products of -power times -power terms. That's the structural reason conjugacy works — multiplying them stays in the same form.

Beta-Binomial · derive the posterior

By Bayes' rule, posterior likelihood prior ·

Combine exponents ·

This is the kernel of .

The posterior is in the same family as the prior — that's what "conjugate" means. We started with a Beta, multiplied by a Bernoulli/Binomial likelihood, and got another Beta out.

Beta-Binomial · the update rule

Interpretation · just add the observed counts to the pseudo-counts. That's the whole update — no integral, no normalization, no MCMC.

This is why we framed as "pseudo-counts" earlier. They literally combine with real counts by addition.

Sequential updating · because conjugacy preserves the family, you can update one observation at a time and the formulas stay the same. Each new flip just adds to either or .

This is the cleanest possible Bayesian inference, and it's what made Bayesian statistics tractable in the pre-MCMC era. Modern DL breaks conjugacy (neural-net likelihoods aren't conjugate to anything) → we need approximations like variational inference (L19) or sampling.

Beta-Binomial · the picture

Start with a weakly-fair prior . After 6 heads in 10 flips · posterior is , peaking near . After another 10 flips with the same rate (16H, 14T total) · , even sharper near .

The posterior gets narrower as more data arrives. With infinite data, it converges to a point mass at the true — and Bayesian / frequentist inference agree.

Beta-Binomial · numeric update (work it out)

Prior · · pseudo-counts , prior mean .

Observe · in flips · MLE .

Step 1 · update parameters

Step 2 · posterior summaries

The posterior mean lives between the prior mean () and the MLE () — exactly what "shrinkage" means. With more data, MAP MLE; with less data, the prior pulls the estimate toward .

Estimator #1 · Maximum Likelihood (MLE)

The simplest possible estimator · ignore the prior, just maximize the likelihood.

Read · "what value of best explains the data we actually saw?"

This is what we'll spend Part 4 of this lecture on. Coin → linear regression → logistic regression → multiclass — all derive their loss as the negative log-likelihood under MLE.

Estimator #2 · Maximum a Posteriori (MAP)

The Bayesian-flavoured estimator · use the prior, maximize the full posterior.

Read · "given my prior belief and the data, what's the most probable ?"

This is Part 5 (Session 2). MAP = MLE plus a regularizer that comes from the log-prior. With a Gaussian prior we'll recover L2; with a Laplace prior we'll recover L1. Same machinery, different prior.

MLE vs MAP · the master sentence

MAP = MLE + a prior on .

That single sentence is what makes regularization fall out of the same machinery as the loss. There is no separate "loss" and "regularizer" theory — it's all one Bayesian story, and L2/L1 are just choices of prior.

When the data is plentiful, the likelihood dominates and MAP MLE. When data is scarce, the prior matters — and that's exactly when overfitting bites and regularization helps. Bayes' rule formalizes this automatically.

PART 4

Maximum Likelihood Estimation

Concrete derivations · coin · linear regression · logistic regression

Pop quiz · which course did this student come from?

Three sections of the same exam · grades modelled as Normal ·

Course Mean Std
C1 80 10
C2 70 10
C3 90 5

A student's marks . Which course is this student most likely from?

Stop and guess before the next slide.

Answer · pick the distribution that best explains the data

Evaluate the density for each course ·

Course C1 wins · its bell is centred close to 82 with reasonable spread.

MLE intuition · among candidate distributions, choose the one under which the observed data has the highest probability.

This is the entire idea. Everything below is just doing it carefully.

MLE for the coin · setup

Data · heads out of flips. Likelihood ·

Take the log ·

Step 1 · differentiate.

MLE for the coin · solve

Step 2 · set derivative to zero.

Expand · .

For our data · . Exactly the empirical frequency.

This is your first MLE derivation. Same recipe applies to everything below.

The MLE recipe · 3 steps

To find the MLE for any model ·

  1. Pick a probabilistic model — choose a distribution that matches your output type (Bernoulli for binary, Normal for continuous, …).
  2. Write the log-likelihood of the dataset under that model · sum the per-example .
  3. Maximize over — by setting derivative to zero analytically, or by gradient ascent (equivalently, gradient descent on the negative log-likelihood).

We will now apply this recipe to linear regression (where the answer pops out as MSE) and logistic regression (where it pops out as BCE).

MLE for linear regression · the assumption

Modelling choice · the target is a linear function plus Gaussian noise ·

Equivalently · the conditional distribution of given is

The model's prediction is the mean of the Gaussian. Real scatters around it with variance .

Linear regression as MLE · the picture

Each data point is one draw from a Gaussian whose mean lies on the regression line. MLE asks · which line makes the observed 's most probable?

Equivalently · which line minimizes the squared distance from each point to the line — exactly the OLS objective. That equivalence is the whole derivation.

This is the only modelling choice. Everything else is algebra.

MLE for linear regression · log-likelihood

For a single example,

Sum over the dataset · log of a product of IID terms is a sum.

Only the second term depends on .

MLE for linear regression · MSE pops out

Maximizing over ·

The factor is a negative constant — flip the sign and minimize ·

This is exactly MSE. MSE is not a heuristic — it is the MLE under Gaussian noise.

If the noise had been Laplace (), the same derivation would give MAE (mean absolute error) instead. Loss design = noise model.

Closed-form OLS · for completeness

The MSE objective is quadratic in , so we can set the gradient to zero analytically.

Stack data · , . Then .

The familiar normal equation is the closed-form MLE under Gaussian noise. SGD / gradient descent gives the same answer iteratively.

Worked OLS by hand · setup

Let (single feature, no bias for clarity). Fit a scalar slope · .

Data ·


We want the that minimizes . With one parameter, the closed-form OLS becomes a simple ratio · .

Let's compute it step by step.

Worked OLS by hand · solve

Step 1 · denominator ·

Step 2 · numerator ·

Step 3 · solve ·

That's the OLS estimate by hand. No matrix inversion needed for — just a ratio.

Worked OLS by hand · verify

Predictions under ·

Residuals ·


Sum-of-squared residuals · .

Sanity check · the MSE gradient at should be zero.

This is exactly the answer torch.linalg.lstsq returns. We just did it by hand to feel the closed form.

MLE for logistic regression · the assumption

Now . Model ·

Per-example probability ·

This is the same compact Bernoulli form as the coin — except now depends on .

MLE for logistic regression · log-likelihood

Take the log ·

where .

Sum over the dataset and negate to get NLL ·

This is exactly binary cross-entropy. Same story · BCE is not invented — it's the MLE under a Bernoulli output.

MLE for logistic regression · gradient

Differentiate w.r.t. . Using and the chain rule (derivation in any ML textbook) ·

The gradient is "prediction minus truth, weighted by input." This is the exact same form as the linear regression gradient .

That is not a coincidence — it's a feature of generalized linear models, all derived from MLE.

Multiclass · the assumption

Now — one of mutually exclusive classes. We need the model to output a full probability distribution over the classes. We use the softmax.

For each class , learn a weight vector . Compute logits for .

Softmax turns logits into probabilities ·

Two properties · (because ) and . So is a valid distribution.

Modelling assumption · . Binary logistic is the special case (with the softmax collapsing to a sigmoid).

Multiclass · the per-example log-likelihood

If example has true class , the probability the model assigns to that label is — i.e. the entry of the softmax at position .

Equivalently with one-hot encoding ·

Take logs ·

Same compact-Bernoulli trick as before — only one term in the sum survives because the one-hot vector has a single 1.

Multiclass · NLL = categorical cross-entropy

Sum over the dataset and negate ·

The right-hand form is the textbook categorical cross-entropy — true distribution (one-hot) against predicted distribution .

This loss powers every multiclass classifier in this course · CIFAR-10 (), ImageNet (), and the next-token loss in every LLM ($K = $ vocab size, often 50,000+) in L13–L15.

Multiclass · the elegant gradient

Differentiate w.r.t. the logit (derivation uses the softmax-Jacobian trick) ·

Prediction minus truth, per logit. Same elegant form as binary logistic () and linear regression (). All three are instances of the same generalized-linear-model identity.

This is why softmax + cross-entropy are always implemented as one fused op (F.cross_entropy(logits, target)) — both for numerical stability (log-sum-exp trick) and because the gradient simplifies dramatically when you do them together.

Multiclass · worked numeric

3 classes (cat, dog, car). Image with true class cat (index 0). Model produces logits .

Softmax (subtract max for stability) · .
· sum .
.

Per-example loss · NLL of the true class.

Gradient on logits · .

The gradient says · push the cat-logit up (negative gradient → the optimizer subtracts it, so cat-logit increases), push the dog and car logits down. Exactly what you want.

Summary so far · the pattern

Output Distribution chosen NLL turns out to be
Continuous Normal MSE
Binary Bernoulli BCE
-class Categorical CE

Every loss = NLL under an assumed conditional distribution .

Pick the distribution to match your data. The loss falls out automatically. No more "memorize MSE for regression and CE for classification" — both come from the same place.

A worked numeric · BCE on one prediction

Cat-vs-dog classifier · model output for one image.

true class log-likelihood NLL = loss model "happy"?
1 (cat) yes — small loss
0 (dog) no — big loss

The loss is small iff the model assigned high probability to the true class. That's all cross-entropy is doing — and that's all "maximizing log-likelihood" means once you write it out.

Session 1 · putting it all together

  1. A model defines a conditional distribution .

  2. Likelihood of a dataset · take and sum.

  3. MLE .

  4. Plug in the right distribution and the right loss falls out automatically ·

    • Normal MSE
    • Bernoulli BCE
    • Categorical Cross-entropy
    • Poisson exp-link Poisson loss
  5. Bayes' rule turns this into a belief-update story · posterior likelihood prior. With conjugate Beta prior, updates are just adding counts.

Session 2 will turn the prior into regularization and KL divergence into the single language used by every model in the rest of the course.

Session 1 · practice problems

Try these on paper; answers worked through in the notebook (lec00-mle-map.ipynb).

P1. A coin gives 12 heads in 20 flips. Compute the MLE for and the negative log-likelihood at that estimate.

P2. Show that for with known , the MLE for is . (Hint · single-feature case of OLS.)

P3. Write down the conditional distribution and the per-example log-likelihood for Poisson regression (). What is the resulting NLL loss?

P4. For a 3-class softmax with logits and true class , compute (a) the predicted probabilities, (b) the cross-entropy loss, (c) the gradient on each logit.

P5. A coin's true bias is . You see 0 heads in 5 flips. What is the MLE? Why is the answer absurd, and what would a prior change?

End of Session 1

So far · probabilistic framework + MLE. Bernoulli, Categorical, Normal · likelihood and log-likelihood · sampling primitives · Bayes' rule with its four named terms · Beta-Binomial conjugate updates · MLE for the coin, linear regression, logistic regression, and multiclass — all derived as NLL of a chosen distribution.

Session 2 picks up from here · MAP and regularization (L2 from Gaussian prior, L1 from Laplace), KL divergence and information theory as the unifying lens, and how it all fans out into VAEs, diffusion, RLHF, and the rest of the course.

Welcome to Session 2

Recap of session 1 · the model defines a distribution, and MLE = NLL minimization. Today · what changes when we add a prior — and why KL divergence is the right language for everything that comes next.

PART 5

MAP and the meaning of regularization

L2 from a Gaussian prior · L1 from a Laplace prior

Why we need a prior · MLE's trap

You flip a coin 3 times and see 3 heads. MLE says .

According to MLE, this coin is certainly biased to always land heads. Future tails impossible.

This is absurd. Three flips is not enough evidence to make such an extreme claim. You know most coins are roughly fair.

The fix · encode that prior knowledge into the inference. That gives MAP.

In ML, the analogous trap is overfitting · with finite data, MLE drives weights to whatever value most exactly fits the training set, even if those values are wildly extreme. A prior pulls them back.

MAP · maximum a posteriori

By Bayes, .

In log space ·

Equivalently, minimize the negative ·

The first term is your usual loss. The second term is whatever the prior gives. That second term is what we will recognize as L1 or L2.

MAP · the geometric picture

The MAP estimate is the point in -space that best balances the data's preferences (likelihood) against your prior beliefs (prior).

Gaussian prior · setup

We choose a prior for every weight, independently. In words · "a priori, weights are small and centred at zero."

For a single weight ·

For the whole vector (independent prior on each component) ·

The constant doesn't depend on , so it drops out of the optimization.

Gaussian prior · L2 pops out

Plug into the MAP objective ·

This is exactly L2 regularization (a.k.a. ridge, weight decay).

  • Strong prior (small ) ⇒ large ⇒ heavy penalty ⇒ weights pulled hard to zero.
  • Weak prior (large ) ⇒ small ⇒ MAP MLE.

Worked numeric · MAP with L2

Linear regression. Suppose at the MLE estimate, the NLL is . Weight vector — so .

Pick (corresponds to — a fairly weak prior).

The optimizer now pays a price for large weights. Since the gradient of is , every step shrinks weights by a factor before the data update. This is exactly weight decay — and you'll see this exact form again in Adam vs AdamW (L5).

Laplace prior · setup

Now choose a prior — same idea (centred at zero) but heavier tails and a sharper peak at zero.

The Laplace density ·

For one weight ·

For the whole vector with independent components ·

Note the absolute value in the log — that's the structural difference from Gaussian's squared term.

Laplace prior · L1 pops out

Plug into MAP ·

This is exactly L1 regularization (a.k.a. lasso).

  • Same machinery (MAP), different prior, different penalty.
  • The Laplace density is "sharply peaked at zero with heavy tails" — and that geometry is what makes L1 produce sparse solutions.

Why L1 produces sparse solutions · the geometry

Think of MAP as minimizing the loss subject to the prior. In 2D ·

  • L2 ball — circle. The data-loss contour can touch it anywhere on the circle. Generic touchpoints have both coordinates non-zero. Solutions are shrunk but rarely zero.
  • L1 ball — diamond. Corners stick out on the axes. Generic data-loss contours hit the diamond at a corner, where one or more coordinates are exactly zero. Solutions are sparse.

L1's sparsity is a direct consequence of the diamond geometry of the Laplace prior — corners that lie exactly on the coordinate axes are the most likely touchpoints. Move to high dimensions and there are exponentially many corners → many features get zeroed simultaneously.

L1 vs L2 · summary table

L2 (Ridge) L1 (Lasso)
Prior
Log-prior penalty $\lambda \sum
Geometry Circle Diamond
Solution small everywhere many exactly zero
Gradient of penalty (smooth) (kink at 0)
When to use dense problems, "shrink everything" feature selection, "kill weak features"

L1 and L2 are the same MAP machinery — they differ only in the prior over . Pick your prior, get your regularizer.

A coin · MLE vs MAP

Back to the coin · 3 heads, 0 tails. MLE says (absurd).

Suppose your prior is "coin is probably near 0.5". MAP combines this with the likelihood ·

Posterior ∝ Likelihood × Prior ·

Maximize · . Set derivative · .

MAP says sensible. Three heads is some evidence the coin is biased, but the prior keeps us from going to 1.0.

This is the same regularization story as L1/L2 on weights — just with a different distribution.

PART 6

Where this matters · the rest of the course

NLL is the loss everywhere · KL connects · VAE/diffusion/GANs are MAP++

Every loss is an NLL — the master table

Output Distribution Loss = NLL Lecture
Real-valued Normal MSE L1 (recap), L19 (VAE recon)
Binary Bernoulli BCE L1 (recap)
K classes Categorical Cross-entropy L7+ (vision), L13–15 (LLMs)
Pixels per-pixel Normal per-pixel MSE L19 (VAE), L21 (diffusion)
Tokens Categorical next-token CE L13–L15 (LLMs)
Image patch given noise Normal in pixel/score space MSE on noise L21 (diffusion)
Latent variable model Normal + KL prior ELBO = recon + KL L19 (VAE)
Two distributions match KL minimization DPO, distillation L16, L23

The whole course will keep instantiating the same NLL recipe. Each new model just changes which distribution is being assumed.

Information content · the surprise of a single outcome

Before defining entropy, define the information content (or "surprise") of one outcome ·

Why this exact formula? Three axioms we want to satisfy ·

  1. depends only on (not on itself).
  2. is decreasing in — rare events are more informative.
  3. is additive for independent events · .

The only function satisfying all three is for (Shannon 1948). With and , the unit is bits.

Surprise · "snowing in Kashmir vs Gandhinagar"

Same headline · "It snowed today." Two locations ·

  • Kashmir in January · · bit. Mild surprise.
  • Gandhinagar in January · · bits. Front-page news.

The same event carries different surprise depending on the distribution generating it. That is exactly what formalizes.

This is also why log-likelihood works as a model-quality signal · a model that places high probability on the data has low per-sample surprise.

Information content · worked examples

Event Interpretation
Fair coin lands heads bit one yes/no question
Roll a 6 on a fair die bits "less than 3 yes/no questions worth"
Win a 1-in-1024 lottery bits extremely surprising
The sun rises tomorrow bits no information (already certain)
Sample a specific token from 50k vocab bits one token = one short word in English

Reading · large = "I learned a lot from observing ." Small = "I expected this, no information gained."

This is the per-sample log-loss in disguise. Cross-entropy is just the average of these surprises.

Entropy · the definition

The entropy of a distribution is the expected information of a draw from ·

In base 2, is in bits · the average number of bits the optimal code spends per sample from .

So entropy is what you get when you average the per-sample surprise from the previous slides. Big entropy = on average, draws from are surprising. Small entropy = mostly predictable.

Entropy as code length · the Huffman lens

Imagine you must send samples from over a wire using a prefix-free binary code. Shannon's source-coding theorem says ·

Concrete · alphabet with ·

Symbol Optimal code Code length
A 0 1
B 10 2
C 110 3
D 111 3

Average length bits exactly.

Entropy = the cost of describing samples from optimally. That's the right physical interpretation.

Entropy · worked numerics in bits

Distribution (bits) Computation
Fair coin
Biased coin
Uniform over 8
One-hot (deterministic) nothing to encode

Entropy peaks for uniform distributions and is zero for deterministic ones. A fair coin needs 1 bit per flip; a near-deterministic coin needs ~0 bits per flip; a uniform-over-K distribution needs bits.

This explains why low-entropy classifier outputs (confident) need fewer bits to encode than high-entropy outputs (uncertain) — and why temperature in LLMs (which controls output entropy) controls "creativity vs determinism."

Entropy of a Bernoulli · in pictures

— concave, symmetric around , peaking at bit.

A fair coin needs one bit to encode each flip. A coin with is almost deterministic — encoded with arithmetic coding it costs ~ bits/flip on average.

Entropy worked · Categorical (3-class)

Let vary over a 3-class Categorical and compute in bits ·

Distribution (bits) Interpretation
— deterministic already certain
— peaked confident model
— moderate mixed evidence
— uniform maximum

The maximum entropy of a -Categorical is , achieved by the uniform. A confident classifier has low-entropy predictions; a confused one has high-entropy predictions. This connects directly to the temperature in LLM sampling — high pushes the categorical toward uniform (high entropy), low sharpens it (low entropy).

Entropy in the continuous case · differential entropy

For a continuous random variable, replace the sum with an integral ·

(Lower-case by convention to distinguish from the discrete case. Differential entropy can be negative — it's not a "bits per sample" quantity in the same sense.)

For a Normal ·

Reading · the differential entropy of a Gaussian only depends on (not on ). It is maximum among all continuous distributions on with that fixed variance — recovering the max-entropy reason "Why Normal" from earlier.

Worked · nats bits.

Mutual information · briefly (will return in L17)

For two random variables with joint ·

Read · "how much and tell us about each other." Always , equal to 0 iff .

Why we'll care ·

  • Self-supervised learning (L17) · contrastive losses are lower bounds on mutual information between augmented views (InfoNCE).
  • Information bottleneck · representation learning as minimization subject to .
  • Capacity of a channel · classical result with the same formula.

We won't compute directly today, but you'll see it surface in L17.

KL divergence · the definition

KL divergence measures how different one distribution is from another distribution over the same support.

(Replace with for continuous distributions.)

Read · "the average log-ratio when the data really comes from but you used to model it."

It's the average extra surprise you experience by encoding samples-from- using a code optimized for .

KL · three properties

  1. for all . (Gibbs' inequality — proof in two slides.)
  2. if and only if everywhere.
  3. Asymmetric · in general .

Property 1 is what makes KL a sensible loss-like quantity — minimize it and you get to zero only when distributions match.

Property 2 says KL = 0 is the unique minimum. So minimizing KL is the right objective for matching distributions.

Property 3 is the surprising one. KL is not a metric — which direction you take matters. We'll see this matters a lot in "forward vs reverse KL" later.

KL worked · two Bernoullis

Let (true) and (model).

In nats (natural log) · nats.

In bits · divide by , giving bits.

Asymmetry check · nats — close but different. KL is not a metric.

KL worked · two Categoricals (3-class)

True . Two candidate models ·

Model A · — close to truth.

nats.

Model B · — very wrong on class 3.


nats.

A roughly-aligned model has KL ; a wildly mismatched one has KL . KL ≈ 0 ⇒ the two distributions agree ; large KL ⇒ they disagree, especially in directions where has mass but doesn't (mode-covering penalty).

KL as wasted code length · the Huffman lens

Same alphabet . True — Huffman code lengths · optimal cost bits.

Suppose we mistakenly built our code for wrong — Huffman lengths .

When we encode samples from with the code for ·

Reading · KL is the literal number of extra bits per sample you pay for using the wrong code. Modelling = compression · this is why "better model = better compression" is not a metaphor.

Why KL ≥ 0 · Jensen's inequality (proof)

Jensen's inequality · for any concave function and random variable ,

Since is concave ·

So , i.e. . Equality iff is constant — i.e. . ∎

This is the full proof of Gibbs' inequality. Jensen's inequality is the same tool used to derive the ELBO in VAEs (L19). Same trick, same place — once you see it once, you see it everywhere.

Cross-entropy = entropy + KL

Algebra ·

  • Cross-entropy · "expected bits to encode samples from if we used a code optimized for ."
  • Entropy · "the irreducible cost of encoding ." Independent of .
  • KL · the extra bits we waste because .

For classification with a one-hot true label , — so cross-entropy is KL. That's why we say "the classifier loss minimizes KL to the true label."

Cross-entropy worked · the one-hot collapse

In classification, the truth is one-hot and the model output is a softmax.

For any one-hot, — there's no uncertainty in the truth. So the cross-entropy formula collapses ·

Cross-entropy of a one-hot truth against a softmax model is exactly the NLL of the model on the true class. This is the standard classification loss.

It's also exactly what L1 derived from a Bernoulli/Categorical assumption — same answer through two different lenses (NLL or KL). On the next slide we tabulate it across confidence levels.

Cross-entropy · table across confidence levels

3 classes, true class is 1 (so ). Vary the model output and read off the loss ·

Model (nats) "Sentiment"
confidently right · tiny loss
mostly right
uncertain (≈ uniform)
confidently wrong · big loss

The loss is small iff the model assigned high probability to the true class and grows steeply when the model is confidently wrong. This asymmetric penalty is what makes cross-entropy a proper scoring rule — it rewards calibration, not just accuracy.

The classifier's training loss is just the average of this column over the dataset.

MLE through the KL lens · setup

Define the empirical data distribution as the histogram of the training set, treated as a distribution ·

𝟙

This is just a discrete probability mass with weight on each observed point.

Now the average log-likelihood becomes an expectation under ·

So the score we already maximize is the negative cross-entropy between the empirical distribution and the model.

MLE through the KL lens · result

Use and drop as constant in ·

MLE = make the model distribution as KL-close as possible to the empirical distribution.

Same machinery, two-distribution view. Every classifier you've trained with cross-entropy has been silently minimizing this KL — to the one-hot empirical distribution of class labels.

This is also why MLE is mode-covering (forward KL) — covered in the "forward vs reverse KL" slide.

MAP through the KL lens

Adding the log-prior to the previous slide ·

L2 regularization through this lens · "minimize KL-to-empirical, plus pay for straying from a Gaussian prior."

This is the fit + don't drift in KL template that recurs throughout the course.

KL regularization · everywhere in DL

The same "fit + don't drift in KL" pattern appears in many places ·

Method Loss structure Reading
L2 regularization NLL don't drift from prior
VAE (L19) reconstruction encoder posterior close to standard normal
DPO / RLHF (L16) reward policy close to base model
Distillation (L23) student matches teacher

Every regularizer in modern DL is some form of "don't drift too far in KL" from a reference distribution. Once you see the pattern, you stop memorizing losses and start deriving them.

KL is asymmetric · two directions, two problems

We've already seen that . So which direction you minimize is itself a modelling choice — and the two directions optimize for very different things.

  • Forward KL · — "average over the truth ."
  • Reverse KL · — "average over the model ."

Both are valid. Both are used in DL. Knowing which one your loss optimizes tells you what failure mode to expect. Next two slides unpack each.

Forward KL · mode-covering

Read · "average the log-ratio over the truth ."

Failure mode · the expectation is taken over , so a region where but contributes a huge penalty (log of a tiny number).

Consequence · mode-covering. is forced to put mass everywhere does. If has two modes, has to span both. Single-Gaussian fit to a bimodal target → blurry middle.

This is what MLE optimizes — recall MLE = . So MLE-trained models are mode-covering by construction.

Reverse KL · mode-seeking

Read · "average the log-ratio over the model ."

Failure mode · the expectation is taken over , so a region where but contributes a huge penalty. The safest move for is to put mass only where is large.

Consequence · mode-seeking. concentrates on a single high-density region. Two-mode target → picks one mode and ignores the other.

This is approximately what VAE encoders, GAN training, and policy distillation optimize. Sharper samples, but mode collapse is a real risk.

Forward vs reverse KL · the picture

Same bimodal target (grey shaded). Fit a single Gaussian by minimizing each direction of KL ·

  • Forward KL spreads the Gaussian across both modes — covers all data, but spends mass on the gap between modes (blurry samples). Why VAE samples are blurry.
  • Reverse KL concentrates on one mode — sharp samples but ignores the other half of the distribution. Why GANs mode-collapse.

The two errors are opposite failure modes. Knowing which KL direction your loss optimizes tells you which failure mode to expect.

KL between two Gaussians · the closed form

For and , the KL is in closed form ·

For the special case ·

This is the exact term that shows up in the VAE loss (L19) — pulling the encoder's posterior toward a standard-normal prior. Memorize this form; it'll save you 5 minutes of staring when you re-derive it later.

Worked · .

KL across the rest of the course

Once you see KL as the underlying object, every advanced model becomes a specific KL minimization ·

Model KL it minimizes Lecture
Classifier (CE loss) L1, every classifier
VAE recon-NLL L19
Diffusion (variational view) L21
GAN (approximately) Jensen-Shannon — symmetrized KL L20
DPO / RLHF KL term L16
Knowledge distillation L23

You will see KL terms in every generative or alignment loss for the next 24 lectures. Each is an instance of what we just derived.

Session 2 · practice problems

Try these on paper; the notebook has plotting and verification.

P6. For linear regression with NLL and weights , compute the MAP loss with (a) L2 penalty , (b) L1 penalty . Which weights does L1 push hardest?

P7. Compute using the closed form. Interpret the answer.

P8. Show that for one-hot true label and softmax prediction , cross-entropy equals . (Hint · entropy of a one-hot is zero.)

P9. A bimodal target has modes at with equal mass. You fit a single Gaussian . Sketch the optimum under (a) forward KL, (b) reverse KL. Which one is mode-covering?

P10. Coin flipped 10 times, observed 8 heads. Starting from prior, write down the Beta posterior and its mean. How does the posterior mean compare to the MLE ? Why does it differ?

Foreshadow · how this lecture powers DL

Every advanced model in this course uses MLE (or MAP) under a specific distribution ·

Model Output distribution Loss = NLL of … Lecture
LLM (next-token) Categorical over vocab L13–L15
VAE Normal pixel decoder + Gaussian latent reconstruction NLL + KL to prior L19
GAN implicit generator min–max over L20
Diffusion Gaussian forward process MSE on predicted noise L21–L22
RLHF / DPO Bradley–Terry preference pair log-sigmoid of reward gap L16

You now know what all of these losses are. They're NLLs.

Notebook teaser · MLE & MAP in PyTorch

We will pair this lecture with a notebook (lec00-mle-map.ipynb) that walks through ·

  1. MLE for a coin with torch.distributions.Bernoulli.
  2. MLE for linear regression with torch.distributions.Normal — recover OLS.
  3. MLE for logistic regression with BCEWithLogitsLoss — show it equals NLL of Bernoulli.
  4. MAP for linear regression with Gaussian prior — recover ridge regression.
  5. MAP for linear regression with Laplace prior — recover lasso, see sparsity emerge as grows.
  6. Visualize likelihood and posterior surfaces in 2D for a tiny example.

Same code skeleton, three lines change between MLE and MAP. That's the punchline.

Common questions · FAQ

Q. Do we always have to pick a distribution before training?
A. Yes — implicitly or explicitly. When you write MSE, you have implicitly assumed Gaussian noise. When you write BCE, Bernoulli. The conscious choice is what we're advocating today.

Q. Why minimize NLL instead of maximize LL?
A. Pure convention. ML libraries minimize. Negate, minimize, same answer.

Q. What if my output isn't Bernoulli/Categorical/Gaussian?
A. Pick whatever distribution fits — Poisson for counts, Beta for probabilities, mixture-of-Gaussians for multimodal data. The recipe (NLL = loss) is universal.

Q. How does this connect to Bayesian deep learning?
A. Bayesian DL keeps the full posterior over instead of a single point estimate. We won't go there in this course, but everything we do today is the entry point.

Lecture 0 — summary

We made the probabilistic framework concrete ·

  • Conditional view · the model outputs a distribution , not a number.
  • Bayes' rule for parameters · — four named terms.
  • MLE · ignore the prior, maximize log-likelihood.
    • MSE = MLE under Gaussian noise.
    • BCE / CE = MLE under Bernoulli / Categorical outputs.
  • MAP · MLE + a prior on , log-prior becomes a regularizer.
    • L2 = MAP with Gaussian prior on weights.
    • L1 = MAP with Laplace prior — sparsity from diamond geometry.
  • NLL is the loss everywhere. Every advanced model in this course will be an instance of this recipe.

Read before Lecture 1

Strang Ch 1 (vectors / matrices) · Bishop & Bishop §2.1–2.3 (probability primer) · §4.1–4.3 (linear regression as MLE) · §5.4 (logistic regression).

Next lecture

Why deep learning — why these tools alone aren't enough at scale, and what depth and non-linearity buy us.