Object	Recipe	Where does it come from?
MSE		?
BCE		?
L2		?
L1	$\lambda \sum	\theta_j

Symbol	Meaning
○	a random variable (uncertain)
● (filled)	an observed random variable (we see its value)
arrow	generates (i.e. depends on )
rectangle (plate) labelled	the contents are repeated times — independence across

Support	Constraint	Max-entropy distribution
	given mean	Bernoulli
Bounded interval	none beyond support	Uniform
	given mean	Exponential
	given mean and variance	Normal

Name	Type	Used for
Bernoulli	discrete, binary	binary classification
Categorical	discrete, -way	multiclass classification
Normal	continuous	regression, Gaussian noise
Laplace	continuous, heavy-tail	L1 regularizer prior
Beta	continuous on	prior over a probability
Multinomial	discrete	counts in trials

Role	What it means	Examples
Generation	Produce new instances from a learned distribution	LLM next-token, VAE images, diffusion, GAN
Monte Carlo estimation	Approximate an expectation we can't compute analytically	mini-batch SGD, REINFORCE, dropout averaging, evaluating ELBOs

Where	The expectation	Estimated by
Mini-batch SGD		one batch of size
VAE ELBO (L19)		usually one sample
Diffusion loss (L21)		one random + one
REINFORCE / RLHF (L16)		a few sampled trajectories

Strategy	What it does	When to use
Greedy ()	Pick the most likely token	Deterministic; safe but boring
Temperature	Sample from	= greedy; = unchanged; = more random
Top-	Keep top logits, renormalize, sample	Caps diversity at the top
Top- (nucleus)	Keep smallest set with cumulative prob	Adapts to the model's confidence

Term	What it is	Where it comes from
Likelihood	how plausible is the data under	the model
Prior	belief about before seeing data	choice / domain knowledge
Posterior	updated belief about after data	what we compute
Evidence	— a normalizer	usually intractable, often ignored

Likelihood	Conjugate prior	Posterior
Bernoulli()	Beta()	Beta()
Categorical()	Dirichlet()	Dirichlet()
Normal(), known	Normal()	Normal (closed-form )
Poisson()	Gamma()	Gamma()

Course	Mean	Std
C1	80	10
C2	70	10
C3	90	5

Output	Distribution chosen	NLL turns out to be
Continuous	Normal	MSE
Binary	Bernoulli	BCE
-class	Categorical	CE

true class	log-likelihood	NLL = loss	model "happy"?
1 (cat)			yes — small loss
0 (dog)			no — big loss

	L2 (Ridge)	L1 (Lasso)
Prior
Log-prior penalty		$\lambda \sum
Geometry	Circle	Diamond
Solution	small everywhere	many exactly zero
Gradient of penalty	(smooth)	(kink at 0)
When to use	dense problems, "shrink everything"	feature selection, "kill weak features"

Output	Distribution	Loss = NLL	Lecture
Real-valued	Normal	MSE	L1 (recap), L19 (VAE recon)
Binary	Bernoulli	BCE	L1 (recap)
K classes	Categorical	Cross-entropy	L7+ (vision), L13–15 (LLMs)
Pixels	per-pixel Normal	per-pixel MSE	L19 (VAE), L21 (diffusion)
Tokens	Categorical	next-token CE	L13–L15 (LLMs)
Image patch given noise	Normal in pixel/score space	MSE on noise	L21 (diffusion)
Latent variable model	Normal + KL prior	ELBO = recon + KL	L19 (VAE)
Two distributions match	KL minimization	DPO, distillation	L16, L23

Event		Interpretation
Fair coin lands heads	bit	one yes/no question
Roll a 6 on a fair die	bits	"less than 3 yes/no questions worth"
Win a 1-in-1024 lottery	bits	extremely surprising
The sun rises tomorrow	bits	no information (already certain)
Sample a specific token from 50k vocab	bits	one token = one short word in English

Symbol	Optimal code	Code length
A	`0`	1
B	`10`	2
C	`110`	3
D	`111`	3

Distribution	(bits)	Interpretation
— deterministic		already certain
— peaked		confident model
— moderate		mixed evidence
— uniform		maximum

Method	Loss structure	Reading
L2 regularization	NLL	don't drift from prior
VAE (L19)	reconstruction	encoder posterior close to standard normal
DPO / RLHF (L16)	reward	policy close to base model
Distillation (L23)		student matches teacher

Model	KL it minimizes	Lecture
Classifier (CE loss)		L1, every classifier
VAE	recon-NLL	L19
Diffusion (variational view)		L21
GAN	(approximately) Jensen-Shannon — symmetrized KL	L20
DPO / RLHF KL term		L16
Knowledge distillation		L23

Model	Output distribution	Loss = NLL of …	Lecture
LLM (next-token)	Categorical over vocab		L13–L15
VAE	Normal pixel decoder + Gaussian latent	reconstruction NLL + KL to prior	L19
GAN	implicit generator	min–max over	L20
Diffusion	Gaussian forward process	MSE on predicted noise	L21–L22
RLHF / DPO	Bradley–Terry preference pair	log-sigmoid of reward gap	L16

Notebook teaser · MLE & MAP in PyTorch

We will pair this lecture with a notebook (lec00-mle-map.ipynb) that walks through ·

MLE for a coin with torch.distributions.Bernoulli.
MLE for linear regression with torch.distributions.Normal — recover OLS.
MLE for logistic regression with BCEWithLogitsLoss — show it equals NLL of Bernoulli.
MAP for linear regression with Gaussian prior — recover ridge regression.
MAP for linear regression with Laplace prior — recover lasso, see sparsity emerge as grows.
Visualize likelihood and posterior surfaces in 2D for a tiny example.

Same code skeleton, three lines change between MLE and MAP. That's the punchline.

Sample value	Density	Distance from mean
(the mean)

Task	Model output	Conditional distribution
Linear regression
Logistic regression
-class softmax

Distribution	(bits)	Computation
Fair coin
Biased coin
Uniform over 8
One-hot (deterministic)		nothing to encode

Model	(nats)	"Sentiment"
		confidently right · tiny loss
		mostly right
		uncertain (≈ uniform)
		confidently wrong · big loss

A Probabilistic View of ML

Distributions · sampling · MLE · MAP · KL

Lecture 0 · ES 667: Deep Learning · spans 2 sessions

Why this lecture

Learning outcomes · Session 1

Learning outcomes · Session 2

PART 0

Revision · linear & logistic regression

Linear regression · the model

Linear regression · loss & training

Logistic regression · the model

Logistic regression · loss & training

Regularization · the other half-mystery

Today's question · the four mysteries

Today's promise · one principle, four answers

Bonus · the dividend in later lectures

PART 1

A probabilistic view of ML

Random variable · the basic object

Distributions usually have parameters

Parameters · three concrete examples

IID · the assumption that makes everything work

Bernoulli · the coin

Bernoulli · the compact form we'll reuse

Bernoulli · moments

Bernoulli · IID worked example

Plate notation · graphical-model conventions

Plate notation · IID Bernoulli example

Categorical · the K-sided die (formal)

Categorical · a worked example

Normal (Gaussian) · the bell curve

Normal · three properties to memorize

Normal · how fast the bell decays

Normal · a worked numeric example

Multivariate Normal · briefly

Why does the Normal show up everywhere?

Why Normal · the Central Limit Theorem

CLT · a worked numeric

CLT in pictures · sum of N uniforms

Why Normal · maximum entropy

Other max-entropy distributions

Why Normal · closed under linear operations

Why Normal · conjugacy

Why Normal · where this pays off (1/2)

Why Normal · where this pays off (2/2)

A small zoo of distributions you'll meet

The conditional view · model outputs a distribution

The supervised graphical model · one picture

PART 1.5

Sampling

Why we sample · two roles

The master sampling primitive · inverse CDF

Inverse CDF · worked on a Bernoulli

Sampling from a Categorical · the algorithm

Categorical sampling · why it matters

Sampling from a Normal · the affine trick

The reparameterization trick · preview of L19

Monte Carlo expectation · the definition

Monte Carlo · three properties

Monte Carlo · hidden everywhere in DL

LLM token sampling · preview of L14

PART 2

Likelihood

Coin toss · setup

Likelihood · definition

Likelihood · plotted

Two problems with the raw likelihood

The fix · take the log

NLL · the convention we'll always use

PART 3

Bayes' rule

Conditional probability · refresher

Bayes' rule · the formula

Bayes worked example · disease test (setup)

Bayes worked example · the answer

Bayes for ML · flip onto

The four terms · names, roles, colors

Why we usually ignore the evidence

Bayesian updating · the dynamic view

Conjugacy · when the math stays clean