Method	Trainable params	Ratio	Disk
Full fine-tune	7B	100%	14 GB
LoRA · r=8	~4M	0.06%	8 MB
LoRA · r=64	~33M	0.47%	66 MB
QLoRA · 4-bit base + r=8	~4M	0.06%	8 MB + 3.5 GB base


	0.05	0.06
	0.10	0.12

	DPO	RLHF
Training stages	1	2 (RM + PPO)
Code complexity	~50 LoC	~2000 LoC (PPO, RM, rollout)
Compute	1×	3–5×
Quality at top scale	tied	often slightly ahead
Open-source preference	DPO	RLHF for frontier labs

Stage	Input	Output
PRM training	100k human-labeled step-by-step proofs	reward-per-step model
RL with PRM	sampled chains-of-thought	step-reward gradient up
Result	chains improve step-by-step, not just final	better generalization

Model	AIME 2024 (math)	Codeforces Elo	GPQA (science)
GPT-4	12%	~800	39%
o1-preview	44%	~1500	73%
o1	74%	~1900	78%
o3	97%	~2700 (grandmaster)	88%

Scenario	Recommended
Small task-specific dataset (< 10k)	SFT on full model
Large instruction dataset (100k+)	SFT + LoRA
Consumer GPU (≤ 48GB) on 70B model	QLoRA (NF4 + LoRA)
Need preference alignment, small team	DPO
Need precise control of behaviors	RLHF + custom RM
Need safety + low-cost labels	Constitutional AI / RLAIF
Need reasoning / math / code	SFT + RL-with-process-rewards (o1 style)

Lecture 16 — summary

SFT · train on (instruction, response) pairs to turn a pretrained LM into an assistant.
LoRA · fine-tune two small matrices alongside a frozen base · 100× fewer trainable params. QLoRA adds 4-bit quantization to cut memory 4×.
RLHF · SFT → reward model → PPO against RM with KL penalty to SFT. Original ChatGPT recipe.
DPO · closed-form alignment without an RM · one supervised loss. Default in open-source.
Constitutional AI / RLAIF · self-critique to scale preference data.
Reasoning models (2024+) · RL with process rewards · long internal chain of thought · new scaling axis.

Read before Lecture 17

Prince Ch 14 (unsupervised, contrastive).

Next lecture

Self-Supervised & Contrastive Learning — SimCLR, BYOL, MAE, DINOv2. How to learn without labels.

Notebook 16 · 16-lora-finetune.ipynb — fine-tune a 7B model with LoRA + peft on a small instruction dataset · compare responses before / after.

Alignment & Fine-tuning

Lecture 16 · ES 667: Deep Learning

Learning outcomes

Where we are

PART 1

SFT · Instruction tuning

Why SFT is necessary

SFT in practice

SFT gotchas

PART 2

LoRA · parameter-efficient fine-tuning

Why full fine-tuning is painful

LoRA · the 2021 fix

LoRA · detailed view

LoRA · the oil-painting analogy

LoRA · build the update from a tiny rank-1 example

Worked numeric · LoRA on a Llama-7B layer

LoRA numbers · 7B model

LoRA in PyTorch · peft library

QLoRA · memory breakdown

QLoRA · quantization is image-compression

QLoRA · worked memory budget for 70B

PART 3

RLHF · reward model + PPO

Why SFT isn't enough

SFT vs RLHF · dog-training analogy

The RLHF pipeline

RLHF · the dog-with-two-goals analogy

RLHF objective · term by term

Worked example · one RLHF step

RLHF · what could go wrong

PART 4

DPO · Direct Preference Optimization

DPO · the 2023 simplification

DPO · the simpler taste-test analogy

DPO loss · inside-out

Worked numeric · DPO step

PPO vs DPO · the two pipelines

DPO vs RLHF · which to use

Constitutional AI and RLAIF

PART 5

Reasoning models · the 2024 turn

Reasoning models

Process rewards · what "reasoning training" looks like

Reasoning models · benchmark jump

Picking an alignment method

Lecture 16 — summary

Read before Lecture 17

Next lecture