Problem	What you see
Slow API	Each request takes 2.5 seconds
Big file	The model file is 50 MB, too big for the mobile app
Out of memory	The Raspberry Pi crashes when loading the model

Part	Topic	Run alongside
1	How computers store numbers	`01-floating-point-basics`
2	Models, parameters and memory	`02-parameter-count-and-memory`
3	Profiling — find the bottleneck	`03-profiling-basics`
4	Batching — work smarter, not harder	`04-batching-benchmark`
5	Quantization in PyTorch and ONNX	`05-pytorch-dynamic-quantization`, `06-onnx-export-and-quantization`
6	Two more ideas: pruning and distillation	`07-pruning-basics`, `08-distillation-basics`
7	Putting it all together	(table built from 05–08)

#	Notebook	Open
01	Floating-point basics	Colab
02	Parameter count and memory	Colab
03	Profiling basics	Colab
04	Batching benchmark	Colab
05	PyTorch dynamic quantization	Colab
06	ONNX export and quantization	Colab
07	Pruning basics (unstructured + structured)	Colab
08	Distillation basics	Colab

Number	Scientific notation
3.14159	3.14159 × 10⁰
0.00042	4.2 × 10⁻⁴
98,700,000	9.87 × 10⁷

Field	Asks the question	Bits
Sign	positive or negative?	1
Exponent	how big or how small?	8
Mantissa	what are the significant digits?	23

Step 4a — A curious pattern

Look at what happens when we normalize different numbers the way we did 6.5
(slide the binary point so exactly one non-zero digit sits before it):

 6.5  →  1.101   × 2²          the leading digit is 1
 0.75 →  1.100   × 2⁻¹         the leading digit is 1
42.0  →  1.01010 × 2⁵          the leading digit is 1
 3.25 →  1.101   × 2¹          the leading digit is 1

Every single normalized number starts with a 1.

Why? Because in binary there are only two possible digits, 0 and 1. If the
digit before the point were 0, the number wouldn't be normalized — you would
have shifted the point one more place to the left.

Analogy in base 10: "scientific notation" always looks like 3.14 × 10⁵,
never 0.314 × 10⁶ or 31.4 × 10⁴. Base 2 is even stricter: the only
legal leading digit is 1.

Real exponent	Stored as (add 127)	Binary
−126	1	`00000001` (smallest)
0	127	`01111111`
2 (ours)	129	`10000001`
127	254	`11111110` (largest)

Format	Bytes	Roughly how many distinct values	Typical use
FP32	4	~4 billion	Training (needs precision for tiny gradients)
FP16	2	~65 thousand	Mixed-precision training, GPU inference
INT8	1	256 (−128 … 127)	Inference on CPU, mobile, edge

Job	Analogy	Needed precision
Training	Measuring a microchip	Nanometers (FP32)
Inference	Measuring a room for furniture	Centimeters is fine

Format	Bytes per number
FP32	4
FP16	2
INT8	1

	Compute-bound	Memory-bound
Picture	Chef is slow at cooking	Waiter can't bring ingredients fast enough
Bottleneck	Not enough math power	Not enough data movement
When	Big batches, big matmuls	Loading weights, small batches
Fix	Fewer ops, faster hardware	Smaller model, better caching

Tool	What it tells you	Analogy
`time.time()`	Total runtime	"The meal took 90 minutes"
`%%timeit`	Average over many runs	"85 ± 5 minutes per meal"
`cProfile`	Time per function	"60 min on the main course"
`line_profiler`	Time per line	"45 min just chopping onions"

Strategy	# of calls	Setup cost paid
One example at a time	1000	1000 × setup
One batch of 1000	1	1 × setup

Model size	What you'll likely see
Tiny MLP (KB-sized, like Notebook 05)	INT8 slower, accuracy basically equal
Medium model (a few MB)	INT8 same speed or a bit faster
Large model (big transformer, LLM)	INT8 much faster and much smaller

Type	Effort	Quality	When to use
Dynamic	1 line of code	Good	Start here (easiest)
Static	~15 lines + calibration data	Better	When dynamic isn't enough
Quantization-Aware Training	Retrain the model	Best	Production, max quality

Where it runs	What they need
iPhone app	Swift / Core ML — no Python interpreter on the phone
Android app	Java / Kotlin — no PyTorch wheel for ARM Android
Browser demo	JavaScript / WebAssembly
Embedded device	C++, no GB-sized PyTorch install
Another team's Java backend	They refuse to add a Python dependency

	Before (FP32)	After (INT4)
Size	28 GB	3.5 GB
Hardware	High-end GPU	MacBook Air
Cost	Cloud GPU rental	Free, your laptop
Privacy	Data sent to cloud	Stays on device

Name	Is it a …	Main idea (one line)
GGUF	file format	Packs INT4/INT8 weights + scales + metadata into one portable file
GPTQ	algorithm	Layer-by-layer: solve for INT4 values that minimize error on a small calibration set
AWQ	algorithm	Activation-aware: protect the few weight channels that matter most

Name you'll see on Hugging Face	What it means
`llama-3-8b.Q4_K_M.gguf`	GGUF file, 4-bit, for `llama.cpp` / Ollama / LM Studio
`Llama-3-8B-Instruct-GPTQ`	GPTQ-quantized, load via `transformers` + `auto-gptq`
`Mistral-7B-Instruct-v0.2-AWQ`	AWQ-quantized, served with vLLM / TGI

	Unstructured pruning	Structured pruning
What's removed	Individual weights (edges)	Whole neurons / channels / heads (nodes)
Result on disk	Dense matrix with lots of zeros	A genuinely smaller matrix
Accuracy impact	Tiny (~0% at 40–60% sparsity)	Larger — you're deleting more at once
Speedup on normal CPU/GPU?	Not by default — you still multiply by zero	Yes — the matrix is literally smaller
Speedup with special kernels?	Yes (sparse BLAS, 2:4 sparsity on NVIDIA)	Always

Level	What the student matches
Outputs (soft labels)	Teacher's output probabilities
Hidden features	Teacher's intermediate activations
Attention maps	Where a transformer teacher "looks"

Variant	Size	Latency	Accuracy
baseline FP32	~600 KB	~5 ms	~0.97
dynamic INT8	~200 KB (↓3×)	~3–5 ms (hardware-dependent)	~0.96
ONNX FP32	~550 KB	~3 ms	~0.97
ONNX INT8	~180 KB (↓3×)	~2–3 ms	~0.96
pruned (40%, unstructured)	~600 KB (dense zeros)	~5 ms	~0.96
distilled student	~90 KB (↓7×)	~2 ms	~0.95

Target	Size budget	Speed budget	Typical tools
Cloud API	Large	Throughput first	Specialized servers
Mobile app	Small	Low latency	Mobile runtimes
IoT / edge	Very small	CPU only	ONNX, `llama.cpp`
Web browser	Tiny	JS / WebGPU	Browser runtimes
LLM on laptop	Fits in RAM	Usable speed	`llama.cpp`, Ollama

App feature	Model	Size	Runs
Keyboard prediction	Tiny LSTM	< 5 MB	On device
Face unlock	MobileFaceNet	< 5 MB	On device
"Hey Siri" / "OK Google"	Wake-word net	< 1 MB	On device
Offline Google Translate	Quantized transformer	~50 MB	On device
Camera HDR / night mode	Tiny CNN	< 2 MB	On device

Tool	Plain-English purpose
`vLLM`	Serve many LLM users efficiently in the cloud
`SGLang`	Fast structured / repeated prompting
`llama.cpp`	Run quantized models locally on CPUs / laptops
`Ollama`	Friendly wrapper over `llama.cpp` for desktop use
`Unsloth`	Fine-tune large models with less GPU memory

Task	Tool	Example
Measure runtime	`timeit`	`timeit.timeit(lambda: f(x), number=100)`
Find slow function	`cProfile`	`cProfile.run('my_fn()')`
Quantize PyTorch	`torch`	`quantize_dynamic(model, {nn.Linear}, torch.qint8)`
Export / deploy	`ONNX`	`torch.onnx.export(...)`
Run locally (LLM)	`llama.cpp`, Ollama	quantized models on CPUs
Serve many users	`vLLM`, `SGLang`	production LLM serving

#	Notebook	Key result
01	Floating point basics	Why `0.1 + 0.2 ≠ 0.3`
02	Parameters and memory	Connect parameter counts to MB
03	Profiling basics	Find the real bottleneck
04	Batching benchmark	See latency drop with batch size
05	PyTorch dynamic quantization	Smaller model in 1 line
06	ONNX export and quantization	Run a model outside Python
07	Pruning basics	Unstructured vs structured, and the accuracy cliff
08	Distillation basics	Small student matches big teacher

Profiling, Quantization & Model Optimization

Week 11: CS 203 - Software Tools and Techniques for AI

Where We Are in the Course

A Story to Start With

By the End of This Lecture

Today's Plan

Companion Notebooks — Colab Links

Part 1: How Computers Store Numbers

Bits and Bytes — The Basics

Integers Are Easy

Floating Point — Scientific Notation for Computers

FP32: The Three Boxes

Encoding 6.5 — Steps 1 & 2

Encoding 6.5 — Step 3 (normalize)

Step 4a — A curious pattern

Step 4b — So we don't bother storing it

Step 4c — Applying it to 6.5

Edge Case — Zero

Step 5 — The Exponent Problem

Step 5 — The Biased Exponent Trick

Putting All 32 Bits Together

Decoding It Back to 6.5

FP32 vs FP16 vs INT8

How Much Precision Do We Actually Need?

Part 2: Models, Parameters, and Memory

A Model Is Just a Big Bag of Numbers

Real Models Get Big Fast

Part 3: Profiling — The Doctor's Checkup

"My Code Is Slow" Is Not Useful

Compute-Bound vs Memory-Bound

Two Kinds of Slowness

Four Profiling Tools, From Simple to Detailed

Level 1: The Stopwatch

Level 2: cProfile — Which Function Is Slow?

The Fix: Load Once (250x Speedup!)

Bonus: Memory Profiling

Profiling Cheat Sheet

Part 4: Batching — Work Smarter, Not Harder

Why Batching Helps

Batching in Practice

Part 5: Quantization — The Diet

Start with the Grocery Store

What Is Quantization, Concretely?

The Code Behind Quantization

Quantization on a Number Line

Why Does It Work? Look at the Weights

Quantization in PyTorch — One Line

What Gets Quantized?

What Does "Dynamic" Mean?

When Does INT8 Actually Speed Things Up?

Reading Your Notebook 05 Numbers

Three Flavours of Quantization

Why Do We Need ONNX?

ONNX in Code

Quantizing an sklearn Model via ONNX

The LLaMA Story

GGUF, GPTQ, AWQ — The Names You'll See

How to Recognize Them in the Wild

Part 6: Two More Ideas

Pruning — Two Flavours

Unstructured vs Structured Pruning

Pruning — How (PyTorch in 4 Lines)

Distillation — Teacher Trains Student

Distillation — Matching at Other Levels

The Mismatched-Size Catch

The Fix: a Throwaway Projection

These Ideas Stack

Part 7: Putting It All Together

The Accuracy vs Size Tradeoff

Compare Everything in One Place

Deployment Targets Have Budgets

Real Examples on Your Phone

A Few Tools Worth Recognizing

The Practical Workflow

FAQ

Summary

Key Takeaways

Tools Cheat Sheet

What to Try in the Notebooks

Questions?