Model	FP32 Size	Your Laptop RAM
LLaMA-7B	28 GB	16 GB
GPT-3	700 GB	16 GB
BERT-base	440 MB	16 GB

Deployment Target	Size Constraint	Speed Constraint
Mobile phone	< 50 MB	< 50 ms
Edge / IoT	< 10 MB	CPU only
Cloud API	Fit in GPU RAM	< 100 ms (GPU costs $$)
LLM on laptop	Fit in 16 GB RAM	Usable speed

	Compute-Bound	Memory-Bound
Analogy	Chef is too slow cooking	Waiter is too slow bringing ingredients
Bottleneck	Not enough math power (FLOPS)	Not enough data bandwidth (GB/s)
When?	Large batches, big matrix multiplies	Small batches, loading model weights
Fix	Fewer operations, faster hardware	Smaller model, better caching

Simple Profiling

The simplest profiler: a stopwatch. But do it right.

import time, torch

def benchmark(model, input_data, n_runs=100):
    # Rule 1: Warmup (JIT compilation, cache filling)
    for _ in range(10):
        with torch.no_grad():
            model(input_data)

    # Rule 2: Sync GPU before timing
    if torch.cuda.is_available():
        torch.cuda.synchronize()

    start = time.perf_counter()
    for _ in range(n_runs):
        with torch.no_grad():
            model(input_data)
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    elapsed = time.perf_counter() - start

    return elapsed / n_runs * 1000  # milliseconds

Three rules: (1) Warmup. (2) Sync GPU. (3) Average many runs.

For deeper profiling, use torch.profiler — see the notebook.

Format	Bytes per number	Precision
FP32	4 bytes	Very high
FP16	2 bytes	High
INT8	1 byte	Moderate
INT4	0.5 bytes	Low

Format	Bytes/param	100M param model	7B param model (LLaMA)
FP32	4	400 MB	28 GB
FP16	2	200 MB	14 GB
INT8	1	100 MB	7 GB
INT4	0.5	50 MB	3.5 GB

	Before (FP32)	After (INT4)
Size	28 GB	3.5 GB
Hardware needed	A100 GPU ($10K+)	MacBook Air
Cost	Cloud GPU rental	Free (your laptop)
Privacy	Data sent to cloud	Everything stays local

Target	Recommended
Mobile app	INT8 + pruning (small & fast)
Cloud API	FP16 (fast & accurate)
Research	FP32 (maximum accuracy)
LLM on laptop	INT4 (fit in RAM)

Model Profiling & Quantization

Week 13 · CS 203: Software Tools and Techniques for AI

The Problem: Your Model is Too Fat and Slow

Why This Matters

Part 1: Profiling — The Checkup

The Itemized Receipt

Compute-Bound vs Memory-Bound

Simple Profiling

Part 2: Quantization — The Diet

What is Quantization?

Model Size by Format

Why Does It Work?

Dynamic Quantization: The One-Liner

The LLaMA Story

Part 3: Other Optimizations

Pruning — The Jenga Approach

Knowledge Distillation — Master and Apprentice

ONNX — The PDF of Machine Learning

Combining Techniques — The Full Pipeline

The Pareto Frontier

Practical Workflow

Key Takeaways

The Full Course Arc