Phase	Bottleneck
Prefill	compute
Decode	memory

Framework	Who	Good at
vLLM	Berkeley / community	paged attention, continuous batching
TGI (Text Generation Inference)	Hugging Face	production serving
llama.cpp	community	CPU + laptop inference
TensorRT-LLM	NVIDIA	peak H100 performance
MLX	Apple	M-series Macs
ExLlamaV2	community	consumer GPU (RTX 4090)

Stage	Speedup vs naive
Baseline (naive fp32)	1×
+ bf16	2×
+ KV-cache	10×
+ FlashAttention-2	3×
+ INT8 quantization	1.5×
+ speculative decoding	2.5×
+ batching + paged attention	4×
Total (compounded)	~900×

Model	Cost per 1M input tokens	Cost per 1M output tokens
Claude 3.5 Haiku	$1	$5
GPT-4o-mini	$0.15	$0.60
Llama-3 70B (self-hosted)	~$0.30	~$0.60
Claude 3.5 Sonnet	$3	$15
o1	$15	$60

Lecture 23 — summary

Prefill vs decode · compute-bound vs memory-bound. Different optimizations.
KV-cache · dominates memory for long contexts. GQA (L15) + paged attention shrink it.
Quantization · BF16 → INT8 → NF4. AWQ handles 4-bit well; <1% quality loss.
FlashAttention · exact attention with O(N) memory · 2–4× faster.
Speculative decoding · draft model proposes, big model verifies in parallel · 2–4× throughput.
Distillation · big teacher trains small student on soft targets.
Production stack · all of these combined → ~10–50× over naive.

Read before Lecture 24

Anthropic interp blog; Chi et al. 2023 (Diffusion Policy); blog posts on Claude Code / computer use.

Next lecture · last one!

Frontier · Agents, Reasoning, Interpretability + course wrap-up.

Notebook 23 · 23-kv-cache.ipynb — take a small GPT; add KV-cache to generation loop; measure tokens/second speedup.

Efficient Inference

Lecture 23 · ES 667: Deep Learning

Learning outcomes

Where we are

PART 1

Prefill vs decode

Reading vs writing · the inference analogy

Prefill vs decode · the kitchen analogy

Decode · why it's memory-bound

PART 2

The KV-cache

The KV-cache explained

Cost with vs without cache

Why caching KV works

KV-cache math · Llama 70B

Paged attention · the parking-lot analogy

Paged attention · vLLM's big idea

PART 3

Quantization

The quantization ladder

INT8 · the map-scale analogy

INT8 · derive the formula step by step

INT8 · worked numeric

Quantization · arithmetic in pictures

Worked example · quantize a single weight row

INT4 and below · GPTQ / AWQ

PART 4

FlashAttention

The attention memory problem

FlashAttention · tiles in SRAM

FlashAttention · the giant-mural analogy

FlashAttention · how it works

FlashAttention · adoption

PART 5

Speculative decoding

Speculative decoding · picture

The decode speedup trick

Why speculative works

PART 6

Knowledge distillation

Distillation · the master-chef analogy

Distillation · the loss, term by term

Distillation · worked numeric

Distillation in 2026

PART 7

Full inference stack · 2026

What a production LLM server does

The inference frameworks

End-to-end · optimization stack compounded

Cost economics · 2026

Lecture 23 — summary

Read before Lecture 24

Next lecture · last one!