With MHA (64 heads, no GQA) that would be 84 GB — larger than the weights themselves.
Naive memory allocation reserves a bus-sized parking spot for every vehicle, even scooters · huge waste.
Paged attention has only standard car-sized spots ("pages"). A bus uses several; a scooter uses one. Pack many more vehicles in the same lot.
For LLMs · "vehicles" = ongoing requests, each needing variable-length KV cache. Pages let us serve many short prompts in the memory that previously held one long-prompt KV cache. ~4× throughput in vLLM.
Kwon et al. 2023 · vLLM paper.
Problem · KV-cache memory is allocated contiguously; long-context requests waste space.
Solution · paged attention — split the KV-cache into fixed-size pages, managed like virtual memory. Each request uses only as many pages as it needs.
vLLM and TGI both use paged attention. ~4× throughput gain over naive implementations in production LLM serving in 2026.
Run bigger models on smaller hardware
You have a satellite image of a city with precise GPS for everything (FP32). To make a tourist map (INT8), you can't keep all that precision.
Quantization · find scale
Goal · convert FP32 weights to 8-bit integers in
Step 1. Find max absolute value in the channel:
Step 2. Map the largest weight to the largest integer (127):
Step 3. Quantize each weight:
Step 4. Reconstruct at inference:
Store the int8 array + a single FP32 scale
Convert
Stored · 4 INT8 (4 bytes) + scale FP32 (4 bytes) = 8 bytes vs 16 bytes (4× FP32). 50% saving.
Reconstruction.
PyTorch · torch.quantization.quantize_dynamic or bitsandbytes.
Channel · 5 weights · w = [0.42, -0.81, 0.05, 0.37, -0.12]
Step 1.
Step 2.
Step 3. Stored · the 5 INT8 values (5 bytes) plus one float scale
Reconstruction at inference ·
At 4 bits per weight, naive quantization breaks. Two successful tricks:
Both work well at 4-bit; AWQ edges out GPTQ at extreme compression (3-bit, 2-bit).
2026 practical recipe · AWQ 4-bit quantization for any LLM you're running on consumer hardware. exllamav2 or vLLM both support it.
Rewrite attention for modern GPUs
Naive attention materializes the
scores = Q @ K.T # [B, H, N, N] ← huge for long context
weights = scores.softmax(dim=-1)
out = weights @ V # [B, H, N, d_h]
For
You need to paint a football-field-sized mural (the
For
Three ideas (Dao et al. 2022):
Memory savings · 8192-context, FP16.
PyTorch 2.0+ ships F.scaled_dot_product_attention — just use it.
F.scaled_dot_product_attention.Just call F.scaled_dot_product_attention — don't roll your own attention in 2026.
Generate multiple tokens per forward pass
Autoregressive generation is one token per forward pass. For a 70B model this caps your throughput.
Speculative decoding (Leviathan et al. 2022):
When the draft is right ·
When the draft is wrong · same cost as normal decoding (pay for one extra forward on the draft).
Net effect · draft is right ~70% of the time on typical text → ~2× speedup for free.
Used in GPT-4, Claude 3/4, and most hosted LLMs. The "draft" is often a specially-trained small version of the verifier or a lightweight heads-only model.
Train a student to mimic a teacher
How does an apprentice learn from a master?
Distillation = method 2.
Two losses:
Part 1 · standard CE against true labels:
Part 2 · imitate the teacher's distribution. Use KL divergence between softened distributions (temperature
Combine with weight
The
Image of a cat. Classes [Cat, Dog, Car].
Now the student learns: increase Cat, decrease Dog and Car, but keep Dog about 2× Car. Rich nuanced signal that pure hard-labels would miss.
Many "small-but-good" 2026 models (Phi, Gemma, DistilRoBERTa) are distilled from bigger siblings. The frontier labs train big, then distill to ship.
Put together · ~10–50× faster and cheaper than naive implementations.
| Framework | Who | Good at |
|---|---|---|
| vLLM | Berkeley / community | paged attention, continuous batching |
| TGI (Text Generation Inference) | Hugging Face | production serving |
| llama.cpp | community | CPU + laptop inference |
| TensorRT-LLM | NVIDIA | peak H100 performance |
| MLX | Apple | M-series Macs |
| ExLlamaV2 | community | consumer GPU (RTX 4090) |
Choose by hardware + quality needs. For teaching, vLLM or llama.cpp are easiest to install.
| Stage | Speedup vs naive |
|---|---|
| Baseline (naive fp32) | 1× |
| + bf16 | 2× |
| + KV-cache | 10× |
| + FlashAttention-2 | 3× |
| + INT8 quantization | 1.5× |
| + speculative decoding | 2.5× |
| + batching + paged attention | 4× |
| Total (compounded) | ~900× |
Real production serving · ~30-100× over naive PyTorch. The rest is latency engineering (batching, KV-cache reuse across requests, model sharding).
| Model | Cost per 1M input tokens | Cost per 1M output tokens |
|---|---|---|
| Claude 3.5 Haiku | $1 | $5 |
| GPT-4o-mini | $0.15 | $0.60 |
| Llama-3 70B (self-hosted) | ~$0.30 | ~$0.60 |
| Claude 3.5 Sonnet | $3 | $15 |
| o1 | $15 | $60 |
Reasoning models cost 5-10× more (inference-time compute). Same token count, much more compute per token. "Think longer" is pay-per-second.