Modern LLMs often train well past Chinchilla optimal:
| Model | Params | Tokens | D/N | Notes |
|---|---|---|---|---|
| Chinchilla | 70B | 1.4T | 20 | training-compute optimal |
| Llama 2 70B | 70B | 2T | 29 | slightly over |
| Llama 3 8B | 8B | 15T | 1875 | wildly over |
| Llama 3 70B | 70B | 15T | 214 | heavily over |
Why overtrain?
Chinchilla optimizes training compute. But inference is where models earn their keep. A smaller, over-trained model has lower inference cost per query — you get back the extra training compute many times over.
Two rules from Chinchilla:
Substitute (2) into (1) to solve for
Then
For 10× more compute ·
| Scenario | Params | Tokens | Status |
|---|---|---|---|
| GPT-3 (2020) · undertrained | 175B | 300B | too big, too few tokens |
| Chinchilla (2022) · optimal | 70B | 1.4T | train-compute sweet spot |
| Llama-3 8B (2024) · overtrained | 8B | 15T | inference-optimal for serving |
| A large startup's "bigger = better" model | 500B | 200B | wastes compute |
The undertrained regime is more wasteful than the overtrained. GPT-3 used 10× Chinchilla's compute for similar final loss. Modern LLMs carefully size
The 2021 fix that stuck
Both inject position by adding a position vector to the token embedding:
Problems:
Relative positions encoded naturally · inner product after rotation depends only on
Extrapolates beyond training length · rotation frequencies are fixed; a model trained at 4k context can extend to 32k without re-training (with minor fixes).
Zero added parameters · rotation matrices are deterministic given position; no nn.Embedding(max_len, d_model) allocation.
Llama, Mistral, Qwen, GPT-NeoX all use RoPE in 2026. A 2021 paper (Su et al.) that took ~2 years to catch on is now the default.
| Year | Frontier model | Context |
|---|---|---|
| 2018 | BERT | 512 tokens |
| 2020 | GPT-3 | 2,048 |
| 2023 | GPT-4 | 32k |
| 2023 | Claude 2 | 100k |
| 2024 | Gemini 1.5 | 1,000,000 |
| 2026 | frontier | 2-10M |
What unlocked 1M? · RoPE extrapolation, FlashAttention (O(N) memory), GQA (smaller KV cache), and training on long documents from the start. No single trick; the stack compounds.
Imagine
Attention score = dot product of
The model learns about relative positions directly.
For position
Apply to query (at position
Attention score:
Two properties:
Substitute:
The score depends only on the relative position
Rotation matrices.
Rotate.
Dot product.
Verify with the relative-position form (
Used in Llama 1/2/3, Mistral, PaLM, GPT-NeoX. No extra params, extrapolates beyond training length.
MQA, GQA, and the KV-cache
During autoregressive decoding we attend to all previous tokens. Recomputing
Build the size, layer by layer:
Setup:
Punchline. The 70B weights (fp16) take
The KV-cache is the model's notebook.
Refine the formula · let
Now compare for Llama 2 70B (
| Variant | Cache size | |
|---|---|---|
| MHA (Llama 1) | 64 | |
| GQA (Llama 2, 8 groups) | 8 | |
| MQA | 1 |
GQA reduces KV-cache by
# In Llama 2 70B:
n_heads = 64 # query heads
n_kv = 8 # GQA groups
d_head = 128
How you fit a 70B model on real hardware
You need to build a massive LEGO model (the LLM) that won't fit on one person's table (one GPU). Hire a team:
Frontier training combines all three · 3D parallelism. Each axis trades off memory, compute, and communication.
Each GPU has a full copy of the model, trains on different batches. Gradients averaged across GPUs.
Simple. Works for models that fit on one GPU. Breaks at 10B+.
Split each matrix multiply across GPUs. Each GPU holds a slice of W.
Megatron-LM. Required for >10B. Heavy all-reduce bandwidth.
Split the layer stack across GPUs. Layer 1-10 on GPU 1, layer 11-20 on GPU 2, etc. Bubble of idle time unless you use micro-batching.
Modern training runs combine all three (3D parallelism). Add ZeRO (sharded optimizer state) and you get the full picture.
Training a 70B from scratch in 2026 · ~10k H100 GPUs for ~2 months.
Almost no one trains from scratch. Everyone fine-tunes open-weight models (Llama, Mistral, Qwen) with LoRA (next lecture).
When more params unlock new behaviors
Null hypothesis · smooth scaling. As you make a model bigger, training loss decreases smoothly. A 10B model is a bit better than a 1B; a 100B model is a bit better than a 10B. Intuitive.
The surprise · for some specific complex tasks, this doesn't happen — performance is near-random until a threshold, then takes off.
Why? A multi-step task is a product of step accuracies:
Smooth improvement in per-step accuracy translates to what looks like a discontinuous jump in end-to-end task performance.
An ability is emergent if it:
No one trained specifically for it. It just appears.
Resolution · both sides are partially right. Capability improves continuously in log-probability, but certain thresholded tasks (match or fail) look discontinuous. The user experience is still of qualitative leaps.
Standard prompt · "Q: 23 × 47 = ?" → A: "1081" (often wrong)
CoT prompt · "Q: 23 × 47 = ? Let's think step by step." →
A: "23 × 47 = 23 × (50 − 3) = 1150 − 69 = 1081"
CoT unlocks multi-digit arithmetic, commonsense, logic at 60B+. Below that, CoT adds nothing (the model can't reason in steps either).
The prompt itself is a learnable control · "let's think step by step" (Kojima 2022) can add 15 points on GSM8K. No fine-tuning. This thread becomes reasoning models (o1, Claude thinking) in 2024.
| Ability | Roughly where it emerges |
|---|---|
| Multi-digit arithmetic | ~13B |
| Basic code generation | ~13B |
| Few-shot in-context learning | ~50B |
| Chain-of-thought reasoning | ~60B |
| Tool use (with prompting) | ~70B+ |
Wei et al. 2022 · "Emergent Abilities of Large Language Models." Contested (Schaeffer et al. 2023 argue it's a metric artifact) but the phenomena are real.
At pretraining, the model only learns next-token prediction. But at 100B+ params, it starts to learn at inference time from examples in the prompt:
Translate to French:
sea otter → loutre de mer
cheese → fromage
banana → banane
carrot → ???
The model has never seen the word "carrot" in its French dictionary. But given three examples, it figures out the task and produces "carotte".
This is few-shot learning without weight updates. Emergent at scale; the foundation of modern prompting.
Prompting the model to "think step by step" dramatically improves multi-step reasoning:
Q: Roger has 5 tennis balls. He buys 2 more cans of 3
tennis balls each. How many tennis balls?
Without CoT: "11 tennis balls." ← often wrong at small scale
With CoT: "He starts with 5. 2 cans × 3 = 6 more.
Total: 5 + 6 = 11." ← reliably correct
CoT emerges at scale. At 10B params, adding "let's think step by step" doesn't help. At 100B+, it adds 20+ percentage points on math benchmarks.
The latest generation — o1, o3, Claude extended thinking, DeepSeek R1 — explicitly trains the model to produce long internal chain-of-thought before answering:
This is where 2026 LLMs are. We'll see alignment + RLHF in the next lecture, then peek at reasoning in L16's final slide.