Model	Params	Tokens	D/N	Notes
Chinchilla	70B	1.4T	20	training-compute optimal
Llama 2 70B	70B	2T	29	slightly over
Llama 3 8B	8B	15T	1875	wildly over
Llama 3 70B	70B	15T	214	heavily over

Scenario	Params	Tokens	Status
GPT-3 (2020) · undertrained	175B	300B	too big, too few tokens
Chinchilla (2022) · optimal	70B	1.4T	train-compute sweet spot
Llama-3 8B (2024) · overtrained	8B	15T	inference-optimal for serving
A large startup's "bigger = better" model	500B	200B	wastes compute

Year	Frontier model	Context
2018	BERT	512 tokens
2020	GPT-3	2,048
2023	GPT-4	32k
2023	Claude 2	100k
2024	Gemini 1.5	1,000,000
2026	frontier	2-10M

Variant		Cache size
MHA (Llama 1)	64
GQA (Llama 2, 8 groups)	8
MQA	1	(quality drops)

Chain-of-thought · prompting unlocks reasoning

Standard prompt · "Q: 23 × 47 = ?" → A: "1081" (often wrong)

CoT prompt · "Q: 23 × 47 = ? Let's think step by step." →
A: "23 × 47 = 23 × (50 − 3) = 1150 − 69 = 1081"

CoT unlocks multi-digit arithmetic, commonsense, logic at 60B+. Below that, CoT adds nothing (the model can't reason in steps either).

The prompt itself is a learnable control · "let's think step by step" (Kojima 2022) can add 15 points on GSM8K. No fine-tuning. This thread becomes reasoning models (o1, Claude thinking) in 2024.

Ability	Roughly where it emerges
Multi-digit arithmetic	~13B
Basic code generation	~13B
Few-shot in-context learning	~50B
Chain-of-thought reasoning	~60B
Tool use (with prompting)	~70B+

Wei et al. 2022 · "Emergent Abilities of Large Language Models." Contested (Schaeffer et al. 2023 argue it's a metric artifact) but the phenomena are real.

Large Language Models

Lecture 15 · ES 667: Deep Learning

Learning outcomes

Where we are

Four questions

What changed · 2018 to 2026

Architecture

Scale & engineering

PART 1

Scaling laws

Scaling laws · the cake-baking analogy

The Chinchilla result

Chinchilla · D/N across models

Chinchilla · in one chart

The 2023+ twist · overtraining

Compute budget · derivation from two rules

Worked example · spending FLOPs

Sub-optimal training · a table

PART 2

RoPE · rotary positional encoding

Problems with sinusoidal / learned PE

RoPE · rotation in pictures

RoPE · three key properties

Context length · the scaling wall

RoPE · the spinning-pointer intuition

RoPE · derivation in 2D

Worked numeric · RoPE in 2D

PART 3

Efficient attention

KV-cache · derivation, piece by piece

Worked numeric · KV-cache for Llama 70B

MHA vs MQA vs GQA

GQA · the shared-notebook analogy

How GQA shrinks the KV-cache

PART 4

Distributed training

Distributed training · three parallelisms

Distributed training · the LEGO-team analogy

Three parallelism strategies

Data parallel (DP)

Tensor parallel (TP)

Pipeline + 3D parallelism

Pipeline parallel (PP)

The 2026 reality

PART 5

Emergent abilities

Why "emergence" is surprising

Emergent abilities · the curves

What "emergent" means

Emergence · the controversy

Emergentists say

Skeptics (Schaeffer 2023) say

Chain-of-thought · prompting unlocks reasoning

In-context learning · the most surprising one

Chain of thought

Reasoning models (2024+)

Lecture 15 — summary

Read before Lecture 16

Next lecture