Unit	Problem
Characters	sequences are 5–10× longer → attention becomes expensive ()
Whole words	vocab must be ~; unseen words → OOV; misspellings fail
Morphemes	language-specific; requires linguistic annotation; doesn't scale

Worked BPE · "low lower newest widest"

Step 0 · split each word into characters with </w> end marker.
l o w </w>, l o w e r </w>, n e w e s t </w>, w i d e s t </w>

Step 1 · count adjacent pairs.
(l, o): 2, (o, w): 2, (e, s): 2, (s, t): 2, (w, e): 1, (e, r): 1, …
Tie among (l,o), (o,w), (e,s), (s,t). Pick (e, s).

Merge 1 · (e, s) → "es".
l o w, l o w e r, n e w es t, w i d es t

Step 2 · recount. Now (es, t): 2 is a new pair tied with (l,o), (o,w). Pick (es, t).

Merge 2 · (es, t) → "est".
l o w, l o w e r, n e w est, w i d est

Merge 3 · (l, o) → "lo". Merge 4 · (lo, w) → "low".
After 4 merges we have learned low, est as single tokens. At inference, apply merges 1–4 in order to any new word.

Variant	How	Used in
Character-level BPE	start from Unicode chars	original 2015 paper
Byte-level BPE	start from raw bytes	GPT-2, Llama, most modern LLMs
WordPiece	same idea, likelihood-based merge	BERT, DistilBERT
SentencePiece	treat whitespace as regular char	Llama, mT5, multilingual

Year	Winner	Why
2018	BERT	cheap, works, encoder embeddings useful
2020	GPT-3	scale unlocked few-shot learning
2022	GPT-3.5 / InstructGPT	alignment via instruction tuning + RLHF
2023	GPT-4, Llama 2	decoder-only becomes the dominant paradigm
2026	decoder-only + tool use + reasoning	everyone converges here

Recipe	Tokens	Quality	Outcome
15T web (Llama-3)	large, noisy	baseline	strong LLM
3T curated (Phi-3)	small, high	synth	matches a larger model
1T books (Books3)	medium, high	literary	great for story writing

Year	Model	Params	Tokens	Notable
2018	BERT-base	110M	3.3B	first pretrained Transformer in production
2019	GPT-2	1.5B	40B	"too dangerous to release"
2020	GPT-3	175B	300B	first few-shot emergence
2022	Chinchilla	70B	1.4T	train-compute optimal
2023	Llama 2 70B	70B	2T	open weights, Chinchilla-ish
2024	Llama 3 8B	8B	15T	aggressively over-trained for inference
2026	frontier LLMs	1T+	10T+	multi-modal, reasoning, tool use

Tokenization & Pretraining Paradigms

Lecture 14 · ES 667: Deep Learning

Where we are

PART 1

Why tokenization is hard

Learning outcomes

Three failed alternatives

A concrete failure · character LMs

Smart-dictionary analogy

The sweet spot · subwords

PART 2

BPE step-by-step

BPE merges · visual

The BPE algorithm · 7 lines

Worked BPE · merge trace

BPE · the data-compression analogy

Worked BPE · "low lower newest widest"

Worked BPE · second example

Why byte-level BPE is the default

Tokenizer comparison · same sentence, different counts

Three BPE variants you will meet

Tokenization gotchas · real LLM failures

PART 3

Three pretraining paradigms

Three families · one architecture

BERT vs GPT · side-by-side

BERT · the cloze-test analogy

BERT · the cloze-test math, step by step

Worked numeric · BERT loss for one mask

BERT · why mask 15%?

GPT · the smartphone-keyboard analogy

GPT · CLM math, step by step

Worked numeric · GPT loss at one step

GPT · why causal loss is so rich

T5 · encoder-decoder · text-to-text

Scaling · which paradigm won?

PART 4

What pretraining actually learns

Why pretraining works so well

Pretraining data · where does 15T tokens come from?

Web-scale sources

Specialized

Data quality beats quantity

Compute economics · where the 6 comes from

Worked numeric · 70B model training cost

Scaling recap · 5 years of LLMs in one chart

Fine-tuning · from pretrained to useful

Lecture 14 — summary

Read before Lecture 15

Next lecture