Train a character-level Transformer on Wikipedia. It works — GPT-like samples of readable English.
But it's expensive:
Plus · character LMs have to learn that "t-h-e" is a word, unit by unit. That's capacity wasted on a solved problem.
Subwords compromise: keep common sequences as one unit (saving sequence length), split rare sequences (keeping open vocabulary). The winning middle path.
A tokenizer builds a dictionary for the language.
un-) and suffixes (-able) · can compose any new word.That's the sweet spot we'll build with BPE on the next slides.
Ideal subword tokenizer:
The winning algorithm: Byte-Pair Encoding (BPE), re-purposed from 1994 data compression.
One merge rule at a time
Interactive: type a corpus, press "merge" to see the most frequent pair get glued live — bpe-merges.
def train_bpe(corpus, n_merges):
# 1. Split every word into characters
tokens = [list(word) for word in corpus.split()]
merges = []
for _ in range(n_merges):
# 2. Count all adjacent pairs
pair_counts = Counter((a, b) for word in tokens for a, b in zip(word[:-1], word[1:]))
if not pair_counts: break
# 3. Find the most frequent pair
best = pair_counts.most_common(1)[0][0]
merges.append(best)
# 4. Apply the merge across all tokens
tokens = [merge_in_word(w, best) for w in tokens]
return merges, tokens
At inference, apply the same merge rules in order → tokenize any new string.
Like a basic file compressor. If "ABCABCABC" appears a lot, define a new symbol $Z = $ "ABC" and rewrite as "ZZZ". BPE does the same for language · find the most common adjacent pair (like th), compress it into a single new token, repeat.
Step 0 · split each word into characters with </w> end marker.
l o w </w>, l o w e r </w>, n e w e s t </w>, w i d e s t </w>
Step 1 · count adjacent pairs.
(l, o): 2, (o, w): 2, (e, s): 2, (s, t): 2, (w, e): 1, (e, r): 1, …
Tie among (l,o), (o,w), (e,s), (s,t). Pick (e, s).
Merge 1 · (e, s) → "es".
l o w, l o w e r, n e w es t, w i d es t
Step 2 · recount. Now (es, t): 2 is a new pair tied with (l,o), (o,w). Pick (es, t).
Merge 2 · (es, t) → "est".
l o w, l o w e r, n e w est, w i d est
Merge 3 · (l, o) → "lo". Merge 4 · (lo, w) → "low".
After 4 merges we have learned low, est as single tokens. At inference, apply merges 1–4 in order to any new word.
Corpus · "hug bug rug". Initial: h u g </w>, b u g </w>, r u g </w>.
Count pairs.
Merge 1 · h ug </w>, b ug </w>, r ug </w>.
Recount.
Merge 2 · h ug</w>, b ug</w>, r ug</w>.
The algorithm has discovered the reusable suffix ug and the morphologically meaningful unit ug</w>.
Two breakthroughs GPT-2 introduced:
Result · a 50k-token vocab that covers English, code, Japanese, emoji, and anything else users throw at it. No <unk> token needed.
Llama, GPT-*, Mistral, Claude all use byte-level BPE with minor tweaks. SentencePiece is the same idea packaged for cross-language training.
| Variant | How | Used in |
|---|---|---|
| Character-level BPE | start from Unicode chars | original 2015 paper |
| Byte-level BPE | start from raw bytes | GPT-2, Llama, most modern LLMs |
| WordPiece | same idea, likelihood-based merge | BERT, DistilBERT |
| SentencePiece | treat whitespace as regular char | Llama, mT5, multilingual |
Byte-level BPE (GPT-2) is now the default · handles any unicode, any language, any emoji, no OOV.
"How many r's in strawberry?" — GPT-4 famously miscounted. Why? "strawberry" tokenizes to something like ["straw", "berry"] or ["str", "aw", "berry"]. The model never sees individual letters — it sees chunks.
Arithmetic errors. Numbers tokenize inconsistently: "1234" might be one token, "1235" might split. Models learn arithmetic by memorizing token patterns, not digit manipulation.
Spaces matter. " the" (with leading space) is a different token from "the". This is why prompts to LLMs are sensitive to trailing spaces.
Same Transformer · different objectives
BERT plays a fill-in-the-blanks game · like a cloze test in school.
We hide ~15% of the words; BERT must guess them. Because it can see text on both sides of the blank, it gets very good at understanding context.
This makes BERT a strong encoder · ideal for tasks where you need a representation of the whole sentence (classification, retrieval, NER). It's bad at generating text · because it never practices producing tokens one-by-one.
The model sees the whole sentence (no causal mask) → rich bidirectional context.
Input · …sat [MASK] the…. Correct answer · "on".
Transformer outputs logits over vocab at the [MASK] position:
Softmax ·
Loss ·
If the model were more confident (
Great for: classification, NER, retrieval (embeddings).
Bad for: generation — can't autoregressively extend.
15% was found empirically. Of those 15%:
[MASK]This mask-then-reconstruct recipe is the same idea as the denoising autoencoder from L19 — BERT is essentially a denoising autoencoder over language, using a Transformer encoder as the denoiser.
Predictive text on your phone. Type "I am heading to the…" — it suggests "gym", "store", "movies". Predicts the next word from what you've already typed; never sees the future.
GPT does this for every word, learning to be an excellent generator.
Sentence ·
<s>, predict "The". Loss · Total · sum over all positions:
Causal attention mask · model can only look backward.
Context · "The cat …". Correct next word · "sat".
Logits over vocab ·
Softmax ·
Loss at
Total sentence loss · sum these up across every position. A 2048-token window gives 2048 little training problems for free, every step.
Great for: generation, chat, code, anything where you produce text one token at a time.
Bad for: bidirectional understanding (but at scale, GPT-3+ closed this gap).
One tiny objective — predict the next token — forces the model to reason about:
Every position in a 2048-token window is a little training example. A 1T-token corpus gives you
This scale-and-generality combo is why next-token prediction — despite looking trivial — ended up subsuming most of NLP.
Frame every task as text-to-text:
"translate English to German: The house is wonderful."
→ "Das Haus ist wunderbar."
"summarize: ‹paragraph›"
→ ‹summary›
"question: ‹q› context: ‹c›"
→ ‹answer›
Raffel et al. 2019 · T5 — Text-to-Text Transfer Transformer. Unified framework; same model does translation, summarization, QA, classification.
Survives in some translation pipelines. But for pure generation, decoder-only (GPT pattern) won.
| Year | Winner | Why |
|---|---|---|
| 2018 | BERT | cheap, works, encoder embeddings useful |
| 2020 | GPT-3 | scale unlocked few-shot learning |
| 2022 | GPT-3.5 / InstructGPT | alignment via instruction tuning + RLHF |
| 2023 | GPT-4, Llama 2 | decoder-only becomes the dominant paradigm |
| 2026 | decoder-only + tool use + reasoning | everyone converges here |
Decoder-only won the LLM race. BERT still ships in retrieval pipelines (small, fast, good embeddings). T5 survives where structured I/O matters.
The "foundation" in foundation model
Predicting the next token on a trillion-token corpus forces the model to learn:
Next-token prediction is so rich a task that a model good at it ends up learning most of what's in the data — implicitly, without any labeled supervision.
Copyright issues · training on copyrighted text without license is legally contested. NY Times v. OpenAI (2023) · unresolved. The data pipeline is as much a legal project as a technical one.
Phi-3 (2024) · trained on 3T "textbook-quality" tokens (heavily filtered + synthetic); matched Llama-2 7B trained on 2T web tokens.
| Recipe | Tokens | Quality | Outcome |
|---|---|---|---|
| 15T web (Llama-3) | large, noisy | baseline | strong LLM |
| 3T curated (Phi-3) | small, high | synth | matches a larger model |
| 1T books (Books3) | medium, high | literary | great for story writing |
Data curation is now the hottest research area. Dedup, language-ID, toxicity filter, repetition filter, perplexity filter with smaller model, synthetic augmentation. Each step buys 1-5 points on benchmarks.
The famous LLM rule of thumb: training FLOPs
Why 6? Per token:
Multiply by all
One training run of a 70B-parameter model on Chinchilla-optimal
Smaller check · Llama 2 7B.
Llama 3 70B · reportedly ~$80M including experiments. GPT-4 class · ~$100M+ per training run.
10 years ago a deep net cost tens of dollars to train. Today, frontier model training costs a house.
| Year | Model | Params | Tokens | Notable |
|---|---|---|---|---|
| 2018 | BERT-base | 110M | 3.3B | first pretrained Transformer in production |
| 2019 | GPT-2 | 1.5B | 40B | "too dangerous to release" |
| 2020 | GPT-3 | 175B | 300B | first few-shot emergence |
| 2022 | Chinchilla | 70B | 1.4T | train-compute optimal |
| 2023 | Llama 2 70B | 70B | 2T | open weights, Chinchilla-ish |
| 2024 | Llama 3 8B | 8B | 15T | aggressively over-trained for inference |
| 2026 | frontier LLMs | 1T+ | 10T+ | multi-modal, reasoning, tool use |
Architecture has barely changed (decoder-only Transformer). What scaled: compute, data, careful engineering. "Attention + scale" really was the thing. L15 goes into the mechanical details.
Pretrained models are knowledgeable but not steerable. To make them follow instructions, chat, or specialize on a task:
The pretrained model is the brain. Fine-tuning is how you train it to do what you want.