BPE Tokenizer Step by Step

Tokenization is invisible until it breaks. LLMs don't see characters — they see whatever subwords BPE learned to carve out. This page grows a tokenizer merge by merge, so you can watch it glue common pairs and end up with tokens like "ization", "tion", or " the".

Byte-Pair Encoding was invented in 1994 for data compression. It showed up in NLP in 2015 (Sennrich et al.), and became the standard tokenizer for GPT-2 (2019) and everything after. The algorithm is tiny: repeatedly merge the most frequent adjacent pair, stop at target vocab size.

The playground

Corpus (paste any text):

Current tokenization

Merge history

Vocab size: 0 · Merges applied: 0 · Total tokens: 0

Try this: Press "Initialize (chars)", then "Merge 10". Watch the vocabulary grow from letters into meaningful subwords. Notice how the tokenizer discovers suffixes ("ing", "ed") and common function words ("the") without any linguistic input — just counting.

Why this matters. Every LLM's vocabulary is frozen at training time — GPT-2 has 50,257 tokens; Llama has 32k. When you type a rare word, it splits into whatever subwords were learned. This is why LLMs spell "strawberry" as "straw" + "berry" and sometimes miscount its letters.

Tokenization across models

GPT-2, GPT-3 — byte-level BPE, 50,257 tokens.
BERT — WordPiece (a variant of BPE), 30,522 tokens.
Llama, Mistral — SentencePiece + BPE, 32k tokens.
Claude, GPT-4 — proprietary, but ~100k tokens each.

Part of the ES 667 Deep Learning course · IIT Gandhinagar · Aug 2026.