Interactive Explainer
Tokenization is invisible until it breaks. LLMs don't see characters — they see whatever subwords BPE learned to carve out. This page grows a tokenizer merge by merge, so you can watch it glue common pairs and end up with tokens like "ization", "tion", or " the".
Byte-Pair Encoding was invented in 1994 for data compression. It showed up in NLP in 2015 (Sennrich et al.), and became the standard tokenizer for GPT-2 (2019) and everything after. The algorithm is tiny: repeatedly merge the most frequent adjacent pair, stop at target vocab size.
Try this: Press "Initialize (chars)", then "Merge 10". Watch the vocabulary grow from letters into meaningful subwords. Notice how the tokenizer discovers suffixes ("ing", "ed") and common function words ("the") without any linguistic input — just counting.
Part of the ES 667 Deep Learning course · IIT Gandhinagar · Aug 2026.