← Explainer Library

Interactive Explainer

BPE Tokenizer Step by Step

Tokenization is invisible until it breaks. LLMs don't see characters — they see whatever subwords BPE learned to carve out. This page grows a tokenizer merge by merge, so you can watch it glue common pairs and end up with tokens like "ization", "tion", or " the".

~10 minDeep Learning · Tokenization · LLM

Byte-Pair Encoding was invented in 1994 for data compression. It showed up in NLP in 2015 (Sennrich et al.), and became the standard tokenizer for GPT-2 (2019) and everything after. The algorithm is tiny: repeatedly merge the most frequent adjacent pair, stop at target vocab size.

The playground

Current tokenization

Merge history

    Vocab size: 0  ·  Merges applied: 0  ·  Total tokens: 0

    Try this: Press "Initialize (chars)", then "Merge 10". Watch the vocabulary grow from letters into meaningful subwords. Notice how the tokenizer discovers suffixes ("ing", "ed") and common function words ("the") without any linguistic input — just counting.

    Why this matters. Every LLM's vocabulary is frozen at training time — GPT-2 has 50,257 tokens; Llama has 32k. When you type a rare word, it splits into whatever subwords were learned. This is why LLMs spell "strawberry" as "straw" + "berry" and sometimes miscount its letters.

    Tokenization across models

    Part of the ES 667 Deep Learning course · IIT Gandhinagar · Aug 2026.