Tokenization in LLMs Explained Visually
Why Tokenization?
What is Tokenization?
Tokenization is how large language models read text. Before a model can process a sentence, it must split the raw characters into smaller pieces called tokens — typically subword units like "un", "der", "stand". The dominant algorithm is Byte Pair Encoding (BPE), which iteratively merges the most frequent character pairs. GPT-4 uses a vocabulary of ~100,000 tokens. Tokenization directly affects cost (you pay per token), context length (how much text fits), and quality (poor tokenization can hurt math and multilingual performance).
Why Tokenization?
LLMs don't read text — they operate on numbers. Tokenization is the bridge: it breaks text into small pieces called tokens, each mapped to an integer the model can process.
Three ways to split text
Characters
Split every character individually. "the cat" becomes 7 tokens: t, h, e, , c, a, t.
Simple, but sequences get very long. A 500-word essay becomes ~2,500 tokens. The model has to process every single character one at a time — slow and inefficient.
Whole words
Split on spaces. "the cat" becomes 2 tokens: the, cat.
Compact, but the vocabulary explodes. English has 170,000+ word forms. What about typos, new slang, compound words, other languages? Any word not in the vocabulary becomes an unknown token — information is lost.
Subwords — the sweet spot
BPE and similar algorithms find a middle ground: common words like the stay whole, while rare words split into meaningful pieces. tokenization becomes token + ization. ~50,000 subword tokens cover what 170,000+ words do.
Try it: Switch between example sentences on the right panel to see how character, word, and subword token counts compare.
Vocabulary size tradeoff
Think of vocabulary like a phrasebook. A small phrasebook (32K entries) forces you to spell out unfamiliar words letter by letter — more pages to read through. A large phrasebook (100K entries) has ready-made entries for those words — faster to look up, but the book itself is heavier to carry.
For an LLM, "heavier" means more GPU memory: each token needs its own learned vector stored in the model's embedding table.
- GPT-4: ~100K tokens — excellent compression, large embedding table
- Llama-2: ~32K tokens — more tokens per text, but leaner model
Every LLM makes this tradeoff differently.
Tokenization is the very first step in every LLM. Before attention, before embeddings, before any neural network computation — the tokenizer runs. The right panel shows how the same text looks under each approach.
BPE: The Algorithm
BPE: The Algorithm
Byte Pair Encoding (BPE) was originally a data compression algorithm, adapted for NLP by Sennrich et al. (2015) and popularized by GPT-2. It's now used by nearly all modern LLMs.
How it works
- Start with every character as its own token
- Count all adjacent pairs and their frequencies
- Merge the most frequent pair into a new token
- Repeat until the desired vocabulary size is reached
Press Play to watch this process on "the cat sat on the mat", or use Step Forward to go one merge at a time.
Predict the first merge
Before stepping — look at the characters. Which adjacent pair appears most often?
The answer: "at" with frequency 3 — it appears in "cat", "sat", and "mat".
How real BPE differs
Our simulator counts raw pairs in flat text. Production BPE works differently: text is first split into words, word frequencies are counted across a huge corpus, and pairs are counted weighted by word frequency.
If "cat" appears 1,000 times in the training corpus, the pair ("c", "a") counts 1,000 — not 1. This is why common words like "the" become single tokens early: they appear millions of times.
When multiple pairs tie in frequency, the algorithm picks the first one encountered scanning left-to-right. Different implementations may break ties differently — this is why two BPE implementations can produce slightly different vocabularies from the same corpus.
Explore the Patterns
Explore the Patterns
Use the preset buttons to try different inputs and observe how BPE behaves.
What to watch for
Default — "the cat sat on the mat"
Common English pairs like "at" and "th" merge first. Notice how "the" gets built up: t + h → th, then th + e → the. The algorithm discovers word structure from raw frequency data.
Repeated chars — "aaaaaa"
Progressive doubling: a → aa → aaaa. With pure repetition, BPE builds increasingly large tokens. Just 2 merges compress 6 tokens down to 2.
Repeated words — "hello world hello world"
Many shared pairs, aggressive compression. Watch the full word reconstruction: h → he → hel → hell → hello. With enough repetition, entire words emerge as single tokens.
Patterns — "abcabc abcabc"
Repeating patterns compress efficiently. ab → abc → abcabc in just 3 merges — from 13 tokens to 3.
Overlapping pairs — "banana banana"
Both "an" and "na" have frequency 4, competing for merges. The algorithm picks whichever it encounters first (left-to-right). This shows how tie-breaking affects the merge sequence.
No repeats — "xyz123"
Zero merges. BPE needs repetition to work. When no adjacent pair appears more than once, there's nothing to merge.
The compression ratio counter shows how much the representation shrinks. Notice the pattern: repetition drives compression. Unique text stays at character level forever — the algorithm has nothing to compress.
Training vs Inference
Training vs Inference
Tokenization has two distinct phases — and our simulator only shows one of them.
Training — building the merge table
What our simulator demonstrates: scan text, count pairs, build an ordered list of merge rules. In production, this runs once on a massive corpus — billions of words from books, websites, and code.
The result is a merge table: an ordered list of rules like:
("t", "h")→"th"("th", "e")→"the"("i", "n")→"in"- ... thousands more rules
This table is fixed forever after training. GPT-4's tokenizer was trained once and has never changed since.
Inference — applying the rules
When you send text to an API, the tokenizer doesn't re-count frequencies. It splits your text into characters and applies the merge rules in the exact order they were learned — rule #1 first, then #2, then #3.
No counting. No learning. Just rule application. This is why tokenization is so fast at inference time.
Why "xyz123" works differently
In our simulator, "xyz123" produces zero merges because no pair repeats. But a real tokenizer would still tokenize it meaningfully — the pre-trained merge table already learned pairs like "12" and "yz" from the training corpus.
Vocabulary size
Vocabulary size is a hyperparameter chosen before training:
- GPT-4: ~100K tokens — better compression, fewer tokens per text
- Llama-2: ~32K tokens — smaller embedding table, less GPU memory
Bigger vocab = better compression but larger embedding matrix. This is one of the first architectural decisions when building an LLM.
"Training a tokenizer is not the same as training a model." Tokenizer training is a deterministic statistical counting process — same corpus + same algorithm = same vocabulary every time. Model training uses stochastic gradient descent and is inherently random.
Byte-Level BPE
Byte-Level BPE
The unknown token problem
Character-level BPE only handles characters it saw during training. What about rare Unicode symbols, emoji, or scripts it never encountered? They become [UNK] — the unknown token. Information is lost.
The solution: start with bytes
Every file on your computer is just bytes — values from 0 to 255. There are exactly 256 possible bytes, a tiny but complete base vocabulary. Any text in any language can be represented as bytes.
The right panel shows how different scripts produce different numbers of bytes:
- ASCII characters (English letters, digits) → 1 byte each
- Accented characters (e, n) → 2 bytes each
- Korean, Chinese, Japanese characters → 3 bytes each
- Emoji → 4 bytes each
GPT-2 introduced byte-level BPE, and all modern LLMs use it — GPT-4, Llama, Claude, Mistral. No text is ever "unknown."
Pre-tokenization — word boundaries
Before BPE runs, text is split into chunks at word boundaries. Merges never cross these boundaries — this prevents nonsensical merges like combining the last letter of one word with the space before the next.
Different models use different splitting rules:
- GPT-2: spaces become part of the next token with a
Gprefix. "how are" →["how", "Gare"] - GPT-4: uses a different regex pattern that handles numbers and whitespace differently
- T5/SentencePiece: uses
_prefix, works well for languages without spaces like Chinese and Japanese
Pre-tokenization is why different models tokenize the same text differently — they define "word boundaries" with different regex patterns before BPE even starts. This means GPT-4 and Llama produce different tokens from the same input.
Why It Matters
Why It Matters
Tokenization isn't just a preprocessing step — it directly shapes how LLMs behave, what they cost, and who they work best for.
Token count = cost
API providers charge per token. Context window limits are in tokens, not characters or words. Efficient tokenization means more content per dollar and more context per prompt.
The language gap
Tokenizers trained on English-heavy corpora compress English text efficiently (~1 token per word) but other languages pay more — the right panel shows this clearly.
Same meaning, more tokens. This is a systematic bias from training data distribution: English text dominated the tokenizer's training corpus, so English-specific merges were learned. The gap widens further with longer or more complex text.
Capitalization matters
"The" and "the" are different tokens with different embeddings. "HELLO" may tokenize as ["HE", "LLO"] while "hello" is a single token. Case-sensitive tokenization means the model literally sees different inputs.
Token boundaries affect reasoning
LLMs reason at the token level, not the character level:
- Math:
"127+456"may split as["12", "7+", "456"]— the model can't "see" the full number 127 when it's split across tokens - Character counting:
"strawberry"tokenizes as something like["str", "awberry"], not individual letters — that's why counting r's is hard - Code:
myVariablevsmy_variabletokenize very differently, affecting how the model reasons about variable names
Special tokens
<|endoftext|>, <|im_start|>, <|im_end|> are injected outside of BPE. They control conversation structure — separating system prompts, user messages, and assistant responses. The BPE algorithm never produces them.
Every quirk of LLM behavior — struggling with math, miscounting characters, charging more for Korean text — traces back to tokenization. Understanding tokens means understanding why models behave the way they do.