What is the LLM Internals track?

Nine interactive modules covering tokenization, embeddings, self-attention, the transformer block, text generation, KV cache, quantization, batching, and paged attention.

Where should I start?

Module 1 (Tokenization). Each module stands alone but builds on the prior one.

Do I need math background?

Basic linear algebra helps, but every concept is illustrated visually first.

Why don't LLMs use whole words as tokens?

A whole-word vocabulary would need millions of entries for all languages and technical terms. BPE subword tokenization keeps vocabulary at 32K–100K while handling rare words by splitting them into known pieces. A word like "tokenization" becomes ["token", "ization"], so the model never hits an unknown token even for novel inputs. This compact vocabulary shrinks the embedding matrix and output softmax, which together dominate parameter count in smaller models.

How many tokens is a typical English word?

Common English words are usually 1 token. Longer or rarer words may be 2–4 tokens. Rule of thumb: 1 token ≈ 0.75 English words, or about 4 characters. So a 1,000-word document is roughly 1,300 tokens, and a 100K-token context window holds around 75,000 words — a short book. Code, non-English languages, and unusual names fragment more aggressively, often doubling the token count compared to plain English prose.

Does tokenization affect model accuracy?

Yes. Poor tokenization splits meaningful units awkwardly, making arithmetic, code, and non-English languages harder. Newer models use larger vocabularies (GPT-4's 100K vs GPT-2's 50K) partly for this reason. When a number like "1234" splits into multiple tokens with inconsistent boundaries, the model has to learn digit-level arithmetic across those fragments — a known cause of math errors. Larger, multilingual vocabularies reduce fragmentation and improve downstream accuracy on code and non-English benchmarks.

What is the difference between BPE and WordPiece?

Both are subword algorithms. BPE merges the most frequent byte pair greedily. WordPiece (used in BERT) picks the merge that maximizes language model likelihood. In practice, BPE dominates modern LLMs — GPT-2 through GPT-4, LLaMA, Mistral, and Claude all use BPE or byte-level BPE variants. The two algorithms produce very similar vocabularies in practice; the main reason modern LLMs standardized on BPE is tooling and byte-level fallback, which guarantees any UTF-8 input can be encoded.

Why does the same text use different token counts on different models?

Each model trains its own tokenizer with a different vocabulary and merge order. "Hello world" might be 2 tokens on GPT-4 but 3 on LLaMA. The tokenizer is trained on a specific corpus, so whichever merges were most frequent in that corpus become single tokens. This is why billing, context limits, and latency estimates must always be computed with the target model's own tokenizer — a 10K-word prompt can be 11K tokens on one model and 14K on another.

Tokenization in LLMs Explained Visually

Why Tokenization?

What is Tokenization?

Tokenization is how large language models read text. Before a model can process a sentence, it must split the raw characters into smaller pieces called tokens — typically subword units like "un", "der", "stand". The dominant algorithm is Byte Pair Encoding (BPE), which iteratively merges the most frequent character pairs. GPT-4 uses a vocabulary of ~100,000 tokens. Tokenization directly affects cost (you pay per token), context length (how much text fits), and quality (poor tokenization can hurt math and multilingual performance).

Why Tokenization?

LLMs don't read text — they operate on numbers. Tokenization is the bridge: it breaks text into small pieces called tokens, each mapped to an integer the model can process.

Three ways to split text

Characters

Split every character individually. "the cat" becomes 7 tokens: t, h, e, , c, a, t.

Simple, but sequences get very long. A 500-word essay becomes ~2,500 tokens. The model has to process every single character one at a time — slow and inefficient.

Whole words

Split on spaces. "the cat" becomes 2 tokens: the, cat.

Compact, but the vocabulary explodes. English has 170,000+ word forms. What about typos, new slang, compound words, other languages? Any word not in the vocabulary becomes an unknown token — information is lost.

Subwords — the sweet spot

BPE and similar algorithms find a middle ground: common words like the stay whole, while rare words split into meaningful pieces. tokenization becomes token + ization. ~50,000 subword tokens cover what 170,000+ words do.

Try it: Switch between example sentences on the right panel to see how character, word, and subword token counts compare.

Vocabulary size tradeoff

Think of vocabulary like a phrasebook. A small phrasebook (32K entries) forces you to spell out unfamiliar words letter by letter — more pages to read through. A large phrasebook (100K entries) has ready-made entries for those words — faster to look up, but the book itself is heavier to carry.

For an LLM, "heavier" means more GPU memory: each token needs its own learned vector stored in the model's embedding table.

GPT-4: ~100K tokens — excellent compression, large embedding table
Llama-2: ~32K tokens — more tokens per text, but leaner model

Every LLM makes this tradeoff differently.

Tokenization is the very first step in every LLM. Before attention, before embeddings, before any neural network computation — the tokenizer runs. The right panel shows how the same text looks under each approach.

BPE: The Algorithm

Byte Pair Encoding (BPE) was originally a data compression algorithm, adapted for NLP by Sennrich et al. (2015) and popularized by GPT-2. It's now used by nearly all modern LLMs.

How it works

Start with every character as its own token
Count all adjacent pairs and their frequencies
Merge the most frequent pair into a new token
Repeat until the desired vocabulary size is reached

Press Play to watch this process on "the cat sat on the mat", or use Step Forward to go one merge at a time.

Predict the first merge

Before stepping — look at the characters. Which adjacent pair appears most often?

The answer: "at" with frequency 3 — it appears in "cat", "sat", and "mat".

How real BPE differs

Our simulator counts raw pairs in flat text. Production BPE works differently: text is first split into words, word frequencies are counted across a huge corpus, and pairs are counted weighted by word frequency.

If "cat" appears 1,000 times in the training corpus, the pair ("c", "a") counts 1,000 — not 1. This is why common words like "the" become single tokens early: they appear millions of times.

When multiple pairs tie in frequency, the algorithm picks the first one encountered scanning left-to-right. Different implementations may break ties differently — this is why two BPE implementations can produce slightly different vocabularies from the same corpus.

Explore the Patterns

Use the preset buttons to try different inputs and observe how BPE behaves.

What to watch for

Default — "the cat sat on the mat"

Common English pairs like "at" and "th" merge first. Notice how "the" gets built up: t + h → th, then th + e → the. The algorithm discovers word structure from raw frequency data.

Repeated chars — "aaaaaa"

Progressive doubling: a → aa → aaaa. With pure repetition, BPE builds increasingly large tokens. Just 2 merges compress 6 tokens down to 2.

Repeated words — "hello world hello world"

Many shared pairs, aggressive compression. Watch the full word reconstruction: h → he → hel → hell → hello. With enough repetition, entire words emerge as single tokens.

Patterns — "abcabc abcabc"

Repeating patterns compress efficiently. ab → abc → abcabc in just 3 merges — from 13 tokens to 3.

Overlapping pairs — "banana banana"

Both "an" and "na" have frequency 4, competing for merges. The algorithm picks whichever it encounters first (left-to-right). This shows how tie-breaking affects the merge sequence.

No repeats — "xyz123"

Zero merges. BPE needs repetition to work. When no adjacent pair appears more than once, there's nothing to merge.

The compression ratio counter shows how much the representation shrinks. Notice the pattern: repetition drives compression. Unique text stays at character level forever — the algorithm has nothing to compress.

Training vs Inference

Tokenization has two distinct phases — and our simulator only shows one of them.

Training — building the merge table

What our simulator demonstrates: scan text, count pairs, build an ordered list of merge rules. In production, this runs once on a massive corpus — billions of words from books, websites, and code.

The result is a merge table: an ordered list of rules like:

("t", "h") → "th"
("th", "e") → "the"
("i", "n") → "in"
... thousands more rules

This table is fixed forever after training. GPT-4's tokenizer was trained once and has never changed since.

Inference — applying the rules

When you send text to an API, the tokenizer doesn't re-count frequencies. It splits your text into characters and applies the merge rules in the exact order they were learned — rule #1 first, then #2, then #3.

No counting. No learning. Just rule application. This is why tokenization is so fast at inference time.

Why "xyz123" works differently

In our simulator, "xyz123" produces zero merges because no pair repeats. But a real tokenizer would still tokenize it meaningfully — the pre-trained merge table already learned pairs like "12" and "yz" from the training corpus.

Vocabulary size

Vocabulary size is a hyperparameter chosen before training:

GPT-4: ~100K tokens — better compression, fewer tokens per text
Llama-2: ~32K tokens — smaller embedding table, less GPU memory

Bigger vocab = better compression but larger embedding matrix. This is one of the first architectural decisions when building an LLM.

"Training a tokenizer is not the same as training a model." Tokenizer training is a deterministic statistical counting process — same corpus + same algorithm = same vocabulary every time. Model training uses stochastic gradient descent and is inherently random.

Byte-Level BPE

The unknown token problem

Character-level BPE only handles characters it saw during training. What about rare Unicode symbols, emoji, or scripts it never encountered? They become [UNK] — the unknown token. Information is lost.

The solution: start with bytes

Every file on your computer is just bytes — values from 0 to 255. There are exactly 256 possible bytes, a tiny but complete base vocabulary. Any text in any language can be represented as bytes.

The right panel shows how different scripts produce different numbers of bytes:

ASCII characters (English letters, digits) → 1 byte each
Accented characters (e, n) → 2 bytes each
Korean, Chinese, Japanese characters → 3 bytes each
Emoji → 4 bytes each

GPT-2 introduced byte-level BPE, and all modern LLMs use it — GPT-4, Llama, Claude, Mistral. No text is ever "unknown."

Pre-tokenization — word boundaries

Before BPE runs, text is split into chunks at word boundaries. Merges never cross these boundaries — this prevents nonsensical merges like combining the last letter of one word with the space before the next.

Different models use different splitting rules:

GPT-2: spaces become part of the next token with a G prefix. "how are" → ["how", "Gare"]
GPT-4: uses a different regex pattern that handles numbers and whitespace differently
T5/SentencePiece: uses _ prefix, works well for languages without spaces like Chinese and Japanese

Pre-tokenization is why different models tokenize the same text differently — they define "word boundaries" with different regex patterns before BPE even starts. This means GPT-4 and Llama produce different tokens from the same input.

Why It Matters

Tokenization isn't just a preprocessing step — it directly shapes how LLMs behave, what they cost, and who they work best for.

Token count = cost

API providers charge per token. Context window limits are in tokens, not characters or words. Efficient tokenization means more content per dollar and more context per prompt.

The language gap

Tokenizers trained on English-heavy corpora compress English text efficiently (~1 token per word) but other languages pay more — the right panel shows this clearly.

Same meaning, more tokens. This is a systematic bias from training data distribution: English text dominated the tokenizer's training corpus, so English-specific merges were learned. The gap widens further with longer or more complex text.

Capitalization matters

"The" and "the" are different tokens with different embeddings. "HELLO" may tokenize as ["HE", "LLO"] while "hello" is a single token. Case-sensitive tokenization means the model literally sees different inputs.

Token boundaries affect reasoning

LLMs reason at the token level, not the character level:

Math: "127+456" may split as ["12", "7+", "456"] — the model can't "see" the full number 127 when it's split across tokens
Character counting: "strawberry" tokenizes as something like ["str", "awberry"], not individual letters — that's why counting r's is hard
Code: myVariable vs my_variable tokenize very differently, affecting how the model reasons about variable names

Special tokens

<|endoftext|>, <|im_start|>, <|im_end|> are injected outside of BPE. They control conversation structure — separating system prompts, user messages, and assistant responses. The BPE algorithm never produces them.

Every quirk of LLM behavior — struggling with math, miscounting characters, charging more for Korean text — traces back to tokenization. Understanding tokens means understanding why models behave the way they do.