What is the LLM Internals track?

Nine interactive modules covering tokenization, embeddings, self-attention, the transformer block, text generation, KV cache, quantization, batching, and paged attention.

Where should I start?

Module 1 (Tokenization). Each module stands alone but builds on the prior one.

Do I need math background?

Basic linear algebra helps, but every concept is illustrated visually first.

What does each dimension in an embedding represent?

Individual dimensions don't have human-interpretable meanings. Meaning emerges from patterns across all dimensions together — like how a point in GPS needs both latitude and longitude to mean anything. Researchers sometimes find directions in embedding space that correlate with human concepts (gender, tense, sentiment), but these are linear combinations of many dimensions, not single axes. The model learns whatever basis minimizes training loss, and that basis is usually not aligned with human categories.

How big is an embedding table?

Vocabulary size × embedding dimension. For GPT-3: 50,257 tokens × 12,288 dimensions = ~617 million parameters, taking ~1.2 GB in FP16. The same table is typically tied (shared) with the output projection that produces logits, so it counts once toward the parameter budget but is accessed twice per forward pass. In smaller models like GPT-2 (124M), the embedding table alone can account for 30–40% of total parameters, which is why vocabulary size is a real architectural lever.

What is cosine similarity?

A measure of how similar two vectors are, based on the angle between them. Cosine similarity of 1.0 means identical direction (same meaning), 0 means unrelated, -1 means opposite. It is computed as the dot product of two vectors divided by the product of their magnitudes, which cancels out length and leaves only direction. This is why cosine similarity is the standard metric for embedding search: two documents of very different lengths can still match if they point the same way in semantic space.

Why do we need positional encoding?

Attention treats tokens as a set, not a sequence. Without position information, "dog bites man" and "man bites dog" would look identical. Positional encoding adds order awareness by injecting a position-dependent signal into each token's embedding. The original Transformer used fixed sinusoidal encodings; modern LLMs like LLaMA use rotary position embeddings (RoPE), which rotate Q and K vectors by an angle proportional to position — giving relative-position awareness and better extrapolation to longer sequences.

What is the difference between static and contextual embeddings?

Static embeddings (Word2Vec) give each word one fixed vector. Contextual embeddings (transformers) produce different vectors depending on surrounding context — "bank" gets different embeddings in "river bank" vs "bank account." In a transformer, the input embedding table is still static, but each attention layer mixes in information from neighbors, so the hidden state at each position drifts with context. That is why transformer-based retrieval and reranking models handle polysemy so much better than Word2Vec-era systems.

Word Embeddings & Positional Encoding Explained

From Token IDs to Vectors

What are Embeddings?

After tokenization, each token ID is converted into a dense vector — a list of numbers (typically 768 to 12,288 dimensions) that encodes the token's meaning. This vector lookup is called an embedding. Tokens with similar meanings end up close together in this high-dimensional space. Positional encoding is then added so the model knows word order (since attention alone is order-agnostic). These embedding vectors are the model's only input — everything the model "knows" about your text starts here.

From Token IDs to Vectors

Tokens are integers — ID 42 means nothing to a neural network. The model needs numbers it can do math with: numbers that encode similarity, context, and meaning.

One-hot vectors — the naive approach

The simplest idea: give each token its own slot in a long list. "Cat" is token #3, so its vector is [0, 0, 1, 0, 0, ...] — a 1 in position 3 and zeros everywhere else. With a 50,000-word vocabulary, that's a vector of 50,000 numbers with 49,999 wasted zeros.

The bigger problem: in this scheme, every token is the same distance from every other token. "Cat" and "dog" are just as far apart as "cat" and "economics" — the representation carries no information about meaning.

Embeddings — dense vectors that encode meaning

An embedding is a dense vector of 768–4096 numbers. Instead of a single 1 in a sea of zeros, every dimension carries information. Similar tokens end up with similar vectors.

One-hot (12 dims shown)

cat

dog

king

Every token looks the same distance apart

→

Embedding (8 dims shown)

cat

0.5

0.8

-0.1

0.3

-0.6

0.2

0.7

-0.9

dog

0.4

0.8

-0.1

0.3

-0.5

0.3

0.6

-0.8

king

-0.7

0.1

0.9

-0.5

0.8

-0.3

-0.2

0.6

cat & dog look similar — king looks different

The embedding layer is a lookup table

This is the key insight: the embedding layer is literally a matrix. Token ID 42 means "go to row 42 and read that row."

No computation — just a table read. The entire embedding layer is this one matrix.

How are these vectors learned?

During training, the model reads billions of sentences. Words that appear near similar neighbors — "cat" and "dog" both appear near "pet", "furry", "feed" — gradually get pushed toward similar vectors. Words in different contexts drift apart.

This happens through backpropagation: the model adjusts every vector slightly with each training example until the arrangement minimizes prediction error across the entire corpus.

Fixed at inference time

Once training ends, the embedding matrix is fixed forever — just like the tokenizer's merge table. When you send text to an API, the model looks up pre-trained vectors. No learning happens.

The right panel shows a slice of the embedding matrix. Select a token to highlight its row — that row is the embedding vector. Notice how "cat" and "dog" rows look visually similar, while "cat" and "king" look very different.

What Do Dimensions Mean?

GPT-3 uses 12,288 dimensions per token. What does each number represent?

You can't label the dimensions

Imagine a photo of a cat. You could describe it with human-friendly features: "has fur", "has four legs", "is small." Each feature makes sense on its own.

Embedding dimensions don't work that way. There's no dimension you can point to and say "this one means animal." Instead, the meaning is spread across all 12,288 numbers together — like how a JPEG stores an image as thousands of numbers that individually mean nothing, but together reconstruct the picture.

The model figured out its own internal code during training. We can't read it dimension by dimension — but we don't need to.

But patterns emerge

Even though individual dimensions are uninterpretable, entire rows tell a story. The right panel color-codes each matrix cell: red = positive value, blue = negative, white = near zero.

Compare "cat" and "dog" — their color patterns are strikingly similar. Many dimensions fire in the same direction. Compare "cat" and "king" — the patterns look largely different. The visual similarity in the matrix directly reflects what the model treats as related.

Why this matters for the model

When "cat" and "dog" have similar vectors, every downstream operation — attention, feed-forward layers, output prediction — will treat them similarly. The model has no other information. Similarity in the embedding space is the model's definition of semantic similarity.

This is why embedding quality sets an upper bound on everything else. If "cat" and "dog" don't land near each other, no amount of attention can fix it.

You can't interpret individual dimensions, but you can compare entire rows. Two rows that look alike = two tokens the model treats as similar. Select two tokens in the right panel to see their rows side by side — and the cosine similarity score between them.

Tokens in Space

Real embeddings have hundreds of dimensions — we can't visualize that directly. So we project down to 2D: a snapshot that preserves the most important structure while collapsing everything else.

What you're looking at

The scatter plot shows 39 tokens plotted in 2D. Each dot is a token. Color indicates semantic group. The 2D coordinates come from projecting the high-dimensional embeddings — similar tokens end up near each other.

What to observe

Animals cluster together. "cat", "dog", "horse", "bird" all appeared near "pet", "furry", "feed", "vet" during training. They share so many contexts that their vectors converged.

Royalty clusters together. "king", "queen", "prince", "princess" share "throne", "crown", "rule", "kingdom" — completely different contexts from animals, so they land in a different region.

Function words ("the", "a", "is", "of") cluster near the center. They appear everywhere — in sentences about animals, royalty, cities, everything. Because they have no specific context, their vectors average out toward the middle.

Colors, actions, and cities each form their own clusters, for the same reason: shared context.

Hover to explore

Hover any dot to see its token and 2D coordinates. Notice how tokens within a cluster have similar coordinates — that's the projection preserving structure from higher dimensions.

The 2D view loses information, but the clusters you see are real. In higher dimensions, the separation between semantic groups is even cleaner. Production tools use PCA or t-SNE to project down for visualization — both preserve local structure while making the space navigable.

Measuring Similarity

"Cat and dog are close" — but how close? We need an actual number.

Cosine similarity

Cosine similarity measures whether two vectors point in the same direction, on a scale from −1 (opposite directions) to 1 (identical direction). It ignores magnitude and focuses purely on angle.

Using the actual embedding data for this module:

Pair	Similarity
cat ↔ dog	0.993 — nearly identical
king ↔ queen	0.995 — very similar
cat ↔ king	0.015 — very different
the ↔ cat	0.180 — function word vs. content word

The diagonal of the heatmap is always 1.0 — every token is perfectly similar to itself.

Vector arithmetic

The most famous embedding insight: king − man + woman ≈ queen.

This works because embeddings encode relational structure, not just proximity. The "royalty" direction and the "gender" direction are consistent across the vector space. Subtracting the "man" direction and adding the "woman" direction moves you from the king region to the queen region.

In 2D the effect is subtle, but in 768+ dimensions this arithmetic works reliably enough to power analogy tasks.

More examples that hold:

paris − france + japan ≈ tokyo (capital city relationship)
actor − man + woman ≈ actress (grammatical gender pair)
walked − walk + run ≈ ran (tense relationship)

A caveat about "famous" analogies: the widely-cited doctor − man + woman ≈ nurse example "works" not because it captures a clean linguistic relationship, but because the training corpus encodes occupational stereotypes (Bolukbasi et al., 2016). When evaluating an embedding analogy, ask whether it reflects a fact about language or a bias in the data.

Cosine similarity is how search engines, recommendation systems, and RAG pipelines find "related" content. When you ask a model a question and it retrieves relevant context from a knowledge base, it's comparing embedding vectors with cosine similarity — finding the stored chunks whose vectors point in the same direction as your query.

Positional Encoding

Embeddings encode meaning — but not order. "The cat sat on the mat" and "mat the on sat cat the" contain the exact same tokens with the exact same embeddings. The model can't distinguish them.

The problem: order matters

"Dog bites man" and "man bites dog" have the same three tokens. Without position information, the model sees identical inputs for very different meanings.

The solution: add position to each vector

Positional encoding adds a position-specific signal to each token's embedding — so the same word at position 1 produces a different vector than at position 5.

Sinusoidal PE (the original approach)

The original Transformer (2017) used a clever trick: mix sine and cosine waves at different speeds.

Think of it like a clock. The second hand moves fast, the minute hand moves slowly, the hour hand barely moves. By reading all three hands together, you know the exact time. Sinusoidal PE works the same way — fast-changing dimensions track fine position differences, slow-changing dimensions track broad position.

Each position gets a unique combination of wave values — a fingerprint that the model learns to read.

Why not just number them 1, 2, 3? Because raw numbers don't generalize. Position 500 would be 100x larger than position 5, distorting the embedding. Waves stay bounded between -1 and 1 no matter how large the position number.

RoPE: what modern models actually use

Sinusoidal PE was groundbreaking but has a limitation: it encodes absolute position. Token at position 5 always gets the same position vector, regardless of context. But what matters for language is usually relative position — how far apart two tokens are, not their exact index.

RoPE (Rotary Position Embeddings) — used by GPT-4, Llama, Mistral, and most modern LLMs — takes a different approach.

The core idea: rotation

Instead of adding a position vector, RoPE rotates the embedding vector by an angle proportional to its position:

Token at position 1 → rotate by angle 1θ
Token at position 2 → rotate by angle 2θ
Token at position 5 → rotate by angle 5θ

Think of each token's embedding as an arrow on a compass. RoPE rotates each arrow by a different amount based on where it sits in the sequence.

Why rotation encodes relative position

When the model compares two tokens (via dot product in attention), what matters is the angle between their rotated vectors. If token A is at position 3 and token B is at position 7, the angle between them depends only on the gap (7 - 3 = 4) — not on the absolute positions.

Move both tokens to positions 10 and 14: the gap is still 4, the angle is the same, and the model sees the same relationship. This is why RoPE naturally handles relative position.

Why RoPE is better

Relative position — "the word 3 positions back" works the same everywhere in the sequence
Better extrapolation — models can handle longer sequences than they were trained on because relative distances stay meaningful
No extra parameters — it's a mathematical rotation, not a learned table

Try it

Toggle Positional Encoding ON on the right panel. Watch the dots shift — each token moves slightly based on its position in the sequence. The displacement is small but visible.

Sinusoidal PE was used in the original Transformer (2017) and BERT. Nearly all modern LLMs have switched to RoPE because relative position matters more than absolute position for language understanding. The shift from "where exactly is this token?" to "how far apart are these tokens?" was a key insight in scaling LLMs to longer contexts.

Why Embeddings Matter

The embedding layer is the model's only interface to tokens. After this step, the model never sees token IDs again — everything that follows is pure vector math.

Everything downstream depends on this

Every operation in a transformer operates on the vectors that came from the embedding layer:

After the embedding layer, token IDs are gone forever. Attention, feed-forward, output — all of it is pure vector math on the embeddings.

If "cat" and "dog" don't land near each other in embedding space, the attention mechanism can't learn that they're related. The feed-forward layers can't treat them similarly. Embedding quality sets the ceiling for everything.

The size of the table

Embedding table size = vocabulary × dimensions.

GPT-3's numbers: 50,257 tokens × 12,288 dimensions = 617 million parameters — just for the lookup table. That's before any attention layers.

This is why vocabulary size is a real tradeoff (from the tokenization module): more tokens means bigger embedding table means more GPU memory.

GPT-2: 50,257 tokens × 1,600 dims = ~80M parameters
GPT-3: 50,257 tokens × 12,288 dims = ~617M parameters
Llama-2: 32,000 tokens × 4,096 dims = ~131M parameters

How much GPU memory?

Each parameter is a number. The memory depends on how many bytes per number:

FP32 (full precision): 4 bytes per parameter — used during training
FP16 / BF16 (half precision): 2 bytes — common for inference
INT8 (quantized): 1 byte — used to fit large models on smaller GPUs
INT4 (aggressive quantization): 0.5 bytes — trades accuracy for memory

The formula: parameters × bytes per parameter = GPU memory

For GPT-3's embedding table (617M params):

FP32: 617M × 4 bytes = ~2.5 GB
FP16: 617M × 2 bytes = ~1.2 GB
INT8: 617M × 1 byte = ~617 MB

And that's just the embedding table — the full GPT-3 model has 175 billion parameters total, needing ~350 GB in FP16. This is why quantization (a later module) is so important.

What "understanding" really means

When people say an LLM "understands" language, they mean the embedding vectors are arranged so that:

Similar meanings are nearby in the vector space
Related concepts have consistent directional relationships (king−man ≈ queen−woman)
The downstream layers can exploit this structure to reason about text

There's no deeper magic. The model has never experienced the world. It has vectors — billions of them, arranged by the geometry of language.

When people say a model "understands" language, what they really mean is: its embedding vectors are arranged so that similar meanings are nearby, related concepts have consistent directional relationships, and the downstream layers can exploit this geometry to reason about text. That geometry is embeddings.