Learn AI VisuallyTracksAI Explained

Word Embeddings & Positional Encoding Explained

From Token IDs to Vectors

What are Embeddings?

After tokenization, each token ID is converted into a dense vector — a list of numbers (typically 768 to 12,288 dimensions) that encodes the token's meaning. This vector lookup is called an embedding. Tokens with similar meanings end up close together in this high-dimensional space. Positional encoding is then added so the model knows word order (since attention alone is order-agnostic). These embedding vectors are the model's only input — everything the model "knows" about your text starts here.

From Token IDs to Vectors

Tokens are integers — ID 42 means nothing to a neural network. The model needs numbers it can do math with: numbers that encode similarity, context, and meaning.

One-hot vectors — the naive approach

The simplest idea: give each token its own slot in a long list. "Cat" is token #3, so its vector is [0, 0, 1, 0, 0, ...] — a 1 in position 3 and zeros everywhere else. With a 50,000-word vocabulary, that's a vector of 50,000 numbers with 49,999 wasted zeros.

The bigger problem: in this scheme, every token is the same distance from every other token. "Cat" and "dog" are just as far apart as "cat" and "economics" — the representation carries no information about meaning.

Embeddings — dense vectors that encode meaning

An embedding is a dense vector of 768–4096 numbers. Instead of a single 1 in a sea of zeros, every dimension carries information. Similar tokens end up with similar vectors.

One-hot (12 dims shown)
cat
0
0
1
0
0
0
0
0
0
0
0
0
dog
0
0
0
1
0
0
0
0
0
0
0
0
king
0
0
0
0
0
0
0
1
0
0
0
0
Every token looks the same distance apart
→
Embedding (8 dims shown)
cat
0.5
0.8
-0.1
0.3
-0.6
0.2
0.7
-0.9
dog
0.4
0.8
-0.1
0.3
-0.5
0.3
0.6
-0.8
king
-0.7
0.1
0.9
-0.5
0.8
-0.3
-0.2
0.6
cat & dog look similar — king looks different

The embedding layer is a lookup table

This is the key insight: the embedding layer is literally a matrix. Token ID 42 means "go to row 42 and read that row."

Token ID 42integer
Row 42 in matrixtable lookup
[0.12, −0.34, 0.87, ...]embedding vector

No computation — just a table read. The entire embedding layer is this one matrix.

How are these vectors learned?

During training, the model reads billions of sentences. Words that appear near similar neighbors — "cat" and "dog" both appear near "pet", "furry", "feed" — gradually get pushed toward similar vectors. Words in different contexts drift apart.

This happens through backpropagation: the model adjusts every vector slightly with each training example until the arrangement minimizes prediction error across the entire corpus.

Fixed at inference time

Once training ends, the embedding matrix is fixed forever — just like the tokenizer's merge table. When you send text to an API, the model looks up pre-trained vectors. No learning happens.

The right panel shows a slice of the embedding matrix. Select a token to highlight its row — that row is the embedding vector. Notice how "cat" and "dog" rows look visually similar, while "cat" and "king" look very different.

What Do Dimensions Mean?

What Do Dimensions Mean?

GPT-3 uses 12,288 dimensions per token. What does each number represent?

You can't label the dimensions

Imagine a photo of a cat. You could describe it with human-friendly features: "has fur", "has four legs", "is small." Each feature makes sense on its own.

Embedding dimensions don't work that way. There's no dimension you can point to and say "this one means animal." Instead, the meaning is spread across all 12,288 numbers together — like how a JPEG stores an image as thousands of numbers that individually mean nothing, but together reconstruct the picture.

The model figured out its own internal code during training. We can't read it dimension by dimension — but we don't need to.

But patterns emerge

Even though individual dimensions are uninterpretable, entire rows tell a story. The right panel color-codes each matrix cell: red = positive value, blue = negative, white = near zero.

Compare "cat" and "dog" — their color patterns are strikingly similar. Many dimensions fire in the same direction. Compare "cat" and "king" — the patterns look largely different. The visual similarity in the matrix directly reflects what the model treats as related.

Why this matters for the model

When "cat" and "dog" have similar vectors, every downstream operation — attention, feed-forward layers, output prediction — will treat them similarly. The model has no other information. Similarity in the embedding space is the model's definition of semantic similarity.

This is why embedding quality sets an upper bound on everything else. If "cat" and "dog" don't land near each other, no amount of attention can fix it.

You can't interpret individual dimensions, but you can compare entire rows. Two rows that look alike = two tokens the model treats as similar. Select two tokens in the right panel to see their rows side by side — and the cosine similarity score between them.

Tokens in Space

Tokens in Space

Real embeddings have hundreds of dimensions — we can't visualize that directly. So we project down to 2D: a snapshot that preserves the most important structure while collapsing everything else.

What you're looking at

The scatter plot shows 39 tokens plotted in 2D. Each dot is a token. Color indicates semantic group. The 2D coordinates come from projecting the high-dimensional embeddings — similar tokens end up near each other.

What to observe

Animals cluster together. "cat", "dog", "horse", "bird" all appeared near "pet", "furry", "feed", "vet" during training. They share so many contexts that their vectors converged.

Royalty clusters together. "king", "queen", "prince", "princess" share "throne", "crown", "rule", "kingdom" — completely different contexts from animals, so they land in a different region.

Function words ("the", "a", "is", "of") cluster near the center. They appear everywhere — in sentences about animals, royalty, cities, everything. Because they have no specific context, their vectors average out toward the middle.

Colors, actions, and cities each form their own clusters, for the same reason: shared context.

Hover to explore

Hover any dot to see its token and 2D coordinates. Notice how tokens within a cluster have similar coordinates — that's the projection preserving structure from higher dimensions.

The 2D view loses information, but the clusters you see are real. In higher dimensions, the separation between semantic groups is even cleaner. Production tools use PCA or t-SNE to project down for visualization — both preserve local structure while making the space navigable.

Measuring Similarity

Measuring Similarity

"Cat and dog are close" — but how close? We need an actual number.

Cosine similarity

Cosine similarity measures whether two vectors point in the same direction, on a scale from −1 (opposite directions) to 1 (identical direction). It ignores magnitude and focuses purely on angle.

vec("cat") · vec("dog")dot product
÷ (|cat| × |dog|)normalize by magnitude
0.993cosine similarity

Using the actual embedding data for this module:

PairSimilarity
cat ↔ dog0.993 — nearly identical
king ↔ queen0.995 — very similar
cat ↔ king0.015 — very different
the ↔ cat0.180 — function word vs. content word

The diagonal of the heatmap is always 1.0 — every token is perfectly similar to itself.

Vector arithmetic

The most famous embedding insight: king − man + woman ≈ queen.

This works because embeddings encode relational structure, not just proximity. The "royalty" direction and the "gender" direction are consistent across the vector space. Subtracting the "man" direction and adding the "woman" direction moves you from the king region to the queen region.

In 2D the effect is subtle, but in 768+ dimensions this arithmetic works reliably enough to power analogy tasks.

More examples that hold:

  • paris − france + japan ≈ tokyo (capital city relationship)
  • actor − man + woman ≈ actress (grammatical gender pair)
  • walked − walk + run ≈ ran (tense relationship)

A caveat about "famous" analogies: the widely-cited doctor − man + woman ≈ nurse example "works" not because it captures a clean linguistic relationship, but because the training corpus encodes occupational stereotypes (Bolukbasi et al., 2016). When evaluating an embedding analogy, ask whether it reflects a fact about language or a bias in the data.

Cosine similarity is how search engines, recommendation systems, and RAG pipelines find "related" content. When you ask a model a question and it retrieves relevant context from a knowledge base, it's comparing embedding vectors with cosine similarity — finding the stored chunks whose vectors point in the same direction as your query.

Further reading

  • The Illustrated Word2Vec — Jay Alammar's visual guide to how word vectors are trained and why vector arithmetic works
  • What Are Embeddings? — Vicki Boykis' comprehensive guide to embeddings as a data structure
  • Cosine Similarity (Wikipedia) — the math behind the metric
  • Man Is to Computer Programmer as Woman Is to Homemaker? (Bolukbasi et al., 2016) — the canonical paper demonstrating gender bias in word embeddings, including the doctor → nurse analogy

Positional Encoding

Positional Encoding

Embeddings encode meaning — but not order. "The cat sat on the mat" and "mat the on sat cat the" contain the exact same tokens with the exact same embeddings. The model can't distinguish them.

The problem: order matters

"Dog bites man" and "man bites dog" have the same three tokens. Without position information, the model sees identical inputs for very different meanings.

The solution: add position to each vector

Positional encoding adds a position-specific signal to each token's embedding — so the same word at position 1 produces a different vector than at position 5.

Embedding vectorwhat the token means
+ Position signalwhere the token is
Final vectormeaning + position

Sinusoidal PE (the original approach)

The original Transformer (2017) used a clever trick: mix sine and cosine waves at different speeds.

Think of it like a clock. The second hand moves fast, the minute hand moves slowly, the hour hand barely moves. By reading all three hands together, you know the exact time. Sinusoidal PE works the same way — fast-changing dimensions track fine position differences, slow-changing dimensions track broad position.

Each position gets a unique combination of wave values — a fingerprint that the model learns to read.

Why not just number them 1, 2, 3? Because raw numbers don't generalize. Position 500 would be 100x larger than position 5, distorting the embedding. Waves stay bounded between -1 and 1 no matter how large the position number.

RoPE: what modern models actually use

Sinusoidal PE was groundbreaking but has a limitation: it encodes absolute position. Token at position 5 always gets the same position vector, regardless of context. But what matters for language is usually relative position — how far apart two tokens are, not their exact index.

RoPE (Rotary Position Embeddings) — used by GPT-4, Llama, Mistral, and most modern LLMs — takes a different approach.

The core idea: rotation

Instead of adding a position vector, RoPE rotates the embedding vector by an angle proportional to its position:

  • Token at position 1 → rotate by angle 1θ
  • Token at position 2 → rotate by angle 2θ
  • Token at position 5 → rotate by angle 5θ

Think of each token's embedding as an arrow on a compass. RoPE rotates each arrow by a different amount based on where it sits in the sequence.

Why rotation encodes relative position

When the model compares two tokens (via dot product in attention), what matters is the angle between their rotated vectors. If token A is at position 3 and token B is at position 7, the angle between them depends only on the gap (7 - 3 = 4) — not on the absolute positions.

Move both tokens to positions 10 and 14: the gap is still 4, the angle is the same, and the model sees the same relationship. This is why RoPE naturally handles relative position.

Why RoPE is better

  • Relative position — "the word 3 positions back" works the same everywhere in the sequence
  • Better extrapolation — models can handle longer sequences than they were trained on because relative distances stay meaningful
  • No extra parameters — it's a mathematical rotation, not a learned table

Try it

Toggle Positional Encoding ON on the right panel. Watch the dots shift — each token moves slightly based on its position in the sequence. The displacement is small but visible.

Sinusoidal PE was used in the original Transformer (2017) and BERT. Nearly all modern LLMs have switched to RoPE because relative position matters more than absolute position for language understanding. The shift from "where exactly is this token?" to "how far apart are these tokens?" was a key insight in scaling LLMs to longer contexts.

Further reading

  • Rotary Embeddings: A Relative Revolution — EleutherAI's deep dive into RoPE
  • The Illustrated Transformer — Jay Alammar's visual guide covering sinusoidal PE

Why Embeddings Matter

Why Embeddings Matter

The embedding layer is the model's only interface to tokens. After this step, the model never sees token IDs again — everything that follows is pure vector math.

Everything downstream depends on this

Every operation in a transformer operates on the vectors that came from the embedding layer:

Embeddingsvectors from lookup
Attentiontokens look at each other
Feed-Forwardtransform each token
Outputpredict next token

After the embedding layer, token IDs are gone forever. Attention, feed-forward, output — all of it is pure vector math on the embeddings.

If "cat" and "dog" don't land near each other in embedding space, the attention mechanism can't learn that they're related. The feed-forward layers can't treat them similarly. Embedding quality sets the ceiling for everything.

The size of the table

Embedding table size = vocabulary × dimensions.

GPT-3's numbers: 50,257 tokens × 12,288 dimensions = 617 million parameters — just for the lookup table. That's before any attention layers.

This is why vocabulary size is a real tradeoff (from the tokenization module): more tokens means bigger embedding table means more GPU memory.

  • GPT-2: 50,257 tokens × 1,600 dims = ~80M parameters
  • GPT-3: 50,257 tokens × 12,288 dims = ~617M parameters
  • Llama-2: 32,000 tokens × 4,096 dims = ~131M parameters

How much GPU memory?

Each parameter is a number. The memory depends on how many bytes per number:

  • FP32 (full precision): 4 bytes per parameter — used during training
  • FP16 / BF16 (half precision): 2 bytes — common for inference
  • INT8 (quantized): 1 byte — used to fit large models on smaller GPUs
  • INT4 (aggressive quantization): 0.5 bytes — trades accuracy for memory

The formula: parameters × bytes per parameter = GPU memory

For GPT-3's embedding table (617M params):

  • FP32: 617M × 4 bytes = ~2.5 GB
  • FP16: 617M × 2 bytes = ~1.2 GB
  • INT8: 617M × 1 byte = ~617 MB

And that's just the embedding table — the full GPT-3 model has 175 billion parameters total, needing ~350 GB in FP16. This is why quantization (a later module) is so important.

What "understanding" really means

When people say an LLM "understands" language, they mean the embedding vectors are arranged so that:

  • Similar meanings are nearby in the vector space
  • Related concepts have consistent directional relationships (king−man ≈ queen−woman)
  • The downstream layers can exploit this structure to reason about text

There's no deeper magic. The model has never experienced the world. It has vectors — billions of them, arranged by the geometry of language.

When people say a model "understands" language, what they really mean is: its embedding vectors are arranged so that similar meanings are nearby, related concepts have consistent directional relationships, and the downstream layers can exploit this geometry to reason about text. That geometry is embeddings.

© 2026 Learn AI Visuallycraftsman@craftsmanapps.com