The news. On June 5, 2026, a paper titled "Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings" (arXiv 2606.07502) asked why large language models make such poor off-the-shelf embedding models. The authors find that pooled text embeddings drift toward frequent-but-uninformative tokens, trace the cause to a subspace in the unembedding matrix, and remove it with one linear transformation they call EmbedFilter — reportedly improving semantic quality while slightly reducing the embedding's dimensionality. Read the paper →
Picture a camera lens with a faint greasy smudge. Every photo you take comes out with the same low haze laid over it — bright pictures, dark pictures, all tinted the same way — so two completely different scenes end up looking oddly similar. A large language model already owns a lens: its unembedding matrix, the layer that turns each hidden state into a score for every word in the vocabulary. When you reuse the model as an embedding model — pooling its hidden states into one text embedding for a whole sentence — you are taking a photo through that lens. And the paper's finding is that the lens has a smudge.
The smudge is a subspace: a handful of directions that always point toward the most frequent tokens — the, of, and, the punctuation every sentence is full of. Because every document contains those tokens, every pooled embedding gets dragged the same way along those directions. Two texts about wildly different things land close together, which is exactly why cosine similarity is a famously weak signal on raw LLM embeddings — the shared haze crowds out the directions that actually carry meaning. EmbedFilter identifies that subspace and projects it out in a single linear transformation — one matrix multiply, no retraining — and the haze lifts.
Put rough numbers on the smudge to see why it matters (illustrative). Say each pooled embedding is a 4,096-dimensional vector, and the frequency smudge lives in a small 16-dimensional subspace. Before filtering, two unrelated documents might sit at cosine similarity 0.86 — almost everything looks alike, because the shared haze dominates the direction of both vectors. Project out those 16 directions, and the same unrelated pair drops to about 0.34, while a genuinely related pair stays high. The separation you actually wanted reappears, and the vector is now 4,080-dimensional — a tiny dimensionality cut on top.
| Way to get a text embedding from an LLM | How | Needs training? | Frequency bias |
|---|---|---|---|
| Mean-pool the hidden states (raw) | average the per-token hidden vectors | no | high — the smudge |
| Contrastive fine-tune (dedicated embedder) | retrain on positive/negative text pairs | yes — expensive | low |
| Whitening / post-hoc normalization | re-center & rescale the embeddings | light | partial |
| EmbedFilter (arXiv 2606.07502) | project out the unembedding subspace, one linear map | no — training-free | low |
What makes this elegant is where the fix came from. You did not need a second model or a labeled dataset to find the smudge — the model's own unembedding matrix already encodes which directions are frequency-driven, because those are the directions it pushes probability toward when it predicts common words. The diagnostic was sitting inside the network the whole time; EmbedFilter just reads it off and subtracts it. That is why the de-biasing costs one matmul and zero training instead of a fine-tuning run — and why a model you already use for retrieval can get cleaner vectors without one.
Goes deeper in: LLM Internals → Embeddings → Measuring Similarity
Related explainers
- Is Grep All You Need? — Grep vs vector retrieval — where text embeddings actually get used, and when literal search beats them
- OmniRetrieval — Source-native query dispatch — another angle on why one flat vector index loses information
- OScaR — Token Norm Imbalance — the kindred finding that a few tokens carry outsized weight and distort the model's vectors