The news. On June 28, 2026, the paper MultiHashFormer (arXiv 2606.28057) proposed hash-based autoregression: representing each token as a short hash signature — a handful of discrete IDs from several independent hash functions — instead of a row in a vocabulary-sized embedding matrix. The authors report that at 100M, 1B, and 3B parameters it reportedly outperforms standard Transformer language models; the work is early-stage (under review). Read the paper →

Picture a city that gives every car its own numbered parking spot. With a thousand cars that is fine; with a hundred million, the lot is bigger than the city. A standard language model parks words the same way: its embedding matrix keeps one row for every word in the vocabulary, and that table grows with the dictionary. For a small model, those word tables — the input embedding plus the output projection — can take up most of the parameter budget, and a large or multilingual vocabulary only makes the lot bigger.

The obvious shortcut is to stop giving every car its own spot — hash several words onto the same row. That saves space, but it collides: two different words land on the identical row, and the model can no longer tell them apart, which breaks next-token prediction. That is why naive hashing tricks were mainly useful for parameter savings, not clean autoregressive generation.

MultiHashFormer's fix is to hand each word a license plate instead of a parking spot: a short multi-hash signature the model encodes, transforms, and decodes back into the next token. A plate is short, but because it is built from several independent character positions, a full-plate collision becomes extremely unlikely. Concretely, each token gets a hash signature: a short sequence of IDs, one from each of several independent hash functions. One hash alone would collide; several independent ones together form a fingerprint distinct enough that collisions effectively vanish. A Hash Encoder then folds that signature into a single latent vector a standard Transformer can consume, and a Hash Decoder predicts the next token's signature, which is looked back up to text — replacing the usual vocabulary-sized output softmax.

Put rough numbers on the lot (illustrative — the paper does not publish these exact dimensions). Take a 128,000-word vocabulary and a 768-dimensional model. The input embedding table alone is 128,000 × 768 ≈ 98 million parameters, and an untied output projection is the same size again — so for a model whose Transformer body is only tens of millions of parameters, the two word tables are bigger than the brain. Swap in signatures from, say, four independent hash functions over 256-entry codebooks, and those lookup tables shrink to under a million parameters, while 256 × 256 × 256 × 256 ≈ 4.3 billion possible signatures cover the 128,000-word vocabulary with collisions essentially at zero. In this illustrative setup, the representation stops scaling with the dictionary.

Way to represent a tokenHowParameters vs vocabularyCollisions
Standard embedding matrixone dedicated row per tokengrows linearly with vocabnone
Single-hash trick (prior work)fold many tokens onto one shared rowsub-linearhigh — breaks generation
MultiHashFormer signature (arXiv 2606.28057)multi-ID signature from several independent hashessub-linear~negligible (reported)
Byte / character-leveltiny fixed alphabet, no word tabletinynone, but longer sequences

What makes this more than a compression trick is that the signature remains generative: it preserves the one-to-one token map a language model needs while cutting the representation free from vocabulary size. Earlier hashing saved parameters but destroyed that mapping; MultiHashFormer keeps it intact by drawing each signature from several independent hash functions. If it holds up beyond these early 100M-to-3B scales, it points at a Transformer whose footprint no longer balloons with the size of its dictionary — a real lever for multilingual and on-device models.

Goes deeper in: LLM Internals → Embeddings → From Token IDs to Vectors

Related explainers

Continue in trackLLM Internals — Embeddings: from token IDs to vectors

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based