The news. On June 18, 2026, Vishesh Tripathi and Abhay Kumar posted Grouped Query Experts (GQE) to arXiv. GQE is a mixture-of-experts layer built on top of grouped-query attention: within each GQA group, a per-token router selects k of the query heads to run while all key-value heads stay dense. On a fixed 30-billion-token budget at the 250-million-parameter scale, the paper reports that activating only half the query heads per token matches the all-active GQA baseline in downstream accuracy, while reducing the active query-head computation. Read the paper →

Picture a hospital triage desk. A patient walks in, and the triage nurse does not call in every specialist on the roster — she pages only the few whose expertise this patient actually needs. A sprained wrist gets the orthopedist; a routine check needs almost no one. Everyone she pages reads from the same shared records desk, which is always staffed, but the specialists themselves are summoned on demand. Most patients simply don't need the whole roster — and paging the whole roster for everyone is exactly the waste GQE is built to remove.

Standard attention is the opposite triage desk: it calls in every specialist for every patient. Each token runs through all of the attention heads, whether it's an easy token the model could almost guess or a hard one that genuinely needs the full panel. The paper's framing is blunt — dense attention "applies the same set of attention heads to every token regardless of token difficulty or information content," and that uniform activation wastes compute, a cost that climbs as sequences get longer and attention becomes the most expensive part of the model.

MHA (standard)QKV8 KV pairs — full sizeMQAQKV1 KV pair — 8× smallerGQA (modern)QKV2 KV pairs — 4× smaller

The records desk is already shared. Grouped-query attention — the layout in the diagram's rightmost panel — long ago stopped giving every query head its own key-value head; instead a group of query heads shares one KV head, which is what keeps the KV cache small. So the keys and values are not the repeated, per-token cost — the query heads are. That is the opening GQE walks through: if the expensive, repeated work lives in the query heads, then that is where a router should decide who actually shows up.

Grouped Query Experts turns each GQA group's query heads into experts. A lightweight per-token router scores them and activates only the top k, leaving the rest idle for that token, while the shared key-value heads stay dense for everyone — exactly the always-staffed records desk. This is mixture-of-experts applied to attention's query projection rather than to the feed-forward network where MoE usually lives. Because only the query side is gated and the KV side never changes, GQE keeps every KV-cache benefit of GQA and shrinks only the active query-head compute — the cache footprint is identical.

Walk the numbers on one group. Say a GQA group has 8 query heads sharing 1 key-value head (illustrative head counts; the paper reports fractions, not a fixed group size). Dense GQA runs all 8 query projections and 8 sets of attention scores for every token. GQE's router keeps half — so for that token it computes 4 query heads instead of 8, a 2× cut on the query-head share of attention work, while the 1 shared KV head, and therefore the KV cache, is untouched. The paper's headline result is that this halving is not free in name only: across a 250-million-parameter model trained on 30 billion tokens, the half-active model matched the accuracy of the all-active baseline.

Attention layoutQuery heads run per tokenKV headsWhat it saves
Multi-head attention (MHA)all, each with its own KV headone per query headnothing — baseline
Grouped-query attention (GQA)all, sharing KV heads in groups~4–8× fewer (group size varies by model)KV cache memory
Grouped Query Experts (GQE)only the router's top-k per token [paper]all dense (unchanged)active query-head compute, KV cache untouched

The honest caveats. The result so far is at a single small scale — 250M parameters on 30B tokens — so "matches the baseline" is a promising signal, not yet a guarantee it holds at frontier size. A per-token router also adds its own small cost and, like every MoE, has to learn to route well rather than collapse onto a favorite few heads. But the conceptual move is the durable part: once you accept that an easy token doesn't need every attention head, the question stops being "how many heads should the model have" and becomes "which heads should fire for this token" — and, because only the query side is gated, attention gets a sparsity dial that costs the KV cache nothing.

Goes deeper in: LLM Internals → KV Cache → GQA: Shrinking the Cache

Related explainers

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based