What is Grouped Query Experts (GQE)?

Grouped Query Experts (GQE) is a mixture-of-experts layer built on top of grouped-query attention, from Vishesh Tripathi and Abhay Kumar (arXiv 2606.20945, June 2026). Within each GQA group, a small per-token router selects k of the query heads to activate while all key-value heads remain dense and unchanged. It is mixture-of-experts applied to attention's query projection rather than to the feed-forward network. On a 250-million-parameter model trained on 30 billion tokens, activating only half the query heads per token matched the accuracy of the all-active GQA baseline while reducing the active query-head computation.

Why does routing query heads matter?

Self-attention is often the most expensive part of a Transformer at long context, and standard attention runs every head on every token — even tokens that don't need the whole panel. That uniform activation wastes compute. GQE adds a sparsity dial to attention: a per-token router fires only k of the query heads, picking which heads each token uses rather than running them all. Crucially, because GQE only gates the query heads and leaves every key-value head dense, the KV cache footprint is unchanged, so the saving comes for free on the memory side.

How is GQE different from a normal mixture-of-experts layer?

A normal MoE layer puts the router on the feed-forward network: each token is sent to a few of many feed-forward experts. GQE moves that idea into the attention block, treating the query heads inside each grouped-query-attention group as the experts and routing each token to the top-k of them. The key-value heads are deliberately left out of the routing and stay dense for every token, so GQE keeps GQA's small, shared KV cache while still skipping the query-head work an individual token doesn't need.

Grouped Query Experts puts mixture-of-experts routing inside attention — Query-head expert routing

TL;DR

What is it: The Grouped Query Experts (GQE) paper (Vishesh Tripathi and Abhay Kumar, arXiv 2606.20945) puts mixture-of-experts routing on the query heads of grouped-query attention — a small per-token router activates only a few query heads instead of running them all.
Why it’s needed: Self-attention is often the most expensive part of a Transformer at long context, and standard attention runs every head on every token regardless of how much that token actually needs, so much of the head compute is wasted.
vs previous: Standard grouped-query attention keeps all query heads active for every token, and ordinary mixture-of-experts only sparsifies the feed-forward network; GQE makes the query heads themselves sparse — a subset per token — while every key-value head stays dense, so the KV cache stays the same size.

Jargon

Grouped-Query Attention (GQA): The modern attention layout where several query heads share one key-value head, instead of each query head owning its own. It is the standard trick for shrinking the KV cache, and the structure GQE builds directly on top of.
Query head: One of the parallel attention heads that produces a query — the "what am I looking for" vector for a token. The query heads are the part GQE switches on or off per token.
Key-value (KV) head: The head that produces the keys and values every query reads against. In GQA one KV head is shared by a whole group of query heads; GQE leaves all KV heads dense and always active.
Mixture of Experts (MoE): A layer that holds many sub-modules ("experts") but, for each token, a router activates only a few. Classic MoE does this to the feed-forward network; GQE does it to attention's query heads.
Router (top-k): The small learned gate that scores the candidate query heads for a token and picks the top k to activate. "k" is how many heads run; the rest stay idle for that token.
KV cache: The stored keys and values for every token already processed — the dominant memory cost of inference. Because GQE only gates query heads and never touches KV heads, the cache footprint is unchanged.

The news. On June 18, 2026, Vishesh Tripathi and Abhay Kumar posted Grouped Query Experts (GQE) to arXiv. GQE is a mixture-of-experts layer built on top of grouped-query attention: within each GQA group, a per-token router selects k of the query heads to run while all key-value heads stay dense. On a fixed 30-billion-token budget at the 250-million-parameter scale, the paper reports that activating only half the query heads per token matches the all-active GQA baseline in downstream accuracy, while reducing the active query-head computation. Read the paper →

Picture a hospital triage desk. A patient walks in, and the triage nurse does not call in every specialist on the roster — she pages only the few whose expertise this patient actually needs. A sprained wrist gets the orthopedist; a routine check needs almost no one. Everyone she pages reads from the same shared records desk, which is always staffed, but the specialists themselves are summoned on demand. Most patients simply don't need the whole roster — and paging the whole roster for everyone is exactly the waste GQE is built to remove.

Standard attention is the opposite triage desk: it calls in every specialist for every patient. Each token runs through all of the attention heads, whether it's an easy token the model could almost guess or a hard one that genuinely needs the full panel. The paper's framing is blunt — dense attention "applies the same set of attention heads to every token regardless of token difficulty or information content," and that uniform activation wastes compute, a cost that climbs as sequences get longer and attention becomes the most expensive part of the model.

The records desk is already shared. Grouped-query attention — the layout in the diagram's rightmost panel — long ago stopped giving every query head its own key-value head; instead a group of query heads shares one KV head, which is what keeps the KV cache small. So the keys and values are not the repeated, per-token cost — the query heads are. That is the opening GQE walks through: if the expensive, repeated work lives in the query heads, then that is where a router should decide who actually shows up.

Grouped Query Experts turns each GQA group's query heads into experts. A lightweight per-token router scores them and activates only the top k, leaving the rest idle for that token, while the shared key-value heads stay dense for everyone — exactly the always-staffed records desk. This is mixture-of-experts applied to attention's query projection rather than to the feed-forward network where MoE usually lives. Because only the query side is gated and the KV side never changes, GQE keeps every KV-cache benefit of GQA and shrinks only the active query-head compute — the cache footprint is identical.

Walk the numbers on one group. Say a GQA group has 8 query heads sharing 1 key-value head (illustrative head counts; the paper reports fractions, not a fixed group size). Dense GQA runs all 8 query projections and 8 sets of attention scores for every token. GQE's router keeps half — so for that token it computes 4 query heads instead of 8, a 2× cut on the query-head share of attention work, while the 1 shared KV head, and therefore the KV cache, is untouched. The paper's headline result is that this halving is not free in name only: across a 250-million-parameter model trained on 30 billion tokens, the half-active model matched the accuracy of the all-active baseline.

Attention layout	Query heads run per token	KV heads	What it saves
Multi-head attention (MHA)	all, each with its own KV head	one per query head	nothing — baseline
Grouped-query attention (GQA)	all, sharing KV heads in groups	~4–8× fewer (group size varies by model)	KV cache memory
Grouped Query Experts (GQE)	only the router's top-k per token [paper]	all dense (unchanged)	active query-head compute, KV cache untouched

The honest caveats. The result so far is at a single small scale — 250M parameters on 30B tokens — so "matches the baseline" is a promising signal, not yet a guarantee it holds at frontier size. A per-token router also adds its own small cost and, like every MoE, has to learn to route well rather than collapse onto a favorite few heads. But the conceptual move is the durable part: once you accept that an easy token doesn't need every attention head, the question stops being "how many heads should the model have" and becomes "which heads should fire for this token" — and, because only the query side is gated, attention gets a sparsity dial that costs the KV cache nothing.

Goes deeper in: LLM Internals → KV Cache → GQA: Shrinking the Cache

Related explainers

Chiaroscuro — spectral-entropy token routing — also cuts attention FLOPs by routing, but it routes which tokens get full attention; GQE routes which query heads fire per token.
SoftMoE — differentiable routing — the same mixture-of-experts routing idea in its usual home, the feed-forward experts, rather than inside attention.
HydraHead — per-head attention hybrid — another "treat the heads individually" idea, mixing head types instead of gating them on and off.

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based