What is a variable-width (hourglass) transformer?

It is a transformer whose hidden width changes by depth instead of staying constant. The paper gives the stack an x-shaped (hourglass) profile — wider early and late layers, a narrower middle — and joins layers of different widths with a parameter-free residual resizing step that adds no weights. Under a loss-matched comparison it cuts about 22% of FLOPs and 15% of KV-cache memory and IO versus a uniform-width baseline, validated from 200M to 2B dense models and a 3B mixture-of-experts model.

Why does shrinking the middle layers save so much compute?

Because a layer's cost is dominated by its feed-forward network, whose FLOPs grow with the square of the hidden width. Narrowing a middle layer therefore cuts its compute roughly quadratically — halving the width drops the FFN cost to about a quarter. The KV-cache those layers hold shrinks only linearly with width, so the FLOP savings (~22%) are larger than the memory savings (~15%). The paper finds the useful width budget is not spread evenly across depth, so the outer layers stay wide while the middle thins.

What is parameter-free residual resizing, and why does it matter?

It is how the paper joins a wide layer to a narrow one without adding any learned weights — a fixed, parameter-free operation reshapes the residual stream between layers of different width. It matters because if the narrow middle were paid for by adding parameters elsewhere, the FLOP cut would be fake. Because the resizing is free, the savings hold up under a fair, loss-matched comparison where both models reach the same validation loss.

Variable-width transformers cut FLOPs 22% — Hourglass layer width

TL;DR

What is it: The news is a paper on Variable-Width Transformers — an architecture that drops the rule that every layer must be the same width and instead gives the stack an hourglass (x-shaped) profile: wider outer layers, a narrow middle.
Why it’s needed: A layer's width sets almost everything expensive about it — the feed-forward compute grows with the width squared, and the KV-cache grows with it linearly. Shrinking the middle layers cuts 22% of FLOPs and 15% of KV-cache memory and IO at the same loss.
vs previous: Standard transformers use a constant hidden width across every layer — a knob most architectures leave fixed. This paper varies it by depth and finds the middle layers can be much thinner at matched loss, joined to the wide layers by a parameter-free residual resizing step that adds no weights.

Jargon

Hidden width (d_model): The length of the vector each token carries as it flows through the layers — the dimension of the residual stream. Bigger width means more capacity per token, but more compute and memory at every layer. In a normal transformer it is the same at every layer; this paper lets it change by depth.
FLOPs: Floating-point operations — the raw count of multiply-adds a model does. It is the standard currency for "how much compute." The feed-forward block dominates it, and its FLOPs grow with the square of the hidden width.
FFN (feed-forward network): The per-token two-layer MLP inside every transformer block. It expands the width by ~4×, applies a nonlinearity, and projects back — so its cost is ≈ width². Full explainer →
KV-cache: The stored keys and values from earlier tokens that let a model decode without recomputing the whole sequence. Its size grows linearly with hidden width, so a thinner layer caches less per token. Full explainer →
Loss-matched (iso-loss) scaling: Comparing two designs at the same validation loss rather than the same parameter count. It is the honest way to claim a FLOP cut: the hourglass model is held to the same quality, then its compute is measured against a uniform-width baseline at that quality.
Parameter-free residual resizing: The paper's trick for joining a wide layer to a narrow one without adding weights. When the residual stream changes width between layers, a fixed, parameter-free operation resizes it — no new learned parameters, so the savings are real and not paid back elsewhere.
MoE (mixture of experts): A model where each token is routed to only a few of many feed-forward "experts." The hourglass result holds for a 3B MoE model too, not just dense ones — the width knob is orthogonal to expert routing.

The news. On June 16, 2026, researchers posted Variable-Width Transformers, which challenges an assumption baked into nearly every transformer: that the hidden width is the same at every layer. They give the stack an x-shaped (hourglass) profile — wider early and late layers, a narrower middle — joined by a parameter-free residual resizing step. Under loss-matched scaling the design cuts 22% of FLOPs and 15% of KV-cache memory and IO versus uniform-width baselines, and the result holds from 200M to 2B dense models and a 3B MoE.

Picture an hourglass standing on a desk. It is wide at the top, pinches to a narrow waist in the middle, and flares wide again at the bottom. Now read it top-to-bottom as a transformer: the top slice is the first layer, the bottom slice is the last, and the glass between them is the stack of layers in order. In a normal transformer that glass would be a plain cylinder — the same width all the way down, because every layer is built to the same hidden dimension. The new paper asks a question most standard architectures do not: what if the glass were allowed to pinch in the middle?

The reason the pinch is nearly free is that the cost of a layer is dominated by its width. Inside each layer, the feed-forward network expands the token vector by about 4×, runs a nonlinearity, and projects it back — two big matrix multiplies whose cost grows with the square of the hidden width. Halve the width of a middle layer and its FFN compute drops to roughly a quarter. The KV-cache those layers hold shrinks too, but only linearly with width — so a thinner middle saves a lot of FLOPs and a smaller, but still real, slice of memory. The outer layers stay wide while the middle thins — the paper's finding is that the useful width budget is not spread evenly across depth (a natural reading is that the layers that ingest the input and form the output have the most use for width).

The mechanism that makes this work without cheating is the parameter-free residual resizing. A transformer carries a running vector — the residual stream — from layer to layer, and normally every layer expects the same width, so they all just add into it. If layer 9 is narrower than layer 8, something has to reconcile the two widths. The paper does it with a fixed, parameter-free resizing of the residual stream that introduces no new learned weights. That detail is load-bearing: if you bought the narrow middle back by adding parameters elsewhere, the FLOP cut would be an illusion. Because the resizing is free, the savings survive a fair, loss-matched comparison.

Where the 22% actually comes from

Hold the depth fixed at 24 layers and treat the FFN as the dominant cost. A uniform model runs every layer at width d = 2048; since per-layer feed-forward work scales with d², write each layer's cost as 2048² ≈ 4.2 (in units of a million multiply-adds per token). Now carve the hourglass: keep the outer 8 layers at 2048 but taper the inner 16 to an average width of about 1700. Their per-layer cost falls to 1700² ≈ 2.9 — roughly 0.7× — because cost follows the square of width. Total compute: uniform = 24 × 4.2 ≈ 100.7; hourglass = (8 × 4.2) + (16 × 2.9) ≈ 33.6 + 46.2 ≈ 79.8 — a ~21% FLOP cut, right in the neighborhood of the paper's reported 22% (illustrative numbers chosen to match the paper's directional finding, not specific reported values). The KV-cache scales with width to the first power, not the square, so the same taper trims it less — the inner layers go from 2048 to 1700, about 0.83×, which lands near the paper's reported ~15% once averaged over the wide outer layers.

Layer profile	FFN cost scaling	What changes	Reported result
Uniform width (baseline)	Θ(d²) per layer, constant d	Every layer the same width — the textbook transformer	Reference (0%)
Hourglass / x-shaped (VWT)	Θ(d²) per layer, d varies by depth	Wide outer layers, narrow middle; residual resized for free	~22% fewer FLOPs, ~15% less KV-cache ~per the paper (loss-matched)
Validated scale	—	Dense + MoE, not a single size	200M–2B dense, 3B MoE ~per the paper

The deeper point is that width allocation across depth is a tunable knob, not a constant — and the paper found it had been sitting at a wasteful default. That fits a thread running through several recent results: SigmaScale's learned SVD scaling treats another fixed architectural choice as something to learn, and Tangram's per-head KV budgets attacks the same KV-cache cost from a different angle — varying the budget per attention head instead of the width per layer. The catch the authors are careful about: these are loss-matched gains, so the claim is "same quality, less compute," not "more quality." Whether the hourglass shape stays optimal at 70B+ scale is the open question — the paper validates up to a 3B MoE, and the roofline economics it leans on (FLOPs falling faster than memory) are exactly the kind of thing that can shift as models grow.

Goes deeper in: LLM Internals → Transformer Block → The Feed-Forward Network

Related explainers

SigmaScale — learned SVD scaling — another "a fixed architectural choice is actually a learnable knob" result
Tangram — per-head KV budgets — cuts the same KV-cache cost by varying budget per head instead of width per layer
MSSP vs μP — MoE scaling — the scaling-law lens on how width and depth should be allocated

Continue in trackLLM Internals: The Feed-Forward Network

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based