The news. On June 16, 2026, researchers posted Variable-Width Transformers, which challenges an assumption baked into nearly every transformer: that the hidden width is the same at every layer. They give the stack an x-shaped (hourglass) profile — wider early and late layers, a narrower middle — joined by a parameter-free residual resizing step. Under loss-matched scaling the design cuts 22% of FLOPs and 15% of KV-cache memory and IO versus uniform-width baselines, and the result holds from 200M to 2B dense models and a 3B MoE.
Picture an hourglass standing on a desk. It is wide at the top, pinches to a narrow waist in the middle, and flares wide again at the bottom. Now read it top-to-bottom as a transformer: the top slice is the first layer, the bottom slice is the last, and the glass between them is the stack of layers in order. In a normal transformer that glass would be a plain cylinder — the same width all the way down, because every layer is built to the same hidden dimension. The new paper asks a question most standard architectures do not: what if the glass were allowed to pinch in the middle?
The reason the pinch is nearly free is that the cost of a layer is dominated by its width. Inside each layer, the feed-forward network expands the token vector by about 4×, runs a nonlinearity, and projects it back — two big matrix multiplies whose cost grows with the square of the hidden width. Halve the width of a middle layer and its FFN compute drops to roughly a quarter. The KV-cache those layers hold shrinks too, but only linearly with width — so a thinner middle saves a lot of FLOPs and a smaller, but still real, slice of memory. The outer layers stay wide while the middle thins — the paper's finding is that the useful width budget is not spread evenly across depth (a natural reading is that the layers that ingest the input and form the output have the most use for width).
The mechanism that makes this work without cheating is the parameter-free residual resizing. A transformer carries a running vector — the residual stream — from layer to layer, and normally every layer expects the same width, so they all just add into it. If layer 9 is narrower than layer 8, something has to reconcile the two widths. The paper does it with a fixed, parameter-free resizing of the residual stream that introduces no new learned weights. That detail is load-bearing: if you bought the narrow middle back by adding parameters elsewhere, the FLOP cut would be an illusion. Because the resizing is free, the savings survive a fair, loss-matched comparison.
Where the 22% actually comes from
Hold the depth fixed at 24 layers and treat the FFN as the dominant cost. A uniform model runs every layer at width d = 2048; since per-layer feed-forward work scales with d², write each layer's cost as 2048² ≈ 4.2 (in units of a million multiply-adds per token). Now carve the hourglass: keep the outer 8 layers at 2048 but taper the inner 16 to an average width of about 1700. Their per-layer cost falls to 1700² ≈ 2.9 — roughly 0.7× — because cost follows the square of width. Total compute: uniform = 24 × 4.2 ≈ 100.7; hourglass = (8 × 4.2) + (16 × 2.9) ≈ 33.6 + 46.2 ≈ 79.8 — a ~21% FLOP cut, right in the neighborhood of the paper's reported 22% (illustrative numbers chosen to match the paper's directional finding, not specific reported values). The KV-cache scales with width to the first power, not the square, so the same taper trims it less — the inner layers go from 2048 to 1700, about 0.83×, which lands near the paper's reported ~15% once averaged over the wide outer layers.
| Layer profile | FFN cost scaling | What changes | Reported result |
|---|---|---|---|
| Uniform width (baseline) | Θ(d²) per layer, constant d | Every layer the same width — the textbook transformer | Reference (0%) |
| Hourglass / x-shaped (VWT) | Θ(d²) per layer, d varies by depth | Wide outer layers, narrow middle; residual resized for free | ~22% fewer FLOPs, ~15% less KV-cache ~per the paper (loss-matched) |
| Validated scale | — | Dense + MoE, not a single size | 200M–2B dense, 3B MoE ~per the paper |
The deeper point is that width allocation across depth is a tunable knob, not a constant — and the paper found it had been sitting at a wasteful default. That fits a thread running through several recent results: SigmaScale's learned SVD scaling treats another fixed architectural choice as something to learn, and Tangram's per-head KV budgets attacks the same KV-cache cost from a different angle — varying the budget per attention head instead of the width per layer. The catch the authors are careful about: these are loss-matched gains, so the claim is "same quality, less compute," not "more quality." Whether the hourglass shape stays optimal at 70B+ scale is the open question — the paper validates up to a 3B MoE, and the roofline economics it leans on (FLOPs falling faster than memory) are exactly the kind of thing that can shift as models grow.
Goes deeper in: LLM Internals → Transformer Block → The Feed-Forward Network
Related explainers
- SigmaScale — learned SVD scaling — another "a fixed architectural choice is actually a learnable knob" result
- Tangram — per-head KV budgets — cuts the same KV-cache cost by varying budget per head instead of width per layer
- MSSP vs μP — MoE scaling — the scaling-law lens on how width and depth should be allocated