Gated DeltaNet — MatMul-only triangular inverse
LLMThe news. On June 4, 2026, a paper (arXiv 2606.06034) targeted the matrix-inversion bottleneck inside chunk-wise parallel linear attention (Gated DeltaNet), where the per-chunk inverse parallelizes poorly on GPU/NPU hardware. It approximates the inverse of a strictly-lower-triangular matrix with a truncated Neumann expansion implemented as matrix-multiplies, adds structural masking to break the sequential dependency and a parallel residual correction for accuracy, and scales the approximation order with chunk size. Because the operation becomes a MatMul, it extends to low-bit integer quantization. Reported: a ~5x kernel-level speedup, ~20% lower decode-layer overhead, accuracy preserved in both floating-point and low-precision inference, tested on Qwen3.5-family models. Read the paper →
Picture a line of friends, each one saying "I'll pay you back the moment the person before me pays me." Nobody at the back can settle up until the chain has resolved all the way from the front — so the honest way to clear the books is strictly one at a time, front to back. That patient, in-order settling is forward substitution, and it is exactly how a triangular system of equations is solved by the book. The problem is that "one at a time" is the worst possible shape for a GPU, whose whole talent is doing thousands of multiply-adds simultaneously. The line of IOUs keeps the hardware standing around waiting.
Underneath the metaphor, that "line of friends" is a chunk of tokens inside linear attention. To run a chunk in parallel during training, Gated DeltaNet has to invert a matrix of the form (I − L), where L is strictly lower-triangular — each row depends only on the rows above it, just like each friend depends only on the friends ahead. Forward substitution respects that dependency by marching down the rows in order, which serializes the kernel and starves the matrix-multiply units. The paper's move is to notice a property of that triangular structure: a strictly-lower-triangular L is nilpotent, so its powers eventually hit zero and the exact inverse is a finite sum — the Neumann series (I − L)⁻¹ = I + L + L² + L³ + …. Truncate it to a few terms (the order scales with the chunk size), and every term is just a matrix-multiply you can fire off in parallel.
That reframing — from "settle the chain in order" to "run a few all-at-once correction rounds" — is the whole game. Each power of L is a GEMM, the one operation tensor cores devour by the thousands per cycle, and structural masking plus a residual correction keep the truncated answer faithful. Crucially, once the kernel is only matrix-multiplies, it inherits everything those multiplies already support — including dropping the weights and activations down to low-bit integers. A sequential triangular solve has no natural INT4 form; a stack of GEMMs does.
Here is where the win earns its keep. Take a chunk of 64 tokens. Forward substitution walks a dependency chain 64 steps long — 64 stages that cannot overlap, no matter how many cores you own. A Neumann series truncated to about 4 terms replaces that with 4 matrix-multiplies, so the critical path collapses from 64 to ~4 (illustrative). Each 64×64 multiply is something tensor cores finish in a handful of cycles, and they run concurrently rather than in lockstep. After the real-world overheads of masking and correction, the paper reports the kernel lands at about a ~5x speedup and roughly 20% lower decode-layer overhead, with accuracy held in both FP and low-precision runs on Qwen3.5-family models.
| How the inverse is computed | Shape of the work | Parallel? | Quantizes to INT? |
|---|---|---|---|
| Forward substitution | solve row by row, in order | no — sequential critical path ~chunk length | awkward — no natural low-bit form |
| Truncated Neumann series | a few matrix-multiplies (I + L + L² + …) | yes — terms run concurrently (~5x kernel, reported) | yes — it's a GEMM, so INT4 is on the table (reported) |
The honest caveat is that a truncated series is an approximation — drop too many terms and the inverse drifts — which is why the paper pairs it with a residual correction and scales the number of terms with chunk size rather than hard-coding it. But the trade is the kind hardware loves: spend a little extra arithmetic (a few full matrix-multiplies) to delete a long sequential dependency, and let the quantized tensor cores do what they are best at. It is the same "make it a GEMM" lever the GPU track teaches over and over — here applied to the one stubbornly sequential corner of a linear-attention layer.
Goes deeper in: LLM Internals → Attention → Attention scores
Related explainers
- Gated DeltaNet-2 — decoupled erase/write gates — the gating half of the same layer: how Gated DeltaNet decides what state to keep versus overwrite
- Parallax — local-linear attention vs FlashAttention 2/3 — another route to fast linear attention, trading the global score matrix for local windows
- Gemma 4 QAT — quantization-aware training — the other side of the INT4 story: training weights so they survive the drop to low bit-width