Gated DeltaNet — MatMul-only triangular inverse

LLM
L
Gated DeltaNet's chunk solve: a sequential triangular inverse → a few parallel matrix-multipliessequentialparallel GEMM(I − L) · x = bxbelow the diagonal: each row waits on the rows above(I − L)⁻¹ =I+L+++⋯each power = one matrix-multiply (the chain is finite, so a few terms suffice)kernel wall-clock (illustrative)sequentialparallelSequential triangular solve → a few parallel GEMMs1 row at a time~5× fasterkernel-level speedup · ~20% less decode-layer overheadaccuracy kept in FP & INT4 — it's now a MatMulfigures reported on Qwen3.5-family · matrix & timings illustrative
learnaivisually.com/ai-explained/gated-deltanet-matmul-inverse

The news. On June 4, 2026, a paper (arXiv 2606.06034) targeted the matrix-inversion bottleneck inside chunk-wise parallel linear attention (Gated DeltaNet), where the per-chunk inverse parallelizes poorly on GPU/NPU hardware. It approximates the inverse of a strictly-lower-triangular matrix with a truncated Neumann expansion implemented as matrix-multiplies, adds structural masking to break the sequential dependency and a parallel residual correction for accuracy, and scales the approximation order with chunk size. Because the operation becomes a MatMul, it extends to low-bit integer quantization. Reported: a ~5x kernel-level speedup, ~20% lower decode-layer overhead, accuracy preserved in both floating-point and low-precision inference, tested on Qwen3.5-family models. Read the paper →

Picture a line of friends, each one saying "I'll pay you back the moment the person before me pays me." Nobody at the back can settle up until the chain has resolved all the way from the front — so the honest way to clear the books is strictly one at a time, front to back. That patient, in-order settling is forward substitution, and it is exactly how a triangular system of equations is solved by the book. The problem is that "one at a time" is the worst possible shape for a GPU, whose whole talent is doing thousands of multiply-adds simultaneously. The line of IOUs keeps the hardware standing around waiting.

Underneath the metaphor, that "line of friends" is a chunk of tokens inside linear attention. To run a chunk in parallel during training, Gated DeltaNet has to invert a matrix of the form (I − L), where L is strictly lower-triangular — each row depends only on the rows above it, just like each friend depends only on the friends ahead. Forward substitution respects that dependency by marching down the rows in order, which serializes the kernel and starves the matrix-multiply units. The paper's move is to notice a property of that triangular structure: a strictly-lower-triangular L is nilpotent, so its powers eventually hit zero and the exact inverse is a finite sum — the Neumann series (I − L)⁻¹ = I + L + L² + L³ + …. Truncate it to a few terms (the order scales with the chunk size), and every term is just a matrix-multiply you can fire off in parallel.

That reframing — from "settle the chain in order" to "run a few all-at-once correction rounds" — is the whole game. Each power of L is a GEMM, the one operation tensor cores devour by the thousands per cycle, and structural masking plus a residual correction keep the truncated answer faithful. Crucially, once the kernel is only matrix-multiplies, it inherits everything those multiplies already support — including dropping the weights and activations down to low-bit integers. A sequential triangular solve has no natural INT4 form; a stack of GEMMs does.

CUDA Core vs Tensor Core — same 4×4 multiply
CUDA Core
1 result
per cycle
vs
Tensor Core
16 results
per cycle
16× throughput for 4×4 matrix multiply

Here is where the win earns its keep. Take a chunk of 64 tokens. Forward substitution walks a dependency chain 64 steps long — 64 stages that cannot overlap, no matter how many cores you own. A Neumann series truncated to about 4 terms replaces that with 4 matrix-multiplies, so the critical path collapses from 64 to ~4 (illustrative). Each 64×64 multiply is something tensor cores finish in a handful of cycles, and they run concurrently rather than in lockstep. After the real-world overheads of masking and correction, the paper reports the kernel lands at about a ~5x speedup and roughly 20% lower decode-layer overhead, with accuracy held in both FP and low-precision runs on Qwen3.5-family models.

How the inverse is computedShape of the workParallel?Quantizes to INT?
Forward substitutionsolve row by row, in orderno — sequential critical path ~chunk lengthawkward — no natural low-bit form
Truncated Neumann seriesa few matrix-multiplies (I + L + L² + …)yes — terms run concurrently (~5x kernel, reported)yes — it's a GEMM, so INT4 is on the table (reported)

The honest caveat is that a truncated series is an approximation — drop too many terms and the inverse drifts — which is why the paper pairs it with a residual correction and scales the number of terms with chunk size rather than hard-coding it. But the trade is the kind hardware loves: spend a little extra arithmetic (a few full matrix-multiplies) to delete a long sequential dependency, and let the quantized tensor cores do what they are best at. It is the same "make it a GEMM" lever the GPU track teaches over and over — here applied to the one stubbornly sequential corner of a linear-attention layer.

Goes deeper in: LLM Internals → Attention → Attention scores

Related explainers

Continue in trackLLM Internals — Attention: how the score matrix is built

Frequently Asked Questions