What is the truncated-Neumann triangular inverse?

It's a way to invert the strictly-lower-triangular matrix inside Gated DeltaNet's chunk-wise linear attention using only matrix-multiplies. Because a strictly-lower-triangular matrix is nilpotent, its inverse equals the finite sum I + L + L² + L³ + …, so truncating to a few terms (each a GEMM) approximates the inverse while replacing the sequential forward-substitution solve with parallel work.

Why does turning the inverse into a MatMul matter?

Two reasons. First, GPUs and NPUs are built around matrix-multiply units; a sequential triangular solve leaves them idle, while a stack of GEMMs keeps them busy — the paper reports about a 5x kernel speedup and ~20% lower decode-layer overhead. Second, a MatMul has a natural low-bit form, so the kernel extends cleanly to INT4 inference, which a forward-substitution solve does not.

How does it relate to linear attention and Gated DeltaNet?

Gated DeltaNet is a gated linear-attention layer usually run in a chunk-wise parallel form for speed, and that form requires inverting a triangular matrix per chunk. This work keeps the architecture unchanged and only swaps how that inverse is computed — from sequential forward substitution to a truncated Neumann series of matrix-multiplies — so the speedup is a kernel-level change, not a new model.

MatMul-only matrix inversion makes quantized Gated DeltaNet 5x faster — Truncated-Neumann triangular inverse

Gated DeltaNet — MatMul-only triangular inverse

LLM

learnaivisually.com/ai-explained/gated-deltanet-matmul-inverse

Jargon

Linear attention: An attention variant whose cost grows linearly with sequence length instead of quadratically, by keeping a running state instead of comparing every token to every other. Standard attention recomputes an all-pairs score matrix; linear attention folds the history into a fixed-size state.
Gated DeltaNet: A modern linear-attention layer that adds a learned gate (how much old state to keep vs. overwrite). It is usually run in a chunk-wise parallel form for training speed — which is exactly where the matrix inverse shows up.
Chunk-wise parallel form: A way to compute a recurrence over a block (chunk) of tokens at once with matrix ops, instead of stepping token by token. The catch: each chunk needs the inverse of a triangular matrix.
Strictly lower-triangular / nilpotent: A matrix whose nonzero entries all sit below the diagonal — a one-directional dependency. Such a matrix L is nilpotent: L^n = 0 for a big enough power, which is why its inverse is a finite sum.
Neumann series: The identity (I − L)⁻¹ = I + L + L² + L³ + …. For a nilpotent L the tail vanishes, so a truncated sum of a few powers is an exact-or-near-exact inverse — and every term is a matrix-multiply.
GEMM / forward substitution: GEMM = general matrix-multiply, the operation tensor cores are built for. Forward substitution is the sequential alternative that solves a triangular system one row at a time — fast on paper, slow on parallel hardware.
Decode-layer overhead: The per-layer cost paid during token generation (the decode phase). The paper reports cutting it by about 20%.

The news. On June 4, 2026, a paper (arXiv 2606.06034) targeted the matrix-inversion bottleneck inside chunk-wise parallel linear attention (Gated DeltaNet), where the per-chunk inverse parallelizes poorly on GPU/NPU hardware. It approximates the inverse of a strictly-lower-triangular matrix with a truncated Neumann expansion implemented as matrix-multiplies, adds structural masking to break the sequential dependency and a parallel residual correction for accuracy, and scales the approximation order with chunk size. Because the operation becomes a MatMul, it extends to low-bit integer quantization. Reported: a ~5x kernel-level speedup, ~20% lower decode-layer overhead, accuracy preserved in both floating-point and low-precision inference, tested on Qwen3.5-family models. Read the paper →

Picture a line of friends, each one saying "I'll pay you back the moment the person before me pays me." Nobody at the back can settle up until the chain has resolved all the way from the front — so the honest way to clear the books is strictly one at a time, front to back. That patient, in-order settling is forward substitution, and it is exactly how a triangular system of equations is solved by the book. The problem is that "one at a time" is the worst possible shape for a GPU, whose whole talent is doing thousands of multiply-adds simultaneously. The line of IOUs keeps the hardware standing around waiting.

Underneath the metaphor, that "line of friends" is a chunk of tokens inside linear attention. To run a chunk in parallel during training, Gated DeltaNet has to invert a matrix of the form (I − L), where L is strictly lower-triangular — each row depends only on the rows above it, just like each friend depends only on the friends ahead. Forward substitution respects that dependency by marching down the rows in order, which serializes the kernel and starves the matrix-multiply units. The paper's move is to notice a property of that triangular structure: a strictly-lower-triangular L is nilpotent, so its powers eventually hit zero and the exact inverse is a finite sum — the Neumann series (I − L)⁻¹ = I + L + L² + L³ + …. Truncate it to a few terms (the order scales with the chunk size), and every term is just a matrix-multiply you can fire off in parallel.

That reframing — from "settle the chain in order" to "run a few all-at-once correction rounds" — is the whole game. Each power of L is a GEMM, the one operation tensor cores devour by the thousands per cycle, and structural masking plus a residual correction keep the truncated answer faithful. Crucially, once the kernel is only matrix-multiplies, it inherits everything those multiplies already support — including dropping the weights and activations down to low-bit integers. A sequential triangular solve has no natural INT4 form; a stack of GEMMs does.

CUDA Core vs Tensor Core — same 4×4 multiply

CUDA Core

1 result

per cycle

Tensor Core

16 results

per cycle

16× throughput for 4×4 matrix multiply

Here is where the win earns its keep. Take a chunk of 64 tokens. Forward substitution walks a dependency chain 64 steps long — 64 stages that cannot overlap, no matter how many cores you own. A Neumann series truncated to about 4 terms replaces that with 4 matrix-multiplies, so the critical path collapses from 64 to ~4 (illustrative). Each 64×64 multiply is something tensor cores finish in a handful of cycles, and they run concurrently rather than in lockstep. After the real-world overheads of masking and correction, the paper reports the kernel lands at about a ~5x speedup and roughly 20% lower decode-layer overhead, with accuracy held in both FP and low-precision runs on Qwen3.5-family models.

How the inverse is computed	Shape of the work	Parallel?	Quantizes to INT?
Forward substitution	solve row by row, in order	no — sequential critical path ~chunk length	awkward — no natural low-bit form
Truncated Neumann series	a few matrix-multiplies (`I + L + L² + …`)	yes — terms run concurrently (~5x kernel, reported)	yes — it's a GEMM, so INT4 is on the table (reported)

The honest caveat is that a truncated series is an approximation — drop too many terms and the inverse drifts — which is why the paper pairs it with a residual correction and scales the number of terms with chunk size rather than hard-coding it. But the trade is the kind hardware loves: spend a little extra arithmetic (a few full matrix-multiplies) to delete a long sequential dependency, and let the quantized tensor cores do what they are best at. It is the same "make it a GEMM" lever the GPU track teaches over and over — here applied to the one stubbornly sequential corner of a linear-attention layer.

Goes deeper in: LLM Internals → Attention → Attention scores

Related explainers

Gated DeltaNet-2 — decoupled erase/write gates — the gating half of the same layer: how Gated DeltaNet decides what state to keep versus overwrite
Parallax — local-linear attention vs FlashAttention 2/3 — another route to fast linear attention, trading the global score matrix for local windows
Gemma 4 QAT — quantization-aware training — the other side of the INT4 story: training weights so they survive the drop to low bit-width

Continue in trackLLM Internals — Attention: how the score matrix is built

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based