Gated DeltaNet-2 — Decoupled channel-wise erase/write gates
LLMGated DeltaNet-2 (GDN-2) is the linear-attention architecture posted to arXiv on May 21, 2026 by NVIDIA researchers Hatamizadeh, Choi, and Kautz. The central change is one paragraph: take the single scalar gate that prior delta-rule models use to decide "how much of the recurrent state survives this step" and schematically separate it into two independent per-channel gates — one for erase, one for write — so each dimension of the fixed-size recurrent state can be updated on its own rate.
The news. Trained at 1.3B parameters on 100B FineWeb-Edu tokens, the paper reports the model beats Mamba-2, Mamba-3, Gated DeltaNet, and Kimi Delta Attention on aggregate benchmarks — with the most pronounced gains on RULER multi-key needle-in-a-haystack, the long-context retrieval task that most directly tests whether a recurrent state can preserve several distinct pieces of information at once. Read the paper →
Picture an audio mixing console. The console has, say, sixteen channels — each holds a different stem (the vocals, the bassline, the synth, the room tone). The classical linear-attention "delta rule" is an engineer who can only turn one master fader: every channel gets faded together by the same amount, and every new sound is written across the whole desk at the same level. If a song needs the vocals preserved while the synth gets a hard cut, the master fader cannot do that. Gated DeltaNet-2 hands every channel its own fader and its own mute switch. Now the vocals can sit untouched while the synth is muted and the bassline is re-cut at the same instant — channel by channel, every step.
Translating back: the recurrent state in a linear-attention model is a fixed-size memory that gets updated at every token. Delta-rule models like Gated DeltaNet and KDA used a single scalar gate per token to decide how much of the past survives — practically a master fader. (State-space-model baselines like Mamba-2 and Mamba-3 sit in a different lineage with their own per-channel scan formulation, and are GDN-2's headline comparison baselines on the benchmark side.) GDN-2 introduces a per-channel erase gate bt and a per-channel write gate wt, both produced as functions of the current token, so each channel of the recurrent state is updated under its own erase rate and write rate (schematically statec ← statec · (1 − bt,c) + writet,c). The authors then derive a gate-aware backward pass that keeps the parallel-scan training algorithm intact — without it, vectorizing the gate would have broken the O(log n) depth that makes these models trainable at scale.
Why decoupling helps a needle test: imagine the channel that happens to hold a long-range fact ("the protagonist's middle name was Quentin") at token 200, and the model is now processing token 5,000. Under a single-gate scheme the model picks one g per token; whenever it needs to make room for a fresh idea, g drops and every channel — including the Quentin channel — decays. Under GDN-2 the model can set bt, Quentin ≈ 0 indefinitely while letting other channels decay aggressively. The needle stays alive in its own register; the rest of the state is free to churn. That is exactly the multi-key RULER setting — keep several distinct needles alive in parallel — and that is the benchmark where the paper's gain is most pronounced.
How GDN-2 relates to other linear-attention designs
| Model | Year | Gate design | State shape |
|---|---|---|---|
| Mamba-2 | 2024 | Per-channel selective scan (state-space) | Constant-size recurrent state |
| Mamba-3 | 2025 | Per-channel scan with refinement step | Constant-size recurrent state |
| Gated DeltaNet (GDN) | 2025 | Single scalar gate per token (delta rule) | Constant-size recurrent state |
| Kimi Delta Attention (KDA) | 2025 | Single learned delta gate (coupled erase/write) | Constant-size recurrent state |
| Gated DeltaNet-2 (this paper) | 2026 | Two independent per-channel gates (erase + write) | Constant-size recurrent state |
A small worked example puts numbers on the difference (illustrative — the paper does not publish per-channel decay curves at this exact setting; the structural argument is the paper's). Take a 16-channel recurrent state and a token where the model "wants to forget" most channels but keep one. Under a single-gate scheme with g = 0.40, every channel — including the one carrying the needle — drops to 40% of its prior amplitude in one step. After two such steps the needle is at 16%; after three, 6.4% — effectively gone. Under GDN-2 the model picks b = 0.60 on most channels and bneedle = 0.02. The same three steps leave most channels at 6.4% as before, but the needle channel is at (1 − 0.02)³ ≈ 94% of its original amplitude — preserved across the same number of forget-pressure steps that wiped the single-gate version. One number per channel, three steps, ~15× difference in needle retention.
This is the structural reason GDN-2 wins on multi-key RULER and not just average benchmarks: the contribution is not "better trained" or "more parameters" but "more independent forgetting axes," and a benchmark that demands keeping several distinct facts alive in parallel is exactly where independent axes pay off.
Goes deeper in: LLM Internals → KV Cache → The KV cache problem and its memory cost — the standard softmax-attention design that linear-attention architectures like GDN-2 are trying to replace with a constant-memory recurrent state.
Related explainers
- DeepSeek V4 — long-context cost cut to a fraction — another sub-quadratic angle on long-context decode, this time via sparse attention rather than a recurrent state
- vLLM v0.20 — TurboQuant 2-bit KV cache — shrinking the standard KV cache instead of replacing it
- SP-KV — utility predictor for the KV cache — another way to keep "important" entries alive longer in a memory budget