What is Gated DeltaNet-2?

Gated DeltaNet-2 (GDN-2) is a linear-attention architecture introduced by NVIDIA researchers Hatamizadeh, Choi, and Kautz in May 2026. It extends the delta-rule family of linear-attention models (Gated DeltaNet, Kimi Delta Attention) by replacing the single scalar gate of those prior designs with two independent per-channel gates: an erase gate b_t that controls how much of the old recurrent state is forgotten in each channel, and a write gate w_t that controls how strongly the new token writes into each channel. The authors derive a gate-aware backward pass that preserves the parallel-scan training algorithm, so the per-channel gates do not break the architecture's O(log n) training depth. At 1.3B parameters trained on 100B FineWeb-Edu tokens, the paper reports the model beats Mamba-2, Mamba-3, Gated DeltaNet, and Kimi Delta Attention on aggregate benchmarks, with the largest gains on long-context retrieval tasks.

Why does decoupling erase and write help?

The recurrent state in a linear-attention model is a fixed-size memory that has to hold every long-range fact the model wants to recall, on top of whatever short-range context it currently needs. With a single scalar gate, the model can only choose one global forget rate per token — so whenever it needs to clear room for new short-range information, every channel forgets together, including the ones that happen to hold a long-range needle. Per-channel gates let some channels behave as long-term memory (their erase rate stays near zero indefinitely) while others churn fast. This matters most on benchmarks like RULER multi-key needle-in-a-haystack, where the model has to preserve several distinct facts in parallel — which is exactly where Gated DeltaNet-2 reports its most pronounced gains.

How does Gated DeltaNet-2 compare to Mamba-3?

Both are sub-quadratic architectures with a fixed-size recurrent state and constant-memory decoding. The architectural difference is in the gate. Mamba-3 uses an input-dependent selective scan to update each channel — effectively per-channel decay, but tied to a state-space-model formulation. Gated DeltaNet-2 stays in the delta-rule family (state ← state·(1 − b) + w) but extends the gate from a scalar to two independent per-channel vectors. The paper's headline empirical claim is that on 1.3B / 100B FineWeb-Edu training and matched evaluation, GDN-2 beats Mamba-3 on aggregate benchmarks with the most pronounced gain on RULER multi-key needle-in-a-haystack. Whether the architectural choice scales beyond 1.3B is, as ever for these papers, an open question.

Gated DeltaNet-2 paper — Decoupled channel-wise erase/write gates

Gated DeltaNet-2 — Decoupled channel-wise erase/write gates

LLM

learnaivisually.com/ai-explained/gated-deltanet-2-decoupled-erase-write-gates

TL;DR

What is it: The Gated DeltaNet-2 paper from NVIDIA introduces a linear-attention architecture that splits the gate of prior delta-rule models — a single scalar that mixed "forget the old state" and "write the new state" — into two independent per-channel vectors, so each dimension of the fixed-size recurrent state has its own erase rate and its own write rate.
Why it’s needed: Linear-attention models compress every past token into a fixed-size recurrent state, but a single shared gate forces every dimension of that state to forget at the same rate — so a needle from 5,000 tokens back fights with normal short-range updates. Decoupling erase and write per channel lets some dimensions behave like long-term memory while others churn fast.
vs previous: Prior delta-rule linear-attention models — Gated DeltaNet and Kimi Delta Attention (KDA) — tied erase and write to one shared scalar, while state-space baselines like Mamba-2 and Mamba-3 use a different per-channel scan formulation. Gated DeltaNet-2 reports stronger multi-key needle-in-a-haystack retrieval on RULER than all four at 1.3B / 100B FineWeb-Edu tokens, with the same linear time complexity and constant-memory decode.

Jargon

Linear attention: A family of attention variants whose state size does not grow with context length. Instead of caching one key/value pair per past token (as softmax attention does in its KV cache), the model maintains a fixed-size matrix or vector state that is updated recurrently. Decode stays constant-memory and constant-time per token regardless of how long the context is.
Delta rule: A classical recurrent-update rule that writes a residual correction into the state on each step. In linear-attention models, the delta rule typically looks like state ← state · g + write, where g controls how much of the old state is kept. Gated DeltaNet-2 generalizes this to state ← state · (1 − b) + w with independent per-channel b and w.
Mamba-3: A recent state-space-model architecture (the latest in the Mamba family) that uses input-dependent selective scans to update a fixed-size recurrent state. It is one of the strongest sub-quadratic baselines and is the headline comparison point in the Gated DeltaNet-2 paper.
Gated DeltaNet (GDN): The immediate predecessor of Gated DeltaNet-2, also from NVIDIA's research group. It combined the delta rule with a single scalar gate per token. Gated DeltaNet-2 ablates that design choice — same architecture skeleton, two-vector gate instead of one scalar.
Kimi Delta Attention (KDA): A 2025 linear-attention design from the Kimi team that also extends the delta rule with a learned gate, but keeps the erase and write coupled. Gated DeltaNet-2 reports beating KDA on aggregate benchmarks at matched scale.
RULER multi-key needle-in-a-haystack: A long-context retrieval benchmark where the model must recall several distinct facts ("needles") that were each planted at different positions in a long context. Multi-key is the hardest setting because the model cannot just preserve one global "important" register — it has to keep several pieces of information alive in parallel.
Parallel scan: The training-time algorithm that lets recurrent updates be evaluated in parallel along the sequence dimension (in O(log n) depth), the way that makes Mamba- and DeltaNet-style models efficient on GPUs. Gated DeltaNet-2's "gate-aware backward pass" is the bit that preserves this efficiency even after the gates are vectorized.

Gated DeltaNet-2 (GDN-2) is the linear-attention architecture posted to arXiv on May 21, 2026 by NVIDIA researchers Hatamizadeh, Choi, and Kautz. The central change is one paragraph: take the single scalar gate that prior delta-rule models use to decide "how much of the recurrent state survives this step" and schematically separate it into two independent per-channel gates — one for erase, one for write — so each dimension of the fixed-size recurrent state can be updated on its own rate.

The news. Trained at 1.3B parameters on 100B FineWeb-Edu tokens, the paper reports the model beats Mamba-2, Mamba-3, Gated DeltaNet, and Kimi Delta Attention on aggregate benchmarks — with the most pronounced gains on RULER multi-key needle-in-a-haystack, the long-context retrieval task that most directly tests whether a recurrent state can preserve several distinct pieces of information at once. Read the paper →

Picture an audio mixing console. The console has, say, sixteen channels — each holds a different stem (the vocals, the bassline, the synth, the room tone). The classical linear-attention "delta rule" is an engineer who can only turn one master fader: every channel gets faded together by the same amount, and every new sound is written across the whole desk at the same level. If a song needs the vocals preserved while the synth gets a hard cut, the master fader cannot do that. Gated DeltaNet-2 hands every channel its own fader and its own mute switch. Now the vocals can sit untouched while the synth is muted and the bassline is re-cut at the same instant — channel by channel, every step.

Translating back: the recurrent state in a linear-attention model is a fixed-size memory that gets updated at every token. Delta-rule models like Gated DeltaNet and KDA used a single scalar gate per token to decide how much of the past survives — practically a master fader. (State-space-model baselines like Mamba-2 and Mamba-3 sit in a different lineage with their own per-channel scan formulation, and are GDN-2's headline comparison baselines on the benchmark side.) GDN-2 introduces a per-channel erase gate b_t and a per-channel write gate w_t, both produced as functions of the current token, so each channel of the recurrent state is updated under its own erase rate and write rate (schematically state_c ← state_c · (1 − b_t,c) + write_t,c). The authors then derive a gate-aware backward pass that keeps the parallel-scan training algorithm intact — without it, vectorizing the gate would have broken the O(log n) depth that makes these models trainable at scale.

Why decoupling helps a needle test: imagine the channel that happens to hold a long-range fact ("the protagonist's middle name was Quentin") at token 200, and the model is now processing token 5,000. Under a single-gate scheme the model picks one g per token; whenever it needs to make room for a fresh idea, g drops and every channel — including the Quentin channel — decays. Under GDN-2 the model can set b_{t, Quentin} ≈ 0 indefinitely while letting other channels decay aggressively. The needle stays alive in its own register; the rest of the state is free to churn. That is exactly the multi-key RULER setting — keep several distinct needles alive in parallel — and that is the benchmark where the paper's gain is most pronounced.

How GDN-2 relates to other linear-attention designs

Model	Year	Gate design	State shape
Mamba-2	2024	Per-channel selective scan (state-space)	Constant-size recurrent state
Mamba-3	2025	Per-channel scan with refinement step	Constant-size recurrent state
Gated DeltaNet (GDN)	2025	Single scalar gate per token (delta rule)	Constant-size recurrent state
Kimi Delta Attention (KDA)	2025	Single learned delta gate (coupled erase/write)	Constant-size recurrent state
Gated DeltaNet-2 (this paper)	2026	Two independent per-channel gates (erase + write)	Constant-size recurrent state

A small worked example puts numbers on the difference (illustrative — the paper does not publish per-channel decay curves at this exact setting; the structural argument is the paper's). Take a 16-channel recurrent state and a token where the model "wants to forget" most channels but keep one. Under a single-gate scheme with g = 0.40, every channel — including the one carrying the needle — drops to 40% of its prior amplitude in one step. After two such steps the needle is at 16%; after three, 6.4% — effectively gone. Under GDN-2 the model picks b = 0.60 on most channels and b_needle = 0.02. The same three steps leave most channels at 6.4% as before, but the needle channel is at (1 − 0.02)³ ≈ 94% of its original amplitude — preserved across the same number of forget-pressure steps that wiped the single-gate version. One number per channel, three steps, ~15× difference in needle retention.

This is the structural reason GDN-2 wins on multi-key RULER and not just average benchmarks: the contribution is not "better trained" or "more parameters" but "more independent forgetting axes," and a benchmark that demands keeping several distinct facts alive in parallel is exactly where independent axes pay off.

Goes deeper in: LLM Internals → KV Cache → The KV cache problem and its memory cost — the standard softmax-attention design that linear-attention architectures like GDN-2 are trying to replace with a constant-memory recurrent state.

Related explainers

DeepSeek V4 — long-context cost cut to a fraction — another sub-quadratic angle on long-context decode, this time via sparse attention rather than a recurrent state
vLLM v0.20 — TurboQuant 2-bit KV cache — shrinking the standard KV cache instead of replacing it
SP-KV — utility predictor for the KV cache — another way to keep "important" entries alive longer in a memory budget

Gated DeltaNet-2 — Decoupled channel-wise erase/write gates

How GDN-2 relates to other linear-attention designs

Related explainers

Frequently Asked Questions