MarginGate — Margin-gated verification for batch-invariant decoding

LLM
L
Temperature-0 BF16 isn't reproducible — verify only the risky stepsGreedy decode (temp 0) — bar height = logit margin (lead over the 2nd-best token)flip-riskthresholddecode steps (time →)FP32 re-check — only the flagged low-margin stepsFP32checkFP32checkFP32checkFP32 re-checksevery stepdashed outline = always-on verification (every step)determinism (same tokens at any batch size)Same determinism — a fraction of the costverify every stepjust the flagged fewdeterminism 100% · ~2× less overheadgate on the logit margin · re-check the few near-ties · repair real flips
learnaivisually.com/ai-explained/margingate-batch-invariant-decoding

The news. On May 28, 2026, a paper introduced MarginGate (arXiv 2605.30218), starting from an uncomfortable fact: temperature-0, greedy BF16 decoding is usually assumed to be reproducible, yet the same request can return different tokens depending on how many other requests happen to share its batch. MarginGate measures that batch-induced token flips are rare, then verifies only the steps at risk. Read the paper →

Picture an airport security line. Almost every traveler has a boarding pass that scans cleanly, so the agent waves them straight through the fast lane — that's a decode step with a wide logit margin, where the top token wins by a mile and no amount of numerical jitter would change it. The trouble is the occasional borderline pass: a near-tie between the top two tokens. For those travelers, a tiny nudge decides which way they go — and at temperature 0, that nudge can come from something as invisible as the batch they were standing in.

Why would the batch matter? Because the GPU sums each token's scores in a reduction order that depends on batch size, and in BF16 addition isn't perfectly associative — re-order the sum and the last bit can change. For a confident step that is harmless. For a near-tie it can flip the winner, so the very same prompt emits one token when decoded alone and another when it rides inside a larger batch. The root cause lives one level down, in how BF16 trades mantissa bits for speed versus FP32.

MarginGate's move is to gate on the margin. High-margin steps keep the cheap BF16 fast lane untouched. Only the sparse low-margin steps are sent to secondary screening — a re-computation in FP32, the same verify-then-correct shape that speculative decoding uses. If the trusted FP32 result disagrees with what BF16 produced, MarginGate repairs the step by swapping the offending column of the K/V cache so the rest of the sequence stays consistent. The expensive check fires on a handful of travelers, not the whole terminal.

How much does that save? Take a 1,000-token completion (illustrative). MarginGate flags the low-margin steps — about 18%, or ~180 steps — for an FP32 re-check, while the other ~820 keep the fast path. Of those 180, only a few are genuine flips: the paper measures flip rates of 0.3–1.3% of all steps (just 0.48% for Llama-3.1-8B on MATH500), so on the order of 3–13 tokens would actually have changed. In the paper's tested settings, MarginGate catches and repairs each one. Always-on verification would instead re-run all 1,000 steps in FP32 for the identical result — which is why margin-gating reports ~2× lower overhead (2.23× and 1.99× in the paper) while still restoring 100% sequence-level determinism on the models the paper tested (Llama-3.1-8B and Qwen2.5-14B).

StrategySteps re-checkedDeterminismRelative overhead
Trust BF16 (no verify)none✗ batch-dependent1× (baseline)
Always-on FP32 verifyevery step✓ 100%~2× the gate, varies by model (paper)
MarginGate (margin-gated)~15–18% (paper)✓ 100%~2× lower than always-on (2.23× / 1.99×, paper)

The deeper lesson is that temperature 0 was never a determinism guarantee — it only fixes the sampling rule, not the arithmetic underneath it. MarginGate is cheap precisely because the failure is rare and predictable from the margin: you don't have to distrust every token, just the few that are genuinely on the fence.

Goes deeper in: LLM Internals → Batching → Continuous Batching and LLM Serving → Serving Metrics & SLOs.

Related explainers

Frequently Asked Questions