The news. On June 16, 2026, researchers published Ternary Mamba, which applies grouped quantization-aware training to compress Mamba-2 1.3B to W1.58A16 — ternary weights, 16-bit activations. The checkpoint shrinks from 2,687 MB to 744 MB (3.61x) while holding 48.1% average zero-shot accuracy across seven tasks. Crucially, it fine-tunes from a pretrained FP16 checkpoint with distillation using only 102M tokens (about 4 GPU-hours on one H100), versus the 150B from-scratch tokens prior low-bit work needed. The authors also name a new failure mode unique to state-space models: zero-ratio collapse. Read the paper →

Picture a film-critic panel where every critic may cast exactly one of three votes: thumbs-down, a shrug, or thumbs-up — nothing in between. If you let each critic write a nuanced essay first and only afterward force them to collapse it into one of those three gestures, a lot of carefully-weighed opinions get mangled. That after-the-fact crush is post-training quantization: the model finishes training at full precision, then every weight is rounded onto the coarse grid at the end. Ternary quantization-aware training is the disciplined alternative — the panel deliberates the entire film knowing only three votes exist, so each critic learns where to spend a thumbs-up versus a shrug while it still matters.

Underneath the metaphor, the "three votes" are the only values a ternary weight can hold: -1, 0, +1. That is just about 1.58 bits of information per weight — the most aggressive step down the number line anyone uses in practice, far below the 4-bit grid most low-bit models target. Rounding a finished model that far (the cheap PTQ path) shoves nearly every weight a long way and the network falls apart. So Ternary Mamba does the rounding inside training instead: it simulates the snap-to-1 on every forward pass so the weights learn to land on those three points, and it leans on a full-precision FP16 teacher — the master critic — to keep the ternary student's outputs close to the original. Watch the number line below tighten from a continuous range down to a handful of grid points: the gap that opens is exactly the error PTQ pays and QAT trains away.

32-bit float — virtually continuous

01234π = 3.14159Quantized: ≈ 3.1416Δ = 0.00000Representable values (0 → 4)

There is one twist that does not show up when you quantize a transformer. A Mamba layer is a state-space model — it carries a running state forward through the sequence rather than re-reading every token — and during ternary QAT the authors hit zero-ratio collapse: too many weights snap to the middle value, 0, so the panel effectively goes silent and the model loses its voice. It is the analog of an entire critic bench defaulting to a shrug. Because the state is reused at every step, that hollowing-out compounds down the sequence in a way transformer QAT never sees, which is why grouping the weights and tuning the QAT schedule matters more here than it does for a transformer.

The reason anyone tolerates this is size. Take Mamba-2's 1.3 billion weights. At FP16 (2 bytes each) the checkpoint is about 2,687 MB. Squeeze each weight from 16 bits to roughly 1.58 — about a tenth as many bits — and the weights alone would be near 265 MB; counting the 16-bit activations, the per-group scales, and the layers left at higher precision, the real checkpoint lands at 744 MB — a 3.61x shrink while average zero-shot accuracy holds at 48.1%. And it gets there on 102M tokens, about 4 GPU-hours on a single H100 — against the 150B from-scratch tokens earlier ternary work needed, roughly a 1,500x cut in training data. That gap is the whole pitch: QAT-from-a-checkpoint turns extreme quantization from a lab-scale retrain into an afternoon's fine-tune.

Way to get a ternary modelWhen the rounding happensTraining costResult
Train ternary from scratchbuilt in from step one~150B tokens (prior work)works, but lab-scale expensive
PTQ a finished model to ternaryonce, after training~free, no retrainingcollapses at three values (setup-dependent)
Ternary Mamba (QAT from a checkpoint)simulated during a short fine-tune102M tokens · ~4 GPU-hours (paper)744 MB, 48.1% zero-shot (paper)

The catch is the same one that makes QAT valuable anywhere: someone still has to run a training pass, even a short one — which is why this ships from a lab rather than as a one-line conversion you run at home. But 4 GPU-hours is cheap enough that extending a model to ternary stops being a research project, and doing it on a state-space model rather than a transformer is what makes the result new: the same train-on-the-grid discipline, carried into an architecture where a single misplaced weight echoes down the whole sequence.

Goes deeper in: LLM Internals → Quantization → The Quantization Process

Related explainers

Continue in trackLLM Internals — Quantization: how rounding to the grid works

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based