What is ternary quantization-aware training?

It trains (or fine-tunes) a model while forcing every weight onto just three values — -1, 0, and +1 — simulating that rounding on every forward pass so the weights learn to sit on the three-point grid. Ternary Mamba uses it to compress a Mamba-2 state-space model to W1.58A16 (about 1.58 bits per weight, 16-bit activations), shrinking the checkpoint 3.61x while holding 48.1% average zero-shot accuracy.

How is it different from post-training quantization or training ternary from scratch?

Post-training quantization rounds a finished model onto the grid once, at the end — fine at 4-bit, but at three values the rounding error is too large and the model collapses. Training ternary from scratch works but cost prior efforts about 150B tokens. Ternary Mamba splits the difference: it fine-tunes from a pretrained FP16 checkpoint with distillation, simulating the rounding during training, and reaches its result on just 102M tokens — about 4 GPU-hours on one H100.

What is zero-ratio collapse?

It is a failure mode the authors flag that is specific to state-space models: during ternary QAT, too many weights snap to the middle value 0, hollowing out the model's capacity. Because a Mamba layer reuses a running state at every step, that loss compounds down the sequence — an instability transformer quantization does not hit, and the reason the QAT schedule and grouped weights need extra care here.

Ternary Mamba compresses an SSM 3.6x with 4 GPU-hours of QAT — Ternary quantization-aware training

TL;DR

What is it: The Ternary Mamba paper compresses a Mamba state-space model with ternary quantization-aware training — forcing every weight to one of three values (-1, 0, +1) while the model is still training, instead of rounding it down afterward.
Why it’s needed: Ternary is the most extreme useful form of quantization: at about 1.58 bits per weight it makes a model tiny, and doing it as a fine-tune from a finished checkpoint costs only 4 GPU-hours rather than a from-scratch retrain on hundreds of billions of tokens.
vs previous: Versus post-training quantization (PTQ) — rounding a trained model to low bits at the end, which falls apart at three values — QAT simulates the rounding during training, and it has to tame a new failure mode, zero-ratio collapse, that transformer quantization never runs into.

Jargon

Ternary quantization (W1.58A16): Every weight is stored as one of just three values — -1, 0, or +1. Three states carry about 1.58 bits of information (log₂3), hence "W1.58"; the "A16" means activations still flow in 16-bit precision.
Quantization-Aware Training (QAT): Training (or fine-tuning) while simulating the low-bit rounding on every forward pass, so the weights learn to sit on the grid. The same QAT idea powers Gemma 4's 4-bit checkpoints — Ternary Mamba pushes it to the three-value limit.
Post-Training Quantization (PTQ): The cheap default: round a finished full-precision model onto the low-bit grid afterward, no retraining. GPTQ and AWQ are PTQ methods — fine at 4-bit, but at three values the rounding error is too large to absorb.
State-space model (Mamba): An alternative to the transformer: instead of attending over every past token, a Mamba layer carries a running state forward through the sequence. Because that state is reused at every step, a quantization error in one weight compounds down the line.
Knowledge distillation: The full-precision FP16 model acts as a teacher; the ternary student is trained to reproduce the teacher's outputs, which transfers most of the original quality across the brutal precision drop.
Zero-ratio collapse: An SSM-specific instability the authors flag: during QAT, too many weights snap to the middle value 0, hollowing out the model. It does not appear in transformer QAT — it is particular to the state-space recurrence.

The news. On June 16, 2026, researchers published Ternary Mamba, which applies grouped quantization-aware training to compress Mamba-2 1.3B to W1.58A16 — ternary weights, 16-bit activations. The checkpoint shrinks from 2,687 MB to 744 MB (3.61x) while holding 48.1% average zero-shot accuracy across seven tasks. Crucially, it fine-tunes from a pretrained FP16 checkpoint with distillation using only 102M tokens (about 4 GPU-hours on one H100), versus the 150B from-scratch tokens prior low-bit work needed. The authors also name a new failure mode unique to state-space models: zero-ratio collapse. Read the paper →

Picture a film-critic panel where every critic may cast exactly one of three votes: thumbs-down, a shrug, or thumbs-up — nothing in between. If you let each critic write a nuanced essay first and only afterward force them to collapse it into one of those three gestures, a lot of carefully-weighed opinions get mangled. That after-the-fact crush is post-training quantization: the model finishes training at full precision, then every weight is rounded onto the coarse grid at the end. Ternary quantization-aware training is the disciplined alternative — the panel deliberates the entire film knowing only three votes exist, so each critic learns where to spend a thumbs-up versus a shrug while it still matters.

Underneath the metaphor, the "three votes" are the only values a ternary weight can hold: -1, 0, +1. That is just about 1.58 bits of information per weight — the most aggressive step down the number line anyone uses in practice, far below the 4-bit grid most low-bit models target. Rounding a finished model that far (the cheap PTQ path) shoves nearly every weight a long way and the network falls apart. So Ternary Mamba does the rounding inside training instead: it simulates the snap-to-1 on every forward pass so the weights learn to land on those three points, and it leans on a full-precision FP16 teacher — the master critic — to keep the ternary student's outputs close to the original. Watch the number line below tighten from a continuous range down to a handful of grid points: the gap that opens is exactly the error PTQ pays and QAT trains away.

32-bit float — virtually continuous

There is one twist that does not show up when you quantize a transformer. A Mamba layer is a state-space model — it carries a running state forward through the sequence rather than re-reading every token — and during ternary QAT the authors hit zero-ratio collapse: too many weights snap to the middle value, 0, so the panel effectively goes silent and the model loses its voice. It is the analog of an entire critic bench defaulting to a shrug. Because the state is reused at every step, that hollowing-out compounds down the sequence in a way transformer QAT never sees, which is why grouping the weights and tuning the QAT schedule matters more here than it does for a transformer.

The reason anyone tolerates this is size. Take Mamba-2's 1.3 billion weights. At FP16 (2 bytes each) the checkpoint is about 2,687 MB. Squeeze each weight from 16 bits to roughly 1.58 — about a tenth as many bits — and the weights alone would be near 265 MB; counting the 16-bit activations, the per-group scales, and the layers left at higher precision, the real checkpoint lands at 744 MB — a 3.61x shrink while average zero-shot accuracy holds at 48.1%. And it gets there on 102M tokens, about 4 GPU-hours on a single H100 — against the 150B from-scratch tokens earlier ternary work needed, roughly a 1,500x cut in training data. That gap is the whole pitch: QAT-from-a-checkpoint turns extreme quantization from a lab-scale retrain into an afternoon's fine-tune.

Way to get a ternary model	When the rounding happens	Training cost	Result
Train ternary from scratch	built in from step one	~150B tokens (prior work)	works, but lab-scale expensive
PTQ a finished model to ternary	once, after training	~free, no retraining	collapses at three values (setup-dependent)
Ternary Mamba (QAT from a checkpoint)	simulated during a short fine-tune	102M tokens · ~4 GPU-hours (paper)	744 MB, 48.1% zero-shot (paper)

The catch is the same one that makes QAT valuable anywhere: someone still has to run a training pass, even a short one — which is why this ships from a lab rather than as a one-line conversion you run at home. But 4 GPU-hours is cheap enough that extending a model to ternary stops being a research project, and doing it on a state-space model rather than a transformer is what makes the result new: the same train-on-the-grid discipline, carried into an architecture where a single misplaced weight echoes down the whole sequence.

Goes deeper in: LLM Internals → Quantization → The Quantization Process

Related explainers

Gemma 4 QAT — quantization-aware training — the same train-on-the-grid idea on a transformer at 4-bit, the natural starting point before ternary
KVarN — Hadamard rotation to a 2-bit KV cache — extreme low-bit on the other budget: the KV cache, not the weights
LongLive 2.0 — NVFP4 W4A4 training and inference — low precision pushed into both training and the activations, not just the stored weights

Continue in trackLLM Internals — Quantization: how rounding to the grid works

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based