The news. On June 5, 2026, Google released quantization-aware-trained checkpoints for the Gemma 4 family, spanning the compact E2B and E4B edge models up through 12B and larger sizes. Alongside the standard Q4_0 4-bit format, a new mobile schema applies targeted 2-bit quantization to the token-generation layers while keeping the core reasoning layers at higher precision, plus an optimized KV cache and static activations. With the mobile format, Gemma 4 E2B's reported footprint drops to about 1 GB. Checkpoints ship as GGUF for
llama.cppand as compressed tensors for vLLM. Read the announcement →
Picture the singer at a cheap keyboard that has only a handful of keys. The pitches she actually wants to sing live between those keys. The lazy way to record is to sing freely and then auto-tune the take onto the nearest key afterward — and if a note was sitting halfway between two keys, the snap yanks it a long way and the whole phrase sounds sour. That sour snap is post-training quantization: the model finishes training wherever its weights landed, and only then do you round them to the coarse low-bit grid. Quantization-aware training is the disciplined alternative — the singer rehearses on those exact keys the entire time, so every note she learns already lands on one. When you finally record in low fidelity, nothing has to move.
Underneath the metaphor, the "keys" are the grid of values a low-bit format can store, and the "sour snap" is rounding error. Drop a weight from 16-bit down to 4-bit and you go from a near-continuous range to just 16 representable values — so the rounding step has to shove each weight onto the nearest of those few points. PTQ does this once, at the end, to weights that never anticipated it. QAT instead simulates the rounding on every training step (using a straight-through estimator so gradients still flow), so the network learns weights that already sit on the grid — and learns to compensate elsewhere for the little that can't. Toggle the number line below from FP32 down to INT4 and watch the error bracket open up: that gap is what PTQ pays and QAT trains away.
32-bit float — virtually continuous
The reason anyone bothers is memory. Take a 2-billion-parameter model (illustrative — E2B is Google's compact "effective-2B" size). At BF16 (2 bytes per weight) the weights alone need 2,000,000,000 × 2 = 4 GB. Round them to 4-bit (about half a byte each) and that's roughly 1 GB — a ~4× shrink. Gemma 4's mobile format then pushes the bulky decode layers down to 2-bit while protecting the reasoning-critical layers, and — with the KV cache and activations optimized on top — Google reports that format brings E2B's footprint to about 1 GB, small enough to run on phone-class hardware. The mixed-precision-by-layer idea is the same one the diagram below shows: spend your bits where the model is fragile, save them where it is robust.
| Approach | When rounding is applied | 4-bit accuracy | Cost to produce |
|---|---|---|---|
| Post-training quantization (PTQ) | after training, once | falls off a cliff at very low bit-width (setup-dependent) | cheap — no retraining |
| Quantization-aware training (QAT) | simulated on every training step | higher quality than standard PTQ at 4-bit (Google) | needs a training / fine-tune pass |
The catch is that QAT is not free: someone has to run that extra training pass, which is why it ships from a lab with the GPUs rather than as a one-line conversion you run at home. That is exactly why a vendor releasing pre-quantized QAT checkpoints matters — Google eats the training cost once, and everyone downloading the GGUF gets 4-bit weights without the accuracy cliff they'd hit by quantizing the model themselves. PTQ still has its place when you can't retrain, and aggressive low-bit work brings its own headaches like outlier weights — but for a model meant to live on a phone, training on the grid is what makes the small size honest.
Goes deeper in: LLM Internals → Quantization → The Quantization Process
Related explainers
- LongLive 2.0 — NVFP4 W4A4 training and inference — the same "train at low precision, not just serve at it" idea, taken all the way to 4-bit activations
- QCA — outlier injection for PTQ — the failure mode on the other side: what makes post-training quantization break, and how PTQ fights back without retraining
- MobileMoE — DRAM-aware scaling — the other half of "fits on a phone": shrinking the active memory of a mixture-of-experts model, not its bit-width