The news. On June 5, 2026, Google released quantization-aware-trained checkpoints for the Gemma 4 family, spanning the compact E2B and E4B edge models up through 12B and larger sizes. Alongside the standard Q4_0 4-bit format, a new mobile schema applies targeted 2-bit quantization to the token-generation layers while keeping the core reasoning layers at higher precision, plus an optimized KV cache and static activations. With the mobile format, Gemma 4 E2B's reported footprint drops to about 1 GB. Checkpoints ship as GGUF for llama.cpp and as compressed tensors for vLLM. Read the announcement →

Picture the singer at a cheap keyboard that has only a handful of keys. The pitches she actually wants to sing live between those keys. The lazy way to record is to sing freely and then auto-tune the take onto the nearest key afterward — and if a note was sitting halfway between two keys, the snap yanks it a long way and the whole phrase sounds sour. That sour snap is post-training quantization: the model finishes training wherever its weights landed, and only then do you round them to the coarse low-bit grid. Quantization-aware training is the disciplined alternative — the singer rehearses on those exact keys the entire time, so every note she learns already lands on one. When you finally record in low fidelity, nothing has to move.

Underneath the metaphor, the "keys" are the grid of values a low-bit format can store, and the "sour snap" is rounding error. Drop a weight from 16-bit down to 4-bit and you go from a near-continuous range to just 16 representable values — so the rounding step has to shove each weight onto the nearest of those few points. PTQ does this once, at the end, to weights that never anticipated it. QAT instead simulates the rounding on every training step (using a straight-through estimator so gradients still flow), so the network learns weights that already sit on the grid — and learns to compensate elsewhere for the little that can't. Toggle the number line below from FP32 down to INT4 and watch the error bracket open up: that gap is what PTQ pays and QAT trains away.

32-bit float — virtually continuous

01234π = 3.14159Quantized: ≈ 3.1416Δ = 0.00000Representable values (0 → 4)

The reason anyone bothers is memory. Take a 2-billion-parameter model (illustrative — E2B is Google's compact "effective-2B" size). At BF16 (2 bytes per weight) the weights alone need 2,000,000,000 × 2 = 4 GB. Round them to 4-bit (about half a byte each) and that's roughly 1 GB — a ~4× shrink. Gemma 4's mobile format then pushes the bulky decode layers down to 2-bit while protecting the reasoning-critical layers, and — with the KV cache and activations optimized on top — Google reports that format brings E2B's footprint to about 1 GB, small enough to run on phone-class hardware. The mixed-precision-by-layer idea is the same one the diagram below shows: spend your bits where the model is fragile, save them where it is robust.

Quantization sensitivity by layer positionEmbeddingFP16errors affect every tokenLayer 1 AttnFP16Layer 1 FFNINT8Layer 2 AttnINT8Layer 2 FFNINT4···INT4middle layers — most robustLayer N FFNINT4Layer N AttnINT8Output projFP16errors affect predictions
high sensitivity (FP16) medium (INT8) low (INT4)
ApproachWhen rounding is applied4-bit accuracyCost to produce
Post-training quantization (PTQ)after training, oncefalls off a cliff at very low bit-width (setup-dependent)cheap — no retraining
Quantization-aware training (QAT)simulated on every training stephigher quality than standard PTQ at 4-bit (Google)needs a training / fine-tune pass

The catch is that QAT is not free: someone has to run that extra training pass, which is why it ships from a lab with the GPUs rather than as a one-line conversion you run at home. That is exactly why a vendor releasing pre-quantized QAT checkpoints matters — Google eats the training cost once, and everyone downloading the GGUF gets 4-bit weights without the accuracy cliff they'd hit by quantizing the model themselves. PTQ still has its place when you can't retrain, and aggressive low-bit work brings its own headaches like outlier weights — but for a model meant to live on a phone, training on the grid is what makes the small size honest.

Goes deeper in: LLM Internals → Quantization → The Quantization Process

Related explainers

Continue in trackLLM Internals — Quantization: how rounding to the grid works

Frequently Asked Questions