LLM Quantization — FP16, INT8, GPTQ, AWQ
The Size Problem
What is Quantization?
LLM weights are stored as numbers, and the precision of those numbers directly determines model size and memory usage. A 7B parameter model in FP32 (32 bits per weight) takes 28 GB. In FP16 it's 14 GB. In INT4 it's 3.5 GB — small enough for a laptop. Quantization converts weights from high precision to low precision, trading a small amount of accuracy for dramatically reduced memory. Modern methods like GPTQ, AWQ, and QLoRA are smart about which weights to quantize and how, preserving model quality while shrinking the model 4–8×.
The Size Problem
Large language models store billions of parameters — each a floating-point number. By default, each number uses FP32 (32-bit floating point), which means 4 bytes per weight.
How Much Memory Is That?
A 7B-parameter model in FP32 requires:
- 7B × 4 bytes = 28 GB just to load the weights
That doesn't fit on a single consumer GPU (which typically has 8–24 GB of VRAM). Even on a professional A100 with 80 GB, you'd only have memory left over for a small batch.
Andrej Karpathy recommends starting training in FP32 for correctness — but for inference and large-scale deployment, we need to compress.
The Solution: Use Fewer Bits
Quantization stores each weight using fewer bits. The progression looks like this:
- FP32 — 4 bytes per weight, full precision, the baseline
- FP16 — 2 bytes per weight, half the memory, still floating-point
- INT8 — 1 byte per weight, integer representation, 4× smaller than FP32
- INT4 — 0.5 bytes per weight, only 16 possible values per weight, 8× smaller
Why It Matters at Scale
A 70B-parameter model in FP32 would require 280 GB — more than three A100s just for weights. Quantization changes the math:
- FP32: 280 GB — impossible on a single node
- FP16: 140 GB — still needs multiple GPUs
- INT8: 70 GB — fits on one 80 GB A100
- INT4: 35 GB — fits with room left for the KV cache
INT4 quantization lets a 70B model fit on a single A100 GPU. Without it, you'd need 4–8 GPUs just to store the weights — before running a single inference step.
The right panel lets you explore how different precision levels affect the actual values stored and the tradeoffs in quality and speed.
How Numbers Shrink
How Numbers Shrink
Every precision format makes a different tradeoff between range (how big a number can be) and precision (how finely it can distinguish between nearby numbers).
The Number Line
When you quantize a number, you're forcing it onto a grid. FP32 has an incredibly fine grid — you can represent billions of distinct values in any range. INT4 has only 16 grid points. Everything else gets rounded to the nearest one.
The interactive diagram below shows this. Toggle between precisions to see how the grid thins out.
32-bit float — virtually continuous
Notice how INT4's grid points are sparse. Two weights that were meaningfully different in FP32 may round to the same INT4 value — that's quantization error accumulating across billions of weights.
Bit Layouts
Different floating-point formats allocate their bits differently between sign, exponent (range), and mantissa (precision):
The key difference between BF16 and FP16:
- BF16 (Brain Float 16): 1 sign + 8 exponent bits + 7 mantissa bits — same exponent as FP32, so same numerical range, but coarser precision. The "Brain" comes from Google Brain, the AI team that created this format — not biological brains.
- FP16: 1 sign + 5 exponent bits + 10 mantissa bits — more precision within a range, but a much smaller max value (~65,504 vs ~3.4×10³⁸)
BF16 for Training, FP16 for Inference
This distinction matters in practice:
- Training — gradients can be large; FP16 overflows at ~65,504, causing NaN cascades. BF16 handles the same range as FP32 and doesn't overflow. Use BF16 for training.
- Inference — activations are bounded and gradients don't exist. FP16's extra mantissa bits give better precision per value. Many inference frameworks default to FP16.
INT8 and INT4 drop the floating-point format entirely — they're plain integers. A signed INT8 can represent values from -128 to 127. The quantization process maps the original float range onto this integer range using a scale factor (more on that in the next step).
Try It
In the right panel, select a precision to see how the value 3.14159265... gets rounded and how much error accumulates. Compare FP32 (essentially zero error in JS) versus INT4 (error of ±0.5 or more).
The Quantization Process
The Quantization Process
Quantization isn't just truncation — it's a calibrated mapping from a float range onto an integer grid. The basic formula involves three steps: scale, round, store. Then reverse on the way out.
Scale-and-Round
Let's walk through a concrete example. Say you have 4 weights:
[0.73, -1.24, 2.01, -0.15]
Step 1 — Find the range and compute scale. The weights span from -1.24 to 2.01. For INT8 (-128 to 127), the scale is:
scale = (2.01 - (-1.24)) / 255 = 0.01275
Step 2 — Divide by scale and round to nearest integer.
0.73 / 0.01275 = 57.3→ round → 57-1.24 / 0.01275 = -97.3→ round → -972.01 / 0.01275 = 157.6→ round → 127 (clamped to max)-0.15 / 0.01275 = -11.8→ round → -12
Store [57, -97, 127, -12] — just 1 byte each instead of 4.
Step 3 — Dequantize when needed. Multiply back by scale:
57 × 0.01275 = 0.727(original: 0.73 — close!)-97 × 0.01275 = -1.237(original: -1.24 — close!)127 × 0.01275 = 1.619(original: 2.01 — lost precision, clamped)-12 × 0.01275 = -0.153(original: -0.15 — close!)
The rounding is where precision is lost — and it's permanent. Notice that most values reconstructed well, but the largest value (2.01) was clamped and lost the most.
One Scale or Many Scales?
A model has billions of weights. When computing the scale factor, do you use one scale for all of them, or separate scales for smaller groups?
One scale for everything — the simplest approach. Find the min and max across all billions of weights, compute one scale. The problem: if most weights are between -1 and 1, but a few are around ±50, the scale must cover -50 to 50. The 256 INT8 grid points get spread across that entire range — but 99% of the weights only live in the tiny -1 to 1 slice. Most grid points are wasted on empty space.
It's like setting the volume for an entire concert based on the single loudest moment — the quiet parts become inaudible.
Separate scales for small groups — split the weights into small blocks (e.g., 128 weights per block). Each block gets its own scale based on its own min/max. A block with weights from -0.1 to 0.1 uses all its grid points in that tiny range — high precision. Another block spanning -3 to 3 gets its own scale for that range.
This is called per-group quantization and it's why INT4 works at all. With one global scale, INT4 (only 16 grid points) would be unusable. With per-group scales, each group of 128 weights gets its 16 grid points concentrated where they're needed.
The cost: you store one extra scale number per group. For groups of 128 weights, that's 1 extra number per 128 — a negligible overhead for a massive accuracy improvement.
Per-group scaling is the key insight that makes INT4 viable for large models. Without it, INT4 accuracy would be unacceptable. With groups of 128, the quality drop from INT8 to INT4 is often under 1% on benchmarks.
PTQ vs QAT
There are two main moments to apply quantization:
- Post-Training Quantization (PTQ) — quantize after training is done. Fast, no GPU cluster needed, but you're working with weights that weren't optimized for quantization. Most deployment tools use PTQ.
- Quantization-Aware Training (QAT) — simulate quantization error during training so the model learns to be robust to it. Better accuracy, but requires retraining, which is expensive for large models.
For most LLM deployment scenarios, PTQ with per-group scaling (optionally with calibration data) is the practical default.
In the right panel, try switching precisions to see how the quantization error changes — that error is exactly what PTQ accepts in exchange for memory savings.
The Outlier Problem
The Outlier Problem
Per-group scaling makes INT4 viable — but there's a deeper problem that affects even INT8 in large models (6B+ parameters).
Most weights are small, but a few are huge
Imagine looking at all 7 billion weights in a model. Almost all of them — 99.9% — are small numbers close to zero, like 0.02, -0.15, 0.08. But scattered throughout, about 0.1% of weights have values like 42, -55, or 60 — numbers that are hundreds of times larger than the rest.
Nobody designed this. It's a pattern that emerges naturally during training in large models. These unusually large weights are called outliers, and they cause serious problems for quantization.
Outliers force INT8 range to span –60 to 60 · most of the 256 grid slots fall on empty space
Click a column in the rail above the histogram to see its peak magnitude — most columns max out below 1.5, but a couple sit ~70× higher than a typical column's peak. That's the structural pattern outliers follow: they don't sprinkle randomly through a tensor, they pile into specific feature dimensions.
Toggle the outliers on and off in the diagram above. When outliers are present, the scale factor must expand to cover them. That expansion squashes the quantization grid — now your 256 INT8 grid points are spread across a huge range, and the 99.9% of normal weights that live in the narrow middle lose most of their resolution.
The result: catastrophic accuracy degradation. This is why naive INT8 quantization fails for large models.
The LLM.int8() Solution
Tim Dettmers' LLM.int8() (2022) solved this with a mixed-precision decomposition:
The key insight: outliers cluster in specific feature dimensions (columns), not randomly distributed. So you can:
- Identify which columns contain outlier values
- Keep those columns in FP16 — no quantization error
- Quantize all other columns to INT8
- Run two matrix multiplications and add the results
The overhead is tiny — typically less than 1% of the weights end up in the FP16 path. The normal INT8 path handles 99%+ of the computation at full speed.
LLM.int8() changed the field. Before it, quantizing models beyond 1–2B parameters to INT8 caused unacceptable quality loss. After it, INT8 became essentially free — you could run a 65B model on a single GPU with near-zero accuracy degradation.
The outlier problem also explains why per-tensor quantization fails at large scale. Per-group (from the previous step) addresses part of this — but for activations, which are computed dynamically and can't be pre-calibrated as easily, mixed-precision decomposition is the more robust solution.
Modern Methods
Modern Methods
The field moved fast after LLM.int8(). Several methods now dominate practical deployment and fine-tuning. Each makes a different bet on what information to preserve.
GPTQ — Compensating for Errors as You Go
The problem: naive quantization rounds each weight independently. Small errors in each weight add up across millions of weights, degrading quality.
The fix: quantize one column of weights at a time, then adjust the remaining columns to absorb the error just introduced. By the time you finish all columns, the cumulative error is much smaller.
GPTQ uses the Hessian (a measure of how sensitive the output is to each weight) to decide exactly how much to adjust. It requires a small calibration dataset and a few GPU-hours — but no retraining.
AWQ — Protect the Weights That Matter Most
The problem: all weights are treated equally during quantization, but some weights matter far more than others. Weights connected to high-activity channels cause much more output error when rounded.
The fix: identify which channels have the highest activations, then protect the weights connected to those channels with higher precision.
AWQ scales important weights up before quantization (giving them more grid resolution), then scales the activations down to compensate. Fast to apply, often matches or exceeds GPTQ quality.
NF4 — A Smarter Grid for Normal Distributions
The problem: standard INT4 spaces its 16 grid points evenly across the range. But neural network weights aren't evenly distributed — they cluster around zero (a bell curve/normal distribution). Evenly-spaced grid points waste resolution on the sparse tails.
The fix: design a 4-bit grid where points are dense near zero and sparse at the edges — matching the shape of the weight distribution.
NF4 is mathematically optimal for normally distributed data — it minimizes the expected quantization error given only 16 possible values. This is the format used by QLoRA.
QLoRA — Fine-Tuning Without the Memory Cost
The problem: fine-tuning a 65B model normally requires the full model in FP16 (130 GB) plus optimizer states — multiple A100 GPUs.
What is LoRA? When you fine-tune a model, you normally update all its weights. But most of the changes are small and low-rank — meaning they can be approximated by two tiny matrices multiplied together, instead of modifying the full weight matrix. LoRA (Low-Rank Adaptation) adds these two small matrices as "adapters" next to each weight layer. During fine-tuning, only the adapters are trained — the original weights stay frozen. The adapters are typically less than 1% of the model size.
QLoRA's fix: combine LoRA with 4-bit quantization. Freeze the entire model in NF4 (it stays compressed in memory). Attach the small LoRA adapters in FP16 — only these get trained. The model is 4-bit (cheap to store), the adapters are FP16 (accurate to train).
A 65B model fine-tuned with QLoRA fits on a single 48 GB GPU. The main model takes ~33 GB in 4-bit; the adapters add only ~0.2 GB.
GPTQ and AWQ are the go-to methods for deployment — they compress an existing model as small as possible for inference. QLoRA is the go-to for fine-tuning — it makes task-specific adaptation of large models practical on modest hardware.
What to Quantize
What to Quantize
Not everything in a model is equally safe to quantize. Different components have different sensitivity to precision loss, and the best deployments use mixed precision — different formats for different parts of the computation.
The Components
Weights are the most common quantization target:
- INT4 or INT8 with per-group scaling (GPTQ, AWQ, NF4)
- Stored compressed, dequantized on-the-fly during computation
- 4–8× memory reduction, modest speed gains
Activations (the outputs flowing between layers) are harder to quantize:
- Can't be calibrated ahead of time — they depend on the input
- Usually kept in FP16 during inference
- Some systems use FP8 activations with careful calibration
The KV Cache (from the KV Cache module — the stored keys and values for all previous tokens) is a major memory consumer during long-context inference:
- FP8 or INT8 KV cache is increasingly common
- Requires careful attention because errors here affect all future tokens
- 2× memory savings for the KV cache, which can dominate at long contexts
Sensitive Layers
Not all layers tolerate quantization equally:
- First and last layers — the embedding layer and the final projection to vocabulary logits — are typically kept in FP16. Errors here propagate everywhere.
- Attention layers are more sensitive than FFN layers — attention scores are highly non-linear; small weight errors compound through the softmax.
- Outlier-heavy layers (identified by LLM.int8()) — kept in FP16 or handled with mixed-precision decomposition.
Precision for Training vs Inference
- BF16 for training — same exponent range as FP32 means no overflow on large gradients. Karpathy's llm.c and most modern training frameworks default to BF16 mixed precision.
- FP16 for inference — better mantissa precision for bounded activation values, and hardware support is near-universal.
- Mixed precision compute — a common pattern: weights INT4, compute in FP16 (dequantize weights, multiply in FP16), accumulate in FP32 for correctness. This is what many GPU kernels do.
Reading GGUF Names
When you download a quantized model from Hugging Face, the filename tells you the quantization scheme:
Here are real GGUF files from TheBloke/Llama-2-7B-Chat-GGUF on Hugging Face — a popular download for running Llama locally:
llama-2-7b-chat.Q2_K.gguf— 2.83 GB — 2-bit, smallest but significant quality lossllama-2-7b-chat.Q3_K_M.gguf— 3.30 GB — 3-bit medium, high quality lossllama-2-7b-chat.Q4_K_M.gguf— 4.08 GB — 4-bit medium, recommended balance of size and qualityllama-2-7b-chat.Q5_K_M.gguf— 4.78 GB — 5-bit medium, very low quality lossllama-2-7b-chat.Q6_K.gguf— 5.53 GB — 6-bit, extremely low quality lossllama-2-7b-chat.Q8_0.gguf— 7.16 GB — 8-bit, near-lossless but large
The same 7B model ranges from 2.83 GB to 7.16 GB depending on quantization — a 2.5× difference. Most people use Q4_K_M as the sweet spot.
When you see "Q4_K_M" on a model card, you're reading the quantization recipe: 4-bit weights, K-quant grouping (per-group of ~256 weights), medium quality preset. This naming is now standard across llama.cpp, Ollama, and most local inference tools.