What is the LLM Internals track?

Nine interactive modules covering tokenization, embeddings, self-attention, the transformer block, text generation, KV cache, quantization, batching, and paged attention.

Where should I start?

Module 1 (Tokenization). Each module stands alone but builds on the prior one.

Do I need math background?

Basic linear algebra helps, but every concept is illustrated visually first.

Does quantization hurt model quality?

Some. INT8 quantization is nearly lossless for most models. INT4 shows a small but measurable accuracy drop (typically 1–3% on benchmarks). The drop is often acceptable for interactive use cases but may matter for precision-critical tasks. Methods like GPTQ, AWQ, and NF4 narrow that gap further by using calibration data to choose smarter rounding, protecting salient weights, or using a non-uniform grid. Larger models tolerate quantization better than smaller ones — an INT4 70B model usually beats an FP16 13B — so quantizing up rather than training a smaller model is often the better tradeoff.

What is the difference between GPTQ, AWQ, and GGUF?

GPTQ uses calibration data to find optimal rounding for each weight column. AWQ identifies and protects "salient" weights (those that matter most for accuracy). GGUF is a file format (used by llama.cpp) that supports multiple quantization levels (Q4_K_M, Q5_K_S, etc.). The key distinction: GPTQ and AWQ are quantization algorithms (how weights get compressed), while GGUF is a container format (how the quantized weights are stored and loaded). A single GGUF file can embed weights quantized with any method. In practice, AWQ tends to preserve quality slightly better at 4-bit, while GPTQ is older and has broader tooling support.

What does Q4_K_M mean in GGUF filenames?

Q4 = 4-bit quantization. K = k-quant method (per-group quantization). M = medium quality preset (balancing size and accuracy). Other suffixes: S = small (more aggressive compression), L = large (higher quality). The k-quant family splits weights into small groups (typically 32 or 64 elements) and stores a separate scale for each group, which keeps outliers from dominating a single scale across the whole tensor. Q4_K_M is the most popular choice for 7B and 13B models on consumer GPUs — it delivers near-FP16 quality at roughly a quarter of the memory footprint.

Can you quantize any model?

Yes, but results vary. Models with many outlier activations (like OPT-175B) are harder to quantize cleanly. Larger models generally tolerate quantization better than smaller ones. Outlier activations — a handful of channels with magnitudes 10–100× larger than the rest — force the quantization scale to be large to avoid clipping, which wastes precision on every normal value. Methods like SmoothQuant and AWQ explicitly handle outliers by migrating difficulty between weights and activations. Models trained with quantization-aware training or released in BF16 from scratch tend to quantize most cleanly.

A fine-tuning method that keeps the base model in 4-bit quantized form and trains small low-rank adapter matrices in higher precision. This lets you fine-tune a 65B model on a single 48 GB GPU. QLoRA combines three tricks: 4-bit NF4 quantization of the frozen base, paged optimizers to handle gradient memory spikes, and double quantization of the scale factors themselves. Because gradients only flow into the small LoRA adapters (millions, not billions, of parameters), the memory saved on optimizer state and activations is dramatic — typically a 15–20× reduction compared to full FP16 fine-tuning.

LLM Quantization — FP16, INT8, GPTQ, AWQ

The Size Problem

What is Quantization?

LLM weights are stored as numbers, and the precision of those numbers directly determines model size and memory usage. A 7B parameter model in FP32 (32 bits per weight) takes 28 GB. In FP16 it's 14 GB. In INT4 it's 3.5 GB — small enough for a laptop. Quantization converts weights from high precision to low precision, trading a small amount of accuracy for dramatically reduced memory. Modern methods like GPTQ, AWQ, and QLoRA are smart about which weights to quantize and how, preserving model quality while shrinking the model 4–8×.

The Size Problem

Large language models store billions of parameters — each a floating-point number. By default, each number uses FP32 (32-bit floating point), which means 4 bytes per weight.

How Much Memory Is That?

A 7B-parameter model in FP32 requires:

7B × 4 bytes = 28 GB just to load the weights

That doesn't fit on a single consumer GPU (which typically has 8–24 GB of VRAM). Even on a professional A100 with 80 GB, you'd only have memory left over for a small batch.

Andrej Karpathy recommends starting training in FP32 for correctness — but for inference and large-scale deployment, we need to compress.

The Solution: Use Fewer Bits

Quantization stores each weight using fewer bits. The progression looks like this:

FP32 — 4 bytes per weight, full precision, the baseline
FP16 — 2 bytes per weight, half the memory, still floating-point
INT8 — 1 byte per weight, integer representation, 4× smaller than FP32
INT4 — 0.5 bytes per weight, only 16 possible values per weight, 8× smaller

Why It Matters at Scale

A 70B-parameter model in FP32 would require 280 GB — more than three A100s just for weights. Quantization changes the math:

FP32: 280 GB — impossible on a single node
FP16: 140 GB — still needs multiple GPUs
INT8: 70 GB — fits on one 80 GB A100
INT4: 35 GB — fits with room left for the KV cache

INT4 quantization lets a 70B model fit on a single A100 GPU. Without it, you'd need 4–8 GPUs just to store the weights — before running a single inference step.

The right panel lets you explore how different precision levels affect the actual values stored and the tradeoffs in quality and speed.

How Numbers Shrink

Every precision format makes a different tradeoff between range (how big a number can be) and precision (how finely it can distinguish between nearby numbers).

The Number Line

When you quantize a number, you're forcing it onto a grid. FP32 has an incredibly fine grid — you can represent billions of distinct values in any range. INT4 has only 16 grid points. Everything else gets rounded to the nearest one.

The interactive diagram below shows this. Toggle between precisions to see how the grid thins out.

32-bit float — virtually continuous

Notice how INT4's grid points are sparse. Two weights that were meaningfully different in FP32 may round to the same INT4 value — that's quantization error accumulating across billions of weights.

Bit Layouts

Different floating-point formats allocate their bits differently between sign, exponent (range), and mantissa (precision):

The key difference between BF16 and FP16:

BF16 (Brain Float 16): 1 sign + 8 exponent bits + 7 mantissa bits — same exponent as FP32, so same numerical range, but coarser precision. The "Brain" comes from Google Brain, the AI team that created this format — not biological brains.
FP16: 1 sign + 5 exponent bits + 10 mantissa bits — more precision within a range, but a much smaller max value (~65,504 vs ~3.4×10³⁸)

BF16 for Training, FP16 for Inference

This distinction matters in practice:

Training — gradients can be large; FP16 overflows at ~65,504, causing NaN cascades. BF16 handles the same range as FP32 and doesn't overflow. Use BF16 for training.
Inference — activations are bounded and gradients don't exist. FP16's extra mantissa bits give better precision per value. Many inference frameworks default to FP16.

INT8 and INT4 drop the floating-point format entirely — they're plain integers. A signed INT8 can represent values from -128 to 127. The quantization process maps the original float range onto this integer range using a scale factor (more on that in the next step).

Try It

In the right panel, select a precision to see how the value 3.14159265... gets rounded and how much error accumulates. Compare FP32 (essentially zero error in JS) versus INT4 (error of ±0.5 or more).

The Quantization Process

Quantization isn't just truncation — it's a calibrated mapping from a float range onto an integer grid. The basic formula involves three steps: scale, round, store. Then reverse on the way out.

Scale-and-Round

Let's walk through a concrete example. Say you have 4 weights:

[0.73, -1.24, 2.01, -0.15]

Step 1 — Find the range and compute scale. The weights span from -1.24 to 2.01. For INT8 (-128 to 127), the scale is:

scale = (2.01 - (-1.24)) / 255 = 0.01275

Step 2 — Divide by scale and round to nearest integer.

0.73 / 0.01275 = 57.3 → round → 57
-1.24 / 0.01275 = -97.3 → round → -97
2.01 / 0.01275 = 157.6 → round → 127 (clamped to max)
-0.15 / 0.01275 = -11.8 → round → -12

Store [57, -97, 127, -12] — just 1 byte each instead of 4.

Step 3 — Dequantize when needed. Multiply back by scale:

57 × 0.01275 = 0.727 (original: 0.73 — close!)
-97 × 0.01275 = -1.237 (original: -1.24 — close!)
127 × 0.01275 = 1.619 (original: 2.01 — lost precision, clamped)
-12 × 0.01275 = -0.153 (original: -0.15 — close!)

The rounding is where precision is lost — and it's permanent. Notice that most values reconstructed well, but the largest value (2.01) was clamped and lost the most.

One Scale or Many Scales?

A model has billions of weights. When computing the scale factor, do you use one scale for all of them, or separate scales for smaller groups?

One scale for everything — the simplest approach. Find the min and max across all billions of weights, compute one scale. The problem: if most weights are between -1 and 1, but a few are around ±50, the scale must cover -50 to 50. The 256 INT8 grid points get spread across that entire range — but 99% of the weights only live in the tiny -1 to 1 slice. Most grid points are wasted on empty space.

It's like setting the volume for an entire concert based on the single loudest moment — the quiet parts become inaudible.

Separate scales for small groups — split the weights into small blocks (e.g., 128 weights per block). Each block gets its own scale based on its own min/max. A block with weights from -0.1 to 0.1 uses all its grid points in that tiny range — high precision. Another block spanning -3 to 3 gets its own scale for that range.

This is called per-group quantization and it's why INT4 works at all. With one global scale, INT4 (only 16 grid points) would be unusable. With per-group scales, each group of 128 weights gets its 16 grid points concentrated where they're needed.

The cost: you store one extra scale number per group. For groups of 128 weights, that's 1 extra number per 128 — a negligible overhead for a massive accuracy improvement.

Per-group scaling is the key insight that makes INT4 viable for large models. Without it, INT4 accuracy would be unacceptable. With groups of 128, the quality drop from INT8 to INT4 is often under 1% on benchmarks.

PTQ vs QAT

There are two main moments to apply quantization:

Post-Training Quantization (PTQ) — quantize after training is done. Fast, no GPU cluster needed, but you're working with weights that weren't optimized for quantization. Most deployment tools use PTQ.
Quantization-Aware Training (QAT) — simulate quantization error during training so the model learns to be robust to it. Better accuracy, but requires retraining, which is expensive for large models.

For most LLM deployment scenarios, PTQ with per-group scaling (optionally with calibration data) is the practical default.

In the right panel, try switching precisions to see how the quantization error changes — that error is exactly what PTQ accepts in exchange for memory savings.

The Outlier Problem

Per-group scaling makes INT4 viable — but there's a deeper problem that affects even INT8 in large models (6B+ parameters).

Most weights are small, but a few are huge

Imagine looking at all 7 billion weights in a model. Almost all of them — 99.9% — are small numbers close to zero, like 0.02, -0.15, 0.08. But scattered throughout, about 0.1% of weights have values like 42, -55, or 60 — numbers that are hundreds of times larger than the rest.

Nobody designed this. It's a pattern that emerges naturally during training in large models. These unusually large weights are called outliers, and they cause serious problems for quantization.

Outliers force INT8 range to span –60 to 60 · most of the 256 grid slots fall on empty space

Click a column in the rail above the histogram to see its peak magnitude — most columns max out below 1.5, but a couple sit ~70× higher than a typical column's peak. That's the structural pattern outliers follow: they don't sprinkle randomly through a tensor, they pile into specific feature dimensions.

Toggle the outliers on and off in the diagram above. When outliers are present, the scale factor must expand to cover them. That expansion squashes the quantization grid — now your 256 INT8 grid points are spread across a huge range, and the 99.9% of normal weights that live in the narrow middle lose most of their resolution.

The result: catastrophic accuracy degradation. This is why naive INT8 quantization fails for large models.

The LLM.int8() Solution

Tim Dettmers' LLM.int8() (2022) solved this with a mixed-precision decomposition:

The key insight: outliers cluster in specific feature dimensions (columns), not randomly distributed. So you can:

Identify which columns contain outlier values
Keep those columns in FP16 — no quantization error
Quantize all other columns to INT8
Run two matrix multiplications and add the results

The overhead is tiny — typically less than 1% of the weights end up in the FP16 path. The normal INT8 path handles 99%+ of the computation at full speed.

LLM.int8() changed the field. Before it, quantizing models beyond 1–2B parameters to INT8 caused unacceptable quality loss. After it, INT8 became essentially free — you could run a 65B model on a single GPU with near-zero accuracy degradation.

The outlier problem also explains why per-tensor quantization fails at large scale. Per-group (from the previous step) addresses part of this — but for activations, which are computed dynamically and can't be pre-calibrated as easily, mixed-precision decomposition is the more robust solution.

Modern Methods

The field moved fast after LLM.int8(). Several methods now dominate practical deployment and fine-tuning. Each makes a different bet on what information to preserve.

GPTQ — Compensating for Errors as You Go

The problem: naive quantization rounds each weight independently. Small errors in each weight add up across millions of weights, degrading quality.

The fix: quantize one column of weights at a time, then adjust the remaining columns to absorb the error just introduced. By the time you finish all columns, the cumulative error is much smaller.

GPTQ uses the Hessian (a measure of how sensitive the output is to each weight) to decide exactly how much to adjust. It requires a small calibration dataset and a few GPU-hours — but no retraining.

AWQ — Protect the Weights That Matter Most

The problem: all weights are treated equally during quantization, but some weights matter far more than others. Weights connected to high-activity channels cause much more output error when rounded.

The fix: identify which channels have the highest activations, then protect the weights connected to those channels with higher precision.

AWQ scales important weights up before quantization (giving them more grid resolution), then scales the activations down to compensate. Fast to apply, often matches or exceeds GPTQ quality.

NF4 — A Smarter Grid for Normal Distributions

The problem: standard INT4 spaces its 16 grid points evenly across the range. But neural network weights aren't evenly distributed — they cluster around zero (a bell curve/normal distribution). Evenly-spaced grid points waste resolution on the sparse tails.

The fix: design a 4-bit grid where points are dense near zero and sparse at the edges — matching the shape of the weight distribution.

NF4 is mathematically optimal for normally distributed data — it minimizes the expected quantization error given only 16 possible values. This is the format used by QLoRA.

QLoRA — Fine-Tuning Without the Memory Cost

The problem: fine-tuning a 65B model normally requires the full model in FP16 (130 GB) plus optimizer states — multiple A100 GPUs.

What is LoRA? When you fine-tune a model, you normally update all its weights. But most of the changes are small and low-rank — meaning they can be approximated by two tiny matrices multiplied together, instead of modifying the full weight matrix. LoRA (Low-Rank Adaptation) adds these two small matrices as "adapters" next to each weight layer. During fine-tuning, only the adapters are trained — the original weights stay frozen. The adapters are typically less than 1% of the model size.

QLoRA's fix: combine LoRA with 4-bit quantization. Freeze the entire model in NF4 (it stays compressed in memory). Attach the small LoRA adapters in FP16 — only these get trained. The model is 4-bit (cheap to store), the adapters are FP16 (accurate to train).

A 65B model fine-tuned with QLoRA fits on a single 48 GB GPU. The main model takes ~33 GB in 4-bit; the adapters add only ~0.2 GB.

GPTQ and AWQ are the go-to methods for deployment — they compress an existing model as small as possible for inference. QLoRA is the go-to for fine-tuning — it makes task-specific adaptation of large models practical on modest hardware.

What to Quantize

Not everything in a model is equally safe to quantize. Different components have different sensitivity to precision loss, and the best deployments use mixed precision — different formats for different parts of the computation.

The Components

Weights are the most common quantization target:

INT4 or INT8 with per-group scaling (GPTQ, AWQ, NF4)
Stored compressed, dequantized on-the-fly during computation
4–8× memory reduction, modest speed gains

Activations (the outputs flowing between layers) are harder to quantize:

Can't be calibrated ahead of time — they depend on the input
Usually kept in FP16 during inference
Some systems use FP8 activations with careful calibration

The KV Cache (from the KV Cache module — the stored keys and values for all previous tokens) is a major memory consumer during long-context inference:

FP8 or INT8 KV cache is increasingly common
Requires careful attention because errors here affect all future tokens
2× memory savings for the KV cache, which can dominate at long contexts

Sensitive Layers

Not all layers tolerate quantization equally:

First and last layers — the embedding layer and the final projection to vocabulary logits — are typically kept in FP16. Errors here propagate everywhere.
Attention layers are more sensitive than FFN layers — attention scores are highly non-linear; small weight errors compound through the softmax.
Outlier-heavy layers (identified by LLM.int8()) — kept in FP16 or handled with mixed-precision decomposition.

Precision for Training vs Inference

BF16 for training — same exponent range as FP32 means no overflow on large gradients. Karpathy's llm.c and most modern training frameworks default to BF16 mixed precision.
FP16 for inference — better mantissa precision for bounded activation values, and hardware support is near-universal.
Mixed precision compute — a common pattern: weights INT4, compute in FP16 (dequantize weights, multiply in FP16), accumulate in FP32 for correctness. This is what many GPU kernels do.

Reading GGUF Names

When you download a quantized model from Hugging Face, the filename tells you the quantization scheme:

Here are real GGUF files from TheBloke/Llama-2-7B-Chat-GGUF on Hugging Face — a popular download for running Llama locally:

llama-2-7b-chat.Q2_K.gguf — 2.83 GB — 2-bit, smallest but significant quality loss
llama-2-7b-chat.Q3_K_M.gguf — 3.30 GB — 3-bit medium, high quality loss
llama-2-7b-chat.Q4_K_M.gguf — 4.08 GB — 4-bit medium, recommended balance of size and quality
llama-2-7b-chat.Q5_K_M.gguf — 4.78 GB — 5-bit medium, very low quality loss
llama-2-7b-chat.Q6_K.gguf — 5.53 GB — 6-bit, extremely low quality loss
llama-2-7b-chat.Q8_0.gguf — 7.16 GB — 8-bit, near-lossless but large

The same 7B model ranges from 2.83 GB to 7.16 GB depending on quantization — a 2.5× difference. Most people use Q4_K_M as the sweet spot.

When you see "Q4_K_M" on a model card, you're reading the quantization recipe: 4-bit weights, K-quant grouping (per-group of ~256 weights), medium quality preset. This naming is now standard across llama.cpp, Ollama, and most local inference tools.