IBM Granite 4.1 — 8B dense matches the prior 32B MoE

LLM
L
IBM Granite 4.1 hero animation — Granite 4.1 8B dense model versus Granite 4.0 32B-A9B Mixture-of-Experts. Same per-token bandwidth, one quarter the HBM footprint.
learnaivisually.com/ai-explained/ibm-granite-4-1-dense-vs-moe

The news. On April 29, 2026, IBM Research released Granite 4.1, a family of open-weight foundation models in 3B, 8B, and 30B dense decoder-only sizes under Apache 2.0 — a clean break from the Mixture-of-Experts architecture of Granite 4.0. IBM reports the 8B instruct variant matches or beats the prior Granite 4.0 32B-A9B MoE on tool calling and instruction following, while occupying roughly one quarter of the FP16 model-weight footprint in HBM. Read the release →

Picture the two teams. The dense 8B model is a small full-time team: every one of its eight members shows up for every task and contributes to every decision. The Granite 4.0 32B-A9B MoE is a much larger payroll — 32 people on staff — but only nine of them ever work on any single task. A small router sits at the door and decides, per task, which nine to pull in; the remaining 23 stay at their desks until a different task suits them better. Both teams turn out roughly the same throughput per task, because only nine people are ever moving at once. But the larger team needs four times the office space, four times the chairs, four times the parking — and that office space is GPU memory the operator has to pay for whether anyone is working or not.

In a dense transformer, every weight participates in every token. Per-token cost — both the FLOPs the GPU executes and the bytes it streams through HBM — scales linearly with parameter count. Decode in production serving is overwhelmingly memory-bandwidth-bound: the arithmetic units sit idle waiting for weights to arrive from HBM. A dense 8B model in FP16 occupies 16 GB of HBM and streams those same 16 GB through the compute units once per layer-pass per decoded token. Doubling parameters roughly doubles the bandwidth bill — a direct hit to throughput.

A Mixture-of-Experts (MoE) layer replaces that single dense FFN with many smaller "expert" sub-networks plus a tiny router that selects only a handful of experts per token. The shorthand "32B-A9B" captures the Granite 4.0 shape: 32 B total parameters resident in HBM, but only ~9 B "active" parameters evaluated per token. The remaining ~23 B sit idle in HBM for that token. Per-token bandwidth tracks the active count, not the total — which is the trick that makes MoE attractive for compute-bound workloads. But the full 32 B still has to fit in GPU memory, and that memory is exactly what shrinks the KV cache budget at long context, caps batch size, and forces deployments onto larger accelerators.

K and V
Two vectors stored per token (Key + Value)
× 2
Layers
Each layer has its own cache (like 32 filing cabinets)
× 32
Heads
Each attention head stores its own K/V pair
× 32
Head size
Each K or V vector has 128 numbers (d_head)
× 128
Bytes per number
FP16 = 2 bytes per number (half precision)
× 2
Per token (Llama-2 7B):2 × 32 × 32 × 128 × 2 = 524,288 bytes ≈ 512 KB

The "8B matches 32B-A9B" story is not that dense architecturally beats MoE — it is that architecture is only half of model quality. IBM attributes the 4.1 8B's gain to higher-quality data curation, staged refinement, supervised fine-tuning, and multi-stage reinforcement learning — not to a larger raw token count over the older MoE. Dense models are particularly efficient consumers of training quality because every parameter sees every gradient: there is no router redirecting signal away from "off-duty" experts on most batches.

A worked example sharpens the trade. At an HBM bandwidth ceiling of ~3 TB/s on a single data-center GPU and FP16 weights, a dense 8B streams ~16 GB per layer-pass per token — a ceiling near ~190 decode steps/sec in the bandwidth-bound limit. A 32B-A9B MoE streams ~18 GB of active weights per token — a ceiling near ~165 decode steps/sec. Per-token throughput lands in the same neighborhood. But the 8B fits in ¼ the memory of the 32B MoE, which is a decisive win when memory is the binding constraint — a single mid-tier accelerator can host the 8B alongside a generous KV cache and a larger batch, while the 32B-A9B either needs a bigger GPU or eats into the KV cache budget that batching depends on.

Goes deeper in: LLM Internals → Transformer Block → Modern Variants & Scale and LLM Internals → KV Cache → Memory

Frequently Asked Questions