IBM Granite 4.1 — 8B dense matches the prior 32B MoE

LLM

learnaivisually.com/ai-explained/ibm-granite-4-1-dense-vs-moe

TL;DR

What is it: IBM Granite 4.1 is an open-weight (Apache 2.0) family of dense decoder-only LLMs at 3B / 8B / 30B — a clean break from the Mixture-of-Experts architecture of Granite 4.0. The 8B variant matches or beats the prior 32B-A9B MoE on tool calling.
Why it’s needed: Decode in production serving is memory-bandwidth-bound, and the full HBM footprint — not the active-parameter count — caps concurrent users, KV cache budget, and what GPU you can deploy on. A smaller dense model that matches a larger MoE wins on serving cost.
vs previous: Granite 4.0’s 32B-A9B MoE kept all 32B parameters resident in HBM but activated only ~9B per token. Granite 4.1 8B occupies ¼ the HBM, leaves room for a generous KV cache and larger batches, and matches the prior MoE’s tool-calling quality.

Jargon

MoE: Mixture-of-Experts. A transformer variant that replaces each dense FFN with many smaller "expert" sub-networks plus a learned router that activates only a handful per token. The shorthand 32B-A9B means 32 B total parameters in HBM, ~9 B active per token.
dense LLM: A transformer where every weight participates in every token's computation — no routing, no idle experts. Per-token FLOPs and HBM bandwidth both scale linearly with parameter count.
HBM: High-Bandwidth Memory. The on-package DRAM attached to a GPU (e.g. A100, H100). Model weights, activations, and the KV cache all live here. HBM capacity is the primary constraint on batch size and context length during inference.
bandwidth-bound: A workload is bandwidth-bound when the GPU's arithmetic units sit idle waiting for data to arrive from HBM — the bottleneck is memory throughput, not compute. LLM decode is almost always bandwidth-bound. See the Roofline model →
router: In an MoE layer, a small learned network that assigns each token to its top-k experts. The router's output is a probability distribution over all experts; only the top-k receive that token's input.
32B-A9B: Shorthand for an MoE shape: 32 B total parameters resident in HBM, with only ~9 B active parameters evaluated per token. The "A" stands for "active". The full 32 B must still fit in GPU memory even though 23 B sit idle on any given token.
active parameters: The subset of a model's weights that are actually read from HBM and multiplied for a given token. In a dense model, active = total. In an MoE model, active = top-k experts' weights. Per-token bandwidth tracks active parameters, not total.
KV cache: The key-value cache stores attention keys and values for every prior token in the context so they don't have to be recomputed. It grows linearly with context length and batch size, competing with model weights for the same HBM. See the KV Cache module →

The news. On April 29, 2026, IBM Research released Granite 4.1, a family of open-weight foundation models in 3B, 8B, and 30B dense decoder-only sizes under Apache 2.0 — a clean break from the Mixture-of-Experts architecture of Granite 4.0. IBM reports the 8B instruct variant matches or beats the prior Granite 4.0 32B-A9B MoE on tool calling and instruction following, while occupying roughly one quarter of the FP16 model-weight footprint in HBM. Read the release →

Picture the two teams. The dense 8B model is a small full-time team: every one of its eight members shows up for every task and contributes to every decision. The Granite 4.0 32B-A9B MoE is a much larger payroll — 32 people on staff — but only nine of them ever work on any single task. A small router sits at the door and decides, per task, which nine to pull in; the remaining 23 stay at their desks until a different task suits them better. Both teams turn out roughly the same throughput per task, because only nine people are ever moving at once. But the larger team needs four times the office space, four times the chairs, four times the parking — and that office space is GPU memory the operator has to pay for whether anyone is working or not.

In a dense transformer, every weight participates in every token. Per-token cost — both the FLOPs the GPU executes and the bytes it streams through HBM — scales linearly with parameter count. Decode in production serving is overwhelmingly memory-bandwidth-bound: the arithmetic units sit idle waiting for weights to arrive from HBM. A dense 8B model in FP16 occupies 16 GB of HBM and streams those same 16 GB through the compute units once per layer-pass per decoded token. Doubling parameters roughly doubles the bandwidth bill — a direct hit to throughput.

A Mixture-of-Experts (MoE) layer replaces that single dense FFN with many smaller "expert" sub-networks plus a tiny router that selects only a handful of experts per token. The shorthand "32B-A9B" captures the Granite 4.0 shape: 32 B total parameters resident in HBM, but only ~9 B "active" parameters evaluated per token. The remaining ~23 B sit idle in HBM for that token. Per-token bandwidth tracks the active count, not the total — which is the trick that makes MoE attractive for compute-bound workloads. But the full 32 B still has to fit in GPU memory, and that memory is exactly what shrinks the KV cache budget at long context, caps batch size, and forces deployments onto larger accelerators.

The "8B matches 32B-A9B" story is not that dense architecturally beats MoE — it is that architecture is only half of model quality. IBM attributes the 4.1 8B's gain to higher-quality data curation, staged refinement, supervised fine-tuning, and multi-stage reinforcement learning — not to a larger raw token count over the older MoE. Dense models are particularly efficient consumers of training quality because every parameter sees every gradient: there is no router redirecting signal away from "off-duty" experts on most batches.

A worked example sharpens the trade. At an HBM bandwidth ceiling of ~3 TB/s on a single data-center GPU and FP16 weights, a dense 8B streams ~16 GB per layer-pass per token — a ceiling near ~190 decode steps/sec in the bandwidth-bound limit. A 32B-A9B MoE streams ~18 GB of active weights per token — a ceiling near ~165 decode steps/sec. Per-token throughput lands in the same neighborhood. But the 8B fits in ¼ the memory of the 32B MoE, which is a decisive win when memory is the binding constraint — a single mid-tier accelerator can host the 8B alongside a generous KV cache and a larger batch, while the 32B-A9B either needs a bigger GPU or eats into the KV cache budget that batching depends on.

Goes deeper in: LLM Internals → Transformer Block → Modern Variants & Scale and LLM Internals → KV Cache → Memory

IBM Granite 4.1 — 8B dense matches the prior 32B MoE

Frequently Asked Questions