What is the difference between active and total parameters?

Total parameters are every weight the model contains — they set its knowledge capacity and its memory footprint, because all of them must be loaded into GPU memory. Active parameters are the subset actually read and multiplied for a single token; they set the per-token compute and bandwidth. In a dense model the two are equal; in a sparse Mixture-of-Experts model like GLM-5.2, active (~40B) is a small fraction of total (744B).

Why does GLM-5.2 list two parameter counts (744B total, 40B active)?

Because it is a Mixture-of-Experts model. Its feed-forward layers are split into many expert sub-networks, and a router activates only a handful per token — so the model holds 744B weights but fires only ~40B on any given token. The total predicts the memory and GPU you need; the active count predicts how fast and how cheaply it runs per token. A single number would hide that the two costs have decoupled.

Does a lower active-parameter count make a model cheaper to run?

It makes the per-token compute and bandwidth cheaper — GLM-5.2 computes a token at roughly the cost of a 40B model. But it does not lower the memory bill: all 744B total parameters still have to fit in GPU memory whether or not they fire. So a very sparse model is cheap on compute and expensive on memory, which is why deployments often pair it with quantization and multi-GPU nodes.

GLM-5.2 becomes the top open-weights model — Active vs total parameters

Jargon

Total parameters: Every weight the model contains — here 744 billion. The total sets the model’s knowledge capacity and, critically, the memory footprint: all 744 B must be loaded into GPU memory whether or not they are used on a given token.
Active parameters: The subset of weights actually read and multiplied for a single token — here ~40 billion. In a dense model active equals total; in a sparse model active is a fraction. Per-token compute and bandwidth track the active count, not the total.
Mixture-of-Experts (MoE): A transformer variant that replaces each dense feed-forward network with many smaller “expert” sub-networks, plus a router that activates only a handful per token. It decouples total capacity from per-token cost.
Router: The small learned network inside an MoE layer that assigns each token to its top-k experts. It is what makes “which weights are active” change from token to token.
Sparsity ratio: The fraction of total parameters that are active per token. GLM-5.2’s 40 B of 744 B is roughly 5% — about one weight in eighteen. A lower ratio means more capacity sits idle on any given token.
Dense model: A model with no routing: every weight participates in every token, so active equals total. Per-token FLOPs scale linearly with the full parameter count.
FLOP: A floating-point operation — one multiply or add. A useful rule of thumb: a forward pass costs about 2 × (active parameters) FLOPs per token.
Artificial Analysis Intelligence Index: A third-party benchmark that aggregates many evals (reasoning, coding, knowledge) into a single comparable score. GLM-5.2 scored 51 on v4.1, leading all open-weights models.

The news. On June 17, 2026, Artificial Analysis reported that Zhipu AI’s GLM-5.2 became the leading open-weights model on its Intelligence Index v4.1, scoring 51 — ahead of MiniMax-M3 and DeepSeek V4 Pro, both at 44. The model carries 744 B total parameters but activates only ~40 B per token, ships under an MIT license, and keeps GLM-5.1’s architecture while showing particular strength on scientific reasoning. Read the report →

Picture a very large engine. It has hundreds of cylinders machined into the block, but at any instant only a handful are firing — and which few are firing changes constantly as a controller picks the right ones for the moment. The size of the engine is one number; the cylinders burning fuel right now are a completely different one. That is exactly the gap GLM-5.2 puts on its spec sheet: 744 billion cylinders built in, but only about 40 billion firing per token. The first number is the engine you have to build and haul; the second is the fuel you actually burn each stroke.

Dense model releases often came with one headline number, because the model was dense — every weight fired on every token, so the engine you built and the engine you ran were the same. A Mixture-of-Experts model breaks that equality on purpose. Most of a transformer’s parameters live in its feed-forward layers — roughly two-thirds of them — so MoE replaces that one big dense feed-forward block with many smaller expert blocks and a router that lights up only the few each token needs. The 744 B stays resident, but the per-token bill tracks the ~40 B that fire.

So the two numbers price two genuinely different resources. The total parameter count sets your memory footprint — every one of the 744 B weights has to sit in GPU memory, idle or not, which is why running an open-weights model this large means a multi-GPU node and a good reason to shrink the weights with quantization. The active count sets your per-token compute and bandwidth — and at ~40 B active, GLM-5.2 computes each token at roughly the cost of a 40 B model even though it holds 744 B parameters of capacity. The notable part of this release is not just that an open-weights model topped the leaderboard; it is that it did so at a ~5% sparsity ratio — about one weight in eighteen — pushing the frontier on a very lean per-token budget.

Per token, you pay…	If GLM-5.2 were dense (744B active)	GLM-5.2 as shipped (744B-total, ~40B active)
Active parameters	744B (all of them)	~40B
Compute per token	~1.49 TFLOP (illustrative, ≈2× active-params rule)	~80 GFLOP (illustrative, ≈2× active-params rule)
Weights held in memory	~744 GB (~approx, 1 byte/param at FP8)	~744 GB (~approx, 1 byte/param at FP8)
Intelligence Index v4.1	—	51 (leading open weights)

Work one token through the numbers to see why the gap matters. Using the rule that a forward pass costs about 2 × (active parameters) FLOPs, a dense 744 B model would burn 2 × 744 B ≈ 1.49 TFLOP every token; GLM-5.2, firing only ~40 B, burns 2 × 40 B ≈ 80 GFLOP — roughly 18× less compute per token (illustrative — derived from the parameter counts, not measured). But both versions still have to keep all 744 B weights resident — about 744 GB at one byte each — so the memory bill is identical. That is the trade in parameter-count terms: MoE is designed to give you the per-token compute of a small model and the capacity of a large one — while still charging you the memory of the large one. (Real systems also pay routing overhead and run dense attention layers, so the picture is more nuanced than the two counts alone.) Whether the trade is worth it depends on what binds you — if memory is the constraint, a smaller dense model can win, which is the flip side explored in the related Granite explainer below.

Goes deeper in: LLM Internals → Transformer Block → The Feed-Forward Network and LLM Internals → Quantization → Why Quantize

Related explainers

IBM Granite 4.1 — 8B dense matches the prior 32B MoE — the serving-cost flip side: when memory is what binds, a smaller dense model can beat a larger MoE
MobileMoE — DRAM-aware MoE scaling — what the active-vs-total gap buys you on memory-constrained devices
SoftMoE — differentiable soft top-k routing — how the router actually decides which experts fire each token

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based