The news. On June 17, 2026, Artificial Analysis reported that Zhipu AI’s GLM-5.2 became the leading open-weights model on its Intelligence Index v4.1, scoring 51 — ahead of MiniMax-M3 and DeepSeek V4 Pro, both at 44. The model carries 744 B total parameters but activates only ~40 B per token, ships under an MIT license, and keeps GLM-5.1’s architecture while showing particular strength on scientific reasoning. Read the report →
Picture a very large engine. It has hundreds of cylinders machined into the block, but at any instant only a handful are firing — and which few are firing changes constantly as a controller picks the right ones for the moment. The size of the engine is one number; the cylinders burning fuel right now are a completely different one. That is exactly the gap GLM-5.2 puts on its spec sheet: 744 billion cylinders built in, but only about 40 billion firing per token. The first number is the engine you have to build and haul; the second is the fuel you actually burn each stroke.
Dense model releases often came with one headline number, because the model was dense — every weight fired on every token, so the engine you built and the engine you ran were the same. A Mixture-of-Experts model breaks that equality on purpose. Most of a transformer’s parameters live in its feed-forward layers — roughly two-thirds of them — so MoE replaces that one big dense feed-forward block with many smaller expert blocks and a router that lights up only the few each token needs. The 744 B stays resident, but the per-token bill tracks the ~40 B that fire.
So the two numbers price two genuinely different resources. The total parameter count sets your memory footprint — every one of the 744 B weights has to sit in GPU memory, idle or not, which is why running an open-weights model this large means a multi-GPU node and a good reason to shrink the weights with quantization. The active count sets your per-token compute and bandwidth — and at ~40 B active, GLM-5.2 computes each token at roughly the cost of a 40 B model even though it holds 744 B parameters of capacity. The notable part of this release is not just that an open-weights model topped the leaderboard; it is that it did so at a ~5% sparsity ratio — about one weight in eighteen — pushing the frontier on a very lean per-token budget.
| Per token, you pay… | If GLM-5.2 were dense (744B active) | GLM-5.2 as shipped (744B-total, ~40B active) |
|---|---|---|
| Active parameters | 744B (all of them) | ~40B |
| Compute per token | ~1.49 TFLOP (illustrative, ≈2× active-params rule) | ~80 GFLOP (illustrative, ≈2× active-params rule) |
| Weights held in memory | ~744 GB (~approx, 1 byte/param at FP8) | ~744 GB (~approx, 1 byte/param at FP8) |
| Intelligence Index v4.1 | — | 51 (leading open weights) |
Work one token through the numbers to see why the gap matters. Using the rule that a forward pass costs about 2 × (active parameters) FLOPs, a dense 744 B model would burn 2 × 744 B ≈ 1.49 TFLOP every token; GLM-5.2, firing only ~40 B, burns 2 × 40 B ≈ 80 GFLOP — roughly 18× less compute per token (illustrative — derived from the parameter counts, not measured). But both versions still have to keep all 744 B weights resident — about 744 GB at one byte each — so the memory bill is identical. That is the trade in parameter-count terms: MoE is designed to give you the per-token compute of a small model and the capacity of a large one — while still charging you the memory of the large one. (Real systems also pay routing overhead and run dense attention layers, so the picture is more nuanced than the two counts alone.) Whether the trade is worth it depends on what binds you — if memory is the constraint, a smaller dense model can win, which is the flip side explored in the related Granite explainer below.
Goes deeper in: LLM Internals → Transformer Block → The Feed-Forward Network and LLM Internals → Quantization → Why Quantize
Related explainers
- IBM Granite 4.1 — 8B dense matches the prior 32B MoE — the serving-cost flip side: when memory is what binds, a smaller dense model can beat a larger MoE
- MobileMoE — DRAM-aware MoE scaling — what the active-vs-total gap buys you on memory-constrained devices
- SoftMoE — differentiable soft top-k routing — how the router actually decides which experts fire each token