What is DRAM-aware MoE scaling, in one paragraph?

A scaling law for Mixture-of-Experts language models that treats device DRAM as a hard constraint, not a soft preference. Where prior MoE scaling laws (OLMoE, DeepSeekMoE, Mixtral) minimized training FLOPs while assuming plenty of HBM, MobileMoE minimizes both training and per-token inference FLOPs subject to an inequality: the model's INT4 footprint plus activation buffers has to fit under the smartphone DRAM budget (approximately 3 GB). That second clause produces a different optimum — fewer total parameters, more fine-grained experts (64), fewer active per token (8), and a shared always-on expert — and yields the MobileMoE S/M/L family at 272M/528M/922M active parameters out of 1.3B/2.8B/5.3B totals.

Why does it matter for on-device LLMs?

Because the cloud-style MoE recipe is structurally wrong for phones. A 7B-total model at INT4 is approximately 3.5 GB before activations, exceeding the working DRAM budget on most mid-range devices; on a flagship it leaves no headroom for the OS, the camera pipeline, or the rest of the app. MobileMoE-L's 5.3B-total model fits in 2.75 GB at INT4 with comparable or better quality than the 7B-total OLMoE-1B-7B baseline, and the MobileMoE family is reported to run 1.8-3.8x faster prefill / 2.2-3.4x faster decode than dense baselines at comparable INT4 memory on a Samsung Galaxy S25 and iPhone 16 Pro. Without DRAM in the loss, the larger model is unshippable. With it, the smaller model wins on both axes.

How does it relate to INT4 quantization and grouped GEMM?

Two layers down the stack. INT4 QAT compresses each weight to 4 bits at training time so the resulting model both stores and runs at INT4 — the FP32 router is the carve-out, kept full precision because routing decisions are discrete and would flip under tiny numerical drift. The ExecuTorch fused-MoE kernel then takes the sparse, per-token dispatch into 8 active experts and reshapes it into one dense grouped GEMM per layer — many small matmuls of identical shape fused into one batched call, which is what amortizes kernel-launch overhead on mobile NPUs. Without both pieces, the architectural FLOP savings would be invisible at the wall clock.

MobileMoE — DRAM-aware MoE scaling for sub-3GB devices

TL;DR

What is it: The MobileMoE paper introduces a family of sub-billion-active Mixture-of-Experts language models — S/M/L at 272M / 528M / 922M active out of 1.3B / 2.8B / 5.3B total — derived from the first scaling law that jointly optimizes for DRAM and compute on smartphones.
Why it’s needed: The trick that unlocks on-device LLMs isn't just smaller models — it's recognizing DRAM as the binding constraint and routing only a sliver of parameters per token while the rest sit dormant in memory. That lets a 5.3B-total model fit in 2.75 GB at INT4, while a dense 5B would never leave the cloud.
vs previous: Prior MoE scaling laws (e.g. OLMoE-1B-7B) optimized for cloud-style training compute and ignored the host memory budget entirely; OLMoE's 7B-total model needs ~4.3 GB at INT4, more than the 3 GB working budget on a typical phone. MobileMoE-L matches or exceeds its quality in 30% fewer active params and 23% less DRAM.

Jargon

MoE (Mixture of Experts): An architecture where each transformer block has many small expert sub-networks plus a router that picks just a few of them per token. The model has lots of total parameters (the wardrobe) but only activates a sliver per forward pass (today's outfit). Context: Granite 4.1 — Dense vs MoE.
Active vs total params: Two different numbers. Total sets the DRAM footprint — the whole model has to be resident in memory. Active sets the per-token compute (FLOPs) and the memory bandwidth the decode loop pulls each step. MoE decouples them; dense models tie them together.
Fine-grained experts (8 / 64): MobileMoE reportedly splits each MoE layer into roughly 64 small experts and routes 8 of them per token, instead of the usual 8-of-8 or 2-of-32. Smaller, more numerous experts give the router more compositional choices without growing active compute.
Shared expert: An always-on, dense sub-network that processes every token regardless of routing — captures common patterns that don't need to be re-learned by every routed expert. Pairs naturally with fine-grained routing.
INT4 QAT with FP32 router: Quantization-Aware Training bakes 4-bit weight rounding into the training loss so the model learns weights that quantize well; the router stays in FP32 because routing decisions are discrete and small numerical drift would flip which experts get chosen. Background: Quantization → The process.
Grouped GEMM: A kernel that does many small matmuls of identical shape as one batched call, instead of launching one kernel per matmul. Critical for MoE because each expert's per-token batch is tiny — without grouping, kernel-launch overhead would dominate. See Operator Fusion → Kernel launch overhead.
ExecuTorch: PyTorch's on-device inference runtime for mobile and embedded targets. MobileMoE ships a custom ExecuTorch fused-MoE operator that converts the sparse expert dispatch into one dense grouped GEMM per layer — that's what unlocks the measured smartphone speedups.
Pareto frontier: The set of design points where no objective (quality, DRAM, compute) can improve without worsening another. MobileMoE's scaling law sketches this frontier in the joint (DRAM, FLOPs) plane and the S/M/L points are explicitly placed on it.

The news. On May 27, 2026, the MobileMoE paper introduced a family of mobile-targeted MoE language models and the first MoE scaling law that jointly optimizes against both DRAM and compute constraints. The S/M/L scales activate 272M / 528M / 922M parameters out of 1.3B / 2.8B / 5.3B totals; at INT4 they fit in 0.68 / 1.48 / 2.75 GB of weight memory. MobileMoE-L reportedly outperforms OLMoE-1B-7B by +7.4 points with 30% fewer active parameters and 23% smaller footprint, and the family is measured at 1.8–3.8× faster prefill and 2.2–3.4× faster decode on a Samsung Galaxy S25 and iPhone 16 Pro at comparable INT4 memory targets.

Picture the metaphor for a moment. You are about to fly with a single carry-on and the airline has a hard 23 kg weight limit at the check-in counter. The problem isn't just buying smaller clothes — your entire wardrobe still has to fit. The clever traveller does two things at once: vacuum-packs every garment so they compress to half their volume, and only wears one outfit per day so the rest can stay folded. The first move shrinks the total wardrobe weight; the second means most of the wardrobe just sits in the bag while you're walking around. Those two moves correspond to the two axes the MobileMoE paper finally treats as equal partners: total parameters (the whole wardrobe — has to fit in DRAM) and active parameters (today's outfit — sets per-token compute and bandwidth).

What's new is that prior MoE scaling laws only optimized one of those axes. OLMoE, DeepSeekMoE, and Mixtral were all derived under the implicit assumption that the cloud serving rig has plenty of HBM and the cost that matters is training FLOPs. That's a fine assumption for a datacenter; it's a useless one for a phone where total resident memory is the binding constraint and bandwidth-bound decode is the dominant runtime cost. MobileMoE writes the scaling law down explicitly: minimize training and per-token inference FLOPs subject to M(N_total, T) ≤ DRAM_budget, where M is a function of total parameters and quantization. That second clause — the inequality — is the conceptual contribution.

The architecture follows. Every block reportedly uses fine-grained MoE: roughly 64 small experts with 8 active per token, plus one always-on shared expert that handles tokens common-enough that all paths would learn the same thing anyway. The router stays in FP32 even after INT4 quantization lands on the weights — because routing decisions are discrete and tiny numerical drift would flip which expert wins, breaking the trained expert specialization. Training runs four stages — 6T-token pretraining, 500B-token math/code/knowledge mid-training, 80M-sample SFT, then INT4 QAT — and the ExecuTorch runtime converts the sparse expert dispatch into one dense grouped GEMM per layer, which is what actually unlocks the measured smartphone wins. Without that fused operator the on-device per-token kernel-launch overhead would wash out the FLOPs savings.

Where the wall-clock time actually goes

A back-of-envelope walk-through for MobileMoE-L (922M active params out of 5.3B total) doing per-token decode on a Galaxy S25. Decode is memory-bound, not compute-bound — at every step the runtime has to stream the active weights through whatever ALUs the device exposes (NPU on a Snapdragon, GPU shader cores, or CPU vector lanes) to compute one output token. At INT4 the active-weight bytes per token are roughly 922M × 0.5 B = ~461 MB. The Galaxy S25's LPDDR5X DRAM delivers around 76 GB/s of bandwidth (illustrative — actual sustained throughput is workload-dependent and varies across devices). That puts the bandwidth-bound lower bound at 461 MB / 76 GB/s ≈ 6.1 ms per token, or ~165 tokens/s at ideal utilization. At realistic decode utilization in the 30-50% range, you land at ~50–80 tokens/s — comfortably interactive for chat.

Now do the same arithmetic on the OLMoE-1B-7B baseline. Same 1B-class active count, but 7B total at INT4 → ~3.5 GB of weight memory (the paper reports 4.3 GB at INT4 once buffers are included), which simply doesn't fit in the 3 GB working-set budget on a typical mid-range phone. Even on a flagship with the headroom, you've paid for parameters that never produce a token of output — the inactive 6B sit in DRAM consuming budget without contributing per-token compute. MobileMoE's "small active inside a small total" choice trades nothing on quality (it beats OLMoE-1B-7B by +7.4 pp) and keeps the budget honest.

How the family compares

Model	Active	Total	INT4 DRAM	Quality vs OLMoE-1B-7B
Llama 3.2 1B (dense)	1.0 B	1.0 B	~0.55 GB	baseline-class (dense small)
SmolLM2 1.7B (dense)	1.7 B	1.7 B	~0.9 GB (illustrative)	baseline-class (dense small)
OLMoE-1B-7B (MoE)	~1.0 B	~7.0 B	~4.3 GB (exceeds 3 GB budget)	reference
MobileMoE-S	0.272 B	1.3 B	~0.68 GB	—
MobileMoE-M	0.528 B	2.8 B	~1.48 GB	—
MobileMoE-L	0.922 B	5.3 B	~2.75 GB	+7.4 pp better

A small but load-bearing caveat: the 1.8–3.8× prefill / 2.2–3.4× decode range vs comparable dense baselines is reported on Samsung Galaxy S25 and iPhone 16 Pro specifically, at comparable INT4 memory targets, and won't transfer one-to-one to other devices — phone DRAM bandwidth, NPU vs CPU dispatch, and thermal headroom all shift the picture. The measured numbers are headline ratios on those two reference devices, not a universal smartphone guarantee, and individual S/M/L sizes can sit at different points within the reported range depending on the dense baseline being compared.

Goes deeper in: LLM Internals → Quantization → What to quantize

Related explainers

IBM Granite 4.1 — Dense vs MoE small models — the dense-vs-MoE tradeoff at the small-model scale that MobileMoE sits inside
MSSP vs muP — MoE scaling — a different scaling-law axis: hyperparameter transfer for MoE, complementary to MobileMoE's memory-budget axis
Mixed quantization — NVFP4 prefill, BF16 decode — the datacenter end of the precision-routing pattern MobileMoE inherits (INT4 weights, FP32 router)

Continue in trackLLM Internals — Quantization & What to Quantize

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based