MobileMoE — DRAM-aware MoE scaling for sub-3GB devices

LLM
L
Sub-3 GB DRAM is the smartphone budget012340.511.5DRAM at INT4 (GB)Active params per token (B)3 GB — smartphone DRAM budgetOLMoE-1B-7BMoE total 7BLlama 3.2 1Bdense 1BSmolLM2 1.7Bdense 1.7B✗ exceeds budgetMobileMoE-S1.3B total · 8/64 · 0.68 GBMobileMoE-M2.8B total · 8/64 · 1.48 GBMobileMoE-L5.3B total · 8/64 · 2.75 GBOLMoE-1B-7B4.3 GB ✗vsMobileMoE-L2.75 GB ✓+7.4 pp quality, 30% fewer active params
learnaivisually.com/ai-explained/mobilemoe-dram-aware-scaling

The news. On May 27, 2026, the MobileMoE paper introduced a family of mobile-targeted MoE language models and the first MoE scaling law that jointly optimizes against both DRAM and compute constraints. The S/M/L scales activate 272M / 528M / 922M parameters out of 1.3B / 2.8B / 5.3B totals; at INT4 they fit in 0.68 / 1.48 / 2.75 GB of weight memory. MobileMoE-L reportedly outperforms OLMoE-1B-7B by +7.4 points with 30% fewer active parameters and 23% smaller footprint, and the family is measured at 1.8–3.8× faster prefill and 2.2–3.4× faster decode on a Samsung Galaxy S25 and iPhone 16 Pro at comparable INT4 memory targets.

Picture the metaphor for a moment. You are about to fly with a single carry-on and the airline has a hard 23 kg weight limit at the check-in counter. The problem isn't just buying smaller clothes — your entire wardrobe still has to fit. The clever traveller does two things at once: vacuum-packs every garment so they compress to half their volume, and only wears one outfit per day so the rest can stay folded. The first move shrinks the total wardrobe weight; the second means most of the wardrobe just sits in the bag while you're walking around. Those two moves correspond to the two axes the MobileMoE paper finally treats as equal partners: total parameters (the whole wardrobe — has to fit in DRAM) and active parameters (today's outfit — sets per-token compute and bandwidth).

What's new is that prior MoE scaling laws only optimized one of those axes. OLMoE, DeepSeekMoE, and Mixtral were all derived under the implicit assumption that the cloud serving rig has plenty of HBM and the cost that matters is training FLOPs. That's a fine assumption for a datacenter; it's a useless one for a phone where total resident memory is the binding constraint and bandwidth-bound decode is the dominant runtime cost. MobileMoE writes the scaling law down explicitly: minimize training and per-token inference FLOPs subject to M(N_total, T) ≤ DRAM_budget, where M is a function of total parameters and quantization. That second clause — the inequality — is the conceptual contribution.

The architecture follows. Every block reportedly uses fine-grained MoE: roughly 64 small experts with 8 active per token, plus one always-on shared expert that handles tokens common-enough that all paths would learn the same thing anyway. The router stays in FP32 even after INT4 quantization lands on the weights — because routing decisions are discrete and tiny numerical drift would flip which expert wins, breaking the trained expert specialization. Training runs four stages — 6T-token pretraining, 500B-token math/code/knowledge mid-training, 80M-sample SFT, then INT4 QAT — and the ExecuTorch runtime converts the sparse expert dispatch into one dense grouped GEMM per layer, which is what actually unlocks the measured smartphone wins. Without that fused operator the on-device per-token kernel-launch overhead would wash out the FLOPs savings.

Where the wall-clock time actually goes

A back-of-envelope walk-through for MobileMoE-L (922M active params out of 5.3B total) doing per-token decode on a Galaxy S25. Decode is memory-bound, not compute-bound — at every step the runtime has to stream the active weights through whatever ALUs the device exposes (NPU on a Snapdragon, GPU shader cores, or CPU vector lanes) to compute one output token. At INT4 the active-weight bytes per token are roughly 922M × 0.5 B = ~461 MB. The Galaxy S25's LPDDR5X DRAM delivers around 76 GB/s of bandwidth (illustrative — actual sustained throughput is workload-dependent and varies across devices). That puts the bandwidth-bound lower bound at 461 MB / 76 GB/s ≈ 6.1 ms per token, or ~165 tokens/s at ideal utilization. At realistic decode utilization in the 30-50% range, you land at ~50–80 tokens/s — comfortably interactive for chat.

Now do the same arithmetic on the OLMoE-1B-7B baseline. Same 1B-class active count, but 7B total at INT4 → ~3.5 GB of weight memory (the paper reports 4.3 GB at INT4 once buffers are included), which simply doesn't fit in the 3 GB working-set budget on a typical mid-range phone. Even on a flagship with the headroom, you've paid for parameters that never produce a token of output — the inactive 6B sit in DRAM consuming budget without contributing per-token compute. MobileMoE's "small active inside a small total" choice trades nothing on quality (it beats OLMoE-1B-7B by +7.4 pp) and keeps the budget honest.

How the family compares

ModelActiveTotalINT4 DRAMQuality vs OLMoE-1B-7B
Llama 3.2 1B (dense)1.0 B1.0 B~0.55 GBbaseline-class (dense small)
SmolLM2 1.7B (dense)1.7 B1.7 B~0.9 GB (illustrative)baseline-class (dense small)
OLMoE-1B-7B (MoE)~1.0 B~7.0 B~4.3 GB (exceeds 3 GB budget)reference
MobileMoE-S0.272 B1.3 B~0.68 GB
MobileMoE-M0.528 B2.8 B~1.48 GB
MobileMoE-L0.922 B5.3 B~2.75 GB+7.4 pp better

A small but load-bearing caveat: the 1.8–3.8× prefill / 2.2–3.4× decode range vs comparable dense baselines is reported on Samsung Galaxy S25 and iPhone 16 Pro specifically, at comparable INT4 memory targets, and won't transfer one-to-one to other devices — phone DRAM bandwidth, NPU vs CPU dispatch, and thermal headroom all shift the picture. The measured numbers are headline ratios on those two reference devices, not a universal smartphone guarantee, and individual S/M/L sizes can sit at different points within the reported range depending on the dense baseline being compared.

Goes deeper in: LLM Internals → Quantization → What to quantize

Related explainers

Continue in trackLLM Internals — Quantization & What to Quantize

Frequently Asked Questions