What is a multimodal MoE?

A multimodal Mixture-of-Experts model accepts inputs in more than one modality — text, image, audio, video — and routes every modality through the same pool of expert sub-networks. Each chunk of input is first turned into a token-shaped embedding regardless of where it came from; the router then picks the top-K experts to evaluate that token, exactly the same way it would for a sub-word. The single shared pool is what distinguishes a multimodal MoE from a dense LLM with a separate vision encoder and audio model bolted on.

What does '30B-A3B' mean?

'30B-A3B' describes a sparsely-activated MoE in two numbers: 30 B total parameters resident in HBM, and ~3 B parameters active per token after the router selects its top-K experts. The total controls the model's HBM footprint and the size of accelerator it needs; the active count controls per-token compute and bandwidth. The ~10× sparsity ratio between them is what makes the 9× throughput claim possible — decode is memory-bandwidth-bound, so cutting per-token bytes streamed by 10× cuts per-token cost by roughly the same factor.

Why is a multimodal MoE cheaper to serve than separate models per modality?

A traditional multimodal stack runs a dense LLM, a separate vision encoder, and a separate audio model side by side, each with its own weights resident in HBM and its own bandwidth bill at decode. A multimodal MoE collapses all three into one parameter set with one router on top, and only the experts the router selects per token actually stream through compute. You pay HBM for one model instead of three, and decode bandwidth for ~3 B of it per token instead of three full models in parallel.

NVIDIA Nemotron 3 Nano Omni — 30B-A3B multimodal MoE

LLM

learnaivisually.com/ai-explained/nvidia-nemotron-3-multimodal-moe

The news. On April 28, 2026, NVIDIA released Nemotron 3 Nano Omni, a 30B-A3B hybrid Mixture-of-Experts model that accepts text, images, audio, video, documents, charts, and graphical-interface screenshots, and outputs text. NVIDIA reports the model delivers up to 9× higher throughput than other open omni-modal models at the same interactivity and ships with open weights, open datasets, and open training recipes — supported on Jetson, DGX Spark, DGX Station, and cloud. Read the release →

Picture the receptionist at the counter. The roster is large — thirty translators on payroll, all sitting in the office, all costing rent — but no single customer needs all of them. Most customers need two. The receptionist looks at who walked in, calls the right pair to the counter, and the other twenty-eight stay at their desks. The shorthand "30B-A3B" captures exactly that: 30 B total parameters resident in HBM, but only ~3 B active parameters evaluated per token. That ~10× sparsity ratio is what makes the 9× throughput claim plausible — at decode time, the GPU is overwhelmingly memory-bandwidth-bound, and you only have to stream the active experts' weights through the compute units, not the whole roster.

The "multimodal" half is what makes Nemotron omni rather than an LLM with a vision adapter bolted on the side. In a typical stack, a customer who shows a photo gets sent to a separate vision specialist; a customer who plays audio gets sent to a separate audio specialist; a customer who types text gets the LLM. Three separate offices, three rents, three sets of weights streamed in parallel. Nemotron's design funnels every modality through the same router into the same expert pool. Each chunk of input — a sub-word, an image patch, an audio frame, a video frame — is first turned into a token-shaped embedding, then routed by the same receptionist to whatever two translators on the same roster best fit the request. NVIDIA names Conv3D, EVS, and a 256K context window as the front-end ingredients that turn images and video frames into routable tokens; the exact expert count and top-K routing fan-out are not yet documented in the announcement.

That single-pool design is the mechanical reason a 30B-A3B omni model is far cheaper to serve than running a dense 30B LLM, a separate vision encoder, and a separate audio model side by side. You pay HBM rent for one 30 B parameter set, not three; you pay decode bandwidth for ~3 B of it per token, not three full models in parallel. The throughput win compounds when batching: every modality contributes tokens to the same continuous batch, so the scheduler keeps the GPU saturated even when an individual user is mid-conversation between an image upload and an audio reply.

A worked example sharpens it. Suppose a serving cluster spends most of its decode time reading parameters out of HBM. A hypothetical dense 30 B model has to stream ~60 GB of weights per layer-pass at FP16. The MoE variant streams roughly ~6 GB of "active" expert weights per token — the other ~54 GB stays put in HBM. At a fixed HBM bandwidth ceiling of 3 TB/s on a modern data-center GPU, that translates to roughly ~50 decode steps/sec for the dense model and ~500 for the MoE in the bandwidth-bound limit. NVIDIA's "9× higher throughput" headline lines up with that order of magnitude, with the residual gap absorbed by attention work, KV-cache reads, routing overhead, and load-imbalance between experts.

Goes deeper in: LLM Internals → Transformer Block → The Feed-Forward Network and LLM Serving → Inference Engine → Continuous Batching

NVIDIA Nemotron 3 Nano Omni — 30B-A3B multimodal MoE

Frequently Asked Questions