NVIDIA Nemotron 3 Nano Omni — 30B-A3B multimodal MoE
LLMThe news. On April 28, 2026, NVIDIA released Nemotron 3 Nano Omni, a 30B-A3B hybrid Mixture-of-Experts model that accepts text, images, audio, video, documents, charts, and graphical-interface screenshots, and outputs text. NVIDIA reports the model delivers up to 9× higher throughput than other open omni-modal models at the same interactivity and ships with open weights, open datasets, and open training recipes — supported on Jetson, DGX Spark, DGX Station, and cloud. Read the release →
Picture the receptionist at the counter. The roster is large — thirty translators on payroll, all sitting in the office, all costing rent — but no single customer needs all of them. Most customers need two. The receptionist looks at who walked in, calls the right pair to the counter, and the other twenty-eight stay at their desks. The shorthand "30B-A3B" captures exactly that: 30 B total parameters resident in HBM, but only ~3 B active parameters evaluated per token. That ~10× sparsity ratio is what makes the 9× throughput claim plausible — at decode time, the GPU is overwhelmingly memory-bandwidth-bound, and you only have to stream the active experts' weights through the compute units, not the whole roster.
The "multimodal" half is what makes Nemotron omni rather than an LLM with a vision adapter bolted on the side. In a typical stack, a customer who shows a photo gets sent to a separate vision specialist; a customer who plays audio gets sent to a separate audio specialist; a customer who types text gets the LLM. Three separate offices, three rents, three sets of weights streamed in parallel. Nemotron's design funnels every modality through the same router into the same expert pool. Each chunk of input — a sub-word, an image patch, an audio frame, a video frame — is first turned into a token-shaped embedding, then routed by the same receptionist to whatever two translators on the same roster best fit the request. NVIDIA names Conv3D, EVS, and a 256K context window as the front-end ingredients that turn images and video frames into routable tokens; the exact expert count and top-K routing fan-out are not yet documented in the announcement.
That single-pool design is the mechanical reason a 30B-A3B omni model is far cheaper to serve than running a dense 30B LLM, a separate vision encoder, and a separate audio model side by side. You pay HBM rent for one 30 B parameter set, not three; you pay decode bandwidth for ~3 B of it per token, not three full models in parallel. The throughput win compounds when batching: every modality contributes tokens to the same continuous batch, so the scheduler keeps the GPU saturated even when an individual user is mid-conversation between an image upload and an audio reply.
A worked example sharpens it. Suppose a serving cluster spends most of its decode time reading parameters out of HBM. A hypothetical dense 30 B model has to stream ~60 GB of weights per layer-pass at FP16. The MoE variant streams roughly ~6 GB of "active" expert weights per token — the other ~54 GB stays put in HBM. At a fixed HBM bandwidth ceiling of 3 TB/s on a modern data-center GPU, that translates to roughly ~50 decode steps/sec for the dense model and ~500 for the MoE in the bandwidth-bound limit. NVIDIA's "9× higher throughput" headline lines up with that order of magnitude, with the residual gap absorbed by attention work, KV-cache reads, routing overhead, and load-imbalance between experts.
Goes deeper in: LLM Internals → Transformer Block → The Feed-Forward Network and LLM Serving → Inference Engine → Continuous Batching