What does 'tokens per megawatt' actually measure?

The number of LLM output tokens a serving rack produces per megawatt-hour of grid energy consumed under representative inference traffic. Unlike peak FLOPS/W (a chip-datasheet number) or even goodput (an inference-engine tuning number), tokens/MW aggregates everything between the grid meter and the user-facing API — prefill compute, decode compute, KV-cache loads, collective communication across NVLink and NIC, idle padding, host-side overhead. It is what the operator's electricity bill actually buys and is the cleanest single number for generation-over-generation hardware comparisons.

Why is NVIDIA reframing serving around tokens/MW now?

Because production inference at hyperscaler scale is power-constrained, not chip-count-constrained. A site's grid contract caps the megawatts available before you decide which silicon to deploy. Throughput, cost-per-token, and the number of customers you can serve all fall out of how efficiently the rack converts each MWh into tokens. NVIDIA's 50x tokens/MW and 35x cost-per-token claims for Blackwell Ultra GB300 NVL72 only make sense in this whole-stack framing — pulled apart by silicon alone, the multiplier would not be there; it is compounded across silicon (approximately 5x), interconnect (approximately 3.3x), and Dynamo orchestration (approximately 3x) on an illustrative author decomposition.

How does tokens/MW relate to goodput?

Goodput is throughput that met the SLO — a per-replica engine-tuning metric that says 'of the tokens we generated, how many landed on time?' Tokens/MW is the system-level cousin that adds the energy denominator and rolls up across the rack and the orchestration layer. The two compose: improving goodput per replica improves the numerator of tokens/MW; improving interconnect, Dynamo orchestration, and silicon perf/W improves the denominator. Operators use goodput to tune the inference engine and tokens/MW to plan the site.

NVIDIA AI Factories — Tokens-per-megawatt as a serving metric

GPU

learnaivisually.com/ai-explained/nvidia-ai-factories-tokens-per-mw

TL;DR

What is it: The NVIDIA AI Factories announcement at GTC Taipei / Computex 2026 bundles Vera Rubin silicon, Blackwell Ultra, the GB300 NVL72 rack, the Dynamo orchestration framework, and the Omniverse DSX Blueprint under a single thesis: an AI datacenter should be benchmarked the way a power plant is — on tokens per megawatt, the useful output per unit of energy input.
Why it’s needed: Inference workloads at production scale are power-constrained, not chip-count-constrained. A hyperscaler's site has a fixed grid contract in megawatts; throughput and cost-per-token both fall out of how many tokens the rack converts each MWh into. Reframing serving around tokens/MW turns three separate optimization fronts — silicon, interconnect, and orchestration — into one metric the CFO can also read.
vs previous: Earlier datacenter framings measured FLOPS/W or TOPS/W — peak silicon throughput per watt — which is what the chip can do on paper. Tokens/MW measures what the full stack actually delivers under real serving traffic: prefill plus decode, KV cache loads, collective communication, and idle padding all included. NVIDIA's claim of ~50× more tokens/MW for the Blackwell Ultra GB300 NVL72 rack vs prior-generation Hopper deployments — and ~35× lower cost per token — only makes sense in this whole-system framing.

Jargon

Tokens per megawatt (tokens/MW): The number of LLM output tokens a serving rack produces per megawatt-hour of grid energy consumed. Aggregates prefill compute, decode compute, KV-cache loads, all-to-all communication, idle padding, and host-side overhead into one number the utility bill can be divided into. Closely related: goodput.
Cost per token: What it actually costs the operator (electricity + amortized hardware + cooling + facility) to emit one LLM output token. For inference at scale, electricity dominates the marginal cost, so tokens/MWh and cost/token move together. NVIDIA's "35× lower" claim is vs Hopper-generation deployments.
Performance per watt (perf/W): The traditional silicon metric: peak FLOPS (or TOPS) per watt of chip power draw. Useful for chip-vs-chip comparisons but ignores everything above the die — DRAM stalls, NVLink crossings, kernel-launch overhead, batching efficiency. Tokens/MW is the system-level cousin that includes those.
AI factory: NVIDIA's framing of a purpose-built LLM-serving datacenter as a token-production line — raw materials are electricity and incoming queries; output is tokens delivered under an SLO. Reuses manufacturing vocabulary (throughput, yield, OEE) for compute infrastructure.
Dynamo: NVIDIA's framework for orchestrating long-context reasoning across nodes in an AI factory. The orchestration layer in the AI-factory stack; sits above the silicon and is positioned as a multiplier on top of the silicon and interconnect gains. The blog does not disclose Dynamo's internal mechanism beyond the orchestration framing.
GB300 NVL72: The Blackwell Ultra rack-scale assembly — 72 Blackwell Ultra GPUs wired into a single sixth-gen NVLink domain. The interconnect detail that makes the 50× tokens/MW claim physically possible — bigger collectives stay on fast fabric. The Vera Rubin NVL72 follow-on extends the same idea to the next-gen Rubin GPUs.
Omniverse DSX Blueprint: A digital-twin blueprint for designing AI-factory floor plans before they're built — power delivery, cooling, network topology, and rack layout simulated in Omniverse. Lets operators co-optimize the building and the chips against the tokens/MW target.

The news. On May 27, 2026, NVIDIA used its GTC Taipei / Computex keynote to bundle five separate products — Vera Rubin silicon, Blackwell Ultra silicon, the GB300 NVL72 rack, the Dynamo orchestration framework, and the Omniverse DSX Blueprint — under one "AI factories" thesis. The headline claim: a Blackwell Ultra GB300 NVL72 rack delivers ~50× more tokens per megawatt than a Hopper-generation deployment and ~35× lower cost per token for comparable inference workloads, with Vera Rubin teed up as the next forward step (reportedly ~35× performance per watt with the Groq 3 LPX pairing). The piece is vision-setting more than architectural disclosure; what it actually changes is the unit of measurement NVIDIA wants the industry to use.

Picture the metaphor for a moment. Your utility bill arrives. The kWh number on it doesn't care what's plugged into the wall — it's set by your grid contract before you decide between the old fridge and the new one. What changes month to month is how much useful cold air that electricity buys: the rattly 20-year-old box turns kWh into a small puff of chill, while the modern inverter unit turns the same kWh into roughly 50 times as much. Same input, vastly different output. The chip generation, the interconnect, and the orchestration framework are just three coordinated parts of that "inverter compressor" upgrade for a serving rack.

What's new is which denominator NVIDIA wants you to compare against. The old denominator was FLOPS — peak silicon throughput. A Hopper H100 datasheet advertises a precise dense TFLOPS number and a stack of HBM, and you reason about deployments by counting chips. The trouble is that peak FLOPS is what the silicon can do on a perfect day, not what your customers' traffic actually extracts after KV-cache loads, all-to-all collectives across pipeline stages, batch padding, and host-side overhead. Performance per watt is the chip-level analogue of that perfect-day number, useful for silicon comparisons and useless for budgeting a serving SLO. Tokens/MW is the system-level cousin — measure the rack under real serving traffic, divide by the megawatt-hours the utility actually bills you for, and you get a number that already includes every inefficiency between the grid meter and the user-facing API. That's the framing inversion: instead of optimizing for a peak the workload can't reach, optimize for the unit the workload actually pays for.

The 50× claim, then, isn't a single-chip number — it's full-stack codesign compounded across the roofline of every layer. Blackwell Ultra silicon moves to lower-precision tensor formats and fatter HBM stacks (more FLOPs per watt at the die). The GB300 NVL72 rack lifts the NVLink domain from one 8-GPU server up to all 72 GPUs in the rack, which means tensor-parallel and expert-parallel collectives no longer cross PCIe-over-NIC at every layer — wasted milliseconds per token at the rack scale (the Vera Rubin NVL72 follow-on extends the same idea). Dynamo orchestrates long-context reasoning across nodes — the blog frames it as the layer that keeps expensive silicon from sitting idle between request tails, related in spirit to prefill / decode disaggregation though the post doesn't disclose the specific mechanism. Multiply small wins at each layer and you reach the 50× headline; pull any one piece out and the chain breaks.

Walking the numbers — where the 50× comes from

A back-of-envelope decomposition (illustrative — NVIDIA does not publish a per-layer attribution). Suppose a Hopper-era rack averages ~500K output tokens per MWh under a representative inference SLO. That's the kind of number you'd get from a 70B-class model at modest batch sizes after every real-world overhead. Splitting NVIDIA's 50× into plausible per-layer contributions:

~5× from silicon (illustrative) — Blackwell Ultra's lower-precision tensor formats plus fatter HBM lift raw die throughput per watt over Hopper.
~3.3× from interconnect (illustrative) — keeping tensor-parallel and expert-parallel collectives inside the GB300 NVL72 rack-scale NVLink domain instead of falling onto NIC-mediated PCIe paths.
~3× from orchestration (illustrative) — Dynamo's request scheduling reduces idle decode time on the expensive GPUs.

Compounded: 5 × 3.3 × 3 ≈ 50×. That puts the Blackwell Ultra GB300 NVL72 endpoint at ~25 million tokens per MWh under the same SLO. At grid power priced around $0.07/kWh (i.e. $70/MWh, illustrative — large-site contracts vary), the marginal energy cost falls from ~$0.14 per 1K tokens (Hopper) to ~$0.0028 per 1K tokens (Blackwell Ultra) — the source of NVIDIA's "~35× lower cost per token" claim once amortized hardware and cooling join the electricity term. The decomposition is the author's; the headline 50× and 35× are NVIDIA's.

Tokens/MW vs the old serving metrics

Metric	What it measures	What it misses	When useful
FLOPS / TOPS (peak)	Theoretical silicon throughput	DRAM stalls, idle padding, collectives, all of serving	Chip datasheet comparisons
Performance / watt	FLOPS divided by chip TDP	Rack-level cooling, host CPU, NIC, idle racks	Chip-vs-chip efficiency
Throughput (tokens/s)	Tokens served per second per replica	Energy cost; whether throughput came from over-provisioning	SLO sizing under fixed hardware
Goodput	Throughput that met the SLO	Energy; rack-level cost	Inference-engine tuning (see Serving Metrics)
Tokens / MW	SLO-meeting tokens per MWh of grid energy	Almost nothing operational — it's the bill	Site planning, generation-over-generation comparisons, cost-per-token budgeting

A small but load-bearing caveat: NVIDIA's 50× and 35× claims are marketing numbers for Blackwell Ultra GB300 NVL72 vs Hopper, framed under workloads NVIDIA chose. Real per-customer ratios will depend on model size, context length distribution, batch policy, KV-cache behavior, and what fraction of the time the rack actually serves traffic vs sits idle. Vera Rubin teed up alongside is a forward-looking step beyond GB300 with separately quoted figures (e.g. ~35× perf/W with Groq 3 LPX), not the source of the 50× headline. The conceptual contribution — measuring serving against the energy bill instead of the peak datasheet — survives even if your number ends up being 12× rather than 50×.

Goes deeper in: LLM Serving → Serving Metrics & SLOs

Related explainers

NVIDIA Vera Rubin NVL72 — Rack-scale NVLink domain — the next-generation extension of the same rack-scale NVLink idea introduced in GB300 NVL72
NVIDIA Jetson Thor — Edge Blackwell vs datacenter Blackwell — the same Blackwell SM building block at the other end of the power envelope (60W edge vs megawatt rack)
Compute Where It Counts — Per-token compute controller — a software lever on tokens/MW: spend more compute only on tokens that need it

Continue in trackLLM Serving — Serving Metrics & SLOs

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based