NVIDIA Vera Rubin NVL72 — Rack-scale NVLink domain
GPUThe news. On May 21, 2026, NVIDIA's Vera Rubin NVL72 swept COMPUTEX 2026's Best Choice Awards (Golden Award plus Sustainable Tech Special Award). The rack pairs 36 Vera CPUs with 72 Rubin GPUs into a single NVLink domain, glued together by the sixth-generation NVLink Switch, ConnectX-9 SuperNICs, Spectrum-X Ethernet, and BlueField-4 DPUs. NVIDIA quotes "up to ~10× higher inference performance per watt" and "~10× lower cost per token" vs the prior generation; with a Groq 3 LPX pairing the headline figure rises to ~35× throughput per watt for trillion-parameter models. Read the release →
Why "72 GPUs in one NVLink domain" is the load-bearing claim
Picture the open-plan office floor. Nine cubicle rooms, each with eight desks sharing a fast whiteboard. Anyone in the same room can shout a number across; anyone in a different room has to send a chat message that crawls through a slow gateway. If a meeting needs every desk to chime in at the same speed, the slow chat sets the pace. Knocking the walls down turns nine rooms into one floor — and the meeting now runs at whiteboard speed.
That gateway is the inter-node PCIe path the prior-generation HGX building block commonly used for any communication beyond 8 GPUs. Inside one server the 8 GPUs sat on a fast NVLink mesh; cross the server boundary and the collective dropped to NIC speeds — fast for storage but slow for the all-reduce that every transformer layer needs. Tensor-parallel groups commonly stayed 8-way to keep the all-reduce on NVLink; going wider often paid a noticeable bandwidth tax. Pipeline parallel and expert parallel did the same dance.
The NVL72 design choice is mechanical and easy to state: build one NVLink fabric for the whole rack. The sixth-generation NVLink Switch sits between all 72 GPUs and replaces the cross-server NIC path for in-rack collectives. A tensor-parallel group can now sit at 16-way, 32-way, or even larger without leaving the NVLink fabric — and an expert-parallel deployment can put one expert per GPU and rely on the same fabric for the all-to-all. The 8-GPU island is no longer the natural unit. The rack is.
| Generation | NVLink domain (common config) | Notes |
|---|---|---|
| Hopper HGX H100 (2022) | ~8 GPUs / server | Cross-server collectives drop to InfiniBand / Ethernet. |
| Blackwell HGX B200 (2024) | ~8 GPUs / server | Same 8-GPU server building block. |
| Blackwell GB200 NVL72 (2024) | 72 GPUs / rack | First rack-scale NVLink domain in the Blackwell generation. |
| Vera Rubin NVL72 (2026) | 72 GPUs / rack | Sixth-gen NVLink Switch; Vera Arm CPU; ConnectX-9 SuperNICs; NVIDIA-claimed ~10× perf/watt. |
(Numbers above are common configurations as marketed; NVIDIA's ~10× perf/watt and ~10× lower cost per token are from the COMPUTEX 2026 release, not third-party-validated, and naturally depend on workload mix.)
A back-of-envelope feel for where time goes
A precise NVL72-vs-HGX comparison would need vendor-published bandwidth numbers and a specific workload; what follows is a deliberately (illustrative) sketch, not a NVIDIA-validated benchmark.
Walk a single transformer-layer forward pass on a tensor-parallel deployment. Each GPU holds a slice of the layer's weight matrix; the matmul is fast and local. Then comes the all-reduce: every GPU sums its partial output with every other GPU's before the next layer can start. The slowest link in the parallel group is what the layer waits on.
If that slowest link is the NIC path between two 8-GPU servers, it is in the tens-of-GB/s range; if it is the NVLink fabric, it is in the hundreds-of-GB/s to low-TB/s range per GPU (exact numbers depend on generation and link width). For a transformer activation tensor sized in the gigabytes — typical at large batch + long context — the gap between those two regimes is the gap between a layer's collective costing tens of milliseconds and costing a couple of milliseconds. Multiply by the layer count of a large model and the difference shows up at the token-budget level.
The matmul itself does not change between the two regimes; the network around it does. A trillion-parameter model that does not fit comfortably on 8 GPUs would have spilled across server boundaries on the old building block and paid the slower-link tax at every layer. On a single NVL72 rack it sits inside one NVLink domain.
What goes around the GPUs
The NVLink fabric is the headline, but the rest of the platform matters too. ConnectX-9 SuperNICs and Spectrum-X Ethernet handle the scale-out fabric that connects multiple NVL72 racks for multi-rack jobs. BlueField-4 DPUs offload storage and network management from the host CPUs. And the Vera CPU is paired tightly with Rubin inside the rack.
The mental model: scale-up inside the rack runs on NVLink; scale-out across racks runs on Ethernet. The teaching that used to live at the boundary between "inside a server" and "across a server" now lives at the boundary between "inside a rack" and "across racks." For a job that fits in 72 GPUs, it never crosses the model-executor's tensor-parallel domain boundary; for jobs larger than that, the splitting question shifts to "across racks" rather than "across servers."
Where it earns its keep — and where it does not
The win is loudest at two extremes. Very large models — trillion-parameter dense or extreme-MoE — were the workloads that fit barely on prior racks and paid the most for cross-server collectives. They benefit the most from a wider NVLink domain. High-throughput inference with many concurrent requests benefits too: more GPUs in one NVLink domain means more flexible parallelism strategies and bigger continuous batches without paying network tax per layer.
For a model that already fits comfortably on 8 GPUs, NVL72 is overkill — most of the fabric goes unused. The 72-GPU domain is a scale ceiling you can choose to fill or not. It is NVIDIA's answer to "what is the natural unit for a single parallel job?" — and that answer has moved from "the server" to "the rack."
Goes deeper in: GPU & CUDA → Memory Hierarchy → NVLink
Related explainers
- vLLM v0.20 — FlashAttention 4 packing — what happens to attention kernels when the rack-level fabric stops being the bottleneck
- SGLang v0.5.12 — TokenSpeed MLA backend — Blackwell-targeted attention backend that the next-gen Rubin will inherit