What is the Vera Rubin NVL72?

Vera Rubin NVL72 is NVIDIA's post-Blackwell rack-scale platform — 36 Vera Arm CPUs paired with 72 Rubin GPUs, all wired into one sixth-generation NVLink Switch fabric, along with ConnectX-9 SuperNICs for scale-out Ethernet and BlueField-4 DPUs for storage and network offload. It is the successor to the Blackwell-generation GB200 NVL72, and it swept COMPUTEX 2026's Best Choice Awards on May 21, 2026.

What does "rack-scale NVLink domain" mean?

The NVLink domain is the set of GPUs that can talk to each other at NVLink bandwidth without crossing PCIe or the network. On prior-generation racks like Hopper HGX H100 and Blackwell HGX B200, the NVLink domain was 8 GPUs — one server. On NVL72, the NVLink domain is 72 GPUs — the entire rack. Any communication beyond the domain boundary drops to NIC-and-switch speeds, which is roughly an order of magnitude slower than NVLink.

Why does it matter for LLM serving and training?

Tensor-parallel and expert-parallel collectives — the all-reduces and all-to-alls that happen at every transformer layer — bottleneck on the slowest link in the parallel group. With an 8-GPU NVLink domain, tensor-parallel groups had to stay 8-way to keep the all-reduce on NVLink, and anything bigger paid the cross-server PCIe-over-network tax. With a 72-GPU NVLink domain, a single trillion-parameter model can run as one parallel job at NVLink speeds, and an MoE deployment can put one expert per GPU and rely on the same fabric for the all-to-all. NVIDIA quotes ~10× higher inference performance per watt and ~10× lower cost per token vs the prior generation, with naturally large workload-dependent variance.

NVIDIA Vera Rubin NVL72 — Rack-scale NVLink domain

GPU

learnaivisually.com/ai-explained/vera-rubin-nvl72-nvlink-rack-domain

TL;DR

What is it: NVIDIA's Vera Rubin NVL72 is a rack-scale platform — 36 Vera CPUs paired with 72 Rubin GPUs — wired together by a sixth-generation NVLink Switch. The artifact this explainer is about is the rack-scale NVLink domain: every GPU in the rack reachable from every other at NVLink bandwidth, with no PCIe hop in between.
Why it’s needed: Tensor-parallel and expert-parallel collectives — the all-reduces and all-to-alls used at every transformer layer — bottleneck on the slowest link in the parallel group. Lifting the NVLink boundary from one server up to a whole rack lets bigger parallel groups stay on the fast fabric.
vs previous: Prior-generation HGX servers — the building block under Hopper H100 and Blackwell B200 deployments — commonly used 8-GPU NVLink groups. Anything beyond the server boundary dropped to NIC speeds. NVL72 makes the rack itself the natural NVLink unit.

Jargon

NVLink: NVIDIA's GPU-to-GPU interconnect, much faster per-GPU than PCIe. The exact per-generation numbers vary by configuration, but the practical rule of thumb is that NVLink is roughly an order of magnitude faster than PCIe for the patterns transformer training and serving care about.
NVLink Switch: A separate switch chip that lets many GPUs share an NVLink fabric instead of point-to-point cables — turns NVLink from a within-server link into a within-rack network.
NVLink domain: The set of GPUs that can talk to each other at NVLink bandwidth without leaving the fabric. On the prior HGX building block this was commonly 8 GPUs per server; on NVL72 it is 72 GPUs per rack.
PCIe-over-network: Shorthand for the path collectives commonly took to cross 8-GPU server boundaries: out of one server, across the data-center network through a top-of-rack switch, and back in on the other side.
all-reduce: The collective each tensor-parallel layer ends with — every GPU sums its partial output with every other GPU's partial output. The slowest link in the parallel group sets the layer's wall-clock time.
tensor parallel (TP): Splitting a single weight matrix across N GPUs so the matmul runs in parallel; needs an all-reduce after every layer.
expert parallel (EP): For Mixture-of-Experts models: putting different experts on different GPUs and routing tokens to the GPU that holds the expert; needs an all-to-all per layer.
Vera CPU: NVIDIA's next-generation Arm host CPU; one Vera CPU pairs with two Rubin GPUs in NVL72.
Rubin GPU: NVIDIA's post-Blackwell GPU generation. NVL72 is the platform that ships with it.

The news. On May 21, 2026, NVIDIA's Vera Rubin NVL72 swept COMPUTEX 2026's Best Choice Awards (Golden Award plus Sustainable Tech Special Award). The rack pairs 36 Vera CPUs with 72 Rubin GPUs into a single NVLink domain, glued together by the sixth-generation NVLink Switch, ConnectX-9 SuperNICs, Spectrum-X Ethernet, and BlueField-4 DPUs. NVIDIA quotes "up to ~10× higher inference performance per watt" and "~10× lower cost per token" vs the prior generation; with a Groq 3 LPX pairing the headline figure rises to ~35× throughput per watt for trillion-parameter models. Read the release →

Why "72 GPUs in one NVLink domain" is the load-bearing claim

Picture the open-plan office floor. Nine cubicle rooms, each with eight desks sharing a fast whiteboard. Anyone in the same room can shout a number across; anyone in a different room has to send a chat message that crawls through a slow gateway. If a meeting needs every desk to chime in at the same speed, the slow chat sets the pace. Knocking the walls down turns nine rooms into one floor — and the meeting now runs at whiteboard speed.

That gateway is the inter-node PCIe path the prior-generation HGX building block commonly used for any communication beyond 8 GPUs. Inside one server the 8 GPUs sat on a fast NVLink mesh; cross the server boundary and the collective dropped to NIC speeds — fast for storage but slow for the all-reduce that every transformer layer needs. Tensor-parallel groups commonly stayed 8-way to keep the all-reduce on NVLink; going wider often paid a noticeable bandwidth tax. Pipeline parallel and expert parallel did the same dance.

The NVL72 design choice is mechanical and easy to state: build one NVLink fabric for the whole rack. The sixth-generation NVLink Switch sits between all 72 GPUs and replaces the cross-server NIC path for in-rack collectives. A tensor-parallel group can now sit at 16-way, 32-way, or even larger without leaving the NVLink fabric — and an expert-parallel deployment can put one expert per GPU and rely on the same fabric for the all-to-all. The 8-GPU island is no longer the natural unit. The rack is.

Generation	NVLink domain (common config)	Notes
Hopper HGX H100 (2022)	~8 GPUs / server	Cross-server collectives drop to InfiniBand / Ethernet.
Blackwell HGX B200 (2024)	~8 GPUs / server	Same 8-GPU server building block.
Blackwell GB200 NVL72 (2024)	72 GPUs / rack	First rack-scale NVLink domain in the Blackwell generation.
Vera Rubin NVL72 (2026)	72 GPUs / rack	Sixth-gen NVLink Switch; Vera Arm CPU; ConnectX-9 SuperNICs; NVIDIA-claimed ~10× perf/watt.

(Numbers above are common configurations as marketed; NVIDIA's ~10× perf/watt and ~10× lower cost per token are from the COMPUTEX 2026 release, not third-party-validated, and naturally depend on workload mix.)

A back-of-envelope feel for where time goes

A precise NVL72-vs-HGX comparison would need vendor-published bandwidth numbers and a specific workload; what follows is a deliberately (illustrative) sketch, not a NVIDIA-validated benchmark.

Walk a single transformer-layer forward pass on a tensor-parallel deployment. Each GPU holds a slice of the layer's weight matrix; the matmul is fast and local. Then comes the all-reduce: every GPU sums its partial output with every other GPU's before the next layer can start. The slowest link in the parallel group is what the layer waits on.

If that slowest link is the NIC path between two 8-GPU servers, it is in the tens-of-GB/s range; if it is the NVLink fabric, it is in the hundreds-of-GB/s to low-TB/s range per GPU (exact numbers depend on generation and link width). For a transformer activation tensor sized in the gigabytes — typical at large batch + long context — the gap between those two regimes is the gap between a layer's collective costing tens of milliseconds and costing a couple of milliseconds. Multiply by the layer count of a large model and the difference shows up at the token-budget level.

The matmul itself does not change between the two regimes; the network around it does. A trillion-parameter model that does not fit comfortably on 8 GPUs would have spilled across server boundaries on the old building block and paid the slower-link tax at every layer. On a single NVL72 rack it sits inside one NVLink domain.

What goes around the GPUs

The NVLink fabric is the headline, but the rest of the platform matters too. ConnectX-9 SuperNICs and Spectrum-X Ethernet handle the scale-out fabric that connects multiple NVL72 racks for multi-rack jobs. BlueField-4 DPUs offload storage and network management from the host CPUs. And the Vera CPU is paired tightly with Rubin inside the rack.

The mental model: scale-up inside the rack runs on NVLink; scale-out across racks runs on Ethernet. The teaching that used to live at the boundary between "inside a server" and "across a server" now lives at the boundary between "inside a rack" and "across racks." For a job that fits in 72 GPUs, it never crosses the model-executor's tensor-parallel domain boundary; for jobs larger than that, the splitting question shifts to "across racks" rather than "across servers."

Where it earns its keep — and where it does not

The win is loudest at two extremes. Very large models — trillion-parameter dense or extreme-MoE — were the workloads that fit barely on prior racks and paid the most for cross-server collectives. They benefit the most from a wider NVLink domain. High-throughput inference with many concurrent requests benefits too: more GPUs in one NVLink domain means more flexible parallelism strategies and bigger continuous batches without paying network tax per layer.

For a model that already fits comfortably on 8 GPUs, NVL72 is overkill — most of the fabric goes unused. The 72-GPU domain is a scale ceiling you can choose to fill or not. It is NVIDIA's answer to "what is the natural unit for a single parallel job?" — and that answer has moved from "the server" to "the rack."

Goes deeper in: GPU & CUDA → Memory Hierarchy → NVLink

Related explainers

vLLM v0.20 — FlashAttention 4 packing — what happens to attention kernels when the rack-level fabric stops being the bottleneck
SGLang v0.5.12 — TokenSpeed MLA backend — Blackwell-targeted attention backend that the next-gen Rubin will inherit

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based