The news. On June 16, 2026, NVIDIA reported that its Blackwell platform posted the fastest time on every one of MLPerf Training 6.0's seven benchmarks. The new GB300 NVL72 rack trained up to 1.6× faster than the previous GB200 NVL72. Submissions scaled to 8,192 GPUs — CoreWeave trained DeepSeek-V3 671B to target in 2.02 minutes, and Microsoft Azure hit the quality target on Llama 3.1 405B in 7.07 minutes at 8,192-GPU scale. The round also added new mixture-of-experts pretraining workloads. Read the release →

Picture the rowing crew. The finish line is the model's quality target, and one stroke of the whole crew is one training step — every rower pulls, the boat lurches forward, and they reset for the next stroke. Each rower is a GPU, working a different slice of the same race. The trick is the catch: the instant the oars enter the water. If every oar hits at the same moment the boat surges; if they're even slightly out of time, the power cancels and the boat wallows. Adding rowers should make the boat faster — but only if the bigger crew can still hit the catch together.

That "only if" is the whole story, and its real name is strong scaling: fix the model, add GPUs, and see how much the clock actually drops. The catch is the catch. Every step, the GPUs have to stop and combine their partial results — gradients across the data-parallel replicas, plus activations and weights traded inside the tensor- and expert-parallel groups — before the next step can begin. That synchronization is a tax that grows as the crew grows, so doubling the GPUs buys you less than 2× — the speedup curve bends below the straight line. A naive cluster, like a crew that can't hold its timing, gives back most of what each new rower adds.

So the engineering is all about making the catch cheap. NVIDIA's answer is the rack-scale NVLink domain: the GB300 NVL72 ties 72 GPUs into one coherent fabric — the racing shell and coxswain that keep a huge crew locked to a single cadence — so the per-step exchange finishes fast enough that thousands of GPUs still row almost as one. Pair that with lower-precision math — Blackwell's tensor cores run the matmuls in 8-bit FP8 and 4-bit NVFP4, lighter oars with less to move every stroke — plus a stronger software stack, and NVIDIA credits that combination for the sweep: a GB300 rack trains up to 1.6× faster than last generation, and 8,192 GPUs finish in minutes, not hours.

MLPerf Training 6.0 resultScaleWhat it showsTime to target
GB300 NVL72 vs GB200 NVL7272-GPU rackhardware-generation speedupup to ~1.6× faster
DeepSeek-V3 671B (MoE)~8,192 GPUsstrong scaling, new MoE workload~2.02 min (CoreWeave)
Llama 3.1 405B (dense)~8,192-GPU scalestrong scaling at frontier size~7.07 min (Azure)

Strong scaling, in one calculation

Hold the model fixed — DeepSeek-V3, 671B parameters, trained to MLPerf's quality target — and watch the clock as you add rowers. On 8,192 Blackwell GPUs, CoreWeave's run finished in 2.02 minutes. Now ask the strong-scaling question: had you used half as many GPUs, would it have taken exactly twice as long? Perfect scaling says yes. Suppose (illustrative) the 4,096-GPU run had actually taken 3.7 minutes. Then doubling the GPUs cut the time from 3.7 to 2.02 — a 1.83× speedup, not the ideal 2.0×. Divide the two and you get a scaling efficiency of ~92%; the missing ~8% is the time the GPUs spent at the catch, waiting on each other. The entire job at this scale is keeping that number pinned near 100% — which is exactly what a faster NVLink fabric and lighter low-precision oars are for. (The 2.02-min, 8,192-GPU, and 1.6× figures are from NVIDIA's MLPerf 6.0 report; the 4,096-GPU split is illustrative.)

Goes deeper in: GPU & CUDA → Memory Hierarchy → NVLink & PCIe

Related explainers

Continue in trackMemory Hierarchy: how NVLink ties GPUs into one fabric — and why that fabric decides how well training scales

Frequently Asked Questions