NVIDIA RTX Spark — Unified CPU–GPU memory
GPUThe news. On June 2, 2026, around Computex / Build 2026, NVIDIA unveiled RTX Spark, a consumer "superchip" aimed at on-device AI agents. It combines a Blackwell RTX GPU (6,144 CUDA cores, fifth-generation FP4 Tensor Cores) with a 20-core Grace CPU over NVLink-C2C, delivering up to 1 petaflop of AI compute and 128GB of unified memory. RTX Spark laptops and compact desktops ship this fall from ASUS, Dell, HP, Lenovo, Microsoft Surface, and MSI. Read the coverage →
Picture the two chefs for a second. The prep chef chops and stages every ingredient on his bench; the line chef does the fast searing under the heat. On a normal setup they work in separate rooms joined by one narrow serving hatch — every tray of mise en place has to be slid through that slot before the line chef can touch it, and finished plates slid back. For a two-cover lunch the hatch is fine. For a 200-cover banquet, the hatch is the bottleneck: both chefs spend more time shoving trays through the slot than actually cooking. RTX Spark knocks out the wall. Now both chefs work at one long shared counter — the line chef reaches over and grabs the mise en place exactly where the prep chef left it. No hatch, no sliding, no copy.
In CUDA terms, the hatch is the PCIe bus and the trays are cudaMemcpy. A discrete GPU keeps its fast VRAM physically separate from the CPU's system RAM; before a kernel can run, the input is copied host→device across PCIe, and the result copied back. The classic four-step dance looks like this:
float *a, *b
(input data)
float *c
(results)
1. cudaMalloc
reserve memory
3. kernel<<<>>>()
1000s of threads
Data crosses the PCIe bus twice — GPUs need large workloads to pay off
A PCIe 5.0 ×16 link tops out around ~64 GB/s — quick in isolation, but glacial next to a GPU's on-package bandwidth of roughly several TB/s (order-of-magnitude figure). For a model that fits in VRAM you pay the copy once and amortize it. For a model bigger than VRAM, you stream weights across PCIe layer by layer, every forward pass, and the copy — not the matmul — sets your token rate. That's the regime where decode goes memory-bandwidth-bound and the GPU's compute cores sit idle waiting for bytes.
RTX Spark deletes the staging copy outright. A Grace CPU and a Blackwell GPU are bonded over NVLink-C2C into a single 128GB coherent pool. Coherent is the load-bearing word: both processors see the same bytes at the same addresses, and a write by one is visible to the other with no explicit transfer. The GPU stops being a walled-off device you ship data to and becomes a peer that reads the data in place — the same shift that the memory ladder work frames as moving the bottleneck back toward on-package bandwidth, where it belongs.
This is why NVIDIA pitches RTX Spark as an on-device agent machine. Local agents juggle big context windows, KV caches, and sometimes several models at once — state that is awkward to shuttle across PCIe but trivial to share in a unified pool. A 70B-class model at 4-bit weights needs ~35GB; it won't fit in a typical discrete laptop GPU's 8–16GB of VRAM, so today it either spills to system RAM over PCIe (slow) or simply won't run. With 128GB of unified memory, the same model just lives in the pool and the GPU addresses all of it. (NVIDIA has not published the consumer part's exact NVLink-C2C bandwidth, so treat the on-package figures below as illustrative.)
Where the copy time actually goes
A back-of-envelope walk-through (illustrative numbers; substitute your own workload). Take a 34GB 4-bit model that does not fit in a 16GB discrete GPU. On the discrete path, running one forward pass means streaming all 34GB of weights across PCIe 5.0 at ~64 GB/s → about ~0.53 s of pure copy per pass. During decode that's roughly one pass per token, so the copy alone caps you near ~1.9 tokens/s before a single multiply happens, and the GPU cores idle the whole time. On the unified path, the GPU addresses all 34GB in the shared pool directly at on-package bandwidth — call it ~0.5 TB/s for a consumer part (illustrative) → reading the same 34GB takes about ~0.07 s, roughly ~8× less wait, and the bottleneck moves back to compute where it should be. The model size didn't change; the copy disappeared.
How systems connect CPU and GPU memory
| System | CPU ↔ GPU memory | Interconnect | Host→device copy? |
|---|---|---|---|
| Discrete GPU (PCIe card) | separate VRAM + system RAM | PCIe 5.0 ~64 GB/s (setup-dependent) | Yes — both ways |
| Integrated GPU (iGPU) | shared system RAM | on-die | No, but low bandwidth |
| Apple Silicon (UMA) | unified system memory | on-package fabric | No |
| NVIDIA Grace Hopper (GH200) | unified, coherent | NVLink-C2C ~900 GB/s (GH200 figure) | No |
| NVIDIA RTX Spark (2026) | unified 128GB, coherent | NVLink-C2C (consumer bandwidth undisclosed) | No — zero-copy |
A caveat worth attaching to the headline: unified memory removes the copy, not the bandwidth wall. The pool is still finite-bandwidth memory, so a model that's memory-bandwidth-bound on a discrete card is still bandwidth-bound on RTX Spark — it just stops paying the PCIe tax on top. And NVIDIA quotes "1 petaflop" as a low-precision (FP4) peak, not a sustained number. The structural win is real and narrow: the host↔device copy goes away, which is exactly the tax that makes over-VRAM models painful on today's laptops.
Goes deeper in: GPU & CUDA → Memory Hierarchy → NVLink
Related explainers
- Jetson Thor — Edge Blackwell vs datacenter Blackwell — the robotics cousin: the same Blackwell silicon, also on a unified-memory SoC
- Vera Rubin NVL72 — rack-scale NVLink domain — NVLink at the other extreme: 72 GPUs as one fabric (GPU↔GPU), versus RTX Spark's CPU↔GPU NVLink-C2C
- MobileMoE — DRAM-aware MoE scaling — the algorithmic side of fitting big models in tight on-device memory