The news. On June 16, 2026, AMD published ATOM + ATOMesh, a paired ROCm-native LLM serving stack for Instinct GPUs, shipped as an early (alpha) preview. ATOM is an AITER-optimized inference engine (kernel acceleration via AITER, distributed communication via MORI); ATOMesh is the orchestration layer on top — it exposes an OpenAI-compatible API, manages multiple engine backends, and applies prefill/decode disaggregation and KV-aware scheduling, evaluated serving DeepSeek-V4-Pro on Instinct hardware. In AMD's framing it deliberately mirrors the vLLM/SGLang design — the same serving primitives, now on AMD silicon. Read the release →

Picture a restaurant kitchen where one cook does everything. First they prep an order — chopping, slicing, mixing every ingredient the dish needs, all at once, in a furious burst of knife work. Then they plate it — assembling the dish one component at a time, walking back to the fridge for each piece. Prep is a flat-out, hands-busy job; plating is a lot of trips to the fridge and not much knife work. Cram both onto one cook and they fight: a big prep order makes every waiting plate go cold, and during the slow plating trips the knives sit idle. That single overloaded cook is one GPU running an LLM, and the two jobs are prefill and decode.

When a model answers, it first runs prefill: it reads your entire prompt in one parallel pass, doing dense matrix math and filling the KV cache. Then it runs decode: it emits output one token per step, and every step drags the whole KV cache and all the weights out of memory to produce that single token. Prefill is compute-bound — limited by the GPU's math units — while decode is memory-bandwidth-bound, limited by how fast it can stream the cache out of memory. They are the prep cook and the plating cook: opposite appetites, forced to share one station.

Prefill vs Decode on the Roofline

Decode~1 FLOP/bytePrefill~100 FLOP/byte← memory-boundcompute-bound →

Same GPU, fundamentally different bottlenecks

That opposite-appetites problem is why a single shared worker wastes hardware. Pack prefill and decode together and a long prompt's prefill burst blocks the queue of decode steps behind it — a head-of-line stall — while the memory-bound decodes leave the expensive compute units sitting idle. You can never shape one machine to be right for both jobs at once.

Disaggregation is the fix: give prep and plating their own stations. Prefill runs on one pool of GPUs, scheduled for compute-heavy bursts; decode runs on a separate pool, scheduled for steady memory-bound streaming with large batches. When a request finishes prefill, the prefill worker hands its KV cache across the interconnect to a decode worker, which then streams the tokens out. Each pool is now sized and tuned for the one bottleneck it actually has — and AMD's ATOMesh is the orchestration layer that does exactly this routing on ROCm. This is the same playbook vLLM and SGLang made standard; ATOM + ATOMesh shows AMD building a ROCm-native path to it.

But disaggregation is not free, and the bill comes due at the handoff. After prefill, the KV cache has to physically travel from the prefill pool to the decode pool. For a 70B-class model with a 2,048-token prompt, that cache is 2 × 80 layers × 8 KV-heads × 128 dim × 2,048 tokens × 2 B ≈ 0.67 GB (illustrative, Llama-3.1-70B with grouped-query attention). Move it over PCIe 4.0 and you pay roughly 21 ms; over NVLink, about 0.75 msa ~28× gap (all three figures illustrative: the size is from the formula above, the times are set by each interconnect's bandwidth, none measured on ATOM). That gap is why disaggregated stacks live or die by their interconnect — and why KV-aware scheduling tries to dodge the transfer entirely, steering a request to a worker that already holds its prefix.

KV cache size (Llama 3.1 70B)

2×80×8×128×2048×2B
[K + V][layers][heads][dim][tokens][bytes/val]
=0.67 GB

Transfer time (0.67 GB KV cache)

PCIe 4.021 ms
InfiniBand NDR13 ms
NVLink0.75 ms

NVLink is ~28× faster than PCIe — same-node transfers are near-free.

PhaseWhat it processesBottleneck (roofline)What it wants from the hardware
PrefillThe whole prompt, in one parallel passCompute-bound — high arithmetic intensityRaw matmul throughput; fewer, fatter GPUs
DecodeOne output token per step, reading the full KV cacheMemory-bandwidth-bound — low arithmetic intensityMemory bandwidth and large batches to amortize the weight reads

The honest caveat: ATOM + ATOMesh ship as an early (alpha) preview, and AMD's post describes the mechanism, not head-to-head numbers — it reports that ATOMesh mirrors the vLLM/SGLang design and was evaluated serving DeepSeek-V4-Pro, but it does not give usable numeric throughput or latency figures in the post text, so treat any performance claim as not yet quantified here and check the source for benchmarks. The KV-transfer figures above are illustrative, sized to a representative model rather than measured on ATOM. But the durable lesson stands: once you see that prefill and decode sit on opposite sides of the roofline, "one GPU does both" stops looking efficient — and a serving stack's real job is to split the two phases and move the KV cache between them cheaply.

Goes deeper in: LLM Serving → Prefill/Decode Disaggregation → Disaggregation

Related explainers

Frequently Asked Questions