The news. On June 18, 2026, researchers posted CacheWeaver, a lightweight prompt-layer method that reorders retrieved evidence so grounded RAG requests reuse as much of the KV prefix cache as possible. It changes neither the serving engine nor the retrieved documents — only the order the chunks appear in the prompt. Across three vLLM configurations it cuts median time-to-first-token by about 20–33% relative to naive retrieval-order prefix caching, reaching 97.5% of the gain an oracle ordering would give, with no measured answer-quality degradation. Read the paper →

Picture a kitchen with a warming shelf of orders that are already half-cooked. A new order comes in, and its first few courses happen to be the same dishes, in the same sequence, as a tray already sitting on the shelf. The cook doesn't start over — they grab the matching tray and only cook the courses that differ, and the first plate leaves the pass far sooner. The whole trick is that the reuse runs strictly from the front: the moment your order's courses stop matching the shelf, every course after that has to be cooked fresh.

In serving terms, each course is a retrieved evidence chunk, the order is the prompt, and the warming shelf is the KV prefix cache. A serving engine keeps the keys and values it already computed for a prompt and reuses them for any later prompt that begins with the exact same tokens. But that matching is unforgiving: change a single token near the front and the block hash no longer matches, so the cache misses and the work is redone.

Two prompts, different hashes
Prompt A
Youarehelpful.Alicesaidhi
hash: e93c
Prompt B
Youarehelpful.Bobsaidhi
hash: 8800
CACHE MISS — hashes differ, shared prefix wasted
Both prompts share 3 leading tokens, but one different token anywhere makes the full-prompt hash completely different.

Here is the catch retrieval creates. A RAG retriever returns chunks ranked by relevance, and that ranking is different for almost every question — so even two requests that pull the same documents arrange them differently, and their prompts share almost no opening. Because TTFT for a long RAG prompt is essentially the time to prefill all that evidence, a cache that almost never hits means the GPU re-reads thousands of evidence tokens on every request.

PrefillDecode
The
cat
sat
on
All prompt tokens processed at once (parallel)
KV cache fills up in one shot
GPU does lots of math (compute-bound)
Fast — GPU is good at parallel work
the
mat
.
Output tokens generated one at a time
Each step reads entire KV cache
GPU mostly loads data (memory-bound)
Slower — waiting for data, not computing
Prefill = one big batch (fast) → Decode = one token at a time (slower)

CacheWeaver's move is to treat the chunk order as a free variable. It maintains a prefix tree of recently served evidence sequences and runs a greedy algorithm that, for each incoming request, surfaces the most reusable cached prefix and then re-plates the retrieved chunks to match it. Because the reordering happens entirely at the prompt layer, the serving engine and the retrieval results stay untouched — and in the paper's evaluations the reordering shows no answer-quality loss, since the same evidence is present either way and only its order changes.

Here is where it earns its keep, with illustrative numbers. Say a request's evidence prefills to 5,000 tokens, and prefill cost is roughly proportional to the tokens the GPU must process. Left in retrieval order, only the shared system preamble and one lucky chunk match the cache — about 1,500 tokens reused, so the engine prefills the remaining 3,500. CacheWeaver reorders the same chunks so the opening matches a cached sequence of ~2,500 tokens, leaving just 2,500 to prefill fresh. TTFT tracks the recomputed portion, so it falls from 3,500 to 2,500 — about a 29% cut, squarely inside the paper's reported 20–33% band, and near the 97.5% of oracle the greedy policy is shown to reach. (The 5,000- and reuse-token figures are illustrative; the 20–33% and 97.5% are from the CacheWeaver paper.)

ApproachWhat it changesPrefix reuseMedian TTFT
Retrieval-order prefix cachingnothing — chunks left in relevance orderonly when two orders happen to matchbaseline
CacheWeaver (greedy reorder)re-sequences evidence at the prompt layermaximized via a prefix-tree match~20–33% lower (CacheWeaver paper)
Oracle orderingbest possible order, known in hindsightmaximal (upper bound)CacheWeaver reaches ~97.5% of this gain (paper)
Answer qualityno measured degradation (CacheWeaver paper)

The reason this is worth noticing is where it sits: it is not a new attention kernel or a smaller cache, but a scheduling decision at the boundary between retrieval and serving — the kind of free win that appears once you stop treating the retrieve-then-generate pipeline and the serving engine as two sealed boxes. The same evidence, the same engine, the same answer — only the order changes, and the cache does the rest.

Goes deeper in: LLM Serving → Prefix Caching & RadixAttention → The prefix tree

Related explainers

Continue in trackPrefix Caching & RadixAttention: how the prefix tree decides what the engine can reuse

Frequently Asked Questions