The news. On June 18, 2026, researchers posted CacheWeaver, a lightweight prompt-layer method that reorders retrieved evidence so grounded RAG requests reuse as much of the KV prefix cache as possible. It changes neither the serving engine nor the retrieved documents — only the order the chunks appear in the prompt. Across three vLLM configurations it cuts median time-to-first-token by about 20–33% relative to naive retrieval-order prefix caching, reaching 97.5% of the gain an oracle ordering would give, with no measured answer-quality degradation. Read the paper →
Picture a kitchen with a warming shelf of orders that are already half-cooked. A new order comes in, and its first few courses happen to be the same dishes, in the same sequence, as a tray already sitting on the shelf. The cook doesn't start over — they grab the matching tray and only cook the courses that differ, and the first plate leaves the pass far sooner. The whole trick is that the reuse runs strictly from the front: the moment your order's courses stop matching the shelf, every course after that has to be cooked fresh.
In serving terms, each course is a retrieved evidence chunk, the order is the prompt, and the warming shelf is the KV prefix cache. A serving engine keeps the keys and values it already computed for a prompt and reuses them for any later prompt that begins with the exact same tokens. But that matching is unforgiving: change a single token near the front and the block hash no longer matches, so the cache misses and the work is redone.
Here is the catch retrieval creates. A RAG retriever returns chunks ranked by relevance, and that ranking is different for almost every question — so even two requests that pull the same documents arrange them differently, and their prompts share almost no opening. Because TTFT for a long RAG prompt is essentially the time to prefill all that evidence, a cache that almost never hits means the GPU re-reads thousands of evidence tokens on every request.
CacheWeaver's move is to treat the chunk order as a free variable. It maintains a prefix tree of recently served evidence sequences and runs a greedy algorithm that, for each incoming request, surfaces the most reusable cached prefix and then re-plates the retrieved chunks to match it. Because the reordering happens entirely at the prompt layer, the serving engine and the retrieval results stay untouched — and in the paper's evaluations the reordering shows no answer-quality loss, since the same evidence is present either way and only its order changes.
Here is where it earns its keep, with illustrative numbers. Say a request's evidence prefills to 5,000 tokens, and prefill cost is roughly proportional to the tokens the GPU must process. Left in retrieval order, only the shared system preamble and one lucky chunk match the cache — about 1,500 tokens reused, so the engine prefills the remaining 3,500. CacheWeaver reorders the same chunks so the opening matches a cached sequence of ~2,500 tokens, leaving just 2,500 to prefill fresh. TTFT tracks the recomputed portion, so it falls from 3,500 to 2,500 — about a 29% cut, squarely inside the paper's reported 20–33% band, and near the 97.5% of oracle the greedy policy is shown to reach. (The 5,000- and reuse-token figures are illustrative; the 20–33% and 97.5% are from the CacheWeaver paper.)
| Approach | What it changes | Prefix reuse | Median TTFT |
|---|---|---|---|
| Retrieval-order prefix caching | nothing — chunks left in relevance order | only when two orders happen to match | baseline |
| CacheWeaver (greedy reorder) | re-sequences evidence at the prompt layer | maximized via a prefix-tree match | ~20–33% lower (CacheWeaver paper) |
| Oracle ordering | best possible order, known in hindsight | maximal (upper bound) | CacheWeaver reaches ~97.5% of this gain (paper) |
| Answer quality | — | — | no measured degradation (CacheWeaver paper) |
The reason this is worth noticing is where it sits: it is not a new attention kernel or a smaller cache, but a scheduling decision at the boundary between retrieval and serving — the kind of free win that appears once you stop treating the retrieve-then-generate pipeline and the serving engine as two sealed boxes. The same evidence, the same engine, the same answer — only the order changes, and the cache does the rest.
Goes deeper in: LLM Serving → Prefix Caching & RadixAttention → The prefix tree
Related explainers
- Attention Once Is All You Need — persistent KV cache across queries — the same "reuse the prefix instead of recomputing it" idea, taken to a stateful cache that survives across separate queries.
- AMD ATOM — prefill/decode disaggregation — why TTFT lives in the prefill phase, and another way to attack it: splitting prefill off onto its own hardware.
- Is Grep All You Need? — grep vs vector retrieval — the retrieval side of the pipeline CacheWeaver reorders, and how the choice of retriever shapes what lands in the prompt.