What is prefix-cache-aware evidence reordering?

It is reordering the retrieved chunks in a RAG prompt so the serving engine's KV prefix cache can reuse as much of the prompt's opening as possible. The engine caches the keys and values it computed for earlier prompts and reuses them for any later prompt that begins with the exact same tokens. Because retrieval returns chunks ranked by relevance — a different order for almost every question — prompts rarely share an opening, so the cache misses. CacheWeaver re-sequences the same chunks at the prompt layer to maximize that shared opening prefix, without touching the engine or the retrieved documents.

Why does it lower time-to-first-token?

Time-to-first-token for a long RAG prompt is essentially the time to prefill all the evidence — the model must read every chunk before it can answer. When the prompt's opening matches a cached prefix, the engine reuses that work and only prefills the remaining tokens, so TTFT tracks the part it has to recompute. By making more of the opening reusable, CacheWeaver shrinks that recomputed portion, cutting median TTFT by about 20–33% across three vLLM configurations and reaching roughly 97.5% of an oracle ordering's gain, with no measured loss in answer quality.

How does CacheWeaver relate to prefix caching and RAG?

It sits exactly between them. Prefix caching (RadixAttention in SGLang, block hashing in vLLM) is the serving-side mechanism that reuses a shared opening; RAG is the retrieval-side pipeline that pastes ranked evidence into the prompt. CacheWeaver changes neither — it adds a prompt-layer scheduler that keeps a prefix tree of recently served sequences and greedily reorders each request's retrieved chunks to match the most reusable cached prefix. It is complementary to the engine's caching and to the retriever's ranking, because it only governs the order in which the already-chosen chunks are laid out.

CacheWeaver reorders RAG evidence for prefix-cache reuse — Prefix-cache-aware evidence reordering

Jargon

TTFT (time-to-first-token): How long after a request arrives before the model emits its first output token. For a long RAG prompt this is almost entirely prefill time — the engine has to read all the evidence before it can answer.
Prefill: The first phase of inference, where the model processes the entire prompt at once to build its KV cache. Its cost grows with prompt length, which is why stuffing evidence into a RAG prompt is what makes TTFT slow.
KV prefix cache (RadixAttention): Serving engines store the KV computed for a prompt and reuse it for any later prompt that starts with the exact same tokens. SGLang's RadixAttention keeps these shared prefixes in a tree; vLLM does it by hashing fixed blocks. See prefix caching.
Prefix (and why order matters): Reuse works only from the front, token-for-token. Two prompts that share their opening reuse that opening; the instant they diverge, everything after must be recomputed — so the order of the evidence decides how much is reusable.
RAG (retrieval-augmented generation): Before answering, the system retrieves relevant documents and pastes them into the prompt as evidence. The retriever returns them ranked by relevance, which is the order CacheWeaver rearranges.
Oracle ordering: The best ordering you could pick if you knew the whole future cache state in advance — an upper bound, not a runnable policy. CacheWeaver's cheap greedy choice reaches about 97.5% of this ideal.

The news. On June 18, 2026, researchers posted CacheWeaver, a lightweight prompt-layer method that reorders retrieved evidence so grounded RAG requests reuse as much of the KV prefix cache as possible. It changes neither the serving engine nor the retrieved documents — only the order the chunks appear in the prompt. Across three vLLM configurations it cuts median time-to-first-token by about 20–33% relative to naive retrieval-order prefix caching, reaching 97.5% of the gain an oracle ordering would give, with no measured answer-quality degradation. Read the paper →

Picture a kitchen with a warming shelf of orders that are already half-cooked. A new order comes in, and its first few courses happen to be the same dishes, in the same sequence, as a tray already sitting on the shelf. The cook doesn't start over — they grab the matching tray and only cook the courses that differ, and the first plate leaves the pass far sooner. The whole trick is that the reuse runs strictly from the front: the moment your order's courses stop matching the shelf, every course after that has to be cooked fresh.

In serving terms, each course is a retrieved evidence chunk, the order is the prompt, and the warming shelf is the KV prefix cache. A serving engine keeps the keys and values it already computed for a prompt and reuses them for any later prompt that begins with the exact same tokens. But that matching is unforgiving: change a single token near the front and the block hash no longer matches, so the cache misses and the work is redone.

Two prompts, different hashes

Prompt A

Youarehelpful.Alicesaidhi

hash: e93c

Prompt B

Youarehelpful.Bobsaidhi

hash: 8800

CACHE MISS — hashes differ, shared prefix wasted

Both prompts share 3 leading tokens, but one different token anywhere makes the full-prompt hash completely different.

Here is the catch retrieval creates. A RAG retriever returns chunks ranked by relevance, and that ranking is different for almost every question — so even two requests that pull the same documents arrange them differently, and their prompts share almost no opening. Because TTFT for a long RAG prompt is essentially the time to prefill all that evidence, a cache that almost never hits means the GPU re-reads thousands of evidence tokens on every request.

CacheWeaver's move is to treat the chunk order as a free variable. It maintains a prefix tree of recently served evidence sequences and runs a greedy algorithm that, for each incoming request, surfaces the most reusable cached prefix and then re-plates the retrieved chunks to match it. Because the reordering happens entirely at the prompt layer, the serving engine and the retrieval results stay untouched — and in the paper's evaluations the reordering shows no answer-quality loss, since the same evidence is present either way and only its order changes.

Here is where it earns its keep, with illustrative numbers. Say a request's evidence prefills to 5,000 tokens, and prefill cost is roughly proportional to the tokens the GPU must process. Left in retrieval order, only the shared system preamble and one lucky chunk match the cache — about 1,500 tokens reused, so the engine prefills the remaining 3,500. CacheWeaver reorders the same chunks so the opening matches a cached sequence of ~2,500 tokens, leaving just 2,500 to prefill fresh. TTFT tracks the recomputed portion, so it falls from 3,500 to 2,500 — about a 29% cut, squarely inside the paper's reported 20–33% band, and near the 97.5% of oracle the greedy policy is shown to reach. (The 5,000- and reuse-token figures are illustrative; the 20–33% and 97.5% are from the CacheWeaver paper.)

Approach	What it changes	Prefix reuse	Median TTFT
Retrieval-order prefix caching	nothing — chunks left in relevance order	only when two orders happen to match	baseline
CacheWeaver (greedy reorder)	re-sequences evidence at the prompt layer	maximized via a prefix-tree match	~20–33% lower (CacheWeaver paper)
Oracle ordering	best possible order, known in hindsight	maximal (upper bound)	CacheWeaver reaches ~97.5% of this gain (paper)
Answer quality	—	—	no measured degradation (CacheWeaver paper)

The reason this is worth noticing is where it sits: it is not a new attention kernel or a smaller cache, but a scheduling decision at the boundary between retrieval and serving — the kind of free win that appears once you stop treating the retrieve-then-generate pipeline and the serving engine as two sealed boxes. The same evidence, the same engine, the same answer — only the order changes, and the cache does the rest.

Goes deeper in: LLM Serving → Prefix Caching & RadixAttention → The prefix tree

Related explainers

Attention Once Is All You Need — persistent KV cache across queries — the same "reuse the prefix instead of recomputing it" idea, taken to a stateful cache that survives across separate queries.
AMD ATOM — prefill/decode disaggregation — why TTFT lives in the prefill phase, and another way to attack it: splitting prefill off onto its own hardware.
Is Grep All You Need? — grep vs vector retrieval — the retrieval side of the pipeline CacheWeaver reorders, and how the choice of retriever shapes what lands in the prompt.

Continue in trackPrefix Caching & RadixAttention: how the prefix tree decides what the engine can reuse

Related explainers

Frequently Asked Questions