AOIAYN — Persistent KV cache across queries

LLM
L
AOIAYN — persist the KV cache across queries; advance it as data arrivesincoming data — streaming sessionpersistent KV cache (the session's running context)naïve mode — cache is rebuilt per query (synchronous prefill)queries (and per-query latency)q1q2q3naïve:per-query prefill grows with context — TTFT scales with NAOIAYN:decode-only — TTFT constant in context length
learnaivisually.com/ai-explained/aoiayn-stateful-prefix

The news. On May 14, 2026, an architecture paper titled Attention Once Is All You Need: efficient streaming inference with stateful transformers was posted as arXiv 2605.13784. The paper's premise: most LLM serving engines today are stateless — every request builds a fresh KV cache from the prompt, returns the answer, and throws the cache away — but the rising workload of streaming sessions (a live data feed plus periodic queries) makes that wasteful. AOIAYN's proposal is to make the engine stateful at the session level: keep one KV cache per session, advance it incrementally as new data arrives in the background, and serve each query against the already-warm cache. Crucially, the model itself is unchanged — full quadratic self-attention is preserved, the KV cache is structurally normal. The paper additionally introduces Flash Queries, a feature where pre-registered questions are pre-evaluated during idle GPU cycles. Reported result: up to 5.9× speedup on streaming benchmarks, with per-query latency that stays constant in context length.

Picture a restaurant on a long lunch service. A naïve kitchen treats every order as a one-shot: ingredients come in from the morning delivery, sit in the crate, and the chef chops and washes everything from scratch the moment an order ticket lands. The customer waits while all the prep work happens. A second customer's order, even for a similar dish, triggers the same chop-from-scratch dance. That is the stateless engine: each query forces a synchronous prefill over all the data the session has accumulated, and TTFT scales with how much context exists at the moment the order arrives.

AOIAYN is the mise-en-place kitchen. The same deliveries arrive throughout the morning, but the chef preps them as they come — onions diced, sauces reduced, proteins portioned — into clearly labelled bins on the counter. When an order ticket lands, the chef just plates from the prepped bins; the customer waits decode-time, not prep-time. In transformer terms, every chunk of incoming data is ingested into the persistent KV cache as it arrives, in the background, between user queries. When a query actually fires, prefill has already happened. The attention layer is unchanged — still standard quadratic self-attention reading the same K, V tensors — but the cache is structurally pre-warmed and the query just pays the decode cost it would have paid anyway.

The paper also introduces a second mechanism the kitchen analogy maps neatly onto: Flash Queries. Many restaurants pre-cook the day's most popular dishes during slow periods so they are ready the moment someone orders. AOIAYN does the same for registered questions — a set of queries that the system knows a user is likely to ask. During idle GPU cycles, the engine pre-evaluates each one against the current KV cache and stashes the answer. When the user actually asks, the response is served from cache. The paper notes this is structurally impossible in stateless engines because they discard intermediate state between requests; the persistent cache is what makes the optimization possible at all.

Where AOIAYN earns its keep is the scaling shape of TTFT (specific TTFT values below are illustrative — they depend on hardware, model size, and ingestion rate). In a stateless engine serving a session that has accumulated 100K tokens of context, every query pays roughly T_prefill(100K) + T_decode. At 100K tokens on a 70B-class model that prefill is typically on the order of seconds, with decode on the order of milliseconds per token; the prefill dominates. The same query under AOIAYN pays roughly T_decode only, because prefill has been amortised across the background ingestion that already happened. As the session grows to 200K, 500K, 1M tokens the stateless TTFT grows roughly linearly with N; AOIAYN's TTFT stays roughly flat — only T_decode per query, with T_prefill paid over the lifetime of the session, not on any single critical path. The paper's headline 5.9× speedup on streaming benchmarks is the empirical version of this asymptotic story.

Where AOIAYN sits next to other long-context levers

LeverWhat it changesPer-query workArchitectural scope
Prefix cachingReuse KV cache for repeated prompt prefixes across requestsO(|new suffix|) prefill — saves time on the shared prefix onlyEngine-level cache reuse; same model
SP-KV pruningDrop low-utility KV entries (sparse cache)O(N) prefill still happens, just with a 3–10× smaller cacheCo-trained head; modified model
FP8 / INT4 KV quantizationPair size — fewer bits per valueO(N) prefill still happens, just with cheaper bytesPost-training quantization; same architecture
AOIAYN (this paper)Persist KV cache across queries; ingest new data in backgroundO(|query|) — prefill is amortised over the session, not the requestEngine-level; standard quadratic attention preserved
RecMem-style memoryCarries a learned summary state across interactionsO(|query|) — but state is a learned compressed memory, not raw KVDifferent family — recurrent learned state vs. persistent raw cache

The compositional point: AOIAYN is orthogonal to prefix caching, SP-KV, and KV quantization — they all act on a single request's cache, while AOIAYN changes whose cache and when it gets built. A serving stack could in principle layer prefix caching across requests, SP-KV pruning inside the cache, and INT4 quantization on the KV bytes — and still wrap all of that in an AOIAYN-style persistent session that advances incrementally and stages Flash Queries. The paper does not benchmark the stacked configuration; the headline 5.9× number is the AOIAYN feature alone.

There are real caveats. The paper preserves full quadratic self-attention, so memory still scales linearly with session length — a 1M-token session holds a 1M-token KV cache, no compression, just persistence. Operators need either large HBM headroom, an SP-KV-style sparsity overlay, or a session lifetime policy that evicts cold sessions. The engineering surface is also non-trivial: persistent sessions live across multiple replicas in a real fleet, so the design surface includes session affinity, KV migration, and reconciliation under failure — the paper handles the single-engine case; multi-engine deployment is open work. And Flash Queries trade idle GPU cycles for a quality bet — pre-evaluating questions that never get asked wastes compute, so the registered-question set has to be picked carefully.

Goes deeper in: LLM Internals → KV Cache → Memory Cost

Related explainers

Frequently Asked Questions