How is AOIAYN different from SP-KV or KV quantization?

SP-KV and KV quantization both operate on a single request's cache: SP-KV drops low-utility entries to make the cache sparse; quantization shrinks each entry's bytes. Both still pay O(N) prefill at query time. AOIAYN changes when and whose cache gets built — it keeps one persistent KV cache per session and advances it in the background as data arrives, so the query itself only pays O(|query|) decode cost. The three are orthogonal and can in principle compose: AOIAYN persists the session-level cache, SP-KV prunes inside it, quantization shrinks the bytes. The paper does not benchmark the stacked combination; the 5.9× speedup is AOIAYN alone.

What are Flash Queries and why do they need a stateful engine?

Flash Queries are pre-registered questions that the engine pre-evaluates during idle GPU cycles, caching the answers so they are served instantly when the user actually asks. The paper notes that this is structurally impossible in stateless engines because each request discards its intermediate state — there is no persistent cache for a pre-evaluated answer to attach to. AOIAYN's persistent session KV cache is what makes Flash Queries possible: the engine has a live, growing context to evaluate registered questions against, and the resulting answer state can be held until the user asks. The tradeoff is that pre-evaluating questions nobody asks wastes idle GPU cycles, so the registered-question set has to be chosen carefully.

Attention Once Is All You Need — Persistent KV cache across queries

Q: What is AOIAYN?

AOIAYN ("Attention Once Is All You Need", arXiv 2605.13784, May 2026) is a streaming-inference architecture that persists the KV cache across queries in a session and advances it incrementally as new data arrives. The model itself is unchanged — full quadratic self-attention is preserved — but the engine becomes stateful at the session level: every chunk of incoming data is ingested into the persistent KV cache in the background, between user queries. When a query fires, prefill has already happened, so per-query latency becomes O(|query|) and constant in accumulated context length. The paper additionally introduces Flash Queries, which pre-evaluate a set of registered questions during idle GPU cycles. Reported result: up to 5.9× speedup on streaming benchmarks.

AOIAYN — Persistent KV cache across queries

LLM

learnaivisually.com/ai-explained/aoiayn-stateful-prefix

TL;DR

What is it: The AOIAYN paper ("Attention Once Is All You Need", arXiv 2605.13784) proposes a stateful streaming-inference engine that keeps the KV cache between queries in a session and advances it incrementally as new data arrives, instead of rebuilding state on each request.
Why it’s needed: In long-running streaming sessions the per-query prefill cost grows with context length, dominating TTFT and forcing operators to either truncate context or eat the latency; AOIAYN moves prefill off the critical path so query latency stops scaling with N.
vs previous: The default stateless engine discards intermediate state between requests, so every query recomputes attention over the whole accumulated context; AOIAYN keeps full quadratic self-attention but reuses the cache across queries — and adds Flash Queries, idle-GPU pre-evaluation of registered questions.

Jargon

KV cache: The cached keys and values from every prior token so attention does not recompute them every decode step. Memory scales as 2 × layers × heads × head_dim × seq_len × bytes_per_value, linear in sequence length. See KV Cache → The Redundancy Problem.
Stateless engine: The default LLM serving model — each request starts cold, builds its own KV cache from the prompt, returns the answer, and discards the cache at the end. AOIAYN's premise is that this is the wrong default for streaming sessions where context accumulates over time.
Streaming session: A long-lived session where new data arrives continuously and the user issues sporadic queries against the accumulated context — e.g. a live news feed plus periodic "what's changed?" questions. The paper's headline target workload.
Prefill on the critical path: When prefill runs synchronously inside the user's query (the default), it dominates TTFT. AOIAYN moves prefill off the critical path by running it in the background as data arrives, so the query itself only pays decode cost. See KV Cache → Prefill vs Decode.
Flash Queries: An auxiliary feature the paper introduces — a set of registered questions that the engine pre-evaluates during idle GPU cycles, caching the answers. When the user actually asks one, the answer is already computed. "Structurally impossible in stateless engines because they discard intermediate state between requests," the paper notes.
Prefix caching: The closest existing serving-stack feature — reuse the KV cache for repeated prompt prefixes across requests. AOIAYN extends this idea to session-level reuse plus background ingestion of new data, where the "prefix" keeps growing during the session. See Prefix Caching → Intro.
TTFT: Time to first token — the wall-clock latency a user sees between submitting a query and the first output token appearing. The metric AOIAYN's reported 5.9× speedup is measured against.

The news. On May 14, 2026, an architecture paper titled Attention Once Is All You Need: efficient streaming inference with stateful transformers was posted as arXiv 2605.13784. The paper's premise: most LLM serving engines today are stateless — every request builds a fresh KV cache from the prompt, returns the answer, and throws the cache away — but the rising workload of streaming sessions (a live data feed plus periodic queries) makes that wasteful. AOIAYN's proposal is to make the engine stateful at the session level: keep one KV cache per session, advance it incrementally as new data arrives in the background, and serve each query against the already-warm cache. Crucially, the model itself is unchanged — full quadratic self-attention is preserved, the KV cache is structurally normal. The paper additionally introduces Flash Queries, a feature where pre-registered questions are pre-evaluated during idle GPU cycles. Reported result: up to 5.9× speedup on streaming benchmarks, with per-query latency that stays constant in context length.

Picture a restaurant on a long lunch service. A naïve kitchen treats every order as a one-shot: ingredients come in from the morning delivery, sit in the crate, and the chef chops and washes everything from scratch the moment an order ticket lands. The customer waits while all the prep work happens. A second customer's order, even for a similar dish, triggers the same chop-from-scratch dance. That is the stateless engine: each query forces a synchronous prefill over all the data the session has accumulated, and TTFT scales with how much context exists at the moment the order arrives.

AOIAYN is the mise-en-place kitchen. The same deliveries arrive throughout the morning, but the chef preps them as they come — onions diced, sauces reduced, proteins portioned — into clearly labelled bins on the counter. When an order ticket lands, the chef just plates from the prepped bins; the customer waits decode-time, not prep-time. In transformer terms, every chunk of incoming data is ingested into the persistent KV cache as it arrives, in the background, between user queries. When a query actually fires, prefill has already happened. The attention layer is unchanged — still standard quadratic self-attention reading the same K, V tensors — but the cache is structurally pre-warmed and the query just pays the decode cost it would have paid anyway.

The paper also introduces a second mechanism the kitchen analogy maps neatly onto: Flash Queries. Many restaurants pre-cook the day's most popular dishes during slow periods so they are ready the moment someone orders. AOIAYN does the same for registered questions — a set of queries that the system knows a user is likely to ask. During idle GPU cycles, the engine pre-evaluates each one against the current KV cache and stashes the answer. When the user actually asks, the response is served from cache. The paper notes this is structurally impossible in stateless engines because they discard intermediate state between requests; the persistent cache is what makes the optimization possible at all.

Where AOIAYN earns its keep is the scaling shape of TTFT (specific TTFT values below are illustrative — they depend on hardware, model size, and ingestion rate). In a stateless engine serving a session that has accumulated 100K tokens of context, every query pays roughly T_prefill(100K) + T_decode. At 100K tokens on a 70B-class model that prefill is typically on the order of seconds, with decode on the order of milliseconds per token; the prefill dominates. The same query under AOIAYN pays roughly T_decode only, because prefill has been amortised across the background ingestion that already happened. As the session grows to 200K, 500K, 1M tokens the stateless TTFT grows roughly linearly with N; AOIAYN's TTFT stays roughly flat — only T_decode per query, with T_prefill paid over the lifetime of the session, not on any single critical path. The paper's headline 5.9× speedup on streaming benchmarks is the empirical version of this asymptotic story.

Where AOIAYN sits next to other long-context levers

Lever	What it changes	Per-query work	Architectural scope
Prefix caching	Reuse KV cache for repeated prompt prefixes across requests	O(\|new suffix\|) prefill — saves time on the shared prefix only	Engine-level cache reuse; same model
SP-KV pruning	Drop low-utility KV entries (sparse cache)	O(N) prefill still happens, just with a 3–10× smaller cache	Co-trained head; modified model
FP8 / INT4 KV quantization	Pair size — fewer bits per value	O(N) prefill still happens, just with cheaper bytes	Post-training quantization; same architecture
AOIAYN (this paper)	Persist KV cache across queries; ingest new data in background	O(\|query\|) — prefill is amortised over the session, not the request	Engine-level; standard quadratic attention preserved
RecMem-style memory	Carries a learned summary state across interactions	O(\|query\|) — but state is a learned compressed memory, not raw KV	Different family — recurrent learned state vs. persistent raw cache

The compositional point: AOIAYN is orthogonal to prefix caching, SP-KV, and KV quantization — they all act on a single request's cache, while AOIAYN changes whose cache and when it gets built. A serving stack could in principle layer prefix caching across requests, SP-KV pruning inside the cache, and INT4 quantization on the KV bytes — and still wrap all of that in an AOIAYN-style persistent session that advances incrementally and stages Flash Queries. The paper does not benchmark the stacked configuration; the headline 5.9× number is the AOIAYN feature alone.

There are real caveats. The paper preserves full quadratic self-attention, so memory still scales linearly with session length — a 1M-token session holds a 1M-token KV cache, no compression, just persistence. Operators need either large HBM headroom, an SP-KV-style sparsity overlay, or a session lifetime policy that evicts cold sessions. The engineering surface is also non-trivial: persistent sessions live across multiple replicas in a real fleet, so the design surface includes session affinity, KV migration, and reconciliation under failure — the paper handles the single-engine case; multi-engine deployment is open work. And Flash Queries trade idle GPU cycles for a quality bet — pre-evaluating questions that never get asked wastes compute, so the registered-question set has to be picked carefully.

Goes deeper in: LLM Internals → KV Cache → Memory Cost

Related explainers

SP-KV — Self-pruned KV cache — the orthogonal axis: shrink the cache by dropping low-utility entries, leaving the per-request model intact
vLLM v0.20 — TurboQuant 2-bit KV cache — yet another orthogonal axis: shrink each KV pair by quantizing the bytes
DeepSeek V4 — long-context cost cut to a fraction — same memory pressure attacked via latent attention at the architecture level
RecMem — subconscious recurrence — a recurrent learned memory state, different family from AOIAYN's persistent raw cache

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based