What is WorldKV's evict-and-reinsert KV memory?

WorldKV (KAIST, Yi et al., arXiv 2605.22718, May 2026) is a training-free KV-cache system for autoregressive video world models. Instead of permanently discarding old KV entries to stay under a memory budget, its World Retrieval layer evicts old KV chunks to CPU/GPU storage indexed by camera and action, then reinserts the scene-relevant chunks when the camera revisits an earlier viewpoint — with no re-encoding, because the keys and values were saved rather than recomputed. A second layer, World Compression, prunes near-duplicate tokens within each chunk by key-key similarity to an anchor frame. Together they report matching or exceeding full-KV fidelity at about 2x throughput.

Why does it matter for video world models?

A video world model generates a scene frame by frame, so its KV cache grows with every frame — and a long rollout either exhausts GPU memory or, if you bound it with a sliding window, forgets places it already drew. That broken long-term coherence is exactly what makes generated worlds drift and contradict themselves. WorldKV keeps the working cache small while preserving the ability to recover earlier scenes on revisit, so the world stays consistent over a long rollout without paying full-KV memory. Because it is training-free, it applies to an existing model at inference time.

How is it different from sliding-window attention or KV pruning?

Sliding-window attention and most KV-pruning methods make eviction terminal: the discarded tokens are gone, so a revisited scene must be regenerated and drifts. WorldKV reframes eviction as storage — evicted chunks move to a cheaper tier indexed by camera/action correspondence and can be reinserted exactly when a viewpoint returns. It pairs that with key-similarity compression to halve per-chunk size. So where pruning trades memory for forgotten history, WorldKV trades a little bookkeeping and off-GPU storage for a bounded live cache that still remembers — the difference between throwing a set away and warehousing it.

WorldKV — Evict-and-reinsert KV memory

TL;DR

What is it: The WorldKV paper (KAIST, Yi et al., arXiv 2605.22718) is a training-free KV-cache system for autoregressive video world models — the kind of model that generates a navigable world one frame at a time — built on evict-and-reinsert KV memory.
Why it’s needed: A video world model's KV cache grows with every frame it generates, so a long rollout either runs out of memory or forgets places it already drew; WorldKV keeps the cache bounded while preserving long-term scene coherence, reaching ~2x throughput at full-KV fidelity.
vs previous: A plain sliding window bounds memory by permanently discarding old frames, so when the camera returns to a room it already rendered the model has forgotten it; WorldKV instead stores the evicted KV chunks and reinserts the scene-relevant ones when the viewpoint comes back.

Jargon

KV cache: The stored keys and values for every token already processed, so attention never recomputes them. It is the dominant memory cost of generation, and left alone it grows with every token — here, every frame — produced.
Autoregressive video world model: A model that generates a navigable scene frame by frame, each new frame conditioned on the frames (and camera moves) before it. Because it keeps generating, its KV cache keeps growing — the problem WorldKV attacks.
Sliding window attention: An attention pattern where each token sees only the last W tokens, not all of them. It bounds memory, but on its own it slides old scenes out of view permanently — so revisiting a place means re-imagining it.
KV eviction: Removing KV entries from the cache to free memory. Normally evicted entries are discarded for good; WorldKV's twist is to keep them in cheaper storage and bring them back. See eviction in serving.
Camera / action correspondence: The index WorldKV stores evicted chunks under: which camera pose and action a chunk belongs to. When the trajectory revisits a similar viewpoint, this index finds the matching chunk to reinsert.
Anchor frame: A reference frame within a chunk that compression measures against: tokens whose keys are highly similar to the anchor's keys are treated as redundant and pruned.
Training-free: The method works on an already-trained model with no retraining or finetuning — it changes how the KV cache is managed at inference time, not the weights.

The news. On May 21, 2026, KAIST's Yi et al. published WorldKV, a training-free KV-cache system for autoregressive video world models — models that roll out a navigable world frame by frame. It tackles the cache explosion two ways: World Retrieval evicts old KV chunks to CPU/GPU storage indexed by camera and action, then reinserts the scene-relevant ones when a viewpoint is revisited; World Compression prunes redundant tokens inside each chunk by key-key similarity to an anchor frame, halving per-chunk storage. It reports matching or exceeding full-KV fidelity at ~2x throughput, with no model retraining. Read the paper →

Picture a film studio shooting a long, wandering scene. Every room and street the story visits is a fully built set, and the soundstage floor only fits a few sets at once. A careless crew demolishes each set the moment the camera leaves it — cheap in the moment, but when the script wanders back to the kitchen on page 200, they have to build it again from scratch, and it never quite matches. That careless crew is a sliding window: it keeps only the most recent frames and throws the rest away, so a video world model that loops back to a place it already drew has simply forgotten what it looked like.

WorldKV is the studio with a warehouse. World Retrieval crates each old set instead of demolishing it, and labels the crate by which camera angle it belongs to. The crates go to off-floor storage — CPU or spare GPU memory — so the KV cache on the active floor stays small. When the camera trajectory loops back toward an earlier viewpoint, the matching crate is wheeled straight back onto the floor and reinserted into the attention window — no reshoot. That "no reshoot" is the load-bearing part: the keys and values were saved, not recomputed, so reinserting an old scene skips re-encoding entirely, the way reusing cached KV beats recomputing it (the lookup and the off-GPU transfer still cost something, but far less than redrawing the scene). Because nothing about the model's weights changes, the whole scheme is training-free.

Remove earliest tokens first

The diagram shows the choice WorldKV refuses. Drop-oldest and sliding-window both delete tokens to stay under budget — fine for text that never looks back, ruinous for a world model whose camera keeps revisiting rooms. WorldKV evicts the same chunks but stores them, so eviction becomes a move to the warehouse rather than the trash. The second layer, World Compression, shrinks each crate before it ships: within a chunk, tokens whose keys are near-duplicates of the chunk's anchor frame carry little new information, so they are pruned by key-key similarity. That halves per-chunk storage, which means the same memory budget now holds twice as much history.

Walk the numbers on one rollout. Say the active floor has room for 64 KV chunks (illustrative — the paper reports ratios, not absolute counts). The sliding-window baseline keeps the most recent 64 and discards everything older, so a room you rendered 300 chunks ago is gone. WorldKV moves both levers at once: World Compression halves each chunk, so 64 chunks of budget now store 128 chunks of history, and World Retrieval keeps the evicted chunks off-floor, reinserting the matching saved chunk a revisit needs with no re-encoding. Hold output quality equal to a full-KV run and the result is the paper's headline — ~2x throughput — because every decoding step attends over a small, relevant window instead of an ever-growing one.

Strategy	What happens to old KV	KV vs rollout length	Coherence on revisit
Full KV cache	kept, all of it	Grows linearly	Best baseline, but memory explodes on long rollouts
Sliding window	discarded permanently	Constant (~W)	Lost — revisited scenes must be re-imagined
WorldKV [paper]	evicted to storage, reinserted on revisit	Bounded (~2x history / budget)	Preserved — a matching saved chunk is reinserted

The honest caveats. Retrieval only pays off when the camera actually revisits earlier viewpoints — a trajectory that never loops back gets the compression win but little from the store, and the correspondence index adds its own bookkeeping. And "match full-KV fidelity" is a claim about this paper's video benchmarks, not a guarantee on every world model. But the deeper idea travels well past video. Once you treat eviction as "move to a labeled warehouse" rather than "throw away," the question stops being "which tokens can I afford to lose" and becomes "how do I find the right ones again" — and a cache that used to grow without limit becomes a small working set plus a retrievable archive. That reframing is exactly the retrieval pattern long-context LLM serving keeps rediscovering.

Goes deeper in: LLM Internals → KV Cache → Memory Cost

Related explainers

Baidu Unlimited OCR — Reference Sliding Window Attention — bounds the KV cache by pinning a fixed reference plus a short window; WorldKV bounds it by evicting and retrieving instead.
SP-KV — self-pruned KV cache — permanently drops low-value KV pairs; WorldKV's whole point is that the dropped pairs come back when needed.
AnchorKV — safety-aware KV compression — also compresses the cache against anchor information; WorldKV anchors on a frame's keys to prune near-duplicate tokens.
FlashMemory — lookahead sparse attention — shrinks attention by having each token read fewer keys; WorldKV shrinks it by keeping fewer chunks live and fetching the rest on demand.

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based