The news. On May 21, 2026, KAIST's Yi et al. published WorldKV, a training-free KV-cache system for autoregressive video world models — models that roll out a navigable world frame by frame. It tackles the cache explosion two ways: World Retrieval evicts old KV chunks to CPU/GPU storage indexed by camera and action, then reinserts the scene-relevant ones when a viewpoint is revisited; World Compression prunes redundant tokens inside each chunk by key-key similarity to an anchor frame, halving per-chunk storage. It reports matching or exceeding full-KV fidelity at ~2x throughput, with no model retraining. Read the paper →
Picture a film studio shooting a long, wandering scene. Every room and street the story visits is a fully built set, and the soundstage floor only fits a few sets at once. A careless crew demolishes each set the moment the camera leaves it — cheap in the moment, but when the script wanders back to the kitchen on page 200, they have to build it again from scratch, and it never quite matches. That careless crew is a sliding window: it keeps only the most recent frames and throws the rest away, so a video world model that loops back to a place it already drew has simply forgotten what it looked like.
WorldKV is the studio with a warehouse. World Retrieval crates each old set instead of demolishing it, and labels the crate by which camera angle it belongs to. The crates go to off-floor storage — CPU or spare GPU memory — so the KV cache on the active floor stays small. When the camera trajectory loops back toward an earlier viewpoint, the matching crate is wheeled straight back onto the floor and reinserted into the attention window — no reshoot. That "no reshoot" is the load-bearing part: the keys and values were saved, not recomputed, so reinserting an old scene skips re-encoding entirely, the way reusing cached KV beats recomputing it (the lookup and the off-GPU transfer still cost something, but far less than redrawing the scene). Because nothing about the model's weights changes, the whole scheme is training-free.
Remove earliest tokens first
The diagram shows the choice WorldKV refuses. Drop-oldest and sliding-window both delete tokens to stay under budget — fine for text that never looks back, ruinous for a world model whose camera keeps revisiting rooms. WorldKV evicts the same chunks but stores them, so eviction becomes a move to the warehouse rather than the trash. The second layer, World Compression, shrinks each crate before it ships: within a chunk, tokens whose keys are near-duplicates of the chunk's anchor frame carry little new information, so they are pruned by key-key similarity. That halves per-chunk storage, which means the same memory budget now holds twice as much history.
Walk the numbers on one rollout. Say the active floor has room for 64 KV chunks (illustrative — the paper reports ratios, not absolute counts). The sliding-window baseline keeps the most recent 64 and discards everything older, so a room you rendered 300 chunks ago is gone. WorldKV moves both levers at once: World Compression halves each chunk, so 64 chunks of budget now store 128 chunks of history, and World Retrieval keeps the evicted chunks off-floor, reinserting the matching saved chunk a revisit needs with no re-encoding. Hold output quality equal to a full-KV run and the result is the paper's headline — ~2x throughput — because every decoding step attends over a small, relevant window instead of an ever-growing one.
| Strategy | What happens to old KV | KV vs rollout length | Coherence on revisit |
|---|---|---|---|
| Full KV cache | kept, all of it | Grows linearly | Best baseline, but memory explodes on long rollouts |
| Sliding window | discarded permanently | Constant (~W) | Lost — revisited scenes must be re-imagined |
| WorldKV [paper] | evicted to storage, reinserted on revisit | Bounded (~2x history / budget) | Preserved — a matching saved chunk is reinserted |
The honest caveats. Retrieval only pays off when the camera actually revisits earlier viewpoints — a trajectory that never loops back gets the compression win but little from the store, and the correspondence index adds its own bookkeeping. And "match full-KV fidelity" is a claim about this paper's video benchmarks, not a guarantee on every world model. But the deeper idea travels well past video. Once you treat eviction as "move to a labeled warehouse" rather than "throw away," the question stops being "which tokens can I afford to lose" and becomes "how do I find the right ones again" — and a cache that used to grow without limit becomes a small working set plus a retrievable archive. That reframing is exactly the retrieval pattern long-context LLM serving keeps rediscovering.
Goes deeper in: LLM Internals → KV Cache → Memory Cost
Related explainers
- Baidu Unlimited OCR — Reference Sliding Window Attention — bounds the KV cache by pinning a fixed reference plus a short window; WorldKV bounds it by evicting and retrieving instead.
- SP-KV — self-pruned KV cache — permanently drops low-value KV pairs; WorldKV's whole point is that the dropped pairs come back when needed.
- AnchorKV — safety-aware KV compression — also compresses the cache against anchor information; WorldKV anchors on a frame's keys to prune near-duplicate tokens.
- FlashMemory — lookahead sparse attention — shrinks attention by having each token read fewer keys; WorldKV shrinks it by keeping fewer chunks live and fetching the rest on demand.