AnchorKV is a 2026 method (Ning Ni and Yingjie Lao, arXiv 2606.17872) that makes KV-cache compression safety-aware. Standard compressors evict cached tokens by attention score alone, which is blind to safety. AnchorKV builds an offline 'refusal anchor' — a difference-of-means direction in the model's key projection space that points toward harmful prompts — and adds a soft penalty so eviction is biased away from that direction. It is a drop-in change that reduces to the original compressor when the penalty strength is zero.

Why can compressing the KV cache hurt model safety?

KV-cache compression keeps only a subset of tokens to save memory, and mainstream policies keep the most-attended ('heavy hitter') tokens. The signal that keeps a model refusing harmful requests does not necessarily live in the most-attended tokens, so an attention-only eviction rule can drop it — and AnchorKV's authors report that such policies can either fail to defend against jailbreak attacks or degrade safety alignment under aggressive eviction. The eviction policy is, quietly, a safety policy.

How is AnchorKV different from H2O or SnapKV?

H2O and SnapKV evict by attention importance, keeping heavy-hitter tokens with no notion of safety. AnchorKV keeps that importance ranking but subtracts a soft penalty along a key-space 'refusal anchor,' so the same memory budget is spent in a way that preserves alignment. A strength knob (λ) tunes the trade between utility and safety, and at λ = 0 AnchorKV behaves exactly like the underlying compressor — so it layers on top of existing methods rather than replacing them.

AnchorKV makes KV-cache compression safety-aware — Refusal anchor in the KV cache

TL;DR

What is it: The AnchorKV paper (arXiv 2606.17872) is a drop-in change to KV-cache compression — the trick serving stacks use to shrink the cache that dominates LLM memory — that adds a safety-aware "refusal anchor" to the rule deciding which cached tokens to keep.
Why it’s needed: KV-cache compression is how a server fits long contexts in limited memory, but the usual eviction rule scores tokens only by how much attention they get — it is blind to safety, so squeezing the cache can let jailbreaks through or quietly weaken alignment.
vs previous: A standard attention-importance compressor (H2O / SnapKV-style) keeps the "heavy-hitter" tokens and drops the rest, with no notion of safety. AnchorKV adds a soft penalty along a key-space direction tied to harmful prompts, so the same memory budget is spent in a way that keeps the model aligned.

Jargon

KV cache: The stored keys and values for every token already processed, so the model never recomputes them as it generates. It is the dominant memory cost of inference — it grows with context length and is what compression targets.
KV-cache compression / eviction: Throwing away the keys and values of some cached tokens to fit a longer context into limited memory. The policy that picks which tokens to drop is the whole game — see production KV-cache tricks.
Attention score / heavy hitter: How much the model's attention weight lands on a token. Mainstream compressors (H2O, SnapKV) keep the "heavy hitters" — the most-attended tokens — and evict the rest. AnchorKV's point is that this ranking is safety-blind.
Refusal anchor: A fixed direction in key space that points toward "harmful prompt", built once, offline. AnchorKV gets it by adapting a difference-of-means representation-engineering recipe to the model's key projections — the average key for harmful prompts minus the average key for benign ones.
Difference-of-means / representation engineering: A way to read a concept out of a model's activations: collect activations for two sets of inputs (here, harmful vs. benign prompts) and take the difference of their means as the concept's direction. AnchorKV applies it per layer to the key projection used in KV caching.
Soft penalty: Instead of a hard "always keep / always drop" rule, AnchorKV subtracts a tunable amount from a token's retention score based on how much its key aligns with the refusal anchor. A strength knob (call it λ) controls it — and at λ = 0 AnchorKV is exactly the original compressor.
Jailbreak: A crafted prompt that gets a model to do what its safety training should refuse. AnchorKV's claim is that aggressive cache compression can make jailbreaks easier, and that a safety-aware eviction rule closes that gap.

The news. On June 16, 2026, Ning Ni and Yingjie Lao posted AnchorKV: Safety-Aware KV Cache Compression via Soft Penalty with a Refusal Anchor. They note that mainstream KV-cache compressors keep only the most attention-relevant tokens — and that those policies "either fail to defend against jailbreak attacks or degrade safety alignment under aggressive eviction." AnchorKV is a drop-in modification that builds an offline safety anchor in the key projection space and uses a soft penalty to "trade a small amount of utility for substantially improved safety alignment," collapsing back to the plain compressor when the penalty is zero. Read the paper →

Picture a bouncer with a notebook, logging every face that comes through the door. By the end of the night the notebook is full, so to keep working he tears out pages — and the rule he uses is simply toss the pages I've flipped back to least. It feels reasonable: the faces he keeps checking must be the ones that matter. But the page he almost never flips to is the laminated "do-not-admit" profile that keeps the bar safe — so a "tear out what you ignore" rule rips out the one page he can't afford to lose. Now a banned troublemaker walks up and there's nothing left to catch him.

That bouncer is a KV-cache compressor, and the rule is the one nearly every production compressor uses. The cache is the biggest memory cost in LLM inference, so to serve long contexts a server evicts tokens — and methods like H2O and SnapKV evict by attention score, keeping the heavy-hitter tokens and dropping the quiet ones. AnchorKV's authors point out the catch: the signal that keeps a model refusing harmful requests may not be reliably preserved by the most-attended tokens, so an attention-only cull can be blind to it — and, they report, under aggressive compression or a deliberate jailbreak that blindness lets the model's alignment slip.

Remove earliest tokens first

AnchorKV's move is to make the eviction score safety-aware. Off the clock, it builds a refusal anchor: using a difference-of-means recipe — the average key for harmful prompts minus the average key for benign ones, computed per layer in the model's key projection space — it gets a single direction that points toward "harmful." Then it changes the keep/drop score by subtracting a soft penalty along that anchor, so eviction is biased away from the key-space direction tied to harmful prompts instead of ranking by attention alone. Crucially, it is a dial, not a switch: a strength knob λ sets how hard the penalty bites, and at λ = 0 AnchorKV is equivalent to the original compressor — so a team can adopt it without giving up their existing cache-compression stack.

What makes this notable is that it ties together two things the curriculum usually teaches in separate rooms — cache compression for memory, and safety alignment — and shows the eviction policy is silently a safety policy. Many long-context serving stacks already run some form of KV-cache compression to fit context in memory; AnchorKV says that knob has been quietly trading away alignment, and hands back a second knob to buy it back.

Walk the scoring math on one cache. Say the budget keeps 3 of 5 cached tokens, and the base compressor ranks them by attention importance: [0.90, 0.70, 0.60, 0.50, 0.30] (illustrative scores). The plain rule keeps the top three — including the token scoring 0.60, which happens to be a harmful-direction token (its key aligns strongly with the refusal anchor, align = 0.8). Now switch AnchorKV on with λ = 0.5: the new score is importance − λ · align, so that token drops to 0.60 − 0.5 × 0.8 = 0.20 and falls out of the top three, while a benign token at 0.50 (align ≈ 0) takes its slot. The same memory budget now spends its kept slots on tokens that keep the model aligned — exactly the paper's "small utility cost for substantially improved safety," with λ as the dial between the two.

Eviction policy	Keeps	Safety-aware?	Behavior under compression / jailbreak
Drop-oldest / sliding window	Recent tokens	No	Cheap, but blind to both importance and safety
Attention-importance (H2O, SnapKV)	Heavy-hitter tokens	No	Keeps accuracy on benign work; can let jailbreaks through or erode alignment (per AnchorKV's framing)
AnchorKV (this paper)	Heavy hitters, re-scored by a soft penalty along the refusal anchor	Yes	Substantially better safety for a small utility cost; λ = 0 ⇒ identical to the base compressor (reported, qualitative)

The honest caveats: the paper's abstract describes the mechanism qualitatively — "a small amount of utility" for "substantially improved safety alignment" — without published numbers in the abstract, so treat the size of the trade as reported, not yet quantified here, and check the paper for the benchmark figures. And the refusal anchor is only as good as the harmful/benign prompts it was built from — a difference-of-means direction captures the kinds of harm it was shown, so coverage of the anchor set matters. But the deeper lesson stands: once you accept that a cached token's value isn't only its attention weight, "which tokens to forget" becomes a safety decision, and the eviction policy needs a safety term — not just an importance one.

Goes deeper in: LLM Internals → KV Cache → Production KV caches

Related explainers

ThriftAttention — importance-aware FP4 attention — also ranks tokens/blocks by a cheap importance score, but to pick precision per block; AnchorKV reuses an importance idea to pick what to keep — with safety folded in.
SP-KV — self-pruned KV cache — another "which KV entries can we drop" policy; AnchorKV is the safety-aware cousin of the same eviction question.
Tangram — per-head KV budgets — allocates the KV memory budget across heads; AnchorKV instead changes which tokens win a fixed budget, by a safety-aware score.
KVarn — Hadamard 2-bit KV cache — shrinks the cache by quantizing every entry rather than evicting some; the two compression axes (precision vs. retention) are complementary.

Related explainers

Frequently Asked Questions