The news. On June 16, 2026, Ning Ni and Yingjie Lao posted AnchorKV: Safety-Aware KV Cache Compression via Soft Penalty with a Refusal Anchor. They note that mainstream KV-cache compressors keep only the most attention-relevant tokens — and that those policies "either fail to defend against jailbreak attacks or degrade safety alignment under aggressive eviction." AnchorKV is a drop-in modification that builds an offline safety anchor in the key projection space and uses a soft penalty to "trade a small amount of utility for substantially improved safety alignment," collapsing back to the plain compressor when the penalty is zero. Read the paper →
Picture a bouncer with a notebook, logging every face that comes through the door. By the end of the night the notebook is full, so to keep working he tears out pages — and the rule he uses is simply toss the pages I've flipped back to least. It feels reasonable: the faces he keeps checking must be the ones that matter. But the page he almost never flips to is the laminated "do-not-admit" profile that keeps the bar safe — so a "tear out what you ignore" rule rips out the one page he can't afford to lose. Now a banned troublemaker walks up and there's nothing left to catch him.
That bouncer is a KV-cache compressor, and the rule is the one nearly every production compressor uses. The cache is the biggest memory cost in LLM inference, so to serve long contexts a server evicts tokens — and methods like H2O and SnapKV evict by attention score, keeping the heavy-hitter tokens and dropping the quiet ones. AnchorKV's authors point out the catch: the signal that keeps a model refusing harmful requests may not be reliably preserved by the most-attended tokens, so an attention-only cull can be blind to it — and, they report, under aggressive compression or a deliberate jailbreak that blindness lets the model's alignment slip.
Remove earliest tokens first
AnchorKV's move is to make the eviction score safety-aware. Off the clock, it builds a refusal anchor: using a difference-of-means recipe — the average key for harmful prompts minus the average key for benign ones, computed per layer in the model's key projection space — it gets a single direction that points toward "harmful." Then it changes the keep/drop score by subtracting a soft penalty along that anchor, so eviction is biased away from the key-space direction tied to harmful prompts instead of ranking by attention alone. Crucially, it is a dial, not a switch: a strength knob λ sets how hard the penalty bites, and at λ = 0 AnchorKV is equivalent to the original compressor — so a team can adopt it without giving up their existing cache-compression stack.
What makes this notable is that it ties together two things the curriculum usually teaches in separate rooms — cache compression for memory, and safety alignment — and shows the eviction policy is silently a safety policy. Many long-context serving stacks already run some form of KV-cache compression to fit context in memory; AnchorKV says that knob has been quietly trading away alignment, and hands back a second knob to buy it back.
Walk the scoring math on one cache. Say the budget keeps 3 of 5 cached tokens, and the base compressor ranks them by attention importance: [0.90, 0.70, 0.60, 0.50, 0.30] (illustrative scores). The plain rule keeps the top three — including the token scoring 0.60, which happens to be a harmful-direction token (its key aligns strongly with the refusal anchor, align = 0.8). Now switch AnchorKV on with λ = 0.5: the new score is importance − λ · align, so that token drops to 0.60 − 0.5 × 0.8 = 0.20 and falls out of the top three, while a benign token at 0.50 (align ≈ 0) takes its slot. The same memory budget now spends its kept slots on tokens that keep the model aligned — exactly the paper's "small utility cost for substantially improved safety," with λ as the dial between the two.
| Eviction policy | Keeps | Safety-aware? | Behavior under compression / jailbreak |
|---|---|---|---|
| Drop-oldest / sliding window | Recent tokens | No | Cheap, but blind to both importance and safety |
| Attention-importance (H2O, SnapKV) | Heavy-hitter tokens | No | Keeps accuracy on benign work; can let jailbreaks through or erode alignment (per AnchorKV's framing) |
| AnchorKV (this paper) | Heavy hitters, re-scored by a soft penalty along the refusal anchor | Yes | Substantially better safety for a small utility cost; λ = 0 ⇒ identical to the base compressor (reported, qualitative) |
The honest caveats: the paper's abstract describes the mechanism qualitatively — "a small amount of utility" for "substantially improved safety alignment" — without published numbers in the abstract, so treat the size of the trade as reported, not yet quantified here, and check the paper for the benchmark figures. And the refusal anchor is only as good as the harmful/benign prompts it was built from — a difference-of-means direction captures the kinds of harm it was shown, so coverage of the anchor set matters. But the deeper lesson stands: once you accept that a cached token's value isn't only its attention weight, "which tokens to forget" becomes a safety decision, and the eviction policy needs a safety term — not just an importance one.
Goes deeper in: LLM Internals → KV Cache → Production KV caches
Related explainers
- ThriftAttention — importance-aware FP4 attention — also ranks tokens/blocks by a cheap importance score, but to pick precision per block; AnchorKV reuses an importance idea to pick what to keep — with safety folded in.
- SP-KV — self-pruned KV cache — another "which KV entries can we drop" policy; AnchorKV is the safety-aware cousin of the same eviction question.
- Tangram — per-head KV budgets — allocates the KV memory budget across heads; AnchorKV instead changes which tokens win a fixed budget, by a safety-aware score.
- KVarn — Hadamard 2-bit KV cache — shrinks the cache by quantizing every entry rather than evicting some; the two compression axes (precision vs. retention) are complementary.