The news. On June 22, 2026, Baidu released Unlimited OCR, a 3-billion-parameter (500 million active) end-to-end OCR model that transcribes 40+ pages of documents in a single forward pass under a 32,000-token context. It replaces every decoder attention layer with Reference Sliding Window Attention (R-SWA), which holds the KV cache at a constant size throughout decoding instead of letting it grow with output length, and reports new end-to-end state-of-the-art on OmniDocBench v1.5 and v1.6. Weights and code are public under CC-BY 4.0. Read the paper →

Picture a scribe copying a long book by hand. The trick that keeps the desk clear is not memory — it's what stays on the desk. The source book lies open, always in reach, and the scribe glances at only the last line they wrote to keep the handwriting and spelling continuous. They never re-read the hundred pages already copied; those go in a drawer. So the desk holds the same two things on page 1 and on page 200 — the source, and the current line — and it never overflows, no matter how long the book.

A standard transformer decoder is the opposite scribe: it keeps every page it has copied stacked on the desk, because each new token attends to all previous tokens. That stack is the KV cache, and its size grows linearly with the length of the output — which is fine for a one-paragraph answer and ruinous for a 40-page transcription, where the output is enormous. The cache is already the biggest memory cost in inference; let it grow with every page and a long document becomes much harder to fit and serve efficiently.

R-SWA is the disciplined scribe. It replaces every decoder attention layer so that each generated token attends to exactly two things: the full set of reference tokens — the document the encoder produced, kept pinned and fully visible — plus only the preceding 128 output tokens, a short sliding window over what was just written. The document never slides out of view, but the output history does. Because both pieces are bounded — the document is fixed and the window is 128 — the KV cache stays a constant size from the first page to the fortieth. This is the move a plain sliding window can't make on its own: slide a fixed window over everything and you'd lose the document you're reading; R-SWA exempts the reference tokens from the slide.

K and V
Two vectors stored per token (Key + Value)
× 2
Layers
Each layer has its own cache (like 32 filing cabinets)
× 32
Heads
Each attention head stores its own K/V pair
× 32
Head size
Each K or V vector has 128 numbers (d_head)
× 128
Bytes per number
FP16 = 2 bytes per number (half precision)
× 2
Per token (Llama-2 7B):2 × 32 × 32 × 128 × 2 = 524,288 bytes ≈ 512 KB

The diagram shows why "grows with sequence length" is the term that hurts. KV-cache memory is a product — layers × heads × head dimension × bytes × sequence length — and only that last factor moves as the model writes more. R-SWA freezes that factor for the output: instead of the sequence length climbing toward 32,000, the output's contribution is clamped at 128, while the reference tokens add a fixed, encoder-compressed amount. Pair that constant-cache decoder with DeepSeek-OCR's high-compression visual encoder — which compresses each page image into far fewer visual tokens — and dozens of pages fit in one 32,000-token pass.

Walk the numbers on one long document. Say transcribing 40 pages produces roughly 12,000 output tokens (illustrative — the real count depends on the document). A standard decoder's cache holds all 12,000, and the 12,000th token attends back across 11,999 predecessors — so both memory and per-token attention work climb with every page. R-SWA caps the output window at 128. So that same final token attends to just the last 128 outputs plus the fixed document tokens, and the output's contribution to the cache stays flat at 128 entries whether the document is 4 pages or 40. That clamp — from a number that grows with the page count to a constant 128 — is the decoder-side reason this can pair with a compressed visual encoder and read 40+ pages in one forward pass.

Attention schemeEach output token attends to…KV cache vs output lengthWhere it earns its keep
Standard causal attentionevery previous tokenGrows linearlyAccurate, but memory explodes on long outputs
Plain sliding-window attentiononly the last ~W tokens (W is a fixed window, model-dependent)Constant (~W)Cheap streaming, but it slides off the document being read
R-SWA (Unlimited OCR)all reference tokens + the last 128 outputs [paper]ConstantLong-document OCR: keeps the full source visible while bounding output memory

The honest caveats. The 128-token window is a default, and a short window is a bet that the next line of a transcription rarely depends on text written thousands of tokens earlier — true for reading a document top-to-bottom, less obviously true for tasks with long-range output structure. And the constant-cache win leans on the encoder doing real work: if the reference tokens themselves were not compressed, "all reference tokens" would be its own large, fixed cost. But the deeper lesson generalizes past OCR — the paper itself notes R-SWA is "a general-purpose parsing attention mechanism… equally applicable to tasks such as ASR, translation, etc." Once you accept that an output token rarely needs the entire output history, the question stops being "how do we shrink the cache" and becomes "what must stay pinned, and how short can the window be" — and the cache stops growing at all.

Goes deeper in: LLM Internals → KV Cache → Memory Cost

Related explainers

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based