The news. On June 22, 2026, Baidu released Unlimited OCR, a 3-billion-parameter (500 million active) end-to-end OCR model that transcribes 40+ pages of documents in a single forward pass under a 32,000-token context. It replaces every decoder attention layer with Reference Sliding Window Attention (R-SWA), which holds the KV cache at a constant size throughout decoding instead of letting it grow with output length, and reports new end-to-end state-of-the-art on OmniDocBench v1.5 and v1.6. Weights and code are public under CC-BY 4.0. Read the paper →
Picture a scribe copying a long book by hand. The trick that keeps the desk clear is not memory — it's what stays on the desk. The source book lies open, always in reach, and the scribe glances at only the last line they wrote to keep the handwriting and spelling continuous. They never re-read the hundred pages already copied; those go in a drawer. So the desk holds the same two things on page 1 and on page 200 — the source, and the current line — and it never overflows, no matter how long the book.
A standard transformer decoder is the opposite scribe: it keeps every page it has copied stacked on the desk, because each new token attends to all previous tokens. That stack is the KV cache, and its size grows linearly with the length of the output — which is fine for a one-paragraph answer and ruinous for a 40-page transcription, where the output is enormous. The cache is already the biggest memory cost in inference; let it grow with every page and a long document becomes much harder to fit and serve efficiently.
R-SWA is the disciplined scribe. It replaces every decoder attention layer so that each generated token attends to exactly two things: the full set of reference tokens — the document the encoder produced, kept pinned and fully visible — plus only the preceding 128 output tokens, a short sliding window over what was just written. The document never slides out of view, but the output history does. Because both pieces are bounded — the document is fixed and the window is 128 — the KV cache stays a constant size from the first page to the fortieth. This is the move a plain sliding window can't make on its own: slide a fixed window over everything and you'd lose the document you're reading; R-SWA exempts the reference tokens from the slide.
The diagram shows why "grows with sequence length" is the term that hurts. KV-cache memory is a product — layers × heads × head dimension × bytes × sequence length — and only that last factor moves as the model writes more. R-SWA freezes that factor for the output: instead of the sequence length climbing toward 32,000, the output's contribution is clamped at 128, while the reference tokens add a fixed, encoder-compressed amount. Pair that constant-cache decoder with DeepSeek-OCR's high-compression visual encoder — which compresses each page image into far fewer visual tokens — and dozens of pages fit in one 32,000-token pass.
Walk the numbers on one long document. Say transcribing 40 pages produces roughly 12,000 output tokens (illustrative — the real count depends on the document). A standard decoder's cache holds all 12,000, and the 12,000th token attends back across 11,999 predecessors — so both memory and per-token attention work climb with every page. R-SWA caps the output window at 128. So that same final token attends to just the last 128 outputs plus the fixed document tokens, and the output's contribution to the cache stays flat at 128 entries whether the document is 4 pages or 40. That clamp — from a number that grows with the page count to a constant 128 — is the decoder-side reason this can pair with a compressed visual encoder and read 40+ pages in one forward pass.
| Attention scheme | Each output token attends to… | KV cache vs output length | Where it earns its keep |
|---|---|---|---|
| Standard causal attention | every previous token | Grows linearly | Accurate, but memory explodes on long outputs |
| Plain sliding-window attention | only the last ~W tokens (W is a fixed window, model-dependent) | Constant (~W) | Cheap streaming, but it slides off the document being read |
| R-SWA (Unlimited OCR) | all reference tokens + the last 128 outputs [paper] | Constant | Long-document OCR: keeps the full source visible while bounding output memory |
The honest caveats. The 128-token window is a default, and a short window is a bet that the next line of a transcription rarely depends on text written thousands of tokens earlier — true for reading a document top-to-bottom, less obviously true for tasks with long-range output structure. And the constant-cache win leans on the encoder doing real work: if the reference tokens themselves were not compressed, "all reference tokens" would be its own large, fixed cost. But the deeper lesson generalizes past OCR — the paper itself notes R-SWA is "a general-purpose parsing attention mechanism… equally applicable to tasks such as ASR, translation, etc." Once you accept that an output token rarely needs the entire output history, the question stops being "how do we shrink the cache" and becomes "what must stay pinned, and how short can the window be" — and the cache stops growing at all.
Goes deeper in: LLM Internals → KV Cache → Memory Cost
Related explainers
- FlashMemory — lookahead sparse attention — also bounds the KV cache by having each token attend to fewer keys; R-SWA bounds it by a fixed window plus a pinned reference instead.
- SP-KV — self-pruned KV cache — drops low-value KV pairs to shrink the cache; R-SWA never writes the far output history in the first place.
- SubQ 1.1 — subquadratic sparse attention — near-linear attention for million-token context; R-SWA is the OCR-decoder cousin of the same "don't attend to everything" idea.
- DeepSeek V4 — long-context cost cut to a fraction — attacks the same long-context memory pressure at the architecture level, and shares the optical-compression encoder lineage R-SWA builds on.