The news. On July 2, 2026, researchers posted AgenticSTS to arXiv. Its subject is memory for long-horizon agents, and its move is to stop appending the raw transcript of past observations, tool calls, and reflections to every prompt. Instead, each decision is assembled from a fresh message via typed retrieval, which keeps the prompt bounded across runs of any length and lets any single memory layer be tested in isolation. It runs on Slay the Spire 2, a closed-rule stochastic deck-builder where a public benchmark of frontier models reports 0 wins at the lowest difficulty while humans win 16%. Read the paper →

Picture a long investigation with a new officer deciding the next move at every step. The obvious way to keep them informed is to hand each one the entire case file — every interview, every note, everything that ever happened. It works for a while, but the file only grows: by the hundredth decision it is a phone book, expensive to carry and impossible to skim. And when a decision goes wrong, no one can say which page mattered, because it is all fused into one stack. The fix is not a thicker file but a records clerk who briefs each decision from a fresh, thin folder holding only what that decision is allowed to see. In AgenticSTS the officer is one agent decision, the swelling case file is the appended transcript, and the clerk's need-to-know folder is typed retrieval governed by a memory contract.

That contract is the whole reframe. Instead of asking "what has happened so far?" and pasting the answer in, the agent asks a narrower question every turn: what is this decision allowed to see? Context is a scarce resource, and the default of gluing the full history onto every prompt spends it recklessly — the transcript is the clearest case of the context bloat the four fixes exist to prevent. Typed retrieval is that fix taken to its logical end: the prompt is rebuilt from scratch each turn out of separately stored memory types, so its size tracks a fixed retrieval budget instead of the run's length.

The second payoff is subtler and, for anyone shipping agents, the bigger one. When memory is one appended blob, you cannot answer "which memory made the difference?" — it is all one input. When memory is a shelf of typed folders, you can pull one folder and re-run. That is memory-layer ablation, and it turns a vague "our agent remembers things" into a measurable claim about each layer. AgenticSTS ships 298 completed, condition-tagged trajectories with frozen memory and skill snapshots precisely so those ablations are reproducible.

How the agent gets its memoryPrompt size over a long runCan you ablate one layer?
Append the raw transcript (default)grows every turn — unbounded (illustrative)no — memory is one fused blob
Summarize-and-appendslower growth, still rising (illustrative)partly — the summary hides the sources
Typed retrieval, per decision (AgenticSTS)bounded — tracks a fixed retrieval budgetyes — each type is a separate store

What the bounded prompt buys you

Here is the growth made concrete. Say each decision adds roughly 800 tokens of observation, tool call, and reflection to the log — an illustrative figure the paper does not quote. Under the append-the-transcript default, decision 200 carries the whole history, about 200 × 800 = 160,000 tokens, and every turn re-pays for that entire prefix. Under the contract, the same decision is rebuilt from typed retrieval inside a fixed budget of, say, 2,000 tokens — so its prompt is roughly 2,000 tokens, not 160,000, and it stays flat as the run gets longer. The append design's cost scales with how long the agent has been running; the contract design's cost does not, which is the entire point of calling the prompt "bounded." (Only the 0-wins / 16%-human testbed figures and the 298 released trajectories come from the paper; the token counts here are illustrative.)

Does the contract actually make agents play better? Honestly, the paper does not yet show that. Its headline gameplay result — a fixed-start ablation with strategic skills enabled moving a no-store baseline from 3 out of 10 to 6 out of 10 wins — sounds strong but is directional only: at that sample size a Fisher exact test gives p ≈ 0.37, i.e. easily explainable by luck. The contribution is the framework and the reproducible testbed, not a proven win — which is exactly why the ablatable, condition-tagged design matters more than the single reported score.

Goes deeper in: AI Agents → Context Engineering → Context as Scarce Resource

Related explainers

Continue in trackContext Engineering: the four fixes for an overflowing agent prompt

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based