What is the AgenticSTS memory contract?

It is a way to give a long-horizon agent memory without appending its transcript to every prompt. Each decision is assembled from a fresh message via typed retrieval — the agent reads only what that turn is "allowed to see," pulled by type from separate memory stores. AgenticSTS (arXiv 2607.02255, July 2026) frames this as a contract: memory is a rule about what each future decision may access, which keeps the prompt bounded across a run of any length.

Why is a bounded, typed memory better than appending the transcript?

Two reasons. First, an appended transcript grows every turn, so the prompt cost scales with how long the agent has run; typed retrieval rebuilds the prompt inside a fixed budget, so it stays flat. Second, an appended transcript fuses all memory into one input you cannot take apart, while typed retrieval keeps each memory layer in its own store — so you can turn one layer off, re-run, and measure exactly what it was worth. AgenticSTS releases 298 condition-tagged trajectories to make those ablations reproducible.

Does AgenticSTS prove the contract makes agents win more?

Not yet, and the paper is careful about this. Its testbed is Slay the Spire 2, where frontier models win 0 games at the lowest difficulty against a 16% human rate. The headline gameplay improvement — a no-store baseline going from 3 to 6 wins out of 10 with a skill layer enabled — is directional only: a Fisher exact test puts it at roughly p ≈ 0.37, well within noise. The real contribution is the memory-contract framing and a reproducible, ablatable testbed, not a decisive performance gain.

AgenticSTS tests long-horizon agent memory — Bounded-memory contract via typed retrieval

TL;DR

What is it: AgenticSTS is a new testbed and memory design for long-horizon agents that reframes what memory even is: a bounded-memory contract where each decision is built from typed retrieval into one fresh message, instead of a growing transcript pasted back every turn.
Why it’s needed: It turns memory from a vague pile into "what each future decision is allowed to see" — which keeps the prompt bounded at any run length and lets you remove one memory layer at a time to measure exactly what it was worth, the core move of Context Engineering and agent Observability.
vs previous: The default is appending the raw transcript — every observation, tool call, and reflection glued onto every prompt. That grows without bound and fuses all memory together, so you can never tell which part actually helped; the contract design fixes both.

Jargon

Long-horizon agent: An agent that acts over many decisions toward a distant goal — not a single question-and-answer turn. The longer the run, the more its accumulated history threatens to overflow the prompt. See AI Agents → Agent Loop & State.
Typed retrieval: Assembling a turn's prompt by pulling specific kinds of memory — facts, past moves, strategy notes — from separate stores, rather than concatenating the raw log. Each "type" is retrieved on demand into a fresh message.
The memory contract: The paper's framing: memory is "a contract about what each future decision is allowed to see." A decision reads only what typed retrieval places in that turn's message — nothing is inherited by default.
Bounded context: A prompt whose size does not grow with the length of the run — it stays inside a fixed retrieval budget. The opposite of an appended transcript, which grows every turn. Related: context as a scarce resource.
Memory-layer ablation: Turning one memory source off and re-running to see how much it mattered. Only possible when layers are separable — which the contract design makes them, and an appended transcript does not.
Fisher exact test: A statistical test for whether a difference in a small count (like 3 vs 6 wins out of 10) is real or luck. A p-value near 0.37 means "could easily be noise" — a directional hint, not proof.

The news. On July 2, 2026, researchers posted AgenticSTS to arXiv. Its subject is memory for long-horizon agents, and its move is to stop appending the raw transcript of past observations, tool calls, and reflections to every prompt. Instead, each decision is assembled from a fresh message via typed retrieval, which keeps the prompt bounded across runs of any length and lets any single memory layer be tested in isolation. It runs on Slay the Spire 2, a closed-rule stochastic deck-builder where a public benchmark of frontier models reports 0 wins at the lowest difficulty while humans win 16%. Read the paper →

Picture a long investigation with a new officer deciding the next move at every step. The obvious way to keep them informed is to hand each one the entire case file — every interview, every note, everything that ever happened. It works for a while, but the file only grows: by the hundredth decision it is a phone book, expensive to carry and impossible to skim. And when a decision goes wrong, no one can say which page mattered, because it is all fused into one stack. The fix is not a thicker file but a records clerk who briefs each decision from a fresh, thin folder holding only what that decision is allowed to see. In AgenticSTS the officer is one agent decision, the swelling case file is the appended transcript, and the clerk's need-to-know folder is typed retrieval governed by a memory contract.

That contract is the whole reframe. Instead of asking "what has happened so far?" and pasting the answer in, the agent asks a narrower question every turn: what is this decision allowed to see? Context is a scarce resource, and the default of gluing the full history onto every prompt spends it recklessly — the transcript is the clearest case of the context bloat the four fixes exist to prevent. Typed retrieval is that fix taken to its logical end: the prompt is rebuilt from scratch each turn out of separately stored memory types, so its size tracks a fixed retrieval budget instead of the run's length.

The second payoff is subtler and, for anyone shipping agents, the bigger one. When memory is one appended blob, you cannot answer "which memory made the difference?" — it is all one input. When memory is a shelf of typed folders, you can pull one folder and re-run. That is memory-layer ablation, and it turns a vague "our agent remembers things" into a measurable claim about each layer. AgenticSTS ships 298 completed, condition-tagged trajectories with frozen memory and skill snapshots precisely so those ablations are reproducible.

How the agent gets its memory	Prompt size over a long run	Can you ablate one layer?
Append the raw transcript (default)	grows every turn — unbounded (illustrative)	no — memory is one fused blob
Summarize-and-append	slower growth, still rising (illustrative)	partly — the summary hides the sources
Typed retrieval, per decision (AgenticSTS)	bounded — tracks a fixed retrieval budget	yes — each type is a separate store

What the bounded prompt buys you

Here is the growth made concrete. Say each decision adds roughly 800 tokens of observation, tool call, and reflection to the log — an illustrative figure the paper does not quote. Under the append-the-transcript default, decision 200 carries the whole history, about 200 × 800 = 160,000 tokens, and every turn re-pays for that entire prefix. Under the contract, the same decision is rebuilt from typed retrieval inside a fixed budget of, say, 2,000 tokens — so its prompt is roughly 2,000 tokens, not 160,000, and it stays flat as the run gets longer. The append design's cost scales with how long the agent has been running; the contract design's cost does not, which is the entire point of calling the prompt "bounded." (Only the 0-wins / 16%-human testbed figures and the 298 released trajectories come from the paper; the token counts here are illustrative.)

Does the contract actually make agents play better? Honestly, the paper does not yet show that. Its headline gameplay result — a fixed-start ablation with strategic skills enabled moving a no-store baseline from 3 out of 10 to 6 out of 10 wins — sounds strong but is directional only: at that sample size a Fisher exact test gives p ≈ 0.37, i.e. easily explainable by luck. The contribution is the framework and the reproducible testbed, not a proven win — which is exactly why the ablatable, condition-tagged design matters more than the single reported score.

Goes deeper in: AI Agents → Context Engineering → Context as Scarce Resource

Related explainers

AtomMem — atomic-fact memory — a memory architecture that stores separable facts; AgenticSTS is the evaluation contract that would let you ablate such a store
EvoMem — patch-based agent memory — another take on structured, editable agent memory rather than a raw appended log
ContextRL — contrastive context selection — learning which context to retrieve per step, the natural next question once memory is typed and bounded

Continue in trackContext Engineering: the four fixes for an overflowing agent prompt

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based