Harness-1 — externalizing search state into the harness
AgentThe news. On June 1, 2026, Harness-1 (arXiv:2606.02373) introduced a 20B-parameter search agent that separates semantic decision-making from state management. The policy decides what to search, inspect, curate, verify, and when to stop; a state-externalizing harness holds the working memory — candidate pools, importance-tagged curated sets, evidence links, verification records, and compressed observations — and renders only a budget-bounded slice into the model's context each step. Rather than training over an ever-growing transcript, the agent is trained with reinforcement learning over a structured external workspace. It reports 0.730 average curated recall across 8 retrieval benchmarks (web, finance, patents, multi-hop QA), +11.4 points over the next-strongest open search sub-agent. Read the paper →
Picture a detective working a long case. Every lead, photo, and verified alibi gets pinned to the case-board on the wall and connected with red string — the board is the durable record, and it only ever grows. When the detective walks into an interview, they don't wheel the entire case file into the room; they carry a single index-card briefing with just what this conversation needs. The board stays on the wall; only a briefing walks in. A rookie who instead lugs the whole growing file box into every interview eventually runs out of desk space — that is exactly what happens when a search agent replays its entire transcript into a finite context window.
That is the move Harness-1 makes concrete. The naïve design treats the agent's memory as a growing transcript: every observation, every candidate document, every verification step is concatenated and fed back to the model on the next step. It works for a few steps, then the transcript balloons and the search has to stop — not because the agent ran out of leads, but because it ran out of room. Harness-1 instead keeps that durable state in the harness — the case-board — and lets the policy decide where the agent's working state lives. Each step, the harness performs budget-bounded rendering: it selects a token-bounded slice of the workspace — the briefing — and shows only that to the model. The board can grow to hundreds of items while the briefing stays the same size, so context cost stays flat no matter how deep the search goes. Crucially, the agent is trained with reinforcement learning over this workspace, not over transcripts, so the policy learns the harness skills — curate, importance-tag, verify, compress, stop — as first-class actions.
Growing transcript vs state-externalizing harness
| Design | What lives in context | Context cost as search deepens | Failure mode |
|---|---|---|---|
| Growing transcript | The full action + observation history, replayed every step | Grows with every step | Overflows the window; the search stalls on length, not leads |
| State-externalizing harness | A budget-bounded slice rendered from the workspace | ~Flat, set by a render budget | A poorly-chosen slice can omit a needed item (mitigated by importance tags + curated recall) |
The two rows describe the contrast Harness-1 draws between transcript-style memory and its externalized workspace; the "budget-bounded slice" claim is from the paper. Token figures in the hero animation are illustrative.
Walk the budget with some round numbers (illustrative). Say each search step adds about 2,000 tokens of fresh observations. Under the growing-transcript design, those tokens never leave: after 8 steps the model is reading roughly 16,000 tokens of history, after 20 steps about 40,000, and a genuinely deep multi-hop search marches straight past a typical working window. Under the state-externalizing harness, those 2,000-token observations land in the workspace, but the model is only ever shown a fixed ~6,000-token render — step 8 and step 20 cost the same 6,000 tokens in context. The accumulated evidence still exists; it just lives on the case-board instead of in the briefing. That is why Harness-1 can keep curating to 0.730 recall across deep benchmarks where a transcript agent would have run out of room — and it's the same lever the agent-engineering track frames as durable state the harness owns, rather than state smeared across a prompt.
It lands as a sharp companion to the recent push on how search agents act — GrepSeek learns a better action space (shell commands over a corpus), while Harness-1 learns a better state substrate (an externalized workspace). Same RL-trained-search-agent family, orthogonal levers. As the work frames it, the model should make the semantic calls and the harness should own the memory — a clean division that the standard fixes for an overflowing context have been circling, now learned end-to-end.
Goes deeper in: AI Agents → The Agent Loop & State → The Anatomy of a Harness
Related explainers
- GrepSeek — training a shell-command search agent — the other lever: learning the search action space instead of the state substrate.
- Is Grep All You Need? — grep vs vector retrieval — empirical evidence that harness design dominates the retrieval algorithm.
- RecMem — subconscious + recurrence-triggered memory — another way to keep durable agent memory off the live context.