PushBench — Quantitative Goal Persistence (QGP)

Agent
L
PushBench — collect 100 verified artifacts100 artifact slotsRun A — no controllerVERIFIED0 / 100TOOL CALLS0QGP (verified ÷ calls)0.00agent without controlleragent with controllercontroller installedstate-trackingverified inventory rejects duplicate submissionsbacklog-trackingoutstanding queue — only terminate when empty
learnaivisually.com/ai-explained/pushbench-qgp

The news. On May 22, 2026, Cai et al. released PushBench, a repository-artifact collection benchmark and the first paper to formalize Quantitative Goal Persistence as a measurable agent property. Frontier models — including Claude Sonnet 4.6 and GPT-5.4 — clear the 50-artifact setting comfortably and then collapse to 3 out of 9 successes at 100 artifacts. The paper's contribution is twofold: a metric that names the failure mode, and two harness-level controllers (state-tracking and backlog-tracking) that recover most of the lost ground without changing the model.

Picture a shopper walking into a grocery store with a checklist of 100 items. After an hour of wandering they emerge with a cart full of something — 142 items grabbed — but only 30 of them check off a unique line on the list. The rest are repeats of the same six aisles. That is not a "wrong answer" — the shopper picked real items, every one of them edible — but it is a particular kind of failure: lots of motion, very little progress against the stated count. Long-horizon agents fail this way constantly, and until PushBench there was no named number for it.

The mechanism is straightforward once the metaphor sits still. PushBench gives the agent a goal of N verified artifacts and a verifier that adjudicates each submission. The agent runs its normal loop — generate a tool call, observe, generate the next tool call — and the benchmark records both the verified count and the total tool calls. QGP is just verified ÷ calls. A model that submits the same three artifacts ten times each has QGP ≈ 0.1; a model that submits each unique artifact once and stops has QGP = 1.0. The paper's headline observation is that under the standard agent loop, frontier models' QGP degrades non-linearly with N — solid at 50, cliff-edge at 100. The shopper does fine when the list has 20 things; given 100 things, they forget what they already grabbed, and the compounding errors over a long rollout eat them alive.

The two controller patterns earn their keep precisely because they are not model changes. State-tracking is the shopper pausing at each aisle to glance at the checklist before reaching for the next can — duplicates get caught at the moment of intent, not at checkout. Backlog-tracking is the shopper refusing to leave the store until the remaining-items column is empty — they cannot satisfy themselves with "I got most of it." These are harness primitives, and they layer on top of any base agent. In the paper's experiments, state-tracking alone lifts QGP into the 69–78% range while eliminating duplicate submissions; backlog-tracking achieves 25–50% success in regimes where the unwrapped agent fails outright. Both controllers wire naturally into the existing harness checkpoint interface — an inventory is just a piece of durable state, and a backlog queue is the same shape as the retry queue you already have.

Where the QGP gap actually comes from

Hold three variables fixed. One agent. One 100-artifact PushBench task. Two separate experimental runs — Run A without controllers, Run B with a state-tracking controller installed from the start. In Run A, the agent issues 142 tool calls; 30 of them are unique artifacts that the verifier accepts; the remaining 112 are duplicates of the first 30 — same six or seven artifacts re-submitted in slightly different phrasings. The arithmetic of the gap is simple: QGP = 30 / 142 ≈ 0.21. In Run B, the same agent on the same task issues 105 tool calls — fewer total, because the controller rejects duplicate intents before they spend tokens — and 78 of those calls verify new artifacts. QGP = 78 / 105 ≈ 0.74, a roughly ~3.5× lift in signal density per call (illustrative numbers calibrated to the paper's 3/9 vs. 69–78% headline figures — these are two separate hypothetical runs, not one continuous run with a mid-task controller install). The success-count jump matters, but the per-call yield jump is the deeper story: each tool call now carries about 3.5× the goal-advancing information of the baseline.

SettingVerified / NTool callsQGP
Frontier agent, 50 artifacts~50 / 50~60–80~0.7–0.8 (reported solid)
Frontier agent, 100 artifacts (no controller)~30 / 100 — 3/9 runs succeed~140+, varies~0.20–0.30 (illustrative)
+ state-tracking controller69–78 / 100 — duplicates eliminated~100–110 (illustrative)~0.70–0.78
+ backlog-tracking controller25–50 / 100 — works where baseline fails outrightvaries by setting~0.25–0.50 (illustrative)

The deeper lesson is what QGP makes legible. Until you name the metric, "the agent kept calling tools but never finished" reads as a vibes-level complaint. Once you can compute verified ÷ calls on a controlled benchmark, you can A/B harness changes, shadow-route candidate controllers in production, and treat persistence as a first-class number alongside latency and cost. The PushBench authors are explicit that this is the point — the controllers are illustrative; the metric is the contribution. Other harness patterns can be evaluated against the same yardstick, and the long tail of "the agent works fine in demos but degrades in real workflows" finally has a place to land.

A small caveat: PushBench measures counting tasks, and QGP as defined is a goodput-style ratio for goals stated as "collect N artifacts." Tasks where the goal is "do one thing well" or "decide between two options" need their own metrics — QGP is not a universal agent score. The paper is careful about this, and the framing here matches it: PushBench is one yardstick, and an excellent one for the long-horizon failure mode that frontier agents most visibly hit. Whether QGP generalizes to non-counting goals is left as future work.

Goes deeper in: AI Agents → Evals & Diagnostics → Compounding errors

Related explainers

Frequently Asked Questions