What is Quantitative Goal Persistence (QGP)?

QGP is the verified-work-to-total-tool-calls ratio when an agent is asked to collect a fixed number of artifacts. It captures a specific failure mode — the agent issues many plausible tool calls but never actually finishes the requested count — that does not show up in single-task correctness scores. A QGP of 1.0 means every tool call advanced the goal; a QGP of 0.2 means four out of five calls were noise. The PushBench paper (Cai et al., May 2026) is the first to formalize this metric and ship a benchmark for it.

Why does QGP matter for production agents?

Frontier models look strong on short tasks and fail in characteristic ways on long ones. PushBench shows Claude Sonnet 4.6 and GPT-5.4 dropping from solid 50-artifact performance to 3 out of 9 successes at 100 artifacts. If you ship an agent for any workflow that batches more than a handful of items — invoice triage, dataset annotation, bulk repository edits — QGP is the number that predicts whether it will survive scale. Single-task correctness scores will not show this failure mode; the agent looks competent until you ask it to count.

How do the state-tracking and backlog-tracking controllers work?

State-tracking maintains a verified inventory and rejects duplicate submissions before they reach the verifier; in the PushBench experiments it lifts QGP into the 69–78% range while eliminating duplicate submissions outright. Backlog-tracking maintains an outstanding-work queue and only allows the agent to terminate when the queue is empty, achieving 25–50% success in settings where the unwrapped agent fails entirely. Both are harness primitives — thin wrappers around the base agent — not model changes, so they layer onto any existing system without retraining. The paper presents them as illustrative; the QGP metric itself is the more lasting contribution.

PushBench paper — Quantitative Goal Persistence (QGP)

PushBench — Quantitative Goal Persistence (QGP)

Agent

learnaivisually.com/ai-explained/pushbench-qgp

TL;DR

What is it: The PushBench paper formalizes Quantitative Goal Persistence (QGP) — the ratio of verified work to total tool calls when an agent is asked to collect a fixed number of artifacts — and introduces a repository-artifact benchmark plus two reusable harness controllers (state-tracking, backlog-tracking) that lift QGP without touching the underlying model.
Why it’s needed: Frontier agents look strong on short tasks and fall off a cliff at long ones. Claude Sonnet 4.6 and GPT-5.4 succeed comfortably at 50-artifact tasks but drop to 3/9 successes at 100; the failure mode is not "wrong answer" but "many plausible tool calls that never actually finish the requested count." Long-horizon reliability needs a number, and QGP is the one PushBench proposes.
vs previous: Prior agent benchmarks score correctness on a single task; PushBench scores persistence to a verified count. And prior reliability work changed the model or the prompt; PushBench's controllers are harness primitives — a verified-inventory check and a backlog queue — that wrap any base agent and recover most of the lost QGP without retraining anything.

Jargon

QGP: Quantitative Goal Persistence — the gap between "the agent issued many plausible tool calls" and "the agent actually completed the requested count of verified work units." Concretely, verified artifacts ÷ tool calls on a counting task. A QGP of 1.0 means every tool call advanced the goal; 0.2 means four out of five calls were noise.
PushBench: The paper's benchmark suite. An agent is tasked with collecting N verified artifacts from a repository (issues, PRs, files, etc.); a deterministic verifier judges whether each submission counts toward the goal. PushBench reports both success rate at N and QGP.
Verifier: A small deterministic checker that grades each submission as "counts" or "does not count" toward the requested artifact count. Closely related to the verifier in RLVR — the same idea (replace a learned reward model with a checker) reused as a benchmark scoring rule.
State-tracking controller: A harness wrapper that maintains a verified inventory of completed work and rejects any submission that duplicates an already-counted artifact before it is sent to the verifier. It is not a model change — it is a thin layer between the agent's tool calls and the environment.
Backlog-tracking controller: A harness wrapper that maintains an outstanding-work queue — the explicit list of artifacts still missing from the goal — and only allows the agent to terminate when the queue is empty. Pairs naturally with state-tracking.
Over-call vs under-finish: The specific failure pattern PushBench measures. The agent over-calls (issues many tool calls, often re-submitting work it already counted) while under-finishing (the verified count plateaus well below N). One axis is wasted action, the other is wasted commitment.

The news. On May 22, 2026, Cai et al. released PushBench, a repository-artifact collection benchmark and the first paper to formalize Quantitative Goal Persistence as a measurable agent property. Frontier models — including Claude Sonnet 4.6 and GPT-5.4 — clear the 50-artifact setting comfortably and then collapse to 3 out of 9 successes at 100 artifacts. The paper's contribution is twofold: a metric that names the failure mode, and two harness-level controllers (state-tracking and backlog-tracking) that recover most of the lost ground without changing the model.

Picture a shopper walking into a grocery store with a checklist of 100 items. After an hour of wandering they emerge with a cart full of something — 142 items grabbed — but only 30 of them check off a unique line on the list. The rest are repeats of the same six aisles. That is not a "wrong answer" — the shopper picked real items, every one of them edible — but it is a particular kind of failure: lots of motion, very little progress against the stated count. Long-horizon agents fail this way constantly, and until PushBench there was no named number for it.

The mechanism is straightforward once the metaphor sits still. PushBench gives the agent a goal of N verified artifacts and a verifier that adjudicates each submission. The agent runs its normal loop — generate a tool call, observe, generate the next tool call — and the benchmark records both the verified count and the total tool calls. QGP is just verified ÷ calls. A model that submits the same three artifacts ten times each has QGP ≈ 0.1; a model that submits each unique artifact once and stops has QGP = 1.0. The paper's headline observation is that under the standard agent loop, frontier models' QGP degrades non-linearly with N — solid at 50, cliff-edge at 100. The shopper does fine when the list has 20 things; given 100 things, they forget what they already grabbed, and the compounding errors over a long rollout eat them alive.

The two controller patterns earn their keep precisely because they are not model changes. State-tracking is the shopper pausing at each aisle to glance at the checklist before reaching for the next can — duplicates get caught at the moment of intent, not at checkout. Backlog-tracking is the shopper refusing to leave the store until the remaining-items column is empty — they cannot satisfy themselves with "I got most of it." These are harness primitives, and they layer on top of any base agent. In the paper's experiments, state-tracking alone lifts QGP into the 69–78% range while eliminating duplicate submissions; backlog-tracking achieves 25–50% success in regimes where the unwrapped agent fails outright. Both controllers wire naturally into the existing harness checkpoint interface — an inventory is just a piece of durable state, and a backlog queue is the same shape as the retry queue you already have.

Where the QGP gap actually comes from

Hold three variables fixed. One agent. One 100-artifact PushBench task. Two separate experimental runs — Run A without controllers, Run B with a state-tracking controller installed from the start. In Run A, the agent issues 142 tool calls; 30 of them are unique artifacts that the verifier accepts; the remaining 112 are duplicates of the first 30 — same six or seven artifacts re-submitted in slightly different phrasings. The arithmetic of the gap is simple: QGP = 30 / 142 ≈ 0.21. In Run B, the same agent on the same task issues 105 tool calls — fewer total, because the controller rejects duplicate intents before they spend tokens — and 78 of those calls verify new artifacts. QGP = 78 / 105 ≈ 0.74, a roughly ~3.5× lift in signal density per call (illustrative numbers calibrated to the paper's 3/9 vs. 69–78% headline figures — these are two separate hypothetical runs, not one continuous run with a mid-task controller install). The success-count jump matters, but the per-call yield jump is the deeper story: each tool call now carries about 3.5× the goal-advancing information of the baseline.

Setting	Verified / N	Tool calls	QGP
Frontier agent, 50 artifacts	~50 / 50	~60–80	~0.7–0.8 (reported solid)
Frontier agent, 100 artifacts (no controller)	~30 / 100 — 3/9 runs succeed	~140+, varies	~0.20–0.30 (illustrative)
+ state-tracking controller	69–78 / 100 — duplicates eliminated	~100–110 (illustrative)	~0.70–0.78
+ backlog-tracking controller	25–50 / 100 — works where baseline fails outright	varies by setting	~0.25–0.50 (illustrative)

The deeper lesson is what QGP makes legible. Until you name the metric, "the agent kept calling tools but never finished" reads as a vibes-level complaint. Once you can compute verified ÷ calls on a controlled benchmark, you can A/B harness changes, shadow-route candidate controllers in production, and treat persistence as a first-class number alongside latency and cost. The PushBench authors are explicit that this is the point — the controllers are illustrative; the metric is the contribution. Other harness patterns can be evaluated against the same yardstick, and the long tail of "the agent works fine in demos but degrades in real workflows" finally has a place to land.

A small caveat: PushBench measures counting tasks, and QGP as defined is a goodput-style ratio for goals stated as "collect N artifacts." Tasks where the goal is "do one thing well" or "decide between two options" need their own metrics — QGP is not a universal agent score. The paper is careful about this, and the framing here matches it: PushBench is one yardstick, and an excellent one for the long-horizon failure mode that frontier agents most visibly hit. Whether QGP generalizes to non-counting goals is left as future work.

Goes deeper in: AI Agents → Evals & Diagnostics → Compounding errors

Related explainers

FutureSim — harness-level agent eval — another harness-side approach to making long-horizon evals tractable
RecMem — subconscious + recurrence-triggered agent memory — model-side analogue to state-tracking
Pantheon-Bench — HITL vs. autonomous coding — another benchmark for long-running coding agents

Continue in trackEvals & Diagnostics: Compounding errors

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based