The news. On June 12, 2026, Artificial Analysis released AgentPerf, billed as the industry's first benchmark for agentic-AI infrastructure. Rather than single chat completions, it replays real coding-agent trajectories — file reads, code execution, iteration — across 12+ programming languages, and scores how many concurrent agents a system sustains under a per-token SLO, normalized by power. NVIDIA reports its GB300 NVL72 serves up to 20× more agents per megawatt than an HGX H200 system, running DeepSeek V4 Pro and measured at both 20 and 60 tokens/sec. Read the announcement →
Picture the fuel-economy sticker on a new car. The number that ends up on the window isn't a quarter-mile drag time — a single sprint down an empty straight tells you almost nothing about the commute you'll actually drive. The figure drivers care about, miles per gallon, comes from a dynamometer replaying a recorded city drive cycle: stop, go, idle, accelerate, the messy real pattern. A single sprint measures the wrong thing; the recorded drive cycle measures the thing you live with. A single chat completion is that sprint. An agent's run is the drive cycle. AgentPerf is the dyno.
The reason the distinction matters is that an agent run looks nothing like one prompt-and-reply. It is a long loop of model calls interleaved with tool executions — read a file, run the code, look at the failure, edit, try again — many steps to finish a single task. That load is bursty and stateful: the context grows with every step, leaning hard on KV-cache reuse, decode comes in stop-go spurts, and many such runs land on the system at once. A benchmark that sends one prompt and reports peak tokens per second is timing the sprint, not the commute.
So AgentPerf replays recorded coding-agent trajectories and asks a different question: how many agents can the system keep above a per-token speed limit at the same time? That is a goodput measurement — count only the agents actually holding the SLO, not raw token throughput — and then divide by the power the rack draws. The unit that falls out, agents per megawatt, is miles-per-gallon for an inference fleet: useful work per unit of energy.
| How you benchmark | What it sends | What it misses |
|---|---|---|
| Single chat completion | one prompt → one response | the bursty, multi-step load a real agent creates |
| Peak-throughput LLM bench | many independent prompts | KV reuse and sustained concurrency within one long run |
| AgentPerf (trajectory replay) | recorded multi-step agent runs | — (scores concurrent agents under an SLO, then agents per megawatt) |
What "20× per megawatt" means
Hold two things fixed: the power, at one megawatt, and the SLO, at 60 tokens per second. Suppose an HGX H200 rack sustains 60 concurrent agents that stay above that floor on its megawatt (illustrative). The one ratio AgentPerf actually reports is the comparison: the GB300 NVL72 sustains up to 20× as many on the same megawatt — roughly 1,200 agents on that scaling. The lever isn't only more FLOPs. Agent trajectories share a huge common prefix — the system prompt, the tool definitions, the conversation so far — so KV-cache reuse and continuous batching are what turn raw compute into sustained agents, and a single-completion benchmark never exercises that reuse. Same megawatt, up to 20× the agents — because the test finally rewards sustained, KV-reuse-heavy agent load instead of a one-shot sprint. (Only the 20× ratio, the 20/60 tok/s SLOs, and the GB300-vs-H200 comparison come from NVIDIA; the 60-agent baseline is illustrative.)
Goes deeper in: Agent Engineering → Cost & Latency → The Cost Profile of an Agent
Related explainers
- NVIDIA AI Factories — tokens per megawatt — the metric cousin: AgentPerf's agents per megawatt is the agent-level version of tokens per megawatt
- WeaveBench — trajectory-aware grading — also replays the whole agent run, but to grade correctness; AgentPerf replays it to measure infrastructure
- FutureSim — harness-level agent eval — the broader shift to evaluating agents at the harness level, not single-shot QA