How is trajectory-replay benchmarking different from a normal LLM benchmark?

A normal LLM benchmark sends one prompt and measures the response — tokens per second, time to first token. An agent, though, runs a long trajectory: chained model calls interleaved with tool executions, with a growing context and bursty decode. Trajectory replay drives the system with those recorded multi-step runs instead of single prompts, so it stresses the scheduler, KV-cache reuse, and sustained decode under concurrency — the load that real agents actually create.

What does agents per megawatt measure?

Agents per megawatt is AgentPerf's headline metric: the number of concurrent agents a system keeps above the per-token SLO, divided by the power it draws. It is a goodput-style efficiency number — useful work per unit of energy — analogous to miles per gallon for an inference fleet. It rewards systems that sustain many real agent runs at once on the same power budget, not just peak token throughput on a single prompt.

NVIDIA Blackwell leads AgentPerf, the first agentic-AI infra benchmark — Trajectory-replay benchmarking

Q: What is AgentPerf?

AgentPerf is a benchmark from Artificial Analysis, billed as the first test for agentic-AI infrastructure. Instead of timing single chat completions, it replays recorded multi-step coding-agent trajectories — file reads, code execution, and iteration across 12+ programming languages — and scores how many concurrent agents a serving system sustains under a per-token SLO, normalized by power (agents per megawatt). In NVIDIA's reported result, a GB300 NVL72 system serves up to 20x more agents per megawatt than an HGX H200 system on DeepSeek V4 Pro.

TL;DR

What is it: The AgentPerf benchmark from Artificial Analysis is the first test built for agentic-AI infrastructure: instead of timing one chat completion, it replays recorded multi-step agent trajectories to see how a serving system holds up under real agent load.
Why it’s needed: Agents don't send one prompt — they run long chains of model calls and tool executions, so a serving system's real job is sustaining many such runs at once. AgentPerf measures exactly that: concurrent agents held above a speed limit, normalized by power.
vs previous: A single-shot completion benchmark sends one prompt and reports tokens per second — and misses the bursty, stateful, KV-cache-heavy load a real agent creates. Trajectory replay reproduces that load, so the score reflects real production agent load, not a sprint time.

Jargon

AgentPerf: Artificial Analysis's benchmark for agentic-AI infrastructure. It drives a serving system with recorded coding-agent trajectories across 12+ programming languages and scores how many concurrent agents the system sustains under a per-token speed limit, normalized by power.
Agent trajectory: The full recorded run of an agent: chained LLM calls interleaved with tool executions — read a file, run code, see the error, try again — many steps to finish one task. See AI Agents → The Agent Loop.
Per-token SLO: A service-level objective on output speed — a floor on tokens per second the system must hold for each agent. AgentPerf measures at both 20 and 60 tok/s. See LLM Serving → Serving Metrics.
Goodput: Only the work that actually meets the SLO — here, the concurrent agents staying above the token-rate floor — as opposed to raw throughput, which counts everything regardless of latency. See Throughput vs Goodput.
Agents per megawatt: AgentPerf's headline metric: concurrent agents meeting the SLO, divided by the power the system draws. An efficiency number — useful work per unit of energy — like miles per gallon for an inference fleet.
GB300 NVL72 / HGX H200: The two NVIDIA systems compared: the rack-scale Blackwell GB300 NVL72 versus the prior-generation HGX H200. Both run DeepSeek V4 Pro in the reported result.

The news. On June 12, 2026, Artificial Analysis released AgentPerf, billed as the industry's first benchmark for agentic-AI infrastructure. Rather than single chat completions, it replays real coding-agent trajectories — file reads, code execution, iteration — across 12+ programming languages, and scores how many concurrent agents a system sustains under a per-token SLO, normalized by power. NVIDIA reports its GB300 NVL72 serves up to 20× more agents per megawatt than an HGX H200 system, running DeepSeek V4 Pro and measured at both 20 and 60 tokens/sec. Read the announcement →

Picture the fuel-economy sticker on a new car. The number that ends up on the window isn't a quarter-mile drag time — a single sprint down an empty straight tells you almost nothing about the commute you'll actually drive. The figure drivers care about, miles per gallon, comes from a dynamometer replaying a recorded city drive cycle: stop, go, idle, accelerate, the messy real pattern. A single sprint measures the wrong thing; the recorded drive cycle measures the thing you live with. A single chat completion is that sprint. An agent's run is the drive cycle. AgentPerf is the dyno.

The reason the distinction matters is that an agent run looks nothing like one prompt-and-reply. It is a long loop of model calls interleaved with tool executions — read a file, run the code, look at the failure, edit, try again — many steps to finish a single task. That load is bursty and stateful: the context grows with every step, leaning hard on KV-cache reuse, decode comes in stop-go spurts, and many such runs land on the system at once. A benchmark that sends one prompt and reports peak tokens per second is timing the sprint, not the commute.

So AgentPerf replays recorded coding-agent trajectories and asks a different question: how many agents can the system keep above a per-token speed limit at the same time? That is a goodput measurement — count only the agents actually holding the SLO, not raw token throughput — and then divide by the power the rack draws. The unit that falls out, agents per megawatt, is miles-per-gallon for an inference fleet: useful work per unit of energy.

Throughput: 10 req/s

Goodput: 3 req/s7 violated SLO

4×

5×

6×

7×

8×

9×

10×

Same system, same second — throughput looks healthy, goodput tells the truth

How you benchmark	What it sends	What it misses
Single chat completion	one prompt → one response	the bursty, multi-step load a real agent creates
Peak-throughput LLM bench	many independent prompts	KV reuse and sustained concurrency within one long run
AgentPerf (trajectory replay)	recorded multi-step agent runs	— (scores concurrent agents under an SLO, then agents per megawatt)

What "20× per megawatt" means

Hold two things fixed: the power, at one megawatt, and the SLO, at 60 tokens per second. Suppose an HGX H200 rack sustains 60 concurrent agents that stay above that floor on its megawatt (illustrative). The one ratio AgentPerf actually reports is the comparison: the GB300 NVL72 sustains up to 20× as many on the same megawatt — roughly 1,200 agents on that scaling. The lever isn't only more FLOPs. Agent trajectories share a huge common prefix — the system prompt, the tool definitions, the conversation so far — so KV-cache reuse and continuous batching are what turn raw compute into sustained agents, and a single-completion benchmark never exercises that reuse. Same megawatt, up to 20× the agents — because the test finally rewards sustained, KV-reuse-heavy agent load instead of a one-shot sprint. (Only the 20× ratio, the 20/60 tok/s SLOs, and the GB300-vs-H200 comparison come from NVIDIA; the 60-agent baseline is illustrative.)

Goes deeper in: Agent Engineering → Cost & Latency → The Cost Profile of an Agent

Related explainers

NVIDIA AI Factories — tokens per megawatt — the metric cousin: AgentPerf's agents per megawatt is the agent-level version of tokens per megawatt
WeaveBench — trajectory-aware grading — also replays the whole agent run, but to grade correctness; AgentPerf replays it to measure infrastructure
FutureSim — harness-level agent eval — the broader shift to evaluating agents at the harness level, not single-shot QA

Continue in trackServing Metrics: why goodput, not throughput, is the number that matters

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based