The news. On June 4, 2026, researchers released AdaPlanBench, a 307-task benchmark that asks a different question than most agent evals: can an agent plan when the rules aren't all handed to it up front? Each household task carries hidden world and user constraints, revealed only when the agent proposes a plan that breaks one — forcing it to detect the violation, revise, and re-test. Across 10 leading LLMs, the best scored just 67.75%, and every model got worse as constraints piled up, with user constraints proving harder than world constraints. Read the paper →
Picture a substitute chef walking into a stranger's kitchen to cook dinner. Nobody hands them the house rules. They lay out a sensible recipe — roast chicken, a peanut-sauce side, finish in the cast-iron — and the instant they reach for the peanuts, a buzzer goes off: the kid is allergic. Swap the side, reach for the cast-iron, buzzer again: that pan is off-limits. Each rule was invisible until the plan tripped it. That is exactly the world AdaPlanBench drops an agent into — the constraints exist from the start, but the only way to learn one is to violate it and be told.
This breaks the quiet assumption behind a normal benchmark, which hands the agent every requirement up front and grades a single plan once. You cannot satisfy a rule you were never given, so a one-shot planner is structurally outmatched here. The only thing that works is a closed loop: propose a plan → hit a violation → revise → re-test, until nothing buzzes — the same evaluator–optimizer loop the Agents track teaches, run against a hidden rulebook instead of a known one. AdaPlanBench scores whether the agent keeps that loop turning, not whether its first attempt happened to look good.
The hidden rules split into two kinds, and the split is the teachable part. World constraints are facts about the environment — the oven is broken, there are no nuts in the pantry — fixed and checkable. User constraints are the family's preferences — no pork, save the good china for guests — and the paper finds them consistently harder for models to recover from. The two behave like distinct failure modes: missing a world fact is a perception slip, while missing a user preference is a values slip, and an agent can be sharp at one and clumsy at the other.
| Hidden constraint | In the kitchen | In a household task | How models cope |
|---|---|---|---|
| World constraint (a fact about the environment) | the oven is broken; there are no nuts | the door is locked; the shelf is already full | easier — a checkable fact |
| User constraint (a preference or requirement) | no pork; save the good china for guests | don't move the heirloom; finish before noon | harder — recovered less reliably than world constraints (AdaPlanBench) |
Where the score actually goes (illustrative — AdaPlanBench reports the 67.75% headline and the "harder as constraints accumulate" trend, not these per-step values). Hold one model and one task family fixed, and suppose that after replanning it satisfies any single hidden constraint about 92% of the time. A task passes only if it clears all of its constraints, so the odds compound: one constraint → 0.92 ≈ 92%; three → 0.92³ ≈ 78%; five → 0.92⁵ ≈ 66%. Same agent, same per-rule skill — the score slides from 92% to 66% purely because the hidden rules stack, and each one is another independent chance to trip the buzzer. That compounding is why errors accumulate over a long rollout, and why "best model, 67.75%" lands where it does once a handful of constraints are in play.
The reason this matters past a leaderboard is that real agent work looks far more like the stranger's kitchen than the open-book exam. A coding agent learns the repo's unwritten conventions only when CI rejects its PR; a travel agent learns the traveler won't fly red-eyes only after it books one. AdaPlanBench's contribution is to make that plan → violate → revise competence measurable, and to show that even frontier models clear barely two-thirds of it — which reframes planning from "write the right plan" into "revise the plan every time a hidden rule fires."
Goes deeper in: AI Agents → Planning & Reflection → Single-shot vs iterative
Related explainers
These probe the same question from different angles — what actually predicts whether a long-horizon agent succeeds:
- AutoLab — iterative experiment-loop evaluation — scores whether an agent sustains a propose → run → measure → refine loop under a budget, the same closed-loop disposition AdaPlanBench rewards.
- Quantitative Goal Persistence (QGP) — measures whether an agent finishes a fixed count of verified work without spinning, another long-horizon failure mode a single score misses.
- Effective Feedback Compute (EFC) — success scales with the quality of the feedback signal a harness returns; AdaPlanBench's violation signal is exactly that kind of feedback.