What is adaptive replanning under hidden constraints?

It is planning as a closed loop instead of a single attempt: the agent proposes a plan, the environment reveals a hidden rule by flagging a violation, and the agent revises and re-tests until the plan satisfies every constraint. AdaPlanBench measures this directly — 307 household tasks whose world and user constraints surface only when a plan breaks them — and finds the best of 10 leading LLMs reaches just 67.75%.

Why are user constraints harder than world constraints?

World constraints are checkable facts about the environment (the oven is broken, there are no nuts), so a miss is a perception slip the agent can catch on re-test. User constraints are softer preferences (no pork, save the good china) that depend on context and intent, and AdaPlanBench reports models recover from them less reliably — the failure is about honoring what the user wants, not observing what is true.

How does AdaPlanBench differ from a normal planning benchmark?

A normal benchmark hands the agent all the constraints up front and grades one plan; AdaPlanBench hides them and discloses each only when a proposed plan violates it. That penalizes one-shot planning by design and rewards the plan, violate, revise loop instead, exposing two failure modes — missing a world fact and missing a user preference — that an open-book, single-shot score can never see.

AdaPlanBench tests agent planning under incremental constraints — Adaptive replanning under hidden constraints

TL;DR

What is it: The AdaPlanBench benchmark tests adaptive replanning under hidden constraints — 307 household tasks whose rules are revealed only when the agent's plan violates one, forcing it to detect the breach, revise, and re-test.
Why it’s needed: Real tasks rarely state every rule up front, so an agent that can only write one plan once fails the moment an unstated rule bites. AdaPlanBench measures the competence that actually matters in production — revising a plan in response to feedback — and finds the best of 10 models clears just 67.75%.
vs previous: Most planning benchmarks hand over all constraints up front and grade a single plan; AdaPlanBench hides them and surfaces each only on violation, exposing two failure modes a one-shot score is blind to: missing a world fact and missing a user preference — the harder of the two.

Jargon

AdaPlanBench: The benchmark in the news — 307 household tasks, each augmented with hidden world and user constraints, run under a multi-turn protocol where a constraint surfaces only when the agent's plan breaks it. It evaluates 10 leading LLMs; the best reaches 67.75%.
Adaptive (closed-loop) replanning: Planning as a loop — propose a plan, hit a violation, revise, re-test — until the plan satisfies every (hidden) rule. It is the opposite of writing one plan and committing to it.
World constraint: A fact about the environment — the oven is broken, there are no nuts in the pantry. Fixed and checkable; missing one is a perception slip.
User constraint: A preference or requirement from the person — no pork, save the good china for guests. Softer and context-dependent; AdaPlanBench reports models recover from these less reliably than from world constraints.
One-shot (single-pass) planning: Producing a complete plan in a single attempt with no revision. It is the baseline AdaPlanBench is designed to defeat: you cannot satisfy a rule you were never told. More on single-shot vs iterative planning →
Constraint-violation feedback: The signal the environment returns when a proposed plan breaks a rule. In this benchmark it is the only channel through which a hidden constraint becomes known — no violation, no disclosure.
Incremental constraint disclosure: Constraints are not all handed over at the start; they are revealed one at a time as the agent's successive plans trip them, so the rulebook fills in over the course of the task.

The news. On June 4, 2026, researchers released AdaPlanBench, a 307-task benchmark that asks a different question than most agent evals: can an agent plan when the rules aren't all handed to it up front? Each household task carries hidden world and user constraints, revealed only when the agent proposes a plan that breaks one — forcing it to detect the violation, revise, and re-test. Across 10 leading LLMs, the best scored just 67.75%, and every model got worse as constraints piled up, with user constraints proving harder than world constraints. Read the paper →

Picture a substitute chef walking into a stranger's kitchen to cook dinner. Nobody hands them the house rules. They lay out a sensible recipe — roast chicken, a peanut-sauce side, finish in the cast-iron — and the instant they reach for the peanuts, a buzzer goes off: the kid is allergic. Swap the side, reach for the cast-iron, buzzer again: that pan is off-limits. Each rule was invisible until the plan tripped it. That is exactly the world AdaPlanBench drops an agent into — the constraints exist from the start, but the only way to learn one is to violate it and be told.

This breaks the quiet assumption behind a normal benchmark, which hands the agent every requirement up front and grades a single plan once. You cannot satisfy a rule you were never given, so a one-shot planner is structurally outmatched here. The only thing that works is a closed loop: propose a plan → hit a violation → revise → re-test, until nothing buzzes — the same evaluator–optimizer loop the Agents track teaches, run against a hidden rulebook instead of a known one. AdaPlanBench scores whether the agent keeps that loop turning, not whether its first attempt happened to look good.

The hidden rules split into two kinds, and the split is the teachable part. World constraints are facts about the environment — the oven is broken, there are no nuts in the pantry — fixed and checkable. User constraints are the family's preferences — no pork, save the good china for guests — and the paper finds them consistently harder for models to recover from. The two behave like distinct failure modes: missing a world fact is a perception slip, while missing a user preference is a values slip, and an agent can be sharp at one and clumsy at the other.

Hidden constraint	In the kitchen	In a household task	How models cope
World constraint (a fact about the environment)	the oven is broken; there are no nuts	the door is locked; the shelf is already full	easier — a checkable fact
User constraint (a preference or requirement)	no pork; save the good china for guests	don't move the heirloom; finish before noon	harder — recovered less reliably than world constraints (AdaPlanBench)

Where the score actually goes (illustrative — AdaPlanBench reports the 67.75% headline and the "harder as constraints accumulate" trend, not these per-step values). Hold one model and one task family fixed, and suppose that after replanning it satisfies any single hidden constraint about 92% of the time. A task passes only if it clears all of its constraints, so the odds compound: one constraint → 0.92 ≈ 92%; three → 0.92³ ≈ 78%; five → 0.92⁵ ≈ 66%. Same agent, same per-rule skill — the score slides from 92% to 66% purely because the hidden rules stack, and each one is another independent chance to trip the buzzer. That compounding is why errors accumulate over a long rollout, and why "best model, 67.75%" lands where it does once a handful of constraints are in play.

The reason this matters past a leaderboard is that real agent work looks far more like the stranger's kitchen than the open-book exam. A coding agent learns the repo's unwritten conventions only when CI rejects its PR; a travel agent learns the traveler won't fly red-eyes only after it books one. AdaPlanBench's contribution is to make that plan → violate → revise competence measurable, and to show that even frontier models clear barely two-thirds of it — which reframes planning from "write the right plan" into "revise the plan every time a hidden rule fires."

Goes deeper in: AI Agents → Planning & Reflection → Single-shot vs iterative

Related explainers

These probe the same question from different angles — what actually predicts whether a long-horizon agent succeeds:

AutoLab — iterative experiment-loop evaluation — scores whether an agent sustains a propose → run → measure → refine loop under a budget, the same closed-loop disposition AdaPlanBench rewards.
Quantitative Goal Persistence (QGP) — measures whether an agent finishes a fixed count of verified work without spinning, another long-horizon failure mode a single score misses.
Effective Feedback Compute (EFC) — success scales with the quality of the feedback signal a harness returns; AdaPlanBench's violation signal is exactly that kind of feedback.

Continue in trackAI Agents — Planning & Reflection: single-shot vs iterative planning

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based