Crafter — a directive critic that says what to fix

Agent
L
A directive critic tells the refiner WHAT to fix — not just how good the figure isrefine loop · round 1IntentPlan ×KCriticRefinerJudgeShared figure spec — the artifact being refinedFig. 3 — runtime vs N (dr…no legendCritic feedback on the specscalar score: 6.2 / 10 — one numberaxis ticks— no per-dimension signallegend— no per-dimension signalbar contrast— no per-dimension signaltitle— no per-dimension signalFigure quality — PaperBanana-Bench (0–100)baseline 33.73Crafter 50.3433.73Per-dimension directives + typed edits33.7350.34+16.61 on PaperBanana-Bench, over 292 figuresa scalar score can’t tell the refiner what to change
learnaivisually.com/ai-explained/crafter-directive-critic-harness

The news. On May 29, 2026, the Crafter paper (arXiv:2605.30611) introduced a multi-agent system for generating publication-quality, editable SVG scientific figures. Rather than ask one model for a finished image, it wraps an image backend in a five-agent harness — an Intent Reasoner, a Plan Generator, a directive Critic, a Specification Refiner, and a Convergence Judge — that iterate on a shared figure specification until the judge accepts it. It reports 50.34 vs a 33.73 baseline on PaperBanana-Bench (292 figures) at roughly $0.25 per figure. Read the paper →

Picture handing a rough draft to two kinds of editor. The first scrawls "6 out of 10" across the top and hands it back — you know it isn't good enough, but not one thing about what to change. The second returns it covered in margin notes: enlarge this caption, the legend is missing, these two bars are the same color. Crafter is built around the second editor. It wraps an image backend in a five-agent harness that keeps refining one shared figure spec until every note clears — the draft circling author → editor → author, with tracked changes, not a fresh rewrite each round.

The move that makes the loop work is the shape of the critic's output. A scalar critic emits a single number, so the refiner downstream can only guess which knob to turn — and the figure wanders, often undoing last round's fix. Crafter's directive critic instead emits per-dimension diagnostics: a separate, concrete instruction for the ticks, the legend, the contrast, and the title. The refiner turns each one into a typed edit — a structured change to the exact element rather than a free-text revision that might contradict itself. That is the evaluator-optimizer pattern with a sharp twist: the evaluator's verdict is actionable, so each pass tends to move the figure forward instead of churning.

Around that critic sit four more roles — an Intent Reasoner that seeds the spec, a Plan Generator that proposes K candidate layouts in parallel, the Specification Refiner that applies the typed edits, and a Convergence Judge that accepts, refines, or reverts each round. That division of labor is the textbook orchestrator-workers shape, and the judge plays the supervisor role a multi-agent team needs to decide when to keep iterating and when to stop.

How much does each piece actually carry? Crafter's ablations strip one component at a time and measure the drop on PaperBanana-Bench:

Remove this componentFalls back toScore change
Typed editsFree-text revision−8.90
Diversity-driven plans (K=1)A single plan−8.56
Refinement loopOne-shot generation−5.48
Directive criticScalar score−5.04

Walk the numbers. The bare image backend scores 33.73; the full harness reaches 50.34 — a +16.61 lift. The two heaviest contributors are exactly the ones the metaphor predicts: kill typed edits and you fall 8.90 points back toward free-text chaos; kill the K-plan search and you lose 8.56. The directive critic and the refinement loop are worth another 5.04 and 5.48. The four ablations aren't strictly additive — each removes a single component on its own — but the lesson is unambiguous: the gain isn't one trick, it's the directive critic, the typed edits, and the parallel planning composing. And it is cheap enough to run the whole loop — about $0.25 per figure, the full 279-sample CraftBench for under $90.

The takeaway travels well beyond figures. Whenever you put a model inside a refine loop, the bottleneck is rarely the generator — it is whether the critic can say what to fix. A score ranks; a directive moves the work forward.

Goes deeper in: AI Agents → Workflow Patterns → Evaluator-Optimizer

Related explainers

Frequently Asked Questions