Crafter — a directive critic that says what to fix
AgentThe news. On May 29, 2026, the Crafter paper (arXiv:2605.30611) introduced a multi-agent system for generating publication-quality, editable SVG scientific figures. Rather than ask one model for a finished image, it wraps an image backend in a five-agent harness — an Intent Reasoner, a Plan Generator, a directive Critic, a Specification Refiner, and a Convergence Judge — that iterate on a shared figure specification until the judge accepts it. It reports 50.34 vs a 33.73 baseline on PaperBanana-Bench (292 figures) at roughly $0.25 per figure. Read the paper →
Picture handing a rough draft to two kinds of editor. The first scrawls "6 out of 10" across the top and hands it back — you know it isn't good enough, but not one thing about what to change. The second returns it covered in margin notes: enlarge this caption, the legend is missing, these two bars are the same color. Crafter is built around the second editor. It wraps an image backend in a five-agent harness that keeps refining one shared figure spec until every note clears — the draft circling author → editor → author, with tracked changes, not a fresh rewrite each round.
The move that makes the loop work is the shape of the critic's output. A scalar critic emits a single number, so the refiner downstream can only guess which knob to turn — and the figure wanders, often undoing last round's fix. Crafter's directive critic instead emits per-dimension diagnostics: a separate, concrete instruction for the ticks, the legend, the contrast, and the title. The refiner turns each one into a typed edit — a structured change to the exact element rather than a free-text revision that might contradict itself. That is the evaluator-optimizer pattern with a sharp twist: the evaluator's verdict is actionable, so each pass tends to move the figure forward instead of churning.
Around that critic sit four more roles — an Intent Reasoner that seeds the spec, a Plan Generator that proposes K candidate layouts in parallel, the Specification Refiner that applies the typed edits, and a Convergence Judge that accepts, refines, or reverts each round. That division of labor is the textbook orchestrator-workers shape, and the judge plays the supervisor role a multi-agent team needs to decide when to keep iterating and when to stop.
How much does each piece actually carry? Crafter's ablations strip one component at a time and measure the drop on PaperBanana-Bench:
| Remove this component | Falls back to | Score change |
|---|---|---|
| Typed edits | Free-text revision | −8.90 |
| Diversity-driven plans (K=1) | A single plan | −8.56 |
| Refinement loop | One-shot generation | −5.48 |
| Directive critic | Scalar score | −5.04 |
Walk the numbers. The bare image backend scores 33.73; the full harness reaches 50.34 — a +16.61 lift. The two heaviest contributors are exactly the ones the metaphor predicts: kill typed edits and you fall 8.90 points back toward free-text chaos; kill the K-plan search and you lose 8.56. The directive critic and the refinement loop are worth another 5.04 and 5.48. The four ablations aren't strictly additive — each removes a single component on its own — but the lesson is unambiguous: the gain isn't one trick, it's the directive critic, the typed edits, and the parallel planning composing. And it is cheap enough to run the whole loop — about $0.25 per figure, the full 279-sample CraftBench for under $90.
The takeaway travels well beyond figures. Whenever you put a model inside a refine loop, the bottleneck is rarely the generator — it is whether the critic can say what to fix. A score ranks; a directive moves the work forward.
Goes deeper in: AI Agents → Workflow Patterns → Evaluator-Optimizer
Related explainers
- Maestro — RL orchestrator over frozen experts — a different way to coordinate specialists: an RL policy that routes, where Crafter's harness critiques and refines.
- EFC — feedback quality predicts agent success — the scaling-law evidence behind Crafter's bet: better feedback, not more compute, is what moves the needle.