What is Crafter's directive critic?

It is the critic agent in Crafter's five-agent figure-generation harness that emits per-dimension directive diagnostics — a concrete fix for each aspect of the figure (axis ticks, legend, contrast, title) — instead of a single scalar quality score. Because each directive names exactly what to change, the downstream refiner can apply a precise typed edit rather than guessing, which is what lets the refinement loop converge.

Why does a directive critic beat a scalar score?

A scalar score can rank two figures but cannot tell the refiner what is wrong, so the refiner gambles and the figure often wanders or undoes prior fixes. A directive critic returns a per-dimension to-do list, turning each refinement pass into targeted, accumulating progress. In Crafter's ablations, replacing the directive critic with a scalar score costs 5.04 points on PaperBanana-Bench, and dropping the typed edits it enables costs 8.90.

How does Crafter relate to the orchestrator-workers and agent-team patterns?

Crafter is a concrete instance of both. The Intent Reasoner and Plan Generator orchestrate specialized workers (the critic and refiner) — the orchestrator-workers workflow pattern — while the Convergence Judge plays the supervisor role a multi-agent team needs, deciding each round whether to accept, refine, or revert. It is a clean, low-cost example of multi-agent orchestration where the team genuinely beats a single model call.

Crafter paper — Multi-agent refinement harness with a directive critic

Crafter — a directive critic that says what to fix

Agent

learnaivisually.com/ai-explained/crafter-directive-critic-harness

Jargon

Directive critic: A critic that emits a concrete fix per dimension ("enlarge the tick font", "add a legend") instead of a single quality number. The fixes are actionable, so the refiner never has to guess.
Scalar critic: A critic that returns one number for the whole artifact (e.g. 6.2/10). It can rank candidates but cannot say what to change — the failure mode the directive critic fixes.
Typed edit: A structured, schema-constrained change to the spec — addLegend(), setTickFont(14) — versus a free-text rewrite that can silently undo a previous fix.
Agent harness: The orchestration scaffold around a model: the roles, control flow, and shared state that turn a single call into a multi-step loop. See harness architecture →
Diversity-driven plan exploration: Generate K candidate layouts in parallel and keep the best, instead of committing to the first plan — the largest single contributor after typed edits.
Convergence judge: The agent that ends each round by deciding accept, refine, or revert — the supervisor that keeps the loop from drifting or looping forever.
PaperBanana-Bench: The figure-quality benchmark the paper reports on — 292 academic figures, scored 0–100. Crafter reaches 50.34 vs a 33.73 backend baseline.

The news. On May 29, 2026, the Crafter paper (arXiv:2605.30611) introduced a multi-agent system for generating publication-quality, editable SVG scientific figures. Rather than ask one model for a finished image, it wraps an image backend in a five-agent harness — an Intent Reasoner, a Plan Generator, a directive Critic, a Specification Refiner, and a Convergence Judge — that iterate on a shared figure specification until the judge accepts it. It reports 50.34 vs a 33.73 baseline on PaperBanana-Bench (292 figures) at roughly $0.25 per figure. Read the paper →

Picture handing a rough draft to two kinds of editor. The first scrawls "6 out of 10" across the top and hands it back — you know it isn't good enough, but not one thing about what to change. The second returns it covered in margin notes: enlarge this caption, the legend is missing, these two bars are the same color. Crafter is built around the second editor. It wraps an image backend in a five-agent harness that keeps refining one shared figure spec until every note clears — the draft circling author → editor → author, with tracked changes, not a fresh rewrite each round.

The move that makes the loop work is the shape of the critic's output. A scalar critic emits a single number, so the refiner downstream can only guess which knob to turn — and the figure wanders, often undoing last round's fix. Crafter's directive critic instead emits per-dimension diagnostics: a separate, concrete instruction for the ticks, the legend, the contrast, and the title. The refiner turns each one into a typed edit — a structured change to the exact element rather than a free-text revision that might contradict itself. That is the evaluator-optimizer pattern with a sharp twist: the evaluator's verdict is actionable, so each pass tends to move the figure forward instead of churning.

Around that critic sit four more roles — an Intent Reasoner that seeds the spec, a Plan Generator that proposes K candidate layouts in parallel, the Specification Refiner that applies the typed edits, and a Convergence Judge that accepts, refines, or reverts each round. That division of labor is the textbook orchestrator-workers shape, and the judge plays the supervisor role a multi-agent team needs to decide when to keep iterating and when to stop.

How much does each piece actually carry? Crafter's ablations strip one component at a time and measure the drop on PaperBanana-Bench:

Remove this component	Falls back to	Score change
Typed edits	Free-text revision	−8.90
Diversity-driven plans (K=1)	A single plan	−8.56
Refinement loop	One-shot generation	−5.48
Directive critic	Scalar score	−5.04

Walk the numbers. The bare image backend scores 33.73; the full harness reaches 50.34 — a +16.61 lift. The two heaviest contributors are exactly the ones the metaphor predicts: kill typed edits and you fall 8.90 points back toward free-text chaos; kill the K-plan search and you lose 8.56. The directive critic and the refinement loop are worth another 5.04 and 5.48. The four ablations aren't strictly additive — each removes a single component on its own — but the lesson is unambiguous: the gain isn't one trick, it's the directive critic, the typed edits, and the parallel planning composing. And it is cheap enough to run the whole loop — about $0.25 per figure, the full 279-sample CraftBench for under $90.

The takeaway travels well beyond figures. Whenever you put a model inside a refine loop, the bottleneck is rarely the generator — it is whether the critic can say what to fix. A score ranks; a directive moves the work forward.

Goes deeper in: AI Agents → Workflow Patterns → Evaluator-Optimizer

Related explainers

Maestro — RL orchestrator over frozen experts — a different way to coordinate specialists: an RL policy that routes, where Crafter's harness critiques and refines.
EFC — feedback quality predicts agent success — the scaling-law evidence behind Crafter's bet: better feedback, not more compute, is what moves the needle.

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based