OpenSCAD Pantheon benchmark — HITL vs autonomous
AgentThe news. On May 21, 2026, ModelRift published the OpenSCAD LLM benchmark: six agentic coding tools given the same prompt plus two reference images, asked to produce a Pantheon model in OpenSCAD. Best autonomous: Antigravity 2.0 + Gemini 3.5 Flash High at 4.5/5 in ~12 min. Best HITL: ModelRift + Gemini Flash 3.0 at 3.8/5 in ~10 min. Codex 5.5 High hit 3.0/5, Claude Sonnet ran 2-3× slower than Codex, and Cursor Composer was fastest but weakest at 1.4/5. Numbers are reviewer-scored; the post does not publish full transcripts or per-iteration traces.
Picture the metaphor. A learner in a driving lesson can either drive solo — head down, no one to grab the wheel — or drive with an instructor in the passenger seat who has a brake pedal. The solo learner gets to the destination faster on average, but if they pick the wrong turn at minute three, that wrong turn carries through to the end. The instructor-paired learner pauses at intersections, lets the instructor weigh in, and ends up with a route that's safer to defend but slower to take. The Pantheon benchmark is exactly this lesson played out on an OpenSCAD model: same prompt, same reference images, two different control modes for the same agent loop.
The mechanism is straightforward. Each tool was given the same prompt — "build a Pantheon model in OpenSCAD from these two reference images" — and ran its iteration loop. In autonomous mode, the agent generated OpenSCAD source, called the CLI to render it, looked at the render, decided what to refine, and repeated until it decided the model was done. In HITL mode, the loop paused at named checkpoints — typically after each render — for the human to approve, edit, or redirect before the next iteration. Both winners ran on the Gemini Flash family (Antigravity on Gemini 3.5 Flash High, ModelRift on Gemini Flash 3.0); the tools differed in how they wrapped that model in the loop.
The headline finding is that the best autonomous tool beat the best HITL tool on quality: 4.5/5 vs 3.8/5. The common assumption is that a human in the loop raises quality (more eyes, more course-correction). On this task it didn't — though with a single reviewer-scored run per tool and no replicates published, the gap should be read as suggestive, not statistically certain. The two runs also used different specific models (Flash 3.5 High vs Flash 3.0), so it's not a clean apples-to-apples isolation of HITL itself; what it does isolate is the end-to-end deliverable that a team would actually ship using each tool. (If you want the strict HITL-vs-autonomous A/B on identical models with replicates, this benchmark doesn't give it to you — that experiment is still missing.)
There's a second finding hiding under the first: HITL was faster here, not slower. ~10 min for ModelRift vs ~12 min for Antigravity 2.0. The conventional reading of HITL is "more careful, takes longer." This run flipped that — the human pauses were short enough, and the redirects shortcutting bad branches were valuable enough, that total wall-clock dropped. The reading is not "HITL is always faster"; it's that for a task with a clear visual target (the Pantheon, with two reference images), a few well-placed human redirects can save more iteration cycles than they cost.
Where the time and quality actually go
Walking through the math with the benchmark's reported totals and illustrative per-iteration breakdowns (the benchmark publishes wall-clock totals and final scores only — iteration counts and per-step times below are stylized estimates picked to match the reported totals, not numbers from a published trace):
Antigravity 2.0 — autonomous, total ~12 min. Treat that as roughly 5 illustrative iterations × ~2.4 min/iteration: no pauses, no human input. The 4.5/5 score reflects that the autonomous loop, given enough iterations, converged on architectural correctness — the dome curve, the column count, the pediment proportion — without anyone steering it.
ModelRift — HITL, total ~10 min. Treat that as roughly 3 illustrative iterations × ~2.7 min + 2 human reviews × ~0.4 min ≈ 9.0 min of model+review time, plus a small overhead. The 3.8/5 score reflects that the human redirects can shortcut some bad branches early (saving iterations vs autonomous) but the final model still trailed on detail in this run.
The cost story is different from the quality story. Autonomous wins on quality, HITL wins on wall-clock; HITL also wins on human attention, but only when that attention catches mistakes that would have cost a full iteration to fix downstream. If your iteration is cheap (seconds, not minutes), the autonomous loop's "just iterate more" strategy dominates. If your iteration is expensive (long renders, paid API calls, slow CI), each human-saved iteration is worth more, and HITL pays back.
What changes for production agent design
The benchmark forces a decision teams usually leave implicit: which control mode is your default for which task class? The shape of the answer has been emerging across other Agent Engineering work — the Decision Rule step frames it as a tradeoff between autonomy, controllability, and cost, and the Cost Profile step walks through where the tokens go under each mode. This benchmark gives you concrete numbers to plug in.
| Decision input | Autonomous wins when | HITL wins when |
|---|---|---|
| Task structure | well-defined target, the model has converged on similar tasks before, mistakes are cheap to discover at the end | target is fuzzy or shifts as work progresses, mistakes are expensive to discover late, the human can articulate preferences faster than they can write them down up front |
| Iteration cost | iterations are cheap (seconds, low-cost API calls); "just iterate more" is viable | iterations are expensive (long renders, paid runs, slow CI); human redirects save more than they cost |
| Reviewability | each iteration produces something a model can self-evaluate (renders, test results, type checks) | each iteration produces something only a human can score (visual judgment, taste, domain expertise) |
| Human time budget | operator is unavailable or expensive (off-hours, batch jobs) | operator is present and available; the wall-clock saved is worth their pause time |
| Failure cost | output is reversible — re-run, regenerate, throw away | output is irreversible — sent emails, executed trades, deployed code; see also the Lethal Trifecta lens |
The honest take after this benchmark is that autonomous is competitive on quality for tasks with clear visual targets — a fact that wasn't obvious a generation ago and that this benchmark puts a number on. HITL still wins where you genuinely can't define the target up front, where iteration is expensive, or where the failure cost dominates. The default that fits most production teams is probably autonomous-first with HITL escape hatches at the failure-cost-sensitive checkpoints — not HITL-throughout — and this benchmark is one data point pushing in that direction.
Goes deeper in: AI Agents → Workflow Patterns → When NOT to Use an Agent
Related explainers
- Agentic CLEAR — System/Trace/Node eval granularity — companion piece on evaluating agent runs across the same three abstraction levels, complementary to picking the right control mode.
- Maestro — RL orchestrator over frozen experts — another agent-design pattern that competes with the autonomous-single-agent baseline this benchmark used.