The news. On June 25, 2026, DeepReinforce released Ornith-1.0, a family of MIT-licensed open-weight coding models — 9B Dense, 31B Dense, 35B MoE, and 397B MoE — built on top of pretrained Gemma 4 and Qwen 3.5. The headline idea is self-scaffolding: instead of relying on a human-designed harness during reinforcement learning, the model learns to generate both the task-specific scaffold and the solution. At flagship scale, Ornith-1.0-397B reports 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified, state of the art among comparable open models. Read the release →
Picture a carpenter who never reaches straight for the saw. Before each cut, they build a jig — a small custom guide clamped to the wood so the blade lands exactly where it should. Build the jig once, and the cut becomes fast, repeatable, hard to botch. A training scaffold is that jig: the task-specific setup wrapped around a model — how its output is read, what counts as a step, the structure it works within (the tools and environment around it stay fixed). For years that jig was hand-built by an engineer, and like a phrasebook it broke the moment a task wandered off the cases it was tuned for. That is exactly why a hand-built harness gets brittle in production: real tasks don't hold still.
Ornith's move is to make the jig something the model learns to build. Each reinforcement-learning step now runs in two passes, not one. First the model looks at the task and the jig it used last time, and writes a refined scaffold fitted to this specific task — the way a carpenter shapes a new guide for an odd-angled cut. Then it solves against that scaffold, producing a rollout the verifier scores. Over training, the model gets better at both halves at once: better at pausing to set up before it acts, and better at the solve itself. The human who used to wire the harness by hand is out of the loop.
That raises the obvious worry: if the model is allowed to rewrite its own jig, what stops it from rewriting the safety rules too? Ornith's answer is to keep every safety control outside the model, where the model can't touch it. The environment and tool surface are frozen and immutable — the carpenter's bench is bolted to the floor. A deterministic monitor enforces the trust boundary by fixed rule, like a mechanical blade guard that physically blocks an unsafe move rather than asking permission. And a frozen LLM judge sits on top of the verifier as a veto — a shop inspector who can reject a finished piece, but can't set the carpenter's pay, so it never becomes the reward the model learns to game. This is capability scoping made concrete: the model's freedom to improvise is real, but it's fenced.
| Approach | Who writes the task scaffold | When it's set | Where safety lives |
|---|---|---|---|
| Hand-built harness | A human engineer, per task / environment | Before training, then fixed | Whatever the human coded in |
| Self-scaffolding (Ornith) | The model itself, refined each RL step | During training, per task | Outside the model: frozen environment + deterministic monitor + frozen-judge veto |
The two-pass step isn't free, and it's worth seeing the cost concretely (illustrative — the release reports benchmark scores, not this per-step token trace). Picture one training step. The old way is a single rollout: read the task, write a solution — call it 8,000 tokens. Ornith's step runs twice: first it writes the jig, a short task-specific scaffold of roughly 1,500 tokens, then it solves against it for another 8,000. So a step costs about 9,500 tokens instead of 8,000 — a ~19% surcharge for letting the model build its own jig. The bet is that the jig pays for itself: a solve guided by a task-fitted scaffold lands more often than a solve into a generic one, so there are fewer wasted steps over the whole run.
The honest read is that Ornith reports its wins as benchmark scores on the flagship model, not as an ablation of the surcharge. But the open weights are the quiet payoff: an MIT-licensed family from 9B to 397B, built on Gemma 4 and Qwen 3.5, means self-scaffolding isn't a closed lab trick — it's a usable target you can fine-tune, probe, and build harness and eval demos against. The brittle, hand-tuned layer that used to cap agent performance becomes, here, one more thing the model learns to make.
Goes deeper in: AI Agents → The Agent Loop & State → The Anatomy of a Harness and Agent Engineering → Production Harness Architecture → Why a Harness Fails in Production
Related explainers
- HarnessBridge — learned agent harness vs hand-engineered — learns the interface between agent and environment; Ornith instead learns to generate the scaffold during RL
- Harness-1 — state-externalizing search harness — keeps working memory outside the transcript; a different way the harness, not the model, does the heavy lifting
- CacheRL — cached rollouts for agent RL — another agent-RL training trick, this one cutting the cost of the rollouts themselves