What is self-scaffolding RL?

Self-scaffolding RL is a training recipe where the model learns to write its own task-specific scaffold — the setup wrapped around it, like the step structure and output format — instead of using a human-built harness, while the tools and environment around it stay fixed. In Ornith-1.0, each reinforcement-learning step runs twice: the model first proposes a scaffold refined for the current task, then generates a solution against that scaffold, and a verifier scores the solution. Over training, the model improves at both building the scaffold and solving against it.

Why let the model write its own scaffold?

Because an agent's performance is capped not just by the model but by the harness wrapped around it, and hand-engineering a fresh harness for every task does not scale — every new environment needs an engineer to re-wire it, and the hand-built version gets brittle as tasks wander off the cases it was tuned for. Letting the model generate and refine its own per-task scaffold removes that human bottleneck and lets the setup adapt to each task instead of being fixed in advance.

How does Ornith keep a self-modifying agent safe?

By keeping every safety control outside the model, where the model cannot rewrite it. The environment and tool surface are frozen and immutable, a deterministic monitor enforces the trust boundary by fixed rule rather than judgment, and a frozen LLM judge acts as a veto on top of the verifier rather than as the primary reward. The model is free to improvise its scaffold, but it cannot move the fence — so it cannot game the safety layer the way it could if the reward itself were a model it influenced.

Ornith-1.0 ships open MIT-licensed coding models — Self-scaffolding RL

Jargon

RL (Reinforcement Learning): Training by trial and reward: the model attempts a task, an automatic check scores the attempt, and the score nudges its weights toward attempts that score higher.
Scaffold / harness: The task-specific setup wrapped around the model — how its output is parsed, what counts as a step, the structure and instructions for this task. In Ornith the model writes this per-task scaffold, while the tools and environment around it stay fixed.
Self-scaffolding: The model generates its own scaffold rather than using a human-built one, and refines it as part of training — a jig it builds fresh for each task.
Rollout: One attempt the model generates for a task — here, one full solution produced against the current scaffold, which the verifier then scores.
Verifier: The automatic check that decides whether a rollout passed (tests green, task done). Its pass/fail signal is the primary reward in training.
Deterministic monitor: A fixed-rule safety check — not a judgment call. It enforces the trust boundary the same way every time, so the model can't talk its way past it.
Frozen LLM judge (veto): A separate model that can reject a result on top of the verifier, but is held fixed and never sets the reward — a veto, not the scoreboard.
Terminal-Bench 2.1 / SWE-Bench Verified: Two hard coding benchmarks — driving a real command line, and resolving real GitHub issues — used to rank agentic coding models.

The news. On June 25, 2026, DeepReinforce released Ornith-1.0, a family of MIT-licensed open-weight coding models — 9B Dense, 31B Dense, 35B MoE, and 397B MoE — built on top of pretrained Gemma 4 and Qwen 3.5. The headline idea is self-scaffolding: instead of relying on a human-designed harness during reinforcement learning, the model learns to generate both the task-specific scaffold and the solution. At flagship scale, Ornith-1.0-397B reports 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified, state of the art among comparable open models. Read the release →

Picture a carpenter who never reaches straight for the saw. Before each cut, they build a jig — a small custom guide clamped to the wood so the blade lands exactly where it should. Build the jig once, and the cut becomes fast, repeatable, hard to botch. A training scaffold is that jig: the task-specific setup wrapped around a model — how its output is read, what counts as a step, the structure it works within (the tools and environment around it stay fixed). For years that jig was hand-built by an engineer, and like a phrasebook it broke the moment a task wandered off the cases it was tuned for. That is exactly why a hand-built harness gets brittle in production: real tasks don't hold still.

Ornith's move is to make the jig something the model learns to build. Each reinforcement-learning step now runs in two passes, not one. First the model looks at the task and the jig it used last time, and writes a refined scaffold fitted to this specific task — the way a carpenter shapes a new guide for an odd-angled cut. Then it solves against that scaffold, producing a rollout the verifier scores. Over training, the model gets better at both halves at once: better at pausing to set up before it acts, and better at the solve itself. The human who used to wire the harness by hand is out of the loop.

That raises the obvious worry: if the model is allowed to rewrite its own jig, what stops it from rewriting the safety rules too? Ornith's answer is to keep every safety control outside the model, where the model can't touch it. The environment and tool surface are frozen and immutable — the carpenter's bench is bolted to the floor. A deterministic monitor enforces the trust boundary by fixed rule, like a mechanical blade guard that physically blocks an unsafe move rather than asking permission. And a frozen LLM judge sits on top of the verifier as a veto — a shop inspector who can reject a finished piece, but can't set the carpenter's pay, so it never becomes the reward the model learns to game. This is capability scoping made concrete: the model's freedom to improvise is real, but it's fenced.

Approach	Who writes the task scaffold	When it's set	Where safety lives
Hand-built harness	A human engineer, per task / environment	Before training, then fixed	Whatever the human coded in
Self-scaffolding (Ornith)	The model itself, refined each RL step	During training, per task	Outside the model: frozen environment + deterministic monitor + frozen-judge veto

The two-pass step isn't free, and it's worth seeing the cost concretely (illustrative — the release reports benchmark scores, not this per-step token trace). Picture one training step. The old way is a single rollout: read the task, write a solution — call it 8,000 tokens. Ornith's step runs twice: first it writes the jig, a short task-specific scaffold of roughly 1,500 tokens, then it solves against it for another 8,000. So a step costs about 9,500 tokens instead of 8,000 — a ~19% surcharge for letting the model build its own jig. The bet is that the jig pays for itself: a solve guided by a task-fitted scaffold lands more often than a solve into a generic one, so there are fewer wasted steps over the whole run.

The honest read is that Ornith reports its wins as benchmark scores on the flagship model, not as an ablation of the surcharge. But the open weights are the quiet payoff: an MIT-licensed family from 9B to 397B, built on Gemma 4 and Qwen 3.5, means self-scaffolding isn't a closed lab trick — it's a usable target you can fine-tune, probe, and build harness and eval demos against. The brittle, hand-tuned layer that used to cap agent performance becomes, here, one more thing the model learns to make.

Goes deeper in: AI Agents → The Agent Loop & State → The Anatomy of a Harness and Agent Engineering → Production Harness Architecture → Why a Harness Fails in Production

Related explainers

HarnessBridge — learned agent harness vs hand-engineered — learns the interface between agent and environment; Ornith instead learns to generate the scaffold during RL
Harness-1 — state-externalizing search harness — keeps working memory outside the transcript; a different way the harness, not the model, does the heavy lifting
CacheRL — cached rollouts for agent RL — another agent-RL training trick, this one cutting the cost of the rollouts themselves

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based