The news. On July 1, 2026, researchers released Dockerless (arXiv 2606.28436), a way to verify coding-agent patches without per-repository containers. Its starting point: standard execution-based verification “requires running unit tests inside per-repository environments such as Docker images, incurring substantial environment setup costs.” Dockerless replaces that with an environment-free judge that explores the repository agentically and gathers evidence to decide correctness. It reaches 62.0% resolve rate on SWE-bench Verified (50.0% Multilingual, 35.2% Pro), surpassing the Qwen3.5-9B baseline by 2.4 / 8.7 / 2.9 points and matching environment-based post-training. Read the paper →
Picture a pull request landing in your review queue. The by-the-book way to approve it is to spin up CI: let it build the whole project, then run every unit test. Thorough — but each repository needs its own build environment, and standing that environment up is the slow, costly part. A senior reviewer often does something else entirely. They read the diff, open the files around it, trace how the change ripples through the code, and sign off by reasoning — no CI run at all.
That second move is exactly what Dockerless does for coding agents. To train or grade an agent, you have to know whether the patch it generated really fixes the bug. The standard check decides pass or fail by running the repo’s unit tests inside a per-repository Docker image — and building that image, once per repo, is where the cost lives. Dockerless drops the container entirely: an execution-free judge explores the repository agentically, gathers evidence about the change, and decides correct-or-not by reasoning, not by running tests.
Here is the part that makes it more than a cheaper grader, and it is the whole idea. The same execution-free verdict does double duty in training. In supervised fine-tuning it selects which trajectories are worth learning from — keep the ones judged correct, drop the rest. In reinforcement learning it is the reward. Because neither step ever runs code, the entire post-training pipeline becomes environment-free: no per-repo containers anywhere, not for evaluation and not for reward. What was an offline eval you had to build infrastructure for collapses into a model reading a repo.
Walk the cost. Hold the base model fixed — the Qwen3.5-9B this work post-trains from — and change only how each patch gets verified. Say a patch check splits into two parts: standing up the environment (build a Docker image for that repo) and running the check (execute its tests). Execution-based pays both, and the environment build is the part that dominates — (illustrative) call it roughly 80% of the verification bill, because it repeats for every repository. Dockerless pays only the read-and-reason part, so it erases that environment tax — and still lands 62.0% on SWE-bench Verified, which is 2.4 points above that same Qwen3.5-9B base and matches the container-based pipeline. The measured pattern repeats on the other splits: 50.0% Multilingual (+8.7) and 35.2% Pro (+2.9).
| Verification approach | How it checks a patch | Infra per repo | SWE-bench Verified resolve rate |
|---|---|---|---|
| Execution-based (Docker) | build a container, run the repo’s unit tests | a Docker image + test run per repo | baseline (matched) |
| Dockerless (arXiv 2606.28436) | agentic repo exploration → reason about the patch, no execution | none | 62.0% (+2.4 vs base; 50.0% / 35.2% on Multilingual / Pro) |
Because correctness is decided by reading rather than running, the training loop sheds the one component that scaled with the number of repositories — the per-repo container — without giving up accuracy. The headline is not a new architecture or a higher ceiling; it is that the verification signal a coding agent learns from does not actually have to come from executed tests: a model that explores the repo and reasons can stand in for the whole environment.
Goes deeper in: AI Agents → Evals & Diagnostics → Pass/Fail vs Score
Related explainers
- The Verification Horizon — co-evolving verifiers — why verifying coding agents is its own hard problem, approached by training the verifier alongside the coder
- MaxProof — defense-in-depth generative verifier — another verifier that reasons about correctness rather than only executing
- NatureBench — discovery vs reproduction — how coding agents fare on real, hard software tasks
- FutureSim — harness-level agent eval — grading an agent by its whole run, not a single answer