What is execution-free patch verification (Dockerless)?

Execution-free patch verification decides whether a coding agent's code patch is correct without running the repository's tests. Dockerless (arXiv 2606.28436) replaces the standard per-repository Docker container — which builds the project and runs its unit tests — with a judge that explores the repository agentically, gathers evidence about the change, and reasons about correctness. It reaches 62.0% resolve rate on SWE-bench Verified while removing the per-repo container entirely.

Why verify a coding-agent patch without running its tests?

Running tests is accurate, but it requires building and running a per-repository environment (typically a Docker image), and that environment setup is the slow, expensive part — and it repeats for every repository, on every patch you want to check. An execution-free judge skips the container, so evaluation and reward generation cost only the read-and-reason step. Dockerless shows this environment-free signal matches environment-based post-training rather than trading accuracy for speed.

How does Dockerless relate to SWE-bench and RL reward generation?

SWE-bench is the benchmark: real GitHub issues an agent must resolve with a patch, scored by resolve rate. Dockerless's key move is that the same execution-free verdict drives two training stages at once — in supervised fine-tuning it selects which trajectories to learn from, and in reinforcement learning it serves as the reward. Because neither stage runs code, the whole post-training pipeline is environment-free, and it posts 62.0% Verified, 50.0% Multilingual, and 35.2% Pro (+2.4 / +8.7 / +2.9 over the Qwen3.5-9B baseline).

Dockerless verifies coding-agent patches without containers — Execution-free patch verification

Jargon

SWE-bench (Verified / Multilingual / Pro): A benchmark of real GitHub issues where an agent must produce a patch that resolves the issue. Verified is a human-checked subset; Multilingual and Pro are other splits. The score is the resolve rate — the fraction of issues the patch actually fixes.
Execution-based verification: Checking a patch by actually running the repository’s unit tests, usually inside a Docker image built for that repo. It is accurate, but the per-repo environment setup is the costly part.
Execution-free / environment-free: Judging correctness without building or running the repo — by reading and reasoning about the code. Dockerless’s whole idea: no container, no test run.
Agentic repository exploration: Letting the judge act as an agent that opens files, greps, and reads surrounding code to gather evidence about whether a change is correct.
SFT (supervised fine-tuning): Training a model on curated example trajectories. Here, the execution-free verdict selects which trajectories are good enough to learn from.
RL reward: The signal reinforcement learning optimizes. Dockerless uses the execution-free correct/not verdict as the reward, instead of a pass/fail from actually running tests.
Trajectory: The sequence of actions and edits an agent took to produce a patch. Trajectory selection keeps only the ones that led to a correct patch.

The news. On July 1, 2026, researchers released Dockerless (arXiv 2606.28436), a way to verify coding-agent patches without per-repository containers. Its starting point: standard execution-based verification “requires running unit tests inside per-repository environments such as Docker images, incurring substantial environment setup costs.” Dockerless replaces that with an environment-free judge that explores the repository agentically and gathers evidence to decide correctness. It reaches 62.0% resolve rate on SWE-bench Verified (50.0% Multilingual, 35.2% Pro), surpassing the Qwen3.5-9B baseline by 2.4 / 8.7 / 2.9 points and matching environment-based post-training. Read the paper →

Picture a pull request landing in your review queue. The by-the-book way to approve it is to spin up CI: let it build the whole project, then run every unit test. Thorough — but each repository needs its own build environment, and standing that environment up is the slow, costly part. A senior reviewer often does something else entirely. They read the diff, open the files around it, trace how the change ripples through the code, and sign off by reasoning — no CI run at all.

That second move is exactly what Dockerless does for coding agents. To train or grade an agent, you have to know whether the patch it generated really fixes the bug. The standard check decides pass or fail by running the repo’s unit tests inside a per-repository Docker image — and building that image, once per repo, is where the cost lives. Dockerless drops the container entirely: an execution-free judge explores the repository agentically, gathers evidence about the change, and decides correct-or-not by reasoning, not by running tests.

Here is the part that makes it more than a cheaper grader, and it is the whole idea. The same execution-free verdict does double duty in training. In supervised fine-tuning it selects which trajectories are worth learning from — keep the ones judged correct, drop the rest. In reinforcement learning it is the reward. Because neither step ever runs code, the entire post-training pipeline becomes environment-free: no per-repo containers anywhere, not for evaluation and not for reward. What was an offline eval you had to build infrastructure for collapses into a model reading a repo.

Walk the cost. Hold the base model fixed — the Qwen3.5-9B this work post-trains from — and change only how each patch gets verified. Say a patch check splits into two parts: standing up the environment (build a Docker image for that repo) and running the check (execute its tests). Execution-based pays both, and the environment build is the part that dominates — (illustrative) call it roughly 80% of the verification bill, because it repeats for every repository. Dockerless pays only the read-and-reason part, so it erases that environment tax — and still lands 62.0% on SWE-bench Verified, which is 2.4 points above that same Qwen3.5-9B base and matches the container-based pipeline. The measured pattern repeats on the other splits: 50.0% Multilingual (+8.7) and 35.2% Pro (+2.9).

Verification approach	How it checks a patch	Infra per repo	SWE-bench Verified resolve rate
Execution-based (Docker)	build a container, run the repo’s unit tests	a Docker image + test run per repo	baseline (matched)
Dockerless (arXiv 2606.28436)	agentic repo exploration → reason about the patch, no execution	none	62.0% (+2.4 vs base; 50.0% / 35.2% on Multilingual / Pro)

Because correctness is decided by reading rather than running, the training loop sheds the one component that scaled with the number of repositories — the per-repo container — without giving up accuracy. The headline is not a new architecture or a higher ceiling; it is that the verification signal a coding agent learns from does not actually have to come from executed tests: a model that explores the repo and reasons can stand in for the whole environment.

Goes deeper in: AI Agents → Evals & Diagnostics → Pass/Fail vs Score

Related explainers

The Verification Horizon — co-evolving verifiers — why verifying coding agents is its own hard problem, approached by training the verifier alongside the coder
MaxProof — defense-in-depth generative verifier — another verifier that reasons about correctness rather than only executing
NatureBench — discovery vs reproduction — how coding agents fare on real, hard software tasks
FutureSim — harness-level agent eval — grading an agent by its whole run, not a single answer

Continue in trackAI Agents — Evals & Diagnostics: how to decide when an agent's output actually passed

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based