What is a code-as-action interface?

A code-as-action interface (also called CodeAct) lets an agent act by writing and running executable code instead of calling a fixed set of named tools. In SpatialClaw, the agent emits one Python cell at a time against a stateful kernel pre-loaded with the input frames and perception primitives, observes the cell's text and image output, then writes the next cell. This lets it freely compose operations and adapt per step, rather than being limited to the operations a tool schema exposes.

Why does SpatialClaw work without any training?

Because it changes only the agent's action interface, not the model's weights. The same off-the-shelf vision-language model is given a stateful Python kernel and asked to act one cell at a time; the gains — an average of 59.9% across 20 spatial-reasoning benchmarks, +11.2 points over the prior spatial agent — come from that observe-then-act composition, and they hold across six different VLM backbones, which is why the paper frames the interface, not a new model, as the contribution.

SpatialClaw lifts agent spatial reasoning to 59.9% — Code-as-action vs structured tool-calls

TL;DR

What is it: SpatialClaw is a training-free framework that gives a vision-language agent a code-as-action interface: instead of calling a fixed menu of tools, the agent writes one executable Python cell per step against a stateful kernel pre-loaded with the input frames and perception primitives.
Why it’s needed: How an agent is allowed to act often matters more than which model runs it. Swapping a rigid interface for a code one lifts the average across 20 spatial-reasoning benchmarks to 59.9% — +11.2 points over the prior spatial agent — with no training at all.
vs previous: Where structured tool-calls expose a fixed schema (limited composition) and single-pass code commits a whole script before seeing any result, code-as-action runs one cell, observes the output, then writes the next — so the agent composes operations and adapts the analysis per task.

Jargon

VLM (vision-language model): A model that takes images (or video) plus text and reasons over both — the backbone SpatialClaw drives. The paper shows its gains hold across six different VLM backbones, so the win is the interface, not one model.
Code-as-action interface: An agent-computer interface where the agent's action is a snippet of executable code, not a call to a named tool. Also called CodeAct. The code runs and its output becomes the next observation.
Structured tool-calls: The common alternative: the agent picks from a fixed schema of named functions with typed arguments (the structured-output path). Reliable, but it can only compose the operations the schema already exposes.
Stateful kernel: A persistent Python session (think a notebook kernel) that keeps variables, loaded frames, and earlier results in memory between steps — so cell #5 can use what cell #2 computed. It is the agent's working memory.
Single-pass code: Writing one whole script up front and running it once. More flexible than a fixed schema, but the agent commits before it sees any intermediate result — no chance to adapt.
Observe-then-act loop: Run one action, read its result, then decide the next action conditioned on everything seen so far — the core of the agent loop. Code-as-action makes each loop iteration a code cell.
Training-free: The method adds no fine-tuning and no benchmark- or model-specific tuning — it changes only how the agent acts, so it drops onto an off-the-shelf VLM as-is.

The news. In June 2026, researchers posted SpatialClaw to arXiv. It gives a VLM-backed agent a code action interface for open-ended 3D/4D spatial reasoning: a persistent Python kernel is pre-loaded with the input frames plus perception and geometry primitives, and the agent writes one executable cell at a time, conditioning each on all prior text and visual outputs. Across 20 spatial-reasoning benchmarks it averages 59.9% — +11.2 points over the prior spatial agent — with consistent gains across six VLM backbones and no training. Read the paper →

Picture a surveyor sent to measure a site they have never seen. Hand them a fixed checklist — record the width, the height, the door count — and they can only report what the form asks; an angle the form forgot is simply unmeasurable. Hand them a blank survey form to fill in up front and they must guess the whole sequence of measurements before taking a single one. What actually works is the surveyor at the site: take one reading, look at it, and let it decide the next — every measurement written into a notebook the next reading can build on. That notebook is the agent's stateful kernel, each reading is one line of code, and "look at it, then decide" is the observe-then-act loop.

Underneath the metaphor, SpatialClaw is making a claim about an agent's action interface — the how-it-acts, not the what-model. The dominant interface today is structured tool-calls: the agent chooses a named function from a fixed schema and fills in typed arguments. It is dependable, but a fixed schema can only ever compose the operations it already lists, and open-ended spatial questions rarely decompose into the schema the designer guessed in advance. The other option, single-pass code, is more expressive but commits the entire script before any result comes back — one wrong assumption early and the whole analysis is wasted.

Code-as-action keeps the expressiveness of code but adds the one thing the loop needs: feedback. The agent emits a single Python cell against a stateful kernel, runs it, reads the intermediate text and image it produced, and only then writes the next cell — so it can detect that a depth estimate looks off and recompute, or chain a detection into a measurement into a comparison without any of those being a predefined tool. Because the change lives entirely in how the agent acts rather than in the weights, it is training-free and transfers across backbones; the same loop that a fixed schema would have flattened into one blind shot becomes a stepwise plan the agent revises as it goes.

Action interface	How the agent acts	Can it adapt mid-task?
Structured tool-calls	pick a named function from a fixed schema	only within the preset operations
Single-pass code	write one whole script, run it once	no — commits before any result
Code-as-action (SpatialClaw)	one executable cell per step on a stateful kernel	yes — observes each result, then writes the next

Where it earns its keep

Picture one illustrative spatial question that needs three operations composed in order: detect the two objects, measure the distance between them, then compare that distance to a reference. With structured tool-calls, those three only compose if the schema happens to expose a detect → measure → compare chain; miss one and the run dead-ends. Single-pass code can write all three at once, but if the detect step misreads an object — which the agent can't know until it runs — the measure and compare built on top inherit the error and the final answer is wrong. Code-as-action runs detect, looks at the result, fixes it if the boxes are off, and only then measures and compares — three steps, each conditioned on the last. (The three-step example is illustrative; only the 59.9% average, +11.2-point gain, 20 benchmarks, and six backbones come from the paper.) The reported +11.2 points to 59.9%, holding across six different VLMs with no training, is what that per-step feedback buys.

Goes deeper in: AI Agents → Tool Use & Function Calling → Designing the agent-computer interface

Related explainers

Harness-1 — externalized agent state — the stateful kernel is exactly this idea: keep the agent's working memory in an external, inspectable store rather than in the prompt
grep vs. vector for agentic retrieval — the same lesson from a different angle: the agent's harness and interface move accuracy more than the underlying algorithm does
EnvFactory — synthetic envs for tool-use agents — the training-data side of tool use, where SpatialClaw is the inference-time interface side

Continue in trackTool Use — designing the interface an agent acts through

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based