Cursor Composer 2.5 — Targeted textual feedback RL

LLM
L
Targeted textual feedback RL — one 100k-token rollout, a coach inserts a textual hint at a target span, the hinted distribution acts as a teacher, and on-policy distillation KL gives a localized credit signal that end-of-rollout scalar reward cannot provide.
learnaivisually.com/ai-explained/cursor-composer-2-5-targeted-textual-feedback-rl

The news. On May 18, 2026, Cursor released Composer 2.5, a "substantial intelligence and behavior upgrade" built on the same Moonshot Kimi K2.5 open-source checkpoint as Composer 2. The release blog details three training-stack changes; the headline one is targeted RL with textual feedback — a credit-assignment trick for very long agent rollouts. The same post also flags a companion frontier model being trained from scratch with SpaceXAI on Colossus 2 at roughly 1M H100-equivalents (~10× Composer 2.5's total compute).

Picture a basketball coach reviewing two hours of game tape after a one-point loss. The scoreboard says −1. That number is technically accurate, technically true feedback on the team's performance — and technically useless for fixing anything. Which possession lost the game? The blown switch in the second quarter? The bad rotation at minute 47? The missed corner three? The scoreboard cannot distinguish, and a player who only ever sees the scoreboard learns by drifting an average direction across every minute of every game. That is exactly what an end-of-rollout scalar reward does to an agent trained on 100k-token rollouts: one number, ~100,000 tokens, ~10⁻⁵ credit per token, and the signal on any individual move is lost in noise. This is the long-rollout version of the credit-assignment problem every RL textbook opens with.

Targeted textual feedback is the coach pulling out a clipboard. The trainer picks out the specific target message in the rollout that went wrong — a single tool call, a single response, a single multi-line plan — and constructs a short hint describing the desired improvement: "be more concise here", "check the file's imports first", "shoot the corner three". The hint gets inserted into the model's local context around that message. Now the model, with the hint in front of it, produces a different next-token distribution at that span — a distribution that reflects what the coach actually wanted. That distribution becomes the teacher. The original policy, running with the original (hint-free) context, is the student. An on-policy distillation KL loss moves the student's probabilities toward the teacher's, but only over the target span — the rest of the trajectory is still learning from the broader RL objective in parallel.

The shape of the gradient signal is what matters here. End-of-rollout scalar reward is one number trying to teach roughly 100,000 tokens. Targeted textual feedback is a small target span of dense, position-specific gradient sitting inside the same rollout. Once an annotated span exists, the team can in principle produce localized hint signals from each annotated span without re-running whole new rollouts to swing one scalar reward.

Where the wall-clock signal density actually shows up

Hold three variables fixed. One 100,000-token rollout. One scalar reward at the end (say, −1 — the rollout failed). One identifiable target span of ~50 tokens that the trainer judges to be the load-bearing moment. With end-of-rollout scalar RL alone, that single −1 has to distribute its gradient across all 100,000 tokens — the per-token signal magnitude is on the order of 1 / 100,000 ≈ 0.001% (illustrative). Stack a hundred such rollouts and the model is averaging over 10⁷ tokens to learn one consistent lesson. With targeted textual feedback, the same rollout gets a localized KL loss on the 50-token target span — per-token signal magnitude on that span is on the order of 1 / 50 ≈ 2% (illustrative), roughly ~2,000× stronger than the diffuse scalar on that span (illustrative). The broader RL gradient still applies over the full trajectory; the localized loss is additive, not a replacement.

PropertyEnd-of-rollout scalarTargeted textual feedback
Signal locationOne number at trajectory end~50-token span anywhere in the rollout
Per-token credit (~100k rollout)~10⁻⁵ ~illustrative~2% on the target span ~illustrative
Cost to produce one signalOne full rollout (~minutes ~setup-dependent)One constructed short hint (~seconds ~setup-dependent)
Loss formPolicy gradient on scalar rewardOn-policy distillation KL on the hinted span
Replaces broader RL objective?n/aNo — runs alongside, additive
Sweet spotShort rollouts with clear scalar outcomes ~setup-dependent, illustrativeLong agent rollouts where one moment was load-bearing ~setup-dependent, illustrative

This is structurally the same move that took the field from RLHF to RLVR: once an inference-time procedure — long agent rollouts here, test-time search there — becomes load-bearing, the post-training stage has to be redesigned around what that procedure actually needs from the policy. Coding agents need localized corrections, not averaged ones, because every long rollout has a few moments that matter and a lot of moments that don't.

It is also a different fix from the training-inference mismatch diagnostic and from window-level RL for speculative drafters. Both of those localize the gradient in time too, but the localization comes from algorithmic structure (windowed sampling, mismatch isolation). Cursor's lever is a constructed short hint at a chosen target message — a textual description of the desired correction — which lets the training pass focus the gradient at the span the trainer flagged, not at fixed-size windows.

Goes deeper in: Agent Engineering → Production Evals → Online vs Offline

Related explainers

Frequently Asked Questions