What are self-evolving rubrics?

A rubric is a structured scorecard that turns an agent's run into a score. A self-evolving rubric, as in SkillCoach (arXiv 2607.01874, July 2026), rewrites its own scoring criteria over several rounds: each round it proposes small patches and keeps one only if the revised rubric passes hard gates (no destructive edits) and still grades a trusted set of validation trajectories correctly. In the paper, this self-evolution lifted gold-keypoint coverage from 71.56% to 83.70%.

Why does agent skill selection get harder as the library grows?

Agent 'skills' are reusable operational units — SOPs, tool workflows, scripts — and real libraries fill up with overlapping, near-duplicate skills. When many candidates look alike, the model struggles to pick the one that actually fits. SkillCoach measures this with distractor skills: against 35,500 decoys, Gemini 3.1 Pro reaches only 0.17 selection F1, and even at 50,000 GPT-4.5 manages just 0.33 — evidence that reliable selection, not raw model quality, is the bottleneck at scale.

How does SkillCoach relate to fixed-rubric methods like LongTraceRL or QVal?

LongTraceRL uses a rubric as a fixed reward for RL training, and QVal tests whether a fixed per-step supervision signal ranks actions like a reference policy. Both treat the grader as given. SkillCoach's contribution is to make the grader itself adaptive: the rubric patches its own criteria under validation gates, and it grades four distinct skill-use dimensions — selection, following, composition, and reflection — rather than a single reward. It can then filter training data, more than doubling a 9B model's accuracy under distractors (14% to 32%).

SkillCoach self-evolves rubrics to grade agentic skill-use at scale — Self-evolving rubrics

TL;DR

What is it: The SkillCoach paper (arXiv 2607.01874) is a framework that grades how well an agent uses its skills — on four axes: skill selection, following, composition, and reflection — using a rubric that evolves its own scoring criteria instead of a fixed, hand-written one.
Why it’s needed: Agent "skills" are becoming a reusable operational layer — SOPs, tool workflows, validation routines. Once the library grows into thousands of overlapping skills, picking the right one is the bottleneck, and a static rubric can't keep grading that reliably.
vs previous: A fixed rubric (or an outcome-only pass/fail score) is frozen the moment you write it. SkillCoach's rubric self-evolves through validation-gated patches — it rewrites its own criteria, keeping an edit only if it still grades a trusted set of sample runs correctly.

Jargon

Skill: A reusable operational unit an agent can invoke — an SOP, a domain rule, a tool workflow, a script, or a validation routine. As these accumulate, choosing the right one for a task gets harder.
Skill library: The growing collection of an agent's skills. Overlapping, near-duplicate skills are what make reliable selection difficult — the more look-alikes, the worse the picking.
The four skill-use dimensions: Selection (pick the right skill), following (execute its key steps), composition (order dependent skills correctly), and reflection (notice and fix your own slips). SkillCoach scores all four.
Rubric: A structured scorecard that turns a messy agent trajectory into a score. In SkillCoach it is not fixed — it is the thing being improved.
Self-evolving rubric: The rubric proposes small edits to its own criteria; an edit is accepted only if it passes hard gates (no destructive changes) plus soft objectives on held-out validation trajectories. Also called validation-gated evolution.
Distractor skills: Near-duplicate decoy skills added to the library to stress-test selection. The more you add, the more skill-selection F1 collapses.
F1: The harmonic mean of precision and recall — here, how well the agent's chosen skills match the gold set of correct ones. An F1 of 0.17 means mostly wrong picks.
Rubric-filtered SFT: Supervised fine-tuning on only the trajectories the rubric scored highly. A cleaner training set lifts a small model's accuracy under distractors — a downstream use of a good rubric.

The news. On July 1, 2026, researchers released SkillCoach (arXiv 2607.01874), a framework for grading how well LLM agents use their skills — the SOPs, tool workflows, and validation routines that are becoming agents' reusable operational layer. It scores a trajectory on four dimensions — skill selection, following, composition, and reflection — and, crucially, evolves the scoring rubric itself through validation-gated patches. Its most striking finding is how badly selection scales: against 35,500 distractor skills, Gemini 3.1 Pro holds only 0.17 selection F1, and even at 50,000 distractors GPT-4.5 reaches just 0.33. Read the paper →

Picture a cooking exam. Each cook works from an overstuffed binder of recipe cards, and the test is not "did the dish taste good" — it is whether they used the binder well: did they pull the right recipe, follow its steps, combine recipes in the right order, and catch their own slips? The trouble is the binder. Cram it with thousands of near-identical decoy cards and even a strong cook grabs the wrong one. That is SkillCoach's starting problem — because skills are becoming a reusable operational layer, the library fills with look-alikes, and the agent's ability to pick the right skill is exactly what starts to break.

SkillCoach turns "used the binder well" into four graded axes, each with its own measuring stick rather than a single vibe. Selection is scored by F1 against the gold set of correct skills; following by weighted completion of a skill's key steps; composition by whether dependent skills run in the right precedence order; and reflection by whether the agent notices and repairs its own mistakes. It is the same shift the curriculum draws between a binary pass/fail and a graded score — a rubric sees how the run went, not just whether it passed.

Dimension	What it checks	How SkillCoach scores it [paper]
Selection	did the agent pick the right skill?	F1 vs the gold skill set
Following	did it execute the skill's key steps?	weighted key-step completion
Composition	did it order dependent skills correctly?	precedence-dependency check
Reflection	did it catch and fix its own slips?	skill-grounded self-correction

Here is the twist that makes SkillCoach more than another benchmark. A grader is only as good as its scorecard — so SkillCoach lets the scorecard rewrite itself. Each round it proposes small patches to its own rubric and keeps a patch only if the revised rubric still grades a trusted set of sample runs correctly: hard gates block destructive edits, soft objectives push for sharper scoring. Over up to 6 rounds, using 10 calibration and 5 validation trajectories per iteration, the rubric tightens itself — gold-keypoint coverage climbs from 71.56% to 83.70% and usability from 81.53 to 94.33. In the kitchen: the examiner revises their own scorecard between rounds, but only edits that still grade the known-good and known-bad dishes correctly survive.

Now watch the selection collapse with real numbers. Hold the model and the task fixed and grow only the decoy binder. Skill selection is scored by F1 on a 0-to-1 scale, where 1.0 is picking exactly the right skills and nothing else. At 35,500 distractor skills, Gemini 3.1 Pro's skill-selection F1 is just 0.17; push a different frontier model, GPT-4.5, all the way to 50,000 distractors and it reaches only 0.33. Both sit far below a competent 1.0 — against a library of look-alikes, even top models pick badly, and it gets worse as the pile grows. The lesson is blunt: at real library scale, "just let the model pick the skill" is not a plan — which is why a reliable, self-sharpening grader for skill-use matters at all.

And the evolved rubric is not only a scoreboard — it is a data filter. Keep just the trajectories it scored highly and fine-tune on those, and rubric-filtered SFT more than doubled a 9B model's accuracy under distractors, from 14% to 32%. A grader good enough to trust becomes a cheap way to clean the training set — the same discipline you would use to grade an agent in shadow before it acts, turned into a training signal.

Goes deeper in: AI Agents → Evals & Diagnostics → Pass/Fail vs Score

Related explainers

QVal — Q-aligned dense supervision — a same-week question about graders: QVal tests whether a fixed per-step supervision signal ranks actions correctly. SkillCoach goes one step further and lets the grader rewrite itself.
LongTraceRL — rubric process reward — a rubric used as a fixed reward for RL training. SkillCoach's rubric is instead the thing being learned.
WeaveBench — trajectory-aware grading — grades agents on the whole trajectory rather than the final outcome; the grading discipline SkillCoach automates and evolves.
SkillHone — persistent decision-history memory — evolves the agent's skills across sessions; the complement to evolving the rubric that grades how those skills get used.

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based