The news. On July 1, 2026, researchers released SkillCoach (arXiv 2607.01874), a framework for grading how well LLM agents use their skills — the SOPs, tool workflows, and validation routines that are becoming agents' reusable operational layer. It scores a trajectory on four dimensions — skill selection, following, composition, and reflection — and, crucially, evolves the scoring rubric itself through validation-gated patches. Its most striking finding is how badly selection scales: against 35,500 distractor skills, Gemini 3.1 Pro holds only 0.17 selection F1, and even at 50,000 distractors GPT-4.5 reaches just 0.33. Read the paper →
Picture a cooking exam. Each cook works from an overstuffed binder of recipe cards, and the test is not "did the dish taste good" — it is whether they used the binder well: did they pull the right recipe, follow its steps, combine recipes in the right order, and catch their own slips? The trouble is the binder. Cram it with thousands of near-identical decoy cards and even a strong cook grabs the wrong one. That is SkillCoach's starting problem — because skills are becoming a reusable operational layer, the library fills with look-alikes, and the agent's ability to pick the right skill is exactly what starts to break.
SkillCoach turns "used the binder well" into four graded axes, each with its own measuring stick rather than a single vibe. Selection is scored by F1 against the gold set of correct skills; following by weighted completion of a skill's key steps; composition by whether dependent skills run in the right precedence order; and reflection by whether the agent notices and repairs its own mistakes. It is the same shift the curriculum draws between a binary pass/fail and a graded score — a rubric sees how the run went, not just whether it passed.
| Dimension | What it checks | How SkillCoach scores it [paper] |
|---|---|---|
| Selection | did the agent pick the right skill? | F1 vs the gold skill set |
| Following | did it execute the skill's key steps? | weighted key-step completion |
| Composition | did it order dependent skills correctly? | precedence-dependency check |
| Reflection | did it catch and fix its own slips? | skill-grounded self-correction |
Here is the twist that makes SkillCoach more than another benchmark. A grader is only as good as its scorecard — so SkillCoach lets the scorecard rewrite itself. Each round it proposes small patches to its own rubric and keeps a patch only if the revised rubric still grades a trusted set of sample runs correctly: hard gates block destructive edits, soft objectives push for sharper scoring. Over up to 6 rounds, using 10 calibration and 5 validation trajectories per iteration, the rubric tightens itself — gold-keypoint coverage climbs from 71.56% to 83.70% and usability from 81.53 to 94.33. In the kitchen: the examiner revises their own scorecard between rounds, but only edits that still grade the known-good and known-bad dishes correctly survive.
Now watch the selection collapse with real numbers. Hold the model and the task fixed and grow only the decoy binder. Skill selection is scored by F1 on a 0-to-1 scale, where 1.0 is picking exactly the right skills and nothing else. At 35,500 distractor skills, Gemini 3.1 Pro's skill-selection F1 is just 0.17; push a different frontier model, GPT-4.5, all the way to 50,000 distractors and it reaches only 0.33. Both sit far below a competent 1.0 — against a library of look-alikes, even top models pick badly, and it gets worse as the pile grows. The lesson is blunt: at real library scale, "just let the model pick the skill" is not a plan — which is why a reliable, self-sharpening grader for skill-use matters at all.
And the evolved rubric is not only a scoreboard — it is a data filter. Keep just the trajectories it scored highly and fine-tune on those, and rubric-filtered SFT more than doubled a 9B model's accuracy under distractors, from 14% to 32%. A grader good enough to trust becomes a cheap way to clean the training set — the same discipline you would use to grade an agent in shadow before it acts, turned into a training signal.
Goes deeper in: AI Agents → Evals & Diagnostics → Pass/Fail vs Score
Related explainers
- QVal — Q-aligned dense supervision — a same-week question about graders: QVal tests whether a fixed per-step supervision signal ranks actions correctly. SkillCoach goes one step further and lets the grader rewrite itself.
- LongTraceRL — rubric process reward — a rubric used as a fixed reward for RL training. SkillCoach's rubric is instead the thing being learned.
- WeaveBench — trajectory-aware grading — grades agents on the whole trajectory rather than the final outcome; the grading discipline SkillCoach automates and evolves.
- SkillHone — persistent decision-history memory — evolves the agent's skills across sessions; the complement to evolving the rubric that grades how those skills get used.