The news. On July 1, 2026, the SkillHone paper (arXiv 2606.08671) proposed a harness for continual agent skill evolution: role-separated subagents run candidate skills on practice problems and write diagnostics, revisions, and outcomes into a persistent decision history, which later agents read to refine their skills — all without retraining the underlying model. It reports a 15.8-point advantage over a commercial deep-research agent on GAIA, and an 18.8-point average accuracy gain across seven internal tool-mediated analysis scenarios. Read the paper →
Picture a race team's setup notebook. Every weekend the crews try things, and the notebook fills with entries: this front wing was too stiff in turn 3, we softened it and gained half a second. The next race, the crew opens the book before they touch the car — they inherit the hard-won lessons instead of rediscovering them. That is roughly what SkillHone does for an agent's skills, and the point is where the lessons live: in the notebook, not in a rebuilt crew.
That distinction is the whole idea. Many "self-improving" agents try to bake their experience into the weights, distilling their own past runs back into the model — and a separate recent paper found that road can backfire, degrading the agent over iterations. SkillHone takes the other branch: it keeps the experience out of the weights entirely and holds it as an external decision history the agent reads. The model never changes; the notebook does.
Filling that notebook is a team job. Role-separated subagents run candidate skills against practice problems, and each entry records the diagnostic, the revision, and the outcome — including what failed and how it was fixed (an ongoing error analysis the team writes down). When a later agent picks up a similar task, it treats that history as engineered context, reading the prior decisions instead of retracing the reasoning from scratch. Because the record carries failures and their fixes, a later run doesn't have to rediscover what already failed — the notebook is doing the learning the weights would otherwise have to.
Walk the gain across sessions (illustrative base). Suppose the base agent scores 50% on one of the seven internal analysis scenarios. Left to replay its own runs, it might stall or slip. Reading a persistent decision history instead, SkillHone reports an average lift of 18.8 points across those seven scenarios — so that 50% becomes ~68.8%, without a single weight update. On GAIA the reported edge over a commercial deep-research agent is 15.8 points (arXiv 2606.08671). The lever is not a bigger model; it is a memory the model is allowed to read.
| Aspect | Weight-baked self-evolution | SkillHone (external decision history) |
|---|---|---|
| Where experience lives | distilled into the model’s weights | in an external record the agent reads |
| Cost to improve | a training / distillation run each round | no retraining — just write and read the history |
| Main failure mode | the agent can degrade over iterations | at worst, a history that needs curating — an engineering concern, not a damaged model |
| Reported result | varies by paper / method | ~+15.8 pts GAIA, ~+18.8 avg over 7 scenarios (arXiv 2606.08671) |
The bigger point isn't the benchmark delta: it's that an agent can keep getting better without anyone retraining it, as long as its own decision history is written down well enough for the next run to read. Skills become data you can inspect and edit, not opaque weights you have to hope improved — a cheaper, more inspectable place to put "the agent learns from itself" (arXiv 2606.08671).
Goes deeper in: AI Agents → Context Engineering → The 4 Fixes
Related explainers
- Self-evolving agents collapse — the weight-baked route SkillHone avoids, and why it can degrade over iterations
- EvoMem — patch-based agent memory — another design for an agent memory the model reads instead of retraining
- PreAct — compiled trajectory replay — turning past runs into replayable programs
- HarnessBridge — learned agent harness — evolving the scaffolding around the model rather than the model