What is SkillHone's persistent decision history?

SkillHone (arXiv 2606.08671) is a harness for continual agent skill evolution. Instead of retraining the model, it keeps a persistent decision history — a structured record of what each skill attempt did, what failed, and how it was fixed. Role-separated subagents fill that record by running candidate skills on practice problems, and later agents read it to refine their skills across sessions. The reported result is a 15.8-point edge over a commercial deep-research agent on GAIA and an 18.8-point average gain across seven internal analysis scenarios.

Why keep the history external instead of retraining the model?

Baking experience into the weights by distilling an agent's own past runs has been shown to backfire — a separate paper found the agent can degrade over iterations. SkillHone takes the other branch by never touching the weights: the experience lives in an external record the model reads at inference time. That makes improvement cheaper (no training run per round) and inspectable (skills are data you can read and edit, not opaque parameters), and it removes the distillation step that can go wrong in the first place.

How does SkillHone relate to agent teams and context engineering?

Two live ideas meet in SkillHone. The role-separated subagents that fill and revise the history are a supervisor/worker team pattern from agent engineering. The decision history the next agent reads is engineered context — prior decisions the agent reads instead of retracing old reasoning. SkillHone's contribution is wiring those together into a loop that improves skills across sessions without retraining.

SkillHone evolves agent skills across sessions — Persistent decision-history memory

Jargon

SkillHone: The paper’s harness (arXiv 2606.08671) for continual agent skill evolution: it stores decision histories so an agent’s skills improve over time without retraining the model.
Decision history (trajectory): A structured record of what the agent did, what failed, and how it was fixed — kept as external memory, not folded into the weights. Compiled trajectory replay is a related way to reuse past runs.
Skill (agent skill): A reusable procedure or tool-use recipe the agent can call. SkillHone evolves these across sessions rather than freezing them at training time.
Role-separated subagents: Splitting the work across specialized agents, each with one role in filling the record — running candidate skills and writing down the diagnostics, revisions, and outcomes. Supervisor / worker teams cover the pattern.
Practice probe: A rehearsal problem the agent solves not to answer a user, but to test and refine a skill before it counts for real.
GAIA: A benchmark of real, multi-step questions that a capable assistant should be able to answer with tools — used here to compare SkillHone against a commercial deep-research agent.

The news. On July 1, 2026, the SkillHone paper (arXiv 2606.08671) proposed a harness for continual agent skill evolution: role-separated subagents run candidate skills on practice problems and write diagnostics, revisions, and outcomes into a persistent decision history, which later agents read to refine their skills — all without retraining the underlying model. It reports a 15.8-point advantage over a commercial deep-research agent on GAIA, and an 18.8-point average accuracy gain across seven internal tool-mediated analysis scenarios. Read the paper →

Picture a race team's setup notebook. Every weekend the crews try things, and the notebook fills with entries: this front wing was too stiff in turn 3, we softened it and gained half a second. The next race, the crew opens the book before they touch the car — they inherit the hard-won lessons instead of rediscovering them. That is roughly what SkillHone does for an agent's skills, and the point is where the lessons live: in the notebook, not in a rebuilt crew.

That distinction is the whole idea. Many "self-improving" agents try to bake their experience into the weights, distilling their own past runs back into the model — and a separate recent paper found that road can backfire, degrading the agent over iterations. SkillHone takes the other branch: it keeps the experience out of the weights entirely and holds it as an external decision history the agent reads. The model never changes; the notebook does.

Filling that notebook is a team job. Role-separated subagents run candidate skills against practice problems, and each entry records the diagnostic, the revision, and the outcome — including what failed and how it was fixed (an ongoing error analysis the team writes down). When a later agent picks up a similar task, it treats that history as engineered context, reading the prior decisions instead of retracing the reasoning from scratch. Because the record carries failures and their fixes, a later run doesn't have to rediscover what already failed — the notebook is doing the learning the weights would otherwise have to.

Walk the gain across sessions (illustrative base). Suppose the base agent scores 50% on one of the seven internal analysis scenarios. Left to replay its own runs, it might stall or slip. Reading a persistent decision history instead, SkillHone reports an average lift of 18.8 points across those seven scenarios — so that 50% becomes ~68.8%, without a single weight update. On GAIA the reported edge over a commercial deep-research agent is 15.8 points (arXiv 2606.08671). The lever is not a bigger model; it is a memory the model is allowed to read.

Aspect	Weight-baked self-evolution	SkillHone (external decision history)
Where experience lives	distilled into the model’s weights	in an external record the agent reads
Cost to improve	a training / distillation run each round	no retraining — just write and read the history
Main failure mode	the agent can degrade over iterations	at worst, a history that needs curating — an engineering concern, not a damaged model
Reported result	varies by paper / method	~+15.8 pts GAIA, ~+18.8 avg over 7 scenarios (arXiv 2606.08671)

The bigger point isn't the benchmark delta: it's that an agent can keep getting better without anyone retraining it, as long as its own decision history is written down well enough for the next run to read. Skills become data you can inspect and edit, not opaque weights you have to hope improved — a cheaper, more inspectable place to put "the agent learns from itself" (arXiv 2606.08671).

Goes deeper in: AI Agents → Context Engineering → The 4 Fixes

Related explainers

Self-evolving agents collapse — the weight-baked route SkillHone avoids, and why it can degrade over iterations
EvoMem — patch-based agent memory — another design for an agent memory the model reads instead of retraining
PreAct — compiled trajectory replay — turning past runs into replayable programs
HarnessBridge — learned agent harness — evolving the scaffolding around the model rather than the model

Continue in trackAI Agents — Context Engineering: the 4 fixes

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based