The news. On June 3, 2026, researchers posted "Rethinking Continual Experience Internalization for Self-Evolving LLM Agents" (arXiv 2606.04703). The headline result is uncomfortable: when an agent repeatedly learns from its own experience, existing methods tend to suffer progressive capability collapse — the agent gets worse over iterations rather than compounding. The paper isolates three design axes that decide which way it goes, and combines them into a simple, robust recipe. Read the paper →
Picture a chef who rewrites the recipe book after every night of service. That is the whole promise of a self-evolving agent: each run produces notes, and those notes feed back into how the next run is cooked. Done well, the kitchen compounds — every week is a little sharper than the last. The unsettling finding is that the obvious way to do it makes the food worse: the chef scribbles down hyper-specific reactions from each night, crams them all into the book at once, and only ever studies their own shifts. A few weeks in, the book is an unreadable pile of one-off corrections and the cooking has drifted off the rails — capability collapse, wearing a toque.
The first knob is what you write down. Instance-level notes ("table 12 wanted less salt on Tuesday") are tied to one trajectory and don't transfer; pile up enough of them and they crowd out general skill. Principle-level notes ("under-salt the special, let guests add") are abstract and reusable — the paper finds they survive repeated rounds where the specifics rot. This is the same lesson as compacting a working context into durable summaries instead of hoarding raw logs: the abstraction is the point, not a lossy shortcut.
The second knob is when you inject it. Global injection rewrites the whole book in one pass at the end of the night; step-wise injection pins each lesson to the exact moment in service it applies to — the intermediate decision state in agent terms. For a long, multi-step task, lining up "use this trick here" beats one undifferentiated brain-dump, because the agent meets the advice at the point where it actually changes a choice. The third knob is whose nights you learn from. Training on-policy — only on your own rollouts — quietly amplifies your own mistakes; off-policy distillation from a fixed set of high-quality teacher trajectories is a steadier signal, the chef studying a master's service rather than re-cooking last week's errors on a loop.
Put illustrative numbers on why granularity matters (numbers here are illustrative — the paper reports the direction, not these figures). Say each self-improvement round internalizes 100 experience items. At instance-level, suppose 70 of them are over-specific — they only fire on the exact trajectory they came from — so 70% of what you bake into the weights is noise that displaces general competence. Run that eight rounds and the noise compounds: the book is now mostly dead weight. Abstract those same 100 items to principle-level first and perhaps 12 reusable strategies remain, so the signal-to-noise ratio flips from roughly 0.4 to over 7 — and the eight rounds now add up instead of canceling out.
| Design axis | The fragile choice (collapses) | The durable choice (sustains) | Why it matters |
|---|---|---|---|
| Experience granularity | instance-level — tied to one trajectory | principle-level — abstract, transferable | specifics crowd out general skill; principles survive repeated rounds |
| Injection pattern | global — rewrite everything at once | step-wise — aligned to each decision state | long-horizon tool use needs advice at the moment it changes a choice |
| Internalization regime | on-policy — train on your own rollouts | off-policy — distill strong teacher trajectories | on-policy amplifies the agent's own errors; off-policy is a steadier signal |
None of this is free: judging which notes are "principle-level" is itself a modeling problem, step-wise alignment needs you to track which state each lesson belongs to, and an off-policy teacher has to come from somewhere. But the payoff is the headline — get all three knobs right and, in the paper's experiments, the agent's skill keeps climbing across rounds instead of eroding, which is the difference between a self-evolving loop you can ship behind a shadow eval and one that quietly degrades in production.
Goes deeper in: AI Agents → Context Engineering → Fixes
Related explainers
- MLEvolve — Monte Carlo Graph Search — the search side of self-evolving agents, vs this paper's learning side
- RecMem — subconscious agent memory — keeping experience in retrievable memory instead of baking it into weights
- MSR delegation study — fidelity drift over iterations — another way agents degrade as a process repeats