The news. On June 3, 2026, researchers posted "Rethinking Continual Experience Internalization for Self-Evolving LLM Agents" (arXiv 2606.04703). The headline result is uncomfortable: when an agent repeatedly learns from its own experience, existing methods tend to suffer progressive capability collapse — the agent gets worse over iterations rather than compounding. The paper isolates three design axes that decide which way it goes, and combines them into a simple, robust recipe. Read the paper →

Picture a chef who rewrites the recipe book after every night of service. That is the whole promise of a self-evolving agent: each run produces notes, and those notes feed back into how the next run is cooked. Done well, the kitchen compounds — every week is a little sharper than the last. The unsettling finding is that the obvious way to do it makes the food worse: the chef scribbles down hyper-specific reactions from each night, crams them all into the book at once, and only ever studies their own shifts. A few weeks in, the book is an unreadable pile of one-off corrections and the cooking has drifted off the rails — capability collapse, wearing a toque.

The first knob is what you write down. Instance-level notes ("table 12 wanted less salt on Tuesday") are tied to one trajectory and don't transfer; pile up enough of them and they crowd out general skill. Principle-level notes ("under-salt the special, let guests add") are abstract and reusable — the paper finds they survive repeated rounds where the specifics rot. This is the same lesson as compacting a working context into durable summaries instead of hoarding raw logs: the abstraction is the point, not a lossy shortcut.

The second knob is when you inject it. Global injection rewrites the whole book in one pass at the end of the night; step-wise injection pins each lesson to the exact moment in service it applies to — the intermediate decision state in agent terms. For a long, multi-step task, lining up "use this trick here" beats one undifferentiated brain-dump, because the agent meets the advice at the point where it actually changes a choice. The third knob is whose nights you learn from. Training on-policy — only on your own rollouts — quietly amplifies your own mistakes; off-policy distillation from a fixed set of high-quality teacher trajectories is a steadier signal, the chef studying a master's service rather than re-cooking last week's errors on a loop.

Put illustrative numbers on why granularity matters (numbers here are illustrative — the paper reports the direction, not these figures). Say each self-improvement round internalizes 100 experience items. At instance-level, suppose 70 of them are over-specific — they only fire on the exact trajectory they came from — so 70% of what you bake into the weights is noise that displaces general competence. Run that eight rounds and the noise compounds: the book is now mostly dead weight. Abstract those same 100 items to principle-level first and perhaps 12 reusable strategies remain, so the signal-to-noise ratio flips from roughly 0.4 to over 7 — and the eight rounds now add up instead of canceling out.

Design axisThe fragile choice (collapses)The durable choice (sustains)Why it matters
Experience granularityinstance-level — tied to one trajectoryprinciple-level — abstract, transferablespecifics crowd out general skill; principles survive repeated rounds
Injection patternglobal — rewrite everything at oncestep-wise — aligned to each decision statelong-horizon tool use needs advice at the moment it changes a choice
Internalization regimeon-policy — train on your own rolloutsoff-policy — distill strong teacher trajectorieson-policy amplifies the agent's own errors; off-policy is a steadier signal

None of this is free: judging which notes are "principle-level" is itself a modeling problem, step-wise alignment needs you to track which state each lesson belongs to, and an off-policy teacher has to come from somewhere. But the payoff is the headline — get all three knobs right and, in the paper's experiments, the agent's skill keeps climbing across rounds instead of eroding, which is the difference between a self-evolving loop you can ship behind a shadow eval and one that quietly degrades in production.

Goes deeper in: AI Agents → Context Engineering → Fixes

Related explainers

Continue in trackAI Agents — Context Engineering: turning raw experience into durable, principle-level context

Frequently Asked Questions