What is continual experience internalization?

It is the step where a self-evolving agent turns its own past experience into a permanent, baked-in capability rather than something it re-reads from context each time. The paper 'Rethinking Continual Experience Internalization for Self-Evolving LLM Agents' (arXiv 2606.04703) studies how to do this repeatedly without the agent's skill degrading.

Why do self-evolving agents collapse over iterations?

Because the naive recipe internalizes the wrong things: hyper-specific, instance-level experience injected all at once and trained only on the agent's own rollouts. Over many rounds the specifics crowd out general skill and the agent's own errors get amplified, so capability erodes instead of compounding — what the paper calls progressive capability collapse.

How does the paper fix it?

With three design choices: keep principle-level (abstract, transferable) experience instead of instance-level; inject it step-wise — aligned to each intermediate decision state — instead of globally; and internalize it off-policy, distilling high-quality teacher trajectories rather than training on-policy on the agent's own runs. Combined, the paper reports a recipe where skill keeps improving across rounds rather than collapsing.

Self-evolving agents collapse over iterations — Continual experience internalization

TL;DR

What is it: A new paper, "Rethinking Continual Experience Internalization for Self-Evolving LLM Agents" (arXiv 2606.04703), studies how a self-evolving agent turns its own past runs into a reusable, baked-in skill — the step it calls continual experience internalization.
Why it’s needed: Get the recipe wrong and an agent that learns from itself gets worse each round — a progressive capability collapse — so this is what decides whether "the agent improves itself" is safe to ship at all.
vs previous: Earlier self-evolving methods store raw, instance-level experience, inject it globally, and train on-policy on the agent's own rollouts; the paper shows the durable recipe is the opposite on all three — principle-level experience, step-wise injection, and off-policy distillation on strong teacher trajectories.

Jargon

Self-evolving agent: An agent that improves itself over time by learning from its own past interactions, instead of staying frozen after training. Each round it folds what it learned back into how it acts.
Continual experience internalization: The step where past experience stops being something the agent re-reads from context and becomes a parametric capability — baked into the weights so it acts on it by default.
Capability collapse: The failure this paper names: under repeated rounds of self-learning, the agent's skill degrades instead of compounding. The point of the recipe is to avoid it.
Principle- vs instance-level experience: The granularity of what you save. Instance-level notes are tied to one specific trajectory ("on that task, click this"); principle-level notes are abstract, transferable strategies. The paper finds principle-level is far more durable.
Step-wise vs global injection: The injection pattern — when the experience is fed in. Step-wise aligns each piece of experience with the intermediate decision state it applies to; global dumps it all in at once. Step-wise wins for long-horizon tool use.
On-policy vs off-policy distillation: The training signal. On-policy trains on the agent's own current rollouts (which can amplify its mistakes); off-policy distills from a fixed set of high-quality teacher trajectories. Off-policy is the more stable signal here.
Context distillation: Training a model to behave as if a long context were present, without actually keeping that context at inference. Here it is the vehicle for turning logged experience into a permanent skill.

The news. On June 3, 2026, researchers posted "Rethinking Continual Experience Internalization for Self-Evolving LLM Agents" (arXiv 2606.04703). The headline result is uncomfortable: when an agent repeatedly learns from its own experience, existing methods tend to suffer progressive capability collapse — the agent gets worse over iterations rather than compounding. The paper isolates three design axes that decide which way it goes, and combines them into a simple, robust recipe. Read the paper →

Picture a chef who rewrites the recipe book after every night of service. That is the whole promise of a self-evolving agent: each run produces notes, and those notes feed back into how the next run is cooked. Done well, the kitchen compounds — every week is a little sharper than the last. The unsettling finding is that the obvious way to do it makes the food worse: the chef scribbles down hyper-specific reactions from each night, crams them all into the book at once, and only ever studies their own shifts. A few weeks in, the book is an unreadable pile of one-off corrections and the cooking has drifted off the rails — capability collapse, wearing a toque.

The first knob is what you write down. Instance-level notes ("table 12 wanted less salt on Tuesday") are tied to one trajectory and don't transfer; pile up enough of them and they crowd out general skill. Principle-level notes ("under-salt the special, let guests add") are abstract and reusable — the paper finds they survive repeated rounds where the specifics rot. This is the same lesson as compacting a working context into durable summaries instead of hoarding raw logs: the abstraction is the point, not a lossy shortcut.

The second knob is when you inject it. Global injection rewrites the whole book in one pass at the end of the night; step-wise injection pins each lesson to the exact moment in service it applies to — the intermediate decision state in agent terms. For a long, multi-step task, lining up "use this trick here" beats one undifferentiated brain-dump, because the agent meets the advice at the point where it actually changes a choice. The third knob is whose nights you learn from. Training on-policy — only on your own rollouts — quietly amplifies your own mistakes; off-policy distillation from a fixed set of high-quality teacher trajectories is a steadier signal, the chef studying a master's service rather than re-cooking last week's errors on a loop.

Put illustrative numbers on why granularity matters (numbers here are illustrative — the paper reports the direction, not these figures). Say each self-improvement round internalizes 100 experience items. At instance-level, suppose 70 of them are over-specific — they only fire on the exact trajectory they came from — so 70% of what you bake into the weights is noise that displaces general competence. Run that eight rounds and the noise compounds: the book is now mostly dead weight. Abstract those same 100 items to principle-level first and perhaps 12 reusable strategies remain, so the signal-to-noise ratio flips from roughly 0.4 to over 7 — and the eight rounds now add up instead of canceling out.

Design axis	The fragile choice (collapses)	The durable choice (sustains)	Why it matters
Experience granularity	instance-level — tied to one trajectory	principle-level — abstract, transferable	specifics crowd out general skill; principles survive repeated rounds
Injection pattern	global — rewrite everything at once	step-wise — aligned to each decision state	long-horizon tool use needs advice at the moment it changes a choice
Internalization regime	on-policy — train on your own rollouts	off-policy — distill strong teacher trajectories	on-policy amplifies the agent's own errors; off-policy is a steadier signal

None of this is free: judging which notes are "principle-level" is itself a modeling problem, step-wise alignment needs you to track which state each lesson belongs to, and an off-policy teacher has to come from somewhere. But the payoff is the headline — get all three knobs right and, in the paper's experiments, the agent's skill keeps climbing across rounds instead of eroding, which is the difference between a self-evolving loop you can ship behind a shadow eval and one that quietly degrades in production.

Goes deeper in: AI Agents → Context Engineering → Fixes

Related explainers

MLEvolve — Monte Carlo Graph Search — the search side of self-evolving agents, vs this paper's learning side
RecMem — subconscious agent memory — keeping experience in retrievable memory instead of baking it into weights
MSR delegation study — fidelity drift over iterations — another way agents degrade as a process repeats

Continue in trackAI Agents — Context Engineering: turning raw experience into durable, principle-level context

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based