// Insight
Memory as action: the agent learns what to forget
Every long-horizon agent eventually drowns in its own context. Tool results, retrieved documents, dead-end reasoning, all of it accumulates until the window is full of history and empty of signal. The standard fixes are external: summarize on a schedule, evict by recency, bolt on a memory system that cannot see what the agent is trying to do. MemAct takes the integrated route. Context management becomes part of the agent’s own action space: deleting and inserting working-memory content are moves the policy makes, chosen with the same machinery that chooses the next tool call.
The framing is the contribution. An agent that treats its context as mutable state can compress a finished sub-task to its conclusion, drop the failed branch entirely, and keep the running thesis pinned, decisions that require knowing what matters to the task right now. A recency heuristic cannot know that. A scheduled summarizer cannot either; it compresses the load-bearing constraint with the same enthusiasm as the boilerplate. Curation quality is task-dependent, which is exactly the argument for learning it.
The technical obstacle is what makes this a paper rather than a prompt. When an agent edits its own context mid-trajectory, the trajectory stops being a single coherent sequence, which breaks the assumptions standard RL training rests on. The proposed Dynamic Context Policy Optimization handles the fracture by segmenting trajectories at memory action points and applying trajectory-level advantages to the resulting segments, making end-to-end training stable despite the moving workspace. The authors report the approach reduces overall computational consumption while improving task performance, with the gains coming precisely from the model learning when curation pays.
Where this sits in the memory lineage
MemGPT opened this design space in 2023 with the operating-system metaphor: virtual context management paging information between window and external storage, the model issuing memory edits through function calls. The mechanism was right; the decision-maker was a prompt. Mem0 industrialized the external route this spring: an extract-consolidate-retrieve pipeline posting a 26% LLM-judged gain over OpenAI’s built-in memory, 91% lower p95 latency, plus 90% token savings against full context. Strong engineering, still a pipeline standing outside the policy it serves. Anthropic’s context-engineering guidance codified the engineered in-window practice, compaction and structured note-taking on a schedule.
The horizontal move trades durable storage for immediacy: in-window edits land where the very next forward pass feels them. The vertical move trades predictability for task-awareness. Every quadrant stays a legitimate choice at a price, the engineered cells cheaper to govern, the learned cells better at knowing what this task can afford to lose.
Racing has a name for the underlying discipline: weight. A bike that carries every tool it might conceivably need is slower everywhere, all the time, in every corner. Race teams strip mass with fanatical attention because weight is a tax paid continuously rather than once. Context is the agent’s weight, taxed on every single forward pass; an agent that learns to strip it is doing what every race engineer does the night before the event. The skill is knowing what you can afford to remove, which is exactly what is being learned here.
The desk relevance runs through the agent designs this archive has tracked all autumn. Memory-R1 learned add-update-delete over an external memory bank; MemAct moves the same learned-curation idea inside the window, where the long-horizon context-rot problem actually lives. A filings-research agent that runs for hours accumulates exactly the debris this targets: superseded quotes, resolved sub-questions, retrieval results that turned out irrelevant. Carrying them costs latency and money on every step. Worse, it costs accuracy, since the relevant fact competes with its own noise for attention. The pattern worth adopting is the unified policy: whatever curates the context should see the task, whether it is learned end-to-end or built as a disciplined heuristic that reads the agent’s goal state.
The caveats are early-paper standard. The evidence comes from agentic benchmarks rather than financial workflows, the deletion operation deletes evidence as happily as debris if the reward says so. An agent that edits its own record needs the external audit trail precisely because the internal one is now mutable by design. A compliance-grade deployment would log every edit operation outside the agent’s reach, turning the learned forgetting into an auditable event stream rather than silent rewriting.
The note-sized verdict: the direction is right and slightly ahead of its evidence. Context curation as a first-class, task-aware decision is where long-running agents have to go, because the alternative, windows that only grow, prices long horizons out of existence. Whether the curation policy must be learned with RL or can be engineered well enough by hand is the open question a desk pilot would answer cheaply.
MemAct makes forgetting a learned action: the agent curates its own working context with the policy that does the work, because what to keep in the window depends on the task, and only the task knows.
Working on AI that needs to ship?
I help funds, fintechs, and data teams take AI from prototype to production.