// Insight

Memory-R1: teaching an agent when to write, update, and delete

August 30, 20254 min read

memoryRAGRL

Most agent memory today is a junk drawer: everything gets appended, nothing gets curated, and retrieval digs through the pile hoping the right fragment surfaces. Memory-R1 attacks the actual problem, which is that maintaining a useful memory is a sequence of decisions, add, update, delete, or leave alone, and nobody labels those decisions for you. The paper’s move is to train them with reinforcement learning from outcomes instead.

The design has two learned policies. A Memory Manager watches each dialogue turn and chooses an operation on the external memory bank: ADD a new entry, UPDATE an existing one, DELETE something stale, or NOOP. An Answer Agent then handles questions by taking the retrieved candidate memories and distilling them, filtering the haul down to what actually bears on the question before answering. Both are trained with outcome-driven RL: if the final answer is right, the operations that led there get credit. No per-operation labels exist anywhere in the loop.

Memory-R1: two policies trained from answer correctness

Correct answers are the only training signal for every memory operation upstream.

The data efficiency is the striking result. The whole system trains on 152 question-answer pairs, then leads across three long-horizon benchmarks, LoCoMo, MSC, and LongMemEval, generalizing across model scales from 3B to 14B. On LoCoMo with a LLaMA-3.1-8B backbone, judged accuracy reaches 62.74 against 45.68 for Mem0 and 48.20 for MemoryOS, the deployed systems a team would reach for today.

LoCoMo, LLM-as-a-Judge score, LLaMA-3.1-8B backbone

Training the memory operations at all delivers most of the gain over deployed systems.

The fine print is as instructive as the headline. A supervised baseline trained on the same data lands at 58.76, ahead of PPO and four points short of GRPO. Read that the way the RLVR literature taught us to: most of the lift comes from training the operations at all, with the right RL recipe adding a real but modest margin on top. The echo of s1’s thousand curated traces is loud: when the base model already carries the competence, a small, well-aimed training signal is enough to organize it. What RL adds here is not new knowledge but a policy for exercising judgment the model already has.

Deletion is the alpha

The operation worth dwelling on is DELETE. Adding to memory is easy and every framework does it. Knowing that yesterday’s entry is now wrong, that guidance was withdrawn, that a position was closed, that a thesis got revised, and acting on that is what separates a memory from a liability. An agent that cannot delete is an agent whose context slowly fills with confident, stale facts, which is a worse failure mode than knowing nothing.

Any quant who has maintained a research database recognizes this as point-in-time discipline. A backtest over a universe that silently retains delisted tickers or pre-revision fundamentals produces beautiful, wrong results. The desk version of Memory-R1’s problem appears the moment you give an agent a persistent notebook across sessions: an evolving earnings thesis, a position book that changes daily, broker guidance that gets superseded mid-quarter. Append-only memory turns each of those into a contamination source. A learned update-and-delete policy is the agentic equivalent of the as-of-date join. Training it from answer correctness alone, rather than from labels nobody has, is what makes it deployable.

The honest caveats are scope-shaped. The benchmarks are long-horizon dialogue rather than financial workflows. The memory bank is text entries rather than structured state. Whether outcome rewards stay informative when the horizon stretches from a conversation to a quarter of research sessions is exactly the kind of credit-assignment question RL keeps relearning. The agentic-RAG survey catalogued memory as the least-developed component in the stack this spring. Memory-R1 is the first result I have seen that treats memory hygiene as a first-class learned skill rather than a heuristic, which is the right framing even if the specific recipe evolves.

The practical takeaway for anyone building a desk agent: stop evaluating memory systems by what they store and start evaluating them by what they correctly discard. A 152-example training budget means you can plausibly tune this on your own workflow’s outcomes. The bar to clear is an honest one, whether the learned policy beats a competent heuristic stack, dated entries, recency-weighted retrieval, scheduled expiry, on your tasks rather than on LoCoMo.

Memory-R1 trains add, update, and delete from answer correctness alone, with 152 examples: agent memory becomes a learned discipline, and deletion is the skill worth paying for.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →