Skip to content
Tim Frenzel

// Insight

What agent memory actually costs to run

11 min read
agent-memorysystemsMLOpsagents

The agent-memory literature this archive has followed for a year optimized one number. Memory-R1 learned when to add and delete. MemAct edited its own working context. Hindsight split durable facts from evolving beliefs. Every one of them reported downstream accuracy and stopped there. A new systems paper from Stanford, KU Leuven, and MIT asks the question all of them skipped: what does running this cost? Profiling ten memory systems on a single H100, the authors find energy per correct answer spans 47 times, from roughly 4.1 kilojoules for a plain lexical index to 186 kilojoules for an agentic system. The cost is dominated by the one-time work of building the memory rather than by serving a query. For a desk choosing a memory stack, accuracy is an insufficient selection criterion.

Why does agent memory need a systems audit?

Agent memory is what RAG becomes when the corpus stops being static. Classic retrieval attaches a model to a fixed, pre-indexed document set. A long-horizon agent instead writes its own corpus from its interaction stream, per user, appending and summarizing and rewriting it across sessions. That turns memory from a passive lookup into an active systems component with a write path, a search path, plus an ongoing maintenance policy. The reference point for the whole design space is MemGPT, which framed memory as an operating-system problem, paging a compact core context against a large archival store. One of the ten systems here, Letta, implements that abstraction directly.

The field then split four ways. The paper’s taxonomy is the spine of everything that follows.

Four ways to give an agent a memory
Long-contextRaw history in the promptNo build, full prefill every queryFlat RAGBM25, embedRAGDeterministic index, no LLM on the write pathStructure-augmentedGraphRAG, HippoRAG, Mem0, SimpleMemAn LLM extracts facts and graphs while writingAgentic controlA-Mem, Letta, MIRIXThe model decides when to write, read, and rewrite *
Ten systems span the four paradigms; cost shape follows the paradigm, not the vendor. Letta implements the MemGPT core-versus-archival abstraction.

Long-context pays a full-history prefill on every query. Flat RAG builds a deterministic index with no LLM in the loop. Structure-augmented systems call an LLM at write time to extract facts, triples, or summaries. Agentic systems hand memory operations to the model as a variable-depth control loop, letting it decide when to search and whether the evidence it pulled is enough. The existing benchmarks scored all four on answer accuracy and left their system behavior dark. That gap is the paper’s target, because the design choices that are invisible to an accuracy metric are exactly the ones that dominate at deployment scale.

Where does the money actually go?

Into construction. Most memory systems serve a question in a fraction of long-context wall time. Mem0 answers in under 0.1 seconds against roughly 38 seconds for stuffing the full history into GPT-4.1-mini. But that query-time advantage is paid up front: Mem0 spends about 4,108 seconds building its memory before the first question; Letta spends over thirteen hours.

Energy per correct answer, kilojoules (Qwen3-32B, 300 queries)
BM254.1embedRAG4.1HippoRAG v210.1GraphRAG15.1SimpleMem50.7Mem050.8A-Mem116.1MIRIX144.6Letta185.9
A 47x spread on identical hardware, set by one-time construction rather than query serving.

The measurements are not synthetic. Each system ingests five LongMemEval samples of about 360,000 tokens of conversational history, then answers 300 queries, with every model and embedding call instrumented on the same H100 timeline. End-to-end energy spans 26.7 times across the suite, from BM25 at 582 kilojoules to Letta at 15,429. Normalized by correct answers, the spread widens to 47 times. The shape of that construction cost is specific. It is a repeated long-read, short-write workload: each write reads a long chunk and emits a compact record. The median decode share of construction tokens is only 4.6%. The work is embedding- and prefill-heavy, closer to background indexing than to interactive serving. The practical sting is co-location. When construction and query traffic share one serving stack, a large construction prefill job occupies the KV-cache and stalls the scheduler precisely when a latency-sensitive query arrives. The paper’s recommendation is to treat construction as a background throughput workload with explicit admission control. The embedding traffic splits by paradigm too. Append-only graph systems submit large batches, GraphRAG embedding roughly 2,300 sequences per call as it indexes entity-relation tuples offline. Consolidating and agentic systems do the opposite, embedding each extracted fact one at a time to resolve an add-update-delete decision on the write-loop critical path. An embedding server tuned for batch throughput will head-of-line block those per-fact writers.

A quant reads this as a familiar omission. The headline number leaves out the expensive part. A reported accuracy that hides a thirteen-hour build is the agent-stack cousin of a backtest Sharpe that ignores the implementation tax of the engine that produced it. The cost did not disappear. It moved to a line nobody was reading.

Does a cheaper construction model rescue the budget?

The obvious lever is to build memory with a small cheap model and reserve the expensive one for answering. The paper shows the lever exists, with a catch the operator has to respect. GraphRAG holds 47 to 48% accuracy across the entire ladder from Qwen3-1.7B to GPT-4o-mini, while MIRIX fails outright at Qwen3-1.7B because its pipeline needs well-formed JSON and legal tool calls that a small model cannot reliably emit. Systems without hard output contracts degrade gracefully as the construction model shrinks, which gives a continuous accuracy-versus-cost dial. Mem0, A-Mem, and SimpleMem sit on that soft-contract side, trading accuracy down smoothly as the builder falls toward a 1.7B model. That is where the cheapest construction savings hide for an operator willing to measure the floor first. Systems with strict contracts have a hard capability floor. Dropping below it does more than lower accuracy. It corrupts the store into something the answer model can recover nothing useful from. So the minimum viable construction model is an algorithm-imposed cost floor a team must validate before deployment rather than treat as a free dial.

What happens as the history grows?

Per-user memory is unbounded and accumulates indefinitely. The growth slope matters more than the starting footprint. Two curves diverge as a single user’s history scales toward a million tokens. On-disk footprint spreads about 9 times across systems: HippoRAG v2 inflates to roughly 62 MB on its multi-view graph, while consolidating Mem0 compresses redundant records down to about 12 MB. Token cost diverges far more sharply. Agentic systems scale super-linearly because each new write queries and rewrites a growing store. Letta’s per-ingestion cost climbs steeply past 256K tokens. None of the ten systems prune or forget by default. Any fleet-scale deployment has to add an independent forgetting policy or watch storage and token cost grow without bound. Project those per-user footprints to a hundred thousand users and the 9 times spread becomes about 0.7 terabytes for embedRAG against 6.2 for HippoRAG v2, before any growth in history length. Retrieval latency, by contrast, stays nearly flat as the store grows, since index lookup is sub-linear in store size.

On a regulated desk that missing forgetting policy is a compliance gap as much as a cost one. A client’s right to erasure, information barriers around material non-public information, and point-in-time correctness all require deliberate deletion. A store that only grows cannot honor any of them. It is the deletion-as-discipline point the agent-memory thread has circled, now a control requirement instead of an efficiency tweak.

A second trap hides in the timing. When sessions arrive continuously and a later query depends on an earlier write, a slow-construction system faces a forced choice: block the query until the write commits, or answer against stale memory. Per-session construction time spans five orders of magnitude across the suite. For the slow systems that becomes a real scheduling decision rather than a corner case. A batch benchmark never surfaces it.

The frontier figure makes the central tension visual.

MemoryAgentBench accuracy, macro-averaged (%)
BM2555.8embedRAG50.3HippoRAG v247.4GraphRAG47A-Mem42.1SimpleMem36.2MIRIX31.7Mem026.8Letta25.9
BM25 tops this recall-leaning aggregate at a sub-second build, while A-Mem builds for ~17,666s and scores lower. On multi-hop and temporal tasks the structure-augmented systems close or reverse the gap.
The blunt surprise is that BM25, the cheapest system in the suite, tops the aggregate accuracy while the elaborate agentic builds score lower at thousands of times the construction cost.

That aggregate is benchmark-dependent. MemoryAgentBench leans recall-heavy, which rewards the exact-match retrieval lexical search handles well. The structure-augmented systems narrow or reverse the gap on the multi-hop and temporal tasks they were built for. The honest read is simpler than a lexical-search victory. No single system is best on construction cost, query latency, and accuracy at once. Each occupies a distinct point on that three-way frontier, which turns selection into a question about your own workload.

Match the cost split to the query pattern
High query volume, stable historyPay once at constructionConsolidate into atomic factsCheap, fast queries thereafterSparse queries, constant ingestionKeep construction lightBM25 or embedRAGNo write backlog, always fresh
No system wins build cost, query latency, and accuracy together; pick by how your queries actually arrive.

How predictable is the latency?

A cost you cannot bound is worse than a high cost you can. On fixed hardware the effective time-to-first-token spans two orders of magnitude, from about 0.10 seconds for Mem0 to 22.6 seconds for SimpleMem. The dominant variable is retrieval-pipeline depth rather than model serving speed. The deeper split is between two regimes. Algorithm-bounded phases stop at a point the algorithm fixes, like BM25’s single top-k lookup or GraphRAG’s one extraction call per chunk. Their worst case is a static property an operator can profile in advance. LLM-bounded phases continue until the model decides they are done, like MIRIX’s type-specific tool calls or SimpleMem’s reflection rounds. When the spec permits arbitrary depth, only an explicit iteration cap bounds the cost.

QA latency tail, p95/p50 ratio
BM251.3embedRAG1.3SimpleMem1.2HippoRAG v21.6A-Mem1.6MIRIX2.3Mem02.4Letta3.9GraphRAG5.9
Deterministic pipelines stay near 1.3x; LLM-bounded retrieval widens the tail as the model adds variable steps.

The two regimes produce different tails. Deterministic systems hold a p95/p50 latency ratio near 1.3, while LLM-bounded retrieval reaches 5.9 times for GraphRAG and 3.9 for Letta as extra tool calls or reflection rounds push latency past what the corpus alone predicts. For a desk that has to meet a latency target, that tail is the number that matters. An algorithm-bounded system can be provisioned from a worst case measured on representative inputs. An LLM-bounded system needs external caps and timeouts, because profiling samples the model’s behavior on the inputs you tested and stays silent on the one you did not.

That turns the paper into a buyer’s checklist. Before adopting a memory system, measure the four numbers that decide its lifecycle cost, none of which appears on an accuracy leaderboard.

Four numbers to demand before picking a memory system
Construction energykJ to build the storeSpread 26.7x, dominates the lifecycleConstruction-vs-query splitWhere the cost sitsLong-context pays per query, Mem0 pays at buildConstruction-model floorSmallest builder that holdsA hard wall for strict-contract systems like MIRIXGrowth slopeCost per token as history scalesSuper-linear for agentic, flat for flat-RAG *
Each is measurable in a weekend on your own corpus; none shows up on the accuracy leaderboard that still drives most selection.

The bottom line

The verdict is a selection discipline. Memory choice is a systems decision priced across the full lifecycle rather than picked off a leaderboard. This is the agent-shaped version of a lesson the archive keeps relearning. In backtesting the engine is part of the model. In context files the convention everyone adopted carried an unmeasured tax. In multi-agent scaling coordination overhead is a cost line that has to clear its own bar. Agent memory is the same story with a kilojoule meter bolted on. Eighteen years on a quant desk teach the reflex this paper rewards: when someone shows you an accuracy number, ask what it cost to produce, then ask how that cost grows as the history does. The methods this archive admired were solving the easier half of the problem. Pricing the other half is what turns a memory demo into something a desk can budget and run.

Agent memory’s real cost is not the query, it is the construction: ten systems span 47x in energy per correct answer on one GPU. The bill is set by how you build the store and how fast it grows, two numbers no accuracy benchmark reports.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.