// Insight

Hindsight: typed memory beats a bigger window

January 6, 20264 min read

memoryagentsbenchmark

The agent-memory debate keeps being framed as a capacity question, when the strongest results keep coming from structure. Hindsight, from Latimer, Boschi, and colleagues in December, makes the cleanest version of the structural case yet. Memory is organized into four typed networks, world facts, the agent’s own experiences, synthesized entity summaries, and evolving beliefs, and on LongMemEval the design lifts an open 20B model from 39 percent to 83.6 percent, beating full-context GPT-4o with a fraction of the parameters. On LoCoMo it posts 89.61 against 75.78 for the prior best open system. The window did not get bigger; the memory got a schema.

The four-way typing is the design decision worth dwelling on, because each type ages differently and the system can finally treat them that way. World facts are stable and verifiable. Experiences are episodic, timestamped, append-mostly. Entity summaries are synthesized and must be re-derived when their sources change. Beliefs are the volatile tier, conclusions that new evidence should revise, and keeping them separate from facts is precisely what stops yesterday’s hypothesis from masquerading as today’s ground truth. The three operations, retain, recall, reflect, route accordingly, with reflection doing the work naive systems skip: revisiting stored content to synthesize summaries and update beliefs after the fact.

Four memory types, because they age differently

The schema separates what is true from what is currently concluded; most failures blur exactly that line.

The numbers earn the architecture its note. A 44.6-point lift on LongMemEval against the same backbone with full context is not a retrieval tweak; it is evidence that long-context recall fails for organizational reasons rather than capacity ones, the model drowning in an undifferentiated transcript that a typed store renders navigable.

Long-horizon memory benchmarks (%)

Same model, restructured memory: the gap is schema, not parameters.

The result slots as the third act of the thread this archive has tracked for eighteen months. Memory-R1 learned which operations to perform on an external bank; MemAct moved learned curation inside the window; Hindsight answers the question both left open, what the memory should be made of, with a typed data model rather than a learned policy. The approaches compose: a schema this clean is exactly the action space a learned manager wants to operate over.

For a desk agent, the four types map onto distinctions a research process already enforces on humans, with the beliefs tier mattering most: an investment thesis is a belief, dated and revisable; any memory system that stores it in the same bucket as a filing fact has already failed the audit.

Market data is world fact; the desk’s own calls and their outcomes are experiences; the running view of a counterparty is an entity summary, stale the day its sources change; the thesis is a belief with a revision history. A memory architecture that makes those distinctions native gives compliance something it has never had from an agent: a queryable answer to what did the system believe, when, and on what evidence, the same auditable-trail property that decides whether agent infrastructure survives review.

The caveats are the genre’s usual. The benchmarks are conversational memory rather than financial workflows; the four-network overhead has a write-and-reflect cost the paper’s gains comfortably absorb at this scale but a high-throughput desk should meter; and the reflection step, synthesizing beliefs from evidence, is itself a model judgment whose errors now persist in storage, which argues for the beliefs tier carrying confidence and provenance fields from day one. None of that dims the headline. The field spent two years scaling windows to carry memory by brute force; a 20B model with a good schema just outperformed the brute force, with a lesson that generalizes well beyond memory.

Four typed memory networks lift a 20B model from 39 to 83.6 on LongMemEval, past full-context GPT-4o: memory fails organizationally before it fails at capacity, and separating beliefs from facts is the schema decision an auditable desk agent needs anyway.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →