// Insight
For a long-horizon agent, full history is dead weight
The instinct when an agent starts failing on a long task is to give it more memory of what it has done. This study inverts that. On a 50-task enterprise benchmark, GPT-5 itemizing hotel expenses through Dynamics 365 tools completed 71.0% of the work when it carried its full conversation history. Pruning that history to the last five tool calls and summarizing the rest lifted completion to 91.6%, using 2.7 times fewer tokens and 2.5 times less wall-clock time. Less context produced a better agent, cheaper and faster at once.
The result is the measured payoff of a geometric fact. A model reads the edges of its context window well and the middle poorly. A long agent trajectory is mostly middle. Every tool call the agent appends pushes the decision it has to make next deeper into the part of the window the model handles worst. Pruning does more than save cost. It keeps the state that matters out of the dead zone.
The study isolates two failures and a fix for each. The first is drift. An agent with the full task prompt but no running reminder of what remained completed only 8.0% of the itemizations, wandering off task as the context filled. The fix is a user model, a lightweight component that re-states the remaining amount to itemize after every step, pinning the goal to the recent edge where the model still reads it. That alone took completion from 8.0% to 71.0%. The second failure is trajectory bloat. The fix is curation: keep the last five tool calls verbatim, then summarize everything older into a compact note.
Each ingredient earns its keep. The decomposition is the actionable part.
Then the efficiency, which is where the result stops being intuitive.
Full history bought neither accuracy nor speed. Carrying 1.48 million tokens of history cost 14.56 hours per benchmark and still trailed a curated run that finished in under six. You do not trade accuracy for savings here. The same bloat that runs up the bill also buries the evidence the model needs to act on. For an agent, a longer context is just more room to lose the thread in.
Look closely at the token counts. The summary pays for itself twice over. Pruning to the last five calls alone used 535,274 tokens; adding the summary brought the total to 553,374, a difference of under 4%. For that rounding-error of extra context, completion jumped from 79.0% to 91.6%. The summary is close to free. It buys back the older state that raw truncation throws away, without reopening the bloat that pruning just closed.
The honest boundaries are the paper’s own. This is one structured task, expense itemization, where each step is a clean tool call and the relevant history is genuinely recent. The five-call window is calibrated to that rhythm: a single line takes two or three calls, making five about two cycles of working memory. A task with long-range dependencies, where a decision rests on something established forty steps ago, would need a smarter eviction policy than keep-the-last-five. Summarization is lossy by construction. The gap between 91.6% of lines completed and 99.64% of the dollar amount itemized is the residue of the few it still drops. The principle travels. The exact recipe is a starting point to tune, validated here on Claude Sonnet 4.5 as well, though not yet across task types.
The shape is common on a desk. Any agent that works a long tool-using trajectory, a reconciliation bot walking a ledger, a research agent pulling and cross-checking filings, an onboarding agent stepping through a KYC packet, accumulates the same kind of history: a long tail of settled sub-steps plus a small live frontier of what remains. The expense task just makes that structure clean enough to measure. Where the frontier is recent and the tail is settled, keep-the-recent-and-summarize is the right default. Where the tail stays live, because step forty still constrains step forty-one, the eviction policy has to get selective rather than positional. At batch scale the token line stops being a rounding error too. A run that burns 1.48 million tokens instead of 553 thousand, multiplied across thousands of documents a night, is the difference between a workflow that pencils out and one that does not.
For a desk running agents at scale, this lands on the same side as the cost work. Agent memory is expensive to build and carry. The accuracy case now agrees with the bill: curate aggressively. A race engineer would recognize the move. You do not carry fuel you will not burn, because the weight you add to feel safe is the weight that loses the lap. The agent version of ballast is the full transcript. Keep the goal in front of the model and the recent moves under it, summarize the rest, and let the dead zone stay empty. The same discipline a learned curation policy reaches through training, a desk can approximate today with two heuristics and a summary pass. Start with the heuristic. It is a one-day change to an agent loop that recovers most of the gap before anyone trains a policy or pays for a bigger window.
It is not more context that makes a long-horizon agent reliable, it is the right context: pruning to the recent tool calls and summarizing the rest lifted GPT-5 from 71.0% to 91.6% completion while cutting tokens and time, because a shorter window has a shallower middle to lose the answer in.
Working on AI that needs to ship?
I help funds, fintechs, and data teams take AI from prototype to production.