// Insight
One agent, full context: the information-theoretic case
The multi-agent debate has been running on intuitions and benchmarks; Tran and Kiela supply the missing theorem. Their argument rests on the Data Processing Inequality, the information-theoretic law that no processing step can increase the information a signal carries about the truth: every agent-to-agent handoff is a processing step, which means a pipeline of agents can only preserve or lose evidence relative to one agent holding everything. At a fixed reasoning-token budget with decent context utilization, the single agent is more information-efficient by construction, before any experiment runs. The experiments then agree.
The argument rewards one slow paragraph, because its force is in how little it assumes. Model the chain: the truth generates evidence, agent one processes the evidence, agent two processes agent one’s output. That is a Markov chain, where the Data Processing Inequality says information about the truth can only decrease along it; no cleverness at stage two recovers what stage one’s summary discarded, because the discarded bits are gone from the channel entirely. The only handoff that loses nothing is a sufficient statistic, a summary preserving everything decision-relevant, and writing sufficient statistics of open-ended reasoning is precisely what cannot be guaranteed. Read that way, the practitioner doctrine of sharing full traces is an engineering approximation to sufficiency: when you cannot prove the summary is enough, ship the whole transcript. The theorem also scopes itself honestly, binding only at a fixed reasoning budget with decent context utilization, which is why the empirical half of the paper matters rather than being ceremony.
The empirical design holds the variable everyone else lets float: thinking tokens. Across Qwen3-30B, a 70B R1 distill, and Gemini 2.5 in both Flash and Pro variants, on FRAMES and on MuSiQue filtered to 4-hop questions, single-agent systems match or beat sequential multi-agent pipelines when both spend the same reasoning budget, with the single agent strongest above a thousand tokens. The representative margin: Gemini 2.5 Pro on MuSiQue at a 5,000-token budget scores 0.419 single-agent against 0.392 for the sequential pipeline.
The Cognition doctrine that production builders converged on last June, Walden Yan’s two principles, share full traces rather than messages, and remember that actions carry implicit decisions, now reads as the engineering translation of the same inequality. Parallel agents that cannot see each other’s full reasoning make conflicting implicit choices; the theorem says the information needed to reconcile them was destroyed in the handoff.
The paper’s own boundary is where it gets interesting for system designers. Multi-agent pipelines become competitive under context degradation: corrupt the shared context heavily, at the 70 percent masking level in their experiments, where the sequential pipeline’s enforced decomposition starts to resist the noise that the single agent’s monolithic context absorbs. Structure is a defense against a polluted window. The corollary runs both ways: multi-agent architecture is compensation for context failure, so every improvement in context quality, longer windows, better curation, learned memory, converts another multi-agent use case into overhead. The in-window curation line and the million-token windows arriving this spring are, in this light, quiet arguments against tomorrow’s committees.
Reconciling the season’s three verdicts
This paper completes a trilogy this archive has now covered in full. The three results compose rather than conflict. MAST showed multi-agent systems failing organizationally, with 44 percent of failures born in specification and design. The DeepMind scaling study quantified when coordination pays at all: below a 45 percent solo baseline, on wide-and-shallow tasks, with centralized supervision, and its standout positive case was parallel evidence-gathering in financial analysis. Tran and Kiela supply the mechanism underneath both: handoffs lose information, so coordination only pays when it buys something information theory cannot see, genuinely parallel coverage of disjoint evidence, and only until the solo agent can hold the whole problem.
The synthesis fits in one sentence: parallelize the gathering, never the reasoning.
Fan out to fetch, search, and retrieve disjoint evidence, where breadth is real and nothing is summarized away; converge everything into one context for the decision, where the inequality rules. The architecture this archive praised in A-RAG is this principle built as a product, one reasoning agent commanding parallel retrieval primitives, while the debating-committee designs earn their keep only where their dissent statistics show the debate adding evidence rather than reprocessing it.
The boundaries of the result keep the synthesis disciplined. The benchmarks are multi-hop QA, reasoning-shaped work, with the wide-and-shallow gathering tasks where the scaling experiments found committees paying deliberately out of frame; the architectures tested are sequential pipelines, the configuration the inequality bites hardest, rather than parallel searchers feeding one synthesizer. And tokens are not wall-clock: parallel agents can lose information while winning latency, a trade that is sometimes correct and should be made with both prices visible rather than discovered in production. None of these caveats rescue the second reasoning agent; all of them define where the first one needs colleagues.
The corruption finding also deserves its constructive reading, because it quietly prices a design choice desks face weekly. If enforced decomposition only pays when the context is badly polluted, then investment in context hygiene, deduplication, provenance, the curation this archive has tracked from retrieval to memory, is directly substituting for architectural complexity. A desk choosing between cleaning its context pipeline and adding a second agent is choosing between fixing the disease and hiring a nurse for the symptom, at comparable cost and very different long-run maintenance.
For a desk, the budget framing is the operational gift, because it converts an architecture argument into a measurable procurement question. Multi-agent pilots habitually win comparisons by silently spending more tokens; matched-budget evaluation, the protocol this paper makes standard, asks the only fair question, what does the architecture deliver per token of reasoning. Run your candidate committee against one well-contexted agent at the same spend, on your tasks, and let the margin speak. Architecture choices should be paid for out of measured margins, the way every other allocation on a desk already is. The era of architecture-by-demo had a good run. Between the failure taxonomy, the scaling laws, and now the theorem, the obligation to prove a margin now rests permanently with whoever proposes the second agent.
Every handoff between agents is lossy compression between the evidence and the decision: at equal token budgets one agent with full context wins by theorem, so parallelize the gathering, never the reasoning, and make the second agent prove its margin.
Working on AI that needs to ship?
I help funds, fintechs, and data teams take AI from prototype to production.