// Insight

When do more agents help? DeepMind runs the experiment

January 10, 202610 min read

multi-agentscalingagentsevaluation

For a year this archive has asked the same question of every committee architecture: when does adding agents pay, and who has measured it? Kim and colleagues at Google DeepMind and MIT finally ran the experiment at the scale the question needed, 180 controlled configurations spanning five canonical architectures, six agentic benchmarks, and three model families. The answer has numbers now: coordination yields negative returns once single-agent accuracy exceeds roughly 45 percent, independent agents amplify errors 17.2 times against a single agent’s baseline, while a regression on four coordination metrics predicts the correct architecture for 87 percent of held-out configurations. The era of choosing agent topologies by vibe has an expiration date.

The experiment that was missing

The design is the contribution before any finding is. Five architectures cover the space practitioners actually choose among: a single agent, independent agents working in parallel without communication, centralized coordination through an orchestrator, decentralized peer-to-peer communication, plus a hybrid of the last two. Each runs against six benchmarks chosen for diversity of work: web browsing, financial analysis, game planning, workplace tasks, software engineering, and command-line operations, instantiated across three model families to separate architecture effects from model effects.

180 configurations, one regression

Architecture effects separated from model effects by design, not assumption.

That factorial discipline is what the multi-agent literature has lacked. Single papers demonstrate single architectures on favorable tasks; the MAST failure taxonomy catalogued how those systems break, while leaving open the quantitative question of when they should exist at all. A 180-cell grid with shared metrics answers the question the only way it can be answered: by varying one thing at a time.

Error amplification is the headline table

Every multi-agent system is an error-propagation machine. The paper measures the gain on that machine by topology. Against a single agent’s baseline of 1.0, centralized coordination amplifies trace-level errors 4.4 times. Hybrid designs reach 5.1, decentralized communication 7.8. Independent agents, parallel workers with no communication at all, hit 17.2 times.

Trace-level error amplification by topology (x, single agent = 1.0)

An unreviewed error compounds through every agent that builds on it; structure is the brake.

The ordering is the lesson: error amplification tracks the absence of a checkpoint, and independent parallelism, the architecture that demos fastest, is the one that compounds mistakes fourfold over a supervised hierarchy.

A wrong intermediate answer in a centralized design meets an orchestrator that can catch it; the same error in an independent fleet propagates into every output that touches it, unreviewed. Finance has organized against exactly this dynamic for a century, which is why the table reads less like a discovery and more like a measurement of something every four-eyes control already assumed: work that flows through a checkpoint degrades slower than work that does not.

Where committees pay, including the result that should interest this readership most

The per-benchmark deltas swing wider than any average suggests, from coordination adding 80.8 percent to destroying 70 percent of performance on the same underlying models.

Best multi-agent variant vs single agent per benchmark; the sign depends on the work, not the model.

Sit with the first row. The benchmark where multi-agent coordination pays most, by an enormous margin, is the financial-analysis suite, with centralized coordination delivering an 80.8 percent improvement. The mechanism behind the split runs through the whole table: financial analysis decomposes into parallel evidence-gathering, filings here, prices there, news in a third stream, where breadth genuinely multiplies coverage and a central desk reconciles. Planning and terminal work are sequential and stateful, every step conditioned on the last, where splitting the work splits the context that the next decision needs. Coordination pays where the task is wide and shallow, fails where it is narrow and deep, and financial research happens to be the widest, shallowest task in the benchmark suite. The committee designs this archive has reviewed were pointed at the right domain all along; what they lacked was the supervision structure the amplification table prices.

Before anyone forwards the 80.8 percent to a steering committee, the allocator’s discount applies. Finance Agent is one benchmark, built from analysis tasks with separable evidence streams and gradable outputs, which is the friendliest possible territory for a centralized committee; a real research workflow interleaves its evidence-gathering with sequential judgment, drifting the work toward the narrow-and-deep regime where the same table turns negative. The directional claim survives the discount, parallel evidence acquisition under central reconciliation is where finance should spend its coordination budget, while the magnitude is a property of the benchmark’s task mix rather than a promise about yours. The protocol for finding your own number is the paper’s real export: hold the model fixed, vary only the topology, repeat enough times to see the variance.

The tool dimension cuts the other way and deserves its own caution. Tool-heavy tasks, the 16-tool business workflows in the workplace suite, suffer from coordination overhead, with the interaction coefficient at -0.096 and p = 0.002. Tools are stateful context, and distributing them across agents multiplies the synchronization surface, the same arithmetic that made tool execution the binding constraint everywhere this archive has looked. An agent fleet sharing a Bloomberg session is a coordination problem wearing a productivity costume.

The 45 percent rule

The capability-saturation result is the one to commit to memory, because it inverts the intuition that stronger models make better committees. Once a single agent exceeds roughly 45 percent accuracy on a task, adding agents yields diminishing and then negative returns: the remaining headroom is smaller than the coordination tax.

The capability-saturation logic

The better your single agent gets, the worse the case for adding colleagues becomes.

The strategic consequence is a moving target: every model generation that lifts single-agent baselines converts more multi-agent architectures into overhead, which means committee designs are depreciating assets in a way single-agent designs are not.

A workflow that genuinely needed five agents on last year’s models may need one on next year’s, which leaves the teams maintaining elaborate orchestration maintaining something the capability curve is actively eroding. The correlated-committee worry this archive has carried since the market-simulation work gains a complement: even uncorrelated committees stop paying once the individuals get good.

A science, not a vibe

The four coordination metrics are the part a desk can operationalize tomorrow: efficiency as success normalized by turn count, overhead as cost relative to single-agent, trace-level error amplification, and redundancy as output overlap between agents. The regression on these achieves cross-validated R-squared of 0.373, rising to 0.413 with task-grounded capability measures, and selects the correct architecture for 87 percent of held-out configurations. Modest fit, excellent decisions: the model does not need to predict scores precisely to rank architectures correctly, the same distinction that separates a useful monitoring instrument from a forecasting fantasy.

The four dials that decide the architecture

Cross-validated R-squared 0.373-0.413; correct architecture picked for 87% of held-out configs.

What a 0.373 R-squared licenses deserves precision, because both dismissals and overclaims are circulating. It does not license point forecasts of system accuracy; residual variance dominates any single configuration. It demonstrably licenses architecture ranking, the 87 percent selection rate on held-out configurations, because ranking only requires the model to order options whose true gaps are large, the same logic by which a noisy factor model can still sort a cross-section. For a desk, the operational consequence is a measurement recipe rather than a lookup table: estimate your own solo baseline with repetition, compute the four dials on a pilot committee, and let the regression’s structure, rather than its coefficients, tell you which terms to watch. The 45 percent threshold itself is empirical, fitted to these benchmarks and this model generation, with the paper explicit that it moves; treating it as a law would repeat the mistake of every hard-coded constant in quantitative finance, while treating it as this quarter’s estimate of a real and re-measurable boundary is just calibration discipline.

The redundancy metric quietly answers the decorrelation question this archive kept demanding of committee papers: measured output overlap is the test for whether three agents are three opinions or one opinion with three letterheads. That it enters a predictive model with positive weight on failure is the empirical confirmation of the suspicion.

A worked example makes the dials concrete. Take the workflow this archive knows best, filings research: a screening pass over a coverage universe, evidence-gathering across documents, then synthesis into a view. The screening pass is wide, shallow, and weak-baseline on any single name, the profile where a centralized fan-out pays, with the orchestrator as the desk head reconciling. The synthesis step is narrow, deep, and strong-baseline on current models, squarely past the 45 percent line, where the table says one agent with everything in context. The overhead dial prices the boundary between them, while redundancy monitoring polices the screening fleet for the secretly-one-agent failure. One workflow, two regimes, architecture assigned by measurement per stage rather than by fashion for the system, which is what the paper’s title means by a science.

The desk translation

The deployment rule now writes itself in measurable steps. Benchmark the single agent first, on your task, with repetition; if it clears 45 percent, the burden of proof flips against adding agents at all. Where a committee is still justified, wide-and-shallow evidence gathering with thin solo baselines, make it centralized, because 4.4x versus 17.2x is the difference between a supervised team and a rumor mill. Instrument the four dials from day one, with redundancy as the standing answer to whether the committee is real. And re-run the solo baseline every model generation, since the 45 percent line is a ratchet that only moves against coordination. The overhead dial deserves the last procedural word, because it is the one finance controls natively: coordination overhead is a cost line, and cost lines get budgets. An agent architecture whose overhead is metered, capped, and reported per workflow stage is one a CFO can govern with existing machinery, which may matter more for adoption than any benchmark column in the paper.

The committee question this archive has asked all year now has its instrument; what remains is the discipline of using it before the org chart, rather than after.

Coordination pays below a 45% solo baseline and on wide-shallow tasks, costs everywhere else, and amplifies errors 17.2x when nobody supervises: the multi-agent question is now a measurement, with all four dials fitting on one dashboard.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →