// Insight

MAST: multi-agent systems fail like organizations, not like models

April 5, 20256 min read

multi-agentfailure-modesagents

The multi-agent demos keep multiplying, and so does an unexamined assumption: that when agent teams fail, the model was not smart enough. Cemri and colleagues did the unglamorous thing instead, annotating 1,642 execution traces across seven open-source frameworks, ChatDev, MetaGPT, HyperAgent, AppWorld, AG2, Magentic-One, and OpenManus, with inter-annotator agreement of 0.88. The result is MAST, the first failure taxonomy for multi-agent systems: 14 distinct modes in 3 clusters, with roughly 44 percent of failures tracing to system design and specification, 32 to inter-agent misalignment, and 24 to task verification. The headline rates frame the stakes: the systems studied fail between 41 and 86.7 percent of the time.

The cluster shares carry the finding. A plurality of failures originate before any agent thinks: ambiguous task specifications, roles that overlap or contradict, termination conditions nobody defined. The intelligence was sufficient; the org chart was not.

Where 1,642 multi-agent failures originate (%)

The plurality of failures happen before any agent reasons; capability was rarely the binding constraint.

The mode-level detail reads like an audit finding from any operations review. The most frequent single failure is step repetition at 15.7 percent, agents redoing completed work because nothing tracks what is done. Reasoning-action mismatch, the agent saying one thing and doing another, accounts for 13.2 percent. Agents unaware the task has ended keep working at 12.4 percent; outright disobedience of the task specification runs 11.8. On the verification side, incorrect verification at 9.1 percent edges out absent verification at 8.2, a detail worth a pause: the checker that approves bad work is measurably more common than no checker at all, which is the difference between an unstaffed control and a false one.

The modes that dominate each cluster (share of all failures)

Fourteen modes total; the top three alone account for over forty percent of everything that breaks.

Anyone who has run an operations or model-risk function will experience this taxonomy as translation rather than news. Step repetition is duplicate processing from a missing workflow state. Reasoning-action mismatch is the trader whose blotter disagrees with his commentary. Incorrect verification is the sign-off that signs without checking. Finance built maker-checker separation, reconciliation breaks, and termination criteria into its operating model because human organizations exhibit exactly these failure modes at exactly these kinds of rates when nobody engineers against them. Multi-agent systems are organizations; they inherit organizational failure physics the moment they are instantiated, regardless of how capable the individuals are.

Same failure, two species: the translation table

Each agent failure mode has a human-organization twin and inherits its known cure.

That reframing changes where the engineering effort belongs. The field’s reflex when an agent team fails is to upgrade the model, the equivalent of responding to a settlement break by hiring smarter operations staff into the same broken process. MAST’s distribution says the higher-leverage fixes are organizational: task specifications precise enough to disobey detectably, role boundaries that do not overlap, explicit done-conditions, state that records what has happened, and verification designed as an adversarial step rather than a formality. None of this requires a better model. All of it requires the systems thinking that orchestration frameworks gestured at and mostly left to the user.

The taxonomy also retroactively explains the pattern in the ensemble results this archive covered earlier. Mixture-of-Agents worked by tightly constraining the interaction, every agent sees the same inputs, aggregation is a fixed function, no free-form coordination to misalign. The agentic-RAG survey’s multi-agent patterns succeed where the decomposition is mechanical. The systems MAST studied fail where coordination is open-ended, because open-ended coordination is precisely where specification and misalignment failures breed. The design lesson compresses to one line: constrain the organization, free the individuals.

The methodological by-product may travel further than the taxonomy. Annotating 1,642 traces by hand does not scale, which is why the authors built an LLM-as-judge pipeline validated against the human annotations at the 0.88 agreement level, which converts MAST from a paper into an instrument: point the judge at your own agent traces and you have a failure-mode dashboard for the systems you actually run. That is the same maturation step reasoning-trace checking later made standard brought to hallucination work, the artifact going from something a reviewer might read to something a pipeline scores continuously.

The taxonomy’s top modes also map one-to-one onto controls finance already knows how to build, which makes the remediation list unusually concrete. Step repetition at 15.7 percent is cured by a shared state ledger, the workflow equivalent of a trade blotter, recording what is done so no agent redoes it. Termination unawareness at 12.4 percent is cured by explicit done-conditions evaluated outside the agents, the same pattern as a checkpointed-workflow semantics. And the false-checker problem, incorrect verification at 9.1 percent, is cured the way audit always cures it: the verifier gets its own success criteria, its own adversarial incentives, and no authorship stake in the work it reviews. None of these is research; all of them are missing from the frameworks measured.

For a desk evaluating any agent-team proposal, MAST supplies the due-diligence checklist that did not exist last quarter. Ask for the task specification and try to violate it on paper. Ask what tracks completion state. Ask who verifies the verifier, with the 9.1-versus-8.2 statistic in hand. Ask what the system does when an agent goes silent, derails, or contradicts its own reasoning, because each has a measured base rate now. A vendor who has not read this paper is selling an org chart that has not met its failure modes. The annotated traces and the LLM-as-judge pipeline are public, which makes running your own candidate system through the taxonomy a weekend exercise rather than a research project.

One caution belongs beside the enthusiasm: the seven frameworks studied are open-source research systems, the population available for tracing rather than the population running in production, and commercial stacks with mature guardrails plausibly sit toward the friendlier end of the 41-to-86.7 range. The taxonomy transfers regardless, since the failure physics is organizational, while the rates deserve the usual external-validity discount before anyone quotes them about their own vendor.

The deeper note for this blog’s running theme: a 41-to-86.7 percent failure range, measured honestly, is the base rate hiding behind every polished multi-agent demo of the past year. Taxonomies are how engineering disciplines grow up; aviation got checklists from crash investigations, medicine got morbidity conferences, and agent systems just got their first equivalent. The capability claims will keep arriving faster than the measurements. The desks that read the measurements first keep ending up on the right side of the gap.

1,642 traces say multi-agent systems fail like organizations: 44% specification and design, a false checker more common than a missing one, and failure rates of 41 to 86.7 percent behind the demos, none of it fixable by a smarter model.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →