// Insight

Mixture of Agents: when a committee of open models beats one big one

June 23, 20247 min read

ensemblingagentsopen-source

Here is the result worth sitting with. A stack of open-weight models, wired together so they critique and refine each other, scored higher on a head-to-head quality benchmark than GPT-4o, the strongest single model you could call this month. The method is Mixture-of-Agents. The paper reports an all-open-source configuration reaching 65.1% on AlpacaEval 2.0, ahead of GPT-4 Omni at 57.5% and GPT-4 Turbo at 55.0%. No new model was trained. The entire gain came from how existing models were arranged.

If you have spent time in quantitative research, that shape is familiar. A committee of weaker, diverse estimators routinely beats a single strong one. Mixture-of-Agents is that instinct applied to language models. The paper is unusually clear about why it works, which is what makes it worth reading rather than just citing. My read is simpler: treat it as an ensemble. Every instinct you already have about ensembles then tells you when it will help and when it will not.

How it works

The method arranges models in layers. In the first layer, several proposers each answer the same prompt independently. Their answers are concatenated and handed to the next layer, where models see all the prior responses and write improved ones. After a few layers, a final aggregator synthesizes everything into one answer. The main configuration is concrete: six open proposers (Qwen1.5-110B-Chat, Qwen1.5-72B-Chat, WizardLM-8x22B, LLaMA-3-70B-Instruct, Mixtral-8x22B-v0.1, and dbrx-instruct), stacked three layers deep, with Qwen1.5-110B-Chat as the final aggregator. The reference implementation is open on GitHub.

Mixture-of-Agents: six proposers, then an aggregator

Six open proposers answer in parallel. The paper stacks three such layers, each rewriting with every prior answer in view, before Qwen1.5-110B-Chat aggregates the final response.

Two design choices do the work. First, diversity: the proposers are different models, with different training data and different strengths. Their errors are not identical. Second, refinement: later layers do not start from scratch, they revise with every prior answer in front of them. It is closer to a panel revising a draft than to a simple vote. The refinement step is what lets the method improve quality rather than only average error down.

Why it rhymes with quant practice

Strip away the language-model specifics. This is an ensemble. We have known for decades that averaging diverse, decorrelated predictors cuts variance without adding bias, which is the logic behind bagging, random forests, and factor-committee approaches many desks run instead of betting on one signal. The proposers are base learners. The aggregator is a stacking layer. The multi-round refinement is the part classic ensembles lack.

The paper’s ablations make the ensemble reading concrete rather than metaphorical. Win rate rises monotonically as you add proposers. The authors find explicitly that using several different models beats sampling one model repeatedly at temperature. Diversity, not raw call count, is the active ingredient.

AlpacaEval 2.0 LC win rate, proposer-count ablation (%)

That is the same lesson a quant learns the hard way. Stacking five copies of one signal sampled differently does almost nothing; stacking five genuinely different signals is where the variance reduction comes from. The chart above is the ensembling textbook rewritten with language models on the x-axis. On a multi-strategy book, the same trap is the expensive one: sleeves that look independent on paper draw down together when the underlying bets are the same.

The cost angle is the other half of the story. The lighter MoA-Lite variant matches GPT-4o’s cost while scoring higher. It beats GPT-4 Turbo by roughly 4% on quality while being more than twice as cost-effective. For a research pipeline whose output is a memo, a literature summary, or a first-pass code review, paying in latency to avoid a closed model is often a good trade. For a compliance-sensitive shop the deeper appeal is different. The whole committee runs on hardware you own. Proprietary inputs never leave the building. There is also a quieter governance benefit. Every member’s contribution is a separate, logged generation. You can audit which proposer said what before the aggregator merged it. A black-box single call gives you one opaque answer. A committee gives you a paper trail.

Where the analogy breaks

An ensemble only helps if its members are genuinely decorrelated.

That is the trap. Many open models are trained on overlapping data and distilled from the same frontier teachers. Their mistakes are correlated. A committee that agrees on the wrong answer is concentration in disguise. Any quant who has stacked five “different” momentum signals and discovered they were the same bet knows the feeling. Off the desk, it is the same mistake as entering a race with three cars on identical setups and calling it a team strategy: you have triple the entries and one point of failure. Before trusting a stack like this, measure the correlation of the members’ errors rather than assuming difference from different model names.

The second issue is the benchmark. AlpacaEval 2.0 measures how often a strong judge model prefers one chat response to another. That is a reasonable proxy for helpfulness and almost useless as a proxy for what a desk cares about: factual grounding, calibrated uncertainty, and not inventing a number. The paper’s own FLASK breakdown is revealing here. MoA improves correctness, factuality, completeness, and insightfulness, all genuinely valuable. It loses on conciseness, producing measurably more verbose answers. An aggregator that learns to write fluent, confident, longer prose can win a preference benchmark while drifting from the source, which is exactly the failure mode you cannot afford on a financial document. Read the 65.1% as evidence the method works in general. It is not a promise about your task.

The third concern the ensemble framing hides is that the aggregator is a single point of failure. The whole structure routes through one model that reads every proposal and writes the final answer. Its biases and blind spots are imposed on the output no matter how diverse the proposers were upstream. The paper gives direct evidence: WizardLM was an excellent proposer but performed poorly in the aggregator role, while models like Qwen, LLaMA-3, and GPT-4o were versatile in both. A panel that feeds a chairman who ignores the dissent is not really a panel. So when you evaluate a stack, hold the aggregator under the most scrutiny, swap it deliberately, and check whether the final answer ever preserves a minority-but-correct proposal or always collapses to the majority view.

The fourth is cost in the other direction. Layers multiply calls. A three-layer stack with six proposers means many model invocations and several round trips per query. For research and report generation that is fine. For anything latency-sensitive, or anything you call thousands of times a day, the economics flip. A single well-chosen model wins. The committee is built for quality, never for throughput.

How I would use it

This is a pattern to borrow rather than a product to buy. This design earns its keep where quality dominates and latency does not: drafting and self-critiquing research notes, cross-checking an extraction against several models before a human reads it, generating and then reviewing code. The MT-Bench numbers point the same way, with MoA at 9.25 against GPT-4 Omni’s 9.19, a thin margin on a saturated benchmark that matters less than the cost and auditability story around it.

The discipline carries over from the desk almost unchanged. Verify that your members are actually different by measuring the correlation of their errors rather than their pedigrees, because the paper proves diversity is the ingredient that pays. Measure on a benchmark that resembles your real task, because a win rate on chat preference will not tell you whether the stack hallucinates a figure in a filing. Price the latency before you fall in love with the result, because the same arrangement that buys quality in a research loop is unusable in a hot path. Mixture-of-Agents is a clean demonstration that the arithmetic of ensembles now applies to language models.

An ensemble is only as good as the independence of its parts. Independence is the thing everyone assumes and almost no one checks.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →