Skip to content
Tim Frenzel

// Insight

AlphaAgents: a three-analyst desk built from one model

8 min read
multi-agentequity-researchvaluationagents

The obvious way to use an LLM for equity research is one big prompt that reads everything and emits a recommendation. AlphaAgents takes the desk-shaped alternative seriously: three role-specialized agents, fundamental, sentiment, and valuation, each reading its own evidence, then arguing to a joint buy-or-sell call in a structured group chat. The design mirrors how an actual research team divides labor. That mirroring is the interesting part, more than any backtest in the paper.

The division of labor is concrete. The fundamental agent reads 10-Ks and 10-Qs through retrieval tools and financial APIs, working cash flow, income statements, and margins. The sentiment agent digests financial news, analyst ratings, and disclosures with reflection-enhanced summarization. The valuation agent computes annualized return and volatility from price and volume history. A coordinator runs the discussion on Microsoft AutoGen’s group-chat scaffolding, guarantees every agent speaks at least twice, and lets the round-robin continue until the three converge on a recommendation. Every turn of the argument is logged, which quietly solves an audit problem that monolithic prompts cannot: you can read the minutes of the meeting that produced the call.

AlphaAgents: divide, debate, decide
Fundamental: filings, RAG, APIsSentiment: news, ratings, reflectionValuation: price, volume, volGroup chat: round-robin, each speaks twice or moreConsensus: buy or sell, reasoning loggedEqual-weight portfolio
A coordinator forces every specialist to speak; the debate transcript is the audit trail.

What the experiment actually shows

The evaluation is small. The honest read starts there. The universe is 15 randomly selected technology stocks. The agents read data from January 2024, form an equal-weight portfolio on February 1, then ride it for four months. Under a risk-neutral profile, the multi-agent portfolio beat both the benchmark and every single-agent baseline on cumulative return and rolling Sharpe. Under a risk-averse profile, every agent configuration trailed the benchmark: a tech rally punished conservative exclusions, while the multi-agent book at least showed lower volatility and smaller drawdowns than the single-agent ones. The paper reports no return percentages or Sharpe figures, which tells you to treat this as a design study rather than a strategy paper.

How the configurations landed, per the paper
Multi, neutralBeat benchmark and singlesCumulative returnrolling SharpeMulti, averseTrailed the benchmarkLower volsmaller drawdownsSingle, neutralMixed against benchmarkOne lens per callSingle, averseWeakest configurationsConservative exclusions hurtRisk profile, neutral to averseAgent setup
Qualitative outcomes only; a four-month tech rally punished every risk-averse book.

One more finding deserves a flag because it is the kind that ages well: prompting the agents with a risk-seeking profile produced outputs nearly indistinguishable from risk-neutral ones. Risk preference expressed through a prompt saturates quickly. A prompt is a soft constraint; the model’s behavior bends toward its training distribution the moment the instruction asks for something the distribution finds unnatural. Anyone planning to implement investor profiles as adjectives in a system prompt should plan to measure the difference rather than assume it.

The cost of the committee

Before the statistics, the invoice. The coordinator guarantees each of three specialists speaks at least twice, which sets a floor of six specialist turns plus coordination overhead per ticker, with retrieval calls layered on top for the fundamental agent. At 15 names that is charming. At a 2,000-name screening universe it is a structural decision, because debate cost scales linearly with the universe while most of the universe never deserved a meeting. The committee is a second-stage instrument: a cheap single-pass screen ranks the universe, then the full debate convenes only for the short list. That is the same escalate-the-residue economics that hybrid thinking switches formalized at the model layer, applied one level up the stack.

The telemetry worth demanding follows from treating the debate as the product. Log the dissent rate per agent, how often a dissent flips the final call, plus the distribution of rounds to consensus. A committee that converges in one round on most names is a rubber stamp with a transcript. One that dissents constantly and never flips is theater of a different kind. The healthy regime sits between, and only the logs can show where a given deployment lands.

Where this sits in the industry stack

AlphaAgents did not arrive in a vacuum. The neighboring systems sharpen what is missing here. TradingAgents staffs a fuller desk: fundamental, sentiment, and technical analysts feed designated bull and bear researchers who argue opposite sides by construction, with a risk-management team gating the trader’s final call. FinCon borrows the investment-firm org chart directly, a manager-analyst hierarchy in which a risk-control component episodically self-critiques and updates the system’s investment beliefs. Set against those designs, AlphaAgents’ trio is a committee of advocates: each specialist argues its own lens, while nobody is paid to attack the emerging consensus.

Consensus-seeking vs engineered opposition
AlphaAgentsThree data-lens specialistsRound-robin debateConsensus callAdversarial designsBull and bear researchersStructured oppositionRisk gate before the trade
Dissent works better as a salaried role than as a hoped-for emergent property.

Real investment committees learned this lesson the slow way. The desks that survive their own conviction institutionalize the bear case, a designated devil’s advocate, a pre-mortem, a risk officer with a veto, because waiting for dissent to emerge organically selects for groupthink with extra steps. The same applies one architectural level down. An explicit bear-case agent, prompted to attack whichever thesis is winning the room, is a cheap addition to the AlphaAgents pattern that the adversarial frameworks already validate. The broader survey of LLM trading agents catalogued how quickly these architectures are proliferating; role design, more than model choice, is where they differ; the MAST failure taxonomy measured why that matters, with 44 percent of multi-agent failures born in specification and role design rather than model capability.

One industry note on the transcript, because it is worth more than it looks. Investment advisers live under books-and-records obligations, and committee minutes are a standing exam artifact. A debate log that records which agent raised which risk, what evidence was cited, and why the dissent was overruled is the machine equivalent of well-kept minutes. Built right, the audit trail is not overhead on the system. It is one of the deliverables.

The question the paper does not answer

Here is the quant’s reflex on any committee: a three-analyst consensus is only worth more than one analyst if the three are meaningfully decorrelated. Human research teams earn their diversification because the fundamental person and the technician genuinely process different information with different priors. AlphaAgents’ three specialists run on the same base model with different prompts and different feeds. Different evidence helps. Whether it helps enough is an open question the paper does not test. The Mixture-of-Agents result cuts both ways here: ensembling LLMs demonstrably lifts benchmark scores, yet ensembling correlated views mostly lifts confidence. The failure mode to fear is three eloquent agents agreeing for the same underlying reason, with the debate transcript reading like diligence.

There is a cheap experiment that would settle the decorrelation question; the modular design makes it easy. Run each specialist on a different base model. A fundamental agent on one vendor’s model, sentiment on another, valuation on a third, with the debate unchanged. If consensus quality improves when the underlying models differ, the committee was extracting real diversification. If nothing changes, the three role prompts were always one analyst wearing three hats.

That concern scales beyond one desk. A market of LLM traders converging on correlated behavior is the systemic version of the same defect. The desk-level version is subtler: a debate that always converges is indistinguishable from a rubber stamp. The metric I would demand before trusting the architecture is disagreement statistics. How often does the sentiment agent dissent? How often does a dissent change the final call? A committee whose minority never wins is not a committee.

The structure still earns its keep, for a reason the sell-side analyst study set up last winter: model strengths are uneven across analytical subtasks, strong on directional synthesis, unreliable on quantitative detail. Role separation lets you gate each lens differently, verify the valuation agent’s arithmetic with tools, audit the sentiment agent’s sources, and keep the human override at the consensus step. Specialist agents with a logged debate give you governance surface that a monolithic prompt fundamentally lacks. That is the part worth copying even if the alpha never materializes.

The verdict from someone who has sat on both sides of a stock-pitch meeting: adopt the architecture, ignore the backtest. Fifteen stocks over four months in a single sector is an anecdote with error bars wider than the effect. The pattern, specialized readers, forced debate, logged reasoning, equal-weight humility about position sizing, is a sound chassis for research automation. Whether it picks stocks better than one good prompt is a question that needs a thousand stocks and a decade, not a quarter.

AlphaAgents gets the architecture right, specialist agents debating on the record, and proves nothing yet about returns: copy the governance surface, demand the disagreement statistics, ignore the four-month backtest.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.