// Insight
WorkstreamBench: can an agent build the model, not just the answer
The spreadsheet is where finance actually happens: the operating model, the scenario grid, the DCF that carries a deal. WorkstreamBench is the first evaluation built at that altitude, end-to-end financial spreadsheet workstreams, modeling, forecasting, scenario analysis, drawn from the Financial Modeling World Cup, the ModelOff competition archive, plus the Wall Street Prep curriculum, graded against professional standards. The best agent in the field, Claude operating through the web interface, scores 69.1 out of 100 overall and falls to 53.4 on medium-hard tasks, with every evaluated agent degrading sharply once a workstream chains more than a few calculations.
The grading design is the contribution that will outlast the scores. Three dimensions, each with fine-grained professional criteria: Accuracy, the computations and completions being right; Formula, the cell-level logic being robust and interpretable rather than hard-coded numbers wearing formula costumes; Format, the artifact being readable by the next human who opens it. Anyone who has inherited a model knows why the second and third dimensions exist. A spreadsheet that produces today’s correct number through opaque, brittle machinery is a liability with a good first impression, the spreadsheet version of the right-answer-wrong-reason failure that validation work exists to catch.
The task provenance settles the legitimacy question most benchmarks dodge. Financial Modeling World Cup and ModelOff problems are what the industry’s best modelers compete on; Wall Street Prep material is what new analysts are drilled and examined on before anyone trusts them with a live deal. Grading agents against the apparatus that certifies humans is the same move that gave the deliverable-evaluation literature its credibility: the bar was not invented for the machines, which means clearing it would mean something.
The scale gap against prior work measures how far evaluation had drifted from the job. WorkstreamBench tasks carry roughly 33 times more cells on average and 93 times more functions at the median than SpreadsheetBench, the previous reference. Earlier suites tested whether a model could manipulate a spreadsheet; this one tests whether it can build the thing an associate is paid to build, which is why the numbers land where they do.
The degradation is the familiar one, measured deeper
The difficulty slope rhymes with everything this archive measured this spring, which is what makes it credible. Fin-RATE found models losing 14 to 19 points when filing analysis spanned entities or quarters; here the same correspondence problem appears inside a single artifact, since a financial model is a web of dependencies where cell C47 must stay consistent with assumptions made forty steps earlier. Chained calculations are the spreadsheet’s native form of the long-horizon problem, and agents degrade on them for the same reason they degrade across documents: holding many bindings in correct correspondence is the skill the training distribution underprovides. The qualitative review’s note that Claude produces the most professional-looking outputs while still falling short of professional standards is the finding in miniature: surface competence has arrived, structural reliability has not.
A quieter finding hides in the rankings: the same Claude model scores differently by surface, with the web-interface agent leading while the Excel-embedded variant trails it. The model is identical; the harness, what the agent can see, how it edits, how it checks its own work, moves the score. That is a procurement insight disguised as a leaderboard footnote. Teams evaluating agent products are partly evaluating scaffolds, the harness-versus-model distinction that agentic benchmarks keep rediscovering. A disappointing pilot may indict the integration rather than the intelligence, with the cheap experiment being to re-run the same tasks through a different surface before changing vendors.
Read as a deployment instrument, the numbers sort the work the way the workflow-evaluation literature keeps recommending. A 69-point artifact on standardized sub-tasks is a genuinely useful draft, scaffolding, formatting, the mechanical first third of a model, delivered in minutes. A 53-point artifact on a chained workstream is a liability that looks like a deliverable, the dangerous quadrant, because spreadsheet errors are silent, load-bearing, and discovered by the person who trusted the cell. The gate that follows is structural rather than statistical: agent-built models earn review at the formula layer, not the output layer, with the dependency chain walked by someone who owns the conclusion. Auditing formulas is cheaper than rebuilding them, which is what keeps the draft economically positive even at today’s scores.
The closing of a loop
This is the 86th and final entry in this archive’s two-year experiment of reading the AI-and-finance literature as it landed. The symmetry with the first entries is worth one paragraph. The archive opened with benchmarks showing retrieval failing on filings while everyone assumed reading was solved. It closes with a benchmark showing agents failing on spreadsheets while everyone assumes building is next. The pattern held the whole way through: capability claims arrive first, measurement arrives second, the gap between them is where deployment risk lives, while the desks that commission their own measurement before believing either are the ones whose systems survive contact with production. WorkstreamBench hands that discipline its newest instrument, aimed at the artifact that runs the industry.
Each row’s number comes from the benchmark that finally measured the claim, FinanceBench, XFinBench, and now this one, and each was lower than the demos implied. The column that never appears in the table is the one this archive existed to supply: what a desk should do in the gap.
What clearing the bar would look like is worth stating, since someone will claim it within a year. A passing agent needs three properties at once: chained calculations that stay consistent to the final cell, formulas a reviewer can audit without archaeology, plus an artifact the next analyst opens and understands. Score inflation on any single dimension, accuracy without auditable formulas being the likely first cheat, would be the benchmark failing rather than the agents succeeding, the reason the three-dimension design is the part to defend as the leaderboard fills.
The benchmark will saturate, as they all do; the grading philosophy should not. Accuracy, formula quality, and format legibility against professional criteria is just the definition of work product, applied to machines; it transfers to every artifact class an agent will be asked to produce next. The question was never whether an agent can fill the cells; it is whether the model it builds can be trusted by the analyst who inherits it. That bar, 69 points and falling with complexity, remains unmet.
The best agent builds a 69-point financial model and drops to 53 once the calculations chain: the artifact is the test, the formula layer is the audit surface, with the inheritance bar, can the next analyst trust this, still unmet.
Working on AI that needs to ship?
I help funds, fintechs, and data teams take AI from prototype to production.