// Insight
XFinBench: where graduate-level finance still beats the best models
Every few months a model aces another math benchmark and someone proposes pointing it at valuation work. XFinBench is the reality check to run first. It is a benchmark of 4,235 problems in knowledge-intensive, graduate-level financial reasoning, built with a companion knowledge bank of 3,032 finance terms, and spanning text and visual context, the charts, payoff diagrams, and tables that real finance problems come wrapped in. The headline result: the best of 18 evaluated models, o1, scores 67.3% text-only against 79.8% for human experts, a 12.5-point gap on exactly the reasoning a desk would want to delegate.
The benchmark decomposes performance into five capabilities: terminology understanding, temporal reasoning, future forecasting, scenario planning, and numerical modelling. Each maps to work a desk actually delegates. Terminology understanding is the vocabulary layer, knowing what a clean price or a payer swaption is. Temporal reasoning covers value moving through time. Forecasting asks for forward inference from given conditions. Scenario planning walks branching outcomes. Numerical modelling sets up and executes the calculation. That decomposition is what makes XFinBench useful rather than merely sobering, because the gap is not uniform. The paper finds o1 lags human experts most in temporal reasoning and scenario planning. Models hold up far better on terminology and on standard calculation patterns than they do on problems where value moves through time or where a structure must be reasoned through state by state.
That pattern should ring familiar. The sell-side analyst study found the same shape last winter: real model strength in directional judgment, paired with inconsistent quantitative reasoning that did not track its own confidence. XFinBench sharpens the finding with capability-level resolution and a harder problem set. The skills that fail are not exotic. Temporal reasoning is day-count conventions, compounding frequency, when a cash flow actually lands. Scenario planning is walking a tree of states without dropping a branch. These are the load-bearing skills of fixed income, derivatives, and risk work.
Where the errors come from
The error analysis is the most actionable part of the paper. Two failure modes dominate. The first is rounding error during calculation: the model sets up the problem correctly, then loses precision mid-arithmetic, an error class that produces answers that are plausibly wrong rather than obviously wrong. Plausibly wrong is the expensive kind. An answer in the right ballpark with the right sign sails past the sanity checks that catch gross errors, then sits in a spreadsheet until something downstream disagrees with it. The second is visual-context blindness: on problems with charts, models misread curve positions and intersections, the bread and butter of payoff diagrams and yield-curve questions.
Both failure modes are silent. A model that sets up the right formula and rounds its way to a wrong number will pass any review that checks method rather than arithmetic. The mitigation for the first is mechanical and known: route calculation through tools rather than token-by-token arithmetic, since the same models with code execution stop dropping decimals. The mitigation for the second is blunter. Chart-reading is not yet a capability to trust unsupervised, which for a desk means the multimodal share of any workflow needs either structured data extraction upstream or a human eye on every visual inference.
One more finding cuts against the obvious fix. Augmenting models with retrieved domain knowledge from the benchmark’s own knowledge bank produced consistent gains only for small open-source models. For frontier models, the knowledge was already inside; their failures live in the reasoning, in time and structure, where retrieval adds nothing. Whoever proposes solving a financial-reasoning gap with a RAG layer should sit with that result. Retrieval fixes missing knowledge. The frontier deficit on XFinBench is misapplied knowledge.
What a risk committee should do with this
Seventeen years into building and validating models for desks, my reading of a 12.5-point expert gap is not “wait for the next model.” It is a gating policy, and XFinBench’s capability decomposition practically writes it for you. Terminology-heavy work, definition lookups, instrument classification, document triage, sits comfortably inside current capability and can run with sampling-based review. Calculation-heavy work earns its place when arithmetic is routed through tools and a verification gate checks the output against source numbers. Temporal-reasoning and scenario work, the capabilities measured weakest, stays human-led with the model as a draft generator whose every dated cash flow gets checked.
The gating policy needs a maintenance schedule to mean anything. A capability profile measured against one model version is a snapshot, while deployments live through upgrades, and every model swap silently re-rolls the profile. The validation file that satisfies both an internal committee and a regulator has three properties: benchmarked per capability rather than in aggregate, re-run on every model change, explicit about which capabilities are gated and how. That is an afternoon of automation against a published benchmark, repaid the first time an upgrade quietly moves a capability you depend on. The alternative is rediscovering the temporal-reasoning gap in production, one mispriced cash flow at a time. Anyone who has watched a data vendor revise a methodology mid-subscription knows the failure mode: the interface stayed identical while the behavior underneath changed, with nobody assigned to notice.
The multimodal finding deserves its own line in that file. A meaningful share of real desk inputs arrives as charts and payoff diagrams, the exact format where the paper documents models misreading curve positions and intersections. Until that capability is measured at parity, the safe architecture extracts the underlying series upstream and feeds models numbers rather than pictures. Where extraction is impossible, the visual share of the workflow stays human.
The meta-lesson is about benchmark selection for validation work. Generic math scores actively mislead here: a model can sit in the 90s on competition math, the territory contamination-resistant benchmarks were built to police, while sitting 12.5 points under experts on graduate finance. Domain benchmarks with capability decomposition are the only evidence that maps to a deployment decision, the same lesson FinanceBench taught for retrieval two years ago. The right artifact for a model-risk file is not a leaderboard rank. It is a capability-by-capability profile against the specific skills your use case draws on, refreshed when the model changes, with the weak capabilities gated rather than averaged away.
The result to carry into the meeting: current frontier models are competent finance juniors with a specific, measurable deficit in time and structure. Assign them work the way you would assign it to a junior with that exact profile.Nothing in XFinBench says do not deploy. Everything in it says deploy by capability rather than by aggregate score.
XFinBench measures a 12.5-point expert gap concentrated in temporal reasoning and scenario planning: deploy models by capability profile, gate the weak ones, and never let an aggregate score make the decision.
Working on AI that needs to ship?
I help funds, fintechs, and data teams take AI from prototype to production.