// Insight

FinToolBench: 760 real tools, and the best agent executes a third

May 26, 20264 min read

tool-useagentsbenchmark

Tool-use benchmarks have been grading agents on toy kitchens: a handful of mock APIs that always respond, never expire, and carry no compliance semantics. FinToolBench builds the real kitchen, 760 executable financial tools paired with 295 queries that genuinely require them, scored on execution rather than intention. The headline lands hard: the best evaluated agent completes successful end-to-end execution on 32.54 percent of queries, while the model with the highest tool-invocation rate succeeds on just 29.49.

The decomposition is where the diagnostic value lives. Qwen3-8B invokes tools on 87.12 percent of queries, eager to act, then converts that activity into success less than a third of the time. GPT-4o sits at the opposite pole: a conservative 22.67 percent invocation rate, but when it does commit, conditional success of 61.76 percent, the highest precision in the table. Doubao-Seed-1.6 balances the two into the best overall execution. The paper’s conclusion follows: neither aggressive tool use nor cautious precision suffices alone; end-to-end performance needs reliable execution combined with active selection, two skills the field has been measuring as one.

End-to-end execution success on 295 real queries (%)

GPT-4o invokes least at 22.67% but converts at 61.76% when it commits; eagerness and precision are different skills.

Two failure profiles, one missing skill

End-to-end success needs both dials turned up; every model in the table has one stuck.

The finance-specific dimensions explain why the general-purpose numbers were always going to flatter. Intent mismatch, the agent calling a tool that answers a different question than the one asked, runs from 50 percent at best to 68.87. Timeliness mismatch, stale data retrieved for queries where freshness is the point, spans 30 to 46 percent across every model, the failure class unique to domains where data volatility is the operating condition. The paper’s FATR baseline, injecting finance attributes into tool retrieval and reasoning, improves tool choice and compliance alignment measurably, which says the gap is partly addressable with domain scaffolding rather than purely a capability wall.

A race shop owns more tools than any mechanic touches in a season, and nobody grades the crew on how fast they grab something off the wall. The stop is graded on the wheel coming off and going back on. The rookie error is always the same: the right gun, the wrong socket, executed with total confidence. Tool invocation at 87 percent with execution at 29 is a garage full of confident rookies; the pit clock only counts finished wheels.

The result completes the reliability ledger this archive has kept on agentic claims. Model cards advertise hundreds of sequential tool calls; the survey’s missing column was whether any of them complete real tasks repeatedly; FinToolBench answers for finance specifically, with execution-grounded numbers that sit far below every capability narrative. A 32.54 percent ceiling on real financial tools is not a deployment verdict against agents. It is a sizing instruction: tool-using agents in finance today belong on tasks where a two-thirds failure rate is recoverable, drafts, research legwork, supervised retrieval, with the whitelists and budgets that make failure cheap, while the FATR result points at the practical lever, teaching the stack finance semantics before buying a bigger model.

One number worth circling for anyone running deployment reviews: the timeliness mismatch floor of 30 percent means roughly one financial answer in three arrives built on stale data even from the better agents, the failure class no general benchmark has a column for. Freshness checks belong in the scaffold, cheap timestamp assertions on every retrieved value, rather than on the wish list for the next model.

The benchmark’s existence may matter more than its first numbers. Executable evaluation with timeliness and compliance dimensions is the harness financial agent claims have been missing. The 760-tool surface is large enough to resist the overfitting that consumed smaller suites. The first vendor whose agent clears fifty percent here will have earned a meeting; the ones quoting invocation rates have already told you which number they would rather discuss.

On 760 real financial tools, the best agent finishes 32.54 percent of jobs while the most eager one calls tools on 87: execution and enthusiasm are different metrics, while the pit clock only counts finished wheels.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →