// Insight
Fin-RATE: models ace the document and fail the workflow
Filing-analysis benchmarks keep testing what models do to one document, while analysts get paid for what they do across many. Fin-RATE is built on that distinction. Three task families mirror the actual workflow, reasoning within a single disclosure, comparing entities on shared topics, tracking one firm across reporting periods, and accuracy drops 14.35 points on the cross-entity shift and 18.60 on the longitudinal one, across 17 leading models. The models are competent at the document and degraded at the workflow, which is the finding every deployment plan needs taped to it.
The construction earns the workflow claim. Rather than isolated capability probes, the families are the three motions a fundamental analyst performs daily: pull the detail and reason about it inside this 10-K; line up disclosure on the same topic across competitors; follow the same firm’s story through consecutive quarters. Seventeen models spanning open-source, closed-source, and finance-specialized lines face all three, in both ground-truth and retrieval-augmented settings, which separates reasoning failures from retrieval ones, the confound that muddied filing QA evaluation for years.
Where the points go, and why
The degradation pattern is the diagnostic content. The paper attributes the drops to comparison hallucinations, temporal and entity mismatches, plus declining reasoning quality and factual consistency as tasks widen, failure categories the authors note existing benchmarks never formalized. Read structurally, the three families differ in one variable: how many things must be held in correct correspondence at once. One document needs internal consistency. Cross-entity needs entity bindings held straight while content is compared; a model that attributes Firm A’s covenant to Firm B has failed silently in the way that passes review and poisons the memo. Longitudinal needs temporal bindings, which quarter said what, plus the discipline to track change rather than blend periods into a smoothie of facts.
The irony stings if you have been following this blog’s alternative-data thread: the moving-targets signal showed that change across periods is where disclosure alpha lives. Fin-RATE now shows general models are worst at that very motion when asked to perform it end to end. The signal extraction worked there because the pipeline held the temporal bookkeeping in code and gave the model one bounded job per call. The benchmark’s models fail here because they are asked to be the pipeline, bookkeeping included.
That contrast generalizes into the deployment rule this benchmark actually licenses. The degradation is not a capability wall; it is a workload-shape penalty, and workload shapes can be re-engineered. Decompose the cross-entity comparison into per-entity extraction plus a deterministic join, with entity bindings enforced by scaffold rather than attention. The task collapses back toward the single-document regime where models score well. The A-RAG primitives lesson applies in full: the wins come from giving the model bounded jobs inside structure, with correspondence handled by machinery that cannot drift. Agents that freestyle across twelve quarters of filings are running the benchmark’s hardest condition by choice.
The finance-specialized models failing alongside the generalists carries its own lesson about where tuning budgets go. Finance fine-tuning corpora are overwhelmingly single-document shaped, QA pairs about one filing, sentiment on one release, extraction from one exhibit, because that is what is cheap to construct. Models tuned on that diet sharpen exactly the family the benchmark shows was never the bottleneck, while the correspondence skills the workflow needs, entity binding across documents, temporal binding across quarters, appear in no training set anyone has built. A desk commissioning fine-tuning after this result should commission workflow-shaped data or save the money.
The dual-setting design quietly answers a question desks keep asking in the wrong order. Running every task in both ground-truth and retrieval-augmented modes separates what the model cannot reason about from what the pipeline failed to fetch, the two failure classes that get blamed on each other in every post-mortem. The workflow penalty appearing in the ground-truth setting, where retrieval is removed from suspicion, is what pins the degradation on correspondence rather than on search. For anyone budgeting fixes, that attribution is worth the whole benchmark: retrieval upgrades, the reflex spend of the last two years, do not touch this failure mode, while binding checks and decomposition scaffolds, which cost a fraction as much, attack it directly.
The validation translation
For a model-risk function the benchmark supplies the missing decomposition axis. Capability profiles built on single-document scores, which is what most vendor evaluations and internal pilots measure, overstate workflow readiness by 14 to 19 points, a bias in the direction that approves deployments. The gating policy XFinBench wrote for reasoning capabilities extends naturally to workflow shapes: single-document analytics inside current capability with sampling review; cross-entity outputs gated behind entity-binding verification, which is a cheap deterministic check, do the names in the answer match the documents retrieved; longitudinal conclusions human-led until your own harness, on your own coverage universe, shows the gap closed. Each gate is testable, each maps to a Fin-RATE family, with the re-measurement cadence following model releases as it always should.
Building the in-house edition is a quarter’s side project with compounding returns. Your coverage universe defines the entities, your peer sets define the comparisons, your archive of quarterly notes defines longitudinal ground truth that no public benchmark can leak. Thirty tasks per family, graded with the blind protocol the deliverable-evaluation literature standardized, produces the workflow-shaped capability profile that vendor decks will never volunteer, refreshed each model generation for the cost of a grading afternoon.
The 17-model breadth adds one more usable fact: the degradation pattern holds across the field, finance-tuned models included, which says the fix is not waiting in a better checkpoint this quarter. Workflow-shaped weakness needs workflow-shaped engineering, the scaffolds, joins, and binding checks above, rather than procurement. Benchmarks that tell you where engineering beats shopping are the rare ones that save money in both directions.
Models lose 14 to 19 points the moment filing analysis spans entities or quarters, exactly where analyst work lives: the fix is workflow engineering, bounded jobs inside deterministic correspondence, rather than waiting for a better checkpoint.
Working on AI that needs to ship?
I help funds, fintechs, and data teams take AI from prototype to production.