// Insight

RACE: when the reasoning and the answer disagree

December 20, 20256 min read

hallucinationreasoning-modelsvalidation

Reasoning models changed what a hallucination looks like. The old failure was a confident wrong answer. The new one is subtler: a long, fluent reasoning trace that is redundant, circular, or quietly inconsistent, attached to an answer that may be right anyway. An answer that is correct for incoherent reasons is not a success; it is a failure that has not happened yet. RACE is the first hallucination detector this archive has covered that treats the trace as evidence rather than decoration.

The design samples multiple responses and extracts the essential reasoning steps from each, then computes four diagnostic signals. Inter-sample reasoning consistency asks whether the model takes compatible paths across samples. Entropy-based answer uncertainty reads the spread of final answers. Reasoning-answer alignment checks semantically whether the stated logic actually supports the stated conclusion. Internal coherence scores the trace’s own logical integrity. The four fuse into a single hallucination score, with the cross-checking structure carrying the value: a model can fake any one signal, while faking all four at once is much harder.

RACE: four signals across trace and answer

The trace becomes evidence; agreement across all four signals is hard to fake.

The numbers hold up across models and datasets. Against SINdex, the strongest baseline, plus the standard field of semantic-entropy and self-checking methods, RACE wins consistently: 77.62 versus 74.50 AUROC on HotpotQA with a DeepSeek-R1 distill, 89.67 versus 87.11 on TriviaQA with Qwen3-14B, with ranges across the evaluated models of roughly 72 to 91 depending on dataset, ahead of the baselines in every reported configuration.

Hallucination detection AUROC, RACE vs best baseline

Consistent wins across QA datasets and reasoning models, per the paper's main table.

The margin widens where retrieval-free recall gets hardest: on NQ-Open, RACE spans 72.14 to 78.61 across the evaluated models against 65.48 to 73.19 for SINdex, the largest gap in the table. Harder questions force longer reasoning, longer traces carry more internal evidence; the trace-reading detector gains exactly where output-only methods thin out.

Calibrate the enthusiasm to what AUROC in the high seventies means operationally: a useful ranking signal with real error mass on both sides, sufficient for routing and triage, insufficient as a sole gate. The evaluation also lives on question-answering datasets rather than financial calculations, leaving the transfer to multi-step quantitative work asserted by analogy until someone measures it. Both caveats define the deployment, neither defeats it.

Why the trace is the audit surface

The finance case for reasoning-aware detection is sharper than the general one, because the right-number-wrong-reason failure is a recognized model-risk category with a name and a history. Every validation function has met the model that backs into a correct valuation through two offsetting errors. It passes outcome checks today and detonates when the errors stop offsetting. XFinBench documented the model version of the pattern, setups that are right while the arithmetic drifts, and answers that are plausibly wrong rather than obviously wrong. An output-only detector cannot see any of this by construction. A trace-reading detector at least looks where the failure lives.

RACE also completes a layered stack this archive has been assembling for two years. Pre-trained UQ heads read the model’s internal states and cost almost nothing per call. RACE reads sampled traces, costs a multiple of inference, and sees failure modes the internals miss. The verification gate checks outputs against source documents and catches what both miss. Cheap screen, trace check on the flagged minority, document verification on whatever feeds a decision: three layers, ordered by cost, each catching what the previous one structurally cannot. No single detector is the answer; the stack of three is starting to look like the deployable shape of LLM validation.

The sampling cost deserves its honest line item. RACE needs multiple responses per query, which multiplies inference cost by the sample count, unaffordable as a blanket policy and exactly affordable on the escalation path. The routing logic writes itself: the cheap UQ screen runs everywhere, RACE runs on the uncertain tail, humans see what RACE cannot clear. A desk running reasoning models whose deliberation is already metered will recognize the pattern, the same escalate-the-residue economics, applied to validation instead of generation.

Running it without fooling yourself

Operationalizing a detector like this is mostly calibration discipline, and three practices decide whether the AUROC on paper becomes precision in production. First, set the operating threshold from your review capacity rather than from the paper: a flag rate your reviewers cannot clear within a day is a queue; a stale queue is a control that exists only on the org chart. Sweep the threshold against your own labeled sample until flagged-and-reviewed matches the hours you actually have. Second, log the four signals separately rather than only the fused score. A flag driven by answer entropy points at genuine model uncertainty; one driven by reasoning-answer misalignment points at confabulation; the triage differs, while the fused score alone tells the reviewer nothing about where to look. Third, treat every model upgrade as a recalibration event, since the signal distributions are properties of the model that produced the traces. A threshold tuned on one checkpoint silently changes its flag rate on the next, the same silent re-roll that makes capability profiles go stale across upgrades.

The labeled sample these practices depend on is cheaper than it sounds, because a desk running a verification gate is already producing it. Every output the gate adjudicates, confirmed against source documents or corrected by a reviewer, is a labeled example of the exact distribution the detector will police. A few hundred of those, accumulated in normal operation, are enough to set thresholds and sanity-check the paper’s AUROC on your own tasks. The calibration set is a byproduct of the controls you should already have.

The sample-count dial is the budget lever. Fewer samples per query cheapen the check and blur every consistency signal; more samples sharpen the signals at linear cost. The honest deployment finds the knee of that curve on its own task distribution rather than inheriting the paper’s setting, which is an afternoon of measurement once the labeled sample exists.

For a model-risk file, the property worth writing down is that the trace becomes a tested artifact rather than documentation theater. Reasoning models ship with visible work; until now the work was something a reviewer might read, sometime, if a number looked odd. A fused consistency score makes the trace machine-checkable at scale, which converts “show your work” from a courtesy into a control. The governance shift is small but real: the reasoning is now part of what gets validated, not part of what gets filed.

The verdict: adopt the layer, size it correctly. RACE-style trace checking belongs on the escalation path of any reasoning-model deployment that touches numbers, between the cheap screen and the human. The AUROC says it earns its inference bill there. Nothing yet says it replaces either neighbor. The QA-to-finance transfer wants a measurement before the committee deck claims it.

Reasoning models fail in the trace before they fail in the answer: RACE makes the trace machine-checkable, four fused signals at AUROC the baselines do not reach. The right-number-wrong-reason failure finally has a detector.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →