// Insight

HalluBench: your hallucination detector degrades with your knowledge graph

April 21, 20264 min read

hallucinationknowledge-graphsfinancial-QA

Knowledge-graph-augmented retrieval keeps being proposed for financial QA on the theory that structure disciplines generation. FinReflectKG-HalluBench tests the supervision layer that theory depends on, and finds it leaning on the same structure it is supposed to police. The best hallucination detectors score F1 of 0.82 to 0.86 when the knowledge graph is clean, then lose 44 to 84 percent of their discriminative power when noisy triplets enter, while embedding-based methods degrade only 9 percent. The detector you validated on clean infrastructure is a different detector on the infrastructure you actually run.

The benchmark itself is built the way financial evaluation should be: 755 annotated examples drawn from 300 pages of SEC 10-K filings, with evidence-linkage labels that require an answer’s support to be traced to both the retrieved text chunks and the relational triplets. That dual-evidence design is the point. A KG-augmented answer can hallucinate against the text, against the graph, or against the join between them; a benchmark that only checks one channel misses the failure modes living in the others. Six detection families face it: LLM judges, fine-tuned classifiers, natural-language-inference models, span detectors, plus embedding-based methods.

What HalluBench actually tests

755 examples, 300 pages; the noise condition is where the field's assumptions break.

Detector degradation when KG triplets turn noisy (MCC drop, %)

Range endpoints from the paper; clean-condition F1 was 0.82-0.86 for both families.

The clean-condition result reads as good news until the stress condition reprices it. LLM judges and embedding methods leading at 0.82-0.86 F1 matches the detection-stack picture this blog assembled in December. The noise injection is the contribution: corrupt some triplets, the realistic condition for any graph built by automated extraction at filing scale, and judge-style detectors collapse, Matthews correlation dropping by up to 84 percent at p < 0.001, while embedding methods barely notice. The mechanism is uncomfortable in the way useful results are. Judge-style detectors trust the graph as ground truth, so corrupted structure does not merely evade them; it actively recruits them, lending fluent confidence to whatever the bad triplet asserts. Embedding methods, comparing answers to evidence in representation space without granting the graph authority, keep their footing.

For anyone running or planning graph-augmented retrieval on filings, three consequences follow. First, the graph is now inside the model-risk boundary: its extraction quality, its update cadence, and its error rate are model inputs, with validation obligations to match, exactly the retrieval-quality-as-silent-ceiling lesson relocated one layer up. Second, detector validation on curated benchmarks overstates production performance unless the benchmark’s corruption condition matches your pipeline’s actual noise floor; measuring that floor, sampling your own triplets against source text, is a week of annotation that reprices every downstream guarantee. Third, the layered-defense argument strengthens: an embedding-based check belongs in the stack not because it is the best detector on clean days but because its failure mode is uncorrelated with the infrastructure’s, which is the same diversification logic a desk applies to anything that matters.

The result also feeds the structure-versus-vectors ledger this blog has kept since the SEC-filings retrieval head-to-head. Graphs earn their complexity only where structure is informative, and now there is a second entry on the cost side: structure that can be wrong creates a hallucination channel that text-only pipelines do not have, plus a detector-degradation channel nobody was pricing. None of that forbids KG augmentation where relational queries genuinely dominate. It does mean the graph arrives with a maintenance contract. The desks that signed up for the demo rarely budgeted for the contract.

The note-sized action item: before the next KG-RAG pilot review, ask one question first, what is our triplet error rate, and watch whether anyone in the room knows.

Hallucination detectors that trust the knowledge graph lose up to 84% of their power when the graph is noisy, while embedding methods lose 9: validate your detector under your infrastructure’s real noise, and keep one check whose failures are uncorrelated with the graph’s.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →