// Insight

HiREC: when every 10-K looks the same to your retriever

June 14, 20256 min read

RAGSEC-filingsmulti-hopretrieval

There is a failure mode in filing RAG that nobody warns you about until it embarrasses you in front of a client. You ask which company faces a particular litigation risk. The system confidently quotes a risk-factor paragraph, correctly worded, professionally formatted, and belonging to the wrong company. The paragraph it cited is near-identical to the one you wanted, because every issuer’s lawyers copy the same boilerplate. HiREC is built around exactly this problem. Standardized filings are full of near-duplicate text. A flat retriever cannot tell the right paragraph from its identical-looking twin in another filing.

Why filings break flat retrieval

Most retrieval treats the corpus as a flat pool of passages and pulls the chunks whose embeddings sit closest to the query. That works when passages are distinctive. Filings are the opposite. A 10-K is built from templates: the same risk-factor language, the same accounting-policy paragraphs, the same table structures, repeated across thousands of companies with only the numbers and names changed. To an embedding model, the risk-factor paragraph in one filing and its near-twin in another are almost the same point in space. The retriever pulls both, or pulls the wrong one. The duplication crowds out the passages that actually answer the question.

This is worse than ordinary retrieval noise. A near-duplicate is not obviously irrelevant. It reads as a perfect match, which is precisely why a model downstream will trust it and a reviewer skimming the citation will wave it through. The error hides inside a correct-looking answer, which is the most expensive kind of error in a regulated setting.

What HiREC does

HiREC attacks the problem with two structural ideas rather than a better embedding.

HiREC: retrieve documents, then passages, then curate

Hierarchy first resolves which filing the question is about, so near-duplicate passages from other companies never enter the pool. Curation then trims to the evidence that matters and detects what is still missing, asking a follow-up query for it.

The first idea is hierarchical retrieval. Rather than search all passages at once, HiREC first identifies which documents are relevant, the right filing and the right company, then retrieves passages only from within those documents. The near-duplicate paragraph in some other company’s filing is never in the candidate pool, because that document was filtered out a level earlier. The hierarchy resolves the ambiguity that a flat search cannot.

The second idea is evidence curation. After retrieval, HiREC drops the passages that do not contribute and keeps the ones that do. When the curated evidence is missing a fact the question needs, the system generates a complementary query to go find it. This is what makes the approach work on multi-hop questions, the ones that need a fact from here and a fact from there before they can be answered. The system notices the gap and fills it rather than answering from incomplete evidence.

The results, plus an efficiency surprise

The numbers come from LOFin, a benchmark the authors built specifically for this: 1,595 question-answer pairs over 145,897 SEC documents, including the multi-hop questions that single-passage benchmarks never test. Multi-hop is the fair test, because it is where naive RAG quietly fails: it retrieves enough for the first part of the question, stops, and never fetches the second fact, returning an answer that is half-right and fully confident.

LOFin answer accuracy (%)

HiREC answers 42.36% of LOFin questions correctly against 29.22% for the best baseline, a dense retriever. Page recall follows the same pattern, 45.35% against 34.78%. The gain is at least 13 points of accuracy over the second-best method.

HiREC answers 42.36% of the questions correctly, against 29.22% for the strongest baseline. Its page recall, the share of questions for which it surfaces the right page, is 45.35% against 34.78%. Those are not small margins on a hard benchmark. The pattern holds across answer types, widest exactly where filings hurt most: on numeric answers pulled from tables, HiREC scores 37.23% against the baseline’s 23.69%, because reading a number out of the right table depends entirely on having retrieved the right table rather than its twin.

The efficiency result is the one I did not expect. HiREC is not only more accurate, it is cheaper. It retrieves an average of only 3.7 passages per question, where the iterative baselines churn through far more. Its retrieval stage spends 4,291 input tokens against an iterative method’s 9,610, and its generation stage runs on a fraction of the tokens. Curation makes the system both better and cheaper, because the passages it declines to carry forward are passages it does not have to pay a model to read. The usual tradeoff, more accuracy for more compute, does not appear here. Trimming the near-duplicate noise improves the answer and shrinks the bill in the same move.

The model-risk reading

For anyone running model-risk governance over a filing-QA system, the near-duplicate problem deserves a place on the risk register, because it produces the failure that audits are least likely to catch: a confident, well-formatted, wrong-sourced answer. HiREC’s design maps onto the controls you would want anyway. Hierarchical retrieval is a provenance control, since resolving the document first means every passage carries the filing and company it came from. A citation to the wrong issuer becomes structurally hard rather than merely unlikely. Evidence curation is an auditability control, because the system records which passages it kept, which it dropped, and what complementary query it issued to fill a gap. That trace is exactly what a reviewer needs to see why an answer was given.

The complementary-query mechanism is the part I would borrow regardless of the rest. A system that can notice it is missing a fact, and go get it, is categorically safer than one that answers from whatever it first retrieved. It is the retrieval-side version of a model that abstains when unsure rather than guessing. On multi-hop questions, where the honest failure is to stop after the first hop, that difference is most of the accuracy gap. It is also the difference between a system that knows what it does not know and one that does not.

The result is not that filing RAG is solved, since 42% is a research number rather than a production guarantee. The result is that the near-duplicate problem has a clean structural fix, one that happens to be cheaper than the methods it beats. For filings specifically, where the whole corpus is built from templates, retrieving the document before the passage is not an optimization. It is the thing that keeps you from citing the wrong company.

Standardized filings are full of near-identical boilerplate. A flat retriever cites the wrong company’s twin paragraph inside a correct-looking answer. HiREC resolves the document before the passage and curates the evidence, lifting LOFin accuracy to 42.36% from 29.22% while retrieving only 3.7 passages. More accurate and cheaper, because curation is noise removal.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →