Skip to content
Tim Frenzel

// Insight

Rethinking retrieval: vectors beat graph-walking on SEC filings

6 min read
RAGSEC-filingsbenchmark

Every quarter someone proposes replacing the vector store with something cleverer: graph traversal, hierarchical node reasoning, retrieval that walks document structure the way a person would. This paper finally runs the comparison properly, vector-based agentic RAG against hierarchical node-based reasoning, on 1,200 SEC filings with a 150-question benchmark, reporting accuracy, latency, plus the configuration details that decide real deployments. The vector-based system wins 68% of head-to-head comparisons at comparable latency, 5.2 seconds against 5.98.

The architectures represent genuinely different philosophies. The vector side is the mature stack this archive has documented all year: hybrid search, metadata filtering, agentic query handling on top of embeddings. The node-based side traverses document hierarchy without embeddings at all, reasoning its way down from section to subsection toward the answer, the approach that sounds more intelligent in a design review. Sounding intelligent is not a retrieval metric. Walking structure costs traversal steps, and every wrong turn at a high node prunes the right answer out of reach, while a good embedding plus reranker simply lands near the evidence and lets the reader sort it out.

Head-to-head win rates on 1,200 SEC filings (%)
Vector-agentic over node-based68Small-to-big over baseline chunking65
Latency 5.2s vs 5.98s for the architectures; small-to-big adds only 0.2s.

The enhancement results inside the vector stack are the actionable half of the paper. Cross-encoder reranking delivers a 59% absolute improvement in MRR@5 at its optimal parameters, retrieving ten candidates and reranking to five. Small-to-big retrieval, embed and match small precise chunks, then hand the generator their larger parent sections, wins 65% of comparisons against baseline chunking at a cost of 0.2 seconds. Both numbers extend a pattern that FinanceBench started and FinSage confirmed: on filings, the wins come from workmanlike retrieval engineering rather than architectural revolution.

Small-to-big: match precisely, read broadly
Filing sectionsSmall chunks indexed for precise matchingQuery hits small chunkExpand to parent sectionGenerator reads full context
Precision in the match, breadth in the read, for 0.2 seconds of added latency.

The reranking numbers reward a closer look, because the parameters are the practical content. MRR@5 measures where the first genuinely relevant passage lands among the top five results, the metric that decides whether the generator sees the evidence at all. The optimal configuration retrieves ten candidates and reranks to five, which is a budget statement: cast a net wide enough that the embedding blur does not exclude the answer, then spend one cross-encoder pass sorting the catch. Wider nets help less than intuition suggests, since reranking thirty candidates mostly re-sorts noise at triple the rerank cost. Retrieve-ten-rerank-five is the kind of default a team can adopt verbatim and tune later, which is precisely what a measured benchmark is for.

Small-to-big deserves the spotlight because it resolves the chunking dilemma that has haunted filing QA from the start. Small chunks match precisely and orphan their context; the orphaned-number problem is the canonical filing failure. Large chunks preserve context and blur the match. Indexing small while reading big takes both ends of the trade, which is why it beats either fixed choice. It composes naturally with contextual chunking rather than competing with it, one restores meaning at indexing time, the other at read time.

The deeper finding is that filings punish structure-walking because their structure is boilerplate: a hierarchy view cannot distinguish this year’s Item 7 from last year’s, while an embedding at least sees the words inside.

Node-based reasoning spends its budget navigating a tree whose shape is nearly identical across every 10-K ever filed, exactly the near-duplicate environment where hierarchical evidence curation had to add aggressive filtering to survive. Structure helps when structure is informative. SEC standardization made it deliberately uninformative.

The result also clarifies what the agentic layer is for, since both architectures here had one. Agentic query handling on top of vectors won; agentic reasoning instead of vectors lost. The distinction matters because the two get conflated under one buzzword. Decomposing a compound question, reformulating slang into filing language, choosing metadata filters, those are query-side tasks where a model adds precision before retrieval happens. Replacing the retrieval mechanism itself with stepwise reasoning subtracts the one thing embeddings are unreasonably good at, finding semantic needles in a half-million-chunk haystack in milliseconds. Intelligence belongs at the edges of retrieval rather than in the middle of it.

A filings stack with measurement behind every layer
FoundationHybrid searchMetadata filteringAlways onCross-encoder rerank, 10 to 5Filings-specificSmall-to-bigContextual chunkingEarned, not defaultAgentic decomposition for compound queries
Each layer carries a measured win; the agentic tier joins only where queries demand it.

The economics split along a capex-opex line the latency numbers hide. The vector stack pays up front: embedding 1,200 filings is a modest one-time bill, while re-embedding the corpus on every embedding-model upgrade is the hidden subscription, the same recurring-index cost the contextual-retrieval analysis flagged for fast-moving corpora. Node-based traversal carries no index and pays per query instead, in traversal steps and model calls. At research-desk volumes the index amortizes within weeks, one more way the measured verdict favors vectors for this corpus. A shop with a tiny query budget over a huge, constantly churning document set faces a genuinely different calculation, which is the honest boundary of the finding.

The caveats are the benchmark’s own. A 150-question set over one corpus type is a strong start rather than a settled verdict; filings are the most standardized documents in finance, which means the anti-structure finding may soften on prospectuses or research notes where hierarchy carries real information. Cost per query is the dimension a desk should add when replicating, since a six-second pipeline that invokes a cross-encoder and an agent loop prices differently at ten queries a day than at ten thousand. None of those caveats rescue node-based traversal; they just bound how far the vector verdict travels.

The best property of this benchmark is how cheap it is to steal. Twelve hundred filings and 150 questions is weekend-scale infrastructure for any team that already runs a document pipeline: swap the public question set for the ones your analysts actually asked last quarter, then run the same head-to-head. The architecture decision stops being a taste argument. Most retrieval debates inside firms run on demos and conviction. A 150-question harness ends them.

For a desk choosing an architecture, this paper plus the agentic-RAG cost invoice from three weeks ago form a usable decision kit: vectors with hybrid search as the foundation, cross-encoder reranking always, small-to-big for filings, agentic query handling reserved for the compound questions that earn it. The five-to-six-second latency both architectures share is the honest banner number, fine for research, disqualifying for interactive products without caching and routing. None of this is the newest idea in retrieval. All of it is measured, which is the property a deployment decision actually needs.

On 1,200 SEC filings the boring stack wins: vectors with reranking beat node-based reasoning 68% head-to-head, and small-to-big buys the chunking trade-off out for 0.2 seconds, evidence over elegance, again.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.