Skip to content
Tim Frenzel

// Insight

Agentic reasoning, unified: a map for deciding where agents belong

10 min read
agentsreasoningsurveytaxonomy

The agent literature has been arriving faster than anyone can file it. This archive has reviewed a year of it one paper at a time: memory systems, tool-use models, debating committees, observability layers. This survey, 29 authors synthesizing the field, finally supplies the filing system. It organizes agentic reasoning into three hierarchical layers, foundational, self-evolving, and collective, and cuts across them with one load-bearing distinction: scaling interaction at inference time versus optimizing behavior in post-training. Taxonomies do not usually earn a long read. This one does, because it doubles as the decision tool a research desk has been missing: a principled way to say where an agentic loop adds value and where it adds degrees of freedom.

The intellectual lineage runs back through one paper. ReAct, Yao and colleagues’ 2022 work, introduced the interleaving of reasoning traces and actions that every system in this survey elaborates: think, act, observe what came back, think again. Everything since, the memory editors, the tool orchestras, the agent committees, is that loop scaled along different axes. The survey’s contribution is naming the axes.

The three layers, with the year’s evidence attached

Foundational agentic reasoning covers the single agent in a stable world: planning, tool use, and search. This is the layer where 2025’s open-weights race played out. K2 Thinking’s 200-to-300 tool calls is a foundational-layer capability, sustained orchestration without losing the goal. So is the routing discipline this archive keeps endorsing, deterministic workflows for known intents with model-driven reasoning reserved for the genuinely open-ended. The survey’s framing sharpens why that discipline works: planning, tool use, and search are separable competencies, which means a system that hard-codes the separable parts spends its model budget where reasoning actually binds.

The foundational layer also carries the year’s least glamorous and most consequential engineering: the runtime. Plans that survive restarts, checkpoints that make multi-hour trajectories resumable, approval gates that pause execution for a human, the properties a stable orchestration layer finally shipped, plus the neutral tool protocol that connects agents to systems without bespoke glue. The survey files these under infrastructure rather than reasoning, which is taxonomically correct and operationally backwards: in production, the runtime properties decide whether the reasoning properties ever get exercised. A desk evaluating foundational-layer claims should read the architecture papers second and the durability story first, because a brilliant plan that dies at hour three of a filing sweep is indistinguishable from no plan at all.

Self-evolving agentic reasoning covers agents that improve themselves through feedback, memory, and adaptation. The year’s memory thread lives here, Memory-R1’s learned bank operations and MemAct’s in-window curation, alongside the reflective-optimization line that GEPA represents: systems that read their own traces and revise their own machinery, a line whose academic anchor is Reflexion, the 2023 work that formalized self-reflection as verbal reinforcement learning. The survey’s placement makes a point the individual papers could not: these are one research program, the agent as student of its own transcript, whether the thing being revised is a memory store, a prompt, or a parameter vector.

Collective multi-agent reasoning covers coordination, communication, and shared knowledge. AlphaAgents’ debating analysts sit here, as does the engineered-opposition design school it contrasts with. The survey treats collective reasoning as the least mature layer, which matches the measured record: MAST’s 1,642-trace failure taxonomy found 44 percent of multi-agent failures rooted in specification rather than capability, while the DeepMind scaling experiments found coordination turning negative once solo baselines clear 45 percent. The evidence in this archive agrees: committee architectures demonstrably produce governance surface, logged dissent, auditable consensus, while the claim that they produce better decisions than one well-tooled agent remains, on current published evidence, a hypothesis with anecdotes.

Three layers, one year of evidence
FoundationalPlanningTool use, K2-classSearch and routingSelf-evolvingMemory ops, Memory-R1, MemActReflective revision, GEPACollectiveDebate, AlphaAgentsCoordinationShared memoryCross-cuttingInference-time orchestration vs post-training optimization
The survey's taxonomy, populated with the systems this archive reviewed in 2025.

The distinction that pays the rent

The survey’s sharpest contribution is separating two ways of making an agent better that the marketing language blurs into one. In-context, inference-time approaches scale the interaction: more steps, more tools, more retries, structured orchestration around a frozen model. Post-training approaches change the model: reinforcement learning or supervised fine-tuning that bakes agentic behavior into the weights. The distinction is operational. It prices differently on every axis a desk cares about.

Two ways to buy agentic capability
Inference-time orchestrationFrozen weightsPay per step, every runChange costs a prompt editAudit reads the tracePost-training optimizationBehavior in the weightsPay once in trainingChange costs a training runAudit needs eval suites
Most production decisions are choices along this axis, usually made implicitly.
Inference-time capability is rented and post-trained capability is owned, and most teams make that capital decision implicitly, one prompt hack at a time.

Orchestration is reversible, inspectable, and expensive at the margin: every run re-pays the token bill, the same arithmetic that made long-horizon trajectories a hardware question. Post-training is cheap at the margin and expensive to change, with its behavior visible only through evaluation rather than inspection, which is why the RLVR skepticism and the SFT-matches-PPO fine print matter before anyone budgets a training run. The survey’s framing gives the trade a name, which is the prerequisite for making it deliberately.

What buying the post-training side actually involves needs a concrete paragraph, because the survey’s catalogue makes it look more settled than it is. Agentic post-training needs trajectories, full sequences of reasoning, tool calls, and outcomes, either harvested from a stronger model or generated and filtered against a reward. Reward design is the hard part finance should recognize immediately: verifiable rewards, the task completed, the constraint satisfied, train honestly, while proxy rewards, a judge model’s approval, a backtest’s improvement, train whatever maximizes the proxy. The year’s RL results bracket the honest expectations. Outcome-driven training on a few hundred trajectories demonstrably reorganizes behavior, while the supervised baseline trained on identical data keeps embarrassing the fancier optimizers, which keeps the burden of proof on anyone proposing the expensive path.

The evaluation gap the survey inherits

One column is missing from the field this survey organizes, and naming it matters more for deployment than any taxonomy cell: reliability. The literature’s benchmarks overwhelmingly measure capability, can the agent complete the task once, while production lives and dies on whether it completes the task every time. The instruments exist at the field’s edges. Tau-bench’s pass-hat-k metric scores success across repeated trials of the same task, the statistic that separates a four-in-five agent from a demo. GDPval’s blind pairwise grading prices deliverable quality against working professionals with the judge noise published. Neither style of measurement has propagated into the agentic-reasoning mainstream the survey maps, which means most of the architectures it catalogues have never been evaluated the way an operations committee would evaluate them.

The omission has a compounding consequence for the layers. A foundational agent at 90% per-step reliability across a 20-step plan completes its task less than an eighth of the time if failures are independent, which is why long-horizon claims need trajectory-level statistics rather than step-level ones. Self-evolving systems make reliability a moving target by design: the thing being measured changes under the measurement. Collective systems multiply the problem by the number of agents and then obscure it behind consensus. Until the field reports reliability with the same prominence as capability, the practitioner translation is unchanged from what this archive concluded reviewing the individual papers: run your own harness, on your own tasks, with repetition built in, before any agent graduates from pilot.

Where agentic loops belong in a research process

Here is the desk translation the survey stops short of, built on its own taxonomy. The question is never whether agents are good; it is which layer’s failure mode your workflow can afford.

Foundational loops, a single agent reading, computing, drafting under a tool whitelist, have a bounded failure mode: a wrong answer, caught by the validation stack the way any model output is caught. The cost-benefit is usually favorable wherever the task is genuinely open-ended. The controls are the ones this archive has catalogued: budgets, whitelists, observability from outside the process.

Self-evolving loops change the analysis, because the failure mode is drift: a system revising its own prompts, memory, or parameters against feedback is doing optimization, and optimization against historical financial data has a name in this business. An agent that iterates on its own strategy until the backtest improves is a p-hacking machine with excellent work ethic. Every degree of freedom the agent can adjust is a test against the same history, mostly unlogged, which is exactly the multiple-testing exposure that quant research spent two decades learning to charge for. Self-evolution belongs where the feedback is verifiable and exogenous, code that compiles, constraints that hold, rewards a checker can confirm, and needs an out-of-sample embargo the agent cannot touch wherever the feedback is a backtest.

Collective loops inherit both failure modes and add correlation: agents sampled from the same base model agreeing with each other is confidence laundering, the committee that is secretly one analyst, the concern the LLM-market simulations scaled up to systemic. The governance surface is real, the diversification claim needs the disagreement statistics before it is believed.

Agentic loops in a quant research process, gated
Open-ended research questionFoundational agent: read, compute, draft under whitelistValidation stack: screen, trace check, verificationSelf-evolution allowed only against verifiable feedbackBacktest feedback: embargoed, test-budgetedHuman owns the hypothesis registry
The taxonomy maps to gates: each layer up requires a stronger control for its failure mode.

What the survey says is missing

The open challenges the authors list, personalization, extended interaction, world modeling, scalable multi-agent training, governance for real deployment, read like a roadmap of this archive’s complaints. Extended interaction is the context-rot problem wearing its academic name. Governance for deployment is the gap between an impressive trajectory and something a regulated firm can run, the territory of kernel-level supervision and neutral protocol stewardship. The survey mapping five application domains, science, robotics, healthcare, autonomous research, mathematics, without finance as a first-class domain is itself a finding: the field’s reference synthesis treats the highest-stakes deployment environment as an afterthought, which is either an opportunity or a warning, and in my experience those are usually the same thing.

The autonomous-research domain deserves the closing thought, because it is where the three layers compound. An agent that plans experiments (foundational), refines its methods from results (self-evolving), and coordinates with specialist agents (collective) is the survey’s end state. It is recognizably a quant research pod with the humans abstracted out. Everything this archive has documented says the abstraction fails at specific, predictable joints: the feedback loops that touch historical data, the consensus that launders correlation, the long horizons that rot context. The pods that work this decade will be the ones that put agents inside the layers where failure is verifiable and keep humans at the joints where it is not.

The bottom line

The survey will be cited for its taxonomy. The taxonomy earns it: three layers and one cross-cutting distinction organize a literature that badly needed organizing, with ReAct’s think-act-observe loop as the common ancestor. The desk value is using it backwards. Classify any proposed agent system by layer; its failure mode, its required controls, and its honest cost model follow almost mechanically: bounded errors and validation stacks at the foundational layer, optimization discipline and embargoed feedback for self-evolution, decorrelation evidence for collectives, plus the rent-versus-own capital decision underneath all of it. A field this young handing practitioners a working classification is rarer than it should be. Use it as a checklist on every agent proposal that crosses the desk this year, then let the next wave of papers fill in whichever layer your pilots prove out first. The filing system finally exists. What gets filed under it is now the practitioner half of the bargain.

Three layers, foundational, self-evolving, collective, and one distinction, orchestrate at inference or optimize in post-training: classify any agent system this way and its failure mode, controls, and true cost follow almost mechanically.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.