// Insight

Building the agentic enterprise: a field guide

June 7, 202620 min read

agentsenterprise-AIarchitectureproductionfield-guide

Everyone is building agents. Most of them will stall after the demo. Gartner expects more than 40% of agentic AI projects to be canceled by the end of 2027, citing escalating cost, unclear business value, and weak risk controls. The technology is not the problem. The gap is everything around the model.

Enterprise posture on agentic AI, early 2025 (% of 3,412 surveyed)

A January 2025 Gartner poll of 3,412 respondents, still mostly hedging on agents.

An agent behaves less like a feature you ship and more like a system you operate.

The working demo is the cheap part. You can stand one up in an afternoon, watch it answer the obvious questions, and win the room. The obvious questions are a small minority of what real users send. The majority is edge cases, ambiguous phrasing, and requests that cross two domains at once. That is where the agent earns trust or loses it. That is the part the demo never shows.

A quant will recognize the shape of this. A backtest is cheap. A live book is not. The interesting risk shows up in production: slippage, regime change, the trade that looked great on paper and bleeds in the market. The model is rarely the thing that fails in production; the operating discipline around it is. This is the field guide I would hand a team before they greenlight an agent: the architecture it sits in, the foundations to get right before launch, then the loop that keeps it honest after.

What does the architecture actually look like?

Strip away the vendor branding and most serious agent platforms converge on the same four concerns, with a fifth that spans all of them.

The agentic enterprise stack

Four capability layers, with a trust layer spanning every one of them.

The engagement layer is where a person meets the agent: a chat panel, a messaging app, a field embedded in software people already use. It carries no intelligence of its own. Its job is to capture intent cleanly and return an answer in the surface where the work happens.

The reasoning layer is where the agent thinks. It plans, decides which tool to call, routes between a fast path and a deliberate one, and decides when it is finished. This is also where the agent is built, monitored, and orchestrated. Most of the engineering you will do lives here.

The system-of-work layer is the set of business applications where actions actually land: resolving a support case, processing a return, updating a pipeline, posting a journal entry. An agent that can talk but cannot act is a chatbot with better manners. The value comes from closing the loop into real systems.

The context layer grounds the agent in reality. It supplies the data, the retrieval, the cross-session memory, plus the metadata that tells the agent which entity, period, and policy a request belongs to. Wiring this layer to live systems without bespoke glue for every source is exactly the problem that open integration standards such as the Model Context Protocol set out to solve.

Spanning all four is the trust layer: model access across more than one provider, the guardrails on both sides of the model, the data controls, plus the observability that lets you see what the agent did and why. A request travels down through these layers and an answer travels back up.

How one request moves through the stack

Known intents take a fixed path; only open-ended requests reach the reasoning loop.

What a layered stack buys you is reuse: a team goes from idea to working agent without rebuilding reasoning, data access, business actions, and trust controls from scratch each time. Having the architecture is necessary. It is nowhere near sufficient. The hard part starts once the agent meets real users.

Why do most enterprise agents fail?

Agents built on language models are flexible by design. They interpret a wide range of inputs and decide what to do in the moment. That flexibility has a price. A language model is probabilistic. The same question can produce a different sequence of steps each time it is asked. In a casual setting that variance is harmless. In a workflow that moves money or makes a promise to a customer, inconsistency is the whole risk.

The deeper reason agents fail is an inversion of where the work sits. In traditional software, roughly 90% of the effort happens before launch. You gather requirements, design the system, build it, and test it against cases you can enumerate. After go-live you are mostly in maintenance. Agents flip that ratio.

Share of total effort that lands after go-live (%)

A practitioner estimate: with agents, most real work starts after go-live.

With agents the ratio inverts, and most of the work begins the day you go live.

The figure above is a practitioner estimate rather than a measured constant, yet every team that has run agents in production recognizes it. You launch. Real users arrive. They send the requests you never imagined. Your team has to learn a genuinely new craft: reading agent transcripts, working out why the agent made a wrong call, then updating instructions, tools, and data sources in response. None of that was in the demo.

Teams fail because they bring the traditional-software playbook to a problem that does not fit it. They treat launch as the finish line. It is the starting line. The ones that succeed do two things deliberately. They get the pre-launch foundations right so iteration is fast. They budget most of their effort for the period after go-live.

There is a design conclusion buried in this. A model predicts the next plausible response rather than executing fixed logic, which makes it powerful for reasoning and natural conversation and unreliable for anything that must be identical every time. The durable pattern is to combine both: deterministic workflows set the guardrails and the non-negotiable steps, while probabilistic reasoning adds adaptability on top. Anthropic draws the same line in its guidance on building effective agents, separating workflows, where models and tools follow predefined code paths, from agents, where the model directs its own process. Most production systems need both, and knowing which parts belong on which side is the core skill.

What has to be right before you ship?

If most of the work is post-launch, the goal of pre-launch is modest and specific. You are not trying to build the perfect agent. You are building an agent you can iterate on quickly and safely. Three foundations make that possible: the right scope, a real measure of success, plus a trust layer on both sides of the model.

Scope it small

The temptation is to aim big. Resist it. Pick a use case that is high value and genuinely achievable, and start there.

Pick the first use case

Start where value and achievability are both high; defer or avoid the rest.

The matrix sorts candidate use cases on two axes: business value across the columns, achievability down the rows. The bright cell is the one to start in: high value and achievable today. An easy but low-value task, such as a basic FAQ bot, earns you little. A high-value but hard task, such as multi-step claims processing, is a phase-two problem that needs more maturity first. A task that is both low value and hard is one to avoid outright. There are two reasons to begin in that bright corner. Agent capability is still moving fast, so anything elaborate you overbuild now you may rebuild in six months as models and tooling improve. And the craft of operating agents is best learned on a small surface, where a mistake is cheap and the feedback loop is short. Once a team has shipped one agent, measured it, and learned the iteration cycle, the second use case goes far faster. OpenAI gives the same advice in its practical guide to building agents: start with a single agent, validate with real users, and add capability only when it pays for itself.

Tie the agent to a number

A common failure across deployments is shipping an agent with no definition of success. Without a KPI tied to a real outcome, you cannot tell a working agent from a drifting one. Activity is not the metric. Conversations handled, messages sent, tokens burned: none of that tells you whether the agent did the work.

Pick a KPI that measures completed work. For a support agent the natural one is containment: the share of cases the agent fully resolves with no human follow-up. A user asks how to reset a password. The agent answers clearly. The user solves the problem and never returns for the same thing. That case is contained. If the user comes back the next day with the same question, the agent failed, whatever the transcript looked like. The discipline is to define the unit of useful work and count how often the agent completes it, then to hold every later decision against that number.

The reason this matters beyond reporting is that the KPI drives your iteration. When you review transcripts and decide what to fix first, the KPI tells you what counts. A clumsy tone is annoying. A logic error that tanks containment is urgent. The same instinct underlies how serious agent benchmarks score reliability: tau-bench measures whether an agent succeeds across repeated trials of the same task rather than once on a good day. An agent that resolves a task four times in five is a different production proposition from one that resolves it once and fails the reruns.

Build trust on both sides of the model

Your agent sits between users and your data with a model in the middle, and data moves in both directions. Queries pull sensitive information toward the model. Responses flow back out and can trigger real actions. Each direction has its own failure mode, so each needs its own checks.

Guardrails sit on both sides of the model

Queries pull data in, responses fire actions out; each side needs checks.

Input guardrails protect the data on the way in. The core three are secure retrieval, zero data retention, then a trusted boundary, in that order.

Input guardrails, in sequence

Masking comes last and optional; it can strip context the agent needs.

Secure retrieval means you control exactly what reaches the prompt. Rather than handing the model raw database access, you route every request through a layer that returns only what the agent is permitted to see. Zero data retention is a contractual guarantee from the model provider that your prompts and responses are not stored, viewed, or used to train future models. Without it, your customer data can end up embedded in a model that serves someone else. A trusted boundary goes one step further: for the most sensitive workloads you route to a provider-hosted model that sits inside your own platform’s trust boundary, keeping the data off the public internet entirely.

Masking deserves a word of caution, because it is the one input control that can backfire. It catches sensitive values before they reach the model and replaces them with placeholders, which sounds like an unambiguous good. The catch is that masking can strip out the very context the agent needs. Ask an agent to find accounts similar to a reference account, then mask that account’s details. Now the agent has lost the very information it needed to find the match. Masking is a legitimate control where the redacted fields play no part in reasoning. It is a poor default for agents that depend on rich context, which is most of them.

Output guardrails protect the user from a bad answer, even when the inputs were clean.

Output guardrails, in sequence

Validation blocks invented actions before they fire, not only bad text.

Tool and sub-agent validation checks that the agent is not inventing actions. If it decides to call a refund handler that does not exist, the system catches that and blocks it before anything fires. Grounding checks verify the agent is answering from your sources rather than its general training. An agent told to answer from your help content returns only what those documents support. This is the same job that lightweight uncertainty-quantification heads do from the inside, reading a model’s own signals to flag a likely fabrication and trigger an abstain-or-escalate path. Content filtering screens the output for harmful or off-brand material before a person ever sees it.

Neither guardrail layer is sufficient on its own.

Masking protects data on the way in and does nothing about a hallucinated tool call on the way out. Output validation catches a bad answer and does nothing to stop sensitive data reaching the model. Serious deployments run both. For anyone who has lived inside model-risk governance, this is familiar ground: controls on the inputs, controls on the outputs, then an independent check before a number reaches a decision.

What happens after you go live?

You have scoped the use case, defined a KPI, and built the trust layer. The agent is live. This is where the real work begins.

In traditional software, testing is close to binary. Unit tests and integration tests pass or they fail. Agents fail in fuzzier ways. Users ask things you never anticipated. The tone drifts off-brand. A retrieved document turns out to be stale. The agent reaches the right answer from the wrong source, which will bite you the next time that source says something different. You cannot unit-test your way out of this. You need a feedback loop. It has to be fast.

The operations loop that gates scaling

Four failure classes, four owners; loop speed, not model size, gates scaling.

The loop has four triage categories, and each has a different fix.

Tone and brand. The agent’s voice does not match the company’s. This shows up most with customer-facing agents where consistency matters. The fix lives in the system prompt and instructions: adjust the voice, add examples of preferred phrasing, then re-test against recent transcripts.

Logic errors. The agent calls the wrong tool, reasons poorly, or takes too many steps to get there, which surfaces as slow or wrong responses. Start with the tool configurations and instructions. If the same error keeps recurring, that flow is a candidate to move out of the reasoning loop and into deterministic code.

Data quality. The agent gives a wrong answer because the source was wrong rather than because it hallucinated. An agent grounded in a large library of help articles will, sooner or later, surface an outdated or contradictory document. The fix is not in the agent. It is routing the issue back to whoever owns the content, to correct or retire it. This is the unglamorous half of the work that the RAG field guide keeps returning to: retrieval quality is the silent ceiling on the whole system.

Coverage gaps. Users ask for things the agent was never built to handle. This is inevitable, and adoption only widens it. The fix is either to expand scope deliberately or to build a clean escalation to a human with full context, so customers never start over. Either way, log the gap and watch coverage grow over time.

The speed of this loop is what gates scaling.

Across real deployments, the teams that could triage and fix quickly built confidence in their numbers and earned approval to expand. Teams with a slow loop stayed stuck in pilot mode, however good the original demo was. The loop, more than the model, is the asset.

Which mistakes show up again and again?

A feedback loop catches problems after they happen. Some problems are better prevented. Three anti-patterns recur across deployments, each easy to fall into and each hard to spot from transcripts alone, because each shows up as degraded accuracy or latency rather than an obvious error.

Reasoning where code would do

Not every decision needs to pass through the model. When a customer asks where their order is, the correct sequence is fixed: look up the order, get its status, get the shipment, format the reply. Routing that through the reasoning loop means a round trip to the model before each call, and each hop adds latency plus a fresh chance to pick the wrong tool.

Where is my order: two ways to answer

A known intent should fire a fixed sequence rather than a per-step reasoning loop.

The fix is to encode the predictable parts as control flow and keep the model for the parts that genuinely need it, such as understanding an ambiguous request or writing the final response. This is precisely the workflow-versus-agent boundary Anthropic describes. It is also why code-writing agents are a natural fit for any task where the action is literally a known procedure. If you can draw the logic as a flowchart, write it as code and keep the model for the ambiguous parts.

Prompting harder instead of encoding policy

This one is subtle, because it feels like good prompt engineering. The agent does something wrong. You add a forceful instruction. NEVER do X. ALWAYS do Y. Capitals, bold, exclamation points. The agent still gets it wrong. You add more emphasis. The system prompt slowly turns into a wall of shouting.

Enforcing a business rule

Emphasis is not enforcement; a hard rule belongs in deterministic code.

The reason it fails is that a model does not weight your capital letters the way a human reader would. Emphasis is not enforcement. What works is to encode the business rule as an explicit, structured policy. If your firm does not operate in a particular state, you do not want the model inferring that from a strongly worded sentence. You want a conditional that says: if the customer is in that state, return this exact response. No judgment, no variance, the same behavior every call. A controlled study on whether reinforcement learning even adds new reasoning, rather than resampling what the base model already does, points the same way: models mostly reweight what they already contain, which means a rule you must hold every time does not belong in a probabilistic path.

Dumping everything into context

The third anti-pattern hurts accuracy and latency at once. Many teams start by passing full, unfiltered tool responses into the model’s context. An order-lookup call can return a hundred thousand tokens by default, most of it fields the agent will never use.

Tokens fed to the agent per tool call (thousands)

Return only the fields the agent needs and trim the rest.

This causes two problems. The model has to reason over a far larger input, which slows everything down. And the noise makes it less accurate, because the relevant fact is buried among hundreds of irrelevant fields and the agent is more likely to seize the wrong one. The fix is to right-size the context: return only the fields the agent actually needs (an order id, a status, an expected date, a tracking number) and trim the rest. The same logic applies to documents. Loading an entire policy file to answer one question is the document-scale version of the same mistake; retrieve the relevant section instead. Anthropic frames this as the central tension in context engineering: find the smallest set of high-signal tokens that gets the job done, because attention degrades as the context grows and piling on tokens makes the answer worse. Less context, faster answers, higher accuracy. The numbers above illustrate a single call. The principle holds across the whole system.

The bottom line

The architecture is the easy half. It is largely a solved problem: an engagement surface, a reasoning layer, a system of work, a context layer, plus a trust layer spanning all of them. The hard half is operational. It is where the cancellations come from. Scope the first agent small and aim it at a use case that is valuable and achievable now. Tie it to a KPI that measures completed work, so progress is distinguishable from drift. Wrap the model in guardrails on both sides, because the input risk and the output risk are different risks. Then accept that launch is the start. The speed of your feedback loop will decide whether the agent scales or stalls. Keep the deterministic parts deterministic, encode your policies as code, and feed the model only what it needs.

None of this is exotic. It is the same discipline a quant desk already lives by: a model is allowed to be probabilistic, the controls around it are not, and operating the system long after the backtest looked good is the real job. The teams that internalize that ship agents that last. The teams that chase the demo join the 40%.

An agent is not a feature you ship, it is a system you operate: scope it small, tie it to a KPI, wrap it in guardrails on both sides, and budget most of your effort for the day after go-live.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →