Workflow or Agent? A Decision Framework Before You Architect Anything

Whitepaper, AI Architecture, ~12 min read

Most of the production "agents" we review are workflows that overshot their brief. Autonomy is expensive, in three measurable ways: debug cost, observability cost, and the recurring drift cost of a system that can pick its own next step. Teams pay those bills because "agent" sounds more sophisticated than "pipeline with an LLM in it."

This paper covers the decision framework that separates workflow problems from agent problems, the failure modes specific to each path, and the "earned autonomy" principle that keeps teams from shipping an agent when a deterministic pipeline would have solved the problem at a tenth of the operating cost.

The definitions matter more than they sound

Ask ten teams what an "agent" is and you will get ten answers. The distinction that matters for architecture is narrow and worth stating precisely.

A workflow is a deterministic pipeline with one or more LLM steps embedded in it. The sequence of steps is defined in code. The LLM's job is local: summarize, classify, extract, generate. Control flow stays with the programmer. If step 3 returns low confidence, step 4 is a predictable branch in the code, not a decision the LLM negotiates.

An agent is a system in which the LLM itself decides what to do next. It has access to tools. It invokes them in an order it chooses. It may call itself recursively, self-correct, or abandon a line of reasoning. Control flow sits with the model.

Both can be valuable. Only one is the right default.

The four questions that decide

When a team is debating "workflow or agent" for a new capability, the decision almost always comes out of answers to four questions. Score each honestly.

1. Can the steps be enumerated in advance?

If the solution path is knowable at design time, the workflow wins. Nearly all structured enterprise tasks have knowable solution paths: "extract three fields from this document and validate them against a schema" is not a problem that benefits from a model picking its own approach.

Agents earn their keep when the solution path is genuinely variable. An operations agent that triages incoming issues across twenty different system types, each with a different debugging flow, has a real reason to pick its own next step.

2. How high is the blast radius of a wrong next step?

Agents fail in a distinctive way: they do the wrong thing efficiently. A workflow that reaches a dead end throws an exception; an agent that reaches a dead end often tries a different tool, then another, burning tokens and sometimes writing to systems it should not have touched.

If the cost of a wrong next step is low and easily reversible, agent autonomy is cheap. If the blast radius includes writing to production systems, sending customer communications, or moving money, agent autonomy is a contingent liability priced in incidents.

3. How observable is the decision trace?

Workflows are observable by construction. The steps are in the code. Every LLM call is a bounded unit with a known input shape and known output shape. Debugging is a matter of reading logs.

Agents require separate observability investment before they can be debugged at production scale. Trace capture, tool-call replay, decision-point annotation, and policy-violation alerting all cost engineering effort that workflow systems do not need. Teams that ship agents without this layer discover, three months in, that they cannot tell why the system did what it did. At that point the only safe move is a rewrite.

4. What is the unit economics at full volume?

Agents cost more per decision, and the cost surface is harder to bound. A workflow can pre-compute most of the expensive reasoning; an agent often re-derives its context each step. At 100 requests per day the difference is noise; at 100,000 per day it is the budget.

Workflow

Steps known at design time

Deterministic pipeline with LLM steps for extraction, classification, summarization. Debug via logs. Scale predictably.

Agent

Steps chosen at run time

Tool-using system where the LLM picks the next action. Debug via traces. Scale with variable cost.

Hybrid

Workflow with one agentic step

Deterministic skeleton, one step where the LLM picks among a small set of tools. The answer for most real systems.

Default

Workflow first

Upgrade to agent only after instrumentation shows a workflow cannot solve the case.

The cost curve tells the truth

The practical argument for workflow-first is about cost, not ideology. Plot the all-in cost per decision (model tokens, tool invocations, retry overhead, incident response) against problem complexity and the two paths diverge sharply.

Architecture economics

All-in cost per decision, workflow vs agent, as problem complexity grows

Illustrative. Actual crossover depends on model choice, tool count, and observability maturity.

Workflow (deterministic pipeline)

Agent (autonomous tool use)

At low and mid complexity, workflows are cheaper by 2-4x. The crossover sits toward the high-complexity end, where enumerating every path in code becomes its own cost. The mistake is to assume you are on the right side of the crossover before you have proved it.

Two takeaways sit in this chart. First, workflows dominate for most enterprise problems, which is the left half of the x-axis. Second, when you do genuinely need an agent, the cost is not unbounded; it is just higher and flatter, which is consistent with the shape of problems that actually benefit from autonomy.

The failure mode is a team that assumes it is on the right side of the crossover and builds an agent where a workflow would have held.

Failure modes by path

Each path has signature failure modes worth naming so leaders can plan around them.

Workflow failures

The LLM step swallows the error. Classification returns "other" on ambiguous input. "Other" then routes to a default handler that is fine 80 percent of the time and dangerous the other 20. Mitigation: threshold confidence per LLM step, route low-confidence to human review.
The pipeline ossifies. The workflow solves 92 percent of cases at launch, then plateaus. The remaining 8 percent are the interesting ones, and they cannot be solved without loosening the control flow. Mitigation: build a release valve. When confidence is low and retry count exceeds threshold, hand off to a narrow agentic path with a strict tool allowlist.
The prompt file becomes a product. Unversioned prompt changes drift the behavior of a system that otherwise has strong change control. Mitigation: prompts are code. They live in the repo, flow through the same CI as everything else, and run through the evaluation suite on every change.

Agent failures

Observability deferred. Agents without trace capture and tool-call replay are impossible to debug. Teams that ship without it pay the bill later, often in a rewrite. Mitigation: build the observability layer before the agent, not after.
Unbounded retries. Agents can retry a failing tool forever. The token bill and the downstream system burden grow linearly with the failure rate. Mitigation: hard caps on steps, token budget, and wall-clock time per session. The agent hits the cap and escalates.
Policy drift. An agent's policy on what it will and will not do is usually written into a system prompt. System prompts are easy to change; auditing them is not. Mitigation: policy as code, separate from the prompt. Tool calls pass through a policy layer before the model touches them.

The earned-autonomy principle

The pattern we recommend, repeatedly, to engineering leaders starting AI work: build a workflow first, earn the right to upgrade to an agent by instrumenting the workflow's failure surface.

This looks like:

Ship the workflow. Accept that it will solve 85 to 95 percent of cases.
Instrument the failures. Capture the inputs, the intermediate states, and the reason the pipeline returned low-confidence or fell through to a default.
Look at the failure distribution. If it clusters (most failures are the same two or three patterns), solve those patterns in the workflow. If it is genuinely long-tail (every failure is different), that is evidence for agent autonomy on the next layer.
If the evidence supports it, add a bounded agent layer, not a replacement. The agent handles the long tail; the workflow handles the rest. Tool allowlist is narrow. The agent's territory is explicit and small.

Teams that follow this path ship faster, debug more easily, and end up with systems that are genuinely justified in the parts where they look agentic.

Teams that skip to "it is an agent" usually end up with a system that is expensive to operate, hard to change, and solving a problem a workflow would have solved.

Three worked examples

Document processing at mid-market volume. A team extracts ten fields from incoming PDFs at 50,000 documents per month. Workflow wins decisively. Fields are enumerable, blast radius is per-document, observability is trivial. If certain document types fail, route them to a narrow agent layer that can choose among three extraction strategies. The core pipeline stays deterministic.

Customer support triage. A team routes incoming tickets to one of eight queues. Workflow wins. The routing logic benefits from an LLM classifier, but the sequence is fixed: classify, score, route, log. Moving to an agent buys nothing except debug difficulty.

IT operations remediation. A team wants automated remediation of incidents across fifteen system types, each with its own debugging flow. Agent wins, cautiously. The domain is genuinely variable; no one will write a workflow that covers fifteen distinct playbooks. The mitigation is strict: tool allowlist, step caps, mandatory human approval for any write to production systems, and a long-running evaluation harness that grades the agent on traces.

What a leader can do this week

Three concrete moves:

Take one in-flight "agent" project and re-scope it as a workflow. Walk through the four questions. If two or more of them point toward workflow, revise the architecture while the code is still cheap to change.
For any agent already in production, audit the observability stack. If the team cannot answer "why did the agent do X at 2:47 PM yesterday," the observability layer is insufficient, and every subsequent debug cycle will be expensive.
Name one workflow candidate for earned autonomy. Identify the failure cases that cluster together. If the cluster is coherent and the long tail is genuine, that is the next agent layer, and it is narrow by design.

If this maps to work in flight on your team, the relevant work sits under AI & agents.

This paper is part of a series for engineering leaders putting AI into production. The other pieces cover evaluation before shipping, model selection and cost management, and the broader adoption sequence.

Workflow or Agent? A Decision Framework Before You Architect Anything

The definitions matter more than they sound

The four questions that decide

1. Can the steps be enumerated in advance?

2. How high is the blast radius of a wrong next step?

3. How observable is the decision trace?

4. What is the unit economics at full volume?

The cost curve tells the truth

All-in cost per decision, workflow vs agent, as problem complexity grows

Failure modes by path

Workflow failures

Agent failures

The earned-autonomy principle

Three worked examples

What a leader can do this week

The companion deck

Related reading

The State of E-Commerce, Mid-2026

Starting AI Adoption: A Sequence for Mid-Market Engineering Teams

Evaluation Before Shipping: How to Test an AI Application Before It Hits Production

Choosing the Right Model (and Knowing When to Switch)

Seven questions to ask before you sign

Compare AI Models

Where this leads

Software Delivery Architecture

Fast-Track Delivery

Ecommerce Solutions

Working on something like this?