Pillar article
A real AI agent program is graded on five levels and seven capability axes. Its return is a four-number conversation that a CFO will sit through. Most teams overrate themselves by 1.5 levels because they confuse a working demo (Level 2) with a system that runs (Level 4 or 5). This article is the canonical reference.
Read time: ~14 minutes. Written for the CIO, COO, and operating partner who has been told their team is "doing AI", and who is trying to figure out whether that's true.
Why a maturity model
Every conversation we have about AI agents in production starts the same way. The team has shipped a demo, the executive saw the demo, the demo was impressive, and now the org is told it is "doing AI." Six months later, the agent is in production for two narrow workflows, the bill is real, and nobody can answer the question that matters: is this working, and at what cost?
The maturity model exists because the gap between "we built a demo" and "we operate a system" is wider than most operators realize. The five levels below describe that gap. The seven capability axes describe what changes between levels. The four ROI numbers at the end are the only ones a CFO actually wants to see.
This is not aspirational. It is the picture of what we see when we walk into a real engagement, and the bar a defended program clears before we sign a recurring agreement.
The five levels
The ladder is forward-only. To claim a level, every capability axis at every prior level has to be operating in production, not just designed. The fastest way to fail an audit is to claim Level 4 because the eval harness exists, while the integration story is still a script on someone's laptop.
Level 1: Curiosity
The org has a workshop, a Slack channel called #ai, and one or two senior engineers who have shipped a Cursor or Copilot habit. There is no production AI yet. There is no policy.
What this looks like in the wild: a deck about "how AI could help us", a few internal experiments, a budget line item with no system to spend it on. Most Fortune 1000 teams are here as of mid-2026, regardless of what the press release said.
What's missing to leave Level 1: a chosen first agent (one job, one domain, one buyer of the outcome), a named owner, a baseline measurement of the manual process the agent will replace.
Level 2: Pilot
A working prototype is in front of friendly internal users. The agent answers, the demo lands, the executive is excited. Failures are tolerated because "it's a pilot."
What this looks like: an internal Slack bot, a wrapper around an LLM API, a manual eval ("I tried 20 prompts, 18 worked"), no SLA, no rollback plan, no observability beyond the LLM provider's dashboard. This is the level that gets written up as a win in the all-hands.
What's missing: a defended evaluation harness, decision-boundary policy, error budgets, an integration plan that survives real user load.
Level 3: Limited production
The agent serves a real, narrow customer-visible workflow. The team has at least one eval suite that runs on every change, observability that shows individual conversations, and a kill switch the on-call can flip.
What this looks like: an external-facing intake bot, a customer-support deflection layer for a single product line, an internal-facing analyst assistant. Outages are rare-but-loud. The team learns "agents fail differently than services."
What's missing: production-grade evaluation discipline (regression suites, capability tests, jailbreak coverage), a real human-in-the-loop design for the failure modes that matter, a documented governance posture.
Level 4: Operated production
Multiple agents in multiple workflows. Evaluation, observability, governance, and human-in-the-loop are formalized. The team can answer "what happens if this agent gets surprised on Tuesday at 3pm?" with a specific runbook, not a guess.
What this looks like: an internal AI platform team that owns the eval harness and the operating playbook, agents promoted and demoted by passing or failing capability tests, a quarterly governance review, named accountability for every agent-driven outcome.
What's missing: an operating model. Agents have to become a discipline the rest of the org can rely on, not a project the AI team owns alone.
Level 5: Operating model
AI agents are infrastructure. Every business unit has an agent program owner, every agent has an SLA, every agent has a CFO-readable ROI line in the operating budget. The org is no longer "doing AI"; AI is part of how the org works.
What this looks like: an org that adds an agent the way it adds a service, with a known cost-of-quality, a known cost-of-failure, and known unit economics. New agent ideas are graded against the same maturity bar.
This is the destination. Most public companies that talk about being here are at Level 3 with a slide deck.
The 7 capability axes
The axes describe what's the same across every agent, regardless of domain. To advance a level, you advance on all seven, not just the easy ones. This is the part of the model that makes audits boring: there are seven columns, every column has to clear the bar, no column gets credit for a strong neighbor.
Maturity scoring
What clears the bar at each level, by capability axis
Illustrative. The shape is what matters: progress is broad, not deep on one axis. A team that's perfect on observability and weak on decision boundaries is not Level 4, it's Level 2 with a nice dashboard.
1. Decision boundaries
What this agent is allowed to decide on its own, what it has to escalate, what it must never touch. The boundaries are written down and enforced in code, not in a Notion page.
Failure mode at low maturity: the boundary is "use good judgment." This is how an agent ends up emailing a refund offer to a competitor's CFO.
2. Observability
Per-conversation traces, per-tool-call timings, per-prompt versions, per-eval-run scores, all queryable. The on-call engineer can answer "why did the agent do that on Tuesday at 3pm" without re-running the conversation.
Failure mode: the only observability is the LLM provider's dashboard, which is missing the half-dozen tool calls and the in-house retrieval that produced the answer.
3. Evaluation
A test suite that runs on every change, scoring the agent against a known set of capability tests, regression tests, and adversarial tests. Failures block the change. New capabilities ship with new tests, not with a Slack message saying "I tried it and it works."
Failure mode: manual evals. Manual evals don't catch regressions, don't survive team turnover, and don't scale past 50 prompts.
4. Governance / accountability
A named human is on the hook for the agent's behavior. Not "the AI team", a person. The governance review is calendared. The audit trail exists in production, not in a one-time pen-test report.
Failure mode: AI ethics as an aspiration document instead of a hiring decision.
5. ROI math
The agent has a defended cost-and-benefit model with the four numbers in the next section. The model is updated on the same cadence as the financial close, not once at deploy.
Failure mode: "AI saves 30%." Saves what? Versus what baseline? Validated by whom?
6. Integration
The agent participates in the org's existing systems (auth, identity, RBAC, audit logs, change management) instead of bypassing them. Pulling out the agent does not orphan a workflow.
Failure mode: an agent with its own SQLite database, its own auth, and a Stripe key in its environment file.
7. Human-in-the-loop design
The handoffs between agent and human are designed, not retrofitted. The human knows when the agent will escalate, what context they get on escalation, and how to override the agent's decision in a way that's recorded.
Failure mode: "the human reviews everything", which collapses to "nobody reviews anything within four weeks."
Take this to your operating review
One short note per week, written to the same standard.
Capability axes, ROI math, real engagement notes (anonymized). No fluff and no recycled vendor pitches.
The 4-number CFO ROI conversation
A defended ROI for an AI agent program comes down to four numbers. Anything beyond these four is decoration; anything less is a slide.
The conversation in the boardroom is "show me the four numbers, with the assumptions, with the audit trail." If the agent program can't answer all four, it's not at Level 3 yet, regardless of how good the demo was.
A worked example. A logistics customer was paying three full-time analysts, fully loaded, to enrich roughly 1,200 inbound leads per month. The "AI saves 30%" hand-wave from the vendor pitch became, after we sat with finance for two afternoons:
- Hours × rate. ~120 person-hours a week of enrichment, displaced. Loaded rate $95/hr. Annualized hours line: roughly $595K of analyst labor.
- Error-rate delta. The manual baseline missed a hallucinated job title roughly 1.2% of the time; the agent, with citation validation, missed it ~0.3% of the time. Revenue-at-risk per bad lead in their pipeline math came out to ~$3,400. Multiplied across 14,400 leads/year, the error-rate-delta line was about ~$440K of avoided revenue exposure.
- Cycle-time delta. The bottleneck was downstream of the analysts; SDRs working those leads only had bandwidth for ~720 leads/month, so faster enrichment alone produced $0 of throughput improvement until they added one SDR. We told finance that, and recommended hiring the SDR. The CFO trusted the rest of the model more because we admitted that line was zero in year one.
- Ramp time × cost of ramp. The team was on a hiring plan that required two more analysts in the next 12 months. Each ramped at six weeks, fully-loaded ramp cost ~$26K. With the agent absorbing the additional volume, both hires were avoided. ~$52K line, plus the recruiting and management overhead that was easier to defend qualitatively.
The total wasn't the point. The point was that finance now had a closed-loop model they could update at the next quarterly close. That is what a Level 4 program looks like.
Annualized contribution by line
The four-number ROI line, illustrative annualized
Numbers are de-identified from the worked example above. The shape (large hours line, meaningful error-rate-delta line, near-zero cycle-time line until the downstream constraint moves, modest ramp-avoided line) is what we see across most engagements.
How most programs grade out
Across the engagements where we've graded a program against the seven axes, the median is somewhere between Level 2 and Level 3. Self-assessment usually puts the same program at Level 4. The gap is consistent and direction-of-error is one-sided: optimistic.
The reasons are predictable:
- Decision boundaries are usually in a Notion page, not in code. The team mistakes "we wrote it down" for "we enforce it." The first agent that escalates the wrong issue, in the wrong direction, exposes the gap.
- Observability stops at the LLM provider's dashboard. The half-dozen tool calls between the user request and the model response are invisible. Debugging takes hours that were supposed to be saved.
- Evaluation is a folder of prompts a senior engineer maintains by hand. There is no regression suite. There is no adversarial suite. New behavior ships because someone tried it.
- Governance is "the AI team handles it", which falls apart the first time the AI team is on PTO and a customer-impacting incident lands.
- ROI math is "we estimate 30% savings", populated once at deploy, never closed against actuals.
- Integration is a sidecar with its own credentials. Removing the agent orphans a workflow because the agent is the workflow.
- Human-in-the-loop is "every output is reviewed", which silently collapses to "no output is reviewed" within four weeks of go-live.
A team can make real progress on any one of these in a sprint. The maturity progress is when all seven move at once, on the cadence the operating model expects, with the evidence the next audit will demand.
How a maturity assessment is run
The AI Agent Scoping Assessment is the conversion surface for this article. It is a 2-week, fixed-fee engagement that produces:
- A current-level grade across the seven capability axes (with evidence, not vibes).
- A defended next-level definition for the agent in question, with cost.
- The four-number ROI model populated against your actual workflow, with the assumption sheet attached.
- A 90-day operating plan and the option to engage Rex Black to execute it. No obligation; the deliverable stands on its own.
We use this on engagements where the operating partner already knows the team is overrating itself and wants the picture an outside hand will produce. The assessment is the picture.
Key takeaways
- The ladder is forward-only. A Level 4 has every Level 1-3 capability operating, not just designed. Skipping is how you end up with a Level 2 program telling its board it's at Level 4.
- Move on all seven axes at once. A team that's strong on observability and weak on decision boundaries is not Level 4. It is Level 2 with a nice dashboard.
- Manual evals don't survive a quarter. Build the regression and adversarial suites before you tell the executive the agent is "in production."
- Governance is a hiring decision. AI ethics as an aspiration document does not protect anyone. A named human, on the hook, with a calendared review, does.
- Defended ROI is four numbers. Hours × loaded rate, error-rate delta, cycle-time delta, ramp-time avoided. If you can't answer all four with the assumption sheet attached, the program is not at Level 3 yet.
- Integration through, not around, your existing systems. Sidecar agents with their own credentials are how a deploy turns into a security incident a year later.
- Honest grading is the cheapest thing on this page. Most teams overrate themselves by 1.5 levels. The first hour spent grading against the seven axes is the highest-yield hour the program will ever buy.
Want to grade your program against the seven axes with our team? Schedule a scoping conversation and we will walk through the assessment in detail.