Evaluation Before Shipping: How to Test an AI Application Before It Hits Production

Whitepaper, AI & Quality, ~14 min read

Most AI features ship because the demo worked, not because the team can say what "ready" means numerically. That is a shippable problem in a prototype and a career-ending one in production. This paper covers the release-gate playbook we use with engineering teams putting LLM-backed features into customer-facing systems.

You will leave with a five-dimension evaluation framework, a recipe for a golden set small enough to actually maintain, the honest limits of LLM-as-judge, rollout mechanics with named exit criteria, and the regression suite that keeps a shipped feature from quietly rotting.

Why "did the demo work" is not an evaluation

The traditional software release question is binary, and the binary is well defined. Does the code meet the specification? Does the test suite pass? Did the staging smoke test stay green? Production readiness is a checklist of measurable conditions, every one of which either passes or fails.

LLM-backed features break this model in a way that teams consistently underestimate. Three properties of generative systems are responsible:

The same input produces different outputs across runs. Temperature zero narrows the range; it does not eliminate it. Model routing, context compaction, and tool-call ordering all introduce non-determinism that is invisible to the test runner.
The output space is effectively infinite. There is no reference value to diff against. "Correct" is a judgment call on each response, and the judgments are expensive.
The failure modes are open. Unit tests catch the bugs the author thought of. Production traffic catches the bugs the author did not.

The net result is that a team with a strong deterministic QA culture, once it starts shipping AI features, often reverts to the weakest possible release gate: "the demo worked." That is not a gate. That is a ceremony. The purpose of this paper is to replace it.

The five evaluation dimensions

Every evaluation suite that survives contact with production covers the same five axes. Score each dimension independently; do not average them into a single number. The point is to see when one dimension regresses even if the others improve.

Correctness

Does it answer the right question

Grounded accuracy against a reference set. The dimension most teams measure first; rarely the dimension that breaks in production.

Groundedness

Does it cite what it uses

For any RAG or tool-using feature, can the answer be traced to the source? Ungrounded answers are the most expensive failure mode at enterprise scale.

Safety

What does it refuse

PII leaks, jailbreak survivability, toxic output, prompt injection tolerance. Measure the refusal rate; measure the false-refusal rate too.

Cost & latency

Does the unit economics work

Per-decision cost at projected traffic. P50 and P95 latency. The cheapest path that meets quality wins.

The fifth dimension is regression stability: does a prompt change, a model upgrade, or a retrieval index refresh keep the scores above threshold on yesterday's traffic? This is the one teams skip at their peril. It is also the one we put the most instrumentation into.

The golden set: small, ruthless, and actually maintained

The dominant failure mode of AI evaluation programs is a "golden set" that is either too large to maintain or too small to generalize. Both failure modes reduce to the same outcome: the team stops looking at the numbers.

Build the golden set to be used, not to be comprehensive.

The size target that works. 60 to 150 examples per feature is enough for almost every mid-market use case. Below 60, the scores are too noisy to move on. Above 150, the maintenance cost suppresses updates, and the set gets stale. We have never seen a team regret shipping with 100 examples; we have seen many teams drown in a 2,000-example set they refuse to look at.

The composition that catches what matters. Build the set in four slices, not one:

Happy path

~40%

The canonical uses. Proves the feature works at all.

Edge cases

~30%

Unusual but legitimate inputs. Ambiguity, long context, mixed language, low-information queries.

Adversarial

~20%

Prompt injection attempts, jailbreak patterns, off-policy requests. Measure refusal rate.

Known regressions

~10%

Every bug the feature has ever had, captured as a test. Grows by one each time a prod bug escapes.

The update rhythm. Every production bug that escapes the evaluation suite gets added to the known-regressions slice the week it is fixed. If the evaluation suite did not catch it, the evaluation suite was insufficient. This is the single most effective discipline we see in teams that ship AI features durably: the golden set grows exactly in proportion to the production surface area.

LLM-as-judge: useful, and dangerous

Once the golden set has more than about 40 examples, hand-scoring every run becomes a bottleneck. The natural move is to use a second LLM as the judge. This works well enough that we recommend it. It also fails in well-characterized ways that teams need to design around.

The three judge failure modes that matter

Position bias. When two candidate answers are presented in order, judges systematically prefer the first one. Documented in multiple studies at double-digit percentage points. Mitigation: always score both orderings and average, or use a single-score rubric instead of pairwise.
Length bias. Judges prefer longer answers independent of correctness. Mitigation: include "answer length is not a quality signal" explicitly in the rubric, and report length distribution alongside scores.
Self-preference. When the judge and the answerer are the same model family, scores inflate by a measurable margin. Mitigation: use a different model for the judge than for the answerer wherever cost allows.

The practical pattern that works: rubric-based single scoring, not pairwise comparison. Give the judge a numbered rubric (1 to 5 on correctness, 1 to 5 on groundedness, etc.) with explicit anchor descriptions for each score. Ask for the score and the justification. Store both. Periodically, hand-sample 10 percent of the judgments and spot-check for drift.

The judge is a noisy signal. It is still more useful than no signal. Treat it like a flaky test: act on trends, not single results.

The release-gate table: what must pass before production traffic

A release gate is a named boolean. The value of the gate is either "pass" or "fail" against an explicit threshold. Vague thresholds ("looks good") are not gates. The gates we use, per feature, look like this:

Gate design

Typical release-gate thresholds by dimension

Thresholds are per feature. Calibrate on the first production cohort, then lock.

Hallucination rate and latency are ceilings (lower is better); the rest are floors (higher is better). Values shown are defaults we use as starting points; every team should recalibrate within two weeks of first production cohort data.

Two design rules:

Every gate has an owner. The evaluation lead owns correctness and groundedness. The ML lead owns hallucination rate. SRE owns latency. Security owns adversarial refusal. If the gate does not have a named owner, it will not have an answer when it fails.

Gates block merges, not deploys. Move the evaluation suite into CI so a failing gate prevents the PR from merging. Deploy-time gates are too late: by the time the build hits staging, the author has moved on to the next ticket, and the gate becomes a slog the team routes around.

Rollout mechanics: canary, 1%, 10%, 100%

Passing the release gate means the feature is ready to start a rollout, not finish one. Every AI feature rolls out in four stages; every stage has a named exit criterion; every criterion is observable before promotion.

Production rollout

Four-stage rollout with exit criteria

Numbers shown are illustrative slots on the funnel, not dropout rates. Each stage represents gated traffic, not retained traffic. The real rate depends on the feature.

The most common mistake at this stage is telescoping. A team, under pressure, skips the 1% cohort because the canary "looked good." Canary traffic is not production traffic. The canary is trained to recognize the feature's failure modes; real customers are not. Respect the stages.

Regression: the evaluation suite that catches drift

Once a feature ships, the question flips. "Does it still work?" replaces "Is it ready?" Four drift sources matter:

Model drift. The provider updates the underlying model. Behavior changes, often subtly. Version pinning helps; version-pinned eval runs on a schedule help more.
Context drift. Retrieval corpus grows, or its distribution shifts. Answers that were well grounded in January are hallucinated in April because the source document was archived.
Prompt drift. The prompt file accumulates edits. Every edit should run through the eval suite before landing; most don't.
Upstream drift. A tool the agent calls changes its schema or quota. Agent behavior degrades in ways the agent itself cannot diagnose.

The defense is a scheduled regression run. Every week, on a known-good evaluation harness, against the current production prompt and model, scores are posted to a dashboard. A drop of more than 3 points on any dimension is an incident. Not "a bug to schedule." An incident.

The regression minimum viable setup

A single CI job, running the golden set against production config on a schedule. Posts to a Slack channel. Reports a single number per dimension. Twenty lines of code, one hour to set up, and it catches the slow-rot drift that every AI feature eventually experiences. If you have nothing else, ship this first.

What this changes about your QA org

The QA function does not disappear when the team starts shipping AI. It shifts shape. The work moves away from scripted acceptance testing and toward what QA has historically called exploratory testing, plus instrumentation, plus rubric design. A QA engineer who can author a 100-example golden set, specify a judge rubric, and own the release-gate thresholds is doing more valuable work than a QA engineer who is running the same regression suite a release automation tool could run.

For most engineering organizations, the cheapest way to get this capability is to retrain one or two existing QA engineers on the LLM evaluation stack rather than hire "AI QA." The methodology is familiar; the tooling is new. We run this kind of transition under the learning and upskilling line when teams want outside help; most of the work can be done internally with the right reference material.

What a leader can do this week

Three concrete moves, in order:

Write down the current release gate for your AI feature, in the same document as the release gates for your deterministic features. If the AI gate is one line long and the deterministic gate is thirty lines, that is the gap this paper is designed to close.
Pick one feature and build a 60-example golden set for it. Do not try to cover every feature. Pick the one with the highest blast radius on a bad answer. Two engineering days is a realistic budget for the first set.
Schedule the weekly regression run. Even if the golden set is incomplete. Even if the rubric is rough. Shipping the rhythm is more valuable than perfecting the artifacts, because the rhythm surfaces the gaps in the artifacts.

If this maps to work in flight on your team, the relevant work sits under AI & agents and test engineering.

This paper is part of a series for engineering leaders putting AI into production. The other three pieces cover workflow vs agent architecture, model selection and cost management, and the broader adoption sequence that wraps them together.

Evaluation Before Shipping: How to Test an AI Application Before It Hits Production

Why "did the demo work" is not an evaluation

The five evaluation dimensions

The golden set: small, ruthless, and actually maintained

LLM-as-judge: useful, and dangerous

The release-gate table: what must pass before production traffic

Typical release-gate thresholds by dimension

Rollout mechanics: canary, 1%, 10%, 100%

Four-stage rollout with exit criteria

Regression: the evaluation suite that catches drift

What this changes about your QA org

What a leader can do this week

The companion deck

Related reading

Choosing the Right Model (and Knowing When to Switch)

The State of E-Commerce, Mid-2026

Starting AI Adoption: A Sequence for Mid-Market Engineering Teams

Workflow or Agent? A Decision Framework Before You Architect Anything

The Case for Investing in Testing: A Board-Level Argument for Enterprise Test-Function Capability

Deciding When to Bring in External Help: A Framework for Training, Consulting, Staff Augmentation, and Outsourced Testing

Where this leads

Risk Reduction & Clear Decisions

Reliable Software at Scale

Software Quality & Security

Working on something like this?