Whitepaper, AI & Quality, ~14 min read
Most AI features ship because the demo worked, not because the team can say what "ready" means numerically. That is a shippable problem in a prototype and a career-ending one in production. This paper covers the release-gate playbook we use with engineering teams putting LLM-backed features into customer-facing systems.
You will leave with a five-dimension evaluation framework, a recipe for a golden set small enough to actually maintain, the honest limits of LLM-as-judge, rollout mechanics with named exit criteria, and the regression suite that keeps a shipped feature from quietly rotting.
A slide deck version is available at /decks/ai-evaluation-before-shipping/slides.pdf.
Why "did the demo work" is not an evaluation
The traditional software release question is binary, and the binary is well defined. Does the code meet the specification? Does the test suite pass? Did the staging smoke test stay green? Production readiness is a checklist of measurable conditions, every one of which either passes or fails.
LLM-backed features break this model in a way that teams consistently underestimate. Three properties of generative systems are responsible:
- The same input produces different outputs across runs. Temperature zero narrows the range; it does not eliminate it. Model routing, context compaction, and tool-call ordering all introduce non-determinism that is invisible to the test runner.
- The output space is effectively infinite. There is no reference value to diff against. "Correct" is a judgment call on each response, and the judgments are expensive.
- The failure modes are open. Unit tests catch the bugs the author thought of. Production traffic catches the bugs the author did not.
The net result is that a team with a strong deterministic QA culture, once it starts shipping AI features, often reverts to the weakest possible release gate: "the demo worked." That is not a gate. That is a ceremony. The purpose of this paper is to replace it.
The five evaluation dimensions
Every evaluation suite that survives contact with production covers the same five axes. Score each dimension independently; do not average them into a single number. The point is to see when one dimension regresses even if the others improve.
The fifth dimension is regression stability: does a prompt change, a model upgrade, or a retrieval index refresh keep the scores above threshold on yesterday's traffic? This is the one teams skip at their peril. It is also the one we put the most instrumentation into.
The golden set: small, ruthless, and actually maintained
The dominant failure mode of AI evaluation programs is a "golden set" that is either too large to maintain or too small to generalize. Both failure modes reduce to the same outcome: the team stops looking at the numbers.
Build the golden set to be used, not to be comprehensive.
The size target that works. 60 to 150 examples per feature is enough for almost every mid-market use case. Below 60, the scores are too noisy to move on. Above 150, the maintenance cost suppresses updates, and the set gets stale. We have never seen a team regret shipping with 100 examples; we have seen many teams drown in a 2,000-example set they refuse to look at.
The composition that catches what matters. Build the set in four slices, not one:
The update rhythm. Every production bug that escapes the evaluation suite gets added to the known-regressions slice the week it is fixed. If the evaluation suite did not catch it, the evaluation suite was insufficient. This is the single most effective discipline we see in teams that ship AI features durably: the golden set grows exactly in proportion to the production surface area.
LLM-as-judge: useful, and dangerous
Once the golden set has more than about 40 examples, hand-scoring every run becomes a bottleneck. The natural move is to use a second LLM as the judge. This works well enough that we recommend it. It also fails in well-characterized ways that teams need to design around.
- Position bias. When two candidate answers are presented in order, judges systematically prefer the first one. Documented in multiple studies at double-digit percentage points. Mitigation: always score both orderings and average, or use a single-score rubric instead of pairwise.
- Length bias. Judges prefer longer answers independent of correctness. Mitigation: include "answer length is not a quality signal" explicitly in the rubric, and report length distribution alongside scores.
- Self-preference. When the judge and the answerer are the same model family, scores inflate by a measurable margin. Mitigation: use a different model for the judge than for the answerer wherever cost allows.
The practical pattern that works: rubric-based single scoring, not pairwise comparison. Give the judge a numbered rubric (1 to 5 on correctness, 1 to 5 on groundedness, etc.) with explicit anchor descriptions for each score. Ask for the score and the justification. Store both. Periodically, hand-sample 10 percent of the judgments and spot-check for drift.
The judge is a noisy signal. It is still more useful than no signal. Treat it like a flaky test: act on trends, not single results.
The release-gate table: what must pass before production traffic
A release gate is a named boolean. The value of the gate is either "pass" or "fail" against an explicit threshold. Vague thresholds ("looks good") are not gates. The gates we use, per feature, look like this:
Gate design
Typical release-gate thresholds by dimension
Thresholds are per feature. Calibrate on the first production cohort, then lock.
Hallucination rate and latency are ceilings (lower is better); the rest are floors (higher is better). Values shown are defaults we use as starting points; every team should recalibrate within two weeks of first production cohort data.
Two design rules:
Every gate has an owner. The evaluation lead owns correctness and groundedness. The ML lead owns hallucination rate. SRE owns latency. Security owns adversarial refusal. If the gate does not have a named owner, it will not have an answer when it fails.
Gates block merges, not deploys. Move the evaluation suite into CI so a failing gate prevents the PR from merging. Deploy-time gates are too late: by the time the build hits staging, the author has moved on to the next ticket, and the gate becomes a slog the team routes around.
Rollout mechanics: canary, 1%, 10%, 100%
Passing the release gate means the feature is ready to start a rollout, not finish one. Every AI feature rolls out in four stages; every stage has a named exit criterion; every criterion is observable before promotion.
Production rollout
Four-stage rollout with exit criteria
Numbers shown are illustrative slots on the funnel, not dropout rates. Each stage represents gated traffic, not retained traffic. The real rate depends on the feature.
The most common mistake at this stage is telescoping. A team, under pressure, skips the 1% cohort because the canary "looked good." Canary traffic is not production traffic. The canary is trained to recognize the feature's failure modes; real customers are not. Respect the stages.
Regression: the evaluation suite that catches drift
Once a feature ships, the question flips. "Does it still work?" replaces "Is it ready?" Four drift sources matter:
- Model drift. The provider updates the underlying model. Behavior changes, often subtly. Version pinning helps; version-pinned eval runs on a schedule help more.
- Context drift. Retrieval corpus grows, or its distribution shifts. Answers that were well grounded in January are hallucinated in April because the source document was archived.
- Prompt drift. The prompt file accumulates edits. Every edit should run through the eval suite before landing; most don't.
- Upstream drift. A tool the agent calls changes its schema or quota. Agent behavior degrades in ways the agent itself cannot diagnose.
The defense is a scheduled regression run. Every week, on a known-good evaluation harness, against the current production prompt and model, scores are posted to a dashboard. A drop of more than 3 points on any dimension is an incident. Not "a bug to schedule." An incident.
A single CI job, running the golden set against production config on a schedule. Posts to a Slack channel. Reports a single number per dimension. Twenty lines of code, one hour to set up, and it catches the slow-rot drift that every AI feature eventually experiences. If you have nothing else, ship this first.
What this changes about your QA org
The QA function does not disappear when the team starts shipping AI. It shifts shape. The work moves away from scripted acceptance testing and toward what QA has historically called exploratory testing, plus instrumentation, plus rubric design. A QA engineer who can author a 100-example golden set, specify a judge rubric, and own the release-gate thresholds is doing more valuable work than a QA engineer who is running the same regression suite a release automation tool could run.
For most engineering organizations, the cheapest way to get this capability is to retrain one or two existing QA engineers on the LLM evaluation stack rather than hire "AI QA." The methodology is familiar; the tooling is new. Rex Black, Inc. runs a focused two-week upskilling program for exactly this transition under the learning and upskilling line.
What a leader can do this week
Three concrete moves, in order:
-
Write down the current release gate for your AI feature, in the same document as the release gates for your deterministic features. If the AI gate is one line long and the deterministic gate is thirty lines, that is the gap this paper is designed to close.
-
Pick one feature and build a 60-example golden set for it. Do not try to cover every feature. Pick the one with the highest blast radius on a bad answer. Two engineering days is a realistic budget for the first set.
-
Schedule the weekly regression run. Even if the golden set is incomplete. Even if the rubric is rough. Shipping the rhythm is more valuable than perfecting the artifacts, because the rhythm surfaces the gaps in the artifacts.
If you want a second pair of eyes on the evaluation design for a specific feature, the AI & agents practice and the test engineering practice run this exercise together as a focused two- to three-week engagement. Decision framework, release-gate spec, initial golden set, and the CI harness to run it live.
This paper is part of a series for engineering leaders putting AI into production. The other three pieces cover workflow vs agent architecture, model selection and cost management, and the broader adoption sequence that wraps them together.