01 / 05

Five dimensions.

Score each independently.

Dimensions

Correctness. Grounded accuracy on the golden set.
Groundedness. For any RAG or tool user, can the answer be traced?
Safety. Refusal rate for off-policy, jailbreak survival.

Dimensions (cont.)

Cost & latency. Per-decision cost, P50 and P95.
Regression stability. Does yesterday's score hold after a prompt edit or model upgrade?

Evaluation before shipping.

How to test an AI application before it hits production · Rex Black, Inc.

"The demo worked"

is not a release gate.

Five dimensions.

Score each independently.

Dimensions

Dimensions (cont.)

The golden set.

Small. Ruthless. Maintained.

Size

Composition

LLM-as-judge.

Useful. And dangerous.

The release gate.

A named boolean.

Rules

Typical thresholds

Rollout.

Canary. 1%. 10%. 100%.

Stages

Discipline

Once it ships.

The question flips.

What this changes

about your QA org.

This week.

Evaluation is the

new QA.

rexblack.com/resources/writing/ai-evaluation-before-shipping