Safe at prototype. Expensive at production.
Mid-market AI deployments we audit: 60 to 80 percent of the bill is available savings, at quality parity, from disciplined routing.
Capability. Score on your golden set. Public benchmarks have no predictive value. Latency. P50 and P95 against the call site budget.
Cost. Per-decision, fully loaded. Include retry rate and fallback rate. Reliability. Single-provider 99.5% = four hours of degraded ops per month.
If a smaller model scores within 2 to 3 points of a larger model on your rubric, that is within the noise floor.
The smaller model is capable enough. The cost gap is a pure economic win.
First pass, small model. ~80% resolved here. Confidence score on the output. Promote below threshold to flagship.
Promotion rate. 15-20% after tuning is the target. Above 30% = gate is wrong. Below 10% = quality is bleeding.
Flagship: ~$9K/month. Cascade: ~$3K/month.
Flagship: ~$90K. Cascade: ~$28K.
Flagship: ~$900K/month. Cascade: ~$240K/month.
Fastest engineering work on the backlog.
Reversible. Cheap. Always first.
Only when the task is stable and narrow. Be honest about the re-train obligation.
When the gap is capability, not data.
No amount of prompting closes a capability gap.
Prompt tuning. Prompts rarely port across models. Eval re-run. Full golden set, new scores.
Regression re-verify. Known bugs may not still be bugs. Operational memory. Two weeks of on-call recalibration.
Design for switching from day one. Prompts and models are config, not code.
Every provider has at least one significant outage per quarter.
The cheapest viable pattern: two providers, one hot standby, routed through a thin abstraction.
The standby does not need to be same-family. It needs to be good enough to carry traffic while the primary recovers.
Fifteen lines of code. Build it before you need it.
Request payload. Model hint. Latency budget.
Response. Provider used. Cost incurred.
Instrument per-decision cost on the hottest feature. If the answer is "we would need to calculate," that is the first gap. Run the cascade prototype. Two-tier, naive gate. If quality holds and cost drops, you just funded the next two quarters of AI work. Audit switching cost. If migrating off the current model takes more than five engineering days, your lock-in is bigger than the team realizes.