Managing complex test environments. A logistics problem, not a spreadsheet problem.
Some bugs only reveal themselves in field-like conditions: server crashes under load, data corruption under concurrent write, regressions that hit only on a specific OS/driver/firmware stack, AI model drift that hits only on a specific dataset version. That means the test environment itself becomes complex enough to need its own system of record — hardware, software, configurations, people, locations, schedules, and the relationships between all of them. This talk is the framework for building and running that system of record, so your test coverage against complex systems stops being accidental.
Abstract
Complex systems find different bugs than simple ones.
Some bugs are shy. They only reveal themselves during performance, stress, volume, data-quality, and reliability testing — and the symptoms, when they finally appear, are the expensive kind: server crashes, sudden performance cliffs, database wedges, silent data corruption, model drift, regressions specific to a driver version. Because software is non-linear, test results in less-complex settings rarely extrapolate. Finding these bugs requires running the tests in environments that look like production.
That requirement creates a second-order problem. A production-like test environment is usually an interconnected mesh: servers, clients, networks, storage, firmware, OS configurations, model endpoints, dataset versions, third-party services, lab locations, calibration equipment, and the people who have to operate all of it on a schedule. At that point you have stopped running tests and started running a logistics operation. Without a system of record, the team ends up guessing which hardware is where, who's running what, which build is on which box, and why today's run doesn't match yesterday's.
This talk walks through the framework. What to model. How to model it using entity-relationship thinking — a technique that predates Airtable, Terraform, and the modern inventory tools by decades and still underwrites all of them. What reports to build off the model once it exists. How to tie the model to the test schedule, to the location graph, and to the software-release plan so every test run is reproducible and explainable. The SpeedyWriter case study at the end shows the whole thing end-to-end with real entities, real cycles, and real configurations.
“Test results in less-complex settings rarely extrapolate, because software is not linear. If your customer runs it in a mesh, you have to test it in a mesh.”
— Rex Black, Inc.
Outline
What the talk covers, in order.
Why complex test environments exist at all
Shy bugs. Performance issues at scale. Volume and capacity limits. Data quality failures. Reliability and MTBF regressions. Driver- or firmware-specific failures. Model drift on specific dataset slices. These are the bug classes that hide in trivial environments and surface catastrophically in production. Testing for them requires the test environment to have enough complexity to provoke them, which is the whole reason test-environment management becomes its own discipline.
What a production-like environment actually is
Today this is a bigger graph than most teams model honestly. For enterprise systems, it's still dozens of interconnected services across cloud regions, on-prem data centers, and third-party APIs. For embedded and IoT, it's hardware, firmware, radios, and calibration equipment in specific physical locations. For AI/ML, it's model versions, dataset versions, GPU SKUs, cache states, prompt-template revisions, and evaluator configurations. The common structure: nodes, relationships, states, versions, and schedules. Everything else is details.
What to track in the system of record
Seven things minimum. Hardware installation, locations, and relocations. Current, historical, and planned hardware configurations. Hardware interconnections and networking. Test locations. Test infrastructure (harnesses, simulators, stubs). Test engineer assignments and locations. Human resource deployment. Track these and you can answer the three questions every test program gets asked at 2 PM on release day: where is it, who has it, and what software is on it.
- Hardware: installed, its location, its configuration history, its network wiring.
- Software: every version running on every host — OS, firmware, applications, tooling, test scripts.
- People: who works where, in which role, on which cycle, across which shifts.
- Schedules: which test suite runs at which location on which build on which date.
Entity-relationship thinking, still the right frame
The original deck modeled this in Microsoft Access. Today the tool is Airtable, a purpose-built inventory DB, Terraform state plus a test-harness layer on top, or a JSON/YAML config in git — but the underlying technique is the same. Identify your entities (hardware, software, testers, tests, locations). List their properties. Pick the key properties that uniquely identify each entity. Identify relationships between entities (one-to-one, one-to-many, many-to-many). Note which relationships have their own properties. The E-R model is the conceptual backbone; whichever tool you pick stores it.
- Entities: hardware, software, testers, tests, locations.
- Key properties: serial numbers, build hashes, usernames, test IDs, site codes.
- Relationships: tester-runs-test, hardware-is-at-location, software-installed-on-hardware, test-requires-hardware-configuration.
- Relationship properties: date, cycle, pass/fail, time-on-task.
Budgeting and planning against the model
Once the entities are in place you can plan off them instead of off guesses. Testers and tests: maintain the test suite inventory and estimated durations, then assign testers to runs. Hardware and infrastructure: know the capacity and the reservation schedule. Locations and lab space: testers work somewhere, hardware is installed somewhere, and the relationship between the two is a scheduling constraint. The planning math — who's running what, on which hardware, at which site, on which build — is a join across the model, not a standalone spreadsheet.
Reports, not tables — the management surface
Raw tables are hard to read and harder to cross-reference. The useful artifact is the report: multi-table joins organized for a reader, not for a DBA. Reports by tester ("what am I running this week"), reports by test ("where and when is this suite executing"), reports by location ("what's in this lab, for how long"), reports by cycle ("what's covered, what's not"), and summary rollups across phases. These are the artifacts management actually reads; the model exists to make them cheap to produce.
- Per-tester assignment reports — the weekly plan.
- Per-test schedule reports — the execution window.
- Per-location reports — the capacity picture.
- Per-cycle summary rollups — the coverage picture.
Locations matter more than teams admit
Equipment is installed in locations — test labs, computer rooms, colocations (like telephony servers at telco central offices, or IoT devices at customer sites). People work at locations too — labs, cubicles, home offices, customer sites, on the road. In a complex program the location graph changes over time: equipment relocates, labs reconfigure, people move. A test environment model that doesn't track locations over time cannot explain why a test ran differently this cycle than last. Capture location, and capture dates on location.
Software release tracking inside the model
Software is not one thing — it's a stack. The model has to hold BIOS/firmware versions, operating system versions, applications, virtual machines and interpreters, utilities, and test tools and scripts. The release plan ties specific revisions to specific cycles on specific platforms at specific dates. When a regression appears, the question "what changed in the stack between cycle N-1 and cycle N on this platform" has to be answerable from the model in a single query, not reconstructed from memory.
The SpeedyWriter worked example — what the model looks like end-to-end
The case study ships SpeedyWriter across five platforms (Mac, Win95, Win98, WinNT, Solaris) through three phases (Component, Integration, System) with three cycles each. That's a 5 × 3 × 3 lattice — forty-five distinct (platform, phase, cycle) slots, each with its own build revision, its own assigned tester, its own hardware allocation, and its own date. The model catalogs every slot; the reports slice it by tester, by platform, by phase, or by date. Replace the 1999 platforms with today's (four cloud regions × three runtime versions × three model revisions, say, or three firmware builds × two radio stacks × four physical labs) and the structure holds.
- Revision identifier grammar: "C.1.Mac" = Component, Cycle 1, Mac build. A single naming convention makes the model legible.
- Release dates are on the revision, not on the plan.
- Tested configurations are a query, not a report — the same model answers them on demand.
What goes wrong without the system of record
Three failure modes, all common. Defocusing "materiel misunderstandings" — two teams assume two different labs are the same lab, or two different firmware versions are the same version, and the test fails or succeeds for reasons no one can reconstruct. Missing coverage against critical customer-impacting bugs — the environment the customer hits wasn't in the matrix, because no one was counting matrices. Bug irreproducibility — the build that triggered the bug can't be re-created because no one captured the stack at the moment of failure. All three go away when the environment is modeled instead of remembered.
Key takeaways
Four things to remember.
Treat the test environment as data.
If your environment state lives in heads, shared folders, and hallway conversations, your complex-system testing is guesswork. Model the entities and the relationships explicitly — in whatever tool fits the team.
Entities and relationships, not tools.
Airtable, Notion, a bespoke DB, Terraform state with a harness layer, or JSON in git — all fine. The entity-relationship model beneath them is what makes any of them pay off. Pick the tool that lets your team author the model without a DBA.
Reports are the product. The model is infrastructure.
The per-tester weekly plan, the per-cycle coverage rollup, the tested-configurations matrix — these are what management consumes. The entity model exists to generate them cheaply.
Dates and locations are first-class.
Equipment moves. Labs reconfigure. People change roles. Builds ship. Track everything with a date and a location, and regression triage stops being forensic work.
Worked examples
One bug. Eight drafts.
Five entities, modeled once, reported many ways.
The core entity set from the SpeedyWriter worked example, restated for a modern program. The entities are the same whether the test target is enterprise software, an embedded device, or an AI/ML system; the properties and relationships change per domain.
Key: serial or asset ID.
Properties: type, spec, state (in-service / in-repair / retired), location, install-date, owner.
Relationships: installed at Location; runs Software; allocated to Test.
Key: revision identifier (e.g. C.1.Mac or v4.2.1-gpu-a100).
Properties: phase, cycle, platform, build-date, artifact URL, checksum.
Relationships: installed on Hardware for a Cycle; under-test by Tester.
Tester key: username or employee ID; properties include role, shift, home location.
Test key: test ID; properties include suite, estimated duration, hardware requirement.
Location key: site code; properties include type (lab, datacenter, customer site, remote), capacity, timezone.
Relationships: Tester runs Test at Location on Date against Software on Hardware — the join every report in this domain is built on.
Per-tester weekly plan (tests × days).
Per-cycle coverage matrix (platform × phase × cycle).
Per-location capacity view (hardware × assignment × date).
Per-build history (software × platform × tester × pass/fail).
Environment-drift diff (what changed in the stack between cycle N-1 and cycle N on this platform).
Closing
The deck's original tooling choice — Microsoft Access — is not the point, and never was. The point is that the test environment is a graph, and graphs need a system of record. Today that system looks different depending on the org: it's a shared Airtable base for a 20-person QA team, a purpose-built internal app at an enterprise, Terraform state plus a custom test-harness layer at a cloud-first company, a YAML/JSON block in a git repo for a small infra team.
Whatever the tool, the work is the same: name the entities, pick the keys, model the relationships, write the reports, and keep the model honest as the program evolves. Teams that invest in this piece of infrastructure spend less time in release-day triage and catch more of the shy bugs the environment was built to find.
Keep reading
Related pieces.
More for this audience
Articles, guides, and case studies tagged for the same readers.
- Whitepaper
Evaluation Before Shipping: How to Test an AI Application Before It Hits Production
The release-gate playbook for AI features. Covers the five evaluation dimensions, how to build a lean golden set, where LLM-as-judge is trustworthy and where it lies, rollout mechanics with named exit criteria, and the regression suite that keeps a shipped AI feature from quietly rotting in production.
Read → - Whitepaper
Choosing the Right Model (and Knowing When to Switch)
A practical framework for matching LLM model tier to task. Covers the four axes (capability, latency, cost, reliability), cascade routing patterns that cut cost 60 to 80 percent without measurable quality loss, switching costs you did not plan for, and the worked economics at 10K, 100K, and 1M decisions per day.
Read → - Whitepaper
Beyond ISTQB: A Multi-Domain Certification Roadmap for Technical L&D
Most engineering L&D programs over-index on a single certification family, usually ISTQB on the QA side, AWS on the infrastructure side, and under-invest across the rest of the technical domains the org actually needs. This paper covers a multi-domain certification roadmap (QA, AI, cloud, data, security, project management, software engineering) with sequencing logic for each level of the engineering ladder, plus the maintenance discipline that keeps the roadmap relevant as the technology shifts underneath it.
Read → - Guide
The ISTQB Advanced Level path, mapped
The Advanced Level landscape keeps changing — CTAL-TA v4.0 shipped May 2025, CTAL-TM is on v3.0, CTAL-TAE is on v2.0. This guide maps all four core modules, prerequisites, exam formats, sunset dates, and which module a given role should take first. Links directly to the authoritative istqb.org syllabi.
Read → - Whitepaper
Bug Triage: A Cross-Functional Framework for Deciding Which Defects to Fix
Bug triage is the cross-functional decision process that converts raw defect reports into prioritized action. Done well, it optimizes limited engineering capacity against risk; done poorly, it becomes a backlog-management ritual that neither fixes the important defects nor drops the unimportant ones. This whitepaper covers the triage process, the participants, the six action outcomes, the four decision factors, and the governance disciplines that keep triage effective in continuous-delivery environments.
Read → - Whitepaper
Building Quality In: What Engineering Organizations Do from Day One
Testing at the end builds confidence, but the most efficient quality assurance is building the system the right way from day one. This whitepaper covers the upstream disciplines — requirements clarity, lifecycle selection, per-unit programmer practices, and continuous integration — that make system-level testing cheap and fast rather than the only thing holding a release together.
Read →
Where this leads
- Service · Quality engineering
Software Quality & Security
Independent test programs, security testing, and quality engineering for systems where defects cost real money.
Learn more → - Solution
Risk Reduction & Clear Decisions
Quality programs and decision frameworks that shift risk discussions from anecdote to evidence.
Learn more → - Solution
Reliable Software at Scale
Quality engineering programs for organizations whose software is now operationally critical.
Learn more →
Want this talk delivered in-house?
Rex Black, Inc. delivers every talk on this site as a live workshop, a keynote, or a conference session. Tailored to your stack, your team, and your timeline.