A Few Thoughts on Test Data

Whitepaper · Test Data

Every test team running against production-like data has to answer the same six questions about that data. This article names all six, explains why the lazy answers don't work, and maps today's tool categories (synthetic-data generators, format-preserving encryption, LLM-assisted scrubbing) onto each question.

Read time: ~9 minutes. Written for test managers, platform leads, and anyone who has been told "just anonymize prod" and told to figure out the rest.

Why this is hard now

Three pressures make test data harder today than it was ten years ago:

Volume and diversity. Modern systems accumulate data over many product versions, across many integrations. No one is hand-crafting records that cover the long tail.
Regulation. GDPR, CCPA, HIPAA, the state-level privacy laws in the U.S., sector-specific rules (PCI, SOX), and now the EU AI Act govern what you are allowed to do with production data, including in lower environments. "We only use it for testing" is not an exemption.
AI and analytics. Test data sets are now consumed by LLM-assisted test generation, embeddings, model fine-tuning, and analytics pipelines. A leak in a test environment that feeds any of those is a leak in all of them.

That reshapes the old choice between three sources of test data:

Source	What it is	Where it shines	Where it fails
Hand-crafted	Records a tester types or generates with a small fixture script	Targeted functional tests, edge cases you deliberately want	Volume, diversity, statistical realism
Synthetic (generated)	Rows produced by a tool (Faker, Mockaroo, Tonic, GANs, LLM generators)	Volume, privacy (no PII by construction), speed of refresh	Distribution fidelity, real-world quirks, cross-table referential subtleties
Anonymized production	Real data scrubbed of identifying fields	Realism, statistical properties, cross-system joins	Cost of the scrub, risk of re-identification, ongoing governance

Most serious programs end up using two of the three. The rest of this article is about what "anonymized production" actually has to satisfy to be worth the trouble.

The six requirements for anonymized production data

Good anonymized data is not just "the names are different." It has to satisfy all six of the following. Miss one and the downstream tests are compromised in ways that are usually not obvious until late.

1. Irreversibility

The scramble cannot be undone by inspection. Trivial substitutions fail this: if the process shifts every character by one, "Kpio Cspxo" is trivially "John Brown" to anyone who looks. Anyone who gets a copy of the dataset becomes an attacker with a trivial de-anonymization tool.

The modern framing is stronger than just "can't be undone by eye." It's: even with access to external data sources an attacker plausibly has (voter rolls, purchased marketing lists, leaked breaches, social media), they should not be able to recover identity by linkage. This is the core failure mode behind every "anonymized dataset" re-identification paper of the last decade.

2. Realism

The anonymized value must keep the meaning of the original. "John Brown" becoming "Lester Camden" is fine, still a plausible male name. "John Brown" becoming "Charlotte Dostoyevsky" imposed a gender change, and if the record has a gender field, the record is now internally inconsistent. That inconsistency will surface as a test failure that is really a test-data bug.

Realism extends beyond names: addresses, phone numbers, tax IDs, IP addresses, device identifiers, all have formats, ranges, and distributions that carry testing-relevant meaning.

3. Query preservation

If a query against production data returns N records, an equivalent query against the anonymized data must also return N records. Not "roughly N", exactly N. Otherwise result-count assertions break, reporting tests become non-deterministic, and performance tests that depend on data cardinality give the wrong answer.

The corollary: row counts, join cardinalities, null rates, and value distributions all have to survive the scrub. A naïve scramble that randomly redistributes values across rows will pass eyeball inspection and break every aggregation-style test you have.

4. Interoperability across systems

Real programs have multiple databases (often multiple services, each with its own store) that share data through de facto foreign keys: full name, tax ID, email, customer ID. End-to-end tests, data-warehouse tests, and any cross-service performance test depend on those joins holding.

If the anonymization scrambles each database independently, the joins silently break. A customer with 20 records in system A might end up with zero matching records in system B. Meaningful end-to-end testing of functionality, performance, reliability, localization, and security becomes impossible.

The practical implication: anonymization has to be a coordinated operation across every system that shares data, not a per-database task. Pick a deterministic mapping keyed on the real identifier so the same "John Brown" becomes the same "Lester Camden" everywhere he appears. Formally this is the property every format-preserving-encryption or tokenization tool has to give you.

5. Data-quality preservation (the errors have to survive)

This is the subtle one. Most production data has errors, industry estimates have put the rate as high as one bad record in four. Bad dates, inconsistent casing, duplicate entries, orphaned references, out-of-range values. Those errors are exactly the inputs that expose brittle code paths.

If the anonymization process "cleans up" the data as a side effect (snapping bad formats to good ones, deduplicating obvious duplicates, filling in nulls) the scrubbed data is no longer representative of what the system sees in production. Tests pass in the scrubbed environment and the same classes of bug reach customers.

The rule: the same records that had errors in production must still have errors in the test environment. The errors must be similar to the originals but must not allow the originals to be reconstructed.

6. Maintainability

A test-data set is not a one-time artifact. Testers add records to cover new features, edit records to set up specific scenarios, and delete records to test cleanup flows. If the anonymization process makes any of those harder than they would be against production, because identifiers are opaque, because cross-system joins require a lookup table that's not in the test environment, because foreign keys are one-way hashes, the test team will route around the anonymized data, and the investment was wasted.

Two practical challenges

Beyond the six requirements, two operational realities ambush most anonymization programs.

Refresh effort. One client reported refreshing test data from production only once every 12 to 18 months, because the scrub process took 4 to 6 person-months of effort and an entire calendar month of wall-clock time. That meant every test environment drifted steadily away from current production for a year-plus at a time. Today, where production data shape changes with every release, that gap is fatal, you end up running tests against a schema and distribution nobody sees anymore.

The fix is to treat test-data refresh as a pipeline, not a project. Run it nightly or weekly against a replica. Instrument it. Make the run time a tracked metric. If it takes four months, rebuild it until it doesn't.

Quiescent data. Anonymization has to operate on a stable snapshot, the source data cannot change mid-extraction. This is the same problem as database backup, and the same mechanisms (transactional snapshots, read replicas, point-in-time restore) solve it. But the people building the anonymization pipeline often don't realize they've built one until it produces inconsistent output on the first live run.

The current tool landscape

Ten years ago the choice was "commercial ETL-style scrubber" or "custom scripts." It isn't anymore. Four categories are now in play, and serious programs use more than one:

Format-preserving encryption and tokenization: deterministic, reversible only with the key, preserves format and length. Strong fit for requirements 1, 3, 4 (joins via deterministic tokens), and 6 (you can round-trip with the key). Weak fit for requirement 2 unless you layer a realistic-values mapping on top. Examples: FF3-1, Vault Transform, many cloud KMS tokenization services.
Synthetic data generation: generate records that mimic the statistical properties of production without containing any production PII. Strong fit for requirements 1 and 2. Requires careful calibration for requirement 3 (query preservation) and 5 (realistic errors). Examples: Tonic.ai, Mostly AI, Gretel, open-source Faker for simpler cases. GAN-based and diffusion-based generators for high-dimensional structured data.
LLM-assisted scrubbing: use a model to rewrite free-text fields (descriptions, notes, comments) plausibly while preserving structure. Useful for long-tail unstructured fields that deterministic scrubbers handle badly. Governance risk: the LLM itself becomes a path for leakage if run against a hosted API; run locally for anything regulated.
Differential privacy: mathematically bounded guarantees about what a dataset reveals about any individual. More often used for analytics than for test data, but worth knowing about, particularly for statistical-only tests (reports, dashboards) where individual-record fidelity isn't needed.

A practical modern pattern: deterministic tokenization for joinable identifiers (so requirement 4 holds across systems), realistic-value mapping for PII (so requirement 2 holds), synthetic generation for long-tail free-text and augmentation, and pipeline-style refresh so the whole thing stays current.

Choosing an approach

The test-tool-selection process applies: assemble a team, enumerate options, identify risks and constraints, evaluate, select, pilot, roll out. Test data is a project, not a task. Programs that try to do it on the cheap produce one of two outcomes:

a scrub so light it fails the six requirements and eventually fails compliance, or
a scrub so heavy the test data becomes a second product with its own maintenance cost and no owner.

Budget for it. Own it. Ship it as a pipeline. The payoff is test environments that behave like production without betraying the people whose data made production what it is.

A Few Thoughts on Test Data

Why this is hard now

The six requirements for anonymized production data

1. Irreversibility

2. Realism

3. Query preservation

4. Interoperability across systems

5. Data-quality preservation (the errors have to survive)

6. Maintainability

Two practical challenges

The current tool landscape

Choosing an approach

Further reading

Related reading

Evaluation Before Shipping: How to Test an AI Application Before It Hits Production

Choosing the Right Model (and Knowing When to Switch)

Beyond ISTQB: A Multi-Domain Certification Roadmap for Technical L&D

The ISTQB Advanced Level path, mapped

Bug Triage: A Cross-Functional Framework for Deciding Which Defects to Fix

Building Quality In: What Engineering Organizations Do from Day One

Where this leads

Software Quality & Security

Risk Reduction & Clear Decisions

Reliable Software at Scale

Working on something like this?