Quality Engineering Powered by Test Data

Quality Engineering Powered by Test Data: Balancing AI Opportunities and Risks

In the era of generative AI, with enormous breakthroughs in software development, quality is more important than ever and it all starts with the right test data.

Systems rarely stand alone. They are part of complex interconnected applications, often distributed across different technologies. During integration and chain testing, consistent and reliable test data is crucial to establish that software systems continue to work correctly. Many large organizations hold historical production data going back decades, data that cannot always be recreated through current interfaces, because it originates from migrated legacy systems or because those interfaces have changed over time.

The challenge with Production data

In practice, organizations regularly fall back on existing data from production systems, copying databases to test and development environments through backup and restore operations. This brings a distinct set of challenges. Production databases are often very large, sometimes multiple terabytes, and making this data available is not only time-consuming but also costly. Multiple Agile or DevOps teams frequently share the same environment, which is far from ideal from an efficiency and team-independence perspective.

Most importantly, production databases typically contain large amounts of privacy-sensitive and business-critical information. Laws and regulations such as GDPR and CCPA, as well as industry standards like ISO 27001 and SOC 2, require that appropriate measures be taken to prevent personal data from being processed beyond its original purpose.

Generating test data with AI?

Generative AI, such as large language models, can produce test data based on natural language instructions. For small and manageable datasets, particularly early in the development process, this approach is highly efficient. However, when requirements become more complex, limitations arise quickly: dependencies between attributes, technical constraints, and functional rules make it considerably harder for an LLM to generate consistent and reliable data at scale.

More traditional AI techniques, such as machine learning and deep learning, can also be used to train models on existing datasets and generate new synthetic data. However, for large enterprise databases with hundreds of tables and datasets ranging from gigabytes to terabytes, these approaches are not yet sufficiently scalable. Benchmarks show that training a dataset of one million rows linked to a table of five thousand rows can already take between fifteen and ninety hours.

Rule-based generation offers a complementary approach: testers define explicit requirements, and data generators produce test data accordingly at high speed and volume. The trade-off is that this requires deep knowledge of the application and considerable manual configuration effort.

Data masking and subsetting: the practical foundation

In practice, the most commonly used and most efficient approach combines multiple techniques. By anonymising production databases, personal data can no longer be traced back to individuals and business-sensitive information is protected, while existing production data can largely be reused, maintaining chain consistency. Combining anonymisation with subsetting and database virtualisation enables teams to work in their own independent environments, with the ability to clone, snapshot, and restore in near real time. Where anonymised data does not cover all test scenarios, targeted synthetic data can fill the gaps.

The role of the Organization

Effective Test Data Management requires not only powerful tooling, but also a well-organised process with clear roles and responsibilities. Organisations that demonstrably succeed with TDM often choose to centralise these activities, for example within a dedicated platform team. From such a central facility, multidisciplinary teams can be optimally supported, while consistency, security, and compliance remain guaranteed.

An efficiently organised TDM process, supported by the right tooling, functions as a powerful accelerator of the software development process, delivering significant cost savings and helping organizations demonstrably comply with laws and regulations.

Want a deeper dive into test data strategies, AI applications, and the DATPROF platform?

Download the full white paper.

Published: 9 April 2026
Author: Bert Nienhuis,
Chief Product Officer at DATPROF

This blog is a partner contribution to the “Amplified Quality Engineering” publication.