Hero imageMobile Hero image
Quality Engineering Powered by Test Data: Balancing AI Opportunities and Risks

The role of test data management

In the era of generative AI, with enormous breakthroughs in the field of software development, quality is more important than ever. A key pillar of software quality is the data used to test and further develop applications. Not only to validate new functionality, but also to ensure that existing functionality continues to work correctly.

 

Systems rarely stand alone. They are part of complex interconnected applications and systems, often distributed across different technologies. During integration and chain testing, it is validated whether data interchange between these systems function correctly, combining or processing data from multiple sources. Consistent and reliable test data is crucial in this process to establish that software systems continue to work correctly.

 

Many large organizations have historical data in their production systems that sometimes goes back decades. This data cannot always be recreated through current UI screens or APIs, for example because it originates from migrated legacy systems or because interfaces have changed over time.

 

The process by which this test data is made available to testers and developers in an efficient, compact, and secure manner is called Test Data Management.

Types of test data

To test effectively, available and usable test data is first and foremost necessary. In practice, there are roughly two ways to get test data into a system.

 

In the first approach, test data is created through (automated) test cases, where data is entered via the UI screens or interfaces of the application. This often involves using a CSV or Excel file to execute the same test case multiple times with different datasets. This method is known as data-driven testing. An important advantage of this is that while creating the test data, it is directly validated whether the functionality being used — such as registering a new customer — still functions correctly.

 

The challenge with this method is that it is often impossible to replicate all production scenarios in this way, especially when it comes to complex chains of systems. For this reason, organizations regularly fall back on existing data from production systems. In doing so, underlying production databases are copied to test and development environments through backup and restore operations.

 

Working with copies of production databases brings an entirely different set of challenges. Production databases are often very large, sometimes multiple terabytes, and contain vast amounts of data, partly due to long historical accumulation. Transferring and making this data available is not only time-consuming but also costly due to the infrastructure required for test and development environments. Additionally, the limited number of available environments often means that multiple multidisciplinary Agile or DevOps teams must share the same environment. From an efficiency, stability, and team independence perspective, this is far from ideal.

 

And then there is the most important consideration: privacy and confidentiality. Production databases typically contain large amounts of privacy-sensitive and corporate sensitive information. You don’t want to simply make this data available in other environments that don’t have the same security measures as production. Moreover, laws and regulations such as GDPR and CCPA, as well as industry standards like ISO 27001 and SOC 2, require that appropriate measures be taken to prevent personal data from being processed for purposes other than those for which it was originally recorded.

Generating test data?

Ideally, representative test data provides a realistic picture of what occurs in production systems, but in a more compact form and preferably completely anonymous or synthetic. Various methods exist to create such safe and manageable test data.

 

Broadly speaking, these methods can be divided into two categories. The first category focuses on anonymizing and potentially subsetting existing production databases. This involves making a representative selection of production data, whereby privacy and confidential information is irreversibly removed or modified.

 

The second category consists of synthetically generating test data. This approach has three variants:

  1. Rule-based generation, where data is generated based on user-defined rules and requirements.
  2. AI model-based generation, where an AI model is trained on production data to generate new, statistically representative data.
  3. Generative AI, such as large language models, which are deployed to create test data.

Each method has its own advantages and disadvantages and is not suitable in all situations or for every test data need. The choice of approach depends on factors such as complexity, privacy requirements, representativeness, and the type of testing being performed.

Synthetic Data Generation

Using generative AI, a large language model (LLM) can be deployed to generate test data based on natural language. For example, an instruction can be given to generate twenty rows with customer numbers, first names, last names, birth dates, and email addresses. Based on this instruction, added with a system prompt, the LLM returns the test data in the desired format.

 

Developers and testers can also include the definition of a table, for instance in the form of a DDL statement, and ask the model to generate test data that conforms to this structure. For small and manageable datasets, particularly early in the development and testing process, this approach is highly suitable and efficient.

 

However, when test data requirements become more complex, limitations arise. Consider dependencies between attributes or extensive requirements per attribute, such as technical requirements (data type, length, precision) and functional constraints (for example, a birth date that falls within a realistic range and is never in the future, or a contract end date that logically follows the start date). In such cases, it becomes considerably more difficult to have an LLM generate consistent and reliable test data.

 

Besides generative AI, more “traditional” AI techniques can also be employed to generate synthetic test data. Using machine learning or deep learning, models can be trained on existing datasets, after which these models are used to create new, synthetic data. The main difference between these approaches is that with deep learning, complex patterns are automatically recognized, whereas with machine learning, the relevant features must be explicitly defined by humans.

 

In both cases, the quality of the training data is decisive — and this is precisely where the greatest challenge lies. For small-scale but complex datasets, for example consisting of one or two tables for research purposes, this approach is well applicable. However, when these techniques are deployed for large enterprise databases with hundreds of tables and datasets ranging from gigabytes to even terabytes, they prove insufficiently scalable at this moment in time.

 

Apart from the very expensive hardware required to train such models, the training time needed is particularly a limiting factor. Benchmarks show that training a dataset of one million rows, linked to a table with five thousand rows, can already take between fifteen and ninety hours. This makes it practically unfeasible to scale this approach to data models with hundreds of tables and millions to billions of records.

 

In addition to generating synthetic test data using AI or generative AI, rule-based generation can also be applied. In this approach, testers or developers define explicit requirements that the test data must comply with. Data generators are then deployed to generate test data based on these requirements, in some cases even directly within the database.

 

The major advantage of this approach is the clear specification of test data requirements and the high speed at which large volumes of data can be generated directly in the systems. This makes rule-based generation particularly suitable for large-scale and repeatable test scenarios. The disadvantage, however, is that this method, like generative AI, requires sufficient knowledge of the application and its test data requirements. Moreover, configuring data generators for a complete database requires considerable manual work. Generating representative test data for an entire enterprise chain landscape is therefore labor-intensive and requires in-depth domain and application knowledge, which is not always available.

Data Masking & Subsetting

In practice, the most commonly used and most efficient approach proves to be combining multiple techniques. By anonymizing production databases, it is ensured that personal data can no longer be traced back to individual persons and that business-sensitive information is not unnecessarily exposed. The major advantage of anonymized data is that existing production data can largely be reused, maintaining chain consistency while only specific attributes need to be modified to prevent identification.

 

By combining anonymization with subsetting and/or virtualizing databases, significant time and cost savings can be achieved. Smaller test data environments enable teams to have their own independent environment, without mutual dependencies. With database virtualization, environments can also be cloned, snapshotted, and restored virtually in real time. This significantly simplifies the test data process, whereas previously complete backups had to be manually restored by database administrators.

 

When anonymized and potentially subsetted production data does not cover all test scenarios, it can be specifically supplemented with synthetic test data. This creates a flexible, scalable, and secure test data approach that is both representative and compliant.

AI Opportunities

In addition to using AI for generating test data, AI also offers opportunities to increase productivity and insight and to accelerate processes. In many solutions, AI is already being integrated as an assistant to make complex tasks that require specific knowledge more accessible and efficient.

 

For example, AI can be used to generate country- or domain-specific seed values that serve as replacements for privacy-sensitive data. Additionally, AI can provide support in generating database-specific SQL to anonymize data.

 

(Generative) AI can thereby not only fulfill a primary role, but also play a supporting role in producing representative and secure test data.

DATPROF – Test Data Simplified

For almost twenty years, as DATPROF, we have been developing solutions in the field of test data, primarily for organizations with complex system landscapes. Think of governments, banks, insurance companies, and pension administration organizations. During this period, we have developed a modular test data platform that addresses the above challenges in the simplest and most manageable way possible.

 

Over the years, we have seen that Test Data Management (TDM) not only requires powerful and flexible software solutions, but also a well-organized organizational process, with clear roles and responsibilities. The far-reaching decentralization of IT into multidisciplinary teams brings many advantages, but also introduces challenges. Specialist knowledge, for example for designing and maintaining anonymization, subsetting, or synthetic data generation solutions, is often limited or not structurally available within these teams.

 

Organizations that are demonstrably successful with TDM therefore often choose to centralize these activities, for example within a dedicated IT-for-IT or platform team. From such a central facility, teams can be optimally supported, while consistency, security, and compliance remain guaranteed.

 

An efficiently organized TDM process, supported by the right tooling, not only functions as a powerful accelerator of the software development process, but also delivers significant cost savings and helps organizations demonstrably comply with laws and regulations.

 

Published: 9 April 2026
Author: Bert Nienhuis,
                Chief Product Officer at DATPROF

This blog is a partner contribution to the “Amplified Quality Engineering” publication.

Amplified Quality Engineering

Amplified Quality Engineering (AmpQE)