Test data management | DevOps

DevOps teams need to do many varieties of testing. Some test varieties (such as performance testing) need a high volume of test data but the exact data doesn’t matter that much. Other test varieties (such as end-to-end testing) require a relatively limited set of test data but the values of the data must be carefully aligned across various systems, possibly even across multiple organizations.

This topic describes an overview of test data management (TDM), which includes a set of test process supporting practices based on the ISO27000 information security standards and General Data Protection Regulation (GDPR).

The implementation of the right test data management practices is a key consideration for the realization of significant time and efficiency gains in quality assurance and testing. Provisioning proper test data is also one of the main bottlenecks to achieving continuous testing in DevOps. According to the Continuous Testing Report (CTR) survey, 55% of respondents are currently spending between 30 to 60 percent of their total testing time on test data management activities (for actual reports refer to www.tmap.net). This is an inordinate amount of time and most organizations have realized that addressing this one area will dramatically improve the speed and efficiency of the entire software development lifecycle. Such efficiency considerations, along with legal requirements such as GDPR, privacy and security concerns are driving important changes in test data management.

What is test data?

Test data relates to all data(sets) and data rules used in the process of test activities. This includes but is not limited to:

Source data sets
Data created or sourced from other systems to the testing environment, this includes any backups that are kept for auditing or restore purposes.
Data access and conversion rules and supporting data sets
If production data is used in testing, there will be additional rules on data access and access to information related to the process of converting this data to test data.
Data used in test scripts/execution
Specific data references used in test execution as well as data that is being consumed during testing have to be managed.
Data flows to external systems in the testing chain
Any data produced by the test activities that may end up in stubbed or virtualized systems (or dead letter queues) have to be managed, monitored and if needed deleted.
Test reporting
Any test data that is visualized in test reporting (from debug logs to final test reports) needs to be managed according to the agreed upon test data processes.

What is test data management?

The test data management process establishes whether an IT system complies with the relevant data requirements (security, privacy etc). Correctly dealing with test data becomes even more important with the introduction of the GDPR which results in possible legal and financial repercussions when issues occur with regards to the use of personal data. A proper test data management process must consider organizational and technical aspect of test data management:

The organizational aspects of test data management
Organizations and teams need a well-planned test data strategy to be able to fully achieve continuous testing. They will need to have their test data model documented and – if possible – linked to a test data management tool.
The technical aspects of test data management
Over the last few years, there has been an increase in the use of commercial TDM tools that provide data subsetting and synthetically generated data. There has also been a steady decrease in the use of production data for testing. These trends are only likely to gain in strength as we move forward. For a current list or tools, see www.tmap.net. Wellknown suppliers of TDM tools are: DATPROF, IBM, EPI-USE, Broadcom, Informatica and Delphix.

The most popular method of provisioning data is using existing test data without any changes, making this the most popular method of provisioning data. The problem with this approach is that it often leads to inadequate test coverage, testing inefficiency, and mounting compliance and security risks. The reuse of test data sets also leads to issues such as the aging of timestamps and date data fields.

Another approach to generating test data is to directly copy data from production environments. Of course, this reliance on production data has been decreasing over time, with GDPR acting as an important catalyst for the switch to masking, subsetting and synthesizing of data.

Masking
The masking of test data uses existing data as a starting point. This usually means production data that will require one or more changes to conform to the specific data rules for the situation at hand. The specific requirement per data field will determine the masking action used, ranging from scrambling (randomly replace a username) to depersonalizing (name change to “Test User 12345”) or rule-based changes (a fictional bank account number still has to adhere to certain rules).
Subsetting
A full production-sized data set (already masked) may be required for performance testing. Smaller data (sub) sets can then be produced that still contain enough of the required data sets for test executions in development or temporary environments in a CI/CD context. This will not only speed up test preparations in those environments but will also result in significant cost reductions regarding environment sizing and maintenance.

Synthetic data

Generating test data with rules
Even in the case of masking or subsetting, data security considerations can still block the use of production data as a source. Or development of a new system will not have existing production data to draw upon. If a full data model is available and translatable to test data management, completely fictional (synthetic) data can be produced. This will require detailed data requirements to be produced as part of test specification.
Generating test data using artificial intelligence
Data generation can also be performed using data intelligence / artificial intelligence. This is how data is created completely synthetically in accordance with the structure of the original data. This synthetic test data always complies with GDPR rules because no original data was involved during the generation of the data.

Which data items?

A lot of data items can be masked. Table 31.1 lists a non-exhaustive list of general personal data. When masking these data, you can distinguish between data that can be scrambled (e.g. names) and data that that can be masked based on rules. With the latter – rule-based masking – you can think of:

Masking based on translation tables (e.g. address in Amsterdam is transferred to Rotterdam with matching street name and zip code).
Masking by adding an extra mask action (if after a name scramble, the name can still look like a real person, you may want to add a word like “TEST” to the surname).
Masking by applying ranges (set date of birth randomly to first or tenth of the month).
Masking by applying special functions (e.g. 11 and 13 proof, etc.).
Instead of masking, replace data completely with dummy info (e.g. video, license plate), provided that, for example, invented license plates are allowed within the application.

Examples of personal data.

Address	Date of birth	Phone number(s)
Bank account number	Debit card number	Photo
BC number	Driving license number	Place of residence
Cadastral number	E-mail address	Postal code
CC / KvK number	Employee number	RSIN (Fiscal number)
Comments/Description	IBAN	Signature
Commercial e-mail address	Identity card number	Social security number
Commercial phone number	IMEI number	Tax Identification Number
Contact person	IP address (dynamic)	User ID
Contract number	MAC address	Vehicle registration plate
Copy of ID document	Notary	Video
Credit card number	Official name	X-ray picture

Other files that are eligible for masking include binary files, such as audio, images, and video. When masking these, the desired use and required quality must be taken into account.

Test data management practices

A complete test data management process will cover the following aspects:

Data knowledge
The greatest challenge seen by organizations is the difficulty and time involved in extracting data spread across multiple databases. The problem becomes even more acute for larger organizations with more complex infrastructures. An accurate metadata model is needed, containing up-to-date information on various databases spread across different systems.
Data access
Another challenge is that testing teams have limited access to production systems and are dependent on database administrators to get the data they need. This is, of course, as it should be! Production systems are business-critical and should not be widely accessible. The test data management process needs to provide the rules and infrastructure for data to move and be transformed from the source to the test environments.
Data management
Organizations need to keep updating their test data sets (which may consist of masked data and synthesized data) every time there is a change in any of the associated databases (TDM impact analysis). The adoption of tools synchronized to various applications and databases – coupled with the right processes – can help organizations capture these database changes in real time and automatically update the relevant test data sets.

This is especially relevant while maintaining test data for large end-to-end tests where the consistency of the data across several systems that are inter-connected, is a major challenge.

These aspects influence all test data management practices regarding initial test data creation, management and impact analysis of changes. The complexity of this task requires automation where possible, using tools that can synchronize across all of the organization’s data sets.

For initial test data management activities and periodic maintenance, the following stakeholders are involved:

Project/organization
Test process
All test activities will impact TDM. This includes straightforward requirements for the data needed for test execution to the number of test databases required as well as frequency of updates/changes to those test databases.

Based on project developments and the following test impact analysis, a TDM impact analysis will cover the following practices:

Demand validation
Based on the data requirements all preparation for TDM is done.
Design verification
The TDM design is validated to be an accurate representation of production data as well as cover all test requirements.
Script synchronization
This is a required step for TDM on two or more systems that interact and can vary from deploying the required data to all systems in one go, to a step by step process over multiple systems/environments.
Deliver timing
Delivery of data is dependent on the project planning and can be influenced by existing test activities (no downtime windows available), business process cycles (TDM cannot run during regular batch windows or for instance end-of-month processes that should run on the environments) or test scenario aspects (if a two-week business process needs to be simulated, TDM changes can only occur in between cycles).