Data Subsetting

Data subsetting

There are number of reasons why testing environments do not need to have the same volumes of data as production environments:

  • Data storage and infrastructure costs associated with the data volume
  • Processing time and labor costs associated when transforming huge volumes of production data to testing environment
  • Processing time and labor costs associated when maintaining data integrity
  • Processing time and labor costs associated with provisioning test data for specific testing scenarios

Data subsetting is one of the key concepts of the BizDataX platform. During data subsetting, BizDataX preserves referential integrity and provides easy to understand and efficient mechanism to filter out un-needed source data.
Data subsetting, data masking and synthetic data generation methods can all be used in the same workflow.

BizDataX platform handles data subsetting similar as data masking. When a subset of data is needed, suppression rules can be used in combination with conditional masking.

Figure 40: Using suppression rule in combination with conditional branching to suppress certain records

Suppression behaves according to context. If existing records are being processed, then suppression maps to deletion of records (in relational world this means that DELETE statement is executed for the records). If new records are being processed, then suppression effectively skips creating the records (in relational world this means that INSERT statement for the records is not executed).

The rules for the rest of the records can be defined in a default branch, masking and suppression can be combined and processed as single pass through the records.


Figure 41: Suppression combined with masking

Option: Extracting a random sample of records can be achieved using “Sample masking” activity. Activity expects sample size in total records or in percentage as an input parameter. The example below shows how to create sample of size 10% of the total records whereas the sampled records are anonymized and other records are suppressed.

suppress w sample masking

Figure 42: Suppression combined with random sampling

See Enforcing Referential Integrity for more information about how to take care of referential integrity and Processing Related Records for more information about how to process related records. Both concepts can be combined with suppression.