Modeling Guide

Change Default Data Mask Settings

Set the default format and language when the data is ambiguous, set the seed value to ensure referential integrity, and enable anonymization to place matching records in a group.

Context

When the input date data is vague or ambiguous, the Data Mask node will output the format and language you specify here. For example, if your Last_Updated column has the date "2016-04-12", depending on the date format for the country, the date could be April 12, 2016, or it could be December 4, 2016. Setting the default Date format to Year Day Month ensures that the output data refers to December 4, 2016.

When you want to maintain referential integrity, set the Seed option. This still masks the data, but in a way that ensures consistent values each time the data is output. Let's say that you are masking the Customer_ID value, and want to ensure each ID is randomized on output. You can use any combination of numbers and characters to create an identifiable value such as Region9_Cust. This value is not output; it just ensures that the output data is consistent each time the flowgraph is run. For example, you are running a Numeric Variance with a Fixed Number and have set the Variance option to 5.

Input data Valid output range
2550 2545-2555
3000 2995-3005
5500 4595-5505
After the first run, let's say the output data is:
Output data after initial processing
2552
3001
5505
With the seed value set, the subsequent processing keeps the same output for each record. Whereas without the seed value set, the output continues to be randomized.
Output after the second run with the seed value set Output after the second run without the seed value
2552 2554
3001 2998
5505 5497

When you want to publish a certain amount of sensitive information while maintaining an individual or organization's privacy, you can anonymize the output. By entering a minimum group number value, you can ensure that there are that number or more records in a group that is output. Any records that fall below the threshold set are not output. If you have credit card transaction data and have masked the card holder names and account numbers, and generalized the date and amount, you can publish information that shows a certain amount of people spent $5000 in the month of December without identifying the individual.

Procedure

  1. Open the Data Mask node.
  2. Click Default Settings.
  3. Set the options to your preferences, and then click Apply.

Results

Tab Option Description
Date Date format

Specifies the order in which month, day, and year elements appear in the input string.

The software uses this value only when the day, month, or year in the input string is ambiguous. Choose one of these formats:
  • Day Month Year
  • Month Day Year
  • Year Day Month
  • Year Month Day

For example, you can see how important the Default Date Format is when the date string is ambiguous. In English, when an input string is 2014/02/01, parsing can’t determine if “02” or “01” is the month, so it relies on the setting in Default Date Format for clarification. If the user sets the Default Date Format to Year_Day_Month, the software parses the string as January 2, 2014. However, if the Default Date Format is Year_Month_Day, the software parses the string as February 1, 2014.

The Default Date Format may not be necessary in this next example. In English, when the input string is 2014/31/12, the software can parse the string to a date of December 31, 2014 even though the user set the Default Date Format to Month_Day_Year.

Date Month format
Specifies the format in which the randomized month is output when the software cannot determine the output month format based on the input alone:
  • Full: Output the long version of the month name.
  • Short: Output the abbreviated form of the month name, when an abbreviated form exists.

For example, lets say that, in English, the software randomizes an input date of 2015/05/05 to a randomized output date of 2015/03/22. However, because “May” is ambiguous in determining whether the output is full or short, the software relies on the Default Month Format setting to determine the output format for month. When this option is set to Full, the software knows to output “March” for the month. If this option is set to Short, the software knows to output “Mar”.

Date Language

Specifies the language that the software should use when determining the output of an ambiguous input month string. The default language is English.

Example: The software cannot determine if the language of an input date like Abril 26, 2014 is in Spanish or Portuguese. Therefore it uses the Default Language value to determine the language to be used on output. The software then uses the Default Language value for the randomized output month name.

General Seed

An alpha and/or a numeric string. Set this option once when you want to maintain referential integrity each time you run the job. One seed value maintains referential integrity for the following variance types set up in the Data Mask node: Number Variance, Date Variance, and Pattern Variance.

To retain the referential integrity for subsequent jobs using this job setup, use the same data. Do not make changes to the Data Mask node settings.

General

Anonymize Group

Minimum rows in anonymized group

Select to output a minimum number of matching records to ensure privacy. Enter a numeric value between 2 and 100. Set this option to group the anonymized output with enough records to publish without risking privacy. A lower number means that more records can be published, but it is possible to re-identify the individual or organization more easily. A higher number means that fewer records are published, but the individual or organization is less identifiable. Use the anonymization option with the following masking types:
  • Mask
  • Date Generalization
  • Numeric Generalization

Example

Retain referential integrity using a seed value to keep the altered values the same when you run a job multiple times.

Date variance seed example: If you randomize the input value "June 10, 2016" by 5 days, the output is a date between "June 5, 2016" and "June 15, 2016". If the output for the first run is "June 9, 2016", using the seed value outputs the value "June 9, 2016" on all subsequent runs, so that you can be certain the data is consistent. Not using the seed value might return a value of "June 11, 2016 on the next run, and "June 7, 2016" on the following run.

Numeric variance seed example: If you randomize the input value "500" with a fixed value of 5, the output is a number between 495-505. If the output for the first run is "499", using the seed value outputs the value "499" in all subsequent runs, so that you can be certain the data is consistent. Not using the seed value might return a value of "503" on the next run, and "498" on the following run.