Change Default Data Mask Settings

Set the default format and language when the data is ambiguous, set the seed value to ensure referential integrity, and enable anonymization to place matching records in a group.

Context

When the input date data is vague or ambiguous, the Data Mask node will output the format and language you specify here. For example, if your Last_Updated column has the date "2016-04-12", depending on the date format for the country, the date could be April 12, 2016, or it could be December 4, 2016. Setting the default Date format to Year Day Month ensures that the output data refers to December 4, 2016.

When you want to maintain referential integrity, set the Seed option. This still masks the data, but in a way that ensures consistent values each time the data is output. Let's say that you are masking the Customer_ID value, and want to ensure each ID is randomized on output. You can use any combination of numbers and characters to create an identifiable value such as Region9_Cust. This value is not output; it just ensures that the output data is consistent each time the flowgraph is run. For example, you are running a Numeric Variance with a Fixed Number and have set the Variance option to 5.

Input data	Valid output range
2550	2545-2555
3000	2995-3005
5500	4595-5505

After the first run, let's say the output data is:

Output data after initial processing
2552
3001
5505

With the seed value set, the subsequent processing keeps the same output for each record. Whereas without the seed value set, the output continues to be randomized.

Output after the second run with the seed value set	Output after the second run without the seed value
2552	2554
3001	2998
5505	5497

When you want to publish a certain amount of sensitive information while maintaining an individual or organization's privacy, you can anonymize the output. By entering a minimum group number value, you can ensure that there are that number or more records in a group that is output. Any records that fall below the threshold set are not output. If you have credit card transaction data and have masked the card holder names and account numbers, and generalized the date and amount, you can publish information that shows a certain amount of people spent $5000 in the month of December without identifying the individual.

Procedure

Open the Data Mask node.
Click Default Settings.
Set the options to your preferences, and then click Apply.

Results

Tab	Option	Description
Date	Date format	Specifies the order in which month, day, and year elements appear in the input string. The software uses this value only when the day, month, or year in the input string is ambiguous. Choose one of these formats: Day Month Year Month Day Year Year Day Month Year Month Day For example, you can see how important the Default Date Format is when the date string is ambiguous. In English, when an input string is `2014/02/01`, parsing can’t determine if “02” or “01” is the month, so it relies on the setting in Default Date Format for clarification. If the user sets the Default Date Format to Year_Day_Month, the software parses the string as January 2, 2014. However, if the Default Date Format is Year_Month_Day, the software parses the string as February 1, 2014. The Default Date Format may not be necessary in this next example. In English, when the input string is `2014/31/12`, the software can parse the string to a date of December 31, 2014 even though the user set the Default Date Format to Month_Day_Year.
Date	Month format	Specifies the format in which the randomized month is output when the software cannot determine the output month format based on the input alone: Full: Output the long version of the month name. Short: Output the abbreviated form of the month name, when an abbreviated form exists. Note This option applies only when the month is text (not a number). For example, lets say that, in English, the software randomizes an input date of `2015/05/05` to a randomized output date of `2015/03/22`. However, because “May” is ambiguous in determining whether the output is full or short, the software relies on the Default Month Format setting to determine the output format for month. When this option is set to Full, the software knows to output “March” for the month. If this option is set to Short, the software knows to output “Mar”.
Date	Language	Specifies the language that the software should use when determining the output of an ambiguous input month string. The default language is English. Note This option applies only when the month is text (not a number). Example: The software cannot determine if the language of an input date like `Abril 26, 2014` is in Spanish or Portuguese. Therefore it uses the Default Language value to determine the language to be used on output. The software then uses the Default Language value for the randomized output month name. Note The software does not verify that the user-defined Default Language corresponds to the language of the input month.
General	Seed	An alpha and/or a numeric string. Set this option once when you want to maintain referential integrity each time you run the job. One seed value maintains referential integrity for the following variance types set up in the Data Mask node: Number Variance, Date Variance, and Pattern Variance. To retain the referential integrity for subsequent jobs using this job setup, use the same data. Do not make changes to the Data Mask node settings.
General	Anonymize Group Minimum rows in anonymized group	Select to output a minimum number of matching records to ensure privacy. Enter a numeric value between 2 and 100. Set this option to group the anonymized output with enough records to publish without risking privacy. A lower number means that more records can be published, but it is possible to re-identify the individual or organization more easily. A higher number means that fewer records are published, but the individual or organization is less identifiable. Use the anonymization option with the following masking types: Mask Date Generalization Numeric Generalization

Example

Retain referential integrity using a seed value to keep the altered values the same when you run a job multiple times.

Date variance seed example: If you randomize the input value "June 10, 2016" by 5 days, the output is a date between "June 5, 2016" and "June 15, 2016". If the output for the first run is "June 9, 2016", using the seed value outputs the value "June 9, 2016" on all subsequent runs, so that you can be certain the data is consistent. Not using the seed value might return a value of "June 11, 2016 on the next run, and "June 7, 2016" on the following run.

Numeric variance seed example: If you randomize the input value "500" with a fixed value of 5, the output is a number between 495-505. If the output for the first run is "499", using the seed value outputs the value "499" in all subsequent runs, so that you can be certain the data is consistent. Not using the seed value might return a value of "503" on the next run, and "498" on the following run.