Change Default Data Mask Settings
Set the default format and language when the data is ambiguous, set the seed value to ensure referential integrity, and enable anonymization to place matching records in a group.
Context
When the input date data is vague or ambiguous, the Data Mask node will output the format and language you specify here. For example, if your Last_Updated column has the date "2016-04-12", depending on the date format for the country, the date could be April 12, 2016, or it could be December 4, 2016. Setting the default Date format to Year Day Month ensures that the output data refers to December 4, 2016.
When you want to maintain referential integrity, set the Seed option. This still masks the data, but in a way that ensures consistent values each time the data is output. Let's say that you are masking the Customer_ID value, and want to ensure each ID is randomized on output. You can use any combination of numbers and characters to create an identifiable value such as Region9_Cust. This value is not output; it just ensures that the output data is consistent each time the flowgraph is run. For example, you are running a Numeric Variance with a Fixed Number and have set the Variance option to 5.
| Input data | Valid output range |
|---|---|
| 2550 | 2545-2555 |
| 3000 | 2995-3005 |
| 5500 | 4595-5505 |
| Output data after initial processing |
|---|
| 2552 |
| 3001 |
| 5505 |
| Output after the second run with the seed value set | Output after the second run without the seed value |
|---|---|
| 2552 | 2554 |
| 3001 | 2998 |
| 5505 | 5497 |
When you want to publish a certain amount of sensitive information while maintaining an individual or organization's privacy, you can anonymize the output. By entering a minimum group number value, you can ensure that there are that number or more records in a group that is output. Any records that fall below the threshold set are not output. If you have credit card transaction data and have masked the card holder names and account numbers, and generalized the date and amount, you can publish information that shows a certain amount of people spent $5000 in the month of December without identifying the individual.
Procedure
- Open the Data Mask node.
- Click Default Settings.
- Set the options to your preferences, and then click Apply.
Results
| Tab | Option | Description |
|---|---|---|
| Date | Date format |
Specifies the order in which month, day, and year elements appear in the input string. The software uses this value only when the day, month, or year in
the input string is ambiguous. Choose one of these formats:
For example, you can see how important the Default Date Format is when the date string is ambiguous. In English, when an input string is 2014/02/01, parsing can’t determine if “02” or “01” is the month, so it relies on the setting in Default Date Format for clarification. If the user sets the Default Date Format to Year_Day_Month, the software parses the string as January 2, 2014. However, if the Default Date Format is Year_Month_Day, the software parses the string as February 1, 2014. The Default Date Format may not be necessary in this next example. In English, when the input string is 2014/31/12, the software can parse the string to a date of December 31, 2014 even though the user set the Default Date Format to Month_Day_Year. |
| Date | Month format |
Specifies the format in which the randomized month is output when
the software cannot determine the output month format based on
the input alone:
For example, lets say that, in English, the software randomizes an input date of 2015/05/05 to a randomized output date of 2015/03/22. However, because “May” is ambiguous in determining whether the output is full or short, the software relies on the Default Month Format setting to determine the output format for month. When this option is set to Full, the software knows to output “March” for the month. If this option is set to Short, the software knows to output “Mar”. |
| Date | Language |
Specifies the language that the software should use when determining the output of an ambiguous input month string. The default language is English. Example: The software cannot determine if the language of an input date like Abril 26, 2014 is in Spanish or Portuguese. Therefore it uses the Default Language value to determine the language to be used on output. The software then uses the Default Language value for the randomized output month name. |
| General | Seed |
An alpha and/or a numeric string. Set this option once when you want to maintain referential integrity each time you run the job. One seed value maintains referential integrity for the following variance types set up in the Data Mask node: Number Variance, Date Variance, and Pattern Variance. To retain the referential integrity for subsequent jobs using this job setup, use the same data. Do not make changes to the Data Mask node settings. |
| General |
Anonymize Group Minimum rows in anonymized group |
Select to output a minimum number of matching records to ensure privacy. Enter a
numeric value between 2 and 100. Set this option to group the
anonymized output with enough records to publish without risking
privacy. A lower number means that more records can be published,
but it is possible to re-identify the individual or organization
more easily. A higher number means that fewer records are published,
but the individual or organization is less identifiable. Use the
anonymization option with the following masking types:
|
Example
Retain referential integrity using a seed value to keep the altered values the same when you run a job multiple times.
Date variance seed example: If you randomize the input value "June 10, 2016" by 5 days, the output is a date between "June 5, 2016" and "June 15, 2016". If the output for the first run is "June 9, 2016", using the seed value outputs the value "June 9, 2016" on all subsequent runs, so that you can be certain the data is consistent. Not using the seed value might return a value of "June 11, 2016 on the next run, and "June 7, 2016" on the following run.
Numeric variance seed example: If you randomize the input value "500" with a fixed value of 5, the output is a number between 495-505. If the output for the first run is "499", using the seed value outputs the value "499" in all subsequent runs, so that you can be certain the data is consistent. Not using the seed value might return a value of "503" on the next run, and "498" on the following run.
