Modeling Guide for SAP Data Hub

Anonymization

Anonymization helps to gain statistically valid insights from your data while protecting the privacy of individuals.

When analyzing data, you must ensure privacy of personal or sensitive information. By removing information that directly identifies an individual, such as a Social Security number or a credit card number, you ensure a certain amount of privacy, but it could still lead to an individual's re-identification. The Anonymization operator helps to create anonymized groups, where you set the minimum number of records within the group. You can also mask and generalize the data for a specified column. For example, if you group individuals into age brackets of 20-29, 30-39, 40-49, and so on, then it is more difficult to re-identify an individual. When you mask or generalize multiple columns, then the ability to re-identifiy decreases.

As you use the Anonymization operator, you'll see the following terms:
  • Sensitive: data that most individuals do not want known about them, for example, salary information or an illness.
  • Nonsensitive: data that most individuals may not mind sharing, for example, the country they live in.
  • Identifier: data that directly identifies individuals such as their name or Social Security number.
  • Quasi-identifier: data that indirectly identifies and individual, especially when combined with other quasi-identifiers, such age, gender, and postcode.

Configuration Parameters

Parameter Type Description
Label String Required. Enter the name of the data mask operator.
Minimum rows in an anonymized group Integer Required. A numeric string. The larger the number you enter (between 2 and 100), the less likely the data can be re-identified and a smaller number of records are output. The lower the number entered, the more likely the data can be re-identified and a larger number of records are output. Groups with fewer records than the number you specify are not output.
Date Format String Required. Specifies the order in which month, day, and year elements appear in the input string. This value is used only when the day, month, or year in the input string is ambiguous.
Month Format String Required. Specifies the format in which the randomized month is output when the software cannot determine the output month format based on the input alone.
Language String Required. Specifies the language that the software should use when determining the output of an ambiguous input month string.
Century Threshold Integer Optional. Indicates whether a two-digit date is considered part of the 20th or 21st century. Enter a value from 0-99. For example, when set to 25, the dates with a 2-digit value from 00-25 result in the years 2000-2025. Dates with a 2-digit value of 26-99 result in the years 1926-1999.
Default Column Behavior String Required. Define whether to output any columns that are not defined in the Column Definitions.
Column Definitions   You can define the Anonymization operation on one or more columns. Each column has its own definition. Click the Open Editor icon, and then click +Add item and complete the following options:
  1. Column ID (string): Required. This string uniquely identifies the column. It should match the ID or name of the column coming into the operator.
  2. Column Designation (string): Required. Specifies the categorization and any masking or generalization of this column:
    • Sensitive: Data is output without modification, for example, height.
    • Nonsensitive: Data is output without modification, for example, hair color.
    • Quasi-identifier: Data is output and used to form equivalence classes, for example, age. You can further mask or generalized the quasi-identifier columns.
      • Mask: Mask all or a portion of the data with another character. For example, a Social Security number might output as ***-**-1234.
      • Date Generalization: Output date ranges into groups. For example, divide subscribers into groups based on their birth dates, and label the era (such as Millennials, GenX, Baby Boomers, and so on) rather than using the actual birth date.
      • Numeric Generalization: Output number ranges into groups. For example, output the records in a SALARY column that have values between $42,000 and $125,000 into a group called Middle Class.
      • Do Not Modify: Output data without masking or generalization.
    • Identifier: Data can positively identify an individual and is not output, for example Social Security number.

Mask Options

Mask all or a portion of the data with another character. For example, a credit card number might output as ****-****-****-1234.
Parameter Type Description
Starting Position String Required. Specifies whether masking should start at the beginning or end of the value.
Unmasked Length String Required. Specifies the number of characters at the beginning or end of the value that should not be masked.
Masking Character String Required. The character or number that replaces the characters in the input data, for example, "#" or "*".
Maintain Formatting String Required.
  • True: retains any special characters such as dashes, slashes or periods, spaces between characters, and formatting in the output. For example, if you have a phone number that uses dashes, then the dashes are output.
  • False: replaces special characters and spaces with the designated masking character.

Date Generalization Options

Output date ranges into groups.
Parameter Type Description
Auto Range Scale String Required. Defines the scale on which to base the auto range.
  • Not in Use: Indicates that you are not using auto range for the specified input column. This setting is appropriate when you complete the Range Definition options for the input column, or when you do not use this feature. Click + Add item to further define the option.
  • Calendar Year: Group records based on the calendar year. The software defines a calendar year as 1/1/yyyy to 12/31/yyyy.
  • Calendar Month: Group records based on the calendar month. The software defines a calendar month as mm/01/yyyy to mm/eom/yyyy, where "eom" is end of month.
Minimum Date String Enter the lowest acceptable date in the range.
Minimum Date Inclusive String Required. Select True when you want to include the minimum date. Select False when you do not want to include the minimum date in the results. For example, if you set the minimum value to 12/31/2020, then 12/31/2020 is included in the results when True is selected.
Maximum Date String Enter the highest acceptable date in the range.
Maximum Date Inclusive String Required. Select True when you want to include the maximum date. Select False when you do not want to include the maximum date in the results. For example, if you set the maximum date to 06/30/2020, then dates through 06/29/2020 are included in the results when False is selected.
Replacement Value String Required. Enter a value to describe the group.
Default Replacement Value String Optional. Value to output when the input value does not fall into any of the defined ranges.
Auto Range Duration Integer Required. Number of years or months to include in the range.
Auto Range Start Date String Required. Starting date in auto range.
Auto Range End Date String Required. Ending date in auto range.
Auto Range Output Format String Required. Determines the format of the output Auto Range Replacement Value.
Auto Range Year Format String Required. Specifies the number of digits to use for the year. Full Year outputs a four-digit number, for example, 2018. Short Year outputs a two-digit number, for example, 18.
Auto Range Month Format String Required. Determines the month format to use in the Auto Range Replacement Value. Full Text outputs the month name, for example, January. Short Text outputs the abbreviated month name, for example, Jan. Numeric outputs the number of the month, for example, 1 for January.
Auto Range Date Delimiter String Required. Determines the delimiter to use in the Auto Range Replacement Value.
Auto Range Numeric Format String Optional. Determines the numeric format to use in the Auto Range Replacement Value.
Auto Range Enable Zero Pad String Optional. Pad a one-digit number with zero when the format includes the month and day. For example, 1/5/2018 changes to 01/05/2018 when set to True.
Auto Range Output Language String Optional. Determines the language to use in the Auto Range Replacement Value. This setting is applicable when the Month Format is set to Short Text or Full Text.

Numeric Generalization Options

Output numbers ranges into groups. For example, output the records in an AGE column that have values between 13-19 into a group called Teenager. Specify the ranges to use for numeric variance. In the Numeric Generalization option, select + Add item.
Parameter Type Description
Minimum Value Integer Enter the lowest acceptable value in the range.
Minimum Value Inclusive String Required. Select True when you want to include the minimum value. Select False when you do not want to include the minimum value in the results. For example, if you set the minimum value to 30, then 30 is included in the results when True is selected.
Maximum Value Integer Enter the highest acceptable value in the range.
Maximum Value Inclusive String Required. Select True when you want to include the maximum value. Select False when you do not want to include the maximum value in the results. For example, if you set the maximum value to 50, then numbers through 49 are included in the results when False is selected.
Replacement Value String Optional. Enter a value to describe the group.
Default Replacement Value String Optional. Value to output when the input value does not fall into any of the defined ranges. For example, if you might want to label those records as Exceptions.

Numeric Generalization Example

Let's say that you want to assign employees to one of three geographic areas based on their employee number. You would add three items and complete the options as follows.