Modeling Guide for SAP Data Hub

Data Mask Examples

See several examples of Mask, Date Generalization, Date Variance, Numeric Generalization, Numeric Variance, and Pattern Variance masking options.

The following examples show how various settings can affect the data.

Mask

Masking a portion of data:
  • If you have a column for User_ID, the value is Smith7887, and you set the option to Everything except the first 4 characters, your output would be Smitxxxxx.
  • If you set the same option to Everything except the last 2 characters, your output would be xxxxxxx87. If your masking character is left blank, your output would be 87.
Using the Maintain format option:
  • If you have a column for Phone1, the value is 800-555-1234, and select Maintain format, your output would be xxx-xxx-xxxx. Not selecting this option would output xxxxxxxxxxxx.
  • If you have a column for Email1, the value is john.smith@abc.com, and you enable Maintain format and Mask all characters in email name, then your output would be xxxxxxxxxx@xxx.xxx.
  • If you enable Maintain format and disable Mask all characters in email name, then your output would be xxxx.xxxxx@xxx.xxx.
  • If you disable both Maintain format and Mask all characters in email name, then your output would be xxxxxxxxxxxxxxxxxx.

Date Generalization

Let's say that you want to divide subscribers into groups based on their birth dates, and that you want to label the era rather than using the actual birth date. Because the age groups are not an equal number of years, it would be best to manually define each range one at a time rather than using the auto range. The default masked value is defined as Out of Range, so that any records that do not belong in the defined ranges will output Out of Range as the masked value.

Minimum Column Name Maximum Masked Value
1940.01.01<= BIRTHDATE <=1964.12.31 Baby Boomer
1965.01.01<= BIRTHDATE <=1976.12.31 Gen X
1977.01.01<= BIRTHDATE <=1995.12.31 Millennial

Date Variance

The following table shows several examples for one date value in a database: May 27, 1995. The examples help to illustrate how the date is calculated internally, and how it is output when the user-defined minimum and maximum values are used together.

Variance Type Internally calculated date Output value within this range Notes

Fixed Days

Variance: 10

Minimum date: <not set>

Maximum date: <not set>

Min date: May 17, 1995

Max date: Jun 6, 1995

Min date: May 17, 1995

Max date: Jun 6, 1995

Max date: Jun 6, 1995

Fixed Days

Variance: 100

Minimum date: Mar 1, 1995

Maximum date: Aug 31, 1995

Min date: Feb 9, 1995

Max date: Sep 4, 1995

Min date: Mar 1, 1995

Max date: Aug 31, 1995

The user-defined minimum and maximum dates are within the calculated minimum and maximum dates. Therefore, the user-defined dates are used.

Fixed Months

Variance: 6

Min date: Jan 1, 1995

Max date: Dec 31, 1995

Min date: Nov 27, 1994

Max date: Nov 27, 1995

Min date: Jan 1, 1995

Max date: Nov 27, 1995

The user-defined minimum date is within the calculated variance, and is the value used for output.

The maximum user-defined date is outside of the calculated variance. Therefore, the maximum calculated date is used.

Fixed Years

Variance: 15

Min date: Jan 1, 1965

Max date: Dec 31, 2015

Min date: May 27, 1980

Max date: May 27, 2010

Min date: May 27, 1980

Max date: May 27, 2010

Both of the user-defined minimum and maximum dates are outside of the calculated variance. Therefore the calculated date is used.

Range

Min date: Jan 1, 1965

Max date: Dec 31, 2015

n/a

Min date: Jan 1, 1965

Max date: Dec 31, 2015

Because there is no variance for range, only the user-defined minimum and maximum values are used.

Numeric Generalization

Let's say that you want to categorize your employee numbers by location. Rather than outputting the employee number, the defined masked values replace the value in the EMPNO column. Any numbers that do not fall into the range have the default masked value, Outliers.

Minimum Column Name Maximum Masked Value
12000<= EMPNO <=16999 Asia_Pac
17000<= EMPNO <=29999 Europe_Africa
23000< EMPNO <=28999 Americas

Numeric Variance

The following table shows several examples for one salary value in a database: $50,000. The examples help to illustrate how the value is calculated internally, and how it is output when the user-defined minimum and maximum values are used together.

Variance Type Internally calculated value Output within this range Notes

Percentage

Variance: 25

Minimum value: <not set>

Maximum value: <not set>

Min value: $37,500

Max value: $62,500

Min value: $37,500

Max value: $62,500

The output uses the internally calculated values, because the user-defined values are not specified.

Percentage

Variance: 25

Minimum value: 45,000

Maximum value: 85,000

Min value: $37,500

Max value: $62,500

Min value: $45, 00

Max value: $62,500

The user-defined minimum value is within the calculated variance, and is the value used for output.

The maximum user-defined date is outside of the calculated variance. Therefore, the maximum calculated value is used.

Fixed number

Variance: 2500

Minimum value: <not set>

Maximum value: <not set>

Min value: $47,500

Max value: $52,500

Max value: $52,500

Max value: $52,500

The output uses the internally calculated values, because the user-defined values are not specified.

Fixed number

Variance: 2500

Minimum value: 45,000

Maximum value: 85,000

Min value: $47,500

Max value: $52,500

Min value: $47,500

Max value: $52,500

Both the user-defined minimum and maximum values are outside of the calculated variance. Therefore, the calculated minimum and maximum values are used.

Range

Minimum value: 55,000

Maximum value: 95,000

n/a

Minimum value: 55,000

Maximum value: 95,000

Because there is no variance for range, only the user-defined minimum and maximum values are used.

Pattern Variance

  • Default variance type: The following table describes how the default pattern variance masks input characters with like characters.

    Input character type Mask values
    Alphabetic

    Masks lower-case alpha character with random lower-case alpha character

    Masks upper-case alpha character with random upper-case alpha character

    Numeric Masks each numeric character with a random numeric character from 0 up to and including 9.
    Special characters or spaces Does not mask special characters or spaces, but outputs them as they are input, unmasked. For example, when the input substring contains a dash (-), the default pattern variance keeps the dash in the output. When the input substring contains a space, the default pattern variance keeps the space in the output.

    The following table shows the best practice use of default pattern variance.

    Description Best Practice Example
    Mask an entire input field using the default pattern variance. Add a substring definition with a Default variance type.

    Starting position: 1

    Ending position: <slider all the way to the right>

    Automatically apply the default pattern variance to substrings of a mapped input column that are not defined. Define input field substrings using one or more of the other pattern variance types, and leave portions of the input value undefined.

    Definition 1:

    Definition type: Preserve

    Starting position: 1

    Ending position: 3

    Definition 2:

    Definition type: String

    Starting position: 4

    Ending position: 5

    Value range min-max: 20-25

    Undefined range: 6-end of the value

    Results

    Definition 1: Preserves the characters in positions 1-3

    Definition 2: Masks the entire substring (position 4-5) with a number that is included in the minimum/maximum range (in this case 20-25)

    Undefined: Masks position six to the end of the field with the default pattern variance.

  • Preserve variance type: The application outputs the defined substring as it was input, with no masking.

    The following table contains an example of the preserve pattern variance.

    Strategy Settings Example input/output Notes
    Preserve the unit identification number in each record. Mask the rest of the value with the default pattern variance.

    Undefined: character in position 1

    Definition:

    Definition type: Preserve

    Starting position: 2

    Ending position: 3

    Value: <blank>

    Undefined: characters 4 through the end of the value

    Input value: B12:G350

    Possible output values, preserved characters in bold.

    A12:N799

    F12:M127

    Undefined: Masks the first position with a like character using the default pattern variance.

    Definition: Preserves position two and three with the preserve pattern variance. (In this case the numbers 12).

    Undefined: masks the fourth position to the end of the string using the default pattern variance.

    The colon in the input field (character four) is included in the undefined portion. The software outputs the colon as it was input based on the default pattern variance definition.

  • Character variance type: The application masks each character in the defined substring with a character from the Value option.

    The Value option can include individual upper or lower case alpha characters, numeric characters from 0 to 9, ranges of alpha characters, or ranges of numeric characters (using numbers from 0 to 9), spaces, special characters (such as @, #, _, &), or any combination of these.

    Special characters are output as they are input, without masking them, when they are present in a defined substring for character pattern variance.

    The following table contains an example of the character pattern variance.

    Strategy Settings Example input/output Notes
    Mask an identification code with specific alpha or numeric values, and apply the default pattern variance to the remaining portion of the value.

    Undefined: character in position 1

    Definition:

    Definition type: Character

    Starting position: 2

    Ending position: 3

    Value: J-L|B|W-Y|2

    Undefined: characters 4 through the end of the value

    Input value: 123a

    Possible output values, masked characters in bold.

    8KBx

    32Wt

    Undefined: Masks the first position with a like character using the default pattern variance.

    Definition: Masks position two and three using the character pattern variance and randomly chooses a character specified in the Value field to mask each position.

    Undefined: Masks the fourth position to the end of the string using the default pattern variance.

  • String variance type:The application masks the entire defined substring with a random character or string from the Value option.

    The Value option can include one or more alpha or numeric characters (such as MILK or 2458), spaces, special characters (such as @, #, _, &), numeric ranges, or any combination of these in a list in the Value option.

    The application counts all alphanumeric characters, spaces, and special characters when it determines the substring length. However, the application does not retain the special characters or spaces in the output when they are present in a defined substring for string pattern variance.

    The following table contains an example of the string pattern variance.

    Strategy Settings Example input/output Notes
    Preserve the product code, but mask the type of milk (white, chocolate, soy, and so on) with the general term MILK.

    Definition 1:

    Definition type: Preserve

    Starting position: 1

    Ending position: 5

    Value: <blank>

    Definition 2:

    Definition type: String

    Starting position: 6

    Ending position: <end of column>

    Value: MILK

    Input value: 5428-WTMLK

    5429-SOYMLK

    Possible output values, string characters in bold.

    5428-MILK

    5429-MILK

    Definition 1: Preserves the first through the fifth positions, including the dash, as part of preserve pattern variance. The dash is output as it was input because it is included in the preserve pattern variance.

    Definition 2: Masks position six to the end of the field with the value MILK.

    Include a zero to the left of any number in a range (to the left of the lower or higher number or to both numbers) so the mask value is left-padded with zeros for the length of the substring.

    Definition 1:

    Definition type: String

    Starting position: 1

    Ending position: 5

    Value: 01-8|999

    Undefined: position 6 through the end of the column.

    Input value: 04-a1099

    Possible output values, string characters in bold.

    00003502

    999832

    Definition 1: Outputs the first through the fifth characters with a number from 1 up to and including 8, or the number 999.

    When the application chooses a number from the range as a mask value, it zero-pads the number to the left so the masked value is the length of the defined substring (5 characters).

    The application does not zero-pad the number 999 because it does not contain a leading zero in the Value option.

    The application includes the dash in the input field in the position count for the substring. However, the application does not output the dash as part of the string pattern variance definition.

    Undefined: The application applies the default pattern variance to the undefined portion of the input field, character six to the end of the field, and replaces each numeric value with a random value from 0 to 9.

  • Examples of various pattern variance types by showing example definition option settings, input values, and possible output values.

    Strategy Settings Example input/outpu Notes

    Mask the weight and the unit of measure from a product code, but preserve the product type.

    Definition 1:

    Definition type: Preserve

    Starting position: 1

    Ending position: 3

    Value: <blank>

    Undefined: Position 4 and 5.

    Definition 2:

    Definition type: String

    Starting position: 6

    Ending position: 9

    Value: GAL|qt|pt|oz|CUP

    Input value: MLK12CUP

    Possible output values:

    MLK63GAL

    MLK18pt

    MLK04oz

    Definition 1: Preserves the first three positions using the preserve pattern variance.

    Undefined: Masks the fourth and fifth positions using the default pattern variance, which replaces each numeric character with a value from 0-9.

    Definition 2: Masks the sixth through the eighth positions with one of the values listed using the string pattern variance.

    Notice that in some cases, the software replaces a 3-character substring with a 2-character value.

    Mask the product type and the weight from a product code, but preserve the unit of measure.

    Definition 1:

    Definition type: String

    Starting position: 1

    Ending position: 3

    Value: ALMLK| SOYMLK| RCEMLK| WTMLK| CHMLK

    Definition 2:

    Definition type: String

    Starting position: 4

    Ending position: 5

    Value: 01-12| 32| 16

    Definition 3:

    Definition type: Preserve

    Starting position: 6

    Ending position: <end of column>

    Value: <blank>

    Input value: MLK12CUP

    Possible output values:

    WTMLK32CUP

    RCEMLK16CUP

    ALMLK08CUP

    Definition 1: Masks the first three positions using one of the values specified for string pattern variance.

    The application masks the 3-character substring with values that may be longer than 3 characters.

    Definition 2: Masks the fourth and fifth positions using the string pattern variance.

    The first value listed in the Value option for Definition 2 is a range beginning with a zero-padded number. This ensures that the mask value is the length of the defined substring, 2 characters.

    Definition 3: Preserves the sixth through the eighth positions.

    Mask the number of paper sheets per package, and the type of packaging from the product description column.

    Definition 1:

    Definition type: Character

    Starting position: 1

    Ending position: 4

    Value: 0-9

    Undefined: Position 5.

    Definition 2:

    Definition type: String

    Starting position: 6

    Ending position: <end of column>

    Value: Ream| Case| Pack| Box

    Input value: 1500/Ream

    Possible output values:

    0950/Case

    8945/Box

    2639/Pack

    Definition 1: Masks the first through the fourth positions with a number from 0-9.

    The user could leave the first through the fifth position undefined so the application masks the substring using the default pattern variance to get similar output values. The forward slash would be output as part of the substring in this case.

    Undefined: Outputs the forward slash (/) character in the fifth position using the default pattern variance (maintains special characters on output).

    Definition 2: Mask the sixth position to the end of the column with one of the character strings listed in the Value option.

    Mask the school district, the state, and the enrollment number. Preserve the type of school.

    Definition 1:

    Definition type: String

    Starting position: 1

    Ending position: 3

    Value: DST

    Definition 2:

    Definition type: String

    Starting position: 4

    Ending position: 5

    Value: ST

    Undefined: Position 6, 7, 8, and 9.

    Definition 3:

    Definition type: Preserve

    Starting position: 10

    Ending position: <end of column>

    Value: <blank>

    Input value: INDNE7321MID

    ANMA7321HIGH

    SNBCA7321ELEM

    Possible output values:

    DSTST3829MID

    DSTST5784HIGH

    DSTST0789ELEM

    Definition 1: Masks position one to three with the string DST.

    Definition 2: Mask the fourth and the fifth position with the string ST.

    Undefined: Masks position six through nine with the default pattern variance.

    Definition 3: Preserves the tenth position to the end of the column.