Modeling Guide for SAP Data Hub

Anonymized Groups Example

Use anonymization to place masked data into match groups so that you can publish data without risking re-identification of sensitive data.

Before masking, the data might be unique. After masking several columns of identifying data with Mask, Date Generalization, or Numeric Generalization, there will be duplicate records. In the Anonymization settings, you can choose the minimum number of records you want to output in a group. The groups that contain records less than the number you specify are not output. The larger the number you enter (between 2 and 100), the less likely the data can be re-identified and a smaller number of records are output. The lower the number entered, the more likely the data can be re-identified and a larger number of records are output.

Let's say that you work for an insurance company. You want to publish quarterly information about the cause of emergency room visits without compromising the patients' privacy.
Row ID Patient ID Name Age Postcode Date of Visit Issue
1 L1234-0987 James Smith 1 54601 05/13/2017 Ear infection
2 R5678-6543 Allison Zhou 26 54650 02/06/2017 Influenza
3 J2345-9876 Mia Vang 53 55190 12/18/2016 Heart attack
4 J6789-5432 Ben McCleary 68 55118 09/25/2016 Stroke
5 P7789-1212 Franz Gullikson 2 54603 04/15/2017 Ear infection
6 R0606-1223 Joseph Kaswizki 4 54551 01/21/2017 Influenza
7 B7212-7306 Avijit Farooq 3 54601 02/02/2017 Influenza
8 R8675-3099 Alejandro Rodriquez 18 54650 11/27/2016 Concussion
9 J0673-1272 Aleksandra Kaminski 2 54603 04/17/2017 Ear infection
10 W1720-0825 Amanda Barns 4 54601 02/04/2017 Influenza
You can mask the data in many different ways. In this example, we remove the patient name and ID. We masked the last three digits of the postcode. We generalized the age, and generalized the date of service from a specific day into a generalized quarter. The resulting data set shows how the formerly unique records are more generic while still providing the necessary data.
Row ID Age Postcode Date of Visit Issue
1 [0-5] 54*** Q2 2017 Ear infection
2 [20-29] 54*** Q1 2017 Influenza
3 [50-59] 55*** Q4 2016 Heart attack
4 [60-69] 55*** Q3 2016 Stroke
5 [0-5] 54*** Q2 2017 Ear infection
6 [0-5] 54*** Q1 2017 Influenza
7 [0-5] 54*** Q1 2017 Influenza
8 [15-19] 54*** Q4 2016 Concussion
9 [0-5] 54*** Q2 2017 Ear infection
10 [0-5] 54*** Q1 2017 Influenza

Now that the data is masked, you can see how the data is placed into anonymized groups.

Age Postcode Date of Visit Issue Anonymization Group Size
[0-5] 54*** Q1 2017 Influenza 3
[20-29] 54*** Q1 2017 Influenza 1
[0-5] 54*** Q2 2017 Ear Infection 3
[50-59] 55*** Q4 2016 Heart attack 1
[60-69] 55*** Q3 2016 Stroke 1
[15-19] 54*** Q4 2016 Concussion 1

If you set the Minimum rows in anonymized group option to three, then you could publish six of ten records.

Row ID Age Postcode Date of Visit Issue
1 [0-5] 54*** Q2 2017 Ear infection
5 [0-5] 54*** Q2 2017 Ear infection
6 [0-5] 54*** Q1 2017 Influenza
7 [0-5] 54*** Q1 2017 Influenza
9 [0-5] 54*** Q2 2017 Ear infection
10 [0-5] 54*** Q1 2017 Influenza

Now, let's say that the issue in the first row is a concussion rather than an ear infection. Would the record still be published? Yes, because the Issue column is not anonymized. The Age, Postcode, and Date of Visit columns are the only anonymized columns. Therefore, only the data in those columns are used in forming anonymized groups, not the data in the Issue column.