Preserving Privacy in Published Data Dimitris Sacharidis

Preserving Privacy in Published Data

Dimitris Sacharidis

outline

• introduction– motivation– definitions

• k-anonymity– incognito– mondrian– topdown

• l-diversity– anatomy

• other issues

motivation

• several agencies, institutions, bureaus, organizations make (sensitive) data involving people publicly available– termed microdata (vs. aggregated macrodata) used for

analysis– often required and imposed by law

• to protect privacy microdata are sanitized– explicit identifiers (SSN, name, phone #) are removed

• is this sufficient for preserving privacy?• no! susceptible to link attacks

– publicly available databases (voter lists, city directories) can reveal the “hidden” identity

link attack example

• Sweeney [S01a] managed to re-identify the medical record of the governor of Massachussetts– MA collects and publishes sanitized medical data for state

employees (microdata) left circle– voter registration list of MA (publicly available data) right circle

• looking for governor’s record• join the tables:

– 6 people had his birth date– 3 were men– 1 in his zipcode

• regarding the US 1990 census data– 87% of the population are unique based on (zipcode, gender,

dob)

privacy in microdata

• the role of attributes in microdata– explicit identifiers are removed– quasi identifiers can be used to re-identify individuals– sensitive attributes (may not exist!) carry sensitive information

Name Birthdate

Sex Zipcode Disease

Andre 21/1/79 male 53715 Flu

Beth 10/1/81 female 55410 HepatitisCarol 1/10/44 female 90210 Brochitis

Dan 21/2/84 male 02174 Sprained Ankle

Ellen 19/4/72 female 02237 AIDS

identifier

quasi identifiers sensitive

Name Birthdate

Sex Zipcode Disease

Andre 21/1/79 male 53715 FluBeth 10/1/81 female 55410 HepatitisCarol 1/10/44 female 90210 BrochitisDan 21/2/84 male 02174 Sprained

AnkleEllen 19/4/72 female 02237 AIDS

goal of privacy preservation (rough definition)de-associate individuals from sensitive

information

outline




• other issues

k-anonymity

• preserve privacy via k-anonymity, proposed by Sweeney and Samarati [S01, S02a, S02b]

• k-anonymity: intuitively, hide each individual among k-1 others– each QI set of values should appear at least k times in the

released microdata– linking cannot be performed with confidence > 1/k– sensitive attributes are not considered (going to revisit this...)

• how to achieve this?– generalization and suppression– value perturbation is not considered (we should remain truthful

to original values )

• privacy vs utility tradeoff– do not anonymize more than necessary

k-anonymity example

tools for anonymization• generalization

– publish more general values, i.e., given a domain hierarchy, roll-up

• suppression– remove tuples, i.e., do not publish outliers– often the number of suppressed tuples is bounded

Birthdate

Sex Zipcode

21/1/79 male 53715

10/1/79 female 55410

1/10/44 female 90210

21/2/83 male 02274

19/4/82 male 02237

Birthdate

SexZipcode

group 1*/1/79 person 5*****/1/79 person 5****

suppressed

1/10/44 female 90210

group 2*/*/8* male 022***/*/8* male 022**

original microdata

2-anonymous data

generalization latticeassume domain hierarchies exist for all QI attributes

zipcode birthdate sex

construct the generalization lattice for the entire QI set

objectivefind the minimum

generalization that satisfies k-anonymity

less

more

Z0

Z1

Z2

={53715, 53710, 53706, 53703}

={5371*, 5370*}

={537**}

B0

B1

={26/ 3/ 1979, 11/ 3/ 1980, 16/ 5/ 1978}

={*}

<S0, Z0>

<S1, Z0> <S0, Z1>

<S1, Z1>

<S1, Z2>

<S0, Z2>

[0, 0]

[1, 0] [0, 1]

[1, 1]

[1, 2]

[0, 2]

S0

S1

={Male, Female}

={Person}

i.e., maximize utility by finding minimum distance vector with k-anonymity

generalization latticeincognito [LDR05]

exploit monotonicity properties regarding frequency of tuples in lattice– reminiscent of OLAP hierarchies and frequent itemset mining

<S0, Z0>

<S1, Z0> <S0, Z1>

<S1, Z1>

<S1, Z2>

<S0, Z2>

(I) generalization property (~rollup)if at some node k-anonymity holds, then it also holds for any ancestor node

(II) subset property (~apriori)if for a set of QI attributes k-anonymity doesn’t hold then it doesn’t hold for any of its supersets

e.g., <S1, Z0> is k-anonymous and, thus, so is <S1, Z1> and <S1, Z2>

e.g., <S0, Z0> is not k-anonymous and, thus <S0, Z0, B0> and <S0, Z0, B1> cannot be k-anonymous

incognito [LDR05] considers sets of QI attributes of increasing cardinality (~apriori) and prunes nodes in the lattice using the two properties above

note: the entire lattice, which includes three dimensions <S,Z,B>, is too complex to show

seen in the domain space• consider the multi-dimensional domain space

– QI attributes are the dimensions– tuples are points in this space– attribute hierarchies partition dimensions

53703 53705 53709 53711 53714 53718 Z0

fem

ale

mal

eS

0

53703 53705 53709 53711 53714 53718 Z0

fem

ale

mal

eS

0

(53705, f emale)

(53711, male)

53703 53705 53709 53711 53714 53718

5370* 5371*

537**

Z0

Z1

Z2

pers

on

fem

ale

mal

eS

0

S1

(53705, f emale)

(53711, male)

zipcode hierarchy

sex hierarchy

seen in the domain space

incognito example2 QI attributes, 7 tuples, hierarchies shown with bold lines

zipcode

sex

group 1w. 2 tuples

group 2w. 3 tuples

group 3w. 2 tuples

rollu

p se

x

rollup zipcode

not 2-anonymous

2-anonymous

seen in the domain spacetaxonomy [LDR05, LDR06]

single dimensional

global recodingincognito [LDR05]

generalization taxonomy according to groupings allowed

multi dimensional global recodingmondrian [LDR06]

multi dimensional local recoding

topdown [XWP+06]

generalization strength

mondrian[LDR06]

• define utility measure: discernability metric (DM)– penalizes each tuple with the size of the group it belongs – intuitively, the ideal grouping is the one in which all groups have size k

• mondrian tries to construct groups of roughly equal size k

• what else (besides Mondrian) does this painting remind you?

• it’s reminiscent of the kd tree:

– cycle among dimensions– median splits

2-anonymous

measuring group quality

• DM depends only on the cardinality of the group– no measure of how tight the group is

• a good group is one that contains tuples with similar QI values• define a new metric [XWP+06]: normalized certainty penalty (NCP)

– measures the perimeter of the group

bad generalization

long boxes

good generalizationsquare-like boxes

topdown [XWP+06]

• start with the entire data set• iteratively split in two

– reminiscent of R-tree quadratic split

• continue until left with groups which contain <2k-1 tuples

split algorithmfind seeds, 2 points that are furthest away• heuristic, not complete quadratic search• the seeds will become the 2 split groups

examine points randomly (unlike quadratic split)

• assign point to the group whose NCP will increase the least

boosting privacy with external data

• external databases (e.g., voter list) are used by attackers• can we use them to our benefit?

– try to improve the utility of anonymized data

• join k-anonymity (JKA) [SMP]

x3

k-anonymity

x3 x3

join

JKA

microdata

public data

3-anonym

ous

join 3-anonymous

joined microdata

outline




• other issues

k-anonymity problems• k-anonymity example• homogeneity attack: in the last group everyone has cancer

• background knowledge: in the first group, Japanese have low chance of heart disease

• we need to consider the sensitive values

id Zipcode

Sex National. Disease

1 13053 28 Russian Heart Disease2 13068 29 American Heart Disease3 13068 21 Japanese Viral Infection4 13053 23 American Viral Infection5 14853 50 Indian Cancer6 14853 55 Russian Heart Disease7 14850 47 American Viral Infection8 14850 49 American Viral Infection9 13053 31 American Cancer

10 13053 37 Indian Cancer11 13068 36 Japanese Cancer12 13068 35 American Cancer

id Zipcode

Sex National. Disease

1 130** <30 ∗ Heart Disease2 130** <30 ∗ Heart Disease3 130** <30 ∗ Viral Infection4 130** <30 ∗ Viral Infection5 1485* ≥40 ∗ Cancer6 1485* ≥40 ∗ Heart Disease7 1485* ≥40 ∗ Viral Infection8 1485* ≥40 ∗ Viral Infection9 130** 3∗ ∗ Cancer

10 130** 3∗ ∗ Cancer11 130** 3∗ ∗ Cancer12 130** 3∗ ∗ Cancer

microdata 4-anonymous data

l-diversity[MGK+06]

• make sure each group contains well represented sensitive values– protect from homogeneity attacks– protect from background knowledge

l-diversity (simplified definition)a group is l-diverse if the most frequent sensitive value appears at most 1/l times in group

• l-diversity definition has monotonicity properties similar to k-anonymity– simply adapt existing algorithms to count frequencies of

sensitive values

anatomy[XT06]

• fast l-diversity algorithm• anatomy is not generalization

– seperates sensitive values from tuples– shuffles sensitive values among groups

id Age Sex

Zipcode Group ID

1 23 M 11000 12 27 M 13000 13 35 M 59000 14 59 M 12000 15 61 F 54000 26 65 F 25000 27 65 F 25000 28 70 F 30000 2

Group-ID Disease Count1 dyspepsia 21 pneumonia 22 bronchitis 12 flu 22 gastritis 1

1

3

5

62

9

8

7

4

group 1 5 8 7

1

3

62

9

4

5 8

3 9

group 1

group 2 6

7

1 2 4

5 8 7

3 9 6

1 2 4

group 1

group 2

group 3

algorithm•assign sensitive values to buckets•create groups by drawing from l largest buckets

outline




• other issues

other issues

• how to preserve privacy in dynamic data– [XT07]

• protection against minimality attack– [WFW+07]

• anonymity in location-based services– spatial cloaking

references[LDR05] LeFevre, K., DeWitt, D.J., Ramakrishnan, R. Incognito: Efficient Full-

domain k-Anonymity. SIGMOD, 2005.[LDR06] LeFevre, K., DeWitt, D.J., Ramakrishnan, R. Mondrian Multidimensional

k-Anonymity. ICDE, 2006.[S01] Samarati, P. Protecting Respondents' Identities in Microdata Release.

IEEE TKDE, 13(6):1010-1027, 2001.[S02a] Sweeney, L. k-Anonymity: A Model for Protecting Privacy. International

Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002.[S02b] Sweeney, L. k-Anonymity: Achieving k-Anonymity Privacy Protection

using Generalization and Suppresion. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002.

[SMP] Sacharidis, D., Mouratidis, K., Papadias, D. k-Anonymity in the Presence of External Databases (submitted)

[WFWP07] Wong, R. C., Fu, A. W., Wang, K., Pei, J. Minimality Attack in Privacy Preserving Data Publishing. VLDB, 2007.

[XT06] Xiao, X, Tao, Y. Anatomy: Simple and Effective Privacy Preservation. VLDB, 2006.

[XT07] Xiao, X, Tao, Y. m-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets. SIGMOD, 2007.

[XWP+06] Xu, J., Wang, W., Pei, J., Wang, X., Shi, B., Fu, A., Utility-Based Anonymization Using Local Recoding. SIGKDD, 2006.

Documents

Preserving Privacy in Published Data Dimitris Sacharidis