Upload
shonda-douglas
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Preserving Privacy in Published Data
Dimitris Sacharidis
outline
• introduction– motivation– definitions
• k-anonymity– incognito– mondrian– topdown
• l-diversity– anatomy
• other issues
motivation
• several agencies, institutions, bureaus, organizations make (sensitive) data involving people publicly available– termed microdata (vs. aggregated macrodata) used for
analysis– often required and imposed by law
• to protect privacy microdata are sanitized– explicit identifiers (SSN, name, phone #) are removed
• is this sufficient for preserving privacy?• no! susceptible to link attacks
– publicly available databases (voter lists, city directories) can reveal the “hidden” identity
link attack example
• Sweeney [S01a] managed to re-identify the medical record of the governor of Massachussetts– MA collects and publishes sanitized medical data for state
employees (microdata) left circle– voter registration list of MA (publicly available data) right circle
• looking for governor’s record• join the tables:
– 6 people had his birth date– 3 were men– 1 in his zipcode
• regarding the US 1990 census data– 87% of the population are unique based on (zipcode, gender,
dob)
privacy in microdata
• the role of attributes in microdata– explicit identifiers are removed– quasi identifiers can be used to re-identify individuals– sensitive attributes (may not exist!) carry sensitive information
Name Birthdate
Sex Zipcode Disease
Andre 21/1/79 male 53715 Flu
Beth 10/1/81 female 55410 HepatitisCarol 1/10/44 female 90210 Brochitis
Dan 21/2/84 male 02174 Sprained Ankle
Ellen 19/4/72 female 02237 AIDS
identifier
quasi identifiers sensitive
Name Birthdate
Sex Zipcode Disease
Andre 21/1/79 male 53715 FluBeth 10/1/81 female 55410 HepatitisCarol 1/10/44 female 90210 BrochitisDan 21/2/84 male 02174 Sprained
AnkleEllen 19/4/72 female 02237 AIDS
goal of privacy preservation (rough definition)de-associate individuals from sensitive
information
outline
• introduction– motivation– definitions
• k-anonymity– incognito– mondrian– topdown
• l-diversity– anatomy
• other issues
k-anonymity
• preserve privacy via k-anonymity, proposed by Sweeney and Samarati [S01, S02a, S02b]
• k-anonymity: intuitively, hide each individual among k-1 others– each QI set of values should appear at least k times in the
released microdata– linking cannot be performed with confidence > 1/k– sensitive attributes are not considered (going to revisit this...)
• how to achieve this?– generalization and suppression– value perturbation is not considered (we should remain truthful
to original values )
• privacy vs utility tradeoff– do not anonymize more than necessary
k-anonymity example
tools for anonymization• generalization
– publish more general values, i.e., given a domain hierarchy, roll-up
• suppression– remove tuples, i.e., do not publish outliers– often the number of suppressed tuples is bounded
Birthdate
Sex Zipcode
21/1/79 male 53715
10/1/79 female 55410
1/10/44 female 90210
21/2/83 male 02274
19/4/82 male 02237
Birthdate
SexZipcode
group 1*/1/79 person 5*****/1/79 person 5****
suppressed
1/10/44 female 90210
group 2*/*/8* male 022***/*/8* male 022**
original microdata
2-anonymous data
generalization latticeassume domain hierarchies exist for all QI attributes
zipcode birthdate sex
construct the generalization lattice for the entire QI set
objectivefind the minimum
generalization that satisfies k-anonymity
less
more
Z0
Z1
Z2
={53715, 53710, 53706, 53703}
={5371*, 5370*}
={537**}
B0
B1
={26/ 3/ 1979, 11/ 3/ 1980, 16/ 5/ 1978}
={*}
<S0, Z0>
<S1, Z0> <S0, Z1>
<S1, Z1>
<S1, Z2>
<S0, Z2>
[0, 0]
[1, 0] [0, 1]
[1, 1]
[1, 2]
[0, 2]
S0
S1
={Male, Female}
={Person}
i.e., maximize utility by finding minimum distance vector with k-anonymity
generalization latticeincognito [LDR05]
exploit monotonicity properties regarding frequency of tuples in lattice– reminiscent of OLAP hierarchies and frequent itemset mining
<S0, Z0>
<S1, Z0> <S0, Z1>
<S1, Z1>
<S1, Z2>
<S0, Z2>
(I) generalization property (~rollup)if at some node k-anonymity holds, then it also holds for any ancestor node
(II) subset property (~apriori)if for a set of QI attributes k-anonymity doesn’t hold then it doesn’t hold for any of its supersets
e.g., <S1, Z0> is k-anonymous and, thus, so is <S1, Z1> and <S1, Z2>
e.g., <S0, Z0> is not k-anonymous and, thus <S0, Z0, B0> and <S0, Z0, B1> cannot be k-anonymous
incognito [LDR05] considers sets of QI attributes of increasing cardinality (~apriori) and prunes nodes in the lattice using the two properties above
note: the entire lattice, which includes three dimensions <S,Z,B>, is too complex to show
seen in the domain space• consider the multi-dimensional domain space
– QI attributes are the dimensions– tuples are points in this space– attribute hierarchies partition dimensions
53703 53705 53709 53711 53714 53718 Z0
fem
ale
mal
eS
0
53703 53705 53709 53711 53714 53718 Z0
fem
ale
mal
eS
0
(53705, f emale)
(53711, male)
53703 53705 53709 53711 53714 53718
5370* 5371*
537**
Z0
Z1
Z2
pers
on
fem
ale
mal
eS
0
S1
(53705, f emale)
(53711, male)
zipcode hierarchy
sex hierarchy
seen in the domain space
incognito example2 QI attributes, 7 tuples, hierarchies shown with bold lines
zipcode
sex
group 1w. 2 tuples
group 2w. 3 tuples
group 3w. 2 tuples
rollu
p se
x
rollup zipcode
not 2-anonymous
2-anonymous
seen in the domain spacetaxonomy [LDR05, LDR06]
single dimensional
global recodingincognito [LDR05]
generalization taxonomy according to groupings allowed
multi dimensional global recodingmondrian [LDR06]
multi dimensional local recoding
topdown [XWP+06]
generalization strength
mondrian[LDR06]
• define utility measure: discernability metric (DM)– penalizes each tuple with the size of the group it belongs – intuitively, the ideal grouping is the one in which all groups have size k
• mondrian tries to construct groups of roughly equal size k
• what else (besides Mondrian) does this painting remind you?
• it’s reminiscent of the kd tree:
– cycle among dimensions– median splits
2-anonymous
measuring group quality
• DM depends only on the cardinality of the group– no measure of how tight the group is
• a good group is one that contains tuples with similar QI values• define a new metric [XWP+06]: normalized certainty penalty (NCP)
– measures the perimeter of the group
bad generalization
long boxes
good generalizationsquare-like boxes
topdown [XWP+06]
• start with the entire data set• iteratively split in two
– reminiscent of R-tree quadratic split
• continue until left with groups which contain <2k-1 tuples
split algorithmfind seeds, 2 points that are furthest away• heuristic, not complete quadratic search• the seeds will become the 2 split groups
examine points randomly (unlike quadratic split)
• assign point to the group whose NCP will increase the least
boosting privacy with external data
• external databases (e.g., voter list) are used by attackers• can we use them to our benefit?
– try to improve the utility of anonymized data
• join k-anonymity (JKA) [SMP]
x3
k-anonymity
x3 x3
join
JKA
microdata
public data
3-anonym
ous
join 3-anonymous
joined microdata
outline
• introduction– motivation– definitions
• k-anonymity– incognito– mondrian– topdown
• l-diversity– anatomy
• other issues
k-anonymity problems• k-anonymity example• homogeneity attack: in the last group everyone has cancer
• background knowledge: in the first group, Japanese have low chance of heart disease
• we need to consider the sensitive values
id Zipcode
Sex National. Disease
1 13053 28 Russian Heart Disease2 13068 29 American Heart Disease3 13068 21 Japanese Viral Infection4 13053 23 American Viral Infection5 14853 50 Indian Cancer6 14853 55 Russian Heart Disease7 14850 47 American Viral Infection8 14850 49 American Viral Infection9 13053 31 American Cancer
10 13053 37 Indian Cancer11 13068 36 Japanese Cancer12 13068 35 American Cancer
id Zipcode
Sex National. Disease
1 130** <30 ∗ Heart Disease2 130** <30 ∗ Heart Disease3 130** <30 ∗ Viral Infection4 130** <30 ∗ Viral Infection5 1485* ≥40 ∗ Cancer6 1485* ≥40 ∗ Heart Disease7 1485* ≥40 ∗ Viral Infection8 1485* ≥40 ∗ Viral Infection9 130** 3∗ ∗ Cancer
10 130** 3∗ ∗ Cancer11 130** 3∗ ∗ Cancer12 130** 3∗ ∗ Cancer
microdata 4-anonymous data
l-diversity[MGK+06]
• make sure each group contains well represented sensitive values– protect from homogeneity attacks– protect from background knowledge
l-diversity (simplified definition)a group is l-diverse if the most frequent sensitive value appears at most 1/l times in group
• l-diversity definition has monotonicity properties similar to k-anonymity– simply adapt existing algorithms to count frequencies of
sensitive values
anatomy[XT06]
• fast l-diversity algorithm• anatomy is not generalization
– seperates sensitive values from tuples– shuffles sensitive values among groups
id Age Sex
Zipcode Group ID
1 23 M 11000 12 27 M 13000 13 35 M 59000 14 59 M 12000 15 61 F 54000 26 65 F 25000 27 65 F 25000 28 70 F 30000 2
Group-ID Disease Count1 dyspepsia 21 pneumonia 22 bronchitis 12 flu 22 gastritis 1
1
3
5
62
9
8
7
4
group 1 5 8 7
1
3
62
9
4
5 8
3 9
group 1
group 2 6
7
1 2 4
5 8 7
3 9 6
1 2 4
group 1
group 2
group 3
algorithm•assign sensitive values to buckets•create groups by drawing from l largest buckets
outline
• introduction– motivation– definitions
• k-anonymity– incognito– mondrian– topdown
• l-diversity– anatomy
• other issues
other issues
• how to preserve privacy in dynamic data– [XT07]
• protection against minimality attack– [WFW+07]
• anonymity in location-based services– spatial cloaking
references[LDR05] LeFevre, K., DeWitt, D.J., Ramakrishnan, R. Incognito: Efficient Full-
domain k-Anonymity. SIGMOD, 2005.[LDR06] LeFevre, K., DeWitt, D.J., Ramakrishnan, R. Mondrian Multidimensional
k-Anonymity. ICDE, 2006.[S01] Samarati, P. Protecting Respondents' Identities in Microdata Release.
IEEE TKDE, 13(6):1010-1027, 2001.[S02a] Sweeney, L. k-Anonymity: A Model for Protecting Privacy. International
Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002.[S02b] Sweeney, L. k-Anonymity: Achieving k-Anonymity Privacy Protection
using Generalization and Suppresion. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002.
[SMP] Sacharidis, D., Mouratidis, K., Papadias, D. k-Anonymity in the Presence of External Databases (submitted)
[WFWP07] Wong, R. C., Fu, A. W., Wang, K., Pei, J. Minimality Attack in Privacy Preserving Data Publishing. VLDB, 2007.
[XT06] Xiao, X, Tao, Y. Anatomy: Simple and Effective Privacy Preservation. VLDB, 2006.
[XT07] Xiao, X, Tao, Y. m-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets. SIGMOD, 2007.
[XWP+06] Xu, J., Wang, W., Pei, J., Wang, X., Shi, B., Fu, A., Utility-Based Anonymization Using Local Recoding. SIGKDD, 2006.