Upload
barrie-parks
View
218
Download
0
Embed Size (px)
Citation preview
04/21/23 1
A Framework for Privacy-Preserving Cluster Analysis
IEEE ISI 2008
Benjamin C. M. FungConcordia University
Canada
Lingyu Wang Mourad DebbabiConcordia University
Canada
{wang, debbabi}@ciise.concordia.ca
Ke WangSimon Fraser University
Canada
04/21/23 2
Agenda
Motivation Problem Scope: Anonymity in Clustering Proposed Method: Top-Down Specialization (TDS) Proposed Framework Experimental Results Related Work Conclusion Q & A
04/21/23 3
Motivation
Corporations, agencies, governments, individuals are desirous to share valuable information.
But, are reluctant to do so due to privacy issues.
The focus of this study is to publish data for the purpose of cluster analysis.
But to satisfy both the privacy goal and the clustering goal?
04/21/23 4
Motivation (cont.)
Real world scenario
A data owner wants to release a person-specific data table to another party (or the public) for the purpose of cluster analysis without compromising privacy of the individuals in the released data.
Data owner Data recipients
Person-specificdata
Adversary
Privacy Threat Looking at the tables below, a description on (Education, Sex) is
so specific that not many people match it, releasing such tables will lead to link a unique or a small number of individuals with their sensitive information.
Education Sex Disease Disease # of Recs.
9th F 30 Flu 3
10th M 32 Heart 4
11th F 35 Fever 5
12th F 37 Fever 4
Bachelors F 42 Flu 6
Bachelors F 44 Heart 4
Masters M 44 Flu 4
Masters F 44 Flu 3
Doctorate F 44 HIV 1
Total: 34
Name Education Sex …
Alice Bachelors F …
Bob Bachelors M …
Cathy Masters F …
Doug Masters F …
Emily Doctorate F …
04/21/23 6
Privacy Goal: k-Anonymity The privacy goal is specified by the anonymity on a
combination of attributes called Quasi-Identifier (QID), where each description on a QID is required to be shared by at least k records in the table [Sweeney and Samarati 1998]
Anonymity requirement Consider QID1,…, QIDp. e.g., QID = {Education, Sex}.
a(qidi) denotes the number of data records in T that share the value qidi on QIDi. e.g., qid = {Doctorate, Female}.
A(qidi) denotes the smallest a(qidi) for any value qidi on QIDi. A table T satisfies the anonymity requirement
{<QID1, h1>, …, <QIDp, hp>}
if A(qidi) ≥ hi for 1 ≤ i ≤ p, where hi is the anonymity threshold on QIDi , specified by the data owner.
04/21/23 7
Anonymity RequirementExample: QID1 = {Education, Sex}, h1 = 4
Education Sex Age Class # of Recs.
9th F 30 0G3B 3
10th M 32 0G4B 4
11th F 35 2G3B 5
12th F 37 3G1B 4
Bachelors F 42 4G2B 6
Bachelors F 44 4G0B 4
Masters M 44 4G0B 4
Masters F 44 3G0B 3
Doctorate F 44 1G0B 1
Total: 34
a( qid1 )
3
4
5
4
10
4
3
1
A(QID1) = 1
04/21/23 8
GeneralizationGeneralize values in UVIDj.
Education Sex Age Disease # of Recs.
9th F 30 Flu 3
10th M 32 Heart 4
11th F 35 Fever 5
12th F 37 Fever 4
Bachelors F 42 Flu 6
Bachelors F 44 Heart 4
Masters M 44 Flu 4
Masters F 44 Flu 3
Doctorate F 44 HIV 1
Education Sex Age Disease # of Recs.
9th F 30 Flu 3
10th M 32 Heart 4
11th F 35 Fever 5
12th F 37 Fever 4
Bachelors F 42 Flu 6
Bachelors F 44 Heart 4
Grad School M 44 Flu 4
Grad School F 44 Flu/HIV 4
04/21/23 9
Problem Statement
Anonymity in Cluster Analysis Given a table T, an anonymity requirement, and a
taxonomy tree of each categorical attribute in UQIDj, generalize T to satisfy the anonymity requirement while preserving as much information as possible (cluster structure) for cluster analysis.
We use the existing k-anonymity algorithms available in the current literature [Sweeny 2002; Bayardo and Agrawal 2005; Fung et al. 2005, 2007; LeFevere et al. 2005]
04/21/23 10
Intuition Clustering goal and privacy goal are mutually exclusive:
Privacy goal: Masking sensitive information, usually specific descriptions that identify individuals.
Clustering goal: Grouping similar items together and extract general structures that capture trends and patterns.
Generalization eliminates outliers, but general cluster structures could be preserved.
If generalization is performed, “carefully”, identifying information can be masked while still preserving trends and patterns for clustering.
04/21/23 11
Challenges What exactly are the cluster structures? What information should we preserve? Our previous work [Fung et al. 2005] addressed the problem
of anonymity for classification analysis.
Education Sex Age Class # of Recs.
9th F 30 0G3B 3
10th M 32 0G4B 4
11th F 35 2G3B 5
12th F 37 3G1B 4
Bachelors F 42 4G2B 6
Bachelors F 44 4G0B 4
Masters M 44 4G0B 4
Masters F 44 3G0B 3
Doctorate F 44 1G0B 1
Education Sex Age Class # of Recs.
9th F 30 0G3B 3
10th M 32 0G4B 4
11th F 35 2G3B 5
12th F 37 3G1B 4
Bachelors F 42 4G2B 6
Bachelors F 44 4G0B 4
Grad School M 44 4G0B 4
Grad School F 44 4G0B 4
04/21/23 12
Raw Labeled Table
Tl
Generalized LabeledTable
Tl
Generalized Table Tl
Raw Table Tl
Ste
p 1
Clu
ster
ing
& L
abel
ing
Data-Owner
Data- User
Step 2
Generalizing
Ste
p 3
Clu
ster
ing
& L
abel
ing
Step3 Comparing Cluster Structures
The Framework: Convert the Problem
Step 4Release
Ap
ply
clu
ste
ring
alg
orit
hm
Apply Top-D
own Specializa
tion (T
DS)
F-measure
Ap
ply
clu
ste
ring
alg
orit
hm
04/21/23 13
Algorithm: Top-Down Specialization (TDS)
Initialize every value in T to the top most value. Initialize Cuti to include the top most value.
while some x UCuti is valid do Find the Best specialization of the highest Score in UCuti. Perform the Best specialization on T and update UCuti. Update Score(x) and validity for x UCuti.
end while
return Generalized T and UCuti.AgeANY
[1-99)
[1-37) [37-99)
04/21/23 14
Search Criteria: Score
Consider a specialization v child(v). To heuristically maximize the information of the generalized data for achieving a given anonymity, we favor the specialization on v that has the maximum information gain for each unit of privacy loss:
04/21/23 15
Experimental Evaluation
Objectives: Evaluate the information loss (in terms of cluster quality)
caused due to generalization. This is the cost for achieving anonymity.
Evaluate the information gain (in terms of cluster quality) compared to existing k-anonymization algorithms (without the focus of preserving cluster structures). This is the benefit of using our method.
Data set: de facto benchmark – Adult data set US census data 45,222 records (each record represents one US resident)
04/21/23 16
Experimental Evaluation (cont.)• Cost = 1-clusterFM (In terms of loss in clusters structure)• Benefit = clusterFM-distortFM
04/21/23 17
Experimental Evaluation (cont.)
04/21/23 18
Related Works [Sweeny 2002] employed bottom-up generalization to achieve k-
anonymity. Single QID. Not considering specific use of data.
[Iyengar 2002] proposed a genetic algorithm (GA) to address the problem of anonymity for classification. Single QID. GA needs 18 hours to generalize 45000 records.
[Fung et al. 2005] proposed an efficient top-down specialization method for the problem of anonymity for classification. TDS needs only 7 seconds to generalize same set of records (with
comparable classification accuracy.
04/21/23 19
Conclusion
Quality clustering and privacy preservation can coexist.
An effective top-down method to iteratively specialize the data, guided by maximizing the information utility and minimizing privacy specificity.
Great applicability to both public and private sectors that share information for mutual benefits.