Upload
jat02013
View
225
Download
0
Embed Size (px)
Citation preview
7/27/2019 PSDM2005 Proceedings
1/67
Privacy and Security Aspects of Data Mining
Proceedings of a Workshop held in Conjunction with
2005 IEEE International Conference on Data MiningHouston, USA, November 27, 2005
Edited by
Stan MatwinUniversity of Ottawa (Canada)
LiWu ChangNaval Research Laboratory (USA)
Rebecca N. Wright
Stevens Institute of Technology (USA
Justin ZhanUniversity of Ottawa (Canada)
ISBN 0-9738918-9-0
7/27/2019 PSDM2005 Proceedings
2/67
7/27/2019 PSDM2005 Proceedings
3/67
i
Table of Contents.i
Foreword ii
What is Privacy?
Critical Steps for Privacy-Preserving Data Mining1Chris Clifton (Purdue University)
An Adaptable Perturbation Model of Privacy
Preserving Data Mining .. 8
Li Liu, Bhavani Thuraisingham, Murat Kantarcioglu & Latifur Khan (University ofTexasat Dallas)
A Robust Data-obfuscation Approach for Privacy
Preservation of Clustered Data . 18Rupa Parameswaram & Douglas Blough (Georgia Institute of Technology)
Implementing Privacy-Preserving Bayesian-Net Discovery
for Vertically Partitioned Data ............ 26
Onur Kardes (Stevens Institute of Technology), Raphael S. Ryger (Yale University),Rebecca N. Wright (Stevens Institute of Technology) & Joan Feigenbaum (Yale
University)
Collaborative Recommendation Vulnerability to Focused
Bias Injection Attacks . 35Robin Burke, Bamshad Mobasher, Runa Bhaumik & Chad Williams (DePaul University)
Secure K-Means Clustering Algorithm for
Distributed Databases . 44
Raj Bhatnagar, Ahmed Khedr, & Amit Sinha (University of Cincinnati)
Generating Cryptographic Keys from Face Images While Preserving
Biometric Secrecy ..... 54Alwyn Goh (Corentix Technologies), Yip Wai Kuan, David Ling & Andrew Jin
(Multimedia University)
7/27/2019 PSDM2005 Proceedings
4/67
ii
Foreword
Privacy and security of data mining has become an active research area in recent years.
Broadly, it addresses how to utilize confidential data for data mining purposes without
revealing the actual confidential data values to the data miners. The goal of this workshopis to bring together researchers who have studied different aspects of this topic in order to
discuss issues of privacy and security in data mining, synergize different views oftechniques and policies, and explore future research directions.
This proceedings contains seven papers: one invited paper and six regular papers. Each
regular paper received on average three critical reviews. Authors of accepted papers wereinvited to present them at the workshop. We would like to thank the authors, our invited
speaker Chris Clifton, the program committee, and the external reviewers for contributing
to the success of this workshop. Finally, we would like to thank the ICDM workshoporganizer, Pawan Lingras, for his overall help in the organization of PDSM 2005.
Workshop Co-Organizers
Stan Matwin LiWu Chang Rebecca N. Wright Justin Zhan
U. of Ottawa, CA NRL Stevens Institute of Technology U. of Ottawa, CA
Program Committee
Elisa Bertino (Purdue University), Chris Clifton (Purdue University),Ping Chen (University of Houston Downtown), Steve Fienberg (Carnegie-Mellon
University), Tom Goldring (National Security Agency), Philippe Golle (Palo Alto
Research Center), Sushil Jajodia (George Mason University), Helger Lipmaa(Cybernetica AS and University of Tartu, Estonia), Taneli Mielikinen (University of
Helsinki, Finland), Ira Moskowitz (Naval Research Laboratory), Kobbi Nissim (Ben
Gurion, Israel), Jerry Reiter (Duke University), Pierangela Samarati (Universit degliStudi di Milano, Italy), Aleksandra Slavkovic (Penn State), Jaideep Srivastava
(University of Minnesota), Bhavani Thuraisingham (University of Texas at Dallas),
Jaideep Vaidya (Rutgers University), Vassilis Verykios (University of Thessaly, Greece)
External Reviewers
Anya Kim (Naval Research Laboratory), Murat Kantarcioglu (University of Texas atDallas)
Copyright of the photo of the cover of this proceedings: Photohome.com
7/27/2019 PSDM2005 Proceedings
5/67
7/27/2019 PSDM2005 Proceedings
6/67
What is Privacy?
Critical Steps for Privacy-Preserving Data Mining
Chris CliftonPurdue University
Department of Computer Science250 North University Street
West Lafayette, Indiana 47907-2066 [email protected]
Abstract
Privacy-Preserving Data Mining has generated manyre-search successes, but as yet little real-world impact. One
problem is that we do not yet have accepted definitions of
privacy; either legal, social, or technical; that apply to
privacy-preserving data mining. This paper discusses this
issue, and surveys work on the topic. In spite of this prob-
lem, there are real-world scenarios that can be addressed by
todays technology; the paper concludes with a discussion
of such areas and the research needed to make technology
transfer happen.
In five short years, the research community has devel-oped numerous technical solutions for privacy-preserving
data mining. What path should the community follow tobring these solutions to adoption? What technical chal-
lenges must be solved before adoption? I claim we still face
one key technical challenge: We do not yet have a coherent
definition of privacy that satisfies both technical and soci-
etal concerns. In spite of this, we have an opportunity to
begin technology transfer, moving solutions into practice inareas without hard privacy constraints. This will establish
credibility for the technology, speeding adoption when we
do have a solid definition of privacy.
With so many published papers in privacy-preserving
data mining, how can I say we dont have a definition for
privacy? The problem is that we have several, but noneby themselves satisfy legal and societal norms. A dictio-
nary definition of privacy that is relevant to data mining is
freedom from unauthorized intrusion[16]. Unauthorized
is easy to understand, which leaves us with freedom fromintrusion. What constitutes intrusion?
To understand this question, let us first look at legal defi-nitions of privacy. Most privacy laws (e.g., European Com-
munity privacy guidelines[8] or the U.S. healthcare laws[9])
only apply to individually identifiable data. Combin-ing intrusion and individually identifiable leads to a stan-
dard to judge privacy-preserving data mining: A privacy-preserving data mining technique must ensure that any in-
formation disclosed
1. cannot be traced to an individual; or
2. does not constitute an intrusion.
Formal definitions for both these items are an open chal-
lenge. We could assume that any data that does not give us
completely accurate knowledge about a specific individual
meets these criteria. This is unlikely to satisfy either pri-
vacy advocates or courts. At the other extreme, we couldconsider any improvement in our knowledge about an indi-
vidual to be an intrusion. The latter is particularly likely tocause a problem for data mining, as the goal is to improve
our knowledge. Even though the target is often groups of
individuals, knowing more about a group does increase our
knowledge about individuals in the group. The answer, and
technical challenge, is measures for both the knowledge
gained and our ability to relate it to a particular individual.
For our research community to truly have the impact we
seek, we must develop legally and socially defensible mea-
sures of privacy. Our solutions must be provento meet these
measures, guaranteeing that information disclosed (includ-
ing data mining results) does not reveal private informa-tion beyond the measures. Existing work is weak in this
respect. Secure multiparty computation based approaches(those that follow the approach in Lindell & Pinkass sem-
inal paper [13, 14]) state what is and is not disclosed; typi-
cally, they state that only the data mining result is disclosed.
This says nothing about the potential privacy impact of that
result. Works based on randomization (as in Agrawal &
Srikants seminal paper [2]) have developed a plethora ofmeasures, but none cleanly addresses both individual iden-
tifiability and intrusiveness.
1
7/27/2019 PSDM2005 Proceedings
7/67
In this paper/talk I review measures for both individual
identifiability and knowledge gain. In the process, I point
out shortcomings of those measures, and try to identify
promising research directions. We do not need to complete
such research to be completed and accepted by the privacyand legal community before moving ahead with technology
transfer; Section 3 concludes with a discussion of viableapplication areas where privacy-preserving data mining can
provide benefit today.
1 Individual Identifiability
The U.S. Healthcare Information Portability and Ac-
countability Act (HIPAA) defines individually nonidenti-
fiable data as data that does not identify an individual
and with respect to which there is no reasonable basis to
believe that the information can be used to identify anindividual[10]. This requires showing that the risk of iden-
tifying an individual in disclosed data is very small. Notethat the analysis must be based not only on the disclosed
data, but also other easily available information. For ex-
ample, Sweeney demonstrated that (commonly disclosed)
anonymous medical data could be linked with (publicly
available) voter registration records on birth date, gender,
and postal code to give a name and address for many ofthe medical records[20]. Just because the individual is not
identifiable in the data is not sufficient; joining the data withother sources must not enable identification.
One proposal to address this problem is k-anonymity[19,
20]. K-anonymity alters identifying information so that
identification is only to a group of k, not to an individ-
ual. A key concept is the notion of a quasi-identifier: in-formation that can be used to link a record to an individ-
ual. With respect to the HIPAA definition, a quasi-identifierwould be data that could link to reasonably available in-
formation. The HIPAA regulations actually give a list ofpresumed quasi-identifiers; if these items are removed, data
is (legally) considered not to be individually identifiable.
The definition of k-anonymity states that for any value
of a quasi-identifier, there must be at least k records withthe same quasi-identifier. This ensures that an attempt to
identify an individual will result in at least k records that
could apply to the individual. Assuming that the privacy-
sensitive data (e.g., medical diagnoses) are not the same for
all k records, this throws uncertainty into any knowledgeabout an individual. The uncertainty lowers the risk that the
knowledge constitutes an intrusion.
The idea that knowledge that applies to a group rather
than a specific individual does not violate privacy is legally
defensible. Census bureaus have used this approach as a
means of protecting privacy. Census data is typically pub-lished as contingency tables; counts of individuals meet-
ing a particular criterion (see Table 1). Aggregates that
Table 1. Excerpt from Table of Census Data,U.S. Census Bureau
Block Group 1, Census Tract
1, District of Columbia, Dis-
trict of ColumbiaTotal: 9
Owner occupied: 31-person household 2
2-person household 1. . .
Renter occupied: 6
1-person household 3
2-person household 2
. . .
reflect a large enough number of households are not con-sidered privacy sensitive. However, when cells list only
a few individuals (as in Table 1, combining the data withother tables may reveal private information. For example, if
we know that all owner-occupied 2-person households have
salary over $40,000, and of the nine multiracial households,
only one has salary over $40,000, we can determine that the
single multiracial individual in an owner-occupied 2-person
household makes over $40,000. Since race and household
size can often be observed, and home ownership status is
publicly available (in most of the U.S.), this would result in
disclosure of an individual salary.
Several methods are used to combat this. The data usedto generate Table 1 uses introduction of noise; the Census
Bureau warns that statistical procedures have been applied
that introduce some uncertainty into data for small geo-
graphic areas with small population groups. Other tech-
niques include cell suppression, in which counts smaller
than a threshold are not reported at all; and general-
ization, where cells with small counts are merged (e.g.,
changing Table 1 so that it doesnt distinguish between
owner-occupied and Renter-occupied housing.) General-
ization and suppression are also common techniques for k-anonymity.
This work gives us one metric that applies to privacy-
preserving data mining. Demonstrating that disclosuresfrom a technique (including the results) generalize to large
enough groups of individuals, then the size of the group
can be used as a metric for privacy protection. The sizeof group standard may be easily met for some techniques;
e.g., pruning approaches for decision trees may already gen-
eralize outcomes that apply to only small groups and asso-
ciation rule support counts provide a clear group size.
There have been several other techniques developed by
2
7/27/2019 PSDM2005 Proceedings
8/67
the official statistics (census research) community to mit-
igate risk of individual identification. These include Gen-
eralization (e.g., limiting geographic detail), top/bottom
coding (e.g., reporting a salary only as greater than
$100,000), and data swapping (taking two records andswapping their values for one attribute.) These techniques
introduce uncertainty into the data, thus limiting the confi-dence in attempts to identify an individual in the data. These
have been used to create Public Use Microdata Sets: Data
sets that appear to be an actual sample of census data. Be-
cause the data is only a sample, and these techniques have
been applied, a match with a real individual is unlikely.
Even if an apparent match is found, it is likely that this
match in the quasi-identifier is actually created from some
other individual through the data perturbation techniques.
Knowing that this is likely, an adversary trying to compro-
mise privacy can have little confidence that the matching
data really applies to the targeted individual.Metrics for evaluating such techniques look at both pri-
vacy and the value of the data. Determining value of datais based on preservation of univariate and covariate statis-
tics on the data. Privacy is based on the percentage of in-
dividuals that a particularly well-equipped adversary could
identify. Assumptions are that the adversary:
1. knows that some individuals are almost certainly in the
sample (e.g., 600-1000 for a sample of 1500 individu-
als),
2. knows that the sample comes from a restricted set of
individuals (e.g., 20,000),
3. has a good estimate (although some uncertainty) about
the non-sensitive values (quasi-identifiers) for the tar-get individuals, and
4. has a reasonable estimate of the sensitive values (e.g.,
within 10%.)
The metric is based on the number of individuals the adver-
sary is able to correctly and confidently identify. In [17],
identification rates of 13% are considered acceptably low.
Note that this is an extremely well-informed adversary; in
practice rates would be much lower.
This experimental approach could be used to determine
the ability of a well-informed adversary to identify individ-uals based on privacy-preserving data mining approaches.
However, it is not amenable to a simple, one size fits allstandard as demonstrated in [17], applying this approach
demands considerable understanding of the particular do-
main and the privacy risks associated with that domain.
A metric presented in [6] tries to formalize this con-
cept of distinguishability in a more general form than k-
anonymity. The idea is that we should be unable to learna classifier that distinguishes between individuals with high
probability. The specific metric proposed was:
Definition 1 [6] Two records that belong to different indi-
viduals I1, I2 are p-indistinguishable given data X if for
every polynomial-time function f : I {0, 1}
|P r{f(I1) = 1|X} P r{f(I2) = 1|X}| p
where0
< p