PSDM2005 Proceedings

7/27/2019 PSDM2005 Proceedings

1/67

Privacy and Security Aspects of Data Mining

Proceedings of a Workshop held in Conjunction with

2005 IEEE International Conference on Data MiningHouston, USA, November 27, 2005

Edited by

Stan MatwinUniversity of Ottawa (Canada)

LiWu ChangNaval Research Laboratory (USA)

Rebecca N. Wright

Stevens Institute of Technology (USA

Justin ZhanUniversity of Ottawa (Canada)

ISBN 0-9738918-9-0


2/67


3/67

i

Table of Contents.i

Foreword ii

What is Privacy?

Critical Steps for Privacy-Preserving Data Mining1Chris Clifton (Purdue University)

An Adaptable Perturbation Model of Privacy

Preserving Data Mining .. 8

Li Liu, Bhavani Thuraisingham, Murat Kantarcioglu & Latifur Khan (University ofTexasat Dallas)

A Robust Data-obfuscation Approach for Privacy

Preservation of Clustered Data . 18Rupa Parameswaram & Douglas Blough (Georgia Institute of Technology)

Implementing Privacy-Preserving Bayesian-Net Discovery

for Vertically Partitioned Data ............ 26

Onur Kardes (Stevens Institute of Technology), Raphael S. Ryger (Yale University),Rebecca N. Wright (Stevens Institute of Technology) & Joan Feigenbaum (Yale

University)

Collaborative Recommendation Vulnerability to Focused

Bias Injection Attacks . 35Robin Burke, Bamshad Mobasher, Runa Bhaumik & Chad Williams (DePaul University)

Secure K-Means Clustering Algorithm for

Distributed Databases . 44

Raj Bhatnagar, Ahmed Khedr, & Amit Sinha (University of Cincinnati)

Generating Cryptographic Keys from Face Images While Preserving

Biometric Secrecy ..... 54Alwyn Goh (Corentix Technologies), Yip Wai Kuan, David Ling & Andrew Jin

(Multimedia University)


4/67

ii

Foreword

Privacy and security of data mining has become an active research area in recent years.

Broadly, it addresses how to utilize confidential data for data mining purposes without

revealing the actual confidential data values to the data miners. The goal of this workshopis to bring together researchers who have studied different aspects of this topic in order to

discuss issues of privacy and security in data mining, synergize different views oftechniques and policies, and explore future research directions.

This proceedings contains seven papers: one invited paper and six regular papers. Each

regular paper received on average three critical reviews. Authors of accepted papers wereinvited to present them at the workshop. We would like to thank the authors, our invited

speaker Chris Clifton, the program committee, and the external reviewers for contributing

to the success of this workshop. Finally, we would like to thank the ICDM workshoporganizer, Pawan Lingras, for his overall help in the organization of PDSM 2005.

Workshop Co-Organizers

Stan Matwin LiWu Chang Rebecca N. Wright Justin Zhan

U. of Ottawa, CA NRL Stevens Institute of Technology U. of Ottawa, CA

Program Committee

Elisa Bertino (Purdue University), Chris Clifton (Purdue University),Ping Chen (University of Houston Downtown), Steve Fienberg (Carnegie-Mellon

University), Tom Goldring (National Security Agency), Philippe Golle (Palo Alto

Research Center), Sushil Jajodia (George Mason University), Helger Lipmaa(Cybernetica AS and University of Tartu, Estonia), Taneli Mielikinen (University of

Helsinki, Finland), Ira Moskowitz (Naval Research Laboratory), Kobbi Nissim (Ben

Gurion, Israel), Jerry Reiter (Duke University), Pierangela Samarati (Universit degliStudi di Milano, Italy), Aleksandra Slavkovic (Penn State), Jaideep Srivastava

(University of Minnesota), Bhavani Thuraisingham (University of Texas at Dallas),

Jaideep Vaidya (Rutgers University), Vassilis Verykios (University of Thessaly, Greece)

External Reviewers

Anya Kim (Naval Research Laboratory), Murat Kantarcioglu (University of Texas atDallas)

Copyright of the photo of the cover of this proceedings: Photohome.com


5/67


6/67

What is Privacy?

Critical Steps for Privacy-Preserving Data Mining

Chris CliftonPurdue University

Department of Computer Science250 North University Street

West Lafayette, Indiana 47907-2066 [email protected]

Abstract

Privacy-Preserving Data Mining has generated manyre-search successes, but as yet little real-world impact. One

problem is that we do not yet have accepted definitions of

privacy; either legal, social, or technical; that apply to

privacy-preserving data mining. This paper discusses this

issue, and surveys work on the topic. In spite of this prob-

lem, there are real-world scenarios that can be addressed by

todays technology; the paper concludes with a discussion

of such areas and the research needed to make technology

transfer happen.

In five short years, the research community has devel-oped numerous technical solutions for privacy-preserving

data mining. What path should the community follow tobring these solutions to adoption? What technical chal-

lenges must be solved before adoption? I claim we still face

one key technical challenge: We do not yet have a coherent

definition of privacy that satisfies both technical and soci-

etal concerns. In spite of this, we have an opportunity to

begin technology transfer, moving solutions into practice inareas without hard privacy constraints. This will establish

credibility for the technology, speeding adoption when we

do have a solid definition of privacy.

With so many published papers in privacy-preserving

data mining, how can I say we dont have a definition for

privacy? The problem is that we have several, but noneby themselves satisfy legal and societal norms. A dictio-

nary definition of privacy that is relevant to data mining is

freedom from unauthorized intrusion[16]. Unauthorized

is easy to understand, which leaves us with freedom fromintrusion. What constitutes intrusion?

To understand this question, let us first look at legal defi-nitions of privacy. Most privacy laws (e.g., European Com-

munity privacy guidelines[8] or the U.S. healthcare laws[9])

only apply to individually identifiable data. Combin-ing intrusion and individually identifiable leads to a stan-

dard to judge privacy-preserving data mining: A privacy-preserving data mining technique must ensure that any in-

formation disclosed

1. cannot be traced to an individual; or

2. does not constitute an intrusion.

Formal definitions for both these items are an open chal-

lenge. We could assume that any data that does not give us

completely accurate knowledge about a specific individual

meets these criteria. This is unlikely to satisfy either pri-

vacy advocates or courts. At the other extreme, we couldconsider any improvement in our knowledge about an indi-

vidual to be an intrusion. The latter is particularly likely tocause a problem for data mining, as the goal is to improve

our knowledge. Even though the target is often groups of

individuals, knowing more about a group does increase our

knowledge about individuals in the group. The answer, and

technical challenge, is measures for both the knowledge

gained and our ability to relate it to a particular individual.

For our research community to truly have the impact we

seek, we must develop legally and socially defensible mea-

sures of privacy. Our solutions must be provento meet these

measures, guaranteeing that information disclosed (includ-

ing data mining results) does not reveal private informa-tion beyond the measures. Existing work is weak in this

respect. Secure multiparty computation based approaches(those that follow the approach in Lindell & Pinkass sem-

inal paper [13, 14]) state what is and is not disclosed; typi-

cally, they state that only the data mining result is disclosed.

This says nothing about the potential privacy impact of that

result. Works based on randomization (as in Agrawal &

Srikants seminal paper [2]) have developed a plethora ofmeasures, but none cleanly addresses both individual iden-

tifiability and intrusiveness.

1


7/67

In this paper/talk I review measures for both individual

identifiability and knowledge gain. In the process, I point

out shortcomings of those measures, and try to identify

promising research directions. We do not need to complete

such research to be completed and accepted by the privacyand legal community before moving ahead with technology

transfer; Section 3 concludes with a discussion of viableapplication areas where privacy-preserving data mining can

provide benefit today.

1 Individual Identifiability

The U.S. Healthcare Information Portability and Ac-

countability Act (HIPAA) defines individually nonidenti-

fiable data as data that does not identify an individual

and with respect to which there is no reasonable basis to

believe that the information can be used to identify anindividual[10]. This requires showing that the risk of iden-

tifying an individual in disclosed data is very small. Notethat the analysis must be based not only on the disclosed

data, but also other easily available information. For ex-

ample, Sweeney demonstrated that (commonly disclosed)

anonymous medical data could be linked with (publicly

available) voter registration records on birth date, gender,

and postal code to give a name and address for many ofthe medical records[20]. Just because the individual is not

identifiable in the data is not sufficient; joining the data withother sources must not enable identification.

One proposal to address this problem is k-anonymity[19,

20]. K-anonymity alters identifying information so that

identification is only to a group of k, not to an individ-

ual. A key concept is the notion of a quasi-identifier: in-formation that can be used to link a record to an individ-

ual. With respect to the HIPAA definition, a quasi-identifierwould be data that could link to reasonably available in-

formation. The HIPAA regulations actually give a list ofpresumed quasi-identifiers; if these items are removed, data

is (legally) considered not to be individually identifiable.

The definition of k-anonymity states that for any value

of a quasi-identifier, there must be at least k records withthe same quasi-identifier. This ensures that an attempt to

identify an individual will result in at least k records that

could apply to the individual. Assuming that the privacy-

sensitive data (e.g., medical diagnoses) are not the same for

all k records, this throws uncertainty into any knowledgeabout an individual. The uncertainty lowers the risk that the

knowledge constitutes an intrusion.

The idea that knowledge that applies to a group rather

than a specific individual does not violate privacy is legally

defensible. Census bureaus have used this approach as a

means of protecting privacy. Census data is typically pub-lished as contingency tables; counts of individuals meet-

ing a particular criterion (see Table 1). Aggregates that

Table 1. Excerpt from Table of Census Data,U.S. Census Bureau

Block Group 1, Census Tract

1, District of Columbia, Dis-

trict of ColumbiaTotal: 9

Owner occupied: 31-person household 2

2-person household 1. . .

Renter occupied: 6

1-person household 3

2-person household 2

. . .

reflect a large enough number of households are not con-sidered privacy sensitive. However, when cells list only

a few individuals (as in Table 1, combining the data withother tables may reveal private information. For example, if

we know that all owner-occupied 2-person households have

salary over $40,000, and of the nine multiracial households,

only one has salary over $40,000, we can determine that the

single multiracial individual in an owner-occupied 2-person

household makes over $40,000. Since race and household

size can often be observed, and home ownership status is

publicly available (in most of the U.S.), this would result in

disclosure of an individual salary.

Several methods are used to combat this. The data usedto generate Table 1 uses introduction of noise; the Census

Bureau warns that statistical procedures have been applied

that introduce some uncertainty into data for small geo-

graphic areas with small population groups. Other tech-

niques include cell suppression, in which counts smaller

than a threshold are not reported at all; and general-

ization, where cells with small counts are merged (e.g.,

changing Table 1 so that it doesnt distinguish between

owner-occupied and Renter-occupied housing.) General-

ization and suppression are also common techniques for k-anonymity.

This work gives us one metric that applies to privacy-

preserving data mining. Demonstrating that disclosuresfrom a technique (including the results) generalize to large

enough groups of individuals, then the size of the group

can be used as a metric for privacy protection. The sizeof group standard may be easily met for some techniques;

e.g., pruning approaches for decision trees may already gen-

eralize outcomes that apply to only small groups and asso-

ciation rule support counts provide a clear group size.

There have been several other techniques developed by

2


8/67

the official statistics (census research) community to mit-

igate risk of individual identification. These include Gen-

eralization (e.g., limiting geographic detail), top/bottom

coding (e.g., reporting a salary only as greater than

$100,000), and data swapping (taking two records andswapping their values for one attribute.) These techniques

introduce uncertainty into the data, thus limiting the confi-dence in attempts to identify an individual in the data. These

have been used to create Public Use Microdata Sets: Data

sets that appear to be an actual sample of census data. Be-

cause the data is only a sample, and these techniques have

been applied, a match with a real individual is unlikely.

Even if an apparent match is found, it is likely that this

match in the quasi-identifier is actually created from some

other individual through the data perturbation techniques.

Knowing that this is likely, an adversary trying to compro-

mise privacy can have little confidence that the matching

data really applies to the targeted individual.Metrics for evaluating such techniques look at both pri-

vacy and the value of the data. Determining value of datais based on preservation of univariate and covariate statis-

tics on the data. Privacy is based on the percentage of in-

dividuals that a particularly well-equipped adversary could

identify. Assumptions are that the adversary:

1. knows that some individuals are almost certainly in the

sample (e.g., 600-1000 for a sample of 1500 individu-

als),

2. knows that the sample comes from a restricted set of

individuals (e.g., 20,000),

3. has a good estimate (although some uncertainty) about

the non-sensitive values (quasi-identifiers) for the tar-get individuals, and

4. has a reasonable estimate of the sensitive values (e.g.,

within 10%.)

The metric is based on the number of individuals the adver-

sary is able to correctly and confidently identify. In [17],

identification rates of 13% are considered acceptably low.

Note that this is an extremely well-informed adversary; in

practice rates would be much lower.

This experimental approach could be used to determine

the ability of a well-informed adversary to identify individ-uals based on privacy-preserving data mining approaches.

However, it is not amenable to a simple, one size fits allstandard as demonstrated in [17], applying this approach

demands considerable understanding of the particular do-

main and the privacy risks associated with that domain.

A metric presented in [6] tries to formalize this con-

cept of distinguishability in a more general form than k-

anonymity. The idea is that we should be unable to learna classifier that distinguishes between individuals with high

probability. The specific metric proposed was:

Definition 1 [6] Two records that belong to different indi-

viduals I1, I2 are p-indistinguishable given data X if for

every polynomial-time function f : I {0, 1}

|P r{f(I1) = 1|X} P r{f(I2) = 1|X}| p

where0

< p

Documents

PSDM2005 Proceedings