PSDM2005 Proceedings

Embed Size (px)

Citation preview

  • 7/27/2019 PSDM2005 Proceedings

    1/67

    Privacy and Security Aspects of Data Mining

    Proceedings of a Workshop held in Conjunction with

    2005 IEEE International Conference on Data MiningHouston, USA, November 27, 2005

    Edited by

    Stan MatwinUniversity of Ottawa (Canada)

    LiWu ChangNaval Research Laboratory (USA)

    Rebecca N. Wright

    Stevens Institute of Technology (USA

    Justin ZhanUniversity of Ottawa (Canada)

    ISBN 0-9738918-9-0

  • 7/27/2019 PSDM2005 Proceedings

    2/67

  • 7/27/2019 PSDM2005 Proceedings

    3/67

    i

    Table of Contents.i

    Foreword ii

    What is Privacy?

    Critical Steps for Privacy-Preserving Data Mining1Chris Clifton (Purdue University)

    An Adaptable Perturbation Model of Privacy

    Preserving Data Mining .. 8

    Li Liu, Bhavani Thuraisingham, Murat Kantarcioglu & Latifur Khan (University ofTexasat Dallas)

    A Robust Data-obfuscation Approach for Privacy

    Preservation of Clustered Data . 18Rupa Parameswaram & Douglas Blough (Georgia Institute of Technology)

    Implementing Privacy-Preserving Bayesian-Net Discovery

    for Vertically Partitioned Data ............ 26

    Onur Kardes (Stevens Institute of Technology), Raphael S. Ryger (Yale University),Rebecca N. Wright (Stevens Institute of Technology) & Joan Feigenbaum (Yale

    University)

    Collaborative Recommendation Vulnerability to Focused

    Bias Injection Attacks . 35Robin Burke, Bamshad Mobasher, Runa Bhaumik & Chad Williams (DePaul University)

    Secure K-Means Clustering Algorithm for

    Distributed Databases . 44

    Raj Bhatnagar, Ahmed Khedr, & Amit Sinha (University of Cincinnati)

    Generating Cryptographic Keys from Face Images While Preserving

    Biometric Secrecy ..... 54Alwyn Goh (Corentix Technologies), Yip Wai Kuan, David Ling & Andrew Jin

    (Multimedia University)

  • 7/27/2019 PSDM2005 Proceedings

    4/67

    ii

    Foreword

    Privacy and security of data mining has become an active research area in recent years.

    Broadly, it addresses how to utilize confidential data for data mining purposes without

    revealing the actual confidential data values to the data miners. The goal of this workshopis to bring together researchers who have studied different aspects of this topic in order to

    discuss issues of privacy and security in data mining, synergize different views oftechniques and policies, and explore future research directions.

    This proceedings contains seven papers: one invited paper and six regular papers. Each

    regular paper received on average three critical reviews. Authors of accepted papers wereinvited to present them at the workshop. We would like to thank the authors, our invited

    speaker Chris Clifton, the program committee, and the external reviewers for contributing

    to the success of this workshop. Finally, we would like to thank the ICDM workshoporganizer, Pawan Lingras, for his overall help in the organization of PDSM 2005.

    Workshop Co-Organizers

    Stan Matwin LiWu Chang Rebecca N. Wright Justin Zhan

    U. of Ottawa, CA NRL Stevens Institute of Technology U. of Ottawa, CA

    Program Committee

    Elisa Bertino (Purdue University), Chris Clifton (Purdue University),Ping Chen (University of Houston Downtown), Steve Fienberg (Carnegie-Mellon

    University), Tom Goldring (National Security Agency), Philippe Golle (Palo Alto

    Research Center), Sushil Jajodia (George Mason University), Helger Lipmaa(Cybernetica AS and University of Tartu, Estonia), Taneli Mielikinen (University of

    Helsinki, Finland), Ira Moskowitz (Naval Research Laboratory), Kobbi Nissim (Ben

    Gurion, Israel), Jerry Reiter (Duke University), Pierangela Samarati (Universit degliStudi di Milano, Italy), Aleksandra Slavkovic (Penn State), Jaideep Srivastava

    (University of Minnesota), Bhavani Thuraisingham (University of Texas at Dallas),

    Jaideep Vaidya (Rutgers University), Vassilis Verykios (University of Thessaly, Greece)

    External Reviewers

    Anya Kim (Naval Research Laboratory), Murat Kantarcioglu (University of Texas atDallas)

    Copyright of the photo of the cover of this proceedings: Photohome.com

  • 7/27/2019 PSDM2005 Proceedings

    5/67

  • 7/27/2019 PSDM2005 Proceedings

    6/67

    What is Privacy?

    Critical Steps for Privacy-Preserving Data Mining

    Chris CliftonPurdue University

    Department of Computer Science250 North University Street

    West Lafayette, Indiana 47907-2066 [email protected]

    Abstract

    Privacy-Preserving Data Mining has generated manyre-search successes, but as yet little real-world impact. One

    problem is that we do not yet have accepted definitions of

    privacy; either legal, social, or technical; that apply to

    privacy-preserving data mining. This paper discusses this

    issue, and surveys work on the topic. In spite of this prob-

    lem, there are real-world scenarios that can be addressed by

    todays technology; the paper concludes with a discussion

    of such areas and the research needed to make technology

    transfer happen.

    In five short years, the research community has devel-oped numerous technical solutions for privacy-preserving

    data mining. What path should the community follow tobring these solutions to adoption? What technical chal-

    lenges must be solved before adoption? I claim we still face

    one key technical challenge: We do not yet have a coherent

    definition of privacy that satisfies both technical and soci-

    etal concerns. In spite of this, we have an opportunity to

    begin technology transfer, moving solutions into practice inareas without hard privacy constraints. This will establish

    credibility for the technology, speeding adoption when we

    do have a solid definition of privacy.

    With so many published papers in privacy-preserving

    data mining, how can I say we dont have a definition for

    privacy? The problem is that we have several, but noneby themselves satisfy legal and societal norms. A dictio-

    nary definition of privacy that is relevant to data mining is

    freedom from unauthorized intrusion[16]. Unauthorized

    is easy to understand, which leaves us with freedom fromintrusion. What constitutes intrusion?

    To understand this question, let us first look at legal defi-nitions of privacy. Most privacy laws (e.g., European Com-

    munity privacy guidelines[8] or the U.S. healthcare laws[9])

    only apply to individually identifiable data. Combin-ing intrusion and individually identifiable leads to a stan-

    dard to judge privacy-preserving data mining: A privacy-preserving data mining technique must ensure that any in-

    formation disclosed

    1. cannot be traced to an individual; or

    2. does not constitute an intrusion.

    Formal definitions for both these items are an open chal-

    lenge. We could assume that any data that does not give us

    completely accurate knowledge about a specific individual

    meets these criteria. This is unlikely to satisfy either pri-

    vacy advocates or courts. At the other extreme, we couldconsider any improvement in our knowledge about an indi-

    vidual to be an intrusion. The latter is particularly likely tocause a problem for data mining, as the goal is to improve

    our knowledge. Even though the target is often groups of

    individuals, knowing more about a group does increase our

    knowledge about individuals in the group. The answer, and

    technical challenge, is measures for both the knowledge

    gained and our ability to relate it to a particular individual.

    For our research community to truly have the impact we

    seek, we must develop legally and socially defensible mea-

    sures of privacy. Our solutions must be provento meet these

    measures, guaranteeing that information disclosed (includ-

    ing data mining results) does not reveal private informa-tion beyond the measures. Existing work is weak in this

    respect. Secure multiparty computation based approaches(those that follow the approach in Lindell & Pinkass sem-

    inal paper [13, 14]) state what is and is not disclosed; typi-

    cally, they state that only the data mining result is disclosed.

    This says nothing about the potential privacy impact of that

    result. Works based on randomization (as in Agrawal &

    Srikants seminal paper [2]) have developed a plethora ofmeasures, but none cleanly addresses both individual iden-

    tifiability and intrusiveness.

    1

  • 7/27/2019 PSDM2005 Proceedings

    7/67

    In this paper/talk I review measures for both individual

    identifiability and knowledge gain. In the process, I point

    out shortcomings of those measures, and try to identify

    promising research directions. We do not need to complete

    such research to be completed and accepted by the privacyand legal community before moving ahead with technology

    transfer; Section 3 concludes with a discussion of viableapplication areas where privacy-preserving data mining can

    provide benefit today.

    1 Individual Identifiability

    The U.S. Healthcare Information Portability and Ac-

    countability Act (HIPAA) defines individually nonidenti-

    fiable data as data that does not identify an individual

    and with respect to which there is no reasonable basis to

    believe that the information can be used to identify anindividual[10]. This requires showing that the risk of iden-

    tifying an individual in disclosed data is very small. Notethat the analysis must be based not only on the disclosed

    data, but also other easily available information. For ex-

    ample, Sweeney demonstrated that (commonly disclosed)

    anonymous medical data could be linked with (publicly

    available) voter registration records on birth date, gender,

    and postal code to give a name and address for many ofthe medical records[20]. Just because the individual is not

    identifiable in the data is not sufficient; joining the data withother sources must not enable identification.

    One proposal to address this problem is k-anonymity[19,

    20]. K-anonymity alters identifying information so that

    identification is only to a group of k, not to an individ-

    ual. A key concept is the notion of a quasi-identifier: in-formation that can be used to link a record to an individ-

    ual. With respect to the HIPAA definition, a quasi-identifierwould be data that could link to reasonably available in-

    formation. The HIPAA regulations actually give a list ofpresumed quasi-identifiers; if these items are removed, data

    is (legally) considered not to be individually identifiable.

    The definition of k-anonymity states that for any value

    of a quasi-identifier, there must be at least k records withthe same quasi-identifier. This ensures that an attempt to

    identify an individual will result in at least k records that

    could apply to the individual. Assuming that the privacy-

    sensitive data (e.g., medical diagnoses) are not the same for

    all k records, this throws uncertainty into any knowledgeabout an individual. The uncertainty lowers the risk that the

    knowledge constitutes an intrusion.

    The idea that knowledge that applies to a group rather

    than a specific individual does not violate privacy is legally

    defensible. Census bureaus have used this approach as a

    means of protecting privacy. Census data is typically pub-lished as contingency tables; counts of individuals meet-

    ing a particular criterion (see Table 1). Aggregates that

    Table 1. Excerpt from Table of Census Data,U.S. Census Bureau

    Block Group 1, Census Tract

    1, District of Columbia, Dis-

    trict of ColumbiaTotal: 9

    Owner occupied: 31-person household 2

    2-person household 1. . .

    Renter occupied: 6

    1-person household 3

    2-person household 2

    . . .

    reflect a large enough number of households are not con-sidered privacy sensitive. However, when cells list only

    a few individuals (as in Table 1, combining the data withother tables may reveal private information. For example, if

    we know that all owner-occupied 2-person households have

    salary over $40,000, and of the nine multiracial households,

    only one has salary over $40,000, we can determine that the

    single multiracial individual in an owner-occupied 2-person

    household makes over $40,000. Since race and household

    size can often be observed, and home ownership status is

    publicly available (in most of the U.S.), this would result in

    disclosure of an individual salary.

    Several methods are used to combat this. The data usedto generate Table 1 uses introduction of noise; the Census

    Bureau warns that statistical procedures have been applied

    that introduce some uncertainty into data for small geo-

    graphic areas with small population groups. Other tech-

    niques include cell suppression, in which counts smaller

    than a threshold are not reported at all; and general-

    ization, where cells with small counts are merged (e.g.,

    changing Table 1 so that it doesnt distinguish between

    owner-occupied and Renter-occupied housing.) General-

    ization and suppression are also common techniques for k-anonymity.

    This work gives us one metric that applies to privacy-

    preserving data mining. Demonstrating that disclosuresfrom a technique (including the results) generalize to large

    enough groups of individuals, then the size of the group

    can be used as a metric for privacy protection. The sizeof group standard may be easily met for some techniques;

    e.g., pruning approaches for decision trees may already gen-

    eralize outcomes that apply to only small groups and asso-

    ciation rule support counts provide a clear group size.

    There have been several other techniques developed by

    2

  • 7/27/2019 PSDM2005 Proceedings

    8/67

    the official statistics (census research) community to mit-

    igate risk of individual identification. These include Gen-

    eralization (e.g., limiting geographic detail), top/bottom

    coding (e.g., reporting a salary only as greater than

    $100,000), and data swapping (taking two records andswapping their values for one attribute.) These techniques

    introduce uncertainty into the data, thus limiting the confi-dence in attempts to identify an individual in the data. These

    have been used to create Public Use Microdata Sets: Data

    sets that appear to be an actual sample of census data. Be-

    cause the data is only a sample, and these techniques have

    been applied, a match with a real individual is unlikely.

    Even if an apparent match is found, it is likely that this

    match in the quasi-identifier is actually created from some

    other individual through the data perturbation techniques.

    Knowing that this is likely, an adversary trying to compro-

    mise privacy can have little confidence that the matching

    data really applies to the targeted individual.Metrics for evaluating such techniques look at both pri-

    vacy and the value of the data. Determining value of datais based on preservation of univariate and covariate statis-

    tics on the data. Privacy is based on the percentage of in-

    dividuals that a particularly well-equipped adversary could

    identify. Assumptions are that the adversary:

    1. knows that some individuals are almost certainly in the

    sample (e.g., 600-1000 for a sample of 1500 individu-

    als),

    2. knows that the sample comes from a restricted set of

    individuals (e.g., 20,000),

    3. has a good estimate (although some uncertainty) about

    the non-sensitive values (quasi-identifiers) for the tar-get individuals, and

    4. has a reasonable estimate of the sensitive values (e.g.,

    within 10%.)

    The metric is based on the number of individuals the adver-

    sary is able to correctly and confidently identify. In [17],

    identification rates of 13% are considered acceptably low.

    Note that this is an extremely well-informed adversary; in

    practice rates would be much lower.

    This experimental approach could be used to determine

    the ability of a well-informed adversary to identify individ-uals based on privacy-preserving data mining approaches.

    However, it is not amenable to a simple, one size fits allstandard as demonstrated in [17], applying this approach

    demands considerable understanding of the particular do-

    main and the privacy risks associated with that domain.

    A metric presented in [6] tries to formalize this con-

    cept of distinguishability in a more general form than k-

    anonymity. The idea is that we should be unable to learna classifier that distinguishes between individuals with high

    probability. The specific metric proposed was:

    Definition 1 [6] Two records that belong to different indi-

    viduals I1, I2 are p-indistinguishable given data X if for

    every polynomial-time function f : I {0, 1}

    |P r{f(I1) = 1|X} P r{f(I2) = 1|X}| p

    where0

    < p