ANONYMITY IN DATA PUBLISHING AND DISTRIBUTION by …pages.cs.wisc.edu/~lefevre/LeFevre_Dissertation.pdfANONYMITY IN DATA PUBLISHING AND DISTRIBUTION by Kristen Riedt LeFevre A dissertation

ANONYMITY IN DATA PUBLISHING AND DISTRIBUTION

by

Kristen Riedt LeFevre

A dissertation submitted in partial fulfillment of

the requirements for the degree of

Doctor of Philosophy

(Computer Sciences)

at the

UNIVERSITY OF WISCONSIN–MADISON

2007

c© Copyright by Kristen Riedt LeFevre 2007

All Rights Reserved

i

To my parents, Jeanne and John LeFevre.

ii

ACKNOWLEDGMENTS

Research is never conducted in isolation. Rather, numerouspeople have contributed to this

thesis through their ideas, discussions, feedback, and support.

First and foremost, I would like to thank my advisors, David DeWitt and Raghu Ramakrishnan.

Their guidance, insight, and humor have shaped me as a researcher, and I am extremely fortunate

to have had the opportunity to work with them over the past fewyears. In addition, I would like to

thank Jeff Naughton and AnHai Doan for their support, feedback, and encouragement.

I also owe a great deal of gratitude to (current and former) members of the Intelligent Informa-

tion Systems group at IBM Almaden Research Center. It was Rakesh Agrawal who initially led

me into the topic of data privacy. During my time at IBM, I learned a lot from several collabora-

tors: Roberdo Bayardo, Tyrone Grandison, Jerry Kiernan, Ashwin Machanavajjhala, Ramakrish-

nan Srikant, Evimaria Terzi, and Yirong Xu. I am also grateful to IBM for supporting my research

for two years through the IBM Ph.D. Fellowship program.

I am also extremely grateful to Surajit Chaudhuri’s group for supporting me through the Mi-

crosoft Research Fellowship, and to the National Science Foundation CyberTrust program for

supporting the umbrella Goal-Oriented Privacy Project.

At Wisconsin, I am fortunate to be surrounded by an amazing group of fellow students. In

particular, Doug Burdick, Bee-Chung Chen, Lei Chen, HectorCorrada Bravo, Jesse Davis, Scott

Diehl, and Vuk Ercegovac have provided valuable collaborations, technical feedback and discus-

sions related to my thesis work. The database group as a wholehas made this a dynamic and

exciting environment in which to work. I would particularlylike to thank Ahmed Ayad, Jennifer

Beckmann, Pedro Bizarro, Fei Chen, Lei Chen, Eric Chu, PedroDeRose, Hongfei Guo, Alan

iii

Halverson, Allison Holloway, Jiansheng Huang, Tochukwu Iwuchukwu, Ameet Kini, Erik Paul-

son, Christine Reilly, Eric Robinson, Mayssam Sayyadian, Warren Shen, Srinath Shankar, Pachu

Shrinivas, and Pradeep Tamma in this regard. I would also like to thank Professors Jude Shavlik

and Miron Livny for the technical feedback they have provided on several occasions.

The women of WACM and SWEGA have provided an important support network and sense of

community over the past few years. Many thanks to Emily Blem,Camille Fournier, Natalie Jerger,

Sarah Knoop, Mariyam Mirza, Christina Oberlin, Irene Ong, Alice Pawley, Florentina Popovici,

and Sondra Renley.

I would also like to thank members of the lunchtime swim team:Jim Burt, Shawn Jeffery,

Colleen Moore, and Kerri Priest.

Finally, on a personal note, I would never have made it this far without the continued support

and encouragement of my family: Jeanne, John, and Katie LeFevre.

DISCARD THIS PAGE

iv

TABLE OF CONTENTS

Page

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 De-Identification in Data Publishing . . . . . . . . . . . . . . . .. . . . . . . . . 21.2 The HIPAA Privacy Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 31.3 Anonymity Framework & Definitions . . . . . . . . . . . . . . . . . . .. . . . . 4

1.3.1 Threat Model & Probabilistic Interpretation . . . . . . .. . . . . . . . . . 61.3.2 Practical Diversity Extensions . . . . . . . . . . . . . . . . . .. . . . . . 81.3.3 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4 Recoding Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 121.4.1 Global vs. Local Recoding . . . . . . . . . . . . . . . . . . . . . . . .. . 131.4.2 Generalization vs. Suppression . . . . . . . . . . . . . . . . . .. . . . . . 141.4.3 Hierarchy-Based vs. Partition-Based Recoding . . . . .. . . . . . . . . . 14

1.5 Thesis Contributions and Organization . . . . . . . . . . . . . .. . . . . . . . . . 14

2 Incognito: Efficient Full-Domain Generalization . . . . . . . . . . . . . . . . . . . . 17

2.1 Full-Domain Generalization . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 192.2 Previous Algorithms for Full-Domain Generalization . .. . . . . . . . . . . . . . 212.3 Incognito Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 22

2.3.1 Basic Incognito Algorithm . . . . . . . . . . . . . . . . . . . . . . .. . . 252.3.2 Soundness and Completeness . . . . . . . . . . . . . . . . . . . . . .. . 332.3.3 Algorithm Optimizations . . . . . . . . . . . . . . . . . . . . . . . .. . . 34

2.4 Experimental Performance Evaluation . . . . . . . . . . . . . . .. . . . . . . . . 352.4.1 Experimental Data and Setup . . . . . . . . . . . . . . . . . . . . . .. . . 372.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 39

2.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 42

v

Page

3 Mondrian: Multidimensional Partitioning . . . . . . . . . . . . . . . . . . . . . . . 43

3.1 Multidimensional Partitioning . . . . . . . . . . . . . . . . . . . .. . . . . . . . 433.2 Some Simple General-Purpose Measures of Quality . . . . . .. . . . . . . . . . . 473.3 Theoretical Analysis (k-Anonymity) . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.3.1 Hardness Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.3.2 Bounds on Equivalence Class Size . . . . . . . . . . . . . . . . . .. . . . 49

3.4 Recursive Partitioning Framework . . . . . . . . . . . . . . . . . .. . . . . . . . 533.4.1 Quality Bounds (k-Anonymity) . . . . . . . . . . . . . . . . . . . . . . . 563.4.2 Incorporating Diversity . . . . . . . . . . . . . . . . . . . . . . . .. . . . 56

3.5 Experimental Evaluation of Data Quality . . . . . . . . . . . . .. . . . . . . . . . 573.5.1 Experimental Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .583.5.2 Results forCDM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.5.3 Attributes Preserved . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 60

3.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 63

4 Incorporating Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 644.2 Language Describing Workloads . . . . . . . . . . . . . . . . . . . . .. . . . . . 654.3 Classification and Regression . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 66

4.3.1 Single Target Classification Model . . . . . . . . . . . . . . . .. . . . . . 674.3.2 Single Target Regression Model . . . . . . . . . . . . . . . . . . .. . . . 684.3.3 Multiple Target Models . . . . . . . . . . . . . . . . . . . . . . . . . .. . 694.3.4 Other Attribute Characterizations . . . . . . . . . . . . . . .. . . . . . . 70

4.4 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .714.5 Aggregation and Summary Statistics . . . . . . . . . . . . . . . . .. . . . . . . . 734.6 Experimental Evaluation of Data Quality . . . . . . . . . . . . .. . . . . . . . . . 74

4.6.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.6.2 Learning from Regions . . . . . . . . . . . . . . . . . . . . . . . . . . .. 764.6.3 Experimental Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .774.6.4 Comparison with Previous Algorithms . . . . . . . . . . . . . .. . . . . . 804.6.5 Multiple Target Models . . . . . . . . . . . . . . . . . . . . . . . . . .. . 874.6.6 Privacy-Utility Tradeoff . . . . . . . . . . . . . . . . . . . . . . .. . . . 874.6.7 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894.6.8 Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 93

vi

AppendixPage

5 Rothko: Scalable Variations of Mondrian . . . . . . . . . . . . . . . . . . . . . . . . 94

5.1 Previous Scalable Algorithms . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 955.2 Exhaustive Algorithm (Rothko-T) . . . . . . . . . . . . . . . . . . .. . . . . . . 95

5.2.1 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . .965.2.2 Recoding Function Scalability . . . . . . . . . . . . . . . . . . .. . . . . 98

5.3 Sampling Algorithm (Rothko-S) . . . . . . . . . . . . . . . . . . . . .. . . . . . 995.3.1 Estimators & Hypothesis Tests . . . . . . . . . . . . . . . . . . . .. . . . 1015.3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.4 Analytical Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 1075.4.1 Rothko-T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1075.4.2 Rothko-S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.5 Experimental Performance Evaluation . . . . . . . . . . . . . . .. . . . . . . . . 1095.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1105.5.2 Need for a Scalable Algorithm . . . . . . . . . . . . . . . . . . . . .. . . 1115.5.3 Counting I/O Requests . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1125.5.4 Runtime Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . .1145.5.5 Effects of Sampling on Data Quality . . . . . . . . . . . . . . . .. . . . . 1185.5.6 Hypothesis Tests and Pruning . . . . . . . . . . . . . . . . . . . . .. . . 118

5.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 121

6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

6.1 Privacy for Published Microdata . . . . . . . . . . . . . . . . . . . .. . . . . . . 1256.1.1 Generalization, Recoding & Microaggregation . . . . . .. . . . . . . . . 1266.1.2 Multiple Releases & Evolving Data . . . . . . . . . . . . . . . . .. . . . 1276.1.3 The Role of Background Knowledge . . . . . . . . . . . . . . . . . .. . 1286.1.4 Other Perturbative Techniques . . . . . . . . . . . . . . . . . . .. . . . . 130

6.2 Client-Side Input Perturbation . . . . . . . . . . . . . . . . . . . .. . . . . . . . 1306.3 Statistical Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 132

6.3.1 Auditing Disclosure . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1326.3.2 Output Perturbation . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 134

6.4 Query-View Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 1356.5 Distributed Privacy-Preserving Data Mining . . . . . . . . .. . . . . . . . . . . . 1366.6 Database Authorization, Access Control & Security . . . .. . . . . . . . . . . . . 137

7 Summary and Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

vii

AppendixPage

LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

APPENDICES

Appendix A: HIPAA Safe Harbor Provision . . . . . . . . . . . . . . . . .. . . . . . 150

DISCARD THIS PAGE

viii

LIST OF TABLES

Table Page

1.1 Original Data (R[ID, Q1, ..., Qd, S]) . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Generalized View (R∗[Q1, ..., Qd, S]) . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 External database (U [ID, Q1, ..., Qd]) . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 CombiningU [ID, Q1, ..., Qd] andR∗[Q1, ..., Qd, S] . . . . . . . . . . . . . . . . . . . 9

2.1 Hospital Patients (Incognito running example) . . . . . . .. . . . . . . . . . . . . . 23

2.2 Experimental Data Description: Adults . . . . . . . . . . . . . .. . . . . . . . . . . 36

2.3 Experimental Data Description: Lands End . . . . . . . . . . . .. . . . . . . . . . . 36

2.4 Total nodes searched by algorithm (k = 2) . . . . . . . . . . . . . . . . . . . . . . . 40

3.1 Hospital Patients (Mondrian running example) . . . . . . . .. . . . . . . . . . . . . 44

3.2 2-Anonymous single-dimensional recoding of Patients .. . . . . . . . . . . . . . . . 46

3.3 2-Anonymous multidimensional recoding of Patients . . .. . . . . . . . . . . . . . . 46

4.1 Experimental Data Description: Synthetic features / quasi-identifier attributes . . . . . 78

4.2 Experimental Data Description: Synthetic class label functions . . . . . . . . . . . . . 79

4.3 Experimental Data Description: Census . . . . . . . . . . . . . .. . . . . . . . . . . 80

4.4 Experimental Data Description: Contraceptives . . . . . .. . . . . . . . . . . . . . . 81

5.1 Experimental system configuration . . . . . . . . . . . . . . . . . .. . . . . . . . . . 110

5.2 Experimental Data Description: Synthetic numeric target functions . . . . . . . . . . 111

DISCARD THIS PAGE

ix

LIST OF FIGURES

Figure Page

1.1 Threat model diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 7

2.1 Domain and value generalization hierarchies . . . . . . . . .. . . . . . . . . . . . . 18

2.2 2-Attribute generalization lattice . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 18

2.3 Star-schema defining generalization dimensions . . . . . .. . . . . . . . . . . . . . . 24

2.4 Searching candidate 2-attribute generalization graphs . . . . . . . . . . . . . . . . . . 27

2.5 Algorithm: Basic Incognito . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 28

2.6 Relational representation of generalization lattice .. . . . . . . . . . . . . . . . . . . 30

2.7 3-Attribute graph (generated following 2-attribute search in Figure 2.4) . . . . . . . . 32

2.8 3-Attribute lattice without a priori pruning . . . . . . . . .. . . . . . . . . . . . . . . 32

2.9 Incognito performance evaluation for varied QID . . . . . .. . . . . . . . . . . . . . 38

2.10 Incognito performance evaluation for variedk . . . . . . . . . . . . . . . . . . . . . . 39

2.11 Cube Incognito performance . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 41

3.1 Spatial representation of Patients and partitionings .. . . . . . . . . . . . . . . . . . 45

3.2 Equivalence class size bound example (2 dimensions) . . .. . . . . . . . . . . . . . . 51

3.3 Algorithm: Mondrian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 54

3.4 Example partition tree . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 55

3.5 Experimental quality evaluation usingCDM . . . . . . . . . . . . . . . . . . . . . . . 59

x

Figure Page

3.6 2-Dimensional single-dimensional partitionings . . . .. . . . . . . . . . . . . . . . . 61

3.7 2-Dimensional multidimensional partitionings . . . . . .. . . . . . . . . . . . . . . 62

4.1 Attribute type characterizations for anonymity and classification/regression . . . . . . 66

4.2 Features vs. quasi-identifiers in classification-oriented anonymization . . . . . . . . . 70

4.3 Evaluating a selection over generalized data . . . . . . . . .. . . . . . . . . . . . . . 73

4.4 Mapping ad-dimensional rectangular region to2 ∗ d attributes . . . . . . . . . . . . . 77

4.5 Classification-based model evaluation using syntheticdata (k = 25) . . . . . . . . . . 83

4.6 Classification-based model evaluation using real-world data . . . . . . . . . . . . . . 84

4.7 General-purpose quality measures using real-world data . . . . . . . . . . . . . . . . 86

4.8 Regression-based model evaluation using real-world data . . . . . . . . . . . . . . . . 86

4.9 Classification-based model evaluation for multiple models (k = 25) . . . . . . . . . . 88

4.10 ℓ-Diversity experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 90

4.11 Imprecision for synthetic Function C2 . . . . . . . . . . . . . .. . . . . . . . . . . . 91

4.12 Selection and projection experiment . . . . . . . . . . . . . . .. . . . . . . . . . . . 92

5.1 Example: Rothko-T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 97

5.2 Example: Rothko-S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 100

5.3 Notation for analytical comparison . . . . . . . . . . . . . . . . .. . . . . . . . . . . 109

5.4 In-memory implementation for large data sets . . . . . . . . .. . . . . . . . . . . . . 112

5.5 I/O Cost Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 113

5.6 Scale-up performance for Median splitting . . . . . . . . . . .. . . . . . . . . . . . 115

5.7 Runtime performance for variedk . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.8 Scale-up performance for InfoGain splitting . . . . . . . . .. . . . . . . . . . . . . . 117

xi

AppendixFigure Page

5.9 Conditional Entropy (InfoGain splitting) . . . . . . . . . . .. . . . . . . . . . . . . . 119

5.10 WMSE (Regression splitting) . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 120

5.11 Number of nodes pruned by Rothko-S as a function of the sample size n . . . . . . . . 122

6.1 Sanitized publication model . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 126

6.2 Client-side input perturbation model . . . . . . . . . . . . . . .. . . . . . . . . . . . 131

6.3 Online query auditing model . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 133

6.4 Output perturbation model . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 134

6.5 Secure multiparty computation model . . . . . . . . . . . . . . . .. . . . . . . . . . 137

xii

ABSTRACT

Numerous organizations collect and distribute non-aggregate personal data for a variety of dif-

ferent purposes, including demographic and public health research. In these situations, the data

distributor is often faced with a quandary: On one hand, it isimportant to protect the anonymity

and personal information of individuals. One the other hand, it is also important to preserve the

utility of the data for research.

This thesis presents an extensive study of this problem. We focus primarily on notions of

anonymity that are defined with respect to individual identity, or with respect to the value of a

sensitive attribute. We propose a variety of techniques that use generalization (also called recoding)

to produce a sanitized view, while preserving the utility ofthe input data. An extensive evaluation

indicates that it is possible to distribute high-quality data that respects several meaningful notions

of privacy. Further, it is possible to do this efficiently forlarge data sets.

1

Chapter 1

Introduction

Personal information is collected, stored, analyzed, and distributed in the course of everyday

life. In the medical domain, the US Department of Health and Human Services has announced a

major initiative toward digitizing the patient records maintained by hospitals, pharmacies, etc. [72].

In the United States, three independent credit reporting agencies maintain databases of personal

finance information that are widely used in credit evaluation [33, 36, 84].

Supermarkets and other retailers maintain and analyze large databases of customer purchase

information, collected by way of various affinity and discount programs. For example, when a

customer makes a purchase using its “Club Card,” the Safewaysupermarket chain records data

about the transaction, including “the amount and content ofyour purchases and the time and place

these purchases are made” [76]. On the surface, this appearsharmless, yet there is the potential for

abuse. For example, in a Los Angeles court case, Robert Rivera sued Vons grocery store (owned

by Safeway) after a slip-and-fall incident. During negotiations, Mr. Rivera’s attorney claimed

that Vons had accessed his client’s shopping records, and planned to introduce at trial information

regarding Rivera’s frequent purchases of alcohol, implying that he was drunk at the time of the

accident [87].

In the online world, many websites and service providers track users’ search requests and

navigation patterns. For example, in it privacy policy, Google clearly states [43]:

When you use Google services, our servers automatically record information that your browser

sends whenever you visit a website. These server logs may include information such as your

2

web request, Internet Protocol address, browser type, browser language, the date and time of

your request and one or more cookies that may uniquely identify your browser.

Indeed, the sensitive nature of this kind of information wasdemonstrated in the summer of

2006 when AOL distributed search histories for more than half a million of it users (with names

removed). Nonetheless, the New York Times was able to identify a handful of users based on the

content of their searches [13].

Finally, in the name of counter-terrorism, the United States Department of Homeland Security

has revealed the existence of a database used to assign “risk-assessments” to millions of American

citizens who travel across national borders [67].

Given the ease with which such data is collected and distributed, it is not surprising that many

questions have been raised in recent years about individualprivacy in the digital world. Broadly

speaking, the problem of data privacy encompasses the many legal, ethical, and technical issues

surrounding data ownership, collection, dissemination, and use. The work described in this the-

sis focuses on a particular technical problem within this broad space: privacy protection and de-

identification in data publishing.

1.1 De-Identification in Data Publishing

Numerous organizations collect and distribute microdata (personal data in its raw, non-aggregate

form) for purposes including demographic and public healthresearch.

In most cases, attributes that are known to uniquely identify individuals (e.g., Name or Social

Security Number) are removed from the released data. However, this fails to account for the pos-

sibility of combining other, seemingly innocuous, attributes with external data to uniquely identify

individuals. For example, according to one study, 87% of thepopulation of the United States can

be uniquely identified on the basis of their 5-digit zip code,sex, and date of birth [83].

The uniqueness of such attribute combinations leads to a class of “linking” attacks, where in-

dividuals are “re-identified” by combining multiple (frequently publicly-available) data sets. This

type of attack was demonstrated by Sweeney, who was able to combine a public voter registration

3

list and the de-identified patient data of Massachusetts’s state employees to determine the medical

history of the state’s governor [83]. Concern over this typeof attack has mounted in recent years

due to the ease with which data is distributed over the World Wide Web.

1.2 The HIPAA Privacy Rule

In the medical domain, the Health Insurance Portability andAccountability Act of 1996 (HIPAA)

included a number of provisions related to personal privacy. In response to this legislation, the U.S.

Department of Health and Human Services issued the regulation “Standards for Privacy of Individ-

ually Identifiable Health Information,” commonly known as theHIPAA Privacy Rule, with which

covered entities were required to demonstrate compliance by 2003.

The HIPAA Privacy Rule specifically addresses de-identification, and provides two distinct sets

of requirements [66]. By satisfying one of these two provisions, data may be exempt from many

of the regulations concerning personally-identifiable health information.

The first provision is deliberately vague, stating that a covered entity may determine that health

information is not individually identifiable if [66]:

A person with appropriate knowledge of and experience with generally accepted statistical

and scientific principles and methods for rendering information not individually identifiable:

(i) Applying such principles and methods, determines that the risk is very small that the infor-

mation could be used, alone or in combination with other reasonably available information,

by an anticipated recipient to identify an individual who isa subject of the information; and

(ii) Documents the methods and results of the analysis that justify such determination

In contrast, the second provision (the so-calledSafe Harbor) is very specific, and quite re-

strictive, requiring that eighteen specific types of information, including names and geographic

information, be removed entirely for any person (e.g., patients, doctors, etc.) before the data can

be considered de-identified [66]. This information is provided in Appendix A.

From a technical perspective, neither of these provisions is entirely satisfactory. The first pro-

vision is not explicit about what information is sensitive,what constitutes a “low” risk, or who

4

should be considered a statistical expert. The second provision is more precise, but necessitates

removing much of the information that is most useful in public health studies (e.g., geography and

dates). Throughout this thesis, we seek to provide anonymization techniques that balance rigorous

standards of privacy with the often competing goal of releasing useful and informative data.

1.3 Anonymity Framework & Definitions

This section gives an overview of the problems considered throughout this thesis. We begin

with a single input relationR, containing non-aggregate personal data collected by a centralized

organization. As in the majority of previous work on this topic, we assume that each attribute in

R can be uniquely characterized by at most one of the followingtypes based on knowledge of the

application domain:

• Identifier Unique identifiers (denotedID), such asNameandSocial Security Numberare

removed entirely from the published data.1

• Quasi-Identifier The quasi-identifier is a set of attributes{Q1, ..., Qd} that is externally

available in combination (either in a single table, or through joins) to the data recipient.2

Examples include the combination ofBirth Date, Sex, andZip Code.

• Sensitive Attribute An attributeS is considered sensitive if an adversary should not be

permitted to uniquely associate its value with an identifier. An example is a patient’sDisease

attribute.

We consider the problem of producing a sanitizedsnapshotof R[Q1, ..., Qd, S], which we

denoteR∗[Q1, ..., Qd, S], that is intended to limit the risk of a linking attack.3 Throughout the re-

mainder of this section, we will describeR∗ in terms of an abstractbucketization. Specifically, it is

1We use the terminology somewhat loosely. Even the U.S. Social Security Department is known to occasionallyissue duplicate numbers.

2We thank Bettini et al.[16] for pointing out the incorrect definition in our earlier work. The revised definition doesnot substantively alter our previous results.

3Additional problems, including inference, arise when multiple different sanitized versions of the same microdataare made available, a problem we describe more fully in Chapter 6.

5

Name DOB Sex Zipcode Disease

Andrew 1/5/76 Male 02173 Cancer

Bob 2/18/76 Male 02173 Broken Arm

Carl 2/24/76 Male 02174 Flu

Ellen 5/8/77 Female 02177 HIV

Frances 11/10/77 Female 02174 HIV

Gloria 12/1/77 Female 02175 HIV

Table 1.1 Original Data (R[ID, Q1, ..., Qd, S])

convenient to think ofR∗ as horizontally partitioningR[Q1, ..., Qd, S] into a set of non-overlapping

equivalence classesR1, ..., Rm, each with identical quasi-identifier values. A more thorough dis-

cussion of generalization techniques is in Section 1.4, butnotice that it is possible to represent

generalized tables in this way. For example, the generalization in Table 1.2 divides the records

from Table 1.1 into two equivalence classes. Throughout thethesis, we will assume bag semantics

unless otherwise noted.

The first anonymity requirement we consider isk-anonymity, an intuitive means of protecting

individual identity that was originally proposed by Samarati and Sweeney [78, 83]. Originally,

k-anonymity was motivated by the idea that each individual inthe released data should blend into

a crowd. That is, no individual inR∗ should be uniquely identifiable from a group of size smaller

thank on the basis of its quasi-identifier values.

Definition 1.1 (k-Anonymity) Sanitized viewR∗ is said to bek-anonymous if each unique tuple

in the projection ofR∗ onQ1, ..., Qd occurs at leastk times.

6

DOB Sex Zipcode Disease

1976 Male 0217* Flu

1976 Male 0217* Broken Arm

1976 Male 0217* Cancer

1977 Female 0217* HIV



Table 1.2 Generalized View (R∗[Q1, ..., Qd, S])

1.3.1 Threat Model & Probabilistic Interpretation

Although k-anonymity is effective in protecting individual identities, Machanavajjhala et al.

noted that it often fails to protect the values of one or more sensitive attributes [60]. This idea is

nicely motivated using a simple threat model and probabilistic interpretation of privacy.4

The class of linking attacks can be framed in terms of a simplethreat model, depicted diagram-

matically in Figure 1.1. We begin with auniverseof recordsU [ID, Q1, ..., Qd, S], each describing

a unique individual in an underlying (finite) population. The data collected by an organization is a

subset of this universe (R ⊆ U). We consider an adversary who has access to the released snap-

shotR∗[Q1, ..., Qd, S], as well as some subset ofU [ID, Q1, ..., Qd]. Using this information, the

adversary tries to reconstruct the association between identifiers and sensitive values. In practice,

however, it is impossible for the organization to anticipate precisely what records are available to

the adversary. For this reason, we take a pessimistic view, and assume that the adversary has access

to all records inU [ID, Q1, ..., Qd]. We refer to this as theexternal database.

It is convenient to model the adversary’s knowledge ofU andR∗ in terms of a set of pairs

{(G1, S1), ..., (Gm, Sm)}. Si denotes the multiset of sensitive values in theith equivalence class

of Ri, andGi denotes the set of identifiers fromU [ID] that can be logically associated with this

4This section is intended primarily to provide background explanation. The probabilistic interpretation is an adap-tation and variation of the work in several papers [24, 60, 61, 95]. The idea that the exernal population can be describedby a universal schema, and that the private dataR is a subset of this relation, was introduced by Samarati in the contextof k-anonymity, and used to motivate the definition ofk-anonymity in terms ofR [78].

7

�� !�� !�� "#��$�� "#��$�� %&'()*+),-.**/01(2.)1&*

345367345367

Figure 1.1 Threat model diagram

equivalence class via the quasi-identifier. Notice, however, that |Gi| may be greater than|Si|. In

this case, we add|Gi| − |Si| copies of a special null value (⊥) indicating “unknown” toSi.

For example, consider the external database in Table 1.3, and the generalized view in Table 1.2.

In this case, we have{(G1 = {Andrew, Bob, Carl, Dave}, S1 = {Flu, Broken Arm, Cancer,⊥}),

(G2 = {Ellen Frances, Gloria}, S2 = {HIV, HIV, HIV })}. This is also depicted in Figure 1.4.

Given the released data and external database, the adversary’s belief that sensitive values is as-

sociated with a particular identityt is described by a conditional probability denotedP ((t, s)|R∗, U).

We define this probability using the random worlds model [12]. A possible worldfor (R∗, U) is an

assignment that matches each element ofSi with an element ofGi. In our example, the following

is one of 144 possible worlds:{ (Andrew, Flu), (Bob, Cancer), (Carl,⊥), (Dave, Broken Arm),

(Ellen, HIV), (Frances, HIV), (Gloria, HIV)}.

In the absence of additional information, we make the standard assumption that each possible

world is equally likely. LetW1, ..., Wn denote the set of possible worlds for(R∗, U). Under this

assumption,

P ((t, s)|R∗, U) =|{Wi : Wi having (t, s)}|

n. (1.1)

8

Returning to our example, in all of the possible worlds, Ellen is paired with HIV. Thus, we say

thatP ((Ellen, HIV )|R∗, U) = 1. Similarly,P ((Andrew, F lu)|R∗, U) = 1/4.

Returning to the threat model, our goal is to limit the adversary’s confidence that a particular

sensitive value is associated with any particular individual. That is, anonymization with respect to

a sensitive attribute should place an upper bound on the adversary’s belief in(t, s). Thus, we say

thatR∗ is safe if the following condition holds;c is a user-specified parameter(0 < c < 1). We

will refer the the maximum conditional probability as thebreach probability.

maxt∈U [ID],s∈U [S]{P ((t, s)|R∗, U)} ≤ c (1.2)

This formulation is still problematic, however, because the data publisher does not have access

to the external database. Fortunately, becauseR ⊆ U , it is easy to compute an upper bound on this

conditional probability. Specifically,

maxt∈U [ID],s∈U [S]{P ((t, s)|R∗, U)} ≤ maxt∈R[ID],s∈R[S]{P ((t, s)|R∗, R)} (1.3)

It is then easy to computeP ((t, s)|R∗, R) for t ∈ R[ID], s ∈ R[S]. Let Ri be the equivalence

class inR∗ containingt. We compute the probability as follows:

P ((t, s)|R∗, R) =|{x : x ∈ Ri, x.S = s}|

|Ri|(1.4)

In our example, we know that the breach probability with respect to Disease is less than or

equal to 1.

Finally, returning to our motivation, it is important to note that when it is the individual’s

identity(represented by a unique sensitive value) that is to be protected,k-anonymity is sufficient

to guarantee that the breach probability is no more than1/k.

1.3.2 Practical Diversity Extensions

Following intuition similar to that described in the previous section, Machanavajjhala et al. for-

mulate the practicalℓ-diversity principle, which requires that each equivalence class inR∗ contain

at leastℓ “well-represented” values of sensitive attributeS [60]. This principle can be implemented

9

Name DOB Sex Zipcode

Andrew 1/5/76 Male 02173

Bob 2/18/76 Male 02173

Carl 2/24/76 Male 02174

Dave 9/25/76 Male 02174

Ellen 5/8/77 Female 02177

Frances 11/10/77 Female 02174

Gloria 21/1/77 Female 02175

Hank 3/5/82 Male 02178

Table 1.3 External database (U [ID, Q1, ..., Qd])

DOB Sex Zipcode Disease

(Andrew) Flu

(Bob) 1976 Male 0217* Broken Arm

(Carl) Cancer

(Dave) ⊥

(Ellen) HIV

(Frances) 1977 Female 0217* HIV

(Gloria) HIV

Table 1.4 CombiningU [ID, Q1, ..., Qd] andR∗[Q1, ..., Qd, S]

10

in several ways. LetDS denote the (finite or infinite) domain of attributeS. The first proposal re-

quires that the entropy ofS within each equivalence class be sufficiently large. (We adopt the

convention0 log 0 = 0.)

Definition 1.2 (Entropy ℓ-Diversity) A table R∗ is entropyℓ-diverse with respect toS if, for

every equivalence classRi in R∗,∑

s∈DS−p(s|Ri) log p(s|Ri) ≥ log(ℓ), wherep(s|Ri) is the

fraction of tuples inRi with S = s. [60]

Entropyℓ-diversity is often quite restrictive. Because the entropyfunction is concave, in order

to satisfyℓ-diversity, the entropy ofS within the entire data set must be at leastlog(ℓ) [60]. For

this reason, they provide an alternate definition motivatedby an “elimination” attack model. The

intuition informing the following definition is as follows:the adversary must eliminate at leastℓ−1

sensitive values in order to conclusively determine the sensitive value for a particular individual.

Definition 1.3 (Recursive(c, ℓ)-Diversity) Within an equivalence classRi, letxi denote the num-

ber of times theith most frequent sensitive value appears. Given a constantc, Ri satisfies recursive

(c, ℓ)-diversity with respect toS if x1 < c(xℓ + xℓ+1 + ... + x|DS |). R∗ satisfies recursive(c, ℓ)-

diversity if every equivalence class inR∗ satisfies recursive(c, ℓ)-diversity. (We say(c, 1)-diversity

is always satisfied.) [60]

When S is numerically-valued, the definitions provided by Machanavajjhala et al. do not

fully capture the intended intuition. For example, supposeS = Salary, and that some equiv-

alence class contains salaries{100K, 101K, 102K}. Technically, this is considered 3-diverse;

however, intuitively, it does not protect privacy as well asan equivalence class containing salaries

{1K, 50K, 500K}.

For this reason, we propose an additional requirement, which is intended to guarantee a certain

level of dispersion ofS within each equivalence class. LetV ar(Ri, S) = 1|Ri|

∑t∈Ri

(t.S−S(Ri))2

denote the variance of values for sensitive attributeS among tuples in equivalence classRi. (Let

S(Ri) denote the mean value ofS in Ri.)

11

Definition 1.4 (Variance Diversity) An equivalence classRi is variance diverse with respect to

sensitive attributeS if V ar(Ri, S) ≥ v, wherev is the diversity parameter.R∗ is variance diverse

if each equivalence class inR∗ is variance diverse.

1.3.3 Properties

There are two important properties to note about each of the proposed anonymity requirements

(k-anonymity, entropyℓ-diversity, recursive(c, ℓ)-diversity, and variance diversity):monotonic-

ity andbucket independence. Each of these properties proves important in developing efficient

anonymization algorithms. Notice that we can define a partial order� on the set of all possible

bucketizations, whereR∗1 � R∗

2 if and only if every equivalence class inR∗2 is the union of one

or more of the equivalence classes inR∗1. In the following, the notationsafeρ(R∗) denotes thatR∗

satisfies anonymity requirementρ.

Definition 1.5 (Monotonicity Property) Let R∗1 andR∗

2 be bucketizations of input dataR such

thatR∗1 � R∗

2. An anonymity requirementρ is monotone iffsafeρ(R∗1)→ safeρ(R∗

2).

k-Anonymity satisfies the monotonicity property [53], as do entropyℓ-diversity and recursive

(c, ℓ)-diversity [60]. Similarly, it is straightforward to show that variance diversity satisfies this

property.

Theorem 1.6 Variance diversity is monotone.

Proof Without loss of generality, consider two finite multisets ofreal numbers,A = {a1, ..., am}

andB = {b1, ..., bn}, which constitute the sensitive values in two non-overlapping buckets (equiv-

alence classes). Suppose that each bucket satisfies variance diversity. That is,V ar(A) ≥ v and

V ar(B) ≥ v. To show thatV ar(A ∪ B) ≥ v, it is sufficient to show thatV ar(A ∪ B) ≥

min(V ar(A), V ar(B)).

For two random variables,X andY , the Law of Total Variance [73] states thatV ar(X) =

E[V ar(X|Y )] + V ar[E(X|Y )]. For finite setsA andB, we useY to indicate membership in one

of the two sets. Thus, we can computeV ar(A ∪ B) as follows, whereA denotes the mean value

in setA:

12

R =m

m + nA +

n

m + nB

V ar(A ∪B) =m

m + nV ar(A) +

n

m + nV ar(B) +

m

m + n(A−R)2 +

n

m + n(B −R)2

Thus,V ar(A ∪ B) ≥ min(V ar(A), V ar(B)).

The second property concerns our ability to determine whether a bucketization (sanitized view)

R∗ satisfies the given anonymity requirement by evaluating each bucket (equivalence class) inde-

pendently.5

Definition 1.7 (Bucket Independence)Let R1 andR2 be disjoint tuple sets, and letR∗1 andR∗

2 be

bucketizations ofR1 andR2, respectively. An anonymity requirementρ is bucket independent iff

safeρ(R∗1)∧ safeρ(R∗

2)→safeρ(R∗1 ∪ R∗

2).

By definition,k-anonymity,ℓ-diversity, and variance diversity are bucket independentrequire-

ments. This is in contrast to some subsequent proposals thatincorporate background knowledge

spanning multiple buckets (e.g., [61]); a more complete discussion of this and other related work

is provided in Chapter 6.

1.4 Recoding Techniques

In their seminal work, Samarati and Sweeney proposed techniques for generating sanitized

view R∗ usinggeneralizationandsuppression[78, 82, 83]. (In the Statistics literature, this ap-

proach is often calledrecoding.) Informally, the idea is to replace quasi-identifier values with

more general (“semantically consistent”) values. For example, a Zipcode value could be gener-

alized by suppressing the least significant digit. Subsequently, these ideas have been refined and

extended in the literature.

From our perspective, the proposed recoding techniques canbe roughly categorized along three

main dimensions. Each recoding technique implicitly places a set of constraints on the space of

5Many thanks to Bee-Chung Chen for pointing out the idea of bucket independence.

13

possible anonymizations. As we will show throughout the thesis, the choice of recoding technique

can greatly influence the quality of the anonymized data.

1.4.1 Global vs. Local Recoding

Many of the proposed recoding techniques seek to anonymize agiven database by mapping

each unique value of the quasi-identifier attributes to a modified (generalized) value. Following the

terminology of Willenborg and deWaal [90], we refer to this as global recoding. In contrast, some

proposals modify individual instances of data items, an approach we will calllocal recoding. In

practice, the difference between global and local recodingamounts to a difference in the treatment

of duplicate values.

In a relational database, there is typically a (finite or infinite) domain associated with each

attribute (e.g., the domain of 5-digit integers, or the domain of dates). We use the notationDA to

denote the domain of attributeA. In the course of this work, we identified two main sub-classes of

global recoding. Prior to our work, most global recoding techniques required that the domain of

each quasi-identifier attribute be recoded individually, an approach we termedsingle-dimensional

global recoding. This flavor of generalization has been used to create numerous anonymization

algorithms [14, 38, 47, 78, 82, 88].

Definition 1.8 (Single-Dimensional Global Recoding)A single-dimensional global recoding is

defined by some functionφi : DQi→ D′

i for each attributeQi of the quasi-identifier.R∗ is

obtained by applying eachφi to the values ofQi in each tuple ofR.

In contrast, we have proposed a more flexible global recodingtechnique, which maps the do-

main of quasi-identifier vectors to more general vector values [53, 55, 56]. We will refer to this

approach asmultidimensional global recoding.

Definition 1.9 (Multidimensional Global Recoding) A multidimensional global recoding is de-

fined by a single functionφ : 〈DQ1× ... ×DQn

〉 → D′. R∗ is obtained by applying eachφ to the

vector of quasi-identifier values in each tuple ofR.

14

Unlike global recoding, local recoding allows tuples inR with identical quasi-identifier values

to be replaced with different generalized values inR∗. In the presence of duplicate quasi-identifier

tuples, this approach is more flexible that multidimensional global recoding.

Definition 1.10 (Local Recoding)A local recoding is defined by a function mappingφ from each

(non-distinct) tuple(q1, ..., qd) in the projection ofQ1, ..., Qd onR to some new tuple(q′1, ..., q′d).

Two main local recoding techniques have been proposed in theliterature. The first producesR∗

by suppressing individual cells ofR [5, 62, 90]. The second maps individual cells to more general

values [82].

1.4.2 Generalization vs. Suppression

Many of the proposed (local and global) recoding techniquesonly consider suppressing data

items in their entirety, but others consider generalizing values through some number of intermedi-

ate states. This distinction is somewhat artificial becausesuppression can be viewed as an extreme

form of generalization. Nonetheless, the two have often been treated separately in the literature.

1.4.3 Hierarchy-Based vs. Partition-Based Recoding

Generalization-based recoding techniques can be further categorized into two main sub-groups:

First, there are techniques that use a fixed user-defined “generalization hierarchy” or taxonomy to

define the set of possible generalizations. Alternatively,there are techniques that view the domain

of each attribute as a totally-ordered set, and define generalizations by partitioning the set into

ranges [14, 47]. Often, partitioning is most suitable for continuous or numeric data, and hierarchy-

based techniques are favorable for categorical values. As we will show in Chapter 3, the two can

often be used simultaneously for different attributes.

1.5 Thesis Contributions and Organization

This thesis provides an extensive study of anonymization techniques for published microdata,

focusing particular attention on the issues of data quality, scalability, and performance. The thesis

15

is organized more or less chronologically, and the main technical contributions are described in

four chapters.

Chapter 2 describes our first algorithm (Incognito), which efficiently anonymizes data accord-

ing to a particular hierarchy-based technique, which we term full-domain generalization. The

algorithm, which was first described in [53], is sound and complete according to this recoding

technique. That is, it finds all full-domain generalizations that satisfy the given anonymity re-

quirements, and for this reason, the algorithm is independent of any particular measure of data

quality.

Following the Incognito work, we delved more deeply into theproblem of understanding and

measuring data quality. There are many ways to define data quality, utility, and anonymization

optimality [14, 38, 47, 50, 56, 62, 78, 82, 83, 88]. However, aprecise definition that is suitable

to all applications remains elusive. In Chapter 3, we present a multidimensional recoding scheme,

as well as a fast greedy partition-based anonymization algorithm (Mondrian), which implements

k-anonymity,ℓ-diversity, and variance diversity. Initially, we limitedthe evaluation of data quality

to a set of simple general-purpose measures. The theoretical and empirical studies in this chapter

indicate the advantages of multidimensional recoding withrespect to data quality. Specifically,

we show that Mondrian often produces higher-quality data than even optimal single-dimensional

algorithms, such as Incognito and others. These results were initially presented in [55].

Chapter 4 expands the notion of data quality based on the observation that quality is subjective.

Thus, we look to a target workload of queries and data mining tasks that are to be carried out using

the released data. In this chapter, we describe extensions to the basic Mondrian algorithm by which

these target tasks can be incorporated into the anonymization process, as originally described in

[56].

While the Mondrian algorithms are substantially more efficient than previous (exhaustive search)

algorithms, such as Incognito, they require some modifications and extensions in order to be ap-

plied to data sets much larger than main memory. Chapter 5 describes two such extensions (col-

lectively calledRothko). The first of these extensions is based on ideas from scalable decision tree

16

construction, and the second is based on sampling. In both cases, the results are guaranteed to

satisfy the given privacy requirements. This work was originally described in [54].

Finally, the thesis concludes with a survey of related work in Chapter 6, and discussion in

Chapter 7.

17

Chapter 2

Incognito: Efficient Full-Domain Generalization

incognito (in-cog-ni-to)

Function:adverb or adjective

Etymology: Italian, from Latinincognitus

: with one’s identity concealed– Merriam-Webster Dictionary

This chapter describes an optimized search-based algorithmic approach calledIncognito[53].

Incognito was originally designed to implementk-anonymity via thefull-domain generalization

recoding technique, a single-dimensional hierarchy-based approach proposed in the original work

by Samarati and Sweeney [78, 83]. However, in subsequent work, Machanavajjhala et al. showed

that Incognito is easily extended to implement any anonymity requirement satisfying the mono-

tonicity property [60], and this chapter is written to reflect this additional generality.

In contrast to previous full-domain anonymization algorithms, Incognito is sound and complete

for finding safe (k-anonymous,ℓ-diverse, variance diverse) full-domain generalizations. In addi-

tion, we found that, at the time, the Incognito algorithms yielded runtime performance of up to an

order of magnitude better than previous approaches on several real-world data sets.

18

Z0 = {53715, 53710, 53706, 53703}

Z1 = {5371*, 5370*}

Z2 = {537**}

(a) Zipcode

B0 = {1/21/76, 2/28/76, 4/13/86}

B1 = {*}

(b) Birth Date

S0 = {Male, Female}

S1 = {Person}

(c) Sex

537**

5371* 5370*

53715 53710 53706 53703

(d) Zipcode

1/21/76 2/28/76 4/13/86

*

(e) Birth Date

Male Female

Person

(f) Sex

Figure 2.1 Domain and value generalization hierarchies

<S0, Z0>

<S0, Z1><S1, Z0>

<S1, Z1> <S0, Z2>

<S1, Z2>

(a) Domain-Generalization Lattice

[0,0]

[0,1][1,0]

[1,1] [0,2]

[1,2]

(b) Lattice of Distance Vectors

Figure 2.2 2-Attribute generalization lattice

19

2.1 Full-Domain Generalization

This chapter considers a specific hierarchy-based single-dimensional global recoding tech-

nique, which we termfull-domain generalization. This approach was proposed in the original

work by Samarati and Sweeney [78].1

Given a domainD, it is possible to construct a more general (semantically consistent) domain

in a variety of ways. For example, the Zipcode domain can be generalized by dropping the least

significant digit, and continuous domains can be divided into ranges. We use<D to denote this

domain generalization relationship. For two domainsDi and Dj, the relationshipDi <D Dj

indicates that the values inDj are the generalizations of the values inDi. More precisely, a many-

to-onevalue generalization functionγ : Di → Dj is associated with each domain generalization

Di <D Dj .

A domain generalization hierarchyis defined to be a set of domains that is totally ordered by

the relationship<D. The hierarchy can be viewed as a chain of nodes, and if there is an edge from

Di to Dj , we callDj thedirect generalizationof Di. Note that the generalization relationship is

transitive, and thus, ifDi <D Dj andDj <D Dk, thenDi <D Dk. In this case, we call domain

Dk an implied generalizationof Di. Paths in a domain hierarchy chain correspond to implied

generalizations, and edges correspond to direct generalizations. Figure 2.1(a,b,c) shows possible

domain generalization hierarchies for the Zipcode, Birthdate and Sex attributes.

We use the notationγ+ as shorthand for the composition of one or more value generalization

functions, producing thedirect andimplied value generalizations. The value-generalization func-

tions associated with a domain generalization hierarchy induce a corresponding value-level tree, in

which edges are defined byγ and paths are defined byγ+. To illustrate, Figure 2.1(d,e,f) associates

a value generalization with each value in the Zipcode, Birthdate, and Sex domains. For example,

Figure 2.1(d) indicates that5371∗ = γ(53715) and537 ∗ ∗ ∈ γ+(53715).

1In addition, the idea of a tuple-suppression threshold was suggested by Samarati and Sweeney [78], and can beincorporated as a simple extension of full-domain generalization. The idea is that there are some number of recordsin R that can be considered outliers. For this reason, up to a certain number of records (themaximum suppressionthreshold) may be completely excluded fromR∗. Under this combined scheme,R∗ is obtained through full-domaingeneralization, with selected outlier tuples removed entirely.

20

Full-domain generalization maps the entire domain of each quasi-identifier attribute to a more

general domain in its domain generalization hierarchy. This scheme guarantees that all values of a

particular attribute inR∗ belong to the same level of the value generalization hierarchy.

Definition 2.1 (Full-Domain Generalization) Let R be a relation with quasi-identifier attributes

Q1, ..., Qd. A full-domain generalization is defined by a set of functions, φ1, ..., φt, each of the

form φi : DQi→ D′

Qi, whereDQi

≤D D′Qi

. φi maps each valueq ∈ DQito someq′ ∈ D′

Qisuch

thatq′ = q or q′ ∈ γ+(q). R∗ is obtained by applying eachφi to the values ofQi in each tuple of

R.

For a quasi-identifier consisting of multiple attributes, the domain generalization hierarchies

of the individual attributes combine to form a lattice. An example lattice for Sex and Zipcode is

shown in Figure 2.2 (a).

Formally, consider a vector ofd domains with corresponding generalization hierarchiesH1 . . .Hd.

A vector ofd domains〈DA1, ..., DAd

〉 is said to be adirect multi-attribute domain generalization

(also denoted<D) of another vector ofd domains〈DB1, ..., DBd

〉 if the following conditions hold:

1. There exists a single valuej in 1 . . . n such that domain hierarchyHj contains the edge

DAj−→ DBj

(i.e.,DBjis a direct domain generalization ofDAj

).

2. For all otheri in 1 . . . d, i.e., fori 6= j, DAi= DBi

.

A multi-attribute generalization latticeover n single-attribute domain generalization hierar-

chies is a complete lattice ofd-vectors of domains in which

1. Each edge is a direct multi-attribute domain generalization relationship.

2. The bottom element is then-vector〈DA1, ...,DAd

〉, where, for alli, DAiis the source of the

hierarchy chainHi (i.e., the most specific domain associated with domain hierarchyHi).

3. The top element is then-vector 〈DA1, ..., DAd

〉, where, for alli, DAiis the sink of the

hierarchy chainHi (ie., the most general domain associated with domain hierarchyHi).

21

In the example lattice shown in Figure 2.2, the domain vector〈S0, Z2〉 is a direct multi-attribute

generalization of〈S0, Z1〉 and an implied multi-attribute generalization of〈S0, Z0〉.

From the generalization lattice, a lattice of distance vectors can be derived. Thedistance vector

between two domain vectors〈DA1, .., DAn

〉 and〈DB1, .., DBn

〉 is a vectorDV = [d1, ..., dn], where

each valuedi denotes the length of the path between the domainDAiand the domainDBi

in

domain generalization hierarchyHi. A lattice of distance vectors can be defined from the zero-

generalization domain vector. This lattice for Sex and Zipcode is given in Figure 2.2(b). The

height of a multi-attribute generalization is the sum of the valuesin the corresponding distance

vector. For example, the height of〈S1, Z1〉 is 2.

2.2 Previous Algorithms for Full-Domain Generalization

As mentioned in the introduction, there are a number of ways to define data quality and

anonymization optimality, a recurring theme throughout this thesis. Prior to this work, one notion

of optimality was defined in [78] using the distance vector ofthe domain generalization. Infor-

mally, this definition says that a full-domain generalization is optimal if the resultingR∗ is safe

(satisfies all given anonymity requirements), and the height of the resulting generalization is less

than or equal to that of any other safe full-domain generalization.2

However, in many cases, it is likely that users would want theflexibility to introduce their own

notions of quality. For example, it might be more important in some applications that the Sex

attribute be released intact, even if this means suppressing more digits of Zipcode. The previous

definition does not allow this flexibility.

Prior to Incognito, several search algorithms had been proposed for full-domain generalization,

each with accompanying guarantees about the optimality of the resulting anonymization. In [78],

Samarati described an algorithm for finding a singlek-anonymous full-domain generalization of

minimum height. The algorithm is based on the observation that if no generalization of heighth

satisfiesk-anonymity, then no generalization of heighth′ < h will satisfy k-anonymity. (The algo-

rithm is easily extended to any monotone anonymity requirement based on similar observations.)

2By this definition, there may exist more than one optimal generalization.

22

For this reason, Samarati’s algorithm performs a binary search on the height value. If the maxi-

mum height in the generalization lattice ish, it begins by checking each generalization at height

⌊h2⌋. If a generalization exists at this height that satisfies thegiven anonymity requirement(s) the

search proceeds to look at the generalizations of height⌊h4⌋. Otherwise, it searches the generaliza-

tions of height⌊3h4⌋, and so forth. Proceeding in this way, the algorithm finds onesafe full-domain

generalization that is optimal according to this very specific definition.

For arbitrary definitions of quality, this binary search algorithm is not always guaranteed to find

the optimal generalization. Instead, a naive bottom-up breadth-first search of the generalization

lattice could be used. This algorithm uses the multi-attribute generalization lattice for the domains

of the quasi-identifier attributes. Starting with the leastgeneral domain at the root of the lattice, the

algorithm performs a breadth-first search, checking whether each generalization satisfies the given

anonymity requirement(s). For monotonic anonymity requirements, this algorithm can be used to

find the set of all safe full-domain generalizations.

2.3 Incognito Algorithm

We noticed a number of convincing parallels between Samarati and Sweeney’s generalization

framework [78] and ideas used in managing multi-dimensional data [21, 44] and mining asso-

ciation rules [10, 81]. By bringing these techniques to bearon the anonymization problem, we

developed a core algorithm (as well as several variations) that are often substantially more efficient

than previous algorithms. Throughout this section, we use the sample Hospital Patients data in

Table 2.1 as a running example to illustrate the algorithms.

There are a number of marked similarities between hierarchy-based generalization and multi-

dimensional data models. It is easy to think of each domain generalization hierarchy as a dimen-

sion. For this reason, it is reasonable to think of the base relationR, and the domain generalization

hierarchies associated with the quasi-identifier attributes ofR, as a relational star-schema. For

example, the star schema for the quasi-identifier〈Birthdate, Sex, Zipcode〉 is given in Figure 2.3.

A full-domain generalization is produced by joiningR with its dimension tables, and projecting

the appropriate domain attributes. For simplicity, ifA is some attribute in the quasi-identifier and

23

Birthdate Sex Zipcode Disease

1/21/76 Male 53715 Flu

4/13/86 Female 53715 Hepatitis

2/28/76 Male 53703 Bronchitis

1/21/76 Male 53703 Broken Arm

4/13/86 Female 53706 AIDS

2/28/76 Female 53706 Hang Nail

Table 2.1 Hospital Patients (Incognito running example)

A <D A1, we will refer to attributeA1, which is produced by joiningR with the dimension table

of A, and projectingA1.

k-Anonymity, ℓ-diversity, and variance diversity can be checked by way of aset of count ag-

gregates, which we will call thecount histogram. (Underk-anonymity, these counts are computed

over the quasi-identifier attributes; the sensitive attribute is also included forℓ-diversity and vari-

ance diversity.)3 We say thatR is safewith respect to quasi-identifier attributesQ1, ..., Qd under

anonymity requirementρ is the projection ofR onQ1, ..., Qd (and sensitive attributeS) satisfiesρ.

Definition 2.2 (Count Histogram) The count histogram of relationR with respect to attributes

Q1, ..., Qd(, S) is a mapping from each unique tuple〈q1, ..., qd(, s)〉 in the projection ofR on

Q1, ..., Qd(, S) to the total number of tuples inR with these values.

Three properties of these generalization dimensions and count histograms play a key role in

our algorithms. The first property follows directly from themany-to-one generalization functions.

Proposition 2.3 (Generalization Property) Let R be a relation, and letP andQ each be sets of

d attributes inR such thatDP <D DQ. Let ρ be a monotone anonymity requirement. IfR is safe

with respect toP underρ, thenR is also safe with respect toQ underρ.

3In SQL, the count histogram is obtained by issuing a COUNT(*)query overR∗, with Q1, ..., Qd(, S) as theattribute list in the GROUP BY clause.

24

DiseaseSexZipBirthdate

B1B0

Z2Z1Z0 S1S0

Birth Date Dimension

Zipcode Dimension Sex Dimension

Patients Table

Figure 2.3 Star-schema defining generalization dimensions

For example, because the Patients data in Table 2.1 is 2-anonymous with respect to〈S0〉, then

it must also be 2-anonymous with respect to〈S1〉, a generalization ofS0.

The second property is reminiscent of operations along dimension hierarchies in OLAP pro-

cessing.

Proposition 2.4 (Rollup Property) Let R be a relation, and letP andQ each be sets ofn at-

tributes such thatDP ≤D DQ. If we havef1, the count histogram ofR with respect toP , then we

can generate each count inf2, the count histogram ofR with respect toQ, by summing the set of

counts inf1 associated byγ with each value set off2.

For example, considerF1, the relational representation of the count histogram of the Patients

table from Table 2.1 with respect to〈Birthdate, Sex, Zipcode〉. Recall that in SQL the count

histogram is computed by a COUNT(*) query with Birthdate, Sex, Zipcode as the GROUP BY

clause. The count histogram (F2) of Patients with respect to〈Birthdate, Sex, Z1〉 can be produced

by joiningF1 with the Zipcode dimension table, and issuing a SUM(count) query with Birthdate,

Sex, Z1 as the GROUP BY clause.

We also noticed a close connection with the a priori observation, a dynamic programming

approach that formed the basis for a number of algorithms formining frequent itemsets [10, 81].

This observation is easily applied to full-domain anonymization by way of the subset property.

25

Proposition 2.5 (Subset Property)Let R be a relation, and letQ be a set of attributes inR. Letρ

be a monotone anonymity requirement. IfR is safe with respect toQ underρ, thenR is safe with

respect to anyP ⊆ Q underρ.

For example, Patients (Table 2.1) is 2-anonymous with respect to 〈S1, Zipcode〉. Based on the

subset property, we know that Patients must also be 2-anonymous with respect to both〈Zipcode〉

and〈S1〉. Similarly, we noted that Patients is not 2-anonymous with respect to〈Sex, Zipcode〉.

Based on this observation and the subset property, we know that Patients must not be 2-anonymous

with respect to〈Birthdate, Sex, Zipcode〉.

2.3.1 Basic Incognito Algorithm

The Incognito algorithm generates the set of all full-domain generalizations ofR that are safe

under a monotone anonymity requirementρ. Based on the subset property, the algorithm begins

by checking single-attribute subsets of the quasi-identifier, and then iterates, checking safety with

respect to increasingly large subsets, in a manner reminiscent of [10, 81]. Each iteration consists

of two main parts: (The basic algorithm is given in Figure 2.5.)

1. Each iteration considers a graph of candidate multi-attribute generalizations (nodes) con-

structed from a subset of the quasi-identifier of sizei. We denote the set of candidate nodes

Ci. The set of direct multi-attribute generalization relationships (edges) connecting these

nodes is denotedEi. A modified breadth-first search over the graph yields the setof multi-

attribute generalizations of sizei with respect to whichR is safe (denotedPi).

2. After obtainingPi, the algorithm constructs the set of candidate nodes of sizei + 1 (denoted

Ci+1), and the edges connecting them (denotedEi+1) using the subset property.

2.3.1.1 Breadth-First Search

The ith iteration of Incognito performs a search that determines the safety of relationR with

respect to all candidate generalizations inCi. This is accomplished using a modified bottom-up

26

breadth-first search, beginning at each node in the graph that is not the direct generalization of

some other node, with the optimization of bottom-up aggregation based on the rollup property.

The breadth-first search also makes use of the generalization property. If a safe generalization

(node) is encountered, we are guaranteed by the generalization property that all of its general-

izations must also be safe. For this reason, when a node is found to be safe, all of its direct

generalizations are marked, and not checked in subsequent iterations of the search.

Example 2.6 Consider again the Patients data in Table 2.1 with quasi-identifier 〈Birthdate, Sex,

Zipcode〉. The first iteration of Incognito finds thatPatients is k-anonymous with respect to

〈B0〉, 〈S0〉, and 〈Z0〉, the un-generalized domains of Birthdate, Sex, and Zipcoderespectively.

The second iteration performs three breadth-first searchesto determine thek-anonymity status

of Patients with respect to the multi-attribute generalizations of〈Birthdate, Sex〉, 〈Birthdate,

Zipcode〉, and〈Sex, Zipcode〉. Figure 2.4 shows these searches. For〈Sex, Zipcode〉, the algorithm

first generates the count histogram ofPatients with respect to〈S0, Z0〉, and finds that 2-anonymity

is not satisfied. It then rolls up this count histogram to generate the count histograms with respect

to 〈S1, Z0〉 and〈S0, Z1〉, and uses these results to checkk-anonymity. In this example,Patients is

2-anonymous with respect to〈S1, Z0〉. Therefore, all generalizations of〈S1, Z0〉 (ie.,〈S1, Z1〉, 〈S1,

Z2〉) must also be 2-anonymous given the generalization property, so they are marked. Patients is

not 2-anonymous with respect to〈S0, Z1〉, so the algorithm then checks the 2-anonymity status of

only 〈S0, Z2〉. Finding thatPatients is 2-anonymous with respect to this attribute set, the search

is complete.

Lemma 2.7 Let ρ be a monotone anonymity requirement. The breadth-first search of the graph

defined byCi andEi determines the safety ofR with respect to alli-attribute generalizations in

Ci, underρ.

Proof During the breadth-first search, the safety of each noden is determined in one of two ways.

Either the count histogram ofR with respect ton is computed (and safety checked), orn is the

(direct or implied) multi-attribute generalization of some safe node. In this case,n must be safe

following the generalization property.

27

89:;<:=89:;<>=89>;<:=89>;<>=89:;<?=89>;<?=89:;<>=89>;<:=89>;<>=89:;<?=89>;<?=89>;<:=89>;<>=89:;<?=89>;<?=

@ABCDBE@ABCDFE@AFCDBE@AFCDFE@ABCDGE@AFCDGE@ABCDFE@AFCDBE@AFCDFE@ABCDGE@AFCDGE@AFCDBE@AFCDFE@ABCDGE@AFCDGE

HIJKLMNHIMKLJNHIJKLJNHIMKLMNHIJKLMNHIMKLJNHIJKLJN

Figure 2.4 Searching candidate 2-attribute generalization graphs

28

Algorithm: Incognito(relationR, quasi-identifiers{Q1, ..., Qd}, monotone anonymity requirementρ)

C1 = {Nodes in the domain generalization hierarchies inQ1, ..., Qd}

E1 = {Edges in the domain generalization hierarchies inQ1, ..., Qd}

queue = {}

for (i = 1 tod)

Pi = copy ofCi

{roots} = {all nodes∈ Ci with no edge∈ Ei directed to them}

Insert{roots} into queue, keepingqueue sorted by height

while (queue 6= {})

node = Remove first item fromqueue

if (node is not marked)

if (node is a root)

countHist = Compute count histogram ofR wrt attributes ofnode usingR

else

countHist = Compute count histogram ofR wrt node using parent’s count histogram

safe = UsecountHist to check whetherR is safe with respect tonode underρ

if (safe)

Mark all direct generalizations ofnode

else

Deletenode from Pi

Insert direct generalizations ofnode into queue, keepingqueue ordered by height

Ci+1, Ei+1 = GraphGeneration(Pi, Ei)

return Projection of attributes ofPd ontoR and dimension tables

Figure 2.5 Algorithm: Basic Incognito

29

2.3.1.2 Graph Generation

We implemented each multi-attribute generalization graphas two relational tables: one for the

nodes and one for the edges. Figure 2.6 shows the relational representation of the lattice depicted

in Figure 2.2 (a). Notice that each node is assigned a unique identifier (ID).

The graph generation component consists of three main phases. First, we have a join phase

and a prune phase for generating the set of candidate nodesCi with respect to whichR could

potentially be safe given previous iterations; these phases are similar to those described in [10]. The

final phase is edge generation, through which the direct multi-attribute generalization relationships

among candidate nodes are constructed.

The join phase creates a superset ofCi based onPi−1. The join query is as follows, and assumes

some arbitrary ordering assigned to the dimensions. As in [10], this ordering is intended purely to

avoid generating duplicates.

INSERT INTO Ci(dim1, index1,...,

dimi, indexi, parent1, parent2)

SELECT p.dim1, p.index1,..., p.dimi−1, p.indexi−1,

q.dimi−1, q.indexi−1, p.ID, q.ID

FROM Pi−1 p, Pi−1 q

WHERE p.dim1 = q.dim1 ∧ p.index1 = q.index1 ∧ ...

∧ p.dimi−2 = q.dimi−2 ∧ p.indexi−2 = q.indexi−2

∧ p.dimi−1 < q.dimi−1

The result of the join phase may include some nodes with subsets not inPi−1, and during the

prune phase, we use a hash tree structure similar to that described in [10] to remove these nodes

from Ci.

OnceCi has been determined, it is necessary to construct the set of edges connecting the nodes

(Ei). Notice that during the join phase we tracked the uniqueIDs of the two nodes inCi−1 that

were combined to produce each node inCi (parent1 andparent2).

30

ID dim1 index1 dim2 index2

1 Sex 0 Zipcode 0

2 Sex 1 Zipcode 0

3 Sex 0 Zipcode 1

4 Sex 1 Zipcode 1

5 Sex 0 Zipcode 2

6 Sex 1 Zipcode 2(a) Nodes

Start End

1 2

1 3

2 4

3 4

3 5

4 6

5 6(b) Edges

Figure 2.6 Relational representation of generalization lattice

31

Ei is constructed usingCi andEi−1 based on some simple observations. Consider two nodes

A andB ∈ Ci. We observe that if there exists a generalization relationship between the first parent

of A and the first parent ofB, and the second parent ofB is either equal to or a generalization of

the second parent ofA, thenB is a generalization ofA. In some cases, the resulting generalization

relationships may be implied, but they may only be separatedby a single node. We remove these

implied generalization relationships explicitly from theset of edges. The edge generation process

can be expressed in SQL as follows:

INSERT INTO Ei(start, end)

WITH CandidateEdges (start, end) AS (

SELECT p.ID, q.ID

FROM Ci p, Ci q, Ei−1 e, Ei−1 f

WHERE (e.start = p.parent1 ∧ e.end = q.parent1

∧ f.start = p.parent2 ∧ f.end = q.parent2)

∨ (e.start = p.parent1 ∧ e.end = q.parent1

∧ p.parent2 = q.parent2)

∨ (e.start = p.parent2 ∧ e.end = q.parent2

∧ p.parent1 = q.parent1)

)

SELECT D.start, D.end

FROM CandidateEdges D

EXCEPT

SELECT D1.start, D2.end

FROM CandidateEdges D1, CandidateEdges D2

WHERE D1.end = D2.start

Example 2.8 Consider again the Patients data in Table 2.1 with quasi-identifier 〈Birthdate, Sex,

Zipcode〉. Suppose the results of the second-iteration graph search are those shown in the final

steps of Figure 2.4 (a, b, c). Figure 2.7 shows the 3-attribute graph resulting from the join, prune,

32

<B1, S1, Z0>

<B1, S0, Z2> <B0, S1, Z2><B1, S1, Z1>

<B1, S1, Z2>

Figure 2.7 3-Attribute graph (generated following 2-attribute search in Figure 2.4)

<B0, S0, Z0>

<B1, S0, Z0> <B0, S1, Z0> <B0, S0, Z1>

<B1, S1, Z0> <B1, S0, Z1><B0, S1, Z1> <B0, S0, Z2>

<B1, S1, Z1> <B1, S0, Z2> <B0, S1, Z2>

<B1, S1, Z2>

Figure 2.8 3-Attribute lattice without a priori pruning

33

and edge generation procedures applied to the 2-attribute graphs. In many cases, the resulting

graph is smaller than the lattice that would have been produced without a priori pruning. For

example, see Figure 2.8.

2.3.2 Soundness and Completeness

As mentioned previously, Incognito generates the set of allsafe full-domain generalizations.

For example, consider the generalization lattice in Figure2.7. If relationR is k-anonymous with

respect to〈B1, S1, Z0〉, then this generalization will be among those produced as the result of

Incognito. If R is notk-anonymous with respect to this generalization, then it will not be in the

result set. In this section, we prove soundness and completeness.

Theorem 2.9 Basic Incognito is sound and complete for producing safe full-domain generaliza-

tions under any monotone anonymity requirementρ.

Proof Consider a relationR and its quasi-identifier attribute setQ. Let Q′ denote the set of multi-

attribute domain generalizations ofQ. Incognito determines the safety of each generalizationq in

Q′ in one of three ways:

1. The safety ofR with respect to some subset ofq is checked explicitly and found to be unsafe.

In this case, it follows from the subset property thatR is not safe with respect toq.

2. The safety ofR with respect to some quasi-identifier subsetp is checked, and found to be

unsafe, and some subset ofq is a (multi-attribute) generalization ofp. In this case, we know

based on the generalization and subset properties thatq is not safe.

3. Generalizationq is checked explicitly, and it is determined thatR is safe with respect toq.

Soundness and completeness is a key distinction between Incognito and Samarati’s binary

search algorithm [78]. Incognito will find all safe full-domain generalizations, from which the

“optimal” may be chosen according to any criteria. The binary search is guaranteed to find only

34

a single safe full-domain generalization, which is optimalonly according to the specific defini-

tion described in Section 2.1. Bottom-up breadth-first search is also sound and complete if run

exhaustively.

2.3.3 Algorithm Optimizations

2.3.3.1 Super-roots

A candidate noden in Ci is a “root” if there is no generalization edge inEi directed from

another node inCi to n. During each iteration of Incognito, the database is scanned once per

root to generate its count histogram. Because of the a prioripruning optimization, however, we

are not guaranteed that the candidate nodes at each iteration will form lattices, so some of these

roots might come from the same “family” (generalizations ofthe same quasi-identifier subset).

In this case, we observed that it was more efficient to group roots according to family, and then

scan the database once, generating the count histogram corresponding to the least upper bound of

each group (the “super-root”). We refer to the basic algorithm, augmented by this optimization, as

Super-roots Incognito.

For example, in Figure 2.7,〈B1, S1, Z0〉, 〈B1, S0, Z2〉, and〈B0, S1, Z2〉 are all roots of the 3-

attribute graph, but they come from the same family. Rather than scanning the database once for

each of these roots, Super-roots Incognito would first compute the count histogram of Patients with

respect to〈B0, S0, Z0〉, and would then use this to compute the count histogram for each of these

roots.

2.3.3.2 Bottom-Up Pre-computation

Even Super-roots Incognito scansR once per subset of the quasi-identifier in order to generate

the necessary frequency sets. This drawback is fundamentalto the a priori optimization, where

single-attribute subsets are processed first. For example,we are not able to use the count histogram

of R with respect to〈Zipcode〉 to generate the count histogram ofR with respect to〈Sex, Zipcode〉.

On the other hand, in the context of computing the data cube, these group-by queries would be

processed in the opposite order [44], and rather than re-scanning the database, we could compute

35

the count histogram ofR with respect to〈Zipcode〉 by simply rolling up the count histogram with

respect to〈Sex, Zipcode〉.

Based on this observation, and with the hope of seeing the benefits of both optimizations,

we considered first generating the count histograms ofR with respect to all subsets of the quasi-

identifier at the lowest level of generalization. These count histograms can be computed using

bottom-up aggregation, similar to that used for computing the data cube. When Incognito is run,

it processes the smallest subsets first, as before, but thesezero-level count histograms can used

on each iteration instead of scanning the entire database. We refer to the basic algorithm that first

pre-computes the zero-generalization count histograms asCube Incognito.

2.4 Experimental Performance Evaluation

To assess the performance of Incognito, we performed a number of experiments using real-

world data. We compared (Basic, Super-roots, and Cube) Incognito to previous optimal full-

domain generalization algorithms, including Samarati’s Binary search [78] and naive Bottom-up

search (without rollup), described in Section 2.2. All of the experiments in this section are based

onk-anonymity as the exclusive anonymity requirement.

In addition, we also adapted the naive bottom-up algorithm to use bottom-up aggregation

(“rollup”) along the generalization dimensions. The search pattern is also breadth-first, but for

nodes other than the root, the count histogram is computed based on the count histogram of (one

of) the nodes of which the node is a generalization. Both bottom-up variations were run exhaus-

tively to produce allk-anonymous generalizations. Throughout our experiments,we found that the

Incognito algorithms uniformly outperformed the others.

At the time, this was the largest-scale performance experiment that had been done fork-

anonymity. (In Chapter 5, we present a larger performance experiment for the later Mondrian

and Rothko algorithms.) No experimental evaluation was provided for binary search [78]. The

genetic algorithm in [47] was evaluated using a somewhat larger database, but this algorithm does

not guarantee optimality.

36

Attribute Distinct Values Generalizations

1 Age 74 5-, 10-, 20-year ranges(4)

2 Gender 2 Suppression(1)

3 Race 5 Suppression(1)

4 Marital Status 7 Taxonomy tree(2)

5 Education 16 Taxonomy tree(3)

6 Native country 41 Taxonomy tree(2)

7 Work Class 7 Taxonomy tree(2)

8 Occupation 14 Taxonomy tree(2)

9 Salary class 2 Suppression(1)

Table 2.2 Experimental Data Description: Adults

Attribute Distinct Values Generalizations

1 Zipcode 31953 Round each digit(5)

2 Order date 320 Taxonomy Tree(3)

3 Gender 2 Suppression(1)

4 Style 1509 Suppression(1)

5 Price 346 Round each digit(4)

6 Quantity 1 Suppression(1)

7 Cost 1412 Round each digit(4)

8 Shipment 2 Suppression(1)

Table 2.3 Experimental Data Description: Lands End

37

2.4.1 Experimental Data and Setup

We evaluated the Incognito algorithms using two databases.The first was based on the Adults

database from the UC Irvine Machine Learning Repository [17], which is comprised of data from

the US Census. We used a configuration similar to that in [47],using nine of the attributes, all of

which were considered as part of the quasi-identifier, and eliminating records with unknown values.

The resulting table contained 45,222 records (5.5 MB). The second database was much larger, and

contained point-of-sale information from Lands End Corporation. The database schema included

eight quasi-identifier attributes, and the database contained approximately 4.6 million records (268

MB).

The experimental databases are described in Tables 2.2 and 2.3, which list the number of unique

values for each attribute, and gives a brief description of the associated generalizations. In some

cases, these were based on a categorical taxonomy tree, and in other cases they were based on

rounding numeric values or simple suppression. The height of each domain generalization hi-

erarchy is listed in parentheses. We implemented the generalization dimensions as a relational

star-schema, materializing the value generalizations in the dimension tables.

We implemented the three Incognito variations, Samarati’sbinary search4, and the two vari-

ations of bottom-up breadth-first search as a Java application running atop IBM DB2. All count

histograms were implemented as un-logged temporary tables. All experiments were run on a dual-

processor AMD Athlon 1.5 GHz machine with 2 GB physical memory. The software included

Microsoft Windows Server 2003 and DB2 Enterprise Server Edition Version 8.1.2. The buffer

pool size was set to 256 MB. Because of the computational intensity of the algorithms, each ex-

periment was run 2-3 times, flushing the buffer pool and system memory between runs. We report

the average cold performance numbers, and the numbers were quite consistent across runs.

4We implemented thek-anonymity check as a group-by query over the star schema. Samarati suggests an alter-native approach whereby a matrix of distance vectors is constructed between unique tuples [78]. However, we foundconstructing this matrix prohibitively expensive for large databases.

38

OPOQOOQPOROORPOSOOSTPUVWXYZ[\]_ab\c\de\fghijklmnoplqporstlkuvwbbwx]Yyz{|wdw}}Yy~vwbbwx]Yyz{|dw}}Yy~v\aZd�eZd��Y�a�w�a\bwvZ[\�a�w�a\bweYyd]dwwb[a�w�a\bw

Lands End (k = 2)

�� ¡¢£¤¥¦¢§¦¥©ª¢¡«¬��®��°±²�³³�¬��®��°±²�³³�¬��µ��¶·�¹��¶º��¬��¶��¶º��¶º��

Lands End (k = 10)

»¼»½»¾»¿»¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÉÐÉÍÑÒÉÓÍÔÕÖ×ØÙÚÛÜÝÙÞÝÜßàáÙØâãäÏÏäåÊÆæçèéäÑäêêÆæëãÉÎÇÑìÒÍÇÑíîãäÏÏäåÊÆæçèéÑäêêÆæëãÇÈÉíËÎíäïÎÉÏäðÆñÍËÎíäïÎÉÏäÒÆæÍÑÊÑääÏÈËÎíäïÎÉÏä

Adults (k = 2)

òóòôòõòöòõö÷øùúûüýþÿ�� ý�� !!ý�"��þ�#�þ�$%��ý�� !!ý�"�þÿ�$��$�&��'ý(��$�&��ý��ÿ��$�&��

Adults (k = 10)

Figure 2.9 Incognito performance evaluation for varied QID

39

)*)+),)-).)+.*)+..)/01234567895:98;<=54>

?@ABCDEFBCGH?IJJIKLMNOPQCIRRMNS?BT@GUAGIVA@JIEMNFCLCIIJTUAGIVA@JIAdults (QID = 8)

WXWYWWYXWZWWZXW[WWZXYWZXXW\]_abcdefbgfehijbaklmnopqrsoptuvwxyz{|lo}mtxnt~�nm�~vwxyz�|r��sp�p~~�}xnt~�nm�~vwxyz�|

Lands End (staggered QID)

Figure 2.10 Incognito performance evaluation for variedk

2.4.2 Experimental Results

The worst-case time complexity of each of the algorithms considered in this chapter, including

Incognito, is exponential in the size of the quasi-identifier. However, we found that in practice the

rollup and a priori optimizations go a long way in improving performance.

Figure 2.9 shows the execution time of Incognito and previous algorithms on the experimental

databases for varied quasi-identifier size (k = 2, 10). We began with the first three quasi-identifier

attributes from each schema (Figures 2.2 and 2.3), and addedadditional attributes in the order they

appear in these lists. We found that Incognito substantially outperformed Binary Search on both

databases, despite the fact that Incognito generates allk-anonymous full-domain generalizations,

while Binary Search finds only one.

2.4.2.1 Effects of Rollup and the A Priori Optimization

As mentioned previously, we observed that the bottom-up breadth-first search can be improved

by using rollup aggregation, an idea incorporated into the Incognito algorithm. To gage the effec-

tiveness of this optimization, we compared the version of the bottom-up algorithm with rollup to

the version without rollup. Figure 2.9 shows that bottom-upperforms substantially better on the

Adults database when it takes advantage of rollup.

40

QID size Bottom-Up Incognito

3 14 14

4 47 35

5 206 103

6 680 246

7 2088 664

8 6366 1778

9 12818 4307

Table 2.4 Total nodes searched by algorithm (k = 2)

We also found that the a priori optimization, the other key component of Incognito, went a

long way in helping to prune the space of nodes searched, in turn improving performance. In

particular, the number of nodes searched by Incognito was much smaller than the number searched

by bottom-up, and the size of the count histograms computed for each of these nodes is generally

smaller. For the Adults database,k = 2, and varied quasi-identifier size(QID), the number of nodes

searched is shown in Table 2.4.

As the size ofk increases, more generalizations are pruned as part of smaller subsets, and the

performance of Incognito improves. For example, Figure 2.10 compares Incognito and the other

algorithms ask increases, and Incognito trends downward. Because of the search pattern, binary

search is more erratic.

2.4.2.2 Effects of Super-Roots

The super-roots optimization was very effective in reducing the Incognito runtime because it

substantially reduced access to the original data, insteadcomputing many of the count histograms

from other count histograms. By creating a single super-root frequency set (which requires a

single scan), in practice we eliminate up to 4 or 5 scans of thedata. This performance gain is most

pronounced in the larger Lands End database.

41

��

�� ¡¢£¤¥¦§£§¦©ª«£¢¬

�®��°�±�²�³��µ²��±�²�Adults (k = 2)

¶·¶¶¶·¶¹¶¶¹·¶º¶¶

º»·¼½¾¿ÀÁÂÃÄÅÆÇÈÉÃÊÃÇËÌÃÍÇÎÏÐÑÒÓÔÕÖ×ÓØ×ÖÙÚÛÓÒÜ

ÝÀÞÇßÀÃàÆáÃâÇãÈäÈåâÃÍÁÉÃäÈáÃâÇLands End (k = 2)

Figure 2.11 Cube Incognito performance

2.4.2.3 Effects of Pre-computation and Materialization

Figure 2.9 shows the cost of Cube Incognito, which includes both the cost of building the zero-

generalization count histograms from the bottom-up and thecost of anonymization using these

count histograms. Figure 2.11 breaks down this cost. Because the Adults database is small, it is

relatively inexpensive to build the zero-generalization count histograms, and the Cube Incognito

algorithm beats Basic Incognito. On the larger Lands End database, Cube Incognito is slower than

the basic variation. However, the marginal cost of anonymization is faster than Basic Incognito

once the zero-generalization count histograms have been materialized.

Strategic materialization is an interesting future direction. In many cases, especially when

many quasi-identifier attributes are considered, the zero-generalization count histograms can be

quite large. Because iterations of Incognito are actually likely to need count histograms at higher

levels of generalization, we suspect that materializing count histograms at multiple levels of gen-

eralization is likely to provide substantial performance improvement.

42

2.5 Chapter Summary

In this chapter, we described an efficient search algorithm,called Incognito, for anonymous

full-domain generalization. We showed that the algorithm is sound and complete, so the “opti-

mal” anonymization can be chosen according to any (application-specific) criterion. In addition,

an experimental performance evaluation indicated that this algorithm is often more efficient than

previous exhaustive search algorithms.

43

Chapter 3

Mondrian: Multidimensional Partitioning

Pieter Cornelis (Piet) Mondrian (March 7, 1872 - February 1, 1944)was a Dutch painter, and

contributer to the De Stijl art movement. He is best-known for his non-representational “compo-

sitions,” which consist of red, yellow, blue, and black rectangular forms, separated by thick black

rectilinear lines.

This chapter introduces, describes and evaluates the multidimensional partitioning approach to

anonymization [55]. While the approach in the previous chapter relied on user-defined generaliza-

tion hierarchies provided as input, the algorithms in this chapter construct generalization functions

dynamically, based on the input data and optional user-supplied constraints.

Using two simple general-purpose measures of data quality,we analyze the partitioning frame-

work theoretically. We prove that the optimal problem is NP-hard. However, underk-anonymity,

the worst-case quality is much superior to that of single-dimensional approaches.

Following this analysis, we introduce a new greedy anonymization algorithm (Mondrian),

which can be used effectively to implement any monotone, bucket independent, anonymity re-

quirement. We observe that in practice this algorithm oftenproduces better (higher-quality) results

than even optimal single-dimensional algorithms. Throughout this chapter, we will use the data in

Table 3.1 as a running example.

3.1 Multidimensional Partitioning

Recall that both single-dimensional and multidimensionalglobal recoding can be applied to

categorical and numeric data. For numeric data, and other totally-ordered domains, single-dimensional

44

Age Sex Zipcode Disease

25 Male 53711 Flu

25 Female 53712 Hepatitis

26 Male 53711 Bronchitis

27 Male 53710 Broken Arm

27 Female 53712 AIDS

28 Male 53711 Hang Nail

Table 3.1 Hospital Patients (Mondrian running example)

“partitioning” approaches have been proposed [14, 47]. Assume that there is a total order associ-

ated with the domain of each quasi-identifier attributeQi. A single-dimensional intervalis defined

by a pair of endpointsp, v ∈ DQisuch thatp ≤ v. (The endpoints may be open or closed.)

Definition 3.1 (Single-Dimensional Partitioning) A single-dimensional partitioning is defined

by, for eachQi, a set ofnon-overlappingsingle-dimensional intervals that coverDQi. φi maps

eachqi ∈ DQito the interval in which it is contained.

This approach can be extended to multidimensional recoding. It is easy to think of the (mul-

tiset) projection ofR on Q1, ..., Qd as a multiset of points ind-dimensional space. (For example,

Figure 3.1(a) shows the spatial representation of Patientswith respect to quasi-identifiersAge and

Zipcode.) Again, assume a total order for eachDQi. A multidimensional regionis defined by a

pair ofd-tuples(p1, ..., pd), (v1, ..., vd) ∈ DQ1× ...×DQd

such that∀i, pi ≤ vi. Conceptually, each

region is bounded by ad-dimensional rectangular box, and we allow each edge and vertex of this

box to be either open or closed.1

Definition 3.2 (Multidimensional Partitioning) A multidimensional partitioning is defined by a

set of non-overlapping multidimensional regionsG1, ..., Gm that coverDQ1× ... × DQd

. φ maps

each tuple(q1, ..., qd) ∈ DQ1× ...×DQd

to the region in which it is contained.

1Non-rectangular regions may also be considered. However, we limit our discussion to rectangular regions becausethese are most naturally expressed in tabular form.

45

28

27

26

25

537125371153710

(a) Patients

28

27

26

25

537125371153710

(b) Single-Dimensional

28

27

26

25

537125371153710

(c) Multidimensional

Figure 3.1 Spatial representation of Patients and partitionings

Whenφ is applied to input relationR, this yields data partitionsR1, ..., Rm. (The tuple set in

each non-empty partition forms an equivalence class inR∗.) Sample 2-anonymizations of Patients,

using single-dimensional and multidimensional partitioning, are shown in Tables 3.2 and 3.3, re-

spectively. Notice that the anonymization obtained using multidimensional partitioning is not per-

missible under single-dimensional partitioning because the domains of Age and Zipcode are not

recoded to a single set of intervals (e.g., Age 25 is mapped toeither [25-26] or [25-27], depending

on the values of Zipcode and Sex). However, the single-dimensional recoding is valid under the

definition of multidimensional partitioning.

Proposition 3.3 Every single-dimensional partitioning for quasi-identifier attributesQ1, ..., Qd

can be equivalently expressed as a multidimensional partitioning. However, whend ≥ 2 and

∀i, |DQi| ≥ 2, there exists a multidimensional partitioning that cannotbe expressed as a single-

dimensional partitioning.

When one or more of the quasi-identifiers is categorical, single-dimensional and multidimen-

sional partitioning can be further constrained by user-defined generalization hierarchies like those

described by Samarati and Sweeney [78, 83] (for examples, refer to Figure 2.1). The hierarchies

can be used in several ways to constrain the set of possible recodings [53]. In the remainder of

this chapter, we require that ifφ maps a leaf valuev to some ancestorv∗, then all leaves that are

descended fromv∗ must also be mapped tov∗.

46


[25-28] Male [53710-53711] Flu

[25-28] Female 53712 Hepatitis

[25-28] Male [53710-53711] Bronchitis

[25-28] Male [53710-53711] Broken Arm

[25-28] Female 53712 AIDS

[25-28] Male [53710-53711] Hepatitis

Table 3.2 2-Anonymous single-dimensional recoding of Patients


[25-26] Male 53711 Flu

[25-27] Female 53712 Hepatitis

[25-26] Male 53711 Bronchitis

[27-28] Male [53710-53711] Broken Arm

[25-27] Female 53712 AIDS

[27-28] Male [53710-53711] Hepatitis

Table 3.3 2-Anonymous multidimensional recoding of Patients

47

3.2 Some Simple General-Purpose Measures of Quality

In this chapter, we restrict the measurement of anonymization quality to simple general-purpose

measures based on the size of the (non-empty) equivalence classesR1, ..., Rm in anonymous view

R∗. Intuitively, thediscernability penalty(CDM ), proposed by Bayardo and Agrawal [14], assigns

to each tuplet a penalty based on the size of the equivalence class containing t.

CDM =

m∑

i=1

|Ri|2 (3.1)

Similarly, we can also measure thenormalized average equivalence class size(CAV G) [55].

Notice thatCAV G = 1 when each equivalence class contains preciselyk records.

CAV G =

(∑mi=1 |Ri|

m

)/k (3.2)

When the purposes for which the data will be used are known a priori, the best measure of

data quality takes this information into account. For example, if the data will be used to build a

classification mode, then the best measure of data quality isthe predictive accuracy of the model.

Similarly, if the data will be used to evaluate a set of queries, then it is important that these queries

can be answered with high precision. We return to the idea of incorporating workload in Chapter 4.

However, when no additional information is available, these simple general-purpose measures are

a place to start.

3.3 Theoretical Analysis (k-Anonymity)

In this section, we will simplify our analysis to include only k-anonymity and totally-ordered

domains. We first show that the problem of finding an optimalk-anonymous multidimensional par-

titioning is NP-hard. For this reason, we also provide some worst-case upper bounds on the size of

equivalence classes resulting from greedy single-dimensional and multidimensional partitioning.

48

3.3.1 Hardness Result

Previous work has shown that the problems of suppressing as few cells or attributes as possible

while satisfyingk-anonymity are both NP-hard [5, 62]. The problem of finding the k-anonymous

multidimensional partitioning with the smallestCDM or CAV G is also NP-hard, but this does not

appear to follow immediately from the previous results.

We formulate the following decision problem usingCAV G. (The result is similar forCDM .)

Here, the input dataR is equivalently represented as a set of distinct(point, count) pairs, where

eachpoint is a distinct vector of quasi-identifier values, and thecountis the total number of occur-

rences inR.

Definition 3.4 (Decisionalk-Anonymous Multidimensional Partitioning) Let DQ1× ...×DQd

be a domain space, and letR be a set of(point, count)pairs located in this space. Is there a

multidimensional partitioning for the space such that for every resulting multidimensional region

Gi (containing set of pairsRi)∑

t∈Rit.count ≥ k or

∑t∈Ri

t.count = 0, andCAV G ≤ positive

constantc?

Theorem 3.5 Decisionalk-anonymous multidimensional partitioning is NP-complete.

Proof The proof is by reduction from Partition [39]:A is a multiset ofn positive integers{a1, ..., an}.

Is there someA′ ⊆ A, such that∑

ai∈A′ ai =∑

aj∈A−A′ aj ?

Consider domain space[0, 1]n, and for eachai ∈ A, construct a pair(pointi, counti). Let

pointi = (01, ..., 0, 1i, 0, ..., 0n) (the ith coordinate is 1, and all other coordinates are 0), and let

counti = ai. Let R = {(point1, count1), ..., (pointn, countn)}.

We claim that the partition problem forA can be reduced to the following:Letk =Pn

i=1ai

2. Is

there ak-anonymous multidimensional partitioning for[0, 1]n andR such thatCAV G ≤ 1? To

prove this claim, we show that there is a solution to thek-anonymous multidimensional partitioning

problem forR if and only if there is a solution to the partition problem forA.

Suppose there exists ak-anonymous multidimensional partitioning forR. This partitioning

must define two non-overlapping regions,G1 andG2 (containing sets of pairsR1 andR2, respec-

tively), such that∑

t∈R1t.count =

∑t∈R2

t.count = k =Pn

i=1ai

2, and possibly some number

49

of empty regions. Thus, the sum of counts for the two non-empty regions constitute the sum of

integers in two disjoint complementary subsets ofA, and we have an equal partitioning ofA.

In the other direction, suppose there is a solution to the partition problem forA. For each binary

partitioning of A into disjoint complementary subsetsA1 and A2, there is a multidimensional

partitioning of the domain space into regionsG1, ..., Gm (containing sets of pairsR1, ..., Rm) such

that∑

t∈R1t.count =

∑ai∈A1

ai,∑

t∈R2t.count =

∑ai∈A2

ai, and all otherRi are empty:G1

is defined by two points, the origin and the point havingith coordinate1 whenai ∈ A1 and0

otherwise. The bounding box forG1 is closed at all edges and vertices.G2 is defined by the

origin and the pointp havingith coordinate= 1 whenai ∈ A2, and 0 otherwise. The bounding

box for G2 is open at the origin, but closed on all other edges and vertices. CAV G is the average

sum of counts for the non-empty regions, divided byk. Thus,CAV G = 1, andR1, ..., Rm is a

k-anonymous multidimensional partitioning for the domain space andR.

Finally, a given solution to the decisionalk-anonymous multidimensional partitioning problem

can be verified in polynomial time by scanning the input set of(point, count) pairs, and maintain-

ing a sum for each regionGi.

3.3.2 Bounds on Equivalence Class Size

It is also interesting to consider worst-case upper bounds on the size of partitions resulting

from k-anonymous single-dimensional and multidimensional partitioning. This section presents

two results. The first result indicates that for multidimensional partitioning, this bound depends

on dimensionalityd, parameterk, and another parameter (b) indicating the maximum number

of duplicate copies of a single point (Theorem 3.10). This isin contrast to the second result

(Theorem 3.11), which indicates that for single-dimensional partitioning, whend > 2, this bound

may grow linearly with the total number of records in input relationR.

In order to state these results, we first define some terminology. A multidimensional cutfor a

multiset of points is an axis-parallel binary cut producingtwo disjoint multisets of points. Such a

cut is said to be allowable underk-anonymity if it does not result in a partition containing fewer

thank points.

50

Definition 3.6 (Allowablek-Anonymous Multidimensional Cut) Consider a multisetR of points

in d-dimensional space. A cut perpendicular to axisQi at qi is allowable if and only if|{t : t ∈

R, t.Qi > qi}| ≥ k and|{t : t ∈ R, t.Qi ≤ qi}| ≥ k.

A single-dimensional cutis also axis-parallel, but must cut all the way across the domain space.

Definition 3.7 (Allowablek-Anonymous Single-Dimensional Cut)Consider a multisetR of points

in d-dimensional space, and suppose we have madeS single-dimensional cuts, dividing the space

into disjoint regionsG1, ..., Gm (containing multisets of pointsR1, ..., Rm). A single-dimensional

cut perpendicular toQi atqi is allowable, givenS, if ∀j ∈ 1..m such thatGj overlaps lineQi = qi,

|{t : t ∈ Rj , t.Qi > qi}| ≥ k and|{t : t ∈ Rj , t.Qi ≤ qi}| ≥ k.

We will say that a multidimensional (single-dimensional) partitioning isminimal if there are

no remaining allowable multidimensional (single-dimensional) cuts.

Definition 3.8 (Minimal k-Anonymous Multidimensional Partitioning) Let G1, ..., Gm denote

a set of regions induced by a multidimensional partitioning, and let each regionGi contain multiset

Ri of points. This multidimensional partitioning is minimal underk-anonymity if∀i, |Ri| ≥ k and

there exists no allowablek-anonymous multidimensional cut forGi, Ri.

Definition 3.9 (Minimal k-Anonymous Single-Dimensional Partitioning) LetG1, ..., Gm be the

set of regions induced by a setS of single-dimensional cuts over some domain space containing

multiset of pointsR. Let regionGi contain multiset of pointsRi. This single-dimensional parti-

tioning is minimal underk-anonymity if∀i, |Ri| ≥ k and there exists no allowablek-anonymous

single-dimensional cut forR givenS.

For example, in Figures 3.1(b) and (c), the first cut occurs ontheZipcodedimension at 53711.

In the multidimensional case, the left-hand side is cut again on theAgedimension, which is allow-

able because it does not produce a region containing fewer thank points. In the single-dimensional

case, however, once the first cut is made, there are no remaining allowable single-dimensional cuts.

51

k-1b3

k-1mk-1b2

k-1b1

a3a2a1

(a) A set of points for which there is no allowable cut

1k-1b3

k-1mk-1b2

k-1b1

a3a2a1

(b) Adding a single point produces an allowable cut

Figure 3.2 Equivalence class size bound example (2 dimensions)

(Any cut perpendicular to the Age axis would result in a region on the right containing fewer than

k points.)

The following two theorems give upper-bounds on partition size for minimalk-anonymous

multidimensional and single-dimensional partitionings,respectively.

Theorem 3.10 Let G1, ..., Gm denote the set of regions induced by a minimalk-anonymous mul-

tidimensional partitioning ad-dimensional domain space, containing multiset of pointsR. The

maximum number of points contained in anyGi is 2d(k−1)+ b, whereb is the maximum number

of copies of any distinct point inR.

Proof The proof has two parts. First, we show that there exists a multiset R of points in d-

dimensional space such that|R| = 2d(k − 1) + b and that there is no allowablek-anonymous

multidimensional cut forR. Let qi denote some value on axisQi such thatqi − 1 and qi + 1 are

also values on axisQi, and letR initially contain b copies of the point(q1, q2, ..., qd). Add to R

k − 1 copies of each of the following points:

(q1 − 1, q2, ..., qd), (q1 + 1, q2, ..., qd),

(q1, q2 − 1, ..., qd), (q1, q2 + 1, ..., qd),

...

(q1, q2, ..., qd − 1), (q1, q2, ..., qd + 1)

52

For example, Figure 3.2 showsR in 2 dimensions. By addition,|R| = 2d(k − 1) + b, and by

projectingR onto anyQi we obtain the following point counts:

|{t : t ∈ R, t.Qi = qi}| =

k − 1, qi = qi − 1

b + 2(d− 1)(k − 1), qi = qi

k − 1, qi = qi + 1

0, otherwise

Based on these counts, it is clear that any binary cut perpendicular to axisQi would result in

some partition containing fewer thank points.

Second, we show that for any multiset of pointsR in d-dimensional space such that|R| >

2d(k− 1) + b, there exists an allowablek-anonymous multidimensional cut forP . Consider some

R in d-dimensional space, such that|R| = 2d(k − 1) + b + 1, and letqi denote the median value

of R projected on axisQi. If there is no allowable cut forR, we claim that there are at leastb + 1

copies of point(x1, ..., xd) in R, contradicting the definition ofb.

For every dimensioni = 1, ..., d, if there is no allowable cut perpendicular to axisQi, then

(becauseqi is the median){t : t ∈ R, t.Qi < qi}| ≤ k − 1 and{t : t ∈ R, t.Qi > qi}| ≤ k − 1.

This means that|{t : t ∈ R, t.Qi = qi}| ≥ 2(d− 1)(k − 1) + b + 1. Thus, overd dimensions, we

find that|{t : t ∈ R, t.Q1 = q1 ∧ ... ∧ p.Qd = qd}| ≥ b + 1.

Theorem 3.11 The maximum number of points contained in any regionGi resulting from a mini-

malk-anonymous single-dimensional partitioning of a multisetof pointsR in d-dimensional space

(d ≥ 2) is O(|R|).

Proof We construct a multiset of pointsR, and a minimalk-anonymous single-dimensional parti-

tioning forR, such that the greatest number of points in a resulting region isO(|R|).

Consider a quasi-identifier attributeQ with domainDQ, and a finite setVQ ⊆ DQ with a point

q ∈ VQ. Initially, let R contain precisely2k − 1 points t having t.Q = q. Then add toR an

arbitrarily large number of pointsu, each withu.Q ∈ VQ, but u.Q 6= q, and such that for each

v ∈ VQ there are at leastk points in the resulting setR havingQ = v.

53

By construction, if|Vq| = r, there arer − 1 allowablek-anonymous single-dimensional cuts

for R perpendicular toQ (at each point inVQ), and we denote this set of cutsS. However, there

are no allowable single-dimensional cuts forR givenS (perpendicular to any other axis). Thus,

S is a minimalk-anonymous single-dimensional partitioning, and the sizeof the largest resulting

region (in terms of contained points) isO(|R|).

Of course, we can projectR onto a single dimension,Qi. (That is, generalize all other attributes

to the full width of their domains.) In this case, single-dimensional partitioning and multidimen-

sional partitioning are identical. Letb′ denote the maximum number of points inR havingQi = q

for any valueq. By Theorem 3.10, we know that the maximum number of points inany resulting

region is2(k−1)+b′. However, for dimensionalityd > 2, we draw a distinction between these two

cases because (particularly for infinite domains)b′ can be arbitrarily larger thanb. This distinction

is of greatest concern for discrete-valued domains.

Indeed, in our experimental evaluation, we note that refining a single dimension is often the

optimal strategy for single-dimensional partitioning. For continuous domains, this may result in

reasonable partition size, but is still undesirable because many attributes are discarded entirely.

3.4 Recursive Partitioning Framework

The analysis in the previous section leads naturally to a greedy recursive multidimensional

partitioning algorithm. This section describes such an algorithm, which we callMondrian.

Originally described in [55], Mondrian is based on a top-down recursive partitioning of the

(multidimensional) quasi-identifier domain space. The basic structure of the algorithm is described

in Figure 3.3.2 The multidimensional domain spaceG is represented logically using a simple

structure: The range of each numeric (or ordinal categorical) attribute is represented by amin and

max value; the range of each (nominal) categorical attribute isrepresented by somevalue in its

generalization hierarchy. Initially, this structure is configured to include the entire domain space.

2The original conference paper also described simple extensions of this algorithm that pertain to a more relaxedlocal recoding approach [55].

54

Algorithm: Mondrian(domain spaceG, relationR, quasi-identifiers{Q1, ..., Qd},

monotone/bucket independentρ)

if (no allowable split forG,R underρ)

return φ : t ∈ R→ tuple representation(G,R)

else

best← ChooseAttribute(R, {Q1, ..., Qd}, ρ)

if numeric(best) or ordinal(best)

threshold← ChooseThreshold(best, R, ρ)

R1 ← {t : t ∈ R, t.best ≤ threshold}

R2 ← {t : t ∈ R, t.best > threshold}

G1 ← UpdateG by settingbest.max = threshold

G2 ← UpdateG by settingbest.min = threshold

return Mondrian(G1,R1,{Q1, ..., Qd},ρ) ∪Mondrian(G2,R2,{Q1, ..., Qd},ρ)

else ifnominal (best)

recodings← {}

for eachchild vi of root(best.hierarchy)

Ri ← {t : t ∈ R, t.best descended fromvi in best.hierarchy}

Gi ← UpdateG by settingbest.value = vi

Q′ ← Replacebest.hierarchy with subtree rooted atvi in {Q1, ..., Qd}

recodings← recodings ∪Mondrian(Gi, Ri, Q′, ρ)

return recodings

Figure 3.3 Algorithm: Mondrian

55

Nationality

Age

European0-40

NationalityAge

Asian0-40

NationalityAge

*41+

NationalityAge

≤ 40 > 40

Europeanp Asianp

Figure 3.4 Example partition tree

In each iteration, Mondrian must choose the dimension aboutwhich to split, and for numeric

attributes, it must also choose a splitthreshold. In this section, we outline a simple general-purpose

approach. Alternatively, the split attribute and threshold may be chosen based on an anticipated

workload, as described in the next chapter.

When choosing the threshold, notice that if there exists an allowablek-anonymous multidimen-

sional cut perpendicular to axisQi, then the cut perpendicular toQi at the median is necessarily

allowable.3 We have some flexibility is choosing the dimension on which tosplit. In the literature

aboutkd-trees, one heuristic chooses the dimension with the widest(normalized) range of values

[37] perpendicular to which there exists an allowable cut. In the absence of application-specific

considerations, this heuristic assures that each quasi-identifier dimension is given equal attention.

We will refer to this algorithm asMedian Mondrian . The time complexity isO(|R| log |R|).

It is also convenient to think of the top-down partitioning as defining apartition tree, which

in turn defines recoding functionφ. (Optionally, this structure can also be materialized.) Each

leaf of the tree contains a (disjoint) multidimensional region, as well as a tuple representation

for this region. For example, Figure 3.4 shows the partitiontree for quasi-identifier attributes

Age and Nationality. Notice that the generalized tuples in this example cover the domain space.

3In the case of duplicate tuples, we select the median threshold resulting in the most even division of tuples.

56

Alternatively, we could consider representing the regionsusing statistics summarizing the data

located in the leaf partitions (see Section 4.5).

In the remainder of this section, we first derive worst-case quality bounds for Median Mondrian

underk-anonymity, based on the general-purpose measures in Section 3.2. Then, we show how

Mondrian can be extended to incorporate diversity requirements.

3.4.1 Quality Bounds (k-Anonymity)

When the quasi-identifier domains are all totally-ordered (without pre-defined generalization

hierarchies), it is easy to derive bounds for the general-purpose quality measures described in

Section 3.2 underk-anonymity. By definition,k-anonymity requires that every equivalence class

contain at leastk records. For this reason, the optimal achievable value ofCDM (denotedCDM∗)

≥ k ∗ |R|, andCAV G∗ ≥ 1.

Note that Median Mondrian produces a minimalk-anonymous partitioning of the space. Thus,

following Theorem 3.10, under the greedy algorithm, for each equivalence classRi, |Ri| ≤ 2d(k−

1) + b, whereb is the maximum number of copies of any distinct quasi-identifier tuple.

CDM is maximized when the tuples are divided into the largest possible equivalence classes, so

CDM ≤ (2d(k − 1) + m) ∗ |R|. Thus,

CDM

CDM∗≤

2d(k − 1) + b

k

Similarly,CAV G is maximized when the tuples are divided into the largest possible equivalence

classes, soCAV G ≤ (2d(k − 1) + b)/k.

CAV G

CAV G∗≤

2d(k − 1) + b

k

Observe that for constantd andb, the greedy algorithm is a constant-factor approximation of

optimal, as defined byCDM andCAV G. If d varies, butb/k is constant, the approximation isO(d).

3.4.2 Incorporating Diversity

Although the previous discussion focused onk-anonymity, Mondrian is easily extended to any

monotone and bucket independent anonymity requirement, such asℓ-diversity or variance diversity

57

(see Section 1.3.2). Specifically, these requirements are easily incorporated as additional stopping

criteria in the recursive partitioning procedure. Thus, weextend the definitions of allowability and

minimality to incorporate the new requirements.

Definition 3.12 (Allowable Entropy ℓ-Diverse (Variance Diverse) Multidimensional Cut) R is

a multiset of points ind-dimensional space. A cut perpendicular to axisQi atqi is allowable if and

only if {t : t ∈ R, t.Qi ≤ qi} and{t : t ∈ R, t.Qi > qi} individually satisfyℓ-diversity (variance

diversity).

Definition 3.13 (Minimal ℓ-Diverse (Variance Diverse) Multidimensional Partitioning) Consider

G1, ..., Gn, the set of regions induced by a multidimensional partitioning, and let each regionGi

contain multisetRi of points. This partitioning is minimal underℓ-diversity (variance diversity) if

eachRi satisfiesℓ-diversity (variance diversity) and there does not exist and allowableℓ-diverse

(variance diverse) multidimensional cut for anyRi.

Finally, it is important to note that if we incorporate theseadditional requirements, our bounds

on equivalence class size (andCDM , CAV G) do not necessarily hold.

3.5 Experimental Evaluation of Data Quality

In this chapter, we present an initial experimental evaluation of the Mondrian approach. We

compared the quality of anonymizations produced by Median Mondrian to those produced by

optimal algorithms for two other recoding techniques:full-domain generalization[53, 78], and

single-dimensional partitioning[14, 47]. The specific algorithms (Incognito [53] and K-Optimize

[14]) were chosen for efficiency, but any optimal algorithm would yield the same result.

It it also important to note that the time complexity of each of the exhaustive algorithms is

exponential in the size of the quasi-identifier (number of attributes), while the time complexity of

Mondrian is loglinear in the size of the input relation. Although there is some crossover point, in

the common cases, Mondrian runs many times faster. At the same time, we find that Mondrian

frequently produces results of superior quality.

58

In this initial set set of experiments, we limited the evaluation to the simple general-purpose

measures of data quality described in Section 3.2, as well asa count of the number of attributes for

which information is preserved at any granularity finer thanthe entire domain. We also limited the

evaluation tok-anonymity. In Chapter 4, we expand the experimental evaluation substantially.

3.5.1 Experimental Data

We constructed a simple synthetic generator with the goal ofunderstanding how the underlying

distribution affects the quality of the resulting anonymized data (for each of the three algorithms).

In these experiments, we focused primarily on discrete distributions, so that the exhaustive algo-

rithms would run in reasonable time without pre-generalizing the data. In particular, we considered

several discrete joint distributions:

• Discrete Uniform The discrete uniform distribution is generated in a straightforward manner

based on the specified cardinality (number of unique values per attribute).

• Discrete Normal To generate the discrete normal distribution, we first generate the multi-

variate normal distribution, according to a specified mean (µ) and standard deviation (σ).

We then discretize the values for each attribute into equal-width ranges based on a specified

cardinality (unique values per attribute).

In addition, we also used the Adults database from the UC Irvine Machine Learning Repository

[17]. We configured the data as in the experiments reported by[14], using eight quasi-identifier

attributes, and removing tuples with missing values. The resulting data contained 30,162 records.

For Mondrian and K-Optimize, we impose an intuitive ordering on each attribute, and eliminated

hierarchical constraints. For Incognito, we use the same generalization hierarchies that were used

in the previous chapter.

59

1.0E+04

1.0E+05

1.0E+06

1.0E+07

1 10 100

K

Dis

ce

rna

bil

ity

Pe

na

lty

Full-Domain Single-Dimensional Multidimensional

Uniform Distribution (5 attributes)

1.0E+04

1.0E+05

1.0E+06

1.0E+07

1 10 100

K

Dis

ce

rna

bil

ity

Pe

na

lty


Normal Distribution (5 attributes, σ = .2)

1.0E+05

1.0E+06

1.0E+07

0.1 0.2 0.3 0.4 0.5

Standard Deviation

Dis

ce

rna

bil

ity

Pe

na

lty


Uniform Distribution (5 attributes, k = 10)

1.0E+05

1.0E+06

1.0E+07

1.0E+08

1.0E+09

1 10 100 1000

K

Dis

ce

rna

bil

ity

Pe

na

lty


Adults Database

Figure 3.5 Experimental quality evaluation usingCDM

60

3.5.2 Results forCDM

Our initial set of experiments compared the anonymizationsproduced by Mondrian to those

generated by K-Optimize and Incognito using theCDM measure. (The results were similar for

CAV G.) Figure 3.5 reports the results.

The first experiment compared the three algorithms for varied k. We fixed 10,000 records, the

per-attribute cardinality at 8, and the number of attributes at 5. For full-domain generalization, we

constructed generalization hierarchies using binary trees. The results for the uniform distribution

and the discrete normal distribution (µ = 3.5, σ = .2) are reported in Figure 3.5. We found that

Median Mondrian produced better generalizations than the other two algorithms in both cases.

However, the magnitude of this difference was much more pronounced for the non-uniform data.

Following this observation, the next experiment compared quality for varied standard deviation

(σ); smallσ indicates a high degree of non-uniformity. The number of attributes was again fixed at

5, andk was fixed at 10. The results show that the difference in quality is largest for non-uniform

data.

In addition to the synthetic data, we compared the three algorithms using the Adults database.

Again, we found that Median Mondrian generally produced higher-quality results.

Following the analysis in Section 3.3.2, we suspect that thedifference between multidimen-

sional partitioning and single-dimensional partitioningwould diminish for continuous domains.

That is, when there are a negligable number of points in the input data with duplicate values for

any quasi-identifier attribute, a single-dimensional partitioning approach can achieve reasonable

CDM by partitioning just a single dimension. However, large numbers of distinct values severely

decrease the efficiency of the K-Optimize algorithm [14], making this difficult to evaluate experi-

mentally.

3.5.3 Attributes Preserved

In addition, we observed informally that single-dimensional partitioning often preserves the

values of just a handful of attributes. To illustrate this point, Figures 3.6 and 3.7 show partitionings

61

0 10 20 30 40 500

10

20

30

40

50

k = 50

0 10 20 30 40 500

10

20

30

40

50

k = 25

0 10 20 30 40 500

10

20

30

40

50

k = 10

Figure 3.6 2-Dimensional single-dimensional partitionings

62

0 10 20 30 40 500

10

20

30

40

50

k = 50

0 10 20 30 40 500

10

20

30

40

50

k = 25

0 10 20 30 40 500

10

20

30

40

50

k = 10

Figure 3.7 2-Dimensional multidimensional partitionings

63

of a 2-dimensional quasi-identifier space containing 1000 records. The records are sampled from

a discrete normal distribution (σ = .2, µ = 25).

Multidimensional partitioning does an excellent job of capturing the underlying distribution.

The optimal single-dimensional partitioning was quite sensitive to small variations in the underly-

ing data, but we observed that it often reflected the distribution of just one attribute.

3.6 Chapter Summary

In this chapter, we introduced and developed the multidimensional partitioning approach to

anonymization, as well as an efficient greedy recursive algorithm (Mondrian). Although the prob-

lem of optimalk-anonymous multidimensional partitioning is NP-hard, a simple analysis shows

that Mondrian yields constant-factor bounds on data quality when parametersb andd are constant,

a common case in many real-world data sets. In addition, an experimental study shows that in many

practical cases, including discrete-valued domains, and non-uniform data distributions, Mondrian

actually produces better results than optimal single-dimensional recoding algorithms.

64

Chapter 4

Incorporating Workload

The previous chapter described a greedy partitioning framework for anonymization. When the

data publisher does not know the purpose for which the data will be used, it is reasonable to mea-

sure data quality using simple general-purpose measures, such as those described in Section 3.2.

However, whenever possible, the best way of measuring quality is based on the task(s) for

which the data will eventually be used. This chapter describes anonymization techniques that

incorporate a target workload composed of selection, aggregation, classification, and regression

tasks. An experimental study indicates the importance of using workload as a quality evaluation

tool, as well as the improvement in data quality that can be achieved by incorporating a target

workload directly into the anonymization procedure.

4.1 Motivating Example

The importance of incorporating workload is best illustrated with an example. Suppose that a

trusted agency compiles a database of disease information for several million hospital patients. In

many cases, this data could prove useful to external researchers. At the same time, it is important

for the agency to take precautions protecting individual privacy, for example hiding identity and

guaranteeing that the released data does not reveal other information, such as an individual’s HIV

status.

Now consider Alice, an external researcher who is directingtwo separate studies, each of which

could benefit from using the data in the central database. As part of the first study, Alice wants

65

to build a classification model that uses age, smoking history, and HIV status to predict life ex-

pectancy. In the second study, she would like to find combinations of variables that are useful for

predicting elevated cholesterol and obesity in males over age 40.

In this situation, there are several reasons why it is valuable to distribute non-aggregate micro-

data to data recipients like Alice.1 One might envision a simpler protocol, where Alice requests

a specific model, constructed entirely by the agency. However, there are two downsides to this

approach. First, the model-request protocol assumes that the tasks are fully-specified at the time

of the initial request. However, in our example, Alice’s second study involves an entire class of

models, each constructed using a subset of the data (attributes and records). Indeed, workloads

like this arise naturally is certain types of exploratory data analysis (e.g., [23]).

Perhaps more importantly, by releasing a single snapshot, this protocol avoids the problem of

auditing inference paths across multiple models constructed on the original data.2 For example,

it is well known that answering multiple aggregate queries may allow an adversary to infer more

precise information about the underlying database [1], andin interactive environments, even re-

jecting a query may reveal sensitive information [49]. The inference implications (with respect to

identity and the value of a sensitive attribute) of releasing one or more predictive models are not

as well-understood, but each such model reveals something about the distributional characteristics

of the agency’s data. In certain cases, it is possible that this information could be combined with

external data, resulting in a breach of privacy. On the otherhand, it is appealing to release a single

(sanitized) view because there are well-developed notionsof anonymity, and the best an adversary

can do is approximate the distribution in the data she is given.

4.2 Language Describing Workloads

In this chapter, we use a simple languageL for describing afamilyof target workloads. Specif-

ically, each workload family is made up of one or more of the following components:

1We continue to assume that the researcher receives just one distribution of any particular data set, and that shedoes not collude with others receiving different views of the same underlying data.

2Of course, if we release multiple snapshots, we must still take steps to audit inference.

66

Figure 4.1 Attribute type characterizations for anonymityand classification/regression

• Classification TasksA (set of) classification task(s) is characterized by a set ofpredictor

attributes{F1, ..., Fn} (commonly calledfeatures), and one or more nominalclass labelsC.

• Regression TasksA (set of) regression task(s) is characterized by a set of features{F1, ..., Fn}

and one or more numerictarget attributesT .

• Selection TasksA set of selection tasks is defined by a set ofselection predicates{P1, ..., Pn},

each of which is a boolean function of the quasi-identifier attributes.

• Aggregation TasksEach aggregation task is defined by anaggregate function(e.g., SUM,

MIN, AVG, etc.).

In the related literature, several single-dimensional recoding algorithms fork-anonymity have

been proposed that incorporate a single target classification model (constructed over the full data

set) [38, 47, 88]. However, previous work has not consideredincorporating such a wide variety of

workload tasks.

4.3 Classification and Regression

In this section, we consider workloads consisting of one or more classification or regression

models. When a single target model is considered in conjunction with anonymity, each attribute

has two characterizations: one for anonymity, and one for the model. Figure 4.1 describes the

space of attributes. The interesting cases are highlightedand numbered. (When an attribute is

not involved in the task, as a feature or class label, then it can be suppressed entirely. Similarly,

attributes that are not part of the quasi-identifier can be released unmodified.)

67

When the (set of) target model(s) is known, we seek to improveupon the simple median-

partitioning heuristic described in Chapter 3. We first consider the case where{Q1, ..., Qd} =

{F1, ..., Fn} andC, T 6= S. Under this assumption, Sections 4.3.1 and 4.3.2 describe algorithms

for incorporating a single classification or regression model. Section 4.3.3 provides extensions to

multiple models. Finally, Section 4.3.4 discusses extensions to these techniques for handling other

interesting attribute characterizations.

4.3.1 Single Target Classification Model

First consider a single target classification model, with featuresF1, ..., Fn and class labelC.

Also consider quasi-identifier attributesQ1, ..., Qd and sensitive attributeS. In this section, we

assume that{F1, ..., Fn} = {Q1, ..., Qd} and thatC 6= S.

Our goal is to produce a multidimensional partitioning of the quasi-identifier domain space

(also the feature space) into regionsG1, ..., Gm (containing disjoint tuple multisetsR1, ..., Rm,

respectively) that satisfies all given anonymity requirements (k-anonymity,ℓ-diversity, etc.). At

the same time, because of the classification task, we would like these partitions to be homogeneous

with respect toC. More formally, one way to implement this intuition is to minimize the following

function, which is the conditional entropy of class labelC, given membership in a particular data

partition:

H(C|R∗) =m∑

i=1

|Ri|

|R|

∑

c∈DC

−p(c|Ri) log p(c|Ri) (4.1)

Using this as motivation, we propose a greedy splitting algorithm based on local entropy mini-

mization. This is reminiscent of algorithms for decision tree construction [19, 68]. At each recur-

sive step, we choose the candidate split that minimizes the following function without violating the

anonymity requirement(s). LetV denote the current (recursive) tuple set, and letV1, ..., Vm denote

the set of data partitions resulting from the candidate split. p(c|Vi) is the fraction of tuples inVi

with class labelC = c. We refer to this algorithm asInfoGain Mondrian .

68

Entropy(V, C) =

m∑

i=1

|Vi|

|V |

∑

c∈DC

−p(c|Vi) log p(c|Vi) (4.2)

The algorithm handles continuous quasi-identifier values by partitioning around thethreshold

value that minimizes Equation 4.2 without violating the anonymity requirement. In order to select

this threshold, we must first sort the data with respect to thesplit attribute. Thus, the complexity

of InfoGain Mondrian isO(|R| log2 |R|).

4.3.2 Single Target Regression Model

The intuition from the previous section is easily extended to numeric target attributeT . Again,

we assume{F1, ..., Fn} = {Q1, ..., Qd} andT 6= S.

In this case, we seek to minimize theweighted mean squared error WMSE(T, R∗), which

indicates the impurity of target attributeT within each data partition.T (Ri) denotes the mean

value ofT in data partitionRi.

MSE(T, Ri) =1

|Ri|

∑

t∈Ri

(t.T − T (Ri))2 (4.3)

WMSE(T, R∗) =

m∑

i

|Ri|

|R|(MSE(T, Ri)) (4.4)

=1

|R|

m∑

i=1

∑

t∈Ri

(t.T − T (Ri))2 (4.5)

Equation 4.5 leads naturally to a local greedy split criterion. Let V denote the current (re-

cursive) tuple set, and letV1, ..., Vm denote the set of data partitions resulting from a candidate

split. Similar to the CART algorithm for regression trees [19], at each step we choose the split that

minimizes the following expression (without violating theanonymity requirements). We call this

Regression Mondrian.

Error2(V, T ) =

m∑

i=1

∑

t∈Vi

(t.T − T (Vi))2 (4.6)

69

Continuous attributes are also handled by sorting and thresholding. The time complexity of

Regression Mondrian is alsoO(|R| log2 |R|).

4.3.3 Multiple Target Models

In certain cases, we would like to allow the data recipient tobuild several models, to accurately

predict themarginal distributionsof several class labels(C1, ..., Cn) or numeric target attributes

(T1, ..., Tn). The heuristics described in the previous two sections can be extended to these cases.

We assume{F1, ..., Fn} = {Q1, ..., Qd}, S /∈ (C1, ..., Cn) andS /∈ (T1, ..., Tn).

For classification, there are two ways to make this extension. In the first approach, the data

recipient would build a single model to predict thevectorof class labels,〈C1, ..., Cn〉, which has

domainDC1× ... × DCn

. A greedy split criterion would minimize entropy with respect to this

single variable.

However, in this simple approach, the size of the domain grows exponentially with the number

of target attributes. To avoid potential sparsity problems, we instead assume independence among

target attributes. This is reasonable because we are ultimately only concerned about the marginal

distribution of each target attribute. Under the independence assumption, the greedy criterion

chooses the split that minimizes the following without violating the anonymity requirement(s):

n∑

i=1

Entropy(V, Ci) (4.7)

In regression (the squared error split criterion in particular), there is no analogous distinction

between treating the set of target attributes as a single variable and assuming independence. For

example, if we have two target attributes,T1 andT2, the joint error is the distance between an

observed point(t1, t2) and the centroid(T1(V ), T2(V )) in 2-dimensional space. The squared joint

error is the sum of individual squared errors,(t1 − T1(V ))2 + (t2 − T2(V ))2. For this reason, we

choose the split that minimizes the following without violating the anonymity requirement(s):

n∑

i=1

Error2(V, Ti) (4.8)

70æç è

Figure 4.2 Features vs. quasi-identifiers in classification-oriented anonymization

4.3.4 Other Attribute Characterizations

Until now, we have assumed that{F1, ..., Fn} = {Q1, ..., Qd} and thatC, T 6= S. This section

explores how the greedy algorithms can be extended to handleother attribute characterizations.

The first interesting question concerns what happens whenS = C or S = T . Intuitively, this

appears to be problematic. Indeed, consider the case where{Q1, ..., Qd} = {F1, ..., Fn}. In this

case, entropyℓ-diversity requires thatH(C|R∗) ≥ log(ℓ). Similarly, variance diversity requires

WMSE(T, R∗) ≥ v.

However, consider the case where{Q1, ..., Qd} ⊂ {F1, ..., Fn}. In this case, we can draw a

distinction between partitioning thequasi-identifier spaceand partitioning thefeature space. In-

formally, it is convenient to think of the former as dividingthe input dataR into disjoint partitions

with identical (generalized) quasi-identifier values. Thelatter further refines this partitioning based

on the values of the additional features. For this reason, itis sometimes possible to obtain feature

space partitions that are homogeneous with respect to the target attribute, while the quasi-identifier

space partitions satisfy the diversity requirement. For example, consider the partitioning in Fig-

ure 4.2 (with featuresF1, F2 and class labels/sensitive values “+” and “-”). The featurespace

partitions are homogeneous with respect to the class label.However, suppose there is just one

quasi-identifier attributeQ1 = F1. Clearly, the partitioning is 2-diverse with respect toQ1.

This observation leads to an interesting extension of the greedy split heuristics described in

Sections 4.3.1 and 4.3.2. LetV denote the current (recursive) data partition, and letV1, ..., Vm

71

denote the set of data partitions resulting from a candidate(quasi-identifier) split. LetF ′ =

{F1, ..., Fn} − {Q1, ..., Qd}.

It is easy to think about further dividing eachVi into sub-partitions based on the values ofF ′.

More formally, define{Vi1, ..., Vini} such thatVi1∪ ...∪Vini

= Vi and∀h, j (h 6= j), Vih∩Vij = ∅.

∀t1, t2 ∈ Vij, t1.F′ = t2.F

′, and∀h, j(h 6= j), there do not existt1 ∈ Vih, t2 ∈ Vij such that

t1.F′ = t2.F

′.

At each recursive step, the greedy algorithm chooses the (quasi-identifier) split that minimizes

the entropy ofC (or squared error ofT ), across the resultingfeature spacepartitions without vio-

lating the given anonymity requirement(s) across the quasi-identifier space partitions{V1, ..., Vm}.

That is, for a single classification model (class labelC), choose the candidate split that minimizes

the following without violating the anonymity requirement(s):

m∑

i=1

ni∑

j=1

|Vij|

|V |

∑

c∈DC

−p(c|Vij) log p(c|Vij) (4.9)

For a single regression model (target attributeT ), choose the candidate split that minimizes the

following without violating the anonymity requirement(s):

m∑

i=1

ni∑

j=1

∑

t∈Vij

(t.T − T (Vij))2 (4.10)

The other interesting case arises whenC ∈ {Q1, ...Qd} or T ∈ {Q1, ..., Qd}. We do not

have an ideal solution to this case. However, we expect that it is important to release the target

attribute intact. Thus, out initial step is to refineC or T as much as possible, prior to the other

quasi-identifier attributes.

4.4 Selection

Sometimes one or more of the tasks in the target workload willuse only a subset of the released

data, and it is important that this data can be selected precisely, despite recoding. For example, in

Section 4.1, we described a study that involved building a model for only males over age 40, but

this is difficult if the ages of some men are generalized to therange[30− 50].

72

Consider a set of selection predicates{P1, ..., Pn} defined by a boolean function of the quasi-

identifier attributes{Q1, ..., Qd}). Conceptually, eachPi defines aquery regionXi in the domain

space such thatXi = {x : x ∈ DQ1× ...×DQd

, Pi(x) = true}. For the purposes of this work, we

only consider selections for which the query region can be expressed as ad-dimensional rectangle.

(Of course, some additional selections can be decomposed into two or more hyper-rectangles, and

incorporated as separate queries.)

A multidimensional partitioning (and recoding functionφ) divides the domain space into non-

overlapping rectangular regionsY1, ..., Ym. The recoding regionYi = {y : y ∈ DQ1× ... ×

DQd, φ(y) = y∗

i }, wherey∗i is a unique generalization of the quasi-identifier vector. When eval-

uatingPi over the sanitized viewR∗, it may be that no set of recoding regions can be combined

to precisely equal query regionXi. Instead, we need to define the semantics of selection queries

on this type of imprecise data. Clearly, there are many possible semantics, but in the rest of this

chapter we settle on one. Under this semantics, a selection with predicatePi returns all tuples from

R∗ that are contained in any recoding regionoverlappingthe corresponding query regionXi. More

formally,

Overlap(Xi, {Y1, ..., Ym}) = ∪{Yj : Yj ∈ {Y1, ..., Ym}, Yj ∩Xi 6= ∅}

Pi(R∗) = {φ(t) : φ(t) ∈ R∗ ∧ t ∈ Overlap(Xi, {Y1, ..., Ym})}

Notice that this will often produce a larger result set than evaluatingPi over the original table

R. We define theimprecisionto be the difference in size between these two result sets.

Pi(R) = {t : t ∈ R, Pi(t) = true}

imprecision(Pi, R∗, R) = |Pi(R

∗)| − |Pi(R)|

For example, Figure 4.3 shows a 2-dimensional domain space.The shaded area represents a

query region, and the tuples ofR are represented by points. The recoding regions are boundedby

dotted lines and numbered. Recoding regions 2, 3, and 4 overlap the query region. If we evaluated

73

Figure 4.3 Evaluating a selection over generalized data

this query using the original data, the result set would include 6 tuples. However, evaluating the

query using the recoded data (under the given semantics) yields 10 tuples, an imprecision of 4.

Ideally, the goal of selection-oriented anonymization is to find the safe (k-anonymous,ℓ-

diverse, variance diverse, etc.) multidimensional partitioning that minimizes the (weighted) sum

of imprecision for the set of target predicates. (We assign each predicatePi a positive weightwi.)

We incorporate this goal through another greedy splitting heuristic. LetV denote the current

(recursive) tuple set, and letV1, ..., Vm denote the set of partitions resulting from the candidate

split. Our heuristic minimizes the sum of weighted imprecisions:

n∑

i=1

wi ∗ imprecision(Pi, V∗, V ) (4.11)

The algorithm proceeds until there is no allowable split that reduces the imprecision of the

recursive partition. We will call this algorithmSelection Mondrian. In practice, we expect this

technique to be used most often for simple selections, such as breaking down health data by state.

Following this, we continue to divide each resulting partition using the appropriate splitting heuris-

tic (i.e., InfoGain Mondrian, etc.).

4.5 Aggregation and Summary Statistics

In multidimensional global recoding, individual data points are mapped to one multidimen-

sional region in the set of disjoint rectangular regions covering the domain space. To this point in

74

the thesis, we have primarily considered representing eachsuch region as a relational tuple based

on its conceptual bounding box (for example, see Figure 3.4).

However, when we consider the task of answering a set of aggregate queries, it is also beneficial

to consider alternate ways of representing these regions using varioussummary statistics, which

is reminiscent of ideas used in microaggregation [30].3 In particular, we consider two types of

summary statistics, which are computed based on the data contained within each region (partition).

For each attributeA in partitionRi, consider the following:

• Range Statistic (R)Including a summary statistic defined by the minimum and maximum

value ofA appearing inRi allows for easy computation of MIN and MAX aggregates.

• Mean Statistic (M) We also consider a summary statistic defined by the mean valueof A

appearing inRi, which allows for the computation of AVG and SUM.

When choosing summary statistics, it is important to consider potential avenues for inference.

Notice that releasing minimum and maximum statistics allows for some inference about the distri-

bution of values within a partition. For example, consider an attributeA, and letk = 2. Suppose

that an equivalence class contains two tuples, with minimum= 0, and maximum = 1. It is easy to

infer that one of the original tuples hasA = 0, and in the other hasA = 1. However, this type

of inference is not problematic in preventing joining attacks because it is still impossible for an

adversary to distinguish the tuples within a partition fromone another.

4.6 Experimental Evaluation of Data Quality

We conducted an experimental evaluation with two main goals. The first goal was to provide

insight about experimental quality evaluation methodology. We outline an experimental protocol

for evaluating an anonymization algorithm with respect to aworkload of classification and regres-

sion tasks. A comparison with the results of simpler general-purpose quality measures indicates

3Certain types of aggregate functions (e.g., MEDIAN) are ill-suited to these types of computations. We do notknow of any way to compute such functions from this type of summary statistics.

75

the importance importance of evaluating data quality with respect to the target workload when it is

known.

The second goal is to evaluate the extensions to Mondrian forincorporating workload. We

pay particular attention to the impact of incorporating oneor more target classification / regression

models and the effects of multidimensional recoding. We also evaluate the effectiveness of our

algorithms with respect to selections and projections.

4.6.1 Methodology

Given a target classification or regression task, the most direct way of evaluating the quality

of an anonymization is by training each target model using the anonymized data, and evaluating

the resulting models usingpredictive accuracy(classification),mean absolute error(regression),

or similar measures. We will call this methodologymodel evaluation. All of our model evaluation

experiments follow a common protocol:

1. The data is first divided into training and testing sets (or10-fold cross-validation sets),Rtrain

andRtest.

2. The anonymization algorithm determines recoding function φ using only thetraining set

Rtrain. Anonymous viewR∗train is obtained by applyingφ to Rtrain.

3. The same recoding functionφ is then applied to thetesting set(Rtest), yieldingR∗test.

4. The classification or regression model is trained usingR∗train, and tested usingR∗

test.

This experimental design is different from the setup used byFung et al. [38] for an important

reason. In [38], thecombinedtraining and testing sets were anonymized using a single-dimensional

recoding algorithm based on information gain. Following this step, the data was separated into

training and testing sets. In our opinion, this setup is inappropriate for evaluating the anonymiza-

tion algorithm because incorporating the test set when choosing a recoding is tantamount to looking

at the test set while doing feature selection. Instead, all of our experiments hold out the test set

during both the anonymization and training phases.

76

Throughout these experiments, we usedk-anonymity as the anonymity requirement. We fixed

the set of quasi-identifier attributes and features to be thesame, and we used the implementations

of the following learning algorithms provided by the Weka software package [92]:

• Decision Tree (J48)Default settings were used.

• Naive BayesSupervised discretization was used for continuous attributes; otherwise all de-

fault settings were used.

• Random ForestsEach classifier was comprised of 40 random trees, and all other default

settings were used.

• Support Vector Machine (SMO) Default settings, including a linear kernel function.

• Linear RegressionDefault settings were used.

• Regression Tree (M5)Default settings were used.

In addition to model evaluation, we also measured certain characteristics of the anonymized

training data to see if there was any correlation between these simpler measures and the results

of the model evaluation. Specifically, we measured theaverage equivalence class size (see Sec-

tion 3.2), and for classification tasks, we measured theconditional entropyof the class label C,

given the partitioning of the full input dataR into R1, ..., Rm (see Equation 4.1).

4.6.2 Learning from Regions

When single-dimensional recoding is used, standard learning algorithms can be applied di-

rectly to the resulting point data, notwithstanding the “coarseness” of some points [38]. Although

multidimensional recoding techniques are more flexible, using the resulting hyper-rectangular data

to train standard data mining models poses an additional challenge.

To address this problem, we make a simple observation. Because we restrict the recoding

regions to include onlyd-dimensional hyper-rectangles, each region can be uniquely represented

as a point in(2 ∗ d)-dimensional space. For example, Figure 4.4 shows a2-dimensional rectangle,

77

éêëìíîïðñòíîïóôòõîïðñòõîïóôòFigure 4.4 Mapping ad-dimensional rectangular region to2 ∗ d attributes

and its unique representation as a 4-tuple. This assumes a total order on the values of each attribute,

similar to the assumption made by support vector machines.

Following this observation, we adopt a simple pre-processing technique for learning from re-

gions. Specifically, we extend the recoding functionφ to map data points tod-dimensional regions,

and in turn, to map these regions to their unique representations as points in(2 ∗ d)-dimensional

space.

Our primary goal in developing this technique is to establish the utility of our anonymization

algorithms. There are many possible approaches to the general problem of learning from regions.

For example, Zhang and Honavar proposed an algorithm for learning decision trees from attribute

values at various levels of a taxonomy tree [99]. Alternatively, we could consider assigning a den-

sity to each multidimensional region, and then sampling point data according to this distribution.

However, a full comparison is beyond the scope of this work.

4.6.3 Experimental Data

Our experiments used both synthetic and real-world data. The synthetic data was produced

using an implementation of the generator described by Agrawal et al. for testing classification

algorithms [7]. This generator is based on a set of predictorattributes, and class labels are generated

as functions of the predictor attributes (see Tables 4.1 and4.2).

In addition to the synthetic data, we also used two real-world data sets. The first (Table 4.3)

was derived from a sample of the 2003 Public Use Microdata, distributed by the United States

78

Attribute Distribution

salary Uniform in [20,000, 150,000]

commission If salary≥ 75,000, then 0

Else Uniform in [10,000, 75,000]

age Uniform integer in [20,80]

elevel Uniform integer in [0, 4]

car Uniform integer in [1, 20]

zipcode Uniform integer in [0, 9]

hvalue zipcode *h * 100,000

whereh uniform in [0.5, 1.5]

hyears Uniform integer in [1, 30]

loan Uniform in [0, 500,000]

Table 4.1 Experimental Data Description: Synthetic features / quasi-identifier attributes

79

Function Class A

C2 ((age < 40) ∧ (50K ≤ salary ≤ 100K))∨

((40 ≤ age < 60) ∧ (75K ≤ salary ≤ 125K))∨

((age ≥ 60) ∧ (25K ≤ salary ≤ 75K))

C4 ((age < 40)∧

(((elevel ∈ {0, 1})?(25K ≤ salary ≤ 75K)) : (50K ≤ salary ≤ 100K))))∨

((40 ≤ age < 60)∧

(((elevel ∈ {1, 2, 3})?(50K ≤ salary ≤ 100K)) : (75K ≤ salary ≤ 125K))))∨

((age ≥ 60)∧

(((elevel ∈ {2, 3, 4})?(50K ≤ salary ≤ 100K)) : (25K ≤ salary ≤ 75K))))

C5 ((age < 40)∧

(((50K ≤ salary ≤ 100K)?(100K ≤ loan ≤ 300K) : (200K ≤ loan ≤ 400K))))∨

((40 ≤ age < 60)∧

(((75K ≤ salary ≤ 125K)?(200K ≤ loan ≤ 400K) : (300K ≤ loan ≤ 500K))))∨

((age ≥ 60)∧

(((25K ≤ salary ≤ 75K)?(300K ≤ loan ≤ 500K) : (100K ≤ loan ≤ 300L))))

C6 ((age < 40) ∧ (50K ≤ (salary + commission) ≤ 100K))∨

((40 ≤ age < 60) ∧ (75K ≤ (salary + commission) ≤ 125K))∨

((age ≥ 60) ∧ (25K ≤ (salary + commission) ≤ 75K))

C7 disposable = .67× (salary + commission) − .2× loan− 20K

disposable > 0

C9 disposable = (.67× (salary + commission) − 5000× elevel − .2× loan− 10K)

disposable > 0

Table 4.2 Experimental Data Description: Synthetic class label functions

80

Attribute Distinct Vals Generalization

Region 57 hierarchy

Age 77 continuous

Citizenship 5 hierarchy

Marital Status 5 hierarchy

Education (years) 17 continuous

Sex 2 hierarchy

Hours per week 93 continuous

Disability 2 hierarchy

Race 9 hierarchy

Salary 2/continuous target

Table 4.3 Experimental Data Description: Census

Census American Community Survey4, with target attribute Salary. This data was used for both

classification and regression, and contained 49,657 records. For classification, we replaced the

numeric Salary with a Salary class (< 30K or ≥ 30K); approximately 56% of the data records

had Salary< 30K. For classification, this is similar to the Adult database [17]. However, we chose

to compile this new data set that can be used for both classification and regression.

The second real data set is the smaller Contraceptives database from the UCI Repository (Ta-

ble 4.4), which contained 1,473 records after removing those with missing values. This data

includes nine socio-economic indicators, which are used topredict the choice of contraceptive

method (long-term, short-term, or none) among sampled Indonesian women.

4.6.4 Comparison with Previous Algorithms

InfoGain Mondrian and Regression Mondrian use both multidimensional recoding and classification-

and regression-oriented splitting heuristics. In this section, we evaluate the effects of these two

4http://www.census.gov/acs/www/index.html

81

Attribute Distinct Vals Generalization

Wife’s age 34 continuous

Wife’s education 4 hierarchy

Husband’s education 4 hierarchy

Children 15 continuous

Wife’s religion 2 hierarchy

Wife working 2 hierarchy

Husband’s Occupation 4 hierarchy

Std. of Living 4 continuous

Media Exposure 2 hierarchy

Contraceptive 3 target

Table 4.4 Experimental Data Description: Contraceptives

82

components through a comparison with two previous anonymization algorithms. All of the ex-

periments in this section consider a single target model, constructed over the entire anonymized

training set.

Several previous algorithms have incorporated a single target classification model while choos-

ing a single-dimensional recoding [38, 47, 88]. To understand the impact of multidimensional

recoding, we compared InfoGain Mondrian and the greedyTop-Down Specialization (TDS)al-

gorithm [38]. Also, we compare InfoGain and Regression Mondrian to Median Mondrian (see

Section 3.4) to measure the effects of incorporating a single target model.

The first set of experiments used the synthetic classification data. Notice that the basic labeling

functions in Figure 4.2 include a number of constants (e.g.,75K). In order to get a more robust

understanding of the behavior of the various anonymizationalgorithms, for functions 2, 4, and 6,

we instead generated many independent data sets, varying the function constants independently

at random over the range of the attribute. Additionally, we imposed hierarchical generalization

constraints on attributeselevelandcar.

Figure 4.5 compares the predictive accuracy of classifiers trained on data produced by the

different anonymization algorithms. In these experiments, we generated 100 independent training

and testing sets, each containing 1000 records, and we fixedk = 25. The results are averaged

across these 100 trials. For comparison, we also include theaccuracies of classifiers trained on the

(not anonymized) original data.

InfoGain Mondrian consistently outperforms both TDS and Median Mondrian, a result that

is overwhelmingly significant based on a series of paired t-tests. It is important to note that the

pre-processing step used to convert regions to points (Section 4.6.2) is only used for the mul-

tidimensional recodings; the classification algorithms run unmodified on the single-dimensional

recodings produced by TDS [38]. Thus, should a better technique be developed for learning from

regions, this would improve the results for InfoGain Mondrian, but it would not affect TDS.5

5Note that by mapping to2 ∗ d dimensions, we effectively expand the hypothesis space considered by the linearSVM. Thus, it is not surprising that this improves accuracy for the non-linear class label functions.

83

ö÷øö÷ùö÷úö÷ûö÷üýþÿùú�� !�"�# ��

J48

$%&$%'$%($%)$%*+,-'(./0123405678968:;;<=>;?@ABCBDEFGEHEIDJKLEBDMKDNABEDOGPMQNBEDMKDNABED

Naive Bayes

RSTRSURSVRSWRSXYZ[UV\]_abcdefgdfhiijklimnopqprstusvswrxyzspr{yr|opsr}u~{�|psr{yr|opsr

Random Forests

�� ¡¢£¡¤¡¥ ¦§¡� ©§ ª��¡ «£¬©ª�¡ ©§ ª��¡

SVM

Figure 4.5 Classification-based model evaluation using synthetic data (k = 25)

84

®°®°±®²±³®±®³®®±®®³®®®µ¶·¹¶º»»¼½¾»¿ÀÁÂÃÂÄÅÆÇÅÈÅÉÄÊËÌÅÂÄÍÎÏÂÅÄÐÇÑ

J48 (Census)

ÒÓÔÒÓÔÕÒÓÖÕ×ÒÕÒ×ÒÒÕÒÒ×ÒÒÒØÙÚÛÜÝÚÜÞßßàáâßãäåæçæèéêëéìéíèîïðéæèñòóæéèôëõNaive Bayes (Census)

ö÷øö÷øùö÷úùûöùöûööùööûöööüýþÿ��þ��

Random Forests (Census)

�� !"#$%&#%'(()*+(,-./0/1234252617892/1:;</21=4>J48 (Contraceptives)

Figure 4.6 Classification-based model evaluation using real-world data

85

We performed a similar set of experiments using the real-world data. Figure 4.6 shows results

for the Census classification data, for increasingk. The graphs show test set accuracy (averaged

across 10 folds) for three learning algorithms. The variance across the folds was quite low, and the

differences between InfoGain Mondrian and TDS, and betweenInfoGain Mondrian and Median

Mondrian, were highly significant based on paired t-tests.

It is important to point out that in certain cases, notably Random Forests, the learning algorithm

overfits the model when trained using the original data. For example, the model for the original

data gets 97% accuracy on the training set, but only 73% accuracy on the test set. When overfitting

occurs, it is not surprising that the models trained on anonymized data obtain higher accuracy

because anonymization serves as a form of feature selection/construction. Interestingly, we also

tried applying a traditional form of feature selection (ranked feature selection based on information

gain) to the original data, and this did not improve the accuracy of random forests for any number

of chosen attributes. We suspect that this discrepancy is due to the flexibility of the recoding

techniques. Single-dimensional recoding (TDS) is more flexible than traditional feature selection

because it can incorporate attributes at varying levels of granularity. Multidimensional recoding is

more flexible still because it incorporates different attributes (at different levels of granularity) for

different data subsets.

We performed the same set of experiments using the Contraceptives database, and observed

similar behavior. InfoGain Mondrian yielded higher accuracy than TDS or Median Mondrian.

Results for J48 are shown in Figure 4.6.

Next, Figure 4.7 shows conditional entropy and average equivalence class size measurements,

averaged across the ten anonymized training folds of the Census classification data. Average equiv-

alence class size, which does not take into account any characteristics of the workload, is not a very

good indicator of model accuracy. Conditional entropy, which incorporates the target class label,

is a lot better; low conditional entropy generally indicates a higher-accuracy classification model.

For regression, we found that Regression Mondrian generally led to better models than Median

Mondrian. Figure 4.8(a) shows the mean absolute test set error for the M5 regression tree and a

linear regression using the Census regression data.

86

?@A?@B?@C?@DAE?A?E??A??E???FGHIJKLKHIMNOILPHQR

STUVWXYTZVT[\YXTZ][YXTZVT[\YXT_Conditional Entropy (Census)

abaaacaaadaaaeaaafbafabaafaabaaaghijklmnoikpqrsstouv

wxyz{|}x~zx��}|x~��}|x~zx��}|x��Avg. Equivalence Class Size (Census)

Figure 4.7 General-purpose quality measures using real-world data

��

�� ¡�� ¢��£�� ¤� ¥ M5 Regression Tree (Census)

¦§©§¦§¦§¦ª«¬®°±²³µ¬¶··²·

¹º»¼½¾¿ÀÁ½»Â¾ÀÁ½»¼½¾¿ÀÁ½Ã¿ÀÄÀ½ÁÅºÁÆÁLinear Regression (Census)

Figure 4.8 Regression-based model evaluation using real-world data

87

4.6.5 Multiple Target Models

In Section 4.3.3 we described a simple adaptation to the basic InfoGain Mondrian algorithm

that allowed us to incorporate more than one target attribute, expanding the set of models for

which a particular anonymization is “optimized.” To evaluate this technique, we performed a set

of experiments using the synthetic classification data, increasing the number of class labels.

Figure 4.9 shows average test set accuracies for J48 and Naive Bayes. We first generated 100

independent training and testing sets, containing 1000 records each. We used synthetic labeling

functions 2-6,7, and 9 from the Agrawal generator [7], randomly varying the constants in functions

2-6 as described in Section 4.6.4.

Each column in the figure (models A-G) represents the averageof 25 random permutations of

the synthetic functions. The anonymizations (rows in the figure) are “optimized” for an increasing

number of target models. (For example, the anonymization inthe bottom row is optimized exclu-

sively for model A.) There are two important things to note from the chart, and similar behavior

was observed for the other classification algorithms.

• Looking at each model (column) individually, when the modelis included in the anonymiza-

tion (above the bold line), test set accuracy is higher than when the model is not included

(below the line).

• As we increase the number of included models (moving upward above the line within

each column), the test set accuracy tends to decrease. This is because the quality of the

anonymization with respect to each individual model is “diluted” by incorporating additional

models.

4.6.6 Privacy-Utility Tradeoff

In Section 4.3.4, we noted that there are certain cases wherethe tradeoff between privacy and

utility is (more or less) explicit, provided that conditional entropy is a good indicator of classifi-

cation accuracy. In particular, this occurs when the set of features is the same as the set of quasi-

identifiers, and the sensitive attribute is the same as the class label or numeric target attribute.

88

ÇÈÉÊËÌÍ ÎÏÐÑÑÎÏÒÓÑÎÏÐÏÔÎÏÐÕÔÎÏÒÓÕÎÏÒÏÔÎÖ×ÓØ ÎÏÕ×ÖÎÏÕÖÒÎÏÒÒÔÎÏÓÒÖÎÏÓÒÑÎÖØÕÑÎÔÖ×Ñ ÎÏÕÔ×ÎÏÕÖØÎÏÕÒÖÎÏÒÖÔÎÔÏÖÖÎÔÔ×ÖÎÔÐÏÐ ÎÏÒÏÏÎÏÕÕÔÎÏÒÑÕÎÔÓÏÕÎÔÐÕÒÎÔÐÓ×ÎÔÓÏÑ ÎÏÓÖ×ÎÏÒÕÐÎÔÒÐÑÎÔÒ×ÒÎÔÒÐÏÎÔÒÏÐÎÔÒÓÒ ÎÏÐØÓÎÔ×ÏÑÎÔÕÖØÎÔÕÑ×ÎÔÕÓ×ÎÔÕÐÑÎÔÕÒÖ ÎÔ×ÐÏÎÔÑÓÖÎÔÕØÔÎÔ×ÕÕÎÔ×ÓÔÎÔ×ÒÕÎÔÕØÔ

ÙÚÛÙÚÜÝÛÙÚÜÝÜÞÛÙÚÜÝÜÞÜßÛÙÚÜÝÜÞÜßÜàÛÙÚÜÝÜÞÜßÜàÜáÛÙÚÜÝÜÞÜßÜàÜáÜâÛ

J48

ãäåæçèé êëìíîêëïðîêëîïëêëìëëêëïïñêëïïòêòðìó êëñòóêëôìðêëïôñêëïíòêëìðòêíëíîêíëñë êëôîóêëôôïêëôñóêëïìëêíóðñêíóñôêíîðë êëïëñêëñíïêëïðóêíìðóêíìóñêíìîóêíìôñ êëìíñêëôòíêíïñëêíôóíêíïðíêíïððêíïñð êëîðïêíñðîêíôîìêíñîíêíôðïêíôðñêíôôô êíññíêëòíòêíñóíêíðëíêíñðìêíðëíêíñëô

õö÷õöøù÷õöøùøú÷õöøùøúøû÷õöøùøúøûøü÷õöøùøúøûøüøý÷õöøùøúøûøüøýøþ÷

Naive Bayes

Figure 4.9 Classification-based model evaluation for multiple models (k = 25)

89

In this section, we illustrate this empirically. Specifically, we conducted an experiment using

entropyℓ-diversity as the only anonymity requirement (i.e.,k = 1), for increasing values of param-

eterℓ. We again used the Census classification data, and this time let the salary class attribute be

both the sensitive attribute and the class label. For eachℓ value, we conducted an anonymization

experiment, measuring the average conditional entropy of the resulting data (across the 10 folds),

as well as the average test set classification accuracy.

The results are shown in Figure 4.10. As expected, the conditional entropy (across resulting

partitions) increases for increasingℓ.6 Also, it is not surprising that the classification accuracy

slowly deteriorates with increasingℓ.

4.6.7 Selection

In Section 4.4, we discussed the importance of preserving selections, and described an algo-

rithm for incorporating rectangular selection predicatesinto an anonymization. We conducted an

experiment using the synthetic data (1,000 generated records), but treating synthetic Function C2

as a selection predicate. Figure 4.11 shows the imprecisionof this selection when evaluated using

the recoded data. The figure shows results for data recoded using three different anonymization

algorithms. The first algorithm is Median Mondrian, with greedy recursive splits chosen from

amongst all of the quasi-identifier attributes. It also shows a restricted variation of Median Mon-

drian, where splits are made with respect to only Age and Salary. Finally, it shows the results of

Selection Mondrian, incorporating Function C2 as three separate rectangular query regions (each

with equal weight). It is intuitive that imprecision increases withk, and that imprecision is reduced

by incorporating the selection into the anonymization.

Incorporating selections can also affect model quality. Inthe absence of selections, InfoGain

and Regression Mondrian choose recursive splits using a greedy criterion driven by the target

model(s). When selections are included, the resulting partitions may not be the same as those that

6Whenℓ = 1, the conditional entropy is greater than 0 due to a small number of records in the original data withidentical feature vectors, but differing class labels.

90

ÿ��ÿ��ÿ��ÿ��ÿ��

��J48 (Census)

�� !!�"!�#!��!� "$%&'()&(*++,-.+/

01234561Naive Bayes (Census)

778978:78;78<===8==89=8>=8:=8?=8;=8@=8<=8A9BCDEFGHGDEIJKEHLDMN

OPQRSTUPVRPWXUTPConditional Entropy (Census)

Figure 4.10ℓ-Diversity experiment

91

YZYY[YY\YY]YYZ_YZY_YYabcdefghgijklmnopqlrstnuslmklmnopvlwlusnxpkxpmtnop

Figure 4.11 Imprecision for synthetic Function C2

would be chosen based on the target model(s). In the worst case, there may be a selection on an

attribute that is uncorrelated with the target attribute.

To test this intuition, we performed an experiment using theCensus classification data. To

simulate the effect of selections that are uncorrelated with the target model, we first assigned each

training tuple to one ofn groups, chosen uniformly at random. (We assume|R|n≥ k.) This

mimics the behavior of Selection Mondrian for a set of equality selections on a new attribute,

Group number, which takes values1, ..., n. We then anonymized each group independently, using

either InfoGain Mondrian or Median Mondrian. Once recodings were determined for each training

group, we randomly assigned each test tuple to one of then groups, and recoded the tuple using

the recoding function for that group. Finally, we trained a single classification model using the full

recoded training set (union of all training groups), and tested using the full recoded test set. This

process was repeated for each of ten folds.

The results of this selection experiment for J48 are shown inFigure 4.12, for increasingn

andk = 50. As expected, accuracy decreases slightly as the number of selections (n) increases.

However, several selections can be incorporated without large negative effects. Similar results

were observed for the other classification algorithms.

92

yz{yz|yz}yz~��{�y{y�yy�� ¡¢£¤¥ ¦¢ §¥¤ ¦©§¥¤ ¦¢ §¥¤

J48 with Selection (Census)

ª«¬ª«ª«®ª«°®¬±²³µ¶·¹º»¼¶½»»¶¹¾¿»·ÀÁÂÃÄÅÂÄÆÇÇÈÉÊÇËÌÍÎÏÎÐÑÒÓÑÔÑÕÐÖ×ØÑÎÐÙÚÛÎÑÐÜÓÝ

J48 with Projection (Census)

Figure 4.12 Selection and projection experiment

4.6.8 Projection

In certain cases, the data recipient will not use all released attributes when constructing a model.

Instead, he or she will build the model using only a projectedsubset of attributes. In our exper-

iments, we have found that single-dimensional recoding often preserves precise values for fewer

attributes than does multidimensional recoding. (This canoften be attributed to non-uniform data

distributions, as described in Section 3.3.2.)

In this section, we describe an experiment comparing anonymization algorithms when only a

subset of the released features is used in constructing a particular model. In this experiment, we

first ranked the set of all features using the original data and a greedy information gain heuristic.

We then removed the features in order, from most to least predictive, and constructed classification

models using the remaining attributes. We fixedk = 100.

As expected, test set accuracy decreases as the most predictive features are dropped. However,

the rate of this decline varies depending on the anonymization algorithm used. Figure 4.12 shows

the observed accuracies for J48 using the Census database. Because of the single-dimensional

recoding pattern, which preserves fewer attributes, this rate of decay is the most precipitous for

TDS. The results were similar for the other classification algorithms and the Contraceptives data.

93

4.7 Chapter Summary

In this chapter, we observed that the most direct way of measuring data quality is based on

the workload (or set of tasks) for which the released data will ultimately be used. Further, our

experimental study indicates that simple measures such as average equivalence class size are not

necessarily indicative of quality with respect to any particular workload.

Following these observations, we introduced a simple language for describing a target “family”

of workloads, and developed a set of extensions for incorporating these workloads into the Mon-

drian partitioning framework. An extensive experimental study indicates that this approach works

quite well in practice.

94

Chapter 5

Rothko: Scalable Variations of Mondrian

Mark Rothko bornMarcus Rothkowitz (September 25, 1903 - February 25, 1970)was a Latvian-

American painter who is classified as an abstract expressionist.

In recent years, numerous algorithms have been proposed foranonymous generalization, clus-

tering, and microaggregation [3, 6, 5, 14, 30, 38, 47, 53, 88,62, 78, 82], but few have considered

data sets larger than main memory. This chapter describes and evaluates two external adaptations

of the Mondrian algorithmic framework. We refer to the scalable variations asRothko.

The first variation is based on ideas from the RainForest scalable decision tree algorithms

[41]. Although the basic structure of the algorithm is similar to RainForest, there were several

technical problems we had to address. First, in order to choose an allowable split (according

to a given split criterion and anonymity requirement), we need to choose an appropriate set of

count statistics; those used in RainForest are not always sufficient. Also, we note that in the

anonymization problem, the resulting partition tree does not necessarily fit in memory, and we

propose techniques addressing this problem.

The second variation takes a different approach, based on sampling. The main idea is to use a

sample of the input data setR (that fits in memory), and to build the partition tree optimistically

according to the sample. Any split made in error is subsequently undone; thus, the output is

guaranteed to satisfy all given anonymity requirements. Wefind that, for reasonably large sample

sizes, this algorithm also generally results in a minimal partitioning.

95

5.1 Previous Scalable Algorithms

The Incognito algorithm described in Chapter 2 operated on external (disk-resident) data, but

the complexity of the algorithm was exponential in the number of attributes in the quasi-identifier,

making it impractical in many situations.

In the context of location-based services, Mokbel et al. proposed using a scalable grid-based

structure to implementk-anonymity [65]. However, the proposed algorithms were notdesigned to

incorporated additional anonymity requirements (e.g.,ℓ-diversity) or workload-oriented splitting

heuristics (e.g., InfoGain splitting). Also, they were designed to handle 2-dimensional spatial data,

and it is not immediately clear how they would scale to data with higher dimensionality.

To the best of our knowledge, all of the other proposed algorithms were designed to handle

only memory-resident data, and none has been evaluated withrespect to data substantially larger

than the available memory.

5.2 Exhaustive Algorithm (Rothko-T)

Our first algorithm, which we callRothko-Tree(or Rothko-T), leverages several ideas originally

proposed as part of the RainForest scalable decision tree framework [41]. Like Mondrian, decision

tree construction typically involves a greedy recursive partitioning of the domain (feature) space.

For decision trees, Gehrke et al. observed that split attributes (and thresholds) could be chosen

using a set of count statistics, typically much smaller thanthe full input data set [41].

In many cases, allowable splits can be chosen greedily in Mondrian using related count statis-

tics, each of which is typically much smaller than the size ofthe input data.

• Median / k-Anonymity Underk-anonymity and Median partitioning, the split attribute (and

threshold) can be chosen using what we will can anAV group. TheAV setof attributeA for

tuple setR is the set of unique values ofA in R, each paired with an integer indicating the

number of times it appears inR (i.e., SELECT A, COUNT(*) FROM R GROUP BY A).

The AV group is the collection of AV sets, one per quasi-identifier attribute.

96

• InfoGain / k-Anonymity When the split criterion is InfoGain, each AV set (group) must be

additionally augmented with the class label, producing anAVC set (group), as described in

[41] (i.e., SELECT A, C, COUNT(*) FROM R GROUP BY A, C.).

• Median / ℓ-Diversity In order to determine whether a candidate split is allowableunderℓ-

diversity, we need to know the joint distribution of attribute values and sensitive values, for

each candidate split attribute (i.e., SELECT A, S, COUNT(*)FROM R GROUP BY A, S).

We call this theAVS set (group).

• InfoGain / ℓ-Diversity Finally, when the split criterion is InfoGain, and the anonymity con-

straint isℓ-diversity, the allowable split yielding maximum information gain can be chosen

usingboththe AVC and AVS groups.

Throughout the rest of the chapter, when the anonymity requirement and split criterion are clear

from context, we will interchangeably refer to the above asfrequency setsandfrequency groups.

When the anonymity requirement is variance diversity, or the split criterion is Regression,

the analogous summary counts (e.g., the joint distributionof attributeA and a numeric sensitive

attributeS, or numeric target attributeT ) are likely to be prohibitively large. We return to this

issue in Section 5.3.

In the remainder of this section, we describe a scalable algorithm for k-anonymity and/orℓ-

diversity (using Median or InfoGain splitting) based on these summary counts. In each case,

the output of the scalable algorithm is identical to the output of the corresponding in-memory

algorithms described in Chapters 3 and 4.

5.2.1 Algorithm Overview

The recursive structure of Rothko-T follows that of RainForest [41], and we assume that at

least one frequency group will fit in memory. In the simplest case, the algorithm begins at the root

of the partition tree, and scans the input data (R) once to construct the frequency group. Using

this, it chooses an allowable split attribute (and threshold), according to the given split criterion.

Then, it scansR once more, and writes each tuple to a disk-resident child partition, as designated

97

1

2 3

R1 R2 R3 R4

4

5 6

R5 R6 R7 R8

… … …

R

Figure 5.1 Example: Rothko-T

by the chosen split. The algorithm proceeds recursively, ina depth-first manner, dividing each of

the resulting partitions (Ri) according to the same procedure.

Once the algorithm descends far enough into the partition tree, it will reach a point where the

data in each leaf partition is small enough to fit in memory. Atthis point, a sensible implementation

loads each partition (individually) into memory, and continues to apply the recursive procedure in

memory.

When multiple frequency groups fit in memory, the simple algorithm can be improved to take

better advantage of the available memory, using an approachreminiscent of the RainForest hybrid

algorithm. In this case, the algorithm first scansR, choosing the split attribute and threshold

using the resulting frequency group. Now, suppose that there is enough memory available to

(simultaneously) hold the frequency groups for all child partitions. Rather than repartitioning the

data across the children, the algorithm proceeds in a breadth-first manner, scanningR once again

to create frequency groups for all of the children.

Because the number of partitions grows exponentially as thealgorithm descends in the tree,

it will likely reach a level at which all frequency groups no longer fit in memory. At this point,

it divides the tuples inR across the leaves, writing these partitions to disk. The algorithm then

proceeds by calling the procedure recursively on each of theresulting partitions. Again, when

each leaf partition fits in memory, a sensible implementation switches to the in-memory algorithm.

98

Example 5.1 (Rothko-T) Consider input tuple set (R), and suppose there is enough memory

available to hold 2 frequency groups forR. The initial execution of the algorithm is depicted

in Figure 5.1.

Initially, the algorithm scansR once to create the frequency group for the root (1) and chooses

the best allowable split (provided that one exists). (In this example, all of the splits are binary.)

Then, the algorithm scansR once more to construct the frequency groups for the child nodes (2

and 3), and chooses the best allowable splits for these nodes.

Following this, the four frequency groups for the next levelof the tree will not fit in memory,

so the data is divided into partitionsR1, ..., R4. The procedure is then called recursively on each of

the resulting partitions.

5.2.2 Recoding Function Scalability

The previous section highlights an additional problem. Because the decision trees considered

by Gehrke et al. were of approximately constant size, it was reasonable to assume that the resulting

tree structure itself would fit in memory [41]. Unfortunately, this is often not true of our problem.

Instead, we implemented a simple scalable technique for materializing the multidimensional

recoding functionφ. Notice that each path from root to leaf in the partition treedefines a rule, and

the set of all such rules defines global recoding functionφ. For example, in Figure 3.4,(Age <

40) ∧ (Nationality � European)→ 〈[0− 40], European〉 is one such rule.

The set of recoding rules can be constructed in a scalable way, without fully materializing the

tree. In the simplest case, when only one frequency group fitsin memory, the algorithm works

in a purely depth-first manner. At the end of each depth-first branch, we write the corresponding

rule (the path from root to leaf) to disk. This simple technique guarantees that the amount of

information stored in memory at any one time is proportionalto the height of the tree, which

grows only as a logarithmic function of the data.

When more memory is available for caching frequency groups,the amount of space is slightly

larger, due to the periods of breadth-first partitioning, but the approach still consumes much less

space than materializing the entire tree.

99

Finally, note that the tree structure is only necessary if itis used to define a global recoding

function that covers the domain space. If we instead choose to represent each resulting region

using summary statistics, then the tree structure need not be materialized. Instead, the summary

statistics can be computed directly from the resulting datapartitions.

5.3 Sampling Algorithm (Rothko-S)

In this section, we describe a second scalable algorithm, this time based on sampling. Rothko-

Sampling (or Rothko-S) addresses some of the shortcomings of Rothko-T. Specifically, because

splits are chosen using only memory-resident data, it provides us with the ability to choose split at-

tributes using the Regression split criterion and to check variance diversity. The sampling approach

also often leads to better performance.

The main recursive procedure consists of three phases:

1. (Optimistic) Growth PhaseThe procedure begins by scanning input tuple setR to obtain a

simple random sample (r) that fits in the available memory. (IfR fits in memory, thenr =

R.) The procedure then grows the tree, using sampler to choose split attributes (thresholds).

When evaluating a candidate split, it uses the sample to estimate certain characteristics ofR,

and using these estimates, it will make a split (optimistically) if it can determine with high

confidence that the split will not violate the anonymity requirement(s) when applied to the

full partition R. The specifics of these tests are described in Section 5.3.1.

2. Repartitioning PhaseEventually, there will be no more splits that can be made withhigh

confidence based on sampler. If r ⊂ R, then input tuple setR is divided across the leaves

of the tree built during the growth phase.

3. Pruning PhaseWhenr ⊂ R, there is the possibility that certain splits were made in error

during the growth phase. Given a reasonable testing procedure, this won’t happen often, but

when a node in the partition tree is found to violate (one of) the anonymity requirement(s),

then all of the partitions in the subtree rooted at the parentof this node are merged. To do

this, during the repartitioning phase, we maintain certainpopulation statistics at each node.

100

1

2 3

4 5 6

7

R1

R2 R3 R4 R6

R7 R8

R5

R

(a)

1

2 3

4 5 6

7

R1

R2 R3 R4 R6

E7 E8

R5

(b)

1

2 3

4 5R1

R2 R3 R4

R6 R7 R8

R5

(c)

Figure 5.2 Example: Rothko-S

101

(Fork-anonymity, this is just a single integer count. Forℓ-diversity or variance diversity, we

construct a frequency histogram over the set of unique sensitive values.)

Finally, the procedure is executed recursively on each resulting partition,R1, ..., Rm. In virtu-

ally all cases, the algorithm will eventually reach a base case where each recursive partitionRi fits

entirely in memory. (There are a few pathological exceptions, which we describe in Section 5.3.2.

These cases typically only arise when an extremely small amount of memory is available.)

Recoding function scalability can be implemented as described in Section 5.2.2. In certain

cases, we stop the growth phase early, for one of three possible reasons. First, if we are construct-

ing a global recoding function, and the tree structure has filled the available memory, we then write

the appropriate recoding rules to disk. Similarly, we repartition the data if the statistics necessary

for pruning (e.g., sensitive frequency histograms) no longer fit in memory. Finally, notice that

repartitioning across a large number of leaves may lead to a substantial amount of non-sequential

I/O if there is not enough memory available to adequately buffer writes. In order to prevent this

from occurring, the algorithm may repartition the data while there still exist high-confidence al-

lowable splits.

Example 5.2 (Rothko-S)Consider an input tuple setR. The algorithm is depicted in Figure 5.2.

The growth phase begins by choosing sampler from R, and growing the partition tree accord-

ingly. When there are no more (high-confidence) allowable splits, R is repartitioned across the

leaves of the tree (e.g., Figure 5.2(a)).

During repartitioning, the algorithm tracks necessary population statistics for each node (e.g.,

total count fork-anonymity). In the example, suppose that Node 7 violates the anonymity require-

ment (e.g., contains fewer thank tuples). In this case, the tree is pruned, and partitionsR6, R7, R8

combined.

Finally, the procedure is executed recursively on data partitionsR1, ..., R5, R6 ∪ R7 ∪ R8.

5.3.1 Estimators & Hypothesis Tests

Rothko-S must often use a sample to check whether a candidaterecursive split satisfies the

given anonymity requirement(s). A naive approach performsthis check directly on the sample. For

102

example, underk-anonymity, if input dataR containsN tuples, and we have selected a sample of

sizen, the naive approach makes a split (optimistically) if each resulting sample partition contains

at leastk(

nN

)tuples.

Unfortunately, we find that this naive approach can lead to anexcessive amount of pruning in

practice (Section 5.5.6). Instead, we propose to perform this check based on a statistical hypothesis

test. In this section, we outline some preliminary methods for performing these tests. We find

that, while our tests for variance diversity andℓ-diversity do not make strong guarantees, these

tests produce quite favorable results in practice. Most importantly, the test will never affect the

anonymity of the resulting data because the algorithm always undoes any split made in error.

In the context of splitting, thenull hypothesis(H0) can be described informally as stating that

the candidate split isnot allowable under the given anonymity requirement. An ideal test would

rejectH0 if it can determine (using realistic assumptions) that there is only a small probability

(≤ α) of the split violating the anonymity requirement. During the growth phase, Rothko-S will

make a split (optimistically) ifH0 can be rejected with high confidence.

In the following, letR denote the input data (a finite population of tuples), and letN denote

the size (number of tuples) ofR. Let r denote a simple random sample ofn tuples, drawn uni-

formly without replacement fromR (n ≤ N). Consider a candidate split, which dividesR into m

partitionsR1, ..., Rm. (When applied to sampler, the split yields sample partitionsr1, ..., rm.)

5.3.1.1 k-Anonymity

We begin withk-anonymity. Letp = |Ri|/N denote the proportion of tuples fromR that would

fall in partition Ri after applying a candidate split toR. Underk-anonymity,H0 andH1 can be

expressed (forRi) as follows, wherep0 = k/N .

H0 : p = p0

H1 : p ≥ p0

Similarly, let p = |ri|/n. We use proportionp to estimatep. Regardless of the underlying data

distribution, we know by the Central Limit Theorem thatp is approximately normally distributed

103

(for large samples). Thus, we use the following test, rejecting H0 when the expression is satisfied.1

V arH0=

p0(1− p0)

n

(N − n

N − 1

)

p− p0 ≥ zα/m

√V arH0

There are three important things to note about this test. First, notice that we are simultaneously

testing allm partitions resulting from the split. That is, we want to construct the test so that the

total probability of accepting anyRi containing fewer thank tuples isα. For this reason we use

the Bonferroni correction (α/m).

Also, it is important to remember that we are sampling from a finite population of data (R),

and the fraction of the population that fits in memory (and is included in the sample) grows each

time the algorithm repartitions the data. For this reason, we have definedV arH0in terms of the

sampling process, incorporating a finite population correction. Given this correction, notice that

whenN = n (i.e., the entire partition fits in memory), thenV arH0= 0.

Finally, as the growth phase progresses (prior to repartitioning), note that the population (R),

and the sample (r), do not change. The only component of the hypothesis test that changes during

a particular instantiation of the growth phase isp, which decreases with each split. Thus, as the

growth phase progresses, it becomes increasingly likely that we will be unable to rejectH0, at

which point we repartition the data.

5.3.1.2 ℓ-Diversity

When the anonymity requirement is recursive(c, ℓ)-diversity, it is substantially more difficult

to construct the hypothesis test with strong guarantees about α. The technique described in this

section is simply a rule of thumb that accomplishes our practical goals.2

We must use each sample partition (ri) to estimate certain characteristics of the sensitive at-

tributeS within the corresponding population partition (Ri). Let Ni denote the size of population

partitionRi, and letni denote the size of sample partitionri.

1zα is the number such that the area beneath the standard normal curve to the right ofzα = α.2We also considered entropyℓ-diversity [60], and found it equally difficult to develop a precise test without simu-

latingH0, which is computationally costly.

104

Recursive(c, ℓ)-diversity can be expressed in terms of two proportions. LetXj denote the

frequency of thejth most common sensitive value inRi. Let p1 = X1/Ni andp2 = (Xℓ + ... +

X|dom(S)|)/Ni. Using these proportions,

H0 : p1 = c ∗ p2

H1 : p1 < c ∗ p2

We use the sample partition (ri) to estimate these proportions. Letxj denote the frequency of

thejth most common sensitive value inri, and letp1 = x1/ni andp2 = (xℓ + ... + x|DS |)/ni.

Notice that these estimates make several implicit assumptions. First, they assume that the

domain of sensitive attributeS is known. More importantly, they assume that the ordering of

sensitive value frequencies is the same inRi andri. (Clearly, this is not true, but in fact leads to a

conservative bias.) Nonetheless, this is a good starting point.

In order to do the test, we need to estimate the sample variance of cp2 − p1. If we assume that

p1 andp2 are independent (also not true), then

V ar(cp2 − p1) = c2V ar(p2) + V ar(p1)

An estimator forV ar(p) is bp(1−bp)ni−1

(Ni−ni

Ni

), so we estimate the variance as follows.

c2p2(1− p2) + p1(1− p1)

ni − 1

(Ni − ni

Ni

)

Of course, when choosing a candidate split, we do not knowNi, the size of theith result-

ing population partition. Instead, we use the overall sampling proportion( nN

) to guide the finite

population correction, which gives us the following estimate.

V arH0=

c2p2(1− p2) + p1(1− p1)

ni − 1

(N − n

N

)

Finally, we rejectH0 in favor ofH1 when the following expression is satisfied, again using the

Bonferroni correction.

cp2 − p1 > zα/m

√V arH0

105

5.3.1.3 Variance Diversity

When the anonymity requirement is variance diversity, our test is again just a rule of thumb.

We again use the sample partitionri to estimate certain characteristics ofRi, namely the variance

of sensitive attributeS. The null and alternative hypotheses (for population partition Ri) can be

expressed as follows.

H0 : V ar(Ri, S) = v

H1 : V ar(Ri, S) ≥ v

We use the variance ofS within sample partitionri as an estimate of the variance in population

partitionRi.

V ar(Ri, S) =1

ni − 1

ni∑

j=1

(sj − s)2

Recall that if eachsj is an independent normally-distributed random variable, then the sample

variance ofS follows a chi-square distribution. Under this assumption,we rejectH0 (for Ri) if the

following holds.3

(ni − 1)V ar(Ri, S)

v≥ χ2

α/m (ni − 1 df)

In reality, S may follow an arbitrary distribution, and because we are sampling from a finite

population, the elements in the sample are not independent.Because this test does not include a

finite population correction as such, when the overall sampling proportion nN

= 1 (which means

that the algorithm is operating on the full data partition),we instead rejectH0 whenV ar(Ri, S) ≥

v, according to the definition of variance diversity.

5.3.2 Discussion

Partitionings produced by Rothko-S are always guaranteed to satisfy the given anonymity re-

quirement(s), provided that the entire input database satisfies the requirement(s). In virtually all

3χ2

α (ndf) is the number such that the area beneath the chi-square density function (withn degrees of freedom) tothe right isα.

106

cases (i.e., when the sample size is not extremely small) theresulting partitioning is also minimal

(see Section 5.5). Potential non-minimality can, however,occur in the following scenario: Sup-

pose the algorithm is operating on only a sample in some recursive instantiation (that is,R is larger

thanTM). If there does not exist a single (high-confidence) split that can be made during the

growth phase, then it is possible that the resulting partitioning is non-minimal.4 In this sense, the

potential for non-minimality can be roughly equated with the power of the test. Similarly, if all

splits made during the growth phase are undone during the pruning phase, we stop the algorithm

to avoid thrashing.

There are two other important issues to consider. First, as we mentioned previously, our appli-

cation can withstand some amount of imprecision and bias in the hypothesis test routine because

splits that are made incorrectly based on a sample are eventually undone. However, it is important

for efficiency that this does not happen too often. We continue to explore this issue in Section 5.5.6

as part of the experimental evaluation.

The second important issue to consider is the precision of the sampling-based algorithm with

respect to workload-oriented splitting heuristics (InfoGain and Regression). It is clear that the split

chosen using sampler is not guaranteed to be the same as the split that would be chosen according

to the full partitionR. This problem has been studied in the context of a sampling-based decision-

tree construction algorithm (BOAT) [40], and could be similarly addressed in the anonymization

setting using bootstrapping for splits, and subsequent refinement.5

From a practical perspective, however, we find that it is lessimportant in our problem to choose

the optimal split (according to the population) at every step. While decision trees typically seek to

construct a compact structure that expresses an underlyingconcept, the anonymization algorithm

continues partitioning the domain space until no allowablesplits remain. We return to the issue of

sampling and data quality in the experimental evaluation (Section 5.5.5).

4Of course, in the rare event that this scenario arises in practice, it is easily detected. Fork-anonymity,ℓ-diversity,Median and InfoGain splitting, a reasonable implementation would simply switch to Rothko-T for the offendingpartition.

5The techniques proposed as part of BOAT would also have to be extended to handle the case where the entirepartition tree does not fit in memory.

107

5.4 Analytical Comparison

In order to lend insight to the experimental evaluation, this section provides a brief analytical

comparison of the I/O behavior of Rothko-T and Rothko-S. Forsimplicity, we make this com-

parison for numeric data (binary splits),k-anonymity, and partition trees that are balanced and

complete. Obviously, these assumptions do not hold in all cases. Under Median splitting, the par-

tition tree will be only approximately balanced and complete due to duplicate values; for InfoGain

and Regression splitting, the tree is not necessarily balanced and complete. Underℓ-diversity or

variance diversity, the analysis additionally depends on the distribution of sensitive attributeS.

Nonetheless, the analytical comparison provides valuableintuition for the relative performance of

the two scalable algorithms.

We use the notation described in Figure 5.3, and count the number of disk blocks that are read

and written during the execution of each algorithm.

5.4.1 Rothko-T

We begin with Rothko-T. Recall that once each leaf contains≤ TM tuples, we switch to the

in-memory algorithm. The height of the partition tree, prior to this switch, is easily computed. (We

assume thatk ≪ TM .)

height = max

(0,

⌈log2

(‖ R ‖

TM

)⌉)

Regardless of the available memory, the algorithm must scanthe full data setheight + 1

times. (The final scan imports the data in each leaf before executing the in-memory algorithm.)

As F CACHE increases, an increasing number of “repartitions” are eliminated.6 Thus, the total

number of reads and writes (disk blocks) is as follows:

6For simplicity, we assume that the size of a frequency group is approximately constant for a given data set. Inreality, the number of unique values per partition decreases as we descend in the tree.

108

repartitionsT =

⌈height

⌊log2 F CACHE⌋+ 1

⌉

readsT = |R| ∗ (height + repartitionsT + 1)

writesT = |R| ∗ repartitionsT

It is important to note that, unlike scalable decision trees[41], Rothko-T does not scale linearly

with the size of the data. The reason for this is simple: decision trees typically express a “concept”

of fixed size, independent of the size of the training data. Inthe anonymization algorithm, however,

the height of the partition tree grows as a function of the input data and parameterk. For Median

partitioning, the height of the full partition tree is approximately⌊log2

(‖R‖k

)⌋.

5.4.2 Rothko-S

In the case of Rothko-S, the number of repartitions is instead a function of the estimator (rather

thanF CACHE). The following recursive function counts the number of times the full data set is

repartitioned underk-anonymity:

repartitionsS(N)

if (N ≤ TM)

return 0

else

p0 = k/N

n = min(TM, N)

levels = max

(x ∈ Z : 1

2x − p0 ≥ zα/2

√p0(1−p0)

n

(N−nN−1

))

if (levels > 0)

return 1+ repartitionsS(

N2levels

)

else// non-minimal partitioning

return 0

109

|R| Number of disk blocks in input relation R

‖R‖ Number of data tuples in input relation R

TM Number of data tuples that fit in memory

F CACHE Number of frequency groups that fit in memory

(≥ 1)

height Height of the partition tree before each leaf

partition fits in memory

Figure 5.3 Notation for analytical comparison

The data is scanned once to obtain the initial sample. Each time the data is repartitioned,

the entire data set is scanned, and the new partitions written to disk. Then, each of the resulting

partitions is scanned to obtain the random sample. Thus, thetotal number of reads and writes (disk

blocks) is as follows:

readsS = |R| ∗ (2 ∗ repartitionsS(‖ R ‖) + 1)

writesS = |R| ∗ repartitionsS(‖ R ‖)

In practice, we observe that for reasonably largeTM (large sample size), the total number of

repartitions is often just 1. In this case, the entire data set is read three times, and written once.

5.5 Experimental Performance Evaluation

We conducted an analytical and experimental evaluation, intended to address the following

high-level problems:

• Need for Scalable Algorithm We first seek to demonstrate the need to explicitly manage

memory and I/O when anonymizing large data sets. (Section 5.5.2)

110

CentOS Linux (xfs file system)

512 MB memory

Intel Pentium 4 2.4 GHz processor

40 GB Maxtor IDE hard drive

(measured 54 MB/sec sequential bandwidth)

gcc version 3.4.4

Table 5.1 Experimental system configuration

• Evaluate and Compare Algorithms One of our main goals is to evaluate and compare

the our scalable algorithms (Rothko-T and Rothko-S). To this end, we perform an exten-

sive experimental comparison of I/O behavior (Section 5.5.3) and total execution time (Sec-

tion 5.5.4).

• Sampling and Data Quality When using a sample, the splits chosen according to the In-

foGain and Regression split heuristics may not be identicalto those chosen using the entire

data set. Section 5.5.5 evaluates the practical implications.

• Evaluate Hypothesis TestsOur final set of experiments (Section 5.5.6) evaluates the effec-

tiveness of the optimistic hypothesis-based splitting approach using a sample. By measuring

the frequency of pruning, we show that the approach is quite effective. Also, though just

rules of thumb, the tests described in Section 5.3.1 work quite well.

5.5.1 Experimental Setup

We implemented each of the scalable algorithms using C++. Inall cases, disk-resident data

partitions were stored as ordinary files of fixed-width binary-encoded tuples. File reads and writes

were buffered into 256K blocks. Table 5.1 describes our hardware/software configuration. In each

of the experiments, we used a dedicated machine, with an initially cold buffer cache.

Our experiments again made use of the synthetic data generator described in the previous sec-

tion. The quasi-identifier attributes were generated following the distributions described in Ta-

ble 4.1. When necessary, categorical class labels were generated as a function of these attributes

111

Target Function T

R7 T = 0.67× (salary + commission)

−0.2× loan− 20K

R10 if hyears < 20 then equity = 0

elseequity = 0.1× hvalue× (hyears− 20)

T = 0.67× (salary + commission)

−5000× elevel + 0.2× equity − 10K

Table 5.2 Experimental Data Description: Synthetic numeric target functions

(see Table 4.2). In addition, for regression tasks, numerictarget attributes were generated using

the functions described in Table 5.2. For these experiments, each quasi-identifier was treated as

numeric (without user-defined generalization hierarchies), and each tuple was 44 bytes.

For the synthetic data, the size of an AV group (Median splitting) was approximately 8.1 MB.

Because the class label attribute has two distinct values, the size of an AVC group (InfoGain

splitting) was approximately 16.2 MB. Also, for the sampling-based algorithm, we fixedα = 0.05

throughout the experimental evaluation.

5.5.2 Need for a Scalable Algorithm

When applied naively to large data sets, the Mondrian algorithms described in Chapters 3

and 4 will often lead to thrashing, and the expected poor performance. To illustrate the need to

explicitly manage memory and I/O, we performed a simple experiment. We ran an in-memory

implementation (also in C++), allowing the virtual memory system to manage memory and I/O.

Figure 5.4 shows I/O behavior and runtime performance, respectively, for Median splitting andk-

anonymity (k = 1000). As expected, the system begins to thrash for data sets thatdo not fit entirely

in memory. These figures show performance for data sets containing up to 10 million records; in

the remainder of this section, we will show that the scalablealgorithms are easily applied to much

larger data sets.

112

ÞßÞàÞáÞâÞãÞäÞ

ÞàâäåßÞæçèçéêëìíîïðñììñòóîôõö÷øùúûüüûýþøýÿ�üý��ø�

��ñèíî�íç�îI/O

��

�� !"#$�%&��'��

Runtime

Figure 5.4 In-memory implementation for large data sets

5.5.3 Counting I/O Requests

We begin by focusing on the I/O incurred by each of the two proposed algorithms. Each of the

experiments in this section uses Linux/proc/diskstat to count the total number of I/O requests

(in 512 byte blocks) issued to the disk. We also compare the experimental measurements to the

values predicted by the analytical study in Section 5.4. Allof the experiments in this section use

Median partitioning andk-anonymity. The results are shown in Figure 5.5.

The first two experiments each used50 million input tuples, andk = 1000. For Rothko-T,

we fixedTM = 2 million, and varied parameterF CACHE. As expected, increasingF CACHE

reduces the number of I/O requests. However, the marginal improvement obtained from each

additional frequency group is decreasing. In some cases, the observed number of disk reads is

smaller than expected due to file system buffering.

For Rothko-S, we varied the sample size. Notice that for thiswide range of sample sizes, the

data was repartitioned just once, meaning that the algorithm read the entire data set 3 times, and

wrote it once. Also, the total amount of I/O is substantiallyless than that of Rothko-T.

Finally, we performed a scale-up experiment, increasing the data size, and fixingTM =

2 million, k = 1000. Rothko-T is able to exploit the buffer cache to some extent,but the to-

tal amount of I/O is substantially more than Rothko-S.

113

()(*(+(,(-((*,./)()*),).01232456789:;<==<>?9>@A=>BC9DEFGHIJ2KILMKHFNOPQRLFIJ2KILMKHFNOEFGHIJ5STFQRUFVLGNOPQRLFIJ5STFQRUFVLGNO

Rothko-T I/O

WXWYWZW[W\WWXYZ[\]_abcdecfgdbbdhfijklmnopqrrqstnsuvrswxnyzc_{ig|hi}~h{cbj��d}cig|hi}~h{cbjzc_{ig��ac�dcf}_j��d}cig��ac�dcf}_bj

Rothko-S I/O

�� ¡¢££¢¤¥�¤¦§£¤©�ª«¬®°±²±³µ¶·«¬®°±²±³µ·«¹º»¼¹¬½«¬®°±²±³µ¶·«¬®°±²±³µ·«¹º»¼¹¬½

Scale-up I/O

Figure 5.5 I/O Cost Comparisons

114

5.5.4 Runtime Performance

Perhaps more importantly, we evaluated the runtime performance of both proposed algorithms.

All of the experiments in this section usek-anonymity as the anonymity requirement. In each

case, we break down the execution time into three components: (1) User space CPU time, (2)

Kernel space CPU time, and (3) I/O wait time. These statistics were gathered from the system via

/proc/stat.

We begin with Median splitting. The first set of experiments measured scale-up performance,

fixing TM = 2 million, andk = 1000. Figure 5.6 shows results for Rothko-T (F CACHE= 1 and

8) and for Rothko-S. As expected, the sampling-based algorithm was faster, both in terms of total

execution time and CPU time. Additionally, each of the algorithms goes through periods where

execution is I/O-bound. Interestingly, the I/O wait times are similar for Rothko-T (F CACHE= 8)

and Rothko-S. However, this is deceptive. Although Rothko-T does more I/O, it also performs

more in-memory calculations, thus occupying the CPU while the file system flushes the buffer

cache asynchronously.

The second set of experiments considered the effects of parameterk. Results for these exper-

iments are shown in Figure 5.7. As expected, a decreasing value ofk leads to more computation.

However, because the algorithms all switch to the in-memoryalgorithm after some number of

splits, this additional cost falls to the CPU.

Finally, we compared scale-up performance using the InfoGain split criterion, again fixing

TM = 2 million, andk = 1000. For these experiments, we used label function C2 to generate the

class labels. Figure 5.8 shows results for Rothko-T (F CACHE= 1, 4) and Rothko-S. As expected,

the CPU cost incurred by these algorithms is greater than Median partitioning, particularly due

to the extra cost of finding safe numeric thresholds that maximize information gain. However,

Rothko-S consistently outperforms Rothko-T.7

7For efficiency, in each case, the recursive partitioning procedure switched to Median partitioning when the infor-mation gain resulting from a new split dipped below a 0.01. Wenote that continuing the InfoGain splitting all the wayto the leaves is very CPU-intensive, particularly for numeric attributes, because of the required sorting.

115

¾¿¾¾¾À¾¾¾Á¾¾¾Â¾¾¾Ã¾¾¾Ä¾¾¾¿¾À¾Á¾Â¾Ã¾Ä¾Å¾Æ¾Ç¾¿¾¾ÈÉÊÉËÌÍÎÏÐÑÒÓÎÎÓÔÕÐÖ×ØÙÚÛÜÝØÞßàáâÉÓÊãÏäÕÏÎåæççÐÏäåæç

Rothko-T (F CACHE = 1)

èéèèèêèèèëèèèìèèèíèèèîèèèéèêèëèìèíèîèïèðèñèéèèòóôóõö÷øùúûüýøøýþÿú��óýô ù�ÿùø��úù��


�� !"#$%&'""'()$*+,-./01,23456�'�7#8)#"9:;;$#89:;

Rothko-S

Figure 5.6 Scale-up performance for Median splitting

116

<=<<<><<<?<<<><<<<<<=<<<<<=<<@ABCDEFGBHIJKLMNOPQRSQTUVWWXQRUVW


YZYYY[YYY\YYY[YYYYYYZYYYYYZYY]_abcd_efghijklmnopnqrsttunorst


vwvvvxvvvyvvvxvvvvvvwvvvvvwvvz{|}~��|��

Rothko-S

Figure 5.7 Runtime performance for variedk

117

�� ¡¢£¤¥¦§££©ª¥«¬®°±²³µ¶·��¤¹ª¤£º»¼¼¥¤¹º»¼


½¾½½½¿½½½À½½½Á½½½Â½¾½Ã½¿½Ä½À½Å½Á½Æ½Â½½ÇÈÉÈÊËÌÍÎÏÐÑÒÍÍÒÓÔÏÕÖ×ØÙÚÛÜ×ÝÞßàáÈÒÉâÎãÔÎÍäåææÏÎãäåæ


çèçççéçççêçççëçççìçèçíçéçîçêçïçëçðçìççñòóòôõö÷øùúûü÷÷üýþùÿ��òüó�ø þø÷��ùø ��

Rothko-S

Figure 5.8 Scale-up performance for InfoGain splitting

118

5.5.5 Effects of Sampling on Data Quality

In Section 5.3.2, we discussed some of the potential shortcomings of the sampling-based al-

gorithm, and we noted that one primary concern is imprecision with respect to the InfoGain and

Regression split criteria. In this section, we evaluate theeffects of sampling with respect to data

quality. For reasonably large sample sizes, we find that in practice the effect is often minimal.

There are a number of ways to measure data quality. In the interest of simplicity, in these

experiments, when using the InfoGain split criterion, we measured the conditional entropy of the

class label (C) with respect to the partitioning (see Equation 4.1). For Regression splitting, we

measured the weighted mean squared error (see Equation 4.5). Both of these measures relate

directly to the given task and greedy split criteria.

We performed experiments using both synthetic and real-life data. Results for InfoGain split-

ting are shown in Figure 5.9. Results for Regression splitting are shown in Figure 5.10. For each

experiment using synthetic data, we generated 10 data sets (each containing 100,000 records), and

we increased the sample size. The reported results are averaged across the ten data sets. In the

figures, we circled partitionings that are potentially non-minimal.

Increasing the sample size does lead to small improvement inquality (decreased entropy or

error). However, for large sample sizes, the difference is very small. In all of our experiments the

sample size had a much smaller impact on quality than the anonymity parameterk.

We also conducted a similar experiment using the Census database described in Figure 4.3.

Again, the improvement in quality gained from increasing the sample size is small.

5.5.6 Hypothesis Tests and Pruning

One of the important components in the design of the sampling-based algorithm is choosing an

appropriate means of checking each anonymity requirement (k-anonymity,ℓ-diversity, and vari-

ance diversity) using a sample. Although the algorithm willalways undo splits made in error, it is

important to have a reasonable procedure in order to avoid excessive pruning.

In this section, we evaluate the effectiveness of the hypothesis-based approach described in

Section 5.3.1. As mentioned previously, our hypothesis tests for ℓ-diversity and variance diversity

119

�� !"#$%&%"#'()#&*"+,-./��-.��-.��

C2

00120130140102012256789:5;<:=>?@ABCDEDABFGHBEIAJKLMN000LM2000LM200

C7

OPQOPROPSOPTOPOUOPUUVWXYZ[V\][_abcdefebcghicfjbklmnQOOmnUOOmnQOmnUO

Census Classification

Figure 5.9 Conditional Entropy (InfoGain splitting)

120

0

2E+12

4E+12

6E+12

8E+12

0.01 0.1 1

Sample Size (%)

WM

SE

k=5000 k=1000 k=100

R7

0

1E+13

2E+13

3E+13

4E+13

0.01 0.1 1

Sample Size (%)

WM

SE

k=5000 k=1000 k=100

R10

9.50E+08

1.05E+09

1.15E+09

1.25E+09

0.01 0.1 1

Sample Size (%)

WM

SE

k=500 k=100 k=50 k=10

Census Regression

Figure 5.10 WMSE (Regression splitting)

121

are just “rules of thumb”. Nonetheless, we find that the approach of using a hypothesis test, as well

as the specific tests outlined in Section 5.3.1, actually work quite well in practice.

We again used the synthetic data generator and the Median split criterion. For each experiment,

we used an input of 100,000 tuples, and varied the sample size. For each experiment, we reparti-

tioned the data automatically when the height of the tree reached 8 (due to memory limitations for

storing sensitive value histograms under variance diversity).

We conducted experiments usingk-anonymity,ℓ-diversity, and variance diversity, each as the

sole anonymity requirement in the respective experiment. For ℓ-diversity, we usedzipcodeas the

sensitive attribute, and fixedc = 1. For variance diversity, we usedsalaryas the sensitive attribute.

In addition to the uniform salary distribution, we also considered a normal distribution. (The

population variance of the uniformsalary is approximately 1.4e9; the population variance of the

normalsalary is approximately 1.1e8.)

Figure 5.11 shows our results. Each entry indicates the total number of nodes that were pruned

during the algorithm’s entire execution. The numbers in parentheses indicate the number of nodes

that are pruned when we use a naive approach that does not incorporate hypothesis tests (see

Section 5.3.1). An “x” indicates that the resulting partitioning was (potentially) non-minimal, as

described in Section 5.3.2.

There are two important things to note from these results. First and foremost, the estimates

are reasonably well-behaved, and do not lead to an excessiveamount of pruning, even for small

samples. Similarly, although our hypothesis tests are justrules of thumb, they provide for much

cleaner execution (less pruning) than the naive approach ofusing no hypothesis test.

As expected, the incidence of both non-minimality and pruning decreases with increased sam-

ple size.

5.6 Chapter Summary

In this chapter, we introduced two extensions to the Mondrian framework that allow the algo-

rithms to be applied to data sets much larger than main memory. The first extension is an adaptation

of the RainForest scalable decision tree framework [41], and the second uses a sample to perform

122

k

n 10 100 1000 10000

100 68 (1384) x (x) x (x) x (x)

250 30 (1110) 7 (97) x (x) x (x)

500 11 (419) 12 (55) x (x) x (x)

1000 0 (0) 5 (6) x (x) x (x)

2500 0 (0) 2 (1) 1 (3) x (x)

5000 0 (0) 0 (0) 1 (4) x (x)

10000 0 (0) 0 (0) 0 (0) x (x)

25000 0 (0) 0 (0) 0 (0) 0 (0)(a) k-Anonymity

ℓ

n 2 4 6 8

100 97 (631) x (x) x (x) x (x)

250 65 (87) x (x) 8 (67) x (x)

500 2 (763) x (111) 2 (32) x (x)

1000 112 (91) 1 (170) 1 (33) x (16)

2500 0 (0) 0 (2) 1 (0) 0 (2)

5000 0 (0) 0 (2) 0 (0) 0 (9)

10000 0 (0) 0 (1) 0 (0) 0 (0)

25000 0 (0) 0 (1) 0 (0) 0 (0)(b) ℓ-Diversity

v

n 1.1e9 1.2e9 1.3e9

100 x (x) x (x) x (x)

250 x (510) x (x) x (x)

500 x (443) x (434) x (x)

1000 0 (456) x (372) x (x)

2500 0 (0) 0 (87) x (146)

5000 0 (0) 0 (0) 0 (47)

10000 0 (0) 0 (0) 0 (5)

25000 0 (0) 0 (2) 0 (7)(c) Variance Diversity (Uniform S)

v

n 7e7 8e7 9e7

100 x (x) x (x) x (x)

250 x (396) x (x) x (x)

500 x (599) x (457) x (x)

1000 0 (402) 0 (405) x (278)

2500 0 18) 0 (66) x (169)

5000 0 (0) 0 (0) 0 (6)

10000 0 (0) 0 (0) 0 (13)

25000 0 (0) 0 (0) 0 (17)(d) Variance Diversity (Normal S)

Figure 5.11 Number of nodes pruned by Rothko-S as a function of the sample size n

123

partitioning. We found that each of these approaches scaleseffectively to data sets much larger

than the available memory.

124

Chapter 6

Related Work

This chapter presents an overview of recent technical work related to the management of sen-

sitive and private data, emphasizing work that is most related to that presented in this thesis. As

mentioned in the introduction, data privacy involves a number of broad issues related to data col-

lection, ownership, dissemination, and use. While not exhaustive, we found the following series of

questions valuable in understanding and distinguishing the related work:

Who owns the data?Throughout the thesis, we considered a setting in which a large quantity of

personal data is collected, stored, and managed by a singlecentral organization. The organization

itself is trustworthy, but must be cautious when distributing data externally. However, there are

many other realistic scenarios. For example,individualsmay be concerned about providing their

information in the first place. In this case, steps should be taken to protect sensitive information at

the time of data collection. Alternatively,multiple organizationsmight compile separate databases,

which they are concerned about sharing with one another without additional precautions.

What information should be hidden / revealed?This is perhaps the most critical question, yet

the most difficult to answer. It is important to understand what information should be hidden from

an adversary, possibly given certain assumptions about hisor her resources. That is, there should

be some well-developed notion of a privacypolicy. At the same time, this must be balanced with

some notion ofutility.

What resources are available to the adversary?This goes hand-in-hand with the previous ques-

tion. In devising an appropriate threat model and privacy policy, it is important to understand the

125

resources available to the adversary, both in terms of data (e.g., external databases, instance-level

and distributional background knowledge, etc.) and computational power.

What is the structure of the data? It is also important to understand the structure of the data.

For example, to this point, we have considered data consisting primarily of a single static relation,

with one record per individual, and we have assumed these records to be independent. However,

in other settings the data may be non-relational, correlated, dynamic, etc.

What is the query setting? There are many different application settings in which we may want

to protect private data, including data publishing, as wellas interactive processing of user-defined

queries.

In the remainder of this chapter, we give an overview of recent work, using these questions as

a guide. A survey by Adam and Wortmann gives a comprehensive view of work in the area of

disclosure control for statistical databases prior to 1989[1]. While some of the settings considered

in this chapter are closely related to those described in that survey, we focus our attention primarily

on more recent work. Whenever possible, we seek to point out open questions.

6.1 Privacy for Published Microdata

Throughout this thesis, we have considered the setting where a large quantity of sensitive per-

sonal information is collected and managed by a central organization. The data takes the form

of a single (static) relation, consisting of one record per individual. When the data is published,

the organization seeks to hide either the identities of individuals (i.e., underk-anonymity) or the

values of some sensitive attribute (i.e., underℓ-diversity, variance diversity). Simultaneously, the

publishing organization seeks to retain the utility of the data for a certain set of tasks. This setting

is depicted in Figure 6.1.

Throughout this work, we have assumed that the adversary hasaccess to some external (link-

able) database of individually identifying information and that the adversary may have infinite

126

opqrsrtuvwuxsyzz{|s}y~p��psr{qFigure 6.1 Sanitized publication model

computational power. A variety of different works have considered this problem setting, and vari-

ations and extensions thereof. In this section we describe some of the different approaches that

have been taken.

6.1.1 Generalization, Recoding & Microaggregation

In recent years, numerous algorithms have been proposed forimplementingk-anonymity (and

ℓ-diversity extensions [60]) via generalization and suppression. Initially, much of the work sought

to optimize simple general-purpose measures of utility. Sweeney proposed a heuristic algorithm

for cell generalization [82]; Samarati proposed the binarysearch algorithm for full-domain gen-

eralization that is described in Section 2.2. Meyerson and Williams [62] and Aggarwal et al. [5]

describe approximation algorithms for the cell-suppression flavor ofk-anonymization. Bayardo

and Agrawal described an optimal search-based algorithm for single-dimensional recoding [14].

Winkler described a stochastic algorithm based on simulated annealing [91].

Subsequently, there has been interest in incorporating an understanding of workload when

anonymizing data. This idea was first proposed by Iyengar, who developed a genetic algorithm

for single-dimensional recoding, incorporating a single target classification model [47]. This ap-

proach proved costly, and subsequently there have been several proposed heuristic algorithms that

also incorporate a single target classifier when performingk-anonymous single-dimensional gen-

eralization [38, 88].

127

In addition to generalization, related techniques based onclustering have also been proposed in

the literature. Microaggregation first clusters data into (ideally homogeneous) groups of required

minimal occupancy, and then publishes the centroid of each group [30]. Similarly, Aggarwal et

al. propose clustering data into groups of at least sizek, and then publishing various summary

statistics for each cluster [6]. Each of these approaches bears some resemblance to the summary

statistics we proposed in Section 4.5, but neither requiresthat the shape of clusters be rectangular

and non-overlapping.

Finally, privacy-preserving histogram sanitization was proposed with the similar goal of guar-

anteeing that individuals blend into a crowd, based on some suitable distance metric [22]. This is

also similar to our multidimensional recoding techniques,but the probabilistic privacy definition

proposed in this paper is slightly different. In our opinion, it does not fully capture situations where

the identification of even a single individual would be considered a breach.

6.1.2 Multiple Releases & Evolving Data

Although the techniques described in this thesis have considered releasing just a single (san-

itized) microdata table, several pieces of related work have sought to reason about disclosure by

multiple releases or multiple tables.

In a series of papers, Dobra and Feinberg considered the closely-related problem of measuring

disclosure for contingency tables [28, 29]. In database terminology, acontingency tablecan be

thought of as a count query, grouping by some set of attributes. A marginalsums out some subset

of these attributes. Given a set of marginals, the goal of this work is to compute precise upper

and lower bounds for each of the entries in the original, underlying contingency table. They show

that in certain special cases, where the schemas of the respective marginals form a decomposable

graph, these bounds can be computed in closed form.

A close connection can be made between marginals and full-domain generalization. Kifer and

Gehrke extended this idea to multiple (full-domain generalized) projections of a single underlying

relation [50]. One common characteristic shared by all single table generalization / suppression

approaches tok-anonymization is their vulnerability to the so-called “curse of dimensionality,” or

128

decrease in utility that arises when clustering or groupinghigh-dimensional data [2]. The work

by Kifer and Gehrke showed that it is often possible to combatthe dimensionality problem by

releasing multiple independent projections.

One interesting open question in this line of research is howto discover a high-utility set of

marginals (or generalized projections) that does not violate the given anonymity requirement(s).

In other related work, Yao et al. have also considered the problem of checking a requirement

similar toℓ-diversity1, given a set of conjunctive views of a single underlying relation [98].

Finally, several recent works have sought to adapt thek-anonymity/ℓ-diversity framework to

handle updates and evolving data. Byun et al. propose a refinement approach (an extension of

our multidimensional partitioning techniques) that handles insertions of new records [20]. Xiao

et al. propose a further extension based on the insertion of superfluous (“counterfeit”) tuples for

databases that evolve through both insertions and deletions [96].

6.1.3 The Role of Background Knowledge

One common problem in data publishing is understanding and reasoning about background

knowledge available to the adversary. In many cases, an adversary attempting to learn personal

information from public data has some instance-level information. For example, consider again

the generalized data in Figure 1.2, and consider a nosy neighbor who is able to isolate her friend

Andrew to the first equivalence class. If she has seen Andrew recently, and knows that he does not

have a broken arm, then the probability of Andrew having Cancer increases from1/3 to 1/2. In

response to problems like this, Martin et al. recently initiated a formal study of logical background

knowledge in data publishing [61].

In a realistic setting, it is impossible for the data publisher to know what instance-level infor-

mation is available to an adversary. In fact, there may be many different adversaries, each with

different knowledge. Thus, Martin et al. instead proposed quantifying the adversary’s knowledge,

1The anonymity requirement considered in this work stipulates that, given the published views, it must be true thatany individual value of theID attribute can be logically connected to at leastk distinct values of the sensitive attributeS. In contrast toℓ-diversity, it does not say anything about the representation of these values.

129

and releasing data that is resilient to a certainamountof knowledge (in the worst case, regard-

less of the specific content of this knowledge). In particular, they proposed a (complete) language

for expressing background knowledge, consisting of a finiteconjunction ofbasic implications.2

They then proposed a dynamic programming algorithm for checking whether an anonymization is

safe, given than the adversary knows up tok basic implications, and noted that this check can be

incorporated into algorithms like Incognito.3

Subsequently, Chen et al. noted that quantifying background knowledge in terms of basic

implications is not terribly intuitive, and instead proposed decomposing knowledge into several

different “types,” each expressed by a stylized logical statement [24]. Specifically, this work con-

sidered three intuitive types of knowledge available to an adversary: knowledge about the target

individual, knowledge about others, and knowledge about the relationships among individuals.

Each type of knowledge is quantified separately, and the anonymity requirement is specified in

terms of a skyline ranging over the various types of background knowledge. For example, consider

a medical database and a target individual, Andrew. An organization may require that the released

data be robust to an adversary who knows up to five diseases Andrew does not have, the diseases

of three other individuals in the database, and up to two individuals who have the same disease

as Andrew. This work also describes efficient adaptations tothe Mondrian framework, based on

a set of global sufficient statistics, for sanitizing data inaccordance with these new anonymity

requirements.

An interesting open question in this line of research is how to represent and reason about prob-

abilistic background knowledge. For example, it is likely that an adversary could have knowledge

of the form: If Andrew has HIV, there is a 60% probability thatEllen has HIV.

2By the definition in [61], a basic implication is of the form(∧i=1..mAi)→ (∨j=1..nBj), where eachAi, Bj is anatom associating a particular individual with a particularsensitive value (e.g., Tom has AIDS).

3It is also important to note that the anonymity requirementsexpressed in the language of Martin et al. [61] donot necessarily satisfy the bucket independence property outlined in Section 1.3.3. There are no obvious extensions ofMondrian that are able to implement these requirements.

130

6.1.4 Other Perturbative Techniques

The anonymization techniques described in this thesis havefocused primarily on generalization

(recoding). However, several other perturbative techniques have been proposed for protecting

privacy when publishing static microdata.

Most similar is the work of Xiao et al. (“Anatomy”) [94], which proposes bucketizing data in a

way similar to that described in Section 1.3. However, rather than generalizing the quasi-identifier

values, they propose permuting the sensitive values withineach equivalence class. The standard of

privacy here is similar toℓ-diversity.

Also related is the “condensation” method [3], which first clusters the input data into groups

containing at leastk records, and then generates synthetic records based on the statistical properties

of each group. The data “swapping” approach swaps data values amongst records in a controlled

way, intended to approximately preserve the order statistics from the original data [70]. Other

work has proposed generating synthetic data following the statistical properties of the input data

[69, 71]. We note that it is not immediately clear how to quantify the protection afforded by each

of these schemes with respect to threat models such as the linking attack described in Section 1.3.

6.2 Client-Side Input Perturbation

Much recent work has also considered the privacy problem in asomewhat different setting.

Consider a large number of participants,P1, .., Pn, each with an input, denotedx1, ..., xn. These

inputs are collected by a central organization, which wouldlike to construct a single data mining

model (formally, a functiong(x1, ..., xn) of the input). However, the parties are hesitant to reveal

their unmodified records to the organization. The supermarket discount cards described in the

introduction represent one interesting example. Alternatively, participants responding to a survey

is also a natural example.

A proposed solution to this problem is the idea ofclient-side input perturbation, depicted in

Figure 6.2, in which each of the participants adds some random noise to his or her input, following

a pre-specified distribution. Abstractly, we can think of this process in terms of a randomization

131

��Figure 6.2 Client-side input perturbation model

operatorR(), with random output. ParticipantPi sendsyi = Ri(xi) to the organization. Following

the notation in [34], we denote the probability thatyi = R(xi) asp[xi → yi]. (In their original

work, Agrawal and Srikant considered a randomization operator that adds toxi some random value

r, drawn from a pre-specified distribution. That is,yi = xi + r [11].)

Agrawal and Srikant noted that in order to perform most data mining tasks, the actual input

values aren’t necessary [11].4 Instead, the distributions of the various attributes are sufficient. For

this reason, they developed a method for reconstructing theoriginal attribute distribution from

the perturbed input, and they propose an extended decision tree construction algorithm that takes

these distributions as input [11]. Subsequently, related protocols have also been developed for

association rule mining [35, 74].

In this line of literature, Evfimievski et al. [34] provided aformal study of the privacy afforded

by client-side input perturbation. They assume that the input valuesx1, ..., xn are drawn from

a single finite set (DX), independently at random, with replacement, and according to a fixed

probability distribution. (LetX denote a random variable with the same probability distribution.)

From the point of view of the organization (also the adversary), yi is also an instance of a random

variableY . Thus, prior to seeingyi, the adversary’s belief thatxi = x is assumed to beα =

4In some sense, this idea is reminiscent of the randomized response paradigm for survey questioning [89].

132

P (X = x). After seeing the perturbed input, the adversary’s belief isβ = P (X = x|Y = yi). The

intuition motivating their notion of privacy is thatα andβ should not be vastly different from one

another (in either direction).

Unfortunately, this proves difficult to check because we do not know the distribution ofX.

Instead, they propose an alternate definition based the ideaof amplification, which requires that

∀x1, x2 ∈ DX , p[x1→y]p[x2→y]

≤ γ. The intuition is that many different input values should bereasonably

likely to be randomized to each different output value. Indeed, they are able to show that this

restriction is sufficient for guaranteeingα-β privacy whenγ is defined appropriately.

Finally, it is worth noting that these randomization techniques developed for client-side input

perturbation can also easily be applied in a publishing setting. That is, the organization simply

applies randomization functionR() to each entry before releasing(y1, ..., yn) as the sanitized view.

6.3 Statistical Databases

A related and long-standing field of research considers the problem of disclosure control for

statistical databases. In this case, a central organization maintains a database of (potentially sensi-

tive) information, over which it answers aggregate queries. (These queries typically consist of an

aggregate function, such as SUM or MAX, and perhaps a selection predicate.) It is well known

that, given a sufficient number of queries, an adversary can reconstruct the contents of the under-

lying database [1]. There have been a variety of approaches proposed for addressing this problem,

and in this section, we briefly describe two:auditingandoutput perturbation.

6.3.1 Auditing Disclosure

One way to manage disclosure for statistical databases is throughauditing. An auditing mech-

anism tracks a history of queries (per user, per colluding group, etc.), and attempts to determine

when the combination of these queries (and their answers) compromises privacy.

Several different notions of privacy have been considered in the auditing literature. Much of this

work has sought to detectexact disclosures. That is, given a set of queries and their true responses,

133

�� ¡ ¢�£��Figure 6.3 Online query auditing model

an adversary can uniquely determine one or more values in theunderlying data. Alternatively,

there have also been proposed definitions ofpartial (probabilistic) disclosure[49].

Kenthapadi et al. described a distinction betweenonlineandofflineauditing problems [49]. In

the offline case, consider a set of queriesq1, ..., qm, and their respective true answers,a1, ..., am.

Suppose that these queries have already been answered. The task of the offline auditor is to deter-

mine whether a breach of privacy has already occurred.

In the case of exact disclosure and real-valued data, when the history consists of only SUM

and MAX queries, there are known polynomial-time offline auditing algorithms [25]; when there

are both SUM and MAX queries in the history, the problem is NP-hard [25]. For boolean data and

SUM queries, the problem of offline auditing for exact disclosure is coNP-hard, but (conservative)

approximations have been proposed [51].

In contrast, in the online case, suppose queriesq1, ..., qi−1 have already been answered, yielding

resultsa1, ..., ai−1, and that the user has posed a new query,qi. The task of an online auditor is to

decide whether or not to answerqi. This is depicted in Figure 6.3.

Unfortunately, Kenthapadi et al. note that using an offline auditor to solve an online problem

may be problematic because the decision to reject a query mayitself reveal sensitive information

[49]. This problem arises, in general, when the online auditor uses the resultai (information not

available to the adversary) in making its decision. Instead, they propose the idea of asimulatable

online auditor [49]. Informally, the idea is that, given information already available to the adversary

(i.e., q1, ..., qi−1, a1, ..., ai−1, qi, but notai), the adversary should be able to simulate the auditor’s

134

¤¥¦§¦©¥ª«¬®¥¦Figure 6.4 Output perturbation model

decision to answer or reject the query, thus assuring that this decision reveals no new information.

In their paper, they propose a simulatable auditor for exactdisclosure by MAX queries, and a

simulatable auditor for partial disclosure by SUM queries [49].

6.3.2 Output Perturbation

In contrast to online auditing, where each query is either answered (correctly and precisely),

or rejected,output perturbationprotocols may add some noise to each query response. This is

depicted in Figure 6.4.

This problem has also drawn renewed interest in recent years[18, 27, 31, 32]. In general, let

f(D) → R be a query function, whereD is a database (modeled as a finite set ofn elements).

Output perturbation works by adding some random valueα (drawn according to a distribution

specified by the mechanism) to the query result.5

This series of papers has proposed several (interrelated) definitions of privacy. Most recently,

they proposed the idea ofdifferential privacy[31]. A randomized functionf ′ satisfies differential

privacy if, for all databasesD1 andD2 that differ in at most one element, and allV ⊆ Range(f ′),

P (f ′(D1) ∈ V ) is within a small constant factor ofP (f ′(D2) ∈ V ).

5Initial work considered primarily sum functions [18, 27], but was later extended to general functions [32].

135

Intuitively, this definition is based on the idea that no harmshould come to an individual, based

solely on his or her participation in a statistical database[31].6 It is shown that a small amount of

noise is sufficient for guaranteeing privacy under this definition [31, 32].

This definition is interesting in that it allows certain disclosures that could be considered viola-

tions of commonly-held ideas of privacy. For example, a database might reveal that 99% of people

with Alice’s demographic characteristics are HIV positive, whether or not Alice’s information is

included in the database. Clearly, anyone with external information about Alice can infer with high

confidence that she has HIV. Determining whether this constitutes a breach of privacy, or simply

the free flow of information, is perhaps a philosophical question.

6.4 Query-View Security

In another related line of research, Miklau and Suciu began aseries of papers on the problem

of query-view security [64]. Given apublic viewV of a (relational) database, the goal is to de-

termine whether it reveals any information about aprivate queryQ of the same database. In this

work, views and queries are defined by conjunctive queries. No assumptions are made about the

computational capacity of the adversary.

The standard of privacy initially proposed by Miklau and Suciu (“perfect privacy”) is quite

strict, and requires that the public views provide no additional information with respect to the

private query [64]. That is, the posterior probability of a particular answer to the private query

should not be altered by an adversary seeing the public views. (This formalism is closely-related

to Shannon’s definition of perfect secrecy [80].) The formulation of perfect privacy makes no

assumptions about the computational capabilities of the adversary. In this initial work, Miklau and

Suciu show that the problem of checking perfect privacy in general, for conjunctive queries, is

ΠP2 -complete.

Subsequently, this work has been extended in some interesting ways. Machanavajjhala and

Gehrke describe a number of interesting cases where the problem of checking perfect privacy is

6The related notion ofǫ-indistinguishability[32] can be justified in terms of an “informed” adversary. That is,given an adversary who knows all entries but one in the database, he or she is unable to learn significantly more aboutthe remaining entry.

136

tractable [59]. Deutsch and Papakonstantinou additionally consider the case where there exist

correlations amongst tuples in the underlying database [26].

One of the main advantages of perfect privacy is that it circumvents the need to track disclosure

across multiple queries. That is, we can answer an arbitrarynumber of public queriesV that are

perfectly private with respect to private queryQ because none reveals any information aboutQ.

The downside, of course, is that important queries may not satisfy these conditions, which impedes

utility.

6.5 Distributed Privacy-Preserving Data Mining

Considerable recent work has also focused on secure multiparty protocols for privacy-preserving

function evaluation and data mining. The setting is relatedto that of client-side input perturba-

tion. Again, we have two or more separate participants (organizations or individuals),P1, ..., Pn,

each with an inputx1, ..., xn. Together, these organizations would like to compute some function

g(x1, ..., xn), without revealing anything to one another except what is logically implied by the

result. For example, we might have several independent hospitals that would like to construct a

model predicting the effects of a certain drug, without revealing their patient information to one

another. Intuitively, the idea is to construct a protocol that, given a polynomial-time adversary,

is equivalent to a trusted third-party protocol, but that does not require the third party. This is

depicted in Figure 6.5.

The secure function evaluation problem was first introducedby Yao [97], who developed a

two-party protocol for computingg(x1, x2) described by a combinatorial circuit. Recently, there

has been much interest in developing special-purpose protocols for the case whereg() is a data

mining model. Lindell and Pinkas developed a solution for computing an ID3 decision tree [57].

Protocols have also been proposed for association rules [48, 85], clustering [86], and Bayesian

networks [93], among others.

These protocols guarantee strong notions of privacy with respect to the inputsx1, ..., xn. (This

is, of course, assuming that the answerg(x1, ..., xn) does not reveal sensitive information. Un-

derstanding the information revealed by various data mining models may be an open problem of

137

°±²³Figure 6.5 Secure multiparty computation model

independent interest.) At the same time, the frequently-voiced criticism is that the protocols can

be quite expensive (in terms of computation and communication overhead), and thus impractical

in many applications. Also of practical concern, secure multiparty protocols typically require that

each participant engage in the protocol at the same time, andhave the ability to carry out some

non-trivial amount of computation. These conditions may not be realistic when the participants

are individuals, as in the supermarket and survey examples.

6.6 Database Authorization, Access Control & Security

While not the primary focus of this thesis, authorization and access control for data manage-

ment systems are also important and active research areas. In this section, we briefly describe three

distinct settings: access control models for traditional database servers, security and confidentiality

for outsourced databases, and authorization for publisheddata.

Traditionally, database authorization has focused on the setting where data is stored on a trusted

server, and the system must control the outflow of information according to some access control

policy. This is the setting addressed by the discretionary access control mechanism of SQL [45],

as well as role-based [79] and mandatory access control systems [15, 58]. More recently, several

fine-grained data-centric access control mechanisms have been developed [8, 52, 75].

138

In recent years, a new class of database “service providers”has emerged (e.g., Salesforce [77]).

In this model, a large number of individual clients store their data on a (potentially untrusted) third-

party database server. As expected, this raises a number of interesting questions related to data

confidentiality and integrity. Several approaches have been proposed for protecting confidentiality,

including techniques for evaluating queries over encrypted data on an untrusted server [46, 9], and

techniques for distributing confidential information across multiple untrusted servers in a way that

prevents any one server from reconstructing the original data [4].

In the third setting, Miklau and Suciu have proposed fine-grained techniques for controlling

access to published (XML) data using encryption and key distribution [63]. We note, however, that

this work does not attempt to reason about what an adversary may be able to infer (with respect to

identities, the answer to a secret query, etc.) given the data he or she is authorized to see.

139

Chapter 7

Summary and Discussion

This thesis has presented an extensive study of anonymization techniques for published, non-

aggregate microdata. Given the framework presented in Chapter 1, which defines notions of

anonymity with respect to identity and the value of a sensitive attribute, we have provided a number

of algorithmic techniques that focus on data quality, scalability, and performance.

Incognito We began by describing an efficient algorithm that implements any monotone anonymity

requirement via full-domain generalization. We showed that the algorithm is sound and complete;

thus the “optimal” anonymization can be chosen according toany notion of utility. In addition,

an experimental evaluation found that this algorithm (and some variations thereof) performed sub-

stantially better than the optimal search algorithms that existed previously.

Mondrian Following this, we introduced the multidimensional partitioning approach, and de-

veloped a greedy algorithm that implements any anonymity requirement exhibiting the monotonic-

ity and bucket independence properties. Initially, we limited our study to several simple general-

purpose measures of utility, and found that the greedy approach often produces higher-quality data

than even optimal single-dimensional generalization algorithms, such as Incognito and K-optimize

[14].

Workload-Aware Anonymization Later, we extended the idea of data quality based on the

observation that quality is really quite subjective. For this reason, we looked to a target workload

of queries and data mining tasks as a quality evaluation tool. We also developed a simple lan-

guage for describing a family of workloads, and developed extensions to the Mondrian algorithm

140

for incorporating these workloads directly into the anonymization procedure. An extensive experi-

mental evaluation showed the importance of both the multidimensional recoding approach, and of

incorporating workload into the anonymization process.

Rothko Finally, we considered the problem of scalability. While the greedy Mondrian algo-

rithms are typically more efficient than previous (exhaustive seach) algorithms, they require some

modifications in order to be applied to data sets larger than memory. We developed and evaluated

two such extensions, the first based on ideas from scalable decision trees, and the second based on

an optimistic sampling-based approach.

Thus, the main contribution of this thesis was to show that itis possible to publish high-quality

non-aggregate data that respects several meaningful notions of privacy. Further, it is possible to do

this for large data sets in a scalable and efficient way.

This thesis has not, by and large, attempted to address the higher-level policy issues surround-

ing data privacy. The computer security community has long drawn a distinction between the ideas

of policyandmechanism, and a similar distinction can be made here. Unfortunately,as in security,

where it is not always clear what it means for a system to be secure, it is sometimes difficult to

precisely define what it means to protect privacy.

Roughly-speaking, a privacy policy should include high-level statements describing what tasks

an adversary should and should not be able to perform, or whathe should and should not be able

to infer, given certain resources. The probabilistic statement of attribute protection (Equation 1.2)

is one example of such a policy. However, this is certainly not the only meaningful idea of privacy.

For example, Dwork proposed the idea of differential privacy, a definition that is substantially

different, both technically and philosophically [31]. Semantic security [42] and perfect privacy

[64] represent still different notions of what it means to protect privacy.

It is our belief that reasonable expectations of privacy (and policies designed in accordance

with these expectations) are dictated by the application athand. For this reason, it seems unlikely

that there will ever be a single catch-all framework for reasoning about all types of privacy and

disclosure. Rather, the future research direction appearsto lie in defining policies (high-level

statements of “privacy”) that are appropriate (philosophically, legally, and technically) to specific

141

application scenarios, and developing mechansisms that rigorously enforce these policies. Many

technical mechanisms have been developed over the years, including authorization, encryption,

aggregation, generalization, output perturbation, and more. In the future, we expect that these

mechanisms will form the building blocks for enforcing emerging classes of policies.

At the same time, we do not believe that data privacy can be entirely solved by new technology.

At the very least, the development of appropriate technicalpolicies will need to be guided by legis-

lation and established social norms. In the future, we suspect that technology and legal regulations

will play complementary roles in enforcing privacy principles.

Clearly, many questions remain. While this thesis has answered important technical questions,

data privacy is establishing itself as an important research area for years to come.

142

LIST OF REFERENCES

[1] N. Adam and J. Wortmann. Security-control methods for statistical databases.ACM Com-puting Surveys, 21(4):515–556, 1989.

[2] C.C. Aggarwal. On k-anonymity and the curse of dimensionality. In Proceedings of the 31stInternational Conference on Very Large Databases (VLDB), 2005.

[3] C.C. Aggarwal and P. Yu. A condensation approach to privacy-preserving data mining. InProceedings of the 9th International Conference on Extending Database Technology (EDBT),2004.

[4] G. Aggarwal, M. Bawa, P. Ganesan, H. Garcia-Molina, K. Kenthapadi, R. Motwani, U. Sri-vastava, D. Thomas, and Y. Xu. Two can keep a secret: A distributed architecture for securedatabase services. InProceedings of the 2nd Conference on Innovative Data Systems Re-search (CIDR), 2005.

[5] G. Aggarwal, T. Feder, K. Kenthapadi, R. Motwani, R. Panigrahy, D. Thomas, and A. Zhu.Anonymizing tables. InProceedings of the 10th International Conference on Database The-ory (ICDT), 2005.

[6] G. Aggarwal, T. Feder, K. Kenthapadi, R. Panigrahy, D. Thomas, and A. Zhu. Achiev-ing anonymity via clustering in a metric space. InProceedings of the 25th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems(PODS), 2006.

[7] R. Agrawal, S. Ghosh, T. Imielinski, and A. Swami. Database mining: A performance per-spective. InIEEE Transactions on Knowledge and Data Engineering, volume 5, 1993.

[8] R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu. Hippocratic databases. InProceedings of the28th International Conference on Very Large Databases (VLDB), 2002.

[9] R. Agrawal, J. Kiernan, R. Srikant, and Y.Xu. Order-preserving encryption for numeric data.In Proceedings of the ACM SIGMOD International Conference on Management of Data,2004.

[10] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. InProceedings ofthe 20th International Conference on Very Large Databases (VLDB), 1994.

143

[11] R. Agrawal and R. Srikant. Privacy-preserving data mining. In Proceedings of the ACMSIGMOD International Conference on Management of Data, 2000.

[12] F. Bacchus, A.J. Grove, J.Y. Halpern, and D. Koller. From statistical knowledge bases todegrees of belief.Artificial Intelligence, 87, 1996.

[13] M. Barbaro and T. Zeller. A face is exposes for AOL searcher no. 4417749.New York Times,August 9 2006.

[14] R. Bayardo and R. Agrawal. Data privacy through optimalk-anonymity. InProceedings ofthe 21st International Conference on Data Engineering (ICDE), 2005.

[15] D. Bell and L. LaPadula. Secure computer systems: Unified exposition and multics interpre-tation. Technical Report ESD-TR-75-306, MITRE Corp., Bedford, Mass., 1976.

[16] C. Bettini, X.S. Wang, and S. Jajodia. The role of quasi-identifiers in k-anonymity revisited.University of Milano Technical Report N. RT 11-06, 2006.

[17] C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998.

[18] A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practicalprivacy: the SuLQ framework.In Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART Symposium on Principles ofDatabase Systems (PODS), 2005.

[19] L. Breiman, J.H. Freidman, R.A. Olshen, and C.J. Stone.Classification and RegressionTrees. Wadsworth International Group, Belmont, CA, 1984.

[20] J-W. Byun, Y. Sohn, E. Bertino, and N. Li. Secure anonymization for incremental datasets.In SIAM Conference on Data Mining (SDM), 2006.

[21] S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology.SIG-MOD Record, 26, 1997.

[22] S. Chawla, C. Dwork, F. McSherry, A. Smith, and H. Wee. Toward privacy in publicdatabases. In2nd Theory of Cryptography Conference, 2005.

[23] B. Chen, L. Chen, Y. Lin, and R. Ramakrishnan. Prediction cubes. InProceedings of the31st International Conference on Very Large Databases (VLDB), 2005.

[24] B. Chen, K. LeFevre, and R. Ramakrishnan. PrivacySkyline: Privacy with multidimensionaladversarial knowledge. Under Submission, 2007.

[25] F. Chin. Security problems on inference control for sum, max, and min queries.Journal ofthe ACM, 33(3), 1986.

[26] A. Deutsch and Y. Papakonstantinou. Privacy in database publishing. InProceedings of the10th International Conference on Database Theory (ICDT), January 2005.

144

[27] I. Dinur and K. Nissim. Revealing information while preserving privacy. InProceedings ofthe 22nd ACM SIGMOD-SIGACT-SIGART Symposium on Principlesof Database Systems(PODS), 2003.

[28] A. Dobra and S. Feinberg. Bounds for cell entries in contingency tables given marginal totalsand decomposable graphs.Proceedings of the National Academy of Science, 97(22), 2000.

[29] A. Dobra and S. Feinberg. Bounding entries in multi-waycontingency tables given a set ofmarginal totals. InProceedings of the Shoresh Conference, 2003.

[30] J. Domingo-Ferrer and J.M. Mateo-Sanz. Practical data-oriented microaggregation for sta-tistical disclosure control.IEEE Transactions on Knowledge and Data Engineering, 4(1),2002.

[31] C. Dwork. Differential privacy. InProceedings of the 33rd International Colloquium onAutomata, Languages, and Programming (ICALP), 2006.

[32] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in privatedata analysis. InProceedings of the 3rd Theory of Cryptography Conference, 2006.

[33] Equifax. http://equifax.com.

[34] A. Evfimievski, J. Gehrke, and R. Srikant. Limiting privacy breaches in privacy-preservingdata mining. InProceedings of the 22nd ACM SIGMOD-SIGACT-SIGART Symposium onPrinciples of Database Systems (PODS), 2003.

[35] A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke. Privacy preserving mining of asso-ciation rules. InProceedings of the ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, 2002.

[36] Experian. http://www.experian.com.

[37] J.H. Friedman, J.L. Bentley, and R.A. Finkel. An algorithm for finding best matches inlogarithmic time.ACM Transactions on Mathematical Software, 3(3), 1977.

[38] B.C.M Fung, K. Wang, and P.S. Yu. Top-down specialization for information and privacypreservation. InProceedings of the 21st International Conference on Data Engineering(ICDE), 2005.

[39] M.R. Garey and D.S. Johnson.Computers and intractability: A guide to the theory of NP-completeness. W.H. Freeman, 1979.

[40] J. Gehrke, V. Ganti, R. Ramakrishnan, and W. Loh. BOAT: Optimistic decision tree con-struction. InProceedings of the ACM SIGMOD International Conference on Management ofData, 1999.

145

[41] J. Gehrke, R. Ramakrishnan, and V. Ganti. RainForest: Aframework for fast decision treeconstruction of large datasets. InProceedings of the 24th International Conference on VeryLarge Databases (VLDB), 1998.

[42] S. Goldwasser and S. Micali. Probabilistic encryption. Journal of Computer and SystemSciences, 28, 1982.

[43] Google. Privacy policy, May 26 2007. http://www.google.com/privacypolicy.html.

[44] J. Gray, S.Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, andH. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab,and sub-totals.Data Mining and Knowledge Discovery, 1(1), 1996.

[45] P. Griffiths and B. Wade. An authorization mechanism fora relational database system. InProceedings of the ACM SIGMOD International Conference on Management of Data, 1976.

[46] H. Hacigumus, B. Iyer, C. Li, and S. Mehrota. Executing SQL over encrypted data in thedatabase-service-provider model. InProceedings of the ACM SIGMOD International Con-ference on Management of Data, 2002.

[47] V. Iyengar. Transforming data to satisfy privacy constraints. InProceedings of the 8th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining, 2002.

[48] M. Kantarcioglu and C. Clifton. Privacy-preserving distributed mining of association ruleson horizontally partitioned data. InProceedings of the ACM SIGMOD Workshop on ResearchIssues in Data Mining and Knowledge Discovery (DMKD), 2002.

[49] K. Kenthapadi, N. Mishra, and K. Nissim. Simulatable auditing. In Proceedings of the 24thACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS),2005.

[50] D. Kifer and J. Gehrke. Injecting utility into anonymized datasets. InProceedings of theACM SIGMOD International Conference on Management of Data, 2006.

[51] J. Kleinberg, C. Papadimitriou, and P. Raghavan. Auditing boolean attributes. InProceedingsof the 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems(PODS), 2000.

[52] K. LeFevre, R. Agrawal, V. Ercegovac, R. Ramakrishnan,Y. Xu, and D. DeWitt. Limitingdisclosure in hippocratic databases. InProceedings of the 30th International Conference onVery Large Databases (VLDB), 2004.

[53] K. LeFevre, D.DeWitt, and R. Ramakrishnan. Incognito:Efficient full-domain k-anonymity.In Proceedings of the ACM SIGMOD International Conference on Management of Data,2005.

146

[54] K. LeFevre and D. DeWitt. Scalable anonymization algorithms for large data sets. Universityof Wisconsin Computer Sciences Technical Report 1590, 2007.

[55] K. LeFevre, D. DeWitt, and R. Ramakrishnan. Mondrian multidimensional k-anonymity. InProceedings of the 22nd International Conference on Data Engineering (ICDE), 2006.

[56] K. LeFevre, D. DeWitt, and R. Ramakrishnan. Workload-aware anonymization. InPro-ceedings of the ACM SIGKDD International Conference on Knowledge Discovery and DataMining, 2006.

[57] Y. Lindell and B. Pinkas. Privacy preserving data mining. Journal of Cryptology, 15(3),2002.

[58] T. Lunt, D. Denning, R. Schell, M. Heckman, and W. Shockley. The SeaView security model.IEEE Transactions on Software Eng., 16(6):593–607, 1990.

[59] A. Machanavajjhala and J. Gehrke. On the efficiency of checking perfect privacy. InPro-ceedings of the 25th ACM SIGMOD-SIGACT-SIGART Symposium onPrinciples of DatabaseSystems (PODS), 2006.

[60] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam. l-Diversity: Privacybeyond k-anonymity. InProceedings of the 22nd International Conference on Data Engi-neering (ICDE), 2006.

[61] D. Martin, D. Kifer, A. Machanavajjhala, J. Gehrke, andJ. Halpern. Worst-case backgroundknowledge in privacy. InProceedings of the IEEE International Conference on Data Engi-neering (ICDE), 2007.

[62] A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. InProceedingsof the 23rd ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems(PODS), 2004.

[63] G. Miklau and D. Suciu. Controlling access to publisheddata using cryptography. InPro-ceedings of the 29th International Conference on Very LargeDatabases (VLDB), 2003.

[64] G. Miklau and D. Suciu. A formal analysis of informationdisclosure in data exchange. InProceedings of ACM SIGMOD International Conference on Management of Data, 2004.

[65] M. Mokbel, C. Chow, and W. Aref. The new casper: Query processing for location serviceswithout compromising privacy. InProceedings of the 32nd International Conference on VeryLarge Databases (VLDB), 2006.

[66] U.S. Department of Health and Human Services Office for Civil Rights. HIPAA administra-tive simplification regulation text, February 16 2006.

[67] Department of Homeland Security Privacy Office. Noticeof privacy act system of records.Federal Register, 71(212), November 2 2006.

147

[68] R. Quinlan.C4.5 Programs for Machine Learning. Morgan Kaufmann, 1993.

[69] T.E. Raghunathan, J. Reiter, and D. Rubin. Multiple imputation for statistical disclosurelimitation. Journal of Official Statistics, 19(1), 2003.

[70] S. Reiss. Practical data-swapping: The first steps.ACM Transactions on Database Systems,9(1), 1984.

[71] J. Reiter. Satisfying disclosure restrictions with synthetic data sets.Journal of Official Statis-tics, 18(5), 2002.

[72] Reuters. US pushes digital medical records, July 22 2004.

[73] John A. Rice.Mathematical statistics and data analysis. International Thomson Publishing,second edition, 1995.

[74] S. Rizvi and J. R. Haritsa. Maintaining data privacy in association rule mining. InProceed-ings of the 28th International Conference on Very Large Databases (VLDB), 2002.

[75] S. Rizvi, A. Mendelzon, S. Sudershan, and P. Roy. Extending query rewriting techniques forfine-grained access control. InProceedings of the ACM SIGMOD International Conferenceon Management of Data, 2004.

[76] Safeway Inc. Privacy policy, May 26 2007. http://www.safeway.com/privacypage.asp.

[77] salesforce.com, Inc. http://www.salesforce.com.

[78] P. Samarati. Protecting respondants’ identities in microdata release.IEEE Transactions onKnowledge and Data Engineering, 13(6), November/December 2001.

[79] R. S. Sandhu, E. J. Coyne, H. L. Feinstein, and C. E. Youman. Role-based access controlmodels.IEEE Computer, 1996.

[80] C.E. Shannon. Communication theory of secrecy systems. The Bell System Technical Jour-nal, 1949.

[81] R. Srikant and R. Agrawal. Mining generalized association rules. InProceedings of the 21stInternational Conference on Very Large Databases (VLDB), 1995.

[82] L. Sweeney. Achieving k-anonymity privacy protectionusing generalization and suppres-sion. International Journal on Uncertainty, Fuzziness, and Knowledge-based Systems,10(5):571–588, 2002.

[83] L. Sweeney. K-anonymity: A model for protecting privacy. International Journal on Uncer-tainty, Fuzziness, and Knowledge-based Systems, 10(5):557–570, 2002.

[84] TransUnion LLC. http://www.transunion.com.

148

[85] J. Vaidya and C. Clifton. Privacy preserving association rule mining in vertically partitioneddata. InProceedings of the 8th ACM SIGKDD International Conferenceon Knowledge Dis-cover and Data Mining, 2002.

[86] J. Vaidya and C. Clifton. Privacy-preserving k-means clustering over vertically partitioneddata. InProceedings of the 9th ACM SIGKDD International Conferenceon Knowledge Dis-covery and Data Mining, 2003.

[87] J. Vogel. When cards come collecting.Seattle Weekly, September 23 1998.

[88] K. Wang, P.S. Yu, and S. Chakraborty. Bottom-up generalization: A data mining solution toprivacy protection. InProceedings of the 4th IEEE International Conference on Data Mining(ICDM), 2004.

[89] S.L. Warner. Randomized response: A survey technique for eliminating evasive answer bias.Journal of the American Statistical Association, 60, 1965.

[90] L. Willenborg and T. deWaal.Elements of Statistical Disclosure Control. Springer VerlagLecture Notes in Statistics, 2000.

[91] W. Winkler. Using simulated annealing for k-anonymity. Research Report 2002-07, USCensus Bureau Statistical Research Division, 2002.

[92] I.H. Witten and E. Frank.Data Mining: Practical machine learning tools and techniques.Morgan Kaufmann, San Francisco, 2nd edition, 2005.

[93] R. Wright and Z. Yang. Privacy-preserving bayesian network structure computation on dis-tributed heterogeneous data. InProceedings of the 10th ACM SIGKDD International Con-ference on Knowledge Discoveryand Data Mining, 2004.

[94] X. Xiao and Y. Tao. Anatomy: Simple and effective privacy preservation. InProceedings ofthe 32nd International Conference on Very Large Databases (VLDB), 2006.

[95] X. Xiao and Y. Tao. Personalized privacy preservation.In Proceedings of the ACM SIGMODInternational Conference on Management of Data, 2006.

[96] X. Xiao and Y. Tao. m-Invariance: Towards privacy preserving re-publication of dynamicdatasets. InProceedings of the ACM SIGMOD International Conference on Management ofData, 2007.

[97] A. Yao. How to generate and exchange secrets. InProceedings of the IEEE Symposium onFoundations of Computer Science (FOCS), 1986.

[98] C. Yao, X.S. Wang, and S. Jajodia. Checking for k-anonymity violation by views. InPro-ceedings of the 31st International Conference on Very LargeDatabases (VLDB), 2005.

149

[99] J. Zhang and V. Honavar. Learning decision tree classifiers from attribute value taxonomiesand partially specified data. InProceedings of the 20th International Conference on MachineLearning (ICML), 2003.

150

Appendix A: HIPAA Safe Harbor Provision

The second de-identification provision of the HIPAA PrivacyRule (theSafe Harbor) requires

the removal of eighteen specific types of information for anyperson (e.g., patients, doctors, etc.).

Specifically, the rule lists the following [66]:

(A) Names;

(B) All geographic subdivisions smaller than a State, including street address, cit, county,

precinct, zip code, and their equivalent geocodes, except for the initial three digits of zip code

if, according to the current publicly available data from the Bureau of the Census:

(1) The geographic unit formed by combining all zip codes with the same three initial digits

contains more than 20,000 people;

(2) The initial three digits of a zip code for all such geographic units containing 20,000 or

fewer people is changed to 000.

(C) All elements of dates (except year) for dates directly related to an individual, including

birth date, admission date, discharge date, date of death; and all ages over 89 and all elements

of dates (including year) indicative of such age, except that such ages and elements may be

aggregated into a single category of age 90 or older;

(D) Telephone numbers;

(E) Fax Numbers;

(F) Electronic mail addresses;

(G) Social security numbers;

(H) Medical record numbers;

(I) Health plan beneficiary numbers;

(J) Account numbers;

(K) Certificate/license numbers;

(L) Vehicle identifiers and serial numbers, including license plate numbers;

151

(M) Device identifiers and serial numbers;

(N) Web Universal Research Locators (URLs);

(O) Internet Protocol (IP) address numbers;

(P) Biometric identifiers, including finger and voice prints;

(Q) Full face photographic images and any comparable images;

(R) Any other unique identifying number, characteristic, or code, except as permitted by para-

graph (c) of this section