186
Thesis proposed to achieve the degree of master in computer science/databases Privacy in Databases Mathy Vanhoef Academic years: 2010—2011 & 2011—2012 Promoter: Prof. dr. Jan Van den Bussche

Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Thesis proposed to achieve the degree of master incomputer science/databases

Privacy in Databases

Mathy Vanhoef

Academic years: 2010—2011 & 2011—2012Promoter: Prof. dr. Jan Van den Bussche

Page 2: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Abstract

In the current digital age more and more data is being collected, yet it’sunclear how to assure adequate privacy. The question of how to analyzelarge amounts of data while preserving privacy now prevails more than ever.In the course of history there have been many failed attempts, showing thatreasoning about privacy is fraught with pitfalls. This caused an increasedinterest in a mathematically robust definition of privacy.

We will prove that absolute disclosure prevention is impossible. In otherwords, a person that gains access to a database can always breach the privacyof an individual. This motivated the move to assuring relative disclosureprevention. One of the most promising definitions in this area is differentialprivacy. It addresses all the currently known attacks, has plenty of practicalimplementations, and knows many extensions that make it applicable in awide range of situations.

Networked data also poses challenging privacy issues. Both active andpassive attacks, where the underlying structure of the network is used tode-anonymize individuals, are discussed in detail. Degree anonymizationand algorithms to create a degree anonymous graphs will also be given. Al-though still an active research topic, privacy of networked data is consideredincreasingly important given the rise of social media.

Page 3: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Preface

This thesis started as a custom proposal to study security, and in particularprivacy, from a theoretic point of view. What started by reading a researcharticle on the foundation of private data analysis quickly turned into a broadsubject covering many aspects of privacy in databases.

My deepest gratitude goes to my promoter, Prof. dr. Jan Van denBussche, for making time in his already full schedule to help me success-fully complete this work. During our rare but intense meetings he providedinvaluable feedback and insights. Often I was critical at first, but once hecompleted the explanation of his interpretation, my understanding of thematter increased significantly and broadened my point of view. Even aftera meeting he would continue to spend time on finding a more sound proof,and once found I would receive an e-mail late at night explaining his findings.I’m very thankful for his indispensable help.

Finally, I would like to thank my parents and brothers, whose supportduring the more stressful moments has been invaluable.

1

Page 4: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Contents

I Foundations of Privacy 7

1 Introduction 81.1 What is Privacy? . . . . . . . . . . . . . . . . . . . . . . . . . 81.2 Tensions in Releasing Data . . . . . . . . . . . . . . . . . . . 9

1.2.1 Preference for Accuracy . . . . . . . . . . . . . . . . . 101.2.2 Preference for Anonymity . . . . . . . . . . . . . . . . 101.2.3 Balance between Accuracy and Anonymity . . . . . . 11

1.3 Information Age . . . . . . . . . . . . . . . . . . . . . . . . . 111.4 Guarantees of Privacy . . . . . . . . . . . . . . . . . . . . . . 121.5 Statistical Databases . . . . . . . . . . . . . . . . . . . . . . . 13

1.5.1 Genotypes and Phenotypes (dbGaP) . . . . . . . . . . 131.5.2 Internet Usage . . . . . . . . . . . . . . . . . . . . . . 14

1.6 Models for Information Flow . . . . . . . . . . . . . . . . . . 141.6.1 Interactive vs. Non-Interactive . . . . . . . . . . . . . 141.6.2 Information Flow . . . . . . . . . . . . . . . . . . . . . 16

1.7 What About Cryptography? . . . . . . . . . . . . . . . . . . . 171.7.1 Secure Function Evaluation . . . . . . . . . . . . . . . 17

2 Probability Theory 182.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.1.1 Conditional Probability . . . . . . . . . . . . . . . . . 202.1.2 Discrete Variables . . . . . . . . . . . . . . . . . . . . 212.1.3 Continuous Variables . . . . . . . . . . . . . . . . . . . 222.1.4 Transformation of a Random Variable . . . . . . . . . 232.1.5 Linearity of Expectation . . . . . . . . . . . . . . . . . 24

2.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.2.1 Self-information . . . . . . . . . . . . . . . . . . . . . 252.2.2 Shannon Entropy . . . . . . . . . . . . . . . . . . . . . 252.2.3 Min-entropy . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3 Laplace distribution . . . . . . . . . . . . . . . . . . . . . . . 27

2

Page 5: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

2.4 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.4.1 Boole’s inequality . . . . . . . . . . . . . . . . . . . . . 282.4.2 Markov Bound . . . . . . . . . . . . . . . . . . . . . . 282.4.3 Hoeffdings’ Inequality . . . . . . . . . . . . . . . . . . 29

2.5 Probabilistic Turing Machine . . . . . . . . . . . . . . . . . . 302.6 Statistical Distance . . . . . . . . . . . . . . . . . . . . . . . . 31

2.6.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . 312.6.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . 32

3 History 343.1 Previous Attacks . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.1.1 The AOL Search Fiasco . . . . . . . . . . . . . . . . . 343.1.2 The HMO records . . . . . . . . . . . . . . . . . . . . 353.1.3 Netflix . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.1.4 Kaggle . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2 Previous Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . 413.3 K-anonymity and Variants . . . . . . . . . . . . . . . . . . . . 43

3.3.1 K-anonymity . . . . . . . . . . . . . . . . . . . . . . . 433.3.2 L-diversity . . . . . . . . . . . . . . . . . . . . . . . . 473.3.3 T-closeness . . . . . . . . . . . . . . . . . . . . . . . . 49

3.4 Intersection Attack . . . . . . . . . . . . . . . . . . . . . . . . 503.4.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . 503.4.2 Empirical Results . . . . . . . . . . . . . . . . . . . . . 51

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4 Impossibility Result 534.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2 The Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . 544.2.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . 55

4.3 The Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.4 Utility vector w is fully learned . . . . . . . . . . . . . . . . . 574.5 Utility vector w is only partly learned . . . . . . . . . . . . . 654.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5 Differential Privacy 675.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.3 Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.3.1 Discrete Range . . . . . . . . . . . . . . . . . . . . . . 695.3.2 Interpretation . . . . . . . . . . . . . . . . . . . . . . . 70

5.4 Choice of Epsilon . . . . . . . . . . . . . . . . . . . . . . . . . 715.5 Achieving Differential Privacy . . . . . . . . . . . . . . . . . . 72

3

Page 6: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

5.5.1 Counting Queries . . . . . . . . . . . . . . . . . . . . . 725.5.2 General Theorem to Achieve Privacy . . . . . . . . . . 745.5.3 Adaptive Queries . . . . . . . . . . . . . . . . . . . . . 755.5.4 Histogram Queries . . . . . . . . . . . . . . . . . . . . 77

5.6 K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . 785.6.1 Original Algorithm . . . . . . . . . . . . . . . . . . . . 785.6.2 Private Algorithm . . . . . . . . . . . . . . . . . . . . 78

5.7 Median Mechanism . . . . . . . . . . . . . . . . . . . . . . . . 805.8 Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

II Privacy in New Settings 82

6 Extensions of Differential Privacy 836.1 When Noise Makes No Sense . . . . . . . . . . . . . . . . . . 83

6.1.1 An Example . . . . . . . . . . . . . . . . . . . . . . . 836.1.2 The Scoring Function . . . . . . . . . . . . . . . . . . 846.1.3 The Privacy Mechanism . . . . . . . . . . . . . . . . . 856.1.4 Limitations and Future Directions . . . . . . . . . . . 87

6.2 Pan Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 876.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 886.2.2 Preliminary Definitions . . . . . . . . . . . . . . . . . 886.2.3 Definition of Pan Privacy . . . . . . . . . . . . . . . . 906.2.4 Density Estimation . . . . . . . . . . . . . . . . . . . . 91

7 Handling Network Data 977.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 977.2 Graph Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7.2.1 Assuring Privacy . . . . . . . . . . . . . . . . . . . . . 997.2.2 Risk Assessment . . . . . . . . . . . . . . . . . . . . . 100

7.3 Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 1027.3.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . 1027.3.2 Graph Properties . . . . . . . . . . . . . . . . . . . . . 103

7.4 Differential Privacy for Graphs . . . . . . . . . . . . . . . . . 1067.4.1 Possible Definitions . . . . . . . . . . . . . . . . . . . . 1067.4.2 Degree Distribution Estimation . . . . . . . . . . . . . 108

7.5 Asymptotic Notation . . . . . . . . . . . . . . . . . . . . . . . 1087.5.1 Special Notations . . . . . . . . . . . . . . . . . . . . . 109

8 Naive Anonymization and an Active Attack 1118.1 Naive Anonymization . . . . . . . . . . . . . . . . . . . . . . 1118.2 Active Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . 1128.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1128.4 Idea Behind the Attack . . . . . . . . . . . . . . . . . . . . . 113

4

Page 7: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

8.5 The Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1138.5.1 Requirements for success . . . . . . . . . . . . . . . . 1148.5.2 Construction of the Subgraph H . . . . . . . . . . . . 1148.5.3 Finding our Constructed Subgraph . . . . . . . . . . . 1158.5.4 Privacy Breach . . . . . . . . . . . . . . . . . . . . . . 117

8.6 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1178.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1248.8 Detectability . . . . . . . . . . . . . . . . . . . . . . . . . . . 1268.9 As a Passive Attack . . . . . . . . . . . . . . . . . . . . . . . 1278.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1278.11 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

8.11.1 Lemma 8.5 . . . . . . . . . . . . . . . . . . . . . . . . 1278.11.2 Lemma 8.6 . . . . . . . . . . . . . . . . . . . . . . . . 128

9 Degree Anonymization 1319.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1319.2 Anonymity Guarantees . . . . . . . . . . . . . . . . . . . . . . 1319.3 Problem Description . . . . . . . . . . . . . . . . . . . . . . . 1339.4 Degree Anonymization . . . . . . . . . . . . . . . . . . . . . . 134

9.4.1 An Improvement . . . . . . . . . . . . . . . . . . . . . 1369.5 Graph Construction . . . . . . . . . . . . . . . . . . . . . . . 137

9.5.1 Realizability and ConstructGraph . . . . . . . . . . . 1379.5.2 SuperGraph: Resemblance to the Input Graph . . . . 1399.5.3 The Probing Scheme . . . . . . . . . . . . . . . . . . . 1429.5.4 Relaxed Graph Construction using GreedySwap . . . 1449.5.5 SimultaneousSwap: Edge Additions and Deletions . . 147

9.6 Experiments and Implementation: Prologue . . . . . . . . . . 1489.6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 1489.6.2 Testing Randomized Algorithms . . . . . . . . . . . . 150

9.7 Individual Algorithm Evaluation . . . . . . . . . . . . . . . . 1519.7.1 Degree Anonymization . . . . . . . . . . . . . . . . . . 1519.7.2 SuperGraph using Probing Scheme . . . . . . . . . . . 1529.7.3 GreedySwap and SimultaneousSwap . . . . . . . . . . 160

9.8 Comparison Between Algorithms . . . . . . . . . . . . . . . . 1609.9 Relation to Link Prediction . . . . . . . . . . . . . . . . . . . 1629.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

10 Passive Attack 16810.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16810.2 The Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16910.3 Seed Identification . . . . . . . . . . . . . . . . . . . . . . . . 169

10.3.1 Individual Auxiliary Information . . . . . . . . . . . . 17010.3.2 Combinatorial Optimization Problem . . . . . . . . . 170

10.4 Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

5

Page 8: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

10.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17410.5.1 Anonymized Flickr Graph . . . . . . . . . . . . . . . . 17410.5.2 Flickr and Twitter . . . . . . . . . . . . . . . . . . . . 175

10.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

11 Conclusion 176

6

Page 9: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Part I

Foundations of Privacy

7

Page 10: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Chapter 1Introduction

Before delving into the details we will give a short introduction on privacy.This is especially needed because privacy is a very broad notion and canbe tackled from multiple angles. In this chapter section What is Privacy?and What About Cryptography? is based on the first lecture of Naor onFoundations of Privacy [Nao10a]. Tensions in Releasing Data is based onthe PhD thesis of Sweeney [Swe01]. The remaining introductory content wasgathered from different sources when reading about—and working on—thesubject.

1.1 What is Privacy?

Before we begin it’s necessary to define what we mean by privacy. Unfortu-nately this is an extremely overloaded term and is hard to define precisely.The following quotes demonstrate this point:

“Privacy is a value so complex, so entangled in competing andcontradictory dimensions, so engorged with various and distinctmeanings, that I sometimes despair whether it can be usefullyaddressed at all.” — Robert C. Post [Pos01]

“Privacy is the right to be let alone.”— Samuel D. Warren and Louis D. Brandeis [WB85]

“Privacy is our concern over our accessibility to others: the ex-tent to which we are known to others, the extent to which othershave physical access to us, and the extent to which we are thesubject of others.” — Ruth Gavison [Gav80]

“There is no one that has a good definition of what privacyactually is.”

— Willem Debeuckelaere (translation mine) [pan11]

8

Page 11: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Identifiable Anonymous

Higher Quality Higher Privacy

Figure 1.1: Tension between quality and privacy

“Privacy is like oxygen. We really appreciate it only when it isgone.” — Charles J. Sykes

Especially in the current information age, privacy seems to be a fragileconcept. Huge amounts of data are being collected, people publish theirinformation on social networks, public surveillance information is being col-lected (e.g., security cameras), telephone companies are required to storecertain information, internet service providers are storing sensitive informa-tion, and so on.

In this work we will investigate if there are mathematical definitions ofprivacy that make sense and can be achieved by algorithms. The advantageof this theoretical approach is that we can prove that certain methods willachieve a specific form of privacy. Today, the most promising definition isdifferential privacy. But before we explain that further it’s important toknow the background and the history of the research being done in privacy,as this will give you a better insight of the surrounding problems one musttake into account.

However, not every problem related to privacy is currently solved, andthere are still many open questions and problems. Whilst differential privacyand its variants certainly have promise, the definition is not always easilyapplicable and is in some cases too strong.

1.2 Tensions in Releasing Data

When releasing data there is always a tension between quality and anonym-ity. On the one hand, releasing the full dataset provides the best quality,while the other extreme of not releasing any data provides the best pri-vacy [Swe01].

Consider figure 1.1: At one end we have the data that is fully identified,and on the other end is the anonymous data derived from the original data.When we want to provide any form of privacy, no matter how minimal, theoriginal data must be modified to ensure anonymity, and is thus distorted.So when we move away from the fully identified end of the spectrum, towardsthe anonymous end, the resulting data will be less useful. In particular, thismeans that there are some tasks that could be computed on the original

9

Page 12: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Computational Disclosure Control 01/08/01 8:22 AM

32

In general, the practices of data holders and related policies do not examine tasks in a vacuum.

Instead, the combination of task and recipient together are weighed against privacy concerns. This can be

modeled as a tug-of-war between the data holder and societal expectations for privacy on one side, and

the recipient and the recipient’s use for the data on the other. In some cases such as public health

legislation, the recipient’s need for the data may overshadow privacy protections, allowing the recipient

(a public health agent) to get the original, fully identified health data. See Figure 2 in which a tug-of-war

is modeled. The privacy constraints on the data holder versus the recipient’s demand for the data are

graphically depicted by the sizes of the images shown. In the case illustrated, the recipient receives the

original, fully identified data.

Accuracy, qualityDistortion, anonymity

Holder

Recipient

A nn 10/2/61 02139 cardiacA be 7/14/61 02139 canc erA l 3 /8/61 02138 liver

Figure 2. Recipient’s needs overpower privacy concerns

Figure 3 demonstrates the opposite extreme outcome to that of Figure 2. In Figure 3, the data holder and

the need to protect the confidentiality or privacy of the information overshadows the recipient and the

recipient’s use for the data and so the data is completely suppressed and not released at all. Data collected

and associated with national security concerns provides an example. The recipient may be a news-

reporting agent. Over time the data may eventually be declassified and a release that is deemed

sufficiently anonymous provided to the press, but the original result is as shown in Figure 3, in which no

data is released at all.

Figure 1.2: Recipient’s needs overpower privacy concerns [Swe01]

data, but not on the released data (or the error margin of the result wouldbe too high to be considered useful).

From this we realize that the desired anonymity and usefulness dependson the sensitivity of the data itself and depends on the task to which theanalyst puts the data. That is, given a particular task, there exists a pointon the continuum in Figure 1.1 where the released data is as anonymousas possible, yet the data still remains useful for the task. This can becharacterized as a tug-of-war between the people and analysts: The peoplewant maximum anonymity while the analyst wants the most useful data.

1.2.1 Preference for Accuracy

There are several examples where access to accurate information is moreimportant than anonymity. In medical sectors this is especially true: Toprovide the best care for patients the original, fully identified health infor-mation, must be available to doctors. This is illustrated in figure 1.2.

However, note that it’s possible the information may not be released toeveryone. In our example we expect that doctors follow the HippocraticOath and keep the information of patients private. Nevertheless, if we con-sider the doctor to be the analyst, it still holds that the original informationis given and no modifications were made so ensure any form of privacy.

1.2.2 Preference for Anonymity

The opposite extreme is also a realistic scenario. In this case the needto preserve the confidentiality of the information overshadows the possibleutility it would have for an analyst. Information related to national securityor military agencies are a good example. Here privacy and confidentiality isachieved by not releasing the information at all.

10

Page 13: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Computational Disclosure Control 01/08/01 8:22 AM

33

Recipient

Holder

A ccuracy, qualityD istortion, anonym ity

.

Figure 3 Data holder and privacy concerns overpower outside uses of the data

Figure 2 and Figure 3 depict situations in which society has made explicit decisions based on the

needs of society as a whole. But secondary uses of medical data, for example, by marketing firms,

pharmaceutical companies, epidemiological researchers and others do not in general lend themselves to

such an explicit itemization. Figure 4 demonstrates situations in which the needs for privacy are weighed

equally against the demand for the data itself. In such situations, a balance should be found in which the

data are rendered sufficiently anonymous yet remain practically useful. As an example, this situation

often occurs with requests by researchers for patient-specific medical records in which researchers seek

to undertake clinical outcomes, or administrative research that could possibly provide benefits to society.

At present, decisions are primarily based on the recipient receiving the original patient data or no data at

all. Attempts to provide something in-between typically results in data with poor anonymity protection or

data that is overly distorted. This work seeks to find ways for the recipient to get data that has adequate

privacy protection, therefore striking an optimal balance between privacy protection and the data’s

fitness for a particular task.

Holder

A* 1961 0213* cardiacA* 1961 0213* cancerA* 1961 0213* liver

Recipient

Accuracy, qualityDistortion, anonymity

Figure 4. An optimal balance is needed between privacy concerns and uses of the data

Figure 1.3: Balance between privacy concerns and uses of the data [Swe01]

1.2.3 Balance between Accuracy and Anonymity

Finally there is the common situation where we must achieve a reasonablebalance between accuracy and anonymity. This is also were the remainder ofthe work will focus on: Problems were we don’t want to release the originaldata, but still want to calculate and release global statistics and propertiesof the data. See figure 1.3 for a graphical representation.

Take the case of medical research where, as an example, researchers wantto search for patterns in diseases, genotype-phenotype correlations, and soon. At the moment the analyst either receives the original patient data or nodata at all [Swe01]. This is because there are currently no guarantees thatremoving certain information, or perturbing it in a certain way, providesenough privacy for the patient. In this thesis we will seek for methods—that can be applied in situations like these—that provably guarantee certaindefinitions of privacy. If such methods can be found it’s reasonable to assumethat the anonymized data will be easier accessible by researcher.

1.3 Information Age

Currently we are in a unique period where for the first time it’s possible tocheaply store huge amounts of data. Before the introduction of the com-puter, gathering information was a slow process. Analyzing this data wasan even more painstaking activity. Because of these limitations the issue ofprivacy was easier to tackle: Removing personal identifying attributes (e.g.,name, address, etc.) was enough to provide a reasonable amount of privacy.These days however this is not enough. As we will see, a combination ofso called quasi-identifiers allow one to uniquely identify an individual. Forexample, in an attack where medical data and voter data were combined,researchers were able to re-identify the medical records of the governor ofMassachusetts [Swe02]. This was possible because with only the combina-

11

Page 14: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Figure 1.4: Will your privacy really be kept?

tion of Zip, Birth date and Gender they were able to unique identify a largepart of the population.

Social Pressure

Another interesting aspect is that society pressures you to make certainchoices that decrease your anonymity. As pointed out by Marlinspike inhis presentations at Blackhat [Mar10], the choice of owning a cell phonefor example is not necessary a real choice. At first sight the choice may bewhether or not to own a simple piece of technology. But as people begin touse the technology more and more, they become dependent on it, and thesociety around it changes. People used to make more plans while talking tosomeone in real life, but nowadays this is decreasing and they rely on theircell phone to arrange meetings.

So in our example it’s no longer just the choice of owning a cell phone,but the choice of being part of society or not. This is a much bigger choicethan just owning a gadget and possibly a choice that lowers your privacyand anonymity.

1.4 Guarantees of Privacy

When filling in surveys or sharing your information online, you often seethe phrase “your privacy is always kept” or similar (see image 1.4). Onehas to ask himself what this means exactly. Since there are currently nowell-known methods to actually ensure privacy, this sentence doesn’t havemuch meaning. All is says it that the organization or person(s) in questionwill do his best to provide some privacy. But since no guarantees are given,the users is—most of the time—left in the dark about what is actually doneto provide anonymity.

As an example that organizations can fail to protect your privacy we willexplain the AOL search fiasco. In chapter 3 we present more cases wherethe privacy of individuals was not adequately protected.

12

Page 15: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

The AOL Search Fiasco

In 2006 AOL released 20 million search queries from 65,0000 AOL users.The data includes all searches from those users for a three month period, aswell as whether they clicked on a result, what that result was and where itappeared on the result page [Arr06]. In an attempt to protect anonymity,the username of each individual has been changed to a random ID. However,different searches made by the same person still have the same random ID.

It didn’t take long before the first person, no. 4417749, was identified asThelma Arnold [BZ06]. She conducted hundreds of searches ranging from“numb fingers” to “60 single man” to “dog that urinates on everything”.Based on her search terms it was straightforward to deduce her age, whereshe lived, her last name, and so on.

1.5 Statistical Databases

The issue of privacy was first investigated with respect to statistical data-bases. In this context it is called statistical disclosure control : Revealingaccurate statistics about a set of respondents while preserving the privacyof individuals. It has a notable history with extensive literature spanningstatistics, theoretical computer science, security, databases and cryptogra-phy [Dwo11]. For an overview of this long history see the survey of Adamand Wortmann [AW89]. This extensive history is a testament to the impor-tance of privacy.

Certain results extend beyond the realm of statistical databases andcan be used in more general settings. Nevertheless, since the focus is onstatistical data, we will provide a few examples why these databases can beof enormous social value.

1.5.1 Genotypes and Phenotypes (dbGaP)

In an effort to ease complex analyses of genetic associations with phenotypicand disease characteristics, the National Center for Biotechnology Informa-tion has created a database to store—among other things—individual-levelinformation on phenotype, genotype, sequence data and the associationsbetween them [MFJ+07]. Access to individual data is only allowed to au-thorized researchers, and there are strict guidelines on how to store and usethis data. Although the high-density genomic data is de-identified, it stillremains unique to the individual and could potentially be linked to a specificperson if used in conjunction with other databases (this is what we will calla composition attack or linkage attack).

It’s clear that access to this kind of data can tremendously help re-searchers in their work. Unfortunately they must first get permission inorder to obtain the data, and then follow strict guidelines. We believe that

13

Page 16: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

new privacy mechanisms can ease access to this data, while maintaining aprovable form of privacy.

1.5.2 Internet Usage

Another huge source of information is internet usage. Your search queriesare saved, the websites you visit can be tracked, the items you bought onwebsite such as amazon can be recorded, Internet Service Providers keep logsof your overall internet usage, you provide information yourself to socialnetworks, and so on. This last item, namely that of social networks, isespecially interesting. Here people willingly publish all their informationonline. One has to ask himself if everyone is actually aware of the hugeprivacy concerns this raises. The founder of WikiLeaks, Julian Assange, evencritiqued facebook—a popular social network website—as “an appalling spymachine”. His reasoning behind this claim is that all the information savedby facebook is accessible to U.S. intelligence by using legal and politicalpressure [AE11].

It’s clear that overall this is a huge amount of data that can potentiallybe abused. But on the other hand it can also be invaluable information toimprove services. Search engines can optimize your search results based onprevious queries, e-commerce website can suggest similar products, and soon.

Unfortunately it’s unclear how well the privacy of the users is protected inthese scenarios. Ideally it should be possible to analyze this information real-time, without the need to store it afterwards, all while preserving privacy.This is not an easy problem to solve, especially if the algorithms to analyzethe data are complex. Nevertheless, we believe that advances in the area ofprivacy under continual observation1 will one day offer an acceptable andusable solution.

1.6 Models for Information Flow

1.6.1 Interactive vs. Non-Interactive

There are two natural modes when releasing data. In the first one—calledthe non-interactive mode—a modified version of the database is publishedand everyone can access this database. This public database is called thesanitized database, and the process of creating it is depicted in figure 1.5.Historically this is a commonly used method, where one simply removedidentifying attributes such as name, address, social security number, andso on. But now more and more data is being stored, and because of theincrease in processing power, these older methods no longer provide the

1This is a concept that extends the notion of differential privacy to guarantee confi-dentiality while analyzing data in realtime.

14

Page 17: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Curator/ Sanitizer

Sensitive data Sanitized data

Figure 1.5: Non-interactive view: A sanitized version of the database ispublished.

Curator/ Sanitizer

Sensitive data Query output

Query 1

Query 2

. . .

Figure 1.6: Interactive view: The sanitizer answers specific queries.

privacy one would expect (Chapter 3 gives examples of why such a strategydoesn’t guarantee privacy). An example of a non-interactive sanitizer isk-anonymity which is explained in section 3.3. An advantage of the non-interactive mode is that once the sanitized database has been created, theoriginal data can be destroyed.

The second possibility is called the interactive mode. Here no data ispublished, instead users can submit queries to the curator. The curator willthen answers these questions in such a way so the privacy of individualsin the database is preserved (see figure 1.6). Privacy can be preserved byadding noise to the output, allowing only a limited number of queries, re-jecting certain types of queries, and so on. An example of an interactivemechanism is ε-differential privacy which is explained in chapter 5. In bothcases we assume that the curator is a trusted entity.

15

Page 18: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Data repository

Query processor

Yvonne

Alice

Bob

...

Respondent's data

Analyst

Analyst

ANameAge...

HIV

C

D

B

Adversary

Figure 1.1. Traditional problem setting: each individual contributes a privaterecord, and analysts wish to compute aggregate queries over the collection of records.Computational approaches to protecting privacy correspond to interventions in theflow of information: (A) input perturbation; (B) transformed data release; (C) queryauditing and query answer perturbation; and (D) access control.

The challenge is to allow analysts to study the data in a way that prevents disclso-

sure of sensitive information of survey respondents. There are several computational

approaches to protecting privacy, illustrated in Figure 1.1 and described next.

Figure 1.1 illustrates the flow of information from survey respondents to analysts.

Individual survey respondents (information providers) contribute their private data.

In this setting, each respondent’s data can be encapsulated in a single record. The

private records are collected into a database, which is controlled by the data manager.

Analysts (information consumers) perform computations on it, interacting with the

data through some kind of query processor.

We distinguish between two kinds of information consumers: analysts and adver-

saries. The analyst wants to study the population, by measuring aggregate statistics,

fitting statistical models, etc. For example: an analyst might want to know what age

groups have highest prevalence of HIV. The adversary, on the other hand, wants to

learn facts about specific individuals. For example, an adversary may ask whether

3

Figure 1.7: Model of information flow: Individuals provide their private in-formation on which analyst want to derive aggregate statistics from. [Hay10]

1.6.2 Information Flow

An interactive or non-interactive mechanism that provides privacy can beachieved in several ways. Figure 1.7 gives a model of the information flowstarting at the individual and ending at the analyst. Here individuals sharetheir private information and these are collected in a database that is con-trolled by the data curator. Analysts can perform calculations on it to learnaggregate statistics, while an adversary shouldn’t be able to learn sensitiveinformation about specific individuals.

To protect privacy we must control the flow of information somewherealong its path. As figure 1.7 suggest this can be done in several ways:

A. Input Perturbation Users can add noise to the data they hand over.An adversary is therefore never sure whether the sensitive informationabout a specific individual is correct, while aggregate statistics canstill be calculated.

B. Transformed Data Release Once the data of all individuals is col-lected and saved in a database, one can attempt to transform thisdata such that sensitive information about individuals can no longerbe derived from it. A well-known example is k-anonymity.

C1. Query Auditing The data curator can monitor the queries being per-formed on the data. If the answer to a query might contain sensitiveinformation on an individual, the query is refused and not executed.

16

Page 19: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

C2. Query Answer Perturbation The analyst can query the data, butnoise is being added to the answers in order to guarantee privacy. Awell-known example is differential privacy.

D. Access Control The dataset can be given to trusted researchers andnever released publicly. This should prevent malicious individuals fromobtaining the data.

1.7 What About Cryptography?

Strangely the application of cryptography is sometimes suggested as a po-tential solution. Although cryptography is relevant, it does not solve theprivacy problem. Cryptography can be used to encrypt, and thus hide, in-formation from an adversary. But in the end the recipient still receives theoriginal, unmodified information. So even if encryption is used the recipientwill obtain the original information including all individual-specific data,and the privacy of these individuals will still be violated.

To further illustrate this point, we will briefly cover Secure FunctionEvaluation, which is sometimes offered as a solution to the privacy issue.It’s a form of secure multi-party computation where entities want to dis-tributively compute a function over their inputs, without exposing theirprivate input. This is a well-studied problem in the field of cryptography.At first sight this may seem to provide privacy and anonymity, unfortunatelythis is not the case.

1.7.1 Secure Function Evaluation

Secure Function Evaluation (SFE) deals with the problem of how to dis-tributively compute a function f(X1, X2, . . . , Xn) where:

1. Each Xi is only known to party i.

2. The parties should only learn the final output denoted by ξ. In otherwords the private input of each party won’t be exposed to other parties.

Requirement 2 holds even if there are parties that maliciously attemptto obtain more information. Practical implementations that achieve thisdefinition depend on, or are restricted by, the number of parties, the powerof the adversary, the communication channel, . . . The problem lies in thefact that the result ξ might itself disclose information. To be precise, thedefinition of SFE states that nothing will be revealed about any input Xi

other than we can be reasonably deduced by knowing the output ξ.So if there are two parties a and b, and the function f would be sum(a, b),

the result ξ would yield the input of the other party. This is not acceptableif we want to guarantee privacy.

17

Page 20: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Chapter 2Probability Theory

In this chapter we will give a brief overview of the basics of probabilitytheory. This is done both to refresh the most important concepts, and tointroduce the notations and terminology that will be used throughout theremainder of this thesis. To avoid any ambiguity we will start by statingthe exact definitions. Then we will cover a few properties that will be usedin forthcoming proofs. The mathematical definitions given in this chapterare based on those in the book of Grimmett and Welsch [GW86].

2.1 Introduction

Probability theory deals with the study of experiments. An experiment (ortrail) is defined as a course of action whose consequence is not predeterminedand is denoted by E . An experiment can be defined mathematically using aprobability space.

Definition 2.1. A probability space is a tuple (Ω,Σ,Pr) where:

• Ω is called the sample space and is defined as the set of all possibleoutcomes of the experiment E .

• Σ ⊆ 2Ω is the event space and must be a sigma-algebra on Ω. Thenotation 2Ω represents the power set of Ω. The conditions a sigma-algebra must meet are given in definition 2.2.

• Pr is a probability measure on the event space Σ. Its requirements aregiven in definition 2.3.

The sample space can be seen as a collection of all elementary results, oroutcomes, of an experiment. An event E ∈ Σ is then a set of outcomes, andthus a subset of the sample space. The event space is the set of all outcomes

18

Page 21: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

of E which can be considered interesting. If Ω is a discrete sample space,then usually Σ is equal to 2Ω. The definition of a sigma-algebra is as follows

Definition 2.2 (Sigma-algebra). Let Ω be any set and Σ ⊆ 2Ω. We call Σa sigma-algebra over Ω if

1. It’s not empty: Σ 6= ∅

2. It’s closed under complementation: ∀A ∈ Σ : Ω−A ∈ Σ

3. It’s closed under countable unions: If A1, A2, . . . is a countable se-quence of events in Σ, then

⋃i≥1Ai ∈ Σ

Lemma 2.1. If Σ is a sigma-algebra over Ω it it follows from definition 2.2that Ω ∈ Σ.

Proof. From Σ 6= ∅ it follows that there exists an event E ∈ Σ. Because Σis closed under complementation we also have Ω−E ∈ Σ. As it’s also closedunder countable unions we finally get (Ω− E) ∪ E ∈ Σ which is equivalentto Ω ∈ Σ.

In this thesis we will use the notation Pr [E] to denote the probability ofevent E. In most cases we will assume it’s evident what the sample spaceΩ and event space Σ are, and won’t explicitly define them. Before we cancontinue further we must define what a probability measure is.

Definition 2.3 (Probability Measure). Let Ω be a sample space on E and Σbe the event space of E. A Probability measure on (Ω,Σ) is then a functionPr : Σ→ [0, 1] which satisfies:

1. Pr [Ω] = 1

2. Let A1, A2, . . . be a countable sequence of pairwise disjoint events inΣ. Then:

Pr

i≥1

Ai

=

i≥1

Pr [Ai] (2.1)

This is known as the countable additivity property.

This definition can come in other, equivalent, variations. We have chosenfor the simplest one. Many definitions also require that Pr [∅] = 0. In ourcase however, this is a property that follows from the definition.

Lemma 2.2. From the definition of a probability measure Pr on (Ω,Σ) itfollows that Pr [∅] = 0.

19

Page 22: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Proof. We know that Pr [Ω] = 1. As the sets Ω and ∅ are disjoint, we canapply the countable additivity problem to Ω ∪∅ = Ω:

1 = Pr [Ω]

= Pr [Ω ∪∅]

= Pr [Ω] + Pr [∅]

= 1 + Pr [∅]

From this we can see that Pr [∅] = 0.

2.1.1 Conditional Probability

One may sometimes possess incomplete information about the actual out-come of an experiment without knowing its outcome exactly. If A and B areevents in Σ and we are given that B occurs, then, taking into account thisinformation, the probability of A may no longer be Pr [A]. We can see thatA occurs if and only if A∩B occurs. This suggests that the new probabilityof A is proportional to Pr [A ∩B] [GW86].

Definition 2.4. The conditional probability of event A given event B isdefined as

Pr [A | B]def=

Pr [A ∩B]

Pr [B]

If Pr [A | B] = Pr [A] we call the events A and B independent. We cannow prove that this definition gives rise to a new probability space.

Theorem 2.3. If B ∈ Σ and Pr [B] > 0 then (Ω,Σ, Q) is a probability spacewhere Q : Σ→ R is defined by Q(A) = Pr [A | B].

Proof. We need to verify that Q is a probability measure on (Ω,Σ). Clearlywe already have that the image of Q is the interval [0, 1]. We also have that

Q(Ω) = Pr [Ω | B] =Pr [Ω ∩B]

Pr [B]= 1

It only remains to check that Q satisfies equation 2.1. So let A1, A2, . . .be a countable sequence of pairwise disjoint events. Then

20

Page 23: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Q

i≥1

Ai

=

1

Pr [B]Pr

i≥1

Ai

∩B

(2.2)

=1

Pr [B]Pr

i≥1

(Ai ∩B)

(2.3)

=1

Pr [B]

i≥1

Pr [Ai ∩B] (2.4)

=∑

i≥1

Q(Ai) (2.5)

Where equality 2.4 follows from the fact that Pr is a probability measureon (Ω,Σ).

2.1.2 Discrete Variables

Intuitively a random variable is a numerical representation of the outcomeof an experiment. Random variables can be either discrete or continuous.Here we will begin by defining discrete random variables.

Definition 2.5 (Discrete Random Variable [GW86]). Let E be an experi-ment with probability space (Ω,Σ,Pr). A discrete random variable is then afunction X : Ω→ R such that:

1. The image of X is a countable subset of R

2. ∀x ∈ R : ω ∈ Ω : X(ω) = x ∈ Σ

If X is a random variable the notation x ∈ X is defined as a shorthandnotation of x ∈ Im(X). In other words, it means that x can be any valueof the image of the random variable X. We must note that there existsmore general definitions where the image of the random variable can be aset different than R [FG96]. Such definitions are based on measure theoryand are out of scope for this thesis. Based on our definition we can definethe probability mass function of a discrete random variable as follows.

Definition 2.6. Given a probability space (Ω,Σ,Pr), the probability massfunction of a discrete random variable X is the function pX : R → [0, 1]defined as

∀x ∈ R : pX(x) = Pr [ω ∈ Ω : X(ω) = x]

21

Page 24: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Remark that if x is not an element of the image of X, the consid-ered set is in definition 2.6 empty. From lemma 2.2 it then follows thatpX(x) = Pr [∅] = 0. Instead of the notation pX(x) we will frequently usethe equivalent notation Pr [X = x]. A consequence of using the probabilitymeasure to define the probability mass function is the property that

x∈XpX(x) =

x∈XPr [ω ∈ Ω : X(ω) = x] (2.6)

= Pr [Ω] = 1 (2.7)

By using the probability mass function (pmf) we can define the cumu-lative distribution function (cdf) as:

F (x) = Pr [X ≤ x]def=∑

y≤xPr [X = y] (2.8)

2.1.3 Continuous Variables

Discrete random variables take only countable many values and their cumu-lative distribution functions generally look like step functions. On the otherhand, the cdf of a continuous random variable is smooth. More formallythis gives rise to the following definition

Definition 2.7 (Continuous Random Variable [GW86]). Let E be an ex-periment with probability space (Ω,Σ,Pr). A continuous random variable isthen a function X : Ω → R whose cumulative distribution function may bewritten in the form

F (x) = Pr [X ≤ x] =

∫ x

∞f(u)du

for all x ∈ R, for some non-negative function f . We say that X hasprobability density function f .

Analog to discrete variables, we again note that there exists more generaldefinitions where the image of the random variable can be a set different thanR [FG96].

A continuous variable is thus defined by its cumulative distribution func-tion (cdf) F (x) which equals

F (x) = Pr [X ≤ x] = Pr [X < x]

The seconds equality holds because Pr [X = x] = 0. Interestingly thismeans that although the probability of an event is defined as being zero,

22

Page 25: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

the event can still occur1. As is clear from the definition, the probabilitydensity function (pdf) of a continuous random variable is the derivative ofthe cdf:

f(x) =d

dxF (x)

It describes the relative likelihood for this random variable to occur at agiven point. We can calculate the probability that it falls within an intervalby taking the integral of f(x) over the interval. It must also satisfy thefollowing relation:

∫ +∞

−∞f(x)dx = 1 (2.9)

As a note of caution, in some papers the notation Pr [X = x] is abused todenote the probability density function f(x) when talking about continuousvariables. In this thesis we have avoided this abuse of notation to preventambiguity. Sometimes we will use the proportional to symbol ∝ to statethat a random variable has a certain probability density function. Thenotation f(x) ∝ g(x) means that there exists a constant k ∈ R such thatf(x) = kg(x). An example would be:

f(x) ∝ exp

(−|x− µ|

b

)(2.10)

This is a shorthand notation for the pdf without specifying the constantthat is needed for it to satisfy equation 2.9. In equation 2.10 the constantk would be 1/2b. A similar notation can be used for the probability massfunction of discrete variables in combination with equation 2.6.

Finally, the term probability distribution is used to refer to either theprobability mass function or the probability density function, depending ifthe context is about discrete or continuous variables respectively.

2.1.4 Transformation of a Random Variable

Many times it’s useful to transform a random variable X to a new randomvariable Y using a given function f . This new random variable Y will bedenoted by f(X). The definition of f(X) is

Definition 2.8. Given a random variable X over the probability space(Ω,Σ,Pr) and a function f : R→ R, we define f(X) to be the function

1This is because probability spaces are based on measure theory which doesn’t reallydistinguish between sets of measure zero (infinitely unlikely) and the empty set (impossi-ble).

23

Page 26: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

f(X) : Ω→ R,

ω 7→ f(X(ω))

This results in f(X) being a random variable over the probability space(Ω,Σ,Pr).

Lemma 2.4. If X is a discrete random variable over (Ω,Σ,Pr) we havethat

Pr [f(X) = t] =∑

s∈S|f(s)=t

Pr [S = s]

Proof. We begin with the definition of the probability mass function andthe function f(X):

Pr [f(X) = t] = Pr [ω ∈ Ω | f(X)(ω) = t] (2.11)

= Pr [ω ∈ Ω | f(X(ω)) = t] (2.12)

= Pr

s∈S|f(s)=t

ω ∈ Ω | X(ω) = s

(2.13)

=∑

s∈S|f(s)=t

Pr [ω ∈ Ω | X(ω) = s] (2.14)

=∑

s∈S|f(s)=t

Pr [X = s] (2.15)

In equation 2.12 the definition of the function f(X) is used. We observethat in the union of equation 2.13 all the sets are pairwise disjoint. Hence wecan transfrom this to equation 2.14 by the countable additivity property.

2.1.5 Linearity of Expectation

To start from scratch we will begin with the definition of the expected valueof a discrete random variable.

Definition 2.9. If X is a discrete random variable, the expectation of X isdenoted by E [X] and defined by

E [X] =∑

x∈XxPr [X = x]

For a continuous variable this becomes E [X] =∫xf(x)dx. Linearity of

expectation now roughly states that the expected value of a sum of randomvariables is equal to the sum of the individual expectations. The property isvery useful when analyzing randomized algorithms and probabilistic meth-ods. Written formally this becomes

24

Page 27: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Property 2.5. Let a, b ∈ R and X and Y be any two random variables.We have that

E [aX + bY ] = aE [X] + bE [Y ]

The formula is even applicable if the variables are dependent, and is truefor both continuous and discrete random variables.

2.2 Entropy

2.2.1 Self-information

Self-information (SI) is a concept in information theory that is a measureof uncertainty associated with a particular event [Kum01]. It can also beinterpreted as a measure of the information content associated with theoutcome of a random variable.

Definition 2.10. Let E be an event and Pr [E] denote the probability of thisevent. Then the self-information of event E is

I(E) = log (1/Pr [E]) = − log Pr [E]

If you consider tossing a coin, the chance of tail or heads is fifty-fifty.Thus the self-information of the event tail is I(tail) = − log 0.5 = 1.Another example is throwing a fair dice. Here any result has self-informationI(E) = − log 1

6 = 2.585. The term is sometimes also called the amountof surprise of surprisal [Kum01]. The name comes from the fact that ahighly improbable outcome has a large amount of self-information. Anothercommonly used synonym is information content.

The reason for using the log function in the definition is to make the self-information additive. The additive property means that the SI of events Aand B equals the sum of the SI of the individual events. In other words:I(A and B) = I(A) + I(B) if A and B are independent events. This makesintuitive sense: The information obtained from the occurrence of two inde-pendent events is the sum of the information obtained from the occurrenceof the individual events. If there is a dependence between the two eventsthis property does not hold.

2.2.2 Shannon Entropy

As self-information measures the amount of uncertainty of a particular event,shannon entropy measures the uncertainty contained in a probability distri-bution. It can also be seen as a measure of the average information contentone is missing when one does not know the outcome of the random experi-ment. In short it can be seen as a measure of unpredictability or disorder.

25

Page 28: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Page 1 of 1

5/4/2011file:///C:/Users/Mathy/Documents/School/Masterproef/Latex/thesis/BinaryEntropyPlot.svg

Figure 2.1: Shannon entropy of a family of coin toss events where the prob-ability Pr [X = 1] ranges from 0 to 1.

Definition 2.11. Let X be a discrete random variable with probability massfunction pX(x). Then the shannon entropy of X is

H(X) = −∑

x∈XPX(x) logPX(x)

If PX(x) = 0 for some x, we will let the term 0 log 0 be equal to 0. Seefigure 2.1 for a graph of the shannon entropy of a binary coin toss event.The probability on the x-axis denotes the probability of the event X = 1,and this also implies the probability of event X = 0. It can be shown thatthe shannon entropy is equal to the expected amount of self-information ofthe discrete random variable.

2.2.3 Min-entropy

The disadvantage of the shannon entropy is that it can be very high, al-though the outcome of the experiment can still be predicted in most cases.This can for example happen if there is one event with probability 0.99 andthere are countless other events with a much smaller probability. Shannonentropy is therefore not a good measure when we want to specify that some-thing is hard to guess. Instead we will use min-entropy for this purpose.

Definition 2.12. Let X be a discrete random variable with probability massfunction pX(x). The min-entropy of X is

H∞(X) = minx∈X

(− logPX(x)) = − log

(maxx∈X

PX(x)

)

Min-entropy and shannon entropy are actually instances of Renyi en-tropy, which is a family of functions to measure the randomness of an ex-periment. Renyi entropy is parameterized by α. Setting α to 1 gives shannon

26

Page 29: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

entropy, while setting α to ∞ gives min-entropy. For this reason we use thenotation H∞(X) to denote the min-entropy of an experiment.

2.3 Laplace distribution

The Laplace distribution is a continuous probability distribution. It is pa-rameterized by its mean µ and the scale factor b.

Definition 2.13 (Laplace Distribution). The probability density functionof the Laplace distribution is defined as

h(x | µ, b) =1

2bexp

(−|x− µ|

b

)

Here the “|” symbol doesn’t stand for conditional probability, but signi-fies that µ and b are parameters of the function h. Most of the time we willlet µ be zero and drop this parameter. To state that a random variable Xfollows the Laplace distribution the notation X ∼ Lap(b, µ) can be used. Ifµ is zero we will frequently use the abbreviated notation Lap(b). We havethe following useful property.

Property 2.6. If h(y) denotes the pdf of the Laplace distribution Lap(b),

the inequality h(y)h(y′) ≤ exp

(1b |y − y

′|)

holds.

Proof. Let h(y) denote the pdf of the Laplace distribution. We have

h(y)

h(y′)=

exp(− |y|b

)

exp(− |y

′|b

) (2.16)

= exp

(1

b(|y′| − |y|)

)(2.17)

≤ exp

(1

b|y − y′|

)(2.18)

Where the last inequality is a consequence of subadditivity, since it statesthat ∀a, b ∈ R : |a+ b| ≤ |a|+ |b|. Filling in a = y′ − y and b = y results in

|y′| ≤ |y′ − y|+ |y||y′| − |y| ≤ |y′ − y|

This concludes the proof of Property 2.6.

27

Page 30: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

2.4 Inequalities

2.4.1 Boole’s inequality

Boole’s inequality gives a very rough upper bound on the probability ofcertain events. The advantage of this property is that it holds even if thegiven events are dependent, and is only based on the probability space.

Property 2.7. (Boole’s inequality) For any finite or countable set of eventsE1, E2, . . . , En, it holds that:

Pr

[n⋃

i=1

Ei

]≤

n∑

i=1

Pr [Ei] (2.19)

It follows from the definition of a probability measure, namely fromthe countable additivity property. A consequence is that the probabilityfunction Pr also satisfies the following statement

∀A,B : Pr [A ∪B] ≤ Pr [A] + Pr [B]

Boole’s inequality is also known as the union bound.

2.4.2 Markov Bound

Markov’s inequality relates probabilities to the expectation of a discreterandom variable. We require the random variable to be non-negative.

Property 2.8. (Markov Inequality) Let X be a non-negative random vari-able and v ∈ R. Then

Pr [X ≥ v] ≤ E [X]

v

Proof. The proof is straightforward if you begin with the definition of theexpected value:

E [X] =∑

x∈Xx Pr [X = x] (2.20)

=∑

x<v

x Pr [X = x] +∑

x≥vx Pr [X = x] (2.21)

≥ 0 + v Pr [X ≥ v] (2.22)

Pr [x ≥ v] ≤ E [X]

v(2.23)

Where inequality 2.22 is true because we assumed X is non-negative.

28

Page 31: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

2.4.3 Hoeffdings’ Inequality

Another useful inequality is was found by Wassily Hoeffding in 1963 [Hoe63].It gives an upper bound on the probability for the sum of independentrandom variables to deviate from its expected value. Often it’s used to provethat the result of a randomized algorithm is accurate with high probability.

Theorem 2.9 (Hoeffdings’ Inequality). If X1, X2, . . . , Xn are independantrandom variables with ∀i : ai ≤ Xi ≤ bi and with

X =X1 +X2 + . . .+Xn

n

then for t > 0

Pr [X − E [X] ≥ t] ≤ exp

(− 2n2t2∑n

i=1(bi − ai)2

)(2.24)

For a proof of this theorem we refer the reader to the original paper ofHoeffding [Hoe63]. The following upper bound trivially follows from theo-rem 2.9.

Theorem 2.10 (Hoeffdings’ Inequality). If X1, X2, . . . , Xn are independentrandom variables with ∀i : ai ≤ Xi ≤ bi and with X as in theorem 2.9 thenfor t > 0

Pr [|X − E [X]| ≥ t] ≤ 2 exp

(− 2n2t2∑n

i=1(bi − ai)2

)(2.25)

Proof. The proof is follows from the equality

Pr [|X − E [X]| ≥ t] = Pr [X − E [X] ≥ t] + Pr [−X + E [X] ≥ t] (2.26)

On the first probability in the sum one can directly apply Hoeffdings’inequality. For the second probability we define the new variables X ′i = −Xi

and

X ′ =X ′1 +X ′2 + . . .+X ′n

n= −X

notice that ∀i : − bi ≤ X ′i ≤ −ai. After doing this we can also apply theHoeffding inequality on the second probably which is now equal to

Pr[X ′ − E

[X ′]≥ t]

Adding both bounds together completes the proof

29

Page 32: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

2.5 Probabilistic Turing Machine

In the following chapters we will often use randomized algorithms (also calledrandomized functions). They can be viewed in two equivalent ways. Oneway of viewing randomized algorithms is to allow the Turing machine to takerandom steps. This can be defined by letting the transition function havetwo possible legal next moves. An internal coin toss is then made to decidewhich result is chosen. Here both options have probability 1

2 . This is alsohow probabilistic Turing machines are defined commonly, as is for exampledone in the book of Sipser [Sip06]. The total amount of random choicesmade during the execution of a Turing machine are called the internal cointosses of the machine [Gol01]. The output of the Turing machine is now nolonger a string, but a random variable that can assume all possible strings asa value. This random variable will be denoted by M(x) where x is the inputstring. We can then let Pr [M(x) = y] denote the probability that machineM on input x will output y [Gol01]. The probability space is taken over allpossible outcomes of the internal coin tosses.

An alternative and equivalent way of viewing randomized algorithms isby giving the outcome of the internal coin tosses as auxiliary input. Hencethe real input given to the machine will be (x, r) where x is the normalinput and r represents a possible outcome of the coin tosses. Given Turingmachine M(x) that uses internal coin tosses, we will let M(x, r) stand forthe corresponding deterministic Turing machine where the coin tosses aregiven as the auxiliary bitstring r. Since we will assume M always halts,there exists a z ∈ N such that M uses at most z coin tosses on every inputx. Using these assumptions the probability space (Ω,Σ,Pr) describing thedistribution of the internal coin tosses becomes

• The sample space Ω is equal to 0, 1z. That is, it’s equal to all possiblesequences of coin tosses.

• The event space Σ is equal to 2Ω.

• Since we consider the coin tosses to be uniformly distributed, the prob-ability measure for every ω in Ω is defined as Pr [ω] = 1

2z .

Here the internal coin tosses of the algorithm are represented by anelement ω ∈ Ω. Without loss of generality we will assume that the algorithmM outputs only bitstrings. Based on this probability space, the randomvariable M(x) is then the function

M(x) : Ω→ 0, 1z,ω 7→M(x, ω)

30

Page 33: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Using these definitions we can rewrite the formula of the probability ofevent M(x) = y to a more convenient and intuitive form:

Pr [M(x) = y] = Pr [ω ∈ Ω |M(x)(ω) = y] (2.27)

= Pr[ω ∈ Ω |M(x, ω) = y

](2.28)

=∑

ω∈Ω|M(x,ω)=y

Pr [ω] (2.29)

=∑

ω∈Ω|M(x,ω)=y

1

2z(2.30)

=

∣∣ω ∈ Ω |M(x, ω∣∣

2z(2.31)

For equation 2.28 we used the definition of M(x)(ω). Equation 2.29follows from the countable additivity property. Hence we can conclude withthe following straightforward formula

Pr [M(x) = y] =|r ∈ 0, 1z |M(x, r) = y|

2z(2.32)

2.6 Statistical Distance

2.6.1 Definition

Often it can be useful to have a measure that quantifies the distance betweentwo random variables or probability distributions. We will mainly use thenotion of ε-close:

Definition 2.14 (ε-close). Let X and Y be two random variables over thesame image S. Their statistical distance is

∆(X,Y )def= max

A⊆S|Pr [X ∈ A]− Pr [Y ∈ A]| = ε

If two distributions have statistical distance at most ε, we say they are ε-close. From this definition we can deduce that the statistical distance rangesbetween 0 and 1. For discrete random variables we have another formula tocalculate the statistical distance.

Lemma 2.11. Given two discrete random variables X and Y over the sameimage S we have that

∆(X,Y ) =1

2

s∈S|Pr [X = s]− Pr [Y = s]|

31

Page 34: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Proof. Let X and Y be two discrete random variable over the same imageS and let A = s ∈ S | Pr [X = s] > Pr [Y = s]. This choice of A makesPr [X ∈ A]− Pr [Y ∈ A] as large as possible and positive, meaning we havethe following

∆(X,Y ) = maxA′⊆S

∣∣Pr[X ∈ A′

]− Pr

[Y ∈ A′

]∣∣ (2.33)

= Pr [X ∈ A]− Pr [Y ∈ A] (2.34)

=∑

a∈A(Pr [X = a]− Pr [Y = a]) (2.35)

=∑

a∈A|Pr [X = a]− Pr [Y = a]| (2.36)

Since both X and Y are discrete random variables equation 2.35 is valid.Equation 2.36 follows from the fact that we chose A such that for all a in Ait’s true that Pr [X = a] > Pr [Y = a]. Now let Ac = S −A. We have

∆(X,Y ) = Pr [X ∈ A]− Pr [Y ∈ A] (2.37)

= 1− Pr [X ∈ Ac]− 1 + Pr [Y ∈ Ac] (2.38)

= Pr [Y ∈ Ac]− Pr [X ∈ Ac] (2.39)

=∑

a∈Ac

(Pr [Y = a]− Pr [X = a]) (2.40)

=∑

a∈Ac

|Pr [Y = a]− Pr [X = a]| (2.41)

=∑

a∈Ac

|Pr [X = a]− Pr [Y = a]| (2.42)

Where equation 2.41 is correct since ∀a ∈ Ac : Pr [Y = a] > Pr [X = a].Adding up equation 2.36 and 2.42 we get

2∆(X,Y ) =∑

a∈S|Pr [X = a]− Pr [Y = a]|

Which is what we needed to prove.

2.6.2 Properties

One important property is that the statistical distance cannot increase byapplying a—possibly randomized—function. Although the property appliesto both discrete and continuous variables, we will focus on discrete variables,since that is what we will need in order to prove the impossibility result inchapter 4.

32

Page 35: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Property 2.12. Given two discrete random variables X and Y over thesame image S, and a randomized function f , we have the following property

∆(f(X), f(Y )) ≤ ∆(X,Y )

Proof. The proof when f can be a randomized function is cumbersome towrite down in detail, and it analogous to the proof where only deterministicfunctions are considered. Therefore we will only give a proof when M is adeterministic function. So let X and Y be two discrete random variablesover the same image S, and f a deterministic function with image T . Weget

∆(f(X), f(Y )) =1

2

t∈T|Pr [f(X) = t]− Pr [f(Y ) = t]| (2.43)

=1

2

t∈T

∣∣∣∣∣∣∑

s∈S|M(s)=t

Pr [X = s]−∑

s∈S|M(s)=t

Pr [Y = s]

∣∣∣∣∣∣(2.44)

=1

2

t∈T

∣∣∣∣∣∣∑

s∈S|M(s)=t

(Pr [X = s]− Pr [Y = s])

∣∣∣∣∣∣(2.45)

≤ 1

2

t∈T

s∈S|M(s)=t

|Pr [X = s]− Pr [Y = s]| (2.46)

=1

2

s∈S|Pr [X = s]− Pr [Y = s]| (2.47)

= ∆(X,Y ) (2.48)

Equality 2.44 follows from the definition of f(X) and the fact that Xand Y are discrete random variables. The inequality follows from the sub-additivity property2 of the absolute value. From this we can conclude that∆(f(X), f(Y )) ≤ ∆(X,Y ). Remark that if f is an injective function theinequality becomes an equality.

This property implies that the acceptance probability of any algorithmon inputs from X differs from its acceptance probability on inputs from Yby at most ∆(X,Y ). This is exactly what we will need in chapter 4.

2In this situation a common synonym of the subadditivity property is the triangleinequality.

33

Page 36: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Chapter 3History

Reasoning about privacy is fraught with pitfalls [BZ06, Swe02, NS08, NSR11].To further appreciate the difficulty of finding a proper privacy definition, wewill look at previous attempts and investigate why they failed. This willshow you why certain intuitive methods don’t work and illustrates the im-portance of taking into account auxiliary information. We will then discussk-anonymity and its variants. Finally we will explain the composition at-tack, which shows that k-anonymity and related definitions are fundamen-tally flawed.

3.1 Previous Attacks

In this overview of previous attacks, a de-identified database is a databasewhere personal identifiable information such as the name, address or socialsecurity number of individuals are removed. The term re-identification de-notes the process of de-anonymizing the de-identified data by associating itwith the original individual. Most of the time this is be done by using auxil-iary information. This can be any additional information that an adversaryhas access to, including older versions of the original database.

3.1.1 The AOL Search Fiasco

We briefly explained this event in section 1.4. As stated, AOL releasedaround 20 million search queries from 65,0000 AOL users. In an attemptto preserve privacy they replaced the usernames with a randomly generatednumber and also carefully redacted obvious disclosive queries (examples arewhen individuals search for their own name or security number) [Dwo11].However the complete search history of an individual can still be extracted.In turn this allowed people to identify individuals based on their searchhistory, and thus clearly breaching their privacy.

34

Page 37: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

Age (years)

Perc

enta

ge o

f age

gro

up

1-anonymous2-anonymous or less5-anonymous or less

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

Age (years)

Perc

enta

ge o

f age

gro

up

1-anonymous5-anonymous or less10-anonymous or less

Figure 1: Anonymity of the U.S. population, by age,

given Gender, ZIP code, Full date of birth.Figure 2: Anonymity of the U.S. population, by age,

given Gender, County, Full date of birth.

0

100

200

300

400

500

600

700

800

900

0 10 20 30 40 50 60 70 80 90 100

Age (years)

k-an

onym

ity

90th percentile50th percentile10th percentile

1

10

100

1000

10000

100000

0 10 20 30 40 50 60 70 80 90 100

Age (years)

k-an

onym

ity(L

ogar

ithm

ic s

cale

)

90th percentile50th percentile10th percentile

Figure 3: Anonymity of the U.S. population, by age,

given Gender, ZIP code, Year of birth.Figure 4: Anonymity of the U.S. population, by age,

given Gender, County, Year of birth.

ten lie without negative consequence (lying about one’s ageis more problematic, since the lie is more easily detectableand may result in inadequate service). Table 1 shows thatrevealing one’s gender, location, year of birth allows forunique identification of only 0.2% of individuals. Is it thensafe for most people to disclose this information? In whatfollows, we analyze in more detail the privacy implicationsof disclosing one’s gender, location, year of birth.

Figures 3 and 4 show the degree of anonymity of the USpopulation, by age, given gender, location, year of birth,where location is either a 5-digit ZIP code (Figure 3) ora county (Figure 4). In both graphs, the green curve (themiddle curve) shows the median degree of anonymity by age(the degree of anonymity of the 50-th percentile). We see forexample that the median anonymity of the population underage 50 after disclosing gender, ZIP code, year of birth is200-anonymity (the median is 3000-anonymity if the countyis disclosed instead of the ZIP code).

The blue line shows the anonymity of the lower 10-th per-centile. In other words, 10% of the population enjoys lessanonymity than indicated by the blue line, and 90% of thepopulation enjoys more anonymity.

Finally, the red line shows the anonymity of the 90-th per-centile. We observe a sharp spike in the degree of anonymityof the 90-th percentile (the 10% of the population who arethe most anonymous) around age 20. Analysis reveals thatthis spike comes from college and university towns withhigh concentrations of individuals between the ages of 18and 22. For example, the 3 ZIP codes with the highest de-gree of anonymity in the range [18–22] are College Station,TX (77840); West Lafayette, IN (47906); and Austin, TX

(78705). All three ZIP codes are home to large universities.The conclusion we can draw from this data is that, for 90%

of the population under the age of 50, disclosing gender,ZIP code, year of birth will result in 40-anonymity or more,while disclosing gender, county, year of birth will result in250-anonymity or better.Finally, those willing to sacrifice truthfulness for optimal

anonymity should claim, when asked for their age and ZIPcode, to be a 21-year-old male from Camp Pendleton, Cal-ifornia (ZIP code 92054); or, if female, to be a 19-year-old from College Station, Texas (ZIP code 77840). Theywill share these characteristics with respectively 4, 099 othermales and 3, 744 other females.

4. CONCLUSIONThis short paper revisits the uniqueness of simple demo-

graphics in the US population based on the most recentcensus data (the 2000 census). We offer a detailed, com-prehensive and up-to-date picture of the threat to privacyposed by the disclosure of simple demographic information.We hope this study will be a useful reference for privacyresearchers who need simple estimates of the comparativethreat of disclosing various demographic data.

5. REFERENCES[1] G. Aggarwal, T. Feder, K. Kenthapadi, R. Motwani,

R. Panigrahy, D. Thomas and A. Zhu. ApproximationAlgorithms for k-Anonymity. Journal of PrivacyTechnology, 2005.

[2] A. Acquisti and J. Grossklags. Privacy andRationality in Individual Decision Making. IEEE

Figure 3.1: Anonymity of the U.S population when given gender, ZIP codeand full date of birth [Gol06]

However we cannot view this as a true example that preserving privacyis a difficult task. This because no real care was given to anonymizing thesearch results. This can be deduced from the observation that the personwho was responsible for releasing the data was quickly fired [And06]. Asupervisor was also fired, and the chief technology officer resigned over theincident [MB06]. The release of this data is even more surprising becausea few months earlier it was ruled that Google did not have to hand overany user’s search queries to the U.S. government [Won06]. The governmentoriginally requested two months’ worth of users’ search queries as part of asubpoena.

3.1.2 The HMO records

The first notable demonstration of a relatively large privacy breach is thede-anonymization of the medical records of the governor of Massachusetts.This was done in a so called linkage attack between two different databases.

For more background on the attack, we begin with the following obser-vation: One can uniquely identify 63% of the U.S. population with onlyknowing the gender, ZIP code and date of birth of an individual [Gol06].The ZIP code is the place where the person currently lives. This is illus-trated in figure 3.1. The blue line shows how many percent of the populationcan be uniquely identified given gender, location, full date of birth. The

35

Page 38: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Computational Disclosure Control 01/08/01 8:22 AM

49

physicians offices, clinics, and so forth [18]. Figure 14 contains a subset of the fields of information, or

attributes, that NAHDO recommends these states accumulate. The few attributes listed in Figure 14

include the patient’s ZIP code, birth date, gender, and ethnicity. Clearly, the data are de-identified. The

patient number in earlier versions was often the patient's Social Security number and in subsequent

versions was a scrambled Social Security number [19]. By scrambled I mean that the digits that compose

the Social Security number are moved around into different locations. If a patient’s record is identified

and their Social Security number known, then the scrambling algorithm can be determined and used to

identify the proper Social Security numbers for the entire data set.

Ethnicity

Visit date

Diagnosis

Procedure

Medication

Total charge

ZIP

Birthdate

Sex

Name

Address

Dateregistered

Partyaffiliation

Date lastvoted

Medical Data Voter List

Figure 15 Linking to re-identify data

For twenty dollars I purchased the voter registration list for Cambridge Massachusetts and

received the information on two diskettes [20] in an attempt to complete the re-identification. Figure 15

shows that these data included the name, address, ZIP code, birth date, and gender of each voter. This

information can be linked using ZIP code, birth date and gender to the medical information described in

Figure 14, thereby linking diagnosis, procedures, and medications to particularly named individuals. The

question that remains of course is how unique would such linking be.

The 1997 voting list for Cambridge Massachusetts contained demographics on 54,805 voters. Of

these, birth date, which is the month, day, and year of birth, alone could uniquely identify the name and

address of 12% of the voters. One could identify 29% of the list by just birth date and gender; 69% with

only a birth date and a five-digit zip code; and 97% when the full postal code and birth date were used.

Figure 3.2: Re-identification by linkage attack [Swe01]

green and red lines show that even if a person is not uniquely identifiable,they still enjoy little privacy. For example, we can see that nearly all peo-ple are 5-anonymous or less. This means at most 5 other individuals havethe same values for the given attributes. The definition of k-anonymity isexplained in more detail in section 3.3.

An interesting observation is that people around the age of 20 seem tohave more privacy. The cause of this peak is the college and universitytowns that have high concentrations of individuals between the ages of 18and 22 [Gol06]. Also note that for ages above 50 the privacy decreases.

The HMO linkage attack is now performed by combining two databasesthat share the attributes sex, ZIP, birth date (see figure 3.2). The firstdatabase is the voter registration list for Cambridge Massachusetts, the sec-ond one the Massachusetts Group Insurance Commission (GIC) medicalencounter data. In her attack, Sweeney bought the voter registration list.It included the attributes ZIP code, birth date, gender, name and addressof each voter.

Medical data is also being collected, where the National Association ofHealth Data Organizations (NAHDO) recommends the collection of certainattributes. They include the patient’s ZIP code, birth date, gender anddiagnosis. A de-identified subset of this data, containing information forapproximately 135,000 state employees, was handed over to the Group In-surance Commission (GIC). This was done because the GIC is responsiblefor purchasing health insurance for state employees. They later gave this

36

Page 39: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

data to researchers and sold a copy to industry because they believed thedata was anonymous [Swe01].

Even though the medical databases did not contain explicit personalidentifiers such as the name or social security number of a patient, individu-als can still be identified by combining both databases. Using the attributesZIP code, gender and birth day Sweeney was able to identify the medicalrecords of William Weld. He lived in Cambridge Massachusetts and wasa governor of Massachusetts, thus his information was in both tables andcould be linked together.

3.1.3 Netflix

Netflix is a company that offers on-demand internet streaming video. Toimprove their movie recommendation algorithm, Netflix released a data-set containing more than 100 million movie ratings made by 480 thousandusers on nearly 18 thousand movie titles [BL07]. To increase interest of thecommunity, they offered a $1 million prize to those who could improve thecurrent system by at least 10%.

In an attempt to secure the privacy of its users they removed personallyidentifying attributes. All that remained are ratings (a score between 1 and5), the date the rating was given, the anonymous ID of the user that gavethe rating, and finally the title of the movie that was rated. Netflix alsostated that only less than one-tenth of the complete dataset was published,and that data was also perturbated [Net]. First we will briefly cover thedetails of their anonymization process, and then we will describe how thedataset was de-anonymized.

We must also note that a similar attack was first done on the MovieLensdataset by Frankowski, Cosley, et al [FCS+06]. In fact, the algorithms usedin the NetFlix attack were improved versions of the methods used in theMovieLens attack.

Anonymization Process

Netflix stated that the released dataset was chosen randomly. However,when we plot the number of ratings X against the number of subscribers inthe released dataset who have made X ratings, we seem to get an unusualresult. As figure 3.3 shows, there are only few subscribers that rated lessthan 20 movies. One would actually expect that there are many people whoonly rate a few movies. A possible explanation is that only subscribers withmore than 20 ratings were sampled, and that afterwards some movies weredeleted.

Two subscribers of Netflix have located their own data in the releaseddata set and inspected how the data were perturbed. Since they can see theiroriginal movie ratings we can compare how many ratings Netflix modified

37

Page 40: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Figure 3.3: Netflix dataset: Number of subscribers with x ratings. Thex-axis stands for the number of ratings a user has given in total, and they-axis for how many people gave x ratings in total [NS08].

before releasing the data set. One of the subscribers had only 1 out of 306ratings altered, and the other one had 5 out of 229 altered. This is only asmall amount noise, and didn’t effect the de-anonymization algorithm much.Here it’s important to take into account that the dataset was released todevelop improved recommendation algorithms. If there would have been alarge amount of perturbation it would have greatly decreased the dataset’sutility.

De-anonymization

In their paper Narayanan and Shmatikov [NS08] showed that you need verylittle auxiliary information to de-anonymize the average subscriber in theNetflix dataset. The algorithm they used even worked if the knowledge ofthe adversary is imprecise and contains a few incorrect values.

The actual attack was only a proof of concept where the anonymizedNetflix dataset was linked with data crawled from the Internet Movie Data-base (IMDb). They only gathered a small sample of 50 IMDb users becauseIMDb’s terms of service has restrictions on crawling. With this sample theywere able to identify the records of two users. Because they had no ab-solute guarantees that these two matches were correct, they compared thematching score of the best candidate record to the second-best candidaterecord. They observed that the second-best candidate has a dramaticallylower matching score, and thus concluded that the best match is highlyunlikely to be a false positive.

You might argue that knowing which movies a person has watched isnot an important privacy breach. This is not true, since one might be able

38

Page 41: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

to deduce important information from the movies you watch. The politicalorientation of a person may be revealed by his opinions about movies suchas “Power and Terror: Noam Chomsky in Our Times” and “Fahrenheit9/11”. Other information such as his religious view, sexual orientation, andso on can be deduced from the movies the person in question has viewed.Even if there is only one person in the Netflix dataset whose privacy iscompromised, this is still unacceptable. After all, the privacy of every usermust be guaranteed.

3.1.4 Kaggle

De-anonymization is not just a threat to privacy. It can also be used forgaming machine learning contests, as was done by Narayanan, Shi and Ru-binstein [NSR11]. Before delving into the details, we will first explain thecontest they were able to game.

Their target was the Kaggle social network challenge that promotes re-search on link prediction between nodes in social networks. In other words,contestants had to predict the relationship between two individuals in asocial network (note that in this example relationships are represented bydirected edges). The dataset that was created for the contest was madeby crawling Flickr—an online social network to share photos—and savingthe obtained nodes and edges in an anonymized database. A directed edgewas included between two nodes if the corresponding person has marked theother person as a friend. The participants were given 7,237,983 edges of thisnetwork and using it they had to predict whether the relationships repre-sented by an additional 8,960 edges are real or fake [Kag10]. To preventcheating they anonymized the identities of all the nodes. Note that in thissituation de-anonymizing the dataset doesn’t breach privacy since the datais already publicly available on the internet. Nevertheless the attack is animportant example of the difficulty of releasing anonymized social networkgraphs.

The researchers first created their own graph by crawling Flickr them-selves, and this graph will be referred to as the Flickr graph. We will call thegraph released for the Kaggle contest as, not surprisingly, the Kaggle graph.The actual de-anonymization of the Kaggle graph was done in two steps.The first step was the biggest challenge and consists of de-anonymizing aninitial number of nodes. In the second step the de-anonymization is propa-gated by selecting an arbitrary, still anonymous node, and finding the mostsimilar node in the Flickr graph. The similarity of nodes is measured usingthe already de-anonymized nodes. We will now briefly cover both of thesesteps. Note that we will describe an almost identical attack in full detail inchapter 10.

39

Page 42: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Seed Identification

In this step a small subset of nodes are de-identified. Since both the Kagglegraph and the Flickr graph contain millions of nodes, it’s essential to firstreduce the search space. For this they reasoned that it’s sensible to expectthat nodes with high in-degree in both graphs roughly correspond to eachother [NSR11]. Indeed, in the top 30 nodes in the Kaggle graph and the top30 nodes in the Flickr graph (with respect to their in-degree), there were 27highly likely correspondences. Hence, in order to reduce the search space, wecan pick two small subsets of nodes, where all nodes have a high in-degree,knowing that with high probability there will be a large correspondencebetween these subsets.

The next step is to find the actual correspondence between these highin-degree nodes. To do this the cosine similarity of the sets of in-neighborsis calculated for every pair of nodes, for both the graphs. The set of in-neighbors of a node are all the nodes with a directed edge from themselvesto the node under consideration. Here the cosine similarity between two setof nodes X and Y is defined as

sim(X,Y ) =|X ∩ Y |√|X||Y |

The cosine similarity is then used to set the weight of an “edge” be-tween the two respective high in-degree nodes, even if there wasn’t an edgebetween those nodes in the original graph. We can then search for a map-ping between these two weighted graphs that minimizes the differences ofthe corresponding edge weights. This is a known problem called weightedgraph matching which can be solved using existing algorithms or heuristics.We end up with a mapping between the nodes in the small subgraphs, andcontinue with the propagation step.

Propagation

In order to propagate the de-anonymization, the algorithm stores a partialmapping between the nodes and iteratively extends this mapping. This ex-tension is done by randomly picking an unmapped node in the Kaggle graphand finding the most similar node in the Flickr graph. If the similarity ishigh enough the nodes get mapped, otherwise no mapping is performed. Tomeasure the similarity we again use the cosine similarity, but only considerthe already mapped neighbors of the Kaggle node and the Flickr node. Herenodes mapped to each other are treated as identical when calculating thecosine similarity.

40

Page 43: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Conclusion

While the de-anonymization of the released social network graph didn’tbreach the privacy of any individual (since the data set was gathered bycrawling public Flickr profiles), this is nevertheless a clear demonstrationthat only publishing nodes and edges can still result in privacy breaches. Infact, anonymizing a social network graph for public release is still an openproblem and will be discussed in more detail in the second part of this work.

3.2 Previous Ideas

There have been several common suggestions to preserve privacy when an-alyzing data [Dwo11]. However, most of these don’t provide the privacyguarantees we are looking for. It’s worth repeating these previous attemptsthough, because it cannot be stressed enough that preserving privacy can bea difficult task with unexpected problems. This also gives additional insightinto the privacy problem.

Large Query Sets

One frequent idea is to prohibit the execution of queries about a specificindividual or small set of individuals. It’s easy to show that this strategy isinadequate when we know a specific person X is in the database. What onecan do is ask two large queries. The first one can be “How many people inthe database are HIV positive?” and the second one “How many people, notnamed X, are HIV positive?”. From both results we can deduce the HIVstatus of person X. Both were large queries yet the privacy of an individualwas still breached.

Query Auditing

Another strategy is evaluating each query to the database in the context ofthe query history, to determine if a response would be disclosive. If the queryis deemed disclosive it is refused and no answer is given. This method can beused to detect the previous example about the HIV status deduction abouta specific individual. Though this might seem promising, monitoring all thequeries is computationally infeasible [KPR00]. Even more problematic isthat the refusal to respond to a query itself may be disclosive [KMN05]. Forthese reasons this approach is useless in practice.

Sampling

Strangely, sampling is sometimes offered as a potential solution to the pri-vacy problem. Here a random number of individuals in the database will bechosen at random, and their information will be released. Properties and

41

Page 44: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

statistics can then be calculated on the returned subsample. When pickinga sufficiently large subsample the results are representative of the dataset asa whole.

Considering the subsample is normally small when compared to the data-base size, it’s unlikely that a specific individual is sampled. However eachtime a person is sampled his privacy is breached. This is unacceptable con-sidering we want to protect the privacy of every individual in the database!So again this approach is not useful in real situations.

Randomized Response

Randomized response is a rather powerful mechanism that does provide agood form of privacy. Here the data is randomized before it is stored, andstatistics can then be computed from this randomized data. Randomizedresponse works by first letting the individual toss a coin. Based on thisoutcome he either answers the question truthfully or makes up a randomanswer. Privacy is guaranteed because of the uncertainty of how to interpreta given value. In other words every individual has plausible deniability:The responder is able to claim, with reasonable probability, that he gave arandom answer.

This method was created for situations where individuals don’t trust thedata collector (curator). The chance that a person returns random datais selected so that when calculating statistics on all the returned data, theerror is kept within an acceptable margin. While randomized response doesprovide privacy, the disadvantage is that it can introduce a lot of noise inthe data, and that the approach becomes untenable for complex data.

Randomized Output

In randomized output randomly noise is added to the output. This is differ-ent from randomized response because there the data itself is randomized,while for randomized output the curator has access to the original data. Hereit’s the curator—also called the privacy mechanism—that will add randomnoise to the result of queries.

This method has promise and we will return to it in chapter 5 on differ-ential privacy. We note that if done naively the approach will fail [Dwo11].To see this, suppose the noise has mean zero and that independent random-ness is used in generating every response. In this case, if the same queryis asked several times, the responses can be summed and averaged, and thetrue answer will eventually be known.

42

Page 45: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

3.3 K-anonymity and Variants

K-anonymity is a non-interactive privacy mechanism that was invented bySweeney [Swe02]. Recall from section 1.6 that in the non-interactive settinga sanitized version of the database is released to the public. At first k-anonymity was considered as a potential solution to the privacy problem,mainly because of its conceptual simplicity and due to efficient algorithmsthat guarantee k-anonymity. Unfortunately it was soon clear that it hadshortcomings and is unable to protect the privacy of all individuals. We willbegin with explaining k-anonymity. Then we consider several improvementsof k-anonymity and explain the specific problem each one was designed toavoid.

Notation In the following sections we will let D denote a bag1 of tuples,where each tuple corresponds to a single individual in the database. Theanonymized version of D will be represented by R. In the remaining we willfrequently use the terms tuple, row and individual interchangeably. Now letA = A1, A2, . . . , Ar be a collection of r attributes and t be a tuple in R. Wewill let the notation t[A] be equivalent to (t[A1], . . . , t[Ar]) where each t[Ai]denotes the value of attribute Ai in table R for t.

3.3.1 K-anonymity

Introduction

The idea behind K-anonymity is that, when given information about an in-dividual from an external source, such as his name, birth date, ZIP code,gender, etc., it should be impossible to (accurately) find the row in theanonymized database corresponding to this individual. Intuitively this shouldresult in being unable to gain information about this individual, since weare unable to locate his information in the database. This can be achievedby removing certain attributes in combination with modifying certain valuesbefore releasing the database.

When creating the anonymized database R, the first step is removingattributes that clearly identify individuals, called explicit attributes. Theseattributes are similar to primary keys: Knowing the value of an explicitattribute of an individuals allows you to find, with high probability, therow corresponding to this person. Examples of explicit attributes are ad-dress, name, social security number, phone number, and so on. However,there are also attributes that, when taken together, can potentially identifyan individual. Such a collection of attributes is called a quasi-identifier.

1A bag is a generalization of a set that can contain duplicate items. A synonym ismultiset.

43

Page 46: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Roughly speaking, a quasi-identifier contains those attributes that are likelyto appear in other databases.

The precise definition of a quasi-identifier is based on the separationbetween sensitive and nonsensitive attributes. The value of a sensitive at-tribute should remain a secret, which means that in an anonymized databaseit should be impossible to link a sensitive value (i.e., a row containing a sensi-tive value) to a specific individual. The reasoning is that it is then impossibleto learn the sensitive value of an individual from the anonymized databaseand that privacy is thus preserved. We note that for this reason a sensitiveattribute cannot be an explicit attribute. All attributes other than the sen-sitive attributes are called nonsensitive attributes. It’s also assumed thatthe selected sensitive attributes do not appear in other databases. Thereforewhen knowing the value of a sensitive attribute one cannot find the individ-ual corresponding to this value, which also means that a sensitive attributeis never included in a quasi-identifier. We conclude that a quasi-identifierconsists only of nonsensitive attributes. The definition then becomes

Definition 3.1 (Quasi-Identifier). A set of nonsensitive attributes Q =Q1, . . . , Qr is called a quasi-identifier for a database D if

∃t ∈ D : ¬(∀t′ ∈ D : t = t′ ∨ t[Q] 6= t′[Q]

)

In other words, there is at least one individual in the original databasewho can be uniquely identified using only the values of the attributes in Q.We denote the set of all quasi-identifiers by QI.

Good examples of quasi-identifiers are birth date, ZIP code, genderand occupation, age, ZIP code. Technically the explicit identifiers arealso quasi-identifiers, so Social Security Number and address are alsoexamples. However, the explicit identifiers are removed from the databasebefore an anonymous version is created [LLV07]. In order to illustrate therelation between quasi-identifiers and keys, we recall the requirement of akey over the attributes Q for a database D:

∀t ∈ D : ¬(∃t′ ∈ D : t 6= t′ ∧ t[Q] = t′[Q]

)

Based on this we can define an anti-key, which although strictly speakingis not the opposite of a key, can be seen as such. The requirement of ananti-key is

∀t ∈ D : ∃t′ ∈ D : t 6= t′ ∧ t[Q] = t′[Q]

We see that the opposite (negation) of an anti-key is a quasi-identifier.To talk about a group of individuals that have the same values for a set ofquasi-identifiers we introduce the definition of an equivalence class2:

2Sometimes we will use the terminology “group of individuals” as synonym for anequivalence class

44

Page 47: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Non-Sensitive Sensitive

Zip code Age Nationality Condition

1 130** < 30 * AIDS2 130** < 30 * Heart Disease3 130** < 30 * Viral Infection4 130** < 30 * Viral Infection

5 1485* ≥ 40 * Cancer6 1485* ≥ 40 * Heart Disease7 1485* ≥ 40 * Viral Infection8 1485* ≥ 40 * Viral Infection

9 130** 3* * Cancer10 130** 3* * Cancer11 130** 3* * Cancer12 130** 3* * Cancer

Table 3.1: Example of a 4-anonymous database. The only quasi-identifieris the set of all the nonsensitive attributes, namely Zip code, Age,Nationality. [MGK07]

Definition 3.2 (Equivalence Class [GKS08]). An equivalence class for atable R with respect to attributes in A is the set of all tuples t1, t2, . . . , ti ∈ Rfor which t1[A] = t2[A] = . . . = ti[A]. That is, the projection of each tupleonto attributes in A are the same.

Take table 3.1 as an example. In the next section we will see that thisrepresents a 4-anonymous database. This means that each equivalence classcontains at least 4 individuals with respect to the single quasi-identifier.The equivalence classes are 1, 2, 3, 4, 5, 6, 7, 8 and 9, 10, 11, 12, as eachclass has the same values for Zip code, Age and Nationality.

K-anonymity

K-anonymity now works by assuring there are at least k other individuals(i.e., rows in the database) that have the same values for all quasi-identifiers.This should assure that if we are searching for a specific individual, we willalways get at least k results and thus can’t identify the row of the individual.

Definition 3.3 (k-anonymity [GKS08]). A release R is k-anonymous if forevery tuple t ∈ R, and for every quasi-identifier A ∈ QI, there exist at leastk− 1 other tuples t1, t2, . . . , tk−1 ∈ R such that t[A] = t1[A] = . . . = tk−1[A].

A technique to achieve k-anonymity is called generalization, which re-places quasi-identifier values with values that are less-specific but semanti-cally consistent. As a result, more records will have the same set of quasi-identifier values [LLV07]. An example of a 4-anonymous database is given in

45

Page 48: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

table 3.1, where the only quasi-identifier is Zip code, Age, Nationality. Inthe table a “*” denotes a suppressed value so, for example, “age=3*” meansthe age is in the interval [30, 39]. The definition of k-anonymity assumes thatthe curator is able to accurately recognize the quasi-identifiers. From theattacks described in section 3.1 we can see that reasoning about the quasi-identifiers can be a very hard task and is easy to get wrong. Specifically,determining the separation between sensitive and nonsensitive attributes canbe problematic. This is clearly one of the shortcomings of k-anonymity: Theassumption that one is accurately able find all the quasi-identifiers seems tobe too strong.

As mentioned previously, k-anonymity is a creation of Sweeney and wasan answer to the HMO linkage attack [Swe02]. In her paper she noted thatone must be careful with subsequent releases of the same database. If notdone correctly, privacy of some, or possibly all individuals, can be lost. Thisis a second important downside, as it is in practice almost impossible toknow which data was already released previously. We will return to thisproblem in section 3.4 on the intersection attack.

Achieving k-anonymity

There are two commonly used techniques to achieve k-anonymity: General-ization and suppression. For both cases attribute values are being modifiedin order to assure the requirement of k-anonymity. In the generalizationtechnique a value for an attribute is transformed to a more general and pos-sibly “abstract” value. For example, given a zipcode such as 02139 we candrop the last number and generalize it to the interval [02130, 02139]. Theloss of information that occurs is a price we have to pay in order to achieveprivacy. Another example can be to generalize the ages of individuals tointervals such as [0, 10[, [10, 20[, and so on.

The second method is suppression. Here a value is simply not released atall. An example is shown in table 3.1 where no value is given for Nationality.Suppression can be seen as a specific form of generalization where the newvalue is the most generic value possible. We could have for example changedthe nationality of each person to Human, which essentially gives us to newinformation at all. Or we could replace the age of a person with the singleinterval [0, 200], which again gives no new information about the individualin question.

One can now formulate the problem of achieving k-anonymity whilstminimizing the number of generalizations or suppressions applied. Thiswould result in the optimal anonymized table, where optimal means theleast amount of information has been distorted, and thus this it can beseen as the most useful anonymized database. Even if we limit the problemto only allowing suppression of values we can prove that the optimizationproblem is NP-Hard [MW04]. Hence only approximation algorithms are

46

Page 49: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

used in practice. Because we will show in section 3.4 that k-anonymity failsto protect against multiple, possibly independent, releases, we will not coverthese approximation algorithms to achieve k-anonymity. The focus in thischapter is on explaining the shortcomings of previously suggested privacymechanism, and not their precise implementation. Example approximationalgorithms can be found in the paper by Aggarwal, Feder, Kenthapadi,Motwani, et al [AFK+05]. We remark that each of these algorithms canhave different properties and advantages.

3.3.2 L-diversity

An improvement of k-anonymity is `-diversity. It was invented by Machanava-jjhala, Gehrke and Kife [MGK07] and was a reaction to the following twoattacks on a k-anonymous database:

Homogeneity Attack In the example database shown in table 3.1 we no-tice that individuals 9, 10, 11 and 12 all have Cancer. So if we knowthat a person is in the database, and his zip code starts with 130 andis between 30 and 40 years old, we can deduce the sensitive infor-mation. We observe that the identity of the person is protected (wedon’t know his row in the database), but the sensitive attribute valueis not protected (because we know the value). In general, if thereis no diversity in the value of the sensitive attributes, the privacy ofthe individual is violated. If the number of tuples heavily outnumberthe possible values of the sensitive attributes, this can be a commonsituation [MGK07].

Background Knowledge Attack The adversary may also possess back-ground knowledge on the distribution of the sensitive values. As anexample we can have a person aged 21 with zip code 13023 and weknow he is in the released database. Thus row 1, 2, 3 or 4 containshis information. We also know he is very active in sports and is thusunlikely to have a heart disease. From this we can conclude that, withhigh probability, he has a viral infection.

We can prevent the homogeneity attack by assuring there is enoughdiversity in the sensitive values. Protection against arbitrary backgroundknowledge is nearly impossible. We can however make it more difficult forthe adversary: Again, if there is greater diversity in the sensitive attributes,more background knowledge will be needed to retrieve the exact value ofthe sensitive attribute. The definition of `-diversity then becomes:

Definition 3.4 (Entropy `-diversity [GKS08]). For an equivalence class E,let S denote the domain of the sensitive attributes, and Pr [E, s] the fractionof records in E that have sensitive value s, then E is `-diverse if:

47

Page 50: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Non-Sensitive Sensitive

Zip code Age Nationality Condition

1 1305* ≤ 40 * Heart Disease2 1305* ≤ 40 * Viral Infection3 1305* ≤ 40 * Cancer4 1305* ≤ 40 * Cancer

5 1485* > 40 * Cancer6 1485* > 40 * Heart Disease7 1485* > 40 * Viral Infection8 1485* > 40 * Viral Infection

9 1306* ≤ 40 * Heart Disease10 1306* ≤ 40 * Viral Infection11 1306* ≤ 40 * Cancer12 1306* ≤ 40 * Cancer

Table 3.2: Example of a 3-diverse database. The quasi-identifiers are all thenonsensitive attributes, namely Zip code, Age, Nationality. [LLV07]

−∑

s∈SPr [E, s] log (Pr [E, s]) ≥ log (`)

A table is `-diverse if all its equivalence classes are `-diverse.

There exists also a simplified definition that states that every equiva-lence class must have at least ` different values for the sensitive attribute.An example of a 3-diverse database—according to the simplified definition—is shown in table 3.2. Based on the true definition of `-diversity, the tableis actually 2.8-diverse. We note that the equation in the definition is al-most the same as the shannon entropy, where the probabilities—and so alsothe supposed discrete random variable—are now given as fractions over thesensitive attributes. This definition addresses the homogeneity attacks, andmakes the background knowledge attack harder. The higher ` is, the morebackground knowledge is needed to reveal the sensitive attribute of an indi-vidual.

The introduction of `-diversity is an important improvement of k-anonymity.However, it still has several flaws [LLV07]:

Difficult to achieve

Say the original database has only one sensitive attribute, namely the testresult for a particular virus. This is basically a boolean value that is eitherpositive or negative. Now assume that 99% of the population doesn’t havethe virus, and the database contains the information of 1000000 individuals.If we want to achieve any form of t-closeness every equivalence class must

48

Page 51: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

contain both negative and positive results. This means that there can be atmost 100000∗1% = 1000 equivalence classes, resulting in a large informationloss when generalization will be applied. We also note that because theentropy of the sensitive attribute in the overall table is very small, assuring`-diversity can only be done if we pick a smal enough `, as for high valuesof ` it would in fact become impossible to assure `-diversity. Although thisis not a flaw, we mention it to illustrate the potential difficulty of acheiving`-diversity.

Skewness Attack

Consider again the example where the sensitive attribute is the test resultof a virus and 99% of the population doesn’t have the virus. The `-diversityprinciple now allows there to be an equivalence class with an equal numberof positive and negative records. For any person that falls into this group hisor her privacy is greatly reduced, because they are now considered to have a50% probability of having the virus, instead of the 1% a priori probability.Here it’s also important to note that if an attacker incorrectly thinks an in-dividual has a certain value of the sensitive attribute, and behaves accordingthis incorrect perception, this can also harm the individual.

Similarity Attack

Another potential problem is if the sensitive attribute values are distinctbut semantically similar. For example, one equivalence class may containvarious types of cancers. In this case we know that everyone in the group hascancer. Another example could be the salary of a person. Every individualin a group may have a unique value, but the overall interval can still besmall. Hence we could fairly accurately guess the salary of each person bytaking the average.

3.3.3 T-closeness

An improvement of `-diversity, that attempts to solve the discussed prob-lems, is called t-closeness. Here we attempt to assure that the distributionof a sensitive attribute is the same as the distribution of the attribute in thewhole population. This results in the following definition.

Definition 3.5 (t-closeness [GKS08]). An equivalence class E is t-close ifthe distance between the distribution of a sensitive attribute in this class anddistribution of the attribute in the whole table is no more than a thresholdt. A table is t-close if all its equivalence classes are t-close.

The precise distance measure that is used is of little importance in ourdiscussion. Although t-closeness is a clear advancement, we will see in sec-tion 3.4 that it’s still vulnerable to a linkage attack. This is because t-

49

Page 52: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

closeness—and also the previous methods—focus on static data that doesn’tchange. They are thus restricted to one-time publication and do not supportrepublication of a new version of the database. This causes problems, be-cause there might be database maintained by a different organization thatstores almost identical information. If that particular organization wouldalso release an anonymized database, the one-time publication assumptionis no longer valid. In other words, in the real world it’s impossible to preventmultiple publications of the same database, and thus privacy mechanism as-suming one-time publication can are considered flawed in practice.

3.4 Intersection Attack

The attack that defeats k-anonymity, `-diversity and t-closeness all at onceis called the intersection attack. It’s a particular instance of a compositionattack, where multiple releases of the database—or overlapping databasesreleased by different entities—are combined. A synonym of compositionattack is a linkage attack. The intersection attack depends on an impor-tant property that a certain privacy mechanism may posses, namely that oflocatability.

Definition 3.6 (Locatability [GKS08]). Let Qvalues be the set of quasi-identifier values of an individual in the original database D. An anonymiza-tion mechanism M that produces the anonymized database R on input D sat-isfies the locatability property if one can identify the set of tuples t1, . . . , tKin R that correspond to Q.

In short it states that, given the values of the quasi-identifiers of anindividual, we can find the equivalence class of that person. As explainedin section 3.3 on k-anonymity there exists several algorithms that producean anonymous database when given the original database. Most of thesemechanism satisfy the locatability property, but it does not necessarily holdfor all algorithms. For algorithms that do not strictly satisfy locatability,experiments suggested that simple heuristics can still locate a person’s equiv-alence group with good probability [GKS08]. In the intersection attack wewill assume the mechanism used to generate the anonymized table satisfiesthe locatability property.

3.4.1 The Algorithm

The intersection attack is now described in algorithm 3.1. The functionGetEquivalenceClass returns the equivalence class where the individual islocated in. Since we assumed the privacy mechanism that generated theanonymized releases Rj satisfies the locatability property this function al-ways exists. The function SensitiveValueSet returns the set of distinct sen-sitive values for the members in a given equivalence class.

50

Page 53: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Algorithm 3.1 Intersection Attack

1: R1, . . . , Rn ← n independent anonymized releases2: P ← a set of individuals common to all n releases3: for each individual i in P do4: for j = 1 to n do5: eij ← GetEquivalenceClass(Rj , i)6: sij ← SensitiveValueSet(eij)7: end for8: Si ← si1 ∩ si2 ∩ . . . ∩ sin9: end for

10: return S1, S2, . . . , S|P |

In essence the attack is straightforward. We are given set of individualsP of whom we know the value for the quasi-identifiers, and we know thattheir information is included in all the releases Rj . The function GetEquiv-alanceClass is based on the locatability property to retrieve the equivalenceclass of a given person based on the values for his or her quasi-identifiers.From every anonymized release we now deduce the possible set of values forthe sensitive attribute. We then take the intersection of all these sets. If weend up with only one value, we know the sensitive attribute. For example,say that from an anonymous database A we know that Alice has either aheart disease or cancer. From a second anonymous database B we learnthat Alice has either cancer or the flu. We can conclude:

SAlice = Heart Disease,Cancer ∩ The Flu,Cancer = Cancer

Clearly the privacy of Alice has been breached. We call this a perfectbreach because we can deduce the exact sensitive value of an individual. Inother words the adversary has a confidence level of 100% about the individ-ual’s sensitive data. Another possibility is that a partial breach occurred.This happens when the adversary is able to boil down the possible sensitivevalues to only a few values which itself could reveal a lot of information. Takea medical database for example, if we can boil down the sensitive values ofthe diagnosis to a few values such as Flu, Fever and Cold, we can concludethat the individual is suffering from a viral infection. The confidence levelof the adversary in this example is 1/3 = 33%.

3.4.2 Empirical Results

In their paper on the composition attack, Ganta, Kasiviswanathan andSmith have tested the intersection attack on two databases [GKS08]. Theyare both from the census databases of the UCI Machine Learning repository.The first one is called the Adult database and has been used extensively in

51

Page 54: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

several k-anonymity experiments. The second is the IPUMS database whichcontains information from the 1997 census studies. From these inputs theygenerated k-anonymous, `-diverse and t-close anonymized databases.

Without unnecessarily delving into the details, they noticed the privacymechanisms failed to protect the privacy of all the individuals. It was alsoobserved that `-diversity and t-closeness performed better than the origi-nal k-anonymity definition, but still lead to considerable privacy breaches.Another observation is that for `-diversity and t-closeness large equivalenceclasses had to be created, resulting in a heavy information loss.

Their experimental results also indicated that the severity of the inter-section attack increases with the increase in entropy of sensitive attributedomain size [GKS08]. This means that if more possible values for the sen-sitive attribute are used in practice, the more severe the intersection attackbecomes. The severity also increases as more independent databases arereleased to the public.

3.5 Conclusion

After discussing all the previous suggestions and examples it’s clear thatprivacy is a difficult problem. When reasoning about it there are manypitfalls we must avoid. Nevertheless, each of these attempts is bringing uscloser to a strong and usable privacy mechanism.

We also noticed that both in the real world examples, and for the theo-retical privacy mechanisms such as k-anonymity and its variants, the back-ground information that an adversary may possess is detrimental to thesupposed privacy guarantees. These may be overlapping database, as wasthe case for the Netflix and IMDb databases. Or these may be differentversions of the same databases, which is possible in an intersection attack.In particular it’s very hard to specify the auxiliary information an adversaryhas access to. Ideally we need a privacy mechanism that can withstand allforms of auxiliary information.

52

Page 55: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Chapter 4Impossibility Result

In 1977 Dalenius expressed a desideratum for statistical databases: nothingabout an individual should be learnable from the database that cannot belearned without access to the database [Dal77]. This intuitively captures theessence of privacy: you will learn nothing new about the individual. Unfor-tunately this is impossible and the privacy of someone not even in the data-base can be compromised. We will require that our database and privacymechanism is useful, and we will prove the impossibility of the desideratumin a general setting.

4.1 Motivation

As we have seen in the previous chapters, many ad hoc privacy mechanismsare flawed and can be broken. In order to guarantee adequate privacy forindividuals we need a strong definition that can also be achieved in prac-tice. Inspired by the semantically flavored definitions that are also used incryptography, we turn our attention to the desideratum of Tore Dalenius:Access to a statistical database should not enable one to learn anything aboutan individual that could not be learned without access.

This is similar to the definition of a semantically secure cryptosystem,where it’s stated that an adversary does not gain anything by looking atthe ciphertext[Gol01]. Thus, whatever information an attacker can extractfrom the ciphertext, he could also extract from the information available apriori (without knowing the cipher text). This definition can be achieved inpractice when we limit the computational power of the attacker1.

Unfortunately the desideratum of Dalenius cannot be achieved in gen-eral. The problem is that of auxiliary information : Information that the

1In cryptography it’s common to limit the computational power of the adversary toprobabilistic polynomial-time Turing machines.

53

Page 56: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

adversary already knows without access to the database (say from newspa-pers, medical studies, other databases, etc.). In any “reasonable” settingthere is some piece of information that, together with access to the data-base, will compromise the privacy of an individual. Note that we have yetto give a proper definition of what privacy means, and in particular when aprivacy breach occurs. This will be discussed in the next section.

Before beginning with the formal proof, we can capture the idea behindit with the following parable. An insurance company wants to minimize itslosses, and to do so wants to refuse people that have an increased cancerrisk. The company also knows that Alice, whom is asking for insurance, is asmoker. With only this auxiliary information about Alice the company hasno valid reason to deny her insurance (i.e., we assume that it’s currently un-known if smoking causes cancer). Now consider the event that the companyhas access to a database that contains medical information about people.From this database they learn that if a person smokes, their risk on gettingcancer greatly increases. Combining this newly learned information withthe auxiliary information yields the conclusion that Alice has an increasedcancer risk, which in this situation can be seen as a privacy breach.

The proof, including several parts of this chapter, is based on the worksof Dwork and Naor [Dwo06, DN10].

4.2 The Setting

We start with a high level description of the proof. Assume the privacymechanism provides some useful information to the analyst that can onlybe learned by accessing this privacy mechanism. Then consider a sensitivepiece of information unknown to the analyst. With this we can construct anauxiliary information generator that uses the useful information provided bythe privacy mechanism as a “password” to encrypt the sensitive information.A simple XOR cipher will be used to encrypt and decrypt the sensitiveinformation. An adversary is then able to utilize the useful informationto decrypt the auxiliary information and expose the sensitive information,while someone without access to the database will be unable to decrypt theauxiliary information.

4.2.1 Preliminaries

As we have seen in section 1.6 there are two natural models for privacymechanisms: interactive and non-interactive. The proof is applicable toboth settings.

As an approach to take into account any information one may have aboutthe database prior to seeing any auxiliary information—in the proof this willmean before one sees the output of the auxiliary information generator—is to define a probability distribution D on databases. For example, this

54

Page 57: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

distribution can represent the fact that we know what type of informationthe database contains: It may contain the ages of people, so we know thatthe data type of an attribute is an integer, and we also know that the agesmust lie between 0 and the age of the oldest known human.

In a sense D can also be seen as the schema of the database.

4.2.2 Assumptions

To prove the impossibility of Dalenius’ desideratum we first need to definewhen a privacy breach occurs, what auxiliary information is available to theadversary, when a privacy mechanism can be considered useful and how wewill model the adversary. We will try to be as abstract as possible in thesedefinitions. This means the proof holds for a wide array of situations, sothat in general the assumptions made in our proof will satisfy those in thereal world.

Utility

As already mentioned in the description of this chapter, our proof needssome notion of utility. This means the privacy mechanism needs to outputuseful results. For example, whilst a mechanism that always outputs thesame result or just returns a completely random answer each time preservesprivacy, it’s clearly useless.

To model usefulness, we require that there is a vector of questions whoseanswers should be learnable by the analyst. These answers shouldn’t beknown in advance (at least most of them), otherwise there is still nothinguseful about them. We will therefore define the utility vector w that repre-sents the answers to these questions. Without loss of generality2 we assumethat this is a binary vector of some fixed length κ. Intuitively we can thinkof this vector as answers to questions about the data. We assume that alarge part of w can be computed from the raw data via the sanitizationmechanism3. This excludes the case where the answers are given completelyrandomly, otherwise w cannot be computed from the raw data.

Since we don’t know most of the answers in advance the utility vector canbe seen as a random variable. To specify that we don’t know all the answersin advance, we will assume that w is sufficiently hard to guess. This notionof “hard to guess” is formalized using the min-entropy measure that was in-troduced in section 2.2. This will—among other things—exclude the trivialcase where the same answer is given to each question. The distribution Don databases induces a distribution on the utility vectors.

2In this case this means we can easily extend the proof where one considers largerdomains

3We use the terms sanitization mechanism and privacy mechanism as synonyms

55

Page 58: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Privacy Breach

It’s difficult to give a clear definition of when a privacy breach has occurredbecause because it’s such a broad term and can mean different things todifferent people. So instead of quantifying what a privacy breach exactly is,we will use a more abstract idea and “describe” a privacy breach by a Turingmachine C that simply tells us whether or not a breach occurred. It takes asinput a distribution D on databases, a database DB drawn according to thisdistribution, and a string s—the supposed privacy breach—and outputs asingle bit denoting whether the string is a privacy breach or not. We requirethat C always halt and we will use the convention that if s is a privacybreach, C will accept.

We say that an adversary wins with respect to C and for a given (D, DB)pair, if it produces a string s such that C(D, DB, s) accepts. Henceforth“with respect to C” will be implicit. It’s important to remark that the strings contains the information that is a privacy breach for some individual. Therequirements of the breach decider C are as follows:

1. It should be hard to find a privacy breach without access to the data-base and auxiliary information. In other words, if we only know Dit should be unlikely to find a privacy breach. Otherwise there is noprivacy to protect, since there is none to give up anyway. This is for-malized in Assumption 4.1 part 2 (see page 58) where even with someform of access to the database it is hard to find a breach.

2. If the raw data would be released, it should be easy to find a pri-vacy breach. Otherwise it’s again trivial to preserve the privacy: evenreleasing all the data doesn’t breach privacy. This is formalized inpart 1b of Assumption 4.1.

Auxiliary Information

In the proof we will define a Turing machine that will generate the auxil-iary information denoted by the string z. It takes as input a descriptionof the distribution D from which the data is drawn as well as the databaseDB itself. In the proof we will give a precise description of how the stringz is computed. The auxiliary information z will be given to both the ad-versary and a simulator. The simulator has no access of any kind to thedatabase, whilst the adversary has access to the database via the privacymechanism. The auxiliary information generator, including both its inputs,will be represented using the symbol X (D, DB).

56

Page 59: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

4.3 The Proof

The proof consists of two parts. In the first part it’s assumed that thecomplete utility vector w can be learned, and in the second part it’s assumedthat it can only be partially learned. Although the second case captures thefirst one, the proof when w can be completely learned is simpler and easierto comprehend. We will therefore give a detailed proof when w can becompletely learned, and only informally describe the second case.

We will let San() denote the privacy mechanism. It stands for sanitizerand it answers queries in a way that provides some amount of privacy, i.e.,it is hard to find a privacy breach using only the results of San(). We canview it as the interface to our database that the analyst must use in orderto access the database. One can also see San() as the trusted data curatorthat will attempt to provide privacy. Our results can be summarized usingthe following meta-theorem [DN10]:

Theorem 4.1 (Meta-Theorem [DN10]). For any privacy mechanism San()and privacy breach decider C there is an auxiliary information generator Xand an adversary A such that for all distributions D where D, C and San()satisfy certain assumptions, and for all adversary simulators A∗

Pr [A(D, San(D, DB),X (D, DB)) wins]− Pr [A∗(D,X (D, DB)) wins] ≥ ∆

where ∆ is a suitably chosen (large) constant. The probability space isover choice of DB ∈R D and the coin flips of San, D, X , A and A∗.

Here the adversary is represented by a probabilistic Turing machine.The adversary simulator A∗ is the same as a normal adversary, but with-out access to San(). The probability space consists of the coin flips of allprobabilistic Turing machines as defined in section 2.5 combined with theprobability space of D. Roughly the theorem states that—in a reasonablesetting—an adversary with access to the database will likely find a privacybreach, while an adversary without access to the database much less likely tofind a privacy breach. In particular there is a piece of auxiliary informationz, so that z alone is useless to someone trying to find a privacy breach, whilstz in combination with access to the data through the privacy mechanismpermits the adversary to win with probability arbitrarily close to 1.

4.4 Utility vector w is fully learned

As already stated several times, our proof requires some notion of utility asexplained in section 4.2.2 and in general a “reasonable“ (realistic) setting.Assumption 4.1 formalizes these requirements:

57

Page 60: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Assumption 4.1. We assume a reasonable setting, namely that:

1. We assume there exists ` ∈ Z+, Υ ∈ R+ and 0 ≤ γ < 1, such thatboth the following conditions hold:

(a) The distribution W on utility vectors has min-entropy at least3`+ Υ, that is H∞(W ) ≥ 3`+ Υ.

(b) Let DB ∈R D. Then with probability at least 1− γ over D, DBhas a privacy breach of length `. Or in a more mathematicalnotation Pr [∃y : |y| = ` ∧ C(D, DB, y) = 1] ≥ 1− γ.

2. For all interactive Turing machines B, where µ is a suitably smallconstant that depends on D and San(), it must hold that

Pr [B(D, San(D, DB)) wins] ≤ µ

Where the probability is taken over the coin flips of B and the privacymechanism San(), as well as the choice of DB ∈R D.

Part 1a states that the utility vector contains enough randomness. Thisalso makes sure the utility vector is useful, since this randomness means wedon’t know the answers in advance, and hence the answers can be consideruseful. It also ensures that we can extract ` bits of randomness from theutility vector—see claim 4.2—which can be used as a one-time pad4 to hideany privacy breach of the same length. Part 1b assures that with high prob-ability there is some piece of information that can be considered a privacybreach. Finally, part 2 is a nontriviality constraint saying that the sanitiza-tion mechanism provides some form of privacy, at least against adversariesthat have no access to auxiliary information.

We begin with showing that our assumption implies that, conditionedon most privacy breaches of length `, the min-entropy of the utility vectoris at least `+ Υ. This is needed so that the auxiliary information generator,given a privacy breach y of length `, can extract enough randomness fromw to encrypt y. Before giving the proof we define the following notation:

Definition 4.1 (Conditional Entropy). Given two discrete random variablesA and B, the conditional min-entropy of A given the event B = b is definedas

H∞(A | b) def= − log

(maxa∈A

Pr [A = a | B = b]

)

4In our scenario a one-time pad can be seen as a secret password used to encrypt anddecrypt a message

58

Page 61: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

We note that, as proven in section 2.1.1, the function Pr [A = a | B = b](where b is given) defines a new probability space, and thus our definitionof conditional entropy is consistent. Using this new notation we can proofthe following claim.

Claim 4.2. Let D be as in Assumption 4.1 part 1a. Consider any deter-ministic procedure P for finding a privacy breach given the database DB, sowith probability 1− γ (see part 1b of the assumptions) over D we have thaty = P(DB) is a privacy breach. Let P(DB) = 0` if there is no breach oflength `. Consider the distribution Y on y’s obtained by sampling DB ∈ Dand then computing P(DB). Then Pr [H∞(W | y) ≥ `+ Υ] ≥ 1−2−`, wherethe probability is taken over the choice y ∈ Y .

Proof. It’s given that H∞(W ) ≥ 3`+ Υ, which can be rewritten to

− log

(maxw∈W

Pr [W = w]

)≥ 3`+ Υ (4.1)

maxw∈W

Pr [W = w] ≤ 2−(3`+Υ) (4.2)

∀w ∈W : Pr [W = w] ≤ 2−(3`+Υ) (4.3)

This followed straightforwardly from the definition of conditional en-tropy. We now note that Pr [W = w] =

∑y∈Y Pr [W = w ∧ Y = y]. From

this and equation 4.3 it follows that

∀w ∈W :∑

y∈YPr [W = w ∧ Y = y] ≤ 2−(3`+Υ) (4.4)

Specifically this means it’s true that

∀w ∈W, ∀y ∈ Y : Pr [W = w ∧ Y = y] ≤ 2−(3`+Υ) (4.5)

which—surprisingly enough—is the only thing we will need to completethe proof. We now continue by first by writing what we have to prove in amore readable and understandable form. To do this, let us consider the setG of y’s that satisfy the event H∞(W | y) ≥ `+ Υ

G = y ∈ Y | H∞(W | y) ≥ `+ Υ (4.6)

We can simplify the definition of G by noting that H∞(W | y) ≥ `+ Υis equivalent to

∀w ∈W : Pr [W = w | Y = y] ≤ 2−(`+Υ) (4.7)

From the definition of conditional probability we also have that

Pr [W = w | Y = y] =Pr [W = w ∧ Y = y]

Pr [Y = y](4.8)

59

Page 62: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Using this we can further simplify the definition of set G to

G = y | ∀w ∈W : Pr [Y = y] ≥ 2`+ΥPr [W = w ∧ Y = y] (4.9)

Using this, the statement we now have to prove can be written asPr [G] ≤ 1 − 2−` where the probability is taken over the complete set Y .This can equivalently be written as

y∈GPr [Y = y] ≥ 1− 2−` (4.10)

1−∑

y/∈G

Pr [Y = y] ≥ 1− 2−` (4.11)

y/∈G

Pr [Y = y] ≤ 2−` (4.12)

The last statement is something we can indeed prove. We use the factthat y /∈ G means that there exists a wy such that

Pr [Y = y] < 2`+ΥPr [P = wy ∧ Y = y] (4.13)

We can now reason as follows

y/∈G

Pr [Y = y] <∑

y/∈G

2`+ΥPr [W = wy ∧ Y = y] (4.14)

≤ 2`+Υ∑

y/∈G

2−(3`+Υ) (4.15)

≤ 2`+Υ2`2−(3`+Υ) (4.16)

= 2−` (4.17)

The first step follows from equation 4.13. The second step uses equa-tion 4.5 that we derived from the given conditions. The third step is truebecause the number of y’s is at most 2`. This finishes the proof.

To describe the internal operation of the auxiliary generator we need onelast element: A strong randomness extractor.

Definition 4.2 (Strong extractor [Sha02]). A (k, ε)-strong extractor is afunction

Ext : 0, 1n × 0, 1d → 0, 1m

Such that for every distribution X on 0, 1n with H∞(X) ≥ k the dis-tribution UdExt(X,Ud) is ε-close to the uniform distribution on 0, 1m+d.The two copies of Ud denote the same random variable5.

5Here stands for the concatenation of two strings.

60

Page 63: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

The probability measure of the distribution Ud Ext(X,Ud) is definedas follows

Definition 4.3. Let X be a discrete random variable and Ud the uniformdistribution on 0, 1d. The probability measure Pr corresponding to the

probability space of the discrete random variable Pdef= Ud Ext(X,Ud) is

defined as

Pr [u y] =1

2d

x:Ext(x,u)=y

Pr [X = x]

where y represents the result of the function Ext(x, u). The sample spaceof P is equal to 0, 1m+d. We now define the probability of the pairwisedisjoint events E = u1 y1, . . . , un yn as

Pr [E] =∑

1≤i≤nPr [ui yi] (4.18)

Proof. We will now prove that this is indeed a valid probability measure. Wecan immediately see that the countable additivity property directly followsfrom equation 4.18 of our definition. The proof that Pr [Ω] = 1 is

Pr [Ω] =∑

u∈0,1d

y∈0,1mPr [u y] (4.19)

=∑

u∈0,1d

y∈0,1m

1

2d

x:Ext(x,u)=y

Pr [X = x] (4.20)

=∑

u∈0,1d

x

Pr [X = x]

2d(4.21)

=∑

u∈0,1d

1

2d(4.22)

= 1 (4.23)

The first step holds because all the events u y are pairwise disjoint. Inthe second step the definition of Pr [u y] is filled in. In the third step thelast two summations are combined. Taken together they sum over all thepossible values of x. For the fourth step we use the trivial property that∑

x Pr [X = X] = 1. Finally we get our desired result. The proof that theimage of Pr is [0, 1] follows by noting that all terms in the calculation ofPr are non-negative and by the just proven property that Pr [Ω] = 1. Thisconcludes the proof.

A randomness extractor can be used to generate a random output thatis shorter, but uniformly distributed, given a high-entropy source and a

61

Page 64: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

short random seed. In other words it can convert an input with “weak”randomness to an output with “high” uniform randomness. The additionalrequirement of a strong extractor is that the output, concatenated with theseed, should still yield a distribution that is close to uniform. There areseveral good constructions of strong extractors [NZ96, Sha02, DN10].

For this proof we can use an extractor based on the Leftover HashLemma:

Lemma 4.3 (Leftover Hash Lemma [ILL89]). If H = h : 0, 1n →0, 1m is a pairwise independent family where m = k − 2 log(1

ε ), then

Ext(x, h)def= h(x) is a strong (k, ε)-extractor. Here we identify a function h

with its index in H.

For a proof, and more detailed explanation of this lemma, see the workof Impagliazzo et al. [ILL89]. The output of the extractor will be used toencrypt the auxiliary information z of length `. Therefore we require astrong (k, ε)-extractor with m ≥ `. From this requirement and Lemma 4.3it follows that:

k − 2 log(1

ε) ≥ ` (4.24)

k ≥ `+ 2 log(1

ε) (4.25)

Hence, to extract enough randomness using the strong extractor from theLeftover Hash Lemma, the min-entropy of the utility vectors W conditionedon a given privacy breach y, must be at least ` + 2 log(1

ε ). Combininginequality 4.25 with Claim 4.2, where it’s stated that with high probabilityH∞(W | y) ≥ `+ Υ, we can deduce that the choice of Υ and ε must satisfyΥ ≥ 2 log(1

ε ).By utilizing this extractor we can describe how the auxiliary information

generator X will produce the auxiliary information. We assume there is a(k, ε)-strong extractor with m ≥ ` and k ≥ `+ Υ. On input (D, DB ∈R D)it will do the following:

1. Find a privacy breach y for DB of length `, which exists with prob-ability at least 1 − γ over D. The efficiency of this operation will bediscussed in section 4.6.

2. Compute the utility vector w using the sanitization mechanism. Re-member that the utility vector w represents everything that can belearned from the database, and we assume that it can be fully learnedusing the interface San(). As described in section 4.2.2 the utilityvector is represented by a binary vector of some fixed length κ.

62

Page 65: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

3. Choose a random seed s ∈R 0, 1d and use the (k, ε) strong extractorto obtain from w an `-bit almost uniformly distributed string r. Inother words r = Ext(w, s) where r will be ε-close to the uniformdistribution.

4. The returned auxiliary information will be z = (s, y ⊕ r).

From Claim 4.2 and the definition of a strong extractor, the distributionof r is within statistical distance ε + 2−` from the uniform distribution on0, 1`. This holds even if s and y are given.

The adversary A that has access to both the sanitization mechanismand the auxiliary information can decrypt the second element of the stringz. To do this it first computes the utility vector w, and then from s obtainsr = Ext(w, s). The decryption step simply applies the XOR operation toretrieve the privacy breach z ⊕ r = (y ⊕ r)⊕ r = y. Given that a privacybreach occurs with probability at least 1 − γ, the adversary A wins withprobability at least 1− γ.

All that is left to do, is to investigate with which probability the adver-sary A∗ can win. We start with part 2 of Assumption 4.1, that is to sayPr [B(D, San(D, DB)) wins] ≤ µ, where the probability is taken over thecoin flips of B and the privacy mechanism San(), as well as the choice ofDB ∈R D. If we remove access to San the probability that B wins cannotincrease:

Pr [B(D) wins] ≤ µ (4.26)

Where with B we denote the best machine, among all those with ac-cess to the given information, at producing a privacy breach. From thiswe can derive that the probability of correctly guessing a privacy breach yis bounded by µ. If we also give the adversary a randomly uniformly dis-tributed string—here a better word for string would be noise—it gains noadvantage. Specifically, the uniformly chosen bitstrings s, u ∈R 0, 1` willbe given to B, which cannot help it to win:

Pr [B(D, s, u) wins] ≤ µ (4.27)

Let U` represent the uniform distribution on `-bit strings. For any fixedstring y ∈ 0, 1` we have U` = U`⊕y. In particular this holds for all privacybreaches y of DB:

Pr [B(D, s, u⊕ y) wins] ≤ µ (4.28)

We now want to replace the random uniformly chosen bitstring u withr = Ext(w, s). However, it might be possible that r is correlated with y. Ifthat is the case y ⊕ r might leak information about the privacy breach y.This is the reason we require Claim 4.2 which states that the min-entropy of

63

Page 66: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

W , conditioned on y, must be at least `+ Υ. Now the output r = Ext(w, s)will be close to uniform even if given y. That makes sure that y ⊕ r willnever leak information about the privacy breach y.

As we have already seen, the statistical distance between the uniformdistribution and the distribution of r is at most ε+ 2−`. As we have provenin section 2.6.2 this means that when replacing (s, u⊕ y) with (s, r⊕ y) theacceptance probability of B differs at most by ε+ 2−`. Hence we get

Pr [B(D, s, r ⊕ y) wins] ≤ µ+ ε+ 2−` (4.29)

Where B and its input is now equivalent with our simulator A∗. Weconclude that the probability that A∗ wins is:

Pr [A∗(D,X (D, DB)) wins] ≤ µ+ ε+ 2−` (4.30)

Finally we can calculate an upper bound between the chances that theadversary wins, and that the simulator wins. This gap is at least:

∆ = (1− γ)− (µ+ ε+ 2`) = 1− (γ + µ+ ε+ 2−`) (4.31)

This concludes our proof. We have shown that an entity with access tothe database can—with high probability and in combination with auxiliaryinformation—breach the privacy of an individual. The entity accesses thedatabase through the privacy mechanism San() which provides some amountof privacy. We have assumed that we can learn the full utility vector wby using San(). While San() already provided some form of privacy, incombination with auxiliary information privacy of individuals can still beviolated. The main result of our proof is that there always exists a pieceof auxiliary information, that together with access to the database, cancause a privacy breach. However, without access to the database, this pieceof auxiliary information is useless. All this is formalized in the followingtheorem.

Theorem 4.4 (Full Recovery of Utility Vector [DN10]). Let Ext be a(k, ε)-strong extractor. For any privacy mechanism San() and privacy breachdecider C there is an auxiliary information generator X and an adversaryA such that for all distributions D where D, C and San() satisfy Assump-tion 4.1 with k ≥ `+ Υ and for all adversary simulators A∗

Pr [A(D, San(D, DB),X (D, DB)) wins]− Pr [A∗(D,X (D, DB)) wins] ≥ ∆

where ∆ = 1− (γ+µ+ ε+ 2−`). The probability space is over the choiceof DB ∈R D and the coin flips of San(), X , A and A∗.

As already mentioned previously, because we used strong randomnessextractors based on the Leftover Hash Lemma, it suffices that Υ ≥ 2 log(1

ε ).

64

Page 67: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

4.5 Utility vector w is only partly learned

The proof can be extended to the more realistic case where the utility vectorcan only be partially learned. This is for example possible if the sanitizationmechanism is randomized, and may add noise to the real output. A conse-quence is that one can only assume that the analyst will retrieve a vector w′

within Hamming distance of the real utility vector. The reason a separateproof is needed is because the auxiliary information generator cannot knowwhich utility vector w′ is seen by the adversary.

The underlying idea of the proof remains the same: The returned utilityvector will be used as a secret that can decrypt the auxiliary information.By decrypting the auxiliary information the adversary learns the privacybreach. For this reason we will not explain the details of the proof, as itonly adds technical details for handling the fact that the utility vector canonly be partially learned.

Without going into the details, the technicality that is used to solve ourproblem is called a fuzzy extractor. Originally, fuzzy extractors were createdto derive cryptographic keys from biometric data [DORS08, DS05]. Herethe original key (e.g., a fingerprint or iris scan) does not get reproducedprecisely each time it is measured. By using fuzzy extractors however, wecan derive a key that is always the same. This matches what we need: Inour case the utility vector does not get learned precisely, yet we can stillderive a key from it that is always the same. For the details of the proof seethe paper by Dwork and Naor [DN10].

4.6 Discussion

Complexity

An objection to the proof might be that performing the attack is computa-tionally infeasible to be executed in practice. However, this is only the caseif finding a privacy breach for a given database is costly, since the complexityof all the other operations are low [DN10].

Statistical Distance Measure

Although we haven’t given the proof if the utility vector is only partiallylearned, the actual proof of it focuses on the Hamming distance betweentwo distributions. There exist also fuzzy extractors that are based on otherdistance measures, for example on the edit distance or set distance. Even socalled general metric spaces can be used if the utility vector returned by theprivacy mechanism is close to the original in those distance measures [DN10,DORS08].

65

Page 68: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

It’s also possible that the sanitization mechanism returns a utility vec-tor w that must satisfy a certain relation (DB,w), although one may askhimself when such a thing would be useful in practice. In this case theresult of the sanitization mechanism could vary a lot, and the (statistical)distance between two valid answers may be large. It’s an open problem ifthe impossibility result also holds in this case [Nao10b]. In other words, forthis scenario it’s unknown if we can hide a privacy breach in the auxiliaryinformation.

4.7 Conclusion

If the database is useful there is always some side information that, togetherwith access to the database, will compromise the privacy of an individual.This means that absolute disclosure prevention is impossible: A secret willalways be leaked, given that the adversary has enough auxiliary informa-tion. The surprising observation is that even the privacy of someone not inthe database can be violated. In the parable mentioned in section 4.1 forexample, we would learn that Alice had an increased cancer risk even if shewasn’t in the medical database. This doesn’t mean we should abandon thesearch for a proper definition that guarantees privacy. Instead we shouldmove from the idea of assuring absolute disclosure prevention, to assuringrelative disclosure prevention. This is done in the definition of differentialprivacy and is introduced in the next chapter.

Also note that we didn’t restrict the computational power of the adver-sary, and perhaps more is realizable if we limit its power. We have alsoassumed that the returned utility vector is within a statistical distance fromthe real utility vector.

It may seem strange that Dalenius’ goal cannot be achieved, while—when limiting the computational power of the adversary—semantic securityfor cryptosystems can be achieved. The reason behind this is that thereis no difference between the adversary and a genuine user in our situation,while for cryptosystems they can be separated using secret keys. And sincethe database is designed to convey useful information, the adversary willinevitably also be able to access this information. Combined with auxiliaryinformation the adversary is then able to find a privacy breach.

66

Page 69: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Chapter 5Differential Privacy

In order to address the problems described in the previous chapters, we turnour attention to differential privacy. This definition takes into account anypotential auxiliary information the adversary may have, and moves from ab-solute disclosure prevention to relative disclosure prevention. It has shownpromise to be an adequate definition and is, at the time of writing, the onlymechanism that can both handle auxiliary information and is composable.There are also abundant practical implementations that guarantee differen-tial privacy.

5.1 Requirements

Before we give the definition of differential privacy, we will briefly look attwo properties that a privacy mechanism must have.

Robustness to Side Information

It must take into account any side information (auxiliary information) thatthe adversary might have. In the definition we must avoid specifying ex-actly what the adversary may know, because this is very hard to quantify.Ideally it will assume the adversary knows everything except the individualinformation of a person.

Composability

If we would apply the sanitization mechanism several times on related data-bases, the adversary should not be able to combine both results to retrievesensitive private information. As we have seen in Chapter 3 this is where alot of previous suggestions have failed. For instance, k-anonymity has theproblem that if it’s done twice for slightly related databases, all privacy can

67

Page 70: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

be lost. For differential privacy general results have been proven about itscomposability [DRV10].

5.2 Definition

Since a privacy disclosure with respect to a given individual can happeneven if that individual is not in the database, Dwork and McSherry proposethe following idea: Any given disclosure (i.e., any output of the privacymechanism) should be, within a small multiplicative factor, just as likelyindependent of whether the information if an individual is saved in thedatabase or not. This means that including your personal information in adatabase doesn’t significantly increase the chance of a disclosure happening.In other words, concealing your information gives only a nominal gain.

We begin by defining what a dataset is and when two datasets are adja-cent.

Definition 5.1 (Dataset). A dataset or database is finite collection of rows.Each row represents the information of one individual in the dataset.

Definition 5.2 (Adjacency). Two datasets are adjacent if one is a subsetof the other, and the larger one contains at most one additional element.We also say that two adjacent datasets differ on at most one element.

Adjacent databases are commonly also called neighbours. Since the def-inition of ε-differential privacy is based on randomized functions, we firsthave to rigorously define this notion.

Definition 5.3 (Randomized Function). Let F be the set of all finite data-sets and V be the set of all random variables with image V . A randomizedV -valued function K is then defined as

K : F → V,D 7→ K(D)

where K(D) is a random variable defined separately for every D ∈ F .

A lot of times the term randomized algorithm is used as a synonym ofrandomized function. Note that these randomized functions can actuallybe continuous random variables, and are thus different from probabilisticTuring machines. Strictly speaking Range(K) should be V. However, forease of notation we will actually let Range(K) be equal to V . We can nowgive the definition of differential privacy.

Definition 5.4 (Differential Privacy [DN10]). A randomized function Kgives ε-differential privacy if for all data sets D1 and D2 differing on atmost one element, and all S ⊆ Range(K),

68

Page 71: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Pr [K(D1) ∈ S] ≤ exp(ε)× Pr [K(D2) ∈ S]

It may seem unusual that we didn’t require the probability distributionsto be statistically close or that they be computationally indistinguishable,as is often done in the field of cryptography. The reason we don’t use adistance measure such as ε-close is because with such a measure there mightbe an output where the difference in probability is relatively large, whilethe probability of all the other outputs could be equal, resulting in a smallstatistical distance. As an example consider the privacy mechanism wherewe randomly sample an individual and release all his data (i.e., return therow corresponding to the person). The probability of being picked whenyou participate in the database is 1/n, where n is the size of the database,and 0 when you are not in the database. In this case the statistical distancewould be small, yet clearly this does not preserve privacy [DMNS06]. Themultiplicative factor exp(ε) in our definition rules out this subsample-and-release paradigm, since it implies that an output whose probability is zero ona given database must have probability zero on any adjacent database, andby recursively applying the definition, also on any other database [Dwo11].

It’s important to remark that a disclosure can still happen. However, thecause of the disclosure will not be the action of giving your individual infor-mation to the databases. Nor was there something you could do that wouldprevent the disclosure in the first place. This should address any concernthe participant might have about the leakage of their personal information.After all, even if the participant removed her data from the database, nooutputs—and hence possible privacy breaches—would become significantlymore or less likely.

5.3 Variants

5.3.1 Discrete Range

To better understand the definition, we will focus on the case where therange of K is a finite or countable infinite set. In this case we can changethe definition to:

Definition 5.5 (Differential Privacy (Discrete Sets)). A randomized func-tion K gives ε-differential privacy if for all data sets D1 and D2 differing onat most one element, where Range(K) is a discrete set,

∀x ∈ Range(K) : Pr [K(D1) = x] ≤ exp(ε)× Pr [K(D2) = x]

If the set Range(K) would be uncountable, the probability of the out-come K(D) = x would be zero, and the definition would trivially hold. It’s

69

Page 72: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

clear that Definition 5.4 implies our new definition. On the other hand wecan also show that, considering Range(K) is finite or countable infinite, thenew definition implies Definition 5.4.

Proof. Assume the privacy mechanism satisfies Definition 5.5. We have that∀S ⊆ Range(K) where |S| = n and S = s1, s2, . . . , sn:

Pr [K(D1) ∈ S] = Pr

[n⋃

i=1

si = K(D1)

](5.1)

=n∑

i=1

Pr [si = K(D1)] (5.2)

≤n∑

i=1

exp(ε)Pr [si = K(D2)] (5.3)

= exp(ε)n∑

i=1

Pr [si = K(D2)] (5.4)

= exp(ε)× Pr [K(D2) ∈ S] (5.5)

Where the first and last equality follow from the countable additivityproperty of the probability measure Pr. The inequality holds since we as-sumed that K satisfies Definition 5.5. We can conclude that Definition 5.4is equivalent, for discrete ranges, to Definition 5.5.

If the set Range(K) were uncountable, the union in equation 5.1 wouldbe over uncountable many events. Therefore the additivity property wouldno longer be applicable, and the proof does not hold. This gives us an insightas to way the definition of differential privacy uses the subset S ⊆ Range(K),since using this makes the definition valid for both discrete and continuousvariables.

5.3.2 Interpretation

From the definition of ε-differential privacy we notice that if the probabilityof an output—or a range of outputs for continuous variables—is zero fora specific data set, then this probability most be zero for all data sets.Practically this means the privacy mechanism K is never allowed to returnthis output, and that therefore the output is not contained in the Range ofK. Hence, without loss of generality, we can assume that the probability1 isalways nonzero.

Considering the definition must hold for data sets (D1, D2) as well as(D2, D1), where D1 and D2 differ in at most one element, we have:

1For discrete variables this means the pmf, for continuous variables we’re actuallytalking about the cdf.

70

Page 73: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Pr [K(D1) ∈ S] ≤ exp(ε)Pr [K(D2) ∈ S] (5.6)

Pr [K(D2) ∈ S] ≤ exp(ε)Pr [K(D1) ∈ S] (5.7)

Since we can assume these probabilities are nonzero, we can rewrite thisto:

Pr [K(D1) ∈ S]

Pr [K(D2) ∈ S]≤ exp(ε) (5.8)

Pr [K(D1) ∈ S]

Pr [K(D2) ∈ S]≥ exp(−ε) (5.9)

Combining these two inequality results in:

exp(−ε) ≤ Pr [K(D1) ∈ S]

Pr [K(D2) ∈ S]≤ exp(ε) (5.10)

This formulation of Definition 5.4 is easier to interpret. For small ε weobserve that exp(ε) ≈ 1 + ε. Intuitively this means that both fractionsshould be close to each other, and consequently that the probabilities mustbe close to each other. This confirms our earlier statement that the outputof the privacy mechanism is almost just as likely, independent of whetherany individual opts in to, or opts out of, the database.

For completeness, note that inequality 5.10 is equivalent with the follow-ing inequality

exp(−ε) ≤ Pr [K(D2) ∈ S]

Pr [K(D1) ∈ S]≤ exp(ε) (5.11)

5.4 Choice of Epsilon

We have yet to define what the size of the parameter ε in the definition ofdifferential privacy must be. Instead of letting ε be a fixed constant, wecould’ve also defined it as a function whose image is R. Taking inspirationfrom the field of cryptography, one may ask whether ε can be negligible infunction of the database size.

Definition 5.6 (Negligible [Gol01]). We call a function µ(x) : N → Rnegligible if for every positive polynomial p(.) there exists an N such thatfor all n > N ,

µ(n) <1

p(n)

71

Page 74: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

We will show that letting ε be negligible in function of the database size,denoted by n, the results of the privacy mechanism K become meaningless.Let D and D′ be two data sets of both size n that differ in every element.Consider the sequence D = D1, D2, D3, . . . , D2n = D′ where we get fromD to D′ by first removing an element, and then adding the correspondingmodified element, until we have D′. For this we need at most 2n steps. Alldatabases Di and Di+1 differ in only one element and are thus adjacent.Let ε = µ(n), where µ is a negligible function. For each of these adjacentdatabases we have

Pr [K(Di) ∈ S] ≤ eµ(n)Pr [K(Di+1) ∈ S] (5.12)

Combining all these inequalities over the whole sequence gives us

Pr [K(D) ∈ S] ≤ eµ(n)2nPr[K(D′) ∈ S

](5.13)

Because µ is a negligible function in n, the function µ′(n) = µ(n)2nis also negligible. In other words, if we let ε be negligible in n, even twocompletely different data sets are treated as adjacent. Therefore the outputof K should be similar for all databases, and thus these outputs would beuseless.

Instead we let the parameter be a public value. Its choice will dependon how much privacy one wants to guarantee. The smaller the value, themore privacy. In most cases we will tend to think of ε as a small value suchas 0.01, 0.1, ln(2) or ln(3). If the probability that some bad event will occuris very small, it might be tolerable to increase it by factors such as 2 or 3,while if the probability is already felt to be close to unacceptable, then evena small increase of e or e0.1 would be intolerable [Dwo08].

5.5 Achieving Differential Privacy

This section will show how one can achieve differential privacy. To getstarted we first investigate counting queries. Then we will see a more generalproperty to assure differential privacy. Finally a few relatively more complexalgorithms that achieve differential privacy will be explained.

5.5.1 Counting Queries

Assume we want to calculate how many rows satisfy a given predicate P (x).What we will do is calculate the exact result and perturbate it with noiseto guarantee differential privacy. Let D be a set of rows representing adatabase, f be the function that counts the true result, Y be a randomvariable used to add noise, and K be the resulting mechanism that willassure differential privacy. This can be written as

72

Page 75: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

−3 −2 −1 0 1 2 30

0.1

0.2

0.3

0.4

Figure 5.1: Pdf of noise added to counting queries using ε = ln(2)

K(D) = f(D) + Y =∑

x∈DP (x) + Y (5.14)

If Y adds Laplace distributed noise, which has the probability densityfunction h(y) = 1

2b exp (−|y|/b), then we can show that K is differentiallyprivate.

Property 5.1. Letting Y add Laplace distributed noise with parameter b =1ε guarantees differential privacy.

Proof. First observe that for any two database D1 and D2 differing in onlyone row, the sums f(D1) and f(D2) differs at most by one. Consider anyS ⊆ Range(K) and take r ∈ S. For K(D1) to output r it must add f(D1)−rnoise to the true answer, similarly K(D2) must add f(D2) − r noise to thetrue answer for K to output the same r. Hence, the probability density thatK(D1) returns r is h(f(D1) − r), and for K(D2) the probability density ish(f(D2)− r). We have

h(f(D1)− r)h(f(D2)− r)

≤ exp

(1

b|(f(D1)− r)− (f(D2)− r)|

)(5.15)

= exp

(1

b|(f(D1)− f(D2)|)

)(5.16)

≤ exp

(1

b

)(5.17)

The first inequality is proven in section 2.3 and the second inequalityfollows from the fact that f(D1) and f(D2) differ by at most one. Lettingb = 1

ε , integrating over S with respect to r yields ε-differential privacy.

Figure 5.1 shows a graphical representation of how much noise is addedto the true result. This is specifically for counting queries where one haschosen an ε values of ln(2). We observe that the noise is symmetrically

73

Page 76: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

centered around zero and that strictly speaking any value, no matter howsmall or large, is possible.

5.5.2 General Theorem to Achieve Privacy

An essential part of achieving differential privacy is knowing how much influ-ence a single row has on the output of the query function. We will formallydefine this as the sensitivity of the query function.

Definition 5.7 (L1 Sensitivity [Dwo11]). The L1 sensitivity of a functionf : D → R

d is

∆f = maxD1,D2

||f(D1)− f(D2)||1

= maxD1,D2

d∑

i=1

|f(D1)i − f(D2)i|

for all D1, D2 ∈ D differing in at most one row.

As an example, the counting query in the previous section has a sensitiv-ity of one. From the definition of sensitivity one can see that we also allowthe function f to return a vector. To assure differential privacy we will addLaplacian noise to each element of this vector. If Y is a vector of d indepen-dent Laplace variables, the probability density of vector y = (y1, . . . , yd) isthe multiplication of the probability density of each individual element:

exp

(−|y1|b

)× . . .× exp

(−|yd|b

)= exp

(−|y1|+ . . .+ |yd|

b

)(5.18)

= exp

(−||y||1

b

)(5.19)

Using this we can prove the following theorem:

Theorem 5.2. For f : D → Rd, the mechanism K that adds independently

generated noise with distribution Lap(∆f/ε) to each coordinate of the outputassures ε-differential privacy.

Proof. Analog to the proof of Claim 5.1, consider any subset S ⊆ Range(K)and let y ∈ S. The pdf of K(D1) returning y is h(||f(D1) − y||1), and thepdf of K(D2) returning the same y is h(||f(D2)− y||1). We combine this asfollows

74

Page 77: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

97 98 99 100 101 102 103 1040

0.1

0.2

0.3

0.4

Figure 5.2: Comparison between the pdf’s of two counting queries with∆f = 1 and ε = ln(2). The true results of the queries are 100 and 101.

h(||f(D1)− y||1)

h(||f(D2)− y||1)=

exp(−1b ||f(D1)− y||1

)

exp(−1b ||f(D2)− y||1

) (5.20)

= exp

(1

b(||f(D2)− y||1 − ||f(D1)− y||1)

)(5.21)

≤ exp

(1

b||f(D2)− f(D1)||1

)(5.22)

≤ exp

(∆f

b

)(5.23)

Here the first inequality follows from the triangle inequality, and thesecond inequality from definition of ∆f . Choosing b = ∆f/ε and integratingover S yields ε-differential privacy.

To visualize the effect of differential privacy on adjacent databases, fig-ure 5.2 shows the situation for the case of a counting query where ∆f = 1and ε = ln(2). The blue line show the probability density function whenthe true result is 100, and the green when the true result is 101. These tworesults could in practice come from two adjacent databases.

5.5.3 Adaptive Queries

In a real setting we want to execute more than one query. These queries maybe independent of each other, or actually depend on previous results. Forexample, when we want to implement a more complex algorithm such as K-means we will perform multiple queries, where the current query depends onthe results of the previous queries. This complicates the nature of the queryfunction, as it is no longer a predetermined function, but rather a strategyfor producing queries based on answers given thus far [DMNS06]. We willcall these adaptive queries. The question is now how much noise must be

75

Page 78: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

added so the total algorithm—and thus the total sequence of queries—isε-differentially private.

Theorem 5.3. Let f1, f2, . . . , fn be a sequence of queries where each queryfi+1 depends on the output of f1, . . . , fi. We define the total sensitivity as

δdef= max (∆f1 + ∆f2 + . . .+ ∆fn) where the maximum is taken over all

possible outputs of f1, f2, . . . , fn. The mechanism K that adds independentlygenerated noise with distribution Lap(δ/ε) to the result of every functionassures ε-differential privacy.

Proof. For convenience we will let S denote the tuple of sets (S1, S2, . . . , Sn)such that S1 ⊆ Range(f1), . . . , Sn ⊆ Range(f2). We will define the notation(a1, . . . , an) ∈ S as equivalent to a1 ∈ S1, . . . , an ∈ Sn. Let f1, f2, . . . , fnbe a sequence of queries where each query fi+1 depends on the output off1, . . . , fi. From this if follows that

Pr [(f1(D1), . . . , fn(D1)) ∈ S] =n∏

i=1

Pr [fi(D1) ∈ Si|f1(D1), . . . , fi−1(D1)]

(5.24)Since for each of these conditional probabilities the results of the previous

queries are given, the probability is only over the amount of noise added tothe real result of fi, which has a Laplace distribution. The true result ofquery fi will be written as ri. Recalling that we write h(x) for the probabilitydensity function of the Laplace distribution Lap(b), we have

∏ni=1 h(||fi(D1)− ri||1)∏ni=1 h(||fi(D2)− ri||1)

=n∏

i=1

exp(−1b ||fi(D1)− ri||1

)

exp(−1b ||fi(D2)− ri||1

)

=n∏

i=1

exp

(1

b(||fi(D2)− ri||1 − ||fi(D1)− ri||1)

)

≤n∏

i=1

exp

(1

b||fi(D2)− fi(D1)||1

)

= exp

(1

b(||f1(D2)− f1(D1)||1 + . . .+ ||fn(D2)− fn(D1)||1)

)

≤ exp

b

)

The first inequality is based on the triangle inequality. The second in-equality follows from the definition of δ. Filling in δ/ε for b and integratingover all Si with respect to every result ri finishes our proof.

In practice it can be difficult to calculate the true value of δ, as it’sdifficult to mathematically model all the possible interactions between the

76

Page 79: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

collection of adaptive queries that are being posed. The following inequalityis used to avoid this problem

δ = max (∆f1 + . . .+ ∆fn) ≤ max (∆f1) + . . .+ max (∆fn) (5.25)

So instead of calculating the global sensitivity of the set of adaptivequeries, we calculate the sum of the sensitivity of each individual query.Whilst this may result in a larger value of δ, it’s much easier to calculate.An important remark is that overestimating δ will result in a Laplace dis-tribution with a higher value for its scale parameter. This in turn results ina higher “amount of noise”, meaning the privacy of individuals is actuallybetter protected if we happen to overestimate δ.

5.5.4 Histogram Queries

A compelling example of Theorem 5.2 are histogram queries. A histogramquery is a function that assigns each row or element to one specific bucket,and returns the number of elements in each bucket. More formally, letthe rows of the database be elements in the universe D and D1, . . . ,Dd bedisjoint regions of D, then the function f : D → N

d that returns the numberof elements in each region is called a histogram query. As a concrete examplethe database may contain the ages of each individual, and the query f isa partitioning of the ages in to the ranges [0, 10), [10, 20], . . . , [90, 100),≥100. In this example the function that would partition the database intodisjoint regions is “hardcoded” into the query. A consequence is that auser would have to create a separate differentially private query for eachpossible partitioning, which is inconvenient. Instead we will require thatthe partitioning function is given as an argument to the histogram query,and that this query will calculate a result that is differentially private for thegiven value of epsilon. Hence the user only has to specify the partitioningfunction, and the rest is handled by the histogram query itself.

At first sight a histogram query may look like d independent countingqueries, meaning one has to add noise distributed according to Lap(d∆f/ε) =Lap(d/ε). This would significantly disturb the result when the buckets out-number the rows in the database, and render the output meaningless. How-ever, the entire query actually has sensitivity ∆f = 1. To see this, notethat if we remove one row from the database, then only one bucket in thehistogram changes, and that bucket only changes by one.

Thus to achieve ε-differential privacy we must add independent randomnoise drawn from Lap(1/ε) to each coordinate of f (recall that f returns avector). Most importantly, the noise added is independent of the numberof buckets d. When comparing this to older methods such as the SuLQFramework, where the noise added to each coordinate is O(

√d/ε), this is a

clear improvement [BDMN05, DMNS06].

77

Page 80: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

5.6 K-Means Clustering

To illustrate the practical importance of adaptive queries, we will constructa privacy preserving alternative of the K-means clustering algorithm. Thiswill be done by using existing queries that were shown to be differentiallyprivate. In other words we’re going to use these existing queries as an“Application Programming Interface” to show that powerful algorithms canbe constructed using only relatively simple differentially private queries.

5.6.1 Original Algorithm

Clustering algorithms attempt to divide data into several groups (clusters)that are meaningful, useful, or both. It has long played an important rolein a wide variety of fields, and there have been many applications of clusteranalysis to practical problems [TSK06].

The original K-means algorithm is relatively simple. First K initial cen-troids are randomly chosen, where K is a user-specified parameter signifyingthe desired number of clusters. Next each point is assigned to the closest cen-troid. Then all the points assigned to a specific centroid are taken togetherand are considered to be a cluster. The centroid of each cluster is calculatedbased upon these points assigned to the cluster. This is repeated until nomore changes happen. The formal description is shown in Algorithm 5.1.

Algorithm 5.1 Generic K-means algorithm

1: Select k points as initial centroids2: repeat3: Form K clusters by assigning each point to its closest centroid.4: Recompute the centroid of each cluster.5: until Centroids do not change.

5.6.2 Private Algorithm

To construct a differentially private version we first make a few changes tothe algorithm itself. While the original algorithm allowed a very abstractnotion of “points”, we will restrict ourselves to points in the d-dimensionalunit cube [0, 1]d. The set of input points from this unit cube will be denotedby P . The resulting procedure is now described in Algorithm 5.2.

In the first for loop the points are partitioned into k clusters where eachpoint is associated to the closest centroid µj . The second for loop calculatesthe new centroid of each cluster. This is repeated until a termination con-dition has been reached. Examples of termination conditions are that thecentroids no longer change, or that a maximum number of allowed iterationshas been reached.

78

Page 81: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Algorithm 5.2 Unit Cube K-means algorithm

1: Initial centroids µ1, . . . , µk are chosen randomly from [0, 1]d.2: repeat3: for j = 1 to k do4: Sj := p ∈ P | ∀i : ||µj − p|| ≤ ||µi − p||5: end for6: for j = 1 to k do7: µj :=

∑p∈Sj

p/|Sj |8: end for9: until Termination criterion reached.

10: return µ1, µ2, . . . , µk

When attempting to construct a differentially private algorithm, step 3of the algorithm would breach privacy if we needed to know the actual setsS1, . . . , Sk. However, to compute the average among the points, knowingthe size and sum of each set of points is already enough. Therefore theactual implementation of the algorithm doesn’t need to expose these sets.The conclusion is that we can focus on step 5 and try to find a method tocalculate the new value of µj while assuring differential privacy.

The first observation is that the denominators |Sj | implicitly define ahistogram query over all points. Therefore we will pose one single histogramquery that returns a vector (|S1|, . . . , |Sj |) containing the size of each set.The partitioning function given as an argument to the histogram queryseparates the points into k regions as specified in the first for loop (i.e., thepartitioning is based on the current values of µi and these represent theclusters). Posing a histogram query has the advantage that very low noiseis added in each coordinate. Namely, the query has a sensitivity of onlyone. For the numerators we see that the sum of all the points in a set Sjhas sensitivity at most d. This is because the points come from the unitcube [0, 1]d, and so the removal or addition of a point can affect the sumby at most ||(1, . . . , 1)|| = d. More importantly, only one of the sums canbe affected. For this reason we will use a variation of the histogram querywhere we still partition the points, but then return the sum of all the pointsinstead of the number of points in a centroid. Thus the query will return avector with the sum of the points for all the sets Sj , and it’s sensitivity isat most d.

Each loop the query sequence has a total sensitivity of at most d + 1.When running the algorithm for a fixed number of N iterations we can addnoise distributed by Lap((d+ 1)N/ε) to achieve ε-differential privacy, whichfollows from theorem 5.3 about adaptive queries. This demonstrates theutility of the theorem: When combining multiple (possibly different) differ-entially private queries to construct a more complex algorithm, it specifieshow much noise must be added to obtain ε-differential privacy. In case one

79

Page 82: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

doesn’t know in advance how many iterations are needed, it’s possible toincrease the noise parameter as the computation proceeds. A possible strat-egy would be to start by adding noise distributed by Lap((d+ 1)/(ε/2)) inthe first iteration, Lap((d+ 1)/(ε/4)) in the second iteration, and so on. Inother words, each time half of the remaining “privacy budget” is being used.It can be proven this indeed assures ε-differential privacy by adjusting theproof of theorem 5.3.

5.7 Median Mechanism

The question can be asked if we can do better than independently addingnoise to the result of every query. If we can design a mechanism were weneed to add less noise, more queries can be answered while maintaining thesame accuracy and privacy constraints. One strategy, called the medianmechanism, can answer exponentially more queries than the previously dis-cussed mechanism where we added independent Laplace distributed noiseto each query result. We will briefly describe the idea behind this strategy.

The median mechanism outperforms the Laplace mechanism by exploit-ing correlations between different queries. It does this by storing the set ofdatabases that are consistent with the answers given thus far. A query whoseanswer reduces this set by a constant factor (or more) is considered a “hard”query. All the other queries are classified as “easy”. A key intuition here isthat if a query is easy, then the user can actually generate the mechanism’sanswer on its own. These easy questions can be answered by adding verylittle noise, whilst hard queries are answered by adding Laplace distributednoise. Roth and Roughgarden showed in their paper that only a fractionof the possible queries are be hard [RR10]. The result is that most queriesare considered easy and thus more queries can be answers while maintainingprivacy and using only a fixed amount of noise.

5.8 Disadvantages

Although ε-differential privacy provides a strong privacy guarantee and hasmany practical implementations, it is not always applicable.

For example, if one wants to publish vague information about individualsin a database, differential privacy cannot be used. The reason is that itsimply prohibits releasing any information about specific individuals. So sayyou want to calculate the Body Mass Index of every individual for medicalresearch, and although it has been agreed that this doesn’t violate privacy,differential privacy still can’t be used. Its privacy guarantees are too strongfor such an application.

Another disadvantage is that because differential privacy takes place inan interactive mode, the original database must be preserved while queries

80

Page 83: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

are being posed. If only one entity wants to query the database this is nota true problem, as we can simply delete the database once the queries havebeen executed. However, if one wants to remain the ability to pose queriesin the future, the original database cannot be destroyed.

81

Page 84: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Part II

Privacy in New Settings

82

Page 85: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Chapter 6Extensions of Differential Privacy

In the previous chapter we have shown how to assure differential privacyfor relatively simple queries. By composing these simple queries one is ableto build complex differentially private algorithms. In this chapter we willstudy new applications of differential privacy. In particular we will see howto handle cases where adding nose makes no sense and we consider thecase where one wants to assure differential privacy when the internal stateof the algorithm can be leaked. This is based on the work of McSherryand Talwar [MT07] and the work of Dwork, Naor, Pitassi, Rothblum andYekhanin [DNP+10].

6.1 When Noise Makes No Sense

Up until now our focus has been on functions that return a number or avector of numbers. Moreover we have assumed that adding noise to thetrue output makes sense. This is not always the case. When the outputof our query is non-numeric it’s easy to comprehend that adding Laplaciandistributed noise doesn’t make sense. For example, if the query functioncomputes a string, tree, or any other complex object, it’s unclear how wecan add noise in a meaningful way. Even when the output is numeric it’sstill not always the best choice to simply add Laplacian noise.

6.1.1 An Example

Assume we have a database containing points in Rd where each point hasthe label A or B. We want to make a simple classifier by dividing the d-dimensional space into two separate spaces. This is done by finding a d-aryvector describing a (d − 1)-dimensional hyperplane that splits the space insuch a way that:

• Points labeled with A have a nonnegative inner product with y.

83

Page 86: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

-x

6y

qqqqqqqq

bbbbbbbb

qqqqqqqq

aaaaaaaa

Figure 6.1: Example database of points in R2 having either label A or B.

• Points labeled with B have a negative inner product with y.

The returned vector is the one which maximizes the number of pointsclassified correctly.

An example database for points in the space R2 is shown in figure 6.1.1.The optimal 1-dimensional hyperplane—which is just a line—classifyingthese points would be the line coinciding with the y-axis. This hyperplaneis defined by the vector (1, 0). Adding noise to each coordinate of the vectorwould change its direction, causing some points to be classified incorrectly.We notice that, if more than 3 points are misclassified by adding noise, thevector coinciding with the x-axis is a better candidate. After all, slightlychanging its direction will have a lesser influence compared to slightly chang-ing the direction of the vector coinciding with the y-axis. Surprisingly thisvector is orthogonal to the optimal result, yet appears to be a better choicewhen noise is being added. We can capture such preferences by defining ascoring function.

6.1.2 The Scoring Function

We can specify the preference of results by defining a scoring function. Itcan be defined for any privacy mechanism that randomly maps a databaseD ∈ D to an output object r ∈ R without making any specific assumptionsabout the domains D and R. In essence the scoring function gives a scoreto each possible output y for every database D.

Definition 6.1 (Scoring Function). Given the domain of databases D andthe domain of possible results R, the scoring function q is defined as

q : D ×R → R (6.1)

and assigns a score to any pair (D, r) ∈ D ×R. Higher scores denote amore appealing result.

84

Page 87: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

The idea is now that, given a database D, the privacy mechanism returnsan r ∈ R that maximizes the score q(D, r) while guaranteeing differentialprivacy. In our example we can give preference to vectors that correctlyclassify the most points, i.e., the score of a result is the number of pointsclassified correctly. We can define this particular scoring function as follows

q(D, y) = |p ∈ D | L(p) = A ∧ py ≥ 0|+ |p ∈ D | L(p) = B ∧ py < 0|

where L(p) denotes the label of point p. The first term counts the numberof correctly classified points with label A and the second term the points withlabel B. An important remark is that although a score for a specific resultcan be zero or negative, this doesn’t guarentee such a result will never bereturned. It merely means that it’s an output we don’t prefer.

6.1.3 The Privacy Mechanism

We can now create a differentially private algorithm—called the exponentialmechanism [MT07]—where noise is introduced in a more meaningful way byutilising the scoring function. But first we must change the definition of L1sensitivity to be applicable for the scoring function. In order to be precise thedefinition to take into account the fact that the function has two parameters.

Definition 6.2. The sensitivity of the scoring function q : D ×R → R is

∆q = maxr∈R

maxD1,D2

|q(D1, r)− q(D2, r)|

for all adjacent databases D1, D2 ∈ D.

The definition of the exponential mechanism now becomes

Definition 6.3 (Exponential Mechanism). An exponential mechanism isa randomized function E : D → R that, when given the privacy parameterε and database instance D ∈ D, outputs a result r ∈ R according to theprobability density function

fE,D(r) ∝ exp

2∆qq(D, r)

)(6.2)

where q can be any scoring function and ∆q is the sensitivity of q asdefined in 6.2.

Notice that in this definition the probability density function is over ageneric set R. While we have only given the formal definition of the pdf overthe set R in section 2.1.3 on continuous variables, this is not a problem. It’s

85

Page 88: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

possible to give definitions where the image of the random variable can beany generic set as long as the set is measurable. However these definitionsare out of scope for this work, and the general idea behind them remainsthe same. The precise definition of fE,D such that it satisfies equation 2.9on page 23 is

fE,D(r) =exp

2∆q q(D, r))

∫R exp

2∆q q(D, r))

dr(6.3)

We see that the mechanism assigns the highest probability to the bestanswer, and the probability assigned to any other answer decreases expo-nentially for lower scores given the current database, hence the name “expo-nential mechanism” [Dwo11]. We will prove this mechanism indeed assuresdifferential privacy.

Theorem 6.1 (Differential Privacy). An exponential mechanism as definedin 6.3 gives ε-differential privacy.

Proof. Given an exponential mechanism E where q is the scoring function,take S ⊆ R and s ∈ S. We get that for adjacent databases D1, D2 ∈ D:

fE,D1(s)

fE,D2(s)=

∫R exp

2∆q q(D2, r))

dr

∫R exp

2∆q q(D1, r))

dr·

exp(

ε2∆q q(D1, s)

)

exp(

ε2∆q q(D2, s)

) (6.4)

The first quotient becomes

∫R exp

2∆q q(D2, r))

dr

∫R exp

2∆q q(D1, r))

dr≤

∫R exp

2∆q (q(D1, r) + ∆q))

dr

∫R exp

2∆q q(D1, r))

dr(6.5)

=exp

(ε2

) ∫R exp

2∆q (q(D1, r)))

dr

∫R exp

2∆q q(D1, r))

dr(6.6)

= exp( ε

2

)(6.7)

And for the second quotient one has

exp(

ε2∆q q(D1, s)

)

exp(

ε2∆q q(D2, s)

) = exp

2∆q(q(D1, s)− q(D2, s))

)(6.8)

≤ exp

2∆q∆q

)(6.9)

= exp( ε

2

)(6.10)

86

Page 89: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

For both quotients the inequalities hold because ∆q is defined as thelargest possible difference and is always positive. Combined this gives

fE,D1(s)

fE,D2(s)≤ exp (ε) (6.11)

Integrating over S with respect to s yields the desired result.

6.1.4 Limitations and Future Directions

A problem that hasn’t been addressed yet in this work is the issue of effi-ciently sampling from the distribution fE,D (defined in 6.3). This is neededwhen we actually want to implement the privacy mechanism and return arandom value according to the given probability density function. The workof McSherry and Talwer [MT07] that introduced the exponential mechanismalso leave this problem largely unanswered. They mentioned that if the scor-ing function q is simple, e.g., when it’s piecewise linear or in the cases wherethe output space can be efficiently discretized, an efficient algorithm canbe constructed [MT07]. Unfortunately for more complex scoring functionsthere are currently no efficient methods to sample from fE,D.

Another problem is that computing the score q(d, r) can be a compu-tationally demanding task. If there are no efficient algorithms to calculatethe score of a result, the exponential mechanism is also infeasible in prac-tice. In such a situation we would like to use approximation algorithms tocalculate the score q(d, r). However these approximation algorithms oughtto have a small sensitivity ∆q, otherwise too much noise would have to beadded. Unfortunately most approximation algorithms do not seem to havethis property [MT07].

6.2 Pan Privacy

There are situations where a data curator is pressured to handover all thedata it has collected or is still processing. This can be caused by a legalcompulsion such as a subpoena1. To protect privacy in such a setting wewill develop algorithms that retain their privacy properties even if theirinternal state becomes visible to an adversary.

The prefix pan comes from the greek pan meaning “all” and “involvingall members”. So pan privacy signifies privacy of all parts, in this case eventhe internal state of the algorithm.

1Informally a subpoena in such a situation can be seen as a command to a witness toproduce certain documents for the court.

87

Page 90: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

6.2.1 Introduction

We will focus on streaming algorithms where the data is arriving element byelement and therefore it’s natural to discard the data that has already beenprocessed. This is not a requirement per se, as batch algorithms workingon a complete database may also be pan private. A batch algorithm couldfor example protect against an intruder that cannot directly access the databecause it’s stored on a different server [DNP+10]. Pan privacy will notassume the existence of a private—possible small—internal memory statethat can only be accessed by the analyst. Such an assumption would afterall conflict with the goal of making the algorithm retain privacy when theinternal state has been leaked. This means that we can’t even store the lastfew records that have been analyzed, otherwise the privacy of the individualscorresponding to those last few records that are still in memory would beviolated. It may seem that without such a secret state one cannot calculateaccurate yet private results. However that is not the case, as we shall seemany properties of a stream can be calculated whilst satisfying pan privacywithin certain accuracy bounds.

The privacy guarantees will be based on differential privacy. For this tobe possible we will first define when two data streams are adjacent. We canchoose between two possible definitions that differ in their level of granular-ity. The first definition would hide all the elements in the stream of a user(user level pan privacy), whilst the second definition hides the existence ofindividual events (event level pan privacy).

6.2.2 Preliminary Definitions

As already mentioned we will study streaming algorithms where the input isa data stream. We assume the incoming data stream is of unbounded lengthand composed of elements in the universe X. As a motivating example youcan imagine that each element describes a visit of a user—represented byhis or her IP address—to a particular website [DNP+10]. Pan privacy willthen protect against the presence of absence of an IP address in the stream.

The definition of differential privacy is based on adjacent databases andis not directly applicable for streaming algorithms. We begin by definingwhat a data stream is and when two data streams are considered adjacent.

Data Streams

The precise definition of a data stream is

Definition 6.4 (Data Stream). A data stream over the universe X is asequence of elements x ∈ X that is possibly unbounded. A more sophisticatedway of representing the abstract sequence is with the notation

88

Page 91: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

xini=1

where n is the length of the sequence and ∀i : xi ∈ X. If the sequence isunbounded we drop the superscript n from the notation.

From this definition we learn that a data stream can have an infinite num-ber of elements. There are many occasions where we want to talk about datastreams with a bounded number of elements. Such bounded data streamsare frequently called data stream prefixes in other works, and we will alsoadopt this naming convention.

Definition 6.5 (Data Stream Prefix). A data stream prefix is a data streamof bounded length.

Based on the definition of data streams there are now two general pos-sibilities to interpret adjacent data streams. The first option, that wouldassure event level pan privacy, considers the presence or absence of only onesingle element x ∈ X. This means that any single event or action causedby a particular individual remains private, but the presence of absence ofthe individual herself is not protected. The second and more powerful def-inition, which will assure user level pan privacy, considers the presence orabsence of all elements equal to any x ∈ X. Here the presence or absence ofan individual is protected: Based on the output of the algorithm one cannotdeduce whether the data stream contained information about a particularindividual or not. In this chapter our focus will be on user level privacy.Hence we arrive at the following definition of adjacent data streams

Definition 6.6 (X-adjacent Data Streams [DNP+10]). The data streams Sand T over universe X are X-adjacent if for all x ∈ X the subsequences S′

and T ′, obtained by removing all occurrences of the element x from respec-tively S and T , are identical.

In other words data streams S and T are X-adjacent if they differ onlyin the presence or absence of any number of occurrences of a single elementx ∈ X. If all occurrences of x are deleted from both streams, then the re-sulting subsequences ought to be identical. This definition permits strings ofradically different lengths to be adjacent, which is seen as a positive property,as it provides the most general privacy guarantees. A relaxed definition ofadjacent data streams—that is in line with user level pan privacy—has beenproposed in the paper by Dwork, Naor, Pitassi and Rothblum [DNPR10].In their work data streams S and S′ are adjacent if there exist x, x′ ∈ Xsuch that when changing some of the instances of x in S to instances of x′

one obtains S′. In this work we will focus on definition 6.6.Another common operation performed on data stream prefixes is con-

catenation. It’s a straightforward operation but in order to avoid any am-biguity we will provide a definition of it.

89

Page 92: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Definition 6.7 (Data Stream Concatenation). Given the data streams pre-fixes S = sini=1 and T = timi=1 their concatenation, denoted by S T , isdefined as cin+m

i=1 where

ci =

si if 1 ≤ i ≤ nti−n else

(6.12)

We observe that the concatenation of two data stream prefixes is againa data stream prefix, i.e., the length of the result is always bounded.

6.2.3 Definition of Pan Privacy

We will model a randomized streaming algorithm using two randomizedfunctions:

• State: This randomized function is called the state mapping functionand maps data stream prefixes to an internal state of the algorithm.The returned state is the one the algorithm is in immediately afterprocessing the prefix. For example, if the algorithm is a Turing ma-chine, the state of the randomized function is the contents of its tapeand the current state the Turing machine is in. We will abbreviate thefunction using the symbol S.

• Out: The randomized function Out is called the output mapping func-tion and maps data stream prefixes to an output. This output is theone produced after processing the complete prefix. We will abbreviatethe function using the symbol O.

Definition 6.8 formalizes these assumptions of a randomized streamingalgorithm. It is based on the paper by Dwork et al. [DNP+10].

Definition 6.8 (Randomized Streaming Algorithm). A randomized stream-ing algorithm M is represented by a pair randomized functions

S : S → I

O : S → R

where S is the set of all possible data stream prefixes, I is the set of allpossible internal states of M and R the set of all possible outputs.

Without loss of generality we will assume that the output is returned af-ter updating the internal state of the algorithm. Note that the results of thefunctions S and O are allowed to be dependent on each other, as is mostlythe case. We will useM(S) as a shorthand notation for (S(S),O(S)). Sim-ilarly the notation M = (S,O) is used to specify the state and output

90

Page 93: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

mapping functions respectively. Additionally the algorithm doesn’t knowthe length of the data stream prefix in advance. Conceptually this meansit will keep processing input elements until it receives a special stop signal,after which the algorithm will produce an output.

A streaming algorithm can now be thought of as taking steps at discretetime intervals. A new step begins on receiving the next element in thestream, after which the algorithm updates its internal state based on thiselement. We assume receiving and updating the state happens in an atomicstep, so that an adversary isn’t able to see a received element when obtainingaccess to the internal state. In other words, an intrusion may only occurbetween atomic steps [DNPR10].

Pan Privacy

We are finally ready to define pan privacy. Remember that the definition isbased on differential privacy.

Definition 6.9 (ε-differential Pan Privacy). LetM be a randomized stream-ing algorithm defined by a state mapping S and output mapping O, and withpossible internal states I and set of possible outputs R. Then if for all I ′ ⊆ Iand R′,R′′ ⊆ R and for all X-adjacent data stream prefixes S = U V andS′ = U ′ V ′ it holds that

Pr[(S(U),O(U),O(S)) ∈ (I ′,R′,R′′)

]

≤ eεPr[(S(U ′),O(U),O(S′)

)∈ (I ′,R′,R′′)

]

algorithm M satisfies ε-differential pan privacy. The probability spaces areover the internal coin flips of S and O.

Pan privacy thus protects against a single intrusion of its internal state.After the intrusion the algorithm can continue processing data and eventu-ally output a result which still protects the privacy of the individuals. Fig-ure 6.2.3 shows a timeline that illustrates these two events: The intrusionoccurs after the prefix U has been processed, and the algorithm producesoutput once the complete prefix S has been processed.

6.2.4 Density Estimation

As an example application of pan privacy we will design a streaming algo-rithm that estimates the density of the stream while assuring ε differentialpan privacy. The density of a data stream is the fraction of items in theuniverse X that appear at least once. We assume the universe X has finitesize.

91

Page 94: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Data Stream:

S(U) O(S)

Figure 6.2: Illustration of when a possible intrusion occurs, and afterwardsoutput is still produced. Each block represents an incoming element of thedata stream. The intrusion happens after reading prefix U and the outputis produced after reading prefix S.

Algorithm Description

The algorithm will have to store which items have already appeared in thestream so far. However this has to be done in a way that doesn’t leak toomuch information, i.e., differential privacy must hold even if the internalstate becomes visible to an adversary. The solution to this problem beginsby defining two probability distributions D0 and D1(ε) on the set 0, 1.The distribution D0 returns both zero and one with probability 1

2 . Distri-bution D1(ε) returns one with probability 1

2 + ε4 and hence returns zero with

probability 12 −

ε4 . Put differently D1 has a slight bias towards one whilst

D0 assigns equal mass to both zero and one.

Algorithm 6.1 Density Estimation

Initialization1: Create a table of size |X| with a single one-bit entry for each item in X2: For every entry x ∈ X initialize its entry in the table with an indepen-

dent random draw from D0

Processing3: Upon receiving a value x ∈ X from the data stream, update x’s entry

in the table by drawing its new value from D1(ε)

Output4: Let θ = fraction of entries in the table with value 15: Output the density value f ′ = 4

(θ − 1

2

)/ε+ Lap(1/(ε|X|))

During the initialization step of the algorithm a table of size |X| is cre-ated. Each entry of the table contains one single bit that is set to a drawfrom the distribution D0. When receiving an element x ∈ X from the datastream, the entry corresponding to x is updated by drawing a value fromthe distribution D1(ε). Finally, at the end of the algorithm, the returnedoutput is 4

(θ − 1

2

)/ε + Lap(1/(ε|X|)) where θ is the fraction of entries in

the table with the value 1. These steps are detailed in algorithm 6.1.Step 3 in algorithm 6.1 is assumed to be an atomic step and cannot be

92

Page 95: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

interrupted. Therefore an intrusion cannot happen when an element x isbeing processed, hence its value can never be learned by an adversary.

Correctness of Privacy Guarantees

To prove that this algorithms satisfies 2ε-differential pan privacy we beginwith the following claim.

Claim 6.2. For 0 ≤ ε ≤ 32 the distributions D0 and D1(ε) are ε-differential

private. Meaning that for a discrete random variable X0 with distributionD0 and X1 with distribution D1(ε), and for all possible outputs b ∈ 0, 1 itholds that

exp (−ε) ≤ Pr [X1 = b]

Pr [X0 = b]≤ exp (ε)

Here we use the formula of differential privacy as explained in sec-tion 5.3.2 which is equivalent with the original definition. We will nowprove the previous claim.

Proof. Let us begin by proving the statement for b = 0. We have that

Pr [X1 = 0]

Pr [X0 = 0]=

12 + ε

412

= 1 +ε

2(6.13)

So the inequalities we need to prove become

exp (−ε) ≤ 1 +ε

2≤ exp (ε) (6.14)

For ε = 0 the inequalities are true. Further observe that exp (−ε) is adecreasing function and 1 + ε

2 is an creasing function. This proves that thefirst inequality is true for ε > 0. To prove the second inequality we knowthat the function

exp (ε)− 1− ε

2(6.15)

is increasing for ε > 0 by taking the derivate and showing that it is non-negative. From this we know the second inequality is also true for non-negative ε, completing the case for b = 0.

The case for b = 1 again begins by calculating the fraction of the prob-abilities, resulting in the inequalities

exp (−ε) ≤ 1− ε

2≤ exp (ε) (6.16)

For non-negative ε the second inequality is trivially true as we alreadyproved that 1 + ε

2 ≤ exp (ε). The first inequality is more tricky and isequivalent to proving

93

Page 96: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

f(ε) = 1− ε

2− exp (−ε) ≥ 0 (6.17)

The two zero points of the function f are at 0 and at 1.593 . . . and thefunction is positive between these the points2. Given that 3

2 < 1.593 . . . theproof is complete.

Utilising this claim we can prove the privacy guarantee of our densityestimation algorithm. The proof is based on the one given in [DNP+10].

Claim 6.3. Algorithm 6.1 provides 2ε-differential pan privacy.

Proof. The first part of the proof shows that the internal state of the algo-rithm is ε-differently private at the time of intrusion. The second part showsthat the final output is also differentially private, even for an adversary whohas previously seen the internal state during an intrusion.

Part one: Private internal state. For all items x ∈ X in the table,its value either didn’t appear in the data stream and its entry is drawn fromD0, or it did appear in the data stream and its entry is drawn from D1(ε).Even if the item appeared multiple times its current entry is still drawn fromD1(ε). By claim 6.2 these entries all satisfy ε-differential privacy. The usersin X are thus guaranteed protection against an unannounced intrusion.

Part two: Private output. The query that would count the fractionof 1’s the table at the end of the algorithms has sensitivity 1/|X|. Thereforeadding noise sampled from Lap(1/(ε|X|)) guarantees ε-differential privacy.This is exactly what is done in the algorithm, thus proving the output onits own is ε-differentially private.

From the composability of differential privacy we infer that the totalalgorithm satisfies 2ε-differential privacy. This completes the proof.

Accuracy

All that is left to do is to explain how the return value is calculated andprove that with high probability this result is accurate.

Let f denote the fraction of items in the universe X that appeared atleast once. The expected fraction of entries in the table with a value of 1 isthen

E [θ] = (1− f) · 1

2+ f ·

(1

2+ε

4

)=

1

2+ε

4f (6.18)

Solving the equation for f gives

f =E [θ]− 1

2

4ε(6.19)

2The actual value of the second zero point is 2 + LambertW(−2e−2) but such a calcu-lation is out of scope for this work.

94

Page 97: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

which is exactly what our algorithm is returning when substituting the ex-pected value of θ with it’s actual value. It’s possible to prove that the resultof the density estimation algorithm has good accuracy.

Claim 6.4. For every t > 0 it holds that

Pr [|f − E [f ]| ≥ t] ≤ 2 exp

(−|X| t

2ε2

8

)

where f is a random variable defined as the output of algorithm 6.1 and|X| the size of universe X.

Proof. This follows from the Hoeffding’s inequality on the sum of randomvariables saved in the table. Let us first restate this inequality (for detailssee section 2.4.3).

Theorem 6.5 (Hoeffdings’ Inequality). If X1, X2, . . . , Xn are independentrandom variables with ∀i : ai ≤ Xi ≤ bi and with

f =X1 +X2 + . . .+Xn

n

then for t > 0

Pr [|f − E [f ]| ≥ t] ≤ 2 exp

(− 2n2t2∑n

i=1(bi − ai)2

)(6.20)

Before we can apply the bound we must rewrite how the random variablef is calculated. Letting n = |X| we get

f =4

ε

(θ − 1

2

)(6.21)

=4

ε

(X1 +X2 + . . .+Xn

n− 1

2

)(6.22)

(6.23)

The Xi denote the independent random variables representing all theentries in the table. We continue transforming this formula until we canapply Hoeffding’s inequality.

f =4

ε

((X1 − 1

2

)+(X2 − 1

2

)+ . . .+

(Xn − 1

2

)

n

)(6.24)

f =

(4ε

(X1 − 1

2

)+ 4

ε

(X2 − 1

2

)+ . . .+ 4

ε

(Xn − 1

2

)

n

)(6.25)

f =

(X ′1 +X ′2 + . . .+X ′n

n

)(6.26)

(6.27)

95

Page 98: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

The last equation defines the new random variables X ′i = 4ε

(Xi − 1

2

).

Knowing that each variable Xi takes values in 0, 1 it’s straightforward tosee that each X ′i takes values in −2

ε ,2ε. The denominator in the exponent

of Hoeffding’s inequality becomes

n∑

i=1

(−2

ε− 2

ε

)2

= n16

ε2(6.28)

We can now directly apply Hoeffdings’ inequality which results in

Pr [|f − E [f ]| ≥ t] ≤ 2 exp

(−nt

2ε2

8

)(6.29)

And since we choose n = |X| the proof is finished.

Claim 6.4 shows that the probability of a result decreases exponentiallyas it diverges away from the actual value. This substantially biases thedistribution towards accurate results. It also learns us that as the size ofthe universe X grows linearly, the probability of the result diverging fromthe actual value decreases exponentially.

Finally we notice that as ε becomes larger, the results is more accurateand visa versa. This matches our intuition, since decreasing the “amount ofprivacy an individual has” increases the accuracy of a result.

96

Page 99: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Chapter 7Handling Network Data

During the last decade there has been a growing interest in network data.Here network data is defined as any kind of data which can be meaningfullyrepresented by a graph. In the previous chapters we have focused mainlyon tabular data. While technically a graph can be stored in a tabular data-base, the existence of relationships between entities profoundly alters manyaspects of the privacy problem. In short the analysis of network data ismore complex and varied than the analyses of traditional tabular data, andinvolves different privacy risks. In this chapter we will formally introducethe graph model and several of it’s properties. Specific privacy problemsthat arise when working with network data will also be addressed.

7.1 Motivation

The focus will be on social networks and their inherent privacy problems.As a demonstration of the importance of network data we will give a briefoverview of several domains producing and analyzing such data. It alsoshows that both fully identified, and supposedly anonymized network data,are widely available and thus accessible by malicious individuals as well.This overview is partly based on that of Narayanan and Shmatikov [NS09].

Data Mining

There is a multitude of datasets which can be used for research. For ex-ample corporations like AT&T possess a database of nearly 2 trillion phonecalls going back decades [Hay06], and they have in-house research facilitiesto perform analysis on these graphs. But smaller operators without suchluxury must share their graphs with external researchers, which is not al-ways done due to privacy concerns. Call graphs might be useful for socialscientists and have also proven to be valuable to detect illicit activity such

97

Page 100: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

as calling fraud [Wil97]. Another purpose is improving national security,where communication graphs are being studied to determine if command-and-control structures of terrorist cells have a unique structure that can beautomatically detected [Hay06].

Health-care professionals, epidemiologists and sociologists also collectvarious information about geographic, friendship, family and sexual net-works to study disease propagation and risk. Potterat et al. [PPPM+02]published a social network which shows a set of individuals related by sexualcontacts and shared drug injection. Researchers had to weigh the benefit ofpublic analysis against possible losses of privacy to the individuals involved,without clear knowledge of potential attacks. In another study the NationalLongitudinal Study of Adolescent Health (Add Health) collected the sexual-relationship network of students of an anonymous Midestern high school.The graph resulting from the Add Health study has been published in ananonymized form [BMS04].

Online Social Networks

Another common source are online social networks. Such data can be col-lected using a crawler that automatically visits public user profiles. Re-searchers on graph anonymization algorithms usually use these data sets totest the performance of their algorithms, or to show the feasibility of certainattacks [NSR11, BDK07, LT08]. Although this information is already pub-licly available, the generation of such graphs benefit adversaries who mightnot have the capabilities to perform large and automated crawling of thesenetworks.

Advertising on online social networks is, most of the time, enhanced byusing the information of the social graph. There is empirical evidence thatusing social network data increases the profit of advertising. As a resultnetwork operators are inclined to share the social graph with advertisingpartners. If we look at the facebook privacy policy we conclude that thecomplete social graph isn’t handed over the advertiser, only information ofusers that give permission to hand over the data are being shared, whichhowever might still be a large proportion of the users. The advertiser canspecify to whom certain advertisements are shown, for example it’s possibleto target a category of user like a “moviegoer” or a “sci-fi fan” [Fac], butmost of the data mining appears to be done by facebook itself. On theother hand—based on the testimony of Chris Kelly, the chief privacy officerat facebook—non-personally identifying information might be shared withadvertisers [Uni08]. This raises the question if the shared data truly pre-serves privacy or actually resembles a form of naive anonymization, whichdoesn’t assure privacy for all users.

98

Page 101: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Other Sources and Advantages

In network security it is often essential to get a global overview of the networkif one wants to detect certain attacks. In particular it’s difficult to defendagainst global phenomena such as botnets when operators can observe onlythe traffic in the local domains [RAW+10]. Understandably network opera-tors are hesitant in sharing information about the traffic in their networks.Mechanism that are able to create a global overview of the network whileassuring privacy could provide a step in the right direction to solve theseproblems.

Another domain that can benefit from network data is that of face recog-nition, by exploiting the fact that users who appear together in photographsare likely to be neighbours in the social network[SZD08]. Since a large num-ber of photos are published on social networks (e.g. flickr and facebook),having access to the anonymized graphs can greatly improve large-scale fa-cial re-identification.

7.2 Graph Model

Evidently we will model network data by a graph G = (V,E) where V isthe set of nodes and E is a set of edges between the nodes.

It would also have been possible to save the nodes and edges in a tabu-lar database and then apply the previous anonymization techniques such ask-anonymity. However such a mechanism would ruin the utility of the un-derlying network. In particular, if viewed as tabular database queries, manygraph analysis require the use of joins on the edges of the graph, e.g., whencalculating the diameter or transitivity of a graph. So the first disadvantageis that tabular anonymization doesn’t work well if later on you want to useanalysis which require many joins.

Another problem is that, if we look at k-anonymity for example, it wouldrequire that there are at least k entries representing “similar” edges (i.e.,edges that are indistinguishable to the adversary), otherwise it wouldn’t beindistinguishable with k other entries. Admittedly this depends on whatyou consider sensitive attributes when applying k-anonymization, but as weshall see the existence of an edge poses privacy risks. Such a modificationto the network data would completely destroy any utility it had before.

7.2.1 Assuring Privacy

As always there are multiple ways one can retain privacy. For networkdata there are three main approaches if the goal is to release data whileretaining privacy. These are depicted in figure 7.1. The first option is torelease a so called naively anonymized graph, where identifiers are removedand attributes are coarsened or not released at all. The weaknesses of this

99

Page 102: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

35 F -Yvonne

Alice

Bob

21 M -

23 F ... +Age

Sex...

HIV

37 M -

42 M -

Francois

...

ANALYSTDATA OWNER

Query answerperturbationNoisy answer

QueryNoise

additionTrue answer

Transformed data release

Networktransformation

23

5

4

Age...

HIV

11

1 ...

Naiveanonymization

Attributeanonymization

23

5

4

Age...

HIV

11

1

...Derek

Figure 1.2. Our problem setting: individuals are connected by relations, forming anetwork. We explore three approaches to protecting privacy: simple anonymization,in which identifiers are replaced and attributes are coarsened or suppressed (a processwe call naive anonymization); transformed data release in which the network structureis altered; and query answer perturbation, in which noise is added to query answers.

that represent pathways for disease transmission (intercourse or needle sharing). This

allows for new kinds of analyses on the role that network structure can play in disease

transmission. And it also raises new privacy concerns: the relationships are sensitive

and the participants in the network have an expectation that this information will be

kept private.

Network data can be naturally modeled as a graph where nodes represent entities

and edges the relationships among them. In many settings, the entities are people,

but they can also be other things, such as hosts in a computer network. There may be

attributes on nodes as well as edges. An example network is illustrated in Figure 1.2.

The relationships in network data encode additional information about the entities

which may be highly sensitive. Naturally, the existence of an edge may be sensitive

(e.g., in the data collected by Klovdahl et al. [65], an edge reveals a sexual or shared

needle relation). Other aspects of connections may be sensitive as well. For example,

the degree of a vertex may be sensitive: academics in a scholarly collaboration net-

10

Figure 7.1: Illustration of the three possible approaches to protecting privacyin graph data. [Hay10]

paradigm are discussed in chapter 8. The second option is to transformthe network so the actual structure of the graph is altered. Chapter 9 willshow one method that can be used, but unfortunately doesn’t retain privacyunder all situations. Finally it’s possible to withhold the network data tothe analyst, and only allow him or her to pose queries and add noise to theresult in order to retain privacy.

7.2.2 Risk Assessment

Let’s consider the basic case where only the graph is released and all at-tributes have been removed. In effect every node has been given a completelyrandom identifier, which on itself doesn’t disclose any private information.The question is now what the privacy risk of releasing this graph would be.

The answer depends of course on the network data the graph represents.In some cases the privacy risk could be low, but in other cases there couldbe a high risk. Here we will focus on worst-case scenarios. Our examplewill be the the “HIV graph” of Colorado Springs where nodes correspondto individuals and the edges stand for sexual relationships or shared druginjection [PPPM+02]. Such data contains sensitive information and canhave a high privacy risk. We will discuss the two most important possibleprivacy violations when only the structure of the graph is released.

100

Page 103: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

B

A

D

C

E

Figure 7.2: Preventing identity disclosure doesn’t prevent edge disclosure.If an adversary knows an individual corresponds to either A, B or D, his orher identity is protected. However all three nodes have an edge to C, hencethe existence of the edge is not protected.

Identity Disclosure

A possible attack is that an adversary learns the location of an individualin the released graph. This on itself could lead to a privacy violation. Forexample, assume that only people having HIV are included in the graph,then clearly identifying an individual in the graph means the adversary hasnow learned that he or she has HIV.

Identity disclosure inevitably also leads to degree disclosure. That is,when learning the location of an individual, an adversary will also learnthe degree of the corresponding node. This can also have privacy implica-tions. For example, the degree of a node could signify how many sexualrelationships an individual has.

Edge Disclosure

Another risk is that of edge disclosure, where the adversary learns of theexistence of an edge between two individuals. Again taking the HIV graphas an example, a malicious individual should not be able to learn whethertwo individual are in a sexual relationship or not.

Remark that if we prevent identity disclosure, this doesn’t necessarilyprevent edge disclosure. Figure 7.2 illustrates this point. Say the adversarywants to determine if there is an edge between Alice and Bob. He or sheknows that Alice has degree 2 and that Bob has degree 3. Based on thisinformation alone Alice can be node A, B or D, and Bob must be node C. Inother words the identity of Alice is protected, as the adversary doesn’t knowthe corresponding node. However notice that all three nodes have an edgeto node C. So even though the adversary doesn’t know the exact location ofAlice, he or she does learn the existence of the edge to Bob.

Power of the Adversary

Our result of chapter 4 also applies for network data. That is, wheneveruseful information is released, there is a possibility that the privacy of an

101

Page 104: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

individual is compromised. This means that if we decide to release a—possibly modified—version of the graph, there will always remain a riskthat the privacy of someone can be violated. So we must focus on reducingthe chance of a privacy violation.

Commonly first the power of the adversary is specified in terms of ac-cess to auxiliary information and computational power, and then a privacymechanism is designed to protect against such adversaries. A flexible modelto describe how much information an adversary has access to is presentedin the paper by Hay et al. [HMJ+08].

7.3 Graph Theory

Graphs are a natural way to represent network data. We will give a briefoverview of some basic graph definitions and properties that will be used inforthcoming chapters.

7.3.1 Basic Definitions

Evidently we begin with the definition of a graph.

Definition 7.1 (Graph). A graph is an ordered pair G = (V,E) where V isa set called the nodes, and E ⊆ V × V is a set called the edges.

Commonly we will also talk about simple graphs. These are graphssatisfying the following three constraints:

1. They have no loops, i.e., they have no edges starting and ending atthe same node.

2. They are undirected, meaning if it has an edge (v1, v2) it must alsocontain the edge (v2, v2).

3. And finally they must be unweighted, which means edges are notmapped to a number.

Two graphs can have the same structure but their nodes can have dif-ferent labels. This is formalized by the definition of an isomorphism.

Definition 7.2 (Isomorphism). Graphs G = (Vg, Eg) and H = (Vh, Eh) areisomorphic if there exists a bijection f between the set of nodes Vg and Vhsuch that

∀u, v ∈ Vg : (u, v) ∈ Ef ⇐⇒ (f(u), f(v)) ∈ EhAn isomorphism can be seen as a relabeling of the nodes of a graph

so that both graphs become identical. An interesting property of an iso-morphism is a fixed point, which will mainly be used in the proofs given inchapter 8.

102

Page 105: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Definition 7.3 (Fixed Point). A node v is a fixed point of an isomorphismf : V → V ′ if v ∈ S ∩ S′ and f(s) = s. V and V ′ are the set of nodes ofboth graphs.

Another common definition is that of a path. This is based on thedefinition of incident nodes and edges, and on the definition of a walk.

Definition 7.4 (Incident). Given a graph G = (V,E), the nodes u, v ∈ Vare incident to the edge e = u, v. Similarly e is incident to both u and v.

Definition 7.5 (Walk). A walk on a graph is an alternating sequence ofnodes and edges, beginning and ending with a node, in which each edge isincident with the node immediately preceding it and the node immediatelyfollowing it.

Based on these definitions we can give the definition of a path.

Definition 7.6 (Path). A path is a walk in which all edges and nodes aredistinct. As an exception the first and last nodes are allowed to be identical,and if this is the case we talk about a closed path, otherwise it’s an openpath.

Often we will want to talk not about all the nodes and edges of a graph,but only about a selected region of the graph. To do this will we will usethe notion of a subgraph.

Definition 7.7 (Subgraph). A graph H = (Vh, Eh) is called a subgraph ofG = (V,E) if Vh ⊆ V and Eh ⊆ E such that Eh ⊆ Vh × Vh.

Additionally we will call G a supergraph of H if and only if H is asubgraph of G. Another interesting concept in the complement of a graph.

Definition 7.8 (Complement of a Graph). The complement of a simplegraph G = (V,E) is the simple graph G = (V,E) where

E = v1, v2 ∈ V | v1 6= v2 ∧ (v1, v2) /∈ E

7.3.2 Graph Properties

When evaluating the impact of anonymization on graph, we will analyzehow certain properties of the graph change. For this reason we will firstintroduce several commonly used and important properties.

Density

The density of a graph is defined as the number of edges divided by thenumber of possible edges. Thus for a graph G = (V,E) it is equal to

density =|E|(|N |

2

) =|E|

|N | (|N | − 1)/2

103

Page 106: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

A B

C D

Figure 7.3: The degrees of A, B, C and D are 3, 2, 1 and 2 respectively.

A

B

C

D

E

F

G

H

Figure 7.4: Graph with a degree sequence of (4, 3, 3, 2, 2, 1, 1, 0).

Degree

The degree of a node in an undirected graph is the number connections ithas to other nodes. In the graph displayed in figure 7.3, node A has a degreeof 3, node B a degree of 2 and node C a degree of one.

Often we will let d(v) stand for the degree of a node v. For directedgraphs one can make a difference between the in-degree and the out-degree.As the names suggest, these correspond to the number of ingoing and out-going edges respectively. In this work we will mainly focus on undirectedgraphs.

Degree Sequence

The degree sequence is arguably one of the most important properties of agraph. It influences both the structure of a graph and processes that operateon it.

It is defined as a monotonic increasing sequence of all the node degreesin a graph. For the example graph in figure 7.4 it is (5, 3, 3, 2, 2, 1, 0). Somedegree sequences uniquely correspond to only one possible graph (an extremeexample are complete graphs). Other define a world of possible graphs.

Clustering Coefficient

The clustering coefficient measures the likelihood that two neighbours of anode are themselves connected. It’s commonly defined as

104

Page 107: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

CC(G) =1

|V |∑

v∈V

∆(v)

(d(v)(d(v)− 1))/2(7.1)

for a graph G = (V,E) where ∆(v) denotes the number of triangles(cliques of size 3) containing node v, and d(v) is the degree of v. Thedenominator in the sum is the number of possible triangles node v can bepart of, and the numerator is the actual number of triangles it is part of.Unfortunately this formula is not defined if there are nodes with a degreeof 1 or 0 (division by zero). Most of the time the term in the sum is thentreated as being equal to zero.

Another common variation of the formula is the following

CC1(G) =

∑v∈V ∆(v)∑

v∈V (d(v)(d(v)− 1))/2(7.2)

This avoids the devision by zero when some nodes have a degree of zeroor one.

Both of these formula are not ideal, because values that are not defined(division by zero) should preferably have no influence on the result [Kai08].To avoid this we introduce a variation of formula 7.1. Let S denote the setv ∈ V | d(v) ≥ 2, then we have

CC2(G) =1

|S|∑

v∈S

∆(v)

(d(v)(d(v)− 1))/2(7.3)

Intuitively the clustering coefficient is a measure of neighbourhood con-nectivity. If a node has no neighbourhood it shouldn’t contribute to thevalue of the clustering coefficient, making formula 7.3 a better option. How-ever, when testing the effect certain graph modification algorithms have,we will report the clustering coefficient as calculated in equation 7.1. Thereason behind this choice is that equation 7.1 is more frequently used, andto remain consistent we will adopt the same formula.

Average Path Length

First consider the shortest path between two nodes v1, v2 ∈ V and denotethis by d(v1, v2). For unweighed graphs the distance is defined as the numberof edges of the shortest path between v1 and v2. If there is no path betweenthe nodes the convention will be that d(v1, v2) = 0. The average path lengthis then defined as

APL =1

n(n− 1)

i,j

d(vi, vj) (7.4)

105

Page 108: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

The abbreviation APL will commonly be used in tables and figures. It isa widely used property and can for example stand for the average number ofpeople one will have to communicate through to contact a complete stranger.

Diameter

The diameter is the maximum distance between any pair of nodes. Writtenformally this becomes

Diameter = maxvi,vj∈V

d(vi, vj) (7.5)

Power Law Estimation

The degree distribution of empirically obtained graphs often follow a powerlaw distribution. Intuitively this means that there are a high number ofnodes with a low degree and only few nodes with a high degree. A quantityx obeys a power law if it is drawn from a probability distribution

p(x) ∝ x−α (7.6)

where α is a parameter of the distribution known as the exponent or scalingparameter [CSN09]. Most of the time the degree distribution doesn’t obeythe power law for all values of x. Particularly it only applies for valuesgreater than some minimum xmin. For most of our analysis we will pickxmin = 3 unless otherwise noted.

A nontrivial problem is how to know whether the degree distributionfollows a power law, and if so what the value of α is. To estimate the scalingparameter α we will use the algorithm presented in the paper by Clauset,Shalizi and Newman [CSN09]. Their algorithm can estimate both xmin andα, but for our data we have noticed better results when manually settingxmin equal to 3. When testing the effect of anonymization we will measurehow much the scaling parameter α changes.

7.4 Differential Privacy for Graphs

The original definition of differential privacy was given for tabular data.Hence, before being able to apply differential privacy to graphs, we mustfirst define differential privacy for graphs. Note that we will only brieflycover this subject to inform the reader of its existence, and only the mostimportant definitions are given.

7.4.1 Possible Definitions

The definition of differential privacy for graphs mainly depends on what weconsider “neighbouring graphs”. A first definition, resembling the opt-in or

106

Page 109: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

opt-out semantic of differently privacy, says that two graphs are neighboursif they differ by at most one node and all of its incident edges.

Definition 7.9 (Node neighbouring graphs). Two graphs G = (V,E) andG = (V ′, E′) are node neighbours if and only if |V ⊕V ′| = 1 and |E⊕E′| =(u, v)|u ∈ (V ⊕ V ′) or v ∈ (V ⊕ V ′).

It results in one of the strongest forms of privacy that can be defined,since results should be roughly the same whether or not an individual ap-pears in the graph or not. For completeness, the definition of differentialprivacy for graphs becomes

Definition 7.10 (ε-differential privacy for graphs). A randomized functionK gives node ε-differential privacy if for all neighbouring graphs G = (V,E)and G′ = (V ′, E′), and for all outputs S ⊆ Range(K),

Pr [K(G) ∈ S] ≤ exp (ε) Pr[K(G′) ∈ S

]

The definition purposely doesn’t specify that the graphs are node neigh-bouring. It allows us to specify other conditions which define when twographs are neighbouring, resulting in different privacy guarantees. When as-suming node neighbouring graphs, one is assuring node ε-differential privacy.Unfortunately node ε-differential privacy is too strong for many analysis be-cause the amount of noise that must be added to assure privacy makes thefinal output meaningless [HLMJ09]. For example, the empty graph consist-ing of n isolated nodes is a neighbour of the star graph (one node connectedto n nodes) for node differential privacy, indicating that certain propertiescan differ significantly for neighbouring graphs.

A more modest definition considers two graphs G and G′ to be neigh-bours if one can produce G′ from G by adding or removing one edge, or byadding or removing one isolated node [Hay10].

Definition 7.11 (Edge neighbouring graphs). Two graphs G = (V,E) andG = (V ′, E′) are edge neighbours if and only if |V ⊕ V ′|+ |E ⊕ E′| ≤ 1.

Assuming edge neighbouring graphs in the definition of differential pri-vacy for graphs results in edge ε-differential privacy. Using this definitionless noise has to be added making more analysis possible while assuringprivacy. However, it may not be a strong enough privacy guarantee in allsituations. For example, the degree of a node is not well protected un-der this definition. When anonymizing graphs about sexual interaction orshared drug usage the degree of a node is sensitive information and mustalso be protected. The middle ground is captured by k-edge neighbouringgraphs, resulting in k-edge differential privacy.

Definition 7.12 (k-edge neighbouring graphs). Two graphs G = (V,E) andG = (V ′, E′) are k-edge neighbours if and only if |V ⊕ V ′|+ |E ⊕ E′| ≤ k.

107

Page 110: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

For k-edge differential privacy two graphs are considered neighbours ifthey differ by at most k edges or nodes. Note that for nodes with a lowerdegree than k this is effectively the same as node differential privacy. Nodeswith a higher degree face more exposure, but this is only natural, since theyalso have a higher influence on the structure of the graph. In other wordshigh degree nodes are bound to have a higher probability of disclosing someinformation.

In complete analogy to differential privacy for tabular data, one can nowalso define the sensitivity of a function for neighbouring graphs. Then itcan be proved that adding noise from Lap(∆f/ε), where ∆f is the sensi-tivity of the function, assures privacy. For more details we refer the readerto [HLMJ09].

7.4.2 Degree Distribution Estimation

Estimating the degree distribution is a good example of the applicabilityof k-edge differential privacy. In fact it becomes almost trivial to calculateit under k-edge differential privacy. First the real degree distribution iscomputed. The sensitivity of it is 2k since adding or removing an edgechanges the degree of two nodes by one. Therefore adding noise distributedby Lap(2k/ε) to each position in the real degree distribution will assurek-edge differential privacy.

7.5 Asymptotic Notation

Because at times asymptotic notation is abused, which can lead to ambiguityin certain cases, we will give precise definitions of the notations that are usedin the remainder of this work and cited papers. Asymptotic notation is usedwhen analyzing algorithms.

Definition 7.13 (Big-O Notation). We write f(x) = O(g(x)) iff

∃c > 0,∃n0 ≥ 0,∀n ≥ n0 : f(n) ≤ cg(n)

and say that f is Big-O of g.

When proving that a function f is indeed Big-O of g the subsequenttheorem can be used.

Theorem 7.1. If limx→∞f(x)g(x) ∈ R then f(x) = O(g(x)).

Let’s first recall the definition of the limit to positive infinity of a real-valued function.

Definition 7.14 (Limit to Infinity). If we have that

∀ε > 0,∃S > 0,∀x : x > S =⇒ |f(x)− L| < ε

108

Page 111: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

then we say that the limit of f as x approaches infinity is L. We will denotethis with

limx→∞

f(x) = L

With this definition in mind the proof of theorem 7.1 becomes ratherstraightforward.

Proof. From the definition of the limit to infinity it follows that

∀ε > 0,∃S > 0,∀x > S :

∣∣∣∣f(x)

g(x)− L

∣∣∣∣ < ε

∣∣∣∣f(x)

g(x)

∣∣∣∣− |L| < ε

∣∣∣∣f(x)

g(x)

∣∣∣∣ < ε+ |L|

|f(x)| < (ε+ |L|) · |g(x)|

Let ε′ = ε+ |L| then

∀ε′ > |L| , ∃S > 0,∀x > S : |f(x)| < ε′ |g(x)|

This surely implies

∃ε′ > 0,∃S > 0,∀x > S : |f(x)| ≤ ε′ |g(x)|

Apart from different variable names this precisely matches the definitionof the Big-O notation.

7.5.1 Special Notations

More important is how we deal with abuse of notation. Commonly the Big-O notation is used inside a formula. For example we can have somethinglike

g(x) = h(x) · 2O(f(x))

which doesn’t strictly follow the definition. In such cases we will define thisto mean

∃c > 0,∃n0 > 0,∀n ≥ n0 : g(n) ≤ h(n) · 2cf(n)

One can even have multiple occurrences of the Big-O notation in a formula.Say you have

g(x) = nO(x2) ·O(log log x)

which would translate to the following

∃c1, c2 > 0,∃n0 > 0,∀n ≥ n0 : g(n) ≤ c1n2 · c2 log log n

109

Page 112: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

In practice, when there are multiple occurrences of the Big-O notation insidea formula, we will commonly define c to be the maximum of all ci and usethat constant instead. So the previous formula would become

∃c > 0,∃n0 > 0,∀n ≥ n0 : g(n) ≤ cn2 · c log log n

where c is the maximum of c1 and c2.

110

Page 113: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Chapter 8Naive Anonymization and anActive Attack

In this chapter we will show that publishing only the graph of a socialnetwork, i.e., only the nodes and edges, can already disclose sensitive infor-mation. We will describe an active attack where the adversary must be ableto inserts nodes and edges in the social network before the graph is released.

8.1 Naive Anonymization

One possibility that was once thought to provide enough privacy is callednaive anonymization. In this mechanism the nodes are renamed to a uniquebut random identifier and the structure of the graph is left unmodified. Inother words only the graph of the social network is published. One mightreason that because all the identifying attributes of a person are removed,the privacy of all the persons are now protected.

Unfortunately this is not the case. Nodes can be re-identified even ifonly the graph of a social network is published. This is possible becausethe all the edges among the nodes contain a large amount of information.Particularly the neighborhood around a node can be used to uniquely locatea node in the naively anonymized graph.

The exact definition of naive anonymization is

Definition 8.1 (Naive Anonymization [Hay10]). Given a network modeledby an undirected graph G = (V,E), the naive anonymization of G is anisomorphic graph Ga = (Va, Ea) defined by a random bijection f : V → Va.

This prevents re-identification when the adversary posses informationabout the attributes of a node (e.g. their name, social security number,address, and son on). So in absence of any other information privacy wouldindeed be preserved. However the adversary may posses information about

111

Page 114: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

the structure of the graph itself. Such information can be obtained byeither manipulating the network before it is released, or by having access toauxiliary knowledge about the social network structure. This correspondsto the active and passive attack respectively, as we will see.

8.2 Active Attack

The first attack found on a naively anonymized network was due to Back-strom et al. [BDK07]. They described an active attack where the adversarymust be able to constructs additional new nodes and edges in the networkbefore the anonymized version of the network is created and released. Al-though such an attack can be hard to execute in practice it’s neverthelessa perfect example of how anonymity in graph data does not imply privacy.In their work they also suggested a passive attack which is fairly similar tothe active attack, but less powerful.

8.3 Background

We will assume the social network is a graph G = (V,E) where V is the set ofnodes of size n and E ⊆ V 2 is the set of edges. Nodes commonly correspondto individual users but can also represent group of users, e.g., in many socialnetworks such as facebook a user can choose to be part of certain public orprivate groups. An edge (u, v) signifies that u has a relationship with v.In this context there is a relationship between two individuals if they arefriends on the particular network, they have communicated with each otheror they have interacted multiple times using other means. A relationshipbetween an individual and a group occurs when the individual is a memberof the particular group.

As mentioned the adversary must be able to create new nodes and edgesto the network. He or she can create new nodes by creating new user ac-counts on the social network. Creating an edge between two nodes can bedone by communicating with the other node, befriending the individual, orby subscribing to the updates of the user, or any other possible action de-pending on the targeted network. Adding edges between two nodes that areunder control of the adversary is easy, and creating directed edges betweena node that is controlled by an adversary and another targeted node in thenetwork is also trivial. However it can be hard to make the targeted noderespond to messages of the adversary. We conclude that adding directededges is easy to do, but creating an undirected edge (i.e., creating a directededge from the targeted node back to our node) can be a nontrivial problem.

The attack becomes easier to carry out if the released anonymized graphdata is directed [BDK07]. Because of this the attack will focus on theharder case of undirected graphs. In other words we assume the data curator

112

Page 115: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

released a graph where the direction of the edges are removed, and onlyundirected edges remain. At its core the attack consists of picking a setof targeted users w1, . . . , wb. The goal is then to find these users in theanonymized graph. Once they have been found we can use the anonymizedgraph to learn whether there is an edge between any of the targeted userswi and wj . Apart from the existence of edges, the adversary will also learnthe degree of the targeted users.

8.4 Idea Behind the Attack

Backstorm et al. suggest two active attacks called the walk-based and thecut-based attack. In the cut-based attack fewer nodes need to be created bythe adversary. The disadvantage is that the attack is more complicated, eas-ier to detect and it requires a computationally more demanding algorithm.Therefore we will only consider the walk-based attack in this chapter.

The attack begins by creating k new user accounts for a small value ofk. Edges are created between these users resulting in a subgraph H. Thesenewly created users are then connected to the targeted nodes w, . . . , wb(e.g. by sending messages to the user or by subscribing to the updates ofthe user) and possibly other nodes as well. This is done in such a way sothat we can uniquely identify each targeted user wi given the location of H,how this is achieved will be explained in the following section. When thedata curator releases the anonymized graph G it will contain the subgraphH along with the edges to the targeted nodes. The attack proceeds byfinding the constructed graph H. Based on this information the adversaryis able to determine the location of the targeted nodes w1, . . . , wb, and thusre-identity these users in the supposedly anonymized graph. The adversarywill learn the degree of the targeted nodes and also any edges that existsbetween any of the targeted users. This of course compromises the privacyof the attacked individuals.

The difficulty of the attack lies in constructing a subgraph H that isunique within the global graph G. This is particularly difficult becausethe adversary doesn’t know G at the time of constructing H. He or shemust therefore attempt to construct a graph that is unique regardless of thestructure of G. Secondly the adversary must be able to efficiently find thesubgraph H which has been hidden within the anonymized graph G. Inother words he or she must create an instance of the subgraph isomorphismproblem that is tractable to solve, even for very large graphs G.

8.5 The Details

The main idea of the attack has already been discussed in the previoussection, i.e., we construct a new subgraph which we will then attempt to

113

Page 116: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

locate in the anonymized graph. We will let X denote the set of nodes wewill add to the network and H = (X,E′) the subgraph created by thesenodes, where E′ denotes the edges we will add within H itself. The graph Gis the naively anonymized graph released by the data curator which containsthe constructed subgraph H. To identity the original graph the notationG−H is used. For a set of nodes S we will let G[S] denote the subgraph ofG induced by the nodes in S. Based on this notation we have H = G[X].

8.5.1 Requirements for success

There are a number of technical requirements the subgraph must satisfy inorder for the attack to be successful:

1. There exists no set of nodes S 6= X such that G[S] and H are isomor-phic.

2. Given the anonymized graph G we must be able to efficiently locatethe subgraph H.

3. The subgraph H must contain no non-trivial automorphisms.

An automorphism of a graph (V,E) is an automorphism from the setof nodes V to itself. Put differently the nodes of the edges are permutatedwhile the structure of the graph is preserved.

If the first requirements holds we can locate the nodes of the subgraph wehave constructed. The second requirements assures we can find these nodeswithin an acceptable amount of time. The third requirement is needed sowe can correctly label the nodes as x1, . . . , xk and hence find the targetednodes.

8.5.2 Construction of the Subgraph H

The construction of the subgraph happens in four steps:

(1) We begin by creating k = (2 + δ) log n new nodes, and represent theseusing the set X = xi, . . . , xk. Then for each node in xi ∈ X we pickan integer value ∆i ∈ [d0, d1]. The value ∆i stands for the number ofedges node xi will have to nodes in G−H. The two constants d0 ≤ d1

can be chosen arbitrary, but normally they are chosen in such a way toreduce the detectability of the subgraph H. Furthermore it’s also pos-sible to simply choose each ∆i arbitrarily, but in experiments in workswell to choose them at uniform from the interval [d0, d1] [BDK07].

(2) Now pick the nodes W = w1, w2, . . . , wb which we want to re-identityin the anonymized graph. Here b = O(log2 n). For each targeted nodewi we choose a set Nj ⊆ X such that all Nj are distinct and each xi

114

Page 117: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

appears in at most ∆i of the setsNj (note that this places an additionalconstraint on the possible values b can take). We add edges from eachnode wj to all nodes in Nj . This allows us to find the targeted nodesonce we located H in the anonymized release.

(3) Edges are added from H to G − H so that each node xi ∈ X hasexactly di edges to G−H. These edges must be added while preservingthe unique linkage to the targeted nodes, i.e., for each targeted userwj there must be no other nodes in G−H that are also connected toprecisely the nodes in Nj . Otherwise we would not be able to uniquelyfind the targeted users once we located H in the anonymized graph.

(4) As a last step we add all the internal edges of our subgraph H. Webegin by including all the edges (xi, xi+1) for 1 ≤ i ≤ k − 1. This willallow us to uniquely label the nodes of H once we have located itsposition, provided that it has no non-trivial automorphisms. Finallywe include each other edge with probability 1/2. The degree of a nodexi ∈ X in the full graph is denoted by ∆′i which is equal to ∆i plus itsnumber of edges to other nodes in X.

As already mentioned, in order for the attack to be successful, we mustbe sure that H contains no non-trivial automorphisms. Results from thedomain of random graph theory imply that with high probability our con-structed graph H has no non-trivial automorphisms [Bol01]. In the remain-ing of this work we will assume that H indeed has no non-trivial automor-phisms.

An example of a subgraph that might be constructed by this algorithmis shown in figure 8.1. Note that the figure doesn’t include any additionaledges between the nodes xi and gi. These were excluded to prevent thefigure from getting too cluttered.

8.5.3 Finding our Constructed Subgraph

We attempt to find H in G by searching for the path x1, x2, . . . , xk. Thissearch begins by considering every node in G with degree ∆′1. Then wetry to add any node with the degree ∆′2 continuing with nodes of degree∆′3 and so on. Apart from this degree test we also use a second edge testwhich verifies that the edges among the nodes currently in the path exactlycorrespond with the edges that exists in our constructed graph H. In otherwords words our search space is pruned by two constraints. The first onebeing the degree test where each possible candidate for node xi must havethe correct degree ∆i. The second one is an internal structure test whichverifies that each candidate for node xi must have the correct subset of edgesto the current path xi, x2, . . . , xi−1. All this is done by constructing a searchtree T according to the the following details:

115

Page 118: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

x1

x2 x3

x4

x5

x6x7

x8

w1

w2

w3

g1

g2

g3

g4

g..

Figure 8.1: Example construction of a subgraph G with k = 8. The nodesx1 to x8 form the subgraph constructed by the adversary. Nodes w1, w2 andw3 are the targeted nodes. The edges draw in red are the unique sets Ni toidentify the targeted nodes. For completeness a few nodes gi are also shownand represented the remained of the released graph G.

1. Every other node in the search tree corresponds with a node in G.Letting α be a node in T , we will define f(α) as the node in G.It’s possible that a node it represents in G appears multiple timesin the search tree. The tree T is now constructed in such a way sothat for every path of nodes α1, . . . , α` from the root to a leaf node,the corresponding nodes f(α1), . . . , f(α`) from a path in G with thesame sequence of degrees and the same internal edges as x1, . . . , x`.Additionally every such path in G corresponds to a distinct rootedpath in T [BDK07].

2. The actual construction of T begins with a fixed dummy root nodeα∗. All the nodes in G having degree ∆′1 are considered children ofthe root node. For further constructing T we consider every currentleaf node α with its associated path α∗ = α0, α1, . . . , α` = α to thedummy root node. We know that this path corresponds to a pathf(α1), f(α2), . . . , f(α`) in the graph G (notice that here we ignore thedummy root node α∗ = α0). The task is now to find all neighboursv of f(α`) in G for which the degree of v is ∆′`+1 and has the correctsubset of edges to the path, i.e.,

∀` : 1 ≤ i ≤ ` : (f(αi), v) is an edge iff (xi, x`+1) is an edge.

For every v satisfying both these test we create a new child β of αwith f(β) = v.

116

Page 119: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

3. Once the tree has been build and there is a unique rooted path of lengthk, then we know that it correspond to the nodes in H. Moreover weimmediately learn the labels of these nodes since we associated eachnode α in T with a specific node f(α) in G, and we know that for thek-length path the node α0 corresponds with x0, α1 with x1, and so on.

Using H we can now re-identify the targeted nodes w1, w2, . . . , wb byfollowing the unique set of edges Ni to the target node wi. Note that therunning time of this algorithm is only slightly larger that the size of T .

8.5.4 Privacy Breach

Having identified the targeted nodes we learn their node degree and whetherthere are edges between any of the targeted nodes. As explain in chapter 7this can result in a privacy breach depending on the nature of the networkthat is being attacked.

8.6 Analysis

Both the correctness of our assumptions and the efficiency of the attack haveto be proved. Our assumption is that with high probability the graph Hwill be a unique subgraph in G. To prove the efficiency we show that thesize of the search tree T does not grow too large.

Uniqueness of H

Although describing the attack was not complicated, the analysis and proofsare complex and not easily comprehensible [BDK07]. Since the true focusof this work is to design robust anonymization techniques we will omit theselong and intricate proofs. Instead we will focus on a simplified version of thetheorem which states that H is unique with high probability. This proof isalso included in the paper by Backstrom, Dwork and Kleinberg [BDK07] asa motivating example, and provides the technical intuition and idea behindthe more general statement of the theorem. Let us first state the generaltheorem.

Theorem 8.1. Let k ≥ (2+δ) log n for an arbitrary positive constant δ > 0,and suppose we use the following process to construct an n-node graph G:

1. We start with an arbitrary graph G′ on n−k nodes and we attach newnodes X = x1, . . . , xk and arbitrarily include some edges to nodes inG′.

2. We build a random subgraph H on X by including each edge (xi, xi+1)for i = 1, . . . , k − 1, and including each other edge independently withprobability 1/2.

117

Page 120: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Then with high probability there is no subset of nodes S 6= X in G suchthat G[S] is isomorphic to H = G[X].

We will prove a more restricted statement where the subset of nodes Sneed not be merely unequal to X, but also disjoint from X. This is easierto prove while still capturing the idea behind the more general proof.

Claim 8.2. Let F0 be the event that there is no subset of nodes S disjointfrom X such that G[S] is isomorphic to H. Then with high probability theevent F0 holds.

Proof. We pick an arbitrary graph G yet consider it to be a fixed graphduring the proof. Pick a specific ordered sequence S = (s1, s2, . . . , sk) of knodes in G − H having edges between the consecutive nodes si and si+1.Define ES as the event that the function f : S → X given by f(si) = xi isan isomorphism for H and G[S]. Since graph G is fixed, all but k − 1 ofthe edges in H are chosen independently with probability 1/2, and since Sis disjoint from X, we have that

Pr [ES ] =

(1

2

)(k2)−(k−1)

(8.1)

where the probability is over all possible graphs H. Simplifying theexponent using basic operations gives

(k

2

)− (k − 1) =

k(k − 1)

2− 2(k − 1)

2(8.2)

=k2 − 3k + 2

2(8.3)

=(k − 1)(k − 2)

2(8.4)

=

(k − 1

2

)(8.5)

And hence we get

Pr [ES ] =

(1

2

)(k−22 )

= 2−(k−22 ) (8.6)

where the probability is again for all possible graphs H. We now usethe equation F0 = ∪SES where the union is over all possible sequencesS of k nodes in G − H. Recalling that G has n nodes, the number ofdifferent sequences S is bounded by nk. As we picked k ≥ (2+ δ) log n whenconstructing the subgraph H we have that n ≤ 2k/(2+δ). This gives

118

Page 121: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Pr [F0] = Pr [∪SES ] (8.7)

≤ nk · 2−(k−1

2

)(8.8)

≤ 2k2

2+δ · 2−k2

2 · 21 + 3k2 (8.9)

= 2−δk2

2(2+δ) + 3k2 + 1

(8.10)

where inequality 8.7 follows from a union bound as explained in sec-tion 2.4.1. We notice that the probability of event F0 goes to zero exponen-tially quick in k. This completes the proof of claim 8.2.

For the complete proof we refer the reader to the paper by Backstromet al. [BDK07]. An important part of their complete proof is claim 8.3 asstated below.

Claim 8.3. With high probability, for any constant c ≥ 15, the followingholds. Let A,B and Y be disjoint sets of nodes in G with A,B ⊆ X, andlet f : A∪ Y → B ∪ Y be an isomorphism. Then the set Y contains at mostc log k nodes that are not fixed points of f .

Although no proof is given of this claim here, we will however use thiswhen proving that the size of the search tree T is relatively small.

Efficient Recovery

Finally it’s possible to show that the search tree T doesn’t grow too largein size.

Theorem 8.4. For every ε > 0, with high probability the size of T isO(n1+ε).

Proof. Recall that k denotes the size of H, and let d be the maximum degreein G of a node in H. As a result from the construction of H both of thesequantities are O(log n). Let Γ be a random variable equal to the number ofnodes in the search tree T . We decompose Γ into the sum of two simplervariables. The first one is Γ ′ and stands for the number of paths of G −Hcorresponding to some node of T . The second variable Γ ′′ is the numberof paths in G that meet H and correspond to some node in T . We haveΓ = Γ ′ + Γ ′′. Our goal is now to find an upper bound on E [Γ ]. During theproof we will assume the events F0 and F1 hold.

Part One. We begin by finding an upper found for E [Γ ′]. For any pathP in G−H we define the random variable Γ ′P as

Γ ′P =

1 if P corresponds to a node in T0 otherwise

119

Page 122: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Further we call a path P feasible if the degree of every node in P is atmost d. If P is not feasible then Γ ′P is almost surely (i.e., with probabilityone) equal to 0.

Now consider any feasible path P of length j ≤ k. For P to be presentin T , i.e., that there exists a node α in T defining a path that corre-sponds to P , we need the edges among nodes in P to match the edgesamong x1, x2, . . . , xj. In the probability calculations we can imagine thatthe edges between nodes in x1, x2, . . . , xj are being generated after P ischosen. The edges (xi, xi+1) for 1 ≤ i ≤ j − 1 resulting from the pathx1, x2, . . . , xj are of course already known to exist. We have, even for thespecial cases when the length j of the path is 1 or 2, that

E[Γ ′P]

= 0 · Pr[Γ ′P = 0

]+ 1 · Pr

[Γ ′P]

(8.11)

= Pr[Γ ′P]

(8.12)

=

(1

2

)(j2)−(j−1)

(8.13)

= 2−(j−12 ) (8.14)

A feasible path of length j can start with any node in the graph havingat most degree d. At worst these are n possibilities. For every consecutivenode in the path we can choose between at most d neighbours of the currentnode, since the degree of all nodes in a feasible path is at most d. Thus theworst case total number of feasible paths of length j is ndj−1. Hence weobtain

E[Γ ′]≤∑

P

E[Γ ′P]

(8.15)

≤k∑

j=1

ndj−12−(j−12 ) (8.16)

= n

k∑

j=1

(d2−

j−22

)j−1(8.17)

To simplify the inequality we use the fact that d = O(log n). In otherwords there exists a c > 0 and n0 > 0 such that for all n ≥ n0:

E[Γ ′]≤ n

k∑

j=1

(n log(c)2−

j−22

)j−1(8.18)

Let s = 2 log log n+ 2 log c+ 2 and split the sum into two parts based onthis value.

120

Page 123: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

E[Γ ′]≤ n

s−1∑

j=1

(c log(n)2−

j−22

)j−1+ n

k∑

j=s

(c log(n)2−

j−22

)j−1(8.19)

Focusing on the second sum we notice that every j in it is larger or equalto s. Hence we obtain the following

j ≥ s = 2 log log n+ 2 log c+ 2 (8.20)

j − 2

2≥ log log n+ log c (8.21)

2−j−22 ≤ (log(n)c)−1 (8.22)

c log(n)2−j−22 ≤ 1 (8.23)

(c log(n)2−

j−22

)j−1≤ 1 (8.24)

(8.25)

From this we deduce that each term in the second sum is smaller or equalto one. In other words, once j is Θ(log log n) the second term is smaller than1. Since there are less than k such terms, the total sum is surely boundedby k. Hence we have found the following upper bound for the second sum

n

k∑

j=s

(c log(n)2−

j−22

)j−1≤ nk (8.26)

We continue with finding an upper bound for the first sum. Observe thatfor j ≥ 1 the factor 2−j/2 can only decrease the sum. Hence we can dropthe factor and introduce an inequality. Further working out this inequalitygives us

n

s−1∑

j=1

(c log(n)2−

j−22

)j−1= n

s−1∑

j=1

(2c log(n)2−

j2

)j−1(8.27)

≤ ns−1∑

j=1

(2c log(n))j−1 (8.28)

= n

s−2∑

j=0

(2c log(n))j (8.29)

= n(2c log n)s−2 − 1

2c log n− 1(8.30)

121

Page 124: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

The last equality uses the formula of geometric series. Finally we canrewrite this back into asymptotic notation.

ns−1∑

j=1

(c log(n)2−

j−22

)j−1= nO(log n)O(log logn) (8.31)

After all this we end up with our upper bound

E[Γ ′]

= n · (O(log n)O(log logn) +O(log n))

which we can further simplify. Let c be the maximum of the constant valuesused in all three Big-O notations (this is possible because all functions aremonotonically increasing). With this we can rewrite the upper bound.

∃n0 > 0,∀n > n0 : E[Γ ′]≤ n

((c log n)c log logn + c log(n)

)

= n(

2log(c logn)·c·log logn + 2log(c logn))

= n(

2log(c)·c·log logn+c·(log logn)2 + 2log(c)+logn)

= n2O((log logn)2)

At long last we have the upper bound E [Γ ′] = n2O((log logn)2) and thisconcludes the first part of the proof.

Part Two. For the second part of the proof we will try to find an upperbound for E [Γ ′′]. This is done by decomposing it into a separate randomvariables for each possible pattern in which a path could snake in and out ofH. In order to describe such a pattern we will use the notion of a template.A template τ is a sequence of ` ≤ k symbols (τ1, . . . , τ`) where symbol τi ∈ Xis either a distinct node of H or the special symbol ⊥. We call the set ofall nodes of H that appear in τ the support of τ , and denote it s(τ). Forexample the support of template

τ = (x2, x6,⊥,⊥, x1,⊥)

is x1, x2, x6. We will say that a path P in G is associated with τ if the ith

node on P lies in G−H for τi =⊥, and otherwise is equal to τi ∈ X. So forour example template, possible paths associated with it are

x2, x6, g1, g1, x1, g33

x2, x6, g5, g2, x1, g12

. . .

where each gi can be any node in G−H. Finally, we say that the reductionof τ , denoted τ , is the template for which τ i =⊥ whenever τi =⊥, and for

122

Page 125: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

which τ i = xi otherwise. We will call such a template τ a reduced template.The reduced template of our example is

(x1, x2,⊥,⊥, x5,⊥)

Notice that the support of a reduced template need to be the same as thesupport of the original template.

Let Γ ′′τ be a random variable equal to the number of paths P associatedwith τ that are represented in the search tree T . The length ` of τ is alsoequal to the length of the path P . If at least one such path exists, thenthere is an isomorphism

f : s(τ)→ s(τ)

given by f(x) = xi when τi = x. Why is this an isomorphism? Well, ifa path P is represented in the search tree T then it passed the internalstructure test, which verifies that each node in the path has the same subsetof edges as are present our constructed subgraph. The isomorphism wedefined is exactly this bijection between the nodes of the path P—whichpassed the internal structure test—and the corresponding subset of nodesin our constructed subgraph H.

We know from claim 8.3, taking A = ∅, Y = s(τ) and B = s(τ)− s(τ),that with high probability all but at most O(log k) nodes are fixed pointsof f . For the remainder of the proof we will assume this is indeed the case.Put differently we have that τ agrees with τ on all but O(log k) positions.

From this one can deduce that the only templates τ for which Γ ′′τ canbe non-zero are those that differ in at most O(log k) positions from a re-duced template. After all, if it differs in more than O(log k) positions then,following from the contraposition of claim 8.3, there wouldn’t exist an iso-morphism between s(τ) and s(τ), hence—again by contraposition—therewould be no path P associated with τ that is represented in the search treeT .

To continue with our search for an upper bound we will decompose Γ ′′τinto a sum of random variables Γ ′′τj for every feasible path P associated withτ . If P is represented in T then Γ ′′τP = 1 and otherwise Γ ′′τP = 0. We nowhave that

E[Γ ′′]≤∑

τ,P

E[Γ ′′τP]

=∑

τ,P

Pr[Γ ′′τP = 1

]

To calculate this sum we will actually consider all reduced templateswith exactly j ⊥’s, and then sum over all possible values of j. First wewant to know how many reduced templates there are with j ⊥’s. Note

123

Page 126: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

that the template cannot contain only ⊥’s since then any path associatedwith it wouldn’t contain nodes of H, meaning the minimum length ` of atemplate containing j ⊥’s is j + 1. For a reduced template we only need topick the location of the ⊥’s, the other symbols are then determined by theirlocation, i.e., location i will then contain xi. We have that the total numberof templates with j ⊥’s is

k∑

`=j+1

(`

j

)

It is proven in section 8.11 that kj+1 is an upper bound of this sum.Hence there are at most kj+1 reduced templates with j ⊥’s, and thereforeat most kO(log k) ·kj+1 templates τ with j ⊥’s for which Γ ′′τP can be non-zero.For each such τ , there are at most dj feasible paths P associated with τ .

Each such path P as a probability of at most 2−(j−12 ) of being represented

in T . When summing over all j’s this results in

E[Γ ′′]≤

k−1∑

j=1

kj+1djkO(logn)2−(j−12 )

= kO(logn)k−1∑

j=1

k2d

(kd

2(j−2)/2

)j−1

Similarly to part one of the proof, once j is Θ(log kd) = Θ(log log n) eachterm is less than 1, so

E[Γ ′′]≤ kO(log k)O(log log n)(kd)O(log logn) (8.32)

= 2O((log logn)2) (8.33)

Conclusion. Combining both upper bounds we can conclude thatE [Γ ] = n2O((log logn)2) + 2O((log logn)2) = n2O((log logn)2).

Lemma 8.5. ∀ε > 0: f(x) ∈ n2O((log logn)2) =⇒ f(x) ∈ O(n1+ε).

Lemma 8.5 shows that this implies that E [Γ ] = O(n1+ε) for every ε > 0.For a proof of lemma 8.5 see section 8.11. This completes the proof oftheorem 8.4.

8.7 Experiments

In order the verify the attack a simulation was carried out on real socialnetwork data. The social graph used in the experiments was creating bycrawling the blogging site LiveJournal. Each node in the graph corresponds

124

Page 127: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

that a path P in G is associated with τ if the ith node on P liesin G − H for τi = ∗, and otherwise is equal to τi ∈ X . Finally,we say that the reduction of τ , denoted τ , is the template for whichτ i = ∗ whenever τi = ∗, and for which τ i = xi otherwise. (Wewill call such a τ a reduced template.)

Let Γ′′τ be a random variable equal to the number of paths P

associated with τ that are represented in the search tree T . If atleast one such path exists, then there is an isomorphism f : s(τ) →s(τ) given by f(xr) = xi when τi = xr . Since we are assumingthat F1 holds, Claims 2 and 3 from the proof of Theorem 2.1 implythat at all but at most O(log k) nodes are fixed points of f , andhence that τ agrees with τ on all but O(log k) positions. Hence,the only templates τ for which Γ′′

τ can be non-zero are those thatdiffer in at most O(log k) positions from a reduced template.

We further decompose Γ′′τ into a sum over random variables

Γ′′τP , for feasible paths P associated with τ , where Γ′′

τP = 1 ifP is represented in T , and 0 otherwise. Now, there are at mostkj reduced templates with j ∗’s, and hence at most kO(log k) · kj

arbitrary templates with j ∗’s for which Γ′′τ can be non-zero. For

each such τ , there are at most dj feasible paths P associated withτ . Each such P has a probability of at most 2−(

j−12 ) of being rep-

resented in T . Summing over all j gives

EˆΓ′′˜ ≤

Xτ,P

EˆΓ′′τP

˜≤

kXj=1

kjdjkO(log k)2−(j−12 )

≤ kO(log k)kX

j=1

kd

„kd

2(j−2)/2

«j−1

Once j is Θ(log kd) = Θ(log logn), each term is O(1), so

EˆΓ′′˜ ≤ kO(log k)O(log logn)(kd)O(log logn)

= O(2O((log logn)2)).

We conclude with some comments on the tests used in the re-covery algorithm. Recall that as we build T , we eliminate pathsbased on an internal structure check (do the edges among pathnodes match those in H?) and a degree check (do the nodes onthe path have the same degree sequence as H?). Although theproofs of Therems 2.1 and 2.2 use just the internal structure checkto prove uniqueness and to bound the size of T respectively, it isvery important in practice that the algorithm use both checks: as theexperiments in the next subsection will show, one can get uniquesubgraphs at smaller vales of k, and with much smaller search treesT , by including the degree tests. But it is interesting to note thatsince these theorems can be proved using only internal structuretests, the attack is robust at a theoretical level provided only thatthe attacker has control over the internal structure of X , even inscenarios where nodes elsewhere in the graph may link to nodesin X without the knowledge of the attacker. (In this case, we stillrequire that the targeted nodes wj ∈ W are uniquely identifiablevia the sets Nj , and that all degrees in X remain logarithmic.)

2.3 Computational Experiments

Social Network Data. We now describe computational experi-ments with the algorithm on real social network data drawn from anon-line setting. We find that the algorithm scales easily to severalmillion nodes, and produces efficiently findable unique subgraphsfor values of k significantly smaller than the upper bounds in theprevious subsections.

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12

prob

abili

ty

k

Probability of successful attack

d0=20, d1=60d0=10, d1=20

Figure 1: For two different choices of d0 and d1, the value k = 7gives the attack on the LiveJournal graph a high probability ofsuccess. Both of these choices for d0 and d1 fall well within thedegrees typically found in G.

As data, we use the network of friendship links on the bloggingsite LiveJournal, constructed from a crawl of this site performed inFebrurary 2006. Each node in LiveJournal corresponds to a userwho has made his or her blog public through the site; each user canalso declare friendship links to other users. These links provide theedges of the social network we construct; they are directed, but wefollow the principle of the previous subsections and convert themto undirected edges for purposes of the experiments. LiveJournalthus works well as a testbed; it has 4.4 million nodes and 77 millionedges in the giant component of its undirected social network, andit exhibits many of the global structural features of other large on-line social networks. Finally, we emphasize that while LiveJournalhas the right structure for performing our tests, it is not in realityan anonymous network — all of the nodes in the network representusers who have chosen to publish their information on the Web.

We simulate anonymization by removing all the user names fromthe nodes; we then run our attack and investigate the ranges of pa-rameters in which it successfully identifies targeted nodes. As afirst question, we examine how often H can be found uniquely forspecific choices of d0, d1, and k. In our construction, we gen-erate a random external degree ∆i for each node xi uniformlyfrom [d0, d1]. We then create links to targeted nodes sequentially.Specifically, in iteration i we choose a new user wi in G − H totarget; we then pick a minimal subset X ′ ⊆ X that has not beenused for any wj for j < i, and where the degrees of nodes in X ′ areless than their randomly selected target degrees. We add an edgebetween wi and each user in X ′. We repeat this process until nosuch X ′ can be found. If, at the end of the process, some nodes inX have not yet reached their target degrees, we add edges to ran-dom nodes in G (and remove nodes from W so that no two nodesare connected to the same subset of X).

Uniqueness. We say the construction succeeds if H can be re-covered uniquely. Figure 1 shows the success frequency for twodifferent choices of d0 and d1 (the intervals [10, 20] and [20, 60]),and varying values of k. We see that the success frequency is notsignificantly different for our two choices. In both cases the num-ber of nodes we need to add to achieve a high success rate is verysmall – only 7. With 7 nodes, we can attack an average of 34 and70 nodes for the smaller and larger degree choices, respectively.

Figure 8.2: Success rate of the active attack for two given intervals [d0, d1] onthe LiveJournal graph. The x-axis show the size of the constructed subgraphH and the y-axis the probability of uniquely locating H. [BDK07]

to a user with a public blog. Users are able to mark other users as theirfriends which in turn creates a directional edge in the underlying socialgraph. Since our focus is on undirected edges we will convert all directededges to undirected ones. Obviously all the nodes in their crawl are userswho have chosen to publish their information publicly on the web, so aselection bias may be present. In total the graph has 4.4 million nodes and77 million edges and has the same properties compared to other large onlinesocial networks [BDK07]. Anonymization has been simulated by removingall the attributes such as the username, location, age, etc. from the nodes.

The attack succeeds when the adversary can uniquely locate the con-structed subgraph H within the LiveJournal social graph. To simulate theattack we randomly created several subgraphs over a range of possible sizes,and check if we can find a unique match in G. The construction is exactlyas explained previously, i.e., during the simulation we will randomly select afew nodes to serve as the targeted nodes and we will further include edges toG so each node has the required amount of edges ∆i it nodes in G−H. Theprobability of success is then defined as the as the number of unique matchesdivided by the number of attempts. Figure 8.2 shows the success probabilityfor varying values of k, the size of G, and for two different choices of theinterval [d0, d1] [BDK07]. These two intervals fall well within the degreestypically found in G.

125

Page 128: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

From figure 8.2 we notice that with k ≥ 7 the attack has a high proba-bility of success. The reason such a low value already results in a successfulattack is because, apart from an internal structure test, we also utilize thedegree test. In the LiveJournal graph each of the possible Hamiltoniangraphs1 on 7 nodes actually occurs. Hence the degree sequence of H withinG is essential in finding a unique match. Note that in theorem 8.1 we basethe probability of recovering the constructed subgraph merely on its inter-nal structure. As k = 7 is much lower compared to the analyzed bound of(2 + δ) log n there results are not conflicting.

Not only must the adversary be able to recover G, he or she must alsobe able to do this within an acceptable amount of time. During the analyseswe briefly stated that the running time of the algorithm is only a smallfactor larger than the size of the search tree. Combined with theorem 8.4this means that the proposed algorithm for recovering the subgraph H isefficient. And indeed, in practice the search tree T is not much larger thanthe number of nodes u such that d(u) = d(x1), making its recovery fast inpractice [BDK07].

8.8 Detectability

We have purposely restricted the degree of the edges to the interval [d0, d1]during the construction phase. If one chooses this interval to match theaverage degrees found in the social graph, then it shouldn’t stand out basedon its degree distribution. We have also assumed that we are working withundirected edges. Based on this one may conjecture that the attack is hardto detect.

In practical settings however, we commonly have to deal with directededges. While this would make the active attack easier to execute, it alsomakes it easier to detect by the data curator. The reason is that the sybil2

nodes, i.e., the nodes added by the adversary, will likely have very littleincoming edges. This is because must users will have little to no reason tolink back to the sybil nodes. In turn this results in the subgraph havingmany outgoing edges while having no incoming edges, which makes it easierto detect. Current research also shows that it’s possible to detect sybilattacks on social networks [YGKX08].

1A Hamiltonian graph is a graph possessing a cycle visiting each node exactly once.2This comes from the term Sybil attack, which is an attack in computer security where

multiple identities in a system are forged with the goal of exploiting or subverting thesystem.

126

Page 129: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

8.9 As a Passive Attack

As a final note we can convert the active attack to a less powerful passiveattack. The idea is that instead of creating a subgraph, users currentlyparticipating in the network from a coalition. They then attempt to locatethemselves in the released graph. If this is successful it might be possible tore-identify certain friends.

Such an attack is however of limited effect. A large obstacle is that wecannot target arbitrary users. Only users having edges to the graph formedby the coalition can potentially be re-identified. This greatly restricts theeffectivity of the attack. On the other hand it has been found empiricallythat once a collation is moderately-sized, it can uniquely locate itself andcompromise the privacy of at least some users [BDK07]. We also note thatfinding a coalition need to be hard. An adversary can pay other users fortheir information, partial information might be retrieved from other publicnetworks, a badly secured account can be hacked, and so on.

8.10 Conclusion

The active attack shows that naive anonymization is flawed and does notassure privacy. But on the other hand the attack requires that the adversaryis able to influence the network before any data is released. Moreover he orshe has to able to create edges to the targeted nodes. This is not alwaysan easy task. Given that in some networks an edge is only created when,say, the targeted node actually accepts a friend request, this could greatlyreduce the effect of the active attack. After all, the targeted node will notalways accept such a request. Even though the feasibility of the attack canbe called into question, it does show that naive anonymization is not anoption.

To provide privacy we must find a mathematically rigor definition ofprivacy that is applicable to graph data. For satisfying such a definitionone will have to change the underlying graph structure, otherwise an activeattack is possible. The search is now on for a definition that provides ade-quate privacy while also being achievable in practice. This will be our focusfor the next chapter.

8.11 Proofs

8.11.1 Lemma 8.5

We first restate the lemma below.

Lemma 8.5. ∀ε > 0: f(x) ∈ n2O((log logn)2) =⇒ f(x) ∈ O(n1+ε).

127

Page 130: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Proof. The proof begins by expanding the left hand side

∃c > 0,∃n0 > 0,∀n > n0 : f(n) ≤ n2c(log logn)2 (8.34)

Here the implication can be proven by showing that

∀ε > 0,∀c > 0: n2c(log logn)2 ∈ O(n1+ε) (8.35)

This in turn can be proven by using theorem 7.1 and showing that forall positive ε and c it is true that

limn→∞

n2c(log logn)2

n1+ε∈ R (8.36)

Working out this limit can be done as follows

limn→∞

[n2c(log logn)2

n1+ε

]= lim

n→∞

[2c(log logn)2)−ε logn

](8.37)

We continue by proving that the exponent in the limit goes to minusinfinity.

limn→∞

[c(log log n)2 − ε log n

]= lim

n→∞

[log n

(c(log log n)2

log n− ε)]

(8.38)

This can be further worked out by showing that

limn→∞

[c(log log n)2

log n

]= c lim

n→∞

[2

log log n

log n

](8.39)

= c limn→∞

[2

1

log n

](8.40)

= c · 0 = 0 (8.41)

where the first two equations use l’Hopital’s rule. Plugging this result intoequation 8.38 we have that, given a ε greater than zero, the exponent inthe limit of equation 8.37 goes to minus infinity. In turn this proves thatthe limit we started with in equation 8.36 goes to zero, which is indeed anelement of R.

8.11.2 Lemma 8.6

We prove the following result

Lemma 8.6. For every k and j it holds that

128

Page 131: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

k∑

`=j+1

(`

j

)≤ kj+1

with k and j are natural numbers.

Proof. The proof is by induction. As the base case we will prove the state-ment is true for all positive k and j = 0.

k∑

`=1

(`

0

)=

k∑

`=1

`!

0!`!(8.42)

=

k∑

`=1

1 (8.43)

= k (8.44)

≤ kj+1 (8.45)

For the inductive step we will assume the statement holds for all k andj = n. We will then prove that is also holds for all k and j = n + 1. Tobegin we have

k∑

`=n+2

(`

n+ 1

)=

k∑

`=n+1

(`

n+ 1

)−(n+ 1

n+ 1

)(8.46)

To reduce this to the induction hypothesis we exploit the following ob-servation

(`

n+ 1

)=

`!

(n+ 1)!(`− n− 1)!(8.47)

=`− n

(n+ 1)· `!

n!(`− n)!(8.48)

=`− n

(n+ 1)·(`

n

)(8.49)

Filling this into equation 8.46 gives

129

Page 132: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

k∑

`=n+2

(`

n+ 1

)=

k∑

`=n+1

`− n(n+ 1)

·(`

n

)− 1 (8.50)

≤k∑

`=n+1

k − n(n+ 1)

·(`

n

)(8.51)

=k − n

(n+ 1)·

k∑

`=n+1

(`

n

)(8.52)

≤ k · kj+1 (8.53)

= kj+2 (8.54)

The first inequality is true because ` is always smaller or equal to k. Thesecond inequality follows from the induction hypothesis and the fact thatn ≥ 0. This proofs the induction step, and hence it has now been provedby the principle of induction that the formula holds for all positive k andn.

130

Page 133: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Chapter 9Degree Anonymization

In this chapter we will consider a privacy mechanism called degree anonymiza-tion, which attempts to transform the graph in order to guarantee privacy.Both the implementation and evaluation of several algorithms will be dis-cussed in detail. The chapter is based on the paper by Liu and Terzi [LT08].

9.1 Introduction

We make the assumption that the adversary only knows the degree of atargeted node. This can happen when the adversary visits a public pro-file of the target, and he or she can see how many friends the target has. Itmust be noted that a more powerful adversary can possess more informationwhich can be used to identify an individual. Nevertheless it is an interestingassumption against weak adversaries, or in cases where the network datadoesn’t contain highly sensitive information and protection against power-ful adversaries is not deemed necessary. Admittedly this makes the appli-cability of degree anonymization rather low. However it provides a goodinsight in how anonymization can be done, and illustrates the importanceof knowing how much auxiliary information the adversary has. Specifically,in chapter 10 we will argue that a powerful adversary can still defeat degreeanonymization.

9.2 Anonymity Guarantees

The degree anonymization mechanism is similar to k-anonymity. The goalis to modify the input graph, with all attributes removed, so each nodehas the same degree with at least k − 1 other nodes. We will call this pri-vacy definition k-degree anonymity. The definition is based on k-anonymousvectors.

131

Page 134: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Z Y

X

W

V

U

TS

A

B

C

D

E

Figure 9.1: On the left is a 4-anonymous graph, and on the right a 2-anonymous graph.

Definition 9.1 (k-anonymous vector). A vector of integers is k-anonymousif every distinct value in it appears at least k times.

For example the vector [5, 5, 3, 3, 2, 2, 2] is 2-anonymous. The definitionof k-degree anonymity now states that the degree sequence of G must bek-anonymous. By first defining k-anonymous vectors it will be easier for usto split the k-degree anonymization algorithm up in two parts. As we shallsee, first we can create a k-anonymous degree sequence, and then constructa graph having this degree sequence that resembles the input graph.

Definition 9.2 (k-degree anonymity). A graph is k-degree anonymous if itsdegree sequence is k-anonymous.

Assuming an adversary only knows the degree of individuals, k-degreeanonymity prevents identity disclose but may not prevent edge disclosure inall cases (this is similar to the homogeneity attack on k-anonymity describedin section 3.3.2). An example of a 4-degree anonymous graph and a 3-degreeanonymous graph is shown in figure 9.1. Theoretically transforming theinput graph to the complete graph would make the graph satisfy k-degreeanonymity, but would render the graph useless for any study. Moreover,simply copying the graph k− 1 times would also make it satisfy the privacydefinition, but clearly provides zero additional privacy. So we want to find agraph satisfying k-degree anonymity without adding or removing nodes, andby using the minimum amount of edge-modifications on the input graph.

Definition 9.3 (k-degree anonymization). The k-degree anonymization ofan input graph G = (V,E) is a graph G = (V, E) that is k-degree anonymous,obtained by modifying the input graph G using the minimum amount ofinsertions or deletions of edges, i.e., it minimizes |E − E|+ |E − E|.

132

Page 135: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

The algorithm we describe in this chapter will assume the graph is sim-ple, i.e., the graph is undirected, unweighted and contains no self-loops ormultiple edges. From the definition of k-anonymous vectors it’s trivial toderive the following claim

Claim 9.1. If a graph is k1-degree anonymous, then it is also k2-degreeanonymous, for every k2 ≤ k1.

Based on these definition we will be able to create a graph anonymizationalgorithm which will assure k-degree anonymity. Unfortunately guarantee-ing that only the minimum number of possible alterations are made to theinput graph is a nontrivial problem, and we will have to fall back on ap-proximations.

9.3 Problem Description

Let G = (V,E) be the input graph we want to anonymize. For now we willrestrict ourselves only to edge additions. Edge deletions can be simulatedby the giving the complement of the input graph to the anonymizationalgorithm. Adding an edge in the complement of the graph is identical toremoving edges in the original graph. In algorithm 9.1 the details are shownon how to turn an edge addition algorithm into an edge deletion algorithm.The function Anonymization Edge Addition is the anonymization algorithmusing only edge additions and returns a k-degree anonymous graph.

Algorithm 9.1 Simulating edge deletions.

1: G = Complement of input graph G.2: Ga = Anonymization Edge Addition(G, k).3: return Ga = Complement of Ga

An algorithm capable of simultaneously utilizing edge additions anddeletions will be considered at the end of this chapter. The goal of theanonymization algorithm we want to design first is formulated in prob-lem 9.1.

Problem 9.1 (Graph Anonymization). Given a graph G = (V,E) and aninteger k, find a k-degree anonymous graph G = (V, E) with E ⊆ E suchthat |E − E| = |E| − |E| is minimized.

This problem always has a solution: The complete graph always satisfiesthe anonymity constraint, given that a sensible k has been chosen, i.e.,k < |V |. Since we need to a find a graph G that minimizes |E| − |E|, itwill also minimize the L1 distance between the degree sequence of G and Gsince we have the equality

133

Page 136: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

|E| − |E| = 1

2L1(d, d) (9.1)

The equality is true because adding one edge to the graph increases thedegree of two nodes, hence the fraction 1

2 in the formula. Recall that the L1

distance is defined as follows

L1(d, d) =∑

i

∣∣∣d(i)− d(i)∣∣∣ (9.2)

Here d stands for the degree sequence of the graph G, and d for thatof G. This is a very interesting observation that can be exploited whendesigning an algorithm. Instead of immediately trying to modify the graphby adding edges, we can first find the degree sequence d that is k-anonymousand minimizes the L1 distance. As a second step we can modify the inputgraph making sure it has the desired k-anonymous degree sequence. We willcontinue by formalizing these two subproblems.

Part 1: Degree Anonymization

First, starting from d, we construct a new degree sequence d that is k-anonymous and minimizes L1(d, d). From equation 9.1 we know that mini-mizing this distance is equivalent to the requirement in problem 9.1 statingthat we must use as few edge additions as possible.

Problem 9.2 (Degree Anonymization). Given d, the degree sequence ofgraph G = (V,E), and an integer k, construct a k-anonymous sequence dby only increasing the degree of nodes, such that L1(d, d) is minimized.

Part 2: Graph Construction

Having the optimal k-anonymous degree sequence d, the next step consistsof constructing a graph with this degree sequence that closely resembles theinput graph. If all edges of the input graph are included in the constructedgraph this gives an optimal solution for k-degree anonymization defined inproblem 9.1.

Problem 9.3 (Graph Construction). Given a graph G = (V,E) and a k-anonymous degree sequence d, construct the graph G = (V, E) such that itsdegree sequence is equal to d and E ⊆ E.

9.4 Degree Anonymization

Problem 9.2 can be efficiently solved by a dynamic-programming algorithmhaving a time complexity of O(n2). We can further optimize this algorithm

134

Page 137: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

to obtain a time complexity of O(nk). Hence we can obtain the optimal re-sults within a linear amount of time. First we outline the algorithm runningin time O(n2) because it’s easier to explain, and then we will describe howto improve the algorithm.

Since we have restricted ourselves to edge additions only, the degree ofthe nodes can only increase. Furthermore we make the following observation.

Claim 9.2. Let d be the degree sequence of a graph. Then if d(i) = d(j)with i < j, it follows that d(i) = d(i+ 1) = . . . = d(j).

Proof. The proof is by contradiction. Assume that there is a position `with i ≤ ` < j such that d(`) < d(` + 1). Because a degree sequenceis monotonically increasing this implies that d(i) < d(j), which leads toa contradiction. Hence there is no position in the subsequence where thedegree increases, so all degrees must be equal.

Let DA(d) be the cost of anonymizing the degree sequence d, which isequal to L1(d, d) where d is the optimal solution as defined in problem 9.2.We will callDA(d) the anonymization cost of a sequence d for a given numberk. Furthermore we will let d[i, j] denote a subsequence of d containing theelements d(i), d(i+ 1), . . . , d(j). Additionally I(d[i, j]) is defined as

I(d[i, j])def=

j∑

`=j

(d(i)− d(`)) (9.3)

which can be seen as the anonymization cost when all nodes i, i + 1, . . . , jare put in the same anonymization group1 and all degrees are set to d(i).

Using these notations we can derive the set of dynamic programmingequation which can be used to solve the Graph Anonymization program.

For i < 2k :

DA(d[1, i]) = I(d[1, i]) (9.4)

For 2k ≤ i ≤ n :

DA(d[1, i]) = min

I(d[1, i]), min

k≤t≤i−kDA(d[1, t] + I(d[t+ 1, i]))

(9.5)

The first equation is correct because, when i < 2k, one cannot constructtwo different anonymized groups each of size k. Therefore all nodes in thesubsequence are put into the same anonymization group, and all the degreesare set to d(1).

For i larger or equal to 2k either the complete sequence can be putinto one big anonymization group, or the sequence can be split up into two

1In an anonymization group all nodes have the same degree.

135

Page 138: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

subsequences that are anonymized. The equation is set up such that, whena sequence is split up, both subsequences are guaranteed to have a lengthof at least k. Then the minimum is found over all possible positions wherethe sequences can be split, and also checks if not splitting at all might beoptimal.

Implementing the algorithm is done by first initializing I(d[i, j]) for every1 ≤ i ≤ j ≤ n, which has a time complexity of O(n2). Then equation 9.4is implemented and simply assigns the correct value to every DA(d[1, i]) fori < 2k having only a time complexity of O(k). Given that k < n these twopreprocessing steps take time at most O(n2). For the recursive formula 9.5the algorithm considers every i starting at 2k and ending at n. Every stepit calculates the minimum over at most i − 2k + 1 splits. This results in atotal running time of O(n2).

9.4.1 An Improvement

Improving the running time of the algorithm can be done by exploiting thefollowing observation.

Claim 9.3. An anonymization group with a size larger than 2k − 1 can besplit into two anonymization groups without increasing the overall anonymiza-tion cost.

Proof. Say the original anonymization group would be anonymized so allnodes were given degree `. Then after splitting it into two anonymizationgroups, one can always anonymize them so that in both groups all the nodedegrees are again anonymized to `. This will not change the anonymizationcost and thus completes the proof.

Based on this claim we can change the dynamic programming equations.The first equation stays the same, but for the second equation we nowknow that the outermost minimum function can be removed, so only theinnermost minimum function remains. Furthermore when considering allpossible splits, we know that the second subsequence should have a size ofat most 2k − 1. All this results in the following recursive formula for i’ssatisfying 2k ≤ i ≤ n :

DA(d[1, i]) = minmaxk,i−2k+1≤t≤i−k

DA(d[1, t] + I(d[t+ 1, i])) (9.6)

Analyzing the time complexity again starts at the initialization steps.To precompute I(d[i, j]) we no longer have to consider all the possible pairsof i and j, but instead can focus on sequences having length at least k andat most 2k − 1. This is done by considering, for every i, only j’s such thatk ≤ j − i+ 1 ≤ 2k. We conclude that the running time for the initialization

136

Page 139: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

step drops to O(nk). Secondly the recursion formula 9.6 needs to consideronly O(k) splits for every i, totaling in a time complexity of O(nk). Thisproves the following theorem.

Theorem 9.4. There exists an algorithm solving the Degree Anonymizationproblem in linear time.

We conclude that this part of the algorithm is efficiently solvable. Thealgorithm can quickly find the optimal solution, and implementing it alsoisn’t a complex task. In subsequent sections we will call this algorithmDegreeAnonymization(d, k) where d is a degree sequence and k the privacyparameter. The function returns the optimal k-anonymous degree sequenceusing only edge additions.

9.5 Graph Construction

9.5.1 Realizability and ConstructGraph

Having a k-anonymous degree sequence the goal is now to construct a graphthat closely resembles the input graph with the given anonymous degreesequence. It is not clear that a graph having the given degree sequence mustexist. After all, not all possible degree sequences have a corresponding simplegraph. We will call a degree sequence that can be achieved by constructinga simple graph realizable.

Definition 9.4 (Realizable). A degree sequence is realizable if and only ifthere exists a simple graph having precisely this degree sequence.

Thanks to Erdos and Gallai [Cho86] we can use the following theoremto test whether a degree sequence is realizable or not.

Theorem 9.5 (Erdos-Gallai Theorem). A degree sequence d is realizable ifand only if

∑i d(i) is even and for every 1 ≤ ` ≤ n− 1 it holds that

i=1

d(i) ≤ `(`− 1) +

n∑

i=`+1

min`, d(i) (9.7)

The first requirement is easy to understand, since the total sum of allthe degrees is equal to 2 · |E| and thus always even. The second requirementstates that for every subset of the ` highest-degree nodes we can draw enoughedges between these nodes themselves, and to the “remaining nodes”, suchthat the node can obtain its given degree. The proof of this theorem isby induction and gives rise to a natural algorithm for constructing a graphwith the specified degree sequence [Cho86, MV02]. Algorithm 9.2 calledConstructGraph shows the details of this algorithm.

137

Page 140: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

First ConstructGraph initializes the graph to be constructed by creatingn isolated nodes. During each iteration it will pick a random node and addedges such that the selected node has the required degree. This is doneby storing the residual degree of every node during the executing of thealgorithm. Basically the residual degree stands for the number of edges thatstill have to be added to the node in question for it to obtain the requireddegree. The randomly picked node is connected to the nodes with the highestresidual degree, making sure that inequalities of the Erdos-Gallai Theoremhold. If it turns out the degree sequence is not realizable the algorithm willgive some of the nodes a negative residual degree, and will then return “No”.On the other hand, once all the nodes have a residual degree of zero, thegraph has been constructed and is returned.

Algorithm 9.2 The ConstructGraph algorithm.

Input: A degree sequence d of length n.Output: A graph G = (V,E) with the given degree sequence d or “No”if d is not realizable.

1: V := 1, . . . , n, E := ∅2: if

∑i d(i) is odd then

3: return “No”4: while 1 do5: if ∃i : d(i) < 0 then6: return “No”7: if ∀i : d(i) = 0 then8: return G = (V,E)9: Pick a random node v with d(v) > 0

10: Set d(v) = 011: W := the d(v)-highest entries in d other than v12: for each node w ∈W do13: E := E ∪ (v, w)14: d(w) := d(w)− 115: end for16: end while

Note that algorithm 9.2 is an oracle for the realizability property of adegree sequence. It will return “No” if and only if the degree sequence isnot realizable.

To analyze the running time of ConstructGraph let n be the numberof nodes and dmax = maxi d(i). One can store all the residual degrees ina hash table with the key being the residual degree. When the node of adegree changes updating the hash table takes only constant time. Also notethat checking if a residual degree is negative can be done while updatingit, so we can avoid having to check all the values each iterations. Detectingif all residual degrees are zero can be done by saving the currently highest

138

Page 141: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

residual degree. By using the current maximum residual degree we can alsoefficiently find the current d(v)-highest residual degrees. Finally a node isat most updated dmax times, giving an overall time complexity of O(ndmax).

A possible variation of the algorithm can pick the highest degree nodeat each iteration. This would result in a topology having a very dense core.On the other hand, starting with the lowest degree, would result in verysparse cores. The middle ground is obtained by picking a random node. Allthese variations would still have the same time complexity.

9.5.2 SuperGraph: Resemblance to the Input Graph

If one would use ConstructGraph to construct the graph of a k-anonymizeddegree sequence of the input graph, it could different significantly from thegiven input graph. This it not want we want. What we need is an algorithmthat finds a solution to the Graph Construction problem. That is, theconstructed graph G = (V, E) must satisfy the constraint E ⊆ E, whereG = (V,E) is the input graph. For this we first introduce a new definition.

Definition 9.5 (Realizable Subject to a Graph). Given an input graphG = (V,E), a degree sequence d is realizable subject to G if and only if thereexists a simple graph G = (V, E) having precisely the degree sequence d andwith E ⊆ E.

Given the above definition one has the following variation of the Erdos-Gallai Theorem [LT08].

Theorem 9.6. Given a degree sequence d and a graph G = (V,E) withdegree sequence d. Let

1. a = d− d

2. V` be the ordered set of ` nodes containing the ` largest a(i) valuessorted in decreasing order

3. d`(i) be the degree of node i in the input graph G when counting onlyedges in G that connect node i to one of the nodes in V`

we have that if d is realizable subject to G then∑

i a(i) is even and for every1 ≤ ` ≤ n− 1 it holds that

i∈V`

a(i) ≤∑

i∈V`

(`− 1− d`(i)

)+

i∈V−V`

min`− d`(i), a(i)

(9.8)

Notice that for every pair of nodes (u, v) with u ∈ V` and v ∈ V − V`it’s always true that a(u) ≥ a(v) and that |V`| = `. Important is thatthe theorem only gives an implication. It gives no guarantees that if the

139

Page 142: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

formula is true, the degree sequence is realizable subject to the graph. Inother words the theorem only gives a necessary condition, it might not be asufficient one. The authors have not published a proof [LT08]. Nonetheless,to gain some insight in the theorem, it’s helpful to convince yourself thatthe following two inequalities always hold:

∀i : d`(i) ≤ ` (9.9)

∀i ∈ V` : d`(i) ≤ `− 1 (9.10)

These make sure that the terms in the sum are never negative.Even though theorem 9.6 gives no sufficient condition, we can still design

an algorithm similar to ConstructGraph that will attempt to construct agraph that is a supergraph of the input graph and has the desired degreesequence. It can first check if inequality 9.8 holds for all appropriate `. Ifit doesn’t it returns “No” signaling that no such supergraph exists. Oth-erwise it will attempt to find the supergraph and return it, or return thevalue “Unkown” signaling that it didn’t find a supergraph but one mightnevertheless exist.

Algorithm 9.3 The SuperGraph algorithm.

Input: A degree sequence d of length n and a graph G = (V,E).Output: A graph G = (V, E) with degree sequence d and E ⊆ E,“Unknown” if no supergraph found, “No” if d is not realizable s.t. to G.

1: a := d− d, with d the degree sequence of G2: if

∑i a(i) is odd or equation 9.8 doesn’t hold then

3: return “No”4: E := E5: while true do6: if ∃i : a(i) < 0 then7: return “Unknown”8: if ∀i : a(i) = 0 then9: return G = (V, E)

10: Pick a random node v with a(v) > 011: Set a(v) = 012: W := the a(v)-highest entries in d other than v, ignoring nodes that

are already connected to v in G13: for each node w ∈W do14: E := E ∪ (v, w)15: a(w) := a(w)− 116: end for17: end while

Algorithm 9.3 gives the pseudo code of SuperGraph. The algorithmworks on the residual degrees a = d − d. These are the degrees that must

140

Page 143: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

0

1

5

2

4

3

Figure 9.2: Example graph to be 2-degree anonymized.

be added in order to obtain the desired degree d. First it checks whetherthe conditions of theorem 9.6 are met, if not it outputs “No”. Otherwise itattempts to add edges by first picking a random node v having a(v) > 0.It then adds edges to the a(v)-highest residual degree nodes while ignor-ing nodes that are already connected to v in G. If not enough nodes canbe found satisfying this condition, the algorithm will return “Unknown”.The remaining details of the algorithm are similar to ConstructGraph. Ifthe algorithm returns “Unknown” it doesn’t mean that a graph satisfyingthe required condition doesn’t exists, it merely means that our algorithmcouldn’t find it.

Concerning the time complexity we can use the same datastructures asin ConstructGraph. With amax = maxi a(i) this gives a time complexityof O(namax). Here we assume that checking if an edge already exists canbe done in constant time. This is possible when using appropriate datastructures, e.g., a hash table containing the existing edges.

There is one subtlety we haven’t addressed yet. This is best explainedusing an example. Say we want to 2-degree anonymize the graph shownin figure 9.2. The degree sequence d of the graph is [3, 3, 3, 2, 2, 1]. The2-degree anonymization d is [3, 3, 3, 3, 2, 2]. The vector of residual degrees abecomes d− d = [0, 0, 0, 1, 0, 1]. Adding an edge between both nodes havingresidual degree 1 could complete the anonymization (if this edge doesn’texists already). But we have a problem: We no longer know the nodethe residual degree belongs to! The two residual degrees of 1 are locatedat indices 3 and 5 in the vector a. However, these indices of the residualdegrees do not correspond to nodes, as can be clearly seen in figure 9.2.Fortunately this can be easily solved by deviating from the formal definitionof a degree sequence, and pairing each degree with its node. The vector athen contains pairs (b, v) where b is the residual degree and v represent thenode. Slightly abusing notation, in our example a would be equal to

[3, 3, 3, 3, 2, 2]− [(3, 1), (3, 2), (3, 4), (2, 3), (2, 5), (1, 0)]

= [(0, 1), (0, 2), (0, 4), (1, 3), (0, 5), (1, 0)]

Now we do know the nodes where an edge could be added between.Based on the calculated value of a is would mean adding the edge (0, 3).

141

Page 144: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

And indeed, adding (0, 3) makes the graph 2-degree anonymous. Howeveradding this edge might come as a surprise. If you look at figure 9.2 visualinspection of the graph clearly shows that adding the edge (0, 5) also makesthe graph 2-degree anonymous. Thus it’s possible there are multiple optionsthat anonymize the graph and minimize the anonymization cost. Dependingon how we order nodes with the same degree, different solutions can befound. If the nodes would be sorted differently, say in decreasing order:

[3, 3, 3, 3, 2, 2]− [(3, 4), (3, 2), (3, 1), (2, 5), (2, 3), (1, 0)]

= [(0, 4), (0, 2), (0, 1), (1, 5), (0, 3), (1, 0)]

we see the other solution pops up which adds the edge (0, 5).

A New Improvement?

In general the way we order nodes with the same degree can not only producedifferent solutions, sometimes it means the difference between finding andnot finding a solution. For example, we can sort nodes with the same degreein increasing order of their node id, which could result in the followingsequence

[(3, 4), (3, 2), (3, 1), (2, 5), (2, 3), (1, 0)]

Or we could sort them in decreasing order, giving rise the to sequence

[(3, 1), (3, 2), (3, 4), (2, 3), (2, 5), (1, 0)]

How we sort nodes having the same degree can impact the realizability of adegree sequence subject to the input graph. Therefore one might propose thefollowing improvement: The order in which nodes with the same degree areput in the degree sequence should be randomized. The ConstructrGraph

algorithm is then run several times so multiple orderings are tested. Un-fortunately, after several tests on different graphs, we concluded that theperformance increase of this modification is very low.

9.5.3 The Probing Scheme

The advantage of SuperGraph is that if it returns a graph, we know that theleast amount of edge additions were made. The disadvantage is that the algo-rithm doesn’t always return a graph (the same is true for ConstructGraph).This is not want we want. We want to return a k-degree anonymous graphthat closely resembles the input graph. If this would require slightly moreedge additions than calculated by DegreeAnonymization, this is a penaltywe must pay. After all, releasing no graph whatsoever is not an option.

A simple solution, called the probing scheme, can be used. If no graphwas found it slightly changes the degree sequence of the input graph G, re-computes the k-anonymous degree sequence d, and then runs the SuperGraph

142

Page 145: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

algorithm on the new degree sequence d. It continues doing this until a graphhas been found. Although rather simple it has been claimed to work fairlywell in practice. We will extensively study the effectivity of the probingscheme in section 9.7.2. The details are outlined in algorithm 9.4.

Algorithm 9.4 The Probing scheme.

Input: A graph G = (V,E) and integer k.Output: A k-degree anonymous graph G = (V, E) with E ⊆ E.

1: d := the degree sequence of G2: d := DegreeAnonymization(d, k)3: G := SuperGraph(d, G)4: while G is not a graph do5: d := AddRandomNoise(d)6: d := DegreeAnonymization(d, k)7: G := SuperGraph(d, G)8: end while9: return G

The most important step is the AddRandomNoise function. The preciseimplementation determines how well the probing scheme works, and whetherit will work at all. First it must add random noise in such a manner so even-tually a graph will always be found. Adding noise to a random location whileavoiding that the degree doesn’t exceed |V | − 1 achieves this requirement.Eventually the degree sequence d will converge to the complete graph whichis always a solution to the anonymization problem. We can conclude thatthe probing scheme always halts.

Secondly the question is where to best add noise. A good heuristic isto increase the degree of low-degree nodes. The reasoning behind this isthat during the degree anonymization of the degree sequence, nodes havinga high degree are put into the same anonymization group. Rarely will thesenodes have the same degree, forcing the anonymization step to increase thedegree of the other nodes with a possibly large amount. In turn this po-tentially large increase must be compensated by increasing the degree oflow-degree nodes, which the degree anonymization algorithm does not guar-antee. Therefore the AddRandomNoise function should be baised towardsincreasing the degree of low-degree nodes.

Now, the degree sequence of most graphs have many low degree entriesat the tail of the sequence. So when picking the node of which we want toincrease the degree, simply picking a position uniform at random gives thedesired bias towards increasing low degree nodes. The resulting implemen-tation is shown in algorithm 9.5.

Deciding with how much to increase the selected degree is also a crucialstep. Setting this value too low could greatly increase the running time ofthe algorith (in algorithm 9.5 this value is maxDelta). On the other hand,

143

Page 146: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Algorithm 9.5 The AddRandomNoise function.

Input: A degree sequence d of length n.Output: A slightly modified degree sequence.

1: pos := pick a number from [1, n] uniform at random2: delta := pick a number from [1,maxDelta] uniform at random3: d(pos) := Min(n− 1, d(pos) + delta)4: return Sort(d)

picking a too big value could lead to unnecessary many edge additions, andwould produce a result far from optimal. In our implementation we man-ually selected an appropriate value for each graph. It might be possible toautomatically select the value by inspecting the highest, lowest and averagedegree.

It’s regrettable that these details, which all influence the performance ofthe algorithm, are not discussed by Lui and Terzi in their paper.

9.5.4 Relaxed Graph Construction using GreedySwap

The SuperGraph algorithm creates a k-degree anonymous graph such thatE ⊆ E. This can be a strong requirement, and we’re actually looking fora graph that looks like the input graph, it doesn’t have to be a supergraphof the input graph. So the real goal is to find a graph such that E ≈ E.Since not all edges of E need to be present in the anonymous graph, thisimplicitly allows both edge additions and deletions. We will come back onsimultaneous edge additions and deletions in the next section.

To avoid multiple modifications to the algorithm at once, we will stilluse the DegreeAnonymization algorithm designed only for edge additions.Later on we will modify it to also allow edge deletions. So assume that thealgorithm DegreeAnonymization returns a k-anonymous degree sequence dand that it’s realizable such that ConstructGraph will return a graph G0.The goal of the utility function called GreedySwap is, given the input graphG and a k-degree anonymous graph G0, to transform the graph G0 as closeas possible to the input graph, without modifying its degree sequence. Theresult would be an anonymous graph that better resembles the input graph.Put differently GreedySwap will transform G0 into G = (V, E) which is stillk-anonymous and with E ≈ E.

Because the degree sequence of G0 is not allowed to change we can onlyapply transformations to edges that do not change the degree of any nodein the graph. Such modifications are called valid swaps.

Definition 9.6 (Valid Swap). Given a graph Gi = (V,Ei) a valid swapstarts with the nodes i, j, k, l ∈ V such that (i, j) ∈ Ei and (k, l) ∈ Ei. Twovalid swaps might be possible depending on the following conditions:

144

Page 147: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

OR

ki

j l

ki

j lki

j l

Figure 2: The swap transformation.

Algorithm 3 The Greedy_Swap algorithm.

Input: An initial graph G0(V, E0) and the input graphG(V,E).

Output: Graph G(V, E) with the same degree sequence

as G0, such that E ∩ E ≈ E.

1: G(V, E) ← G0(V, E0)

2: (c, (e1, e2, e′1, e

′2)) = Find_Max_Swap(G)

3: while c > 0 do4: E = E \ e1, e2 ∪ e′1, e′25: (c, (e1, e2, e

′1, e

′2)) = Find_Max_Swap

6: return G

the degree sequence d is realizable. The first step involvesa call to the ConstructGraph algorithm, which we have de-scribed in Section 5.1, Algorithm 1. The ConstructGraph

algorithm will return a graph G0 with degree distribution d.The Greedy_Swap algorithm is then invoked with input the

constructed graph G0. The final output of the process is a

k-degree anonymous graph that has degree sequence d andlarge overlap in its set of edges with the original graph.

A naive implementation of the algorithm would require

time O(I|E0|2), where I is the number of iterations of the

greedy step and |E0| the number of edges in the input graph

graph. Given that |E0| = O(n2), the running time of theGreedy_Swap algorithm could be O(n4), which is dauntingfor large graphs. However, we employ a simple samplingprocedure that considerably improves the running time. In-stead of doing the greedy search over the set of all pos-sible edges, we uniformly at random pick a subset of size

O(log|E0|) = O(log n) of the edges and run the algorithmon those. This reduces the running time of the greedy algo-rithm to O(I log2 n), which makes it efficient even for verylarge graphs. As we show in our experimental evaluation,the Greedy_Swap algorithm performs very well in practice,

even in cases where it starts with graph G0 that shares smallnumber of edges with G.The Probing Scheme for Greedy_Swap: As in the caseof the Supergraph algorithm, it is possible that the Con-

structGraph algorithm outputs a “No” or “Unknown”. Inthis case we invoke a Probing procedure identical to the onewe have described in Section 5.3.

6.2 The Priority algorithmWe additionally show a simple modification of the Con-

structGraph algorithm that allows the construction of de-gree anonymous graphs with similar high edge intersectionwith the original graph directly, without using Greedy_Swap.We call this algorithm the Priority algorithm because dur-ing the graph-construction phase, it gives priority to alreadyexisting edges in the input graph G(V,E). The intersec-

Algorithm 4 An overall algorithm for solving the RelaxedGraph Construction problem; the realizable case.

Input: A realizable degree sequence d of length n.

Output: A graph G(V,E′) with degree sequence d andE ∩ E′ ≈ E.

1: G0 = ConstructGraph(d)

2: G = Greedy_Swap(G0)

tions we obtain using the Priority algorithm are compara-ble, if not better, to the intersections we obtain using theGreedy_Swap algorithm. However, the Priority algorithmis less computationally demanding than the naive implemen-tation of the Greedy_Swap procedure.

The Priority algorithm is similar to the ConstructGraph.Recall that the ConstructGraph algorithm at every step

picks a node v with residual degree d(v) and connects it

to d(v) nodes with the highest residual degree. Priority

works in a similar manner with the only difference that it

makes two passes over the sorted degree sequence d of theremaining nodes. In the first pass, it considers only nodes

v′ such that d(v′) > 0 and edge (v, v′) ∈ E. If there are

less that d(v) such nodes it makes a second pass consider-

ing nodes v′ such that d(v′) > 0 and edge (v, v′) /∈ E. Inthat way Priority tries to connect node v to as many ofhis neighbors in the input graph G. The graphs thus con-structed share lots of edges with the input graph. In termsof running time, the Priority algorithm is the same as Con-structGraph.

In the case where Priority fails to construct a graphby reaching a dead-end in the edge-allocation process, theProbing scheme is employed; and random noise addition isenforced until the Priority algorithm outputs a valid graph.

7. EXPERIMENTSIn this section we evaluate the performance of the pro-

posed graph-anonymization algorithms.

7.1 DatasetsWe use both synthetic and real-world datasets. For the

experiments with synthetic datasets, we generate random,small-world and scale-free graphs.

Random graphs: Random graphs are graphs with nodesrandomly connected to each other with probability p. Giventhe number of nodes n and the parameter p, a random graphis generated by creating an edge between each pair of nodesu and v with probability p. We use GR to denote the familyof graphs generated by this data-generation model and GR

to denote a member of the family.Small-world graphs: A small-world graph is a type of

graph in which most nodes are not neighbors of one an-other, but most nodes can be reached from every other bya small number of hops. This kind of graphs have largeclustering coefficient (CC) that is significantly higher thanexpected by random chance, and small average path length(APL) that is close to that of an equivalent random graph.The average path length is defined as the average length ofthe shortest path between all pairs of reachable nodes. Theclustering coefficient is defined as the average fraction ofpairs of neighbors of a node that are also connected to eachother. These two indices, along with the degree distribu-

Figure 9.3: Illustration of valid swap operations [LT08]

If (i, j) /∈ E and (k, l) /∈ E

Ei+1 := Ei − (i, k), (j, l) ∪ (i, j), (k, l) (9.11)

If (i, l) /∈ E and (j, k) /∈ E

Ei+1 := Ei − (i, k), (j, l) ∪ (i, l), (j, k) (9.12)

The valid swaps given in the previous definition are illustrated in fig-ure 9.3. Important is that the valid swaps can change the structure of thegraph, but they do not change the degree sequence of the graph.

GreedySwap now finds valid swaps with the maximum increase in edgeintersection, i.e., it finds a swap such that

|E ∩ Ei+1| − |E ∩ Ei|

is maximized, and continues until no more valid swaps are found. The utilityfunction FindMaxSwap is used to find precisely these valid swaps. It returnsthe increase in edge intersection and the swap belonging to this increase.As shown in algorithm 9.6 GreedySwap will greedily keep swapping edgesuntil there are no more valid swaps that can increase the size of the edgeintersection.

Using GreedySwap it’s now possible to implement our original idea:Given the input graph G we find the k-anonymous degree sequence, con-struct a graph with this degree sequence using ConstructGraph, and finallytransform this graph making sure it resembles the input graph. In the casethe degree sequence d is not realizable, we can invoke a Probing schemeidentical to the one we have described in algorithm 9.4. The final process isgiven in algorithm 9.7.

To calculate the time complexity we will ignore the probing schemesince it’s hard to find usable upper bounds for it. However, rememberthat it converges to the complete graph and hence always terminates. A

145

Page 148: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Algorithm 9.6 The GreedySwap algorithm.

Input: Input graph G = (V,E) and initial graph G0 = (V, E0).Output: A k-degree anonymous graph G = (V, E) with E ≈ E.

1: G(V, E) := G0(V, G0)2: (c, (e1, e2, e

′1, e′2)) := FindMaxSwap(G)

3: while c > 0 do4: E := E − e1, e2 ∪ e′1, e′25: (c, (e1, e2, e

′1, e′2)) := FindMaxSwap(G)

6: end while7: return G

Algorithm 9.7 The RelaxedConstruction algorithm.

Input: A graph G = (V,E) and integer k.Output: A k-degree anonymous graph G = (V, E) with E ≈ E.

1: d := degree sequence of G2: d := DegreeAnonymization(d, k)3: G0 := ConstructGraph(d)4: while G0 is not a graph do5: d := AddRandomNoise(d)6: d := DegreeAnonymization(d, k)7: G0 := ConstructGraph(d)8: end while9: return G = GreedySwap(G0, G)

146

Page 149: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

naive implementation of FindMaxSwap would result in a time complexity ofO(|E|2) = O(n4). Letting I be the number of iterations the Greedy Swapalgorithm needs, the total time complexity would be O(In4). This is un-acceptable for large input graphs. Instead we will sample only O(log |E|)edges in the FindMaxSwap utility function. As a result we are no longerguaranteed that it will find the real maximum swap, but it still gives goodresults in practice. Also note that the GreedySwap is by itself already anapproximation algorithm. The running time becomes O(I log2 n).

An important comment remains: Similar to the SuperGraph algorithmit’s of importance how we decide which nodes in the graph returned byConstructGraph correspond to which nodes in the input graph. Again thiscan be solved by slightly deviating from the formal definition of a degreesequence and pairing each degree with its node. This can then be used inthe ConstructGraph algorithm to create the mapping as specified by thisdegree sequence.

9.5.5 SimultaneousSwap: Edge Additions and Deletions

Finally we move to our algorithm allowing both edge additions and deletionsat all steps. First we must modify the DegreeAnonymization algorithm.Surprisingly it’s very easy to include edge deletions and additions at thesame time. All we need to update is I(d[i, j]) denoting the anonymizationcost when all nodes i, i+ 1, . . . , j are put in the same anonymization group.It’s now equal to

I(d[i, j])def=

j∑

`=i

|d∗ − d(`)|

where d∗ is

d∗def= arg min

d

j∑

`=i

|d∗ − d(`)|

The value of d∗ has been proven to be equal to the median of the valuesd(i), . . . , d(j) [Lee95]. Thus calculating I(d[i, j]) can be done in lineartime. The recursion equations stay exactly the same. So when using the im-provement explained in section 9.4.1 the time complexity of calculating theoptimal k-anonymous degree sequence is O(nk). We will call this algorithmOptimalDegreeAnon.

Combined with a probing scheme, ConstructGraph and GreedySwap thisresults in algorithm 9.8. First it uses the probing scheme to construct ak-degree anonymous graph. Then it transform the graph to more closelyresemble the input graph. We call it the SimultaneousSwap algorithm.

147

Page 150: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Algorithm 9.8 The SimultaneousSwap algorithm.

Input: A graph G = (V,E) and integer k.Output: A k-degree anonymous graph G = (V, E) with E ≈ E.

1: d := degree sequence of G2: d := OptimalDegreeAnon(d, k)3: G0 := ConstructGraph(d)4: while G0 is not a graph do5: d := AddRandomNoise(d)6: d := OptimalDegreeAnon(d, k)7: G0 := ConstructGraph(d)8: end while9: return G = GreedySwap(G0, G)

9.6 Experiments and Implementation: Prologue

All algorithms are implemented in Java and accompanied with appropriateunit tests. This should assure that the algorithms function correctly. Al-though other papers on anonymization tend to use Python, we have chosenJava because of its speed advantages. This allows use to test the algorithmson large graphs within an acceptable amount of time.

9.6.1 Datasets

To test the anonymization algorithms several datasets have been used. Wewill briefly describe how we obtained them and how they were created.

Enron graph

During the Enron2 scandal the Federal Energy Regulatory Commission re-leased all the emails that had been send between employers of Enron. Fromthese correspondences a social graph has been created. A link is introducedbetween two people only if both have sent emails to each other, and if intotal they have send more than 5 messages to each other.

A cleaned MySql database created by Shetty and Abidi [SA04] was usedin our experiments. From this we manually derived the social graph. At thetime of writing the dataset was available at http://www.isi.edu/~adibi/Enron/Enron.htm.

Powergrid graph

In this graph, the nodes represent generators, transformers and substationsin a powergrid network. The edges represent high-voltage transmission linesbetween them. [LT08] A visualization of the dataset is shown in figure 9.4.

2The Enron Corporation was an American energy company.

148

Page 151: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Pajek@USpowerGrid. 4941 nodes, 6594 edges.Figure 9.4: Powergrid graph visialization by AT&T Research.

At the time of writing the dataset was available at http://www.cise.

ufl.edu/research/sparse/matrices/Pajek/USpowerGrid.html as a ma-trix saved in Matlab. A small utility function was written to convert the fileinto a format our own application could better handle.

Facebook100 dataset

In a study on the social structure of facebook networks, Traud, Mucha andPorter where given an anonymous dataset directly by Adam D’Angelo ofFacebook [TMP11]. It contains the friendship graphs of 100 American col-leges and universities as they existed in September 2005. Even if users didn’tpublish their friend list on facebook at the time, this dataset does contain alltheir friends! Thus it’s a very accurate graph representing a real social net-work. Later on the Facebook Data Team requested that researchers shouldno longer distribute the dataset. However, copies of it still float around onthe internet. Given that this graph contains invaluable information we stilldecided to use it. This also allows us to test the algorithm on real socialnetworks, and not publicly scraped ones, or on randomly generated ones.

For our analysis we picked three graphs: The Texas graph, the Berke-ley and the American graph. Later on the graphs Caltech and Bowdoinwhere also included. These are two smaller graphs that will be used whenanalyzing more computationally demanding algorithms. See table 9.1 forbasic properties about these graphs. Looking at the amount of edges of the

149

Page 152: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

first three mentioned graphs, they are 10 to 100 times bigger compared tothe largest graphs used in the tests by Lui and Terzi, and even though theCaltech and Bowdoin graphs are of a more moderate size, they’re still 2 to8 times bigger.

Co-authors graph

From the DBLP Computer Science Bibliography it’s possible to derive the socalled co-authors graph. In this graph each node represents an author, and anedge is drawn between authors that have co-authored a paper. This is a verylarge graph containing more than 1 million nodes and more than 3.6 millionedges. This makes it too big to be used to evaluate more computationallydemanding algorithms. However, when possible, a limited number of testswere also performed on this graph.

An XML file containing the dataset was downloaded from http://www.

informatik.uni-trier.de/~ley/db/. The file is almost 1GB large andhad to be parsed using a SAX parser to avoid running out of memory.

Graph Name #Nodes #Edges

Enron 151 502Caltech 768 16,656Bowdoin 2,251 84,387

Powergrid 4,939 6,594American 6,386 217,662Berkeley 22,937 852,444

Texas 36,371 1,590,655Co-authors 1,048,435 3,663,990

Table 9.1: Basic graph statistics.

9.6.2 Testing Randomized Algorithms

It’s very important to realize that most of our algorithms are randomized.For example, ConstructGraph picks a random node at every iteration. Asanother example, during the probing scheme a random amount of noise isadded to a random location in the degree sequence. This complicates ouranalysis of the performances of the algorithms, since the output can changeevery run. Strangely Lui and Terzi don’t mention how they dealt with thisinherent randomness, raising questions how they performed their analysis.Either they reported the average value over an unknown number of runs, oronly ran the algorithm once.

Only reporting the average value returned by a randomized algorithmactually gives us little information about the performance of it! [Sha] Thisis easily demonstrated by considering two randomized algorithms who have

150

Page 153: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Figure 9.5: This example shows the fictive response time of two servers for100 request. Both servers have the same average response time, but a differ-ent variance. This shows reporting only the average is not sufficient. [Sha]

the same average return value. The problem is that the variance of bothalgorithms can differ significantly. Here the algorithm with a lower varianceperforms better. A visual example is shown in figure 9.5. Both sampleshave the same average but a different variance. An algorithm with a lowervariance is preferable as its results are more predictable.

During our tests we will report both the average and variance over aselected number of runs. The number of runs is chosen such that the totalrunning time remains within practical limits. This gives a much betterunderstand on the performance of the algorithm instead of just reportingthe average value. On the other hand, if the variance is low, we will omit itin the graphs.

9.7 Individual Algorithm Evaluation

First we look at how the algorithms behave on their own. Particularly thisevaluates the influence of the random steps that are taken in the algorithms,if they contain any at all.

9.7.1 Degree Anonymization

Implementing the Degree Anonymization algorithm was first done in Pythonand later on ported to Java. The reason for switching to java was to increasethe speed of the algorithm, which made it possible to test it on larger andmore realistic graphs within an acceptable amount of time.

Both DegreeAnonymization and OptimalDegreeAnon always return the

151

Page 154: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

optimal result. Nevertheless it is interesting to know what the anonymiza-tion cost is for certain graphs. Figure 9.6 shows the L1-distance between theoriginal degree sequence and that obtained by DegreeAnonymization andOptimalDegreeAnon. We immediately see that allowing both edge deletionsand additions greatly reduces the anonymization cost.

In figure 9.7.1 the anonymization cost when running the OptimalDegreeAnonalgorithm are shown for the American and Powergrid graph. This illustratesthat some graphs are easier to anonymize than others.

9.7.2 SuperGraph using Probing Scheme

We continue by evaluating the performance of the SuperGraph algorithmusing the probing scheme. The precise algorithm is shown in algorithm 9.4with maxDelta set to 10. First the L1 distance between d and d′ is mea-sured, where d is the degree sequence returned by DegreeAnonymization

and d′ the degree sequence of the graph returned by algorithm 9.4. In effectthis measures how much the probing scheme, and in particular the imple-mentation of AddRandomNoise, have to modify the degree sequence in orderfor it to be realizable with respect to the input graph. For each value ofk the algorithm was run 20 times on the input graph. Figure 9.8 showsthe average L1 distance in a full line, and to visualize the variance the thedotted line shows the variance added to the average. A low variance thenmeans both lines are close to each other, while a high variance increases thedistance between both lines.

For the Enron and Powergrid graphs we notice the anonymization pa-rameter k has little effect on the result. If we look at the graphs of Caltechand Bowdoin we see that that k does influence the result. The higher k,the higher the solution differs from the optimal one. In turn this meansthe probing scheme needs to make more and more iterations before it findsa solution. Lui and Terzi already warned that in most cases the probingscheme only needs to be invoked a few times. However, it appears that forreal social networks, the current implementation of the probing scheme willnot scale well to large graphs and values of k. For the even bigger Bowdoingraph it became too time consuming to test for all values of k, hence onlythe first few values are shown. Social networks can be much bigger thanthese example graphs, raising questions on the performance of the probingscheme in combination with the SuperGraph algorithm.

Probing Scheme 2

We attempted to improve the probing scheme by making the following mod-ifications to the AddRandomNoise function:

• If the sum of degrees is uneven, increase the degree of a random nodeby only one.

152

Page 155: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

0

1000

2000

3000

4000

5000

1 11 21 31 41 51 61 71 81 91

L1 d

iffe

ren

ce

Privacy parameter k

0

200

400

600

800

1000

1200

1 11 21 31 41 51 61 71 81 91

L1 d

iffe

ren

ce

Privacy parameter k

0

20000

40000

60000

80000

1 11 21 31 41 51 61 71 81 91

L1 d

iffe

ren

ce

Privacy parameter k

0

100000

200000

300000

400000

1 11 21 31 41 51 61 71 81 91

L1 d

iffe

ren

ce

Privacy parameter k

(a) Enron graph

0

1000

2000

3000

4000

5000

1 11 21 31 41 51 61 71 81 91

L1 d

iffe

ren

ce

Privacy parameter k

0

200

400

600

800

1000

1200

1 11 21 31 41 51 61 71 81 91

L1 d

iffe

ren

ce

Privacy parameter k

0

20000

40000

60000

80000

1 11 21 31 41 51 61 71 81 91

L1 d

iffe

ren

ce

Privacy parameter k

0

100000

200000

300000

400000

1 11 21 31 41 51 61 71 81 91

L1 d

iffe

ren

ce

Privacy parameter k

(b) Powergrid graph

0

1000

2000

3000

4000

5000

1 11 21 31 41 51 61 71 81 91

L1 d

iffe

ren

ce

Privacy parameter k

0

200

400

600

800

1000

1200

1 11 21 31 41 51 61 71 81 91

L1 d

iffe

ren

ce

Privacy parameter k

0

20000

40000

60000

80000

1 11 21 31 41 51 61 71 81 91

L1 d

iffe

ren

ce

Privacy parameter k

0

100000

200000

300000

400000

1 11 21 31 41 51 61 71 81 91

L1 d

iffe

ren

ce

Privacy parameter k

(c) American graph

0

1000

2000

3000

4000

5000

1 11 21 31 41 51 61 71 81 91

L1 d

iffe

ren

ce

Privacy parameter k

0

200

400

600

800

1000

1200

1 11 21 31 41 51 61 71 81 91

L1 d

iffe

ren

ce

Privacy parameter k

0

20000

40000

60000

80000

1 11 21 31 41 51 61 71 81 91

L1 d

iffe

ren

ce

Privacy parameter k

0

100000

200000

300000

400000

1 11 21 31 41 51 61 71 81 91

L1 d

iffe

ren

ce

Privacy parameter k

(d) Berkeley graph

0

100000

200000

300000

400000

500000

600000

1 11 21 31 41 51 61 71 81 91

L1 d

iffe

ren

ce

Privacy parameter k

0

5000

10000

15000

20000

1 4 7 10 13 16 19 22 25

L1 d

iffe

ren

ce

Privacy parameter k

(e) Texas graph

0

100000

200000

300000

400000

500000

600000

1 11 21 31 41 51 61 71 81 91

L1 d

iffe

ren

ce

Privacy parameter k

0

5000

10000

15000

20000

1 4 7 10 13 16 19 22 25

L1 d

iffe

ren

ce

Privacy parameter k

(f) Co-authors graph

Figure 9.6: L1-distance of original degree sequence and that obtained byDegreeAnonymization, shown in a dashed line. And L1 distance betweenoriginal degree sequence and that obtained by OptimalDegreeAnon, shownin a full line. X-axis stands for the privacy parameter k and the y-axis forthe L1-distance.

153

Page 156: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

0

2000

4000

6000

8000

10000

12000

1 11 21 31 41 51 61 71 81 91

L1 d

iffe

ren

ce

Privacy parameter k

American75

Powergrid

Figure 9.7: Comparison of L1-distance between original degree sequence andthat obtained by OptimalDegreeAnon for two different graphs, showing thatsome graphs are easier to anonymize.

154

Page 157: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

0

50

100

150

200

250

2 12 22 32 42 52 62 72 82 92

L1 d

iffe

ren

ce

Privacy parameter k

0

20

40

60

80

100

120

140

160

2 12 22 32 42 52 62 72 82 92

L1 d

iffe

ren

ce

Privacy parameter k

0

2000

4000

6000

8000

10000

12000

14000

16000

2 7 12 17 22 27 32 37

L1 d

iffe

ren

ce

Privacy parameter k

(a) Enron graph

0

50

100

150

200

250

2 12 22 32 42 52 62 72 82 92

L1 d

iffe

ren

ce

Privacy parameter k

0

20

40

60

80

100

120

140

160

2 12 22 32 42 52 62 72 82 92

L1 d

iffe

ren

ce

Privacy parameter k

0

2000

4000

6000

8000

10000

12000

14000

16000

2 7 12 17 22 27 32 37

L1 d

iffe

ren

ce

Privacy parameter k

(b) Powergrid graph

0

500

1000

1500

2000

2500

3000

3500

4000

2 7 12 17 22 27 32 37 42 47 52

L1 d

iffe

ren

ce

Privacy parameter k

0

2000

4000

6000

8000

10000

12000

14000

16000

2 8 14 20 26 32 38 44

L1 d

iffe

ren

ce

Privacy parameter k

(c) Caltech graph

0

500

1000

1500

2000

2500

3000

3500

4000

2 7 12 17 22 27 32 37 42 47 52

L1 d

iffe

ren

ce

Privacy parameter k

0

2000

4000

6000

8000

10000

12000

14000

16000

2 8 14 20 26 32 38 44

L1 d

iffe

ren

ce

Privacy parameter k

(d) Bowdoin graph

Figure 9.8: The full line is the average L1-distance between the degreesequence returned by OptimalDegreeAnon and that of the graph returnedalgorithm 9.4 over 20 runs. The dotted line is obtained by adding thevariance to the full line.

155

Page 158: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

• If the sum of degrees is even, increase the degree of a random node byan event amount.

In retrospect these might seem like obvious improvements. We will callthis modification probing scheme 2, and the original version probing scheme1. Figure 9.9 shows the results when using these modifications. We noticethat not much has changed. Both the results and the variance are roughlythe same. For low values of k such as 2 and 3 the L1 difference is actuallymuch higher! A comparison between the old and new probing scheme isshown in figure 9.10 for the Powergrid and Caltech graph. Here we see thatfor some values of k probing scheme 2 is better, and for some values it’sactually worse.

Probing Scheme 3

When running the tests a more drastic modification was also tried: Wemodified AddRandomNoise such that it increased the degree of the last nodein the degree sequence by one. We will call it probing scheme 3. A predicteddownside is that this will increase the running time of the algorithm. Thefirst observation is that on all tested graphs the variance is much lower,making it impossible to visually see the dotted line, and therefore it’s nolonger included in any graphs dealing with probing scheme 3. A comparisonof the L1 distance between the optimal degree sequence and that of thereturned graph is shown in figure 9.11 with as input the Powergrid andCaltech graphs. We observe that when using probing scheme 3 on averageless edge additions have to be used. A disadvantage is that because only alow amount of noise is added, on average the algorithms take a longer timeto execute, as we already predicted. From the Caltech graph however, weobserve something very interesting. For most values of k the result is indeedbetter, yet there are some outliers for which the returned result is actuallyworse!

Probing Scheme 4

Our last attempt at improving the probing scheme is to increase the degreeof a random node located in the second half of the degree sequence byone. The idea is that this will still reduce the variance and decrease theL1 distance as we have seen in probing scheme 3, but that it hopefullyreduces the peaks where it performed worse than probing scheme 1 and 2.Figure 9.12 shows the results. The variance is higher than probing scheme3, but lower than scheme 1 and 2. Figure 9.13 shows a comparison betweenthe probing schemes for the Caltech graph. The accuracy is better thanscheme 1 and 2 and has less outliers than scheme 3.

We conclude that improving the probing scheme appears to be a tedioustask, and that for some graphs many iterations are needed before a result is

156

Page 159: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Facebook100: Bowdoin (Probing 1)

0

50

100

150

200

250

300

2 12 22 32 42 52 62 72 82 92

L1 d

iffe

ren

ce

Privacy parameter k

0

100

200

300

400

500

2 12 22 32 42 52 62 72 82 92

L1 d

iffe

ren

ce

Privacy parameter k

0

2000

4000

6000

8000

10000

12000

14000

16000

L1 d

iffe

ren

ce

Privacy parameter k 1000

1500

2000

2500

3000

(a) Enron graph

Facebook100: Bowdoin (Probing 1)

0

50

100

150

200

250

300

2 12 22 32 42 52 62 72 82 92

L1 d

iffe

ren

ce

Privacy parameter k

0

100

200

300

400

500

2 12 22 32 42 52 62 72 82 92

L1 d

iffe

ren

ce

Privacy parameter k

0

2000

4000

6000

8000

10000

12000

14000

16000

L1 d

iffe

ren

ce

Privacy parameter k 1000

1500

2000

2500

3000

(b) Powergrid graph

0

500

1000

1500

2000

2500

3000

3500

2 7 12 17 22 27 32 37 42 47 52

L1 d

iffe

ren

ce

Privacy parameter k

0

2000

4000

6000

8000

10000

12000

14000

16000

2 8 14 20 26 32 40

L1 d

iffe

ren

ce

Privacy parameter k

0

500

1000

1500

2000

2500

3000

2 7 12 17 22 27 32 37 42 47 52

Probing 2

Probing 1

(c) Caltech graph

0

500

1000

1500

2000

2500

3000

3500

2 7 12 17 22 27 32 37 42 47 52

L1 d

iffe

ren

ce

Privacy parameter k

0

2000

4000

6000

8000

10000

12000

14000

16000

2 8 14 20 26 32 40

L1 d

iffe

ren

ce

Privacy parameter k

0

500

1000

1500

2000

2500

3000

2 7 12 17 22 27 32 37 42 47 52

Probing 2

Probing 1

(d) Bowdoin graph

Figure 9.9: The full line is the average L1-distance between the degreesequence returned by OptimalDegreeAnon and that of the graph returnedalgorithm 9.4 using probing scheme 2 over 20 runs. The dotted line isobtained by adding the variance to the full line.

157

Page 160: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Facebook100: Bowdoin (Probing 1)

0

50

100

150

200

250

300

2 12 22 32 42 52 62 72 82 92

L1 d

iffe

ren

ce

Privacy parameter k

0

100

200

300

400

500

2 12 22 32 42 52 62 72 82 92

L1 d

iffe

ren

ce

Privacy parameter k

0

50

100

150

200

250

2 12 22 32 42 52 62 72 82 92

L1 d

iffe

ren

ce

Privacy parameter k

Probing 2

Probing 1

(a) Powergrid graph

0

500

1000

1500

2000

2500

3000

3500

2 7 12 17 22 27 32 37 42 47 52

L1 d

iffe

ren

ce

Privacy parameter k

0

2000

4000

6000

8000

10000

12000

14000

16000

2 8 14 20 26 32 40

L1 d

iffe

ren

ce

Privacy parameter k

0

500

1000

1500

2000

2500

3000

2 7 12 17 22 27 32 37 42 47 52

L1 d

iffe

ren

ce

Privacy parameter k

Probing 2

Probing 1

(b) Caltech graph

Figure 9.10: Comparison between average L1-distance of the degree se-quence returned by OptimalDegreeAnon and that of the graph returnedalgorithm 9.4 using probing scheme 1 and 2 over 20 runs.

0

20

40

60

80

100

120

140

160

2 12 22 32 42 52 62 72 82 92

L1 d

iffe

ren

ce

Privacy parameter k

Probing 3 Probing 1

0

20

40

60

80

100

2 12 22 32 42 52 62 72 82 92

L1 d

iffe

ren

ce

Privacy parameter k

Probing 3 Probing 1

(a) Powergrid graph

0

500

1000

1500

2000

2500

3000

2 7 12 17 22 27 32 37 42 47 52

L1 d

iffe

ren

ce

Privacy parameter k

Probing 3 Probing 1

(b) Caltech graph

Figure 9.11: Comparison between average L1-distance of the degree se-quence returned by OptimalDegreeAnon and that of the graph returnedalgorithm 9.4 using probing scheme 1 and 3 over 20 runs.

158

Page 161: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

0

20

40

60

80

100

120

140

160

2 12 22 32 42 52 62 72 82 92

L1 d

iffe

ren

ce

Privacy parameter k

0

1

2

3

4

5

6

7

8

9

2 12 22 32 42 52 62 72 82 92

L1 d

iffe

ren

ce

Privacy parameter k

0

500

1000

1500

2000

2500

3000

2 7 12 17 22 27 32 37 42 47 52

L1 d

iffe

ren

ce

Privacy parameter k

0

500

1000

1500

2000

2500

2 7 12 17 22 27 32 37 42 47 52

Probing Scheme 1 Probing Scheme 3 Probing Scheme 4

(a) Powergrid graph

0

500

1000

1500

2000

2500

3000

2 7 12 17 22 27 32 37 42 47 52L1

dif

fere

nce

Privacy parameter k

0

500

1000

1500

2000

2500

3000

2 7 12 17 22 27 32 37 42 47 52

Probing Scheme 1 Probing Scheme 3 Probing Scheme 4

(b) Caltech graph

Figure 9.12: The full line is the average L1-distance between the degreesequence returned by OptimalDegreeAnon and that of the graph returnedalgorithm 9.4 using probing scheme 4 over 20 runs. The dotted line isobtained by adding the variance to the full line.

0

20

40

60

80

100

120

140

160

2 12 22 32 42 52 62 72 82 92

L1 d

iffe

ren

ce

Privacy parameter k

0

1

2

3

4

5

6

7

8

9

2 12 22 32 42 52 62 72 82 92

L1 d

iffe

ren

ce

Privacy parameter k

0

500

1000

1500

2000

2500

3000

2 7 12 17 22 27 32 37 42 47 52

L1 d

iffe

ren

ce

Privacy parameter k

0

500

1000

1500

2000

2500

2 7 12 17 22 27 32 37 42 47 52

L1 d

ista

nce

Privacy parameter k

Probing Scheme 1 Probing Scheme 3 Probing Scheme 4

Figure 9.13: Comparison between average L1-distance of the degree se-quence returned by OptimalDegreeAnon and that of the graph returnedalgorithm 9.4 using probing scheme 1, 3 and 4 over 20 runs for the Caltechgraph.

159

Page 162: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

found. In the following algorithms we will always use probing scheme 4.

9.7.3 GreedySwap and SimultaneousSwap

Both GreedySwap and SimultaneousSwap use ConstructGraph to createthe initial anonymous graph. Both algorithms will then swap edges in thecreated graph to make it resemble the input graph. The difference is thatGreedySwap only uses edge additions when computing the k-anonymousdegree sequence while SimultaneousSwap allows both edge deletions andadditions during this step. When the degree sequence isn’t realizable aprobing scheme is used.

The probing scheme has already been extensively studied for the algo-rithm SuperGraph, and we will use probing scheme 4 for all algorithms.Furthermore the variance of the results are similar to that of SuperGraphfor the input graphs we used. No particularly interesting observations weremade when individually implementing and testing these algorithms. Ourmain interest is in comparing the results of the algorithms, which will bedone in the next section.

9.8 Comparison Between Algorithms

Finally we can compare the performance of the anonymization algorithmsagainst each other. Abbreviated algorithm names have been used to avoidthe graphs from becoming too cluttered. We will test the following three:

• SuperGraph: Algorithm 9.4 on page 143

• Relaxed: Algorithm 9.7 on page 146

• SimulSwap: Algorithm 9.8 on page 148

The Enron, Powergrid, Caltech and Bowdoin graphs will be used totest the algorithms. Note that because the Enron graph has only 151 nodes,results deteriorate quickly for it in function of k, and once k is higher than 75all nodes of the Enron graph are put in the same anonymization group. Theother three graphs are large enough to avoid this problem. Again we will runthe algorithms a number of times and then take the average of the computedproperties. However, because of the size of the graphs, only a limited amountof runs for each value of k could be executed. To increase the speed not allvalues of k were tested and its value was increased with a predefined interval.An interval of, say 3, would mean the algorithm was tested for the valuesk = 2, 5, 8, 11, . . . , 98. These parameters are summarized in table 9.2.

Unfortunately there are no special metrics designed to measure “howmuch two graphs differ in terms of structure” [Hay10]. Hence the effect ofanonymization on the graphs will be measured by studying the change inseveral graph properties.

160

Page 163: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Graph Name #Nodes #Edges #Runs Interval Size

Enron 151 502 20 2Powergrid 4,939 6,594 5 3Caltech 768 16,656 4 3Bowdoin 2,251 84,387 2 5

Table 9.2: Graph Statistics and Testing Parameters.

Diameter

The effect on the diameter of the graphs is shown in figure 9.14 on page 163.Although not shown, the variance of this property can be rather high. Forthe Enron and Caltech graph the variance is generally around 1, but forthe Powergrid and Bowdoin graph using the SimulSwap algorithm it canbe as large as 4. When using the Supergraph or Relaxed algorithms thevariance is typically lower. That the variance of this property is large canbe explained because it’s the maximum distance between any pair of nodes.Adding or removing a specific edge can drastically influence this property.

Our tests indicate that SimulSwap best preserves the diameter of a graph.Remember that the results for the Enron graph deteriorate quickly becauseit only has few nodes, and therefore one should not focus too much on theresults obtained from this graph.

Average Path Length (APL)

The effect on the average path length of the graphs is shown in figure 9.15on page 164. The variance of this property is low. Interestingly SuperGraph

seems to have better results for low values of k on the Powergrid graph, butSimulSwap is still better for higher values of k. For the other three graphsSimulSwap is clearly the better algorithm when considering the average pathlength.

Density

Figure 9.16 on page 165 shows the effect on the density of the graphs. Sinceonly SimulSwap allows both edge additions and deletions using when con-structing the k-anonymous degree sequence it clearly performs better. How-ever we also see that Relaxed is slightly better than SuperGraph.

Powerlaw Estimation of the Degree Sequence

The effect on the powerlaw estimation of the degree sequence (as explainedin section 7.3.2 on page 106) is shown in figure 9.17 on page 166. For thePowergrid graph the results of Relaxed and SuperGraph coincide, and forthe Bowdoin graph the results of SimulSwap and Relaxed coincide.

161

Page 164: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Because the Enron graph contains only 151 nodes, estimating the pow-erlaw distribution becomes meaningless for higher values of k, since therewill be only few anonymization groups left.

Clustering Coefficient

In figure 9.18 on page 167 the change in the clustering coefficient is show afteranonymization. For the Powergrid graph SuperGraph clearly outperformsthe other two algorithms. However, for the Caltech and Bowdoin graph it’sless ideal and SimulSwap might be better for very large values of k.

Summary

There is no superior algorithm. Depending on both the graph and the prop-erty being measured, some algorithms perform better than others. Thus se-lecting an anonymization algorithm can be a hard task, although commonlySimulSwap is chosen because it allows both edge additions and removalsduring all steps of the algorithm [Hay10].

9.9 Relation to Link Prediction

All the previous algorithms add or remove random edges. A possible prob-lem is that the edges being added might actually be unlikely to exists inreality. Or that the removed edges are easy to predict. Therefore it mightbe possible to detect which edges where added during the anonymizationon the graph, and that removed edges can be predicted. This boils downto the link prediction problem, where the question is asked which new inter-actions among its members are likely to occur in the near future [LNK07].It might be possible to remove part of the noise introduced by the degreeanonymization mechanism, and thus still breach the privacy of certain indi-viduals. Unfortunately we did not have the time to explore this idea, but itcould be an interesting direction for future work.

9.10 Conclusion

There are two obstacles that prevent degree anonymization from being usefulin practice. First of all, assuming that the adversary only knows the degreesof targeted individuals is a strong assumption, an adversary likely also knowsabout the existence of certain edges. The second problem is that the currentalgorithms do not scale well to large graphs due to—among other things—the probing scheme.

162

Page 165: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

74 6 2,11117 0,133244 0,304535354

76 6 1,775077 0,229709 0,236904337

78 7 1,77543 0,229709 0,228514557

80 7,5 1,774746 0,229709 0,232278669

82 6,25 1,774879 0,229709 0,231732026

84 7 1,775342 0,229709 0,2395306

86 6,75 1,775386 0,229709 0,238672014

88 7,5 1,775011 0,229709 0,23248366

90 7,5 1,774967 0,229709 0,236024955

92 8 1,774967 0,229709 0,232774807

94 7,5 1,775121 0,229709 0,232171717

96 7,25 1,774901 0,229709 0,232593583

98 7 1,774768 0,229709 0,234355318

4

6

8

10

12

14

16

18

20

1 10 20 30 40 50 60 70 80 90

SimulSwap SuperGraph Relaxed

1,5

2

2,5

3

3,5

4

1 10 20 30 40 50 60 70 80 90

SimulSwap SuperGraph Relaxed

0

0,05

0,1

0,15

0,2

0,25

1 10 20 30 40 50 60 70 80 90

SimulSwap SuperGraph Relaxed

30

35

40

45

50

55

60

65

70

10

12

14

16

18

20

0,00052

0,00053

0,00054

0,00055

0,00056

0,00057

0,00058

0,00059

(a) Enron graph

10

12

14

16

1,8

1,9

2,1

2,2

2,3

2,4

0,040000

0,050000

0,060000

0,070000

0,080000

0,090000

0,100000

30

35

40

45

50

55

60

65

70

1 8 17 26 35 44 53 62 71 80 89 98

SimulSwap SuperGraph Relaxed

6

8

10

12

14

16

18

20

1 8 17 26 35 44 53 62 71 80 89 98

SimulSwap SuperGraph Relaxed

0,00052

0,00053

0,00054

0,00055

0,00056

0,00057

0,00058

0,00059

1 8 17 26 35 44 53 62 71 80 89 98

SimulSwap SuperGraph Relaxed

(b) Powergrid graph

6

8

10

12

14

16

1 8 17 26 35 44 53 62 71 80 89 98

SimulSwap SuperGraph Relaxed

1,8

1,9

2

2,1

2,2

2,3

2,4

1 8 17 26 35 44 53 62 71 80 89 98

SimulSwap SuperGraph Relaxed

0,040000

0,050000

0,060000

0,070000

0,080000

0,090000

0,100000

1 8 17 26 35 44 53 62 71 80 89 98

SimulSwap SuperGraph Relaxed

11

13

15

17

19

1,8

1,9

2,1

2,2

2,3

2,4

2,5

0,025

0,035

0,045

0,055

(c) Caltech graph

7

9

11

13

15

17

1 5 15 25 35 45 55 65 75 85 95

SimulSwap SuperGraph Relaxed

1,8

1,9

2

2,1

2,2

2,3

2,4

2,5

1 5 15 25 35 45 55 65 75 85 95

SimulSwap SuperGraph Relaxed

0,025

0,03

0,035

0,04

0,045

0,05

0,055

1 5 15 25 35 45 55 65 75 85 95

SimulSwap SuperGraph Relaxed

(d) Bowdoin graph

Figure 9.14: Diameter. The x-axis shows the privacy parameter k and they-axis the diameter of the k-anonymous graph returned by the the variousanonymization algorithms.

163

Page 166: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

74 6 2,11117 0,133244 0,304535354

76 6 1,775077 0,229709 0,236904337

78 7 1,77543 0,229709 0,228514557

80 7,5 1,774746 0,229709 0,232278669

82 6,25 1,774879 0,229709 0,231732026

84 7 1,775342 0,229709 0,2395306

86 6,75 1,775386 0,229709 0,238672014

88 7,5 1,775011 0,229709 0,23248366

90 7,5 1,774967 0,229709 0,236024955

92 8 1,774967 0,229709 0,232774807

94 7,5 1,775121 0,229709 0,232171717

96 7,25 1,774901 0,229709 0,232593583

98 7 1,774768 0,229709 0,234355318

4

6

8

10

12

14

16

18

20

1 10 20 30 40 50 60 70 80 90

SimulSwap SuperGraph Relaxed

1,5

2

2,5

3

3,5

4

1 10 20 30 40 50 60 70 80 90

SimulSwap SuperGraph Relaxed

0

0,05

0,1

0,15

0,2

0,25

1 10 20 30 40 50 60 70 80 90

SimulSwap SuperGraph Relaxed

30

35

40

45

50

55

60

65

70

10

12

14

16

18

20

0,00052

0,00053

0,00054

0,00055

0,00056

0,00057

0,00058

0,00059(a) Enron graph

10

12

14

16

1,8

1,9

2,1

2,2

2,3

2,4

0,040000

0,050000

0,060000

0,070000

0,080000

0,090000

0,100000

30

35

40

45

50

55

60

65

70

1 8 17 26 35 44 53 62 71 80 89 98

SimulSwap SuperGraph Relaxed

6

8

10

12

14

16

18

20

1 8 17 26 35 44 53 62 71 80 89 98

SimulSwap SuperGraph Relaxed

0,00052

0,00053

0,00054

0,00055

0,00056

0,00057

0,00058

0,00059

1 8 17 26 35 44 53 62 71 80 89 98

SimulSwap SuperGraph Relaxed

(b) Powergrid graph

6

8

10

12

14

16

1 8 17 26 35 44 53 62 71 80 89 98

SimulSwap SuperGraph Relaxed

1,8

1,9

2

2,1

2,2

2,3

2,4

1 8 17 26 35 44 53 62 71 80 89 98

SimulSwap SuperGraph Relaxed

0,040000

0,050000

0,060000

0,070000

0,080000

0,090000

0,100000

1 8 17 26 35 44 53 62 71 80 89 98

SimulSwap SuperGraph Relaxed

11

13

15

17

19

1,8

1,9

2,1

2,2

2,3

2,4

2,5

0,025

0,035

0,045

0,055(c) Caltech graph

7

9

11

13

15

17

1 5 15 25 35 45 55 65 75 85 95

SimulSwap SuperGraph Relaxed

1,8

1,9

2

2,1

2,2

2,3

2,4

2,5

1 5 15 25 35 45 55 65 75 85 95

SimulSwap SuperGraph Relaxed

0,025

0,03

0,035

0,04

0,045

0,05

0,055

1 5 15 25 35 45 55 65 75 85 95

SimulSwap SuperGraph Relaxed

(d) Bowdoin graph

Figure 9.15: Average Path Length (APL). The x-axis shows the privacyparameter k and the y-axis the APL of the k-anonymous graph returned bythe the various anonymization algorithms.

164

Page 167: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

74 6 2,11117 0,133244 0,304535354

76 6 1,775077 0,229709 0,236904337

78 7 1,77543 0,229709 0,228514557

80 7,5 1,774746 0,229709 0,232278669

82 6,25 1,774879 0,229709 0,231732026

84 7 1,775342 0,229709 0,2395306

86 6,75 1,775386 0,229709 0,238672014

88 7,5 1,775011 0,229709 0,23248366

90 7,5 1,774967 0,229709 0,236024955

92 8 1,774967 0,229709 0,232774807

94 7,5 1,775121 0,229709 0,232171717

96 7,25 1,774901 0,229709 0,232593583

98 7 1,774768 0,229709 0,234355318

4

6

8

10

12

14

16

18

20

1 10 20 30 40 50 60 70 80 90

SimulSwap SuperGraph Relaxed

1,5

2

2,5

3

3,5

4

1 10 20 30 40 50 60 70 80 90

SimulSwap SuperGraph Relaxed

0

0,05

0,1

0,15

0,2

0,25

1 10 20 30 40 50 60 70 80 90

SimulSwap SuperGraph Relaxed

30

35

40

45

50

55

60

65

70

10

12

14

16

18

20

0,00052

0,00053

0,00054

0,00055

0,00056

0,00057

0,00058

0,00059

(a) Enron graph

10

12

14

16

1,8

1,9

2,1

2,2

2,3

2,4

0,040000

0,050000

0,060000

0,070000

0,080000

0,090000

0,100000

30

35

40

45

50

55

60

65

70

1 8 17 26 35 44 53 62 71 80 89 98

SimulSwap SuperGraph Relaxed

6

8

10

12

14

16

18

20

1 8 17 26 35 44 53 62 71 80 89 98

SimulSwap SuperGraph Relaxed

0,00052

0,00053

0,00054

0,00055

0,00056

0,00057

0,00058

0,00059

1 8 17 26 35 44 53 62 71 80 89 98

SimulSwap SuperGraph Relaxed

(b) Powergrid graph

6

8

10

12

14

16

1 8 17 26 35 44 53 62 71 80 89 98

SimulSwap SuperGraph Relaxed

1,8

1,9

2

2,1

2,2

2,3

2,4

1 8 17 26 35 44 53 62 71 80 89 98

SimulSwap SuperGraph Relaxed

0,040000

0,050000

0,060000

0,070000

0,080000

0,090000

0,100000

1 8 17 26 35 44 53 62 71 80 89 98

SimulSwap SuperGraph Relaxed

11

13

15

17

19

1,8

1,9

2,1

2,2

2,3

2,4

2,5

0,025

0,035

0,045

0,055

(c) Caltech graph

7

9

11

13

15

17

1 5 15 25 35 45 55 65 75 85 95

SimulSwap SuperGraph Relaxed

1,8

1,9

2

2,1

2,2

2,3

2,4

2,5

1 5 15 25 35 45 55 65 75 85 95

SimulSwap SuperGraph Relaxed

0,025

0,03

0,035

0,04

0,045

0,05

0,055

1 5 15 25 35 45 55 65 75 85 95

SimulSwap SuperGraph Relaxed

(d) Bowdoin graph

Figure 9.16: Density. The x-axis shows the privacy parameter k and they-axis the density of the k-anonymous graph returned by the the variousanonymization algorithms.

165

Page 168: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

0

0,1

0,2

0,3

0,4

0,5

1 10 20 30 40 50 60 70 80 90

SimulSwap SuperGraph Relaxed

1,4

1,6

1,8

2

2,2

2,4

2,6

1 4 8 12 16 20 24 28 32 36 40 44 48

SimulSwap SuperGraph Relaxed

3,15

3,25

3,35

3,45

0,055

0,065

0,075

0,085(a) Enron graph

1,3

1,4

1,5

1,6

1,7

1,8

1,9

0,25

0,35

0,45

0,55

3,15

3,2

3,25

3,3

3,35

3,4

3,45

1 8 17 26 35 44 53 62 71 80 89 98

SimulSwap SuperGraph Relaxed

0,055

0,06

0,065

0,07

0,075

0,08

0,085

1 8 17 26 35 44 53 62 71 80 89 98

SimulSwap SuperGraph Relaxed

(b) Powergrid graph

1,3

1,4

1,5

1,6

1,7

1,8

1,9

1 8 17 26 35 44 53 62 71 80 89 98

SimulSwap SuperGraph Relaxed

0,25

0,3

0,35

0,4

0,45

0,5

0,55

0,6

1 8 17 26 35 44 53 62 71 80 89 98

SimulSwap SuperGraph Relaxed

1,4

1,5

1,6

1,7

1,8

1,9

0,15

0,25

0,35

0,45(c) Caltech graph

1,4

1,5

1,6

1,7

1,8

1,9

2

1 5 15 25 35 45 55 65 75 85 95

SimulSwap SuperGraph Relaxed

0,15

0,2

0,25

0,3

0,35

0,4

0,45

1 5 15 25 35 45 55 65 75 85 95

SimulSwap SuperGraph Relaxed

(d) Bowdoin graph

Figure 9.17: Powerlaw. The x-axis shows the privacy parameter k and they-axis the powerlaw estimation of the degree sequence of the k-anonymousgraph returned by the the various anonymization algorithms.

166

Page 169: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

0

0,1

0,2

0,3

0,4

0,5

1 10 20 30 40 50 60 70 80 90

SimulSwap SuperGraph Relaxed

1,4

1,6

1,8

2

2,2

2,4

2,6

1 4 8 12 16 20 24 28 32 36 40 44 48

SimulSwap SuperGraph Relaxed

3,15

3,25

3,35

3,45

0,055

0,065

0,075

0,085

(a) Enron graph

1,3

1,4

1,5

1,6

1,7

1,8

1,9

0,25

0,35

0,45

0,55

3,15

3,2

3,25

3,3

3,35

3,4

3,45

1 8 17 26 35 44 53 62 71 80 89 98

SimulSwap SuperGraph Relaxed

0,055

0,06

0,065

0,07

0,075

0,08

0,085

1 8 17 26 35 44 53 62 71 80 89 98

SimulSwap SuperGraph Relaxed

(b) Powergrid graph

1,3

1,4

1,5

1,6

1,7

1,8

1,9

1 8 17 26 35 44 53 62 71 80 89 98

SimulSwap SuperGraph Relaxed

0,25

0,3

0,35

0,4

0,45

0,5

0,55

0,6

1 8 17 26 35 44 53 62 71 80 89 98

SimulSwap SuperGraph Relaxed

1,4

1,5

1,6

1,7

1,8

1,9

0,15

0,25

0,35

0,45

(c) Caltech graph

1,4

1,5

1,6

1,7

1,8

1,9

2

1 5 15 25 35 45 55 65 75 85 95

SimulSwap SuperGraph Relaxed

0,15

0,2

0,25

0,3

0,35

0,4

0,45

1 5 15 25 35 45 55 65 75 85 95

SimulSwap SuperGraph Relaxed

(d) Bowdoin graph

Figure 9.18: Clustering Coefficient. The x-axis shows the privacy parameterk and the y-axis the clustering coefficient of the k-anonymous graph returnedby the the various anonymization algorithms.

167

Page 170: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Chapter 10Passive Attack

The degree anonymization mechanism explained in chapter 9 protects onlyagainst limited adversaries, and is still vulnerable to attack. Although notcausally related to degree anonymization, the passive attack will be pre-sented as an example of a powerful attack an adversary can perform whichcan possibly defeat degree anonymization. In contrary to the active at-tack, the passive attack can be performed without adding new nodes inthe target network. The chapter is based on two papers by Narayanan etal. [NSR11, NS09].

10.1 Background

When applying the degree anonymization mechanism explain in the previ-ous chapter, we assumed the adversary only knows the degree of the user(s)he or she wants to target. We already mentioned that this is not alwaysa realistic assumption, and in certain situations the adversary can possesmore detailed information. As a result degree anonymization is not a verypowerful anonymization technique that is usable in practice. However, it’san excellent example that illustrates how one can attempt to prevent pri-vacy breaches, but if an adversary posses more auxiliary information thanassumed, he or she will still able to re-identify individuals.

Originally the passive attack outlined in this chapter was not a reac-tion to demonstrate the weaknesses of degree anonymization. However, itwas the first algorithm that showed the feasibility of large-scale, passivede-anonymization of real-world social networks [NS09]. Although time re-strictions prevent us from testing the passive attack on a degree anonymizednetwork, we believe a significant amount of individuals can be re-identifiedusing a passive attack. And even if not the case, the passive attack is animportant result on its own.

The target of the passive attack is an anonymized social network which

168

Page 171: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

is possibly modified by introducing noise [NS09]. The auxiliary informationof the adversary consists of a second social network that has some overlapwith the network to be attacked. Note that this overlap doesn’t need to belarge, even an overlap of only 15% suffices. With this auxilairy informationthe adversary is able to re-identify a significant amount of individuals usingonly the structure of both networks.

10.2 The Setting

The passive attack attempts to re-identity nodes in an anonymized graph.Anonymization is modeled by publishing only a subset of attributes, andthat the released attributes themselves are insufficient for re-identification.Furthermore the data release process may involve perturbation or sanitiza-tion that changes the graph structure in some way to make re-identificationattacks harder [NS09]. An example of sanitization can be degree anonymiza-tion. We will call the sanitized target network SSAN. Because of how thegraphs are obtained and published some nodes might have no outgoing edges,and we will call such nodes crawled nodes.

In addition to SSAN the adversary also has access to a different networkSAUX whose membership partially overlaps with SSAN. This auxiliary net-work can be obtained in several ways. It can be creating by crawling asimilar, but public, social network. Or a government-level agency can usemodern surveillance techniques to create their own social graph. Anotherpossibility is that a hacker can obtained an auxiliary graph from a databreach. When carrying out the attack it makes no difference how SAUX

was obtained. The only requirement is that there is a partial membershipoverlap between SSAN and SAUX.

The actual de-anonymization of SSAN is done in two steps. The firststep, called seed identification, is the biggest challenge and consists of de-anonymizing an initial number of nodes. In the second step, called propa-gation, the de-anonymization is propagated by selecting an arbitrary, stillanonymous node, and finding the most similar node in the auxiliary graph.Both steps will be discussed in detail.

10.3 Seed Identification

Seed identification consists of creating an mapping between an initial sub-set of nodes in SSAN to nodes in SAUX. Note that a mapping effectivelyde-anonymizes the node in question, as we now know its identity from theauxiliary information. Seed identification can be achieved using two meth-ods. The first one uses individual auxiliary information [NS09], and thesecond one is based on the weighted graph matching problem [NSR11].

169

Page 172: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

10.3.1 Individual Auxiliary Information

To de-anonymize an initial number of nodes an attackers individual auxiliaryinformation can be used. It consists of detailed information about a verysmall number of members of the target network SSAN. With this informationthe attacker can determine the location of these members in the anonymizedgraph, or if they are not present in the graph. An example used in theoriginal paper of Narayanan and Shmatikov is a clique of k nodes which arepresent in both SSAN and SAUX. The attack must also know the degree ofeach node and the number of common neighbors each pair of nodes in theclique has.

Based on this information the seed identification step searches the targetgraph for the unique clique with matching node degrees and common neigh-bors counts. This closely resembles the active attack discussed in chapter 8,and this method is essentially a pattern search. Remark that only few seedsare needed, so only a limited amount of individual auxiliary information isneeded. An attacker can collect such information by bribing individuals thatare included in both networks, some individuals share all their informationpublicly which the attacker can crawl, or an attacker can attempt to hack aselected number of social profiles in order to retrieve enough information.

Originally this method was used to find seeds between two different socialnetworks. Specifically, in the paper by Narayanan and Shmatikov, it wasused to find seeds for crawls of Twitter and Flickr. And these are differentsocial networks.

10.3.2 Combinatorial Optimization Problem

In this method the structure of both graphs is used to find initial seeds.Before we explain it further, it’s important to know is that this attack wasdesigned to find seed mappings between two networks that originated fromthe same social network. The target network SSAN was created by Kaggleby crawling Flickr during 21-28 June 2010. Between mid-December 2010and mid-January 2011 the researcher created the auxiliary network SAUX

by also crawling Flickr. The crawls were made by initially downloadingrandom nodes, and subsequently sampling uniformly from the outboundneighbors of these nodes. Not all nodes had outgoing edges in the targetgraph, such edges may exist in reality but weren’t included in the crawls(these are positions were the crawling process was terminated). We will callnodes without outgoing edges crawled nodes.

Since the graphs can contain millions of nodes, it’s essential to firstreduce the search space. The assumption that will be used is that, with highprobability, nodes having a high in-degree in both graphs roughly correspondto each other. In other words nodes with the highest in-degree will be presentin both graphs. Note that for the crawling process used to obtain SSAN and

170

Page 173: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

4

3

2

1

B

A

5

6

Figure 10.1: Example to who high in-degree nodes A and B.

SAUX this is highly likely.Now, in order to reduce the search space, we can pick two small subsets

of nodes, where all nodes have a high in-degree, knowing that with highprobability there will be a large correspondence between these subsets. Thenext step is to find the actual correspondence between these high in-degreenodes. Because we are not sure if all edges among the top k nodes arepresent in the target graph, it’s not possible to search for the mapping thatminimizes the edges mismatches between them.

Instead, for each graph separately, the cosine similarity of the sets ofin-neighbors is calculated for every pair of nodes. The set of in-neighbors ofa node are all the nodes with a directed edge from themselves to the nodeunder consideration. Here the cosine similarity between two set of nodes Xand Y is defined as

sim(X,Y ) =|X ∩ Y |√|X||Y |

For example, the cosine similarity of the nodes A and B in figure 10.1 is

|2, 3|√3 · 3

=2

3

The idea is now that the cosine similarity between two nodes is roughlythe same as the cosine similarity between the corresponding nodes in theother graph. Thus we must find a mapping that minimizes the differencesbetween the cosine similarity of all pairs of nodes. This can be reducedto an existing problem by regarding the cosine similarity as the weight ofan edge between these two nodes. Doing this for both graphs creates twocomplete weighted graphs. Now the goal is to find a mapping such that theweights of corresponding edges are as close to each other as possible, which iscalled the weighted graph matching problem. It can be solved using existing

171

Page 174: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Figure 10.2: Blue nodes have already been mapped. The similarity betweenA and B is 2

3 . Possible edges between A and A’ or B and B’ do not influencethe similarity. [NSR11, Nar11]

algorithms or heuristics [NSR11]. We end up with a mapping between thehigh in-degree nodes, completing the seed identification step.

10.4 Propagation

In order to propagate the de-anonymization, the algorithm stores a partialmapping between the nodes and iteratively extends this mapping. How themapping is extended depends on how the input graphs were obtained. Some-times different strategies are needed in different situations, as is evidencedby the different algorithms used in the two papers by Narayanan. In thissection we can combine both methods to create a general algorithm, butnote that in practice our presented algorithm may require further tweaking.

Similarity Measure

First we need to measure the similarity between two unmapped nodes inboth graphs. For this we calculate the cosine similarity between the alreadymapped neighbors of both nodes. Nodes mapped to each other will betreated as identical for the purpose of cosine comparison. Figure 10.2 showsan example. Note that only edges to already mapped nodes influence thecosine similarity. However, this calculation is not trivial, as we also haveto take into account edge directionality and the fact that not all nodes arecrawled (i.e., some nodes have no outgoing edges). This is solved by ignoringthe outgoing edges of the nodes unless both have been crawled. The finalsimilarity measure is shown in algorithm 10.1.

172

Page 175: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Algorithm 10.1 Similarity scores for two nodes.

Input:vsan: a node in the target graphvaux: a node in the auxiliary graphD: set of nodes in vsan de-anonymized so farmap: current partial mapping between nodes in SSAN and SAUX

CSAN: set of crawled nodes in SSAN

CAUX: set of crawled nodes in SAUX

N+(·): out-neighborhood of a nodeN−(·): in-neighborhood of a node

Output: Similarity score between vsan and vaux.

1: NSAN := map[N−(vsan) ∩D] ∩ CAUX

2: NAUX := N−(vaux) ∩map[CSAN]3: if vsan ∈ CSAN and vaux ∈ Caux then4: NSAN := NSAN ∪map[N+(vsan) ∩D] ∩ CAUX

5: NAUX := NAUX ∪N+(vaux) ∩map[CSAN]6: end if7: return CosineSim(NSAN, NAUX)

Further Details

To increasing the probability that a mapping is correct the eccentricity of amapping is used. It measures how much an item stands out from the rest,and is defined as

max(X)−max2(X)

σ(X)

where max and max2 denote the highest and second highest values, respec-tively, and where σ is the standard deviation [NS09]. It’s now possible toreject a mapping if the eccentricity is below a certain threshold.

During the executing of the algorithm it will revisit already mappednodes. The newly calculated mapping for a node may be different than thecurrent mapping, and if so it will be updated accordingly. This is donebecause at the early stages of the algorithm only few mapped nodes areavailable, thus making an incorrect mapping more likely. As more and moremappings become available this error can be detected and corrected. Prac-tically this is done by trying to find a mapping for every node during eachiteration, even for already mapped ones.

When a possible mapping from a node in SSAN to SAUX is found, thegraphs are switched. Only if then the same mapping is found on the switchedgraphs will it be added. This increases the probability that the mapping iscorrect. By doing this is also makes no difference which graph is the targetnetwork, and which is the auxiliary network. Switching them will still result

173

Page 176: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

1 def matchScores(lgraph, rgraph, mapping, lnode):2 return [ similarity (lnode, rnode) for rnode in rgraph.nodes]3

4 def propagationStep(lgraph, rgraph, mapping):5 for lnode in lgraph.nodes:6 scores [lnode] = matchScores(lgraph, rgraph, mapping, lnode)7 if eccentricity (scores [lnode]) < theta: continue8 rnode = (pick node from rgraph.nodes where9 scores [lnode ][node] = max(scores[lnode]))

10

11 scores [rnode] = matchScores(rgraph, lgraph, invert(mapping), rnode)12 if eccentricity (scores [rnode]) < theta: continue13 reverse match = (pick node from lgraph.nodes where14 scores [rnode][node] = max(scores[rnode]))15

16 if reverse match != lnode:17 continue18

19 mapping[lnode] = rnode20

21 def passiveAttack():22 while not converged:23 propagationStep(lgraph, rgraph, seedMapping)

Listing 10.1: Passive Attack

in the same output. For this reason we will also use the more general termslgraph and rgraph, denoting the left graph and right graph, respectively.

Propagation Step

The passive attack is can now be implemented. Each iteration it calculatesthe similarity score of all pair of nodes, and if the eccentricity is high enoughthe mapping is added. Theta will be a parameter denoting the threshold onthe eccentricity for a mapping to be added. It will continue to iterate overall node pairs until not more changes are made to the mapping. The finalalgorithm is shown in listing 10.1.

10.5 Experiments

10.5.1 Anonymized Flickr Graph

As already mentioned in section 3.1.4 a passive attack was performed on asupposedly anonymized social graph used in the Kaggle link prediction chal-lenge. For the challenge they crawled a subset of the Flickr social network.Fake edges where added to the resulting graph, and contestants had to uselink prediction to determine whether selected edges were fake or genuine.

174

Page 177: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

In an attempt to preserve privacy—and to prevent cheating—the releasedgraph was naively anonymized.

Narayanan, Shi and Rubinstein where able to re-identify 64.7% of thenodes with an accuracy of 95.2% [NSR11]. First they made their own crawlof Flickr to obtain the auxiliary graph used in the passive attack. Becauseboth graphs stem from the same network the seed identification was doneusing the combinatorial optimization problem. During the contest they founda seed mapping of 10 nodes among the top 20 nodes by mere visual inspectionof the matrix of cosines [NSR11]. Later on they developed an automatedapproach to the weighted graph matching problem.

10.5.2 Flickr and Twitter

The passive attack was also tested on two different social networks, namelyTwitter and Flickr. The graphs were obtained by crawling the public profileson the social network sites, so again a selection bias may be present due tothe crawling process. First a ground truth was established, i.e., the truemapping between the two graphs. These mappings were based on exactmatches of username and name of a profile. Human inspection of a randomsample of the resulting ground truth indicated that the error rate is wellunder 5%.

Seed identification was done by simply selecting 150 pairs of nodes fromthe ground truth. This mapping was then fed to the passive attack algo-rithm. When completed 30.8% of the mappings were correctly re-identified,12.1% where incorrectly re-identified and 57% of the ground truth map-pings were not identified. Interestingly 41% of the incorrectly identifiedmappings (6.7% overall) where mapped to nodes that were neighbors ofthe true mapping. It’s suggested that human verification can complete there-identification process in those situations.

10.6 Conclusion

The passive attack is a potentially powerful attack, given that the adversaryhas access to enough auxiliary information. And the experiments illustratethat one must not underestimate the amount of auxiliary information anadversary can posses.

175

Page 178: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Chapter 11Conclusion

We have argued that privacy is increasingly important now that more andmore data is being collected. Previous events have taught us that intuitivemethods to assure privacy are likely to fail. This is manifested by examplessuch as the AOL search fiasco, the HMO record linkage attack, the Netflixprize and the Kaggle machine learning contest. In all of these cases super-ficial attempts have been made to assure privacy, which caused all of themto fail. This motivated the search to a mathematical definition of privacy.

The holy grail of privacy definitions would be Dalenius desideratum,which states that obtaining access to the database doesn’t increase the chanceof a privacy breach. Unfortunately such a guarantee is impossible, as thereis always a piece of auxiliary information that, combined with access to thedatabase, will violate the privacy of an individual. After all, the goal ofa database is to store data and learn new information from this data. It’sunavoidable that this newly learned information can increase the chance of apossible privacy breach. From this the idea of relative disclosure preventionwas suggested.

Relative disclosure prevention states that there is only a small increase ofthe possibility that a privacy breach occurs. This can give rise to a definitionthat is achievable in practice. One of the most promising definitions in thisarea is differential privacy. Roughly speaking it states that handing overyour data—or keeping it private—doesn’t significantly affect the chance ofa privacy breach occurring. Written differently, hiding your data doesn’toffer you much benefits when differential privacy is being used.

176

Page 179: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

Bibliography

[AE11] Julian Assange and Laura Emmett. Interview with julianassange by RT. Retrieved 4 May, 2011, from http://rt.com/

news/wikileaks-revelations-assange-interview/, May2011.

[AFK+05] Gagan Aggarwal, Tomas Feder, Krishnaram Kenthapadi, Ra-jeev Motwani, Rina Panigrahy, Dilys Thomas, and An Zhu. Ap-proximation algorithms for k-anonymity. In Proceedings of theInternational Conference on Database Theory (ICDT 2005),November 2005.

[And06] Nate Anderson. The ethics of using AOL search data. ArsTechnica, August 2006.

[Arr06] Michael Arrington. AOL proudly releases massive amounts ofprivate data. Tech Crunch, August 2006.

[AW89] Nabil R. Adam and John C. Worthmann. Security-controlmethods for statistical databases: a comparative study. ACMComputing Surveys (CSUR), 21:515–556, December 1989.

[BDK07] Lars Backstrom, Cynthia Dwork, and Jon Kleinberg. Where-fore art thou r3579x?: anonymized social networks, hiddenpatterns, and structural steganography. In Proceedings of the16th international conference on World Wide Web, WWW ’07,pages 181–190. ACM, 2007.

[BDMN05] Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nis-sim. Practical privacy: the SuLQ framework. In Proceedings ofthe 24th ACM Symposium on Principles of Database Systems,PODS 2005, pages 128–138. ACM, 2005.

[BL07] James Bennett and Stan Lanning. The Netflix prize. In Pro-ceedings of KDD Cup and Workshop, August 2007.

177

Page 180: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

[BMS04] Peter S Bearman, James Moody, and Katherine Stovel. Chainsof affection: The structure of adolescent romantic and sexualnetworks. American Journal of Sociology, 110(1):44–91, 2004.

[Bol01] B. Bollobas. Random Graphs. Cambridge University Press,2001.

[BZ06] Michael Barbaro and Tom Zeller. A face is exposed for AOLsearcher no. 4417749. The New York Times, August 2006.

[Cho86] S.A. Choudum. A simple proof of the erdos-gallai theorem ongraph sequences. The BULLETIN of the Australian Mathe-matical Society, 33:67–70, 1986.

[CSN09] Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman.Power-law distributions in empirical data. SIAM Rev., 51:661–703, November 2009.

[Dal77] Tore Dalenius. Towards a methodology for statistical disclosurecontrol. Statistik Tidskrift, 15, 1977.

[DMNS06] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and AdamSmith. Calibrating noise to sensitivity in private data analysis.In Proceedings of the 3rd Theory of Cryptography Conference,pages 265–284, February 2006.

[DN10] Cynthia Dwork and Moni Naor. On the difficulties of disclosureprevention in statistical databases or the case for differentialprivacy. Journal of Privacy and Confidentiality, 2(1):93–107,2010.

[DNP+10] Cynthia Dwork, Moni Naor, Toniann Pitassi, Guy N. Roth-blum, and Sergey Yekhanin. Pan-private streaming algorithms.In Proceedings of The First Symposium on Innovations in Com-puter Science (ICS 2010). Tsinghua University Press, January2010.

[DNPR10] Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N. Roth-blum. Differential privacy under continual observation. In Pro-ceedings of the 42nd ACM symposium on Theory of computing,STOC ’10, pages 715–724. ACM, 2010.

[DORS08] Yevgeniy Dodis, Rafail Ostrovsky, Lenoid Reyzin, and AdamSmith. Fuzzy extractors: How to generate strong keys frombiometrics and other noisy data. SIAM Journal on Computing,38(1):97–139, March 2008.

178

Page 181: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

[DRV10] Cynthia Dwork, Guy N. Rothblum, and Salil Vadhan. Boost-ing and differential privacy. In Proceedings of the 2010 IEEE51st Annual Symposium on Foundations of Computer Science,FOCS ’10, pages 51–60. IEEE Computer Society, 2010.

[DS05] Yevgeniy Dodis and Adam Smith. Correcting erros withoutleaking partial information. In Proceedings of the 37th ACMSymposiumm on Theory of Computing (STOC 2005), pages654–663. ACM Press, 2005.

[Dwo06] Cynthia Dwork. Differential privacy. In 33rd InternationalColloquium on Automata, Languages and Programming, partII (ICALP 2006), volume 4052 of Lecture Notes in ComputerScience, pages 1–12. Springer, July 2006.

[Dwo08] Cynthia Dwork. Differential privacy: a survey of results. InProceedings of the 5th international conference on Theory andapplications of models of computation, TAMC’08, pages 1–19.Springer, 2008.

[Dwo11] Cynthia Dwork. A firm foundation for private data analysis.Communications of the ACM, 54:86–95, January 2011.

[Fac] Facebook. Facebook’s data use policy: Personalised adverts.Retrieved, 6 December 2011, from https://www.facebook.

com/about/privacy/advertising.

[FCS+06] Dan Frankowski, Dan Cosley, Shilad Sen, Loren Terveen, andJohn Riedl. You are what you say: privacy risks of publicmentions. In Proceedings of the 29th annual international ACMSIGIR conference on Research and development in informationretrieval, SIGIR ’06, pages 565–572. ACM Press, 2006.

[FG96] Bert Fristedt and Lawrence Gray. A Modern Approach to Prob-ability Theory. Birkhauser Boston, 1996.

[Gav80] Ruth Gavison. Privacy and the limits of the law. Yale LawJournal, 1980.

[GKS08] Srivatsava Ranjit Ganta, Shiva Kasiviswanathan, and AdamSmith. Composition attacks and auxiliary information in dataprivacy. In Proceeding of the 14th ACM SIGKDD internationalconference on Knowledge discovery and data mining, pages265–273. ACM Press, 2008.

[Gol01] Oded Goldreich. Foundations of Cryptography: Basic Tools,volume 1 of Foundations of Cryptography. Cambridge Univer-sity Press, 2001.

179

Page 182: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

[Gol06] Philippe Golle. Revisiting the uniqueness of simple demograph-ics in the US population. In Proceedings of the 5th ACM work-shop on Privacy in electronic society, WPES ’06, pages 77–80.ACM Press, 2006.

[GW86] Geoffrey Grimmett and Dominic Welsch. Probability: An In-troduction. Oxford University Press, 1986.

[Hay06] Brian Hayes. Can the tools of graph theory and social-networkstudies unravel the next big plot? American Scientist, 95,October 2006.

[Hay10] Michael Hay. Enabling Accurate Analysis of Private NetworkData. PhD thesis, University of Massachusetts, 2010.

[HLMJ09] Michael Hay, Chao Li, Gerome Miklau, and David Jensen. Ac-curate estimation of the degree distribution of private networks.In Proceedings of the 2009 Ninth IEEE International Confer-ence on Data Mining. IEEE Computer Society, 2009.

[HMJ+08] Michael Hay, Gerome Miklau, David Jensen, Don Towsley,and Philipp Weis. Resisting structural re-identification inanonymized social networks. Proc. VLDB Endow., August2008.

[Hoe63] Wassily Hoeffding. Probability inequalities for sums of boundedrandom variables. Journal of the American Statistical Associ-ation, 58:13–30, 1963.

[ILL89] Russell Impagliazzo, Leonid Anatolievich Levin, and MichaelLuby. Pseudo-random generation from one-way functions (ex-tended abstracts). In Proceedings of the 21st annual ACMSymposium on Theory of Computing (STOC ’89), pages 12–24. ACM Press, 1989.

[Kag10] Kaggle. IJCNN social network challenge. Retrieved 14 May,2011, from http://www.kaggle.com/c/socialNetwork, 2010.

[Kai08] Marcus Kaiser. Mean clustering coefficients: the role of iso-lated nodes and leafs on clustering measures for small-worldnetworks. New Journal of Physics, 10(8), 2008.

[KMN05] Krishnaram Kenthapadi, Nina Mishra, and Kobbi Nissim. Sim-ulatable auditing. In Proceedings of the 24th ACM Symposiumon Principles of Database Systems, PODS 2005, pages 118–127.ACM Press, 2005.

180

Page 183: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

[KPR00] Jon Kleinberg, Christos Papadimitriou, and Prabhakar Ragha-van. Auditing boolean attributes. In Proceedings of the 10thACM Symposium on Principles of Database Systems, PODS2000, pages 86–91. ACM Press, 2000.

[Kum01] I. Ravi Kumar. Comprehensive Statistical Theory of Commu-nication. Laxmi Publications Ltd, 2001.

[Lee95] Yoong-Sin Lee. Graphical demonstration of an optimality prop-erty of the median. The American Statistician, 49(4):369–372,1995.

[LLV07] Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian. t-closeness: Privacy beyond k-anonymity and l-diversity. In 23rdInternational Conference on Data Engineering, pages 106–115,April 2007.

[LNK07] David Liben-Nowell and Jon Kleinberg. The link-predictionproblem for social networks. J. Am. Soc. Inf. Sci. Technol.,58:1019–1031, May 2007.

[LT08] Kun Liu and Evimaria Terzi. Towards identity anonymizationon graphs. In Proceedings of the 2008 ACM SIGMOD interna-tional conference on Management of data, SIGMOD ’08, pages93–106. ACM, 2008.

[Mar10] Moxie Marlinspike. BlackHat Europe 2011: Changing threatsto privacy. Retrieved 17 April, 2011, from http://www.

blackhat.com/html/bh-eu-10/bh-eu-10-archives.html,April 2010.

[MB06] Elinor Mills and Anne Broache. Three workers depart AOLafter privacy uproar. CNET News, August 2006.

[MFJ+07] Matthew D. Mailman, Michael Feolo, Yumi Jin, MasatoKimura, Kimberly Tryka, et al. The NCBI dbGaP databaseof genotypes and phenotypes. Nature Genetics, 39:1181–1186,October 2007.

[MGK07] Ashwin Machanavajjhala, Johannes Gehrke, and Daniel Kifer.`-diversity: Privacy beyond k-anonymity. ACM Transactionson Knowledge Discovery from Data, 1, March 2007.

[MT07] Frank McSherry and Kunal Talwar. Mechanism design via dif-ferential privacy. In Proceedings of the 48th Annual IEEE Sym-posium on Foundations of Computer Science, pages 94–103,Washington, DC, USA, 2007. IEEE Computer Society.

181

Page 184: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

[MV02] Milena Mihail and Nisheeth Vishnoi. On generating graphswith prescribed degree sequences for complex network modelingapplications. Position Paper, ARACNE, 2002.

[MW04] Adam Meyerson and Ryan Williams. On the complexity ofoptimal k-anonymity. In Proceedings of the twenty-third ACMSIGMOD-SIGACT-SIGART symposium on Principles of data-base systems, PODS ’04, pages 223–228. ACM, 2004.

[Nao10a] Moni Naor. Foundations of privacy - spring 2010: Lecture1. Retrieved 21 April, 2011, from http://www.wisdom.

weizmann.ac.il/~naor/COURSE/foundations_of_privacy.

html, 2010.

[Nao10b] Moni Naor. Foundations of privacy - spring 2010: Lecture2. Retrieved 21 April, 2011, from http://www.wisdom.

weizmann.ac.il/~naor/COURSE/foundations_of_privacy.

html, 2010.

[Nar11] Arvind Narayanan. 33 bits of entropy: Link prediction byde-anonymization: How we won the kaggle social networkchallenge. Retrieved, 7 Januari 2012, from http://bit.ly/

gkWcIZ, March 2011.

[Net] Netflix. Netflix prize: Frequently asked questions. RetrievedMay 14, 2011, from http://www.netflixprize.com/faq.

[NS08] Arvind Narayanan and Vitaly Shmatikov. Robust de-anonymization of large sparse datasets. In Proceedings of the2008 IEEE Symposium on Security and Privacy, pages 111–125. IEEE Computer Society, 2008.

[NS09] Arvind Narayanan and Vitaly Shmatikov. De-anonymizing so-cial networks. In Proceedings of the 2009 30th IEEE Sympo-sium on Security and Privacy, pages 173–187. IEEE ComputerSociety, 2009.

[NSR11] Arvind Narayanan, Elaine Shi, and Benjamin Rubinstein. Linkprediction by de-anonymization: How we won the Kaggle socialnetwork challenge. Retrieved 20 April, 2011, from http://

arxiv.org/abs/1102.4374, February 2011.

[NZ96] Noam Nisan and David Zuckerman. Randomness is linear inspace. Journal of Computer and System Sciences, 52(1):43–52,1996.

182

Page 185: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

[pan11] Panorama: we zijn gezien. Retrieved 12 May, 2011, fromhttp://www.deredactie.be/cm/vrtnieuws/mediatheek/

programmas/panorama/1.1014456, May 2011. VlaamseRadio- en Televisieomroeporganisatie.

[Pos01] Robert C. Post. Three concepts of privacy. Faculty ScholarshipSeries, 2001.

[PPPM+02] J J Potterat, L Phillips-Plummer, S Q Muth, R B Rothenberg,D E Woodhouse, T S Maldonado-Long, H P Zimmerman, andJ B Muth. Risk network structure in the early epidemic phaseof hiv transmission in colorado springs. Sexually TransmittedInfections, 78:i159–i163, 2002.

[RAW+10] Jason Reed, Adam J. Aviv, Daniel Wagner, Andreas Hae-berlen, Benjamin C. Pierce, and Jonathan M. Smith. Differ-ential privacy for collaborative security. In Proceedings of theThird European Workshop on System Security, EUROSEC ’10,pages 1–7. ACM, 2010.

[RR10] Aaron Roth and Tim Roughgarden. Interactive privacy via themedian mechanism. In Proceedings of the 42nd ACM sympo-sium on Theory of computing, STOC ’10, pages 765–774. ACM,2010.

[SA04] Jitesh Shetty and Jafar Adibi. The enron email dataset: Data-base schema and brief statistical report, 2004.

[Sha] Zed Shaw. Programmers need to learn statistics or i will killthem all. Retrieved, 2 Januari 2012, from http://zedshaw.

com/essays/programmer_stats.html.

[Sha02] Ronen Shaltiel. Recent developments in explicit constructionsof extractors. Bulletin of The European Association for Theo-retical Computer Science: Computational Complexity Column,77:67–95, June 2002.

[Sip06] Michael Sipser. Introduction to the Theory of Computation.International Thomson Publishing, 2nd edition, 2006.

[Swe01] Latanya Arvette Sweeney. Computational Disclosure Control- A Primer on Data Privacy Protection. PhD thesis, Mas-sachusetts Institute of Technology, 2001.

[Swe02] Latanya Sweeney. k-anonymity: a model for precting pri-vacy. International Journal on Uncertainty, Fuzziness andKnowledge-based Systems, 10(5):557–570, October 2002.

183

Page 186: Privacy in Databases - Mathy Vanhoefpapers.mathyvanhoef.com/masterthesis.pdf · increasingly important given the rise of social media. Preface This thesis started as a custom proposal

[SZD08] Zak Stone, Todd Zickler, and Trevor Darrell. Autotagging face-book: Social network context improves photo annotation. 2008IEEE Computer Society Conference on Computer Vision andPattern Recognition Workshops, pages 1–8, 2008.

[TMP11] Amanda L Traud, Peter J Mucha, and Mason A Porter. Socialstructure of facebook networks. Social Networks, 1102(Febru-ary 2004):82, 2011.

[TSK06] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Intro-duction to Data Mining. Pearson Education, 2006.

[Uni08] United States Senate Committee on Commerce, Science andTransportation. Testimony of chris kelly: Chief privacy of-ficer at facebook. Retrieved, 6 December 2011, from http:

//tinyurl.com/dxl25lz, July 2008.

[WB85] Samuel D. Warren and Louis D. Brandeis. The right to privacy,pages 172–183. Wadsworth Publ. Co., 1985.

[Wil97] Graham J. Wills. Nicheworks - interactive visualization of verylarge graphs. In Proceedings of the 5th International Sympo-sium on Graph Drawing, GD ’97, pages 403–414, 1997.

[Won06] Nicole Wong. Judge tells DoJ “no” on search queries. OfficialGoogle Blog, March 2006.

[YGKX08] Haifeng Yu, Phillip B. Gibbons, Michael Kaminsky, and FengXiao. Sybillimit: A near-optimal social network defense againstsybil attacks. In IEEE Symposium on Security and Privacy,pages 3–17, 2008.

184