A Tutorial of Privacy-Preservation of Graphs and Social Networksxintaowu/publ/ppsn-tut.pdf · Community partition and outlier detection Information spreading Resiliency/robustness,

XintaoWu, Xiaowei Ying

University of North Carolina at Charlotte

A Tutorial of Privacy-Preservation of

Graphs and Social Networks

National Freedom of Information

2

Data Protection Laws

3

National Laws USA

HIPAA for health care Passed August 21, 96

lowest bar and the States are welcome to enact more stringent rules California State Bill 1386

Grann-Leach-Bliley Act of 1999 for financial institutions

COPPA for childern’s online privacy

etc.

Canada PIPEDA 2000

Personal Information Protection and Electronic Documents Act

Effective from Jan 2004

European Union (Directive 94/46/EC) Passed by European Parliament Oct 95 and Effective from Oct 98.

Provides guidelines for member state legislation

Forbids sharing data with states that do not protect privacy

4

Privacy Breach AOL's publication of the search histories of more than 650,000 of its users

has yielded more than just one of the year's bigger privacy scandals. (Aug 6, 2006)

That database does not include names or user identities. Instead, it lists only a unique ID number for each user. AOL user 710794

an overweight golfer, owner of a 1986 Porsche 944 and 1998 Cadillac SLS, and a fan of the University of Tennessee Volunteers Men's Basketball team.

interested in the Cherokee County School District in Canton, Ga., and has looked up the Suwanee Sports Academy in Suwanee, Ga., which caters to local youth, and the Youth Basketball of America's Georgia affiliate.

regularly searches for "lolitas," a term commonly used to describe photographs and videos of minors who are nude or engaged in sexual acts.

5

Source: AOL's disturbing glimpse into users' lives By Declan McCullough , CNET News.com,

August 7, 2006, 8:05 PM PDT

http://news.com.com/AOLs+disturbing+glimpse+into+users+lives/s disturbing glimpse into users

Privacy Preserving Data Mining Data mining The goal of data mining is summary results (e.g., classification,

cluster, association rules etc.) from the data (distribution)

Individual Privacy Individual values in database must not be disclosed, or at least

no close estimation can be got by attackers Contractual limitations: privacy policies, corporate agreements

Privacy Preserving Data Mining How to transform data such that we can build a good data mining model (data utility) while preserving privacy at the record level (privacy)?

6

PPDM on Tabular Data

7

ssn name zip race … age Sex Bal income … IntP

28223 Asian … 20 M 10k 85k … 2k

28223 Asian … 30 F 15k 70k … 18k

28262 Black … 20 M 50k 120k … 35k

28261 White … 26 M 45k 23k … 134k

. . … . . . . … .

28223 Asian … 20 M 80k 110k … 15k

69% unique on zip and birth date

87% with zip, birth date and gender

Generalization (k-anonymity, L-diversity,

t-closeness etc.) and Randomization

Refer to a survey book [Aggarwal, 08]

PPDM Tutorials on Tabular Data Privacy in data system, Rakesh Agrawal, PODS03

Privacy preserving data mining, Chris Clifton, PKDD02, KDD03

Models and methods for privacy preserving data publishing and analysis, Johannes Gehrke, ICDM05, ICDE06, KDD06

Cryptographic techniques in privacy preserving data mining, HelgerLipmaa, PKDD06

Randomization based privacy preserving data mining, XintaoWu, PKDD06

Privacy in data publishing, Johannes Gehrke & AshwinMachanavajjhala, S&P09

Anonymized data: genertion, models, usage, Graham Cormode & Divesh Srivastava, SIGMOD09

8

Social Network

9

Network of US political books

(105 nodes, 441 edges)

Books about US politics sold by

Amazon.com. Edges represent

frequent co-purchasing of books by

the same buyers. Nodes have been

given colors of blue, white, or red to

indicate whether they are "liberal",

"neutral", or "conservative".

Social Network

10

Social Network

Network of the political blogs on the 2004 U.S. election

(polblogs, 1,222 nodes and 16,714 edges)

11

Social Network

12

• Collaboration network of scientists [Newman, PRE06]

More Social Network Data

Newman’s collection

http://www-personal.umich.edu/~mejn/netdata/

Enron data

http://www.cs.cmu.edu/~enron/

Stanford large network dataset collection

http://snap.stanford.edu/data/index.html

13




http://www.cs.cmu.edu/~enron/

http://snap.stanford.edu/data/index.html

Graph Mining A very hot research area Graph properties such as degree distribution Motif analysis Community partition and outlier detection Information spreading Resiliency/robustness, e.g., against virus propagation Spectral analysis

Research development “Managing and mining graph data” by Aggarwal and Wang,

Springer 2010. “Large graph-mining: power tools and a practitioner’s guide” by

Faloutsos et al. KDD09

14

Network Science and Privacy

15Source: Jeannette Wing, Computing research: a view from DC, SNOWBIRD, 2008

Outline

• Attacks on Naively Anonymized Graph

• Privacy Preserving Social Network Publishing

• K-anonymity

• Generalization

• Randomization

• Other Works

• Output Perturbation

• Background on differential privacy

• Accurate analysis of private network data

16

Social Network Data Publishing

17

Data ownerData miner

release

name sex age disease salary

Ada F 18 cancer 25k

Bob M 25 heart 110k

Cathy F 20 cancer 70k

Dell M 65 flu 65k

Ed M 60 cancer 300k

Fred M 24 flu 20k

George M 22 cancer 45k

Harry M 40 flu 95k

Irene F 45 heart 70k

id Sex age disease salary

5 F Y cancer 25k

3 M Y heart 110k

6 F Y cancer 70k

1 M O flu 65k

7 M O cancer 300k

2 M Y flu 20k

9 M Y cancer 45k

4 M M flu 95k

8 F M heart 70k

Threat of Re-identification

18

id Sex age disease salary

5 F Y cancer 25k

3 M Y heart 110k

6 F Y cancer 70k

1 M O flu 65k

7 M O cancer 300k

2 M Y flu 20k

9 M Y cancer 45k

4 M M flu 95k

8 F M heart 70k

Attacker

attack

Ada’s sensitive information is disclosed.

Privacy breachesIdentity disclosure

Link disclosure

Attribute disclosure

Deriving Personal Identifying Information [Gross WPES05]

User profiles (e.g., photo, birth date, residence,

interests, friend links) can be used to estimate personal

identifying information such as SSN.

### - ## - ####

Users should pay attention to (default) privacy

preference settings of online social networks.

19

Determined by zip code

https://secure.ssa.gov/apps10/poms.nsf/lnx/0100201030

Group no Sequential no

Active and Passive Attacks [BackstormWWW07]

u v

20

Active attack outline1. Join the network by creating some new user accounts;

2. Establish a highly distinguishable subgraph H among the

attacking nodes;

3. Send links to targeted individuals from the

attacking nodes;

4. In the released graph, identify the subgraph H

among the attacking nodes;

5. The targeted individuals and their links are

then identified.


u v

21

Active attacks & subgraph HThe active attack is based on the subgraph H among the attackers:

1. No other subgraphs isomorphic to H;

2. Subgraph H has no non-trivial automorphism

3. Efficient to identify H regardless G;


22

Passive attacks outline1. Observation: most nodes in the network already form a

uniquely identifiable subgraph.

2. One adversary recruits k-1 of his neighbors to form the

subgraph H of size k.

3. Work similarly to active attacks.

Drawback: Uniqueness of H is not guaranteed.

23

Structural queries:A structural query Q represents complete or partial structural

information of a targeted individual that may be available to

adversaries.

Structural queries and identity privacy:

Attacks by Structural Queries [Hay VLDB08]


24

Degree sequence refinement queries


25

Subgraph queriesThe adversary is capable of gathering a fixed number of

edges around the targeted individual.

Hub fingerprint queriesA hub is a central node in a network with high degree and

high betweenness centrality. A hub fingerprint of node v is

the node's connections to a set of designated hubs within a

certain distance.

Attacks by Combining Multiple Graphs [Narayanan ISSP09]

26

Attack outline:1. The attacker has two type of auxiliary information:

Aggregate: an auxiliary graph whose members overlap with the

anonymized target graph

Individual: the detailed information on a very small number of

individuals (called seeds) in both the auxiliary graph and the target graph.

2. Identify seeds in the target graph.

3. Identify more nodes by comparing the neighborhoods of the

de-anonymized nodes in the auxiliary graph and the target

graph (propagation).

Deriving Link Structure of Entire Network [Korolova ICDE08]

A different threat in which

An adversary subverts user accounts to get local neighborhoods and pieces them together to build the entire network.

No underlying network is released.

A registered user often can see all the links and nodes incident to him within distance d from him.

d=0 if a user can see who he links to.

d=1 if a user can also see who links to all his friends.

Analysis showed that the number of local neighborhoods needed to cover a fraction of the entire network drops exponentially with increase of the lookahead parameter d.

27

Outline



• K-anonymity

• Generalization

• Randomization

• Other Works




28

Privacy Preserving Social Network Publishing

Naïve anoymization is not sufficient to prevent privacy

breaches, mainly due to link structure based attacks.

Graph topology has to be modified via

Adding/deleting edges/nodes

Grouping nodes/edges into super-nodes and super-edges

Breaking associations between node attributes and nodes

for rich graphs

How to quantify utility loss and privacy preservation in

the perturbed (and anonymized) graph?

29

Graph Utility

Utility heavily depends on mining tasks.

It is challenging to quantify the information loss in the

perturbed graph data.

Unlike tabular data, we cannot use the sum of the

information loss of each individual record.

We cannot use histograms to approximate the distribution

of graph topology.

It is more challenging when considering both structure

change and node attribute change.

30

31

Graph Utility

Topological features: Structural characteristics of the graph.

Various measures form different perspectives.

Commonly used.

Spectral features: Defined as eigenvalues of the graph's adjacency matrix or other derived

matrices.

Closely related to many topological features.

Aggregate queries: Calculate the aggregate on some paths or subgraphs satisfying the query

condition.

E.g.: the average distance from a medical doctor vertex to a teacher vertex

in a network.

Topological Features

Topological features of networks

Harmonic mean of shortest distance

Transitivity(cluster coefficient)

Subgraph centrality

Modularity (community structure)

And many others (refer to: F. Costa et al., Characterization of Complex Networks:

A Survey of measurements, 2006)

32

Graph and Matrix

Adjacency matrix

1. For a undirected graph, A is a symmetric;

2. No self-links, diagonal entries of A are all 0

3. For a un-weighted graph, A is a 0-1 matrix

33

Spectral Features

Spectral features of networks

Adjacency spectrum

Laplacian spectrum

34

Topological vs. Spectral Features

Adjacency and Laplacian spectrum:

The maximum degree, chromatic number, clique number etc.

are related to ;

Epidemic threshold of the virus propagates in the network is

related to ;

Laplacian spectrum indicates the community structure:

1. k disconnected communities:

2. k loosely connected communities:

35


Laplacian spectrum & communitiesDisconnected communities:

Loosely connected communities:

36

0.00 0.00 0.00 1.27 2.59 3.00 3.00 3.00 4.00 4.00

4.00 4.73 5.00 5.41 6.00 6.00 6.00 6.00 6.00

0.00 0.11 0.34 1.31 2.60 3.00 3.10 3.36 4.00 4.13

4.59 4.79 5.31 5.58 6.00 6.00 6.00 6.66 7.12

Eigenspace [Ying SDM09]

37

),,( 21 kiiii xxx

nk

k

k

nn

k

x

x

x

x

x

x

x

x

x

2

1

2

22

21

1

21

11

21



Topological & spectral features are related

No. of triangles:

Sub-graph centrality：

Graph diameter:

38

Outline



• K-anonymity

• Generalization

• Randomization

• Other Works




39

K-anonymity Privacy Preservation

K-anonymity (Sweeney 1998)

Each individual is identical with at least K-1 other

individuals

A general definition for network data [Hay, VLDB08]

40

K-anonymity

Each node is identical with at least K-1 other nodes

under topology-based attacks.

The adversary is assumed to have some knowledge of the

target user:

node degree (K-degree)

(immediate) neighborhood (K-neighborhood)

arbitrary subgraph (K-automorphism etc.)

K-anonymity approach guarantees that no node in the

released graph can be linked to a target individual with

success prob. greater than 1/K.

41

K-degree Anonymity [Liu SIGMOD08]

Attacking model:The attackers know the degree of the targeted individual

42


K-degree anonymous:

Optimize utility: minimize no. of added edges

43


Algorithm outline:

44

K-neighborhood Anonymity [Zhou ICDE08]

Attacking modelThe attackers know the immediate neighborhood of a targeted

individual.

45


Problem:

46


Algorithm outline1. Extract the neighborhoods of all vertices in the network.

2. Compare and test all neighborhoods by neighborhood component coding

3. Organize vertices into groups and anonymize the neighborhoods of

vertices in the same group until the graph satisfies K-neighborhood

anonymity.

47

1-neighborhood of Ada Naively Anonymized

Graph

K-neighborhood

Anonymity


Graph utility:

1. The nodes have hierarchical label information.

2. Two ways to anonymize the neighborhoods: generalizing labels

and adding edges.

3. Focus on using the anonymized graph to answer aggregate

network queries as accurate as possible.

48

K-automorphism Anonymity [ZouVLDB09]

Attacking model:The attackers can know any subgraph that contains the targeted

individual.

Graph automorphism

49


K-automorphic graph

50


Algorithm outline1. Partition graph G into several groups of subgraphs, each group

contains at least K subgraphs, and no subgraphs share a node.

2. Block Alignment: make subgraphs within each group

isomorphic to each other.

3. Edge Copy: copy the edges across the subgraphs properly.

51

K-symmetry Model [Wu EDBT10]


individual.

K-symmetry approach:1. A concept similar to K-automorphism (equivalent?)

2. Make graph K-symmetry by adding fake nodes

52

K-isomorphism Model [Cheng SIGMOD10]


individual.

Insufficient protection on link privacy by K-

automorphism approach

53

Example: the adversary can not identify Alice or Bob, but there must

be a link between them.

K-isomorphism Model [Cheng SIGMOD10]

K-security graph

54

K-anonymity

55

Privacy

protection

Utility preservation

K-neighborhood

K-degree

K-automorphism

K-symmetry

K-security

K-obfuscation [Bonchi ICDE11]

56

K-obfuscation [Bonchi ICDE11]

57

Both cases respect 2-candidate anonymity.

K-candidate aims at guaranteeing a lower bound on the amount of

uncertainty.

K-obfuscation measures the uncertainty.

The obfuscation level quantified by means of the entropy is always no

less than the one based on a-posteriori belief probabilities.

Outline



• K-anonymity

• Generalization

• Randomization

• Other Works




58

Generalization Approach [Hay VLDB08]

Generalize nodes into super nodes and edges into super

edges

59

Generalization Approach [Hay VLDB08]

The size of possible graph world:

Maximize the graph likelihood functionSimulated annealing algorithm

1. Start with a single partition containing all nodes;

2. Update state by splitting/merge partitions or move a node to a new

partition.

60

Anonymizing Rich Graph

Social networks contain much richer information in addition to the graph topology.

individual attributes including demographic information, such as age, gender and location, and sensitive personal data, such as political and religious preferences.

relationship may denote various types of interactions and one interaction can also involve more than two participants.

Queries are on the combinations of properties of entities and patterns of the link structure.

K-anonymous models based only on link structure may still leak privacy, e.g., all nodes in a K-anonymous group are associated with the same sensitive information.

61

Link Protection [Zheleva PinKDD07]

Graph model: one type of nodes but multiple types of edges, which are classified as either sensitive or non-sensitive.

The problem of link re-identification is to infer sensitive relationships from non-sensitive ones.

Cluster-edge anonymization is to aggregate edges by type to protect link privacy.

all the anonymized nodes in an equivalence class are collapsed into a single super-node.

release the number of edges of each type between equivalence classes.

62

Clustering Approach [Campan PinKDD08]

Graph model:

edges are not labeled

nodes are associated with attributes (identifier, quasi-

identifier, and sensitive ones).

K-anonymity model via cluster collapsing and

generalization is for both the quasi-identifier attributes

and the quasi-identifier relationship homogeneity.

Two nodes from any cluster are indistinguishable based on

either their relationships or their attributes.

63

Anonymizing Bipartitite Graph [CormodeVLDB08]

Graph model

two types of entities

an association only exists between two entities of different types, e.g., customers buy products.

The relationship is sensitive while entity attributes are public.

The anonymization method is to preserve the graph structure exactly by masking the mapping from entities to nodes.

Aggregate queries are used to evaluate utility.

the total number of OTC products bought by NJ customers.

64

Anonymizing Rick Interaction Graph [BhagatVLDB09]

Hyper-graphG(V,I,E) represents multiple types of interactions between entities

Attacking modelThe attackers know part of the links and nodes in the graph

65

Anonymizing Rick Graph [BhagatVLDB09]

Algorithm outline1. Sort the nodes according to the attributes;

2. Group nodes into super nodes satisfying class safety property,

and the size of each supper node is greater than K.

Class safety: each node cannot have interactions with two or more nodes from

the same group

3. Replace the node identifiers by the label list.

66

Outline



• K-anonymity

• Generalization

• Randomization

• Other Works




67

Basic Graph Randomization Operations1. Rand Add/Del: randomly add k false edges and delete k true edges (no. of

edges unchanged)

2. Rand Switch: randomly switch a pair of edges, and repeat it for k times

(nodes’ degree unchanged)

68

1

2 3

4

5

1

2 3

4

5

1

2 3

4

5

1

2 3

4

5

Resilient to Subgraph Attack

69

Links both within and outside attack

subgraph are changed.

Randomization

Randomized response model [Warner 1965]

70

: Cheated in the exam : Didn’t cheat in the exam

Cheated in

exam

Didn’t cheat

AAA

A

Randomization device

Do you belong to A? (p)

Do you belong to ?(1-p)A

…

)1)(1( pp AA

12

ˆ

12

1ˆ

pp

pAW

1

“Yes” answer

“No” answer

As: Unbiased estimate:

Procedure:

Purpose: Get the proportion( ) of population

members that cheated in the exam.A

…

Purpose

Randomization [Agarawal SIGMOD00]

71

50 | 40K | ... 30 | 70K | ... ...

...

Randomizer Randomizer

Reconstruct

Distribution

of Age

Reconstruct

Distribution

of Salary

Classification

AlgorithmModel

65 | 20K | ... 25 | 60K | ... ...30

becomes

65

(30+35)

Alice’s

age

Add random

number to

Age

Reconstruction

Given

x1+y1, x2+y2, ..., xn+yn where xi are original values.

the probability distribution of noise Y

Estimate the probability distribution of X.

72

met) criterion (stopping until

1

)())((

)())((1)(

repeat

0

ondistributi uniform

1

1

0

jj

afayxf

afayxf

naf

j

f

n

ij

XiiY

j

XiiYj

X

X

0

200

400

600

800

1000

1200

20 60

Age

Nu

mb

er o

f P

eo

ple

Original

Randomized

Reconstructed

[Agarawal SIGMOD00]

Outline on Graph Randomization

Link privacy

the prob. of existence a link (i,j) given the

perturbed graph

Feature preservation randomization

Spectrum preserving randomization

Markov chain based feature preserving randomization

Reconstruction from randomized graph

73

?)~

|1( GaP ij

Prior probability:

Posterior probabilities

74

Randomized

graph

i

j

ijm~

Method II:

If similarity is large, link (i. j) is

more likely to be a true link

Method I:

m

kmaaP ijij )1~|1(

)~,1~|1( xmaaP ijijij

2/)1()1(

nn

maP ij

Link Privacy: Posterior Beliefs [Ying PAKDD09]

Posterior probability and similarity measures

75

Similarity and proportion of existing edges – before randomization

Similarity and proportion of existing edges – after randomization



Posterior probability and similarity measures

How to calculate posterior probability for general

cases?

76

ji

ijijij

ji

ijij

ji

ij maaPaaPaPm )~,~|1()~|1()1(

prior prob. posterior prob. I posterior prob. II

?)~

|1( GaP ij

Exploit graph space to breach link privacy

Link Privacy: Graph Space [Ying SDM09]

77

Sample the graph space when the space is large

1. Start with the randomized graph, construct a Markov chain,

and uniformly sample the graph space.

2. Generate N uniform graph samples

Empirical evaluations show that node pairs with highest

probabilities have serious link disclosure risk (as high as 90%).

Link Privacy: Graph Space [Ying SDM09]

78

N

k

kij

N

jiGN

a

GGGN

1

21

),(1

)SpaceGraph |1Pr(

SpaceGraph ,,, :samples

Graph Features Under Pure Randomization

Topological and spectral features change significantly

along the randomization.

79(Networks of US political books, 105 nodes and 441 edges)

Can we better preserve

the network structure?

Spectrum Preserving Randomization [Ying SDM08]

Graph spectrum is related to many real graph features.

Preserve graph features by preserving some eigenvalues.

80


Spectral Switch (apply to adjacency matrix):

Up-switch to increase the eigenvalue:

Down-switch to decrease the eigenvalue:

81

t (xt)

v (xv)

u (xu)

w (xw)

t (xt) u (xu)

w (xw) v (xv)

Spectral Switch (apply to Laplacian matrix):

Down-switch to decrease the eigenvalue:

Up-switch increase the eigenvalue:

t (yt)

v (yv)

u (yu)

w (yw)

t (yt) u (yu)

w (yw) v (yv)

82


Algorithm outline

83


Preserve any graph feature S(G) within a small range

Feature range constraint specified by the user

Markov chain with feature range constraint

Markov Chain Based Feature Preserving

Randomization [Ying SDM09]

84

],[)~

( ],,[)( ssRGSssRGS

RGSRGSRGSRGS

GGGG

t

t

)()()()(

~

010

10

(uniformity on accessible graphs)

Feature constraint can be used to breach link privacy

85

Original

Released

Markov Chain Based Feature Preserving

Randomization [Ying SDM09]

Reconstruction from Randomized Graph [Wu SDM10]

Motivation

From the randomized graph, can we reconstruct a graph

whose features are closer to the true features?

86

Low rank approximation approach

1. Best rank r approximation by eigen-decomposition:

2. Discretize the low rank matrix

87


Effect of including significant negative

eigenvalues

8888

Original r = 1

r = 4r = 2


89


Algorithm Outline

Feature value of the original graph, randomized graph,

and the reconstructed graph

9090


Reconstructed graphs do not jeopardize link privacy for

real-world networks.

-- Privacy measured by the proportion of different edges:

9191


92

Graphs with low rank may have privacy breached by

reconstruction:

Reconstruction on synthetic low rank graphs

92


93

Graph and feature data

1. Original Graph

2. Binary feature matrix

3. Two individuals with higher similarity in features are more

likely to be connected in the graph.

Reconstruction problem

The maximum likelihood estimation is adopted to reconstruct

the original graph and features.

93

Reconstructing Randomized Social

Networks & Features [Vuokko SDM10]

Random sparsification [Bonchi ICDE11]

only remove edges from the graph without adding new edges.

outperform random add/del in terms of utility preservation partially due to the small word phenomenon:

adding random long-haul edges brings nodes close,

while removing an edge does not bring nodes so much apart since there exit alternative paths.

utility vs. privacy trade-off for various randomization strategies.

94

Outline



• K-anonymity

• Generalization

• Randomization

• Other Works




95

Edge-weighted Graph

Edge weights could be sensitive, e.g., trustworthiness of

user A according to user B, or transaction amount

between two accounts.

96

Anonymizing Edge-weighted Graph [Das TR09]

Some properties of edge weights in terms of some

functions are preserved.

Relative distances between nodes for shortest paths or

kNN queries.

A framework for edge weight anonymization of graph

data that preserves linear properties.

A linear property can be expressed by a specific set of

linear inequalities of edge weights.

Finding new weights for each edge is a linear programming

problem.

97

Gaussian Randomization [Liu SDM09]

Perturb edge weights while preserving global and local

utilities. Graph structure is unchanged.

Gaussian randomization multiplication

The original weight of each edge is multiplied by a random

Gaussian noise with mean 1 and some variance.

In the original graph, if the shortest distance d(A,B) is

much smaller than d(C,D), the order is high likely to be

preserved.

Greedy perturbation

Preserve a set of shortest distances.

98

Anonymizing Multi-graphs [Li SDM11]

How to generate an anonymized collection of graphs where each graph corresponds to an individual’s behavior.

XML representation of attributes about an individual

Click-graph in a user-session

Route for a given individual in a time-period

Condensation based approach

create constrained clusters of size at least K

construct a super-template to represent properties of the group

generate anonymized graphs from super-template

99

Computing Privacy Scores [Liu ICDM09]

The privacy score measures the user’s potential privacy risk due to

her online information sharing behaviors. It increases with

sensitivity of the information being shared

visibility of the revealed information in the network

100

Computing Privacy Scores [Liu ICDM09]

Item Response Theory based model

Used to measure the abilities of examinees, the difficulty

of the questions, and the prob. of an examinee to answer a

question correctly.

Each examinee is mapped to a user, and each question is

mapped to a profile item. The difficulty parameter is to

quantify the sensitivity of a profile item.

The true visibility is estimated from observed profiles.

101

Outline



• K-anonymity

• Generalization

• Randomization

• Other Works




102

Output Perturbation

103

Data ownerData miner


Ada F 18 cancer 25k

Bob M 25 heart 110k


Dell M 65 flu 65k

Ed M 60 cancer 300k

Fred M 24 flu 20k


Harry M 40 flu 95k


Query f

Query result + noise

Differential Guarantee [Dwork, TCC06]

104

name disease

Ada cancer

Bob heart

Cathy cancer

Dell flu

Ed cancer

Fred flu

f count(#cancer)

f(x) + noise

name disease

Ada cancer

Bob heart

Cathy cancer

Dell flu

Ed cancer

Fred flu

K

K

f count(#cancer)

f(x’) + noise

3 + noise

2 + noise

Two databases (x, x’) differ in only one row.

Differential Guarantee

Require that the prob. distribution is essentially the same independent of whether any individual opts in to, or opts out of the database.

Anything that can be learned about a respondent from a statistical database should be learnable without access to the database. (Tore Dalenius 1977 ad omnia privacy)

Independent of adversary knowledge.

Different from prior work on comparing an adversary's prior and posterior views of an individual.

105

Outline



• K-anonymity

• Generalization

• Randomization

• Other Works




106

-differential privacy

is a privacy parameter: smaller = stronger privacy

Differential Privacy [Dwork, TCC06 &Dwork CACM11]

Two neighboring datasets are defined in terms of

• Hamming distance |(x-x’) (x’-x)|=1

• Symmetric distance

107

-differential privacy

is a privacy parameter: smaller = stronger privacy

Differential Privacy [Dwork, TCC06 &Dwork CACM11]

108

is public and its selection is a

social question.

Calibrating Noise

109

Laplace distribution

Sensitivity of function

global sensitivity

local sensitivity

Multiple queries

Laplace Distribution

110 Lap(b)

http://upload.wikimedia.org/wikipedia/commons/8/89/Laplace_distribution_pdf.png

Gaussian Distribution

111

Sensitivity

112


Ada F 18 cancer 25k

Bob M 25 heart 110k


Dell M 65 flu 65k

Ed M 60 cancer 300k

Fred M 24 flu 20k


Harry M 40 flu 95k


Function f sensitivity

Count(#cancer) 1

Sum(salary) u (domain upper bound)

Avg(salary) u/n

Complex functions or data mining tasks can be

decomposed to a sequence of simple functions.

L-1 distance for

vector output

Differential Guarantee

113

name disease

Ada cancer

Bob heart

Cathy cancer

Dell flu

Ed cancer

Fred flu

f count(#cancer)

f(x) + Lap(1/ ) K

=1 =2

Difference of Prob.

Neighboring Datasets

114

• Two neighboring datasets are defined in terms of

• Hamming distance |(x-x’) (x’-x)|=1 --- presence/absence

• Symmetric distance --- hiding the value of an individual row

• How about two data sets differing by k rows?

Multiple Queries

115

• The sensitivity of a sequence of two counting

queries is 2. Adding independent noise Lap(2/ )

to each answer yields -differential privacy.

Histogram Query

SELECT count(*)

FROM table

GROUP BY disease


Ada F 18 cancer 25k

Bob M 25 heart 110k

Ed M 60 cancer 300k


Harry M 40 flu 95k


[3, 2, 1]

cancer, heart, flu

[3+Lap(1/ ), 2+Lap(1/ ),1+Lap(1/ )]

Complex Data Mining Task

Many data mining tasks can be decomposed

into a series of noisy sum and/or noisy

count queries.

Known data mining algorithms include

SVD, ID3 decision tree, K-means clustering,

association rule etc.

117

Recent Development

[Xiao ICDE10] Dependencies among the queries can be

exploited to improve the accuracy of responses.

[Li PODS10] Matrix mechanism for answering a workload of

predicate counting queries

[Kifer SIGMOD11] Misconceptions of differential privacy.

118

Outline



• K-anonymity

• Generalization

• Randomization

• Other Works




119

Private Query Answering on Networks[Hay ICDM09]

120

Two neighboring graphs can be defined to differ by a

single edge, K edges, or a single node.

edge -differential privacy

query output is indistinguishable whether any

single edge is present or absent.

K-edge -differential privacy

query output is indistinguishable whether any set

of k edges is present or absent.

node -differential privacy

query output is indistinguishable whether any

single node (and all its edges) is present or absent.

Lap( f/ )

Lap( f K/ )

Degree Sequence

The list of degrees of each node in a graph

The degree sequence of a network may be sensitive as it

can be used to determine the graph structure by

incorporating other graph statistics

121

Two Equivalent Queries

122

Degree sequence D(G)=[1,1,3,3,3,3,2] D(G’)=[1,1,3,3,2,2,2]

D=2, Lap(2/ ) for each component

# of nodes with degree i=0,…n-1

F(G)=[0,2,1,4,0,0,0] F(G’)=[0,2,3,2,0,0,0]

F=4, Lap(4/ ) for each component

Boosting Accuracy [Hay ICDM09]

123

Rewrite query D to get S with constraint Cs

K Submit S

Perturbed answer A(S)

Perform inference on A(S) with

constraints Cs to derive a better estimation

Formulating Query D

124

Degree sequence D(G)=[1,1,3,3,3,3,2] D(G’)=[1,1,3,3,2,2,2]

D=2, Lap(2/ ) for each component

Return the rank i-th degree

S(G)=[1,1,2,3,3,3,3]Perturbed answer could be

[3,2,…]

A new (and more accurate) sequence could be derived by computing the

closest non-decreasing sequence.

+Lap(2/ )

Accurate Motif Analysis

Measures the frequency of occurrence of small

subgraphs in a network, e.g., # of triangles

125

# of triangles is n-2 # of triangles is 0

High sensitivity!

Weakening Privacy

Statistics such as transitivity, clustering coefficient,

centrality, and path-lengths have high sensitivity values.

Possible solutions

[Nissim STOC07] Smooth sensitivity, adopting local

sensitivity, i.e., the max. change between Q(I) and Q(I’) for any

I’ in neighbor(I).

[Rastogi PODS09] Adversary privacy, limiting assumptions

of the priori knowledge of the adversary

More exploration is needed on robust statistics and

differential privacy.

126

Model based Data Publishing

127

Data owner Data miner


Ada F 18 cancer 25k

Bob M 25 heart 110k


Dell M 65 flu 65k

Ed M 60 cancer 300k

Fred M 24 flu 20k


Harry M 40 flu 95k


K.

.

.

Build models (e.g., contingency table, power-law graph)

Release differentially private model parameters

Generate synthetic data using models with perturbed parameters

Outline



• K-anonymity

• Generalization

• Randomization

• Other Works




128

Challenges

How to quantify graph utility-privacy tradeoff especially for (dynamic) rich graphs?

existing data publication techniques do not provide guarantees of accuracy of graph analysis.

how to define utility?

Scalability is always an issue.

Differential privacy preserving social network mining

sensitivity of statistics and relaxation

129

References

[Agarawal, SIGMOD00] R. Aggarwal and R. Srikant. Privacy preserving data mining. SIGMOD, 2000.

[Aggarwal, 08] C. C. Aggarwal and P. S. Yu. Privacy-preserving data mining: models and algorithms. Springer, 2008.

[Backstrom, WWW07] L. Backstrom, C. Dwork, and J. Kleinberg. Wherefore art thou R3579X? Anonymized social networks, hidden patterns and structural steganography. WWW, 2007.

[Bhagat, VLDB09] S. Bhagat, G. Cormode, B. Krishnamurthy, and D. Srivastava. Class-based graph anaonymization for social network data. VLDB, 2009.

[Bonchi, ICDE11] F. Bonchi, A. Gionis, and T. Tassa. Identity Obfuscation in Graphs Through the Information Theoretic Lens. ICDE, 2011.

[Campan, PinKDD08] A. Campan and T. M. Truta. A clustering approach for data and structural anonymity in social network data. PinKDD, 2008.

[Cheng, SIGMOD10] J. Cheng, A. Fu, and J. Liu. K-isomorphism: privacy preserving network publication against structural attacks. SIGMOD, 2010.

[Cormode, VLDB08] G. Cormode, D. Srivastava, T. Yu, and Q. Zhang. Anonymizing bipartite graph data using safe groupings. VLDB, 2008.

130

References

[Das, TR09] S. Das, Omer Egecioglu, and A. E. Abbadi. Anonymizing edge weighted social network graphs. 2009.

[Dwork, CACM11] C. Dwork. A firm foundation for private data analysis. CACM, 2011.

[Dwork, TCC06] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. TCC, 2006.

[Gross, WPES05] R. Gross and A. Acquisti. Information revelation and privacy in online social networks (the Facebook case). WPES, 2005.

[Hanhijarvi, SDM09] S. Hanhijarvi, G. C. Garriga, and K. Puolamaki. Randomization techniques for graphs. SDM, 2009.

[Hay, VLDB08] M. Hay, G. Miklau, D. Jensen, D. Towsely, and P. Weis. Resisting structural re-identification in anonymized social networks. VLDB, 2008.

[Hay, 07] M. Hay, G. Miklau, D. Jensen, P. Weis, and S. Srivastava. Anonymizing social networks. 2007.

131

References

[Hay, 09] M. Hay, G. Miklau and D. Jensen. Enabling accurate analysis of private data analysis. 2009.

[Kifer, SIGMOD11] D. Kifer and A. Machanavajjhala. No Free Lunch in Data Privacy. SIGMOD, 2011.

[Korolova, ICDE08] A. Koroloca, R. Motwani, S. Nabar, and Y. Xu. Link privacy in social networks. ICDE, 2008.

[Li, PODS10] C. Li, M. Hay, V. Rastogi, G. Miklau, A. McGregor. Optimizing linear counting queries under differential privacy. PODS, 2010.

[Liu, SIGMOD08] K. Liu and E. Terzi. Towards identity anonymization on graphs. SIGMOD 2008.

[Liu, ICDM09] K. Liu and E. Terzi. A framework for computing the privacy scores of users in online social networks. ICDM 2009.

[Liu, SDM09] L. Liu, J. Wang, J. Liu and J. Zhang. Privacy preserving in social networks against sensitive edge disclosure. SDM 2008.

[McSherry, FOCS07] F. McSherry and K. Talwar. Mechanism design via differential privacy. FOCS, 2007.

132

References

[Narayanan, 09] A. Narayanan and V. Shmatikov. De-anonymizing social networks. 2009.

[Newman, PRE06] M. Newman. Physical Review E, 2006.

[Vuokko, SDM10] N. Vyokko and E. Terzi. Reconstructing randomized social networks. SDM, 2010.

[Wu, SDM10] L. Wu, X. Ying, and X. Wu. Reconstruction of randomized graph via low rank approximation. SDM, 2010.

[Wu, EDBT11] W. Wu, Y. Xiao, W. Wang, Z. He, and Z. Wang. k-symmetry model for identity anonymization in social networks. EDBT, 2011.

[Wu, 09] X. Wu, X. Ying, K. Liu, and L. Chen. A Survey of Algorithms for Privacy-Preservation of Graphs and Social Networks. 2009.

[Ying, SDM08] X. Ying and X. Wu. Randomizing social networks: a spectrum preserving approach. SDM, 2008.

[Ying, SDM09] X. Ying and X. Wu. Graph generation with prescribed feature constraints. SDM, 2009.

133

References [Ying, SDM09-2] X. Ying and X. Wu. On randomness measures for social networks. SDM, 2009.

[Ying, PAKDD09] X. Ying and X. Wu. On link privacy in randomizing social networks. PAKDD, 2009.

[Xiao, ICDE10] X. Xiao, G. Wang, and J. Gehrke. Differential privacy via wavelet transformation. ICDE,

2010.

[Zheleva, PinKDD07] E. Zhelava and L. Getoor. Preserving the privacy of sensitive relationships in

graph data. PinKDD, 2007.

[Zhou, ICDE08] B. Zhou and J. Pei. Preserving privacy in social networks against neighborhood attacks.

ICDE, 2008.

[Zou, VLDB09] L. Zou, L. Chen, and M. T. Ozsu. K-automorphism: A general framework for privacy

preserving network publication. VLDB, 2009.

134

Questions?

Acknowledgments

This work was supported in part by U.S. National Science

Foundation IIS-0546027 , CNS-0831204 and CCF-1047621.

Update version: http://dpl.sis.uncc.edu/ppsn-tut.PDF

Thank You!

135

http://dpl.sis.uncc.edu/ppsn-tut.PDF