Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
XintaoWu, Xiaowei Ying
University of North Carolina at Charlotte
A Tutorial of Privacy-Preservation of
Graphs and Social Networks
National Freedom of Information
2
Data Protection Laws
3
National Laws USA
HIPAA for health care Passed August 21, 96
lowest bar and the States are welcome to enact more stringent rules California State Bill 1386
Grann-Leach-Bliley Act of 1999 for financial institutions
COPPA for childern’s online privacy
etc.
Canada PIPEDA 2000
Personal Information Protection and Electronic Documents Act
Effective from Jan 2004
European Union (Directive 94/46/EC) Passed by European Parliament Oct 95 and Effective from Oct 98.
Provides guidelines for member state legislation
Forbids sharing data with states that do not protect privacy
4
Privacy Breach AOL's publication of the search histories of more than 650,000 of its users
has yielded more than just one of the year's bigger privacy scandals. (Aug 6, 2006)
That database does not include names or user identities. Instead, it lists only a unique ID number for each user. AOL user 710794
an overweight golfer, owner of a 1986 Porsche 944 and 1998 Cadillac SLS, and a fan of the University of Tennessee Volunteers Men's Basketball team.
interested in the Cherokee County School District in Canton, Ga., and has looked up the Suwanee Sports Academy in Suwanee, Ga., which caters to local youth, and the Youth Basketball of America's Georgia affiliate.
regularly searches for "lolitas," a term commonly used to describe photographs and videos of minors who are nude or engaged in sexual acts.
5
Source: AOL's disturbing glimpse into users' lives By Declan McCullough , CNET News.com,
August 7, 2006, 8:05 PM PDT
Privacy Preserving Data Mining Data mining The goal of data mining is summary results (e.g., classification,
cluster, association rules etc.) from the data (distribution)
Individual Privacy Individual values in database must not be disclosed, or at least
no close estimation can be got by attackers Contractual limitations: privacy policies, corporate agreements
Privacy Preserving Data Mining How to transform data such that we can build a good data mining model (data utility) while preserving privacy at the record level (privacy)?
6
PPDM on Tabular Data
7
ssn name zip race … age Sex Bal income … IntP
28223 Asian … 20 M 10k 85k … 2k
28223 Asian … 30 F 15k 70k … 18k
28262 Black … 20 M 50k 120k … 35k
28261 White … 26 M 45k 23k … 134k
. . … . . . . … .
28223 Asian … 20 M 80k 110k … 15k
69% unique on zip and birth date
87% with zip, birth date and gender
Generalization (k-anonymity, L-diversity,
t-closeness etc.) and Randomization
Refer to a survey book [Aggarwal, 08]
PPDM Tutorials on Tabular Data Privacy in data system, Rakesh Agrawal, PODS03
Privacy preserving data mining, Chris Clifton, PKDD02, KDD03
Models and methods for privacy preserving data publishing and analysis, Johannes Gehrke, ICDM05, ICDE06, KDD06
Cryptographic techniques in privacy preserving data mining, HelgerLipmaa, PKDD06
Randomization based privacy preserving data mining, XintaoWu, PKDD06
Privacy in data publishing, Johannes Gehrke & AshwinMachanavajjhala, S&P09
Anonymized data: genertion, models, usage, Graham Cormode & Divesh Srivastava, SIGMOD09
8
Social Network
9
Network of US political books
(105 nodes, 441 edges)
Books about US politics sold by
Amazon.com. Edges represent
frequent co-purchasing of books by
the same buyers. Nodes have been
given colors of blue, white, or red to
indicate whether they are "liberal",
"neutral", or "conservative".
Social Network
10
Social Network
Network of the political blogs on the 2004 U.S. election
(polblogs, 1,222 nodes and 16,714 edges)
11
Social Network
12
• Collaboration network of scientists [Newman, PRE06]
More Social Network Data
Newman’s collection
http://www-personal.umich.edu/~mejn/netdata/
Enron data
http://www.cs.cmu.edu/~enron/
Stanford large network dataset collection
http://snap.stanford.edu/data/index.html
13
Graph Mining A very hot research area Graph properties such as degree distribution Motif analysis Community partition and outlier detection Information spreading Resiliency/robustness, e.g., against virus propagation Spectral analysis
Research development “Managing and mining graph data” by Aggarwal and Wang,
Springer 2010. “Large graph-mining: power tools and a practitioner’s guide” by
Faloutsos et al. KDD09
14
Network Science and Privacy
15Source: Jeannette Wing, Computing research: a view from DC, SNOWBIRD, 2008
Outline
• Attacks on Naively Anonymized Graph
• Privacy Preserving Social Network Publishing
• K-anonymity
• Generalization
• Randomization
• Other Works
• Output Perturbation
• Background on differential privacy
• Accurate analysis of private network data
16
Social Network Data Publishing
17
Data ownerData miner
release
name sex age disease salary
Ada F 18 cancer 25k
Bob M 25 heart 110k
Cathy F 20 cancer 70k
Dell M 65 flu 65k
Ed M 60 cancer 300k
Fred M 24 flu 20k
George M 22 cancer 45k
Harry M 40 flu 95k
Irene F 45 heart 70k
id Sex age disease salary
5 F Y cancer 25k
3 M Y heart 110k
6 F Y cancer 70k
1 M O flu 65k
7 M O cancer 300k
2 M Y flu 20k
9 M Y cancer 45k
4 M M flu 95k
8 F M heart 70k
Threat of Re-identification
18
id Sex age disease salary
5 F Y cancer 25k
3 M Y heart 110k
6 F Y cancer 70k
1 M O flu 65k
7 M O cancer 300k
2 M Y flu 20k
9 M Y cancer 45k
4 M M flu 95k
8 F M heart 70k
Attacker
attack
Ada’s sensitive information is disclosed.
Privacy breachesIdentity disclosure
Link disclosure
Attribute disclosure
Deriving Personal Identifying Information [Gross WPES05]
User profiles (e.g., photo, birth date, residence,
interests, friend links) can be used to estimate personal
identifying information such as SSN.
### - ## - ####
Users should pay attention to (default) privacy
preference settings of online social networks.
19
Determined by zip code
https://secure.ssa.gov/apps10/poms.nsf/lnx/0100201030
Group no Sequential no
Active and Passive Attacks [BackstormWWW07]
u v
20
Active attack outline1. Join the network by creating some new user accounts;
2. Establish a highly distinguishable subgraph H among the
attacking nodes;
3. Send links to targeted individuals from the
attacking nodes;
4. In the released graph, identify the subgraph H
among the attacking nodes;
5. The targeted individuals and their links are
then identified.
Active and Passive Attacks [BackstormWWW07]
u v
21
Active attacks & subgraph HThe active attack is based on the subgraph H among the attackers:
1. No other subgraphs isomorphic to H;
2. Subgraph H has no non-trivial automorphism
3. Efficient to identify H regardless G;
Active and Passive Attacks [BackstormWWW07]
22
Passive attacks outline1. Observation: most nodes in the network already form a
uniquely identifiable subgraph.
2. One adversary recruits k-1 of his neighbors to form the
subgraph H of size k.
3. Work similarly to active attacks.
Drawback: Uniqueness of H is not guaranteed.
23
Structural queries:A structural query Q represents complete or partial structural
information of a targeted individual that may be available to
adversaries.
Structural queries and identity privacy:
Attacks by Structural Queries [Hay VLDB08]
Attacks by Structural Queries [Hay VLDB08]
24
Degree sequence refinement queries
Attacks by Structural Queries [Hay VLDB08]
25
Subgraph queriesThe adversary is capable of gathering a fixed number of
edges around the targeted individual.
Hub fingerprint queriesA hub is a central node in a network with high degree and
high betweenness centrality. A hub fingerprint of node v is
the node's connections to a set of designated hubs within a
certain distance.
Attacks by Combining Multiple Graphs [Narayanan ISSP09]
26
Attack outline:1. The attacker has two type of auxiliary information:
Aggregate: an auxiliary graph whose members overlap with the
anonymized target graph
Individual: the detailed information on a very small number of
individuals (called seeds) in both the auxiliary graph and the target graph.
2. Identify seeds in the target graph.
3. Identify more nodes by comparing the neighborhoods of the
de-anonymized nodes in the auxiliary graph and the target
graph (propagation).
Deriving Link Structure of Entire Network [Korolova ICDE08]
A different threat in which
An adversary subverts user accounts to get local neighborhoods and pieces them together to build the entire network.
No underlying network is released.
A registered user often can see all the links and nodes incident to him within distance d from him.
d=0 if a user can see who he links to.
d=1 if a user can also see who links to all his friends.
Analysis showed that the number of local neighborhoods needed to cover a fraction of the entire network drops exponentially with increase of the lookahead parameter d.
27
Outline
• Attacks on Naively Anonymized Graph
• Privacy Preserving Social Network Publishing
• K-anonymity
• Generalization
• Randomization
• Other Works
• Output Perturbation
• Background on differential privacy
• Accurate analysis of private network data
28
Privacy Preserving Social Network Publishing
Naïve anoymization is not sufficient to prevent privacy
breaches, mainly due to link structure based attacks.
Graph topology has to be modified via
Adding/deleting edges/nodes
Grouping nodes/edges into super-nodes and super-edges
Breaking associations between node attributes and nodes
for rich graphs
How to quantify utility loss and privacy preservation in
the perturbed (and anonymized) graph?
29
Graph Utility
Utility heavily depends on mining tasks.
It is challenging to quantify the information loss in the
perturbed graph data.
Unlike tabular data, we cannot use the sum of the
information loss of each individual record.
We cannot use histograms to approximate the distribution
of graph topology.
It is more challenging when considering both structure
change and node attribute change.
30
31
Graph Utility
Topological features: Structural characteristics of the graph.
Various measures form different perspectives.
Commonly used.
Spectral features: Defined as eigenvalues of the graph's adjacency matrix or other derived
matrices.
Closely related to many topological features.
Aggregate queries: Calculate the aggregate on some paths or subgraphs satisfying the query
condition.
E.g.: the average distance from a medical doctor vertex to a teacher vertex
in a network.
Topological Features
Topological features of networks
Harmonic mean of shortest distance
Transitivity(cluster coefficient)
Subgraph centrality
Modularity (community structure)
And many others (refer to: F. Costa et al., Characterization of Complex Networks:
A Survey of measurements, 2006)
32
Graph and Matrix
Adjacency matrix
1. For a undirected graph, A is a symmetric;
2. No self-links, diagonal entries of A are all 0
3. For a un-weighted graph, A is a 0-1 matrix
33
Spectral Features
Spectral features of networks
Adjacency spectrum
Laplacian spectrum
34
Topological vs. Spectral Features
Adjacency and Laplacian spectrum:
The maximum degree, chromatic number, clique number etc.
are related to ;
Epidemic threshold of the virus propagates in the network is
related to ;
Laplacian spectrum indicates the community structure:
1. k disconnected communities:
2. k loosely connected communities:
35
Topological vs. Spectral Features
Laplacian spectrum & communitiesDisconnected communities:
Loosely connected communities:
36
0.00 0.00 0.00 1.27 2.59 3.00 3.00 3.00 4.00 4.00
4.00 4.73 5.00 5.41 6.00 6.00 6.00 6.00 6.00
0.00 0.11 0.34 1.31 2.60 3.00 3.10 3.36 4.00 4.13
4.59 4.79 5.31 5.58 6.00 6.00 6.00 6.66 7.12
Eigenspace [Ying SDM09]
37
),,( 21 kiiii xxx
nk
k
k
nn
k
x
x
x
x
x
x
x
x
x
2
1
2
22
21
1
21
11
21
Topological vs. Spectral Features
Topological vs. Spectral Features
Topological & spectral features are related
No. of triangles:
Sub-graph centrality:
Graph diameter:
38
Outline
• Attacks on Naively Anonymized Graph
• Privacy Preserving Social Network Publishing
• K-anonymity
• Generalization
• Randomization
• Other Works
• Output Perturbation
• Background on differential privacy
• Accurate analysis of private network data
39
K-anonymity Privacy Preservation
K-anonymity (Sweeney 1998)
Each individual is identical with at least K-1 other
individuals
A general definition for network data [Hay, VLDB08]
40
K-anonymity
Each node is identical with at least K-1 other nodes
under topology-based attacks.
The adversary is assumed to have some knowledge of the
target user:
node degree (K-degree)
(immediate) neighborhood (K-neighborhood)
arbitrary subgraph (K-automorphism etc.)
K-anonymity approach guarantees that no node in the
released graph can be linked to a target individual with
success prob. greater than 1/K.
41
K-degree Anonymity [Liu SIGMOD08]
Attacking model:The attackers know the degree of the targeted individual
42
K-degree Anonymity [Liu SIGMOD08]
K-degree anonymous:
Optimize utility: minimize no. of added edges
43
K-degree Anonymity [Liu SIGMOD08]
Algorithm outline:
44
K-neighborhood Anonymity [Zhou ICDE08]
Attacking modelThe attackers know the immediate neighborhood of a targeted
individual.
45
K-neighborhood Anonymity [Zhou ICDE08]
Problem:
46
K-neighborhood Anonymity [Zhou ICDE08]
Algorithm outline1. Extract the neighborhoods of all vertices in the network.
2. Compare and test all neighborhoods by neighborhood component coding
3. Organize vertices into groups and anonymize the neighborhoods of
vertices in the same group until the graph satisfies K-neighborhood
anonymity.
47
1-neighborhood of Ada Naively Anonymized
Graph
K-neighborhood
Anonymity
K-neighborhood Anonymity [Zhou ICDE08]
Graph utility:
1. The nodes have hierarchical label information.
2. Two ways to anonymize the neighborhoods: generalizing labels
and adding edges.
3. Focus on using the anonymized graph to answer aggregate
network queries as accurate as possible.
48
K-automorphism Anonymity [ZouVLDB09]
Attacking model:The attackers can know any subgraph that contains the targeted
individual.
Graph automorphism
49
K-automorphism Anonymity [ZouVLDB09]
K-automorphic graph
50
K-automorphism Anonymity [ZouVLDB09]
Algorithm outline1. Partition graph G into several groups of subgraphs, each group
contains at least K subgraphs, and no subgraphs share a node.
2. Block Alignment: make subgraphs within each group
isomorphic to each other.
3. Edge Copy: copy the edges across the subgraphs properly.
51
K-symmetry Model [Wu EDBT10]
Attacking model:The attackers can know any subgraph that contains the targeted
individual.
K-symmetry approach:1. A concept similar to K-automorphism (equivalent?)
2. Make graph K-symmetry by adding fake nodes
52
K-isomorphism Model [Cheng SIGMOD10]
Attacking model:The attackers can know any subgraph that contains the targeted
individual.
Insufficient protection on link privacy by K-
automorphism approach
53
Example: the adversary can not identify Alice or Bob, but there must
be a link between them.
K-isomorphism Model [Cheng SIGMOD10]
K-security graph
54
K-anonymity
55
Privacy
protection
Utility preservation
K-neighborhood
K-degree
K-automorphism
K-symmetry
K-security
K-obfuscation [Bonchi ICDE11]
56
K-obfuscation [Bonchi ICDE11]
57
Both cases respect 2-candidate anonymity.
K-candidate aims at guaranteeing a lower bound on the amount of
uncertainty.
K-obfuscation measures the uncertainty.
The obfuscation level quantified by means of the entropy is always no
less than the one based on a-posteriori belief probabilities.
Outline
• Attacks on Naively Anonymized Graph
• Privacy Preserving Social Network Publishing
• K-anonymity
• Generalization
• Randomization
• Other Works
• Output Perturbation
• Background on differential privacy
• Accurate analysis of private network data
58
Generalization Approach [Hay VLDB08]
Generalize nodes into super nodes and edges into super
edges
59
Generalization Approach [Hay VLDB08]
The size of possible graph world:
Maximize the graph likelihood functionSimulated annealing algorithm
1. Start with a single partition containing all nodes;
2. Update state by splitting/merge partitions or move a node to a new
partition.
60
Anonymizing Rich Graph
Social networks contain much richer information in addition to the graph topology.
individual attributes including demographic information, such as age, gender and location, and sensitive personal data, such as political and religious preferences.
relationship may denote various types of interactions and one interaction can also involve more than two participants.
Queries are on the combinations of properties of entities and patterns of the link structure.
K-anonymous models based only on link structure may still leak privacy, e.g., all nodes in a K-anonymous group are associated with the same sensitive information.
61
Link Protection [Zheleva PinKDD07]
Graph model: one type of nodes but multiple types of edges, which are classified as either sensitive or non-sensitive.
The problem of link re-identification is to infer sensitive relationships from non-sensitive ones.
Cluster-edge anonymization is to aggregate edges by type to protect link privacy.
all the anonymized nodes in an equivalence class are collapsed into a single super-node.
release the number of edges of each type between equivalence classes.
62
Clustering Approach [Campan PinKDD08]
Graph model:
edges are not labeled
nodes are associated with attributes (identifier, quasi-
identifier, and sensitive ones).
K-anonymity model via cluster collapsing and
generalization is for both the quasi-identifier attributes
and the quasi-identifier relationship homogeneity.
Two nodes from any cluster are indistinguishable based on
either their relationships or their attributes.
63
Anonymizing Bipartitite Graph [CormodeVLDB08]
Graph model
two types of entities
an association only exists between two entities of different types, e.g., customers buy products.
The relationship is sensitive while entity attributes are public.
The anonymization method is to preserve the graph structure exactly by masking the mapping from entities to nodes.
Aggregate queries are used to evaluate utility.
the total number of OTC products bought by NJ customers.
64
Anonymizing Rick Interaction Graph [BhagatVLDB09]
Hyper-graphG(V,I,E) represents multiple types of interactions between entities
Attacking modelThe attackers know part of the links and nodes in the graph
65
Anonymizing Rick Graph [BhagatVLDB09]
Algorithm outline1. Sort the nodes according to the attributes;
2. Group nodes into super nodes satisfying class safety property,
and the size of each supper node is greater than K.
Class safety: each node cannot have interactions with two or more nodes from
the same group
3. Replace the node identifiers by the label list.
66
Outline
• Attacks on Naively Anonymized Graph
• Privacy Preserving Social Network Publishing
• K-anonymity
• Generalization
• Randomization
• Other Works
• Output Perturbation
• Background on differential privacy
• Accurate analysis of private network data
67
Basic Graph Randomization Operations1. Rand Add/Del: randomly add k false edges and delete k true edges (no. of
edges unchanged)
2. Rand Switch: randomly switch a pair of edges, and repeat it for k times
(nodes’ degree unchanged)
68
1
2 3
4
5
1
2 3
4
5
1
2 3
4
5
1
2 3
4
5
Resilient to Subgraph Attack
69
Links both within and outside attack
subgraph are changed.
Randomization
Randomized response model [Warner 1965]
70
: Cheated in the exam : Didn’t cheat in the exam
Cheated in
exam
Didn’t cheat
AAA
A
Randomization device
Do you belong to A? (p)
Do you belong to ?(1-p)A
…
)1)(1( pp AA
12
ˆ
12
1ˆ
pp
pAW
1
“Yes” answer
“No” answer
As: Unbiased estimate:
Procedure:
Purpose: Get the proportion( ) of population
members that cheated in the exam.A
…
Purpose
Randomization [Agarawal SIGMOD00]
71
50 | 40K | ... 30 | 70K | ... ...
...
Randomizer Randomizer
Reconstruct
Distribution
of Age
Reconstruct
Distribution
of Salary
Classification
AlgorithmModel
65 | 20K | ... 25 | 60K | ... ...30
becomes
65
(30+35)
Alice’s
age
Add random
number to
Age
Reconstruction
Given
x1+y1, x2+y2, ..., xn+yn where xi are original values.
the probability distribution of noise Y
Estimate the probability distribution of X.
72
met) criterion (stopping until
1
)())((
)())((1)(
repeat
0
ondistributi uniform
1
1
0
jj
afayxf
afayxf
naf
j
f
n
ij
XiiY
j
XiiYj
X
X
0
200
400
600
800
1000
1200
20 60
Age
Nu
mb
er o
f P
eo
ple
Original
Randomized
Reconstructed
[Agarawal SIGMOD00]
Outline on Graph Randomization
Link privacy
the prob. of existence a link (i,j) given the
perturbed graph
Feature preservation randomization
Spectrum preserving randomization
Markov chain based feature preserving randomization
Reconstruction from randomized graph
73
?)~
|1( GaP ij
Prior probability:
Posterior probabilities
74
Randomized
graph
i
j
ijm~
Method II:
If similarity is large, link (i. j) is
more likely to be a true link
Method I:
m
kmaaP ijij )1~|1(
)~,1~|1( xmaaP ijijij
2/)1()1(
nn
maP ij
Link Privacy: Posterior Beliefs [Ying PAKDD09]
Posterior probability and similarity measures
75
Similarity and proportion of existing edges – before randomization
Similarity and proportion of existing edges – after randomization
Link Privacy: Posterior Beliefs [Ying PAKDD09]
Link Privacy: Posterior Beliefs [Ying PAKDD09]
Posterior probability and similarity measures
How to calculate posterior probability for general
cases?
76
ji
ijijij
ji
ijij
ji
ij maaPaaPaPm )~,~|1()~|1()1(
prior prob. posterior prob. I posterior prob. II
?)~
|1( GaP ij
Exploit graph space to breach link privacy
Link Privacy: Graph Space [Ying SDM09]
77
Sample the graph space when the space is large
1. Start with the randomized graph, construct a Markov chain,
and uniformly sample the graph space.
2. Generate N uniform graph samples
Empirical evaluations show that node pairs with highest
probabilities have serious link disclosure risk (as high as 90%).
Link Privacy: Graph Space [Ying SDM09]
78
N
k
kij
N
jiGN
a
GGGN
1
21
),(1
)SpaceGraph |1Pr(
SpaceGraph ,,, :samples
Graph Features Under Pure Randomization
Topological and spectral features change significantly
along the randomization.
79(Networks of US political books, 105 nodes and 441 edges)
Can we better preserve
the network structure?
Spectrum Preserving Randomization [Ying SDM08]
Graph spectrum is related to many real graph features.
Preserve graph features by preserving some eigenvalues.
80
Spectrum Preserving Randomization [Ying SDM08]
Spectral Switch (apply to adjacency matrix):
Up-switch to increase the eigenvalue:
Down-switch to decrease the eigenvalue:
81
t (xt)
v (xv)
u (xu)
w (xw)
t (xt) u (xu)
w (xw) v (xv)
Spectral Switch (apply to Laplacian matrix):
Down-switch to decrease the eigenvalue:
Up-switch increase the eigenvalue:
t (yt)
v (yv)
u (yu)
w (yw)
t (yt) u (yu)
w (yw) v (yv)
82
Spectrum Preserving Randomization [Ying SDM08]
Algorithm outline
83
Spectrum Preserving Randomization [Ying SDM08]
Preserve any graph feature S(G) within a small range
Feature range constraint specified by the user
Markov chain with feature range constraint
Markov Chain Based Feature Preserving
Randomization [Ying SDM09]
84
],[)~
( ],,[)( ssRGSssRGS
RGSRGSRGSRGS
GGGG
t
t
)()()()(
~
010
10
(uniformity on accessible graphs)
Feature constraint can be used to breach link privacy
85
Original
Released
Markov Chain Based Feature Preserving
Randomization [Ying SDM09]
Reconstruction from Randomized Graph [Wu SDM10]
Motivation
From the randomized graph, can we reconstruct a graph
whose features are closer to the true features?
86
Low rank approximation approach
1. Best rank r approximation by eigen-decomposition:
2. Discretize the low rank matrix
87
Reconstruction from Randomized Graph [Wu SDM10]
Effect of including significant negative
eigenvalues
8888
Original r = 1
r = 4r = 2
Reconstruction from Randomized Graph [Wu SDM10]
89
Reconstruction from Randomized Graph [Wu SDM10]
Algorithm Outline
Feature value of the original graph, randomized graph,
and the reconstructed graph
9090
Reconstruction from Randomized Graph [Wu SDM10]
Reconstructed graphs do not jeopardize link privacy for
real-world networks.
-- Privacy measured by the proportion of different edges:
9191
Reconstruction from Randomized Graph [Wu SDM10]
92
Graphs with low rank may have privacy breached by
reconstruction:
Reconstruction on synthetic low rank graphs
92
Reconstruction from Randomized Graph [Wu SDM10]
93
Graph and feature data
1. Original Graph
2. Binary feature matrix
3. Two individuals with higher similarity in features are more
likely to be connected in the graph.
Reconstruction problem
The maximum likelihood estimation is adopted to reconstruct
the original graph and features.
93
Reconstructing Randomized Social
Networks & Features [Vuokko SDM10]
Random sparsification [Bonchi ICDE11]
only remove edges from the graph without adding new edges.
outperform random add/del in terms of utility preservation partially due to the small word phenomenon:
adding random long-haul edges brings nodes close,
while removing an edge does not bring nodes so much apart since there exit alternative paths.
utility vs. privacy trade-off for various randomization strategies.
94
Outline
• Attacks on Naively Anonymized Graph
• Privacy Preserving Social Network Publishing
• K-anonymity
• Generalization
• Randomization
• Other Works
• Output Perturbation
• Background on differential privacy
• Accurate analysis of private network data
95
Edge-weighted Graph
Edge weights could be sensitive, e.g., trustworthiness of
user A according to user B, or transaction amount
between two accounts.
96
Anonymizing Edge-weighted Graph [Das TR09]
Some properties of edge weights in terms of some
functions are preserved.
Relative distances between nodes for shortest paths or
kNN queries.
A framework for edge weight anonymization of graph
data that preserves linear properties.
A linear property can be expressed by a specific set of
linear inequalities of edge weights.
Finding new weights for each edge is a linear programming
problem.
97
Gaussian Randomization [Liu SDM09]
Perturb edge weights while preserving global and local
utilities. Graph structure is unchanged.
Gaussian randomization multiplication
The original weight of each edge is multiplied by a random
Gaussian noise with mean 1 and some variance.
In the original graph, if the shortest distance d(A,B) is
much smaller than d(C,D), the order is high likely to be
preserved.
Greedy perturbation
Preserve a set of shortest distances.
98
Anonymizing Multi-graphs [Li SDM11]
How to generate an anonymized collection of graphs where each graph corresponds to an individual’s behavior.
XML representation of attributes about an individual
Click-graph in a user-session
Route for a given individual in a time-period
Condensation based approach
create constrained clusters of size at least K
construct a super-template to represent properties of the group
generate anonymized graphs from super-template
99
Computing Privacy Scores [Liu ICDM09]
The privacy score measures the user’s potential privacy risk due to
her online information sharing behaviors. It increases with
sensitivity of the information being shared
visibility of the revealed information in the network
100
Computing Privacy Scores [Liu ICDM09]
Item Response Theory based model
Used to measure the abilities of examinees, the difficulty
of the questions, and the prob. of an examinee to answer a
question correctly.
Each examinee is mapped to a user, and each question is
mapped to a profile item. The difficulty parameter is to
quantify the sensitivity of a profile item.
The true visibility is estimated from observed profiles.
101
Outline
• Attacks on Naively Anonymized Graph
• Privacy Preserving Social Network Publishing
• K-anonymity
• Generalization
• Randomization
• Other Works
• Output Perturbation
• Background on differential privacy
• Accurate analysis of private network data
102
Output Perturbation
103
Data ownerData miner
name sex age disease salary
Ada F 18 cancer 25k
Bob M 25 heart 110k
Cathy F 20 cancer 70k
Dell M 65 flu 65k
Ed M 60 cancer 300k
Fred M 24 flu 20k
George M 22 cancer 45k
Harry M 40 flu 95k
Irene F 45 heart 70k
Query f
Query result + noise
Differential Guarantee [Dwork, TCC06]
104
name disease
Ada cancer
Bob heart
Cathy cancer
Dell flu
Ed cancer
Fred flu
f count(#cancer)
f(x) + noise
name disease
Ada cancer
Bob heart
Cathy cancer
Dell flu
Ed cancer
Fred flu
K
K
f count(#cancer)
f(x’) + noise
3 + noise
2 + noise
Two databases (x, x’) differ in only one row.
Differential Guarantee
Require that the prob. distribution is essentially the same independent of whether any individual opts in to, or opts out of the database.
Anything that can be learned about a respondent from a statistical database should be learnable without access to the database. (Tore Dalenius 1977 ad omnia privacy)
Independent of adversary knowledge.
Different from prior work on comparing an adversary's prior and posterior views of an individual.
105
Outline
• Attacks on Naively Anonymized Graph
• Privacy Preserving Social Network Publishing
• K-anonymity
• Generalization
• Randomization
• Other Works
• Output Perturbation
• Background on differential privacy
• Accurate analysis of private network data
106
-differential privacy
is a privacy parameter: smaller = stronger privacy
Differential Privacy [Dwork, TCC06 &Dwork CACM11]
Two neighboring datasets are defined in terms of
• Hamming distance |(x-x’) (x’-x)|=1
• Symmetric distance
107
-differential privacy
is a privacy parameter: smaller = stronger privacy
Differential Privacy [Dwork, TCC06 &Dwork CACM11]
108
is public and its selection is a
social question.
Calibrating Noise
109
Laplace distribution
Sensitivity of function
global sensitivity
local sensitivity
Multiple queries
Laplace Distribution
110 Lap(b)
Gaussian Distribution
111
Sensitivity
112
name sex age disease salary
Ada F 18 cancer 25k
Bob M 25 heart 110k
Cathy F 20 cancer 70k
Dell M 65 flu 65k
Ed M 60 cancer 300k
Fred M 24 flu 20k
George M 22 cancer 45k
Harry M 40 flu 95k
Irene F 45 heart 70k
Function f sensitivity
Count(#cancer) 1
Sum(salary) u (domain upper bound)
Avg(salary) u/n
Complex functions or data mining tasks can be
decomposed to a sequence of simple functions.
L-1 distance for
vector output
Differential Guarantee
113
name disease
Ada cancer
Bob heart
Cathy cancer
Dell flu
Ed cancer
Fred flu
f count(#cancer)
f(x) + Lap(1/ ) K
=1 =2
Difference of Prob.
Neighboring Datasets
114
• Two neighboring datasets are defined in terms of
• Hamming distance |(x-x’) (x’-x)|=1 --- presence/absence
• Symmetric distance --- hiding the value of an individual row
• How about two data sets differing by k rows?
Multiple Queries
115
• The sensitivity of a sequence of two counting
queries is 2. Adding independent noise Lap(2/ )
to each answer yields -differential privacy.
Histogram Query
SELECT count(*)
FROM table
GROUP BY disease
name sex age disease salary
Ada F 18 cancer 25k
Bob M 25 heart 110k
Ed M 60 cancer 300k
George M 22 cancer 45k
Harry M 40 flu 95k
Irene F 45 heart 70k
[3, 2, 1]
cancer, heart, flu
[3+Lap(1/ ), 2+Lap(1/ ),1+Lap(1/ )]
Complex Data Mining Task
Many data mining tasks can be decomposed
into a series of noisy sum and/or noisy
count queries.
Known data mining algorithms include
SVD, ID3 decision tree, K-means clustering,
association rule etc.
117
Recent Development
[Xiao ICDE10] Dependencies among the queries can be
exploited to improve the accuracy of responses.
[Li PODS10] Matrix mechanism for answering a workload of
predicate counting queries
[Kifer SIGMOD11] Misconceptions of differential privacy.
118
Outline
• Attacks on Naively Anonymized Graph
• Privacy Preserving Social Network Publishing
• K-anonymity
• Generalization
• Randomization
• Other Works
• Output Perturbation
• Background on differential privacy
• Accurate analysis of private network data
119
Private Query Answering on Networks[Hay ICDM09]
120
Two neighboring graphs can be defined to differ by a
single edge, K edges, or a single node.
edge -differential privacy
query output is indistinguishable whether any
single edge is present or absent.
K-edge -differential privacy
query output is indistinguishable whether any set
of k edges is present or absent.
node -differential privacy
query output is indistinguishable whether any
single node (and all its edges) is present or absent.
Lap( f/ )
Lap( f K/ )
Degree Sequence
The list of degrees of each node in a graph
The degree sequence of a network may be sensitive as it
can be used to determine the graph structure by
incorporating other graph statistics
121
Two Equivalent Queries
122
Degree sequence D(G)=[1,1,3,3,3,3,2] D(G’)=[1,1,3,3,2,2,2]
D=2, Lap(2/ ) for each component
# of nodes with degree i=0,…n-1
F(G)=[0,2,1,4,0,0,0] F(G’)=[0,2,3,2,0,0,0]
F=4, Lap(4/ ) for each component
Boosting Accuracy [Hay ICDM09]
123
Rewrite query D to get S with constraint Cs
K Submit S
Perturbed answer A(S)
Perform inference on A(S) with
constraints Cs to derive a better estimation
Formulating Query D
124
Degree sequence D(G)=[1,1,3,3,3,3,2] D(G’)=[1,1,3,3,2,2,2]
D=2, Lap(2/ ) for each component
Return the rank i-th degree
S(G)=[1,1,2,3,3,3,3]Perturbed answer could be
[3,2,…]
A new (and more accurate) sequence could be derived by computing the
closest non-decreasing sequence.
+Lap(2/ )
Accurate Motif Analysis
Measures the frequency of occurrence of small
subgraphs in a network, e.g., # of triangles
125
# of triangles is n-2 # of triangles is 0
High sensitivity!
Weakening Privacy
Statistics such as transitivity, clustering coefficient,
centrality, and path-lengths have high sensitivity values.
Possible solutions
[Nissim STOC07] Smooth sensitivity, adopting local
sensitivity, i.e., the max. change between Q(I) and Q(I’) for any
I’ in neighbor(I).
[Rastogi PODS09] Adversary privacy, limiting assumptions
of the priori knowledge of the adversary
More exploration is needed on robust statistics and
differential privacy.
126
Model based Data Publishing
127
Data owner Data miner
name sex age disease salary
Ada F 18 cancer 25k
Bob M 25 heart 110k
Cathy F 20 cancer 70k
Dell M 65 flu 65k
Ed M 60 cancer 300k
Fred M 24 flu 20k
George M 22 cancer 45k
Harry M 40 flu 95k
Irene F 45 heart 70k
K.
.
.
Build models (e.g., contingency table, power-law graph)
Release differentially private model parameters
Generate synthetic data using models with perturbed parameters
Outline
• Attacks on Naively Anonymized Graph
• Privacy Preserving Social Network Publishing
• K-anonymity
• Generalization
• Randomization
• Other Works
• Output Perturbation
• Background on differential privacy
• Accurate analysis of private network data
128
Challenges
How to quantify graph utility-privacy tradeoff especially for (dynamic) rich graphs?
existing data publication techniques do not provide guarantees of accuracy of graph analysis.
how to define utility?
Scalability is always an issue.
Differential privacy preserving social network mining
sensitivity of statistics and relaxation
129
References
[Agarawal, SIGMOD00] R. Aggarwal and R. Srikant. Privacy preserving data mining. SIGMOD, 2000.
[Aggarwal, 08] C. C. Aggarwal and P. S. Yu. Privacy-preserving data mining: models and algorithms. Springer, 2008.
[Backstrom, WWW07] L. Backstrom, C. Dwork, and J. Kleinberg. Wherefore art thou R3579X? Anonymized social networks, hidden patterns and structural steganography. WWW, 2007.
[Bhagat, VLDB09] S. Bhagat, G. Cormode, B. Krishnamurthy, and D. Srivastava. Class-based graph anaonymization for social network data. VLDB, 2009.
[Bonchi, ICDE11] F. Bonchi, A. Gionis, and T. Tassa. Identity Obfuscation in Graphs Through the Information Theoretic Lens. ICDE, 2011.
[Campan, PinKDD08] A. Campan and T. M. Truta. A clustering approach for data and structural anonymity in social network data. PinKDD, 2008.
[Cheng, SIGMOD10] J. Cheng, A. Fu, and J. Liu. K-isomorphism: privacy preserving network publication against structural attacks. SIGMOD, 2010.
[Cormode, VLDB08] G. Cormode, D. Srivastava, T. Yu, and Q. Zhang. Anonymizing bipartite graph data using safe groupings. VLDB, 2008.
130
References
[Das, TR09] S. Das, Omer Egecioglu, and A. E. Abbadi. Anonymizing edge weighted social network graphs. 2009.
[Dwork, CACM11] C. Dwork. A firm foundation for private data analysis. CACM, 2011.
[Dwork, TCC06] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. TCC, 2006.
[Gross, WPES05] R. Gross and A. Acquisti. Information revelation and privacy in online social networks (the Facebook case). WPES, 2005.
[Hanhijarvi, SDM09] S. Hanhijarvi, G. C. Garriga, and K. Puolamaki. Randomization techniques for graphs. SDM, 2009.
[Hay, VLDB08] M. Hay, G. Miklau, D. Jensen, D. Towsely, and P. Weis. Resisting structural re-identification in anonymized social networks. VLDB, 2008.
[Hay, 07] M. Hay, G. Miklau, D. Jensen, P. Weis, and S. Srivastava. Anonymizing social networks. 2007.
131
References
[Hay, 09] M. Hay, G. Miklau and D. Jensen. Enabling accurate analysis of private data analysis. 2009.
[Kifer, SIGMOD11] D. Kifer and A. Machanavajjhala. No Free Lunch in Data Privacy. SIGMOD, 2011.
[Korolova, ICDE08] A. Koroloca, R. Motwani, S. Nabar, and Y. Xu. Link privacy in social networks. ICDE, 2008.
[Li, PODS10] C. Li, M. Hay, V. Rastogi, G. Miklau, A. McGregor. Optimizing linear counting queries under differential privacy. PODS, 2010.
[Liu, SIGMOD08] K. Liu and E. Terzi. Towards identity anonymization on graphs. SIGMOD 2008.
[Liu, ICDM09] K. Liu and E. Terzi. A framework for computing the privacy scores of users in online social networks. ICDM 2009.
[Liu, SDM09] L. Liu, J. Wang, J. Liu and J. Zhang. Privacy preserving in social networks against sensitive edge disclosure. SDM 2008.
[McSherry, FOCS07] F. McSherry and K. Talwar. Mechanism design via differential privacy. FOCS, 2007.
132
References
[Narayanan, 09] A. Narayanan and V. Shmatikov. De-anonymizing social networks. 2009.
[Newman, PRE06] M. Newman. Physical Review E, 2006.
[Vuokko, SDM10] N. Vyokko and E. Terzi. Reconstructing randomized social networks. SDM, 2010.
[Wu, SDM10] L. Wu, X. Ying, and X. Wu. Reconstruction of randomized graph via low rank approximation. SDM, 2010.
[Wu, EDBT11] W. Wu, Y. Xiao, W. Wang, Z. He, and Z. Wang. k-symmetry model for identity anonymization in social networks. EDBT, 2011.
[Wu, 09] X. Wu, X. Ying, K. Liu, and L. Chen. A Survey of Algorithms for Privacy-Preservation of Graphs and Social Networks. 2009.
[Ying, SDM08] X. Ying and X. Wu. Randomizing social networks: a spectrum preserving approach. SDM, 2008.
[Ying, SDM09] X. Ying and X. Wu. Graph generation with prescribed feature constraints. SDM, 2009.
133
References [Ying, SDM09-2] X. Ying and X. Wu. On randomness measures for social networks. SDM, 2009.
[Ying, PAKDD09] X. Ying and X. Wu. On link privacy in randomizing social networks. PAKDD, 2009.
[Xiao, ICDE10] X. Xiao, G. Wang, and J. Gehrke. Differential privacy via wavelet transformation. ICDE,
2010.
[Zheleva, PinKDD07] E. Zhelava and L. Getoor. Preserving the privacy of sensitive relationships in
graph data. PinKDD, 2007.
[Zhou, ICDE08] B. Zhou and J. Pei. Preserving privacy in social networks against neighborhood attacks.
ICDE, 2008.
[Zou, VLDB09] L. Zou, L. Chen, and M. T. Ozsu. K-automorphism: A general framework for privacy
preserving network publication. VLDB, 2009.
134
Questions?
Acknowledgments
This work was supported in part by U.S. National Science
Foundation IIS-0546027 , CNS-0831204 and CCF-1047621.
Update version: http://dpl.sis.uncc.edu/ppsn-tut.PDF
Thank You!
135