Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Privacy Preserving Data Mining
Li Xiong
Department of Mathematics and Computer Science
Department of Biomedical Informatics
Emory University
CS378 Introduction to Data Mining
Netflix Sequel
• 2006, Netflix announced the challenge
• 2007, researchers from University of Texas identified
individuals by matching Netflix datasets with IMDB
• July 2009, $1M grand prize awarded
• August 2009, Netflix announced the second challenge
• December 2009, four Netflix users filed a class action
lawsuit against Netflix
• March 2010, Netflix canceled the second challenge
3
Netflix Sequel
• 2006, Netflix announced the challenge
• 2007, researchers from University of Texas identified
individuals by matching Netflix datasets with IMDB
• July 2009, $1M grand prize awarded
• August 2009, Netflix announced the second challenge
• December 2009, four Netflix users filed a class action
lawsuit against Netflix
• March 2010, Netflix canceled the second challenge
Netflix Sequel
• 2006, Netflix announced the challenge
• 2007, researchers from University of Texas identified
individuals by matching Netflix datasets with IMDB
• July 2009, $1M grand prize awarded
• August 2009, Netflix announced the second challenge
• December 2009, four Netflix users filed a class action
lawsuit against Netflix
• March 2010, Netflix canceled the second competition
Facebook-Cambridge Analytica
• April 2010, Facebook launches Open Graph
• 2013, 300,000 users took the psychographic personality
test app ”thisisyourdigitallife”
• 2016, Trump’s campaign invest heavily in Facebook ads
• March 2018, reports revealed that 50 million (later revised
to 87 million) Facebook profiles were harvested for
Cambridge Analytica and used for Trump’s campaign
• April 11, 2018, Zuckerberg testified before Congress
Facebook-Cambridge Analytica
• April 2010, Facebook launches Open Graph
• 2013, 300,000 users took the psychographic personality
test app ”thisisyourdigitallife”
• 2016, Trump’s campaign invest heavily in Facebook ads
• March 2018, reports revealed that 50 million (later revised
to 87 million) Facebook profiles were harvested for
Cambridge Analytica and used for Trump’s campaign
• April 11, 2018, Zuckerberg testified before Congress
• How many people know we are here?
(a) no one
(b) 1-10 i.e. family and friends
(c) 10-100 i.e. colleagues and more (social network)
friends
Who Knows What About Me? A Survey of Behind the Scenes Personal Data Sharing to Third Parties by Mobile Apps,
2015-10-30 https://techscience.org/a/2015103001/
• 73% / 33% of Android
apps shared personal
info (i.e. email) / GPS
coordinates with third
parties
• 45% / 47% of iOS
apps shared email /
GPS coordinates with
third parties
Location data sharing by iOS apps (left) to domains (right)
The EHR Data Map
Shopping records
Big Data Goes Personal
• Movie ratings
• Social network/media data
• Mobile GPS data
• Electronic medical records
• Shopping history
• Online browsing history
Data Mining
Data Mining … the dark side
Private
DataSanitized
Data/
Models
Privacy Preserving
Data Mining
Privacy Preserving Data Mining
• Privacy goal: personal data is not revealed and cannot be
inferred
• Utility goal: data/models as close to the private data as
possible
Privacy preserving data mining
• Differential privacy
• Definition
• Building blocks (primitive mechanisms)
• Composition rules
• Data mining algorithms with differential privacy
• k-means clustering w/ differential privacy
• Frequent pattern mining w/ differential privacy
Differential Privacy
Original
Data
Sanitized
ViewDe-identification
anonymization
Traditional De-identification and Anonymization
• Attribute suppression, perturbation, generalization
• Inference possible with external data
Massachusetts GIC Incident (1990s)
• Massachusetts Group Insurance Commission (GIC) Encounter
data (“de-identified”) – mid 1990s
• External information: voter roll from city of Cambridge
• Governor’s health records identified
• 87% Americans can be uniquely identified using: Zip, birthdate,
and sex (2000)
Name SSN Birth
date
Zip Diagnosis
Alice 123456789 44 48202 AIDS
Bob 323232323 44 48202 AIDS
Charley 232345656 44 48201 Asthma
Dave 333333333 55 48310 Asthma
Eva 666666666 55 48310 Diabetes
AOL Query Log Release (2006)
• User 4417749
• “numb fingers”,
• “60 single men”
• “dog that urinates on everything”
• “landscapers in Lilburn, Ga”
• Several people names with last name Arnold
• “homes sold in shadow lake subdivision
gwinnett county georgia”
AnonID Query QueryTime ItemRank ClickURL
217 lottery 2006-03-01 11:58:51 1 http://www.calottery.com
217 lottery 2006-03-27 14:10:38 1 http://www.calottery.com
1268 gall stones 2006-05-11 02:12:51
1268 gallstones 2006-05-11 02:13:02 1 http://www.niddk.nih.gov
1268 ozark horse blankets 2006-03-01 17:39:28 8 http://www.blanketsnmore.com
20 million Web search queries by AOL
The Genome Hacker (2013)
Differential Privacy
• Statistical outcome (view) is indistinguishable regardless
whether a particular user is included in the data
Differential Privacy
• Statistical outcome (view) is indistinguishable regardless
whether a particular user is included in the data
Private
Data D Models
/Data
Privacy preserving
data mining/sharing
mechanism
Differential Privacy
• View is indistinguishable regardless of the input
Private
Data D’
Original records Original histogramPerturbed histogram
with differential privacy
Differential privacy: an example
Laplace Mechanism
0
0.2
0.4
0.6
-10 -8 -6 -4 -2 0 2 4 6 8 10
Laplace Distribution –Lap(S/ε)
Private
Data
Query q
True
answer
q(D)q(D) + η
η
Laplace Distribution
• PDF:
• Denoted as Lap(b) when u=0
• Mean u
• Variance 2b2
How much noise for privacy?
Sensitivity: Consider a query q: I R. S(q) is the smallest number s.t.
for any neighboring tables D, D’,
| q(D) – q(D’) | ≤ S(q)
Theorem: If sensitivity of the query is S, then the algorithm
A(D) = q(D) + Lap(S(q)/ε) guarantees ε-differential privacy
[Dwork et al., TCC 2006]
Example: COUNT query
• Number of people having HIV+
• Sensitivity = ?
Example: COUNT query
• Number of people having HIV+
• Sensitivity = 1
• ε-differentially private count: 3 + η, where η is drawn from Lap(1/ε)
• Mean = 0
• Variance = 2/ε2
Example: Sum (Average) query
• Sum of Age (suppose Age is in [a,b])
• Sensitivity = ?
Example: Sum (Average) query
• Sum of Age (suppose Age is in [a,b])
• Sensitivity = b
Composition theorems
Sequential composition∑iεi –differential privacy
Parallel compositionmax(εi)–differential privacy
Sequential Composition
• If M1, M2, ..., Mk are algorithms that access a
private database D such that each Mi satisfies εi -
differential privacy,
then the combination of their outputs satisfies
ε-differential privacy with ε=ε1+...+εk
Parallel Composition
• If M1, M2, ..., Mk are algorithms that access disjoint
databases D1, D2, …, Dk such that each Mi satisfies εi -
differential privacy,
then the combination of their outputs satisfies
ε-differential privacy with ε= max{ε1,...,εk}
Postprocessing
• If M1 is an ε differentially private algorithm that accesses a
private database D,
then outputting M2(M1(D)) also satisfies ε-differential
privacy.
Module 2Tutorial: Differential
Privacy in the Wild
42
Original records Original histogramPerturbed histogram
with differential privacy
Differential privacy: an example
Privacy preserving data mining
• Differential privacy
• Definition
• Building blocks (primitive mechanisms)
• Composition rules
• Data mining algorithms with differential privacy
• k-means clustering w/ differential privacy
• Frequent itemsets mining w/ differential privacy
Privacy Preserving Data Mining as Constrained
Optimization
• Two goals
• Privacy
• Error (utility)
• Given a task and privacy budget ε, how to design a set of
queries (functions) and allocate the budget such that the
error is minimized?
Data mining algorithms with differential privacy
• General algorithmic framework
• Decompose a data mining algorithm into a set of
functions
• Allocate privacy budget to each function
• Implement each function with εi differential privacy
• Compute noisy output using Laplace mechanism
based on sensitivity of the function and εi
• Compose them using composition theorem
• Optimization techniques
• Decomposition design
• Budget allocation
• Sensitivity reduction for each function
Review: K-means Clustering
K-means Problem
• Partition a set of points x1, x2, …, xn into k clusters S1, S2,
…, Sk such that the SSE is minimized:
Mean of the cluster Si
K-means Algorithm
• Initialize a set of k centers
• Repeat until convergence
1. Assign each point to its nearest center
2. Update the set of centers
• Output final set of k centers and the points in each cluster
Differentially Private K-means
• Initialize a set of k centers
• Repeat iterations until convergence
• In each iteration (given a set of centers):
1. Assign the points to the closest center
2. Compute the size of each cluster
3. Compute the sum (centroid) of points in each cluster
• Output the final centroid and size of each cluster
[BDMN 05]
Differentially Private K-means
• Initialize a set of k centers
• Suppose we fix the number of iterations to T
• In each iteration (given a set of centers):
1. Assign the points to the closest center
2. Compute the noisy size of each cluster
3. Compute the noisy sum (centroid)
of points in each cluster
• Output the final centroid and size of each cluster
[BDMN 05]
Differentially Private K-means
• Initialize a set of k centers
• Suppose we fix the number of iterations to T
• In each iteration (given a set of centers):
1. Assign the points to the closest center
2. Compute the noisy size of each cluster
3. Compute the noisy sum (centroid)
of points in each cluster
• Output the final centroid and size of each cluster
[BDMN 05]
Each iteration uses
ε/T privacy,
total privacy is ε
Differentially Private K-means
• Initialize a set of k centers
• Suppose we fix the number of iterations to T
• In each iteration (given a set of centers):
1. Assign the points to the closest center
2. Compute the noisy size of each cluster
3. Compute the noisy sum (centroid)
of points in each cluster
• Output the final centroid and size of each cluster
[BDMN 05]
Each iteration uses
ε/T privacy,
total privacy is ε
S = 1
S = Dom
Differentially Private K-means
• Initialize a set of k centers
• Suppose we fix the number of iterations to T
• In each iteration (given a set of centers):
1. Assign the points to the closest center
2. Compute the noisy size of each cluster
3. Compute the noisy sum (centroid)
of points in each cluster
• Output the final centroid and size of each cluster
[BDMN 05]
Each iteration uses
ε/T privacy,
total privacy is ε
Laplace(2T/ε)
Laplace(2T *dom/ε)
Results (T = 10 iterations, random initialization)
Original K-means algorithm Laplace K-means algorithm
• Laplace k-means can distinguish clusters that are far apart
• Laplace k-means can’t distinguish small clusters that are close by.
Privacy preserving data mining
• Differential privacy
• Definition
• Building blocks (primitive mechanisms)
• Composition rules
• Data mining algorithms with differential privacy
• k-means clustering w/ differential privacy
• Frequent itemsets/sequence mining w/
differential privacy
Frequent Sequence Mining (FSM)
ID
100
200
300
400
500
Record
a→c→d
b→c→d
a→b→c→e→d
d→b
a→d→c→d
Database D
Sequence
{a}
{b}
{c}
{d}
Sup.
3
3
4
4
{e} 1
C1: cand 1-seqs
Sequence
{a}
{b}
{c}
{d}
Sup.
3
3
4
4
F1: freq 1-seqs
Sequence
{a→a}
{a→b}
{a→c}
{a→d}
Sup.
0
1
3
3
{b→a}
{b→b}
{b→c}
{b→d}
0
2
2
1
{c→a}
{c→b}
{c→c}
{c→d}
0
0
0
4
{d→a}
{d→b}
{d→c}
{d→d}
0
1
1
0
C2: cand 2-seqs
Sequence
{a→c}
{a→d}
{c→d}
Sup.
3
3
4
F3: freq 2-seqs
Scan D
Scan D
Scan D
Sequence
{a→a}
{a→b}
{a→c}
{a→d}
{b→a}
{b→b}
{b→c}
{b→d}
{c→a}
{c→b}
{c→c}
{c→d}
{d→a}
{d→b}
{d→c}
{d→d}
C2: cand 2-seqs
Sequence
{a→b→c}
C3: cand 3-seqs
Sequence
{a→b→c}
Sup.
3
F3: freq 3-seqs
Baseline Differentially Private FSM
ID
100
200
300
400
500
Record
a→c→d
b→c→d
a→b→c→e→d
d→b
a→d→c→d
Database D
Sequence
{a}
{b}
{c}
{d}
Sup.
3
3
4
4
{e} 1
C1: cand 1-seqs
noise
0.2
-0.4
0.4
-0.5
0.8
Sequence
{a→a}
{a→c}
{a→d}
{c→a}
{c→c}
{c→d}
{d→a}
{d→c}
{d→d}
C2: cand 2-seqs
Sequence
{a→a}
{a→c}
{a→d}
Sup.
0
3
3
{c→a}
{c→c}
{c→d}
0
0
4
{d→a}
{d→c}
{d→d}
0
1
0
C2: cand 2-seqs
noise
0.2
0.3
0.2
-0.5
0.8
0.2
0.3
2.1
-0.5
Scan D
Scan D
Sequence
{a→c→d}
C3: cand 3-seqs
{a→d→c}
noise
0
0.3
Sequence
{a→c→d}
Sup.
3
{a→d→c} 1
C3: cand 3-seqs
Scan D
Sequence
{a}
{c}
{d}
Noisy Sup.
3.2
4.4
3.5
F1: freq 1-seqs
Sequence
{a→c}
{a→d}
{c→d}
Noisy Sup.
3.3
3.2
4.2
F2: freq 2-seqs
{d→c} 3.1
Sequence
{a→c→d}
Noisy Sup.
3
F3: freq 3-seqs
Lap(|C2| / ε2)
Lap(|C1| / ε1)
Lap(|C3| / ε3)
S Xu, S Su, X Cheng, Z Li, L Xiong. Differentially Private Frequent Sequence Mining via Sampling-based
Candidate Pruning. ICDE 2015
Frequent pattern (subgraph) mining
• Represent each record as a graph
• Modeling the co-occurrence between diagnosis, procedures, medications
• Frequent subgraph mining with differential privacy
v1
v2 v3
v4 v1
v2 v3
v4
v1
v2 v3
v4
…
Threshold = 3v1 v4 …
Input Graphs Frequent Subgraphs
support = 3
S. Xu, S. Su, L. Xiong, X. Cheng, K. Xiao, Differentially Private Frequent
Subgraph Mining. ICDE 2016
Acknowledgements
• Research support
• Center for Comprehensive Informatics
• Woodrow Wilson Foundation
• Cisco research award
• Students
• James Gardner
• Yonghui Xiao
• Collaborators
• Andrew Post, CCI
• Fusheng Wang, CCI
• Tyrone Grandison, IBM
• Chun Yuan, Tsinghua
Emory Assured Information
Management and Sharing (AIMS) Lab
• Collect, use, analyze, share data
without compromising privacy