Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Privacy Preserving Data Mining

Li Xiong

Department of Mathematics and Computer Science

Department of Biomedical Informatics

Emory University

CS378 Introduction to Data Mining

Netflix Sequel

• 2006, Netflix announced the challenge

• 2007, researchers from University of Texas identified

individuals by matching Netflix datasets with IMDB

• July 2009, $1M grand prize awarded

• August 2009, Netflix announced the second challenge

• December 2009, four Netflix users filed a class action

lawsuit against Netflix

• March 2010, Netflix canceled the second challenge

3

Netflix Sequel








• March 2010, Netflix canceled the second challenge

Netflix Sequel








• March 2010, Netflix canceled the second competition

Facebook-Cambridge Analytica

• April 2010, Facebook launches Open Graph

• 2013, 300,000 users took the psychographic personality

test app ”thisisyourdigitallife”

• 2016, Trump’s campaign invest heavily in Facebook ads

• March 2018, reports revealed that 50 million (later revised

to 87 million) Facebook profiles were harvested for

Cambridge Analytica and used for Trump’s campaign

• April 11, 2018, Zuckerberg testified before Congress

Facebook-Cambridge Analytica

• April 2010, Facebook launches Open Graph

• 2013, 300,000 users took the psychographic personality

test app ”thisisyourdigitallife”

• 2016, Trump’s campaign invest heavily in Facebook ads

• March 2018, reports revealed that 50 million (later revised

to 87 million) Facebook profiles were harvested for

Cambridge Analytica and used for Trump’s campaign

• April 11, 2018, Zuckerberg testified before Congress

• How many people know we are here?

(a) no one

(b) 1-10 i.e. family and friends

(c) 10-100 i.e. colleagues and more (social network)

friends

Who Knows What About Me? A Survey of Behind the Scenes Personal Data Sharing to Third Parties by Mobile Apps,

2015-10-30 https://techscience.org/a/2015103001/

• 73% / 33% of Android

apps shared personal

info (i.e. email) / GPS

coordinates with third

parties

• 45% / 47% of iOS

apps shared email /

GPS coordinates with

third parties

Location data sharing by iOS apps (left) to domains (right)

The EHR Data Map

Shopping records

Big Data Goes Personal

• Movie ratings

• Social network/media data

• Mobile GPS data

• Electronic medical records

• Shopping history

• Online browsing history

Data Mining

Data Mining … the dark side

Private

DataSanitized

Data/

Models

Privacy Preserving

Data Mining

Privacy Preserving Data Mining

• Privacy goal: personal data is not revealed and cannot be

inferred

• Utility goal: data/models as close to the private data as

possible

Privacy preserving data mining

• Differential privacy

• Definition

• Building blocks (primitive mechanisms)

• Composition rules

• Data mining algorithms with differential privacy

• k-means clustering w/ differential privacy

• Frequent pattern mining w/ differential privacy

Differential Privacy

Original

Data

Sanitized

ViewDe-identification

anonymization

Traditional De-identification and Anonymization

• Attribute suppression, perturbation, generalization

• Inference possible with external data

Massachusetts GIC Incident (1990s)

• Massachusetts Group Insurance Commission (GIC) Encounter

data (“de-identified”) – mid 1990s

• External information: voter roll from city of Cambridge

• Governor’s health records identified

• 87% Americans can be uniquely identified using: Zip, birthdate,

and sex (2000)

Name SSN Birth

date

Zip Diagnosis

Alice 123456789 44 48202 AIDS

Bob 323232323 44 48202 AIDS

Charley 232345656 44 48201 Asthma

Dave 333333333 55 48310 Asthma

Eva 666666666 55 48310 Diabetes

AOL Query Log Release (2006)

• User 4417749

• “numb fingers”,

• “60 single men”

• “dog that urinates on everything”

• “landscapers in Lilburn, Ga”

• Several people names with last name Arnold

• “homes sold in shadow lake subdivision

gwinnett county georgia”

AnonID Query QueryTime ItemRank ClickURL

217 lottery 2006-03-01 11:58:51 1 http://www.calottery.com

217 lottery 2006-03-27 14:10:38 1 http://www.calottery.com

1268 gall stones 2006-05-11 02:12:51

1268 gallstones 2006-05-11 02:13:02 1 http://www.niddk.nih.gov

1268 ozark horse blankets 2006-03-01 17:39:28 8 http://www.blanketsnmore.com

20 million Web search queries by AOL

The Genome Hacker (2013)


• Statistical outcome (view) is indistinguishable regardless

whether a particular user is included in the data


• Statistical outcome (view) is indistinguishable regardless

whether a particular user is included in the data

Private

Data D Models

/Data

Privacy preserving

data mining/sharing

mechanism


• View is indistinguishable regardless of the input

Private

Data D’

Original records Original histogramPerturbed histogram

with differential privacy

Differential privacy: an example

Laplace Mechanism

0

0.2

0.4

0.6

-10 -8 -6 -4 -2 0 2 4 6 8 10

Laplace Distribution –Lap(S/ε)

Private

Data

Query q

True

answer

q(D)q(D) + η

η

Laplace Distribution

• PDF:

• Denoted as Lap(b) when u=0

• Mean u

• Variance 2b2

How much noise for privacy?

Sensitivity: Consider a query q: I R. S(q) is the smallest number s.t.

for any neighboring tables D, D’,

| q(D) – q(D’) | ≤ S(q)

Theorem: If sensitivity of the query is S, then the algorithm

A(D) = q(D) + Lap(S(q)/ε) guarantees ε-differential privacy

[Dwork et al., TCC 2006]

Example: COUNT query

• Number of people having HIV+

• Sensitivity = ?

Example: COUNT query

• Number of people having HIV+

• Sensitivity = 1

• ε-differentially private count: 3 + η, where η is drawn from Lap(1/ε)

• Mean = 0

• Variance = 2/ε2

Example: Sum (Average) query

• Sum of Age (suppose Age is in [a,b])

• Sensitivity = ?

Example: Sum (Average) query

• Sum of Age (suppose Age is in [a,b])

• Sensitivity = b

Composition theorems

Sequential composition∑iεi –differential privacy

Parallel compositionmax(εi)–differential privacy

Sequential Composition

• If M1, M2, ..., Mk are algorithms that access a

private database D such that each Mi satisfies εi -

differential privacy,

then the combination of their outputs satisfies

ε-differential privacy with ε=ε1+...+εk

Parallel Composition

• If M1, M2, ..., Mk are algorithms that access disjoint

databases D1, D2, …, Dk such that each Mi satisfies εi -

differential privacy,

then the combination of their outputs satisfies

ε-differential privacy with ε= max{ε1,...,εk}

Postprocessing

• If M1 is an ε differentially private algorithm that accesses a

private database D,

then outputting M2(M1(D)) also satisfies ε-differential

privacy.

Module 2Tutorial: Differential

Privacy in the Wild

42

Original records Original histogramPerturbed histogram

with differential privacy

Differential privacy: an example



• Definition





• Frequent itemsets mining w/ differential privacy

Privacy Preserving Data Mining as Constrained

Optimization

• Two goals

• Privacy

• Error (utility)

• Given a task and privacy budget ε, how to design a set of

queries (functions) and allocate the budget such that the

error is minimized?

Data mining algorithms with differential privacy

• General algorithmic framework

• Decompose a data mining algorithm into a set of

functions

• Allocate privacy budget to each function

• Implement each function with εi differential privacy

• Compute noisy output using Laplace mechanism

based on sensitivity of the function and εi

• Compose them using composition theorem

• Optimization techniques

• Decomposition design

• Budget allocation

• Sensitivity reduction for each function

Review: K-means Clustering

K-means Problem

• Partition a set of points x1, x2, …, xn into k clusters S1, S2,

…, Sk such that the SSE is minimized:

Mean of the cluster Si

K-means Algorithm

• Initialize a set of k centers

• Repeat until convergence

1. Assign each point to its nearest center

2. Update the set of centers

• Output final set of k centers and the points in each cluster

Differentially Private K-means


• Repeat iterations until convergence

• In each iteration (given a set of centers):

1. Assign the points to the closest center

2. Compute the size of each cluster

3. Compute the sum (centroid) of points in each cluster

• Output the final centroid and size of each cluster

[BDMN 05]



• Suppose we fix the number of iterations to T



2. Compute the noisy size of each cluster

3. Compute the noisy sum (centroid)

of points in each cluster


[BDMN 05]










[BDMN 05]

Each iteration uses

ε/T privacy,

total privacy is ε










[BDMN 05]

Each iteration uses

ε/T privacy,

total privacy is ε

S = 1

S = Dom










[BDMN 05]

Each iteration uses

ε/T privacy,

total privacy is ε

Laplace(2T/ε)

Laplace(2T *dom/ε)

Results (T = 10 iterations, random initialization)

Original K-means algorithm Laplace K-means algorithm

• Laplace k-means can distinguish clusters that are far apart

• Laplace k-means can’t distinguish small clusters that are close by.



• Definition





• Frequent itemsets/sequence mining w/

differential privacy

Frequent Sequence Mining (FSM)

ID

100

200

300

400

500

Record

a→c→d

b→c→d

a→b→c→e→d

d→b

a→d→c→d

Database D

Sequence

{a}

{b}

{c}

{d}

Sup.

3

3

4

4

{e} 1

C1: cand 1-seqs

Sequence

{a}

{b}

{c}

{d}

Sup.

3

3

4

4

F1: freq 1-seqs

Sequence

{a→a}

{a→b}

{a→c}

{a→d}

Sup.

0

1

3

3

{b→a}

{b→b}

{b→c}

{b→d}

0

2

2

1

{c→a}

{c→b}

{c→c}

{c→d}

0

0

0

4

{d→a}

{d→b}

{d→c}

{d→d}

0

1

1

0

C2: cand 2-seqs

Sequence

{a→c}

{a→d}

{c→d}

Sup.

3

3

4

F3: freq 2-seqs

Scan D

Scan D

Scan D

Sequence

{a→a}

{a→b}

{a→c}

{a→d}

{b→a}

{b→b}

{b→c}

{b→d}

{c→a}

{c→b}

{c→c}

{c→d}

{d→a}

{d→b}

{d→c}

{d→d}

C2: cand 2-seqs

Sequence

{a→b→c}

C3: cand 3-seqs

Sequence

{a→b→c}

Sup.

3

F3: freq 3-seqs

Baseline Differentially Private FSM

ID

100

200

300

400

500

Record

a→c→d

b→c→d

a→b→c→e→d

d→b

a→d→c→d

Database D

Sequence

{a}

{b}

{c}

{d}

Sup.

3

3

4

4

{e} 1

C1: cand 1-seqs

noise

0.2

-0.4

0.4

-0.5

0.8

Sequence

{a→a}

{a→c}

{a→d}

{c→a}

{c→c}

{c→d}

{d→a}

{d→c}

{d→d}

C2: cand 2-seqs

Sequence

{a→a}

{a→c}

{a→d}

Sup.

0

3

3

{c→a}

{c→c}

{c→d}

0

0

4

{d→a}

{d→c}

{d→d}

0

1

0

C2: cand 2-seqs

noise

0.2

0.3

0.2

-0.5

0.8

0.2

0.3

2.1

-0.5

Scan D

Scan D

Sequence

{a→c→d}

C3: cand 3-seqs

{a→d→c}

noise

0

0.3

Sequence

{a→c→d}

Sup.

3

{a→d→c} 1

C3: cand 3-seqs

Scan D

Sequence

{a}

{c}

{d}

Noisy Sup.

3.2

4.4

3.5

F1: freq 1-seqs

Sequence

{a→c}

{a→d}

{c→d}

Noisy Sup.

3.3

3.2

4.2

F2: freq 2-seqs

{d→c} 3.1

Sequence

{a→c→d}

Noisy Sup.

3

F3: freq 3-seqs

Lap(|C2| / ε2)

Lap(|C1| / ε1)

Lap(|C3| / ε3)

S Xu, S Su, X Cheng, Z Li, L Xiong. Differentially Private Frequent Sequence Mining via Sampling-based

Candidate Pruning. ICDE 2015

Frequent pattern (subgraph) mining

• Represent each record as a graph

• Modeling the co-occurrence between diagnosis, procedures, medications

• Frequent subgraph mining with differential privacy

v1

v2 v3

v4 v1

v2 v3

v4

v1

v2 v3

v4

…

Threshold = 3v1 v4 …

Input Graphs Frequent Subgraphs

support = 3

S. Xu, S. Su, L. Xiong, X. Cheng, K. Xiao, Differentially Private Frequent

Subgraph Mining. ICDE 2016

Acknowledgements

• Research support

• Center for Comprehensive Informatics

• Woodrow Wilson Foundation

• Cisco research award

• Students

• James Gardner

• Yonghui Xiao

• Collaborators

• Andrew Post, CCI

• Fusheng Wang, CCI

• Tyrone Grandison, IBM

• Chun Yuan, Tsinghua

Emory Assured Information

Management and Sharing (AIMS) Lab

• Collect, use, analyze, share data

without compromising privacy

Documents

Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000