20
Detecting Data Leakage Panagiotis Papadimitriou [email protected] Hector Garcia-Molina [email protected]

Detecting Data Leakage Panagiotis Papadimitriou [email protected] Hector Garcia-Molina [email protected]

Embed Size (px)

Citation preview

Detecting Data Leakage

Panagiotis [email protected]

Hector [email protected]

Leakage Problem

Stanford Infolab 2

App. U1 App. U2

Jeremy Sarah Mark

Other Sourcese.g. Sarah’s Network

Name: Mark

Sex: Male

….

Name: Sarah

Sex: Female….

Kathryn

Outline

• Problem Description

• Guilt Models– Pr{U1 leaked data} = 0.7

– Pr{U2 leaked data} = 0.2

• Distribution Strategies

Stanford Infolab 3

• Problem Description

• Guilt Models

• Distribution Strategies

Stanford Infolab 4

Problem Entities

Entity Dataset

Distributor Facebook

T Set of all Facebook profiles

AgentsFacebook Apps U1, …, Un

R1, …, Rn Ri: Set of people’s profiles who have

added the application Ui

Leaker S Set of leaked profiles

Stanford Infolab 5

Agents’ Data Requests

• Sample– 100 profiles of Stanford people

• Explicit– All people who added application

(example we used so far)

– All Stanford profiles

Stanford Infolab 6

• Problem Description

• Guilt Models

• Distribution Strategies

Stanford Infolab 7

Guilt Models (1/3)

Stanford Infolab 8

Other Sourcese.g. Sarah’s

Network

8

p

p: posterior probability that a leaked profile comes from other sources

p

Guilty Agent: Agent who leaks at least one profilePr{Gi|S}: probability that agent Ui is guilty, given the leaked set of profiles S

Guilt Models (2/3)

Stanford Infolab 99

or

or

Agents leak each of their data items independently

Agents leak all their data items OR nothing

or

(1-p)2

(1-p)p

p(1-p)

p2

Guilt Models (3/3)

Independently NOT Independently

Stanford Infolab 10

Pr{G1}

Pr{G2} Pr{G2}

Pr{G1}

• Problem Description

• Guilt Models

• Distribution Strategies

Stanford Infolab 11

The Distributor’s Objective (1/2)

Stanford Infolab 12

U1U1

U2U2

U3U3

U4U4

Request

Request

Request

Request

R1

Pr{G1|S}>>Pr{G2|S}

Pr{G1|S}>> Pr{G4|S}

S (leaked)

R1R1

R3R3

R2

R3

R4

The Distributor’s Objective (2/2)

• To achieve his objective the distributor has to distribute sets Ri, …, Rn that

minimize

• Intuition: Minimized data sharing among agents makes leaked data reveal the guilty agents

Stanford Infolab 13

njiRRRi ij

jii

,...,1,,1

Distribution Strategies – Sample (1/4)

• Set T has four profiles: – Kathryn, Jeremy, Sarah and Mark

• There are 4 agents: – U1, U2, U3 and U4

• Each agent requests a sample of any 2 profiles of T for a market survey

Stanford Infolab 14

Distribution Strategies – Sample (2/4)

Poor

ji

ji RR Minimize

Stanford Infolab 15

U1

U2

U3

U4

U1

U2

U3

U4

Distribution Strategies – Sample (3/4)

• Optimal Distribution

• Avoid full overlaps and minimize

Stanford Infolab 16

U1

U2

U3

U4

i ij

jii

RRR

1

Distribution Strategies – Sample (4/4)

Stanford Infolab 17

Distribution Strategies

Sample Data Requests• The distributor has the

freedom to select the data items to provide the agents with

• General Idea:– Provide agents with as much

disjoint sets of data as possible

• Problem: There are cases where the distributed data must overlap E.g., |Ri|+…+|Rn|>|T|

Explicit Data Requests• The distributor must

provide agents with the data they request

• General Idea:– Add fake data to the

distributed ones to minimize overlap of distributed data

• Problem: Agents can collude and identify fake data

• NOT COVERED in this talk

Stanford Infolab 18

Conclusions

• Data Leakage

• Modeled as maximum likelihood problem

• Data distribution strategies that help identify the guilty agents

Stanford Infolab 19

Thank You!