38
Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering & Applied Sciences Harvard University on sabbatical at Shing-Tung Yau Center Department of Applied Mathematics National Chiao-Tung University Lecture at Institute of Information Science, Academia Sinica November 9, 2015

Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Embed Size (px)

Citation preview

Page 1: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Differential Privacy:Theoretical & Practical

Challenges

Salil Vadhan

Center for Research on Computation & SocietyJohn A. Paulson School of Engineering & Applied Sciences

Harvard University

on sabbatical atShing-Tung Yau Center

Department of Applied MathematicsNational Chiao-Tung University

Lecture at Institute of Information Science, Academia Sinica November 9, 2015

Page 2: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Data Privacy: The Problem

Given a dataset with sensitive information, such as:• Census data• Health records• Social network activity• Telecommunications data

How can we:• enable “desirable uses” of the data• while protecting the “privacy” of the data

subjects?

• Academic research• Informing policy• Identifying subjects for

drug trial• Searching for terrorists• Market analysis• …

????

Page 3: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Name Sex Blood HIV?Chen F B YJones M A N

Smith M O N

Ross M O Y

Lu F A N

Shah M B Y

Approach 1: Encrypt the Data

Problems?

Name Sex Blood HIV?

100101 001001 110101 110111

101010 111010 111111 001001

001010 100100 011001 110101

001110 010010 110101 100001

110101 000000 111001 010010

111110 110010 000101 110101

Page 4: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Name Sex Blood HIV?Chen F B YJones M A N

Smith M O N

Ross M O Y

Lu F A N

Shah M B Y

Approach 2: Anonymize the Data

“re-identification” often easy

[Sweeney `97]

Problems?

Page 5: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Approach 3: Mediate Access

C

C

trusted“curator”

q1

a1

q2

a2

q3

a3

data analysts

Problems?

Name Sex Blood HIV?

Chen F B Y

Jones M A N

Smith M O N

Ross M O Y

Lu F A N

Shah M B Y

Even simple “aggregate” statistics can reveal individual info. [Dinur-Nissim `03, Homer et al. `08, Mukatran et al. `11, Dwork et al. `15]

Page 6: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Privacy Models from Theoretical CS

Model Utility Privacy Who Holds Data?

Differential Privacy statistical analysis of dataset

individual-specific info

trusted curator

Secure Function Evaluation

any query desired everything other than result of query

original users (or semi-trusted delegates)

Fully Homomorphic (or Functional) Encryption

any query desired everything(except possibly result of query)

untrusted server

Page 7: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Differential privacy

C

C

curator

q1

a1

q2

a2

q3

a3

data analysts

Requirement: effect of each individual should be “hidden”

[Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherry-Nissim ’05, Dwork-McSherry-Nissim-Smith ’06]

Sex Blood HIV?

F B Y

M A N

M O N

M O Y

F A N

M B Y

Page 8: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Differential privacy[Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherry-Nissim ’05, Dwork-McSherry-Nissim-Smith ’06]

C

C

curator

q1

a1

q2

a2

q3

a3

Sex Blood HIV?

F B Y

M A N

M O N

M O Y

F A N

M B Y adversary

Page 9: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Differential privacy[Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherry-Nissim ’05, Dwork-McSherry-Nissim-Smith ’06]

C

C

curator

q1

a1

q2

a2

q3

a3

Sex Blood HIV?

F B Y

M A N

M O N

M O Y

F A N

M B Y

Requirement: an adversary shouldn’t be able totell if any one person’s data were changed arbitrarily

adversary

Page 10: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Differential privacy[Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherry-Nissim ’05, Dwork-McSherry-Nissim-Smith ’06]

C

C

curator

q1

a1

q2

a2

q3

a3

Sex Blood HIV?

F B Y

M O N

M O Y

F A N

M B Y

Requirement: an adversary shouldn’t be able totell if any one person’s data were changed arbitrarily

adversary

Page 11: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Differential privacy[Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherry-Nissim ’05, Dwork-McSherry-Nissim-Smith ’06]

C

C

curator

q1

a1

q2

a2

q3

a3

Sex Blood HIV?

F B Y

F A Y

M O N

M O Y

F A N

M B Y

Requirement: an adversary shouldn’t be able totell if any one person’s data were changed arbitrarily

adversary

Page 12: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Simple approach: random noise

C “What fraction of people are type B and HIV positive?”

Answer + Noise())𝑛

Sex Blood HIV?

F B Y

M A N

M O N

M O Y

F A N

M B Y

M

Error as

• Very little noise needed to hide each person as

• Note: this is just for one query

Page 13: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Differential privacy[Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherry-Nissim ’05, Dwork-McSherry-Nissim-Smith ’06]

C

C

randomizedcurator

q1

a1

q2

a2

q3

a3

Sex Blood HIV?

F B Y

F A Y

M O N

M O Y

F A N

M B Y

Requirement: for all D, D’ differing on one row, and all q1,…,qt

Distribution of C(D,q1,…,qt) Distribution of C(D’,q1,…,qt)

adversary

Page 14: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Differential privacy[Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherry-Nissim ’05, Dwork-McSherry-Nissim-Smith ’06]

C

C

randomizedcurator

q1

a1

q2

a2

q3

a3

Sex Blood HIV?

F B Y

F A Y

M O N

M O Y

F A N

M B Y

Requirement: for all D, D’ differing on one row, and all q1,…,qt

Distribution of C(D,q1,…,qt) Distribution of C(D’,q1,…,qt)

adversary

Page 15: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Differential privacy[Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherry-Nissim ’05, Dwork-McSherry-Nissim-Smith ’06]

C

C

randomizedcurator

q1

a1

q2

a2

q3

a3

Sex Blood HIV?

F B Y

F A Y

M O N

M O Y

F A N

M B Y

Requirement: for all D, D’ differing on one row, and all q1,…,qt

sets T, Pr[C(D,q1,…,qt)T] (1+) Pr[C(D’,q1,…,qt)T]

adversary

Page 16: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Simple approach: random noise

C “What fraction of people are type B and HIV positive?”

Answer + Laplace()𝑛

Sex Blood HIV?

F B Y

M A N

M O N

M O Y

F A N

M B Y

C

Density

• Very little noise needed to hide each person as

• Note: this is just for one query

Page 17: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Answering multiple queries

C “What fraction of people are type B and HIV positive?”

Answer + Laplace()𝑛

Sex Blood HIV?

F B Y

M A N

M O N

M O Y

F A N

M B Y

C

Composition Thm [Dwork-Rothblum-V. `10]:

independent -DP algs -DP

Error if

Page 18: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Answering multiple queries

C “What fraction of people are type B and HIV positive?”

Answer + Laplace()𝑛

Sex Blood HIV?

F B Y

M A N

M O N

M O Y

F A N

M B Y

C

Error if

Theorems: Cannot answer more than queries if:• answers computed independently [Dwork-Naor-V. ’12]

OR• data dimensionality is large (e.g. column means

[Bun-Ullman-V. `14, Dwork-Smith-Steinke-Ullman-V. `15].

Page 19: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Some Differentially Private Algorithms

• histograms [DMNS06]

• contingency tables [BCDKMT07, GHRU11, TUV12, DNT14],

• machine learning [BDMN05,KLNRS08],

• regression & statistical estimation [CMS11,S11,KST11,ST12,JT13]

• clustering [BDMN05,NRS07]

• social network analysis [HLMJ09,GRU11,KRSY11,KNRS13,BBDS13]

• approximation algorithms [GLMRT10]

• singular value decomposition [HR12, HR13, KT13, DTTZ14]

• streaming algorithms [DNRY10,DNPR10,MMNW11]

• mechanism design [MT07,NST10,X11,NOS12,CCKMV12,HK12,KPRU12]

• …

See Simons Institute Workshop on Big Data & Differential Privacy 12/13

Page 20: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Differential Privacy: Interpretations

• Whatever an adversary learns about me, it could have learned from everyone else’s data.

• Mechanism cannot leak “individual-specific” information.• Above interpretations hold regardless of adversary’s

auxiliary information.• Composes gracefully (k repetitions ) k differentially

private)But • No protection for information that is not localized to a

few rows.• No guarantee that subjects won’t be “harmed” by results

of analysis.

Distribution of C(D,q1,…,qt) Distribution of C(D’,q1,…,qt)

Page 21: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Answering multiple queries

C “What fraction of people are type B and HIV positive?”

Answer + Laplace()𝑛

Sex Blood HIV?

F B Y

M A N

M O N

M O Y

F A N

M B Y

C

Error if

Theorems: Cannot answer more than queries if:• answers computed independently [Dwork-Naor-V. ’12]

OR• data dimensionality is large (e.g. column means

[Bun-Ullman-V. `14, Dwork-Smith-Steinke-Ullman-V. `15].

Page 22: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Amazing possibility: synthetic data

Utility: preserves fraction of people with every set of attributes!

“fake” people

[Blum-Ligett-Roth ’08, Hardt-Rothblum `10]

Problem: uses computation time exponential in .

C

𝑛C

Sex Blood HIV?

F B Y

M A N

M O N

M O Y

F A N

M B Y

Sex Blood HIV?

M B N

F B Y

M O Y

F A N

F O N

𝑑≪𝑛2

Page 23: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Our result: synthetic data is hard

“fake” people

[Dwork-Naor-Reingold-Rothblum-V. `09, Ullman-V. `11]

Theorem: Any such C requires exponential computing time (else we could break all of cryptography).

C

𝑛C

Sex Blood HIV?

F B Y

M A N

M O N

M O Y

F A N

M B Y

Sex Blood HIV?

M B N

F B Y

M O Y

F A N

F O N

𝑑≪𝑛2

Page 24: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

• Contains the same statistical information as synthetic data

• But can be computed in sub-exponential time!

rich summary

C

𝑛C

Sex Blood HIV?

F B Y

M A N

M O N

M O Y

F A N

M B Y

𝑑≪𝑛2

Our result: alternative summaries[Thaler-Ullman-V. `12, Chandrasekaran-Thaler-Ullman-Wan `13]

Open: is there a polynomial-time algorithm?

Page 25: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Amazing Possibility II: Statistical Inference & Machine

Learning

Theorem [KLNRS08,S11]: Differential privacy for vast array of machine learning and statistical estimation problems with little loss in convergence rate as . • Optimizations & practical implementations for logistic

regression, ERM, LASSO, SVMs in [RBHT09,CMS11,ST13,JT14].

C𝑛

Hypothesisor model about world, e.g. rule for predicting disease from attributes

Sex Blood HIV?

F B Y

M A N

M O N

M O Y

F A N

M B Y

Page 26: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

DP Theory & Practice

Theory: differential privacy research has • many intriguing theoretical challenges• rich connections w/other parts of CS theory &

mathematicse.g. cryptography, learning theory, game theory &

mechanism design, convex geometry, pseudorandomness, optimization, approximability, communication complexity, statistics, …

Practice: interest from many communities in seeing whether DP can be brought to practice

e.g. statistics, databases, medical informatics, privacy law, social science, computer security, programming languages, …

Page 27: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Challenges for DP in Practice

• Accuracy for “small data” (moderate values of )

• Modelling & managing privacy loss over time– Especially over many different analysts & datasets

• Analysts used to working with raw data– One approach: “Tiered access” – DP for wide access, raw data only by approval with strict terms of use (cf. Census PUMS vs. RDCs)

• Cases where privacy concerns are not “local” (e.g. privacy for large groups) or utility is not “global” (e.g. targeting)

• Matching guarantees with privacy law & regulation

• …

Page 28: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Some Efforts to Bring DP to Practice

• CMU-Cornell-PennState “Integrating Statistical and Computational Approaches to Privacy”(See http://onthemap.ces.census.gov/)

• Google “RAPPOR"• UCSD “Integrating Data for Analysis, Anonymization, and

Sharing” (iDash)• UT Austin “Airavat: Security & Privacy for MapReduce”• UPenn “Putting Differential Privacy to Work”• Stanford-Berkeley-Microsoft “Towards Practicing Privacy”• Duke-NISSS “Triangle Census Research Network”• Harvard “Privacy Tools for Sharing Research Data” • MIT/CSAIL/ALFA "MoocDB Privacy tools for Sharing MOOC

data"• …

Page 29: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Computer Science, Law, Social Science, Statistics

Privacy tools for sharing research data

http://privacytools.seas.harvard.edu/

Any opinions, findings, and conclusions or recommendations expressed here are those of the author(s) and do not necessarily reflect the views of the funders of the work.

A SaTCFrontierproject

Page 30: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

30

Target: Data Repositories

Page 31: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Datasets are restricted due to privacy concerns

Goal: enable wider sharing while protecting privacy

Page 32: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Challenges for Sharing Sensitive Data

Complexity of Law• Thousands of privacy laws in the US alone, at federal,

state and local level, usually context-specific: HIPAA, FERPA, CIPSEA, Privacy Act, PPRA, ESRA, ….

Difficulty of Deidentification• Stripping “PII” usually provides

weak protections and/or poor utility

Inefficient Process for Obtaining Restricted Data• Can involve months of negotiation between

institutions, original researchers

Goal: make sharing easier for researcher without expertise in

privacy law/cs/stats

Sweeney `97

Page 33: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Vision: Integrated Privacy Tools

Risk Assessment and

De-Identification

Differential Privacy

Customized & Machine-

ActionableTerms of Use

Data Tag Generator

Data Set

Query Access

Restricted Access

Tools to be developed during

project

Consent from

subjects

Open Access toSanitized Data Set

IRB proposal & review

Policy Proposals and Best Practices

Database of Privacy Laws

& Regulations

Deposit in repository

Page 34: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Already can run many statistical analyses (“Zelig methods”) through the Dataverse interface, without downloading data.

Page 35: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

A new interactive data exploration & analysis tool to in Dataverse 4.0.

Plan: use differential privacy to enable accessto currently restricted datasets

Page 36: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Goals for our DP Tools

• General-purpose: applicable to most datasets uploaded to Dataverse.

• Automated: no differential privacy expert optimizing algorithms for a particular dataset or application

• Tiered access: DP interface for wide access to rough statistical information, helping users decide whether to apply for access to raw data (cf. Census PUMS vs RDCs)

(Limited) prototype on project website:http://privacytools.seas.harvard.edu/

Page 37: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Differential Privacy: Summary

Differential Privacy offers• Strong, scalable privacy guarantees• Compatibility with many types of “big data” analyses• Amazing possibilities for what can be achieved in principle

There are some challenges, but reasons for optimism• Intensive research effort from many communities• Some successful uses in practice already• Differential privacy easier as

Page 38: Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering

Privacy Models from Theoretical CS

Model Utility Privacy Who Holds Data?

Differential Privacy statistical analysis of dataset

individual-specific info

trusted curator

Secure Function Evaluation

any query desired everything other than result of query

original users (or semi-trusted delegates)

Fully Homomorphic (or Functional) Encryption

any query desired everything(except possibly result of query)

untrusted server

See Shafi Goldwasser’s talk at White House-MIT Big Data Privacy Workshop 3/3/14