Foundations of Privacy Lecture 1 Lecturer: Moni Naor

Foundations of Privacy

Lecture 1

Lecturer: Moni Naor

What is Privacy?Extremely overloaded termHard to define“Privacy is a value so complex, so entangled in

competing and contradictory dimensions, so engorged with various and distinct meanings, that I sometimes despair whether it can be usefully addressed at all.”

Robert C. Post, Three Concepts of Privacy, 89 Geo. L.J. 2087 (2001).

Privacy is like oxygen – you only feel it when it is gone

What is Privacy?Extremely overloaded term• “the right to be let alone”

- Samuel D. Warren and Louis D. Brandeis, The Right to Privacy, Harv. L. Rev. (1890)

• “our concern over our accessibility to others: the extent to which we are known to others, the extent to which others have physical access to us, and the extent to which we are the subject of others attention.

• - Ruth Gavison, “Privacy and the Limits of the Law,” Yale Law Journal (1980)

What is Privacy?Extremely overloaded term• Photojournalism• Census data• Huge databases collected by companies

– Data deluge– Example: “Ravkav”

• Public Surveillance Information– Cameras– RFIDs

• Social Networks

Louis Brandeis and Samuel Warren: The Right to Privacy, Harvard Law Rev. 1890

Mandatory participationMust not reveal individual data

Official DescriptionThe availability of fast and cheap computers coupled with massive

storage devices has enabled the collection and mining of data on a scale previously unimaginable.

This opens the door to potential abuse regarding individuals' information. There has been considerable research exploring the tension between utility and privacy in this context.

The goal is to explore techniques and issues related to data privacy. In particular:

• Definitions of data privacy• Techniques for achieving privacy• Limitations on privacy in various settings.• Privacy issues in specific settings

Planned Topics

Privacy of Data Analysis• Differential Privacy

– Definition and Properties– Statistical databases– Dynamic data

• Privacy of learning algorithms

• Privacy of genomic data

Interaction with cryptography• SFE• Voting• Entropic Security• Data Structures• Everlasting Security• Privacy Enhancing Tech.

– Mixed nets

Course InformationFoundation of Privacy - Spring 2010

Instructor: Moni NaorWhen: Mondays, 11:00--13:00 (2 points)Where: Ziskind 1

• Course web page: www.wisdom.weizmann.ac.il/~naor/COURSE/foundations_of_privacy.html

• Prerequisites: familiarity with algorithms, data structures, probability theory, and linear algebra, at an undergraduate level; a basic course in computability is assumed.

• Requirements:– Participation in discussion in class

• Best: read the papers ahead of time– Homework: There will be several homework assignments

• Homework assignments should be turned in on time (usually two weeks after they are given)!

– Class Project and presentation– Exam : none planned

Office: Ziskind 248Phone: 3701E-mail: moni.naor@

Projects

• Report on a paper• Apply a notion studied to some known domain• Checking the state of privacy is some setting

Cryptography and Privacy

Extremely relevant - but does not solve the privacy problemSecure function Evaluation• How to distributively compute a function f(X1, X2, …,Xn),

– where Xj known to party j.• E.g., = sum(a,b,c, …)

– Parties should only learn final output ()• Many results depending on

– Number of players– Means of communication– The power and model of the adversary – How the function is represented

More worried what to compute than how to compute

Example: Securely Computing Sums

X1 X2 X3 X4 X5

0 · Xi · P-1. Want to compute Xi

Party 1 selects r 2R [0..P-1]. Sends Y1 = X1+rParty i received Yi-1 and sends Yi = Yi-1+ Xi Party 1 received Yn and announces = Xi = Yn-r

Y1 Y2 Y3 Y4

Y5

mod P

Is this Protocol Secure?

To talk rigorously about cryptographic security:• Specify the Power of the Adversary

– Access to the data/system– Computational power? – “Auxiliary” information?

• Define a Break of the System– What is compromise– What is a “win” for the adversary?

If it controls two players - insecure

Can be all powerful here

The Simulation Paradigm

A protocol is considered secure if:• For every adversary (of a certain type) There exists a simulator that outputs an indistinguishable

``transcript” .

Examples:• Encryption• Zero-knowledge• Secure function evaluation

Power of analogy

SFE: Simulating the ideal model

A protocol is considered secure if:• For every adversary there exists a simulator

operating in the ``ideal” (trusted party) model that outputs an indistinguishable transcript.

Major result: “Any function f that can be evaluated using polynomial resources can be securely evaluated using polynomial resources”

Breaking = distinguishing!

The Problem with SFESFE does not imply privacy: • The problem is with ideal model

– E.g., = sum(a,b)– Each player learns only what can be deduced from

and her own input to f– if and a yield b, so be it.

Need ways of talking about leakage even in the ideal model

Statistical Data AnalysisHuge social benefits from analyzing large collections of data:

Finding correlationsE.g. medical: genotype/phenotype correlations

Providing better services Improve web search results, fit ads to queries

Publishing Official StatisticsCensus, contingency tables

DataminingClustering, learning association rules, decision trees, separators, principal component analysis

However: data contains confidential information

WHAT ABOUT PRIVACY?

• Better Privacy Better Data

Example of Utility

John Snow’s map Cholera cases in London 1854 epidemic

SuspectedpumpSuspectedpump

Cholera cases

Cholera cases

Modern Privacy of Data AnalysisIs public analysis of private data a

meaningful/achievable Goal?

The holy grail:Get utility of statistical analysis while protecting privacy of every individual participant

Ideally:“privacy-preserving” sanitization allows reasonably accurate answers to meaningful information

Sanitization: Traditional View

Curator/Sanitizer

OutputDataA

Trusted curator can access DB of sensitive information,should publish privacy-preserving sanitized version

Traditional View: Interactive Model

Data

Multiple queries, chosen adaptively

?

query 1query 2Sanitizer

Sanitization: Traditional View

Curator/Sanitizer

OutputDataA

How to sanitizeAnonymization?

Auxiliary Information

• Information from any source other than the statistical database– Other databases, including old releases of this one– Newspapers– General comments from insiders– Government reports, census website– Inside information from a different organization

• Eg, Google’s view, if the attacker/user is a Google employee

Linkage Attacks: Malicious Use of Aux Info

The Netflix Prize

• Netflix Recommends Movies to its Subscribers– Seeks improved recommendation system– Offered $1,000,000 for 10% improvement

• Not concerned here with how this is measured– Published training data

Prize won in September 2009“BellKor's Pragmatic Chaos team”

From the Netflix Prize Rules Page…

• “The training data set consists of more than 100 million ratings from over 480 thousand randomly-chosen, anonymous customers on nearly 18 thousand movie titles.”

• “The ratings are on a scale from 1 to 5 (integral) stars. To protect customer privacy, all personal information identifying individual customers has been removed and all customer ids have been replaced by randomly-assigned ids. The date of each rating and the title and year of release for each movie are provided.”

Netflix Data Release [Narayanan-Shmatikov 2008]

User 1

User 2

User N

Item 1Item 2

Item M

• Ratings for subset of movies and users

• Usernames replaced with random IDs

• Some additional perturbation

Credit: Arvind Narayanan via Adam Smith

A Source of Auxiliary Information

• Internet Movie Database (IMDb)– Individuals may register for an account and rate movies– Need not be anonymous

• Probably want to create some web presence

– Visible material includes ratings, dates, comments

Use Public Reviews from IMDb.comAliceBobCharlieDanielleEricaFrank

Anonymized NetFlix data

Public, incomplete IMDB data

Identified NetFlix Data

=AliceBobCharlieDanielleEricaFrank


De-anonymizing the Netflix DatasetResults• “With 8 movie ratings and dates that may have a 3-day error, 96% of Netflix

subscribers whose records have been released can be uniquely identified in the dataset.”

• “For 89%, 2 ratings and dates are enough to reduce the set of plausible records to 8 out of almost 500,000, which can then be inspected by a human for further deanonymization.”

Consequences?– Learn about movies that IMDB users didn’t want to tell the world

about...Sexual orientation, religious beliefs

– Subject of current lawsuits


of which 2 may be completely wrong

Video Privacy Protection Act 1988

Settled, March 2010

30

AOL Search History Release (2006)

• 650,000 users, 20 Million queries, 3 months• AOL’s goal:

– provide real query logs from real users• Privacy?

– “Identifying information” replaced with random identifiers– But: different searches by the same user still linked

31

Name: Thelma ArnoldAge: 62WidowResidence: Lilburn, GA

AOL Search History Release (2006)

Other Successful Attacks• Against anonymized HMO records [Sweeny 98]

– Proposed K-anonymity

• Against K-anonymity [MGK06]

– Proposed L-diversity

• Against L-diversity [XT07]

– Proposed M-Invariance

• Against all of the above [GKS08]

• Example: two hospitals serve overlapping populationsWhat if they independently release “anonymized” statistics?

• Composition attack: Combine independent releases

33

“Composition” Attacks [Ganta-Kasiviswanathan-Smith, KDD 2008]Individual

s

Hospital B

statsB

statsA

Hospital A

Curators

Attacker

sensitive informatio

n

• Example: two hospitals serve overlapping populationsWhat if they independently release “anonymized” statistics?

• Composition attack: Combine independent releases

34

“Composition” Attacks [Ganta-Kasiviswanathan-Smith, KDD 2008]Individual

s

Hospital B

statsB

“Adam has either diabetes or high blood pressure”

Hospital A

Curators

Attacker

sensitive informatio

n

statsA

“Adam has either diabetes or emphyzema”

35

“Composition” Attacks [Ganta-Kasiviswanathan-Smith, KDD 2008]

• “IPUMS” census data set. 70,000 people, randomly split into 2

pieces with overlap 5,000.

With popular technique (k-anonymity, k=30) for each database, can learn “sensitive” variable for 40% of individuals

With popular technique (k-anonymity, k=30) for each database, can learn “sensitive” variable for 40% of individuals

Analysis of Social Network Graphs

• “Friendship” Graph– Nodes correspond to users– Users may list others as “friend,” creating an edge

• Edges are annotated with directional information

• Hypothetical Research Question– How frequently is the “friend” designation reciprocated?

Attack

• Replace node names/labels with random identifiers• Permits analysis of the structure of the graph• Privacy hope: randomized identifiers make it

hard/impossible to identify nodes with specific individuals,– thereby hiding the privacy of who is connected to whom

• Disastrous! [Blum Dwork K07]

– Vulnerable to active and passive attacks

Flavor of Active Attack Connections:

Targets: “Steve” and “Jerry” Attack Contacts: A and B Finding A and B allows finding Steve and Jerry

S

J

A

B

Flavor of Active Attack Magic Step

Isolate lightly linked-in subgraphs from rest of graph Special structure of subgraph permits finding A, B

S

J

A

B

Why Settle for Ad Hoc Notions of Privacy? Dalenius, 1977:

• Anything that can be learned about a respondent from the statistical database can be learned without access to the database– Captures possibility that “I” may be an extrovert– The database doesn’t leak personal information– Adversary is a user

• Analogous to Semantic Security for Crypto– Anything that can be learned from the ciphertext can be learned

without the ciphertext– Adversary is an eavesdropper

Goldwasser-Micali 1982

Computational Security of EncryptionSemantic Security

Whatever Adversary A can compute on encrypted string X 0,1n, so can A’ that does not see the encryption of X, yet simulates A’s knowledge with respect to X

A selects:• Distribution Dn on 0,1n

• Relation R(X,Y) - computable in probabilistic polynomial timeFor every pptm A there is an pptm A’ so that for all pptm relation R for XR Dn

PrR(X,A(E(X)) - PrR(X,A’())

is negligible

Outputs of A and A’ are indistinguishable even for a tester who knows X

X Y

R

E(X)

A

X Y

R

.

A’

A: Dn A’: Dn

¼

X 2R Dn

Making it Slightly less VagueCryptographic Rigor Applied to Privacy

• Define a Break of the System– What is compromise– What is a “win” for the adversary?

• Specify the Power of the Adversary– Access to the data– Computational power? – “Auxiliary” information?

• Conservative/Paranoid by Nature– Protect against all feasible attacks

In full generality: Dalenius Goal Impossible

– Database teaches smoking causes cancer– I smoke in public– Access to DB teaches that I am at increased risk for

cancer

• But what about cases where there is significant knowledge about database distribution

Outline

• The Framework• A General Impossibility Result

– Dalenius’ goal cannot be achieved in a very general sense

• The Proof– Simplified– General case

Two Models

Database Sanitized Database

?San

Non-Interactive: Data are sanitized and released

Two Models

Database

Interactive: Multiple Queries, Adaptively Chosen

?San


Common theme in many privacy horror stories: • Not taking into account side information

– Netflix challenge: not taking into account IMDb [Narayanan-Shmatikov]

The auxiliary informationThe Database

SAN(DB) =remove names

Not learning from DBWith access to the database

San A


San A’


DB DB

There is some utility of DB that legitimate users should learn• Possible breach of privacy• Goal: users learn the utility without the breach

Without access to the database

Not learning from DBWith access to the database Without access to the database

San A


San A’


DB DB

Want: anything that can be learned about an individual from the database can be learned without access to the database

• 8 DD 8 A 9 A’ whp DB 2R D 8 auxiliary information z |Prob [A(z) $ DB wins] – Prob[A’(z) wins]| is small

Illustrative Example for DifficultyWant: anything that can be learned about a respondent from the

database can be learned without access to the database

• More Formally 8D 8A 9A’ whp DB 2R D 8 auxiliary information z |Probability [A(z) $ DB wins] – Probability [A’(z) wins]| is small

Example: suppose height of individual is sensitive information– Average height in DB not known a priori

• Aux z = “Adam is 5 cm shorter than average in DB”– A learns average height in DB, hence, also Adam’s height– A’ does not

Defining “Win”: The Compromise Function

Notion of privacy compromise

Compromise?

y

0/1

Adv

DB DD

Privacy breach

Privacy compromise should be non trivial:

•Should not be possible to find privacy breach from auxiliary information alone

Privacy breach should exist:

•Given DB there should be y that is a privacy breach

•Should be possible to find y efficiently

Basic Concepts• Distribution on (Finite) Databases DD

– Something about the database must be unknown– Captures knowledge about the domain

• E.g., rows of database correspond to owners of 2 pets• Privacy Mechanism San(DD, DB)

– Can be interactive or non-interactive– May have access to the distribution D

• Auxiliary Information Generator AuxGen(DD, DB)– Has access to the distribution and to DB– Formalizes partial knowledge about DB

• Utility Vector w– Answers to k questions about the DB– (Most of) utility vector can be learned by user– Utility: Must inherit sufficient min-entropy from source D

Impossibility Theorem: Informal • For any* distribution D D on Databases DB• For any* reasonable privacy compromise decider C. • Fix any useful* privacy mechanism San Then • There is an auxiliary info generator AuxGen and an

adversary A Such that • For all adversary simulators A’

[A(z) $ San( DB)] wins, but [A’(z)] does not win

Tells us information we did not know

z=AuxGen(DB)

Finds a compromise

Impossibility Theorem Fix any useful* privacy mechanism San and any reasonable

privacy compromise decider C. Then There is an auxiliary info generator AuxGen and an adversary

A such that for “all” distributions DD and all adversary simulators A’

Pr[A(D, San(D,DB), AuxGen(D, DB)) wins] - Pr[A’(D, AuxGen(D, DB)) wins] ≥ for suitable, large,

The probability spaces are over choice of DB 2R D D and the coin flips of San, AuxGen, A, and A’

To completely specify: need assumption on the entropy of utility vector W and how well SAN(W) behaves

Strategy• The auxiliary info generator will provide a hint

that together with the utility vector w will yield the privacy breach.

• Want AuxGen to work without knowing D just DB– Find privacy breach y and encode in z– Make sure z alone does not give y. Only with w

• Complication: is the utility vector w– Completely learned by the user?– Or just an approximation?

Entropy of Random Sources• Source:

– Probability distribution X on {0,1}n.– Contains some “randomness”.

• Measure of “randomness”– Shannon entropy: H(X) = - ∑ x Γ Px (x) log Px (x)

• Represents how much we can compress X on the averageBut even a high entropy source may have a point with prob 0.9

– min-entropy: Hmin(X) = - log max x Γ Px (x) • Represents the most likely value of X

Definition: X is a k-source if H1(X) ¸ k .i.e. Pr[X=x] · 2-k for all x

{0,1}n

Min-entropy

• Definition: X is a k-source if H1(X) ¸ k.

i.e. Pr[X=x] · 2-k for all x• Examples:

– Bit-fixing: some k coordinates of X uniform, rest fixed• or even depend arbitrarily on others.

– Unpredictable Source: 8 i2[n], b1, ..., bi-12 {0,1},

k/n· Prob[Xi =1| X1, X2, … Xi-1= b1, ..., bi-1] · 1-k/n

– Flat k-source: Uniform over S µ {0,1}n, |S|=2k

• Fact every k-source is convex combination of flat ones.

Min-Entropy and Statistical Distance

For a probability distribution X over {0,1}n

H1(X) = - log maxx Pr[X = x]

X is a k-source if H1(X) ¸ k

Represents the probability of the most likely value of X

¢(X,Y) = a|Pr[X=a] – Pr[Y=a]|Statistical distance:

Want to be close to uniform distribution:

ExtractorsUniversal procedure for “purifying” an imperfect source

Definition:

Ext: {0,1}n £ {0,1}d ! {0,1}ℓ is a (k,)-extractor if:

for any k-source X

¢(Ext(X, Ud), Uℓ) ·

d random bits

“seed”

EXT

k-source of length n

ℓ almost-uniform bits

x

s {0,1}n

2k strings

Strong extractors

Output looks random even after seeing the seed.

Definition: Ext is a (k,) strong extractor if Ext’(x,s)= s ◦ Ext(x,s) is a

(k,)-extractor

• i.e. 8 k-sources X, for a 1- ’ frac. of s 2 {0,1}d

Ext(X,s) is -close to Uℓ.

Extractors from Hash Functions

• Leftover Hash Lemma [ILL89]: universal (pairwise independent) hash functions yield strong extractors– output length: ℓ = k-O(1)– seed length: d = O(n)Example: Ext(x,(a,b))=first ℓ bits of a¢x+b in

GF[2n]

• Almost pairwise independence:– seed length: d= O(log n+k)

ℓ = k – 2log(1/)

Suppose w Learned Completely

AuxGen and A share a secret: w

AuxGen(DB)• Find privacy breach y of

DB of length ℓ• Find w from DB

– simulate A

• Choose s2R{0,1}d and compute Ext(w,s)

Set z = (s, Ext(w,s)©y)

San

DB AuxGen

A

C

0/1

w

z

Suppose w Learned Completely


DB AuxGen

A’

C

0/1

San

DB AuxGen

A

C

0/1

w

z

z = (s, Ext(w,s) © y)

z

Technical Conditions: Hmin (W|y) ≥ |y| and |y| “safe”

Why is it a compromise?


Why doesn’t A’ learn y:• For each possible value of y(s, Ext(w,s)) is -close to

uniform• Hence: (s, Ext(w,s) © y) is -

close to uniform

San

DB AuxGen

A

C

0/1

w

z

z = (s, Ext(w,s) © y) Need Hmin(W) ¸

3ℓ+O(1)

Technical Conditions: Hmin (W|y) ≥ |y| and |y| “safe”

To complete the proof

• Handle the case where not all of w is retrieved

Documents

Foundations of Privacy Lecture 1 Lecturer: Moni Naor