Upload
adolfo-lone
View
230
Download
8
Embed Size (px)
Citation preview
Foundations of Privacy
Lecture 1
Lecturer: Moni Naor
What is Privacy?Extremely overloaded termHard to define“Privacy is a value so complex, so entangled in
competing and contradictory dimensions, so engorged with various and distinct meanings, that I sometimes despair whether it can be usefully addressed at all.”
Robert C. Post, Three Concepts of Privacy, 89 Geo. L.J. 2087 (2001).
Privacy is like oxygen – you only feel it when it is gone
What is Privacy?Extremely overloaded term• “the right to be let alone”
- Samuel D. Warren and Louis D. Brandeis, The Right to Privacy, Harv. L. Rev. (1890)
• “our concern over our accessibility to others: the extent to which we are known to others, the extent to which others have physical access to us, and the extent to which we are the subject of others attention.
• - Ruth Gavison, “Privacy and the Limits of the Law,” Yale Law Journal (1980)
What is Privacy?Extremely overloaded term• Photojournalism• Census data• Huge databases collected by companies
– Data deluge– Example: “Ravkav”
• Public Surveillance Information– Cameras– RFIDs
• Social Networks
Louis Brandeis and Samuel Warren: The Right to Privacy, Harvard Law Rev. 1890
Mandatory participationMust not reveal individual data
Official DescriptionThe availability of fast and cheap computers coupled with massive
storage devices has enabled the collection and mining of data on a scale previously unimaginable.
This opens the door to potential abuse regarding individuals' information. There has been considerable research exploring the tension between utility and privacy in this context.
The goal is to explore techniques and issues related to data privacy. In particular:
• Definitions of data privacy• Techniques for achieving privacy• Limitations on privacy in various settings.• Privacy issues in specific settings
Planned Topics
Privacy of Data Analysis• Differential Privacy
– Definition and Properties– Statistical databases– Dynamic data
• Privacy of learning algorithms
• Privacy of genomic data
Interaction with cryptography• SFE• Voting• Entropic Security• Data Structures• Everlasting Security• Privacy Enhancing Tech.
– Mixed nets
Course InformationFoundation of Privacy - Spring 2010
Instructor: Moni NaorWhen: Mondays, 11:00--13:00 (2 points)Where: Ziskind 1
• Course web page: www.wisdom.weizmann.ac.il/~naor/COURSE/foundations_of_privacy.html
• Prerequisites: familiarity with algorithms, data structures, probability theory, and linear algebra, at an undergraduate level; a basic course in computability is assumed.
• Requirements:– Participation in discussion in class
• Best: read the papers ahead of time– Homework: There will be several homework assignments
• Homework assignments should be turned in on time (usually two weeks after they are given)!
– Class Project and presentation– Exam : none planned
Office: Ziskind 248Phone: 3701E-mail: moni.naor@
Projects
• Report on a paper• Apply a notion studied to some known domain• Checking the state of privacy is some setting
Cryptography and Privacy
Extremely relevant - but does not solve the privacy problemSecure function Evaluation• How to distributively compute a function f(X1, X2, …,Xn),
– where Xj known to party j.• E.g., = sum(a,b,c, …)
– Parties should only learn final output ()• Many results depending on
– Number of players– Means of communication– The power and model of the adversary – How the function is represented
More worried what to compute than how to compute
Example: Securely Computing Sums
X1 X2 X3 X4 X5
0 · Xi · P-1. Want to compute Xi
Party 1 selects r 2R [0..P-1]. Sends Y1 = X1+rParty i received Yi-1 and sends Yi = Yi-1+ Xi Party 1 received Yn and announces = Xi = Yn-r
Y1 Y2 Y3 Y4
Y5
mod P
Is this Protocol Secure?
To talk rigorously about cryptographic security:• Specify the Power of the Adversary
– Access to the data/system– Computational power? – “Auxiliary” information?
• Define a Break of the System– What is compromise– What is a “win” for the adversary?
If it controls two players - insecure
Can be all powerful here
The Simulation Paradigm
A protocol is considered secure if:• For every adversary (of a certain type) There exists a simulator that outputs an indistinguishable
``transcript” .
Examples:• Encryption• Zero-knowledge• Secure function evaluation
Power of analogy
SFE: Simulating the ideal model
A protocol is considered secure if:• For every adversary there exists a simulator
operating in the ``ideal” (trusted party) model that outputs an indistinguishable transcript.
Major result: “Any function f that can be evaluated using polynomial resources can be securely evaluated using polynomial resources”
Breaking = distinguishing!
The Problem with SFESFE does not imply privacy: • The problem is with ideal model
– E.g., = sum(a,b)– Each player learns only what can be deduced from
and her own input to f– if and a yield b, so be it.
Need ways of talking about leakage even in the ideal model
Statistical Data AnalysisHuge social benefits from analyzing large collections of data:
Finding correlationsE.g. medical: genotype/phenotype correlations
Providing better services Improve web search results, fit ads to queries
Publishing Official StatisticsCensus, contingency tables
DataminingClustering, learning association rules, decision trees, separators, principal component analysis
However: data contains confidential information
WHAT ABOUT PRIVACY?
• Better Privacy Better Data
Example of Utility
John Snow’s map Cholera cases in London 1854 epidemic
SuspectedpumpSuspectedpump
Cholera cases
Cholera cases
Modern Privacy of Data AnalysisIs public analysis of private data a
meaningful/achievable Goal?
The holy grail:Get utility of statistical analysis while protecting privacy of every individual participant
Ideally:“privacy-preserving” sanitization allows reasonably accurate answers to meaningful information
Sanitization: Traditional View
Curator/Sanitizer
OutputDataA
Trusted curator can access DB of sensitive information,should publish privacy-preserving sanitized version
Traditional View: Interactive Model
Data
Multiple queries, chosen adaptively
?
query 1query 2Sanitizer
Sanitization: Traditional View
Curator/Sanitizer
OutputDataA
How to sanitizeAnonymization?
Auxiliary Information
• Information from any source other than the statistical database– Other databases, including old releases of this one– Newspapers– General comments from insiders– Government reports, census website– Inside information from a different organization
• Eg, Google’s view, if the attacker/user is a Google employee
Linkage Attacks: Malicious Use of Aux Info
The Netflix Prize
• Netflix Recommends Movies to its Subscribers– Seeks improved recommendation system– Offered $1,000,000 for 10% improvement
• Not concerned here with how this is measured– Published training data
Prize won in September 2009“BellKor's Pragmatic Chaos team”
From the Netflix Prize Rules Page…
• “The training data set consists of more than 100 million ratings from over 480 thousand randomly-chosen, anonymous customers on nearly 18 thousand movie titles.”
• “The ratings are on a scale from 1 to 5 (integral) stars. To protect customer privacy, all personal information identifying individual customers has been removed and all customer ids have been replaced by randomly-assigned ids. The date of each rating and the title and year of release for each movie are provided.”
Netflix Data Release [Narayanan-Shmatikov 2008]
User 1
User 2
User N
Item 1Item 2
Item M
• Ratings for subset of movies and users
• Usernames replaced with random IDs
• Some additional perturbation
Credit: Arvind Narayanan via Adam Smith
A Source of Auxiliary Information
• Internet Movie Database (IMDb)– Individuals may register for an account and rate movies– Need not be anonymous
• Probably want to create some web presence
– Visible material includes ratings, dates, comments
Use Public Reviews from IMDb.comAliceBobCharlieDanielleEricaFrank
Anonymized NetFlix data
Public, incomplete IMDB data
Identified NetFlix Data
=AliceBobCharlieDanielleEricaFrank
Credit: Arvind Narayanan via Adam Smith
De-anonymizing the Netflix DatasetResults• “With 8 movie ratings and dates that may have a 3-day error, 96% of Netflix
subscribers whose records have been released can be uniquely identified in the dataset.”
• “For 89%, 2 ratings and dates are enough to reduce the set of plausible records to 8 out of almost 500,000, which can then be inspected by a human for further deanonymization.”
Consequences?– Learn about movies that IMDB users didn’t want to tell the world
about...Sexual orientation, religious beliefs
– Subject of current lawsuits
Credit: Arvind Narayanan via Adam Smith
of which 2 may be completely wrong
Video Privacy Protection Act 1988
Settled, March 2010
30
AOL Search History Release (2006)
• 650,000 users, 20 Million queries, 3 months• AOL’s goal:
– provide real query logs from real users• Privacy?
– “Identifying information” replaced with random identifiers– But: different searches by the same user still linked
31
Name: Thelma ArnoldAge: 62WidowResidence: Lilburn, GA
AOL Search History Release (2006)
Other Successful Attacks• Against anonymized HMO records [Sweeny 98]
– Proposed K-anonymity
• Against K-anonymity [MGK06]
– Proposed L-diversity
• Against L-diversity [XT07]
– Proposed M-Invariance
• Against all of the above [GKS08]
• Example: two hospitals serve overlapping populationsWhat if they independently release “anonymized” statistics?
• Composition attack: Combine independent releases
33
“Composition” Attacks [Ganta-Kasiviswanathan-Smith, KDD 2008]Individual
s
Hospital B
statsB
statsA
Hospital A
Curators
Attacker
sensitive informatio
n
• Example: two hospitals serve overlapping populationsWhat if they independently release “anonymized” statistics?
• Composition attack: Combine independent releases
34
“Composition” Attacks [Ganta-Kasiviswanathan-Smith, KDD 2008]Individual
s
Hospital B
statsB
“Adam has either diabetes or high blood pressure”
Hospital A
Curators
Attacker
sensitive informatio
n
statsA
“Adam has either diabetes or emphyzema”
35
“Composition” Attacks [Ganta-Kasiviswanathan-Smith, KDD 2008]
• “IPUMS” census data set. 70,000 people, randomly split into 2
pieces with overlap 5,000.
With popular technique (k-anonymity, k=30) for each database, can learn “sensitive” variable for 40% of individuals
With popular technique (k-anonymity, k=30) for each database, can learn “sensitive” variable for 40% of individuals
Analysis of Social Network Graphs
• “Friendship” Graph– Nodes correspond to users– Users may list others as “friend,” creating an edge
• Edges are annotated with directional information
• Hypothetical Research Question– How frequently is the “friend” designation reciprocated?
Attack
• Replace node names/labels with random identifiers• Permits analysis of the structure of the graph• Privacy hope: randomized identifiers make it
hard/impossible to identify nodes with specific individuals,– thereby hiding the privacy of who is connected to whom
• Disastrous! [Blum Dwork K07]
– Vulnerable to active and passive attacks
Flavor of Active Attack Connections:
Targets: “Steve” and “Jerry” Attack Contacts: A and B Finding A and B allows finding Steve and Jerry
S
J
A
B
Flavor of Active Attack Magic Step
Isolate lightly linked-in subgraphs from rest of graph Special structure of subgraph permits finding A, B
S
J
A
B
Why Settle for Ad Hoc Notions of Privacy? Dalenius, 1977:
• Anything that can be learned about a respondent from the statistical database can be learned without access to the database– Captures possibility that “I” may be an extrovert– The database doesn’t leak personal information– Adversary is a user
• Analogous to Semantic Security for Crypto– Anything that can be learned from the ciphertext can be learned
without the ciphertext– Adversary is an eavesdropper
Goldwasser-Micali 1982
Computational Security of EncryptionSemantic Security
Whatever Adversary A can compute on encrypted string X 0,1n, so can A’ that does not see the encryption of X, yet simulates A’s knowledge with respect to X
A selects:• Distribution Dn on 0,1n
• Relation R(X,Y) - computable in probabilistic polynomial timeFor every pptm A there is an pptm A’ so that for all pptm relation R for XR Dn
PrR(X,A(E(X)) - PrR(X,A’())
is negligible
Outputs of A and A’ are indistinguishable even for a tester who knows X
X Y
R
E(X)
A
X Y
R
.
A’
A: Dn A’: Dn
¼
X 2R Dn
Making it Slightly less VagueCryptographic Rigor Applied to Privacy
• Define a Break of the System– What is compromise– What is a “win” for the adversary?
• Specify the Power of the Adversary– Access to the data– Computational power? – “Auxiliary” information?
• Conservative/Paranoid by Nature– Protect against all feasible attacks
In full generality: Dalenius Goal Impossible
– Database teaches smoking causes cancer– I smoke in public– Access to DB teaches that I am at increased risk for
cancer
• But what about cases where there is significant knowledge about database distribution
Outline
• The Framework• A General Impossibility Result
– Dalenius’ goal cannot be achieved in a very general sense
• The Proof– Simplified– General case
Two Models
Database Sanitized Database
?San
Non-Interactive: Data are sanitized and released
Two Models
Database
Interactive: Multiple Queries, Adaptively Chosen
?San
Auxiliary Information
Common theme in many privacy horror stories: • Not taking into account side information
– Netflix challenge: not taking into account IMDb [Narayanan-Shmatikov]
The auxiliary informationThe Database
SAN(DB) =remove names
Not learning from DBWith access to the database
San A
Auxiliary Information
San A’
Auxiliary Information
DB DB
There is some utility of DB that legitimate users should learn• Possible breach of privacy• Goal: users learn the utility without the breach
Without access to the database
Not learning from DBWith access to the database Without access to the database
San A
Auxiliary Information
San A’
Auxiliary Information
DB DB
Want: anything that can be learned about an individual from the database can be learned without access to the database
• 8 DD 8 A 9 A’ whp DB 2R D 8 auxiliary information z |Prob [A(z) $ DB wins] – Prob[A’(z) wins]| is small
Illustrative Example for DifficultyWant: anything that can be learned about a respondent from the
database can be learned without access to the database
• More Formally 8D 8A 9A’ whp DB 2R D 8 auxiliary information z |Probability [A(z) $ DB wins] – Probability [A’(z) wins]| is small
Example: suppose height of individual is sensitive information– Average height in DB not known a priori
• Aux z = “Adam is 5 cm shorter than average in DB”– A learns average height in DB, hence, also Adam’s height– A’ does not
Defining “Win”: The Compromise Function
Notion of privacy compromise
Compromise?
y
0/1
Adv
DB DD
Privacy breach
Privacy compromise should be non trivial:
•Should not be possible to find privacy breach from auxiliary information alone
Privacy breach should exist:
•Given DB there should be y that is a privacy breach
•Should be possible to find y efficiently
Basic Concepts• Distribution on (Finite) Databases DD
– Something about the database must be unknown– Captures knowledge about the domain
• E.g., rows of database correspond to owners of 2 pets• Privacy Mechanism San(DD, DB)
– Can be interactive or non-interactive– May have access to the distribution D
• Auxiliary Information Generator AuxGen(DD, DB)– Has access to the distribution and to DB– Formalizes partial knowledge about DB
• Utility Vector w– Answers to k questions about the DB– (Most of) utility vector can be learned by user– Utility: Must inherit sufficient min-entropy from source D
Impossibility Theorem: Informal • For any* distribution D D on Databases DB• For any* reasonable privacy compromise decider C. • Fix any useful* privacy mechanism San Then • There is an auxiliary info generator AuxGen and an
adversary A Such that • For all adversary simulators A’
[A(z) $ San( DB)] wins, but [A’(z)] does not win
Tells us information we did not know
z=AuxGen(DB)
Finds a compromise
Impossibility Theorem Fix any useful* privacy mechanism San and any reasonable
privacy compromise decider C. Then There is an auxiliary info generator AuxGen and an adversary
A such that for “all” distributions DD and all adversary simulators A’
Pr[A(D, San(D,DB), AuxGen(D, DB)) wins] - Pr[A’(D, AuxGen(D, DB)) wins] ≥ for suitable, large,
The probability spaces are over choice of DB 2R D D and the coin flips of San, AuxGen, A, and A’
To completely specify: need assumption on the entropy of utility vector W and how well SAN(W) behaves
Strategy• The auxiliary info generator will provide a hint
that together with the utility vector w will yield the privacy breach.
• Want AuxGen to work without knowing D just DB– Find privacy breach y and encode in z– Make sure z alone does not give y. Only with w
• Complication: is the utility vector w– Completely learned by the user?– Or just an approximation?
Entropy of Random Sources• Source:
– Probability distribution X on {0,1}n.– Contains some “randomness”.
• Measure of “randomness”– Shannon entropy: H(X) = - ∑ x Γ Px (x) log Px (x)
• Represents how much we can compress X on the averageBut even a high entropy source may have a point with prob 0.9
– min-entropy: Hmin(X) = - log max x Γ Px (x) • Represents the most likely value of X
Definition: X is a k-source if H1(X) ¸ k .i.e. Pr[X=x] · 2-k for all x
{0,1}n
Min-entropy
• Definition: X is a k-source if H1(X) ¸ k.
i.e. Pr[X=x] · 2-k for all x• Examples:
– Bit-fixing: some k coordinates of X uniform, rest fixed• or even depend arbitrarily on others.
– Unpredictable Source: 8 i2[n], b1, ..., bi-12 {0,1},
k/n· Prob[Xi =1| X1, X2, … Xi-1= b1, ..., bi-1] · 1-k/n
– Flat k-source: Uniform over S µ {0,1}n, |S|=2k
• Fact every k-source is convex combination of flat ones.
Min-Entropy and Statistical Distance
For a probability distribution X over {0,1}n
H1(X) = - log maxx Pr[X = x]
X is a k-source if H1(X) ¸ k
Represents the probability of the most likely value of X
¢(X,Y) = a|Pr[X=a] – Pr[Y=a]|Statistical distance:
Want to be close to uniform distribution:
ExtractorsUniversal procedure for “purifying” an imperfect source
Definition:
Ext: {0,1}n £ {0,1}d ! {0,1}ℓ is a (k,)-extractor if:
for any k-source X
¢(Ext(X, Ud), Uℓ) ·
d random bits
“seed”
EXT
k-source of length n
ℓ almost-uniform bits
x
s {0,1}n
2k strings
Strong extractors
Output looks random even after seeing the seed.
Definition: Ext is a (k,) strong extractor if Ext’(x,s)= s ◦ Ext(x,s) is a
(k,)-extractor
• i.e. 8 k-sources X, for a 1- ’ frac. of s 2 {0,1}d
Ext(X,s) is -close to Uℓ.
Extractors from Hash Functions
• Leftover Hash Lemma [ILL89]: universal (pairwise independent) hash functions yield strong extractors– output length: ℓ = k-O(1)– seed length: d = O(n)Example: Ext(x,(a,b))=first ℓ bits of a¢x+b in
GF[2n]
• Almost pairwise independence:– seed length: d= O(log n+k)
ℓ = k – 2log(1/)
Suppose w Learned Completely
AuxGen and A share a secret: w
AuxGen(DB)• Find privacy breach y of
DB of length ℓ• Find w from DB
– simulate A
• Choose s2R{0,1}d and compute Ext(w,s)
Set z = (s, Ext(w,s)©y)
San
DB AuxGen
A
C
0/1
w
z
Suppose w Learned Completely
AuxGen and A share a secret: w
DB AuxGen
A’
C
0/1
San
DB AuxGen
A
C
0/1
w
z
z = (s, Ext(w,s) © y)
z
Technical Conditions: Hmin (W|y) ≥ |y| and |y| “safe”
Why is it a compromise?
AuxGen and A share a secret: w
Why doesn’t A’ learn y:• For each possible value of y(s, Ext(w,s)) is -close to
uniform• Hence: (s, Ext(w,s) © y) is -
close to uniform
San
DB AuxGen
A
C
0/1
w
z
z = (s, Ext(w,s) © y) Need Hmin(W) ¸
3ℓ+O(1)
Technical Conditions: Hmin (W|y) ≥ |y| and |y| “safe”
To complete the proof
• Handle the case where not all of w is retrieved