Upload
adina
View
37
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto. Benny Pinkas HP Labs, Israel. Course structure. Lecture 1: Introduction to privacy Introduction to cryptography , in particular, to rigorous cryptographic analysis. Definitions Proofs of security - PowerPoint PPT Presentation
Citation preview
page 1February 28, 2005 10th Estonian Winter School in Computer Science
Privacy Preserving Data Mining
Lecture 1
Motivating privacy research, Introducing Crypto
Benny PinkasHP Labs, Israel
page 2February 28, 2005 10th Estonian Winter School in Computer Science
Course structure
• Lecture 1:– Introduction to privacy– Introduction to cryptography, in particular, to rigorous
cryptographic analysis.• Definitions• Proofs of security
• Lecture 2– Cryptographic tools for privacy preserving data
mining.
• Lecture 3– Non-cryptographic tools for privacy preserving data
mining– In particular, answer perturbation.
page 3February 28, 2005 10th Estonian Winter School in Computer Science
Privacy-Preserving Data Mining
• Allow multiple data holders to collaborate in order to compute important information while protecting the privacy of other information. – Security-related information– Public health information– Marketing information
• Advantages of privacy protection– protection of personal information– protection of proprietary or sensitive information– enables collaboration between different data
owners (since they may be more willing or able to collaborate if they need not reveal their information)
– compliance with the law
page 4February 28, 2005 10th Estonian Winter School in Computer Science
Privacy Preserving Data Mining
• Two papers appeared in 2000– “Privacy preserving data mining”, Agrawal and
Srikant, SIGMOD 2000. (statistical approach) – “Privacy preserving data mining”, Lindell and
Pinkas, Crypto 2000. (cryptographic approach)
• Why privacy now?– Technological changes erode privacy: ubiquitous
computing, cheap storage.– Public awareness: health coverage, employment,
personal relationships.– Historical changes: Small towns vs. Cities vs.
Connected society.– Privacy is a real problem that needs to be solved
page 5February 28, 2005 10th Estonian Winter School in Computer Science
Some data privacy cases: hospital data
• Hospital data contains– Identifying information: name, id, address– General information: age, marital status– Medical information– Billing information
• Database access issues:– Your doctor should get every information that is
required to take care of you– Emergency rooms should get all medical
information that is required to take care of whoever comes there
– Billing department should only get information relevant to billing
• Problem: how to stop employees from getting information about family, neighbors, celebrities?
page 6February 28, 2005 10th Estonian Winter School in Computer Science
Some data privacy cases: Medical Research
• Medical research:– Trying to learn patterns in the data, in “aggregate”
form.– Problem: how to enable learning aggregate data
without revealing personal medical information?– Hiding names is not enough, since there are many
ways to uniquely identify a person
• A single hospitals/medical researcher might not have enough data
• How can different organizations share research data without revealing personal data?
page 7February 28, 2005 10th Estonian Winter School in Computer Science
Public Data
• Many public records are available in electronic form: birth records, property records, voter registration
• “Your information serves as an error correcting code of your identity”
• Latanya Sweeney:– Date of birth uniquely identifies 12% of the
population of Cambridge, MA.– Date of birth + gender: 29%– Date of birth + gender + (9 digit) zip code: 95%– Sweeney was therefore able to get her medical
information from an “annonymized” database
page 8February 28, 2005 10th Estonian Winter School in Computer Science
Census data
• A trusted party (the census bureau) collects information about individuals
• Collected data:– Explicitly identifying data (names, address..)– Implicitly identifying data (combination of several
attributes)– Private data
• The data should is collected to help decision making– Partial or aggregate data should therefore made
public
page 9February 28, 2005 10th Estonian Winter School in Computer Science
Total Information Awareness (TIA)
• Collects information about transactions (credit card purchases, magazine subscriptions, bank deposits, flights)– Early detection of terrorist activity– Check a chemistry book in the library, buy something
at a hardware store and something in a pharmacy…• Early collection of epidemic bursts
– Early symptoms of Anthrax are similar to the flu– Check non-traditional data sources: grocery and
pharmacy data, school attendance records, etc..– Such systems are developed and used
• Could the collection of data be done in a privacy preserving manner? (without learning about individuals?)
page 10
February 28, 2005 10th Estonian Winter School in Computer Science
Basic Scenarios
• Single (centralized) database, e.g., census data:– This is often a simple abstraction of a more
complicated scenario, so we better solve this one– Need to collect data and present it in a privacy
preserving way
• Published data (e.g., on a CD)– A “trusted” party collects data and then publishes
a “sanitized” version– Users can do any computation they wish with the
sanitized data– For example, statistical tabulations.
page 11
February 28, 2005 10th Estonian Winter School in Computer Science
Basic Scenarios
• Multi database scenarios:– Two or more parties with private data want to
cooperate. – Horizontally split: Each party has a large database.
Databases have same attributes but are about different subjects. For example, the parties are banks which each have information about their customers.
– Vertically split: Each party has some information about the same set of subjects. For example, the participating parties are government agencies; each with some data about every citizen.
bank 1
bank 2
u1
un
u’1
u’n
hou
ses
ban
k
u1
un
taxes
page 12
February 28, 2005 10th Estonian Winter School in Computer Science
Issues and Tools
• Best privacy can be achieved by not giving any data, but..
• Privacy tools: cryptography [LP00]– Encryption: data is hidden unless you have the decryption key.
However, we also want to use the data.– Secure function evaluation: two or more parties with private
inputs. Can compute any function they wish without revealing anything else.
– Strong theory. Starts to be relevant to real applications.• Non-cryptographic tools [AS00]
– Query restriction: prevent certain queries from being answered.– Data/Input/output perturbation: add errors to inputs – hide
personal data while keeping aggregates accurate. (randomization, rounding, data swapping.)
– Can these be understood as well as we understand Crypto? Provide the same level of security as Crypto?
page 13
February 28, 2005 10th Estonian Winter School in Computer Science
Introduction to Cryptography
page 14
February 28, 2005 10th Estonian Winter School in Computer Science
Why learn/use crypto to solve privacy issues?
• Why are we referring to crypto?– Cryptography is one of the tools we can use for
preserving privacy– A mature research area:– many useful results/tools– Can reflect on our thinking – how is “security”
defined in cryptography? How should we define “privacy”?
page 15
February 28, 2005 10th Estonian Winter School in Computer Science
What is Cryptography?
Traditionally: how to maintain secrecy in communication
Alice and Bob talk while Eve tries to listen
AliceBob
Eve
page 16
February 28, 2005 10th Estonian Winter School in Computer Science
History of Cryptography
• Very ancient occupation
• Up to the mid 70’s - mostly classified military work– Exception: Shannon, Turing*
• Since then - explosive growth– Commercial applications– Scientific work: tight relationship with Computational
Complexity Theory– Major works: Diffie-Hellman, Rivest, Shamir and Adleman
(RSA)• Recently - more involved models for more diverse tasks.
• Scope: How to maintain the secrecy, integrity and functionality in computer and communication system.
page 17
February 28, 2005 10th Estonian Winter School in Computer Science
Relation to computational hardness
• Cryptography uses problems that are infeasible to solve.
• Uses the intractability of some problems in order to construct secure systems.– Feasible – computable in probabilistic polynomial time
(PPT)– Infeasible – no probabilistic polynomial time algorithm– Usually average case hardness is needed
• For example, the discrete log problem
page 18
February 28, 2005 10th Estonian Winter School in Computer Science
The Discrete Log Problem
• Let G be a group and g an element in G.• Given yG let x be minimal non-negative integer
satisfying the equation y=gx.x is called the discrete log of y to base g.
• Example: y=gx mod p in the multiplicative group of Zp* (p is prime). (For example, p=7, g=3, y=4 x=4.)
• In general, it is easy to exponentiate– (using repeated squaring and the binary
representation of x)
• Computing the discrete log is believed to be hard in Zp* if p is large. (E.g., p is a prime, |p|>768 bits, p=2q+1 and q is also a prime.)
page 19
February 28, 2005 10th Estonian Winter School in Computer Science
Encryption
• Alice wants to send a message m {0,1}n to Bob– Set-up phase is secret– Symmetric encryption: Alice and Bob share a secret
key k• They want to prevent Eve from learning
anything about the message Alice Bob
Eve
Ek(m)
k k
page 20
February 28, 2005 10th Estonian Winter School in Computer Science
Public key encryption
• Alice generates a private/public key pair (SK,PK)• Only Alice knows the secret key SK• Everyone (even Eve) knows the public key PK,
and can encrypt messages to Alice• Only Alice can decrypt (using SK)
Alice Bob
Eve
EPK(m)
SK
CharliePK
PKEPK(m)
page 21
February 28, 2005 10th Estonian Winter School in Computer Science
Rigorous Specification of Security
To define the security of a system we must specify:
1. What constitute a failure of the system
2. The power of the adversary – computational – access to the system– what it means to break the system.
page 22
February 28, 2005 10th Estonian Winter School in Computer Science
What does `learn’ mean?
• Even if Eve has some prior knowledge of m, she should not have any advantage in – Probability of guessing m, or probability of guessing whether m is m0
or m1, or prob. of computing any other function f of m ,or even computing |m|
• Ideally: the message sent is a independent of the message m – Implies all the above
• Achievable: one-time pad (symmetric encryption)– Let rR {0,1} n be the shared key. – Let m {0,1} n
– To encrypt m send r m– To decrypt z send m = z r
• Shannon: achievable only if the entropy of the shared secret is at least as large as that of m. Therefore must use long key .
page 23
February 28, 2005 10th Estonian Winter School in Computer Science
Defining security
The power of the adversary– Computational: Probabilistic polynomial time machine (PPTM)– Access to the system: e.g. can it change messages?– Passive adversary, (adaptive) chosen plaintext attack, chosen
ciphertext attack…
• What constitutes a failure of the system? – Recovering plaintext from ciphertext – not enough
• Allows for the leakage of partial information• In general, hard to answer which partial information
may/should not be leaked. Application dependent.• How would partial information the adversary already holds
be combined with what he learns to affect privacy? – Better: Prevent learning anything about an encrypted
message• There are two common, equivalent, definitions…
page 24
February 28, 2005 10th Estonian Winter School in Computer Science
Security of Encryption: Definition 1Indistinguishability of Encryptions
• Adversary A chooses any X0 , X1 0,1n
• Receives encryption of Xb for bR0,1• Has to decide whether b 0 or b 1.
For every PPTM A, choosing a pair X0 , X1
0,1n :| Pr A(E(X0))= ‘1’ - Pr A(E(Xb1)) ‘1’ | = neg(n)
– (Probability is over the choice of keys, randomization in the encryption and A‘s coins)
• Note that a proof of security must be rigorous
page 26
February 28, 2005 10th Estonian Winter School in Computer Science
Security of Encryption: Definition 2Semantic Security
Simulation: Whatever Adversary A can compute given an encryption of X 0,1n so can a `simulator’ S that does not get to see the encryption of X.
• A selects a distribution Dn on 0,1n and a relation R(X,Y) - computable in PPT (e.g. R(X,Y)=1 iff Y is last bit of X).
• XR Dn is sampled•
Given E(X), A outputs Y trying to satisfy R(X,Y)• The simulator S does the same without access to E(X)• Simulation is successful if A and S have the same
success probability• Successful simulation semantic security
page 27
February 28, 2005 10th Estonian Winter School in Computer Science
Security of Encryption (2)Semantic Security
More formally:For every PPTM A there is a PPTM S so that for all PPTM relations R for XR Dn
Pr R(X,A(E(X)) - Pr R(X,S())
is negligible.
In other words: The outputs of A and S are indistinguishable even for a test that is aware of X.
page 28
February 28, 2005 10th Estonian Winter School in Computer Science
Which is the Right Definition?
• Semantic security seems to convey that the message is protected
• But it is usually easier to prove indistinguishability of encryptions
• Would like to argue that the two definitions are equivalent
• Must define the attack: chosen plaintext attack– Adversary can obtain the encryption for any message it
chooses, in an adaptive manner– More severe attacks: chosen ciphertext
• The Equivalence Theorem: A cryptosystem is semantically secure if and only if it has the indistinguishability of encryptions property
page 29
February 28, 2005 10th Estonian Winter School in Computer Science
Equivalence Proof (informal)
Semantic security Indistinguishability of encryptions• Suppose no indistinguishability:
– A chooses a pair X0 , X10,1n for which it can distinguish encryptions with non-negligible advantage
• Choose – Distribution Dn
= {X0 , X1 }
– Relation R which is “equality with X ”S that doesn’t get E(X), and outputs Y’ we have
Prob[ R( X, Y’ ) ]= ½ • Given E(Xb ), run A(E(Xb )), get output b{0,1}, set Y=Xb • Now, | PrA(E(Xb))= ‘1’ b 1 - PrA(E(Xb)) ‘1’ b 0 | > • Therefore, | PrR(X,Y) - PrR(E(X,Y’) | > / 2
page 30
February 28, 2005 10th Estonian Winter School in Computer Science
Equivalence Proof (informal)
Indistinguishability of encryptions Semantic security • Suppose no semantic security: A chooses some distribution
Dn and some relation R• Choose X0, X1 R Dn , choose bR {0,1}, compute E(Xb).
– Give E(Xb) to A, ask A to compute Yb = A(E(Xb))
• For X0 , X1 R Dn let
– 0 = Prob[R(X0, Yb)], 1 = Prob[R(X1, Yb)]
• With noticeable probability |0 - 1 | is non-negligible, since otherwise Yb can be computed without the encryption.
• If |0 - 1 | is non-negligible, then we can distinguish between an encryption of X0 and X1
page 31
February 28, 2005 10th Estonian Winter School in Computer Science
Lessons learned?
• Rigorous approach to cryptography– Defining security– Proving security
page 32
February 28, 2005 10th Estonian Winter School in Computer Science
References
Books:• O. Goldreich, Foundations of Cryptography Vol 1,
Basic Tools, Cambridge, 2001 • Pseudo-randomness, zero-knowledge
– Vol 2, Basic Applications (to be available May 2004)• Encryption, Secure Function Evaluation)
– Other volumes in www.wisdom.weizmann.ac.il/~oded/books.html
Web material/courses:• S. Goldwasser and M. Bellare, Lecture Notes on
Cryptography, http://www-cse.ucsd.edu/~mihir/papers/gb.html
• M. Naor, 9th EWSCS, http://www.cs.ioc.ee/yik/schools/win2004/naor.php
page 33
February 28, 2005 10th Estonian Winter School in Computer Science
Secure Function Evaluation
• A major topic of cryptographic research• How to let n parties, P1,..,Pn compute a function
f(x1,..,xn) – Where input xi is known to party Pi
– Parties learn the final input and nothing else
page 34
February 28, 2005 10th Estonian Winter School in Computer Science
The Millionaires Problem [Yao]
x
Whose value is greater?
y
Leak no other information!
Alice Bob
page 35
February 28, 2005 10th Estonian Winter School in Computer Science
Comparing Information without Leaking it
• Output: Is x=y?• The following solution is insecure:
– Use a one-way hash function H()– Alice publishes H(x), Bob publishes H(y)
xyAlice Bob
page 36
February 28, 2005 10th Estonian Winter School in Computer Science
Secure two-party computation - definition
x y
F(x,y) and nothing else
Input:Output:
x yAs if…
F(x,y) F(x,y)
Trusted third party
page 37
February 28, 2005 10th Estonian Winter School in Computer Science
Leak no other information
• A protocol is secure if it emulates the ideal solution
• Alice learns F(x,y), and therefore can compute everything that is implied by x, her prior knowledge of y, and F(x,y).
• Alice should not be able to compute anything else
• Simulation:– A protocol is considered secure if:
For every adversary in the real worldThere exists a simulator in the ideal world, which outputs an indistinguishable ``transcript” , given access to the information that the adversary is allowed to learn
page 38
February 28, 2005 10th Estonian Winter School in Computer Science
More tomorrow…