37
page 1 February 28, 2005 10th Estonian Winter School in Computer Science Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto Benny Pinkas HP Labs, Israel

Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

  • Upload
    adina

  • View
    37

  • Download
    0

Embed Size (px)

DESCRIPTION

Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto. Benny Pinkas HP Labs, Israel. Course structure. Lecture 1: Introduction to privacy Introduction to cryptography , in particular, to rigorous cryptographic analysis. Definitions Proofs of security - PowerPoint PPT Presentation

Citation preview

Page 1: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 1February 28, 2005 10th Estonian Winter School in Computer Science

Privacy Preserving Data Mining

Lecture 1

Motivating privacy research, Introducing Crypto

Benny PinkasHP Labs, Israel

Page 2: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 2February 28, 2005 10th Estonian Winter School in Computer Science

Course structure

• Lecture 1:– Introduction to privacy– Introduction to cryptography, in particular, to rigorous

cryptographic analysis.• Definitions• Proofs of security

• Lecture 2– Cryptographic tools for privacy preserving data

mining.

• Lecture 3– Non-cryptographic tools for privacy preserving data

mining– In particular, answer perturbation.

Page 3: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 3February 28, 2005 10th Estonian Winter School in Computer Science

Privacy-Preserving Data Mining

• Allow multiple data holders to collaborate in order to compute important information while protecting the privacy of other information. – Security-related information– Public health information– Marketing information

• Advantages of privacy protection– protection of personal information– protection of proprietary or sensitive information– enables collaboration between different data

owners (since they may be more willing or able to collaborate if they need not reveal their information)

– compliance with the law

Page 4: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 4February 28, 2005 10th Estonian Winter School in Computer Science

Privacy Preserving Data Mining

• Two papers appeared in 2000– “Privacy preserving data mining”, Agrawal and

Srikant, SIGMOD 2000. (statistical approach) – “Privacy preserving data mining”, Lindell and

Pinkas, Crypto 2000. (cryptographic approach)

• Why privacy now?– Technological changes erode privacy: ubiquitous

computing, cheap storage.– Public awareness: health coverage, employment,

personal relationships.– Historical changes: Small towns vs. Cities vs.

Connected society.– Privacy is a real problem that needs to be solved

Page 5: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 5February 28, 2005 10th Estonian Winter School in Computer Science

Some data privacy cases: hospital data

• Hospital data contains– Identifying information: name, id, address– General information: age, marital status– Medical information– Billing information

• Database access issues:– Your doctor should get every information that is

required to take care of you– Emergency rooms should get all medical

information that is required to take care of whoever comes there

– Billing department should only get information relevant to billing

• Problem: how to stop employees from getting information about family, neighbors, celebrities?

Page 6: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 6February 28, 2005 10th Estonian Winter School in Computer Science

Some data privacy cases: Medical Research

• Medical research:– Trying to learn patterns in the data, in “aggregate”

form.– Problem: how to enable learning aggregate data

without revealing personal medical information?– Hiding names is not enough, since there are many

ways to uniquely identify a person

• A single hospitals/medical researcher might not have enough data

• How can different organizations share research data without revealing personal data?

Page 7: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 7February 28, 2005 10th Estonian Winter School in Computer Science

Public Data

• Many public records are available in electronic form: birth records, property records, voter registration

• “Your information serves as an error correcting code of your identity”

• Latanya Sweeney:– Date of birth uniquely identifies 12% of the

population of Cambridge, MA.– Date of birth + gender: 29%– Date of birth + gender + (9 digit) zip code: 95%– Sweeney was therefore able to get her medical

information from an “annonymized” database

Page 8: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 8February 28, 2005 10th Estonian Winter School in Computer Science

Census data

• A trusted party (the census bureau) collects information about individuals

• Collected data:– Explicitly identifying data (names, address..)– Implicitly identifying data (combination of several

attributes)– Private data

• The data should is collected to help decision making– Partial or aggregate data should therefore made

public

Page 9: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 9February 28, 2005 10th Estonian Winter School in Computer Science

Total Information Awareness (TIA)

• Collects information about transactions (credit card purchases, magazine subscriptions, bank deposits, flights)– Early detection of terrorist activity– Check a chemistry book in the library, buy something

at a hardware store and something in a pharmacy…• Early collection of epidemic bursts

– Early symptoms of Anthrax are similar to the flu– Check non-traditional data sources: grocery and

pharmacy data, school attendance records, etc..– Such systems are developed and used

• Could the collection of data be done in a privacy preserving manner? (without learning about individuals?)

Page 10: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 10

February 28, 2005 10th Estonian Winter School in Computer Science

Basic Scenarios

• Single (centralized) database, e.g., census data:– This is often a simple abstraction of a more

complicated scenario, so we better solve this one– Need to collect data and present it in a privacy

preserving way

• Published data (e.g., on a CD)– A “trusted” party collects data and then publishes

a “sanitized” version– Users can do any computation they wish with the

sanitized data– For example, statistical tabulations.

Page 11: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 11

February 28, 2005 10th Estonian Winter School in Computer Science

Basic Scenarios

• Multi database scenarios:– Two or more parties with private data want to

cooperate. – Horizontally split: Each party has a large database.

Databases have same attributes but are about different subjects. For example, the parties are banks which each have information about their customers.

– Vertically split: Each party has some information about the same set of subjects. For example, the participating parties are government agencies; each with some data about every citizen.

bank 1

bank 2

u1

un

u’1

u’n

hou

ses

ban

k

u1

un

taxes

Page 12: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 12

February 28, 2005 10th Estonian Winter School in Computer Science

Issues and Tools

• Best privacy can be achieved by not giving any data, but..

• Privacy tools: cryptography [LP00]– Encryption: data is hidden unless you have the decryption key.

However, we also want to use the data.– Secure function evaluation: two or more parties with private

inputs. Can compute any function they wish without revealing anything else.

– Strong theory. Starts to be relevant to real applications.• Non-cryptographic tools [AS00]

– Query restriction: prevent certain queries from being answered.– Data/Input/output perturbation: add errors to inputs – hide

personal data while keeping aggregates accurate. (randomization, rounding, data swapping.)

– Can these be understood as well as we understand Crypto? Provide the same level of security as Crypto?

Page 13: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 13

February 28, 2005 10th Estonian Winter School in Computer Science

Introduction to Cryptography

Page 14: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 14

February 28, 2005 10th Estonian Winter School in Computer Science

Why learn/use crypto to solve privacy issues?

• Why are we referring to crypto?– Cryptography is one of the tools we can use for

preserving privacy– A mature research area:– many useful results/tools– Can reflect on our thinking – how is “security”

defined in cryptography? How should we define “privacy”?

Page 15: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 15

February 28, 2005 10th Estonian Winter School in Computer Science

What is Cryptography?

Traditionally: how to maintain secrecy in communication

Alice and Bob talk while Eve tries to listen

AliceBob

Eve

Page 16: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 16

February 28, 2005 10th Estonian Winter School in Computer Science

History of Cryptography

• Very ancient occupation

• Up to the mid 70’s - mostly classified military work– Exception: Shannon, Turing*

• Since then - explosive growth– Commercial applications– Scientific work: tight relationship with Computational

Complexity Theory– Major works: Diffie-Hellman, Rivest, Shamir and Adleman

(RSA)• Recently - more involved models for more diverse tasks.

• Scope: How to maintain the secrecy, integrity and functionality in computer and communication system.

Page 17: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 17

February 28, 2005 10th Estonian Winter School in Computer Science

Relation to computational hardness

• Cryptography uses problems that are infeasible to solve.

• Uses the intractability of some problems in order to construct secure systems.– Feasible – computable in probabilistic polynomial time

(PPT)– Infeasible – no probabilistic polynomial time algorithm– Usually average case hardness is needed

• For example, the discrete log problem

Page 18: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 18

February 28, 2005 10th Estonian Winter School in Computer Science

The Discrete Log Problem

• Let G be a group and g an element in G.• Given yG let x be minimal non-negative integer

satisfying the equation y=gx.x is called the discrete log of y to base g.

• Example: y=gx mod p in the multiplicative group of Zp* (p is prime). (For example, p=7, g=3, y=4 x=4.)

• In general, it is easy to exponentiate– (using repeated squaring and the binary

representation of x)

• Computing the discrete log is believed to be hard in Zp* if p is large. (E.g., p is a prime, |p|>768 bits, p=2q+1 and q is also a prime.)

Page 19: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 19

February 28, 2005 10th Estonian Winter School in Computer Science

Encryption

• Alice wants to send a message m {0,1}n to Bob– Set-up phase is secret– Symmetric encryption: Alice and Bob share a secret

key k• They want to prevent Eve from learning

anything about the message Alice Bob

Eve

Ek(m)

k k

Page 20: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 20

February 28, 2005 10th Estonian Winter School in Computer Science

Public key encryption

• Alice generates a private/public key pair (SK,PK)• Only Alice knows the secret key SK• Everyone (even Eve) knows the public key PK,

and can encrypt messages to Alice• Only Alice can decrypt (using SK)

Alice Bob

Eve

EPK(m)

SK

CharliePK

PKEPK(m)

Page 21: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 21

February 28, 2005 10th Estonian Winter School in Computer Science

Rigorous Specification of Security

To define the security of a system we must specify:

1. What constitute a failure of the system

2. The power of the adversary – computational – access to the system– what it means to break the system.

Page 22: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 22

February 28, 2005 10th Estonian Winter School in Computer Science

What does `learn’ mean?

• Even if Eve has some prior knowledge of m, she should not have any advantage in – Probability of guessing m, or probability of guessing whether m is m0

or m1, or prob. of computing any other function f of m ,or even computing |m|

• Ideally: the message sent is a independent of the message m – Implies all the above

• Achievable: one-time pad (symmetric encryption)– Let rR {0,1} n be the shared key. – Let m {0,1} n

– To encrypt m send r m– To decrypt z send m = z r

• Shannon: achievable only if the entropy of the shared secret is at least as large as that of m. Therefore must use long key .

Page 23: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 23

February 28, 2005 10th Estonian Winter School in Computer Science

Defining security

The power of the adversary– Computational: Probabilistic polynomial time machine (PPTM)– Access to the system: e.g. can it change messages?– Passive adversary, (adaptive) chosen plaintext attack, chosen

ciphertext attack…

• What constitutes a failure of the system? – Recovering plaintext from ciphertext – not enough

• Allows for the leakage of partial information• In general, hard to answer which partial information

may/should not be leaked. Application dependent.• How would partial information the adversary already holds

be combined with what he learns to affect privacy? – Better: Prevent learning anything about an encrypted

message• There are two common, equivalent, definitions…

Page 24: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 24

February 28, 2005 10th Estonian Winter School in Computer Science

Security of Encryption: Definition 1Indistinguishability of Encryptions

• Adversary A chooses any X0 , X1 0,1n

• Receives encryption of Xb for bR0,1• Has to decide whether b 0 or b 1.

For every PPTM A, choosing a pair X0 , X1

0,1n :| Pr A(E(X0))= ‘1’ - Pr A(E(Xb1)) ‘1’ | = neg(n)

– (Probability is over the choice of keys, randomization in the encryption and A‘s coins)

• Note that a proof of security must be rigorous

Page 25: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 26

February 28, 2005 10th Estonian Winter School in Computer Science

Security of Encryption: Definition 2Semantic Security

Simulation: Whatever Adversary A can compute given an encryption of X 0,1n so can a `simulator’ S that does not get to see the encryption of X.

• A selects a distribution Dn on 0,1n and a relation R(X,Y) - computable in PPT (e.g. R(X,Y)=1 iff Y is last bit of X).

• XR Dn is sampled•

Given E(X), A outputs Y trying to satisfy R(X,Y)• The simulator S does the same without access to E(X)• Simulation is successful if A and S have the same

success probability• Successful simulation semantic security

Page 26: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 27

February 28, 2005 10th Estonian Winter School in Computer Science

Security of Encryption (2)Semantic Security

More formally:For every PPTM A there is a PPTM S so that for all PPTM relations R for XR Dn

Pr R(X,A(E(X)) - Pr R(X,S())

is negligible.

In other words: The outputs of A and S are indistinguishable even for a test that is aware of X.

Page 27: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 28

February 28, 2005 10th Estonian Winter School in Computer Science

Which is the Right Definition?

• Semantic security seems to convey that the message is protected

• But it is usually easier to prove indistinguishability of encryptions

• Would like to argue that the two definitions are equivalent

• Must define the attack: chosen plaintext attack– Adversary can obtain the encryption for any message it

chooses, in an adaptive manner– More severe attacks: chosen ciphertext

• The Equivalence Theorem: A cryptosystem is semantically secure if and only if it has the indistinguishability of encryptions property

Page 28: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 29

February 28, 2005 10th Estonian Winter School in Computer Science

Equivalence Proof (informal)

Semantic security Indistinguishability of encryptions• Suppose no indistinguishability:

– A chooses a pair X0 , X10,1n for which it can distinguish encryptions with non-negligible advantage

• Choose – Distribution Dn

= {X0 , X1 }

– Relation R which is “equality with X ”S that doesn’t get E(X), and outputs Y’ we have

Prob[ R( X, Y’ ) ]= ½ • Given E(Xb ), run A(E(Xb )), get output b{0,1}, set Y=Xb • Now, | PrA(E(Xb))= ‘1’ b 1 - PrA(E(Xb)) ‘1’ b 0 | > • Therefore, | PrR(X,Y) - PrR(E(X,Y’) | > / 2

Page 29: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 30

February 28, 2005 10th Estonian Winter School in Computer Science

Equivalence Proof (informal)

Indistinguishability of encryptions Semantic security • Suppose no semantic security: A chooses some distribution

Dn and some relation R• Choose X0, X1 R Dn , choose bR {0,1}, compute E(Xb).

– Give E(Xb) to A, ask A to compute Yb = A(E(Xb))

• For X0 , X1 R Dn let

– 0 = Prob[R(X0, Yb)], 1 = Prob[R(X1, Yb)]

• With noticeable probability |0 - 1 | is non-negligible, since otherwise Yb can be computed without the encryption.

• If |0 - 1 | is non-negligible, then we can distinguish between an encryption of X0 and X1

Page 30: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 31

February 28, 2005 10th Estonian Winter School in Computer Science

Lessons learned?

• Rigorous approach to cryptography– Defining security– Proving security

Page 31: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 32

February 28, 2005 10th Estonian Winter School in Computer Science

References

Books:• O. Goldreich, Foundations of Cryptography Vol 1,

Basic Tools, Cambridge, 2001 • Pseudo-randomness, zero-knowledge

– Vol 2, Basic Applications (to be available May 2004)• Encryption, Secure Function Evaluation)

– Other volumes in www.wisdom.weizmann.ac.il/~oded/books.html

Web material/courses:• S. Goldwasser and M. Bellare, Lecture Notes on

Cryptography, http://www-cse.ucsd.edu/~mihir/papers/gb.html

• M. Naor, 9th EWSCS, http://www.cs.ioc.ee/yik/schools/win2004/naor.php

Page 32: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 33

February 28, 2005 10th Estonian Winter School in Computer Science

Secure Function Evaluation

• A major topic of cryptographic research• How to let n parties, P1,..,Pn compute a function

f(x1,..,xn) – Where input xi is known to party Pi

– Parties learn the final input and nothing else

Page 33: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 34

February 28, 2005 10th Estonian Winter School in Computer Science

The Millionaires Problem [Yao]

x

Whose value is greater?

y

Leak no other information!

Alice Bob

Page 34: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 35

February 28, 2005 10th Estonian Winter School in Computer Science

Comparing Information without Leaking it

• Output: Is x=y?• The following solution is insecure:

– Use a one-way hash function H()– Alice publishes H(x), Bob publishes H(y)

xyAlice Bob

Page 35: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 36

February 28, 2005 10th Estonian Winter School in Computer Science

Secure two-party computation - definition

x y

F(x,y) and nothing else

Input:Output:

x yAs if…

F(x,y) F(x,y)

Trusted third party

Page 36: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 37

February 28, 2005 10th Estonian Winter School in Computer Science

Leak no other information

• A protocol is secure if it emulates the ideal solution

• Alice learns F(x,y), and therefore can compute everything that is implied by x, her prior knowledge of y, and F(x,y).

• Alice should not be able to compute anything else

• Simulation:– A protocol is considered secure if:

For every adversary in the real worldThere exists a simulator in the ideal world, which outputs an indistinguishable ``transcript” , given access to the information that the adversary is allowed to learn

Page 37: Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

page 38

February 28, 2005 10th Estonian Winter School in Computer Science

More tomorrow…