69
Multiplicative Data Perturbation for Privacy Preserving Data Mining Kun Liu Ph.D. Dissertation Defense Dept. of Computer Science and Electrical Engineering University of Maryland, Baltimore County (UMBC) January 15 th , 2007

Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

  • Upload
    hakhanh

  • View
    219

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

Multiplicative Data Perturbation for Privacy Preserving Data Mining

Kun Liu

Ph.D. Dissertation DefenseDept. of Computer Science and Electrical EngineeringUniversity of Maryland, Baltimore County (UMBC)

January 15th, 2007

Page 2: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

2

Growing Privacy Concerns

“Detailed information on an individual’s credit, health, and financial status, on characteristic purchasing patterns, and on other personal preferences is routinely recorded and analyzed by a variety of governmental and commercial organizations.”

- M. J. Cronin, “e-Privacy?” Hoover Digest, 2000.

Page 3: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

3

Privacy-Preserving Data Mining“the best (and perhaps only) way to overcome the ‘limitations’ of data mining techniques is to do more research in data mining, including areas like data security and privacy-preserving data mining, which are actually active and growing research areas.”- SIGKDD Executive Committee, “’Data Mining’ Is NOT Against Civil Liberties,” 2003.

Privacy-preserving data mining is “the study of how to produce valid mining models and patterns without disclosing private information.”- F. Giannotti and F. Bonchi, “Privacy Preserving Data Mining,” KDUbiq Summer School, 2006.

Page 4: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

4

Privacy-Preserving Data MiningData Perturbation

Hiding private data while mining patterns

Secure Multi-Party ComputationBuilding a model over multi-party distributed databases without knowing others’ inputs

Knowledge HidingHiding sensitive rules/patterns

Privacy-aware Knowledge SharingDo the data mining results themselves violate privacy?

[more]

Page 5: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

5

Data Perturbation

Private Database Perturbed Database

Data Miner

Census Model

Researcher

Page 6: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

6

Additive Data Perturbation

R. Agrawal and R. Srikant, “Privacy-preserving data mining,” ACM SIGMOD, 2000.

50 | 40K | ...

30 | 70K | ... ...

...

Randomizer Randomizer

Reconstructdistribution

of Age

Reconstructdistributionof Salary

Data MiningAlgorithms Model

65 | 20K | ... 25 | 60K | ... ...

Age

Add a random number to Age

30 becomes 65

original distribution

perturbed distribution

reconstructed distribution

Page 7: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

7

Additive vs. Multiplicative Noise

Additive perturbation is not safe.“in many cases, the original data can be accurately

estimated from the perturbed data using a spectral filter that exploits some theoretical properties of random matrices”- Kargupta et al., “On the Privacy Preserving Properties of Random Data PerturbationTechniques,” ICDM, 2003.

Related work: [Huang05], [Guo06], etc.

How about multiplicative noise?Has not been carefully studied. Topic of this dissertation.

Page 8: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

8

Primary ContributionsWe examined the effectiveness of exact Euclidean distance preserving data perturbation, and developed three attack techniques.

K. Liu, C. Giannella, and H. Kargupta, “An attacker's view of distance preserving maps for privacy preserving data mining,” 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'06), 2006.

We proposed a random projection-based approximate distance preserving perturbation as a possible remedy, and analyzed its privacy issues.

K. Liu, H. Kargupta, and J. Ryan, “Random projection-based multiplicative perturbation for privacy preserving distributed data mining,” IEEE Transactions on Knowledge and Data Engineering (TKDE), 18(1), 2006.

Page 9: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

9

RoadmapTraditional Multiplicative NoiseDistance Preserving Data Perturbation

Fundamental PropertiesKnown Input-Output AttackKnow Sample AttackIndependent Signal Attack

Random Projection-based PerturbationFundamental PropertiesBayes Privacy ModelAttacks Revisit

Conclusion and Future Work

Page 10: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

10

Traditional Multiplicative Noise

, where is the private data, (1, ) [Kim03].ij ij ij ij ijy x r x r N σ= × ∼

2,5782,899Tax

1,3241,889Rent

83,82198,563Wages

10021001ID

2,1352,964Tax

1,3811,878Rent

85,396116,166Wages

10021001ID

2,899 * 1.0224 = 2,964

Private Database X Perturbed Database Y

Page 11: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

11

Traditional Multiplicative NoiseMechanism

Each data element/cell is randomized independently by multiplying a random number.

ProsSummary statistics (e.g., mean, variance) can be estimated from the perturbed data.Effective if data disseminator only wants minor perturbationPopular in the statistics community.

ConsEquivalent to additive perturbation after a logarithmic operation. Vulnerable to attacks designed for additive noise.Not preserving Euclidean distance; not suitable for many data mining tasks.

Page 12: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

12

RoadmapTraditional Multiplicative NoiseDistance Preserving Data Perturbation

Fundamental PropertiesKnown Input-Output AttackKnow Sample AttackIndependent Signal Attack

Random Projection-based PerturbationFundamental PropertiesBayes Privacy ModelAttacks Revisit

Conclusion and Future Work

Page 13: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

13

Distance Preserving Perturbation

Dist. preserving perturbation

Dist. preserving perturbation is equivalent to

Dist. preserving perturbation with origin fixed

: if , , || || || ( ) ( ) ||n n nT x y x y T x T y→ ∀ ∈ − = −

, for and ,where is the set of all orthogonal matrices.

n nn

n

x Mx v M vn n

∈ → + ∈Ο ∈Ο ×

, where Orthogonal Transformationnnx Mx M∈ → ∈Ο ↔

Our FocusOur Focus

Page 14: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

14

Distance Preserving Perturbation

Perturbation ModelX: original private data with each column a record Y: perturbed dataM: orthogonal perturbation matrix

Perturbed data produces exactly the same data mining results

Clustering [Oliveira04], Classification [Chen05]

Other related: [Mukherjee06], etc.

Y

2,5782,899Tax

1,3241,889Rent

83,82198,563Wages

10021001ID

0.45270.88870.0726

0.2559-0.0514-0.9653

-0.85420.4556-0.2507

×

8,43210,151Tax

-80,324-94,502 Rent

-22,613-26,326Wages

10021001ID

=

M X

Y MX=

Page 15: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

15

Distance Preserving PerturbationAtt

ribute

sRec

ord

s

Page 16: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

16

Is Dist. Preserving Perturbation Secure?

Known Input-Output Attack: attacker knows some collection of linearly independent private data records and their corresponding perturbed version.

Known Sample Attack: attacker has a collection of independent data samples from the same distribution the original data was drawn.

Independent Signals Attack: all data attributes are non-Gaussian and statistically independent

Page 17: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

17

Privacy Breach

For any , we say that an occurs if

where is the attacker’s estimate of , the data tuple in X.

Probability of

the probability that an occurs.

ˆ|| || || ||i ix x x ε− ≤

0ε >

x

ˆ( , ) Prob || || || || i i ix x x xρ ε ε= − ≤

ix thi

ε-Privacy Breach

ε-Privacy Breach

ε-privacy breach

ε-privacy breach

Page 18: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

18

Known Input-Output Attack

Assumption (can be relaxed): rank(Xnxk)=k

If k=n:

Probability of privacy breachThe attacker has a perfect recovery of the private data.

If k<n, what is going to happen?

( ) ( )[ ] [ ]n k n m k n n n k n m kY Y M X X× × − × × × −=

KNOWN

1( ) ( ), T

n k n k n m k n m kM Y X X M Y−× × × − × −= =

( , ) =1 for 0 and any . ix iρ ε ε =

Page 19: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

19

Known Input-Output Attack

If k<n, any matrix in the set

can be the original perturbation matrix , where is On is the set of all nxn orthogonal matrices. [more]

The attacker chooses one uniformly from as an estimation of , uses that to recover other private data, and computes the probability of privacy breach. [more]

( ) ( )[ ] [ ]n k n m k n n n k n m kY Y M X X× × − × × × −=

KNOWN

ˆ ˆ : n n k n kM MX Y× ×Ω = ∈Ο =M

n nM ×

Ωn nM ×

Page 20: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

20

Known Input-Output AttackProbability of Privacy Breach

ˆ( , ) Prob || || || || ˆ = Prob || || || ||

1 || ||2arcsin if || || 2 ( , ) ; = 2 ( , )

1 otherwise.where ( , ) i

i i i

i i i

ii i n k

i n k

i n k

x x x x

MMx x x

x x d x Xd x X

d x X

ρ ε ε

ε

ε επ ×

×

×

= − ≤

− ≤

⎧ ⎛ ⎞<⎪ ⎜ ⎟

⎨ ⎝ ⎠⎪⎩

s the distance of from the column space of X ,ˆ ˆand is uniformly chosen from = : .

i n k

n n k n k

x

M M MX Y×

× ×Ω ∈Ο = [more]

Page 21: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

21

Known Input-Output Attack Example

The distance of X2 from the column space of X1 is 0, therefore The distance of X3 from the column space of X1 is 9.4868, therefore

X3X2X1

105.000090.000075.000045.000030.000025.0000Private Data X:

Y3Y2Y1

91.387580.358266.9652-68.5443-50.4237-42.0198

Perturbed Data Y:

X1->Y1 KNOWN

2( , ) 1 for any .xρ ε ε=

UNKNOWN

33 3

1 || ||( , ) 2arcsin , e.g. ( ,0.01) 3.84%.2 9.4868

xx xερ ε ρπ

⎛ ⎞= =⎜ ⎟×⎝ ⎠

Page 22: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

22

Known Sample AttackAssumptions

Each data record arose as an independent sample from some unknown distributionThe attacker has a collection of samples independently chosen from the same distributionThe covariance of the distribution has all distinct eigenvalues (holds true in most practical situations [Jolliffe02]).

Attack TechniqueExploring the relationship between the principal eigenvectors of the original data and the principal eigenvectors of the perturbed data.

Page 23: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

23

Known Sample Attack

Fig. Relationship between original and perturbed principal eigenvectors. [more]

Page 24: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

24

Known Sample Attack Experiments

Fig. Known sample attack for 3D Gaussian data with 10,000 private tuples. The attacker has 2% samples from the same distribution. The average relative error of the recovered data is 0.0265 (2.65%). [more]

Page 25: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

25

Known Sample Attack Experiments

Fig. Probability of privacy breach w.r.t. attacker’s sample size. The relative error bound ε is fixed to be 0.02. (3D Gaussian data with 10,000 private tuples.)

Fig. Probability of privacy breach w.r.t. the relative error bound ε. The sample ratio is fixed to be 2%. (3D Gaussian data with 10,000 private tuples.)

[more]

Page 26: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

26

Independent Signals Attack

Basic Independent Component Analysis Model

Objective: recover the original signals X from only the observed mixtures Y.Requirements

Source signals are statistically independent All signals must be non-Gaussian with exception of onek ≥ nMatrix M must be of full column rank [more]

11 12 1 1 1 1 2 1

21 22 2 2 1 2 2 2

3

1 2 1 2

... ( ) ( ) ... ( )

... ( ) ( ) ... ( )... ... ... ... ... ... ... ( )

... ( ) ( ) ... ( )

n m

n mk m k n n m

m

k k kn n n n m

m m m x t x t x tm m m x t x t x t

Y M Xx t

m m m x t x t x t

× × ×

⎡ ⎤ ⎡ ⎤⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥= =⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦

KNOWN UNKNOWN

Page 27: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

27

Independent Signals Attack Experiments

original

perturbed

recovered

Page 28: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases
Page 29: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

29

Distance Preserving Perturbation Summary

MechanismWhole data set is perturbed by multiplying an orthogonal matrix.

ProsPerturbed data preserves Euclidean distance. Many data mining algorithms can be applied to the perturbed data and produce exactly the same results as if applied to the original data.

ConsVulnerable to known input-output attackVulnerable to known sample attackVulnerable to independent signals attack

Page 30: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

30

RoadmapTraditional Multiplicative NoiseDistance Preserving Data Perturbation

Fundamental PropertiesKnown Input-Output AttackKnow Sample AttackIndependent Signal Attack

Random Projection-based PerturbationFundamental PropertiesBayes Privacy ModelAttacks Revisit

Conclusion and Future Work

Page 31: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

31

Random ProjectionBasic Model

1 11 1, and ,

where and is i.i.d. (0, ).

k m m k m mr r

ij r

u R x v R yk kk m r Nσ σ

σ

× × × ×= =

< ∼

Original Data Perturbed Data

Page 32: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

32

Random ProjectionPreserving Inner Product

The distortion produced by random projection is zero on the average, and its variance is inversely proportional to k, dimension of new space.

[more]

Preserving Euclidean Distance

The probability that relative error is bounded with in (1±η) increases proportionally with k. [more]

2 2 21[ ] 0 and [ ] ( ( ) ).T T T Ti i i i

i i i

E u v x y Var u v x y x y x yk

− = − = +∑ ∑ ∑

(1 )2 2 2

(1 )Pr(1 ) || || || || (1 ) || || ( ; ) , >0where ( ; ) is the p.d.f. of chi-square distribution with degrees of freedom.

k

kx y u v x y f t k dt

f t k k

η

ηη η η

+

−− − ≤ − ≤ + − = ∫

Page 33: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

33

Bayes Privacy ModelPrimitives

Let x be the private data and y the perturbed one.Attacker’s Prior Belief: f(x) Attacker’s Additional Background Knowledge: θAttacker’s Posteriori Belief: f(x |y,θ)

Information Non-Disclosure PrincipleThe perturbed data should provide the attacker with little additional information beyond the attacker’s prior belief and other background knowledge.

Example(ρ1, ρ2, )-privacy [Evfimevski03] happens when f(x)<ρ1 and f(x |y,θ)>ρ2 OR f(x)>1-ρ1 and f(x |y,θ)<1-ρ2

Page 34: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

34

Maximum a Posteriori Probability (MAP) Estimate

(ρ1, ρ2, )-privacy works only for discrete data. It assumes statistically independent inputs and outputs, and requires transition probability explicitly defined. Not appropriate for multiplicative perturbation.

We propose a maximum a posteriori probability (MAP)estimate-based approach

1.

2. is compared with xi to see whether any extra information is disclosed, e.g.,

ˆ arg max ( | , )MAP xx f x y θ=

ˆMAPxˆ|| || || || .MAP i ix x x ε− ≤

Page 35: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

35

Why Maximum a Posteriori Probability (MAP) Estimate

It is closely related to maximum a posteriori probability hypothesis testing. [more]

It considers both prior and posterior belief. In the absence of a priori knowledge, MAP estimate becomes maximum likelihood estimate (MLE).

It often produces estimates with errors that are not much higher than the minimum mean square error.

It is relatively easy to derive the conditional p.d.f. in the multiplicative data perturbation scenario.

Page 36: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

36

Maximum a Posteriori Probability (MAP) Estimate

Assumption I: Attackers’ best knowledge of f(x) is it is uniformly distributed over an multi-dimensional interval. Assumption II: Attacker has no other background knowledge, i.e., .MAP Estimate:

Solution: Any in the interval that satisfies Conclusion: MAP does not offer attacker more info than what has been implied by properties of random projection itself.

θ = ∅

1 / 2

/ 2x

ˆ a rg m a x ( | , )

= a rg m a x e x p ,( 2 ) 2

w h e re a n d .

M A P xT

T k T

n k

x f x yk k y yx x x x

x y

θ

π

=

⎛ ⎞−⎜ ⎟

⎝ ⎠∈ ∈

x ˆ ˆ .T Tx x y y=

Page 37: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

37

Privacy /Accuracy ControlRandom Projection

Accuracy

Here f(t;k) is the p.d.f. of chi-square distribution with k degrees of freedom.

1 11 1, and ,

where and is i.i.d. (0, ).

k m m k m mr r

ij r

u R x v R yk kk m r Nσ σ

σ

× × × ×= =

< ∼

2

2

(1 )

(1 )ˆPr || || > || || ( ; ) ( ; ) .

k

MAP kx x x f t k dt f t k dt

ε

εε

− +∞

−∞ +− = +∫ ∫

(1 )2 2 2

(1 )Pr(1 ) || || || || (1 ) || || ( ; ) , >0.

k

kx y u v x y f t k dt

η

ηη η η

+

−− − ≤ − ≤ + − =∫

¬ ε-Privacy Breach

Page 38: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

38

Privacy/Accuracy Control

Page 39: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

39

RoadmapTraditional Multiplicative NoiseDistance Preserving Data Perturbation

Fundamental PropertiesKnown Input-Output AttackKnow Sample AttackIndependent Signal Attack

Random Projection-based PerturbationFundamental PropertiesBayes Privacy ModelAttacks Revisit

Conclusion and Future Work

Page 40: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

40

Independent Signals Attack

When k<n, at most (k-1) source signals can be separated out [Cao96].

With probability one, linear ICA can’t separate out any of the original signals for any (k x n) (k≤n/2, n ≥ 2) random matrix with i.i.d. entries chosen from continuous distribution [Liu06a].

1k m k n n m

r

Y R Xkσ× × ×=

Page 41: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

41

Independent Signals Attack Experiments

original

perturbed

recovered

Page 42: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases
Page 43: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

43

Known Sample Attack

Fig. Relationship between original and perturbed principal eigenvectors.

Page 44: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

44

Known Input-Output Attack

If p=n and rank(Xnxp) = p, R can be recovered, but still it is an under-determined system of linear equations.

MAP estimate shows that relative error decreases as known input-output pairs increases; relative error increases as k decreases. [more]

( ) ( )1[ ] [ ]k p k m p k n n p n m p

r

Y Y R X Xkσ× × − × × × −=

KNOWN

Page 45: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

45

Random Projection-based Perturbation Summary

MechanismData is projected to a lower dimensional random space.

ProsFrom the perspective of MAP estimate, random projection does not disclose more information than what have been implied by the distance preservation properties. It offers better privacy protection than orthogonal transformation-based distance preserving perturbation.

ConsPerturbed data approximately preserves Euclidean distance, therefore little loss in accuracy.

Page 46: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

46

ConclusionsTraditional Multiplicative Data Perturbation

Distance Preserving Data PerturbationKnown Input-Output Attack (linear algebra, statistics)Known Sample Attack (PCA)Independent Signals Attack (ICA)

Random projection-based Data PerturbationAccuracy AnalysisBayes Privacy ModelMaximum a Posteriori Probability (MAP) EstimatePrivacy/Accuracy ControlAttack Analysis

Privacy Issues Are Intrinsically ComplexNeed joined efforts from researchers, engineers, sociologists, legal experts, policy makers…

Page 47: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

47

Future WorkA game theoretic framework for large scale distributed privacy preserving data mining

Distributed and ubiquitous computing becomes popularSome participants cooperative and honest, some maliciousComputation in such environment is more like a game Necessary to develop a game theoretic framework

Combination of cryptographic techniques and perturbation techniques

Cryptographic techniques offers strong privacy guarantee, but with high communication and computation costPerturbation provides statistically weaker privacy protection, but more efficient Would be ideal to combine them to achieve both efficiency and privacy

Page 48: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

48

References[Kargupta03] H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar, “On the privacy preserving

properties of random data perturbation techniques,” in Proceedings of the IEEE International Conference on Data Mining (ICDM’03), Melbourne, FL, November 2003.

[Huang05] Z. Huang, W. Du, and B. Chen, “Deriving private information from randomized data,” in Proceedings of the 2005 ACM SIGMOD Conference (SIGMOD’05), Baltimore, MD, June 2005, pp. 37-48.

[Guo06] S. Guo, X. Wu, and Y. Li, “On the lower bound of reconstruction error for spectral filtering based privacy preserving data mining,” in Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’06), Berlin, Germany, 2006.

[Kim03] J. J. Kim and W. E. Winkler, “Multiplicative noise for masking continuous data,” Statistical Research Division, U. S. Bureau of the Census, Washington D.C., Tech. Rep. Statistics #2004-01, April 2003.

[Liu06a] K. Liu, H. Kargupta, and J. Ryan, “Random projection-based multiplicative data perturbation for privacy preserving distributed data mining,” IEEE Transactions on Knowledge and Data Engineering (TKDE), vol. 18, no. 1, pp. 92–106, January 2006.

[Liu06b] K. Liu, C. Giannella, and H. Kargupta, “An attacker's view of distance preserving maps for privacy preserving data mining,” in Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'06), Berlin, Germany, pp. 297-308, 2006.

[Mukherjee06] S. Mukherjee, Z. Chen, and A. Gangopadhyay, “A privacy preserving technique for Euclidean distance-based mining algorithms using Fourier-related transforms,” The VLDB Journal, p. to appear, 2006.

[Chen05] K. Chen and L. Liu, “Privacy preserving data classification with rotation perturbation,” in Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05), Houston, TX, pp. 589–592, November 2005.

Page 49: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

49

References[Oliveira04] S. R. M. Oliveira and O. R. Zaïane, “Privacy preservation when sharing data for

clustering,” in Proceedings of the International Workshop on Secure Data Management in a Connected World, Toronto, Canada, pp. 67–82, August 2004.

[Jolliffe02] I. T. Jolliffe, Principal Component Analysis, 2nd ed., ser. Springer Series in Statistics. Springer, 2002.

[Evfimevski03] A. Evfimevski, J. Gehrke, and R. Srikant, “Limiting privacy breaches in privacy preserving data mining,” in Proceedings of the ACM SIGMOD Conference (SIGMOD’03), San Diego, CA, June 2003.

[Cao96] X.-R. Cao and R.-W Liu, “A General Approach to Blind Source Separation,” IEEE Trans. Signal Processing, vol. 44, pp. 562-571, 1996.

Page 50: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

50

I Have a Dream“I have a dream that one day this nation will rise up and live out the true meaning of its creed: 'We hold these truths to be self-evident, that all men are created equal.’"

- Martin Luther King, Jr., 1963.

“I have a dream that one day I will get a Ph.D. degree.”

- Kun Liu, when he was a kid.

Page 51: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

51

Thank you and Questions

Page 52: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

52

Backup Slides

Page 53: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

53

Overview of PPDM backup

Data Hiding

Rule Hiding

Value Distortion

Probability Distribution

Additive PerturbationMultiplicative PerturbationData MicroaggregationData AnonymizationData SwappingOther Techniques

Sampling MethodAnalytical Method

Secure Multi-Party ComputationDistributed Data Mining

Association Rule Hiding

Classification Rule Hiding

Data Perturbation

Page 54: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

54

Traditional Multiplicative Noise backup

Properties:

Each data element randomized independently.Original Mean and variance can be estimated from perturbed data.Equivalent to additive perturbation after a logarithmic operation. Not preserve Euclidean distance.

, where is the private data, (1, ) [Kim03].ij ij ij ij ijy x r x r N σ= × ∼

2,5782,899Tax

1,3241,889Rent

83,82198,563Wages

10021001ID

2,1352,964Tax

1,3811,878Rent

85,396116,166Wages

10021001ID

2,899 * 1.0224 = 2,964

Page 55: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

55

Knowledge Hiding backup

What is disclosed?the data (modified somehow)

What is hidden?some “sensitive” knowledge (i.e. secret rules/patterns)

How?usually by means of data sanitization. The data which we are going to disclose is modified, in such a way that the sensitive knowledge can no longer be inferred, while the original database is modified as less as possible.

Page 56: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

56

Privacy-aware Knowledge Sharing backup

What is disclosed?the intentional knowledge (i.e., rules , patterns, models)

What is hidden?the source data

The central questionDo the data mining results themselves violate privacy

Page 57: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

57

Privacy-aware Knowledge Sharing backup

Age = 27, Zip = 15254, Christian->American (sup_count = 758, confidence = 99.8%)

Age = 27, Zip = 15254->American(sup_count = 1518, confidence = 99.9%)

sup_count (27, 15254, Christian) = 758/.998 = 759.5sup_count(27, 15254, Christian, American) = 759.5*0.002 = 1.519

sup_count(27, 15254) = 1518/0.999 = 1519.5sup_count(27, 15254, American) = 1519.5 * 0.001 = 1.5195

Age = 27, Postcode = 45254, American->Christian(sup_count ≈ 1.5, confidence ≈ 100.0%)

This information refers to my France neighbor…. he is Christian!

Page 58: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

58

Known Input-Output Attack backup

Closed-form Expression of

( ) ( )[ ] [ ]n k n m k n n n k n m kY Y M X X× × − × × × −=

KNOWN

Ωˆ = :

= ( ) : ,where is the orthonormal basis for the column space of ,

is the orthonormal basis for the orthogonal complement of the column space

n n k n kT T

k k n k n k n k

k n k

n k

M MX Y

M U U U PU PU X

U

× ×

− − −

×

Ω ∈Ο =

+ ∀ ∈Ο

of .n kX ×

Page 59: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

59

Known Input-Output Attack backup

Special case in 2D space: when k = 1 and n = 2.

The attacker can’t distinguish rotation and reflection.

Page 60: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

60

Known Input-Output Attack backup

Properties of the Probability of Privacy BreachAttacker can compute the probability of privacy breach for a given private record and a relative error bound .The larger the , the higher the probability of privacy breach.The closer the private record is to the column space of the known records, the higher the probability of privacy breach.The distance can be computed from the perturbed data.

ε

( , )i n kd x X ×

ε

Page 61: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

61

Known Sample Attack backup

The principal eigenvectors of the original data have experienced the same distance preserving perturbation as the data itself.

ZY can be computed from the perturbed data, ZX can be estimated from the sample data. (See dissertation for choice of D, details omitted. )Attacker uses ZX, ZY and D to recover M, and therefore X.

Let , we have ,where is the eigenvector matrix of the covariance of Y;

is the eigenvector matrix of the covariance of X;and D is a diagonal matrix with each entry on the diagonal 1.

= =

±

Y X

Y

X

Y MX Z MZ DZ

Z

Page 62: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

62

Known Sample Attack Experiments backup

Fig. Known sample attack for Adult data with 32,561 private tuples. The attacker has 2% samples from the same distribution. The average relative error of the recovered data is 0.1081 (10.81%).

Page 63: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

63

Known Sample Attack Experiments backup

Fig. Probability of privacy breach w.r.t. attacker’s sample size. The relative error bound ε changes from 0.10 to 0.20. (Adult data with 32,561 private tuples)

Fig. Probability of privacy breach w.r.t. the relative error bound ε. The sample ratio is fixed to be 2% and 10%. (Adult data with 32,561 private tuples.)

Page 64: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

64

Effectiveness of Known Sample Attack backup

Covariance Estimation QualityLarger sample size gives attacker better recoveryRobust covariance estimator helps to downweight the influence of outliers

p.d.f. of the DataThe greater the difference between any pair of eigenvalues of the covariance, the higher the probability of privacy breach

More details can be found in the dissertation.

Page 65: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

65

Independent Component Analysis

Basic Model

ICA EstimationTo find a matrix W such that WY = WMX = X

Nongaussian is IndependentCentral limit theory - sum of random variables has a distribution closer to Gaussian than any of the original random variables.ICA looks for a W that maximizes the nongaussianity of WY.

Measures of NongaussianityKurtosis: Negentropy:Mutual information:

11 12 1 1 1 1 2 1

21 22 2 2 1 2 2 2

3

1 2 1 2

... ( ) ( ) ... ( )

... ( ) ( ) ... ( )... ... ... ... ... ... ... ( )

... ( ) ( ) ... ( )

n m

n mk m k n n m

m

k k kn n n n m

m m m x t x t x tm m m x t x t x t

Y M Xx t

m m m x t x t x t

× × ×

⎡ ⎤ ⎡ ⎤⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥= =⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦

4 2 2kurt( ) = E[ ] - 3E [ ]x x xJ( ) = H( ) - H( )gaussianx x x

n

1 21

I( , , , ) = H( ) H( )n ii

x x x x x=

−∑…

Page 66: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

66

Random Projection backup

Relative Errors in Computing the Inner Product of Two Attributes

6.320.270.031.813000(30%)

7.000.010.042.692000(20%)

7.530.030.052.941000(10%)

18.410.120.255.84500(5%)

23.470.070.419.91100(1%)

Max(%)Min(%)Var(%)Mean(%)k

3.910.610.011.803000(30%)

6.900.310.032.592000(20%)

7.210.110.052.701000(10%)

18.320.230.294.97500(5%)

32.581.510.6710.44100(1%)

Max(%)Min(%)Var(%)Mean(%)k

Relative Errors in Computing the Euclidean Distance of the Two Attributes

Adult data from UCI Repository. The first 10,000 elements of attributes fnlwgt and education-num.

Page 67: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

67

Random Projection backup

0

0.2

0.4

0.6

0.8

1

100 200 300 400 500 600 700 800 900 1000k

prob

abili

ty

ε=0.05

ε=0.10

The probability of the accuracy of random projection w.r.t. k and ε. Each entry of the random matrix is i.i.d., chosen from a Gaussian distribution with mean zero and constant variance.

The probability that relative error is bounded with in (1±η) increases proportionally with k.

(1 )2 2 2

(1 )Pr(1 ) || || || || (1 ) || || ( ; ) , >0where ( ; ) is the p.d.f. of chi-square distribution with degrees of freedom.

k

kx y u v x y f t k dt

f t k k

η

ηη η η

+

−− − ≤ − ≤ + − = ∫

Page 68: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

68

Maximum a Posteriori (MAP) Test backup

Given a binary hypothesis testing experiment with outcome s, the following rule leads to the lowest possible value of PERROR:

Here

The test design divides S into two sets, A0 and A1=A0c. If

the outcome s is in A0, the conclusion is accept H0. Otherwise, the conclusion is accept H1.

0 0 1 1if Prob | Prob | ; otherwise.s A H s H s s A∈ ≥ ∈

1 0 0 0 1 1Prob | ProbH ProbA |H ProbH .ERRORP A H= +

Page 69: Multiplicative Data Perturbation for Privacy Preserving ...kunliu1/research/kunliu_phd_defense.pdf · Multiplicative Data Perturbation for Privacy Preserving Data Mining ... databases

69

MAP Known Input-Output Attack backup

Assumption I: Attackers’ best knowledge of f(X) is it is uniform. Assumption II: Attacker has no other background knowledge, i.e., .

( ) ( )1[ ] [ ]k p k m p k n n p n m p

r

Y Y R X Xkσ× × − × × × −=

KNOWN

11 2( 1) 12

1 1ˆ arg max ( | , ) ,

1 1 1arg max (2 ) det ( ) ,2

(0,1), [ ], [ ], .

MAP p px

kk p T T T

x

ij p p

x f x y X Yk k

X X etr Y X X Yk k

where r N X x X Y y Y X has full column rank

π−

− + −

= = = =

⎛ ⎞ ⎛ ⎞= −⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠

= =

x Rx R

θ = ∅