Attacks on Randomization based Privacy Preserving Data Mining

Attacks on Randomization based Privacy Preserving Data Mining

Xintao Wu

University of North Carolina at CharlotteSept 20, 2010

2

Scope

3

Outline

Part I: Attacks on Randomized Numerical Data Additive noise Projection

Part II: Attacks on Randomized Categorical Data Randomized Response

4

Additive Noise Randomization Example

Bal income … IntP

1 10k 85k … 2k

2 15k 70k … 18k

3 50k 120k … 35k

4 45k 23k … 134k

. . . … .

N 80k 110k … 15k

10 85 2

15 70 18

50

45

80

120

23

110

35

134

15

= +

Y = X + E

7.334 3.759 0.099

4.199 7.537 7.939

9.199

6.208

9.048

8.447

7.313

5.692

3.678

1.939

6.318

17.334 88.759 2.099

19.199 77.537 25.939

59.199

51.208

89.048

128.447

30.313

115.692

38.678

135.939

21.318

5

Individual Value Reconstruction (Additive Noise)

• Methods Spectral Filtering, Kargupta et al. ICDM03 PCA, Huang, Du, and Chen SIGMOD05 SVD, Guo, Wu and Li, PKDD06

• All aim to remove noise by projecting on lower dimensional space.

6

Individual Reconstruction Algorithm

Apply EVD : Using some published information about V, extract the first k

components of as the principal components. λ1≥ λ2··· ≥ λk ≥ λe and e1, e2, · · · ,ek are the corresponding

eigenvectors. Qk = [e1 e2 · · · ek] forms an orthonormal basis of a subspace X.

Find the orthogonal projection on to X : Get estimate data set: PUU pˆ

TUp QQ

Up

TkkQQP

Up = U + VNoisePerturbed Original

7

Why it works

• Original data are correlated • Noise are not correlated

noise

2nd principal vector

1st principal vectororiginal signal perturbed

+ =

2-d estimation1-d estimation

8

Challenging Questions

• Previous work on individual reconstruction are only empirical

Attacker question: How close the estimated data is from the original one?

Data owner question: How much noise should be added to preserve privacy at a given tolerated level?

21 ||ˆ|| UU

9

Determining k

• Strategy 1: (Huang and Du SIGMOD05 )

• Strategy 2:(Guo, Wu and Li, PKDD 2006)

The estimated data using is approximate

optimal

}~

|max{ Viik

TkkQQUPUU

~~~~ˆ ~

1}2~

|min{ Viik

10

Additive Noise vs. Projection

• Additive perturbation is not safe Spectral Filtering Technique

H.Kargupta et al. ICDM03 PCA Based Technique

Huang et al. SIGMOD05 SVD based & Bound Analysis

Guo et al. SAC06,PKDD06

• How about the projection based perturbation? Projection models Vulnerabilities Potential attacks

XX EEYY

NoisePerturbed Original

= +

RR XXYY

Perturbed Transformation Original

=

11

Rotation Randomization Example

0.3333 0.6667 0.6667

-0.6667 0.6667 -0.3333

-0.6667 -0.3333 0.6667

10 15 50 45 80

85 70 120 23 110

2 18 35 134 15

61.33 63.67 110.00 119.67 63.33

49.33 30.67 55.00 -59.33 -31.67

-33.67 -21.33 -30.00 51.67 -51.67

=

Y = R X

Bal income … IntP

1 10k 85k … 2k

2 15k 70k … 18k

3 50k 120k … 35k

4 45k 23k … 134k

. . . … .

N 80k 110k … 15k

RRT = RTR = I

12

Rotation Approach (R is orthonormal)

• When R is an orthonormal matrix (RTR = RRT = I) Vector length: |Rx| = |x| Euclidean distance: |Rxi - Rxj| = |xi - xj|

Inner product : <Rxi ,Rxj> = <xi , xj>

• Many clustering and classification methods are invariant to this rotation perturbation. Classification, Chen and Liu, ICDM 05 Distributed data mining, Liu and Kargupta, TKDE 06

13

Example

866.0500.0

500.0866.0R

RXY

0.2902

0.2902

1.30

86

1.3086

RRT = RTR = I

14

Weakness of Rotation

0.2902

1.30

86

0.2902

1.3086?

866.0500.0

500.0866.0RRegression

•Known sample attackKnown

Info Original data

15

General Linear Transformation

• Y = R X + E When R = I: Y = X + E (Additive Noise Model) When RRT = RTR = I and E = 0: Y = RX (Rotation Model) R can be an arbitrary matrix

4.751 2.429 2.282

1.156 4.457 0.093

3.034 3.811 4.107

10 15 50 45 80

85 70 120 23 110

2 18 35 134 15

265.95 286.63 475.68 581.71 520.53

394.30 338.49 569.58 174.22 277.79

362.55 394.11 665.37 776.46 463.08

=

Y = R X

+

7.334 4.199 9.199 6.208 9.048

3.759 7.537 8.447 7.313 5.692

0.099 7.939 3.678 1.939 6.318

+ E

16

Is Y = R X + E Safe?

• R can be an arbitrary matrix, hence regression based attack wont work

• How about noisy ICA direct attack?

Y = R X + E General Linear Transformation Model

X = A S + N Noisy ICA Model

17

ICA Revisited

• ICA Motivation Blind source separation: separating unobservable or latent

independent source signals when mixed signals are observed Cocktail-party problem

• What is ICA ICA is a statistical technique which aims to represent a set of

random variables as linear combinations of statistically independent component variables

ICA is a process for determining the structure that produced a signal

18

ICA

1 111 1

1

( ) ( )

( ) ( )

m

n nm m n

s t x tA A

A A s t x t

Linear Mixing ProcessLinear Mixing Process

Mixing Matrix Source Observed

Separation ProcessSeparation Process

Separated Demixing Matrix

1 111 1

1

( ) ( )

( ) ( )

n

m m mn n

y t x tW W

y t W W x t

Independent?Independent?

Cost Function

Optimize

http://www.cis.hut.fi/projects/ica/cocktail/001000010mix1.wav



http://www.cis.hut.fi/projects/ica/cocktail/001000010est1.wav

http://www.cis.hut.fi/projects/ica/cocktail/001000010est1.wav

19

Restriction of ICA

• Restrictions: All the components si should be independent; They must be non-Gaussian with the possible exception of one

component.

• Can we apply the ICA directly? No Correlations among attributes of X More than one attributes may have Gaussian distributions

Y = RX

X = AS

20

A-priori Knowledge based ICA (AK-ICA) Attack

21

Correctness of AK-ICA

• We prove that J exists such that

J represents the connection between the distributions of and XS ~

YS

XJSAX yx ~ˆ

More details, See Guo and Wu, PAKDD 2007

22

Assumption

• Privacy can be breached when a small subset of the original data X , is available to attackers

• Assumption is reasonable!

Understanding net users' attitude about online privacy, April 99

Bal income … IntP

1 10k 85k … 2k

2 15k 70k … 18k

3 50k 120k … 35k

4 45k 23k … 134k

. . . … .

N 80k 110k … 15k

Privacy Concern 56%

Refuse 17%

No Concern 27%

Willing to provide data

24

Outline

Part I: Attacks on Randomized Numerical Data Additive noise Projection

Part II: Attacks on Randomized Categorical Data Randomized Response

25

Randomized Response ([ Stanley Warner; JASA 1965])

: Cheated in the exam : Didn’t cheat in the exam

Cheated in exam

Didn’t cheat

AAA

A

Randomization device

Do you belong to A? (p)

Do you belong to ?(1-p)A…

)1)(1( pp AA 12

ˆ

12

1ˆ

pp

pAW

1

“Yes” answer

“No” answer

As: Unbiased estimate of is: A

Procedure:

Purpose: Get the proportion( ) of population members that cheated in the exam.

A

…

Purpose

26

Matrix Expression

• RR can be expressed by matrix as: ( 0: No 1:Yes)

=

Unbiased estimate of is:

P

ˆˆ 1P

0

1p1

0

1

p p1

p

27

Vector Response

is the true proportions of the population

is the observed proportions in the survey

is the randomization device set by the interviewer.

),...,( 1 t

))(( jipP

),...,( 1 t

i

j

1 2 3 4

1 0.60 0.20 0.00 0.10

2 0.20 0.50 0.20 0.10

3 0.15 0.15 0.70 0.30

4 0.05 0.15 0.10 0.50

0.10

0.30

0.20

0.40

=

0.16

0.25

0.32

0.27

=P

28

Extension to Multi Attributes

,,...,, 21 mAAA m sensitive attributes: each has categories:

denote the true proportion corresponding to the combination

be vector with elements ,arranged

lexicographically.

e.g., if m =2, t1 =2 and t2=3

Simultaneous Model

Consider all variables as one compounded variable and apply the regular vector response RR technique

Sequential Model

jtjjtj AA ,...,1

mii ,...,1

),,...,(11 mmii AA

mii ,...,1 ),..,1( jj ti

)',,,,,( 232221131211

)...( 21 mPPP stands for Kronecker product

29

Disclosure Analysis R: Typical response which is “yes” ( ) or “no” ( )

Posterior probabilities:

yy

)()1()(

)()(

ARPARP

ARPRAP

AA

A

)(1)( RAPRAP

R is regarded as jeopardizing with respect or if: AA

ARAP )( or ARAP 1)(

, are conditional probabilities set by investigators)( ARP )( ARP

31

QA&Xintao Wu [email protected], http://www.sis.uncc.edu/~xwu

Data Privacy Lab http://www.dpl.sis.uncc.edu

mailto:[email protected]

http://www.sis.uncc.edu/~xwu

http://www.dpl.sis.uncc.edu/

Documents

Attacks on Randomization based Privacy Preserving Data Mining