Upload
xia
View
49
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Attacks on Randomization based Privacy Preserving Data Mining. Xintao Wu University of North Carolina at Charlotte Sept 20, 2010. Scope. Outline. Part I: Attacks on Randomized Numerical Data Additive noise Projection Part II: Attacks on Randomized Categorical Data Randomized Response. - PowerPoint PPT Presentation
Citation preview
Attacks on Randomization based Privacy Preserving Data Mining
Xintao Wu
University of North Carolina at CharlotteSept 20, 2010
2
Scope
3
Outline
Part I: Attacks on Randomized Numerical Data Additive noise Projection
Part II: Attacks on Randomized Categorical Data Randomized Response
4
Additive Noise Randomization Example
Bal income … IntP
1 10k 85k … 2k
2 15k 70k … 18k
3 50k 120k … 35k
4 45k 23k … 134k
. . . … .
N 80k 110k … 15k
10 85 2
15 70 18
50
45
80
120
23
110
35
134
15
= +
Y = X + E
7.334 3.759 0.099
4.199 7.537 7.939
9.199
6.208
9.048
8.447
7.313
5.692
3.678
1.939
6.318
17.334 88.759 2.099
19.199 77.537 25.939
59.199
51.208
89.048
128.447
30.313
115.692
38.678
135.939
21.318
5
Individual Value Reconstruction (Additive Noise)
• Methods Spectral Filtering, Kargupta et al. ICDM03 PCA, Huang, Du, and Chen SIGMOD05 SVD, Guo, Wu and Li, PKDD06
• All aim to remove noise by projecting on lower dimensional space.
6
Individual Reconstruction Algorithm
Apply EVD : Using some published information about V, extract the first k
components of as the principal components. λ1≥ λ2··· ≥ λk ≥ λe and e1, e2, · · · ,ek are the corresponding
eigenvectors. Qk = [e1 e2 · · · ek] forms an orthonormal basis of a subspace X.
Find the orthogonal projection on to X : Get estimate data set: PUU pˆ
TUp QQ
Up
TkkQQP
Up = U + VNoisePerturbed Original
7
Why it works
• Original data are correlated • Noise are not correlated
noise
2nd principal vector
1st principal vectororiginal signal perturbed
+ =
2-d estimation1-d estimation
8
Challenging Questions
• Previous work on individual reconstruction are only empirical
Attacker question: How close the estimated data is from the original one?
Data owner question: How much noise should be added to preserve privacy at a given tolerated level?
21 ||ˆ|| UU
9
Determining k
• Strategy 1: (Huang and Du SIGMOD05 )
• Strategy 2:(Guo, Wu and Li, PKDD 2006)
The estimated data using is approximate
optimal
}~
|max{ Viik
TkkQQUPUU
~~~~ˆ ~
1}2~
|min{ Viik
10
Additive Noise vs. Projection
• Additive perturbation is not safe Spectral Filtering Technique
H.Kargupta et al. ICDM03 PCA Based Technique
Huang et al. SIGMOD05 SVD based & Bound Analysis
Guo et al. SAC06,PKDD06
• How about the projection based perturbation? Projection models Vulnerabilities Potential attacks
XX EEYY
NoisePerturbed Original
= +
RR XXYY
Perturbed Transformation Original
=
11
Rotation Randomization Example
0.3333 0.6667 0.6667
-0.6667 0.6667 -0.3333
-0.6667 -0.3333 0.6667
10 15 50 45 80
85 70 120 23 110
2 18 35 134 15
61.33 63.67 110.00 119.67 63.33
49.33 30.67 55.00 -59.33 -31.67
-33.67 -21.33 -30.00 51.67 -51.67
=
Y = R X
Bal income … IntP
1 10k 85k … 2k
2 15k 70k … 18k
3 50k 120k … 35k
4 45k 23k … 134k
. . . … .
N 80k 110k … 15k
RRT = RTR = I
12
Rotation Approach (R is orthonormal)
• When R is an orthonormal matrix (RTR = RRT = I) Vector length: |Rx| = |x| Euclidean distance: |Rxi - Rxj| = |xi - xj|
Inner product : <Rxi ,Rxj> = <xi , xj>
• Many clustering and classification methods are invariant to this rotation perturbation. Classification, Chen and Liu, ICDM 05 Distributed data mining, Liu and Kargupta, TKDE 06
13
Example
866.0500.0
500.0866.0R
RXY
0.2902
0.2902
1.30
86
1.3086
RRT = RTR = I
14
Weakness of Rotation
0.2902
1.30
86
0.2902
1.3086?
866.0500.0
500.0866.0RRegression
•Known sample attackKnown
Info Original data
15
General Linear Transformation
• Y = R X + E When R = I: Y = X + E (Additive Noise Model) When RRT = RTR = I and E = 0: Y = RX (Rotation Model) R can be an arbitrary matrix
4.751 2.429 2.282
1.156 4.457 0.093
3.034 3.811 4.107
10 15 50 45 80
85 70 120 23 110
2 18 35 134 15
265.95 286.63 475.68 581.71 520.53
394.30 338.49 569.58 174.22 277.79
362.55 394.11 665.37 776.46 463.08
=
Y = R X
+
7.334 4.199 9.199 6.208 9.048
3.759 7.537 8.447 7.313 5.692
0.099 7.939 3.678 1.939 6.318
+ E
16
Is Y = R X + E Safe?
• R can be an arbitrary matrix, hence regression based attack wont work
• How about noisy ICA direct attack?
Y = R X + E General Linear Transformation Model
X = A S + N Noisy ICA Model
17
ICA Revisited
• ICA Motivation Blind source separation: separating unobservable or latent
independent source signals when mixed signals are observed Cocktail-party problem
• What is ICA ICA is a statistical technique which aims to represent a set of
random variables as linear combinations of statistically independent component variables
ICA is a process for determining the structure that produced a signal
18
ICA
1 111 1
1
( ) ( )
( ) ( )
m
n nm m n
s t x tA A
A A s t x t
Linear Mixing ProcessLinear Mixing Process
Mixing Matrix Source Observed
Separation ProcessSeparation Process
Separated Demixing Matrix
1 111 1
1
( ) ( )
( ) ( )
n
m m mn n
y t x tW W
y t W W x t
Independent?Independent?
Cost Function
Optimize
19
Restriction of ICA
• Restrictions: All the components si should be independent; They must be non-Gaussian with the possible exception of one
component.
• Can we apply the ICA directly? No Correlations among attributes of X More than one attributes may have Gaussian distributions
Y = RX
X = AS
20
A-priori Knowledge based ICA (AK-ICA) Attack
21
Correctness of AK-ICA
• We prove that J exists such that
J represents the connection between the distributions of and XS ~
YS
XJSAX yx ~ˆ
More details, See Guo and Wu, PAKDD 2007
22
Assumption
• Privacy can be breached when a small subset of the original data X , is available to attackers
• Assumption is reasonable!
Understanding net users' attitude about online privacy, April 99
Bal income … IntP
1 10k 85k … 2k
2 15k 70k … 18k
3 50k 120k … 35k
4 45k 23k … 134k
. . . … .
N 80k 110k … 15k
Privacy Concern 56%
Refuse 17%
No Concern 27%
Willing to provide data
24
Outline
Part I: Attacks on Randomized Numerical Data Additive noise Projection
Part II: Attacks on Randomized Categorical Data Randomized Response
25
Randomized Response ([ Stanley Warner; JASA 1965])
: Cheated in the exam : Didn’t cheat in the exam
Cheated in exam
Didn’t cheat
AAA
A
Randomization device
Do you belong to A? (p)
Do you belong to ?(1-p)A…
)1)(1( pp AA 12
ˆ
12
1ˆ
pp
pAW
1
“Yes” answer
“No” answer
As: Unbiased estimate of is: A
Procedure:
Purpose: Get the proportion( ) of population members that cheated in the exam.
A
…
Purpose
26
Matrix Expression
• RR can be expressed by matrix as: ( 0: No 1:Yes)
=
Unbiased estimate of is:
P
ˆˆ 1P
0
1p1
0
1
p p1
p
27
Vector Response
is the true proportions of the population
is the observed proportions in the survey
is the randomization device set by the interviewer.
),...,( 1 t
))(( jipP
),...,( 1 t
i
j
1 2 3 4
1 0.60 0.20 0.00 0.10
2 0.20 0.50 0.20 0.10
3 0.15 0.15 0.70 0.30
4 0.05 0.15 0.10 0.50
0.10
0.30
0.20
0.40
=
0.16
0.25
0.32
0.27
=P
28
Extension to Multi Attributes
,,...,, 21 mAAA m sensitive attributes: each has categories:
denote the true proportion corresponding to the combination
be vector with elements ,arranged
lexicographically.
e.g., if m =2, t1 =2 and t2=3
Simultaneous Model
Consider all variables as one compounded variable and apply the regular vector response RR technique
Sequential Model
jtjjtj AA ,...,1
mii ,...,1
),,...,(11 mmii AA
mii ,...,1 ),..,1( jj ti
)',,,,,( 232221131211
)...( 21 mPPP stands for Kronecker product
29
Disclosure Analysis R: Typical response which is “yes” ( ) or “no” ( )
Posterior probabilities:
yy
)()1()(
)()(
ARPARP
ARPRAP
AA
A
)(1)( RAPRAP
R is regarded as jeopardizing with respect or if: AA
ARAP )( or ARAP 1)(
, are conditional probabilities set by investigators)( ARP )( ARP
31
QA&Xintao Wu [email protected], http://www.sis.uncc.edu/~xwu
Data Privacy Lab http://www.dpl.sis.uncc.edu