Privacy preserving data mining
Li Xiong
CS573 Data Privacy and Anonymity
February 12, 2009 2
What Is Data Mining?
Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful) patterns or knowledge from huge amount of data
Knowledge discovery in databases (KDD), knowledge extraction, data/pattern analysis, information harvesting, business intelligence
Privacy preserving data mining
Support data mining while preserving privacy Sensitive raw data Sensitive mining results
February 12, 2009 4
Seminal work
Privacy preserving data mining, Agrawal and Srikant, 2000 Centralized data Data randomization (additive noise) Decision tree classifier
Privacy preserving data mining, Lindell and Pinkas, 2000 Distributed data mining Secure multi-party computation Decision tree classifier
slide 5
Input Perturbation
x1…xn
Reveal entire database, but randomize entries
Database
x1+1…xn+n
Add random noise i to each database entry xi
For example, if distribution of noise hasmean 0, user can compute average of xi
User
February 12, 2009 6
Taxonomy of PPDM algorithms
Data distribution Centralized Distributed – Privacy preserving distributed data mining
Approaches Input perturbation – additive noise (randomization),
multiplicative noise, generalization, swapping, sampling Output perturbation – rule hiding Crypto techniques – secure multiparty computation
Data mining algorithms Classification Association rule mining Clustering
Randomization techniques
Privacy preserving data mining, Agrawal and Srikant, 2000 Seminal work on decision tree classifier
Limiting Privacy Breaches in Privacy-Preserving Data Mining, Evfimievski and Gehrke, 2003 Refined privacy definition Association rule mining
Randomization Based Decision Tree Learning (Agrawal and Srikant ’00)
Basic idea: Perturb Data with Value Distortion User provides xi+r instead of xi r is a random value
Uniform, uniform distribution between [-, ] Gaussian, normal distribution with = 0,
Hypothesis Miner doesn’t see the real data or can’t
reconstruct real values Miner can reconstruct enough information to build
decision tree for classification
Randomization Approach
50 | 40K | ... 30 | 70K | ... ...
...
Randomizer Randomizer
ClassificationAlgorithm
Model
65 | 20K | ... 25 | 60K | ... ...30 becomes 65 (30+35)
Alice’s age
Add random number to Age
?
February 12, 2008 10
Classification predicts categorical class labels (discrete or nominal)
Prediction (Regression) models continuous-valued functions, i.e., predicts
unknown or missing values Typical applications
Credit approval Target marketing Medical diagnosis Fraud detection
Classification
Li Xiong 1111
Motivating Example for Classification – Fruit Identification
…
DangerousHardSmallSmooth
SafeSoftLargeGreenHairy
DangerousSoftRedSmooth
SafeHardLargeGreenHairy
safeHardLargeBrownHairy
ConclusionFleshSizeColorSkin
Large
Red
February 12, 2008 12
Another Example – Credit Approval
Classification rule: If age = “31...40” and income = high then credit_rating = excellent
Future customers Paul: age = 35, income = high excellent credit rating John: age = 20, income = medium fair credit rating
Name Age Income … Credit
Clark 35 High … Excellent
Milton 38 High … Excellent
Neo 25 Medium … Fair
… … … … …
February 12, 2008 Data Mining: Concepts and Techniques 13
Classification—A Two-Step Process
Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a
predefined class, as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules, decision trees, or mathematical formulae
Model usage: for classifying future or unknown objects
February 12, 2008 Data Mining: Concepts and Techniques 14
Training Dataset
age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
February 12, 2008 Data Mining: Concepts and Techniques 15
Output: A Decision Tree for “buys_computer”
age?
overcast
student? credit rating?
<=30 >40
no yes yes
yes
31..40
no
fairexcellentyesno
February 12, 2008 Data Mining: Concepts and Techniques 16
Algorithm for Decision Tree Induction
ID3 (Iterative Dichotomiser), C4.5 CART (Classification and Regression Trees) Basic algorithm (a greedy algorithm) - tree is constructed in a top-down
recursive divide-and-conquer manner At start, all the training examples are at the root A test attribute is selected that “best” separate the data into partitions
Heuristic or statistical measure Samples are partitioned recursively based on selected attributes
Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting
is employed for classifying the leaf There are no samples left
February 12, 2008 Data Mining: Concepts and Techniques 17
Attribute Selection Measures
Idea: select attribute that partition samples into homogeneous groups Measures
Information gain (ID3) Gain ratio (C4.5) Gini index (CART)
February 12, 2008 Data Mining: Concepts and Techniques 18
Attribute Selection Measure: Information Gain (ID3)
Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class C i,
estimated by |Ci, D|/|D| Expected information (entropy) needed to classify a tuple in D:
Information needed (after using A to split D into v partitions) to classify D:
Information gain – difference between original information requirement and the new information requirement by branching on attribute A
)(log)( 21
i
m
ii ppDInfo
)(||
||)(
1j
v
j
jA DInfo
D
DDInfo
(D)InfoInfo(D)Gain(A) A
February 12, 2008 Data Mining: Concepts and Techniques 19
Attribute Selection Measure: Gini index (CART)
If a data set D contains examples from n classes, gini index, gini(D) is defined as
where pj is the relative frequency of class j in D If a data set D is split on A into two subsets D1 and D2, the gini index
gini(D) is defined as
Reduction in Impurity:
The attribute provides the smallest ginisplit(D) (or the largest reduction in
impurity) is chosen to split the node
n
jp jDgini
1
21)(
)(||||)(
||||)( 2
21
1 DginiDD
DginiDDDginiA
)()()( DginiDginiAginiA
February 12, 2008 Data Mining: Concepts and Techniques 20
Information-Gain for Continuous-Value Attributes
Let attribute A be a continuous-valued attribute
Must determine the best split point for A
Sort the value A in increasing order
Typically, the midpoint between each pair of adjacent values is considered as a possible split point
(ai+ai+1)/2 is the midpoint between the values of ai and ai+1
The point with the minimum expected information requirement for A is selected as the split-point for A
Split: D1 is the set of tuples in D satisfying A ≤ split-point, and
D2 is the set of tuples in D satisfying A > split-point
Randomization Approach
50 | 40K | ... 30 | 70K | ... ...
...
Randomizer Randomizer
ClassificationAlgorithm
Model
65 | 20K | ... 25 | 60K | ... ...30 becomes 65 (30+35)
Alice’s age
Add random number to Age
?
February 12, 2008 Data Mining: Concepts and Techniques 22
Attribute Selection Measure: Gini index (CART)
If a data set D contains examples from n classes, gini index, gini(D) is defined as
where pj is the relative frequency of class j in D If a data set D is split on A into two subsets D1 and D2, the gini index
gini(D) is defined as
Reduction in Impurity:
The attribute provides the smallest ginisplit(D) (or the largest reduction in
impurity) is chosen to split the node
n
jp jDgini
1
21)(
)(||||)(
||||)( 2
21
1 DginiDD
DginiDDDginiA
)()()( DginiDginiAginiA
Randomization Approach Overview
50 | 40K | ... 30 | 70K | ... ...
...
Randomizer Randomizer
ReconstructDistribution of Age
ReconstructDistributionof Salary
ClassificationAlgorithm
Model
65 | 20K | ... 25 | 60K | ... ...30 becomes 65 (30+35)
Alice’s age
Add random number to Age
Original Distribution Reconstruction
x1, x2, …, xn are the n original data values Drawn from n iid random variables X1, X2, …, Xn similar to X
Using value distortion, The given values are w1 = x1 + y1, w2 = x2 + y2, …, wn = xn + yn
yi’s are from n iid random variables Y1, Y2, …, Yn similar to Y
Reconstruction Problem: Given FY and wi’s, estimate FX
Original Distribution Reconstruction: Method
Bayes’ theorem for continuous distribution
The estimated density function:
Iterative estimation The initial estimate for fX at j=0: uniform distribution Iterative estimation
Stopping Criterion: 2 test between successive iterations
n
iXiY
XiYX
dzzfzwf
afawf
naf
1
1
n
i jXiY
jXiYj
X
dzzfzwf
afawf
naf
1
1 1
Reconstruction of Distribution
0
200
400
600
800
1000
1200
20 60
Age
Nu
mb
er o
f P
eop
le
OriginalRandomizedReconstructed
Original Distribution Reconstruction
Original Distribution Construction for Decision Tree
When to reconstruct distributions? Global
Reconstruct for each attribute once at the beginning
Build the decision tree using the reconstructed data
ByClass First split the training data Reconstruct for each class separately Build the decision tree using the
reconstructed data Local
First split the training data Reconstruct for each class separately Reconstruct at each node while building
the tree
Accuracy vs. Randomization Level
Fn 3
40
50
60
70
80
90
100
10 20 40 60 80 100 150 200
Randomization Level
Acc
ura
cy OriginalRandomizedByClass
More Results
Global performs worse than ByClass and Local ByClass and Local have accuracy within 5% to 15% (absolute
error) of the Original accuracy Overall, all are much better than the Randomized accuracy
Privacy level
Is the privacy level sufficiently measured?
slide 32
How to Measure Privacy Breach
Weak: no single database entry has been revealed
Stronger: no single piece of information is revealed (what’s the difference from the “weak” version?)
Strongest: the adversary’s beliefs about the data have not changed
slide 33
Kullback-Leibler Distance
Measures the “difference” between two probability distributions
slide 34
Privacy of Input Perturbation
X is a random variable, R is the randomization operator, Y=R(X) is the perturbed database
Measure mutual information between original and randomized databases Average KL distance between (1) distribution of X
and (2) distribution of X conditioned on Y=y
Ey (KL(PX|Y=y || Px)) Intuition: if this distance is small, then Y leaks little
information about actual values of X
Why is this definition problematic?
slide 35
Is the randomization sufficient?
Gladys: 85Doris: 90Beryl: 82
Name: AgedatabaseGladys: 72
Doris: 110Beryl: 85
Age is an integer between 0 and 90
Randomize database entries by adding random integers between -20 and 20
Randomization operatorhas to be public (why?)
Doris’s age is 90!!
slide 36
Privacy Definitions
Mutual information can be small on average, but an individual randomized value can still leak a lot of information about the original value
Better: consider some property Q(x) Adversary has a priori probability Pi that Q(xi) is
true
Privacy breach if revealing yi=R(xi) significantly changes adversary’s probability that Q(xi) is true Intuition: adversary learned something about entry
xi (namely, likelihood of property Q holding for this entry)
slide 37
Example
Data: 0x1000, p(x=0)=0.01, p(x=k)=0.00099 Reveal y=R(x) Three possible randomization operators R
R1(x) = x with prob. 20%; a uniformly random number with prob. 80%
R2(x) = x+ mod 1001, uniform in [-100,100]
R3(x) = R2(x) with prob. 50%, a uniform random number with prob. 50%
Which randomization operator is better?
slide 38
Some Properties
Q1(x): x=0; Q2(x): x{200, ..., 800}
What are the a priori probabilities for a given x that these properties hold? Q1(x): 1%, Q2(x): 40.5%
Now suppose adversary learned that y=R(x)=0. What are probabilities of Q1(x) and Q2(x)? If R = R1 then Q1(x): 71.6%, Q2(x): 83%
If R = R2 then Q1(x): 4.8%, Q2(x): 100%
If R = R3 then Q1(x): 2.9%, Q2(x): 70.8%
slide 39
Privacy Breaches
R1(x) leaks information about property Q1(x) Before seeing R1(x), adversary thinks that
probability of x=0 is only 1%, but after noticing that R1(x)=0, the probability that x=0 is 72%
R2(x) leaks information about property Q2(x) Before seeing R2(x), adversary thinks that
probability of x{200, ..., 800} is 41%, but after noticing that R2(x)=0, the probability that x{200, ..., 800} is 100%
Randomization operator should be such that posterior distribution is close to the prior distribution for any property
slide 40
Privacy Breach: Definitions
Q(x) is some property, 1, 2 are probabilities 1“very unlikely”, 2“very likely”
Straight privacy breach:
P(Q(x)) 1, but P(Q(x) | R(x)=y) 2 Q(x) is unlikely a priori, but likely after seeing
randomized value of x Inverse privacy breach:
P(Q(x)) 2, but P(Q(x) | R(x)=y) 1
Q(x) is likely a priori, but unlikely after seeing randomized value of x
[Evfimievski et al.]
slide 41
How to check privacy breach
How to ensure that randomization operator hides every property? There are 2|X| properties Often randomization operator has to be selected
even before distribution Px is known (why?)
Idea: look at operator’s transition probabilities How likely is xi to be mapped to a given y?
Intuition: if all possible values of xi are equally likely to be randomized to a given y, then revealing y=R(xi) will not reveal much about actual value of xi
slide 42
Amplification
Randomization operator is -amplifying for y if
For given 1, 2, no straight or inverse privacy breaches occur if
y)p(x
y)p(x :V x ,x
2
1x21
) -(1
) -(1
2
1
1
2
[Evfimievski et al.]
slide 43
Amplification: Example
R1(x) = x with prob. 20%; a uniformly random number with prob. 80%
R2(x) = x+ mod 1001, uniform in [-100,100]
R3(x) = R2(x) with prob. 50%, a uniform random number with prob. 50%
For R3,
p(xy) = ½ (1/201 + 1/1001) if y[x-100,x+100]
½(0 + 1/1001) otherwise
Fractional difference = 1 + 1001/201 < 6 (= ) Therefore, no straight or inverse privacy
breaches will occur with 1=14%, 2=50%
Coming up
Multiplicative noise Output perturbation
February 12, 2008 Data Mining: Concepts and Techniques 45
Example: Information Gain
Class P: buys_computer = “yes”, Class N: buys_computer = “no”
age pi ni I(pi, ni)<=30 2 3 0.97131…40 4 0 0>40 3 2 0.971
694.0)2,3(14
5
)0,4(14
4)3,2(
14
5)(
I
IIDInfoage
048.0)_(
151.0)(
029.0)(
ratingcreditGain
studentGain
incomeGain
246.0
)()()(
DInfoDInfoageGain age
age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
940.0)14
5(log
14
5)
14
9(log
14
9)5,9()( 22 IDInfo