Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy
ONN the use of Neural Networksfor Data Privacy
Jordi Pont-Tuset
Pau Medrano Gracia
Jordi Nin
Josep Lluís Larriba Pey
Victor Muntés i Mulero
Instituto de Investigación en Inteligencia ArtificialConsejo Superior de Investigaciones Científicas
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy Presentation Schema
MotivationBasic ConceptsOrdered Neural Networks (ONN)Experimental ResultsConclusions and Future Work
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy
Name ID number Salary Postal code Age
John Smith 53124566 20.000 € 17100 32
Michael Grom 34423312 25.000 € 08080 42
Anna Molina 18827364 15.000 € 36227 32
Our Scenario: attribute classification
Classification of attributes Identifiers (ID) Quasi-identifiers
• Confidential (C)• Non-Confidential (NC)
3
NC NCID ID C
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy Data Privacy and Anomymization
ID
NC
C
4
Original DataReleased Data
ID
NC
External Data Source
Record Linkage
Confidential data disclosure!!!
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy Data Privacy and Anomymization
ID
NC’
C
5
ID
NC
External Data Source
Record Linkage
NC ?
AnonymizationProcess Goal: Ensure protection while
preserving statistical usefulnessGoal: Ensure protection while
preserving statistical usefulness
Trade-Off: Accuracy vs PrivacyTrade-Off: Accuracy vs Privacy
Privacy in Statistical Database (PSD)Privacy Preserving Data Mining (PPDM)
Privacy in Statistical Database (PSD)Privacy Preserving Data Mining (PPDM)
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy Presentation Schema
MotivationBasic ConceptsOrdered Neural Networks (ONN)Experimental ResultsConclusions and Future Work
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy
Rank Swapping (RS-p) [Moore96] Sorts values in each attribute and swaps them randomly
with a restricted range of size p
Microaggregation (MIC-vm-k) [DM02] Builds small clusters from v variables of at least k elements Then, it replaces each value by the centroid of the cluster
to which it belongs
Best Ranked Protection Methods [DT01]
[DT01] Domingo-Ferrer, J., Torra, V.: A quantitative comparison of disclosure control methods for microdata. In: Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies, Elsevier Science (2001) 111-133
[Moore96] Moore, R.: Controlled data swapping techniques for masking public use microdata sets. U.S. Bureau of the Census (Unpublished manuscript) (1996)
[DM02] Domingo-Ferrer, J., Mateo-Sanz, J.M.: Practical data-oriented microaggregationfor statistical disclosure control. IEEE Trans. on KDE 14 (2002) 189-201
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy Our contribution...
We propose a new perturvative protection method for numerical data based on the use of neural networks
Basic idea: learning a pseudo-identity function (quasi-learning ANNs)
Anonymizing numerical data sets
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy
Each neuron weights its inputs and applies an activation function:
For our purpose, we assume ANNs without feedback connections and layer-bypassing
Artificial Neural Networks
(Sigmoid)
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy
Allows the ANN to learn from a predefined set of input-output example pairs
It adjusts weights in the ANN iteratively In each iteration we calculate the error in the output layer
using a sum of the squared difference Weights are updated using an iterative steepest descent
method
Backpropagation Algorithm
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy Presentation Schema
MotivationBasic ConceptsOrdered Neural Networks (ONN)Experimental ResultsConclusions and Future Work
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy
Key idea: innacurately learning the original data set, using ANNs, in order to reproduce a similar one:
Similar enough to preserve the properties of the original data set
Different enough not to reveal the original confidential values
Ordered Neural Networks (ONN)
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy
How can we learn the original data set?
Ordered Neural Networks (ONN)
a
n
Try to learn the original data set
with a single neural network
TOO COMPLEX
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy
How can we learn the original data set?
Ordered Neural Networks (ONN)
a
n
a
The pattern to be learnt may still be
too complex
The pattern to be learnt may still be
too complex
We could sort each attribute independently in order to
simplify the learning process
We could sort each attribute independently in order to
simplify the learning process
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy
How can we learn the original data set?
Ordered Neural Networks (ONN)
a
n
a
Reo
rderin
g o
f each attrib
ute sep
arately
The concept of tuple is lost!
The concept of tuple is lost!
Why are we so keen on preserving the
attribute semantics?
Why are we so keen on preserving the
attribute semantics?
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy
Different approach:
We ignore the attribute semantics mixing all the values in the database
We sort them to make the learning process easier
We partition the values into several blocks in order to simplify the learning process
Ordered Neural Networks (ONN)
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy ONN General Schema
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy Vectorization
ONN ignores the attribute semantics to reduce the learning process cost
20
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy Sorting
Objective: simplifying the learning process and reduce learning time
21
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy Partitioning
The size of the data set may be very large
A single ANN would make the learning process very difficult
ONN will use a different ANN for each partition k
22
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy Normalize
In order to make the learning process possible, it is necessary to normalize input data
23
We normalize the values for their images to fit in the range where the slope of the activation function
is rellevant [FS91]
[FS91] Freeman, J.A., Skapura, D.M. In: Neural Networks: Algorithms, Applications and Programming Techniques. Addison-Wesley Publishing Company (1991) 1-106
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy Learning Step
Given P partitions, we have one ANN per partition
Each ANN is fed with values coming from the P partitions in order to add noise
k
aa bb cc
dd ee ff
gg hh ii
aa
dd
gg
P
p1
p2
p3
p1
pP
a’a’ aa= ?
Backpropagation
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy Learning Step
Given P partitions, we have one ANN per partition
Each ANN is fed with values coming from the P partitions in order to add noise
k
aa bb cc
dd ee ff
gg hh ii
cc
ff
ii
P
p1
p2
p3
p1
pP
c’c’ cc= ?
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy Protection Step
k
aa bb cc
dd ee ff
gg hh ii
P
p1
p2
p3
p1
pP
a’a’
aa
dd
gg
First, we propagate the original data set through the trained ANNs
Finally, we derandomized the generated values
De
-no
rma
lizatio
n
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy Presentation Schema
MotivationBasic ConceptsOrdered Neural Networks (ONN)Experimental ResultsConclusions and Future Work
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy
Data used in CASC Project (http://neon.vb.cbs.nl/casc) Data from US Census Bureau:
• 1080 tuples x13 attributes =14040 values to be protected
We compare our algorithm with the best 5 parameterizations presented in the literature for:
Rank Swapping Microaggregation
ONN is parameterized adhoc
Experiments Setup
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy
ONN parameterization:
P: Number of Partitions B: Normalization Range Size E: Learning Rate Parameter C: Activation Function Slope Parameter H: Number of neurons in the hidden layer
Experiments Setup
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy Score: Protection Methods Evaluation
Score = 0.5 IL + 0.5 DR
IL = 100(0.2 IL1 + 0.2 IL2 + 0.2 IL3 + 0.2 IL4 + 0.2 IL5)
IL1 = mean of absolute error
IL2 = mean variation of average
IL3 = mean variation of variance
IL4 = mean variation of covariancie
IL5 = mean variation of correlation
DR = 0.5 DLD + 0.5 ID
DLD = number of links using DBRL
ID = protected values near orginal
We need a protection quality score that measures:
The difficulty for an intruder to reveal the original data The information loss in the protected data set
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy Results
7 variables 13 variables
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy Presentation Schema
MotivationBasic ConceptsOrdered Neural Networks (ONN)Experimental ResultsConclusions and Future Work
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy
The use of ANNs combined with some preprocessing techniques is promising for protection methods
In our experiments ONN is able to improve the protection quality of the best ranked protection methods in the literature
As future work, we would like to establish a set of criteria to automatically tune the parameters of ONN
Conclusions & Future Work
SOFSEM 2008, Nový Smokovec, Slovakia
Ne
ura
l Ne
two
rks
fo
r D
ata
Pri
va
cy Any questions?
Contact e-mail: [email protected]
DAMA Group Web Site: http://www.dama.upc.edu