Download ppt - Neural Networks for Data Privacy ONN the use of Neural Networks for Data Privacy Jordi Pont-Tuset Pau Medrano Gracia Jordi Nin Josep Lluís Larriba Pey

Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va

cy

ONN the use of Neural Networksfor Data Privacy

Jordi Pont-Tuset

Pau Medrano Gracia

Jordi Nin

Josep Lluís Larriba Pey

Victor Muntés i Mulero

Instituto de Investigación en Inteligencia ArtificialConsejo Superior de Investigaciones Científicas

SOFSEM 2008, Nový Smokovec, Slovakia

Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va

cy Presentation Schema

MotivationBasic ConceptsOrdered Neural Networks (ONN)Experimental ResultsConclusions and Future Work


Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va

cy

Name ID number Salary Postal code Age

John Smith 53124566 20.000 € 17100 32

Michael Grom 34423312 25.000 € 08080 42

Anna Molina 18827364 15.000 € 36227 32

Our Scenario: attribute classification

Classification of attributes Identifiers (ID) Quasi-identifiers

• Confidential (C)• Non-Confidential (NC)

3

NC NCID ID C


Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va

cy Data Privacy and Anomymization

ID

NC

C

4

Original DataReleased Data

ID

NC

External Data Source

Record Linkage

Confidential data disclosure!!!


Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va

cy Data Privacy and Anomymization

ID

NC’

C

5

ID

NC

External Data Source

Record Linkage

NC ?

AnonymizationProcess Goal: Ensure protection while

preserving statistical usefulnessGoal: Ensure protection while

preserving statistical usefulness

Trade-Off: Accuracy vs PrivacyTrade-Off: Accuracy vs Privacy

Privacy in Statistical Database (PSD)Privacy Preserving Data Mining (PPDM)

Privacy in Statistical Database (PSD)Privacy Preserving Data Mining (PPDM)


Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va




Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va

cy

Rank Swapping (RS-p) [Moore96] Sorts values in each attribute and swaps them randomly

with a restricted range of size p

Microaggregation (MIC-vm-k) [DM02] Builds small clusters from v variables of at least k elements Then, it replaces each value by the centroid of the cluster

to which it belongs

Best Ranked Protection Methods [DT01]

[DT01] Domingo-Ferrer, J., Torra, V.: A quantitative comparison of disclosure control methods for microdata. In: Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies, Elsevier Science (2001) 111-133

[Moore96] Moore, R.: Controlled data swapping techniques for masking public use microdata sets. U.S. Bureau of the Census (Unpublished manuscript) (1996)

[DM02] Domingo-Ferrer, J., Mateo-Sanz, J.M.: Practical data-oriented microaggregationfor statistical disclosure control. IEEE Trans. on KDE 14 (2002) 189-201


Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va

cy Our contribution...

We propose a new perturvative protection method for numerical data based on the use of neural networks

Basic idea: learning a pseudo-identity function (quasi-learning ANNs)

Anonymizing numerical data sets


Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va

cy

Each neuron weights its inputs and applies an activation function:

For our purpose, we assume ANNs without feedback connections and layer-bypassing

Artificial Neural Networks

(Sigmoid)


Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va

cy

Allows the ANN to learn from a predefined set of input-output example pairs

It adjusts weights in the ANN iteratively In each iteration we calculate the error in the output layer

using a sum of the squared difference Weights are updated using an iterative steepest descent

method

Backpropagation Algorithm


Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va




Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va

cy

Key idea: innacurately learning the original data set, using ANNs, in order to reproduce a similar one:

Similar enough to preserve the properties of the original data set

Different enough not to reveal the original confidential values

Ordered Neural Networks (ONN)


Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va

cy

How can we learn the original data set?


a

n

Try to learn the original data set

with a single neural network

TOO COMPLEX


Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va

cy



a

n

a

The pattern to be learnt may still be

too complex

The pattern to be learnt may still be

too complex

We could sort each attribute independently in order to

simplify the learning process

We could sort each attribute independently in order to

simplify the learning process


Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va

cy



a

n

a

Reo

rderin

g o

f each attrib

ute sep

arately

The concept of tuple is lost!

The concept of tuple is lost!

Why are we so keen on preserving the

attribute semantics?

Why are we so keen on preserving the

attribute semantics?


Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va

cy

Different approach:

We ignore the attribute semantics mixing all the values in the database

We sort them to make the learning process easier

We partition the values into several blocks in order to simplify the learning process



Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va

cy ONN General Schema


Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va

cy Vectorization

ONN ignores the attribute semantics to reduce the learning process cost

20


Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va

cy Sorting

Objective: simplifying the learning process and reduce learning time

21


Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va

cy Partitioning

The size of the data set may be very large

A single ANN would make the learning process very difficult

ONN will use a different ANN for each partition k

22


Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va

cy Normalize

In order to make the learning process possible, it is necessary to normalize input data

23

We normalize the values for their images to fit in the range where the slope of the activation function

is rellevant [FS91]

[FS91] Freeman, J.A., Skapura, D.M. In: Neural Networks: Algorithms, Applications and Programming Techniques. Addison-Wesley Publishing Company (1991) 1-106


Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va

cy Learning Step

Given P partitions, we have one ANN per partition

Each ANN is fed with values coming from the P partitions in order to add noise

k

aa bb cc

dd ee ff

gg hh ii

aa

dd

gg

P

p1

p2

p3

p1

pP

a’a’ aa= ?

Backpropagation


Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va

cy Learning Step

Given P partitions, we have one ANN per partition

Each ANN is fed with values coming from the P partitions in order to add noise

k

aa bb cc

dd ee ff

gg hh ii

cc

ff

ii

P

p1

p2

p3

p1

pP

c’c’ cc= ?


Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va

cy Protection Step

k

aa bb cc

dd ee ff

gg hh ii

P

p1

p2

p3

p1

pP

a’a’

aa

dd

gg

First, we propagate the original data set through the trained ANNs

Finally, we derandomized the generated values

De

-no

rma

lizatio

n


Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va




Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va

cy

Data used in CASC Project (http://neon.vb.cbs.nl/casc) Data from US Census Bureau:

• 1080 tuples x13 attributes =14040 values to be protected

We compare our algorithm with the best 5 parameterizations presented in the literature for:

Rank Swapping Microaggregation

ONN is parameterized adhoc

Experiments Setup


Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va

cy

ONN parameterization:

P: Number of Partitions B: Normalization Range Size E: Learning Rate Parameter C: Activation Function Slope Parameter H: Number of neurons in the hidden layer

Experiments Setup


Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va

cy Score: Protection Methods Evaluation

Score = 0.5 IL + 0.5 DR

IL = 100(0.2 IL1 + 0.2 IL2 + 0.2 IL3 + 0.2 IL4 + 0.2 IL5)

IL1 = mean of absolute error

IL2 = mean variation of average

IL3 = mean variation of variance

IL4 = mean variation of covariancie

IL5 = mean variation of correlation

DR = 0.5 DLD + 0.5 ID

DLD = number of links using DBRL

ID = protected values near orginal

We need a protection quality score that measures:

The difficulty for an intruder to reveal the original data The information loss in the protected data set


Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va

cy Results

7 variables 13 variables


Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va




Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va

cy

The use of ANNs combined with some preprocessing techniques is promising for protection methods

In our experiments ONN is able to improve the protection quality of the best ranked protection methods in the literature

As future work, we would like to establish a set of criteria to automatically tune the parameters of ONN

Conclusions & Future Work


Ne

ura

l Ne

two

rks

fo

r D

ata

Pri

va

cy Any questions?

Contact e-mail: [email protected]

DAMA Group Web Site: http://www.dama.upc.edu