39
Under consideration for publication in Knowledge and Information Systems Geometric Data Perturbation for Privacy Preserving Outsourced Data Mining Keke Chen 1 , Ling Liu 2 1 Department of Computer Science and Engineering, Wright State University, Dayton OH, USA; 2 College of Computing Georgia Institute of Technology, Atlanta GA, USA Abstract. Data perturbation is a popular technique in privacy-preserving data mining. A major challenge in data perturbation is to balance privacy protection and data utility, which are normally considered as a pair of conflicting factors. We argue that selectively preserving the task/model spe- cific information in perturbation will help achieve better privacy guarantee and better data utility. One type of such information is the multidimensional geometric information, which is implicitly utilized by many data mining models. To preserve this information in data perturbation, we pro- pose the Geometric Data Perturbation (GDP) method. In this paper, we describe several aspects of the GDP method. First, we show that several types of well-known data mining models will deliver a comparable level of model quality over the geometrically perturbed dataset as over the original dataset. Second, we discuss the intuition behind the GDP method and compare it with other mul- tidimensional perturbation methods such as random projection perturbation. Third, we propose a multi-column privacy evaluation framework for evaluating the effectiveness of geometric data perturbation with respect to different level of attacks. Finally, we use this evaluation framework to study a few attacks to geometrically perturbed datasets. Our experimental study also shows that geometric data perturbation can not only provide satisfactory privacy guarantee but also preserve modeling accuracy well. Keywords: Privacy-preserving Data Mining, Data Perturbation, Geometric Data Perturbation, Privacy Evaluation, Data Mining Algorithms 1. Introduction With the rise of cloud computing, service-based computing is becoming the major paradigm (Amazon, n.d.; Google, n.d.). Either to use the cloud platform services (Armbrust, Fox, Griffith, Joseph, Katz, Konwinski, Lee, Patterson, Rabkin, Stoica and Zaharia, 2009), or to use existing services hosted on clouds, users will have to export their private Received Mar 10, 2010 Revised Oct 03, 2010 Accepted Oct 23, 2010

Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

Under consideration for publication in Knowledge and Information Systems

Geometric Data Perturbation for PrivacyPreserving Outsourced Data Mining

Keke Chen1, Ling Liu2

1Department of Computer Science and Engineering, Wright State University, Dayton OH, USA;2College of Computing Georgia Institute of Technology, Atlanta GA, USA

Abstract. Data perturbation is a popular technique in privacy-preserving data mining. A majorchallenge in data perturbation is to balance privacy protection and data utility, which are normallyconsidered as a pair of conflicting factors. We argue that selectively preserving the task/model spe-cific information in perturbation will help achieve better privacy guarantee and better data utility.One type of such information is the multidimensional geometric information, which is implicitlyutilized by many data mining models. To preserve this information in data perturbation, we pro-pose the Geometric Data Perturbation (GDP) method. In this paper, we describe several aspects ofthe GDP method. First, we show that several types of well-known data mining models will delivera comparable level of model quality over the geometrically perturbed dataset as over the originaldataset. Second, we discuss the intuition behind the GDP method and compare it with other mul-tidimensional perturbation methods such as random projection perturbation. Third, we proposea multi-column privacy evaluation framework for evaluating the effectiveness of geometric dataperturbation with respect to different level of attacks. Finally, we use this evaluation framework tostudy a few attacks to geometrically perturbed datasets. Our experimental study also shows thatgeometric data perturbation can not only provide satisfactory privacy guarantee but also preservemodeling accuracy well.

Keywords: Privacy-preserving Data Mining, Data Perturbation, Geometric Data Perturbation,Privacy Evaluation, Data Mining Algorithms

1. Introduction

With the rise of cloud computing, service-based computing is becoming the majorparadigm (Amazon, n.d.; Google, n.d.). Either to use the cloud platform services (Armbrust,Fox, Griffith, Joseph, Katz, Konwinski, Lee, Patterson, Rabkin, Stoica and Zaharia,2009), or to use existing services hosted on clouds, users will have to export their private

Received Mar 10, 2010Revised Oct 03, 2010Accepted Oct 23, 2010

Page 2: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

2 K. Chen and L. Liu

data to the service provider. Since these service providersare not within the trust bound-ary, the privacy of the outsourced data has become one of the top-priority problems(Armbrust et al., 2009; Bruening and Treacy, 2009). As data mining is one of the mostpopular data intensive tasks, privacy preserving data mining for the outsourced data hasbecome an important enabling technology for utilizing the public computing resources.Different from other settings of privacy preserving data mining such as collaborativelymining private datasets from multiple parties (Lindell andPinkas, 2000; Vaidya andClifton, 2003; Luo, Fan, Lin, Zhou and Bertino, 2009; Teng and Du, 2009), this paperwill focus on the following setting: the data owner exports data to and then receivesa model (with the quality description such as the accuracy for a classifier) from theservice provider. This setting also applies to the situation that the data owner uses thepublic cloud resources for large-scale scalable mining, where the service provider justprovides computing infrastructure.

We present a new data perturbation technique for privacy preserving outsourceddata mining (Aggarwal and Yu, 2004; Chen and Liu, 2005) in this paper. A data per-turbation procedure can be simply described as follows. Before the data owners pub-lish their data, they change the data in certain way to disguise the sensitive informationwhile preserving the particular data property that is critical for building meaningful datamining models. Perturbation techniques have to handle the intrinsic tradeoff betweenpreserving data privacy and preserving data utility, as perturbing data usually reducesdata utility. Several perturbation techniques have been proposed for mining purposerecently, but these two factors are not satisfactorily balanced. For example, randomnoise addition approach (Agrawal and Srikant, 2000; Evfimievski, Srikant, Agrawaland Gehrke, 2002) is weak to data reconstruction attacks andonly good for very fewspecific data mining models. The condensation approach (Aggarwal and Yu, 2004)cannot effectively protect data privacy from naive estimation. The rotation perturba-tion (Chen and Liu, 2005; Oliveira and Zaiane, 2010) and random projection pertur-bation (Liu, Kargupta and Ryan, 2006) are all threatened by prior-knowledge enabledIndependent Component Analysis (Hyvarinen, Karhunen and Oja, 2001). Multidimen-sional k-anonymization (LeFevre, DeWitt and Ramakrishnan, 2006) is only designedfor general-purpose utility preservation and may result inlow-quality data mining mod-els. In this paper, we propose a newmultidimensionaldata perturbation technique: geo-metric data perturbation that can be applied for several categories of popular data miningmodels with better utility preservation and privacy preservation.

1.1. Data Privacy vs. Data Utility

Perturbation techniques are often evaluated with two basicmetrics: the level of pre-served privacy guarantee and the level of preserved data utility. Data utility is oftentask/model-specific and measured by the quality of learned models. An ultimate goalfor all data perturbation algorithms is to maximize both data privacy and data utility,although these two are typically representing conflicting goals in most existing pertur-bation techniques.

Level of Privacy Guarantee:Data privacy is commonly measured by the difficultylevel in estimating the original data from the perturbed data. Given a data perturba-tion technique, the more difficult the original values can beestimated from the per-turbed data, the higher level of data privacy this techniqueprovides. In (Agrawal andSrikant, 2000), the variance of the added random noise is used as the level of difficultyfor estimating the original values. However, recent research (Evfimievski, Gehrke andSrikant, 2003; Agrawal and Aggarwal, 2002) reveals that variance of added noise only is

Page 3: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

Geometric Data Perturbation for Privacy Preserving Outsourced Data Mining 3

not an effective indicator of privacy guarantee. More research (Kargupta, Datta, Wangand Sivakumar, 2003; Huang, Du and Chen, 2005) has shown thatprivacy guaranteeis subject to the attacks that can reconstruct the original data (or some records) fromthe perturbed data. Thus, attack analysis has to be integrated into privacy evaluation.Furthermore, since the amount of attacker’s prior knowledge on the original data deter-mines the type of attacks and its effectiveness, we should also study privacy guaranteeaccording to the level of prior knowledge the attacker may have. With this study, thedata owner can decide whether the perturbed data can be released under the assumptionof certain level of prior knowledge. In this paper, we will study the proposed geomet-ric data perturbation under a new privacy evaluation framework that incorporates attackanalysis and calculates multi-level privacy guarantees according to the level of attacker’sprior knowledge.

Level of Data Utility: The level of data utility typically refers to the amount of crit-ical information preserved after perturbation. More specifically, the critical informationshould be task or model oriented. For example, decision treeand k-Nearest-Neighbor(kNN) classifier for classification modeling typically utilize different sets of informa-tion about the datasets: decision tree construction primarily concerns the related col-umn distributions; the kNN model relies on the distance relationship which involvesall columns. Most of existing perturbation techniques do not explicitly address that thecritical information is actually task/model-specific. We argue that by narrowing down topreserve only the task/model-specific information, we are able to provide better qualityguarantee on both privacy and model accuracy. The proposed geometric data pertur-bation aims to approximately preserve the geometric properties that many data miningmodels are based on.

It is interesting to note that privacy guarantee and data utility have exhibited con-tradictive relationship in most data perturbation techniques. Typically, data perturbationalgorithms that aim at maximizing the level of privacy guarantee often have to bear withreduced data utility. The intrinsic correlation between the two factors makes it challeng-ing to find a right balance for them in developing a data perturbation technique.

1.2. Contributions and Scope

Bearing the above issues in mind, we have developed the geometric data perturbationapproach to privacy preserving data mining. In contrast to other perturbation approaches(Aggarwal and Yu, 2004; Agrawal and Srikant, 2000; Chen and Liu, 2005; Liu, Kar-gupta and Ryan, 2006), our method exploits the task and modelspecific multidimen-sional information about the datasets and produces a robustdata perturbation methodthat not only preserves such critical information well but also provides a better balancebetween the level of privacy guarantee and the level of data utility. The contributions ofthis paper can be summarized into three aspects.

First, we articulate that the multidimensional geometric properties of datasets arethe critical information for many data mining models. We define a data mining modelto be “perturbation invariant”, if the model built on the geometrically perturbed datasetpresents a quality to that over the original dataset. With geometric data perturbation,the perturbed data can be exported to the public platform, where these perturbation-invariant data mining models are applied to obtain equivalent models. We have provedthat a batch of data mining models, including kernel methods, SVM classifiers withthe three popular kernels, linear classifiers, linear regression, regression trees, and allEuclidean-distance based clustering algorithms, are invariant to geometric data pertur-

Page 4: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

4 K. Chen and L. Liu

bation with the rotation and translation components only, and we have also studied theeffect of the distance perturbation component to the model invariance property.

Second, we also study whether random projection perturbation (Liu, Kargupta andRyan, 2006) can be an alternative component in geometric data perturbation, based onthe formal analysis of the effect of multiplicative perturbation to model quality. We usethe Gaussian mixture model (McLachlan and Peel, 2000) to show in which situationsthe multiplicative component can affect the model quality.It helps us understand whythe rotation component is a better choice than other multiplicative components in termsof preserving model accuracy.

Third, since a random geometric-transformationbased perturbation is a multidimen-sional perturbation, the privacy guarantee of the multipledimensions (attributes) shouldbe evaluated collectively, not separately. We use a unified privacy evaluation metric forall dimensions and a generic framework to incorporate attack analysis in privacy eval-uation. We also analyze a set of attacks according to different levels of knowledge anattacker may have. A randomized perturbation optimizationalgorithm is presented toincorporate the evaluation of attack resilience into the perturbation algorithm design.

The rest of paper is organized as follows. Section 2 briefly reviews the related workin data perturbation. Section 3 defines some notations and gives the background knowl-edge about geometric data perturbation. Then, in Section 4 and 5, we define the geo-metric data perturbation and prove that many major models inclassification, regressionand clustering modeling are invariant to rotation and translation perturbation. In Section5, we also extend the discussion to the effect of noise component and other choices ofmultiplicative components such as random projection to model quality. In Section 6, wefirst introduce a generic privacy evaluation model and definea unified privacy metricfor multidimensional data perturbation. Then, a few inference attacks are analyzed un-der the proposed privacy evaluation model, which results ina randomized perturbationoptimization algorithm. Finally, we present experimentalresults in Section 7.

2. Related Work

A considerable amount of work on privacy preserving data mining methods have beenreported in recent years (Aggarwal and Yu, 2004; Agrawal andSrikant, 2000; Clifton,2003; Agrawal and Aggarwal, 2002; Evfimievski et al., 2002; Vaidya and Clifton, 2003).The most relevant work about perturbation techniques for data mining includes the ran-dom noise addition methods (Agrawal and Srikant, 2000; Evfimievski et al., 2002),the condensation-based perturbation (Aggarwal and Yu, 2004), rotation perturbation(Oliveira and Zaiane, 2010; Chen and Liu, 2005) and projection perturbation (Liu, Kar-gupta and Ryan, 2006). In addition, k-anonymization (Sweeney, 2002) can also be re-garded as a perturbation technique, and there are a large body of literatures focusingon the k-anonymity model (Fung, Wang, Chen and Yu, 2010). Since our work is lessrelevant to the k-anonymity model, we will focus on other perturbation techniques.

Noise Additive Perturbation The typical additive perturbation technique (Agrawaland Srikant, 2000) is column-based additive randomization. This type of techniquesrelies on the facts that 1) Data owners may not want to equallyprotect all values in arecord, thus a column-based value distortion can be appliedto perturb some sensitivecolumns. 2) Data classification models to be used do not necessarily require the individ-ual records, but only the column value distributions (Agrawal and Srikant, 2000) withthe assumption of independent columns. The basic method is to disguise the originalvalues by injecting certain amount of additive random noise, while the specific infor-

Page 5: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

Geometric Data Perturbation for Privacy Preserving Outsourced Data Mining 5

mation, such as the column distribution, can still be effectively reconstructed from theperturbed data.

A typical random noise addition model (Agrawal and Srikant,2000) can be preciselydescribed as follows. We treat the original values(x1, x2, . . . , xn) from a column to berandomly drawn from a random variableX, which has some kind of distribution. Therandomization process changes the original data by adding random noisesR to theoriginal data values, and generates a perturbed data columnY, Y = X + R. Theresulting record(x1+r1, x2+r2, . . . , xn+rn) and the distribution ofR are published.The key of random noise addition is the distribution reconstruction algorithm (Agrawaland Srikant, 2000; Agrawal and Aggarwal, 2002) that recovers the column distributionof X based on the perturbed data and the distribution ofR.

While the randomization approach is simple, several researchers have recently iden-tified that reconstruction-based attacks are the major weakness of the randomization ap-proach (Kargupta et al., 2003; Huang et al., 2005). In particular, the spectral propertiesof the randomized data can be utilized to separate noise fromthe private data. Further-more, only the mining algorithms that meet the assumption ofindependent columnsand work on column distributions only, such as decision-tree algorithms (Agrawal andSrikant, 2000), and association-rule mining algorithms (Evfimievski et al., 2002), canbe revised to utilize the reconstructed column distributions from perturbed datasets. Asa result, it is inconvenient to apply this method for data mining in practice.

Condensation-based PerturbationThe condensation approach (Aggarwal and Yu,2004) is a typical multi-dimensional perturbation technique, which aims at preserv-ing the covariance matrix for multiple columns. Thus, some geometric properties suchas the shape of decision boundary are well preserved. Different from the randomiza-tion approach, it perturbs multiple columns as a whole to generate the entire “perturbeddataset”. As the perturbed dataset preserves the covariance matrix, many existing datamining algorithms can be applied directly to the perturbed dataset without requiring anychange or new development of algorithms.

The condensation approach can be briefly described as follows. It starts by parti-tioning the original data intok-record groups. Each group is formed by two steps –randomly selecting a record from the existing records as thecenter of group, and thenfinding the(k − 1) nearest neighbors of the center to be the other(k − 1) members.The selectedk records are removed from the original dataset before forming the nextgroup. Since each group has small locality, it is possible toregenerate a set ofk recordsto approximately preserve the distribution and covariance. The record regeneration al-gorithm tries to preserve the eigenvectors and eigenvaluesof each group, as shown inFigure 1. The authors demonstrated that the condensation approach can well preservethe accuracy of classification models if the models are trained with the perturbed data.

However, we have observed that the condensation approach isweak in protectingdata privacy. As stated by the authors, the smaller the size of the locality is in eachgroup, the better the quality of preserving the covariance with the regeneratedk recordsis. However, the regeneratedk records are confined in the small spatial locality as shownin Figure 1. Our result (section 7) shows that the differences between the regeneratedrecords and the nearest neighbor in original data are very small on average, and thus,the original data records can be estimated from the perturbed data with high confidence.

Rotation Perturbation Rotation perturbation was initially proposed for privacy pre-serving data clustering (Oliveira and Zaıane, 2004). As one of the major components ingeometric perturbation, we first applied rotation perturbation to privacy-preserving dataclassification in our paper (Chen and Liu, 2005) and addressed the general problem of

Page 6: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

6 K. Chen and L. Liu

*

**

** **

*

*

*

****

*

* *

*

ooo

oo oo o

oo

**o

o

eigenvectors

o

ooo

o

o o

o

* the original datao the regenerated data

Fig. 1. Condensation approach

privacy evaluation for multiplicative data perturbations. Rotation perturbation is sim-ply defined asG(X) = RX whereRd×d is a randomly generated rotation matrix andXd×n is the original data. The unique benefit and also the major weakness is distancepreservation, which ensures many modeling methods are perturbation invariant whilebringing distance-inference attacks. Distance-inference attacks have been addressed byrecent study (Chen, Liu and Sun, 2007; Liu, Giannella and Kargupta, 2006; Guo andWu, 2007). In (Chen et al., 2007), we discussed some possibleways to improve itsattack resilience, which results in our proposed geometricdata perturbation. To be self-contained, we will include some attack analysis in this paper under the privacy eval-uation framework. In (Oliveira and Zaiane, 2010), the scaling transformation, in addi-tion to the rotation perturbation, is also used in privacy preserving clustering. Scalingchanges the distances; however, the geometric decision boundary is still preserved.

Random Projection Perturbation Random projection perturbation (Liu, Kargupta andRyan, 2006) refers to the technique of projecting a set of data points from the originalmultidimensional space to another randomly chosen space. LetPk×d be a random pro-

jection matrix, whereP ’s rows are orthonormal (Vempala, 2005).G(X) =√

dkPX

is applied to perturb the datasetX . The rationale of projection perturbation is based onits approximate distance preservation, which is supportedby the Johnson-LindenstraussLemma (Johnson and Lindenstrauss, 1984). This lemma shows that any dataset in Eu-clidean space could be embedded into another space, such that the pair-wise distanceof any two points are maintained with small error. As a result, model quality can beapproximately preserved. We will compare random projection perturbation to the pro-posed geometric data perturbation.

3. Preliminaries

In this section, we first give the notations and then define thecomponents in geometricperturbations. Since geometric perturbation works only for numericaldata classifica-tion, by default, the datasets discussed in this paper are all numerical data.

3.1. Training Dataset

Training dataset is the part of data that has to be exported/published in privacy-preservingdata classification or clustering. A classifier learns the classification model from the

Page 7: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

Geometric Data Perturbation for Privacy Preserving Outsourced Data Mining 7

Data Owner (A) Service Provider (B)

TransformData beforereleasing it

Mined models/patterns

reconstruct/Transform back

f(X)X

Fig. 2. Applying geometric data perturbation to out-sourced data

training data and then is applied to classify the unclassified data. Suppose thatX is atraining dataset consisting ofN data rows (records) andd columns (attributes, or di-mensions). For the convenience of mathematical manipulation, we useXd×N to denotethe dataset, i.e.,X = [x1 . . .xN ], wherexi is a data tuple, representing a vector in thereal spaceRd. Each data tuplexi belongs to a predefined class if the data is for classi-fication modeling, which is indicated by the class label attributeyi. The data for clus-tering do not have labels. The class label can be nominal (or continuous for regression),which is public, i.e., privacy-insensitive. All other attributes containing private informa-tion needs to be protected. Unclassified dataset could also be exported/published withprivacy-protection if necessary.

If we considerX is a sample dataset from thed-dimension random vector[X1,X2,. . . ,Xd]

T , we use boldXi to represent the random variable for the columni. In general,we will use bold lower case to represent vectors, bold upper case to represent randomvariables, and regular upper case to represent matrices.

3.2. Framework and Threat Model for Applying Geometric DataPerturbation

We study geometric data perturbation under the following framework (Figure 2). Thedata owner wants to use the data mining service provider (or the public cloud serviceprovider). The outsourced data needs to be perturbed first and then sent to the serviceprovider. Then, the service provider develops a model basedon the perturbed data andreturns it to the data owner, who can use the model either by transforming it back to theoriginal space or perturb new data to use the model. In the middle of developing modelsat the service provider, there is no additional interactionhappening between the twoparties. Therefore, the major costs for the data owner incurin optimizing perturbationparameters that can use a sample set of the data and perturbing the entire dataset.

We take the popular and reasonable honest-but-curious service provider approachfor our threat model. That is, we assume the service providerwill honestly providethe data mining services. However, we also assume that the provider might look at thedata stored and processed on their platforms. Therefore, only well-protected data can beprocessed and stored on such an untrusted environment.

Page 8: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

8 K. Chen and L. Liu

4. Definition of Geometric Data Perturbation

Geometric data perturbation consists of a sequence of random geometric transforma-tions, including multiplicative transformation (R) , translation transformation (Ψ), anddistance perturbation∆.

G(X) = RX +Ψ+∆ (1)

We briefly define these transformations and describe their properties.

4.1. Multiplicative Transformation

The componentR can be rotation matrix (Chen and Liu, 2005) or random projectionmatrix (Liu, Kargupta and Ryan, 2006). Rotation matrix exactly preserves distanceswhile random projection matrix only approximately preserve distances. We will com-pare the advantages and disadvantages of the two choices.

It is intuitive to understand a rotation transformation in two-dimensional or three-dimensional (2D or 3D, for short) space. We extend it to represent all kind of orthonor-mal transformation in multi-dimensional space. A rotationperturbation is defined asfollows: G(X) = RX . The matrixRd×d is an orthonormal matrix (Sadun, 2001),which has some important properties. LetRT represent the transpose ofR, rij repre-sent the(i, j) element ofR, andI be the identity matrix. Both rows and columns ofR

are orthonormal: for any columnj,∑d

i=1 r2ij = 1, and for any two columnsj andk,

j 6= k,∑d

i=1 rijrik = 0; a similar property is held for rows. This definition infers thatRTR = RRT = I. It also implies that by changing the order of the rows or columnsof an orthogonal matrix, the resulting matrix is still orthonormal. A random orthonor-mal matrix can be efficiently generated following the Haar distribution (Stewart, 1980),which preserves some important statistical properties (Jiang, 2005).

A key feature of rotation transformation is preserving the Euclidean distance. LetxT represent the transpose of vectorx, and‖x‖ = xTx represent the length of a vectorx. By the definition of rotation matrix, we have‖Rx‖ = ‖x‖. Similarly, inner productis also invariant to rotation. Let〈x,y〉 = xTy represent the inner product ofx andy. We have〈Rx, Ry〉 = xTRTRy = 〈x,y〉. In general, rotation transformation alsocompletely preserves the geometric shapes such as hyperplane and manifold in the mul-tidimensional space. Thus, many modeling methods are “rotation-invariant” as we willsee. Rotation perturbation is a key component of geometric perturbation, which pro-vides the primary protection to the perturbed data from naive estimation attacks. Othercomponents of geometric perturbation are used to protect rotation perturbation frommore complicated attacks.

A random projection matrix (Vempala, 2005)Rk×d is defined asR =√

dkR0. R0

is randomly generated and its row vectors are orthonormal (note there is no such re-quirement on column vectors). The Johnson-Lindenstrauss Lemma (Johnson and Lin-denstrauss, 1984) proves that random projection can approximately preserve Euclideandistances if certain conditions are satisfied. Concretely,let x andy be any originaldata vectors. Given0 < ǫ < 1 andk = O(ln(N)/ǫ2), there is a random projectionf : Rd → Rk, so that(1−ǫ)‖x−y‖ ≤ ‖f(x)−f(y)‖ ≤ (1+ǫ)‖x−y‖. ǫ defines theaccuracy of distance preservation. Therefore, in order to precisely preserve distances,k has to be large. For large dataset (N is large), it would be difficult to well preservedistances with computationally acceptablek. We will discuss the effect of random pro-jection and rotation transformation to the result of perturbation.

Page 9: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

Geometric Data Perturbation for Privacy Preserving Outsourced Data Mining 9

4.2. Translation Transformation

It is easy to understand a translation in low-dimensional (< 4D) space. We extend thedefinition to anyd-dimensional spaces as follows.Ψ is a translation matrix ifΨ =[t, t, . . . , t]d×n, i.e.,Ψd×n = td×11

TN×1, where1 is a vector of one in all elements. A

translation transformation is simply:G(X) = X+Ψ. For any two pointsx andy in theoriginal space, with translation, we have the distance‖(x− t)− (y − t)‖ ≡ ‖x− x‖.Therefore, translation always preserves distances. However, it does not preserve innerproduct according to the definition of inner product.

Translation perturbation only does not provide protectionto the data. TheΨ com-ponent can be simply canceled if the attacker knows only translation perturbation isapplied. However, when combined with rotation perturbation, translation perturbationcan increase the overall resilience to attacks.

4.3. Distance Perturbation

The above two components preserve the distance relationship. By preserving distances,a bunch of important classification models will be “perturbation-invariant”, which is thecore of geometric perturbation. However, distance preserving perturbation may be underdistance-inference attacks in some situations (Section 6.2). The goal of distance pertur-bation is to preserve distances approximately, while effectively increasing the resilienceto distance-inference attacks. We define the third component as a random matrix∆d×n,where each entry is an independent sample drawn from the samedistribution with zeromean and small variance. By adding this component, the distance between a pair ofpoints is disturbed slightly.

Again, solely applying distance perturbation without the other two components willnot preserve privacy since the noise intensity is low. However, a low-intensity noisecomponent will provide sufficient resilience to attacks to rotation and translation per-turbation. The major issue brought by distance perturbation is the tradeoff between thereduction of model accuracy and the increase of privacy guarantee. In most cases, if wecan assume the original data items are secure and the attacker knows no informationabout the original data, the distance-inference attacks cannot happen and thus the dis-tance perturbation component can be removed. The data ownercan decide to remove orkeep this component according to their security assessment.

4.4. Cost Analysis

The major cost of perturbation is determined by the Eq. 1 and arandomized perturbationoptimization process that applies to a sample set of dataset. The perturbation can beapplied to data records in a streaming manner. Based on the Eq. 1, it will cost O(d2)to perturb eachd-dimensional data record. Note that this is an one-time cost, no furthercost incurring when the service provider developing models.

5. Perturbation-Invariant Data Mining Models

In this section, first, we give the definition of perturbationinvariant data mining modelsthat would be appropriate for our setting of mining on outsourced data. Then, we provethat several categories of data mining models are invariantto rotation and translation

Page 10: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

10 K. Chen and L. Liu

perturbation. We also formally analyze the effect of the noise components and arbitrarymultiplicative perturbations (including random projection) to the quality of data miningmodels, using the Gaussian mixture model.

5.1. A General Definition of Perturbation Invariance

We say a data mining model is invariant to a transformation, if the model mined withthe transformed data has asimilar model quality as that mined with the original data.We formally define this concept as follows.

Let M represent a type of data mining model (or modeling method) and MX be aspecific model mined from the datasetX , andQ(MX , Y ) be the model quality evalu-ated on a datasetY , e.g., the accuracy of classification model. LetT () be any perturba-tion function, which transforms the datasetX to another datasetT (X). Given a smallreal numberε, 0 < ε < 1,

Definition 5.1. The modelMX is invariant to the perturbationT () if and only if |Q(MX , Y )−Q(MT (X), T (Y ))| < ε for any training datasetX and testing datasetY .

If Q(MX , Y ) ≡ Q(MT (X), T (Y )), we call the model isstrictly invariantto the pertur-bationT (). In the following subsections, we will prove some of the datamining modelsare strictly invariant to the rotation and translation components of geometric data pertur-bation and discuss how the invariance property is affected by the distance perturbationcomponent.

5.2. Perturbation-Invariant Classification Models

In this section, we show some of the classification models that are invariant to geomet-ric data perturbation (with only rotation and translation components). The model qualityQ(MX , Y ) is the classification accuracy of the trained model tested onthe test dataset.

kNN Classifiers and Kernel Methods: A k-Nearest-Neighbor (kNN) classifier deter-mines the class label of a point by looking at the labels of itsk nearest neighbors in thetraining dataset and classifies the point to the class that most of its neighbors belong to.Since the distance between any pair of points is not changed with rotation and trans-lation, thek nearest neighbors are not changed and thus the classification result is notchanged either.

Theorem 1. kNN classifiers are strictly invariant to rotation and translation perturba-tions.

kNN classifier is a special case of kernel methods. We assert that any kernel methodswill be invariant to rotation, too. Same as the kNN classifier, a typical kernel method1

is a local classification method, which classifies the new data record only based on theinformation of its neighbors in the training data.

Theorem 2. Kernel methods are strictly invariant to rotation and translation.

1 SVM is also a kind of kernel method, but its training process is different from the kernel methods wediscuss here.

Page 11: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

Geometric Data Perturbation for Privacy Preserving Outsourced Data Mining 11

Proof. Let us define kernel methods first. Like kNN classifiers, a kernel method alsoestimates the class label of a pointx with the class labels of its neighbors. LetKλ(x,xi)be the kernel function used for weighting any pointxi in x’s neighborhood, and letλdefine the geometric width of the neighborhood. We assume{x1,x2, . . . ,xn} be thepoints in thex’s neighborhood determined byλ. A kernel classifier for continuous classlabels2 is defined as

fX(x) =

∑ni=1 Kλ(x,xi)yi

∑ni=1 Kλ(x,xi)

(2)

Specifically, the kernelKλ(x,xi) is defined as

Kλ(x,xi) = D(‖x− xi‖

λ) (3)

D(t) is a function, e.g., the Gaussian kernelD(t) = 1√2π

exp{−t2/2}. Since‖Rx −Rxi‖ = ‖x − xi‖ for rotation perturbation andλ is constant,D(t) is not changed af-ter rotation, and thusKλ(Rx, Rxi) = Kλ(x,xi). Since the geometric area around thepoint is also not changed, the point set in the neighborhood of Rx are still the rotationof those in the neighborhood ofx, i.e.,{Rx1, Rx2, . . . , Rxn} and thesen points areused in trainingMRX , which makesQ(MRX , (Rx) = fX(x). It is similar to prove thatkernel methods are invariant to translation perturbation.�

Support Vector Machines: Support Vector Machine (SVM) classifiers also utilizekernel functions in training and classification. However, it has an explicit training pro-cedure to generate a global model, while kernel methods are local methods that usetraining samples in classifying new instances. Letyi be the class label to a tuplexi inthe training set,αi andβ0 be the parameters determined by training. A SVM classifiercalculates the classification result ofx using the following function.

fX(x) =

N∑

i=1

αiyiK(x,xi) + β0 (4)

First, we prove that SVM classifiers are invariant to rotation with two key steps: 1)training with the rotated dataset generates the same set of parametersαi andβ0; 2) thekernel functionK() is invariant to rotation. Second, we prove that some SVM classifiersare also invariant to translation (empirically, SVM classifiers with the discussed kernelsare all invariant to translation).

Theorem 3. SVM classifiers using polynomial, radial basis, and neural network kernelsare strictly invariant to rotation, and SVM classifiers using radial basis are also strictlyinvariant to translation.

Proof. The SVM training problem is an optimization problem, which finds the param-etersαi andβ0 to maximize the Lagrangian (Wolfe) dual objective function(Hastie,Tibshirani and Friedman, 2001)

LD =

N∑

i=1

αi − 1/2

N∑

i,j=1

αiαjyiyjK(xi,xj),

2 It has different form for discrete class labels, but the proof will be similar.

Page 12: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

12 K. Chen and L. Liu

subject to:

0 < αi < γ,

N∑

i=1

αiyi = 0,

whereγ is a parameter chosen by the user to control the allowed errors around the deci-sion boundary. The training result ofαi is only determined by the form of kernel func-tionK(xi,xj). With the determinedαi, β0 can be determined by solvingyifX(xi) = 1for anyxi (Hastie et al., 2001), which is again determined by the kernel function. It isclear that ifK(T (x), T (xi)) = K(x,xi) is held, the training procedure generates thesame set of parameters.

Three popular choices for kernels have been discussed in theSVM literature(Cristianini and Shawe-Taylor, 2000; Hastie et al., 2001).

d-th degree polynomial: K(x,x′) = (1+ < x,x′ >)d,

radial basis: K(x,x′) = exp(−‖x− x′‖/c),neural network:K(x,x′) = tanh(κ1 < x,x′ > +κ2)

Note that the three kernels only involve distance and inner product calculation. As wediscussed in Section 4, the two operations keep invariant tothe rotation transformation.Thus,K(Rx, Rx′) = K(x,x′) is held for the three kernels, and, thus, training withthe rotated data will not change the parameters for the SVM classifiers using the threepopular kernels. However, with this method, we can only prove that the radial basiskernel is invariant to translation, while the other two are not.

It is easy to verify that the classification function (Eq. 4) is invariant to rotation,which involves only the invariant parameters and the invariant kernel functions. Simi-larly, we can prove that the classification function with radial basis kernel is also invari-ant to translation.�

Although we cannot prove that polynomial and neural networkkernels are also in-variant to translation with this method, we use experimentsto show that they are alsoinvariant to translation.

Linear Classifiers: A linear classifier uses a hyperplane to separate the training data.Let the weight vector bewT = [w1, . . . , wd] and the bias beβ0. The weight and biasparameters are determined by the training procedure (Hastie et al., 2001). A trainedclassifier is represented as follows.

fX(x) = wTx+ β0

Theorem 4. Linear classifiers are strictly invariant to rotation and translation.

Proof. First, it is important to understand the relationship between the parameters andthe hyperplane. As Figure 3 shows, the hyperplane can be represented aswT (x − xt) =0, wherew is the perpendicular axis to the hyperplane, andxt represents the deviationof the plane from the origin (i.e.,β0 = −wTxt).

Intuitively, rotation will rotate the classification hyperplane and feature vectors. Theperpendicular axisw is changed toRw and the deviationxt becomesRxt after ro-tation. Letxr represent the data in the rotated space. Then, the rotated hyperplane isrepresented as(Rw)T (xr −Rxt) = 0, and the classifier is transformed tofRX(xr) =

wTRT (xr − Rxt). Sincexr = Rx andRTR = I, fRX(xr) = wTRTR(x− xt)

= wT (x − xt) = fX(x). The two classifiers are equivalent.

Page 13: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

Geometric Data Perturbation for Privacy Preserving Outsourced Data Mining 13

w

xt

* ***

**

**

**

oo

o

oo

oo

oo

Hyperplane

Fig. 3. Hyperplane and its parameters

It is also easy to prove that linear classifiers are invariantto translation. We willignore the proof.�

5.3. Perturbation Invariant Regression Methods

Regression modeling (Hastie et al., 2001) is very similar toclassification modeling.The only difference is that the class label is changed from discrete to continuous, whichrequires the change of the criterion for model evaluation. Aregression model is oftenevaluated by the loss functionL(f(X),y), wheref(X) is the response vector of apply-ing the regression functionf() to the training instancesX , andy is the original targetvector (i.e., the class labels in classification modeling).A typical loss function is meansquare error (MSE).

L(f(X),y) =

n∑

i=1

(f(xi)− yi)2

As the definition of model quality is instantiated by the lossfunctionL, we give thefollowing definition of perturbation invariant regressionmodel.

Definition 5.2. A regression method is invariant to a transformationT if and only if|L(fX(Y ),yY )−L(fT (X)(T (Y )),yY )| < ε for any training datasetX , and any testingdatasetY . 0 < ε < 1 andyY is the target vector of the testing dataY .

Similarly, the strictly invariant condition becomesL(fX(Y ),yY ) ≡ L(fT (X)(T (Y )),yY ). We prove that

Theorem 5. The linear regression model using MSE as the loss function isstrictlyinvariant to rotation and translation.

Proof. The linear regression model based on the MSE loss function can be representedasy = XTβ + ǫ, whereǫ is a vector of random Gaussian noise with mean zero andvarianceσ2. The estimate ofβ is β = (XXT )−1XyX . Thus, for any testing dataY ,the estimated model isyY = Y T β. Since the loss function for the testing dataY is

Page 14: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

14 K. Chen and L. Liu

L(fX(Y ),y) = ‖Y T (XXT )−1XyX − yY ‖. After rotation it becomes

L(fXTRT (Y TRT ),y) = ‖Y TRT (RX(RX)T )−1RXyX − yY ‖= ‖Y TRT (RXXTRT )−1RXyX − yY ‖= ‖Y TRT (RT )−1(XXT )−1R−1RXyX − yY ‖= ‖Y T (XXT )−1XyX − yY ‖ ≡ L(fX(Y ),y) (5)

The linear regression model can also be represented asy = β0 +∑d

i=1 βixi, wherexi is the value of dimensioni for the vectorx. It is clear that ifx is translated tox′ = x+t, we can reuse the model parameters exceptˆbeta0 is replaced with ˆbeta0−dt.Thus, the new model does not change MSE as well.�

Other regression models, such as regression tree based methods (Friedman, 2001),which partitions the global space based on Euclidean distance, are also strictly invariantto rotation and translation. We skip the details here.

5.4. Perturbation Invariant Clustering Algorithms

There are several metrics used to evaluate the quality of clustering result, all of whichare based on cluster membership, i.e., recordi belongs to clusterCj . Suppose the num-ber of cluster is fixed asK. The same clustering algorithm applied to the original dataand the perturbed data will generate two clustering results. Since the record ID does notchange before and after perturbation, we can compare the difference between two setsof clustering results to evaluate the invariance property.We use theconfusion matrixmethod (Jain, Murty and Flynn, 1999) to evaluate this difference, where each elementcij 1 6 i, j 6 K represents the number of points from the clusterj in the originaldataset assigned to clusteri by the clustering result on the perturbed data. Since clusterlabels may represent different clusters in two clustering results. Let{(1), (2), . . . , (K)}be any permutation of the sequence of cluster labels{1, 2, . . . , K}. There is a permu-tation that best matches the clustering results of before and after data perturbation andmaximizes the number of consistent pointsmC for clustering algorithmC.

mC = max{K∑

i=1

ci(i), for any{(1), (2), . . . , (K)}}

We define the error rate asDQC(X,T (X)) = 1 − mC

N , whereN is the total numberof points.DQC is the quality difference between the two clustering results. Then, thecriterion for perturbation invariant clustering algorithm can be defined as

Definition 5.3. A clustering algorithm is invariant to a transformationT if and only ifDQC(X,T (X)) < ε for any datasetX , and a small value0 < ε < 1.

For strict invariance,DQC(X,T (X)) = 0.

Theorem 6. Any clustering algorithms or cluster visualization algorithms that are basedon Euclidean distance are strictly invariant to rotation and translation.

Since geometric data perturbation aims at preserving the Euclidean distance relation-ship, the cluster membership does not change before and after perturbation. Thus, it iseasy to prove that the above theorem is true and we skip the proof.

Page 15: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

Geometric Data Perturbation for Privacy Preserving Outsourced Data Mining 15

δ=2σ

Decision boundary:

xi

Points in range δ>2σ has small

chance perturbed out of the boundary

Fig. 4. Analyzing the points beingperturbed out of boundary.

5.5. Effect of Noise Perturbation to the Invariance Property

Intuitively, the noise component will affect the quality ofdata mining model. In thissection, we give a formal analysis on how the noisy intensityaffects the model qualityfor classification (or clustering) modeling.

Assume the boundary for the data perturbed without the noisecomponent is shownin Figure 4 and the noises are drawn from the normal distributionN(0, σ2). Let’s look atthe small band withδ distance (one side) around the classification or clusteringbound-ary. The increased error rate is determined by the number of points that are originalproperly classified or clustered but now are perturbed to theother side of the boundary.Out of the band the points are less likely perturbed to the other side of the boundary. Fora d-dimension pointx = (x1, x2, . . . , xd), its perturbed version (with only the noisecomponent) is represented asx′ = (x1 + ǫ1, x2 + ǫ2, . . . , xd + ǫd), whereǫi is drawnfrom the same distributionN(0, σ2). To further simplify the analysis, assume a decisionboundary is perpendicular to one of the dimensions (we can always rotate the dataset tomeet this setting), sayxi, and there aren points uniformly distributed in theδ band.

According to normal distribution, forδ > 2σ, the points located out of theδ band,will have small probability (< 0.025) to be perturbed to the other side of the boundary.Therefore, we consider only the points within theδ = 2σ band. Letp(y) be the proba-bility of a point that has distancey to the boundary perturbed out of the boundary, thenthe average number of points perturbed out of the boundary is

∫ 2σ

0

p(y)n

2σdy =

∫ 2σ

0

∫ ∞

y

1√2πσ

exp−x2

2σ2 dxn

2σdy.

Expandingexp−x2

2σ2 with Taylor series (Gallier, 2000) for the first three terms we

obtain exp−x2

2σ2 ≈ 1 − x2

2σ2 + x4

8σ4 . With the fact∫∞y

1√2πσ

exp−x2

2σ2 dx = 1/2 −∫ y

01√2πσ

exp−x2

2σ2 dx, we solve the equation and get the number of out-of-the-boundary

points≈ (12 − 45√2π)n ≈ 0.18n. The other side of the boundary has the similar amount

of points perturbed out of the boundary. Depending on the data distribution andσ, theamount of affected data points can vary. Borrowing the concept of “margin” from SVMliterature, we understand that if the margin is greater than2δ, the model accuracy is notaffected at all; if the margin is less than2δ, the model quality is affected by the amountof points in the2δ region.

Page 16: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

16 K. Chen and L. Liu

5.6. Effect of General Multiplicative Perturbation to Model Quality

In geometric data perturbation, the rotation and translation components strictly preservedistance, which is then slightly perturbed by distance perturbation. If we relax the con-dition of strictly preserving distance, what will happen tothe discussed mining models?This relaxation may use any linear transformation matrix toreplace the rotation com-ponent, e.g., projection perturbation (Liu, Kargupta and Ryan, 2006). In this section,we will discuss the effect of a general multiplicative perturbation withG(x) = Ax toclassification model quality, whereA is ak × d matrix andk may not equal tod. Weanalyze why arbitrary projection perturbations do not generally preserve geometric de-cision boundaries and what are the alternative ways to rotation perturbation to generatedecision-boundary (or approximately) preserving multiplicative perturbations.

This analysis is based on a simplified model of data distribution - multidimensionalGaussian mixture model. Assume the dataset can be modeled with multiple data clouds,each of which has approximately normal (Gaussian) distributionN(µi,Σi), whereµi isthe mean vector andΣi is the covariance matrix. Since such a general multiplicative per-turbation does not necessarily preserve all of the geometric properties for the dataset, itis not guaranteed that the discussed data mining models willbe invariant to these trans-formations. Let’s first consider a more general case that does not put a constraint onk.The rationale of projection perturbation is based on approximate distance preservationsupported by the Johnson-Lindenstrauss Lemma (Johnson andLindenstrauss, 1984).

Theorem 7. For any0 < ǫ < 1 and any integern, let k be a positive integer such thatk ≥ 4 log n

ǫ2/2−ǫ3/3 . Then, for any setS of n data points ind dimensional spaceRd, there is

a mapping functionf : Rd → Rk such that, for allx ∈ S,

(1 − ǫ)‖x− x‖2 ≤ ‖f(x)− f(x)‖2 ≤ (1 + ǫ)‖x− x‖2

where‖ · ‖ denotes the vector 2-norm.

This lemma shows that any set ofn points ind-dimensional Euclidean space could beembedded into aO( log n

ǫ2 ) -dimensional space with some linear transformationf , suchthat the pair-wise distance of any two points are maintainedwith a controlled error.However, there is a cost to achieve high precision in distance preserving. For example,a setting ofn = 1000, a quite small dataset, andǫ = 0.01, will requirek ≈ 0.5 milliondimensions, which makes the transformation impossible to perform. Increasingǫ to 0.1,we still need aboutk ≈ 6, 000. In order to further reducek, we have to increaseǫ more,which brings larger errors, however. In the case of increased distance error, the decisionboundary may not be well preserved.

We also analyze the effect of transformation from a more intuitive perspective. Inorder to see the connections between the general linear transformation and data miningmodels, we use classifiers that are based on geometric decision boundaries for example.

Below we name a dense area (a set of points are similar to each other) ascluster,while the points with the same class label are in the sameclass. We can approximatelymodel the whole dataset with Gaussian mixtures based on its density property. Withoutloss of generality, we suppose that a geometrically separable class consists of one ormore Gaussian clusters as shown in Figure 5. Letµ be the density center, andΣ bethe covariance matrix of one Gaussian cluster. A clusterCi can be represented with thefollowing distribution.

N id(µ,Σ) =

1

(2π)d/2|Σ|1/2 exp{−(x− µ)′Σ−1(x− µ)/2}

Page 17: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

Geometric Data Perturbation for Privacy Preserving Outsourced Data Mining 17

c1

c5

c3c4

c2

c6 c7

DecisionBoundary

Fig. 5. Use mixture to describe theclasses in the original dataset.

µ describes the position of the cluster andΣ describes thehyper-elliptic shapeof thedense area. After the transformation with invertibleA, the center of the cluster is movedto Aµ and the covariance matrix (corresponding to the shape of dense area) is changedto AΣAT .

Let x andy be any two points. After the transformation, the distance between thetwo becomesD′ = ||A(x− y)|| = (x− y)TATA(x− y). If we compare this distanceto the original distanceD, we get their difference as

D′ −D = (x− y)T (ATA− I)(x− y) (6)

We study the property ofATA− I to find how the distance changes. First, for a randominvertible anddiagonalizable(Bhatia, 1997) matrixA that preserves dimensions, i.e.,k = d, we will haveATA positive definite for the following reason. SinceA is diago-nalizable,A can be eigen-decomposed toUTΛU , whereU is an orthogonal matrix,Λ isthe diagonal matrix of eigenvalues, and all eigenvalues arenon-zero for the invertibilityof A. Then, we haveATA = UTΛUUTΛU = UTΛ2U , where all eigenvalues ofΛ2

are positive. Therefore,ATA is positive definite. If all eigenvalues ofATA are greaterthan 1, thenATA− I will be positive definite andD′ −D > 0 for all distances. Simi-larly, if all eigenvalues ofATA are less than 1, thenATA− I will be negative definiteandD′ − D < 0 for all distances. For any case else, we are unable to determine howdistances change - it can be lengthened or shortened. Because of the possibly arbitrarychange of distances for an arbitraryA, the points belonging to one cluster may possi-bly become members of another cluster. Since we define classes based on clusters, thechange of clustering structure may also perturb the decision boundary.

Then, what kind of perturbations will preserve clustering structures? Besides thedistance preserving perturbations, we may also use a familyof distance-ordering pre-servingperturbations. Assumex,y,u,v are any four points in the original space, and||x − y|| ≤ ||u − v||, i.e.,

∑di=1(xi − yi)

2 ≤ ∑di=1(ui − vi)

2, which defines theorder of distances, wherexi, yi, ui, andvi are dimensional values. It is easy to ver-ify that if distance ordering is preserved after transformation, i.e.,||G(x) − G(y)|| ≤||G(u) − G(v)||, the clustering structure is preserved as well and thus the decisionboundary is preserved. Therefore, distance ordering preserving perturbation is an alter-native choice to rotation perturbation.

In the following we discuss how to find a distance ordering preserving perturbation.Letλ2

i , i = 1, . . . , d, be the eigenvalues ofATA. Then, the distance ordering preservingproperty requires

∑di=1 λ

2i (xi−yi)

2 ≤ ∑di=1 λ

2i (ui−vi)

2. Apparently, for arbitraryA,this condition cannot be satisfied. One simple setting will guarantee to preserve distance

Page 18: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

18 K. Chen and L. Liu

ordering that isλi = λ, whereλ is some constant. This results in distance orderingpreserving matricesA = λR whereR is a rotation matrix andλ is an arbitrary constant− we name itscalingof the rotation matrix. Based on this analysis, we can also deriveapproximate distance ordering preserving by perturbingλi to λ+ δi, whereδi is a smallvalue drawn from a distribution. In fact, scaling is also discussed in transformation-based data perturbation for privacy preserving clustering(Oliveira and Zaiane, 2010).

6. Attack Analysis and Privacy Guarantee of Geometric DataPerturbation

The goal of random geometric perturbation is twofold: preserving the data utility andpreserving the data privacy. The discussion about the transformation-invariantclassifiershas proven that geometric transformations theoretically guarantee preserving the modelaccuracy for many models. As a result, numerous such geometric perturbations canpresent the same model accuracy, and we only need to find one that maximizes theprivacy guaranteein terms of various potential attacks.

We dedicate this section to discuss how good the geometric perturbation approach isin terms of preserving privacy. The first critical step is to define amulti-column privacymeasurefor evaluating the privacy guarantee of a geometric perturbation to a givendataset. It should be distinct from that used for additive perturbation (Agrawal andSrikant, 2000), which assumes each column is independentlyperturbed, since geomet-ric perturbation changes the data on all columns (dimensions) together. We will use thismulti-column privacy metric to evaluate several attacks and optimize the perturbationin terms of attack resilience.

6.1. A Conceptual Privacy Model for Multidimensional Perturbation

Unlike the existing random noise addition methods, where multiple columns are per-turbed independently, random geometric perturbation needs to perturball columns to-gether. Therefore, the privacy quality of all columns is correlated under one single trans-formation and should be evaluated under a unified metric. We first present a conceptualmodel for privacy evaluation in this section, and then we will discuss the design of theunified privacy metric and a framework for incorporating attack evaluation.

In practice, since different columns (attributes) may havedifferent privacy con-cern, we consider that the general-purpose privacy metricΦ for entire dataset shouldbe based oncolumn privacy metric. A conceptual privacy evaluation model is de-fined as follows. Letp be the column privacy metric vectorp = (p1, p2, . . . , pd),and there areprivacy weights associated to thed columns, respectively, denoted asw = (w1, w2, . . . , wd). Without loss of generality, we assume that the weights arenormalized, i.e.,

∑di=1 wi = 1. Then,Φ = Φ(p,w) defines the privacy guarantee. In

summary, the design of the specific privacy model should consider the three factorsp,w, and the functionΦ.

We will leave the concrete discussion about the design ofp in the next section, anddefine the other two factors first. We notice that different columns may have differentimportance in terms of the level of privacy-sensitivity. The first design idea is to take thecolumn importance into consideration. Intuitively, the more important the column is, thehigher level of privacy guarantee will be required for the perturbed data, correspondingto that column. If we usewi to denote the importance of columni in terms of preservingprivacy,pi/wi can be used to represent theweighted column privacyfor columni.

Page 19: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

Geometric Data Perturbation for Privacy Preserving Outsourced Data Mining 19

The second intuition is the concept ofminimum privacy guaranteeamong all columns.Normally, when we measure the privacy quality of a multi-column perturbation, weneed to pay special attention to the column that has the lowest weighted column privacy,because such a column could become the breaking point of privacy. Hence, we designthe first composition functionΦ1 = mindi=1{pi/wi} and call it the minimum privacyguarantee. Similarly, theaverage privacy guaranteeof the multi-column perturbation,defined byΦ2 = 1

d

∑di=1 pi/wi, could be another interesting measure.

With the definition of privacy guarantee, we can evaluate theprivacy quality of aperturbation to a specific dataset, and most importantly, wecan use it to find a multi-dimensional perturbation that locally maximizes the privacy guarantees. With randomgeometric perturbation, we demonstrate that it is convenient to adjust the perturbationmethod to obtain high privacy guarantees, without the concern of preserving the modelaccuracy for the discussed classifiers.

6.1.1. A Unified Column Privacy Metric

Intuitively, for a data perturbation approach, the qualityof preserved privacy can beunderstood as the difficulty level of estimating the original data from the perturbeddata. We name such estimation methods as “inference attacks”. A unified metric shouldbe a generic metric that can be used to evaluate as many types of inference attacksas possible. In the following, we first derive a unified privacy metric from the mean-square-error method, and then discuss how to apply the metric to evaluate the attacks togeometric perturbation.

We compare the original value and the estimated value to determine the uncer-tainty brought by the perturbation. This uncertainty is theprivacy guarantee that pro-tects the original value. Let the difference between the original column dataY andthe perturbed/reconstructed dataY be a random variableD. We use the root of meansquare error (RMSE) to estimate this difference. Assume theoriginal data samples arey1, y2, . . . , yN . Correspondingly, the estimated values arey1, y2, . . . , yN . The root ofmean square error of estimation,r, is defines as

r =

1

N

N∑

i=1

(yi − yi)2

As we have discussed, to evaluate the privacy quality of multi-dimensional perturbation,we need to evaluate the privacy of all perturbed columns together. Unfortunately, thesingle-column metric is subject to the specific column distribution, i.e., the same amountis not equally effective for different value scales. For example, the same amount ofRMSE for the “age” column has much stronger protection than for “salary” due tothe dramatically different value ranges. One effective wayto unify the different valueranges is vianormalization, e.g., max-min normalization or standardization. We employthe standardization procedure, which is simply described as a transformation to theoriginal valuey′ = y−µ

σ , whereµ is the mean of the column andσ is the standarddeviation. By using this procedure all columns are approximately unified into the samedata range. The rationale behind the standardization procedure is that for large sampleset(e.g. hundreds of samples) normal distribution would be a good approximation for mostdistributions (Lehmann and Casella, 1998). The standardization procedure normalizesall distributions to standard normal distribution (with mean zero and variance one).According to normal distribution, the range[µ − 2σ, µ + 2σ] covers more than 95%points in the population. Let’s use this range, i.e.,4σ to approximately represent the

Page 20: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

20 K. Chen and L. Liu

-2σ +2σestimated value v

Original value could be in this range [v-r, v+r].

1-r/2σ, the risk of privacy breach

Fig. 6. The intuition behind the privacy met-ric.

value range. We use the normalized values, the definition of RMSE, and the normalizedvalue range to represent the unified privacy metric.

Priv(y,y′) =2r

4σ=

1

1

N

N∑

i=1

(yi − µ

σ− yi)2

This definition3 can be explained with Figure 6. The normalized RMSE

r =√

1N

∑Ni=1(

y−µσ − yi)2 represents the average error on value estimation. The real

value can be in the range of[yi − r, yi + r]. The rate of this range2r to the value range4σ represents the normalized uncertainty of estimation. Thisrate can be possibly higherthan 1, which means extremely large uncertainty. If the original data is standardized(σ = 1), this metric is reduced top = r/2.

6.1.2. Incorporating Attack Analysis into Privacy Evaluation

The proposed metric should compare the difference between two datasets: the originaldataset and theobserved or estimated dataset. With different level of knowledge, theattacker observes the perturbed dataset differently. The attacks we know so far can besummarized into the following three categories: (1) the basic statistical methods thatestimate the original data directly from the perturbed data(Agrawal and Srikant, 2000;Agrawal and Aggarwal, 2002), without any other knowledge about the data (as knownas “naive inference”); (2) data reconstruction methods that reconstruct data from the per-turbed data with any released information about the data andthe perturbation, and thenuse the reconstructed data to estimate the original data (Kargupta et al., 2003; Huanget al., 2005) (as known as “reconstruction-based inference”); and (3) if some particularoriginal records and their image in the perturbed data can beidentified, e.g., outliers ofthe datasets, based on the preserved distance information,the mapping between thesepoints can be used to discover the perturbation (as known as distance-based inference).

Let X be the normalized original dataset,P be the perturbed dataset, andO bethe observed dataset. We calculateΦ(X,O), instead ofΦ(X,P ), in terms of differentattacks. Using rotation perturbationG(X) = RX for example, we can summarize theevaluation of privacy in terms of attacks.

1. Naive inference:O = P , there is no more accurate estimation than the releasedperturbed data;

2. Reconstruction-based inference: methods like Independent Component Analysis (ICA)are used to estimateR. Let R be the estimate ofR, andO = R−1P ;

3 Note that this definition is improved from the one we gave in the SDM paper (Chen et al., 2007).

Page 21: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

Geometric Data Perturbation for Privacy Preserving Outsourced Data Mining 21

3. Distance-based inference: the attacker knows a small setof special points inX thatcan be mapped to certain set of points inP , so that the mapping helps to estimateR,and thenO = R−1P .

4. Subset-based inference: the attacker knows a significantnumber of original pointsthat can be used to estimateR and thenO = R−1P .

The higher the inference level is, the more knowledge about the original dataset theattacker needs to break the perturbation. In the following sections, we analyze some in-ference attacks and see how geometric perturbation provides resilience to these attacks.

Note that the proposed privacy evaluation method is a generic method that can beused to evaluate the effectiveness of a general perturbation, where nothing but the per-turbed data is released to the attacker. It is important to remember that this metric shouldbe evaluated on the original data and theestimateddata. We cannot simply assume theperturbed data is the estimated data as the original additive perturbation does (Agrawaland Srikant, 2000), which makes the assumption that the attacker has no knowledgeabout the original data.

6.2. Attack Analysis and Perturbation Optimization

In this section, we will use the unified multi-column privacymetric to analyze a fewattacks. The similar methodology can be used to analyze any new attacks. Based on theanalysis, we develop a randomized perturbation optimization algorithm.

6.2.1. Privacy Analysis on Naive Estimation Attack

We start with the analysis on multiplicative perturbation,which is the key componentin geometric perturbation. With the proposed metric over the normalized data, we canformally analyze the privacy quality of random rotation perturbation. LetX be the nor-malized dataset,X ′ = RX be the rotation ofX , andId be thed-dimensional identitymatrix. Thus, the difference matrixX ′ −X can be used to calculate the privacy metric,and the columnwise metric is based on the element(i, i) in K = (X ′ −X)(X ′ −X)T

(note thatX andX ′ are column vector matrices as we defined), i.e.,√

K(i,i)/2, whereK(i,i) is represented as

K(i,i) = ((R − Id)XXT (R − Id)T )(i,i) (7)

SinceX is normalized,XXT is also the covariance matrix, where the diagonalelements are the column variances. Letrij represent the element(i, j) in the matrixR,andcij be the element(i, j) in the matrix ofXX. K(i,i) is transformed to

K(i,i) =

d∑

j=1

d∑

k=1

rijrikckj − 2

d∑

j=1

rijcij + cii (8)

When the random rotation matrix is generated following the Haar distribution, a con-siderable number of matrix entries are approximately independent normal distributionN(0, 1/d) (Jiang, 2005). The full discussion about the numerical characteristics of therandom rotation matrix is out of the scope of this paper. For simplicity and easy under-standing, we assume that all entries in random rotation matrix approximately follow in-dependent normal distributionN(0, 1/d). Therefore, sample randomly rotations should

Page 22: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

22 K. Chen and L. Liu

makeK(i,i) changing around the mean valuecii as shown in the following result.

E[K(i,i)] ∼d

j=1

d∑

k=1

E[rij ]E[rik]ckj − 2

d∑

j=1

E[rij ]cij + cii = cii

It means that the original column variance could substantially influence the result ofrandom rotation. However,E[K(i,i)] is not the only factor determining the final privacyguarantee. We should also look at the variance ofK(i,i). If the variance is considerablylarge, we still have great chance to get a rotation with largeK(i,i) in a set of samplerandom rotations, and the larger the variance is, the more likely the randomly generatedrotation matrices can provide a high privacy level. With thesimplicity assumption, wecan also roughly estimate the factors that contribute to thevariance.

V ar(K(i,i)) ∼d

i=1

d∑

j=1

V ar(rij)V ar(rik)c2ij + 4

d∑

j=1

V ar(rij)c2ij

∼ O(1/d2d

i=1

d∑

j=1

c2ij + 4/d

d∑

j=1

c2ij). (9)

The above result shows that the variance is approximately related to the average of thesquared covariance entries, with more influence from the rowi of covariance matrix.

A simple method is to select the best rotation matrix among a bunch of randomlygenerated rotation matrices. But we can do better or be more efficient in a limited num-ber of iterations. In Equation 8, we also notice that thei-th row vector of rotation matrix,i.e., the valuesri∗, plays a dominating role in calculating the metric. Hence, it is possibleto simply swap the rows ofR to locally improve the overall privacy guarantee, whichdrives us to propose a row-swapping based fast local optimization method for findinga better rotation from a given rotation matrix. This method can significantly reduce thesearch space and thus provides better efficiency. Our experiments show that, with thelocal optimization, the minimum privacy level can be increased by about 10% or more.We formalize the swapping-maximization method as follows.Let {(1), (2), . . . , (d)}be a permutation of the sequence{1, 2, . . . , d}. Let the importance level of privacy pre-serving for the columns bew = (w1, w2, . . . , wd). The goal is to find the permutationof rows of a given rotation matrix that results in a new rotation matrix that maximizesthe minimum or average privacy guarantee .

argmax{(1),(2),...,(d)}{min1≤i≤d{(d

j=1

d∑

k=1

r(i)jr(i)kckj − 2

d∑

j=1

r(i)jcij + cii)/wi}}

(10)

Since the matrixR′ generated by swapping the rows ofR is still a rotation matrix, theabove local optimization step will not change the rotation-invariance property of thediscussed classifiers.

Attacks to Rotation Center The basic rotation perturbation uses the origin as the ro-tation center. Therefore, the points closely around the origin are still around the originafter the perturbation, which leads to weaker privacy protection about these points. Weaddress this problem with random translation so that the weakly perturbed points aroundthe rotation center are not detectable due to the randomnessof the rotation center. At-

Page 23: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

Geometric Data Perturbation for Privacy Preserving Outsourced Data Mining 23

***

***

* *

*** *

*

**

*

Weakpoints

rotation

Fig. 7. Weak points around the de-fault rotation center.

tacks to translation perturbation will depend on the success of the attack to rotationperturbation, which will be described in later sections.

6.2.2. Privacy Analysis on ICA-based Attack

The unified privacy metric evaluates the privacy guarantee and the resilience againstthe first type of privacy attack− the naive inference. Considering the reconstruction-based inference, we identify that Independent Component Analysis (ICA) (Hyvarinenet al., 2001) could be the most powerful one to estimate the original datasetX , if morecolumn statistics are known by the attacker. We dedicate this section to analyze theICA-based attacks with the unified privacy metric.

Requirements of Effective ICA. ICA is a fundamental problem in signal process-ing which is highly effective in several applications such as blind source separation(Hyvarinen et al., 2001) of mixed electro-encephalographic(EEG) signals, audio sig-nals and the analysis of functional magnetic resonance imaging (fMRI) data. Let ma-trix X composed by source signals, where row vectors represent source signals. Sup-pose we can observe the mixed signalsX ′, which is generated by linear transformationX ′ = AX . The ICA model can be applied to estimate the independent components(the row vectors) of the original signalsX , from the mixed signalsX ′, if the followingconditions are satisfied:

1. The source signals are independent, i.e., the row vectorsof X are independent;2. All source signals must be non-Gaussian with possible exception of one signal;3. The number of observed signals, i.e. the number of row vectors ofX ′, must be at

least as large as the independent source signals.4. The transformation matrixA must be of full column rank.

For rotation matrices and full rank random projection, the 3rd and 4th conditions arealways satisfied. However, the first two conditions, especially the independency condi-tion, although practical for signal processing, seem not very common in data mining.Concretely, there are a few basic difficulties in applying the above ICA-based attack tothe rotation-based perturbation. First of all, if there is significant dependency betweenany attributes, ICA fails to precisely reconstruct the original data, which thus cannotbe used to effectively detect the private information. Second, even if ICA can be donesuccessfully, the order of the original independent components cannot be preserved ordetermined through ICA (Hyvarinen et al., 2001). Formally,any permutation matrixP

Page 24: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

24 K. Chen and L. Liu

and its inverseP−1 can be substituted in the model to giveX ′ = AP−1PX . ICA couldpossibly give the estimate for some permutated sourcePX . Thus, we cannot identifythe particular column if the original column distributionsare unknown. Finally, even ifthe ordering of columns can be identified, ICA reconstruction does not guarantee to pre-serve the variance of the original signal− the estimated signal may scale up the originalone but we do not know how much it scales, without knowing the statistical property ofthe original column.

In summary, without the necessary knowledge about the original dataset, the at-tacker cannot simply use the ICA reconstruction. In case that attackers know enoughdistributional information that includes the maximum/minimum values and the prob-ability density functions (PDFs) of the original columns, the effectiveness of ICA re-construction will totally depend on the independency condition of the original columns.We observed in experiments that, since pure independency does not exist in the realdatasets, we can still tune the rotation perturbation so that we can find one resilientenough to ICA-based attacks, even though the attacker knowsthe column statistics.In the following, we analyze how the sophisticated ICA-based attacks can be done anddevelop a simulation based method to evaluate the resilience of a particular perturbation.

ICA Attacks with Known Column Statistics. When the basic statistics, such as themax/min values and the PDF of each column are known, ICA data reconstruction canpossibly be done more effectively. We assume that ICA is quite effective to the dataset(i.e., the four conditions are approximately satisfied) andthe column PDFs aredis-tinctive. Then, the reconstructed columns can be approximately matched to the originalcolumns by comparing the PDFs of the reconstructed columns and the original columns.When the maximum/minimum values of columns are known, the reconstructed data canbe scaled to the proper value ranges. We define an enhanced attack with the followingprocedure.

1. Running ICA algorithm to get a reconstructed data;2. Estimate column distributions for the reconstructed columns, and for each recon-

structed column find the closest match to the original columnby comparing theircolumn distributions;

3. Scale the columns with the corresponding maximum/minimum values of the originalcolumns;

Note if the four conditions for effective ICA are exactly satisfied and the basic statis-tics and PDFs are all known, the basic rotation perturbationapproach will not work.However, in practice, since the independency conditions are not all satisfied for mostdatasets in classification, we observed that different rotation perturbations may resultin different quality of privacy and it is possible to find one rotation that is consider-ably resilient to the enhanced ICA-based attacks. For this purpose, we can simulatethe enhanced ICA attack to evaluate the privacy guarantee ofa rotation perturbation.Concretely, it can be done in the following steps.

First step is called “PDF Alignment”. We need to calculate the similarity betweenthe PDF of the original column and that of the reconstructed data column to find the bestmatches between the two sets of columns. A direct method is tocalculate the differencebetween the two PDF functions. Letf(x) andg(x) be the original PDF and the PDFof the reconstructed column, respectively. A typical method to define the difference ofPDFs employs the following function.

∆PDF =

|f(x)− g(x)|dx (11)

Page 25: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

Geometric Data Perturbation for Privacy Preserving Outsourced Data Mining 25

PDF1PDF2

PDF1

PDF2

The two PDFS have differentdata ranges

The two PDFS are comparedin the same data range.

Fig. 8. Comparing PDFs in different ranges results in large error. (Thelined areas are calculated as the difference between the PDFs.)

In practice, for easy calculation we discretize the PDF intobins. It is then equivalent touse the discretized version:

∑ni=1 |f(bi)− g(bi)|, wherebi is the discretized bini. The

discretized version is easy to implement by comparing the two histograms with a samenumber of bins. However, the evaluation is not accurate if the values in the two columnsare not in the same range as shown in Figure 8. Hence, the reconstructed PDF needsto be translated to match the range, which requires to know the maximum/minimumvalues of the original column. Since the original column is already scaled to [0, 1] incalculating unified privacy metric, we can just scale the reconstructed data to [0, 1],making it consistent with the normalized original data (Section 6.1.1). Meanwhile, thisalso scales the reconstructed data down so that the variancerange is consistent with theoriginal column. As a result, after the step of PDF Alignment, we can directly calculatethe privacy metrics between the matched columns to measure the privacy quality.

Without loss of generality, we suppose that the level of confidence for an attackis primarily based on the PDF similarity between the two matched columns. LetObe the reconstruction of the original datasetX . ∆PDF (Oi,Xj) represents the PDFdifference of the columni in X and the columnj in O. Let {(1), (2), . . . , (d)} bea permutation of the sequence{1, 2, . . . , d}, which means a match from the originalcolumni to (i). Let an optimal match minimize the sum of PDF differences of all pairsof matched columns. We define the minimum privacy guarantee based on the optimalmatch as follows.

pmin = min{ 1

wkpriv(Xk,O(k)), 1 ≤ k ≤ d} (12)

where{(1), (2), . . . , (d)} = argmin{(1),(2),...,(d)}∑d

i=1 ∆PDF (Xi,X(i)). Similarly,we can define the average privacy guarantee based on an optimal match.

With the above multi-column metric, we are able to estimate how resilient a rota-tion perturbation is to the ICA-based attack equipped with the known column statistics.We observed in experiments that, although the ICA method mayeffectively reduce theprivacy guarantee for certain rotation perturbations, we can always find some rotationmatrices so that they can provide satisfactory privacy guarantee to ICA-based attacks.

6.2.3. Attacks to Translation Perturbation

Previously, we use random translation to address the weak protection on the pointsaround the rotation center. We will see how translation perturbation is attacked if theICA-based attack is applied .

Let each dimensional value of the random translation vectort is uniformly drawnfrom the range [0, 1], so that the center hides in the normalized data space. The pertur-bation can be represented as

f(X) = RX +Ψ = R(X +R−1Ψ)

Page 26: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

26 K. Chen and L. Liu

rotation

**

* ***

*

*

**

* **** *

mapping

Fig. 9. Using known points and distance relationshipto infer the rotation matrix.

It is easy to verify thatT = R−1Ψ is also a translation matrix. An effective attack toestimate the translation component should be based on the ICA inference toR and thenremove the componentR−1Ψ based on the unknown distribution ofX . Concretely, theprocess can be described as follows.

By applying ICA attack, the estimate toX+T is X + T = R−1P . Suppose that theoriginal columni has maximum and minimum valuesmaxi andmini, respectively, andR−1P hasmax′

i andmin′i, respectively. Since translation does not change the shapeof

column PDFs, we can align the column PDFs first. As scaling is one of the major effectof ICA estimation, we rescale the reconstructed column withsome factors, which canbe estimated bys ≈ max′

i−min′

i

maxi−mini. Then, the columni of R−1P is scaled down to the

same span ofX by the factors. Then, we can extract the translationti for columni with

ti ≈ min′i × s−mini

Since the quality of the estimation is totally dependent on that of ICA reconstructionto rotation perturbation, a good rotation perturbation will protect translation perturba-tion as well. We will show some experimental results to see how well we can protectthe translation component.

6.2.4. Privacy Analysis on Distance-inference Attack

In the previous section, we have discussed naive-inferenceattacks and ICA-based at-tacks. In the following discussion, we assume that, besidesthe information necessaryto perform these two kinds of attacks, the attacker manages to get more knowledge aboutthe original dataset: s/he also knows at leastd+1 original data points,{x1,x2, . . . ,xd+1},d points of which are also linearly independent. Since the basic geometric perturbationpreserves the distances between the points, the attacker can possibly find the mappingbetween these points and their images in the perturbed dataset, {o1,o2, . . . ,od+1}, ifthe point distribution is peculiar, e.g. the points are outliers. With the known mappingthe rotation componentR and translation componentt can be calculated consequently.There is also discussion about the scenario that the attacker knows less thand points(Liu, Giannella and Kargupta, 2006).

The mapping might be identified precisely for low-dimensional small datasets (< 4dimensions). With considerable cost, it is not impossible for higher dimensional largerdatasets by simple exhaustive search if the known points have special distribution. Theremay have multiple matches, but the threat can be substantial.

So far we have assumed the attacker has obtained the right mapping between theknown points and their images. In order to protect from the distance-inference attack, we

Page 27: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

Geometric Data Perturbation for Privacy Preserving Outsourced Data Mining 27

use the noise component∆ to protect geometric perturbation−G(X) = RX+Ψ+∆.After we append the distance perturbation component, we have the original points andtheir maps be{x1,x2, . . . ,xd+1} → {o1,o2, . . . ,od+1}, oi = Rxi + t+ εi, whereεiis the noise. The attacker can perform a linear regression based estimation method.

1)R is estimated with the known mapping. The translation vectort can be canceledfrom the perturbation and we getd equations:oi − od+1 = R(xi − xd+1) + εi − εd+1,1 ≤ i ≤ d. Let O = [o1 − od+1, o2 − od+1, . . . , od − od+1], X = [x1 − xd+1, x2 −xd+1, . . . , xd − xd+1], andε = [ε1 − εd+1, ε2 − εd+1, . . . , εd − εd+1]. The equationsare unified toO = RX+ ε, and estimatingR becomes a linear regression problem. Theminimum variance unbiased estimator forR is R = O′X(X ′X)−1 (Hastie et al., 2001).

2) With R, the translation vectort can also be estimated. Sinceoi−Rxi−εi = t andεi has mean value 0, withR the attacker has the estimate oft ast = 1

d+1{∑d+1

i=1 (oi −Rxi)−

∑d+1i=1 εi} ≈ 1

d+1

∑d+1i=1 (oi − Rxi). t will have certain variance brought by the

componentsR andεi.3) Finally, the original dataX can be estimated as follows. AsO = RX + Ψ+∆,

using the estimatorsR and Ψ = [t, . . . , t], we getX = R−1(O − Ψ). Due to thevariance introduced byR, Ψ, and∆, the attacker may need to run several times to getthe average of estimatedX , in practice.

By simulating the above process, we are able to estimate the effectiveness of theadded noise. As we have discussed, as long as the geometric boundary is preserved, thegeometric perturbation with noise addition can preserve the model accuracy. We haveformally analyzed the effect of the noise component to modelquality in Section 5.5. Wewill further study the relationship between the noise level, the privacy guarantee, andthe model accuracy in experiments.

6.2.5. Privacy Analysis on Other Attacks

We have studied a few attacks, according to the different levels of knowledge that anattacker may have. There are also studies about the extreme case that the attacker canknow a considerable number of points (≫ d)in the original dataset. In this case, classicalmethods, such as Principle Component Analysis (PCA) (Liu, Giannella and Kargupta,2006) and ICA (Guo, Wu and Li, 2008), can be used to reconstruct the original datasetwith the higher order statistical information derived fromboth the known points andthe perturbed data. In order to make these methods effective, the known points shouldbe representative for the original data distribution, so that higher order statistics canbe preserved, such as the covariance matrix of the original dataset that both PCA andICA based methods depend on. As a result, what portion of samples are known bythe attacker and how different the known sample distribution is from the original onebecome the important factor for the success of attacks. Mostimportantly, these attacksbecome less meaningful in practice: when a large number of points have been crackedit is too late to protect data privacy and security. In addition, outliers in the dataset maybe easily under attacks, if additional knowledge about the original outliers is available.Further study should be performed on the outlier-based attacks. We will leave theseissues for future study.

6.2.6. A Randomized Algorithm for Finding a Better Perturbation

We have discussed the unified privacy metric for evaluating the quality of a random geo-metric perturbation. Three kinds of inference attacks are analyzed under the framework

Page 28: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

28 K. Chen and L. Liu

of multi-column privacy evaluation, based on which we design an algorithm to choosegood geometric perturbations that are resilient to the discussed attacks. In addition, thealgorithm itself, even published, should not be a weak pointin privacy protection. Sincea deterministic algorithm in optimizing the perturbation may also provide extra clues toprivacy attackers, we try to bring some randomization into the optimization process.

Algorithm 6.1 runs in a given number of iterations, aiming atmaximizing themin-imum privacy guarantee. At the beginning, a random translation is selected. In eachiteration, the algorithm randomly generates a rotation matrix. Local maximization ofvariance through swapping rows is then applied to find a better rotation matrix. Andthen, the candidate rotation matrix is tested by the ICA-based attacks 6.2.2 assumingthe attacker knows column statistics. The rotation matrix is accepted as the currentlybest perturbation, if it provides higher minimum privacy guarantee in terms of bothnaive estimation and ICA-based attacks than the previous perturbations. Finally, thenoise component is appended to the perturbation, so that thedistance-inference attackcannot reduce the privacy guarantee to a safety levelφ, e.g.,φ = 0.2. Algorithm 6.1outputs the rotation matrixRt, the random translation matrixΨ, the noise levelσ2, andthe corresponding minimum privacy guarantee. If the privacy guarantee is lower thanthe anticipated threshold, the data owner can choose not to release the data. Note thatthis optimization process is applied to a sample set of the data. Therefore, the cost willbe manageable even for very large original dataset.

Algorithm 6.1 Finding a Good Perturbation (Xd×N , w, m)Input : Xd×N :the original dataset,w: weights of attributes in privacy evaluation,m: the number of itera-tions.Output : Rt: the selected rotation matrix,Ψ: the random translation,σ2: the noise level,p: privacy guar-anteecalculate the covariance matrixC of X;p = 0, and randomly generate the translationΨ;for Each iterationdo

randomly generate a rotation matrixR;swapping the rows ofR to getR′, which maximizesmin1≤i≤d{

1wi

(Cov(R′X −X)(i,i)};

p0 = the privacy guarantee ofR′, p1 = 0;if p0 > p then

generateX with ICA;{(1), (2), . . . , (d)} = argmin{(1),(2),...,(d)}

∑di=1 ∆PDF (Xi, O(i))

p1 = min1≤k≤d1

wk

Priv(Xk , O(k))

end ifif p < min(p0, p1) then

p = min(p0, p1), Rt = R′;end if

end forp2 = the privacy guarantee to the distance-inference attack with the perturbationG(X) = RtX+Ψ+∆.

Tune the noise levelσ2 , so thatp2 ≥ p if p < φ or p2 > φ if p < φ .

7. Experiments

We design four sets of experiments to evaluate the geometricperturbation approach.The first set is designed to show that the discussed classifiers are invariant to rotationsand translations. In this set of experiments, general linear transformations, including

Page 29: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

Geometric Data Perturbation for Privacy Preserving Outsourced Data Mining 29

Dataset N d k kNN SVM(RBF) Perceptron

orig R A orig R A orig R A

Breast-w 699 10 2 97.6 −0.5± 0.3 −0.5± 0.3 97.2 0± 0 −0.2± 0.2 80.4 −8.7± 0.3 −8.0± 1.5

Credit-a 690 14 2 82.9 0± 0.8 −0.7± 0.8 85.5 0± 0 +0.9± 0.3 73.6 −7.3± 1.0 −8.8± 0.7

Credit-g 1000 24 2 72.9 −1.2± 0.9 −1.8± 0.8 76.3 0± 0 +0.9± 0.9 75.1 0± 0 −0.2± 0.2

Diabetes 768 8 2 73.3 +0.4± 0.5 −0.4± 1.4 77.3 0± 0 −3.6± 1.0 68.9 0.0± 0.7 −2.5± 2.8

E.Coli 336 7 8 85.7 −0.4± 0.8 −2.0± 2.2 78.6 0± 0 −4.3± 1.5 - - -

Heart 270 13 2 80.4 +0.6± 0.5 −1.7± 1.1 84.8 0± 0 −2.3± 1.0 75.6 −5.2± 0.3 −3.9± 1.1

Hepatitis 155 19 2 81.1 +0.8± 1.5 0± 1.4 79.4 0± 0 +3.3± 1.9 77.4 −1.2± 0.4 −1.8± 2.4

Ionosphere 351 34 2 87.4 +0.5± 0.6 −30.0± 1.0 89.7 0± 0 +0.4± 0.4 75.5 −3.5± 1.0 −5.6± 1.0

Iris 150 4 3 96.6 +1.2± 0.4 −2.0± 2.1 96.7 0± 0 −10.2± 3.0 - - -

Tic-tac-toe 958 9 2 99.9 −0.3± 0.4 −8.3± 0.4 98.3 0± 0 +1.6± 6.9 76.4 −5.3± 0.0 −5.2± 0.1

Votes 435 16 2 92.9 0± 0.4 −12.6± 3.1 95.6 0± 0 −0.5± 1.9 90.7 −4.3± 1.0 −8.3± 4.9

Wine 178 13 3 97.7 0± 0.5 −2.0± 0.2 98.9 0± 0 −5.7± 1.3 - - -

Table 1. Experimental result on transformation-invariantclassifiers

dimensionality-preserving transformation and projection transformation, are also in-vestigated to see the advantage of distance preserving transformations. The second setshows the optimization of the privacy guarantee in geometric perturbation without thenoise component, in terms of the naive-inference attack andthe ICA-based attack. Inthe third set of experiments, we explore the relationship between the intensity of thenoise component, the privacy guarantee and the model accuracy, in terms of distance-inference attack. Finally, we compare the overall privacy guarantee provided by ourgeometric perturbation and another multidimensional perturbation− condensation ap-proach. All datasets used in the experiments can be found in UCI machine learningdatabase4.

7.1. Classifiers Invariant to Rotation Perturbation

In this experiment, we verify the invariance property of several classifiers discussedin section 5.1 to rotation perturbation. Three classifiers:kNN classifier, SVM classi-fier with RBF kernel, and perceptron, are used as the representatives. To show the ad-vantage of distance preserving transformations, we will test the invariance property ofdimensionality-preserving general linear transformation and projection perturbation.

Each dataset is randomly rotated 10 times in the experiment.Each of the 10 resultantdatasets is used to train and cross-validate the classifiers. The reported numbers arethe average of the 10 rounds of tests. We calculate the difference of model accuracy,between the classifier trained with the original data and those trained with the rotateddata.

In the table 1, ‘orig’ is the classifier accuracy to the original datasets, ‘R’ denotes

4 http://www.ics.uci.edu/∼mlearn/Machine-Learning.html

Page 30: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

30 K. Chen and L. Liu

the result of the classifiers trained with rotated data, and the numbers in ‘R’ columns arethe performance difference between the classifiers trainedwith original and rotated data,for example, “−1.0± 0.2” means that the classifiers trained with the rotated data havethe accuracy rate1.0% lower than the original classifier on average, and the standarddeviation is0.2%. We use single-perceptron classifiers in the experiment. Therefore,the datasets having more than two classes, such as “E.Coli”,“Iris” and “Wine” datasets,are not evaluated for perceptron classifier. ‘A’ means arbitrarily generated nonsingularlinear perturbations that preserves the dimensionality ofthe original dataset. From thisresult, we can see that rotation perturbation almost fully preserves the model accuracyfor all of the three classifiers, except that perceptron might be sensitive to rotation per-turbation for some datasets (e.g., “Breast-w”). Arbitrarily generated linear perturbationsmay downgrade the model accuracy a lot for some datasets, such as “Inonosphere” forkNN (-30.0%), and “Iris” for SVM (RBF) (-10.2%).

7.2. The Effect of Random Projection to Model Accuracy

To see whether random projection can safely replace the rotation perturbation com-ponent in geometric data perturbation, we perform a set of experiments to check howmodel accuracy is affected by random projection. We implement the standard randomprojection method (Vempala, 2005). Random projection is defined as

G(x) =

k

dRTx,

whereR is ad × k matrix with orthonormal columns.R can be generated in multiplemethods. One simple method is to generate a random matrix with each element drawnfrom the standard normal distributionN(0, 1) first, and then apply Gram-Schmidt pro-cess (Bhatia, 1997) to orthogonalize the columns. From the Johnson-LindenstraussLemma (Johnson and Lindenstrauss, 1984), we understand that the number of pro-jected dimensions is the major factor affecting the accuracy of models. We will lookat this relationship in the experiment. For clear presentation, we will pick three datasetsfor each classifier that show great impact to accuracy. Similarly, each result is based onthe average of ten rounds of different random projections.

For clear presentation, for each classifier modeling, we select only three datasetsthat show the most representative patterns. In Figure 10, the x-axis is the differencebetween the projected dimensions and the original dimensions and the y-axis is the dif-ference between original model accuracy and the perturbed model accuracy (perturbedaccuracy - original accuracy). Note that random projections that preserve dimension-ality (dimension difference=0) is as same as rotation perturbation. It shows that thekNN model accuracy for the three datasets can decrease dramatically regardless of in-creased or decreased dimensionality. The numbers are the average of ten runs for eachdimensionality setting. In Figure 11, SVM models also show the model accuracy is sig-nificantly reduced with a dimensionality different to the original one. The “Diabetes”data is less affected by changed dimensionality. Interestingly, the perceptron models(Figure 12) are less sensitive to changed dimensionality for some datasets such as “Di-abetes” and “Heart”, while very sensitive to others such as “BreastW”. In general, theerror caused by random projection perturbation that changes dimensionality is so largethat the resultant models are not useful.

Page 31: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

Geometric Data Perturbation for Privacy Preserving Outsourced Data Mining 31

��������������������������� �� �� � � � ���� �������� ���� ��������� ����������

� !"#$%"&'()"*"(+,'-'Fig. 10. The effect of projection pertur-bation to kNN.

./01./021./02./031./03./041./04./051./05./0/1/.6 .2 .4 / 74 72 76899:;<9=>?@@A;AB9A CDEFGHDIG CDJJFKFGLFMNOPQRQSTUVWXNYNZ[YOZ[YWQ

Fig. 11. The effect of projection pertur-bation to SVM.

\]^_\]^ _\]^\]^a_\]^a\]^b_\]^b\]^c_\]^c\]^]_]\d \` \b ] eb e` edfgghijgklmnnoiopgo qrstuvrwu qrxxtytuzt

{|}~�����~�}�}��}~|�Fig. 12. The effect of projection pertur-bation to Perceptron.

Error on Estimating T from R(X+T)

0

0.5

1

1.5

2

2.5

Breas

t-w

Credit-

a

Credit

-g

Diabet

es

E.Coli

Heart

Hepat

itis

Iono

shper

e Iris

Tic-ta

c-to

e

Votes

Win

e

sqrt

(MS

E)

Fig. 13. Resilience to the attack to ran-dom translation

7.3. Effectiveness of Translation Perturbation

The effectiveness of translation perturbation is two-fold. First, we show that translationperturbation cannot be effectively estimated based on the discussed techniques. Then,we give complementary experimental results to show that theclassifiers: SVMs withpolynomial kernel and sigmoid kernel indeed invariant to translation perturbation.

As we have mentioned, if the translation vector could be precisely estimated, therotation center would be exposed. We applied the ICA-based attack to rotation centerthat is described in Section 6.2.3. The data in Figure 13 showsstdev(t− t). Comparedto the range of the elements int, i.e., [0, 1], the standard deviations are quite large, sowe can conclude that random translation is also hard to estimate if we have optimizedrotation perturbation in terms of ICA-based attacks.

SVMs with polynomial kernel, and sigmoid kernel, are also invariant to translationtransformation. Table 2 lists the experimental result on random translation for the 12datasets. We randomly translate each dataset for ten times.The result is the average ofthe ten runs. For most datasets, the result shows zero or tinydeviation from the standardmodel accuracy.

Page 32: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

32 K. Chen and L. Liu

Table 2. Experimental result on random translation

Dataset SVM(polynomial) SVM(sigmoid)

orig Tr orig Tr

breast-w 96.6 0± 0 65.5 0± 0

credit-a 88.7 0± 0 55.5 0± 0

credit-g 87.3 −0.4± 0.4 70 0± 0

diabetes 78.5 0± 0.3 65.1 0± 0

ecoli 89.9 −0.1± 0.5 42.6 0± 0

heart 91.1 −0.2± 0.2 55.6 0± 0

hepatitis 96.7 −0.4± 0.3 79.4 0± 0

ionosphere 98 +0.3± 0 63.5 +0.6± 0

iris 97.3 0± 0 29.3 −1.8± 0.4

tic-tac-toe 100 0± 0 65.3 0± 0

votes 99.2 +0.2± 0.1 65.5 −4.7± 0.6

wine 100 0± 0 39.9 0± 0

������������������������������������������������ �� � ¡¢ £¤¥¦§¨©ª«¬­® ¯¥°±²¬­® ³ª´µ ¶¤· ±µµ±°¸ ¹ µº¥»µ ¬­®

Fig. 14. Minimum privacy guaranteegenerated by local optimization, com-bined optimization, and the perfor-mance of ICA-based attack.

¼¼½¾¼½¿¼½À¼½Á½¾Â½¿ÃÄÅÆÇÄÈÉÊËÌÍÉÇÉÅÎÏÏ ÐÑÒÑ ÓÔÒÕÖ×ØÙÚÛÜÝÞßà á×âãäÞßà åÜæç èÖé éççãâê

Fig. 15. Average privacy guaranteegenerated by local optimization, com-bined optimization, and the perfor-mance of ICA-based attack.

7.4. Perturbation Optimization against Naive Estimation and ICA-basedAttack

We run the randomized optimization algorithm and show how effective it can generateresilient perturbations5. Each column in the experimental dataset is considered equallyimportant in privacy evaluation. Thus, the weights are not included in evaluation.

Figure 14 and 15 summarize the evaluation of privacy qualityon experimentaldatasets. The results are obtained in 50 iterations with theoptimization algorithm de-

5 Since we slightly changed the definition of privacy guarantee from our SDM paper (Chen et al., 2007), weneed to re-ran the experiments that use this metric for comparison. Therefore, the numbers in this section canbe slightly different from those in the paper (Chen et al., 2007).

Page 33: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

Geometric Data Perturbation for Privacy Preserving Outsourced Data Mining 33

ëëìëíëìîëìîíëìïëìïíëìðëìðíëìñëìñíëìíî ð í ò ó îî îð îí îò îó ïî ïð ïí ïò ïó ðî ðð ðí ðò ðó ñî ñð ñí ñò ñóôõö÷øõùúûüýþúøúöÿ�� ����������� ��� ����������������������� �! �����"

Fig. 16. Optimization of perturbationfor Diabetes data.

##$%#$&#$'#$(#$)#$*#$+#$,#$-% ' ) + - %% %' %) %+ %- &% &' &) &+ &- '% '' ') '+ '- (% (' () (+ (-./012/34567842409:: ;<=>?>@A=>BC DBECFG HIJKLMNOPQRSITUVPQWXNYW ZH[ [WWUT\

Fig. 17. Optimization of perturbationfor Votes data.

scribed in Section 6.2.6. “LocalOPT” represents the locally optimized minimum privacyguarantee addressing naive estimation at a number of iterations. “Best ICA attack” is theworst perturbation that gives the best ICA attack performance, i.e., getting the lowestprivacy guarantee among the perturbations tried in the rounds. “CombinedOPT”is thecombined optimization result given by Algorithm 6.1 after anumber of iterations. Theabove values are calculated with the proposed privacy metric based on the estimateddataset and the original dataset. The LocalOPT values can often reach a relatively highlevel after 50 iterations, which means that the swapping method is very efficient in lo-cally optimizing the privacy quality in terms of naive estimation. In the contrast, thebest ICA attacks often result in very low privacy guarantee,which means some rota-tion perturbations are very weak to ICA-based attacks. CombinedOPT values are muchhigher than the corresponding ICA-based attacks, which supports our conjecture thatwe can always find one perturbation that is sufficiently resilient to ICA-based attacks inpractice.

We also show the detail in the course of optimization for two datasets “Diabetes”and “Votes” in Figure 16 and 17, respectively. For both datasets, the combined optimalresult is between the curves of best ICA-attacks and the bestlocal optimization result.Different datasets or different randomization processes may cause different change pat-terns of privacy guarantee in the course of optimization. However, we see after a fewrounds the results are quickly stabilized round satisfactory privacy guarantee, whichmeans the proposed optimization method is very efficient.

7.5. Distance Perturbation: the Tradeoff between Privacy and ModelAccuracy

Now we extend the geometric perturbation with random noise component :G(X) =RX + Ψ + ∆, to address the potential distance-inference attacks. From the formalanalysis, we know that the noise component∆ can conveniently protect the perturbationfrom distance-inference attack. Intuitively, the higher the noise level is, the better theprivacy guarantee. However, with the increasing noise level, the model accuracy couldalso be affected. In this set of experiments, we first study the relationship between thenoise level, represented by its varianceσ2, and the privacy guarantee, and then therelationship between the noise level and the model accuracy.

Each known I/O attack is simulated by randomly picking a number of records (e.g.,5% of the total records) as the known records and then applying the estimation proce-dure discussed in Section 6.2.4. After running 500 runs of simulated attacks for each

Page 34: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

34 K. Chen and L. Liu

]]^_]^`]^a]^b]^c]^]c ]^]d ]^]e ]^__ ]^_a ]^_c ]^_d ]^_efghijgklmnopljlhqrr stuvw xwywz {vu|}~�

�������� ���� �����Fig. 18. The change of minimum pri-vacy guarantee vs. the increase of noiselevel for the three datasets.

����������

���� ���� ���� ���� ���� ���������������� ¡¢£ ¤¥¦§¨ ©¨ª¨« ¬§¦­®¯°±²³´µ¶µ· ¸¹²· º»¶µ·

Fig. 19. The change of accuracy ofKNN classifier vs. the increase of noiselevel.

¼½¼¾¼¿¼ÀÁÀ

ÁÂÁà ÁÂÁ¾ ÁÂÁÄ ÁÂÁ½ ÁÂÁÅ ÁÂÆÇÈÈÉÊËÈÌÍÎËÏÐÑÒÓÔ ÕÖ×ØÙ ÚÙÛÙÜ ÝØ×Þßàáâãäåæçæè éêãè ëìçæè

Fig. 20. The change of accuracy ofSVM(RBF) classifier vs. the increaseof noise level.

íîíïíðíñòñ

òóòô òóòï òóòõ òóòî òóòö òó÷øùùúûüùýþÿü������ ��� �� �������������� ����� �������� �

Fig. 21. The change of accuracy of per-ceptron classifier vs. the increase ofnoise level.

noise level, we get the average of minimum privacy guarantee. In addition, since the pa-per (Huang et al., 2005) showed that the PCA based noise filtering technique may helpreduce the noise for some noise perturbed datasets, we also simulate the PCA filteringmethod based on the described algorithm (Huang et al., 2005)and checked its effec-tiveness. The results show that in most cases (except for some noise levels for “Iris”data) when the number of principal components equals to the number of the originaldimensions (i.e., no noise reduction is applied), the attack is most effective. Since thePCA method cannot clearly distinguish the noise from other perturbation components,removing the smallest principal components will inevitably change the non-noise partas well. Figure 18 shows the best attacking results for different noise levels. Overall,the privacy guarantee increases with the increase of noise level for all three datasets.At the noise levelσ = 0.1, the privacy guarantee is between the range 0.1-0.2. Figure19 and 20 show a trend of decreasing accuracy for KNN classifier and SVM (RBF ker-nel) classifier, respectively. However, with the noise level lower than 0.1, the accuracyof both classifiers is only reduced less than 6%, which is quite acceptable. Meanwhile,perceptron (Figure 21) is less sensitive to different levels of noise intensity. We performexperiments on all datasets at the noise levelσ = 0.1 to see how the model accuracy isaffected by the noise component.

We summarize the privacy guarantees at the noise level 0.1 for all experimentaldatasets6 in Figure 22, and also the change of model accuracy for KNN, SVM(RBF),

6 “Ionosphere” is not included because the existence of nearly constant value in one column.

Page 35: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

Geometric Data Perturbation for Privacy Preserving Outsourced Data Mining 35

!!"#!"$!"%!"&!"'!"()*+,-*./0123/-/+455 67879:8;Fig. 22. Minimum privacy guarantee atthe noise levelσ = 0.1

<=><?<@<A<B>BACDDEFGDHIJGKLMNOP QRSRTUSV

WXX YZ[\]^_` abcdbefcghFig. 23. The change of model accuracyat the noise levelσ = 0.1

and Perceptron in Figure 23. Among the three types of classifiers, KNN is the most sta-ble one while perceptron classifiers are most sensitive to distance perturbation. Overall,the distance perturbation component may affect model accuracy, but it has much lessimpact on model accuracy. Last but not least, it is worth noting that the noise compo-nent can be removed if the data owner makes sure to securely store the original data.This important feature provides extra and valuable flexibility in geometric perturbationfor data mining.

8. Conclusion

We present a random geometric perturbation approach to privacy preserving data clas-sification. Random geometric perturbation,G(X) = RX +Ψ+∆, includes the linearcombination of the three components: rotation perturbation, translation perturbation,and distance perturbation. Geometric perturbation can preserve the important geomet-ric properties, thus most data mining models that search forgeometric class boundariesare well preserved with the perturbed data. We proved that many data mining models,including classifier, regression models, and clustering methods, are invariant to geomet-ric perturbation.

Geometric perturbation perturbs multiple columns in one transformation, which in-troduces new challenges in evaluating the privacy guarantee for multi-dimensional per-turbation. We propose a multi-column privacy evaluation model and design a unifiedprivacy metric to address these problems. We also thoroughly analyze the resilienceof the rotation perturbation approach against three types of inference attacks: naive-inference attacks, ICA-based attacks, and distance-inference attacks. With the privacymodel and the analysis of attacks, we are able to construct a randomized optimizationalgorithm to efficiently find a good geometric perturbation that is resilient to the attacks.Our experiments show that the geometric perturbation approach not only preserves theaccuracy of models, but also provides much higher privacy guarantee, compared to ex-isting multidimensional perturbation techniques.

9. Acknowledgement

The second author acknowledges the partial support from grants of NSF NetSE pro-gram, NSF CyberTrust program, an IBM SUR grant and a grant from Intel ResearchCouncil.

Page 36: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

36 K. Chen and L. Liu

References

Aggarwal, C. C. and Yu, P. S. (2004), A condensation approachto privacy preserving data mining,in ‘Pro-ceedings of Intternational Conference on Extending Database Technology (EDBT)’, Vol. 2992, Springer,Heraklion, Crete, Greece, pp. 183–199.

Agrawal, D. and Aggarwal, C. C. (2002), On the design and quantification of privacy preserving data miningalgorithms,in ‘Proceedings of ACM Conference on Principles of Database Systems (PODS)’, ACM,Madison, Wisconsin.

Agrawal, R. and Srikant, R. (2000), Privacy-preserving data mining, in ‘Proceedings of ACM SIGMODConference’, ACM, Dallas, Texas.

Amazon (n.d.), ‘Applications hosted on amazon clouds’,http://aws.amazon.com/solutions/case-studies/.Armbrust, M., Fox, A., Griffith, R., Joseph, A. D., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin,

A., Stoica, I. and Zaharia, M. (2009), ‘Above the clouds: A berkeley view of cloud computing’,TechnicalReport, University of Berkerley.

Bhatia, R. (1997),Matrix Analysis, Springer.Bruening, P. J. and Treacy, B. C. (2009), ‘Privacy, securityissues raised by cloud computing’,BNA Privacy

& Security Law Report8(10).Chen, K. and Liu, L. (2005), A random rotation perturbation approach to privacy preserving data classifica-

tion, in ‘Proceedings of International Conference on Data Mining (ICDM)’, IEEE, Houston, TX.Chen, K., Liu, L. and Sun, G. (2007), Towards attack-resilient geometric data perturbation,in ‘SIAM Data

Mining Conference’.Clifton, C. (2003), Tutorial: Privacy-preserving data mining, in ‘Proceedings of ACM SIGKDD Conference’.Cristianini, N. and Shawe-Taylor, J. (2000),An Introduction to Support Vector Machines and Other Kernel-

based Learning Methods, Cambridge University Press.Evfimievski, A., Gehrke, J. and Srikant, R. (2003), Limitingprivacy breaches in privacy preserving data

mining, in ‘Proceedings of ACM Conference on Principles of Database Systems (PODS)’.Evfimievski, A., Srikant, R., Agrawal, R. and Gehrke, J. (2002), Privacy preserving mining of association

rules,in ‘Proceedings of ACM SIGKDD Conference’.Friedman, J. H. (2001), ‘Greedy function approximation: A gradient boosting machine’,Annals of Statistics

29(5), 1189–1232.Fung, B. C., Wang, K., Chen, R. and Yu, P. S. (2010), ‘Privacy-preserving data publishing: A survey on recent

developments’,ACM Computer Survey.Gallier, J. (2000),Geometric Methods and Applications for Computer Science and Engineering, Springer-

Verlag, New York.Google (n.d.), ‘Google appengine gallery’,http://appgallery.appspot.com/.Guo, S. and Wu, X. (2007), Deriving private information fromarbitrarily projected data,in ‘Proceedings of the

11th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD07)’,Warsaw, Poland.

Guo, S., Wu, X. and Li, Y. (2008), ‘Determining error bounds for spectral filtering based reconstructionmethods in privacy preserving data mining’,Knowledge and Information Systems17(2).

Hastie, T., Tibshirani, R. and Friedman, J. (2001),The Elements of Statistical Learning, Springer-Verlag.Huang, Z., Du, W. and Chen, B. (2005), Deriving private information from randomized data,in ‘Proceedings

of ACM SIGMOD Conference’.Hyvarinen, A., Karhunen, J. and Oja, E. (2001),Independent Component Analysis, Wiley.Jain, A., Murty, M. and Flynn, P. (1999), ‘Data clustering: Areview’, ACM Computing Surveys31, 264–323.Jiang, T. (2005), ‘How many entries in a typical orthogonal matrix can be approximated by independent

normals’,Annals of Probability.Johnson, W. B. and Lindenstrauss, J. (1984), ‘Extensions oflipshitz mapping into hilbert space’,Contempo-

rary Mathematics26.Kargupta, H., Datta, S., Wang, Q. and Sivakumar, K. (2003), On the privacy preserving properties of random

data perturbation techniques,in ‘Proceedings of International Conference on Data Mining (ICDM)’.LeFevre, K., DeWitt, D. J. and Ramakrishnan, R. (2006), Mondrain multidimensional k-anonymity.,in ‘Pro-

ceedings of IEEE International Conference on Data Engineering (ICDE)’.Lehmann, E. L. and Casella, G. (1998),Theory of Point Estimation, Springer-Verlag.Lindell, Y. and Pinkas, B. (2000), ‘Privacy preserving datamining’, Journal of Cryptology15(3), 177–206.Liu, K., Giannella, C. and Kargupta, H. (2006), An attacker’s view of distance preserving maps for privacy

preserving data mining,in ‘European Conference on Principles and Practice of Knowledge Discovery inDatabases (PKDD)’, Berlin, Germany.

Liu, K., Kargupta, H. and Ryan, J. (2006), ‘Random projection-based multiplicative data perturbation forprivacy preserving distributed data mining’,IEEE Transactions on Knowledge and Data Engineering(TKDE) 18(1), 92–106.

Page 37: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

Geometric Data Perturbation for Privacy Preserving Outsourced Data Mining 37

Luo, H., Fan, J., Lin, X., Zhou, A. and Bertino, E. (2009), ‘A distributed approach to enabling privacy-preserving model-based classifier training’,Knowledge and Information Systems20(2).

McLachlan, G. and Peel, D. (2000),Finite Mixture Models, Wiley.

Oliveira, S. R. M. and Zaıane, O. R. (2004), Privacy preservation when sharing data for clustering,in ‘Pro-ceedings of the International Workshop on Secure Data Management in a Connected World’, Toronto,Canada, pp. 67–82.

Oliveira, S. R. and Zaiane, O. R. (2010), ‘Privacy preserving clustering by data transformation’,Journal ofInformation and Data Management (JIDM)1(1).

Sadun, L. (2001),Applied Linear Algebra: the Decoupling Principle, Prentice Hall.

Stewart, G. (1980), ‘The efficient generation of random orthogonal matrices with an application to conditionestimation’,SIAM Journal on Numerical Analysis17.

Sweeney, L. (2002), ‘k-anonymity: a model for protecting privacy’, International Journal on Uncertainty,Fuzziness and Knowledge-based Systems10(5).

Teng, Z. and Du, W. (2009), ‘A hybrid multi-group approach for privacy-preserving data mining’,Knowledgeand Information Systems19(2).

Vaidya, J. and Clifton, C. (2003), Privacy preserving k-means clustering over vertically partitioned data,in‘Proceedings of ACM SIGKDD Conference’.

Vempala, S. S. (2005),The Random Projection Method, American Mathematical Society.

Author Biographies

Keke Chen is currently an assistant professor in Wright State University, Dayton OH, USA, where he directsthe Data Intensive Analytics and Computing lab. He receivedhis PhD degree in Computer Science from theCollege of Computing at Georgia Tech, Atlanta GA, USA, in 2006. Keke’s research focuses on data privacyprotection, visual analytics, and distributed data intensive scalable computing, including web search, datamining and visual analytics. From 2002 to 2006, Keke worked with Dr. Ling Liu in the Distributed DataIntensive Systems Lab at Georgia Tech, where he developed a few well-known research prototypes, such asthe VISTA visual cluster rendering and validation system, the iVIBRATE framework for large-scale visualdata clustering, the “Best K” cluster validation method forcategorical data clustering, and the geometric dataperturbation approach for outsourced data mining. From 2006 to 2008, he was a senior research scientist inYahoo! Labs, Santa Clara, CA, working on international web search relevance and data mining algorithms forlarge distributed datasets on cloud computing. In Yahoo! Labs, he developed the tree adaptation method forranking function adaptation.

Page 38: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

38 K. Chen and L. Liu

Ling Liu is a full Professor in the School of Computer Science at Georgia Institute of Technology. Thereshe directs the research programs in Distributed Data Intensive Systems Lab (DiSL), examining various as-pects of data intensive systems with the focus on performance, availability, security, privacy, and energyefficiency. Prof. Liu and her students have released a numberof open source software tools, including We-bCQ, XWRAPElite, PeerCrawl, GTMobiSim. Currently she is the lead PI on two NSF sponsored researchprojects: Location Privacy in Mobile and Wireless InternetComputing, and Privacy Preserving InformationNetworks for Healthcare Applications. Prof. Liu has published over 250 International journal and confer-ence articles in the areas of databases, distributed systems, and Internet Computing. She is a recipient of thebest paper award of ICDCS 2003, WWW 2004, the 2005 Pat Goldberg Memorial Best Paper Award, and2008 Int. conf. on Software Engineering and Data Engineering. Prof. Liu has served as general chair and PCchairs of numerous IEEE and ACM conferences in data engineering, distributed computing, service comput-ing and cloud computing fields and is a co-editor-in-chief ofthe 5 volume Encyclopedia of Database Systems(Springer). She is currently on the editorial board of several international journals, such as Distributed andParallel Databases (DAPD, Springer), Journal of Parallel and Distributed Computing (JPDC), IEEE Transac-tions on Service Computing (TSC), and Wireless Network (WINET, Springer). Dr. Liu’s current research isprimarily sponsored by NSF, IBM, and Intel.

Correspondence and offprint requests to: Keke Chen, Department of Computer Science and Engineering,Wright State University, Dayton, OH 45435, USA. Email: [email protected]

Page 39: Geometric Data Perturbation for Privacy Preserving ...lingliu/papers/2010/Rotation-kais.pdfThe most relevant work about perturbation techniques for data mining includes the ran dom

Geometric Data Perturbation for Privacy Preserving Outsourced Data Mining 39