5
An Algorithm For k -Degree Anonymity On Large Networks Jordi Casas-Roma Universitat Oberta de Catalunya Barcelona, Spain Email: [email protected] Jordi Herrera-Joancomart´ ı Universitat Aut` onoma de Barcelona Bellaterra, Spain Email: [email protected] Vicenc ¸ Torra Artificial Intelligence Research Institute Spanish National Research Council Bellaterra, Spain Email: [email protected] Abstract—In this paper, we consider the problem of anonymization on large networks. There are some anonymization methods for networks, but most of them can not be applied on large networks because of their complexity. We present an algorithm for k-degree anonymity on large networks. Given a network G, we construct a k-degree anonymous network, ˜ G, by the minimum number of edge modifications. We devise a simple and efficient algorithm for solving this problem on large networks. Our algorithm uses univariate micro-aggregation to anonymize the degree sequence, and then it modifies the graph structure to meet the k-degree anonymous sequence. We apply our algorithm to a different large real datasets and demonstrate their efficiency and practical utility. I. I NTRODUCTION In recent years, as more and more network data has been made publicly available, anonymization on network data has become an important concern. Backstrom et. al. [1] point out that the simple technique of anonymizing networks by removing the identities of the nodes before publishing the actual network does not always guarantee privacy. They show that there exist adversaries that can infer the identity of the nodes by solving a set of restricted graph isomorphism prob- lems. Some approaches and methods have been imported from anonymization on relational data, but anonymizing network data has some peculiarities that make the approaches and methods from relational data to not work directly. In addition, divide-and-conquer methods do not apply to anonymization of network data due to the fact that registers are not separable, since removing or adding vertices and edges may affect other vertices and edges as well as the properties of the network [2]. Although some approaches and methods have been de- veloped for graph anonymization, they are applicable only to small and medium networks of, at most, a few thousands of nodes and edges. Anonymization on large networks is still an open problem. In this paper we present our anonymization algorithm for large networks, based on the concept of k-degree anonymity. It works with simple, undirected and unlabelled networks. Because these networks have no attributes or labels in the edges, information is only in the structure of the network itself and, due to this, the adversary can use information about the structure of the network to attack the privacy. k-degree anonymity ensures the user’s privacy when the attacker has degree-based knowledge on target nodes. This paper is organized as follows. In Section II, we review the state of the art of anonymization on networks, specifically the k-anonymity-based methods. We discuss the preliminaries and the problem definition on Section III. Section IV introduces our algorithm for k-degree anonymity on large networks. Then, in Section V, we present the experiments and discuss the results. Finally, in Section VI, we present the conclusions and the future work. II. k - ANONYMITY ON NETWORKS There are three main categories of anonymization methods on network data: (1) Network modification approaches which anonymize a network by modifying (adding and/or deleting) edges or nodes in a network. (2) Clustering-based approaches (also known as generalization) which cluster nodes and edges into groups and anonymize a sub-network into a super-node in order to publish the aggregate information about structural properties. (3) And finally, differentially private approaches which guarantee that individuals are protected under the defi- nition of differential privacy [3]. Differential privacy imposes a guarantee on the data release mechanism rather than on the data itself. The goal is to provide statistical information about the data while preserving the privacy of users. Our main objective is to allow publishing network data without breaking user’s privacy. But we are interested on publishing the entire network, not a summary or a statistic information about the network. Clustering-based approaches do not enable local structure data analysis and differential privacy does not allow us to release all structural information. So, if we want to analyse the structure of the network, both the local and the global structure, we have to use a network modification approach which allow us to deliver the entire network structure. Randomization is the simplest way to anonymize a network by modification approaches. Randomization methods are based on adding random noise in original data. Hay et al. [4] proposed a method to anonymize unlabelled networks, called Random Perturbation, which is based on removing p edges at random and then adding random false p edges. Ying et al. [5] proposed a method, called Blockwise Random Add/Delete strategy or simply Rand Add/Del-B. This method divides the network into blocks according to the degree sequence and implements modifications (by adding and removing edges) on the nodes at high risk of re-identification, not at random over the entire set of nodes. Notice that the randomization approaches protect against re-identification in a probabilistic manner. 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 671 ASONAM'13, August 25-29, 2013, Niagara, Ontario, CAN Copyright 2013 ACM 978-1-4503-2240-9 /13/08 ...$15.00

[ACM Press the 2013 IEEE/ACM International Conference - Niagara, Ontario, Canada (2013.08.25-2013.08.28)] Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social

  • Upload
    vicenc

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [ACM Press the 2013 IEEE/ACM International Conference - Niagara, Ontario, Canada (2013.08.25-2013.08.28)] Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social

An Algorithm For k-Degree AnonymityOn Large Networks

Jordi Casas-RomaUniversitat Oberta de Catalunya

Barcelona, Spain

Email: [email protected]

Jordi Herrera-JoancomartıUniversitat Autonoma de Barcelona

Bellaterra, Spain

Email: [email protected]

Vicenc TorraArtificial Intelligence Research Institute

Spanish National Research Council

Bellaterra, Spain

Email: [email protected]

Abstract—In this paper, we consider the problem ofanonymization on large networks. There are some anonymizationmethods for networks, but most of them can not be appliedon large networks because of their complexity. We present analgorithm for k-degree anonymity on large networks. Given anetwork G, we construct a k-degree anonymous network, G, bythe minimum number of edge modifications. We devise a simpleand efficient algorithm for solving this problem on large networks.Our algorithm uses univariate micro-aggregation to anonymizethe degree sequence, and then it modifies the graph structure tomeet the k-degree anonymous sequence. We apply our algorithmto a different large real datasets and demonstrate their efficiencyand practical utility.

I. INTRODUCTION

In recent years, as more and more network data has beenmade publicly available, anonymization on network data hasbecome an important concern. Backstrom et. al. [1] pointout that the simple technique of anonymizing networks byremoving the identities of the nodes before publishing theactual network does not always guarantee privacy. They showthat there exist adversaries that can infer the identity of thenodes by solving a set of restricted graph isomorphism prob-lems. Some approaches and methods have been imported fromanonymization on relational data, but anonymizing networkdata has some peculiarities that make the approaches andmethods from relational data to not work directly. In addition,divide-and-conquer methods do not apply to anonymization ofnetwork data due to the fact that registers are not separable,since removing or adding vertices and edges may affect othervertices and edges as well as the properties of the network [2].

Although some approaches and methods have been de-veloped for graph anonymization, they are applicable only tosmall and medium networks of, at most, a few thousands ofnodes and edges. Anonymization on large networks is still anopen problem.

In this paper we present our anonymization algorithm forlarge networks, based on the concept of k-degree anonymity.It works with simple, undirected and unlabelled networks.Because these networks have no attributes or labels in theedges, information is only in the structure of the networkitself and, due to this, the adversary can use information aboutthe structure of the network to attack the privacy. k-degreeanonymity ensures the user’s privacy when the attacker hasdegree-based knowledge on target nodes.

This paper is organized as follows. In Section II, we

review the state of the art of anonymization on networks,specifically the k-anonymity-based methods. We discuss thepreliminaries and the problem definition on Section III. SectionIV introduces our algorithm for k-degree anonymity on largenetworks. Then, in Section V, we present the experimentsand discuss the results. Finally, in Section VI, we present theconclusions and the future work.

II. k-ANONYMITY ON NETWORKS

There are three main categories of anonymization methodson network data: (1) Network modification approaches whichanonymize a network by modifying (adding and/or deleting)edges or nodes in a network. (2) Clustering-based approaches(also known as generalization) which cluster nodes and edgesinto groups and anonymize a sub-network into a super-nodein order to publish the aggregate information about structuralproperties. (3) And finally, differentially private approacheswhich guarantee that individuals are protected under the defi-nition of differential privacy [3]. Differential privacy imposesa guarantee on the data release mechanism rather than on thedata itself. The goal is to provide statistical information aboutthe data while preserving the privacy of users.

Our main objective is to allow publishing network datawithout breaking user’s privacy. But we are interested onpublishing the entire network, not a summary or a statisticinformation about the network. Clustering-based approachesdo not enable local structure data analysis and differentialprivacy does not allow us to release all structural information.So, if we want to analyse the structure of the network, boththe local and the global structure, we have to use a networkmodification approach which allow us to deliver the entirenetwork structure.

Randomization is the simplest way to anonymize a networkby modification approaches. Randomization methods are basedon adding random noise in original data. Hay et al. [4]proposed a method to anonymize unlabelled networks, calledRandom Perturbation, which is based on removing p edgesat random and then adding random false p edges. Ying et al.[5] proposed a method, called Blockwise Random Add/Deletestrategy or simply Rand Add/Del-B. This method divides thenetwork into blocks according to the degree sequence andimplements modifications (by adding and removing edges)on the nodes at high risk of re-identification, not at randomover the entire set of nodes. Notice that the randomizationapproaches protect against re-identification in a probabilisticmanner.

2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

671ASONAM'13, August 25-29, 2013, Niagara, Ontario, CAN Copyright 2013 ACM 978-1-4503-2240-9 /13/08 ...$15.00

Page 2: [ACM Press the 2013 IEEE/ACM International Conference - Niagara, Ontario, Canada (2013.08.25-2013.08.28)] Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social

Another way to anonymize a network by modificationapproaches consist on edge addition and deletion to meetdesired privacy constraints. One widely adopted strategy isbased on the concept of k-anonymity. This concept was intro-duced by Sweeney [6] for the privacy preservation on relationaldata. Formally, the k-anonymity model is defined as follows.Let RT (A1, . . . , An) be a table and QIRT be the quasi-identifier associated with it. RT is said to satisfy k-anonymityif and only if each sequence of values in RT [QIRT ] appearswith at least k occurrences in RT [QIRT ]. The k-anonymitymodel indicates that an attacker can not distinguish betweendifferent k records although he manages to find a group ofquasi-identifiers. Therefore, the attacker can not re-identifyan individual with a probability greater than 1

k . In general,the higher the k value, the greater the anonymization and theinformation loss. Let G(V,E) be a network where V is thenode set and E is the edge set. In the extreme case of k = |V |all nodes in G(V,E) have the same degree. So, the probabilityof re-identification will be almost null, but the information losswill be very large, producing a useless anonymized network.

The k-anonymity model can be applied using differentconcepts when dealing with networks rather than relationaldata. A widely used option is to consider the node degreeas a quasi-identifier. This corresponds to k-degree anonymity.In short, in k-degree anonymity we presume that the onlypossible attack is when the attacker knows the degree of somenodes. Therefore, if some node is identified with certaintywith this information, then we have an information leakage.k-Anonymity methods are based on modifying the networkstructure (by adding and removing edges) to ensure that allnodes satisfy the k-anonymity for the degrees of the nodes.In other words, the main objective is that all nodes have atleast k−1 other nodes sharing the same degree. Liu and Terzi[7] develop a method which given a network G(V,E) andan integer k, finds a k-degree anonymous network G(V, E)where E ∩E ≈ E, trying to minimize the number of changeson edges.

Pei and Zhou [2] consider as quasi-identifier the 1-neighbourhood sub-network of the objective nodes. Letk be a positive integer. For a vertex u ∈ V , uis k-anonymous in anonymization G if there are atleast k − 1 other vertices v1, . . . , vk−1 ∈ V suchthat NeighG(u), NeighG(v1), . . . , NeighG(vk−1) are iso-morphic. G is k-anonymous if every vertex in G is k-anonymous in G. It is called k-neighbourhood anonymity.Zhou et al. [8] consider all structural information about atarget node as quasi-identifier and propose a new model calledk-automorphism to anonymize a network and ensure privacyagainst this attack. They define a k-automorphic network asfollows: given a network G, (a) if there exist k−1 automorphicfunctions Fa(a = 1, . . . , k − 1) in G, and (b) for eachvertex v in G, Fa1(v) �= Fa2(1 ≤ a1 �= a2 ≤ k − 1),then G is called a k-automorphic network. Hay et al. [9]go a step further. They propose a method, named k-candidateanonymity, that uses queries as quasi-identifier. In this method,a node vi is k-candidate anonymous with respect to questionQ if there are at least k − 1 others nodes in the networkwith the same answer. Formally, | candQ(vi) |≥ k wherecandQ(vi) = {vj ∈ V | Q(vj) = Q(vi)}. A network is k-candidate anonymous with respect to question Q if all of its

nodes are k-candidate anonymous with respect to question Q.The question Q is modelled according to the knowledge of theadversary assumed.

Hay et al. [4] [9] proposed a method, called Vertex Re-finement Queries, to model the knowledge of the adversary.This class of queries, with increasing attack power, models thelocal neighbourhood structure of a node in the network. Theweakest knowledge query, H0(vj), simply returns the label ofthe node vj . The queries are successively more descriptive:H1(vj) returns the degree of vj , H2(vj) returns the list ofeach neighbours’ degree, and so on. The queries can be definediteratively, where Hi(vj) returns the multi-set of values whichare the result of evaluating Hi−1 on the set of nodes adjacentto vj :

Hi(vj) = {Hi−1(v1),Hi−1(v2), . . . ,Hi−1(vm)} (1)

where v1, v2, . . . , vm are the nodes adjacent to vj .

A candidate set candHi for a query Hi is a set of allnodes with the same value of Hi. Therefore, the cardinalityof a candidate set for Hi is the number of indistinguishablenodes in G under Hi. Note that if the cardinality of the smallestcandidate set under H1 is k, the probability of re-identificationis 1

k . Hence, the k-degree anonymity value for G is k.

candH1 = {vj ∈ V | H1(vi) = H1(vj)} (2)

III. PRELIMINARIES

Let G(V,E) be a simple graph, where V is the set of nodesand E the set of edges in G. We define n = |V | to denotethe number of nodes and m = |E| to denote the number ofedges. We use d to define the degree sequence of G, whered is a vector of length n and di is the value of i-th element,that is, the degree of node vi ∈ V . We refer to the ordereddegree sequence as a monotonic non-decreasing sequence ofthe vertex degrees, that is di ≤ dj ∀ i < j.

Regarding to the degree sequence, notice that:

• The number of elements in the degree sequence isfixed by the number of nodes.

• Each di ∈ d must be an integer in the range [0, n−1],because the values of the degree sequence are node’sdegrees.

• The total number of edges of the graph is half the sumof the degree sequence, since each edge is countedtwice in the degree sequence. Therefore,

∑ni=1 di must

be an even number.

The degree sequence is an interesting tool since the conceptof k-degree anonymity for a graph can be directly mapped toits degree sequence, as Liu and Terzi showed in [7] and werecall in the following definitions:

Definition 1: A vector of integers V is k-anonymous, ifevery distinct value in vi ∈ V appears at least k times.

Definition 2: A graph G(V,E) is k-degree anonymous ifthe degree sequence of G, d, is k-anonymous.

2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

672

Page 3: [ACM Press the 2013 IEEE/ACM International Conference - Niagara, Ontario, Canada (2013.08.25-2013.08.28)] Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social

(a) Edge switch (b) Edge removal (c) Edge addition

Fig. 1: Basic operations for graph modification with nodeinvariability. Dashed lines represent deleted edges while solidlines are the added ones.

In order to modify the edge set of a given network, wecan define three basic operations: edge switch, edge removaland edge addition. The edge switch between three nodes canbe defined as follows: if vi, vj , vk ∈ V , (vi, vk) ∈ E and(vj , vk) �∈ E, we can delete (vi, vk) and create (vj , vk).Figure 1a shows this basic operation. Such modification canbe translated in the degree sequence as di = di − 1 anddj = dj + 1, where d is the degree sequence after the basicoperation. Notice that the number of edges in the networkremains the same, vi decreases its degree by one, vj increasesits degree by one and vk keeps the same degree.

Secondly, we define edge removal as follows: we selectfour nodes vi, vj , vk, vl ∈ V where (vi, vk) ∈ E, (vj , vl) ∈ Eand (vk, vl) �∈ E. We delete edges (vi, vk) and (vj , vl) andcreate a new edge (vk, vl), as shown on Figure 1b. Note that

di = di − 1, dj = dj − 1, dk = dk and dl = dl.

Finally, we define edge addition as follows: we select twonodes vi, vj ∈ V where (vi, vj) �∈ E and creates it. Note that

di = di + 1 and dj = dj + 1. It is shown on Figure 1c.

IV. THE UMGA ALGORITHM

In this section, we present the UMGA (short for UnivariateMicro-aggregation for Graph Anonymization) algorithm, de-signed to achieve k-degree anonymity on large, undirected andunlabelled networks. The algorithm performs modifications tothe original network G(V,E) only on edges, E, by applyingthe three basic operations defined in the previous section. So,the node set V does not change during anonymization process.The UMGA is based on a two-step approach:

1) Degree Sequence’s Anonymization. From the degreesequence of the original network G(V,E), d ={d1, ..., dn}, we construct a new sequence d that is k-degree anonymous. We use the function Δ to reducethe distance from the anonymized sequence to theoriginal one, computed as:

Δ =n∑

i=1

| di − di | (3)

2) Graph modification. By using the three basic edgemodification operations defined in the previous sec-tion, we build a new network G(V, E) where its

degree sequence is equal to d.

The second step ensures that the new G network is k-degree anonymous and the lower the value Δ can be obtained

in the first step also the lower the information loss of theanonymized network.

A. Degree Sequence’s Anonymization

The objective of the first step is to anonymize the degreesequence of the original network, d. We use the OptimalUnivariate Microaggregation [10] to achieve the best groupdistribution and then we compute the values for each group thatminimize the distance Δ from the original degree sequence.

Without loss of generality, we assume d to be an ordereddegree sequence of the original network. Otherwise, we applya permutation f to the sequence to reorder the elements. Let kbe an integer such that 1 ≤ k < n which is the k-degreeanonymity value. Typically, k is much smaller than n. Inorder to apply the optimal univariate microaggregation, andaccording to Hansen and Mukherjee [10], we construct a newdirected network Hk,n and get the optimal partition which isexactly the set of groups that corresponds to the arcs of theshortest path from node 0 to node n on this network. Wedenote by g the optimal partition, where g has n

k ≤ p ≤ n2k−1

groups and each of them, gj , has between k and 2k−1 items.Obviously, each di ∈ d belongs to a specific group gj .

Next, we compute the matrix of differences, Mp×2, usingeach group of the partition. The first column contains thesum of differences between each element of the group andthe arithmetic mean of all degrees that belong to the group,using floor function to round the mean value. The secondcolumn is computed in the same way, but using de ceilingfunction instead. Conceptually, M contains in the first columnthe number of degrees that we should decrease in this groupto meet k-degree anonymity. These values are always a zeroor positive. The second column contains the number of de-grees that we should increase in this group to meet k-degreeanonymity. These values are always negative or zero. Formally,for j = 1, · · · , p, each element mji is computed as:

mj1 =∑di∈gj

(di −

⌊ ∑di∈gj

di|gj |

⌋)(4)

mj2 =∑di∈gj

(di −

⌈ ∑di∈gj

di|gj |

⌉)(5)

where |gj | denotes de cardinality of set gj .

A group with zero values on both columns should not applyany modification on its items. For example, a group g1 ={2, 2, 2} generates [0, 0] as the first row of M . So the itemsof the group g1 do not need to modify their values. But, agroup g2 = {3, 4, 4} generates the row [2,−1]. So, there aretwo possibilities to anonymize the group: decrease the valuesof the second and third items to 3 or increase the value of thefirst item to 4.

Now we have to compute a solution, that is a p-sequencevalues where each mj ∈ {mj1,mj2}. The closer |∑p

j=1 mj |to 0, the better the solution. This search is a complex operationwith a cost Θ(2q) where q is the number of groups with mj �=0. Because of the power law on real networks, q is typicallymuch smaller than p, but still too high for large networks.

2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

673

Page 4: [ACM Press the 2013 IEEE/ACM International Conference - Niagara, Ontario, Canada (2013.08.25-2013.08.28)] Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social

For that reason, we propose two methods in order toachieve the best combination on minimum time:

1) The Exhaustive Method: The first approximation isbased on exhaustive search. This method selects the valuesmj1 or mj2 for j = 1, · · · , p so that the minimum valueof |∑p

j=1 mj | is achieved. To do this selection, all possiblecombinations are considered unless a solution is found with∑p

j=1 mj = 0.

2) The Greedy Method: We define an iterative methodand in each step values for mj are selected according to aprobability distribution based on the size of values mj1 andmj2. The lower the value is, the more probability to be chosen.More specifically,

p(mj = mj1) = 1−( mj1

mj1 +mj2

)(6)

p(mj = mj2) = 1−( mj2

mj1 +mj2

)(7)

Iterations are finished when a solution is found with∑pj=1 mj = 0 or when we have a fixed number of iterations

without any improvement in the function∑p

j=1 mj .

After these computations, we have a k-degree anonymizedsequence, d, which minimizes the number of changes from theoriginal degree sequence. However, notice that in case we startthe process with a non-ordered degree sequence, we shouldapply here the inverse transform f−1 to obtain the correctdegree sequence.

B. Graph modification

In the second step, changes are made in the originalnetwork in order to convert it to a k-degree anonymousnetwork. The anonymized degree sequence d indicates whichnodes should modify its degree. In fact, d indicates preciselywhich nodes must increase or decrease their degree to achievethe desired k-degree anonymity. The changes on the originaledges are performed using the basic operations depicted inFigure 1.

The graph modification starts by obtaining the vector ofchanges between k-degree anonymous sequence and originaldegree sequence, δ = d − d. δ allows to easily detect nodeswhich have to reduce or increase its degree. From δ, wecreate the list of nodes which must decrease their degree,δ− = {vi | δi < 0} and the list of nodes which must increasetheir degree, δ+ = {vi | δi > 0}.

Let σ(d) be the sum of each element in d, σ(d) =∑n

i=1 di,and let σ(d) be the sum of each element of d, σ(d) =

∑ni=1 di.

Then, if σ(d) = σ(d), it implies that there are the same numberof edges in the original network and in the anonymized one andtherefore, we only need to apply edge switch modifications.

Otherwise, we have to add or deleteσ(d)−σ(d)

2 edges in orderto anonymize the network.

If σ(d) < σ(d), we need to removeσ(d)−σ(d)

2 edges fromthe network, as we have shown on Figure 1b. In order to deletean edge from the network, we choose vi, vj ∈ δ− and find

TABLE I: General properties of tested networks.

Network Nodes Edges Average degree kCaida 26,475 53,381 2.016 1

Amazon 403,394 2,443,408 6.057 1

Yahoo! 1,878,736 4,079,161 2.171 1

TABLE II: Candidate set size of H1, candH1 , for originaltested networks.

Network [1, 1] [2, 10] [11, 20] [21, 50] [51, 100]Caida 70 217 119 294 295

(0.264%) (0.819%) (0.449%) (1.110%) (1.114%)

Amazon 90 535 435 711 1,165(0.022%) (0.132%) (0.107%) (0.176%) (0.288%)

Yahoo! 73 331 196 381 705(0.003%) (0.018%) (0.010%) (0.020%) (0.037%)

other two nodes vk, vl where (vi, vk) ∈ E and (vj , vl) ∈ E.Then we delete these two edges and create a new one (vk, vl).

On the other hand, if σ(d) > σ(d), we need to addσ(d)−σ(d)

2edges to the network, as we have shown on Figure 1c. To addan edge, we select vi, vj ∈ δ+ where (vi, vj) �∈ E and createit. Finally, we have to increase the degree of some nodes anddecrease others, until σ(d) = σ(d) = 0. This modificationis done using edge switch, Figure 1a. For each vi ∈ δ− andvj ∈ δ+, we find another node vk where (vi, vk) ∈ E. Wedelete this edge and create a new one (vk, vj).

V. RESULTS

We have tested our algorithm with three real networks. Allthese networks are undirected and unlabelled. Table I showsa summary of the networks’ main features including numberof nodes, edges, average degree and k-anonymity value. Caida[11] is an undirected network of autonomous systems of theInternet connected with each other from the CAIDA project,collected in 2007. Amazon [12] is the network of items onAmazon that have been mentioned by Amazon’s ”People whobought X also bought Y” function. Yahoo! Instant Messengerfriends connectivity graph (version 1.0) [13] contains a non-random sample of the Yahoo! Messenger friends network from2003. All tests are made on a 4 CPU Intel Xeon X3430 at2.40GHz with 32GB RAM running Debian GNU/Linux.

Table II shows the candidate set, candH1 , for the originalnetworks. It shows interesting information about how re-identification risk is distributed on all nodes of the graph. Caidanetwork has 70 re-identificable nodes, i.e., nodes with a uniquedegree’s value, and 217 nodes in high risk of re-identification,i.e., nodes with candidate set size between 2 and 10. Amazonhas 90 re-identificable nodes and 535 nodes in high risk ofre-identification, and Yahoo! has 73 re-identificable nodes and331 nodes in high risk of re-identification.

Table III shows the results of our experiments. We applythe UMGA algorithm, both on exhaustive and greedy methods,to the three selected networks. We test our algorithm for valuesof k = {10, 20, 50, 100} on each network and computes thenumber of possible combinations (2q) in order to provide anapproximation of the complexity. For each method we showthe computation time of the algorithm (time), the differencebetween the original edge set E and the anonymized one E(ed) and the percentage of modified edges (%mod).

2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

674

Page 5: [ACM Press the 2013 IEEE/ACM International Conference - Niagara, Ontario, Canada (2013.08.25-2013.08.28)] Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social

TABLE III: UMGA’s results for tested networks.

General Exhaustive Method Greedy MethodNetwork k 2q time ed %mod time ed %modCaida 10 224 0:00:06 0 6.06% 0:00:06 0 6.06%

20 221 0:00:07 0 11.65% 0:00:07 0 11.65%

50 213 0:00:08 -9 18.43% 0:00:08 -9 18.43%

100 29 0:00:10 -9 25.81% 0:00:10 41 25.85%

Amazon 10 254 0:14:16 0 0.16% 0:11:54 0 0.16%

20 250 2:57:37 0 0.26% 0:11:55 0 0.26%

50 232 0:12:07 0 0.39% 0:12:07 0 0.39%

100 230 2:11:47 -1 0.53% 0:13:57 -1 0.53%

Yahoo! 10 238 1:34:34 0 0.05% 1:31:40 0 0.05%

20 228 2:19:21 0 0.07% 1:31:57 0 0.07%

50 219 1:33:09 0 0.13% 1:32:12 0 0.13%

100 216 1:34:27 0 0.18% 1:34:32 0 0.18%

Caida is a quite sparse network. More than 49% of theirnodes have a degree between 1 and 10, 21.71% between 11and 100, 18.66% between 101 and 1,000 and 10.26% havea degree greater than 1,000. The maximum degree is 2,628.Because of this, it is necessary to modify more than 6% of theedges in order to get a k = 10. This percentage grows whilethe value of k grows. For a k = 50 and k = 100 the algorithmneeds to modify the total number of edges on 9 (decrease) and41 (increase) edges. It is about 0.018% of total edges, so webelieve that the noise introduced in the network is negligible.We can see similar times and results for exhaustive and greedymethods. The number of possible combinations is small andthe exhaustive method can deal with it.

Amazon is a network larger than Caida, so we can seegreater differences between exhaustive and greedy methods.The complexity grows up to 254 when k = 10. Noticethat smaller k-values imply bigger complexity since moregroup of nodes (gj) are possible and therefore more possiblecombinations. When k = 100 there is no solution with∑p

j=1 mj = 0, so the exhaustive method explores all the

possible combinations. In this case it is 230 and it can bedone with relative small amount of time. For other values ofk, there is a solution equal to 0, so the exhaustive methoddo not explores all combinations. Indeed, it explores less than0.1% in all cases. However, the greedy method finds the sameresult on all experiments, and it spends much less time.

Yahoo! network is the largest test network, but it is lesssparse than others. 99.21% of the nodes have a degree between1 and 100, 0.75% between 101 and 1,000, and only 0.03% havea degree greater than 1,000. Nodes with a degree value lessthan 100 are well protected and they are more than 99%. Thesecharacteristics imply that k-degree anonymous networks fromk = 10 to k = 100 can be achieved with less than 0.20% ofmodifications on edge set. So, the utility of the anonymizednetwork will be almost intact.

VI. CONCLUSION

In this paper we have presented a new algorithm fornetwork anonymization on large networks. It is based onedge set modification in order to achieve the desired k-degreeanonymity value. The new algorithm, called Univariate Micro-aggregation for Graph Anonymization (UMGA) is based on themodification of the degree sequence using univariate micro-aggregation technique. This process obtains an anonymizeddegree sequence which is k-degree anonymous and minimizes

the distance from the original one. Then we use the basicoperations to translate the modifications made on anonymizeddegree sequence to network edge set.

We have shown that the algorithm is able to anonymizelarge networks. We have used three different real networks totest the algorithm with two variants based on an exhaustiveand a greedy method. Both methods show good results on allnetworks, but the greedy method spend less time to get similar(in much cases, the same) result. In addition, greedy methodremains much more stable over time than exhaustive method.The tests proved that our algorithm can anonymize large realnetworks based on k-degree anonymity concept.

Many interesting directions for future research have beenuncovered by this work. It would be very interesting to think onhow the algorithm can work with other network’s type. Also,it would be interesting to consider other structural propertiesas a quasi-identifiers.

ACKNOWLEDGMENT

This work was partially supported by the Spanish MCYTand the FEDER funds under grants TSI2007-65406-C03 ”E-AEGIS”, TIN2010-15764 ”N-KHRONOUS”, CONSOLIDERCSD2007-00004 ”ARES”, and TIN2011-27076-C03 ”CO-PRIVACY”.

REFERENCES

[1] L. Backstrom, C. Dwork, and J. Kleinberg, “Wherefore art thour3579x?: anonymized social networks, hidden patterns, and structuralsteganography” in WWW ’07. New York, NY, USA: ACM, 2007, pp.181–190.

[2] B. Zhou and J. Pei, “Preserving privacy in social networks againstneighborhood attacks” in ICDE ’08. Washington, DC, USA: IEEEComputer Society, Apr. 2008, pp. 506–515.

[3] C. Dwork, “Differential Privacy”. Automata languages and program-ming, vol. 4052, pp. 1–12, 2006.

[4] M. Hay, G. Miklau, D. Jensen, P. Weis, and S. Srivastava, “AnonymizingSocial Networks”. Computer Science Department, University of Mas-sachusetts Amherst, Technical Report No. 07-19, 2007.

[5] X. Ying, K. Pan, X. Wu, and L. Guo, “Comparisons of randomizationand k-degree anonymization schemes for privacy preserving socialnetwork publishing” in SNA-KDD ’09. New York, NY, USA: ACM,2009, pp. 10:1–10:10.

[6] L. Sweeney, “k-anonymity: a model for protecting privacy”. Intl. J. ofUncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, no. 5,pp. 557–570, 2002.

[7] K. Liu and E. Terzi, “Towards identity anonymization on graphs” inSIGMOD ’08. New York, NY, USA: ACM, 2008, pp. 93–106.

[8] L. Zou, L. Chen, and M. T. Ozsu, “K-Automorphism: A GeneralFramework For Privacy Preserving Network Publication” in VLDB ’09,vol. 2, no. 1, pp. 946–957, 2009.

[9] M. Hay, G. Miklau, D. Jensen, D. Towsley, and P. Weis, “Resistingstructural re-identification in anonymized social networks” in VLDB’08, vol. 1, no. 1, pp. 102–114, 2008.

[10] S. L. Hansen and S. Mukherjee, “A Polynomial Algorithm for OptimalUnivariate Microaggregation”. IEEE Trans. Knowl. Data Eng., vol. 15,no. 4, pp. 1043–1044, 2003.

[11] J. Leskovec, J. Kleinberg, and C. Faloutsos, “Graph Evolution: Densi-fication and Shrinking Diameters”. ACM Trans. Knowledge Discoveryfrom Data, vol. 1, no. 1, pp. 1–40, 2007.

[12] J. Leskovec, L. A. Adamic, and B. A. Huberman, “The dynamics ofviral marketing”. ACM Transactions on the Web, vol. 1, no. 1, May2007.

[13] Yahoo! Webscope, “Yahoo! Instant Messenger friends connectivitygraph, version 1.0”. http://research.yahoo.com/Academic Relations

2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

675