GraSS: Graph Structure Summarization - Computer … · GraSS: Graph Structure Summarization ... summarization process leaves the database owner with the ... ﬁnding a good graph

GraSS: Graph Structure Summarization

Kristen LeFevre∗ Evimaria Terzi†

AbstractLarge graph databases are commonly collected and analyzedin numerous domains. For reasons related to either spaceefficiency or for privacy protection (e.g., in the case ofsocial network graphs), it sometimes makes sense to replacethe original graph with a summary, which removes certaindetails about the original graph topology. However, thissummarization process leaves the database owner with thechallenge of processing queries that are expressed in termsof the original graph, but are answered using the summary.

In this paper, we propose a formal semantics for an-

swering queries on summaries of graph structures. At its

core, our formulation is based on a random worlds model.

We show that important graph-structure queries (e.g., adja-

cency, degree, and eigenvector centrality) can be answered

efficiently and in closed form using these semantics. Further,

based on this approach to query answering, we formulate

three novel graph partitioning/compression problems. We

develop algorithms for finding a graph summary that least

affects the accuracy of query results, and we evaluate our

proposed algorithms using both real and synthetic data.

1 Introduction

Graph-structured data is commonly collected and ana-lyzed in a variety of application domains, e.g., the Web,internet, local-area networks (LANs), social and biolog-ical networks, and many more. In this paper, we in-vestigate the problem of probabilistic query answeringusing compressed graphs, or graph summaries.

Graph structures are commonly summarized, orreplaced with coarser representations, in at least thefollowing two important (and very different) scenarios:

Compression, Space Efficiency: Over the past sev-eral years, particularly with the growth of the WorldWide Web, the number and magnitude of graph-structured databases have been rapidly increasing, mak-ing it increasingly difficult and expensive to store thedata contained therein. One common solution to deal-ing with the size of these databases is by expanding theamount of computing and storage resources (i.e., hard-ware and disks). On the other hand, a complementary

∗University of Michigan, Ann Arbor, MI. Email:[email protected]

†Boston University, Boston, MA. Email: [email protected]

Alice Bob

Charlie Dave Ellen

Figure 1: Social network graph

solution deliberately removes some of the detail fromthe graphs in order to reduce space consumption.

Privacy and Anonymity for Social Networks:Privacy and anonymity have emerged as importantproblems when publishing social network graphs (e.g.,describing the “friend” relationships between users onFacebook or LinkedIn). Recent work has observed thatremoving known identifiers (e.g., Name, SSN) is oftennot enough to prevent re-identification [7, 20, 26]. Asa simple example, consider the social network graph inFigure 1, and suppose that we replace it with the graphin Figure 2(a) (i.e., replace user names with meaninglessinteger pseudonyms). Now consider an attacker whoknows that Bob is in the de-identified graph. If theattacker has some simple information about the graphtopology surrounding Bob (e.g., Bob has 4 neighbors),then it is easy for the attacker to locate Bob.

In recent work, Hay et al. demonstrated that agraph summarization approach is sufficient to preventthis particular attack, even in the case of an adversarywho has strong structural background knowledge (i.e.,knows the entire network topology surrounding thetarget node) [20]. Essentially, by grouping each person(node) into a super-node containing at least k− 1 othernodes, it is guaranteed that even a strong adversarywill not be able to pinpoint a target individual beyonda group containing at least k − 1 others.

While graph-structure summarization is importantin multiple domains, there has not been much work onhow to actually use these summaries for data analysis.The process of summarization, by definition, removessome information from the original graph. Intuitively,this introduces some uncertainty into any query oranalysis that takes the summarized graph as input. Toaddress this problem, we propose formal probabilisticsemantics for evaluating structural queries on graphsummaries.

1 2

3 4 5

(a) Input graph

1 2

3 4 5

(b) Partitioned graph

{1,2,3}

(2 edges)

{4,5}

(1 edge)

(2 edges)

(c) Graph summary

Figure 2: Graph summarization example

Summary of Contributions Our first main contribu-tion is a formal semantics for query answering on graphsummaries. We provide necessary definitions in Sec-tion 2, and then describe the query-answering seman-tics in Section 3. At its core, our approach is basedon a random-worlds framework [6], and queries are an-swered using an expected value semantics. Interestingly,we show that several important graph structure queries(adjacency, degree, and eigenvector centrality) can beanswered quickly and in closed form while still preserv-ing our rigorous probabilistic semantics. More complexqueries can be answered heuristically (with a high de-gree of accuracy) using a novel expected adjacency ma-trix structure.

Motivated by the above, we also study the problemof finding a “good” graph summary. In our problem set-ting, the quality of a summary is measured by the extentto which it alters the results of queries. In Section 4, weformulate three novel graph summarization problems.Two of these problems are motivated by compressionand space efficiency; the third is motivated by privacy.For each of these problems, we describe algorithms forfinding a good graph summary in Section 5. The graphsummarization problems we study here give rise to threenovel graph-partitioning problems that, to the best ofour knowledge, have not been studied before. Finally,we provide an extensive experimental evaluation of ourtechniques in Section 6.

2 Graph Summarization Basics

Throughout this paper, we will consider an input graphG (V,E) that is simple, undirected, and unweighted; Vdenotes the set of n nodes V = {v1, . . . , vn} and Edenotes the set of edges among these nodes. Given aninput graph G (V,E) a summary S (G) consists of

1. A partition of the nodes of V into parts V (V ) ={V1, . . . , Vk}, such that Vi ⊆ V and Vi ∩Vj = ∅, fori, j ∈ {1, . . . , k} and i 6= j. We refer to each groupof nodes Vi as a supernode of the summary S.

2. For every supernode Vi ∈ V, summary S describesthe number of edges within the nodes in the su-pernode. That is, it counts the number of edges in

the input graph G that have both their endpointsin the nodes of Vi. For supernode Vi we denote thisnumber by Ei. That is,

Ei = |{e(u, v) | u, v ∈ Vi, e(u, v) ∈ E}| .

3. For every pair of supernodes Vi, Vj ∈ V, summaryS also gives the number of edges across the twosupernodes. That is, it counts the number ofedges in the input graph G that have one of theirendpoints in a node of Vi and their other endpointin a node in Vj . For two supernodes Vi and Vj , wedenote this number by Eij .

Eij = |{e(u, v) | u ∈ Vi, v ∈ Vj , e(u, v) ∈ E}| .

Example 1. Consider the input graph in Figure 2(a).In this case, the nodes are uniquely named with integervalues. Figure 2(b) shows a possible partition of thenodes of the input. Figure 2(c) shows the correspondingsummary that consists of two supernodes.

Note that for any input graph G (V,E) there existmany possible summaries S. In fact, there are as manyas the possible partitions of the nodes in V . We useS (G) to denote the set of all (exponentially many)possible summaries that can be extracted from an inputgraph G (V,E). In Section 4, we will discuss techniquesfor finding the best such partition / summary.

3 Probabilistic Queries on Graph Summaries

Replacing original graph G with summary S introducesuncertainty with respect to the structure of G. Inthis section, we formalize this intuition by defining thesemantics of graph structure queries on summaries.

Our formalism is based on a random-worlds frame-work [6]. Intuitively, given a summary, there are manypossible original graphs that could have produced thesummary. Following the principle of indifference [6], inthe absence of additional information, it is reasonable toassume that each such reconstruction is equally likely.Using this assumption, we define the expected-value re-sponse for any real-valued query Q(). Further, for sev-eral important kinds of graph structure queries – adja-cency, degree, and eigenvector centrality – we show that

the expected value response is easily computed from thesummary in polynomial time.

3.1 Random Worlds and Reconstructions Con-sider the case where we are given only a summary S, butwe want to find the answer to some structural query Q()with respect to original graph G. In the absence of ad-ditional information, there are typically several graphsthat could have produced summary S. We refer to eachsuch graph as a reconstruction of S.1

Definition 1. (Graph Reconstruction) Let S bea summary graph consisting of k supernodes V ={V1, . . . , Vk} and edge counts Ei, Eij for i, j ∈{1, . . . , k}. The set of valid reconstructions of S, de-noted by R (S), is the set of graphs G (V,E), such that

• For every i ∈ {1, . . . , k},|{e(u, v) | u, v ∈ Vi, e(u, v) ∈ E} | = Ni.

• For every i, j ∈ {1, . . . , k} and i 6= j,

|{e(u, v) | u ∈ Vi, v ∈ Vj , e(u, v) ∈ E} | = Nij .

3.2 Expected-Value Semantics To capture theuncertainty introduced by summarization, the seman-tics of query Q() on S should be defined, conceptually,with respect to the set of all valid reconstructions. Forrelated problems, two main approaches have been pro-posed: expected value semantics and range semantics(e.g., [5]).

Intuitively, using expected value semantics, the an-swer to a query Q() is the expected result, given a dis-tribution over possible reconstructions. In the absenceof additional information, the principle of indifferencesuggests that it is reasonable to assume a uniform dis-tribution.

Definition 2. (Expected Value Semantics) LetR (S) denote the set of all valid reconstructions fromsummary S, and let Q() denote a query on G with aboolean or real-valued response.2 Under expected valuesemantics, the answer to Q() is defined to be the realnumber e such that

e =

∑G∈R(S) Q(G)

|R(S)|Note that the above equation assumes uniform

probability distribution over all graphs. Incorporatingprior knowledge about the graph structure (e.g., givingpreference to scale-free graphs) can be easily done bymultiplying Q(G) with the prior probability P(G) ofthe graph G.

1To use the terminology of [6], each reconstruction constitutesa possible world.

2For boolean queries true = 1 and false = 0.

1 2 3 4 51 0 1 0 0 02 1 0 1 1 13 0 1 0 0 04 0 1 0 0 15 0 1 0 1 0

(a) Adjacency matrix forgraph in Figure 2(a)

1 2 3 4 51 0 2/3 2/3 1/3 1/32 2/3 0 2/3 1/3 1/33 2/3 2/3 0 1/3 1/34 1/3 1/3 1/3 0 15 1/3 1/3 1/3 1 0

(b) Expected adjacency matrix for sum-mary in Figure 2(c)

Figure 3: Expected adjacency matrix example

3.3 Adjacency Queries One of the simplest andmost common graph-structure queries is the adjacencyquery, which simply asks: Given graph G(V,E) and twonodes u, v ∈ V , does there exist an edge (u, v) ∈ E?Given a summary S, the expected adjacency matrixcaptures the answers to all possible adjacency queriesunder expected value semantics.

Definition 3. (Expected Adjacency Matrix )Let S be a summary graph. The expected adjacencymatrix A for S is a |V | × |V | matrix, where all entriesare real numbers in the range [0, 1] defined as follows:

A(u, v) =

∣∣ {G(V,E) | G ∈ R (S) , (u, v) ∈ E}∣∣∣∣R (S)

∣∣Given a graph summary, each of the entries in theexpected adjacency matrix is easily computed in closedform.

Theorem 3.1. Given summary S, the entries of theexpected adjacency matrix A given S can be computedas follows:

1. If u, v ∈ V are distinct nodes in the same supernodeVi, then

A(u, v) =2Ei

|Vi|(|Vi| − 1)(3.1)

2. If u, v ∈ V are distinct nodes in different supern-odes, Vi and Vj, then

A(u, v) =Eij

|Vi| × |Vj |(3.2)

3. Otherwise (if u = v),

A(u, v) = 0(3.3)

The proof of the above theorem is based on basicprobability manipulations and is omitted for space.

Example 2. Figure 3(a) shows the adjacency matrixfor the graph in Figure 2(a), and Figure 3(b) showsthe expected adjacency matrix given the summary inFigure 2(c).

3.4 Degree Queries A degree query is another sim-ple query of the form: Given a node v ∈ V , how manyedges in E touch v? Under expected value semantics,we can define the answers to these queries using theexpected adjacency matrix.

Theorem 3.2. The expected degree of a node v ∈ V iscomputed by d(v) =

∑|V |j=1 A(v, j)

Proof. Suppose that there are m valid reconstructions,and the degree of node v in reconstruction i is givenby di(v). Then, the expected degree of v is by defi-nition d(v) =

Pmi=1 di(v)

m . Let Ai(u, v) denote the ad-jacency matrix entry for nodes u and v in the ith re-construction. Of course, the degree of a node v canbe computed using a graph’s adjacency matrix. Thus,

di(v) =∑|V |

j=1 Ai(v, j) and d(v) =Pm

i=1P|V |

j=1 Ai(v,j)

m =∑|V |j=1 A(v, j).

Example 3. Using the expected adjacency matrix inFigure 3(b), which is constructed for the summary inFigure 2(c), we have d(1) = 2, d(2) = 2, d(3) = 2,d(4) = 2, and d(5) = 2.

3.5 Eigenvector-Centrality Queries Eigenvectorcentrality is one common way of measuring the impor-tance of a node in a network.

Definition 4. (Eigenvector centrality)Consider G (V,E). The eigenvector centrality scoreof node vi, denoted p(vi), is the probability of beingat node vi after infinite steps of a random walk onG. Alternatively, consider A to be the adjacencymatrix of G and D a diagonal matrix of the degreesD = diag (d(v1), . . . , d(vn)) of the n nodes in G. Also,let matrix M = D−1A. Then the eigenvector centralityp(vi) of vi is the i-th element of the left eigenvector ofthe matrix M that corresponds to the largest eigenvalue.

Note that for the eigenvector centrality to be de-fined, the Markov chain that corresponds to G needs tobe connected and aperiodic. Therefore, the theoreticalanalysis that follows refers to only such graphs. In prac-tice, we can guarantee aperiodicity, with small impacton the final centrality scores, by adding self-loops withvery small weights to each input node.

Theorem 3.3. The expected eigenvector centralityscore of a node v ∈ V is computed in closed form byp(v) = d(v)

2|E|

Proof. Recall that in undirected graph G(V,E), theeigenvector centrality score p(v) of v ∈ V can becomputed based on the degree of v as p(v) = d(v)

2|E| , whered(v) is the degree of v in G (for details see [25] Chapter6). Now, suppose that there are m valid reconstructionsgiven a summary S. If the degree (centrality score) ofnode v in reconstruction i is denoted by di(v) (pi(v)),

then we have p(v) =Pm

i=1 pi(v)

m =Pm

i=1di(v)2|E|

m = d(v)2|E|

Example 4. For the graph summary in Figure 2(c), wehave p(1) = 1/5, p(2) = 1/5, p(3) = 1/5, p(4) = 1/5,and p(5) = 1/5.

3.6 Complex Queries In the previous three subsec-tions, we have demonstrated that three simple kinds ofqueries (adjacency, degree, and eigenvector centrality)can be answered in closed form using expected value se-mantics. While strict adherence to the formal semanticsis a worthy goal, and it is useful to continue to searchfor closed-form solutions to other queries, we observethat many queries can be answered heuristically, withreasonable accuracy, by viewing the expected adjacencymatrix as a weighted graph. For example, in Section 6,we provide some experimental results for PageRank, in-dicating that this approach works fairly well.

4 Summarization Problems

In this section, we will turn our attention to theproblem of finding a good summary, either when thegoal is compression (or space efficiency), or when thegoal is privacy protection. In both cases, we willmeasure the quality of a summary based on how wellwell the summary S (G) describes the input graphG. Given an input graph G, we want to find thesummary S (G) such that G’s adjacency matrix A andthe expected adjacency matrix A of summary S (G) areas similar as possible. This intuition is captured bythe reconstruction error. Intuitively, the reconstructionerror measures the average absolute error (resultingfrom using summary S rather than G) across all possibleadjacency queries.

Definition 5. (Reconstruction Error (Re)) LetG (V,E) be an input graph described by a (|V | × |V |)adjacency matrix A. Let S be a summary of G, and letA be the expected adjacency matrix for S. We definethe (normalized) reconstruction error of S with respect

1

2

5

4

3

(a) Input graph

1

2

5

4

3

(b) Optimal partition (k = 2)

Figure 4: Novel partitioning problem

to G as follows:

Re(A | A

)=

1|V |2

|V |∑i=1

|V |∑j=1

∣∣A(i, j)−A(i, j)∣∣ .

In the rest of this section we formally define theoptimization problems that we study in the rest ofthe paper. Namely, we define three problems: k-Graph Summarization (k-Gs), Graph Summariza-tion (Gs), and k-Capacitated Graph Summariza-tion (k-CGs). k-Gs and Gs are inspired by the spaceefficiency problem; k-CGs is inspired by the privacyproblem.

4.1 Space Efficiency Problems (k-Gs and Gs) Inthe k-Gs problem, the maximum number of supernodesk in the output summary is specified as a constraint.Then, for a given input graph, the goal is to find the bestsummary graph with at most k supernodes. In contrast,in the Gs problem, we adopt a Minimum DescriptionLength (MDL) formulation, rather than assuming thata fixed k is specified in advance. In that case, the goalis to find the best summary graph S so that exactlythe right number of supernodes are included in thesummary. The right number of supernodes is definedusing information-theoretic arguments, and it dependson the complexity of the summary and the error thatthe summary introduces in the description of the data.

In the rest of the section we formally define eachone of the above problems.

Problem 1. Given input graph G(V,E) and integer k,find a summary graph S for G with at most k supernodesV (|V| ≤ k), such that Re (G | S) is minimized.

Note that Problem 1 is a novel graph-partitioningproblem. Although other variants of graph partitioninghave been extensively studied (see Section 7), theirfocus has been on finding dense graph components, or(approximately) disconnected parts in the graph. Ourproblem has a different flavor; the nodes that belong toa partition (supernode) need not necessarily be densely

connected; being disconnected is equally good. Thespecial nature of the problem is illustrated by thefollowing example.

Example 5. Consider the star-shaped graph shown inFigure 4(a). The optimal partitioning solution to Prob-lem 1 for k = 2 is shown in Figure 4(b). The summarygraph resulting from this partitioning would have recon-struction error Re = 0. Such a partitioning is not op-timal with respect to any other graph-partition problem.

We devote the rest of this subsection to describinga variant of Problem 1 where the maximum numberof supernodes k is not given as part of the input, butrather the objective function is such that it decides byitself the optimal number of supernodes. We achievethis balance by using the MDL principle to guide ourproblem formulation.

The Minimum Description Length (MDL) prin-ciple: The MDL principle states the following: assumetwo parties P and P ′ want to communicate with one an-other. More specifically, assume that P wants to send agraph G to P ′ using as few bits as possible. In order forP to minimize the communication cost, she has to selectthe model S from the class of models S, and use S todescribe her data. Then, she can send P ′ model S, plusthe additional information required to describe G, giventhe transmitted model. In our case, the class of mod-els were are considering is the class of summary graphs;thus, P has to encode the summary graph, plus the datagiven the summary graph. The quality of the selectedmodel (summary graph) can be evaluated based on thetotal number of bits required to encode the model andthe data given the model.

MDL discourages complex models with small datacost and simple models with large data cost. Intu-itively, it strikes a balance between these two extremes.Clearly, the MDL principle is more general than ourgraph summarization problem. In the past, it hasbeen successfully applied in a variety of settings thatinclude decision-tree classifiers [24], genetic-sequencemodelling [23], patterns in sets of strings [22], andmany more. We now describe our instantiation of theMDL principle.

Cost of the data given the model: Our input data isgraph G (V,E), while our model is the summary S of thegraph. Let A and A be the adjacency matrix of G andthe expected adjacency matrix given S, respectively.

For now, assume that the permutations of the rows(and columns) of A and A are the same. That is, the i-throw (column) of A corresponds to the i-th row (column)of A. Recall that the value of A (i, j) corresponds to the

probability of the existence of an edge between nodes iand j of the original graph. Then, the probability ofA (i, j) = 1 is

Pr[A (i, j) = 1

]= A (i, j) ,

while the probability of A (i, j) = 0 is

Pr[A (i, j) = 0

]= 1−A (i, j) .

Therefore, given model S described with the ex-pected adjacency matrix A, the probability of observingthe original graph G, with adjacency matrix A is

Pr[A | A

]=

∏i,j

(A (i, j)A(i,j)×

(1−A (i, j)

)(1−A(i,j)))

.

Subsequently, the number of bits required to de-scribe A given A is − log Pr

[A | A

]3 and thus

B`A | A

´= − log Pr

ˆA | A

˜=

Xi,j

“− A (i, j) log A (i, j) −

−`1 − A (i, j)

´log

`1 − A (i, j)

´”.

Cost of describing the model: Our model is thesummary S that is, in principle, described by theexpected adjacency matrix A. Here we will describe thecost (in bits) of describing the model A. Assume thatthe summary graph S has k supernodes, then matrix Ais of size n × n (|V | = n), but the n rows and columnscan be rearranged so that all the nodes that belong tothe same supernode in S are close to each other. Sucha rearrangement will produce a matrix that consistsof k × k blocks. Notice that the entries of matrix Awithin every such block are constant; that is, there is asingle value that can describe the probability of an edgebetween nodes in the block. In order to encode thismatrix we need to encode the following information:

1. We need to encode the permutation of the rows(and columns) used for the rearrangement in ma-trix A. Since there are n = |V | rows, n log n bitsare needed to encode this permutation.

2. We need to specify the partitioning of the rows(and columns) of A into k groups. Given thepermutation, it is enough to specify the indices ofthe rows where a new block starts. For k blocks,k − 1 indices need to be specified which require atotal of (k − 1) log n bits. Note that we do not

3Recall that the number of bits needed to describe event e thatoccurs with probability pe is − log pe

need to specify the indices of the columns since thepermutation of the rows and the columns of matrixA are identical.

3. Each one of the k × k blocks of matrix A is char-acterized by a single parameter, namely the proba-bility of an edge between any two nodes in the su-pernode(s) associated with the block. This param-eter is a real number, and by standard information-theoretic arguments we need 1

2 log n2 = log n bits toencode it. Since there are k2 such parameters, thecost of their description sums up to k2 log n bits.

Therefore, the total number of bits required toencode the model is

B(A

)= n log n + (k − 1) log n + k2 log n

= (n + k2 + k − 1) log n.(4.4)

Putting it all together: Using the standard MDLmethodology, we define our problem as that of findingthe model S (or A) such that the minimum total numberof bits is used. This is formalized in the followingproblem definition.

Problem 2. Given input graph G(V,E) with adjacencymatrix A, find a summary S of this graph with expectedadjacency matrix A such that the total number of bits

Tb(A

)= B

(A

)+ B

(A|A

)(4.5)

is minimized.

4.2 Privacy Problem (k-CGs) In the k-CGs prob-lem we still want to find the graph summary that mini-mizes the reconstruction error. However, the differencebetween this and the k-Gs problem is that we restrictthe graph summary so that all supernodes in the sum-mary must contain at least k nodes from the originalgraph.4

The k-CGs problem is formalized as follows. Thisformalizes the k-anonymity requirement [30, 31] ongraphs; every node in the original graph must be hiddenin the anonymity of the supernode to which it belongs.At the same time, we seek to maximize the utility of theanonymized graph.

Problem 3. Given input graph G(V,E) and integer k,find a summary graph S for G with supernodes V suchthat Re (G | S) is minimized and for every V ′ ∈ V|V ′| ≥ k.

4The parameter k is given as input to the problem; note thatthe semantics is different from the semantics of parameter k inthe k-Gs problem!

5 Algorithms

In this section we give algorithms for Problems 1, 2and 3. Note that for Problems 1 and 2 we are goingto use the same Greedy algorithm. On the other hand,our algorithmic solution to Problem 3 is different.

5.1 Solving the k-Gs and Gs problems We startby describing a natural greedy algorithm for Problems 1and 2. The Greedy algorithm starts with a summarygraph in which each node is placed in a separate su-pernode. In each step, it merges the two supernodesthat cause the largest reduction in the objective func-tion. The algorithm repeats until a stopping conditionis met. For the k-Gs problem the stopping condition isthat the summary graph contains k super-nodes, whilefor the Gs version of the problem the stopping condi-tion states that there is no merging of super-nodes thatcan improve the objective function Tb (Equation 4.5).

We give a detailed description of Greedy for whenthe goal is to minimize the Re objective function. Thealgorithm is identical when used to optimize objectivefunction Tb. As we mentioned before, the algorithmproceeds in iterations. Originally, the summary graphS0 is identical to the input graph. At the t-th step,t super-nodes have been merged and summary graphSt with n − t + 1 nodes has been created. At eachstep t the algorithm needs to perform two tasks: (a)find the pair of super-nodes in St−1 such that summarygraph St generated after their merge has the propertythat Re

(A|At−1

)− Re

(A|At

)is maximum and (b)

actually perform the merge of the right pair of nodesand update summary St−1 to summary St.

Step (a) requires going through all pairs of nodesin St−1 and computing from scratch the reduction inRe that their merge would cause.5 This can be naivelydone in O(n3) time, since one has to go through all pairsof super-nodes (O(n2)) and for each such pair computethe total reduction in cost function Re. This latterstep can be done, with the right bookkeeping, in O(n)time. Step (b) requires updating St−1 to St by simplyremoving the two individual super-nodes that have beenmerged and adding the new super-node that resultsform the merging. Thus, the time required for step(b) is just O(n) since both the old super-nodes that areremoved and the new super-node that is added can haveat most O(n) neighbors. Therefore, the total runningtime of the above implementation of steps (a) and (b)is O(n3 + n) = O(n3). For the k-Gs problem, steps (a)and (b) need to be repeated n−k+1 times, so the totalrunning time of the algorithm is O(n4). Apparently,

5The reduction in the value of Re can be both positive ornegative.

this running time is prohibitive for large datasets. Thealgorithms described in the next two subsections aim atreducing this running time.

5.1.1 The SamplePairs algorithm As for theGreedy algorithm, we describe the SamplePairs algo-rithm for the k-Gs problem and the Re optimizationfunction.

In fact, the SamplePairs algorithm is an instanceof the Greedy algorithm. Thus, SamplePairs alsoproceeds in rounds. At every round t a summary St

is constructed by merging two supernodes of the St−1

summary. The pair of supernodes to be merged are theones that after merging, they cause the smallest increase(or the largest reduction) in the objective function. Theonly difference here is that not all pairs of supernodesare checked for the impact that their merge causes inthe optimization function. Instead, in every iterationt, a sample P (t) of the supernode pairs are selecteduniformly at random. The pair in P (t) that increasesthe least (or decreases the most) the objective functionis picked. The next iteration is executed analogously bypicking a new set of candidate supernodes for merging.

By following the same line of thought as beforein the running-time analysis we have that the runningtime of the SamplePairs algorithm is O

(|P (t)|n2

).

Therefore, if we pick a constant number of pairs inevery round K, the running time of the SamplePairsalgorithm is quadratic to the number of nodes in theinput graph.

Alternatively, at every iteration t we may samplepairs P (t) with cardinality proportional to the numberof supernodes n(t) in the graph summary at iterationt: |P (t)| = c · (n(t)). In this case, the running timeis O

(n3

), which is still an improvement to the Greedy

algorithm.

5.1.2 The LinearCheck algorithm TheLinearCheck algorithm is similar in spirit to theSamplePairs algorithm. Recall that in SamplePairs,at every iteration t, P (t) pairs of supernodes were sam-pled as candidates for merging. In the LinearCheckalgorithm, at every iteration t, a single supernode Vt ispicked uniformly at random among all the supernodesin summary St−1. Then, the gain in the objectivefunction obtained by merging Vt with any of thesupernodes in St−1 is evaluated. The merge with thebest gain is picked and the summary St is constructed.

By following the analysis of the previous two al-gorithms it is easy to see that the worst-case runningtime of the LinearCheck algorithm is O

(n3

). However,

note that in this case, as well as in the Greedy andSamplePairs algorithms these are all worst-case run-

ning times. In practice, depending on the number ofneighbors of each node in the graph the running timeof the algorithms is much less than the correspondingworst-case bounds.

5.2 Solving the k-CGs problem The algorithm weuse for solving the k-CGs problem is again an iterativealgorithm; we will call the algorithm Condense due toits similarity to the condensation algorithm proposed byAggarwal and Yu for (non-graphical) tabular data [3].

Let Si be the summary graph when iteration istarts. Also let V i be the set of original nodes thathave not yet been assigned to a supernode in Si. Initeration i, the Condense algorithm randomly selectsa node v from V i, and forms a “ball” around v. Thisball, denoted by B(v), consists of k−1 other nodes fromV i, that are chosen so that when combined with v in asupernode they cause the least increase in the currentvalue of the reconstruction error. On each iteration,nodes B(v) ∪ {u} are added to the summary graph Si

as a new supernode.At every step of the Condense algorithm the supern-

odes that are added to the summary have size exactly k.If in the last iteration fewer than k nodes remain with-out being assigned to a supernode, we assign them toalready existing supernodes. Tha assignment is done se-quantially (in random order) so that the representationerror is minimized at each step.

Finally, on each iteration, we must find the ballB(v) around node v that minimizes the increase in re-construction error. This may not be a computationallyeasy task. Thus, we use a simple greedy process forforming balls around the nodes in the original graph.

6 Experimental Evaluation

6.1 Datasets We evaluated our proposed algorithmsusing both real and synthetic graph data. The real datasets were as follows:

• Authors: The first data set is an undirected co-authorship graph. We studied the bibliographiesfor VLDB, SIGMOD, and PODS available at theCollection of Computer Science Bibliographies6 andextracted the lists of authors from each paper. Aselection of papers with at least 2 authors yielded3109 papers with a total of 9538 authors (nodes in thegraph). However, in some experiments, we reducedthe size of the graph by sampling a smaller subset ofthe nodes.

• Italian Wikipedia: The second graph was extractedfrom the link structure of the Italian-language version

6http://liinwww.ira.uka.de/bibliography/

of Wikipedia, and the full graph contains 15,000nodes. While the original graph is directed, in allof our experiments, we view the graph as undirected.

In addition, we constructed a synthetic data gen-erator, which generates undirected graphs according totwo distributions:

• Uniform(n, d): The edges in these graphs are chosenaccording to a uniform distribution based on a densityparameter d. That is, suppose we want to generatea uniform graph with n nodes. Each of the n(n −1) possible edges (no self-edges) is selected withprobability d.

• Barabasi-Albert(n, m0,m): These graphs are gen-erated according to the procedure proposed byBarabasi and Albert [8]. The graph begins with aninitial network of m0 nodes, and new nodes are addedone at a time. Each node is connected to m of theexisting nodes with probability proportional to thenumber of links the node already has.

6.2 Algorithms for k-Gs and Gs Our first set ofexperiments evaluates the algorithms that we proposedfor the k-Gs and Gs problems. (These problemsare the space-limited and MDL problem formulations,respectively.)

6.2.1 Comparing Algorithms Our first set ofexperiments compares the algorithms proposed forthese problems in Section 5.1 – SamplePairs, andLinearCheck – in terms of resulting reconstruction er-ror. Note that due to the high time complexity of theGreedy algorithm, we did not run experiments using thefull enumeration of possible merges. However, we doobserve that as we increase parameter c, the results forSamplePairs appear to converge. In addition, for thesake of comparison, we include two other algorithms:the first is a naive Random algorithm, which operatesin a bottom-up manner, and at each iteration selectsa random pair of nodes to merge. The second is astate-of-the-art community detection algorithm calledFastGreedy [12]. This latter algorithm is designed tosolve a different problem; the one of detecting dense,and relatively isolated components of the input graph.

Figures 5, 6, and 7 show our results for the Authorsdata set (samples of size 500 and 5,000), and Italy dataset (sample of size 5000). Recall that the reconstructionerror intuitively measures the average absolute errorover all possible adjacency queries. As expected, theLinearCheck and SamplePairs algorithms significantlyoutperform Random, and within SamplePairs, as weincrease the value of c, the overall quality of the resultsimprove, appearing to near a point of convergence.

0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0 100 200 300 400 500

Reco

nstru

ctio

n Er

ror

# Super Nodes

RandomFastGreedy

LinearSamplePairs (c=0.1)SamplePairs (c=1.0)

SamplePairs (c=10.0)

Figure 5: k-Gs, Authors(500)

0

0.0002

0.0004

0.0006

0.0008

0.001

0.0012

0.0014

0 1000 2000 3000 4000 5000

Reco

nstru

ctio

n Er

ror

# Super Nodes

RandomFastGreedy



0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0 1000 2000 3000 4000 5000

Reco

nstru

ctio

n Er

ror

# Super Nodes

RandomFastGreedy


Figure 7: k-Gs, Italy(5000)

0

0.5

1

1.5

2

2.5

3

0 100 200 300 400 500

Avg

Degr

ee E

rror

# Super Nodes

RandomSamplePairs (c=0.1)SamplePairs (c=1.0)


0

0.0002

0.0004

0.0006

0.0008

0.001

0.0012

0.0014

0.0016

0 100 200 300 400 500

Avg

Cent

rality

Erro

r

# Super Nodes



0

0.0002

0.0004

0.0006

0.0008

0.001

0.0012

0.0014

0 100 200 300 400 500

Avg

Page

Rank

Erro

r

# Super Nodes



It is also interesting to observe that LinearCheckand SamplePairs (c = 1) have the same time complex-ity. However, in all of our experiments, SamplePairssignificantly outperformed LinearCheck. By not fixingeither of the supernodes to participate in the merge,and instead selecting pairs at random, it appears thatwe achieve a better (though still incomplete) explorationof the space of possible merges during each iteration.

Finally, it is important to observe that ouralgorithms yield smaller reconstruction errors thanFastGreedy, which indicates that it is important to con-sider the problem formulation based on reconstructionerror in the case where the goal is accurate query an-swering.

6.2.2 Querying Graph Summaries In addition toreconstruction error, we conducted additional experi-ments to understand the effects of summarization onother kinds of queries and analyses.

The first set of experiments measured the averageabsolute degree error across all nodes in the originalgraph G(V,E). That is, if d(v) and d(v) denote the orig-inal and expected responses, respectively, to the queryrequesting the degree of node v, then we measure aver-age error as 1

|V |∑

vi∈V |d(vi)−d(vi)|. Figure 8 shows ourresults for Authors(500). Clearly, SamplePairs reducesthe degree error substantially compared to Random. (Wealso obtained similar results for the Italy dataset, butthey are omitted in the interest of space.)

The second set of experiments measured the averageabsolute error in centrality score, also across all nodesin G(V,E). Figure 9 shows results for Authors(500);as expected, the results are proportional to those fordegree error.

Finally, while we have a closed-form solution forcomputing the expected centrality score (Theorem 3.3),we also consider (empirically) computing the relatedPageRank score.7 An intuitive heuristic for computingthe expected PageRank score is to view the expectedadjacency matrix as a weighted graph, and, conceptu-ally, use this as input. Figure 10 shows that, despitethe heuristic query evaluation semantics, use of cleversummarization algorithms is effective in reducing theaverage PageRank error. (In Figure 10, we computePageRank for β = 0.85.)

6.2.3 Graph Compressibility Our final set of ex-periments is intended to understand the degree to whichdifferent kinds of graphs can be compressed using oursummarization scheme (Section 2). For these experi-ments, we used synthetic data with varied generativemodels and parameters.

The first experiment used uniform graphs withvaried density values. Intuitively, we expect that verysparse graphs and very dense graphs will be the mosteasily compressed. For example, consider a set of nodes

7PageRank [10] introduces an additional damping factor β.PageRank is equivalent to eigenvector centrality when β = 1.0.

0 0.05

0.1 0.15

0.2 0.25

0.3 0.35

0.4 0.45

0.5

0 100 200 300 400 500

Reco

nstru

ctio

n Er

ror

# Super Nodes

Density = 50%Density = 75%Density = 90%Density = 99%

Figure 11: k-Gs, Uniform(500)

0 0.05

0.1 0.15

0.2 0.25

0.3 0.35

0.4 0.45

0.5

0 100 200 300 400 500

Reco

nstru

ctio

n Er

ror

# Super Nodes

Density = 50%Density = 25%Density = 10%

Density = 1%

Figure 12: k-Gs, Uniform(500)

0

50000

100000

150000

200000

250000

300000

0 20 40 60 80 100

min

(TB)

Density (%)

TB

Figure 13: Gs, Uniform(500)

that constitute a connected component (or independentset). If these edges are grouped together in the samesupernode, then there is no reconstruction error withinthis supernode. Similarly, consider the nodes in twosupernodes s1 and s2. If every node in s1 is connectedto every node in s2 (or every node in s1 is not connectedto any node in s2), then there is no between-supernodereconstruction error.

We confirmed this intuition experimentally. Forsynthetic graphs consisting of 500 nodes each, Fig-ures 11 and 12 show the reconstruction error for var-ied numbers of supernodes. Each of these lines wasobtained using the SamplePairs algorithm (c = 5.0).Clearly, the very sparse (density = 1%) and very dense(density = 99%) graphs produce the least reconstruc-tion error. (In fact, the symmetric pairs 1% and 10%;25% and 75%; 10% and 90% were so similar that wehad to display the results in two separate graphs.) Inaddition, Figure 13 shows, for varied density, the min-imum value of TB (Equation 4.5), which measures thecost in bits of encoding the summary and the data giventhe summary. This also indicates that sparse and densegraphs are the most compressible.

Finally, Figure 14 shows results comparingBarabasi-Albert graphs and Uniform graphs. (Noticethat Barabasi(5) and Uniform(1%), Barabasi(50) andUniform(10%) each contain the same total number ofedges.) Our empirical results indicate that the uniformgraphs are somewhat more easily compressed.

6.3 Algorithms for k-CGs In this section, we eval-uate our proposed algorithm for the k-CGs problem.This problem is inspired by the k-anonymity require-ment for graphs. Our proposed algorithm – Condense– is described in Section 5.2.

To gage the effectiveness of our algorithm, we com-pared it to a strawman. In this case, the strawman it-eratively chooses a random group of k nodes from G,and assigns these nodes to a single supernode. (Atthe end, the final supernode will contain between k

0

0.05

0.1

0.15

0.2

0.25

0.3

0 100 200 300 400 500

Reco

nstru

ctio

n Er

ror

# Super Nodes

Barabasi (m=50)Uniform(density = 10%)

Barabasi (m=5)Uniform(density = 1%)

Figure 14: k-Gs, Barabasi-Albert(500)

and 2k − 1 nodes from the original graph.) Figures 15and 16 compare this strawman algorithm (Random) withthe Condense algorithm using the Authors(500) andItaly(500) data sets, respectively. In each comparison,we use several different values of k, which is the re-quired minimum occupancy for a supernode. Clearly,Condense outperforms Random.

7 Related work

Graph-partitioning and dense subgraph prob-lems: Closely related to our work are problems asso-ciated with graph partitioning, where an input graphG(V,E) must be partitioned into k disjoint parts, ac-cording to a certain optimization criterion. Several ver-sions of this problem have been studied.8 The use ofeigenvectors for the purposes of partitioning was firstintroduced by Donath and Hoffman in [14] and has beenstudied extensively since. Although our problem seemssimilar to existing graph-partitioning problems, our op-timization function (reconstruction error) is differentfrom the standard optimization functions of previouswork.

Given G (V,E) the densest subgraph problem asksfor a subset V ′ ⊆ V of arbitrary size such that thevertex-induced subgraph has maximum average degree.

8See http://en.wikipedia.org/wiki/Graph partitioning fora short overview of such problems and references

0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

2 5 10 25 50 100

k (Minimum Occupancy Constraint)

Rec

onst

ruct

ion E

rror

Random Condense

Figure 15: k-CGs, Authors(500)

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

2 5 10 25 50 100

k (Minimum Occupancy Constraint)

Rec

onst

ruct

ion E

rror

Random Condense

Figure 16: k-CGs, Italy(500)

This problem can be solved in polynomial time usingflow techniques [17]. When the cardinality of the vertexset |V ′| ≤ k is determined as part of the input, theproblem is NP-hard [15]. The fundamental differencebetween this problem and our problem is that we wantto partition the graph rather than find a single subgraphwith maximum edge density.

Discovering communities: In practice, several in-stances of graph-partitioning and dense-subgraph prob-lems have arisen in the context of social networks andthe Web. The goal there has been to identify onlinecommunities. For example, a practical algorithm forfinding large, dense subgraphs is presented in [19]. Thealgorithm is based on shingles and has been success-fully applied for studying the subgraph structure of theWeb. Algorithms for identifying communities have alsobeen proposed in [16, 18]. This line of research is ratherorthogonal to ours since either the communities identi-fied by these algorithms do not define a partition of thenodes or the optimization function different. A recentline of research is related to a novel graph clustering in-dex called modularity that has been recently proposedin [28]. Algorithms for optimizing modularity of a par-titioning have been proposed in [12, 29, 33, 35].

Graph compression: The problem of graph summa-rization has also been considered recently [27, 32]. Al-though the motivation of [27] is also to create condensedrepresentations of a graph using the MDL principle,our setting is rather different. We deal with arbitrarygraphs, while [27] only compresses bipartite graphs.The objective function in [27] is also different from ours.Tian et al. [32] proposed two database-style aggregationoperations for specifying graph summaries, but did notconsider the problem of answering queries using thesesummaries. Thus, the objective function in [32] is alsodifferent from ours.

The most extensive literature for graph compressioncomes from the studies related to the Web graph (see

for example [2, 9]: the goal in this work focuses onfinding compact Web-graph representations to be usedto compute PageRank [10] or other measures. Most ofthis work, however, focuses on reducing the number ofbits required to encode a link between two pages andnot on constructing graph summaries that can be usedfor structural discovery instead of the original graph.

Querying uncertain, incomplete and imprecisedata: Finally, significant classical and recent workhas focused on problems related to querying data thatis uncertain, incomplete, or imprecise [1, 4, 6, 11,13, 21, 34]. A common approach to handling thisambiguity is to define the semantics of important queriesin terms of possible worlds [1, 6]. For example, workin probabilistic databases extends the relational datamodel by associating an inclusion probability with eachtuple t, which is interpreted to mean that t appearsin this proportion of true database instances. A bigchallenge is evaluating queries expressed in terms of thisand related data models [13, 34]. Other recent workaims to capture data uncertainty and imprecision inthe OLAP data model and evaluating aggregate queries[11].

8 Conclusion

In this paper, we proposed probabilistic semantics foranswering structural queries atop “coarsened” graphsummaries. Using this as motivation, we proposedalgorithms for finding high-quality, space-efficient andprivacy-preserving summaries.

There are several interesting opportunities for fu-ture work. In particular, we are considering variousapproaches for scaling our summarization algorithms tolarge datasets and techniques for incrementally main-taining a high-quality summary when new nodes areinserted into underlying graph G. In addition, we areconsidering extensions to our query answering semanticsthat incorporate attribute values in addition to simple

graph topologies.

References

[1] S. Abiteboul, P. Kanellakis, and G. Grahne. On therepresentation and querying of sets of possible worlds. InSIGMOD, 1987.

[2] M. Adler and M. Mitzenmacher. Towards compressingweb graphs. In Proceedings of the Data CompressionConference, 2001.

[3] C. Aggarwal and P.Yu. A condensation approach toprivacy-preserving data mining. In EDBT, 2004.

[4] L. Antova, C. Koch, and D. Olteanu. (10)(10)6 worldsand beyond: Efficient representation and processing ofincomplete information. In ICDE, 2007.

[5] M. Arenas, L. Bertossi, J. Chomicki, X. He, V. Ragha-van, and J. Spinrad. Scalar aggregation in inconsistentdatabases. Theoretical Computer Science, 296(3), 2003.

[6] F. Bacchus, A. Grove, J. Halpern, and D. Kohler. Fromstatistical knowledge bases to degrees of belief. A.I., 87(1-2), 1996.

[7] L. Backstrom, C. Dwork, and J. Kleinberg. Wherefore artthou r3579x? anonymized social networks, hidden patterns,and structural steganography. In WWW, 2007.

[8] A. Barabasi and R. Albert. Emergence of scaling in randomnetworks. Science, 286:509–512, 1999.

[9] P. Boldi and S. Vigna. The webgraph framework i: com-pression techniques. In WWW, 2004.

[10] S. Brin and L. Page. The anatomy of a large-scale hyper-textual web search engine. In WWW, 1998.

[11] D. Burdick, P. Deshpande, T. Jayram, and R. Ramakrish-nan. Olap over uncertain and imprecise data. In VLDB,2005.

[12] A. Clauset, M. Newman, and C. Moore. Finding communitystructure in very large networks. Physical Review, 70, 2004.

[13] N. Dalvi and D. Suciu. Efficient query evaluation on proba-bilistic databases. In Proceedings of the 30th InternationalConference on Very Large Databases, 2004.

[14] W.E. Donath and A.J. Hoffman. Algorithms for partition-ing of graphs and computer logic based on eigenvectors ofconnection matrices. IBM Technical Disclosure Bulletin,15, 1972.

[15] U. Feige, G. Kortsarz, and D. Peleg. The dense k-subgraphproblem. Algorithmica, 29:2001, 1997.

[16] G.W. Flake, S. Lawrence, and C. L. Giles. Efficientidentification of web communities. In KDD, 2000.

[17] G. Gallo, M.D. Grigoriadis, and R.E. Tarjan. A fast para-metric maximum flow algorithm and applications. SiamJournal on Computing, 18:30–55, 1989.

[18] D. Gibson, J. Kleinberg, and P. Raghavan. Inferring webcommunities from link topology. In Hypertext, 1998.

[19] D. Gibson, R. Kumar, and A. Tomkins. Discovering largedense subgraphs in massive graphs. In VLDB, 2005.

[20] M. Hay, G. Miklau, D. Jensen, and P. Weis. Resistingstructural re-identification in anonymized social networks.In VLDB, 2008.

[21] T. Imielinski and W. Lipski. Incomplete information inrelational databases. Journal of the ACM, 31(4):761–791,1984.

[22] P. Kilpelainen, H. Mannila, and E. Ukkonen. Mdl learningof unions of simple pattern languages from positive exam-ples. In EuroCOLT, 1995.

[23] M. Koivisto, M. Perola, T. Varilo, et al. An MDL methodfor finding haplotype blocks and for estimating the strengthof haplotype block boundaries. In Pacific Symposium onBiocomputing, 2003.

[24] M. Mehta, J. Rissanen, and R. Agrawal. Mdl-based decisiontree pruning. In KDD, 1995.

[25] R. Motwani and P. Raghavan. Randomized Algorithms.Cambridge University Press, 1995.

[26] A. Narayanan and V. Shmatikov. De-anonymizing socialnetworks. In IEEE Symposium on Security and Privacy,2009.

[27] S. Navlakha, R. Rastogi, and N. Shrivastava. Graphsummarization with bounded error. In SIGMOD, 2008.

[28] M. Newman and M. Girvan. Finding and evaluatingcommunity structure in networks. Physical Review, 69,2004.

[29] M. Newman and M. Girvan. Algorithms for detectingcommunity structure in networks. Physical Review, 71,2005.

[30] P. Samarati. Protecting respondents’ identities in micro-data release. Transactions on Knowledge and Data Engi-neering, 2001.

[31] L. Sweeney. k-anonymity: a model for protecting pri-vacy. International Journal on Uncertainty, Fuzziness andKnowledge-based Systems, 10(5):557–570, 2002.

[32] Y. Tian, R. Hankins, and J. Patel. Efficient aggregation forgraph summarization. In SIGMOD, 2008.

[33] S. White and P. Smyth. A spectral clustering approach tofinding communities in graph. In SDM, 2005.

[34] J. Widom. Trio: A system for integrated management ofdata, accuracy, and lineage. In CIDR, 2005.

[35] E. Ziv, M. Middenforf, and C. Wiggins. An information-theoretic approach to network modularity. Physical Review,71, 2005.

Documents

GraSS: Graph Structure Summarization - Computer … · GraSS: Graph Structure Summarization ... summarization process leaves the database owner with the ... ﬁnding a good graph