20
Efficient processing of label-constraint reachability queries in large graphs Lei Zou a,n , Kun Xu a , Jeffrey Xu Yu b , Lei Chen c , Yanghua Xiao d , Dongyan Zhao a a Peking University, No.5 Yiheyuan Road Haidian District, Beijing, China b The Chinese University of Hong Kong, Shatin, NT, Hong Kong c Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong d Fudan University, Shanghai, China article info Article history: Received 22 November 2012 Received in revised form 13 June 2013 Accepted 1 October 2013 Recommended by: Xifeng Yan Available online 18 October 2013 Keywords: Graph database Reachability query abstract In this paper, we study a variant of reachability queries, called label-constraint reachability (LCR) queries. Specifically, given a label set S and two vertices u 1 and u 2 in a large directed graph G, we check the existence of a directed path from u 1 to u 2 , where edge labels along the path are a subset of S. We propose the path-label transitive closure method to answer LCR queries. Specifically, we t4ransform an edge-labeled directed graph into an augmen- ted DAG by replacing the maximal strongly connected components as bipartite graphs. We also propose a Dijkstra-like algorithm to compute path-label transitive closure by re- defining the distanceof a path. Comparing with the existing solutions, we prove that our method is optimal in terms of the search space. Furthermore, we propose a simple yet effective partition-based framework (local path-label transitive closure þonline traversal) to answer LCR queries in large graphs. We prove that finding the optimal graph partition to minimize query processing cost is a NP-hard problem. Therefore, we propose a sampling-based solution to find the sub-optimal partition. Moreover, we address the index maintenance issues to answer LCR queries over the dynamic graphs. Extensive experiments confirm the superiority of our method. & 2013 Elsevier Ltd. All rights reserved. 1. Introduction The growing popularity of graph databases has generated many interesting data management problems. One impor- tant type of queries over graphs is reachability queries [8,10,13,14,20,21]. Specifically, given two vertices u 1 and u 2 in a directed graph G, we want to verify whether there exists a directed path 1 from u 1 to u 2 . There are many applications of reachability queries, such as pathway finding in biological networks [16], inferring over RDF (resource description framework) graphs [17], relationship discovery in social networks [23]. There are two extreme solutions to answer reachability queries. One approach is to materialize the transitive closures of a graph, enabling one to answer reachability queries efficiently. On the other extreme, we can perform DFS (depth-first search) or BFS (breath-first search) over graph G on the fly to answer reachability queries. Obviously, these two methods cannot work in a large graph G, since the former needs OðjV j 2 Þ space to store the transitive closure (large index space cost), and the latter needs OðjV time in answering reachability queries (slow query response time), where V is a set of vertices in G. The key issue in reachability queries is how to find a good trade- off between the two extreme solutions. Therefore, many algorithms have been proposed, such as 2-hop [8,7,4], GRIPP [20], path-cover [10], tree-cover [20,21], pathtree [14] and 3-hop [13]. In many real applications, edge labels are utilized to denote different relationships between two vertices. For Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/infosys Information Systems 0306-4379/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.is.2013.10.003 n Corresponding author at: Institute of Computer Science and Technology, Peking University, No.5 Yiheyuan Road Haidian District, Beijing 100871, China. Tel.: þ86 10 82529643. E-mail addresses: [email protected], [email protected] (L. Zou). 1 In this paper, all pathsrefer to simple pathsunless otherwise specified. Information Systems 40 (2014) 4766

Efficient processing of label-constraint reachability ...€¦ · Efficient processing of label-constraint reachability queries in large graphs Lei Zoua,n,KunXua, Jeffrey Xu Yub,

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Efficient processing of label-constraint reachability ...€¦ · Efficient processing of label-constraint reachability queries in large graphs Lei Zoua,n,KunXua, Jeffrey Xu Yub,

Contents lists available at ScienceDirect

Information Systems

Information Systems 40 (2014) 47–66

0306-43http://d

n CorrTechnolBeijing

E-m1 In

specifie

journal homepage: www.elsevier.com/locate/infosys

Efficient processing of label-constraint reachability queriesin large graphs

Lei Zou a,n, Kun Xu a, Jeffrey Xu Yu b, Lei Chen c, Yanghua Xiao d, Dongyan Zhao a

a Peking University, No.5 Yiheyuan Road Haidian District, Beijing, Chinab The Chinese University of Hong Kong, Shatin, NT, Hong Kongc Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kongd Fudan University, Shanghai, China

a r t i c l e i n f o

Article history:Received 22 November 2012Received in revised form13 June 2013Accepted 1 October 2013

Recommended by: Xifeng Yan

LCR queries. Specifically, we t4ransform an edge-labeled directed graph into an augmen-

Available online 18 October 2013

Keywords:Graph databaseReachability query

79/$ - see front matter & 2013 Elsevier Ltd.x.doi.org/10.1016/j.is.2013.10.003

esponding author at: Institute of Computerogy, Peking University, No.5 Yiheyuan Road100871, China. Tel.: þ86 10 82529643.ail addresses: [email protected], zoulei@this paper, all “paths” refer to “simple path

d.

a b s t r a c t

In this paper, we study a variant of reachability queries, called label-constraint reachability(LCR) queries. Specifically, given a label set S and two vertices u1 and u2 in a large directedgraph G, we check the existence of a directed path from u1 to u2, where edge labels alongthe path are a subset of S. We propose the path-label transitive closure method to answer

ted DAG by replacing the maximal strongly connected components as bipartite graphs. Wealso propose a Dijkstra-like algorithm to compute path-label transitive closure by re-defining the “distance” of a path. Comparing with the existing solutions, we prove that ourmethod is optimal in terms of the search space. Furthermore, we propose a simple yeteffective partition-based framework (local path-label transitive closureþonline traversal)to answer LCR queries in large graphs. We prove that finding the optimal graph partitionto minimize query processing cost is a NP-hard problem. Therefore, we propose asampling-based solution to find the sub-optimal partition. Moreover, we address theindex maintenance issues to answer LCR queries over the dynamic graphs. Extensiveexperiments confirm the superiority of our method.

& 2013 Elsevier Ltd. All rights reserved.

1. Introduction

The growing popularity of graph databases has generatedmany interesting data management problems. One impor-tant type of queries over graphs is reachability queries[8,10,13,14,20,21]. Specifically, given two vertices u1 and u2in a directed graph G, we want to verify whether there existsa directed path1 from u1 to u2. There are many applications ofreachability queries, such as pathway finding in biologicalnetworks [16], inferring over RDF (resource descriptionframework) graphs [17], relationship discovery in social

All rights reserved.

Science andHaidian District,

pku.edu.cn (L. Zou).s” unless otherwise

networks [23]. There are two extreme solutions to answerreachability queries. One approach is to materialize thetransitive closures of a graph, enabling one to answerreachability queries efficiently. On the other extreme, wecan perform DFS (depth-first search) or BFS (breath-firstsearch) over graph G on the fly to answer reachabilityqueries. Obviously, these two methods cannot work in alarge graph G, since the former needs OðjV j2Þ space to storethe transitive closure (large index space cost), and the latterneeds OðjV jÞ time in answering reachability queries (slowquery response time), where V is a set of vertices in G. Thekey issue in reachability queries is how to find a good trade-off between the two extreme solutions. Therefore, manyalgorithms have been proposed, such as 2-hop [8,7,4], GRIPP[20], path-cover [10], tree-cover [20,21], pathtree [14] and3-hop [13].

In many real applications, edge labels are utilized todenote different relationships between two vertices. For

Page 2: Efficient processing of label-constraint reachability ...€¦ · Efficient processing of label-constraint reachability queries in large graphs Lei Zoua,n,KunXua, Jeffrey Xu Yub,

L. Zou et al. / Information Systems 40 (2014) 47–6648

example, edge labels in RDF graphs denote differentproperties. We can also use edge labels to define differentrelationships in social networks. In this paper, we studya variant of reachability queries, called Label-ConstraintReachability (LCR) queries, which are originally proposedin [11]. Specifically, given two vertices u and v and a labelset S, a LCR query checks whether there exists a directedpath from u to v, where the edge labels along the path area subset of S. Here, we give some motivation examples todemonstrate the usefulness of LCR queries.

We model a social network as a graph G, in which eachvertex in G denotes an individual and an edge indicatesthe association between two users. Edge labels denote therelationship types, such as isFriendOf, isColleagueOf,isRelativeOf, isSchoolmateOf, isCoauthorOf, isAdvisorOfand so on. In some social network analysis tasks, we areonly interested in finding some specified relationshipsbetween two individuals. For example, we want to seewhether two suspects are remote relatives (in a terroristnetwork G) by checking the existence of a path betweentwo corresponding vertices, where the edge labels alongthe path are all “isRelativeOf”.

LCR queries are also useful for understanding howmetabolic chain reactions take place in metabolic networks.A metabolic network can be also modeled as a graph G,where each vertex corresponds to a chemical compoundand each directed edge indicates a chemical reaction fromone compound to another. Enzymes catalyze these reac-tions. Thus, we can use edge labels to denote differentenzymes. A metabolic pathway involves the step-by-stepmodification of an initial molecule to form another product.From the perspective of graph theory, a pathway is adirected path from the initial node in the metabolic networkto the target. The common query is as follow: consideringthe availability of a set of enzymes, is there a pathway fromone compound to another one? Obviously, this is a LCRquery over a metabolic network.

As demonstrated above, LCR queries are quite useful;however, it is non-trivial to answer LCR queries over alarge directed graph. Traditional reachability queries donot consider edge labels along the path [11]. For example,vertex 1 can reach vertex 4 in graph G (in Fig. 1). However,if the constraint label set S is fb; cg, the LCR query answer isNO, since we cannot find a path from 1 to 4, where all edgelabels along the path are a subset of S. Generally speaking,existing reachability indexes are compact data structuresof the transitive closure. As traditional reachability queriesdo not consider edge labels, in order to compute thetransitive closure, we only need to consider a singledirected path from one vertex to another (if any). However,in order to compute the transitive closures for LCR queries,

Fig. 1. A Running example.

we have to consider all possible paths between two vertices,because different paths may have different edge labels alongpaths. Therefore, it is much more complicated of computingtransitive closure for LCR queries. Existing index techniquesin traditional reachability queries are not available in LCRqueries, either. For example, in order to answer reachabilityqueries, we always transform a directed graph G into adirected acyclic graph (DAG) by coalescing each stronglyconnected component (in G) into a single vertex. However,this method cannot work in LCR queries, since each stronglyconnected component has different edge labels.

In order to address LCR queries efficiently, we make thefollowing contributions in this work:

(1)

Given an edge-labeled directed graph G, we find allmaximal strongly connected components in G andreplace them by bipartite graphs. Then, a directed graphG is transformed into an augmented DAG with labels.Based on the augmented DAG, we propose a method tocompute path-label transitive closure (Definition 3.8),where LCR queries can be answered directly.

(2)

We re-define the “distance” of a path by the number ofdistinct edge labels along the path, and also propose aDijkstra-like algorithm to compute a single-sourcepath-label transitive closure (Definition 3.8). We provethat our algorithm is optimal in terms of search space.

(3)

In order to speed up query processing over large graphs,we propose an effective partition-based framework(local path-label transitive closureþonline traversal)to answer LCR queries in large graphs. We prove thatfinding the optimal partition in terms of minimizingthe number of traversal steps is NP-hard. Based on thecomplexity analysis, we design a sampling-based solu-tion to find a sub-optimal partition.

(4)

In order to handle graph updates, we propose anefficient index maintenance algorithm to handle updatesover graphs.

(5)

Last but not least, extensive experiments confirm thatour method is faster than the existing ones by orders ofmagnitude. For example, given a random network satis-fying ER model with 100K vertices and 150K edges, themethod in [11] consumes 277 h for index building. Giventhe same graph, our method only needs 0.5 h for indexbuilding. Furthermore, our method can work well in avery large RDF graph (Yago dataset) with more than2 million vertices and 6 million edges and 97 edge labels.

The rest of this paper is organized as follows. Therelated work is discussed in Section 2. We formally definethe problem and discuss existing solutions in Section 3.Then, we propose several novel techniques for computingpath-label transitive closures in Section 4. The partition-based solution is discussed in Section 5. We also discusshow to handle dynamic graphs in Section 6. We evaluateour method in Section 7. Section 8 concludes this paper.

2. Related work

Recently, reachability queries have attracted lots ofattentions in the database community [22,2,5,12]. Generally

Page 3: Efficient processing of label-constraint reachability ...€¦ · Efficient processing of label-constraint reachability queries in large graphs Lei Zoua,n,KunXua, Jeffrey Xu Yub,

L. Zou et al. / Information Systems 40 (2014) 47–66 49

speaking, given two vertices ui and uj in a directed graph G,a reachability query verifies the existence of a directed pathfrom ui to uj. The reachability query is a fundamentaloperation over graph data, which can be used in differentapplications. For example, reachability queries can be usedto find pathways between two compounds in metabolicnetworks to understand chain reactions. We can also findsome semantic association on the semantic web [3].Usually, a directed graph G is transformed into a directedacyclic graph by coalescing all strongly connected compo-nents into vertices. It is easy to prove that the directedacyclic graph has the same reachability information with G.Thus, all existing approaches assume that graph G is adirected acyclic graph. So far, there have been a lot ofproposals to address this issue. Basically, these existingapproaches can be classified into three categories: chain-cover [10], tree-cover [20,21] and 2-hop labeling [4,8,7].

The chain cover is to decompose a graph G intopairwise disjoint chains. A chain is more general than apath. Each vertex has a distinct code based on the positionon a chain. Given any two vertices u1 and u2, if u1 canreach u2 along a chain, it is not necessary to record thereachability information between u1 and u2, because thatit can be deduced based on their codes on the chain. Twotypical solutions are proposed in [10] and [5].

The tree cover proposes to use a spanning tree insteadof chains to compress the transitive closures. All edgescovered by the tree are called tree edges, and others arecalled non-tree edges. Some tree codes are designed tocheck the existence of a directed tree path (the path onlycontaining tree-edges) from vertex u1 to another vertex u2.In order to consider non-tree edges, many differentapproaches are proposed. In [20], Trißl and Leser proposea traversal-based solution. Instead, Wang et al. propose adual-labeling theme to materialize the reachability infor-mation through non-tree edges [21].

Different from the above methods, a 2-hop labelingmethod over a large graph G assigns to each vertex uAVðGÞ a label LðuÞ ¼ ðLinðuÞ; LoutðuÞÞ, where LinðuÞ; LoutðuÞDVðGÞ. Vertices in Lin(u) and Lout(u) are called centers. Forreachability labeling, given any two vertices u1;u2AVðGÞ,there is a path from u1 to u2 (denoted as u1-u2), if andonly if Loutðu1Þ \ Linðu2Þaϕ. It is a NP-hard problem to findthe minimal number of hops. Lots of heuristic approachesare proposed, such as [8,7,4]. In [8], Cohen et al. adopt theset-cover solution to find hops. However, the method of

Table 1Frequently-used notations

Notation Description

e, λðeÞ Edge e and the edp, L(p) Path p and the paS¼ fl1 ;…; lng The label constraiPðu1 ;u2Þ, LSðu1 ;u2Þ All paths and pathMGðu1 ;u2Þ All non-redundanMGðu; �Þ single-source pathMG the path-label tranPruneð_Þ Remove all redund� The concatenation� The concatenation

[8] is not scalable on large graphs. In order to address thisissue, Cheng and Yu in [6] propose efficient algorithms tocompute 2-hops efficiently.

LCR query is first proposed in [11], which is a specialcase of regular path queries [1]. Different from traditionalreachability queries, LCR query needs to consider the edgelabels along the path. In this case, some existing techni-ques cannot be used. For example, we cannot simplyreplace all strongly connected components by vertices,since this transformation leads to missing edge labels ineach strongly connected component. In [11], Jin et al. use aspanning tree and some local transitive closures to supportLCR queries. The intuition of their method is that the fulltransitive closure can be re-constructed by the spanningtree and local transitive closures. However, this kindof tree-cover method suffers from the high densities ingraphs, since local transitive closures may be very large.Label constraint semantic is also considered in the shortestpath queries over road networks [18]. Different from [18],our work focuses on the “reachability” queries, whichfollows the same problem definition in [1].

This paper is a heavily expanded journal version ofa research paper entitled “Answering Label-ConstraintReachability in Large Graphs” presented at CIKM 2011. Inthis journal version, we made the following new contribu-tions: First, we propose an optimized solution to computelocal transitive closure in each maximal strongly con-nected component in Section 4.3. Second, we propose apartition-based method to answer LCR queries over largegraphs. Third, we also discuss the index maintenanceissues on the dynamic graphs. Most existing approachesassume that the graph is static. They do not discuss howto update indexes. However, many real-life graph data,such as social networks, RDF graphs, are always evolvingover time. Thus, we should consider the reachability overdynamic graphs. Furthermore, in this journal version, weintroduce six more experiments (Exp3–Exp8) to evaluatethe new proposed techniques.

3. Background

3.1. Problem definition

We formally define our problem in this section. Table 1shows some frequently used symbols throughout thispaper.

ge label of e (Definition 3.1)th label L(p) (Definition 3.1)nt set (Definition 3.2)labels between u1 and u2 (Definition 3.4).

t paths labels between u1 and u2 (Definition 3.6).-label transitive closure in graph G (Definition 3.8)sitive closure of graph G (Definition 3.8)ant paths (Definition 3.9)of two path label sets (Definition 3.10).of two paths.

Page 4: Efficient processing of label-constraint reachability ...€¦ · Efficient processing of label-constraint reachability queries in large graphs Lei Zoua,n,KunXua, Jeffrey Xu Yub,

L. Zou et al. / Information Systems 40 (2014) 47–6650

Definition 3.1. A directed edge-labeled graph G is denotedas G¼ ðV ; E;∑; λÞ, where (1) V is a set of vertices, and (2)EDV � V is a set of directed edges, and (3) ∑ is a set ofedge labels, and (4) the labeling function λ defines themapping E-∑.Given a path p from u1 to u2 in graph G, the path-label of

p is denoted as LðpÞ ¼⋃eApλðeÞ, where λðeÞ denotes e'sedge label.

Given a graph G in Fig. 1, the numbers inside verticesare vertex IDs that we introduce to simplify descriptionof a graph; and the letters beside edges are edge labels.Considering path p1 ¼ ð1;2;5;6Þ, the path-label of p1 isLðp1Þ ¼ facg.

Definition 3.2. Given two vertices u1 and u2 in graph Gand a label constraint (set) S¼ fl1;…; lng, where li is label,i¼ 1;…;n, we say that u1 can reach u2 under labelconstraint S (denoted as u1-uS

2) if and only if there existsa path p from u1 to u2 and LðpÞDS.

Definition 3.3 (Problem Definition). Given two verticesu1 and u2 in graph G and a label set S¼ fl1;…; lng, alabel�constraint reachability (LCR) query verifies whetheru1 can reach u2 under the label constraint S, denoted asLCRðu1;u2; S;GÞ.

For example, given two vertices 1 and 6 in graph G inFig. 1 and label constraint S¼ facg, it is easy to know that 1can reach 6 under label constraint S, i.e., LCRð1;6; S;GÞ ¼true, since there exists path p1 ¼ f1;2;5;6g, where Lðp1ÞDS.If S¼ fbcg, query LCRð1;6; S;GÞ ¼ false.

Definition 3.4. Given two vertices u1 and u2 in graph G,Pðu1;u2Þ denotes the set of all paths from u1 to u2. Thepath-label set from u1 to u2 is defined as LSðu1;u2Þ ¼ fLðpÞjpAPðu1;u2Þg.

Note that Pðu1;u2Þ and LSðu1;u2Þ may be very large andwe do not compute them in our method. We only use theconcept of Pðu1;u2Þ to define the minimal path-label set fromu1 to u2 (Definition 3.6). Consider two paths (1, 2, 3, 4, 5, 6)and (1, 2, 5, 6) in Pð1;6Þ, where Lð1;2;3;4;5;6ÞD Lð1;2;5;6Þ.Obviously, if path (1, 2, 5, 6) can satisfy some label constraintS, path (1, 2, 3, 4, 5, 6) will also satisfy S. Therefore, path (1, 2,5, 6) is redundant (Definition 3.5) for any LCR query.

Definition 3.5. Considering two paths p and p′ fromvertex u1 to u2, respectively, if LðpÞDLðp′Þ, we say L(p)covers Lðp′Þ. In this case, p′ is a redundant path, and Lðp′Þ isalso redundant in the path-label set LSðu1;u2Þ. Consideringone path p from vertex u1 to u2, if there exists no otherpath p′ from u1 to u2 that covers p, p is a non-redundant path.

Definition 3.6. The minimal path-label set from u1 to u2 ingraph G is defined as MGðu1;u2Þ, where (1) MGðu1;u2ÞDLSðu1;u2Þ; and (2) there exists no redundant path-label inMGðu1;u2Þ; and (3) path-labels in MGðu1;u2Þ cover (definedin Definition 3.5) all path labels in LSðu1;u2Þ.

Definition 3.7. Given two vertices u1 and u2 and a labelconstraint (set) S, we say MGðu1;u2Þ covers S if and only ifthere exists a path p from u1 to u2, where S+LðpÞ andLðpÞAMGðu1;u2Þ.

Definition 3.8. Given a graph G, path-label transitive closureis a collection of minimal path-label sets between any twovertices in G. Specifically, MG ¼ ½MGðu1;u2Þ�jVðGÞj�jVðGÞj, whereu1;u2AVðGÞ, and a single-source path-label transitive closureis a vector MGðu; �Þ¼ ½MGðu;uiÞ�1�jVðGÞj, where uiAVðGÞ.

When the context is clear, for the simplicity of thepresentation, we use transitive closure instead of path-labeltransitive closure. For ease of presentation, we borrow twooperator definitions (Prune and �) from Ref. [11].

Definition 3.9. Given a path label set LSðu1;u2Þ from u1 tou2, PruneðLSðu1;u2ÞÞ is defined to delete all redundant pathlabels (defined in Definition 3.5) in LSðu1;u2Þ, i.e., PruneðLSðu1;u2ÞÞ ¼MGðu1;u2Þ.

For example, in Fig. 1 Prune ðLSð1, 6Þ ¼ fa; acgÞ ¼MG

ð1;6Þ ¼ fag, since fagDfacg.

Definition 3.10. Given a path-label L(p) and a path-labelset LS¼ fLðp1Þ;…; LðpnÞg, LðpÞ � LS¼ fLðpÞ⋃Lðp1Þ;…; LðpÞ⋃LðpnÞg.Given two path-label sets LS1 ¼ fLðp11Þ;…; Lðp1nÞg and

LS2 ¼ fLðp21Þ;…; Lðp2mÞg, LS1 � LS2 ¼ fLðp11Þ � LS2; Lðp12Þ�LS2;…; Lðp1nÞ � LS2g ¼ fLðp11Þ⋃Lðp21Þ;…; Lðp1nÞ⋃Lðp2mÞg.Given a path-label set LS1 and a vector MG ¼ ½LS21;…;

LS2n�1�n, where each LS2i (i¼1,…, n) is a path-label set,LS1 � MG ¼ ½LS1 � LS21;…; LS1 � LS2n�1�n.

For instance, given MGð5;6Þ ¼ fag and MGð3;6Þ ¼ fag,fλð2;5Þg � MGð5;6Þ ¼ facg and fλð2;3Þg � MGð3;6Þ ¼ fag.It is easy to prove that MGðui;ujÞ ¼ Pruneð⋃

uiu′��!

AEðGÞ

fλðuiu′��!Þ � MGðu′;ujÞgÞ. For example, MGð2;6Þ ¼ Pruneðffλð2;

5Þg � MGð5;6Þ; fλð2;3Þg � MGð3;6Þg ¼ Pruneðffag; facgÞg ¼ffagg.

An extreme approach to answering LCR queries is tomaterialize transitive closure MG. At run time, given aquery LCRðu1;u2; S;GÞ, LCR queries can be answered bysimply checking MGðu1;u2Þ. However, computing MG ismuch more complicated than traditional transitive closure.We introduce the following theorem, which is used in ourLCR query algorithms and the performance analysis.

Theorem 3.1 (Apriori property). Given a path p, p must beredundant if one of its subpaths is redundant.

Proof. Assume that one subpath p1 of p is redundant. Wecan find another path p2 that has the same end points of p1and p2 is a non-redundant path, i.e., Lðp2Þ covers Lðp1Þ. Weget another path p′ by replacing p1 by p2. It is easy to provethat Lðp′Þ covers L(p). Thus, pmust be a redundant path. □

3.2. Existing approaches

LCR queries are proposed in [11]. Generally speaking,the method in [11] employs a spanning tree T and a partialtransitive closure NT to compress the full transitive clo-sure. Specifically, a spanning tree T is found in the graph G.Based on T, all pairwise paths are partitioned into threecategories Pn and Ps and Pe. All paths in Pn contain allpairwise paths whose starting edges and end edges areboth non-tree edges. All paths in Ps (and Pe) containall pairwise paths whose starting (and ending) edges are

Page 5: Efficient processing of label-constraint reachability ...€¦ · Efficient processing of label-constraint reachability queries in large graphs Lei Zoua,n,KunXua, Jeffrey Xu Yub,

Fig. 2. Existing solution. (a) An example of tree-cover. (b) Running example of Algorithm 1.

L. Zou et al. / Information Systems 40 (2014) 47–66 51

tree-edges. In Fig. 2(a), (4,5,6,1) is a path in Pn, since 4;5�!

and 6;1�!

are non-tree edges. NTðu; vÞ contains all pathlabels between u and v in Pn. MGðu; vÞ can be re-constructed by Eq. (1). Therefore, we can re-constructthe full transitive closure by the spanning tree T andpartial transitive closure NT ¼ fNTðu; vÞ;u; vAVðGÞg.MGðu; vÞ ¼ ffLðPT ðu;u′ÞÞg � NTðu′; v′Þ � fLðPT ðv′; vÞÞgju′ASuccðuÞ and v′APredðvÞg ð1Þ

where u′ is reachable from u in the spanning tree T andLðPT ðu;u′ÞÞ denotes the corresponding path label in T;and v′ can reach v in the spanning tree T and LðPT ðv′; vÞÞdenotes the corresponding path label in T.

Obviously, different spanning trees will lead to differentNT. In order to minimize the size of NT, Jin et al. introduce“weight”w(e) for each edge e, where w(e) reflects that if e isin the tree, the number of path-labels that can be removedfrom NT. Therefore, they propose to use the maximalspanning tree in G. However, it is quite expensive to assignexact edge weights w(e). Thus, they propose a samplingmethod. For each sampling seed (vertex), they computesingle-source transitive closure, based on which, theypropose some heuristic methods to define edge weights.

However, there are two limitations of their method in[11]. First, similar to the counterpart methods in traditionalreachability queries [21,14], a single spanning tree cannotcompress the transitive closure greatly, especially in densegraphs. Consequently, NTmay be very large. Second, in orderto find the optimal spanning tree T, Jin et al. [11] propose analgorithm to compute single-source transitive closure foreach sampling seed (vertex). However, the search space intheir algorithm is not minimal in terms of search space, as itcontains a large number of redundant paths in intermediateresults. The redundant paths affect the performance greatly.We will analyze the shortcoming in detail shortly. The abovetwo problems (large index size and expensive index buildingprocess) affect the scalability of the method in [11].

Since computing single-source transitive closure is alsoa building block in our method, we prove that our methodis optimal in terms of search space. In order to understandthe superiority of our method, we first analyze the algo-rithm in [11]

Given a vertex u in graph G, single-source transitiveclosure of vertex u is a vector Mðu; �Þ¼ ½Mðu;u1Þ;…;

Mðu;ujVðGÞjÞ�, where uiAVðGÞði¼ 1;…; jVðGÞjÞ. The methodin [11] adopts a generalization of Bellman–Ford algorithmto compute Mðu; �Þ. Algorithm 1 lists the pseudo-code

of their method. Generally speaking, Algorithm 1 adoptsa BFS-strategy to broadcast one vertex's path-labels to itsneighbors in each iteration. Fig. 2 shows a running exampleof computingMð1; �Þ in graph G. A problem of this methodis that: if one path-label Lðu;uiÞ from u to ui is redundant(Definition 3.5), it may infect ui's neighbors. For example, inStep 2, there is a redundant path-label facg inMð1;5Þ (it willbe pruned in Step 4), it infects its neighbor vertex 6 inStep 3. Thus, there is also a redundant path-label facg inMð1;6Þ in Step 3. Actually, these redundant path-labelsshould be pruned from search space to avoid unnecessarycomputation. Given a redundant path p, redundant path-label of p will infect all its super paths. The infectionwill affect the performance greatly, especially in large anddense graphs. An optimal algorithm should “magically” stopthe infection as early as possible. For example, a magicalmethod can remove vertex 5 from V1 in Step 3. A keyproblem in Algorithm 1 is that we cannot know facg inMð1;5Þ is redundant until Step 4. Therefore, the infection ofredundancy cannot be avoided.

In our proposed method (Section 4.1), we can guaranteethat if one path p is redundant, all its super paths are prunedfrom search space. For example, in our algorithm, path (1, 2,5, 6) can be pruned, since its subpath (1, 2, 5) is redundant. Itmeans that it is impossible to generate intermediate resultMð1;6Þ ¼ facg. Consequently, comparing with Algorithm 1,our method reduces the search space greatly.

Algorithm 1. Single source transitive closure computation[11].

Input: A graph G and a vertex u in G;

Output: Single Source Transitive Closure Mðu; �Þ.

1: Mðu; �Þ←NULL;V1←fug. 2: while V1aNULL do 3: V2←NULL 4: for each vertex vAV1 do 5: for each vertex v′ANðvÞfv′ : ðv; v′ÞAEðGÞg do 6: New←PruneðMðu; v′Þ⋃Mðu; vÞ � fλðv; v′ÞgÞ 7: if NewaMðu; v′Þ then 8: Mðu; v′Þ←New;V2←V2⋃fv′g ; 9: end if 10: end for 11: end for 12 : V1←V2

13:

end while

In [9], Fan et al. add regular expressions to graphreachability queries. Specifically, given two vertices u1 andu2, the method in [9] verifies whether there is a directedpath P where all edge labels along the path satisfy the

Page 6: Efficient processing of label-constraint reachability ...€¦ · Efficient processing of label-constraint reachability queries in large graphs Lei Zoua,n,KunXua, Jeffrey Xu Yub,

Fig. 3. Algorithm process.

L. Zou et al. / Information Systems 40 (2014) 47–6652

specified regular expression. Obviously, LCR query is aspecial case of the problem in [9]. Fan et al. propose abi-directional BFS algorithm at runtime. We can utilize themethod in [9] for LCR queries. Given two vertices u1 and u2over a large graph G and a label set S, two sets aremaintained for u1 and u2. Each set records the vertices thatare reachable from (resp. to) u1 (resp. u2) only via edges oflabels in S. We expand the smaller set at a time until eitherthe two sets intersect, or they cannot be further expanded(i.e., unreachable). This method works very well in smallgraphs, but, the running time of the bidirectional BFS isslow on large graphs.

4. Computing transitive closure

As mentioned earlier, comparing with the traditionaltransitive closure, it is much more challenging to computepath-label transitive closure (Definition 3.8). This sectionfocuses on computing path-label transitive closure efficiently.We first propose a Dijkstra-like algorithm to compute single-source transitive closure efficiently (Section 4.1). However, itis very expensive to iterate single-source transitive closurecomputation from each vertex in G to compute MG. In orderto address this issue, we propose the augmented DAG (aDAGfor short) by representing all strongly connected componentsas bipartite graphs. The aDAG-based solution to computetransitive closures over G is discussed in Section 4.2. Wediscuss how to optimize computing local transitive closure ineach strongly connected component in Section 4.3.

4.1. Single-source transitive closure

This subsection focuses on computing single-sourcetransitive closure efficiently. As discussed in Section 3.2,the method in [11] is not optimal in terms of search space.The key problem is that some redundant paths are visitedbefore their corresponding non-redundant paths. In orderto address this issue, we propose a Dijkstra-like algorithm.In each step of Dijkstra's algorithm, we always access oneun-visited vertex that has the minimal distance from theorigin vertex. In our algorithm, we redefine “distance” of apath by the number of distinct edge labels along the pathin Definition 4.1. Given a redundant path p1, there mustexist another non-redundant path p2, where Lðp2Þ � Lðp1Þ.It is straightforward to know the “distance” of p1 must belarger than p2. Dijkstra's algorithm only finds the shortestpaths between the origin vertex and other vertices. There-fore, p2 must be out of search space in our single-sourcetransitive closure computation (Theorem 4.1).

Definition 4.1. A distance of a path p is defined as thenumber of distinct edge labels in p.

Given a graph G in Fig. 1, Fig. 3 demonstrates how tocompute MGð1; �Þ from vertex 1 in our algorithm (i.e.,Algorithm 2). Initially, we set vertex 1 as the source. All vertex1's neighbors are put into the heap H. Each neighbor isdenoted as a neighbor triple ½LðpÞ; p; d�, where d denotes theneighbor's ID, p specifies one path from source s to d, and L(p)is the path-label set of p. All neighbor triples are rankedaccording to the total order defined in Definition 4.2. Since

½fag; ð1;2Þ;2� is the heap head (see Fig. 3), it is moved to pathset RS. When we move the heap head T1 into path set RS, wecheck whether T1 is covered (Definition 4.3) by some neighbortriple T2 in RS (Line 5 in Algorithm 2). If so, we ignore T1 (Lines5–6); otherwise, we insert T1 into RS (Lines 7–8).

Definition 4.2. Given two neighbor triples T1 ¼ ½Lðp1Þ;p1; d1� and T2 ¼ ½Lðp2Þ; p2; d2� in the heap H, T1rT2 if andonly if jLðp1Þjr jLðp2Þj, where jLðp1Þj is the number ofdistinct vertex labels along path p1.

Definition 4.3. Given one neighbor triple T1 ¼ ½Lðp1Þ; p1;d1�, T1 is redundant if and only if there exists anotherneighbor triple T2 ¼ ½Lðp2Þ; p2;d2�, where Lðp1Þ+Lðp2Þ andd1 ¼ d2. In this case, we say that T1 is covered by T2.

Definition 4.4. Given two paths p1 and p2, whose lengthsare n and nþ1, respectively, p1 is a parent path of p2 if andonly if p2 ¼ p1 � e, where “�” denotes the concatenationof a path p1 and an edge e.

Definition 4.5. Given two neighbor triples T1 ¼ ½Lðp1Þ; p1;d1� and T2 ¼ ½Lðp2Þ; p2; d2�, if p1 is a parent path (or a childpath) of p2, we say that T1 (or T2) is a parent neighbor triple(or a child neighbor triple) of T2 (or T1).

Algorithm 2. Single-source transitive closure computation.

Input: A graph G and a vertex u in G;

Output: single-source transitive closure MGðu; �Þ.

1: Set u as the source. Set answer set RS¼ϕ and heap H¼ϕ. 2: Put all neighbor triples of u into H. 3: while Haϕ do 4: Let T1 ¼ ½Lðp1Þ; p1; d� to denote the head in H. 5: if T1 is covered by some neighbor triple T2 in RS then 6: Delete T1 from H 7: else 8: Move T1 into RS and put Lðp1Þ into MGðu; dÞ 9: for each child neighbor triple T′½Lðp′Þ;p′; d′� of T1 do 10: if p′ is a non-simple path then 11: continue 12: end if 13: if T′ is not covered by some neighbor triple T″ in H then 14: Insert T ′ into H 15: end if 16: if T′ covers some neighbor triple T″ in H then 17: Delete T ″ from H 18: end if 19: end for 20: end if 21: end while 22: MGðu; �Þ ¼ ½MGðu;u1Þ;…;MGðu;unÞ� and return MGðu; �Þ

Then, we put all child neighbor triples (Definition 4.5)of ½fag; ð1;2Þ;2� into heap H. Considering one neighbor of

Page 7: Efficient processing of label-constraint reachability ...€¦ · Efficient processing of label-constraint reachability queries in large graphs Lei Zoua,n,KunXua, Jeffrey Xu Yub,

L. Zou et al. / Information Systems 40 (2014) 47–66 53

vertex 2, such as vertex 3, we put neighbor triple ½fag [Lð2;3�!Þ¼ fag; ð1;2;3Þ;3� into H, where (1, 2, 3) is (1, 2)'s child

path. Analogously, we put ½fag [ Lð2;5�!Þ¼ facg; ð1;2;5Þ;5�into H. When we insert some neighbor triple T ′½Lðp′Þ; p′; d′�into H, we first check whether p′ is a non-simple path,and we ignore T ′ if so. (Lines 10–11). Furthermore, wealso check whether there exists another triple T″ that hasexisted in H and T ′ is covered by T″, or T ′ covers T″ (Lines12–15). If T ′ is covered by T″, we ignore T′; otherwise, T ′ isinserted into H. If T ′ covers some triple T″ in H, T″ isdeleted from H. At Step 2, the heap head is ½fag; ð1;3Þ;3�,which is moved to path set RS.

Iteratively, we put all child neighbor triples of½fag; ð1;3Þ;3� into heap H. At Step 4, we find that½facg; ð1;2;5Þ;5� is covered by ½fag; ð1;2;3;4;5Þ;5�. There-fore, we remove ½facg; ð1;2;5Þ;5� from H. Fig. 3 illustratesthe whole process. All paths and path-labels in RS are non-redundant. According to RS, it is straightforward to obtainMGð1; �Þ. Note that, our algorithm stops the infection fromthe redundant path to its child paths (Theorem 4.1). Forexample, path (1, 2, 5, 6) is pruned from search space inour algorithm.

Analysis of Algorithm 2:

Theorem 4.1. Given a vertex u in graph G, the followingclaims about Algorithm 2 hold:

1.

All non-redundant paths can be found in Algorithm 2. 2. The paths found by Algorithm 2 (i.e., the paths that are

inserted into RS) are non-redundant paths.

Proof. (1) Proof of the first claim (proof by induction)(a) (Base Case): According to Line 2 of Algorithm 2, all

length-1 non-redundant paths beginning from vertex u arepushed into H in Algorithm 2. Obviously, these length-1non-redundant paths are not covered by any path in H orRS. According to the loop steps (Lines 3-21), these length-1non-redundant paths will be the heap head of H at someiteration step. At this moment, they are inserted into resultset RS. Therefore, Algorithm 2 will not miss these length-1non-redundant paths.(b) (Hypothesis): Assume that all length-n non-redundant

paths beginning from vertex u can be found in Algorithm 2.(Induction): Given a length-(nþ1) non-redundant path

p, its parent path is denoted as p′. It is straightforward toknow the length of p′ is n, meaning that p′ must be foundin Algorithm 2, according to the above assumption.Since p is p′'s child, p must be considered in Lines 9–15.

Since p is a non-redundant path, it means that p cannotbe covered by any path in H or RS. Therefore, p must beinserted into H. At some iteration step, p is a heap head,which will be moved into RS.(c) (Conclusion): According to the above analysis, we can

always obtain any length-(nþ1) non-redundant path p,once we have obtained its length-n parent p′. Furthermore,all length-1 non-redundant paths can be found in RS(proved in Base Case). Therefore, according to the induc-tion method, we can find all non-redundant paths inAlgorithm 2.

(2) Proof of the second claim in Theorem 4.1.Given a heap head T1½Lðp1Þ; p1; d1�, if p1 is a redundant

path, its child paths must be redundant paths (proved inTheorem 3.1). According to Lines 5–6, p1 will be deleted.Also, Algorithm 2 does not expand p1 to generate redun-dant paths. Thus, all paths founded by Algorithm 2 arenon-redundant paths. □

Theorem 4.2. Algorithm 2 is optimal in terms of the searchspace for computing the single source transitive closure.

Proof. In order to compute the single source transitiveclosure, we must find all non-redundant paths beginningfrom the original vertex. Otherwise, we will miss thereachability information in the single source transitiveclosure.According to Theorem 4.1, Algorithm 2 can generate all

non-redundant paths beginning from the original vertexbut cannot generate any redundant path. Therefore,Algorithm 2 is optimal in the search space. □

Theorem 4.3. The time complexity of Algorithm 2 in theworst case is OðDdÞ, where D is the maximal outgoing degreeand d is the diameter of the graph.

Proof. The worst case is that all edges have distinct edgelabels. In this case, all paths (i.e., simple paths) are non-redundant. Thus, Algorithm 2 needs to evaluate all paths(beginning from the original vertex) for computing thesingle source path-label transitive closure. It is straightfor-ward to know there are at most OðDdÞ paths, where D is themaximal outgoing degree and d is the diameter of thegraph. Therefore, the time complexity is OðDdÞ. □

Although Algorithm 2 has the same time complexitywith the method in [11] (i.e., Algorithm 1) in the worstcase. Our algorithm can avoid visiting redundant paths(Theorem 4.1). In practice, our method is faster than themethod in [11] significantly.

4.2. Computing transitive closures

Given a graph G, we can iterate Algorithm 2 from eachvertex in G to compute MG. However, this is an inefficientsolution. Intuitively, given two adjacent vertices u1 and u2,they share a lot of steps for computing MGðu1; �Þ andMGðu2; �Þ by Algorithm 2. Therefore, an efficient algorithmshould avoid unnecessary redundant computation. Usually,a directed graph G can be transformed into a DAG bycoalescing each strongly connected component into a singlevertex to compute transitive closure efficiently. However,this method cannot be used for LCR queries since it missessome edge labels. Instead, we propose an augmented DAG Dby replacing all strongly connected components as bipartitegraphs. Note that, we allow for some trivial maximalstrongly connected components, which include a singlevertex. Then, we can compute single-source transitiveclosureMGðu; �Þ according to the reverse order of D. Duringthe computation, MGðu; �Þ is always transmitted to itsparent vertices in D. In this way, redundant computationcan be avoided.

Page 8: Efficient processing of label-constraint reachability ...€¦ · Efficient processing of label-constraint reachability queries in large graphs Lei Zoua,n,KunXua, Jeffrey Xu Yub,

L. Zou et al. / Information Systems 40 (2014) 47–6654

Algorithm 3. Building an augmented DAG D for a directedgraph G.

Input: : A directed graph G;

Output: The augmented DAG D.

1: Find all maximal strongly connected components in G. 2: for each maximal strongly connected component Ci do 3: Replace Ci by a bipartite graph Bi ¼ ðVi1 ;Vi2 Þ, where Vi1

contains all in-portal vertices in Ci and Vi2 contains all out-portalvertices in Ci.

4:

For any two vertices u1AVi1 and u2AVi2 , we introduce adirected edge u1 to u2, whose edge label is MCi

ðu1 ;u2Þ.

5: end for 6: Set D to be the updated graph. 7: Return D

Given a directed graph G, Algorithm 3 shows how to getan augmented DAG D. Specifically, we first find all maximalstrongly connected components in G. Then, we replace eachmaximal connected component Ci by a bipartite graphBi ¼ ðVi1 ;Vi2 Þ, where Vi1 contains all in-portal vertices in Ciand Vi2 contains all out-portal vertices in Ci. A vertex u in Ci iscalled as an in-portal if and only if it has at least oneincoming edge from vertices out of Ci. A vertex u in Ci iscalled as an out-portal if and only if it has at least oneoutgoing edge to vertices out of Ci. If vertex u is both an in-portal and an out-portal, it has two instances u and u′ thatoccur in Vi1 and Vi2 , respectively. For any two vertices u1AVi1and u2AVi2 , we introduce a directed edge u1 to u2, whoseedge label is MCi

ðu1;u2Þ. Theorem 4.4 proves that the graphgenerated by Algorithm 3 is a directed acyclic graph (DAG).

Given a graph G in Fig. 4(a), only a single maximalconnected component C1 is identified in G. We computeMC1

for C1. In C1, in-portals are V1 ¼ f2;3g and out-portals areV2 ¼ f3;4g. Note that vertex 3 is both an in-portal and anout-portal. Thus, we introduce two instances of vertex 3. Webuild a bipartite graph B1 by the in-portal vertices and theout-portal vertices. We introduce directed edges betweenany pair of vertices between V1 and V2. The edge label is theminimal path-label set between the two vertices.

Theorem 4.4. The updated graph D in Algorithm 3 is adirected acyclic graph (DAG).

Fig. 4. Augmented DAG. (a) grap

Proof. If D is not a DAG, there must exist at least one cyclein D. This cycle corresponds to one maximal connectedcomponent, or this cycle should be embedded into onemaximal connected component. It means that cycle shouldoccur in some maximal connected component, which hasbeen replaced in a bipartite graph Bi. It also means thatthere exists no such cycle in D. □

Note that, in the traditional reachability problem, adirected graph G is transformed into a DAG by coalescingeach maximal strongly connected component by a singlevertex. To differentiate the DAG generated by Algorithm 3from the DAG generated in the traditional reachabilityproblem, we call the DAG generated by Algorithm 3 as theaugmented DAG (aDAG for short).

Definition 4.6. Given two maximal strongly connectedcomponents Ci and Cj, Ci is called an ancestor component ofCj if and only if there exists a directed path from Ci to Cj inthe augmented DAG.

For example, C5 is an ancestor component of C2, sincethere is a directed path C5�C1�C2 in the augmented DAG.

Algorithm 4. Compute transitive closure of graph G.

h G. (b

Input: A graph G;

Output: MG.

1:

Set MGðu; �Þ ¼ fϕ;…;ϕg for each vertex u. 2: Identify all maximal connected component Ci (i¼ 1;…;n) and

build aDAG D by employing Algorithm 3.

3: for each maximal connected component Ci, i¼ 1;…;n do 4: for each vertex u in Ci do 5: Compute MCi

ðu; �Þ by employing Algorithm 2.

6: end for 7: end for 8: Call Function aDAG to compute MG

9:

Return MG

aDAG ðD;MC1 ;…MCn Þ

1: for each vertex u in D according to the reverse topological order

do

2: MGðu; �Þ¼ Pruneð⋃i ¼ 1;…;nðλðuci�!Þ � MGðci ; �ÞÞÞ 3: end for 4: for each maximal connected component Ci do 5: for each intra-vertex uACi do

) Augmented DAG D.

Page 9: Efficient processing of label-constraint reachability ...€¦ · Efficient processing of label-constraint reachability queries in large graphs Lei Zoua,n,KunXua, Jeffrey Xu Yub,

L. Zou et al. / Information Systems 40 (2014) 47–66 55

6:

MGðu; �Þ¼MCiðu; �Þ

7:

for each out-portal ui in Vi2 do 8: MGðu; �Þ¼ PruneðMGðu; �Þ⋃PruneðMCi

ðu;uiÞ � MGðui ; �ÞÞÞ

9: end for 10: end for 11: for each vertex u′=2Ci do 12: for each intra-vertex uACi do 13: for each in-portal ui in Vi1 do 14: MGðu′;uÞ ¼ PruneðMGðu′;uÞ⋃PruneðMGðu′;uiÞ � MCi

ðui ;uÞÞÞ

15: end for 16: end for 17: end for 18: end for

Fig. 5. Theorem 4.7.

Given a graph G, Algorithm 4 shows pseudo-codesto compute transitive closure for G. First, we initializeMGðu; �Þ¼ fϕ;…;ϕg and identify all maximal connectedcomponent Ci in G (Lines 1–2 in Algorithm 4). We build anaDAG by calling Algorithm 3. For each maximal stronglyconnected component Ci, we iterate Algorithm 2 from eachvertex u in Ci to compute MCi

(Lines 3–7). Finally, we callFunction aDAG to compute MG.

In Function aDAG, we first perform the topologicalsorting over D. We process each vertex (in D) accordingto the reverse topological sort of D. If a vertex u has nchildren ci, i¼ 1;…;n, we set MGðu; �Þ¼ Pruneð⋃i ¼ 1;…;n

ðλðuci�!Þ � MGðci; �ÞÞÞ. In this way, we can obtain MGðu; �Þfor each vertex u in aDAG D.

Now, we need to consider “intra-vertices” in eachcluster Ci (i.e, the vertices that are not in D), such asvertices 1, 5 and 6 in Fig. 4. Given an intra-vertex u in Ci,we initialize MGðu; �Þ¼MCi

ðu; �Þ. We first consider howan intra-vertex u (in cluster Ci) reach other vertices out of Ci.If u reaches another vertex out of Ci, the path must goesthrough an out-portal in Ci. Thus, for each out-portal uiin V2i

, we update MGðu; �Þ¼ PruneðMGðu; �Þ⋃PruneðMCi

ðu;uiÞ � MGðui; �ÞÞÞ iteratively .Then, we consider how other vertices u′ (out of Ci)

reach an intra vertex u in Ci. Obviously, the path from u′ tou must go through one in-portal vertex in Ci. Consider anyone vertex u′=2Ci. Given an intra vertex u in Ci, we computeMGðu′;uÞ as follows: initially, we set MGðu′;uÞ ¼ ϕ. For eachin-portal ui in V1i , we update MGðu′;uÞ ¼ PruneðMGðu′;uÞ⋃PruneðMGðu′;uiÞ � MCi

ðui;uÞÞÞ iteratively.

Theorem 4.5. The time complexity of Function aDAG inAlgorithm 4 is OðjVðGÞj3Þ, where jVðGÞj is the number ofvertices in G.

Proof. The time complexity of Lines 1–3 is OðjVðGÞj2Þ,since one vertex has at most jVðGÞj children in the aDAGand there are jVðGÞj loops of Lines 1–3.A maximal connected component has at most jVðGÞj in-

portals and out-portals and intra-vertices. Thus, the timecomplexity of computing transitive closures of intra-verticesis jVðGÞj2 (Lines 5–17). Since there are at most jVðGÞj blocks,thus, the total time complexity is OðjVðGÞj3Þ. □

Theorem 4.6. The time complexity of Algorithm 4 isOðMaxðjVðGÞj3; jVðGÞj2 � DdÞ, where jVðGÞj is the number ofvertices in G and D is the maximal outgoing vertex degree insome maximal strongly component Ci and d is the maximaldiameter in some maximal strongly component Ci.

Proof. There are at most jVðGÞj maximal strongly componentsand each maximal strongly component has at most jVðGÞjvertices. There are at most jVðGÞj2 loops of calling Algorithm 2.The time complexity of Algorithm 2 is given in Theorem 4.3.Thus, the time complexity of Lines 1–7 is OðjVðGÞj2 � DdÞ.According to the time complexity of Function aDAG (inTheorem 4.5), we know that Theorem 4.6 holds. □

4.3. Optimization: computing local transitive closure instrongly connected components

In Lines 3–7 of Algorithm 4, we need to compute localtransitive closure MCi

for each maximal strongly connectedcomponent Ci. In order to compute MCi

, we iterateAlgorithm 2 from each vertex uACi. Let us consider C1in graph G in Fig. 4(a). Assume that we have computedMC1 ð3; �Þ. It is unnecessary to compute MC1 ð2; �Þ byrunning Algorithm 2 from vertex 2 again, since somesearch branches of computingMC1 ð2; �Þ can be terminatedby using MC1 ð3; �Þ as early as possible. It means, given twovertices u and u′ in a strongly connected component C,where there exists a directed edge uu′

�!in C, if MCðu′; �Þ

have been computed, some search space of computingMCðu; �Þ can be pruned by using MCðu′; �Þ to savecomputation cost. For example in Fig. 5, if there existtwo different paths p1 and p3 from u to vertex d, wherep1 does not go through u′ and p3 goes through u′ andLðp1Þ ¼ Lðp3Þ, it is not necessary to further extend p1toreach another vertex d′, since we can guarantee thatwe can find another path by extending p3 to reach d′ andthe two extended paths have the same labels. Specifically,in Fig. 5, Lðp1þp2Þ ¼ Lðp3þp2Þ, where p1þp2 means con-catenating path p1 and p2. Theorem 4.7 shows the details.

Theorem 4.7. Given a non-redundant path p1 from u to d, ifthere exists another non-redundant path p3 from u to d thatgoes through u′, where u′ is a neighbor of u, and Lðp1Þ ¼Lðp3Þ, for any non-redundant path p from u to another vertexd′ that goes through vertex d, the following equation holds:

LðpÞAðλðu;u′��!Þ � Mðu′; dÞÞwhere MCðu′; dÞ is defined in Definition 3.6.

Proof. (1) Assume that p goes through u′. The subpath of pfrom u′ to d′ must be a non-redundant path, since p is anon-redundant path. It means that LðpÞAλðu;u′��!Þ � MC

ðu′; d′Þ.(2) Assume that p does not go through u′. Let subpath of

p from vertex d to d′ be denoted as p2, as shown in Fig. 5.

Page 10: Efficient processing of label-constraint reachability ...€¦ · Efficient processing of label-constraint reachability queries in large graphs Lei Zoua,n,KunXua, Jeffrey Xu Yub,

Fig. 6. Optimization technique.

L. Zou et al. / Information Systems 40 (2014) 47–6656

Since Lðp1ÞAðλðu;u′��!Þ � MCðu′; �ÞÞ, it means that thereexists a path P3 from u to d through vertex u′ andLðp1Þ ¼ Lðp3Þ. Let p′¼ p3 � p2, i.e., path p′ is formed byconcatenating p3 and p2. It is straightforward to know thatLðp′Þ ¼ LðpÞ. Since p is a non-redundant path, p′ must bealso a non-redundant path. The subpath of p′ from u′ to d′is denoted as p4. According to Apriori property (Theorem 3.1),p4 is also a non-redundant path. It means that Lðp4ÞAMCu′; d′. Therefore, LðpÞ ¼ Lðp′Þ ¼ Lðuu′�! � p4ÞAλðu;u′��!Þ�MCðu′; d′ÞÞ. □

Assume that MCðu′; �Þ is computed before computingMCðu; �Þ. When computing MCðu; �Þ by Algorithm 2, weget a head triple T1 ¼ ½Lðp1Þ; p1;d� in Line 4. If Lðp1Þ ¼ Prune

ðλðu;u′��!Þ⋃MCðu′; dÞÞ, it is not necessary to extend the pathp1 to reach another vertex d′, as we must find another pathgoes through u′ to reach d′. Specifically, Algorithm 5 showshow to revise Algorithm 2 to save computation in comput-ing MCðu; �Þ. For simplicity, Algorithm 5 only shows thedifferences between Algorithms 5 and 2, where Lines 1-1,1-2, 1-3 in Algorithm 5 replace Line 1 in Algorithm 2 andLines 8-1, 8-2, 8-3 in Algorithm 5 replace Line 8 inAlgorithm 2. Given a vertex u in C, assume that there existm outgoing neighbors u1;…;um of u, where MCðuj; �Þ havebeen computed, j¼ 1;…;m. Initially, we set MCðu; �Þ¼ ϕ.For each outgoing neighbor uj of u, we set MCðu; �Þ¼PruneðMCðu; �Þ⋃fλðu;uj

��!Þ � MCðuj; �ÞgÞ (Lines 1-1,1-2,1-s3in Algorithm 5). Then, we employ Algorithm 2 fromvertex u. However, when we pop the neighbor tripleT1 ¼ ½Lðp1Þ; p1; d� from heap H (Line 8 in Algorithm 2), weneed to check whether Lðp1Þ is in MCðu;dÞ. If so, accordingto Theorem 4.7, we can terminate the branch. Otherwise,we continue the following steps of Algorithm 2 (Lines 8-1,8-2,8-3 in Algorithm 5).

Algorithm 5. Optimization.

1-1) Set u as the source. Set answer set RS¼ ϕ, heap H¼ ϕ and

MC ðu; �Þ¼ ϕ

1-2) MC ðu; �Þ ¼⋃u;uj��!

AVðCÞfλðu;uj

��!Þ � MC ðuj ; �Þg.

……8-1) if path label Lðp1Þ exists in MC ðu; dÞ8-2) continue8-3) Move T1 into RS and put Lðp1Þ into MC ðu;dÞ

Let us consider C1 in Fig. 4(a). Assume that we havecomputed MC1 ð3; �Þ as shown in Fig. 6(a). According to

Algorithm 5, we first set MC1 ð3; �Þ¼ Pruneðλð2;3Þ���! �MC1 ð3; �ÞÞ in 6(b). Then, we begin Algorithm 2 from vertex2. In the first step, there are two neighbor triples½fbg; ð2;5Þ;5� and ½fag; ð2;3Þ;3�, as shown in Fig. 6(d). Sincetriple ½fbg; ð2;5Þ;5� is not covered by any path label inMC1 ð2;5Þ, thus, we insert fbg into MC1 ð2;5Þ. Then, we insertthe neighbor triples of 5 into H. At the second step,neighbor triple ½fag; ð2;3Þ;3� is removed from heap H, sincethere is a path label fag in MC1 ð2;3Þ. Furthermore, neighbortriple ½fabg; ð2;5;6Þ;6� is also removed from heap H sincefabg is covered by a path label fag in MC1 ð2;6Þ. Therefore,the whole process is terminated at the second step. We getMC1 ð2; �Þ in Fig. 6(c).

According to Algorithm 5, we have the followingmethod to compute local transitive closure MC for astrongly connected component C. Specifically, we first finda vertex u′ with the largest incoming degree in C. Wecompute MCðu′; �Þ by Algorithm 2. Then, we performbreadth-first search over C from this largest incomingdegree vertex. When visiting some node u in C, we utilizeAlgorithm 5 to compute MCðu; �Þ.

5. LCR query over large graphs

Given a large graph G, it is very expensive to computepath-label transitive closures despite its capability toanswer LCR queries. As discussed earlier, another extremeapproach to answering reachability queries is to traverse Gon the fly. The intuition behind our method is: we cancombine the two extreme approaches to find a good trade-offbetween offline and online costs.

In our method, we partition a large graph G into severalblocks Pi, i¼ 1;…; k. For each block Pi, we employ themethod in Section 4 to compute path-label transitiveclosure of Pi. All boundary vertices and crossing edges(Definition 5.1) are collected to form a skeleton graph Gn

(Definition 5.2). Obviously, Gn is much smaller than G.With local transitive closures, we can answer LCR queriesover G by traversing Gn on the fly. Obviously, differentpartitions lead to different query performance. We delaythe discussion of finding the optimal partition to enhancequery performance until Section 5.2, since it is related toour query algorithm in Section 5.1.

5.1. Query algorithm

Definition 5.1. Given a vertex u in a block P, u is aboundary vertex if and only if u has at least one neighborvertex that is outside of block P. An edge e¼ u1u2

��! is calleda crossing edge if and only if u1 and u2 are boundaryvertices in two different blocks.

Page 11: Efficient processing of label-constraint reachability ...€¦ · Efficient processing of label-constraint reachability queries in large graphs Lei Zoua,n,KunXua, Jeffrey Xu Yub,

L. Zou et al. / Information Systems 40 (2014) 47–66 57

Definition 5.2. Given a large graph G that is partitionedinto k blocks Pi, i¼ 1;…; k, all boundary vertices andcrossing edges are collected to form a skeleton graph Gn.

Algorithm 6. Answer LCR queries by traversal over skele-ton graph.

Input: Two vertices u1 and u2 and a label set S.

Output: If u1-Su2, return True;Otherwise, return False;

1:

Let P1 and P2 denote the corresponding blocks of u1 and u2,respectively.

2:

Let Visited¼ ϕ. 3: if P1 ¼ P2 then 4: if MP1 ðu1 ;u2Þ covers S then 5: Return True 6: end if 7: end if 8: Let

BD¼ fbjb is a boundary vertex in P1 and MP1 ðu1 ;biÞ covers Sg

9: for each bABD do 10: Assume that b has m outgoing neighbors bi (i¼ 1;…;m) that

are in other blocks except for P1 and λðbbi�!ÞAS.

11:

for each outgoing neighbors bi, i¼ 1;…;m do 12: Let Pi denote the block where bi is resided in. 13: if ðbi ; PiÞAVisited then 14: continue 14: else 15: Call Function Travel ðbi; PiÞ 16: end if 17: end for 18: end for Travel ðb′; P′Þ 1: Insert ðb′; P′Þ into Visited 2: if P′¼ P2 then 3: if MP′ðb′;u2Þ covers S then 4: Return True 5: end if 6: end if 7: Let BD¼ fbjb is a boundary vertex in P′ and MP′ðb′; bÞ coversSg 8: for each bABD do 9: Assume that b has m outgoing neighbors bi ði¼ 1;…;mÞ that

are in other blocks except for P′ and λðbbi�!ÞAS.

10:

for each outgoing neighbors bi, i¼ 1;…;m do 11: Let Pi denote the block where bi is resided in. 12: if ðbi ; PiÞAVisited then 13: continue 14: else 15: Call Function Travel ðbi; PiÞ 16: end if 17: end for 18: end for

Algorithm 6 shows the pseudo-code for LCR testingover two vertices u1 and u2 under label constraint S. Lines3–5 verify whether u1 and u2 are in the same block P1 andLCRðu1;u2; S; P1Þ ¼ true, and if so, returns true. Otherwise,Line 6 finds all boundary vertices (in P1) that are reachable

Fig. 7. Partition VS. query performance. (a) Grap

from u1 under constraint S. For adjacent edges e¼ bbi�!

tothese boundary vertices, if λðbbi

�!ÞAS, it means that thetraversal can reach boundary vertices of other blocksunder constraint S. Note that, we use Visited to store allvertices that have been visited in one search branch (Line 1in Function Travel). A vertex can be visited multiple times(Line 14 in Function Travel) in different branches, but atmost once in the same branch. Otherwise, it may generateduplicate results. If the traversal reaches the destinationblock P2, Lines 2–4 (in Function Travel) verify whetherLCRðu1;u2; S; P2Þ ¼ true, and if so, returns true. Otherwise,the traversal will be continued.

5.2. Finding the optimal partition

As mentioned earlier, different partitions lead to differ-ent query performance. For example, given a graph G,there are two kinds of graph partition over G, as shown inFig. 7. Consider a query LCRð1;2; fag;GÞ. In the first parti-tion in Fig. 7(b), our query algorithm can answerLCRð1;2; fag;GÞ based on local transitive closure, since path(1, 3, 2) is contained in one block. However, given thesecond partition, we have to access another block toanswer the same query. Obviously, the former is fasterthan the latter, since the latter leads to more I/O cost.

The optimal partition should minimize the overallquery workloads. In this subsection, we formalize thegraph partition problem and prove that it is a NP-hardproblem, since min-cut graph partitioning problem can bereduced to this optimization problem. Therefore, we adoptsome heuristic methods to find a “good” partition to speedup Algorithm 6. Generally speaking, we reduce a classicalmin-cut edge-weighted graph partition problem into ourproblem. The main contribution of our method is howto assign edge weights in our problem. Note that, in thefollowing discussion, we assume that the partition numberk is given. We will discuss how to set up k in Section 5.3.

Considering one path p from u1 to u2 in graph G, if thereare l crossing edges, p is divided into 2� lþ1 segments.If a segment contains some consecutive non-crossingedges, it is called a non-crossing segment. Otherwise, it iscalled a crossing segment. For example, given a pathpð1;2;3;4; 5;6;7;8Þ in Fig. 8, since there are two crossingedges, p is divided into five segments, (1, 2, 3), (3, 4), (4, 5,6), (6, 7) and (7, 8), where (1, 2, 3) and (4, 5, 6) and (7, 8)are non-crossing segments, and the others are crossingsegments. Let us recall Algorithm 6. We employ localtransitive closures to find a non-crossing segment, whosecost is defined as α. We employ online traversal to find

h G. (b) A partition. (c) Another partition.

Page 12: Efficient processing of label-constraint reachability ...€¦ · Efficient processing of label-constraint reachability queries in large graphs Lei Zoua,n,KunXua, Jeffrey Xu Yub,

Fig. 8. Crossing segments and non-crossing segments.

2 Inserting an isolated vertex does not affect the reachability infor-mation. Deleting a vertex means deleting all adjacent edges to the vertex.

L. Zou et al. / Information Systems 40 (2014) 47–6658

a crossing segment, whose cost is defined as β. Thus, wedefine the cost of finding one path p as follows:

Definition 5.3. Given a path p, if there are l crossingedges, the cost of finding p is defined as CostðpÞ ¼ ðlþ1Þ�αþ l� β¼αþðαþβÞ � l.

Given a graph G, any non-redundant path may be ananswer to a LCR query. Therefore, given a partition overgraph G, we define the overall cost for LCR queries as follows:

Definition 5.4. Given a partition over graph G, the overallcost of this partition is defined as follows:

Cost ¼∑pi APCCostðpiÞ ¼ α� jPCjþðαþβÞ∑pi APCli ð2Þwhere PC denotes all non-redundant paths and li denotesthe number of crossing edges in path pi.

Obviously, given a graph G, the first part of Eq. (2) is aconstant. Different partitions over G lead to differentvalues of the second part of Eq. (2).

Definition 5.5. A partition over graph G is optimal if andonly if its cost (Definition 5.4) is minimal.

Theorem 5.1. Given a graph G and a number k, finding theoptimal partition (Definition 5.5) that divides G into kdisjoint blocks is NP-hard.

Proof. Generally speaking, the classical min-cut graphpartition problem, i.e., partitioning G into k disjoint blocksand the sum of all crossing edge weights is minimized, canbe reduced to our optimization problem.In order to prove the theorem, we first introduce the

following definition.

Definition 5.6. Given an edge e in graph G, its edge weightis defined as follows:

wðeÞ ¼ jfpijeApi4piAPCgjwhere PC contains all non-redundant paths. Actually, w(e)denotes the number of non-redundant paths (in PC) thatare covered by e.

Given a partition over G, all crossing edges are denotedas CE. It is straightforward to know the following equationholds:

Cost ¼ α� jPCjþðαþβÞ �∑pi APCli¼ α� jPCjþðαþβÞ �∑eACEwðeÞ ð3Þ

Eq. (3) means that the optimal partition for our queryalgorithm is the partition with the minimal sum of cross-ing edge weights (Definition 5.6), which is exactly thesame as the min-cut graph partitioning problems that is aclassical NP-hard problem. □

Actually, the proof process of Theorem 5.1 implies asolution to find the optimal partition. Specifically, all non-redundant paths are enumerated. Then, each edge weightwðeiÞ can be determined by Definition 5.6. Finally, someclassical min-cut graph partitioning algorithm, such asMETIS [15], can be utilized to find a good partition. How-ever, it is prohibited to enumerate all non-redundant paths(i.e.,PC) in practice. Therefore, we can utilize some samplingmethods to estimate edge weights. Specifically, we ran-domly select Δ seed vertices from G, denoted as si,i¼ 1;…;Δ. Then, for each seed si, we employ Algorithm 2to enumerate all non-redundant paths beginning from si.The set of these paths is denoted as PC′. The estimated edgeweight of e is denoted as follows:

w′ðeÞ ¼ jfpjpAPC′4eApgjFinally, we employ METIS to partition graph G based on

these estimated edge weights.

5.3. Setting k

Now, we discuss how to set up the partition number k.The larger the value k is, the smaller each block size jPj is. Onthe other hand, larger k means more blocks in the skeletongraph Gn, which leads to more search space in the onlinetraversal of Algorithm 6. Therefore, in order to optimizequery performance, k should be as small as possible.However, small k leads to more expensive offline cost. Inthe extreme, k¼1 means that computing transitive closureover the whole graph. In practice, we can tune k to obtain agood trade-off between offline and online performance.

6. LCR query over dynamic graphs

In this section, we address index maintenance issuesin dynamic graphs. In real-life applications, such as socialnetworks, the graph structure is evolving over time. Wemodel the updates over graphs as a series of edge insertionsand deletions. In this section, we only discuss how to handlethe two basic operations (edge insertions and deletions).2

Assume that a graph G is decomposed into k blocks Pi,i¼ 1;…; k. All boundary vertices and crossing edges arecollected to form a skeleton graph Gn. The index main-tenance involves the updates to skeleton graph Gn andlocal transitive closures. We first discuss the generalframework of our method in Section 6.1. The key problemis how to update local transitive closures, which is pre-sented in Section 6.2.

6.1. Overview

Consider that we insert an edge e¼ uiuj��! into graph G.

There are five cases for e:

(1)

If ui =2G AND uj =2G, inserting an isolated edge e doesnot affect the reachability information of the originalgraph. It is a trivial case.
Page 13: Efficient processing of label-constraint reachability ...€¦ · Efficient processing of label-constraint reachability queries in large graphs Lei Zoua,n,KunXua, Jeffrey Xu Yub,

L. Zou et al. / Information Systems 40 (2014) 47–66 59

(2)

If ui =2G AND ujAG AND ujAPj, we update local transi-tive closure of block Pj by the method in Section 6.2.1.

(3)

If uiAG AND uj =2G AND uiAPi, we update localtransitive closure of Pi by the method in Section 6.2.1.

(4)

If uiAG AND ujAG AND uiuj��!APi, we update local

transitive closure of Pi by the method in Section 6.2.1.

(5) If uiAG AND ujAG AND uiAPi AND ujAPj AND PiaPj,

meaning uiuj��! is a crossing edge, we introduce edge e

into Gn directly.

Consider that we delete an edge e¼ uiuj��! from graph G.

There are two cases for e:

(1)

If e is a crossing edge in Gn, we delete e from Gn directly. (2) If e is in one block Pi, we delete e from Pi and update

the local transitive closure of Pi by the method inSection 6.2.2.

6.2. Local transitive closure maintenance

6.2.1. Edge insertionAssume that we have computed the local transitive closure

of block P, denoted as MP. This subsection discusses how toupdate the local transitive closure, if a new edge e¼ uiuj

��! isinserted into P. Let P′ be the updated block. There are fourcases for e¼ uiuj

��!. Algorithm 7 shows the pseudo codes.(1) ui =2P 4uj =2P; 2) uiAP 4uj =2P; 3) ui =2P 4ujAP; 4)

uiAP 4ujAPIt is very easy to update the transitive closure MP in

the first three cases. In Case 1 (Fig. 9(a)), since theinserted edge does not affect the transitive closure ofother vertices, i.e., 8uAP, MPðu; �Þ¼MP′ðu; �Þ, whereMPðu; �Þ denotes the single source transitive closure fromvertex u in block P.

Let us consider Case 2 in Fig. 9(b). Initially, foreach vertex uAP, we set MP′ðu; �Þ¼MPðu; �Þ. For eachvertex uAP, MP′ðu;ujÞ ¼ PruneðMPðu;uiÞ � λðuiuj

��!ÞÞ. We setMP′ðuj; �Þ¼ fϕ;…;ϕg.

Let us consider Case 3 in Fig. 9(c). For each vertex uAP,we set MP′ðu; �Þ¼MPðu; �Þ. MP′ðui; �Þ¼ Pruneðλðuiuj

��!Þ�MPðuj; �ÞÞ.

The key issue is how to computeMP′ in Case 4 in Fig. 9(d),which is the focus of this subsection.

Fig. 9. Four

Algorithm 7. Insertion.

cases.

Input: The local transitive closure MP over block P, and an inserted

edge e¼ uiuj��!;

Output: the updated block P′ and the updated transitiveclosure MP′ .

1:

Insert e into P to get P′. 2: //Case 1; 3: MP′¼MP. 4: if ui =2P4ui =2P then 5: set MP′ðui ;ujÞ ¼ λðeÞ and insert MP′ðui;ujÞ ¼ λðeÞ into MP′

6:

Return MP′ 7: end if 8: //Case 2; 9: if uiAP4ui =2P then 10: for each vertex u in P do 11: MP′ðu;ujÞ ¼ PruneðMPðu;uiÞ � λðuiuj

��!ÞÞ

12: end for 13: set MP′ðuj ; �Þ ¼ fϕ;…;ϕg and insert MP′ðuj ; �Þ into Mp′

14:

Return MP′ 15: end if 16: //Case 3; 17: if ui =2P4uiAP then 18: set MP′ðui ; �Þ ¼ Pruneðλðuiuj

��!Þ � MP ðuj ; �ÞÞ and putMP′ðui; �Þinto MP′ .

19:

Return MP′ 20: end if 21: //Case 4; 22: if uiAP4uiAP then 23: Get an inversed graph P by reversing each edges direction in P′ 24: Employ Algorithm 2 in P from vertex uj by replacing Line 2

by “put neighbor ui of uj into H” get MP ;ujui��!ðuj ; �Þ.

25:

for each vertex u in P do 26: Set M

P′;uiuj��!ðu;ujÞ ¼M

P ;ujui��!ðuj ;uÞ

27:

SetMP′ðu; �Þ ¼ PruneðMPðu; �Þ⋃M

P′;uiuj��!ðu;ujÞ � MP ðuj ; �ÞÞ

28:

end for 29: Return MP′ 30: end if

Theorem 6.1. P is changed into P′ by inserting an edge e intoP. Given any two vertices u1 and u2 in block P, MPðu1;

u2ÞaMP′ðu1;u2Þ if and only if the following conditions hold,where MPðu1;u2Þ is the minimal path label sets in block P(Definition 3.6).

(1)

there exists a path p from u1 to u2 in P′, where p goesthrough e; and
Page 14: Efficient processing of label-constraint reachability ...€¦ · Efficient processing of label-constraint reachability queries in large graphs Lei Zoua,n,KunXua, Jeffrey Xu Yub,

L. Zou et al. / Information Systems 40 (2014) 47–6660

(2)

MPðu1;u2Þ does not cover L(p), where L(p) denotes thepath label of p.

Table 2Time complexity analysis of inserting edge e into block P.

ui =2P4uj =2P uiAP4uj =2P ui =2P4ujAP uiAP4ujAPOð1Þ OðjVðPÞjÞ OðjVðPÞjÞ OðDdÞ

Note: e¼ uiuj��!, D is the maximal degree in block P and d is the diameter

of P.

Proof. It is straightforward to know if the above twoconditions hold, MPðu1;u2ÞaMP′ðu1;u2Þ.Given two vertices u1 and u2, if there exists no path p

from u1 to u2, where p goes through e, the insertion of edoes not affect MPðu1;u2Þ. Thus, MPðu1;u2Þ ¼MP′ðu1;u2Þ.Given two vertices u1 and u2, if there exists a path p from

u1 to u2, where p goes through e, butMPðu1;u2Þ covers L(p),it means that there must exist another path p′ from u1 tou2, where path label Lðp′Þ covers L(p). Thus, the insertion ofe does not affect MPðu1;u2Þ, i.e., MPðu1;u2Þ ¼MP′ðu1;u2Þ. □

According to Theorem 6.1, we design an algorithm tohandle the insertion in Case 4. Assume that edge e¼uiuj

��! isinserted into block P. Initially, for each vertex uAP′, we setMP′ðu; �Þ¼MPðu; �Þ. Due to the inserted edge uiuj

��!, somereachability information need to be updated. For example,due to the inserted edge 10;4

���!in Fig. 9(d), vertex 10 can

reach vertices {4, 5, 6, 1, 2, 3, 8} in G′, since vertex 4 canreach these vertices in G and vertex 10 can reach 4 in G′.Therefore, according to the reverse order of edges, weneed to propagate MPð4; �Þ to other vertices iteratively

through 10;4���!

. Note that, the propagation process alsofollows the “best-first” strategy in Algorithm 2 to avoidredundant paths.

Specifically, we design the following algorithm. We firstinsert edge uiuj

��! into block P to P′. We get an inversedgraph P by reversing each edge's direction in P′. Then, weemploy Algorithm 2 in P from vertex uj. Note that, in thefirst step, we only consider neighbor ui in graph P . In thisway, we can know that how uj can reach other vertices in P

through edge ujui��! in graph P . It also means that we can

know how each vertex u (AP′) can reach uj through edge

uiuj��! in graph P′, i.e., denoted as M

P′;uiuj��!ðu;ujÞ. Finally, for

each vertex uAP′, we update MP′ðu; �Þ¼ PruneðMPðu; �Þ⋃M

P′;uiuj��!ðu;ujÞ � MPðuj; �ÞÞ.For example, we get a block P′ by inserting edge e¼ 10;4

��!into P in Fig. 9(d). Initially, we set MP′ðu; �Þ¼ MPðu; �Þ,where uAP′. We get a reverse graph G, as shown in Fig. 10(a).Then, we employ Algorithm 2 in P from vertex 4. Note that,in the first step, we only consider neighbor 10 in graph P .Fig. 10(b) shows the process. Note that, we only consider

Fig. 10. Algorithm process. (a) A

which vertices can be reached from vertex 4 through edge

4;10��!

in P . Then, we can know that MP′;10;4��!ð10;4Þ ¼ fcg and

MP′;10;4��!ð9;4Þ ¼ fb; cg. For other vertices uAP′, M

P′;10;4��!

ðu;4Þ ¼ ϕ. Finally, we update MP′ð10; �Þ¼MP′;10;4��!ð10;4Þ �

MP′ð4; �Þ andMP′ð9; �Þ¼ PruneðMPð9; �Þ⋃MP′;10;4��!ð9;4Þ �

MP′ð4; �ÞÞ The time complexity analysis is given Table 2.

Algorithm 8. Deletion.

reverse

Input: The local transitive closure MP over block P, and an deleted

edge e¼ uiuj��!;

Output: the updated block P′ and the updated transitiveclosure MP′ .

1:

Delete e¼ uiuj��! from P to get P′.

2:

Ci and Cj are two blocks that ui and uj reside in. 4: //Case 1; 5: if CiaCj then 6: All maximal strongly connected components do not change

in P′.

7: Rebuild the aDAG D′ by Calling Algorithm 3. 8: Recompute MP′ by calling function aDAG in Algorithm 4. 9: end if 10: //Case 2; 11: if Ci¼Cj then 12: Recompute all maximal strongly connected components. 13: for each maximal strongly connected component C′ do 14: if it is the same with some maximal strongly connected

component C in P then

15: MC′ ¼MC

16:

else 17: Recompute MC′ by calling Algorithm 2 18: endif 19: end for 20: Recompute MP′ by calling Function aDAG in Algorithm 4. 21: end if 22: Return MP′

It is trivial to know the time complexity of the firstthree cases, as shown in Table 3. As we know, we need to

graph G . (b) Process.

Page 15: Efficient processing of label-constraint reachability ...€¦ · Efficient processing of label-constraint reachability queries in large graphs Lei Zoua,n,KunXua, Jeffrey Xu Yub,

Fig. 11. Two cases. (a) Case 1. (b) Case 2.

Table 3Time complexity analysis of deleting edge e from block P.

uiACi4ujACj4CiaCj uiACi4ujACj4Ci ¼ Cj

OðjVðPÞj3Þ OðMaxðjVðPÞj3 ; jVðPÞj2 � DdÞÞ

Note: D is the maximal degree in block P and d is the diameter of P.

L. Zou et al. / Information Systems 40 (2014) 47–66 61

employ Algorithm 2 for the last case. The time complexityof Algorithm 2 has been given in Theorem 4.3. Thus, weknow the time complexity of the last case is OðDdÞ, whereD is the maximal vertex outgoing degree in block P and d isthe diameter of P.

6.2.2. Edge deletionIn this section, we discuss how to update the local

transitive closure if we delete one edge e¼ uiuj��! from block

P. Let P′ be the updated block after edge deletion. Ci and Cjare two maximal strongly connected components containingui and uj in the original block P, respectively. There are twocases for edge e¼ uiuj

��!. Fig. 11 demonstrates the two cases.

1.

ui and uj are not in the same maximal strongly con-nected component, i.e., CiaCj.

2.

ui and uj are in the same maximal strongly connectedcomponent, i.e., Ci¼Cj.

If ui and uj are in two different blocks, it is straightfor-ward to know that all maximal strongly connected com-ponents in P′ are the same as that in P. According toAlgorithm 3, we can compute the updated aDAG, denotedas D′. Finally, we re-compute local transitive closure bycalling Function aDAG in Algorithm 4 again. Actually, thecomputing process can be optimized by the followingtheorem. According to Theorem 6.2, we only need tobeginning the propagation from Cj.

Theorem 6.2. Let P′ be the updated block by deleting anedge e¼ uiuj

��! from P, where ui and uj are in two differentmaximal connected components Ci and Cj, respectively.

Consider a vertex uðAP′Þ that is included in a maximalstrongly connected component C. MP′ðu; �ÞaMPðu; �Þ onlyif there exists a directed path from C to Ci in aDAG D, where Dis an augmented DAG of block P.

Proof. If there is no directed path from C to Ci in aDAG D,it means that there exists no path from vertex u to othervertices in G, where the path goes through the deletededge e¼ uiuj

��!. Therefore, the single source transitive clo-sure MG′ðu; �Þ¼MGðu; �Þ. □

If ui and uj are in the same maximal strongly connectedcomponent C, we propose the following algorithm tohandle the updates. First, we identify all maximal stronglyconnected components in the updated graph G′. Then,according to the method in Section 4, we represent G′ asa new aDAG D′. For each maximal strongly connectedcomponent C′ in G′, if C′ is the same with some maximalstrongly connected component in G, it is not necessary torecompute MC′. Otherwise, we need to recompute MC′.Since we only delete one edge, most components are notchanged in G′. According to the reverse topological orderof D′, we find the lowest component C′ whose localtransitive closure (i.e., MC′) is changed. Finally, we callFunction aDAG in Algorithm 4 from the component C′ torecompute the transitive closure.

In the first case, all maximal strong connected compo-nents do not change after deleting an edge. We only needto re-compute the topological sorting over D′ and re-compute MP′ by calling Function aDAG in Algorithm 4.Thus, the time complexity of the first case is the same withfunction aDAG. It is OðjVðPÞj3Þ. In the second case, we haveto re-compute MCi

for some maximal strong connectedcomponents. Thus, the time complexity of the second caseis OðMaxðjVðPÞj3; jVðPÞj2 � dDÞÞ

7. Experiments

In this section, we evaluate our methods over bothrandom networks and real datasets, and compare them

Page 16: Efficient processing of label-constraint reachability ...€¦ · Efficient processing of label-constraint reachability queries in large graphs Lei Zoua,n,KunXua, Jeffrey Xu Yub,

L. Zou et al. / Information Systems 40 (2014) 47–6662

with the existing solution the sampling-tree method in[11]. Specifically, we experimentally study the perfor-mance of three approaches: (1) the sampling-tree methodproposed in [11]; (2) we compute path-label transitiveclosure method by Algorithm 4, based on which, we cananswer LCR queries. This method is called transitive closuremethod; (3) the partition-based approach proposed inAlgorithm 6; (4) the bi-directional search proposed in [9].The codes of the sampling-tree are provided by authors in[11]. Our methods, including the transitive closure methodand the partition-based approach, are implemented usingCþþ , and our experiments are conducted on a P4 3.0 GHzmachine with 2G RAM running Ubuntu Linux.

7.1. Datasets

There are two types of synthetic datasets to be used inour experiments, namely, Erdos Renyi Model (ER) andScale-Free Model (SF). ER is a classical random graphmodel. It defines a random graph as jV j vertices connectedby jEj edges, chosen randomly from the jV jðjV j�1Þ possibleedges. In our experiments, we vary the density jEj=jV j from1.5 to 5.0, and vary jV j from 1K to 200K. SF defines arandom network with jV j vertices satisfying power-lawdistribution in vertex degrees. In our implementations, weuse the graph generator gengraphwin (http://fabien.viger.free.fr/liafa/generation/) to generate a large graph G satisfy-ing power-law distribution. Usually, the power-law distri-bution parameter γ is between 2.0 and 3.0 to simulate realcomplex networks [19]. Thus, default value of parameter γ isset to 2.5 in this work. In order to study the scalability, wealso vary jV j in SF networks from 1K to 200K. The numberof edge labels ðjΣjÞ is 20. The distribution of labels isgenerated according to uniform distribution.

We also employ four real graph datasets (Yeast, Small-Yago, Large-Yago and DBLP) in our experiments. The firsttwo datasets are provided by authors in [11].

(1)

TabPer

jV

d

124681

Yeast is a protein-to-protein interaction network inbudding yeast. Each vertex denotes a protein andan edge denotes the interaction between two corre-sponding proteins. Yeast graph contains 3063 vertices(genes) with density 2.4. It has 5 edge labels, whichcorresponds to different type of interactions, such asprotein–DNA and protein–protein interaction.

(2)

Small-YAGO is a sampling graph from a large RDFdataset, containing 5000 vertices with 66 labels, andhas density jEj=jV j¼5.7.

le 4formance VS. jV j in ER Graphs.

j Transitive closure method

¼1.5 IT (s) IT-opt (s) IS (KB) QT (ms)

K 15 3 415 0.01.K 23 5 1396 0.01K 217 7 4920 0.02K 623 10 9000 0.03K 964 12 13,200 0.030K 5065 15 33,000 0.04

(3)

Sam

IT (s

359

100

Large-YAGO is the full version of RDF graph corre-sponding to YAGO dataset, which is a knowledge basecontaining information harvested from Wikipedia andlinked to Wordnet. In our experiments, we delete all“literal” vertices from RDF graph and maintain all“entity” and “class” vertices. Each edge label corre-sponds to one property. Generally speaking, Large-Yago has 2 million vertices and 6 million edges and 97edge labels. The average density is jEj=jV j ¼ 2:7.

(4)

DBLP contains a large number of bibliographic descrip-tions on major computer science journals and pro-ceedings. We use a RDF version of DBLP dataset, whichis available at http://sw.deri.org/aharth/2004/07/dblp/.We also delete all “literal” vertices from DBLP RDF graph.There are 1,145,882 vertices, 1,699,117 edges and 5 edgelabels in the RDF graph. Each edge label denotes oneproperty. The average density is jEj=jV j ¼ 1:48.

7.2. Performance of transitive closure method

In this section, we use Algorithm 4 to compute transi-tive closure for a graph G. We report index constructiontime (IT), index size (IS) and average query response time(QT) for the experiments on the synthetic datasets. IT-optrefers to the index construction time when we utilizethe optimization technique in Section 4.3. Note that, thedefault query constraint size ðjSjÞ is 30%� j∑j ¼ 6. Further-more, we also compare our method with the samplingtree method. Note that, in the following experiments, wealways randomly generate 1000 queries to evaluate queryperformance. QT is reported as the average responsetime for one query. In these experiments, we evaluatethe performance with regard to graph size, graph densityand label constraint size jSj. Furthermore, we also test theperformance of bi-directional search in Table 4. Since bi-directional search does not need offline processing, thus,we only report QT in the following experiments.

Exp1. varying graph size ðjV jÞon ER graphs: In thisexperiment, we fix the density jEj=jV j¼1.5 and labelconstraint size jSj¼6 and vary jV j from 1000 to 10,000 tostudy the performance by varying graph sizes. Table 4reports the detailed performance, such as, index sizes (IS),index building times (IT) and average query response time(QT). From Table 4, we know that transitive closuremethod is faster than the sampling-tree method in offlineprocessing by orders of magnitude. Furthermore, theoptimization technique (IT-opt) in Section 4.3 can further

pling-tree method Bi-directional search

) IS (KB) QT (ms) QT (ms)

113 13,275 0.05 0.01493 51,020 0.09 0.02680 78,920 0.10 0.02689 92,890 0.15 0.03290 100,890 0.18 0.04,560 123,450 0.29 0.05

Page 17: Efficient processing of label-constraint reachability ...€¦ · Efficient processing of label-constraint reachability queries in large graphs Lei Zoua,n,KunXua, Jeffrey Xu Yub,

Table 7Performance VS. jV j in SF graphs.

jV j Transitive closuremethod

Sampling-treemethod

Bi-directionalsearch

d¼1.5 IT(s)

IT-opt(s)

IS(KB)

QT(ms)

IT(s)

IS(KB)

QT(ms)

QT (ms)

1K 0.1 0.1 43 0.01. 0.6 1576 0.16 0.012K 0.2 0.1 94 0.01 2.6 2583 0.17 0.024K 0.2 0.1 140 0.01 9.5 4870 0.18 0.026K 0.4 0.2 289 0.01 27.8 8854 0.20 0.038K 0.4 0.2 390 0.01 45.5 11393 0.22 0.0310K 1.1 0.3 492 0.01 80.4 23931 0.24 0.03

L. Zou et al. / Information Systems 40 (2014) 47–66 63

speed up offline processing, as shown in Table 4. Forexample, when jV j ¼ 1K, transitive closure method spends15 s to build index and the optimization technique onlyneeds 3 s, but the sampling-tree method needs 113 s. Theindex size of our method is much smaller than that in thesampling-tree method. Furthermore, our query perfor-mance is also better than the sampling tree method. FromTable 4, we know that bi-directional search is also veryfast for LCR queries. However, when the constraint labelsize jSj increases, the performance of bi-directional searchmethod degrades greatly, as evaluated in Table 8 in Exp4.

Exp2. varying density jEjjV j

� �on ER graphs: In these

experiments, we fix jV j ¼ 10;000 and vary the densityjEj=jV j from 2 to 5 to study the performance of our methodin dense graphs. From Table 5, we know that the indexbuilding time and index size increase when varying jEj=jV jfrom 2 to 5 in both methods. Furthermore, the samplingtree method cannot finish index building in 48 h whenjEj=jV jZ4. From Table 5, we know that transitive closuremethod has better scalability with regard to the graphdensity jEj=jV j than the sampling tree method. Actually,the two methods need to compute MGðu; �Þ (i.e., single-source transitive closure). As proven in Theorem 4.1, ourmethod has the minimal search space, but the searchspace in the sampling tree method is not minimal. Thus,large search space affects the scalability of the samplingtree method. Furthermore, our query performance is alsobetter than the sampling tree method.

Another observation is that our optimal solution(Section 4.3) for computing the transitive closure in eachstrong connected component (SCC) works very well, sinceIT-opt is much faster than IT in Table 5, especially in densegraphs. As we know, computing transitive closure of graphG involves two steps. The first step is to compute the localtransitive closure in each SCC. The native solution for thisstep is to iterate Algorithm 2 from each vertex in this SCC.However, according to the analysis in Section 4.3, we canspeed up the process by considering computation sharing.In ER graphs, there are lots of SCCs, especially in high-degree graphs. Table 6 shows the statics about the numberof vertices that are in non-trivial SCCs (i.e., size 41). Weobserve that most of vertices are in non-trivial SCCs in ER

Table 5Performance VS. Density in ER Graphs.

Degree Transitive closure method

d IT (s) IT-opt (s) IS (KB) QT (ms)

2 6890 20 26.3 0.08.3 11,112 23 102.3 0.094 25,347 35 160 0.125 33,169 80 186 0.23

Table 6The fraction of vertices in non-trivial SCCs.

Degree 1.5 2.0 2.5 3.0ER10K (%) 34.47 63.57 79.16 88.19

graphs. Furthermore, the fraction is growing with theincreasing of the average vertex degree in ER graphs. Itmeans that computing local transitive closure in each SCCcovers a large proportion of the whole computation.Therefore, IT-opt works much better than IT in ER graphs.

Exp3. varying graph size ðjV jÞ on SF graphs: Similar toExp1, we study the performance of our method by varyingjV j from 1K to 10K in SF graphs. Table 7 shows that ourmethod is also better than the sampling tree in all perfor-mance measures. Note that, the performance of our methodin SF is much faster than that in ER graph. The reason is thatmost vertices in SF have very small degrees and thatreduces the search space in offline processing.

Exp4. varying query constraint size ðjSjÞ on ER and SFgraphs: In ER graphs, we fix jV j ¼ 10;000 and jEj ¼ 15;000.In SF graphs, we fix jV j ¼ 10;000 and the power-lawdistribution parameter γ¼2.5. We vary jSj from 30%�j∑j ¼ 6 to 80%� j∑j ¼ 16. Note that, the offline processdoes not depend on label constraint size jSj. Thus, we onlyreport the query response time in Table 8. It is straightfor-ward to know the online performance in transitive closuremethod is stable with jSj. We also report the performanceof the sampling tree method in Table 8. From Table 8, wehave two findings: (1) transitive closure method is fasterthan the sampling tree method in query response time;(2) the performance of the bi-directional search degradessignificantly when jSj increases. The reason is that thesearch space is very large when jSj increases.

Sampling-tree method Bi-directional search

IT (s) IS (KB) QT (ms) QT (ms)

123,890 95.6 0.31 0.05378,563 232.7 0.53 0.09F F F 0.15F F F 0.18

3.5 4.0 4.5 5.093.42 96.14 97.58 98.58

Page 18: Efficient processing of label-constraint reachability ...€¦ · Efficient processing of label-constraint reachability queries in large graphs Lei Zoua,n,KunXua, Jeffrey Xu Yub,

Table 8Performance VS. jSj in ER and SF graphs.

jSj Transitive closure method Sampling-tree method Bi-directional search

QT (ms) QT (ms) QT (ms)

ER SF ER SF ER SF

jV j ¼ 10K jV j ¼ 10K jV j ¼ 10K jV j ¼ 10K jV j ¼ 10K jV j ¼ 10Kd¼1.5 d¼1.5 d¼1.5

6 0.04 0.01 0.29 0.24 0.05 0.038 0.04 0.01 0.32 0.25 0.08 0.05

10 0.05 0.02 0.32 0.27 0.25 0.112 0.06 0.03 0.39 0.29 1.20 0.714 0.06 0.04 0.40 0.32 3.49 1.216 0.07 0.04 0.52 0.34 10.90 3.59

Table 9Offline performance in real datasets.

jV j Transitive closure method Sampling-tree method

IT (s) IT-opt (s) IS (MB) IT (s) IS (MB)

Yeast 60 10 3.18 877 151Small Yago 32 6 1.58 945 90

Table 10Online performance in real datasets.

jSj= ALLlabels

Transitive closuremethod

Sampling-treemethod

Bi-directionalsearch

QT (ms) QT (ms) QT (ms)

Yeast SmallYago

Yeast SmallYago

Yeast SmallYago

40% 0.14 0.12 0.68 1.20 0.09 0.1560% 0.17 0.23 0.76 1.30 1.28 2.0580% 0.20 0.36 1.23 1.50 3.59 3.87

100% 0.27 0.46 1.56 1.60 9.78 10.56

L. Zou et al. / Information Systems 40 (2014) 47–6664

Exp5. performance on real graphs: In this experiment,we evaluate transitive closure and the sampling treemethod in two small real graphs, Yeast and Small-Yago.Table 9 confirms that our method is much better than thesampling tree approach in all performance measures. Forexample, the index building time in our method is onlyabout 1

10 � 130 of that in the sampling tree method. The

index size in our method is also much smaller than that inthe sampling tree method. Furthermore, the bi-directionalsearch method cannot work well when jSj is large. Wealso report the online performance on real datasets inTable 10.

7.3. Performance of partition-based approach

Although transitive closure method has good perfor-mance, it suffers from offline processing cost in a large graph.Therefore, we evaluate the performance of the partition-based approach in this section.

Setup. Given a large graph G, we partition G intok¼ ⌈jV j=jPj⌉ blocks. The index building time can be esti-mated as follows: ITðGÞ ¼ ⌈jV j=jPj⌉� ITðPÞ, where IT(P) isthe index building time for block P. According to thestatistics in Tables 4 and 7, we can set up jPj to minimizethe overall index building time, i.e., IT(G). Therefore, we setup jPj ¼ 2K in ER graphs and jPj ¼ 8K in SF graphs.

In order to obtain an optimal partition for queryprocessing, we utilize the method in Section 5.2 to parti-tion graph G into k blocks. Specifically, given a graph G, werandomly select Δ¼ jVðGÞj � 1% vertices as seeds. Then,based on these seeds, we can estimate edge weight w(e).Finally, we employ METIS algorithm to partition G.

Exp6. varying graph size ðjV jÞ on large ER and SF graphs:In this experiment, we fix the density jEj=jV j ¼ 1:5 and

label constraint size jSj ¼ 6 and vary jV j from 20K to 200Kto study the performance. From Table 11, we know that thesampling tree method cannot finish index building in areasonable time (o48 h) when jV j420K. Generallyspeaking, index building time and index size are linearwith the graph size jV j in the partition-based approach.A promising finding in Table 11 is that the query responsetime in the partition-based approach is less than 0.1 s.For example, given a graph with 200K vertices and300K edges, query response time in the partition-basedapproach is about 90 ms, as shown in Table 11. Althoughthe bi-direction search method has good performance inTable 11, in which, jSj ¼ 6, but it cannot work well when jSjincreases, as shown in Table 13. We can observe the similarresults in large SF graphs in Table 12.

7.4. Performance of updates

Exp8. performance on updates: Table 14 lists the averagetime for inserting/deleting one edge into four real datasets.Note that, in Large Yago and DBLP, we adopt the partition-based method, thus, we only need to update the localtransitive closure of each partition. Table 14 shows theaverage time to insert/delete one edge is less than 100 ms.Specifically, in order to evaluate the insertion performance,we randomly generate 100 edges to be inserted. Sincethere are four cases of the inserted edges, each case has1/4 probability to be generated in our update workload.We report the average insertion time in Table 14. We have

Page 19: Efficient processing of label-constraint reachability ...€¦ · Efficient processing of label-constraint reachability queries in large graphs Lei Zoua,n,KunXua, Jeffrey Xu Yub,

Table 12Performance VS. jV j in SF graphs.

jV j Partition-based method Sampling-tree method Bi-directional search

IT (s) IT-opt (s) IS (KB) QT (ms) IT (s) IS (KB) QT (ms) QT (ms)

20K 3 2 1.61 1.95 123.90 48.73 0.25 0.1640K 10 7 2.15 2.89 356.80 120.50 0.31 0.9660K 21 10 3.25 3.20 835.70 378.90 0.35 2.6880K 50 30 4.36 3.58 F F F 4.89

100K 66 41 5.47 4.84 F F F 8.45120K 81 52 6.87 5.80 F F F 10.59140K 171 63 8.07 6.50 F F F 25.56160K 187 75 9.27 9.56 F F F 30.43180K 211 81 10.5 10.89 F F F 40.56200K 233 109 16.8 12.96 F F F 50.98

Table 13Performance VS. jSj In large ER and SF graphs.

jSj Partition-based method Bi-directional search

QT (ms) QT (ms)

ER SF ER SF

jV j ¼ 200K;d¼ 1:5 jV j ¼ 200K jV j ¼ 200K; d¼ 1:5 jV j ¼ 200K

6 89.9 12.96 160.58 50.988 115.2 15.58 197.32 85.60

10 125.8 18.90 235.35 156.8912 153.6 22.39 302.60 198.914 180.9 25.60 430.90 285.316 210.3 30.56 560.68 300.56

Table 14Evaluating index maintenance.

Data set Insertion (ms) Deletion (ms)

Yeast 35 12Small Yago 56 37Large Yago 63 83DBLP 51 65

Table 11Performance VS. jV j in ER graphs.

jV j Partition-based method Sampling-tree method Bi-directional search

d¼1.5 IT (s) IT-opt (s) IS (KB) QT (ms) IT (s) IS (KB) QT (ms) QT (ms)

20K 210 10 0.71 20.8. 253,236 367.98 0.95 0.2540K 396 23 6.64 25.6 F F F 1.6860K 622 35 5.55 29.8 F F F 5.3680K 1018 60 6.14 32.1 F F F 16.89

100K 1104 65 6.88 45.5 F F F 30.10120K 1281 72 7.81 51.6 F F F 40.68140K 1463 81 8.88 58.9 F F F 69.59160K 1686 85 9.83 65.4 F F F 106.80180K 1807 120 10.70 72.6 F F F 120.56200K 2233 178 11.80 89.9 F F F 160.58

L. Zou et al. / Information Systems 40 (2014) 47–66 65

the similar setting for the deletion. We randomly delete100 edges. Half of them are from the first case and theothers are from the second case.

8. Conclusions

In this paper, we address label-constraint reachability(LCR) queries over large graphs. Theoretically, we proposeseveral methods to optimize path-label transitive closurecomputing. In order to address the scalability issue, wepropose a partition-based approach based on the graphpartition. We prove that hardness of finding the optimalpartition is NP hard. Thus, we propose a sampling-basedsolution to find a good partition to speed up LCR queries.Last but not the least, extensive experiments on both realand synthetic datasets confirm that our methods are fasterthan the existing solution by orders of magnitude in bothoffline and online processing.

Acknowledgments

Lei Zou's work was supported by NSFC under Grant61370055. Dongyan Zhao was supported by NSFC underGrant 61272344 and China 863 Project under Grant no.2012AA011101. Jeffery Xu Yu's work was supported byResearch Grants Council of the Hong Kong SAR, Chinaunder Grant no. 418512. Lei Chen's work was supported inpart by the Hong Kong RGC GRF 611411, National GrandFundamental Research 973 Program of China under Grant2012-CB316200, Microsoft Research Asia Grant, HuaweiNoahs ark lab project HWLB06-15C03212/13PN andGoogle Faculty Award 2013. Yanghua Xiao was supported

Page 20: Efficient processing of label-constraint reachability ...€¦ · Efficient processing of label-constraint reachability queries in large graphs Lei Zoua,n,KunXua, Jeffrey Xu Yub,

L. Zou et al. / Information Systems 40 (2014) 47–6666

by NSFC (No. 61003001, 61170006, 61171132, 61033010);Specialized Research Fund for the Doctoral Program ofHigher Education No. 20100071120032; Shanghai MunicipalScience and Technology Commission with Funding No.13511505302; NSF of Jiangsu Province (No. BK2010280).

References

[1] S. Abiteboul, V. Vianu, Regular path queries with constraints,J. Comput. Syst. Sci. 58 (3) (1999).

[2] R. Agrawal, A. Borgida, H.V. Jagadish, Efficient management oftransitive relationships in large data and knowledge bases, in:SIGMOD Conference, 1989.

[3] K. Anyanwu, A.P. Sheth, ρ-queries: enabling querying for semanticassociations on the semantic web, in: WWW, 2003.

[4] R. Bramandia, B. Choi, W.K. Ng, Incremental maintenance of 2-hoplabeling of large graphs, IEEE Trans. Knowl. Data Eng. 22 (5) (2010).

[5] Y. Chen, Y. Chen, An efficient algorithm for answering graph reach-ability queries, in: ICDE, 2008, pp. 893–902.

[6] J. Cheng, J.X. Yu, On-line exact shortest distance query processing,in: EDBT, 2009, pp. 481–492.

[7] J. Cheng, J.X. Yu, X. Lin, H. Wang, P.S. Yu, Fast computing reachabilitylabelings for large graphs with high compression rate, in: EDBT, 2008.

[8] E. Cohen, E. Halperin, H. Kaplan, U. Zwick, Reachability and distancequeries via 2-hop labels, SIAM J. Comput. 32 (5) (2003).

[9] W. Fan, J. Li, S. Ma, N. Tang, Y. Wu, Adding regular expressions tograph reachability and pattern queries, in: ICDE, 2011.

[10] H.V. Jagadish, A compression technique to materialize transitiveclosure, ACM Trans. Database Syst. 15 (4) (1990).

[11] R. Jin, H. Hong, H. Wang, N. Ruan, Y. Xiang, Computing label-constraint reachability in graph databases, in: SIGMOD, 2010.

[12] R. Jin, N. Ruan, S. Dey, J.X. Yu, Scarab: scaling reachability computa-tion on large graphs, in: SIGMOD Conference, 2012, pp. 169–180.

[13] R. Jin, Y. Xiang, N. Ruan, D. Fuhry, 3-hop: a high-compressionindexing scheme for reachability query, in: SIGMOD, 2009.

[14] R. Jin, Y. Xiang, N. Ruan, H. Wang, Efficiently answering reachabilityqueries on very large directed graphs, in: SIGMOD, 2008, pp. 595–608.

[15] G. Karypis, V. Kumar, Multilevel k-way partitioning scheme forirregular graphs, J. Parallel Distrib. Comput. 48 (1) (1998).

[16] S. Lu, F. Zhang, J. Chen, S.-H. Sze, Finding pathway structures inprotein interaction networks, Algorithmica 48 (4) (2007).

[17] J.P. McGlothlin, L.R. Khan, Rdfkb: efficient support for RDF inferencequeries and knowledge management, in: IDEAS, 2009.

[18] V.J.T. Michael Rice, Graph indexing of road networks for shortestpath queries with label restrictions, PVLDB 4 (2) (2010) 69–80.

[19] Réka Albert, Albert-László Barabási, Statistical mechanics of complexnetworks, Rev. Mod. Phys. 74 (2002) 47–97.

[20] S. Trißl, U. Leser, Fast and practical indexing and querying of verylarge graphs, in: SIGMOD, 2007.

[21] H. Wang, H. He, 0001, J.Y., P.S. Yu, J.X. Yu, Dual labeling: Answeringgraph reachability queries in constant time, in: ICDE, 2006.

[22] J.X. Yu, Graph Reachability Queries: A Survey (Book Chapter),Kluwer Academic Publishers, Boston, Dordrecht, London, 2010.

[23] J. Zhu, Z. Nie, X. Liu, B. Zhang, J.-R. Wen, Statsnowball: a statisticalapproach to extracting entity relationships, in: WWW, 2009.