- Home
- Documents
- Top-K Subgraph Matching Query in A Large Subgraph Matching Query in A Large Graph ... we propose a novel sub-graph matching query in single-graph database, called Top-K Subgraph Matching Query.

Published on

04-May-2018View

212Download

0

Transcript

Top-K Subgraph Matching Query in A Large GraphLei Zou1 Lei Chen2 Yansheng Lu11Huazhong Univ.of Sci. & Tech., Wuhan, China, {zoulei,lys}@mail.hust.edu.cn2Hong Kong Univ.of Sci. & Tech., Hong Kong, China, leichen@cse.ust.hkAbstractRecently, due to its wide applications, subgraph search has at-tracted a lot of attention from database and data mining commu-nity. Sub-graph search is defined as follows: given a query graphQ, we report all data graphs containing Q in the database. How-ever, there is little work about sub-graph search in a single largegraph, which has been used in many applications, such as biolog-ical network and social network.In this paper, we address top-k sub-graph matching query prob-lem, which is defined as follows: given a query graph Q, we lo-cate top-k matchings of Q in a large data graph G according toa score function. The score function is defined as the sum of thepairwise similarity between a vertex in Q and its matching vertexin G. Specifically, we first design a balanced tree (that is G-Tree)to index the large data graph. Then, based on G-Tree, we proposean efficient query algorithm (that is Ranked Matching algorithm).Our extensive experiment results show that, due to efficiency ofpruning strategy, given a query with up to 20 vertices, we canlocate the top-100 matchings in less than 10 seconds in a largedata graph with 100K vertices. Furthermore, our approach out-performs the alternative method by orders of magnitude.1. IntroductionGraph is an important data structure in computer science, whichcan model objects as vertices and their pairwise relationships asedges. Graph database finds many applications in chem-informatics[22], or pattern recognition [4] and complex network [24]. Basi-cally, graph databases can be divided into two categories [13]:First, graph-transaction setting. There is a large set of rel-atively small graphs (10K vertices), such as biological network [14],social network [21] and WWW [1]. There are many interestingproblems in this category, such as community detection [2] andfrequent sub-graph mining [13] and connectivity [20].Among of these graph-based problems, (sub)graph matching[8] is one of fundamental problems with many practical applica-tions, such as compound search [10] [22], pattern recognition [4].On the other hand, as we know, sub-graph isomorphism is NP-complete [8]. In fact, it is exact the intrinsical hardness and popu-lar applications that motivate people to design practical algorithmsfor graph matching problems, such as [19, 23, 11, 25, 9, 6, 3]1.1 MotivationIn this paper, we propose a novel sub-graph matching queryin single-graph database, called Top-K Subgraph Matching Query.Given a query graph Q, we report the top-k matching positions inthe large graph G with the k largest scores. Before the formaldefinition in Section 1.2, let us consider some motivating examplesas follows.Recently, in life science, due to rapid advances in experimentaldesigns, biologists can obtain a large amount of data describingbiological interactions at a genome scale. We can model theseinteraction networks as large graphs, such as protein-protein in-teraction networks (PPI) [17], metabolic networks [14] and so on.Vertices and edges are used to represent biological entities andinteractions between them. As a motivation example for our prob-lem, let us consider the following scenario: A scientist working onan unwell-studied species has constructed a small portion of a pro-tein pathway complex Q based on analysis of various experimentaldata. The scientist is interested in predicting more biological ac-tivity about the unwell-studied species. This task can be expressedas locating some similar matchings of Q in a large PPI networkG about a well-studied species. For each matching, we can com-pute its scores. We only report top-k matchings. The score of amatching is defined as the sum of the pairwise similarity betweena vertex (this is a protein) in Q and its matching vertex in G. Wecan evaluate the similarity between two vertices in terms of theprotein sequence similarity [18]. Most existing approaches alwaysassume that the query pathway complex is a path or tree, or relaxmatching definitions (not sub-graph isomorphism, only consider-ing pairwise vertex distances)[24]. Furthermore, it is difficult forthese methods to work on a very large data graph.Actually, top-k sub-graph matching query problem has manyother applications, such as social network. Assume that an officialTable 1. Meanings of Symbols UsedDia(Q) The Diameter of Query QSim(u, v) The similarity between u and vScore(X) The matching score of XS(N) the node area of NES(N, d) the d-extension node area of Nsecurity department has built a large social network. Each ver-tex corresponds to a dangerous individual, and each edge denotesthe interaction between two corresponding individuals, as shownin Figure 1. Some personal characters about a danger individualare recorded, such as appearance, accent and so on. In order tofind a terrorist group, through a long time detection, polices haveobtained some personal characters about 4 terrorists (such as ap-pearance and so on), and the interactions among them. Then, theydraw a query graph Q in Figure 1. Q is not always a completegraph, since the interactions among members in the terrorist groupare always well pre-defined. Some members have no interactionwith others, even though they are in the same terrorist group. InFigure 1, the vertex color is used to denote the personal character.Similarity between personal characters is denoted by the similar-ity between the vertex colors. In order to locate the terrorist groupin the network, we want to find 4 matching vertices (individuals)in G, which are similar with vertices in Q respectively. Further-more, the interaction among them are the same with that in Q.Though vertex 1 in Figure 1 is similar with one query vertex, itis eliminated due to no interactions with other suspects. The ver-tices (5, 9, 12, 14) is a matching of Q in G. There may exist manymatchings of Q in the network. Polices should pay more attentionto top-k good matchings.Figure 1. Terrorist Network1.2 Problem DefinitionFirst, we briefly review the terminologies that we will use inthis paper. Table 1 lists commonly used symbols in this paper.DEFINITION 1.1. Graph. A graph G is defined as G = (V, E),where V is the set of vertices, and E is a set of vertex pairs.In this paper, we use graphs to model a large network, suchas protein-protein interaction network (PPI), social network andWWW. Notice that, different from graph-transaction databases,vertices have no category labels. They correspond to different ob-jects. For example, in PPI, each vertex corresponds to a proteinand the edge corresponds to the interaction between two proteins.In social network, we always use a vertex to denote an individualand an edge to denote the interaction between each other.DEFINITION 1.2. Graph Isomorphism. Assume that we havetwo graphs Q < V1, E1 > and G < V2, E2 >. Graph Q isisomorphism to graph G, iff 1 there exists at least one bijective1iff: if and only iffunction f : V1 V2 such that for any edge uv E1, there is anedge f(u)f(v) E2.DEFINITION 1.3. Sub-Graph Isomorphism. Assume that wehave two graphs Q < V1, E1 > and G < V2, E2 >. If thereexists at least one sub-graph X in graph G, graph Q is isomor-phism to X under the bijective function f , graph Q is sub-graphisomorphism to graph G.As mentioned above, a vertex in a graph corresponds to an ob-ject. For a vertex u in query Q and a vertex v in data graph G,we use Sim(u, v) to evaluate the similarity between two objectsassociated with vertices u and v. Sim(u, v) depends on differ-ent applications. In the PPI example of Section 1.1, Sim(u, v) isdefined as the similarity between two corresponding proteins. Interrorist network example, Sim(u, v) is defined as the similaritybetween two corresponding individual characters.How to evaluate Sim(u, v) is not focus of this paper, whichdepends on different applications. In our problem, Sim(u, v) forany pair (u, v) is also the input of the algorithm, where u and vare two vertices in query Q and data graph G respectively. Givena vertex u in query Q, a vertex v in the graph G can be a matchingvertex to u iff Sim(u, v) , where is user specified parameter.Therefore, we have the following definition.DEFINITION 1.4. Sub-Graph Matching. Assume that we havetwo graphs Q < V1, E1 > and G < V2, E2 >. Q is isomorphismto X under some bijective function f , where X is a sub-graphof G. We call X a sub-graph matching (matching for short) ofQ in graph G, iff, for any vertex u in Q, Sim(u, f(u)) .Furthermore, the vertex f(u) in G is called the matching vertexw.r.t 2 vertex u in query Q.It is clear that, given a query graph Q, there may exist many match-ings of Q in the large data graph. For each matching X , we definethe matching score Score(X) as follows:DEFINITION 1.5. Matching Score. Given a query graph Qand a large data graph G, X is a matching of Q in graph G. Thescore of X is defined asScore(X) =|V (Q)|i=1 Sim(vi, f(vi)) (1), where |V (Q)| is the number of vertices in Q.We would like to point out that matching score function alsodepends on different applications. In fact, our algorithm proposedin this paper holds for any monotone aggregate matching scorefunction. For the clear presentation, we use sum function as Score(X)in this paper. Our top-k subgraph matching query is defined as fol-lows:DEFINITION 1.6. (Problem Definition) Top-K Subgraph Match-ing Query (top-k matching for short). Given a query graph Q, alarge data graph G and Sim(u, v) for any vertex pair (u, v) (uand v are two vertices in Q and G respectively), we want to locatek matchings Xi in G, i=1...k, whose matching scores are the firstk largest.Furthermore, in our problem, |V (Q)| < log|V (G)|, namely, queryQ is a small graph compared with the data graph G.Example 1 (Running Example). Given a query graph Q and alarge data graph G in Figure 2, we want to locate Top-2 match-ings. The vertex similarity threshold is = 0.1. In Figure 2, weonly report all vertex pairs whose similarities are no less than .2w.r.t: with regard toAB CQuery QA 1 0.1Similarity:12 341715 1614138 10912115 67Graph GA 9 0.6A 5 0.5A 16 0.7A 15 0.8B 3 0.3B 10 0.8B 7 0.7B 17 0.6C 2 0.3C 12 0.8C 11 0.7C 14 0.6B 4 0.1A 8 0.1VertexID Adjacent Lists1 2, 3, 172 1, 3, 4, 83 1, 2, 44 2, 3, 85 10, 6, 76 5, 7, 147 5, 68 2, 4, 9, 119 8, 11, 12, 1010 9, 12, 511 8, 9, 1312 9, 10, 1313 11, 1214 15, 16, 615 14, 1716 14, 1717 1, 15, 14, 16Graph File ( adjacency file)C 16 0.8Figure 2. Running ExampleNotice that, each vertex in query Q and data graph G corre-sponds to one object. A and 1 in Figure 2 are vertex ID, notvertex labels.1.3 Our ApproachesIn this problem, we have to face sub-graph isomorphism duringquery processing, known to be NP-complete. Obviously, for anymatching X of Q in the data graph G, X is located into a local areawith diameter Dia(Q) in G, where Dia(Q) is the diameter ofquery Q. We can check the local area around each vertex in G bysub-graph isomorphism until we find all matchings of Q in G. Inorder to reduce the number of sub-graph isomorphism checking,we can apply feature-based pruning strategy. If a feature (that issub-structure) of a query Q does not exist in some local area of G,the local area cannot contain any matching of Q, namely, the lo-cal area can be pruned safely. Feature-based pruning is a popularmethod in sub-graph search problem [19, 23, 25, 3]. However, inour problem, given a query Q, the matchings of Q exist in mostlocal areas of the data graph G, especially when |Q| is small. Itmeans that we can prune few local areas in G. The underlying rea-son is that all vertices in our problem are non-labeled. Therefore,feature-based pruning cannot work well in the problem.In this paper, we adopt another kind of pruning strategy, thatis score-upper bound pruning: for a local area in the data graphG, if there exist no matching whose score can be in top-k match-ing scores, the local area can be filtered out safely. The efficiencyof score-upper bound pruning is derived from clusters in the largegraph. In many graph clusters, matching scores are small. Since,in these clusters, there exit few vertices having similar charactersto that in query Q. In summary, we made the following contribu-tions in this paper:1. We propose a novel sub-graph search problem in a singlelarge graph, which locates and reports top-k matchings.2. We design a height-balanced tree, G-Tree, to index a largegraph.3. Based on G-Tree index, to answer top-k matching query, wepropose Ranked Matching algorithm.4. Extensive experiments on two real data sets evaluate the ef-ficiency of our methods.The remainder of the paper is organized as follows: Relatedwork is discussed in Section 2. We design G-tree index and pro-pose novel Ranked Matching algorithms in Section 3. We evaluateour method in the extensive experiments in Section 4. Finally, Weconclude the paper in Section 5.2. Related WorkRecently, there is increasing research interest in graph databases.Here, we make a brief review of the related work, which canbe categorized into three groups: 1) sub-graph search in graph-transaction database; 2) pathway search in the biological network;3) graph cluster and community detection.Sub-graph search in graph-transaction database. Given a querygraph Q, we need to report all data graphs containing Q in thedatabase [19, 23, 11, 25, 9, 6, 3]. The popular pruning strategyin sub-graph search problem is feature-based pruning: If a feature(that is sub-structure) of query Q does not exits in some data graphGi, Gi can be filtered out safely. The straightforward method ofextending the strategy to our problem is that, if a local area of thelarge graph G does not consist a feature of query, the local area canbe pruned. However, it has no good pruning power. The underly-ing reason is that all vertices in our problem has no categoricallabels.Pathway search in the biological network. Given a pathwaycomplex Q, we want to find a sub-structure X in the biologi-cal network that is most similar to Q. In the recent work [24],authors propose GraphMatch algorithm. The matching defi-nition in [24] replaces sub-graph isomorphism by pair-wise ver-tex distance, which is different from ours. In experiments, wechoose GraphMatch as an alternative method for comparison.GraphMatch can work well in biological network (about 5K ver-tices), but it can not work on a very large graph (100K vertices).Graph cluster and community detection The goal of communitydetection is to cluster the vertices in the large graph into groupsthat share common characteristics. Many graph partition and clus-tering algorithms have been proposed in the literature, such asspectral clustering [15], METIS [12] and so on. In fact, the prun-ing effect of our method benefits from community and graphclusters, which is discussed in Section 3.2.3. Ranked Matching AlgorithmIn this section, we design G-Tree to index a large data graphin Section 3.1. Then, based on G-Tree, we propose RM algorithmfor top-k matching query problem in Section 3.2. Advanced RMalgorithm is proposed in Section 3.3.3.1 G-Tree StructureMany real large graphs alway display the hierarchical topol-ogy [16], such as PPI, social network and WWW. In these largegraphs, there exist many clusters. The vertices in a cluster arehighly connected, and they have only a few or no neighbors out-side this cluster. In PPI, these clusters always correspond to dif-ferent functional groups. In friendship network, they denote thecommunities with shared interests. Iteratively, small clusters arealso densely interconnected, which form larger clusters. Inspiredby the hierarchical topology in the real large graphs, we propose aG-Tree index as follows.DEFINITION 3.1. G-Tree. Given a large data graph G, G-Treeindex is a height balanced tree, where:1. Each node N in G-Tree corresponds to a set of vertices ingraph G. The set of vertices is called Node Area w.r.t nodeN , denoted as S(N). In the same level of G-Tree, all nodeareas form a graph Gs partition.2. Given an intermediate node N and one of its children M inG-Tree, S(M) is a subset of S(N).3. Each node has at least m children unless it is root, m >2.Each node has at most M children and (M+1)2 m.In this paper, for presentation convenience, we preserve nodefor G-Tree and vertex for the data graph and query graph. TheG-tree index for the running example is illustrated in Figure 3.Tree Construction The straightforward approach to build G-Treeis to perform insertion sequentially. Here, we discuss G-Tree con-struction in up-bottom fashion. We can use graph partition algo-rithm, such as METIS [12], on the data graph to obtain nodes inthe first level. Iteratively, for each node, we obtain its child nodesby graph partition. We terminate the above process when each leafnode area can be stored in a disk page.Vertexes: 1 ~ 17Vertexes: 1 ~ 4, 8 ~ 13 Vertexes: 5~7 , 14~17 Vertexes: 1 ~ 4 Vertexes: 8 ~ 13 Vertexes: 5 ~ 7 Vertexes: 14 ~ 17Root RN1 N2N3 N4 N5 N6Figure 3. G-TreeInsertion and Split To insert a vertex v, an insertion operationbegins at the root of G-Tree and iteratively chooses a child nodeuntil it reaches a leaf node. The main challenge of insertion isthe criterion for choosing a child node. In order to handle theproblem, we propose the Cluster Coefficient in Definition 3.2. Wealways finds a child node Ni that maximizes the cluster coefficientCluster(v Ni) among all Ni.DEFINITION 3.2. Cluster Coefficient.Cluster(N) =|AllEdges(N)| |OutEdges(N)||AllEdges(N)|, where |OutEdges(N)| is the number of out edges linking ver-tices in S(N) to vertices out of S(N). |AllEdges(N)| is thenumber of all edges adjacent to vertices in S(N).Analogous to other balanced tree, such as R-tree, for a nodeN , we need to split N into two nodes if the number of childrenis larger than M , where M is maximal fanout. We perform graphpartition on N in order to obtain new nodes N1 and N2. Splittingmay cause the parent node to split as well and this procedure mayrepeat all the way up to the root.Deletion and Merge To delete a vertex v from the graph G, wefind the leaf node in G-Tree where v is stored. For any node Nalong the path, we delete v from node area S(N). After deletion,if node N has less than m children, where m is minimal fanoutof G-Tree, we merge N into another Ni in the same level. Thecriterion for choosing Ni is to maximize Cluster(NiN) amongall Ni. The procedure may propagate up to the root.3.2 RM AlgorithmThe large search space is the challenge to the top-k matchingquery problem. According to analysis in Section 1.3, we adopt thescore upper-bound pruning strategy in this paper. First, we discussthe topological relationships between a matching and a node area(defined in Definition 3.1) in G-Tree as follows.DEFINITION 3.3. Contain, Overlap and Outside. Given a queryQ and a large graph G, a subgraph X of G is a matching of Q. N isa node of G-Tree. We use S(X) to denote the set of vertices in X.The topological relationship between the matching X and the nodearea S(N) are:1) Contain. S(X) S(N);2) Overlap. S(X) S(N) 6= AND S(X) * S(N);2) Outside. S(X) S(N) = .138 1091211AB CQuery QAB CQuery Q138 1091211AB CQuery QN4 N4(1) Contain (2) Overlap (3) Outside12 34138 1091211N424Figure 4. The Relationship Between a NodeArea and a MatchingFigure 4 shows the examples about contain, overlap and outside.DEFINITION 3.4. Boundary Vertex. Given a large graph Gand a node N in G-tree for graph G, for a vertex v in node areaS(N), if there exists at least one neighbor vertex that is not inS(N), v is a Boundary Vertex w.r.t S(N).DEFINITION 3.5. d-Extension Node Area. Given a node Nin G-Tree, a vertex v is a boundary vertex w.r.t S(N). N(v, d) ={vj |the distance between vj and v is no larger than d}. Weassume that there are m boundary vertices vi in S(N). The d-Extension Node Area is denoted as ES(N, d) = (i=mi=1 N(vi, d))S(N).12 34178N31-Extension N3Figure 5. d-Extension Node AreaFigure 5 shows 1-Extension N3 (that is ES(N3, 1)) in the runningexample.THEOREM 3.1. Given a query graph Q and a large graph G,X is a matching of Q in G. For a node area S(N), if X is con-tained in S(N) or overlapped with S(N), X must be containedin ES(N, Dia(Q)), where Dia(Q) is the Qs diameter.Given a query graph Q with n vertices and Extension node areaES(N, Dia(Q)), X is a matching contained in ES(N, Dia(Q)).To evaluate the upper bound of Score(X), we work as follows.For each vertex ui in query graph Q, we find all candidate match-ing vertices vij in ES(N), where j = 1...m. Then we defineSim(ui, ES(N, Dia(Q))) = Max(Sim(ui, vij)). Accordingto Definition 1.4, if Sim(ui, ES(N, Dia(Q))) < , it is clear toknow that there exists no matching X contained in ES(N, Dia(Q)).It means that the node N in G-Tree can be pruned safely. Other-wise, we use the following theorem to evaluate the upper bound ofScore(X).THEOREM 3.2. Given a query graph Q with n vertices ui, X isa matching of Q contained in or overlapped with node area S(N).We define UpScore(Q, N) =i=ni=1 Sim(ui, ES(N, |Dia(Q)|)).UpScore(Q, N) is the upper bound of Score(X).Given a query Q in Figure 2 and a node N3 of G-Tree in Fig-ure 3, UpScore (Q, N3)= Sim(A, ES(N3, Dia(Q)))+ Sim(B,ES(N3, Dia(Q)))+ Sim(C, ES(N3, Dia(Q))) =0.1 + 0.6 +0.3 = 1.0.The pseudo-code for RM algorithm is given in Algorithm 1. Allchild nodes Ni of root are inserted into the max-heap H in non-ascending order of UpScore(Q, Ni) (Line 1). Initially, we set = (Line 2). If the heap head N is a leaf node, we removeN from the heap and perform sub-graph isomorphism to find allmatchings of Q in ES(N, Dia(Q)). They are inserted into themax-heap R in non-ascending of their matching scores. We update to be the k-th largest matching score in R (Lines 7-10). If theheap head N is non-leaf node, we remove N from the heap andinsert all its child nodes Ci into the heap in non-ascending order ofUpScore(Q, Ci) (Lines 12-13). RM algorithm terminates when is not larger than UpScore(Q, N), where N is the heap Hshead. (Line 5)Action Heap contents RAccess rootExpand N2matching 2 matching 3 matching 1 Matching ID Vertex ID listsCheck N6 Expand N4Check N4Figure 6. Heap ContentIn the running example, we first access the root and insert allits child nodes into the max-heap H , as shown in Figure 6. SinceUpSocre(Q, N2) = 2.4 > UpSocre(Q, N1) = 2.2, N2 is be-fore N1 in the heap. N2 is heap head, therefore, we expand N2.N5 and N6 are children of N2, which are inserted into the heap.N6 is the next expanded node. Since N6 is a leaf node, we re-move N6 from the heap and perform sub-graph isomorphism tofind all matchings of Q in ES(N6, Dia(Q)). We find matching1 in ES(N6, Dia(Q)) with score 2.0, which is inserted into theresult set R. Now, N1 is heap node. N1s children, that are N3and N4, are inserted into the heap. Next, since N4 is a leaf node,we remove N4 from the heap and perform sub-graph isomorphismto find all matchings in ES(N4, Dia(Q)), that are matching 2and matching 3. We keep the threshold to be the k-th largestmatching score by now (k=2 in the running example). Now, =2.0 > UpSocre(Q, N5) = 1.9, where N5 is the heap head. Itindicates that all left nodes can be pruned safely. We report top-2 matchings in the heap R (that are matching 2 and 1) and thealgorithm terminates.THEOREM 3.3. (Algorithm Correctness) RM algorithm can re-port the correct top-K matchings.Algorithm 1 Ranked Matching (RM for short)Require: Input: query Q, a large graph G and G-Tree index T .Output: Top-k matchings of Q in G1: Insert all child nodes Ni of T s root into the max-heap H accordingto UpScore(Q, ES(Ni, Dia(Q))).2: Set = .3: while |H| > 0 do4: the head of heap H is node N .5: if > UpScore(Q, N) then6: Break7: if N is a leaf node then8: Remove N from the heap H and find all matchings of Q inES(N, Dia(Q)) by sub-graph isomorphism.9: The matchings are inserted into the max-heap R in descendingorder of their matching scores.10: Update to be the k-th largest matching score.11: else12: Remove N from the heap H13: For all child nodes Ci of N , they are inserted into max-heap Hin descending order of UpScore(Ci)14: Report the top-k matchings in C.Proof.(sketch) It is straightforward to know that, if we assess ev-ery leaf node by sub-graph isomorphism, we can find the correcttop-k matchings. If the algorithm terminates when |H| = 0 (Line3), it means that each leaf node has been checked. Therefore, thealgorithm can report the correct top-k answers. If the algorithmterminates when > UpScore(Q, N) (Line 6), it means that,for un-assessed leaf nodes, there exits no matching whose score islarger than ( is the k-th largest matching score by far). It in-dicates that we have found the correct top-k matchings. Thereforewe prove the correctness of RM algorithm. 2THEOREM 3.4. Given a top-k query Q, for any leaf node L,the sufficient and necessary condition for performing sub-graphisomorphism between Q and ES(L, Dia(Q)) is that UpScore(Q, L) , where is the k-th largest matching score in the final re-sults.Proof.(sketch) 1) sufficient condition. If there exists a leaf nodeL that has not been checked by sub-graph isomorphism so far,where UpScore(Q, L) , there must exist L or its ancestorin the heap H according to RM algorithm. It is clear to know ( is the k-th largest matching score by far). SinceUpScore(Q, L) , UpScore(Q, L) . It means that thealgorithm cannot terminate, according to Line 5. Therefore, forany leaf node L, if UpScore(Q, L) , it must be checked bysub-graph isomorphism before algorithm termination.2) necessary condition. Assume that there exits a leaf node L,where UpScore(Q, L) < , and L needs to be checked bysub-graph isomorphism before algorithm termination. Since Lneeds to be checked, it means that L is the recent heap head andUpScore(Q, L) > . Since UpScore(Q, L) < and L isthe recent heap head, it means that we cannot find any matchingwith score in un-accessed leaf nodes. It also indicates thatthe final k-th largest matching score < , which leads to con-tradiction ( is the final k-th largest matching score). Therefore,there exits no leaf node L, where UpScore(Q, L) < , and Lneeds to be checked by sub-graph isomorphism.In conclusion, we prove the correctness of Theorem 3.4.2Pruning Effect Discussion. To evaluate the pruning effect ofRM algorithm, we define the assess ratio to be the ratio of thenumber of assessed leaf nodes to the total number of leaf nodes inG-Tree. Theorem 3.4 shows the underlying reason that RM algo-rithm can obtain good pruning power. In fact, each leaf node corre-sponds to a cluster in the large data graph. Different clusters (leafnodes) have different characteristics. For example, in PPI network,these clusters correspond to different functional groups. There-fore, given a query Q, for most leaf nodes L, UpScore(Q, L) issmall, since Q and L have no common characteristics. Accordingto Theorem 3.4, these leaf nodes will not be assessed due to smallUpScore(Q, L). In experiments, we will evaluate our method bytwo important real Datasets: S.cerevisiae and DBLP dataset. Ex-tensive results confirm our intuition about pruning effect.However, when Dia(Q) is large, we always obtain a large ex-tension node area ES(N, Dia(Q)) during query processing. Itmeans that we will spend more time to perform sub-graph isomor-phism. In order to handel the problem, we propose the advancedRM algorithm in the following section.3.3 Advanced RM algorithmThe pseudo-code of Advanced RM algorithm is shown in Al-gorithm 3.3. We partition the query Q into m parts Qi (Line 1),and Max(Dia(Qi)) < (we discuss setting later). For eachquery part Qi, we maintain a max-heap Ri to store all matchingsof Qi that have been founded by now. The algorithm is iteratedfrom n=1 (Line 712). In each iteration step, for each Qi, wecall function Sub RM(Qi, n, Hi, Ri) to return the correct top-n matchings of Qi in the max-heap Ri (Line 9 10)(Sub RMfunction is analogous to RM algorithm). Then, for the first nmatchings of Qi in each Ri, we check whether there exist somematchings that can be assembled into a matching of Q. Theseassembled matchings of Q are inserted into a max-heap R (Line11). We set to be the k-th largest matching score of Q in Rby now, and set i to be the n-th largest matching score of Qi inRi by now (Line 12). If >i=mi=1 i, the iteration terminates(Line 7). The algorithm report the top-k matchings in R (Line13). The algorithm correctness can derived from the correctnessof RM and TA algorithm [7]. Due to space limited, we omit theproof. Notice that re-calling function Sub RM at the nth iterationonly needs to compute the nth largest matching of Qi, and it doesnot re-compute the first (n-1) largest matchings of Qi.Algorithm 2 Advanced Ranked Matching (ARM for short)Require: Input: query Q, a large graph G and G-Tree index T .Output: Top-k matching of Q in G.1: Partition Q into m parts Q1...Qm, and each Dia(Qi) < .2: set n=03: Set and all i to be -.4: Set max-heaps R and Ri to be .5: for each Qi do6: Insert all child nodes Ni of Ts root into the max-heap Hi accordingto UpScore(Qi, ES(Ni, Dia(Qi))).7: while 1000 [5]. Our experiment study also confirms that. Therefore, wemaximize when Max(|ES(N, Dia(Qi))|) < 1000.Assembling matchings of Qi. Assume that a query Q is parti-tioned into m parts Qi, i=1...m. The edges connecting these Qiare called cut edges. In Line 11 of Advanced RM algorithm, if1) for each max-heap Ri, there is a matching Xi of Qi, and 2) foreach cut edge in Q, there exists a edge connecting correspondingXi in the data graph G, and 3) Xi Xj = , i 6= j, then thesematching Xi can be assembled into a matching X of Q.Sub RM(Qi, n, Hi, Ri)1: while |Hi| > 0 do2: the head of heap Hi is node Ni.3: set i to be the n-th largest matching score of Qi in Ri bynow.4: if i > UpScore(Qi, Ni) then5: Break6: if Ni is a leaf node then7: Remove Ni from the heap Hi and find all matchings ofQi in ES(Ni, Dia(Qi)) by sub-graph isomorphism.8: The matchings are inserted into the max-heap Ri in de-scending order of their matching scores.9: else10: Remove Ni from the heap Hi11: For all child nodes Ci of Ni, they are inserted into max-heap H in descending order of UpScore(Ci)4. ExperimentsIn this section, we report our experiment results to evaluate ourmethods in two real data sets. All experiments are implementedby standard C++ with STL library support and compiled by VisualStudio 2003 in a P4 1.7GHz machine of 1G RAM running Win-dows XP. To our best knowledge, there is no existing algorithm ontop-k matching problem proposed in this paper. The most relatedand recent work is GraphMatch [24]. In fact, there are somedifferences between each other. First, GraphMatch relaxes sub-graph isomorphism matching by only considering pair-wise ver-tex distance. Second, GraphMatch allows existing un-matchingvertices in query Q and indel vertices in the matching of Q in thedata graph. In experiments, to enable performance comparison,we set parameter indel=0 (not allowing indel vertices) and set abig penalty value for each un-matching vertex in Q (not allow-ing un-matching vertices in Q in top-k matchings). We downloadGraphMatch from http://faculty.cs.tamu.edu/shsze/graphmatch.Furthermore, we design a TA variation algorithm for top-k match-ing query problem, called TA-matching. Given a query Q, for eachvertex u in Q, we rank the vertices vj in data graph G in non-ascending order of Sim(ui, vj) to form the list Li. Similar withTA-algorithm [7], we scan all lists Li in parallel. For each en-countered vertex vj , we perform sub-graph isomorphism betweenQ and the local area around vj in the data graph G. The algo-rithm continues until all left vertices cannot form a top-k matching(analogy to TA algorithm). TA-matching has a significant limita-tion: estimation about the upper bound of the left matching scoresis not tight.Datasets We use two real datasets, that are DBLP and protein in-teraction network of S.cerevisiae.1) DBLP Dataset: We construct a co-author network G: everyauthor is denoted as a vertex in G; and the edge is introducedwhen the corresponding two authors have at least one co-authoredpaper. We consider about 100 important conferences in differ-ent areas to construct G. On the whole, there are about 100Kvertices and about 400K edges in G. We describe each vertex(a author) by a paper title string title, combining the titles ofall papers that he or she published. To generate a query Q, werandomly extract a sub-graph from G with the pre-defined vertexnumber. To evaluate the vertex similarity Sim(m, n) in Defini-tion 1.5, we define it by the similarity between paper title stringsof two corresponding authors. To be specific, for two vertices nand m in query Q and the co-author network G respectively, wedefine Sim(n, m) = |n.titlem.title||n.title| , where |n.titlem.title| isthe number of common words in n.title and m.title. Obviously,0 Sim(n, m) 1.2) S.cerevisiae Dataset: We download it from DIP 3 and representit by an undirected graph G in which each vertex represents a pro-tein and each edge represents interactions between two proteins.There are about 4934 vertices and 17346 edges in G. Similar with[24], for two vertices n and m in query Q and the data graph G,we define Sim(n, m) to be as the minus-log E-value between twocorresponding protein sequence. To generate a synthetic query,we randomly extract a sub-graph from G. Then, each vertex in thesub-graph is perturbed by pre-specified amount of point mutationsin its corresponding protein sequences.All experiment results in this section are the average results of100 queries.1K 2K 3K 4K 5K051015Data Graph SizeTotal Response Time (seconds)RMTAmatchingGraphMatch(a) Response Time in Top-100 Query,|Dia| = 5, |Q| = 10, = 0.0, inS.cerevisiae dataset20K 40K 60K 80K 100K100101102Data Graph SizeTotal Response Time (seconds)RMTAmatchingGraphMatch(b) Response Time in Top-100 Query,|Dia| = 5, |Q| = 20, = 0.5, in DBLPdatasetFigure 7. Scalability TestScalability Study We test scalability of our methods in Figure 7on both datasets, and compare them with GraphMatch. It is diffi-cult for GraphMatch to work when the data graph has more than60K vertices. Our methods have good scalability.Performance Study Since there are only about 5,000 vertices in3http://dip.doe-mbi.ucla.eduS.cerevisiae network, in order to test the performance of RM al-gorithm in a large graph, we only report the experiment results inDBLP dataset as follows. The default vertex cardinality of the datagraph G is 100K in the following experiments. We only evaluateRM algorithm and TA-matching algorithm proposed in the paper,since it is difficult for GraphMatch to work on the large graphwith 100K vertices.Iteratively, we build the G-Tree by running graph partition al-gorithm METIS [12] on the data graph G. We set the fanout ofG-tree to be 50. First, we fix the diameter of query Q to be 5 andthreshold =0.5 ( is defined in Definition 1.5), and vary the sizeof Q (that is number of vertices in Q) from 10 to 25. Total responsetime about top-100 queries is shown in Figure 8(a). We also eval-uate the pruning effects of G-Tree in Figure 8(b) by assess ratioin RM algorithm. Assess ratio is defined as the ratio of the num-ber of assessed leaf nodes in RM algorithm to the total number ofleaf nodes in G-Tree. RM algorithm only performs expensive sub-graph isomorphism when it assesses the leaf nodes. Therefore,less assess ratio indicates faster response time. RM algorithm out-performs TA-matching algorithm in Figure 8(a). Observed fromFigure 8(a), we find that the query response time in RM algorithmdecreases with increasing query size. The underlying reason isthat, when the query graph is large, G-tree provides more efficientpruning effect, as shown in Figure 8(b). Notice that, the trend ofresponse time for RM algorithm in Figure 8 happens when we fixDia(Q). If the query diameter increases with the query size, theresponse time will increase. We evaluate RM performance withregard to the query diameter as follows.10 15 20 250102030405060Query SizeTotal Response Time (seconds)RMTAmatching(a) Response Time in Top-100 Query,|Dia| = 5, = 0.510 15 20 2500.511.522.53Query SizeAssess Ratio (%)RM(b) Assess Ratio in RM algorithm in Top-100 Query, |Dia| = 5, = 0.5Figure 8. Varying Query SizeWe fix the query size to be 20 and =0.5 and vary the diam-eter of query from 2 to 10. Figure 9(a) reports the response timeand pruning effect of G-tree. The clear trend is that the responsetime increases with increasing query diameter. When RM algo-rithm assesses a leaf node N , it is clear to know that we obtain alarge extend node area ES(N, Dia(Q)) when Dia(Q) is large.It means that we will spend more time to perform sub-graph iso-morphism between Q and ES(N, Dia(Q)). There is the similarproblem in TA-matching algorithm.To address large query diameter problem, we propose the Ad-vanced RM algorithm in Section 3.3. We evaluate the AdvancedRM algorithm in Figure 10. We fix the query size to be 20 andvary the query diameter from 6 to 14. Results are reported in Fig-ure 10. Though, assess ratio of Advanced RM is larger than that inRM , Advanced RM algorithm improves the query performance(response time). The underlying reason is that we need less timeto perform sub-graph isomorphism in Advanced RM algorithm,since |ES(N, Dia(Qi))| is small. We partition the query Q intodifferent parts and Max(Dia(Qi)) < = 5.2 4 6 8 10020406080100Diameter SizeTotal Response Time (seconds)RMTAmatching(a) Response Time in Top-100 Query,|Q| = 20, = 0.52 4 6 8 1000.511.522.5Diameter SizeAssess Ratio (%)RM(b) Assess Ratio in RM algorithm in Top-100 Query, |Q| = 20, = 0.5Figure 9. Varying Diameter Size6 8 10 12 14050100150200Diameter SizeTotal Response Time (seconds)RMAdvanced RM(a) Response Time in Top-100 Query,|Q| = 20, = 0.56 8 10 12 14012345Diameter SizeAssess Ratio (%)RMAdvanced RM(b) Assess Ratio in Top-100 Query,|Q| = 20, = 0.5Figure 10. Evaluate Advanced RM algorithmAt last, we test RM algorithm with decreasing . We fix thequery size to be 20 and |Dia(Q)|=5. When =0.5, for each vertexin query Q, there are about 1,000 candidate matching vertices indata graph G. When =0.1, for each vertex in query Q, thereare more than 10,000 candidate matching vertices in data graphG. It is clear to know, the search space of sub-graph isomorphismwhen =0.1 is larger than that in = 0.5. Therefore, Figure11(a) shows that total response time increases with decreasing .The assess ratio of RM only depends on the final top-k answers.The sufficient and necessary condition that a leaf node needs to beperformed sub-graph isomorphism is discussed in Theorem 3.4.Usually, when k is small, decreasing does not affect the finaltop-k answers. Therefore, assess ratio keeps invariable.0.5 0.4 0.3 0.2 0.1050100150200Gamma ValueTotal Response Time (seconds)RMTAmatching(a) Response Time in Top-100 Query,|Q| = 20, |Dia| = 50.1 0.2 0.3 0.4 0.50.80.850.90.951Gamma ValueAssess Ratio (%)RM(b) Assess Ratio in RM algorithm in Top-100 Query, |Q| = 20, |Dia| = 5Figure 11. Vary the Threshold 5. ConclusionsIn this paper, we propose a novel sub-graph matching queryproblem in a very large data graph. It finds many applicationsin biological network and social network. To address the prob-lem, we design a balanced tree G-Tree to index the large graph.Adopting score-upper bound pruning strategy, based on G-Tree,we propose the RM algorithm to locate the top-k matchings ofquery Q in the large data graph. Extensive experiments shows thatour algorithm has good scalability and fast response time, whichoutperforms the alternative method by orders of magnitude.6. References[1] A. Z. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan,R. Stata, A. Tomkins, and J. L. Wiener. Graph structure in the web.Computer Networks, 33(1-6), 2000.[2] D. Cai, Z. Shao, X. He, X. Yan, and J. Han. Community miningfrom multi-relational networks. In PKDD, 2005.[3] J. Cheng, Y. Ke, W. Ng, and A. Lu. fg-index: Towardsverification-free query processing on graph databases. In SIGMOD,2007.[4] D. Conte, P. Foggia, C. Sansone, and M. Vento. Thirty years ofgraph matching in pattern recognition. IJPRAI, 18(3), 2004.[5] L. P. Cordella, P. Foggia, C. Sansone, and M. Vento. A (sub)graphisomorphism algorithm for matching large graphs. IEEE Trans.Pattern Anal. Mach. Intell., 26, 2004.[6] J. H. D.W. Williams and W. Wang. Graph database indexing usingstructured graph decomposition. In ICDE, 2007.[7] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithmsfor middleware. In PODS, pages 102113, 2001.[8] S. Fortin. The graph isomorphism problem. Department ofComputing Science, University of Alberta, 1996.[9] P. Y. H. Jiang, H. Wang and S. Zhou. Gstring: A novel approach forefficient search in graph databases. In ICDE, 2007.[10] T. R. Hagadone. Molecular substructure similarity searching:Efficient retrieval in two-dimensional structure databases. J.Chem.Inf. Comput. Sci, 32, 1992.[11] H. He and A. K. Singh. Closure-tree: An index structure for graphqueries. In ICDE, 2006.[12] G. Karypis and V. Kumar. A fast and highly quality multilevelscheme for partitioning irregular graphs. SIAM Journal on ScientificComputing, 1995.[13] M. Kuramochi and G. Karypis. Finding frequent patterns in a largesparse graph. Data Mining and Knowledge Discovery, 2005.[14] V. Lacroix, C. G. Fernandes, and M.-F. Sagot. Motif search ingraphs: Application to metabolic networks. IEEE/ACM Trans.Comput. Biol. Bioinformatics, 3(4), 2006.[15] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysisand an algorithm. In NIPS, 2001.[16] E. Ravasz and A.-L. Barabasi. Hierarchical organization in complexnetworks. Physical Review E, 67, 2003.[17] J. Scott, T. Ideker, R. M. Karp, and R. Sharan. Efficient algorithmsfor detecting signaling pathways in protein interaction networks. InRECOMB, 2005.[18] A. SF, G. W, M. W, M. EW, and L. DJ. Basic local alignment searchtool. J Mol Biol., 215(3), 1990.[19] D. Shasha, J. T.-L. Wang, and R. Giugno. Algorithmics andapplications of tree and graph searching. In PODS, 2002.[20] S. Tril and U. Leser. Fast and practical indexing and querying ofvery large graphs. In SIGMOD Conference, pages 845856, 2007.[21] D. J. Watts. Six Degrees:The Science of a Connected Age. W. W.Norton & Company, 2004.[22] P. Willett. Chemical similarity searching. J. Chem. Inf. Comput. Sci,38(6), 1998.[23] X. Yan, P. S. Yu, and J. Han. Graph indexing: A frequentstructure-based approach. In SIGMOD, 2004.[24] Q. Yang and S. hoi Sze. Path matching and graph matching inbiological networks. J Comput Biol., 14(1), 2007.[25] S. Zhang, M. Hu, and J. Yang. Treepi: A novel graph indexingmethod. In ICDE, 2007.