10

Click here to load reader

Efficient Algorithms for Generalized Subgraph Query … · Efficient Algorithms for Generalized Subgraph Query Processing ... Graph Databases, Graph Matching Algorithm, ... and

Embed Size (px)

Citation preview

Page 1: Efficient Algorithms for Generalized Subgraph Query … · Efficient Algorithms for Generalized Subgraph Query Processing ... Graph Databases, Graph Matching Algorithm, ... and

Efficient Algorithms for Generalized Subgraph QueryProcessing

Wenqing Lin, Xiaokui Xiao, James Cheng, Sourav S. BhowmickSchool of Computer Engineering, Nanyang Technological University, Singapore

{wlin1, xkxiao, assourav}@ntu.edu.sg [email protected]

ABSTRACTWe study a new type of graph queries, which injectivelymaps its edges to paths of the graphs in a given database,where the length of each path is constrained by a giventhreshold specified by the weight of the correspondingmatching edge. We give important applications of the newgraph query and identify new challenges of processing sucha query. Then, we devise the cost model of the branch-and-bound algorithm framework for processing the graphquery, and propose an efficient algorithm to minimize thecost overhead. We also develop three indexing techniquesto efficiently answer the queries online. Finally, we verifythe efficiency of our proposed indexes with extensive exper-iments on large real and synthetic datasets.

Categories and Subject DescriptorsH.2.4 [Database Management]: System—Query process-ing

General TermsAlgorithms, Experimentation, Performance

KeywordsGraph Databases, Graph Matching Algorithm, Graph In-dexing, Graph Querying

1. INTRODUCTIONGraph is a powerful data model that can naturally repre-

sent various entities and their relationships. Graph datais ubiquitous today and being able to query such graphdata is beneficial to many applications. For example, inbio-informatics and chemical informatics, graphs can modelcompounds and proteins, and graph queries can be usedfor screening, drug design, motif discovery in protein struc-tures, and protein interaction analysis. In computer vision,graphs represent organization of entities in images and graphqueries can be used to identify objects and scenes. In het-erogeneous web-based data sources and e-commerce sites,

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.CIKM’12, October 29–November 2, 2012, Maui, HI, USA.Copyright 2012 ACM 978-1-4503-1156-4/12/10 ...$15.00.

graphs model schemas and graph matching can be applied tosolve problems of schema matching and integration. Thereare also many other applications, such as program flows,software and data engineering, taxonomies, etc., where datais modeled as graphs and it is essential to search and querythe graph data.

Existing research focuses on mainly two types of graphdatasets, one consisting of a single large graph (e.g., an on-line social network or the entire citation graph in a certaindomain) and the other consisting of a large set of small ormedium-sized graphs. We focus on the later, which are alsovery popular in real life (e.g., most of the examples we listedearlier belong to this type).

To query a graph database, G, that consists of many smallgraphs, there are three types of queries commonly studiedin the literature. Let q be a query graph. The first one issubgraph query [4,9,13,15,18,21,23], which finds the subsetof graphs A of G such that q is a subgraph of any graph inA. The second one is supergraph query [2, 3, 14, 22], whichfinds the subset of graphs A of G such that q is a supergraphof any graph in A. The third one is similarity query [12,19,20, 24], which finds the subset of graphs A of G such thatq is a similar graph of any graph in A according to a givensimilarity measure.

The three types of queries are useful in different appli-cations. However, both subgraph queries and supergraphqueries are too rigid and therefore similarity queries areproposed as an alternative. Existing similarity queries aremostly measured by the edit distance [20, 24] or maximumcommon subgraph [12,19], which is reasonable for some ap-plications but often fails to capture meaningful patterns ortargets in applications where critical objects or entities mayhave to be matched or they may be within a distance fromeach other that is beyond the specified similarity distance(i.e., the similarity threshold). We show such an applica-tion, where both exact queries and similarity queries arenot applicable, by the following example.

Example 1. Consider a drug design system, which sup-ports the inventive process of finding new medications basedon the knowledge of the biological target. Figure 1 showssome compounds in the database, i.e., g1, g2, and g3. Acompound can be naturally modeled as a graph, where atomsare vertices, the chemical name of the atom is the label ofthe correspond vertex, and the chemical bonds between anytwo atoms are modeled as edges in the graph. Among manydrug design methods, the pharmacophore model is the mostpopular one whose goal is to find the substructures that areclosely matched to the objective. ALADDIN [17] is a com-

Page 2: Efficient Algorithms for Generalized Subgraph Query … · Efficient Algorithms for Generalized Subgraph Query Processing ... Graph Databases, Graph Matching Algorithm, ... and

C

2

2N 3 H

O

q1

u3

u4

u1 u2

C

1

CC N

C

OO

O H

C C

C

C g1

CC C

O

NN

N

HCC

O HH

O

HHOO H

OH

g2

N 1 C

q2

u5 u61 O

u7

OC NH

O

Cv1

v7

v2 v3 v5

v4g3

OOC

v6

Figure 1: A drug database, G = {g1, g2, g3}, and twoquery graphs, q1 and q2

Table 1: A graph query in ALADDIN languagePOINT N ; POINT H ; POINT O ; POINT C ;DISTANCE (1, 2) 1 3 ; DISTANCE (1, 3) 1 2 ;DISTANCE (1, 4) 1 2 ; DISTANCE (3, 4) 1 1 ;

puter program for the design and recognition of compoundsthat meet geometric, steric, and sub-structural criteria. AL-ADDIN also uses a precise geometric description languageto define the properties of a designed molecule.

The query shown in Table 1 is written in the ALADDINlanguage, which is to find a graph pattern where

1. there are four atoms: N, H, O, and C, whose positionsare at 1, 2, 3, and 4, respectively;

2. the distance between N and H is 1 to 3, and similarly1 to 2 between N and O, 1 to 2 between N and C, andexactly 1 between O and C.

Since the distance between all pairs of atoms can be esti-mated [11], the distance can be further modeled as the num-ber of bonds that connect the atoms. Thus, the query inTable 1 can be converted to a graph query which finds allgraphs in the database such that

1. there exist four vertices u1 to u4 in the graph whoselabels are N, H, O, and C, respectively;

2. let P = 〈ui, . . . , uj〉 be a path that connects ui and uj

and |P | be the length of the path:

there exist paths P1 = 〈u1, . . . , u2〉, P2 = 〈u1, . . . , u3〉,P3 = 〈u1, . . . , u4〉, and P4 = 〈u3, . . . , u4〉, respectively,where 1 ≤ |P1| ≤ 3, 1 ≤ |P2| ≤ 2, 1 ≤ |P3| ≤ 2, and|P4| = 1.

Such a query can be naturally represented as the querygraph1 q1 shown in Figure 1, and the answer to this queryis {g3}. In such a query, subgraph query cannot be applied,while similarity query is also not suitable when the matchingpaths are long.

In this paper, we study this new type of graph queries asdescribed in Example 1, which will be formally defined inSection 2. But intuitively, the new query is a generalizationof the subgraph query, which generalizes exact edge match-ing to path matching constrained by a path length; that is,instead of matching each edge as in a subgraph query, we finda path with two matching end vertices for each edge in the

1We assume that path length cannot be negative.

query graph, where the length of the matching path must bewithin the specified edge weight. Thus, the new query hasa much stronger expressive power than a subgraph query.

Such a query is also useful in many other applications.For example, in querying user online traversal graphs, onemay be only interested in whether users have visited certainimportant sites within a certain number of clicks, while anexact or a quality similar matching may not exist. In search-ing pictures in an image database, it is often rare to find anexact or even similar matching due to the huge amount of ir-relevant information in the background (note that similaritymeasure by edit distance or maximum common subgraph of-ten counts all such irrelevant information in the matching);in this case, we can specify a few features to be focused inthe matching while relaxing the links between the featuresby some reasonable edge weight.

Processing the new query, however, is significantly morechallenging. For both subgraph and supergraph query pro-cessing, it involves subgraph isomorphism which is NP-hard.The relaxation in the new query from exact edge matchingto approximate path matching essentially further explodesthe already exponential search space. Existing pruning tech-niques cannot be directly applied or they are simply not ad-equate, since our generalized query graph is different fromthe indexed features. Therefore, this paper proposes neweffective pruning techniques and efficient data structures tosolve this challenging problem.

Our contributions. The contributions of this paperare four-fold. First, we propose the problem of generalizedsubgraph query processing, which is useful in applicationswhere subgraph queries are too restrictive to apply whilesimilarity queries may return low quality answers due tolarge edit distance arisen from abundant irrelevant informa-tion. Second, we devise a fast algorithm for generalized sub-graph matching, which is a significantly more complicatedmatching problem than subgraph isomorphism. Third, wedevelop three indexes for the efficient processing of gener-alized subgraph queries, namely, a distance-based index, afrequent-pattern-based index, and a star-structure-based in-dex. We discuss in details the strengths and limitations ofthe indexes. Fourth, we verify the efficiency of our matchingalgorithm (for candidate verification) and our indexes (forfiltering) using both real and synthetic datasets.

Paper Organization. Section 2 gives the notations andformally defines the problem. Section 3 presents the gener-alized subgraph matching algorithm. Section 4 discusses indetails the three indexes. Section 5 reports the experimentalresults. Section 6 discusses the related work and Section 7concludes the paper.

2. PROBLEM STATEMENTLet G be a database that contains a set of simple and

labeled graphs. We denote each graph g ∈ G as a tripletg = (Vg, Eg, lg), where Vg and Eg are the sets of verticesand edges in g, respectively, and lg is a labelling functionthat maps each vertex in g to a label in a finite alphabet.For ease of exposition, we assume that all edges in g areundirected; our results can be easily extended for directedgraphs.

For any vertices u and v in a graph g ∈ G, we define thedistance between u and v, denoted as distg(u, v), as the num-ber edges in the shortest path between u and v. For instance,

Page 3: Efficient Algorithms for Generalized Subgraph Query … · Efficient Algorithms for Generalized Subgraph Query Processing ... Graph Databases, Graph Matching Algorithm, ... and

in the graph g3 in Figure 1, we have distg3(v1, v3) = 2,since the shortest path between v1 and v3 contains two edges(v1, v2) and (v2, v3).

We aim to support generalized subgraph queries on G. Inparticular, a generalized subgraph q is a simple, undirected,and labelled graph where each edge carries a positive integerweight. We denote q as a quadruple (Vq, Eq, lq, t), where Vand E are the sets of vertices and edges in q, respectively, lqis the labelling function for q, and t is a function that mapseach edge in q to its weight. We say that a graph g ∈ Gmatches q, if there exists an injective function f from Vq toVg, such that for any edge (u, v) in q, (i) the labels of u andf(u) are the same, (ii) the labels of v and f(v) are the same,and (iii) the distance between f(u) and f(v) in g is no morethan the weight of (u, v).

For example, in Figure 1, the graph g3 matches the gen-eralized subgraph q1. To explain this, let us consider aninjective function f that maps u1 to v3, u2 to v1, u3 to v6,and u4 to v5. For the edge (u1, u2) in q1, we have f(u1) = v3and f(u2) = v1, and the distance distg3 between v3 and v1in g3 equals 2, which is no more than the weight associatedwith (u1, u2). The cases for the other edges in q1 can beverified in a similar manner.

Given a generalized subgraph q, a generalized subgraphquery on G returns the graphs in G that match q. For con-venience, we refer to q as the query graph, and the graphs inG as the data graphs. In addition, we say that a data graphg contains q (denoted by q ⊆ g), if g matches q.

3. GENERALIZED SUBGRAPH MATCH-ING ALGORITHM

To enable generalized subgraph queries on G, we need tofirst address a crucial problem: How do we decide whethera data graph g ∈ G matches the query graph q? We referto this problem as the generalized subgraph matching prob-lem. It is not hard to see that this problem is NP-hard; inparticular, when the weights of all edges in q equal 1, test-ing whether a data graph g matches q is equivalent to thesubgraph isomorphism problem, which has been shown to beNP-complete [5].

Given that generalized subgraph matching is theoreticallyintractable, we resort to heuristics and propose a solutionthat provides practical efficiency. The core of our solution isa cost-based matching approach that significantly extendsand improves the existing heuristic algorithms [13, 16] forthe subgraph isomorphism problem. In what follows, wewill first introduce the existing methods for subgraph iso-morphism (in Section 3.1), and then present the details ofour solution (in Section 3.2).

3.1 Existing Algorithms for Subgraph Iso-morphism

The classic solution for the subgraph isomorphism is Ull-mann’s algorithm [16], which matches the vertices in thequery graph q to the vertices in the data graph g in an iter-ative manner. Specifically, in each iteration, the algorithmselects an unmatched vertex u in q, maps it to an unmatchedvertex in g with the same label, and then checks whether themapping is feasible, i.e., whether any two matched verticesin q that induce an edge in q are mapped to two vertices ing that induce an edge in g. If the mapping is feasible, the al-gorithm will enter the next iteration to match the remainingvertices in q. Otherwise, the algorithm will try matching u

to another unmatched vertex in g. If there is no vertex thatu can be matched to, the algorithm backtracks to the lastmatched vertex u′ in q, re-maps u′ to an unmatched vertexin g, and then re-starts the current iteration.

For example, in Figure 1, given the query graph q2 andthe data graph g3, Ullmann’s algorithm may first map u5 tov3, and then map u6 to v5. In that case, u5 and u6 inducean edge in q2, while v3 and v5 also induce an edge in g3, i.e.,the matching is feasible. Assume that, in the next iteration,the algorithm maps u7 to v4. Then, u6 and u7 induce anedge in q2, but the vertices that they are mapped to (i.e.,v5 and v4) do not induce any edge in g3. As a consequence,the mapping is infeasible, and hence, the algorithm wouldproceed to re-map u7 to another unmatched vertex in g3.

Intuitively, the efficiency of Ullmann’s algorithm dependson the order in which the vertices in q are matched. For in-stance, assume that q contains only two vertices u1 and u2,such that u1 has the same label with only one vertex v1 inthe data graph g, whereas u2 has the same label with almostall vertices in g. If we invoke Ullmann’s algorithm and mapu1 to v1 in the first iteration, then in the remaining itera-tions, we only need to examine whether u2 can be mappedto a vertex adjacent to v1. In contrast, if the first iterationmaps u2 (instead of u1) to some vertex v in g, then in theremaining iterations, we not only need to try mapping u1

to the neighbors of v, but also need to consider other possi-ble mappings that match u2 to other vertices in g, i.e., thesearch space of the algorithm becomes significantly larger.

Despite the importance of vertex mapping order, it is nottaken into account in Ullmann’s algorithm. This motivates amore advanced method called QuickSI [13], which improvesover Ullmann’s algorithm by heuristically choosing a map-ping order that is likely to reduce computation cost. Specif-ically, QuickSI decides the vertex mapping order based ontwo sets of statistics pre-computed from the graph databaseG. First, for any vertex u that can possibly appear in aquery graph, QuickSI pre-computes its frequency in G, i.e.,the average number of vertices in each data graph (in G)that have the same label with u. Second, for any edge e thatmay appear in a query graph, QuickSI also pre-computes itsfrequency in G, i.e., the average number of edges in eachdata graph that have endpoints with labels matching thoseof the endpoints of e. With these statistics, for any givenquery graph q, QuickSI first generates a spanning tree of q,such that vertices and edges closer to the root of tree tendto have lower frequencies in G. After that, QuickSI gener-ates an ordering of the vertices in q following a traversalof the spanning tree that recursively visits the branch withthe least frequent edge. The resulting vertex order is thenused whenever QuickSI compares a data graph g with q. In-tuitively, this vertex order improves efficiency, as it tendsto ensure that the search space of the matching algorithmwould be reduced significantly after each iteration.

3.2 A Cost-based Approach for GeneralizedSubgraph Isomorphism

Both Ullmann’s algorithm and QuickSI can be extendedfor generalized subgraph isomorphism, with a modified fea-sibility check in each iteration. Specifically, each time afterwe map a vertex u in the query graph q to a vertex v inthe data graph g, we would decide whether the mappingis feasible by examining every edge e in q that is inducedby u and any vertex u′ in q that has been matched. Let

Page 4: Efficient Algorithms for Generalized Subgraph Query … · Efficient Algorithms for Generalized Subgraph Query Processing ... Graph Databases, Graph Matching Algorithm, ... and

v′ be the vertex in g that u′ is mapped to. If for each e,the distance distg(v, v

′) between v and v′ is no more thanthe weight w(e) of e, then the mapping is feasible, and wewould proceed to the next iteration. Otherwise, we wouldre-map u to other unmatched vertex in g; if there does notexist any feasible mapping for u, we would backtrack to thelast matched vertex in q and re-map it (as with the case ofsubgraph isomorphism).

The aforementioned extensions of Ullmann’s algorithmand QuickSI, however, leave much room for improvements.In particular, Ullmann’s algorithm does not exploit the orderof vertex mapping for efficiency; QuickSI heuristically tunesthe vertex mapping order, but its tuning method is ratherah hoc and is without a formal model that justifies mappinga vertex ahead of any other. To remedy this, we propose anovel algorithm for generalized subgraph isomorphism thatincorporates a cost model for selecting a preferable order ofvertex mapping. In the following, we will first present therationale behind our method, and then provide the detailsabout our cost model and algorithm.

Assume that the query graph q and the data graph gcontain m and n vertices, respectively. Totally, there existP (n,m) = n!/(n−m)! different ways to map the vertices inq to distinct vertices in g, and these P (n,m) possible match-ings constitute the search space for the generalized subgraphisomorphism algorithm. (P (m,n) denotes the number of m-permutations of n.) To efficiently decide whether g matchesq, it is essential that the algorithm should traverse the searchspace in a judicious order that enables it to pinpoint a so-lution (if any) as quickly as possible. This motivates us tomatch vertices in q in an order based on how likely they canreduce the search space that we need to explore. Note thatwe use the same node mapping order for all data graphs (asin QuickSI), so as to avoid the overhead of re-computing thenode order for each data graph.

Specifically, to pick the first vertex in q to be matched, wewould inspect each edge e in q, and examine the frequency ofe (denoted as c(e)) in the data graphs in G. The frequency of(u′, u∗) in q is defined as the average number of vertex pairs(v′, v∗) in each data graph in G, such that (i) the labels ofu′ and v′ are the same, (ii) the labels of u∗ and v∗ are thesame, and (iii) the distance between v′ and v∗ is no morethan the weight of (u′, u∗). (To facilitate this step of thealgorithm, we pre-compute the frequency of any edge thatmay appear in the query graph.)

For each e, we intuitively estimate that it can be matchedto c(e) vertex pairs in the data graph. Given this estimation,if a vertex u is an endpoint of e and we choose to matchu first, then the search space size induced by mapping ucan be estimated as c(e) · P (n− 1,m− 1), where n denotesthe average number of vertices in the data graphs. Therationale here is that u is expected to be mapped to aroundc(e) vertices in a data graph, and the other unmatched m−1vertices in q are expected to be matched to around n − 1vertices in g in P (n − 1, m − 1) different ways; therefore,the number of possible matchings that remain to exploredcan be estimated as c(e) · P (n − 1,m − 1). Accordingly,we pick a vertex u incident to the edge e with the smallestc(e), and set u as the first vertex to be matched. The termP (n− 1, m− 1) is ignored since its value is the same for allvertices in q. (This helps us avoid the pathological case whenn < m, in which case P (n− 1,m − 1) is undefined.) Given

that the edge e with the smallest c(e) has two endpoints, wechoose the endpoint u with the smaller frequency c(u).

The order of the remaining vertices is decided in a similarmanner. Assume that we have picked a set S of k verticesand we are about to choose the next vertex to be matched.Let u′ be any vertex that has not be selected. If u′ is notconnected to any vertex in S by an edge in q, then we esti-mate the search space size induced by mapping u′ as

minany edge e adjacent to u′ c(e) ·N(S) ·P (n−k−1,m−k−1), (1)

where N(S) denotes the number of ways to match the firstk vertices, and P (n−k−1,m−k−1) is the number of waysto match the remaining m−k−1 vertices except u′. As willbe shown shortly, we do not need to compute the values ofN(S) and P (n− k − 1,m− k − 1).

On the other hand, if u′ has some edges that are incidentto the vertices in S, then our estimation of the search spacesize would take those edges into account. Let E be the set ofedges in q that connect u′ to the vertices in S. For each e inE that connects u′ to a vertex u∗, we examine the frequencyof u∗ (denoted as c(u∗)) in G, as well as the frequency of e(denoted as c(e)). Given c(u∗) and c(e), we intuitively esti-mate that the vertex u∗ is connected to around c(e)/c(u∗)vertices that have the same label with u′. Therefore, thesearch space size induced by mapping u′ is estimated as

c(e)/c(u∗) ·N(S) · P (n− k − 1, m− k − 1), (2)

where∑

u∈S c(u) and P (n−k−1,m−k−1) are as explainedin Equation 1. We refer to c(e)/c(u∗) as the matching rateof u′ implied by e, and we denote it as r(u′, e).

Observe that each edge e ∈ E may imply a differentmatching rate of u′, leading to different estimations of thesearch space size. We combine all estimations by taking thesmallest one, i.e., the size of the search space is estimated as

mine∈E

r(u′, e) ·N(S) · P (n− k − 1,m− k − 1). (3)

For convenience, we let r(u′) = mine∈E r(u′, e) if u′ isconnected to the vertices in S by at least one edge in the datagraph, otherwise we let r(u′) be the minimum frequency ofan edge in q that is adjacent to u′. Given Equations 1 and4, we choose the next vertex u′ to be matched as the onethat minimizes the estimated search space size, i.e.,

u′ = argminu

{r(u)

}. (4)

Note that Equation 4 does not involve the terms N(S) andP (n − k − 1,m − k − 1) (which appear in both Equations1 and 4). This is because their values are the same for allpossible u′, and hence, they have no effect on the selectionof u′.

In summary, our algorithm optimizes the vertex matchingorder by a qualitative prediction of how each vertex may helpreduce the search space size. As will be shown in Section 5,our experimental results demonstrate the superiority of ouralgorithm over both Ullmann’s algorithm and QuickSI onboth standard and generalized subgraph isomorphism tests.

4. INDEXING TECHNIQUESAlthough in Section 3 we proposed a reasonably fast al-

gorithm for generalized subgraph matching, it is still im-practical to answer a query by sequentially scanning theinput database and matching the query graph with each

Page 5: Efficient Algorithms for Generalized Subgraph Query … · Efficient Algorithms for Generalized Subgraph Query Processing ... Graph Databases, Graph Matching Algorithm, ... and

data graph, especially if the database is large. We applythe filtering-and-verification strategy to reduce the match-ing cost, that is, we first filter out as many unmatching datagraphs as possible and then verify the remaining candidatedata graphs by matching them with the query graph one byone. To do this, it is important to design an effective index-ing technique to filter out the unmatching data graphs.

In this section, we propose three indexing techniques: D-Index, FP-Index and S-Index. First, in Section 4.1 wepresent D-Index, which can be easily constructed but itspruning power is relatively weak. Then, we propose FP-Index in Section 4.2, which has an expensive constructioncost but is partially verification-free. Lastly, in Section 4.3we propose S-Index, which explores the star structures toachieve effective pruning as well as a low construction cost.

4.1 Distance IndexWe first present D-index, which is constructed based on

the distance among pairs of vertices in each data graph.Given a data graph g = (Vg, Eg, lg) ∈ G, we obtain the dis-tance set (DS) of all triplets of every two vertices consistingof their ordered labels and the correspond distance in g asfollows.

DS(g) = {(lg(u), lg(v), distg(u, v)) : u, v ∈ Vg, lg(u) ≤ lg(v)}A distance triplet (l1, l2, d) ∈ DS(g) is subsumed by an-

other distance triplet (l1, l2, d′) ∈ DS(g) if d > d′. We say

that a subset DSmin(g) ⊆ DS(g) is minimal if each distancetriplet in DSmin(g) is not subsumed by any other distancetriplet, that is, for each (l1, l2, d) ∈ DSmin(g), there doesnot exist (l1, l2, d

′) ∈ DSmin(g) such that d′ < d.

Example 2. Assume that vertex labels are ordered lexi-cographically, i.e., O<N<H<C. Consider the data graph g3in Figure 1, the distance set of g3 is DS(g3) = {(C,C,1),(C,C,2), (C,C,3), (H,C,1), (H,C,2), (H,C,3), (N,C,1), (N,C,2),(O,C,1), (O,C,2), (O,C,3), (O,C,4),(N,H,2), (O,H,3), (O,H,4),(O,N,1), (O,N,2), (O,O,1)}, and DSmin(g3) = {(C,C,1),(H,C,1), (N,C,1), (O,C,1), (N,H,2), (O,H,3), (O,N,1), (O,O,1)}.Note that |DS(g3)| = 18 while |DSmin(g3)| = 8.

The minimal set of distinct distance triplets in thedatabase is then given by

DS = ∪g∈GDSmin(g).

For each distance triplet (l1, l2, d) ∈ DS , the set of datagraphs that contain (l1, l2, d) is given by

A(l1, l2, d) = {g : (l1, l2, d) ∈ DSmin(g)}.The Distance Index (D-index) is constructed on DS andA(l1, l2, d) for each (l1, l2, d) ∈ DS, which is to be detailedas follows.

4.1.1 Index ConstructionThe structure of D-index consists of the following parts:

• A B+-tree index, called the Label Pair Index (LPI ),stores all pairs of labels of the triplets in DS.• A sorted list of distance values for each label pair

(l1, l2) ∈ LPI , denoted by LPI(l1, l2).DV , that is,LPI(l1, l2).DV = {d : (l1, l2, d) ∈ DS}.• Each distance value d ∈ LPI(l1, l2).DV for

any (l1, l2) ∈ LPI is associated with a set ofdata graphs A(l1, l2, d), we further denote it asLPI(l1, l2).DV (d) = A(l1, l2, d).

Algorithm 1: Build-DIndex(G)input : the graph database, Goutput: the D-index, LPI

1 for g ∈ G do2 Compute DSmin(g) ;3 for (l1, l2, d) ∈ DSmin(g) do4 LPI ← (l1, l2) ;5 LPI(l1, l2).DV ← d ;6 LPI(l1, l2).DV (d)← g ;

7 return LPI

Algorithm 2: Query-DIndex(q, LPI,G)input : the query graph, q = (Vq, Eq, lq, t)

D-Index, LPIthe graph database, G

output: the candidate set of q, C(q)1 C(q)← G ;2 for (l1, l2, d) ∈ DSmin(q) do3 C(l1, l2, d)← ∅ ;4 for k ∈ LPI(l1, l2).DV and k ≤ d do5 C(l1, l2, d)← C(l1, l2, d) ∪ LPI(l1, l2).DV (k) ;

6 C(q)← C(q) ∩ C(l1, l2, d) ;7 return C(q)

The algorithm for D-index construction, Build-DIndex,is shown in Algorithm 1. For each data graph g ∈ G, wefirst compute its minimal distance set DSmin(g) (Line 2).Then, for each distance triplet (l1, l2, d) ∈ DSmin(g), weassign (l1, l2) to LPI and put the distance value d to thesorted list LPI(l1, l2).DV (Line 4-5). Finally, g is includedin LPI(l1, l2).DV (d) (Line 6).

4.1.2 Query ProcessingGiven a query graph q = (Vq, Eq, lq , t), we first obtain its

minimal distance set DSmin(q) by all pairs shortest pathalgorithm. For each distance triplet (l1, l2, d) ∈ DSmin(q),the correspond candidate set C(l1, l2, d) can be obtained bymerging all the graph sets associated with LPI(l1, l2).DV (k)for 1 ≤ k ≤ d. The final candidate set C(q) for verificationis the intersection of all the candidate sets of each distancetriplet (l1, l2, d) ∈ DSmin(q), that is

C(q) = ∩(l1,l2,d)∈DSmin(q)C(l1, l2, d).As shown in Algorithm 2, processing the query graph q by

D-index obtains the candidate set for each distance triplet(l1, l2, d) ∈ DSmin(q) (Lines 3-5), and then intersects themto output the candidate set of q (Line 6).

Lemma 1. Given a query graph q = (Vq, Eq, lq, t), its an-swer set A(q) is a subset of Build-DIndex(q, LPI,G).

Proof. Consider a data graph g = (Vg, Eg, lg) ∈G that matches q. For each edge (v, u) ∈ Eq, wecan map it to a path P = 〈f(v), . . . , f(u)〉 such that|P | ≤ t(v, u). Without the loss of generality, we assumethat lg(f(v)) ≤ lg(f(u)). Consider the distance triplet(lg(f(v)), lg(f(u)), d) ∈ DSmin(g), we have d ≤ |P |, thatis d ≤ t(v, u), which completes the proof.

Page 6: Efficient Algorithms for Generalized Subgraph Query … · Efficient Algorithms for Generalized Subgraph Query Processing ... Graph Databases, Graph Matching Algorithm, ... and

4.1.3 Complexity AnalysisAssume that α is the average number of vertices and β is

the average number of edges for the graphs in G, the spacecomplexity of D-index is O(α2|G|). Since the constructiontime for each DSmin(g) is O(αβ) by starting a BFS fromeach vertex in g, the time complexity of constructing D-index is O(αβ|G|).

The generation of minimal distance set of query graphq can be done in O(|Vq |3) time. For each distance tripletin DSmin(q), the response time of D-index is O(log(md))where m is the number of distinct labels in G and d is thelargest distance. Thus, the index response time for the graphpattern is O(|Vq |3 + k2 log(md)), where k is the number ofdistinct labels in q. Note that, in practice, both |Vq| and kare very small.

4.2 Frequent Pattern IndexThe shortcoming of D-index is that it loses the structural

information of data graphs. As a result, the filtering is noteffective enough, leading to a lot of unmatched candidategraphs. Although there are graph indexing approaches thatretain structural information of data graphs [4,9,15,18,23],they cannot be directly applied to answer generalized sub-graph queries.

In order to employ structural information for answeringgeneralized subgraph queries, we propose the concept of fre-quent generalized subgraph (FGG) patterns and apply FGGsto design a structural index called Frequent Pattern Index(FP-index). The challenges, however, are 1) how to effi-ciently mine the FGGs, 2) how to apply and index FGGsfor filtering. We address the two challenges as follows.

The first challenge can be addressed by the pattern-growthapproach [1]. The difference is that in an FGG, edges areweighted. To obtain weighted edges for FGGs, we grow thefrequent patterns from weighted edges. We initialize theset of weighted edges by taking the set of distinct distancetriplets E = ∪g∈GDS(g) introduced in Section 4.1, whereeach distance triplet (l1, l2, d) ∈ E is considered as an edge(l1, l2) with weight d, while |C(l1, l2, d)| is the frequency ofthe edge.

A subgraph pattern f is frequent if its frequency is greaterthan a pre-defined threshold σ. Since the number of FGGscan be too large and indexing a large number of FGGs willincrease the index size and hence the search time, we applya maximum pattern size threshold, γ, and a maximum edgeweight threshold, ρ, to obtain only FGGs with size at mostγ and any edge weight at most ρ. In our experiments, weset these thresholds as the best possible values such that theFGGs can fit in the machine memory.

4.2.1 Index ConstructionThe FP-index consists of two parts: frequent pattern graph

index (FPG-index) and edge index (E-index). FPG-indexstores the FGGs in a B+-tree, with the key as an FGGand the data value as the set of data graphs containing theFGG. To answer queries that may contain infrequent edges,we also construct E-index that builds a B+-tree on the setof infrequent edges and frequent large-weight edges whoseweights are larger than ρ, with the key as an edge and thedata value as the set of data graphs containing the edge.

4.2.2 Query ProcessingGiven a query graph q, we process the query with FP-

index as follows:

O

2

3H

CN

v3

v4

v7v2

H

1

H2

4O

C

v6v1

O

1

1N

HC

v2

v3

v1

N

2

N2

2O

C

v6v3

v7

sg3(v1) sg3(v3)Cv5

3

Cv5 C1

Figure 2: Two star structures: sg3(v1) and sg3(v3)

• Case 1: q is an FGG indexed by FPG-index. In thiscase, we obtain q’s answer set from the index directlywithout verification.

• Case 2: q is not an FGG indexed by FPG-index. If qcontains infrequent edges, then the candidate set of q isrelatively small (at most σ|G|), which can be retrievedfrom E-index. Otherwise, we reduce the weight of someedges in q to obtain an FGG q′, then we have a partialanswer, A(q′), of q, without verification. And then,we obtain the rest of the answer, i.e., A(q) \ A(q′), asfollows. We decompose q into several FGGs in FPG-index and frequent large-weight edges in E-index, asf1, f2, . . . , fk, and compute the candidate set by in-tersecting the answer sets of these frequent patterns:∩1≤i≤kA(fi).

4.2.3 Complexity AnalysisSimilar to other structural graph indexes such as FG-index

[4], the construction cost of FP-index is dominated by thecost of mining FGGs and the index size is dominated bythe overall size of FGGs. Likewise, the query processingcomplexity also heavily depends on the number of FGGsindexed as well as the value of σ|G|. However, the complexityof mining FGGs, as well as the size and number of FGGs,may vary significantly from database to database and weare not aware of any formal analysis for these factors in theliterature.

4.3 Star IndexSince the number of FGGs can be large, FP-index can only

be used to process queries of small size efficiently. Moreover,mining FGGs may also be too expensive. Thus, we proposeanother index, called Star Index (S-index), which uses onlystar structures (instead of subgraph structures) to reduceboth the index construction and storage overhead, while stillcapturing much of the structural information for effectivefiltering in query processing.

For a vertex v in a data graph g = (Vg, Eg, lg) ∈ G, wedefine the star structure of v as sg(v) = (Vg, E

vg , lg , w), where

(1) v is the center of the star structure; (2) Evg consists of the

edges from v to other vertices in Vg, that is Evg = {v}×(Vg \

{v}); (3) w is a function that assigns the distance distg(v, u)to each edge (v, u) ∈ Ev

g , i.e., w(v, u) = distg(v, u). Forexample, Figure 2 shows the star structures of v1 and v3 ofthe graph g3 in Figure 1.

We group the weights of the edges in a star structure bythe label of the non-center end vertex. Let L = {lg(u) : u ∈Vg \ {v}}. We obtain a multiset of weights (called weightmultiset) for each label as follows

Wg(v, l) = {w(u, v) : lg(u) = l, l ∈ L, and u ∈ Vg \ {v}}.The weight values in the weight multiset are sorted in

ascending order. Table 2 lists all the weight multisets for

Page 7: Efficient Algorithms for Generalized Subgraph Query … · Efficient Algorithms for Generalized Subgraph Query Processing ... Graph Databases, Graph Matching Algorithm, ... and

Table 2: Weight multisets of g3Vertex v lg3(v) N C O H

v1 H 2 1,2,3 3,4v2 C 1 1,2 2,3 1v3 N 1,1,2 1,2 2v4 O 1 2,2,3 1 3v5 C 1 2,3 1,2 3v6 O 2 1,3,4 1 4v7 C 2 1,3 3,4 2

Table 3: Compressed weight multisets of g3Label N C O H

N 1,1,2 1,2 2C 1 1,2 1,2 1O 1 1,2,3 1 3H 2 1,2,3 3,4

each label and each star structure of g3. For example, for thestar structure sg3(v1), Wg3(v1, C) = {1, 2, 3}, Wg3(v1, O) ={3, 4}, and Wg3(v1, N) = {2}.

Given two weight multisetsW1 = {w1, . . . , wk} andW2 ={w′

1, . . . , w′t}, where k ≤ t. We define the merge of W1 and

W2 as follows.

W1 ∩W2 = {min(w1, w′1), . . . ,min(wk, w

′k), w

′k+1, . . . , w

′t}.

For some vertices in a graph, they may share the same la-bel. So, we further compress the weight multisets as follows.

Wg(l1, l2) = ∩lg(v)=l1,v∈VgWg(v, l2).

Table 3 lists the compressed weight multisets. For exam-ple, v2 and v5 share the same label C, so Wg3(v2, H) = {1}and Wg3(v5, H) = {3} are merged into one Wg3(C, H) ={min(1, 3)} = {1}.

The set of distinct compressed weight multisets of each la-bel pair (l1, l2) in the database G can be obtained as follows.

W(l1, l2) = ∪g∈GWg(l1, l2).

For each compressed weight multiset w ∈ W(l1, l2), theset of data graphs whose corresponding compressed weightmultiset is w is defined as follows.

A(w) = {g :Wg(l1, l2) = w and g ∈ G}.

4.3.1 Index StructureS-index is constructed based onW(l1, l2) for each distinct

label pair (l1, l2) and A(w) for each w ∈ W(l1, l2). Theindex structure of S-index consists of the following parts:

• A B+-tree on the set of distinct label pairs with eachpair (l1, l2) as the key and W(l1, l2) as the data value.We name this B+-tree as SI .

• A nested B+-tree on the set of distinct compressedweight multisets W(l1, l2) for each pair (l1, l2), whereeach compressed weight multiset w ∈ W(l1, l2) is thekey and A(w) is the data value. We name this nestedB+-tree as SI(l1, l2).

Algorithm 3 outlines the construction of S-index. For eachdata graph g ∈ G, for each distinct label pair (l1, l2) in g, weobtain the compressed weight multiset Wg(l1, l2), and storeit in the record with the key (l1, l2) in SI . Then, we accessthe record with the key Wg(l1, l2) in the nested B+-treeSI(l1, l2) and add g to the corresponding record.

Algorithm 3: Build-SIndex(G)input : the graph database, Goutput: the S-index, SI

1 for g ∈ G do2 Let L(g) = {(lg(u), lg(v)) : u, v ∈ Vg and u �= v} ;3 for (l1, l2) ∈ L(g) do4 Compute w =Wg(l1, l2) ;5 Add w to the record in SI with key (l1, l2) ;6 Add g to the record in SI(l1, l2) with key w ;

7 return SI

Algorithm 4: Query-SIndex(q, SI,G)input : the query graph, q = (Vq, Eq, lq, t)

S-Index, SIthe graph database G

output: the candidate set of q, C(q)1 C(q)← G ;2 Let L(q) = {(lg(u), lg(v)) : u, v ∈ Vg and u �= v} ;3 for (l1, l2) ∈ L(q) do4 Compute Wq(l1, l2) ;5 C(l1, l2)← ∅ ;6 for w ∈ SI(l1, l2) and w ≤ Wq(l1, l2) do7 C(l1, l2)← C(l1, l2) ∪ A(w) ;

8 C(q)← C(q) ∩ C(l1, l2) ;9 return C(q)

4.3.2 Query ProcessingWe now discuss query processing by S-index. Given two

compressed weight multisets W1 = {w1, . . . , wk} and W2 ={w′

1, . . . , w′t}, we say that W1 ≤ W2 if and only if 1) k ≥ t,

and 2) wr ≤ w′r for 1 ≤ r ≤ t.

Lemma 2. Let sg(u) and sq(v) be two star structures,such that u and v are vertices in graphs g = (Vg, Eg, lg)and q = (Vq, Eq, lq, t), respectively. If sg(u) matches sq(v),then we have Wg(l1, l2) ≤ Wq(l1, l2) for all distinct labelpairs (l1, l2) of q.

Proof. Consider a label pair (l1, l2) of q, l1 is the labelof the center vertex of a star structure of q and l2 is thelabel of a non-center vertex. Since g matches q, the numberof vertices in Vg with label l2 is no smaller than the numberof vertices in Vq with the same label. Therefore, we have|Wg(l1, l2)| ≥ |Wq(l1, l2)|. Moreover, for each vertex v′ ∈Vq of label l2, there exists a vertex u′ ∈ Vg such that u′

maps to v′ and distg(u, u′) ≤ distq(v, v

′). Thus, the proofis complete.

According to Lemma 2, we process a query by S-index asshown in Algorithm 4. For each distinct label pair (l1, l2)of q, we merge all the candidate sets associated with (l1, l2)that are smaller than Wq(l1, l2), which gives C(l1, l2). Thecandidate set of q is then obtained by intersecting C(l1, l2)for all distinct pairs (l1, l2) of q.

4.3.3 Complexity AnalysisLet m be the average number of distinct labels in a

data graph. Thus, the number of distinct label pairs isO(m2), and the number of compressed weight multisets in

Page 8: Efficient Algorithms for Generalized Subgraph Query … · Efficient Algorithms for Generalized Subgraph Query Processing ... Graph Databases, Graph Matching Algorithm, ... and

QuickSI CBAUllmann

0

5

10

15

20

25

Q4 Q8 Q12 Q16 Q20 Q24

Computation Time (sec)

0

0.2

0.4

0.6

0.8

1.0

1.2

Q4 Q8 Q12 Q16 Q20 Q24

Computation Time (sec)

(a) Average Edge Weight = 1.5 (b) Average Edge Weight = 1

Figure 3: Efficiency of Generalized Subgraph Iso-morphism Algorithms

the database G is O(m2|G|). Assume that the average num-ber of vertices in a data graph is α, then the average sizeof a compressed weight multiset is α/m. Thus, the spacecomplexity of S-index is O(αm|G|).

Assume that the number of distinct labels in a query graphq is t. The index response time is the summation of the timefor searching the compressed weight multisets of q, thus therunning time complexity is O(t2 log(m2|G|)). Note that, inreal life queries, the number of distinct labels is often small.

5. EXPERIMENTAL RESULTSThis section experimentally evaluates our indices and al-

gorithms for generalized subgraph matching. Section 5.1describes the experimental settings. Section 5.2 evaluatesour algorithms for the generalized subgraph isomorphismproblem, and Section 5.3 tunes the parameters for the pro-posed FP-Index. After that, Sections 5.4 and 5.5 demon-strate the efficiency of our indexing methods on real andsynthetic datasets, respectively.

5.1 Experimental SettingsDatasets. We use two benchmark datasets commonlyadopted in the literature [8, 18]. Both datasets containgraphs that represent chemical molecules. The first oneis the AIDS Antiviral Screen Dataset [18], which consistsof 10, 000 graphs. The second dataset is referred to asPubChem [18], and it contains 100, 000 graphs. We usePubChem.mK to denote a sample set of PubChem withm thousands of graphs. In addition, we use syntheticdatasets produced from GraphGen2, a public available syn-thetic graph generator.

Query sets. For the AIDS dataset, we adopt the querysets from [18], but we ignore the label on each edge andadd a weight on the edge (since we target at query graphswhere the edges are unlabelled and weighted). For the otherdatasets, we generate the query sets by first extracting gen-eralized subgraphs from the graphs in the datasets, suchthat the number of data graphs matching each extractedgeneralized subgraph is at most 10% of the total number ofdata graphs. In other words, we avoid generating general-ized subgraph matching queries that would return excessivenumbers of results.

All of our experiments are conducted on a machine witha Intel Xeon 2.4GHz CPU with 48GB RAM.

5.2 Generalized Subgraph IsomorphismOur first set of experiments compares three algorithms

for generalized subgraph isomorphism: our cost-based ap-

2http://www.cse.ust.hk/graphgen/

102

103

104

105

1 2 3 4

Maximum Edge Weight

Number of FGGs

103

104

105

1 2 3 4

Maximum Edge Weight

Computation Time (sec)

(a) σ = 0.05, γ = 4 (b) σ = 0.05, γ = 4

104

105

0.01 0.02 0.03 0.04 0.05

frequency threshold σ

Number of FGGs

104

105

0.01 0.02 0.03 0.04 0.05

frequency threshold σ

Computation Time (sec)

(c) ρ = 3, γ = 4 (d) ρ = 3, γ = 4

Figure 4: Space and Pre-computation Costs of theFP-index

proach (denoted as CBA), as well as the extensions of Ull-mann’s algorithm and QuickSI. Figure 3 illustrates the aver-age running time required by each algorithm to match eachquery graphs in query set Qi to all data graphs in the AIDSdataset. In particular, each query set Qi contains 1000 querygraphs, and each query graph in Qi contains i vertices. Fig-ure 3a shows the results when the edges in the query graphhave average weight 1.5. Observe that CBA considerablyoutperforms the extension of QuickSI, which in turn is su-perior than the extension of Ullman’s algorithm. Figure 3bshows the results when the average edge weight in the querygraph equals 1, i.e., when the generalized subgraph isomor-phism problem degenerates to the standard subgraph iso-morphism problem. Even in this degenerated case, CBA stillconsistently outperforms QuickSI and Ullmann’s algorithm.This demonstrates the superiority of our cost-based methodfor optimizing vertex matching order. We have conducteda similar set of experiments on the PubChem datasets, andwe found that the results are qualitative similar; we omitthose results for the interests of space.

5.3 Tuning the FP-indexThe second set of our experiments evaluation the space

and pre-computation costs of the FP-index (presented inSection 4.2) on a set of 40 thousands data graphs sampledfrom the PubChem dataset. Figure 4a illustrates the num-ber of Frequent Generalized subGraphs (FGG) that needto be stored in the FP-index, varying the maximum edgeweight ρ in the graph patterns from 1 to 4, with the fre-quency threshold set to σ = 0.05 and the maximum numberof vertices in the FGGs set to γ = 4. Note that the num-ber of FGGs increases exponentially with maximum edgeweight ρ. Figure 4b shows the time required to mine theFGGs, which also exhibits an exponential growth with theincrease of ρ. These results indicate that maximum edgeweight adopted in the construction in the FP-index have tobe carefully selected and has to be reasonably small.

Figures 4c and 4d illustrate the number of FGGs and pre-computation time required by FP-index, respectively, vary-ing the frequency threshold σ from 0.01 to 0.05, with themaximum edge weight set to ρ = 3 and the maximum num-ber of vertices in the FGGs set to γ = 4. Observe that

Page 9: Efficient Algorithms for Generalized Subgraph Query … · Efficient Algorithms for Generalized Subgraph Query Processing ... Graph Databases, Graph Matching Algorithm, ... and

D-index S-index FP-index

0

10

20

30

40

50

60

10K 20K 40K 60K 80K 100K

Computation Time (ms)

02000400060008000

1000012000

10K 20K 40K 60K 80K 100K

Size of Candidate Set

(a) Query Time (b) Size of Candidate Set

100

101

102

103

10K 20K 40K 60K 80K 100K

Index Size (MB)

100

101

102

103

104

10K 20K 40K 60K 80K 100K

Construction Time (sec)

(c) Index Size (d) Construction Time

Figure 5: Index Performance v.s. Dataset Size

both the number of FGGs and the pre-computation timedecreases exponentially when the frequency threshold σ in-creases. Hence, we may use a large σ to reduce the spaceand pre-computation cost of the FP-index. One may betempted to set σ to be even larger than 0.05, which, how-ever, may significantly reduce the effectiveness of FP-index,as an excessively large σ would make it difficult for FP-indexto answer a query without invoking the verification process.

Based on the results in Figure 4, we set ρ = 3, σ = 0.05,and γ = 4 for the FP-index in all following experiments.

5.4 Performance of IndicesOur next set of experiments compares the performance

of the three proposed indices (i.e., D-index, S-index, andFP-index) in terms of query processing performance, spaceoverhead, and construction time. For these experiments, weuse sample sets of PubChem dataset with sizes varying from10K to 100K. We do not use the AIDS dataset as it containsonly a small number of data graphs.

Figure 5a illustrates average query processing time of eachindex for a query set with 1000 graphs, such that on aver-age each graph has 5 vertices and 7 edges, and the averageedge weight equals 2.5. Both the S-index and the FP-indexsignificantly outperforms the D-index, and the FP-index isslightly better than the S-index. This is consistent with theresults in Figure 5b, which shows the average size of thecandidate set induced by each index during query process-ing. As shown in Figures 5c and 5d, however, the space andconstruction overheads of FP-index are significantly higherthan those of the S-index, which in turn are higher thanthose of the D-index.

Figure 6a shows the average query time of each index forquery sets ViEj on the dataset with 40K data graphs, suchthat each query set ViEj contains 1000 query graphs, eachof which has i vertices and j edges, and the average weightof the edges equals 2.5. The FP-index achieves the smallestquery time when the numbers of vertices and edges in thequery graphs are small, but it is outperformed by the S-index on large query graphs. In addition, the D-index isconsistently slower than both the FP-index and the S-index.

Figure 6b illustrates the average query time of each indexfor query set V5E7, with the average edge weight varying

D-index S-index FP-index

0

10

20

30

40

50

60

V3E2 V4E4 V5E7 V6E9 V7E13 V8E17

Computation Time (ms)

0

10

20

30

1 2 3 4 5

Average Edge Weight

Computation Time (ms)

(a) Varying Query Graph Size (b) Varying Max. Edge Weight

Figure 6: Index Performance v.s. Parameters of theQuery Graphs

D-index S-index FP-index

00.5

11.5

22.5

3

0.3 0.5 0.7

graph density

Computation Time (ms)

50100

200

300

400

500

0.3 0.5 0.7

graph density

Size of Candidate Set

(a) Query Time (b) Size of Candidate Size

10-1

100

101

102

0.3 0.5 0.7

graph density

Index Size (MB)

10-1100101102103104105106

0.3 0.5 0.7

graph density

Construction Time (sec)

(c) Index Size (d) Construction Time

Figure 7: Index Performance v.s. Graph Density

from 1 to 5. The FP-index performs the best when the av-erage edge weight is no more than 3, which is the maximumedge weight handled in its preprocessing step. When theaverage edge weight is larger than 3, however, the perfor-mance of the FP-index degrades, and the S-index becomesthe most efficient one.

5.5 Performance on Synthetic DatasetThe experiments use synthetic datasets to evaluate the

performance of our indices with respect to a parameter thathas not been investigated in the previous experiments, i.e.,the densities of the data graphs. In particular, the densityof a data graph with n vertices and m edges equals m/

(n2

).

We generate synthetic graphs with densities varying from0.3 to 0.7, and we use them to construct datasets, such thateach dataset contains 10K data graphs, each of which has 30edges and a fixed density. The query graphs for each datasetis constructed in a manner similar to previous experiments,such that each query graph on average has 5 vertices, 7edges, with an average edge weight 2.5.

Figure 7 illustrates the performance of each index as afunction of the data graph density. As with our previousexperiments, the FP-index achieves the best query perfor-mance, but it incurs the highest space and pre-computationoverheads. The D-index requires the smallest space and pre-processing time, but its query time is the largest. The S-index consistently lands on the middle ground between theFP-index and the D-index.

Page 10: Efficient Algorithms for Generalized Subgraph Query … · Efficient Algorithms for Generalized Subgraph Query Processing ... Graph Databases, Graph Matching Algorithm, ... and

Summary. Our experiments show that the FP-index offerssuperior query performance at the cost of space and pre-computation time. Therefore, it is suitable for the applica-tions where (i) efficient query processing is crucial, and (ii)space and pre-computation overheads are not a major con-cern. In contrast, the D-index entails relatively high querycost, but it incurs minimal space and preprocessing over-head. This renders it preferable in the scenarios with strin-gent requirements on space consumption or pre-computationtime. Finally, the S-index’s space and pre-computation costsare only slightly higher than that of the D-index, but itsquery efficiency is almost comparable to that of the FP-index. Hence, it offers user a choice to strike a good balancebetween query processing and space (preprocessing) over-heads.

6. RELATED WORKThere are some existing studies of graph matching prob-

lem by allowing edges to map to paths for graphs [6,7,10,25].However, their queries are too rigid by fixing the length ofall the mapping paths [25], or too relax by allowing nodesimilarity matching [7]. Besides, all these works are tailoredto query a single large graph making them unsuitable forquerying a large set of small or medium-sized graphs.

On the other hand, various types of graph query process-ing on a large set of small or medium-sized graphs havebeen studied in the literature in recent years and we restrictour discussion on the closely related ones, namely subgraphquery processing [4, 9, 13, 15, 18, 21, 23], supergraph queryprocessing [2, 3, 14, 22], and similarity graph query process-ing [12, 19, 20, 24]. All these works proposed some indexingtechniques to filter out as many unmatching data graphsas possible. Although many different types of graph index-ing techniques have been proposed, none of them is simi-lar to our indexes except FG-index [4], which is similar toFP-index. However, the only similarity lies on the use offrequent patterns to avoid verification and the use of infre-quent edges to reduce the candidate set size, while the indexstructure of FP-index (which builds on B+-trees) is totallydifferent from that of FG-index (which is an unbalanced treebuilt on the clusters of frequent patterns). Apart from that,both D-index and S-index are entirely different from all ex-isting indexes. In addition, our work is the first to proposeindexes for processing generalized subgraph queries.

7. CONCLUSIONSWe studied a new type of graph queries, generalized sub-

graph queries. We proposed a succinct and effective costmodel to minimize the cost of generalized subgraph isomor-phism. We also developed three indexes that can effectivelyfilter out unmatching data graphs, which significantly re-duces the total query response time. We evaluated our al-gorithms with experiments on both real datasets and syn-thetic datasets. The results show that our matching algo-rithm is efficient in candidate verification as it considerablyoutperforms the direct extension of existing graph matchingalgorithms, while our indexes are also effective in filtering.Thus, the results verify that our method is efficient in queryprocessing (in both filtering and candidate verification). Al-though some of the indexes have weaknesses, we show howthe weaknesses are addressed by another index; in particu-lar, our results show that S-index achieves both a low indexconstruction cost and a short query response time.

8. ACKNOWLEDGMENTSXiaokui Xiao was supported by Nanyang Technological

University under SUG Grant M58020016 and AcRF Tier1 Grant RG 35/09, and by the A*STAR SERG Grants1021580074. James Cheng was supported in part by theA*STAR TSRP Grants 1021580034 and 1121720013.

9. REFERENCES[1] C. C. Aggarwal and H. Wang, editors. Managing and Mining

Graph Data, volume 40 of Advances in Database Systems.Springer, 2010.

[2] C. Chen, X. Yan, P. S. Yu, J. Han, D.-Q. Zhang, and X. Gu.Towards graph containment search and indexing. In VLDB,pages 926–937, 2007.

[3] J. Cheng, Y. Ke, A. W.-C. Fu, and J. X. Yu. Fast graph queryprocessing with a low-cost index. VLDB J., 20(4):521–539,2011.

[4] J. Cheng, Y. Ke, W. Ng, and A. Lu. Fg-index: towardsverification-free query processing on graph databases. InSIGMOD, pages 857–872, 2007.

[5] S. A. Cook. The complexity of theorem-proving procedures. InSTOC, pages 151–158, 1971.

[6] W. Fan, J. Li, S. Ma, N. Tang, and Y. Wu. Adding regularexpressions to graph reachability and pattern queries. In ICDE,pages 39–50, 2011.

[7] W. Fan, J. Li, S. Ma, H. Wang, and Y. Wu. Graphhomomorphism revisited for graph matching. PVLDB,3(1):1161–1172, 2010.

[8] W.-S. Han, J. Lee, M.-D. Pham, and J. X. Yu. igraph: Aframework for comparisons of disk-based graph indexingtechniques. PVLDB, 3(1):449–459, 2010.

[9] H. He and A. K. Singh. Closure-tree: An index structure forgraph queries. In ICDE, page 38, 2006.

[10] W. Jin and J. Yang. A flexible graph pattern matchingframework via indexing. In SSDBM, pages 293–311, 2011.

[11] C. L. S. John W. Moore and P. C. Jurs. Chemistry: TheMolecular Science, volume 2. Brooks Cole, 2007.

[12] H. Shang, X. Lin, Y. Zhang, J. X. Yu, and W. Wang.Connected substructure similarity search. In SIGMOD, pages903–914, 2010.

[13] H. Shang, Y. Zhang, X. Lin, and J. X. Yu. Taming verificationhardness: An efficient algorithm for testing subgraphisomorphism. In VLDB, 2008.

[14] H. Shang, K. Zhu, X. Lin, Y. Zhang, and R. Ichise. Similaritysearch on supergraph containment. In ICDE, pages 637–648,2010.

[15] D. Shasha, J. T.-L. Wang, and R. Giugno. Algorithmics andapplications of tree and graph searching. In PODS, pages39–52, 2002.

[16] J. R. Ullmann. An algorithm for subgraph isomorphism. J.ACM, 23(1):31–42, 1976.

[17] J. H. Van Drie, D. Weininger, and Y. C. Martin. Aladdin: Anintegrated tool for computer-assisted molecular design andpharmacophore recognition from geometric, steric, andsubstructure searching of three-dimensional molecularstructures. Journal of Computer-Aided Molecular Design,3:225–251, 1989.

[18] X. Yan, P. S. Yu, and J. Han. Graph indexing: A frequentstructure-based approach. In SIGMOD, pages 335–346, 2004.

[19] X. Yan, P. S. Yu, and J. Han. Substructure similarity search ingraph databases. In SIGMOD, pages 766–777, 2005.

[20] Z. Zeng, A. K. H. Tung, J. Wang, J. Feng, and L. Zhou.Comparing stars: On approximating graph edit distance.PVLDB, 2(1):25–36, 2009.

[21] S. Zhang, M. Hu, and J. Yang. Treepi: A novel graph indexingmethod. In ICDE, pages 966–975, 2007.

[22] S. Zhang, J. Li, H. Gao, and Z. Zou. A novel approach forefficient supergraph query processing on graph databases. InEDBT, pages 204–215, 2009.

[23] P. Zhao, J. X. Yu, and P. S. Yu. Graph indexing: Tree + delta>= graph. In VLDB, pages 938–949, 2007.

[24] X. Zhao, C. Xiao, X. Lin, and W. Wang. Efficient graphsimilarity joins with edit distance constraints. In ICDE, pages834–845, 2012.

[25] L. Zou, L. Chen, and M. T. Ozsu. Distancejoin: Pattern matchquery in a large graph database. PVLDB, 2(1):886–897, 2009.