Download pdf - TALE: A Tool for Approximate Large Graph Matching · ... A Tool for Approximate Large Graph Matching ... mate subgraph matching of large graph queries. ... exible graph similarity

TALE: A Tool for Approximate Large Graph Matching

Yuanyuan Tian, Jignesh M. Patel

EECS Department, University of Michigan,

Ann Arbor, Michigan, USA

{ytian, jignesh}@eecs.umich.edu

Abstract

Large graph datasets are common in many emergingdatabase applications, and most notably in large-scalescientific applications. To fully exploit the wealth ofinformation encoded in graphs, effective and efficientgraph matching tools are critical. Due to the noisyand incomplete nature of real graph datasets, approx-imate, rather than exact, graph matching is required.Furthermore, many modern applications need to querylarge graphs, each of which has hundreds to thousandsof nodes and edges.

This paper presents a novel technique for approxi-mate matching of large graph queries. We propose anovel indexing method that incorporates graph struc-tural information in a hybrid index structure. This in-dexing technique achieves high pruning power and theindex size scales linearly with the database size. In ad-dition, we propose an innovative matching paradigm toquery large graphs. This technique distinguishes nodesby their importance in the graph structure. The match-ing algorithm first matches the important nodes of aquery and then progressively extends these matches.Through experiments on several real datasets, this pa-per demonstrates the effectiveness and efficiency of theproposed method.

1 Introduction

Graphs provide a natural way to model data ina wide variety of applications, such as social net-works, road networks, network topology, protein inter-action networks and protein structures. Many graphdatabases are growing rapidly in size. The growth isboth in the number of graphs and the sizes of graphs(the number of nodes and the number of edges). Forexample, the number of interactions (edges in proteininteraction networks) in the BIND database [3] grew

about 10 folds from 2002 September to 2004 Septem-ber, and almost doubled after that. The number of pro-tein structures (graphs) in the ASTRAL database [8]has increased more than 3 folds since 2002. There isa critical need for efficient and effective graph query-ing tools for querying and mining these growing graphdatabases.

The database community has had a long-standinginterest in querying graph databases [6, 9, 17, 19–25].These previous studies have mostly been carried outwithin the context of precise graph data, and have fo-cused on exact graph or subgraph matching queries.However, many real graph datasets are noisy and in-complete in nature. For example, it is well knownthat protein interaction networks produced by high-throughput methods contain many false positives [18].Moreover, the discovered interactions only represent asmall fraction of the true network. As a result, ex-act graph or subgraph matching often fails to produceuseful results.

In contrast, approximate graph or subgraph match-ing plays a critical role in these applications. Approx-imate matching allows node/edge insertions and dele-tions, and node/edge mismatches. Furthermore, manynew graph applications prefer approximate matchingresults rather than exact ones as they can provide moreinformation such as what might be missing or spuriousin a query or a database graph.

In addition, most existing graph matching methodsare applicable to databases that contain graphs withsmall sizes, i.e. each graph has a small number (tens) ofnodes and edges. Moreover, the query graphs allowedin these methods are also small in size. However, inmany new applications, both the query and databasegraphs are “large”. Each graph can contain hundredsto thousands of nodes and edges. For example, in lifesciences applications, protein interaction networks forindividual species are often matched to determine sim-ilarities and differences across species. Each protein in-teraction network is large, and typically contains hun-

1

dreds to thousands of nodes and edges in each graph.

The problem that we address in this paper is ap-proximate subgraph matching of large query graphs.Namely, given a large query graph, with hundreds tothousands of nodes and edges, and a database of largegraphs, we want to find the subgraphs in the databasethat are similar to the query.

In this paper we present an index-based method forapproximate subgraph matching, called TALE (a Toolfor Approximate Subgraph Matching of Large QueriesEfficiently). TALE employs a novel graph indexingmethod, called NH-Index (Neighborhood Index). Mostexisting graph indexing methods only index subgraphs(paths, trees or general subgraphs), which can lead toindex sizes that are exponential in the database size.The indexing unit of NH-Index is the neighborhood ofeach database node. The neighborhood concept cap-tures the local graph structure around each node, andresults in an index with a high pruning power. At thesame time, the number of indexing units is equal tothe number of nodes in the database, which allows theindex to grow linearly with the database size. Further-more, NH-Index is a disk-based index, which allows itto handle graph databases that do not fit in memory.It employs a hybrid index that uses existing commondisk-based index structures, which makes implementa-tion in existing DBMSs straightforward.

We also propose an innovative matching paradigmfor querying large graphs. Unlike most previous graphmatching tools which treat every node in a graphequally, this matching technique distinguishes nodesby their importance in the graph structure. The algo-rithm first probes the NH-Index to match the impor-tant nodes in a query graph, and then progressively ex-tends the matches by enclosing satisfiable nearby nodesof already matched nodes.

We have applied TALE to three real biologicaldatasets. Our experiments demonstrate that TALE isable to produce useful and meaningful results in all thethree cases. In addition, our experimental evaluationshows that TALE is very efficient for large queries, andthat the execution time grows gracefully with increas-ing number of graphs in the database. Through com-parisons with other existing tools, we also show thatTALE is significantly faster than existing methods.

The main contributions of this paper are as follows:

(1) We propose TALE – a general tool for approxi-mate subgraph matching of large graph queries. TALEuses a novel disk-based indexing method, which indexesthe neighborhood of each database node. It achieveshigh pruning power and its size scales linearly withthe database size. We introduce an innovative graphmatching paradigm, which distinguishes nodes by their

importance in the graph structure, and accordinglytreats them differently in the matching process.

(2) By applying TALE to real applications, we showits effectiveness, significant performance improvementsover existing methods, and ability to gracefully handlelarge graph queries and databases.

The remainder of this paper is organized as follows:Related work is presented in Section 2. Section 3 de-fines the preliminary concepts. Section 4 describesour indexing mechanism, and Section 5 introduces theTALE algorithm. Experimental results are presentedin Section 6, and Section 7 contains our conclusionsand directions for future work.

2 Related Work

There is a long history of database research on meth-ods for querying graphs. However, most previous workshave focused on exact graph or subgraph matching,i.e. graph or subgraph isomorphism. Subgraph iso-morphism was proved to be NP-complete in [5]. Ull-mann [20] proposed a subgraph matching algorithmbased on a state space search method with backtrack-ing. However, this algorithm is prohibitively expensivefor querying against database with a large number ofgraphs. To reduce the search space, GraphGrep [17],GIndex [22] and TreePi [25] index substructures of thedatabase (paths, frequent subgraphs and trees respec-tively) to filter out graphs that do not match the query.

Several index-based methods for approximate sub-graph matching have also been proposed. However,most of these techniques only apply to small graphs andallow limited approximation. Grafil [23] and PIS [24]are both built on top of the exact subgraph matchingmethod GIndex. However, neither method allows nodeinsertion or deletion in their match models. CDIn-dex [21] only applies to graphs with limited sizes, as itexhaustedly enumerates and indexes all the subgraphsin the database. GString [9] utilizes sequence matchingto answer graph queries, but it only applies to appli-cations in which the graphs contain a small number ofbasic substructures. C-Tree [6], which employs an R-tree like index structure, is a more general tool thanthe above methods. In Section 6, we compare TALEwith C-Tree. A recent method [19], called SAGA, em-ploys a flexible graph similarity model. While SAGAis very efficient for small graph queries, it is compu-tationally expensive when applied to large graphs. Incontrast, TALE focuses on approximate matching forlarge graph queries. In Section 6, we also compareTALE with SAGA.

The life science community has produced vastamount of protein interaction networks. Several tools

2

for comparing protein interaction networks have beenproposed. These include PathBlast [10], its successorNetworkBlast [16], MaWIsh [13], and Graemlin [4]. Ofthese, Graemlin is the latest method and in many wayssuperior to the other methods for comparing proteininteraction networks. In Section 6, we compare TALEwith Graemlin.

3 Preliminaries

A graph G is denoted as (V,E), where V is the setof nodes and E ⊆ V ×V is the set of (directed or undi-rected) edges. Nodes and edges can have labels speci-fied by mappings φ : V → Σv and ψ : E → Σe respec-tively, where Σv is the set of node labels and Σe is theset of edge labels. In order to uniquely identify a node,we assign an unique id to each node in a graph. We alsoimpose an order on the ids. Our indexing method andmatching algorithm support both directed and undi-rected graphs with labeled nodes and/or labeled edges.For ease of presentation, we present our method usingundirected graphs with labeled nodes. Adaptations ofour method to other graph types are fairly straight-forward unless discussed. The simple adaptations areomitted in the interest of space.

Let G1 = (V1, E1) and G2 = (V2, E2) be two graphs.An exact graph match (graph isomorphism) is a bijec-tion mapping function λ : V1 ↔ V2, in which for everyv ∈ V1, φ(v) = φ(λv), and (u, v) ∈ E1 if and onlyif (λu, λv) ∈ E2. An exact subgraph match (subgraphisomorphism) from G1 (the query) to G2 (the target) isdefined as ∃G′

2⊆ G2, and G′

2is an exact graph match

for G1.Approximate graph matching allows node mis-

matches (i.e. φ(v) 6= φ(λv)), and node/edge insertionsand deletions. We define an approximate graph matchas a bijection mapping λ : V ′

1↔ V ′

2, where V ′

1⊆ V1 and

V ′

2⊆ V2. Similarly, an approximate subgraph match

from G1 (the query) to G2 (the target) is defined as∃G′

2⊆ G2, and G′

2is an approximate graph match for

G1.An approximate subgraph matching tool often re-

turns a large number of matches for a query. Often theuser is only interested in the top-K results. To returnthe top-K results, TALE has to sort the matches basedon their similarities to the query. Several graph similar-ity or distance models have been proposed, e.g. [2,19].Each model is meaningful for some applications, butthere is no “universal” model that fits all applications.We do not want to limit the generality of TALE bytailoring it to a particular similarity model. Instead,we let the users customize the similarity method thatbest models their application, thereby allowing TALE

to serve as flexible graph matching tool that can beused in a variety of graph matching applications. Sec-tion 6 shows examples of how this similarity model canbe customized in practice.

4 The NH-Index

In this section, we introduce the novel indexing tech-nique, Neighborhood Index (NH-Index).

4.1 Indexing Unit

The first question that arises with a graph indexingmethod is the graph entities, e.g. nodes, edges, sub-graphs, etc., that should be indexed. The NH-Index isused by the matching algorithm to match the impor-tant nodes in the query graph. These initial matchesfor the important nodes are then extended to producethe final matching results. A naive indexing method isto index all the nodes in the database. This method hasthe benefit that the index size grows linearly with thenumber of nodes in the database, but suffers from lowpruning power, as each query node can have many falsepositive matches (matches that cannot be extendedlater). Our NH-Index size is linear in the numberof nodes in the database and also has a high prun-ing power. NH-Index achieves this by incorporatingneighborhood information into the naive node indexingmethod. When matching a query node, instead of look-ing at the node in isolation, NH-Index also considersits neighborhood. A database node matches the querynode, only if the two nodes match and their neighbor-hoods also match. Using this technique, a large fractionof false positives can be eliminated.

A neighborhood is defined as the induced subgraphof a node and its neighbors (adjacent nodes). Thereare three main properties that characterize the neigh-borhood of a node: the number of neighbors, how theneighbors connect to each other, and the labels of theactual neighbors. The number of neighbors is simplythe degree of the node. To quantify the “connected-ness” amongst the neighbors, we define neighbor con-nection as the number of edges between the neighbors.For example, the neighbor connection of the black nodein Figure 1 is 5.

To capture the neighbors of a node, a naive methodis to simply enumerate the labels of the neighbors.However, this naive approach results in variable-lengthindex entries as well as large index size (in the worstcase of a clique, the storage cost is O(n2), where n isthe number of nodes in the database). An alternativeto the naive approach is to use a compact bit array tocapture the neighbors set. In the simple case when the

3

Figure 1. An example graph

total number of different labels in the problem domainis small (i.e. |Σv| is small), we can use a deterministicbit array to store the neighbors. The size of the bit ar-ray is equal to |Σv|, and each bit in the array indicateswhether a neighbor with a specific label exists (set to1) or not (set to 0). We call this bit array neighbor ar-ray. When |Σv| is a large number, using a deterministicbit array is very expensive. To handle this situation,we employ the Bloom filter approach [1]. We fix thesize of the bit array to be Sbit, where Sbit is a user-controllable parameter. A hash function is utilized tomap a node label to a bit array position. To improveprecision, multiple bit arrays and hash functions can beused to characterize the neighbors of a node. For sim-plicity, we only use one bit array to store the neighborinformation in this work.

In summary, the indexing unit of the NH-Index con-tains the following information: (label, degree, nbCon-nection, nbArray), where nbConnection is the neighborconnection of the node, and nbArray is the neighborarray.

4.2 Matching a Query Node

In the previous section, we discussed the indexingunit of the NH-Index. Next, given a query node, weexamine how our method finds the matching databasenodes. For ease of presentation, we first investigate thematching conditions for exact subgraph matching, andthen extend it to approximate subgraph matching.

For exact subgraph matching, in order to match aquery node to a database node, the two nodes musthave the same label. The degree of the query nodeshould be no more than that of the database node.The same condition holds for neighbor connections.Besides, the neighbors of the query node should havecorresponding matching nodes in the neighborhood ofthe database node.

For approximate matching, we want to tolerate somemisses in the match. We introduce a single user-definedparameter ρ, which is used to control the degree of ap-proximation. Intuitively, ρ is the percentage of neigh-bors of a query node that can have no correspondingmatches in the neighborhood of a database node. Inother words, nbmiss = (ρ×Nq.degree) neighbors of thequery node can be missing in the match to a databasenode. If nbmiss nodes are allowed to be missing, then at

most nbcmiss = nbmiss× (nbmiss−1)/2+(Nq.degree−nbmiss)×nbmiss neighbor connections are allowed to bemissing in the match, i.e. in the worst case, the nbmiss

nodes all connect to each other, and also connect to allof the remaining (Nq.degree− nbmiss) nodes.

Note that we also support node mismatches (nodeswith different labels are matched) in TALE. For easeof presentation, we delay the discussion of node mis-matches to Section 4.5.1, and for now assume thatmatching nodes are required to have the same label.

Formally, the conditions for matching a query nodeto an NH-index entry for approximate subgraph match-ing is:

Ndb.label = Nq.label (4.1)

Ndb.degree ≥ Nq.degree− nbmiss (4.2)

Sbit∑

i=1

Miss(Ndb.nbArray[i], Nq.nbArray[i]) ≤ nbmiss

(4.3)

Ndb.nbConnection ≥ Nq.nbConnection− nbcmiss

(4.4)The Miss function in Equation 4.3 is defined as fol-

lows:

Miss(x, y) =

{1 if x = 0 and y = 1

0 otherwise

In fact, exact subgraph matching can be viewed asa special case of approximate subgraph matching whenρ = 0.

Note that the conditions expressed in Equations 4.1to 4.4 can result in producing some false positives.Our index serves as a filtering mechanism to prune thesearch space. These matches are then refined in thematching algorithm (Section 5).

4.2.1 Node Match Quality

Given a query node, there can be more than onedatabase node that satisfies the conditions specified inEquations 4.1 to 4.4. Each of these matches can have adifferent match quality. Therefore, we need to measurethe quality of the node matches. This quality metricwill then be used at a later step (see Section 5.2) fol-lowing the index probe. In this section, we describe thematch quality computation.

Let nbmiss be the actual number of missing neigh-

bors in the node match, and nbcmiss be the actual num-ber of missing neighbor connections. Then the fractionof missing neighbors of the query node can be defined as

fnb =fnbmiss

Nq.degree. And the fraction of missing neighbor

4

��

��

��

��

� � � �

� � � �

��

�

�

�

�

�

�

��

��

Figure 2. The hybrid index structure

connections can be defined as fnbc =gnbcmiss

Nq.nbConnection.

Then, we define the quality of a node match, w, as:

w =

{2 − fnbc if nbmiss = 0

2 − (fnb + fnbc

fnbmiss

) otherwise(4.5)

Note that fnbc is correlated with fnb, as more miss-ing neighbors is likely to result in more missing neigh-bor connections in the match. Therefore, we amor-

tize fnbc by the number of missing neighbors nbmiss inEquation 4.5. The value of (fnb + fnbc

fnbmiss

) falls between

0 and 2. We subtract this value from 2, so that higherw value means a better node match.

4.3 Index Structure

Next, we consider the index structure to implementthe NH-index. Rather than designing a new indexstructure, which makes adoption and implementationhard, it is desirable to consider using existing indexstructures that can implement the NH-index efficiently.A suitable index structure needs to support the condi-tions specified in Equations 4.1 through 4.4. We pro-pose a simple hybrid index structure (see Figure 2) forthe NH-Index.

This hybrid index structure has two levels. Thehighest level of the index structure is a B+-tree indexon node label, degree and neighbor connection. Thispart of the index is used for fast evaluation of the equal-ity search on node labels (Equation 4.1), range searchon node degrees (Equation 4.2) and neighbor connec-tions (Equation 4.4). Each leaf entry in the B+-tree in-dex points to a second-level index. This second-level in-dex has two components. The first is a list of databasenode ids that are represented by the B+-tree leaf in-dex entry. (Recall from Section 3 that every databasegraph node has a unique node id.) These nodes havethe same unique label, degree and neighbor connection.The second component is a bitmap index for the neigh-bor arrays of these database nodes. Each node has onecorresponding bit array in the bitmap. Figure 2 showsan example bitmap index for a B+-tree leaf entry that

is mapped to six distinct database nodes with the samelabel, degree and neighbor connection. The bitmap in-dex is used to expedite the evaluation of Equation 4.3using Algorithm 1 (discussed in detail below).

Note that our hybrid index structure is easily im-plemented in existing relational systems. The secondlevel indices can be implemented simply as a relationwith two attributes: one that stores the list of databasenodes, and the other that stores a bitmap (using anextensible data type). The first level index is simply aB+-tree built on this table. This simple implementa-tion is robust and allows us to easily realize the NH-Index.

4.4 Index Probing

Given a query node, we first utilize the label, de-gree and neighbor connection information to probe theB+-Tree index. Then, we obtain a list of bitmaps thatmust be further examined using the conditions spec-ified in Equation 4.3. An efficient algorithm for thisevaluation is shown in Algorithm 1. This algorithmcontains two steps. The first step (line 1 to 17) countsthe number of missing neighbors of the query node inthe match to each database node in a bitmap. The sec-ond step (line 18 to 30) prunes all the database nodeswith the number of missing neighbors higher than theuser threshold. We discuss these two steps in detailbelow.

If a position in the query neighbor array is set to1, but the corresponding position in a database neigh-bor array is 0, we count it as one miss. Step 1 of Al-gorithm 1 simulates the binary addition operation tocount the total number of misses. We keep a counterof countSize+1 bits (countSize = blog

2(nbmiss)c+1)

for each database node to record the number of misses.These counters are stored in the countSize + 1 bitvectors Count[0] to Count[countSize], i.e. vectorCount[0] stores the bit position 0 for all the counters,and so on. The algorithm scans through the queryneighbor array from the lowest bit (position 0) to thehighest bit (position Sbit − 1). If the current bit is1, then the algorithm negates the bits in the corre-sponding column of the bitmap index and adds all thebit values to the counters of the database nodes. Toavoid overflow, the highest bit Count[countSize] for adatabase node is set to 1 when the number of missesexceeds countSize bits. An example of the first step isshown in Step 1 of Figure 3.

The second step of Algorithm 1 prunes all thedatabase nodes with more than nbmiss misses. We usetwo bit vectors Resulteq and Resultlt to record thenodes with nbmiss misses and less than nbmiss misses,

5

Algorithm 1 Bitmap Probe for ApproximateSubgraph Matching (Nq, Bitmap, ρ)

Input: Nq is the query node, Bitmap is the bitmap in-dex to be probed, assuming that there are n nodesin the bitmap index and the size of neighbor ar-ray is Sbit, ρ is the percentage of neighbors of aquery node that can be missing in the match to adatabase node

Output: Resultle is the bit vector indicating whichnodes satisfy the query

1: // [Step 1]: count the number of missing neighbors2: nbmiss = bρ×Nq.degreec // the threshold for the

number of missing neighbors3: countSize = blog

2(nbmiss)c + 1

4: for i from 0 to countSize do5: Count[i] = (0, 0, ..., 0) // Count[i] is a bit vector

of size n6: end for7: for j from 0 to Sbit − 1 do8: if Nq.nbArray[j] = 1 then9: Carries = NOT Bitmap.Bj

10: for k from 0 to countSize− 1 do11: Temp = Count[k] AND Carries12: Count[k] = Count[k] XOR Carries13: Carries = Temp14: end for15: Count[countSize] = Count[countSize] OR

Carries16: end if17: end for18: // [Step 2]: only return nodes with no more than

nbmiss missing neighbors19: Resultlt = (0, 0, ..., 0) // Resultlt is a bit vector of

size n20: Resulteq = (1, 1, ..., 1) // Resulteq is a bit vector

of size n21: for k from countSize to 0 do22: if bit k of nbmiss’s binary format is 1 then23: Resultlt = Resultlt OR (Resulteq AND (NOT

Count[k]))24: Resulteq = Resulteq AND Count[k]25: else26: Resulteq = Resulteq AND (NOT Count[k])27: end if28: end for29: Resultle = Resultlt OR Resulteq

30: return Resultle

� � � � �

�

�

�

� � � �

� � � �

� � � �

��

��

��

��

��

� � � � �

��

��

��

��

�

�

�

�

�

�

�

�

��

� ��

��

�

�

�

�

�

�

�

�

!�� "��# !�� "�"�

� �

��$��$��

�

�

�

�

�

�

�

�

!�� "��# !�� "�"�

� �

�

�

�

�

�

�

�

�

!�� "��# !�� "�"�

� �

�

�

�

�

!�� "�"�

%&'(��

%&'(��

Figure 3. Example demonstrating Algorithm 1

respectively. As the algorithm scans the binary formatof nbmiss from the highest bit (position countSize) tothe lowest bit (position 0), it updates Resulteq andResultlt. Finally, the bitwise OR of the two vectorsgives us the right answer. Each position in the resultvector indicates whether the corresponding databasenode is in the query result or not. Figure 3 also showsan example of this step. This second step of Algo-rithm 1 resembles the algorithm for range predicatewith a bit-sliced index introduced in [15].

Next, we analyze the complexity of Algorithm 1.This algorithm takes O(Sbit × log(ρ× d)) bitwise oper-ations in step 1, where d is the degree of the query node.And step 2 takes O(Sbit) bitwise operations. Therefore,the complexity of Algorithm 1 is O(Sbit × log(ρ × d))bitwise operations on bit vectors. Usually, ρ×d is verysmall value, thus log(ρ × d) is even smaller, and oftennegligible.

We have also compared Algorithm 1 with a naivebitmap index probing method, which scans throughevery neighbor array in the bitmap index, and decideswhether the neighbor array satisfies the condition spec-ified in Equation 4.3. We set up a simulation to test theefficiency of Algorithm 1 against this naive method. Werandomly generated 12 bitmap indexes with increasingsizes. The smallest bitmap index contains neighbor ar-rays for 16 nodes, while the largest one contains neigh-bor arrays for 32768 nodes. Each neighbor array inthe bitmap has 32 bits. We use 50 randomly generatedquery neighbor arrays to probe these bitmap indexes.Algorithm 1 shows significant performance advantageover the naive method – the speedup ranges from 2Xfor the smallest index to more than 12X for the largestindex.

6

4.5 Extensions to the Basic Approach

Next we introduce several extensions to the basicindexing technique to improve the basic approach andhandle more general cases.

4.5.1 Node Mismatches

In the above indexing method, TALE requires twomatching nodes to have the same label. However,real applications often need to allow matchings be-tween nodes with different labels. We adopt the nodemismatch model introduced in [19], which implicitlygroups nodes based on a specific notion of similarity.In this model, the grouping of nodes is defined basedon the application domain, and two nodes are allowedto match only if they belong to the same group. Forexample, if a node represents a gene, then its groupmembership is defined by the orthologous group thatit belongs to (orthologous groups are organized basedon similar gene functionalities), and two nodes matchif they belong to the same orthologous group. To ac-commodate this model, we extend the basic indexingapproach by replacing the node labels with their cor-responding group labels and hashing the group labelsfor the bit arrays. The remaining indexing methodremains unchanged. In Section 6, we show how TALEcan be applied to real applications using this node mis-match model.

4.5.2 Directed Graphs

The above indexing method works for undirectedgraphs. However, it is fairly easy to extend it to handledirected graphs. In a directed graph, every edge hasdirection. Given a node, an adjacent edge either goestowards the node or away from the node. Therefore,the indexing unit becomes (label, in-degree, out-degree,in-nbConnection, out-nbConnection, in-nbArray, out-nbArray). For the index structure, we can build oneB+-Tree index on label, in-degree, out-degree, in-nbConnection and out-nbConnection. And each leafentry in the B+-Tree points to one bitmap index forthe in-nbArray and another bitmap index for the out-nbArray. Other candidate index structures are alsopossible.

4.5.3 Edge Labels

A simple extension can be made to the basic index-ing method proposed above to handle graphs with la-beled edges. In the basic method, we hash the labels ofneighbors to get the neighbor array. To handle labelededges, we hash (node label, edge label) pairs to produce

��

��

��

��

��

��

��

��

��

��

Figure 4. Overview of the matching algorithm

the neighbor arrays. The remaining index method isunchanged.

Any of the above extensions can be combined to-gether to meet the requirement of different applica-tions.

5 The Matching Algorithm

In this section, we introduce the approximate sub-graph matching algorithm. We first start with anoverview of this algorithm in Section 5.1, and then de-scribe the algorithm in detail in Sections 5.2 and 5.3.

5.1 Algorithm Overview

Our approximate subgraph matching algorithm isbased on the following two observations.

Observation 1: Some nodes in a graph play moreimportance roles in the graph structure than others. Asshown in Figure 1, some nodes (e.g. the black node)connect to many other nodes. If these nodes are ab-sent, then the graph structure quickly gets fragmented.In contrast, some nodes (e.g. the gray node) sit on theperiphery of the graph and only connect to few othernodes. The overall graph structure will not be dra-matically affected by removing these nodes. There arevarious ways of measuring the importance of a nodein a graph. For simplicity, we use the degree central-ity measure in this work. In this measure, nodes withhigh degrees are considered more important than nodeswith low degrees. In Section 6.5, we will evaluate theeffectiveness of this importance measure. Note that thedefinition of “importance” is flexible in TALE and cus-tomizable for specific application needs. TALE can beeasily extended to use other measures of node impor-tance, such as closeness, betweenness, and eigenvectorcentralities.

7

Algorithm 2 GrowMatch (Gq, Gdb, Mimp)

Input: Gq is the query graph, Gdb is the databasegraph, Mimp contains the matches for the impor-tant nodes in Gq

Output: M contains the node matches for the result-ing graph match

1: put all node matches fromMimp to a priority queueQ sorted by their qualities

2: while Q is not empty do3: pop up the best node match (Nq, Ndb) from Q4: put (Nq, Ndb) into M5: ExamineNodesNearBy(Gq, Gdb, Nq, Ndb, M ,

Q) // finding new matches for nodes nearby Nq

6: end while7: return M

Observation 2: A good approximate match shouldbe more tolerant towards missing unimportant nodesin the query than missing important nodes. In otherwords, most of the important nodes in the query shouldbe present in the match, while missing unimportantnodes is more tolerated. In addition, the number ofmatched important nodes, and the qualities of thesenode matches can be used to estimate the quality of anapproximate subgraph match.

Based on these two observations, we introduce anew approximate subgraph matching algorithm. Theoverview of this algorithm is as follows: First, thealgorithm selects a number of important nodes fromthe query based on the specified importance measure(degree centrality in this work), and then probes theNH-Index to find matching nodes for these importantquery nodes. These matching node pairs serve as an-chor points for producing graph matches. In the sec-ond step, for each matching database graph, the algo-rithm extends the graph match from the anchor pointsby progressively adding satisfiable nearby nodes of al-ready matched nodes. The entire matching process isoutlined in Figure 4.

5.2 Step 1: Match the Important Nodes

In this step, the algorithm selects a number of im-portant nodes from the query and probes the NH-Indexto match these important nodes.

The algorithm first needs to decide how many nodescount as important nodes. We introduce a parameterPimp, defined as the fraction of important nodes in thequery. Given Pimp, we sort the nodes in the query bytheir importance (degree centrality in this work) andselect the top Pimp percent as the important nodes.In Section 6.2, we show how to choose the Pimp value

Algorithm 3 ExamineNodesNearBy (Gq, Gdb,Nq, Ndb, Mc, Qc)

Input: Gq is the query graph, Gdb is the databasegraph, Nq is a node in Gq, Ndb is the node in Gdb

matched to Nq, Mc contains all the current nodematches found so far, Qc contains all the candidatenode matches to be examined

1: NB1q = immediate neighbors of Nq that have nomatches in Mc

2: NB2q = nodes two hops away from Nq that haveno matches in Mc

3: NB1db = immediate neighbors of Ndb that have nomatches in either Mc or Qc

4: NB2db = nodes two hops away from Ndb that haveno matches in either Mc or Qc

5: MatchNodes(Gq, Gdb, NB1q, NB1db, Mc, Qc)6: MatchNodes(Gq, Gdb, NB1q, NB2db, Mc, Qc)7: MatchNodes(Gq, Gdb, NB2q, NB1db, Mc, Qc)

based on graph properties of specific applications.After selecting the important nodes, the algorithm

probes the NH-Index for each important node as dis-cussed in Section 4.4. After the index probe, we obtaina list of database graphs that have matches for someor all of the important nodes in the query. A matchscore is also calculated for each matching node pair us-ing Equation 4.5. In the results produced by the indexprobes, a single important query node can be mappedto multiple database nodes and vice versa. Since themain purpose of this first step is to find the anchorpoints that can be expanded in the next step, we needto find one-to-one node mappings between the queryand database nodes. For this part, we use a maxi-mum weighted bipartite graph matching algorithm (us-ing node match scores as weights) from the LEDA-R 3.2 library (http://www.algorithmic-solutions.com/index.htm).

5.3 Step 2: Extend the Match

Step 1 of the matching algorithm produces a list ofcandidate database graphs. For each candidate graph,Step 2 of the algorithm utilizes the node matches pro-duced by Step 1 as anchor points to match the remain-ing nodes in the database and query graphs, and pro-duces the final graph match.

The overall idea of this step is as follows. For eachnode that is already matched, we try to match its“nearby” nodes (as described below these includes notjust the adjacent nodes, but also nodes that are twohops away). We perform this extension progressivelyuntil no more nodes can be added to the match. The

8

detailed algorithm is shown in Algorithm 2, 3 and 4.Algorithm 2 is the main procedure for step 2. It

first puts all the important node matches (the anchorpoints) into a priority queue sorted by the qualities ofthe node matches (cf. Section 4.2.1). In each iterationof the loop, we pop up the best node match (with thehighest quality) from the queue and put it into thefinal graph match. In addition, we examine the nearbynodes of the query node, as well as the nearby nodesof the database node, to see whether any of them canbe matched. If so, we add these new node matches tothe priority queue. This process ends when the priorityqueue is empty.

Algorithm 3 implements the ExamineNodesNearByfunction called by Algorithm 2. Based on a pair of al-ready matched nodes, this function tries to match theirnearby nodes. In order to allow more flexibility in theapproximate matching, we do not limit the matchingextensions to just adjacent nodes of the query nodeand the database node. Instead, this algorithm ex-amines nodes at most two-hops away from the querynode and the database node. Note that this algorithmis generic. It can be easily extended to match nodesmore than two-hops away to allow more approximation(at the expense of an increased computational cost).

Algorithm 4 shows the details of the MatchNodesfunction called by Algorithm 3. For each node fromthe given set of query nodes, this algorithm finds thebest matching node from the set of database nodes. Ifthe new node match does not conflict with any existingones in the priority queue, it is simply put into thepriority queue. However, if this node match is betterthan an existing match in the queue, the existing oneis replaced with the new one.

Note that our algorithm only produces one match foreach database graph. In some applications, users maywant more than one match for each database graph.In this case, we can extend our matching algorithm toretain more than one set of anchor points (in step 1,instead of only retaining the maximum weighted bipar-tite matching, also retain other high weighted maximalbipartite matchings) and then extend each of them toproduce a match.

6 Evaluation

In this section, we apply TALE to three real biolog-ical applications, and present results evaluating TALEwith three measures: effectiveness (whether the resultsproduced by the tool are useful and meaningful in reallife applications), efficiency and scalability.

Note that while the applications discussed in thispaper are from life sciences, TALE can be applied to

Algorithm 4 MatchNodes(Gq, Gdb, Sq, Sdb, Mc,Qc)

Input: Gq is the query graph, Gdb is the databasegraph, Sq is a set of nodes in Gq, Sdb is a set ofnodes in Gdb, Mc contains all the current matchesfound so far, Qc contains all the candidate matchesto be examined

1: for every node Nq in Sq do2: Ndb=the best mapping of Nq in Sdb

3: if Ndb=null then4: continue5: end if6: if Nq has no matches in Qc then7: put (Nq, Ndb) into Qc

8: remove Ndb from Sdb

9: else if (Nq, Ndb) is a better node match then10: remove the existing match of Nq from Qc

11: put (Nq, Ndb) into Qc

12: remove Ndb from Sdb

13: end if14: end for

any area in which there is a need for approximate sub-graph matching. Other such areas include comparingRDF graphs in semantic web applications, and compar-ing parse trees produced by natural language parsersfor literature mining. We have chosen to focus on lifesciences applications since we have actual collaboratorswho have ready applications for our tool.

TALE is implemented in C++ on top of PostgreSQL(http://www.postgresql.org). The execution timesreported in this section correspond to the runningtimes of this C++ program including the DBMS accesstimes. All experiments were run on a 2.8GHz Pentium4 Fedora Core 2 machine, with 2GB memory, and a250GB SATA disk. We use PostgreSQL version 8.1.3and set the buffer pool size to 512MB.

6.1 Experimental Datasets

BIND Dataset: We use the BIND [3] dataset (ver-sion May 25, 2006) to demonstrate the applicationof TALE for comparing Protein Interaction Networks(PINs). A PIN is a large graph, in which nodes repre-sent proteins and edges indicate protein-protein inter-actions. Comparing PINs of different species allows abiologist to discover the evolutionary conserved func-tional units across species. However, due to the higherror rate of detection methods, PINs are noisy in na-ture [18]. Therefore, approximate subgraph matchingis useful for comparing PINs.

KEGG Dataset: This dataset consists of biologi-

9

Degree

Freq

uenc

y

0 50 100 150

0.0

0.2

0.4

0.6

0.8

Degree

Freq

uenc

y

1 2 3 4 5 6 7 8

0.0

0.4

0.8

Degree

Freq

uenc

y

2 4 6 8 10 12 14

0.00

0.10

0.20

Figure 5. Degree distribu-tion for the BIND dataset

Figure 6. Degree distribu-tion for the KEGG dataset

Figure 7. Degree distribu-tion for the ASTRAL dataset

cal pathways from the well-known KEGG database [12](downloaded on Feb 28, 2007). We use this datasetto demonstrate the application of TALE for biologi-cal pathways analysis. A pathway is a directed graphwith nodes representing cellular entities such as pro-teins and regulatory elements, and edges representingtheir interactions. The graph shows the sequence ofactions that lead to different cellular entities interact-ing to carry out some function. Similar to PINs, eachinteraction in a pathway is based on an experimentallyobserved phenomenon. Therefore, pathway data areoften noisy and incomplete. Nature is very effective atmodularizing complex actions and reusing subcompo-nents. There are a lot of common building blocks inthe cellular machinery that often get “reused” in differ-ent pathways. By investigating the similarities sharedby different pathways, a biologist can generate varioushypotheses that can help refine the understanding of apathway of interest.

ASTRAL Dataset: To demonstrate the potentialapplication of TALE for Protein Structure Matching,we use the ASTRAL [8] dataset (version 1.71). Thisdataset contains the 3D structures of protein domains.A domain is an independent, self-stabilizing unit of aprotein, usually pertinent to the function of the pro-tein they belong to. In biology, structure similarity isoften a good indicator of function similarity. 3D struc-tures can be translated into contact graphs, and struc-ture matching can be achieved by approximate sub-graph matching on the corresponding contact graphs.In a contact graph, nodes represent amino acids (sincethere are 20 different kinds of amino acids, there are 20distinct node labels) and edges indicate that the cor-responding amino acids physically interact with eachother. This physical interaction is usually decided by athreshold of the contact distance. In our experiment,we used the widely used 7A threshold [7] to converteach domain 3D structure into a contact graph.

6.2 Parameterizations

In this section, we demonstrate how to choose thevalues of the parameters used in TALE for the threeexperimental datasets. TALE requires the setting ofthe following three parameters: the neighbor array sizeSbit in the NH-Index, the approximation ratio ρ, andthe fraction of important nodes Pimp in a query graph.

The size of the neighbor bit array is related to thenumber of node labels (or the number of group labelswhen allowing node mismatches, cf. Section 4.5.1) inan application. For protein structure matching, thereare only 20 amino acids. Two nodes of the contactgraphs can be matched only if they represent the sameamino acid. Therefore, we set Sbit to be 32 (to make itfit in an 32-bit integer), and we use the exact value ofthe amino acids to set the bit array (instead of using ahash function).

For pathway analysis and protein interaction net-works comparison, we need to match nodes based onfunction or sequence similarity. For the KEGG dataset,we utilize the KEGG Orthologous group (which classi-fies proteins based on function similarity). Two nodescan be matched only if they belong to the same KEGGOrthologous group. For the BIND dataset, we usedCD-HIT [11] to cluster the proteins based on their se-quence similarity1. And two nodes can be matchedonly if they belong to the same cluster. There are to-tally 8814 KEGG Orthologous groups and 22311 CD-HIT clusters. We set Sbit to be 64 and 96 for KEGGand BIND datasets, respectively. In fact, we have ex-perimented with other Sbit values, and there is no sig-nificant difference in performance for different Sbit val-ues. Therefore, in the interest of space, we do not showthe effect of different neighbor array sizes.

The approximation ratio ρ indicates the percentageof neighbors of a query node that can have no corre-sponding matchings in the neighborhood of a database

1Proteins in each cluster share at least 40% sequence identity.

10

Table 1. PINs of human, mouse and rat# nodes # edges

human 8470 11260mouse 2991 3347

rat 830 942

node. It is related to the similarity requirement of aspecific application. For simplicity, we set this param-eter to be 25% for all three applications.

The fraction of important nodes in a query graphPimp is highly associated with the graph properties inan application. As we use degree centrality to measurethe importance of a node, we study the degree distribu-tions of the 3 applications. The representative degreedistributions of the 3 datasets are shown in Figures 5through 7. The degree distribution for the KEGG andBIND datasets is highly right skewed. In fact, studieshave shown that both pathways and PINs show power-law degree distribution [14]. Only very small fractionof nodes have high degrees. We set Pimp = 15% for theKEGG and BIND datasets. For the ASTRAL dataset,the degree distribution is bell shape. Around 25% ofnodes have degree more than 8, which we consider asimportant nodes (Pimp = 25%).

6.3 Effectiveness Evaluation

In this section, we present results examining the ef-fectiveness of TALE. We also compare TALE to C-Tree [6], SAGA [19], and Graemlin [4].

6.3.1 Protein Interaction Networks Compari-son

Graph matching techniques are used on PINs to findconserved components shared between the query net-work and each network in the database. The PIN for awell studied species is usually a large graph with hun-dreds to thousands of nodes and edges. C-Tree [6] isnot applicable for comparing PINs as the implementa-tion does not allow node mismatches (nodes with dif-ferent labels to be matched), which is a requirementfor this application. On the other hand, TALE handlesnode mismatches by utilizing the group labels producedby existing protein clustering tools (see Section 6.2).SAGA can be used for querying PINs, but queryingsuch large graphs using SAGA is prohibitively expen-sive.

For comparing PINs, the tools most closely relatedto TALE are NetworkBlast [16], MaWISh [13] andGraemlin [4]. Since these tools largely deal with pair-wise comparison, we only focus on pairwise PIN com-

Table 2. Effectiveness for comparing PINs# KEGGs KEGG time

hit coverage (sec)rat vs. human

Graemlin 0 NA 910.0TALE 6 3.2% 0.3

mouse vs. humanGraemlin 18 5.0% 16305.5

TALE 42 13.6% 0.8

parison in this experiment. In [4], the authors showedthat Graemlin is better at identifying conserved func-tional modules than the other methods. Therefore, weonly compare TALE with Graemlin.

We choose the PINs of three well studied mammals:human, mouse and rat for this experiment. The statis-tics for these three networks are described in Table 1.

We use both TALE and Graemlin (using code down-load from http://graemlin.stanford.edu/) to querythe rat and the mouse PINs against the human PIN.We compare the two methods using the effectivenessmeasures: the number of KEGGs hit and the averageKEGG coverage as proposed in [4]. The number ofKEGGs hit is the number of pathways in the KEGGdatabase [12] aligned between 2 species. A KEGGpathway is considered as a hit if at least 3 proteinsin the pathway are aligned to their counterparts in thepathway of the other species. KEGG coverage is thefraction of proteins aligned within a pathway.

As shown in Table 2, TALE achieves significantlarger number of KEGGs hit and better average KEGGcoverage than Graemlin. Most noticeable is the big dif-ference in running time. TALE only takes about 1 sec-ond for the two queries while Graemlin takes 4.8 hours.In addition, TALE only takes about 1 second to buildthe index on the human PIN.

Table 3. The statistics of KEGG pathways forthe 7 well-studied model species

#pathways avg #nodes avg #edgeshuman 173 83.3 38.5mouse 169 83.8 38.3

rat 161 83.6 30.3fly 103 97.3 12.7

worm 97 100.4 13.4yeast 87 106.1 20.8ecoli 95 102.1 25.1total 885 91.2 28.3

11

prec

isio

n

0

0.2

0.4

0.6

0.8

1

recall0 0.2 0.4 0.6 0.8 1

TALESAGA pr

ecis

ion

0

0.2

0.4

0.6

0.8

1

recall0 0.2 0.4 0.6 0.8 1

TALESAGA pr

ecis

ion

0

0.2

0.4

0.6

0.8

1

recall0 0.2 0.4 0.6 0.8 1

TALESAGA

Figure 8. ROC curves forhuman pathways

Figure 9. ROC curves formouse pathways

Figure 10. ROC curves forrat pathways

6.3.2 Biological Pathways Analysis

This experiment uses the KEGG pathway dataset.Again since we allow node mismatches in pathway anal-ysis, C-Tree is not applicable. Therefore for this exper-iment, we only compare TALE with SAGA. SAGA wasconfigured using the parameter settings in [19].

Close related species in evolution (e.g. mouse andrat) share significant similarities in their correspondingpathways. For example, the WNT pathways of mouseand rat are very similar to each other, but are moredifferent from the WNT pathway of fly. This providesus a way to evaluate the effectiveness of TALE for theapplication of biological pathways analysis.

For this experiment, we choose the pathways of the7 well-studied model species in KEGG: human, mouse,rat, fly, worm, yeast and ecoli. The statistics of thisdataset is summarized in Table 3. Human, mouse andrat are more closely related to each other in evolution-ary than the other 4 species. Therefore, we expect theresults produced by TALE can reflect the fact that thepathways for these 3 species will be more similar toeach other.

We used every pathway for human, mouse and ratto query the database. Note that for some large queries(e.g. human fatty acid biosynthesis pathway with 163nodes and 151 edges), SAGA could not finish in a rea-sonable amount of time (taking over 1 hour), whileTALE can finish every query within 1.8 seconds. Forthe queries SAGA can finish within 1 hour, the averagerunning time for SAGA is about 12 seconds, while theaverage time for TALE is 0.14 seconds.

To evaluate the effectiveness of the results, we em-ploy the measures: recall and precision. Recall is de-fined as the fraction of the retrieved relevant resultsout of all the relevant results. Precision is the fractionof the retrieved relevant results out of all the retrievedresults. A matching result is considered relevant if it isthe same pathway from a species close in evolution. Forexample, if the query is human WNT pathway, then arelevant result can be human, mouse or rat WNT path-

way.To keep this experiment manageable, we kill any

SAGA query if it runs over 1 hour. For fair compari-son, we only compare TALE with SAGA for the queryresults that SAGA can finish within the time limit.We employ the SAGA distance model (using the de-fault parameters in [19]) to rank the results returnedby both TALE and SAGA. We compute the averageprecision and recall values for human, mouse and ratpathway queries. The ROC curves are shown in Fig-ure 8 to Figure 10. SAGA and TALE show very com-parable effectiveness for pathway analysis, with SAGAhaving a slight advantage.

6.3.3 Protein Structure Matching

In this experiment, we evaluate the effectiveness ofTALE for protein structure matching using the AS-TRAL dataset.

This application generally does not require nodemismatches, therefore we can compare TALE with C-Tree. However, the C-Tree implementation that we gotfrom the authors is memory-based. In other words, thewhole index needs to reside in memory for query pro-cessing. Naturally, as the database size increases, theindex will soon grow out of memory. For example,C-Tree cannot build an index on the entire ASTRALdataset (which has 75626 domains). In contrast, NH-Index is a disk-based index technique and is not limitedby the memory size. As we will show in Section 6.4.3,TALE can easily handle the entire ASTRAL dataset,and our disk-based index structure scales nicely withincreasing database sizes. For a fair comparison, weemploy the similarity model used by C-Tree [6] to rankthe matching results.

ASTRAL contains 75626 domains, which are classi-fied into 7275 families. Domains in each family presentsignificant structural similarity. This provides us witha way of evaluating the effectiveness of TALE: largefraction of the top matching results are expected tobelong to the same family of the query domain.

12

prec

isio

n

0

0.2

0.4

0.6

0.8

1

recall0 0.2 0.4 0.6 0.8 1

TALEC−Tree

Figure 11. ROC curves using the ASTRALdatasetWe test TALE and C-Tree on a subset of ASTRAL,

so that C-Tree can hold the index in memory. Thedataset is created as follows: We randomly choose 1300families (with more than 10 domains in each family),and then randomly choose 10 domains from each fam-ily. The average number of nodes and edges for eachgraph are 186.6 and 734.2, respectively.

We randomly choose 20 queries (with 346.4 nodesand 971.6 edges per graph on average) from the 13000domains. We gradually increase the number of resultsreturned by TALE and C-Tree, and measure the meanrecall and mean precision for both methods. The recalland precision ROC curves are shown in Figure 11. Theprecision for both methods stays very high until the re-call reaches round 0.6. This is because both methodsreturn relevant results as their top results. However,as the recall further increases, the precision drops moresteeply. After the recall reaches around 0.8, return-ing more results will not improve the recall any more.This is because the classification system in ASTRALis not purely based on structure similarity, but alsoon extensive domain knowledge. No method based onpure structural similarity is likely to perfectly matchthis classification system. However, TALE could po-tentially be used for classifying novel family membersin combination with the domain knowledge providedby experts.

Although TALE and C-Tree are very comparable intheir effectiveness for this dataset, TALE is faster thanC-Tree. The average running time for the 20 queries

Table 4. Four BIND sub-datasets for the scal-ability experiment

avg avg index index#graphs #nodes #edges size time

D1 10 939.1 1093.2 1.4MB 13.2sD2 20 938.5 1691.9 2.9MB 31.1sD3 30 939.5 1920.7 4.5MB 50.4sD4 40 940.1 1743.6 5.7MB 62.7s

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10

Run

ing

Tim

e (s

ec)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Queries (#nodes, #edges)

(63,

52)

(93,

77)

(100

,67)

(146

,99)

(182

,127

)

(306

,313

)

(601

,407

)

(185

0,15

93)

(299

1,33

47)

(305

9,48

50)

D1 D2 D3 D4

Figure 12. Scalability Experiment using the BINDdataset

is 34.8 seconds using TALE, but 61.9 seconds using C-Tree. TALE is almost 2 times faster than C-Tree (eventhough it is a disk-based implementation and is goingthrough PostgreSQL).

6.4 Efficiency and Scalability Evaluation

In this experiment, we test the efficiency and scala-bility of TALE for the three applications.

6.4.1 Experiment on BIND Dataset

In this experiment, we evaluate the efficiency and scal-ability of TALE on the BIND dataset. BIND has PINsfor 757 species, but most PINs are incomplete. Wechoose the largest 40 PINs from BIND. The largestgraph contains 8470 nodes and 11260 edges. The small-est of these 40 PINs contains 45 nodes and 105 edges.On average, each graph has 940.1 nodes and 1743.6edges. The characteristic of this data is that it con-tains large-sized graphs. To measure the scalability ofTALE, we formed 4 datasets D1 to D4 with increasingsizes 2. The statistics of the four datasets are sum-marized in Table 4. The index sizes and the indexconstruction times are also shown in this table. As thedatabase size increases, the index size grows at a near-linear rate and the index construction time increasessteadily.

We choose the 10 graphs in dataset D1 as thequeries. For this experiment, we do not restrict thenumber of results returned by each query. The execu-tion time for the 10 queries on the 4 datasets is shownin Figure 12. Even for the largest query with 3059

2The 4 datasets are formed as follows. We first divide the 40PINs into 4 balanced groups each with 10 PINs and roughly sametotal number of nodes. We randomly select one group as D1,randomly add another group to D1 to form D2, then randomlyadd one of the remaining groups to D2 to form D3, finally D4contains all the 4 groups.

13

nodes and 4850 edges on the largest D4 dataset, thequery executes in about 0.7 seconds. The executiontime grows as the size of the database increases. Formost queries, the growth ratio shows near-linear trend.Note that query execution time is not just influencedby the query and database sizes, but also by the resultcardinality. In Figure 12, Q2, Q3 and Q4 increase inthe query size, but the execution time increases fromQ2 to Q3 while decreases from Q3 to Q4 for D2, D3 andD4 datasets. The reason is that Q3 has more databasematches than Q2 and Q4. (Recall that in this experi-ment, we do not restrict the number of results returnedby each query.) For Q3, there is a jump from D1 to D2,because more matching graphs are found in D2. Butthe number of matches remain roughly the same fromD2 to D4 (and so does the execution time). Similarexplanations apply to other queries in this figure.

6.4.2 Experiment on KEGG Dataset

In this section, we test the efficiency and scalability ofTALE on increasing sizes of KEGG pathway databases.The smallest dataset contains all the human and mousepathways. We increase the database size by includingpathways of more species until it contains all the 41550KEGG pathways of 538 species. The index construc-tion time and index sizes for these increasing databasesare shown in Figure 13 and Figure 14, respectively.Our indexing technique indexes the neighborhood ofeach database node. This novel technique gives us thenear linear increase in the index construction time andindex size as shown in Figure 13 and Figure 14.

To test query execution time, we randomly selected20 human pathways (76.7 nodes and 29.9 edges pergraph on average). For each query, we ran TALE toget the top 20 results. The average running time for the20 queries on increasing database sizes is reported inFigure 15. As shown in this figure, the query executiontime increases steadily with the database size.

6.4.3 Experiment on ASTRAL Dataset

In this experiment, we evaluate the efficiency and scala-bility of TALE on the ASTRAL datasets with increas-ing sizes. The smallest dataset contains 200 graphs,while the largest one contains all the 75626 graphs inASTRAL. As shown in Figure 16 and Figure 17, theindex construction time and index size show steadygrowth with increasing database size.

We randomly selected 20 queries (153.1 nodes and592.0 edges per graph on average) from the smallestdataset, and ran it on the increasing sized databases.For each query, we only retain the top 20 results. Theaverage execution time for the 20 queries is shown in

Figure 18. The running time scales nicely with thedatabase size.

6.5 Discussion and Summary

We note that TALE is a heuristic algorithm. It doesnot guarantee that it will find the best or all matches.However, given that finding the best/all matches isNP-hard [2] and infeasible in practice, heuristics areinevitable. For most real graphs, our heuristics achievehigh accuracy compared with existing tools, as shownin our experiments.

In this work, we have used degree centrality to mea-sure the importance of nodes. To show the effectivenessof this measure, we compare TALE to a variant calledTALE-Random, where the “important” nodes are sim-ply a randomly selected subset of the nodes. We ranthe BIND mouse vs human test (Table 2, Row 3) usingTALE-Random. We compare the number of matchingnodes, the number of matching edges, the number ofKEGGs hit and the average KEGG coverage for thetwo methods. The results are 106, 61, 42, 13.6% forTALE and 85, 24, 8, 5.8% for TALE-Random. Thistest shows the effectiveness of this node importancemeasure for this application.

To summarize the experimental section, our exten-sive empirical evaluation demonstrates the effective-ness, efficiency and scalability of TALE. We have com-pared TALE to three existing tools, SAGA, C-Tree andGraemlin. TALE is a flexible tool and the only toolthat can easily be applied across the three applicationsconsidered in our evaluation. Furthermore, TALE pro-duces useful and meaningful results for all the threeapplications, and is also significantly faster than theseexisting tools. Our results also show that TALE is scal-able for large queries and large databases.

7 Conclusions and Future Work

In this paper we have presented TALE – an ap-proximate subgraph matching tool for matching graphqueries with a large number of nodes and edges. TALEemploys a novel indexing technique, which achievesa high pruning power and scales linearly with thedatabase size. This index structure can be easily im-plemented in existing relational systems. The innova-tive matching algorithm used by TALE distinguishesnodes by their importance to the graph structure. Thisalgorithm first matches the important nodes in thequery, and then extends them to produce larger graphmatches. TALE is a general tool for approximate sub-graph matching queries, and can be easily customizedto meet the requirement of different applications. Our

14

Con

stru

ctio

n Ti

me

(sec

)

0

20

40

60

80

100

120

Database Size (# graphs)0 7000 14000 21000 28000 35000 42000

Inde

x S

ize

(MB

)

0

100

200

300

400


Exe

cutio

n Ti

me

(sec

)

0

0.2

0.4

0.6

0.8

1

1.2


Figure 13. Index Construc-tion Time with IncreasingKEGG Database Size

Figure 14. Index Size withIncreasing KEGG DatabaseSize

Figure 15. Query ExecutionTime with Increasing KEGGDatabase Size

Con

stru

ctio

n Ti

me

(sec

)

0

1000

2000

3000

4000

5000

6000

Database Size (# graphs)0 19000 38000 57000 76000

Inde

x S

ize

(MB

)

0

440

880

1320

1760

2200


Exe

cutio

n Ti

me

(sec

)

0

15

30

45

60

75

90


Figure 16. Index construc-tion time for the ASTRALdataset

Figure 17. Index size for theASTRAL dataset

Figure 18. Query execu-tion time for the ASTRALdataset

empirical evaluations demonstrate the improved effec-tiveness and efficiency of TALE over existing methods.As part of future work, we plan on applying TALE toother applications, such as social networks and RDFgraph datasets, to further evaluate the generality ofTALE.

Acknowledgment

This research was primarily supported by the Na-tional Science Foundation under grant DBI-0543272,the National Institutes of Health under grant 1-U54-DA021519-01A1 and by an unrestricted research giftfrom Microsoft Corp.

References

[1] B. H. Bloom. Space/time trade-offs in hash codingwith allowable errors. Commun. ACM, 13(7):422–426, 1970.

[2] H. Bunke. On a relation between graph edit dis-tance and maximum common subgraph. PatternRecogn. Lett., 18(8):689–694, 1997.

[3] C. Alfarano et al. The biomolecular interactionnetwork database and related tools 2005 update.Nucleic Acids Res., 33:D418–D424, 2005.

[4] J. Flannick, A. Novak, B. S. Srinivasan, H. H.McAdams, and S. Batzoglou. Græmlin: Generaland robust alignment of multiple large interactionnetworks. Genome Res., 16:1169–1181, 2006.

[5] M. R. Garey and D. S. Johnson. Computersand Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York,NY, USA, 1979.

[6] H. He and A. K. Singh. Closure-tree: an indexstructure for graph queries. In ICDE, 2006.

[7] J. Hu, X. Shen, Y. Shao, C. Bystroff, and M. J.Zaki. Mining protein contact maps. In BIOKDD,2002.

[8] J. Chandonia et al. The astral compendium in2004. Nucleic Acids Res., 32:D189–D192, 2004.

[9] H. Jiang, H. Wang, P. S. Yu, and S. Zhou. Gstring:A novel approach for efficient search in graphdatabases. In ICDE, 2007.

[10] B. P. Kelley, B. Yuan, F. Lewitter, R. Sharan,B. R. Stockwel, and T. Ideker. Pathblast: a toolfor alignment of protein interaction networks. Nu-cleic Acids Res., pages W83–W88, 2004.

15

[11] W. Li, L. Jaroszewski, and A. Godzik. Clusteringof highly homologous sequences to reduce the sizeof large protein database. Bioinformatics, 17:282–283, 2001.

[12] M. Kanehisa et al. The kegg resources for deci-phering the genome. Nucleic Acids Res., 32:D277–D280, 2004.

[13] M. Koyuturk et al. Pairwise alignment of proteininteraction networks. Journal of ComputationalBiology, 13(2):182–199, 2006.

[14] M. E. J. Newman. The structure and function ofcomplex networks. SIAM Review, 45:167, 2003.

[15] P. O’Neil and D. Quass. Improved query perfor-mance with variant indexes. In SIGMOD, 1997.

[16] R. Sharan et al. Conserved patterns of proteininteraction in multiple species. PNAS, 102:1974–1979, 2005.

[17] D. Shasha, J. T.-L. Wang, and R. Giugno. Algo-rithmics and applications of tree and graph search-ing. In PODS, 2002.

[18] E. Sprinzak, S. Sattath, and H. Margalit. How re-liable are experimental protein-protein interaction

data? Journal of Molecular Biology, 327(5):919–923, 2003.

[19] Y. Tian, R. C. McEachin, C. Santos, D. J. States,and J. M. Patel. SAGA: a subgraph matching toolfor biological graphs. Bioinformatics, 23(2):232–239, 2007.

[20] J. R. Ullmann. An algorithm for subgraph isomor-phism. J. ACM, 23(1):31–42, 1976.

[21] D. Williams, J. Huan, and W. Wang. Graphdatabase indexing using structured graph decom-position. In ICDE, 2007.

[22] X. Yan, P. S. Yu, and J. Han. Graph indexing: afrequent structure-based approach. In SIGMOD,2004.

[23] X. Yan, P. S. Yu, and J. Han. Substructure sim-ilarity search in graph databases. In SIGMOD,2005.

[24] X. Yan, F. Zhu, J. Han, and P. S. Yu. Search-ing substructures with superimposed distance. InICDE, 2006.

[25] S. Zhang, M. Hu, and J. Yang. Treepi: A newgraph indexing method. In ICDE, 2007.

16