TALE: A Tool for Approximate Large Graph Matching
Yuanyuan Tian, Jignesh M. Patel
EECS Department, University of Michigan,
Ann Arbor, Michigan, USA
Large graph datasets are common in many emergingdatabase applications, and most notably in large-scalescientific applications. To fully exploit the wealth ofinformation encoded in graphs, effective and efficientgraph matching tools are critical. Due to the noisyand incomplete nature of real graph datasets, approx-imate, rather than exact, graph matching is required.Furthermore, many modern applications need to querylarge graphs, each of which has hundreds to thousandsof nodes and edges.
This paper presents a novel technique for approxi-mate matching of large graph queries. We propose anovel indexing method that incorporates graph struc-tural information in a hybrid index structure. This in-dexing technique achieves high pruning power and theindex size scales linearly with the database size. In ad-dition, we propose an innovative matching paradigm toquery large graphs. This technique distinguishes nodesby their importance in the graph structure. The match-ing algorithm first matches the important nodes of aquery and then progressively extends these matches.Through experiments on several real datasets, this pa-per demonstrates the effectiveness and efficiency of theproposed method.
Graphs provide a natural way to model data ina wide variety of applications, such as social net-works, road networks, network topology, protein inter-action networks and protein structures. Many graphdatabases are growing rapidly in size. The growth isboth in the number of graphs and the sizes of graphs(the number of nodes and the number of edges). Forexample, the number of interactions (edges in proteininteraction networks) in the BIND database  grew
about 10 folds from 2002 September to 2004 Septem-ber, and almost doubled after that. The number of pro-tein structures (graphs) in the ASTRAL database has increased more than 3 folds since 2002. There isa critical need for efficient and effective graph query-ing tools for querying and mining these growing graphdatabases.
The database community has had a long-standinginterest in querying graph databases [6, 9, 17, 1925].These previous studies have mostly been carried outwithin the context of precise graph data, and have fo-cused on exact graph or subgraph matching queries.However, many real graph datasets are noisy and in-complete in nature. For example, it is well knownthat protein interaction networks produced by high-throughput methods contain many false positives .Moreover, the discovered interactions only represent asmall fraction of the true network. As a result, ex-act graph or subgraph matching often fails to produceuseful results.
In contrast, approximate graph or subgraph match-ing plays a critical role in these applications. Approx-imate matching allows node/edge insertions and dele-tions, and node/edge mismatches. Furthermore, manynew graph applications prefer approximate matchingresults rather than exact ones as they can provide moreinformation such as what might be missing or spuriousin a query or a database graph.
In addition, most existing graph matching methodsare applicable to databases that contain graphs withsmall sizes, i.e. each graph has a small number (tens) ofnodes and edges. Moreover, the query graphs allowedin these methods are also small in size. However, inmany new applications, both the query and databasegraphs are large. Each graph can contain hundredsto thousands of nodes and edges. For example, in lifesciences applications, protein interaction networks forindividual species are often matched to determine sim-ilarities and differences across species. Each protein in-teraction network is large, and typically contains hun-
dreds to thousands of nodes and edges in each graph.
The problem that we address in this paper is ap-proximate subgraph matching of large query graphs.Namely, given a large query graph, with hundreds tothousands of nodes and edges, and a database of largegraphs, we want to find the subgraphs in the databasethat are similar to the query.
In this paper we present an index-based method forapproximate subgraph matching, called TALE (a Toolfor Approximate Subgraph Matching of Large QueriesEfficiently). TALE employs a novel graph indexingmethod, called NH-Index (Neighborhood Index). Mostexisting graph indexing methods only index subgraphs(paths, trees or general subgraphs), which can lead toindex sizes that are exponential in the database size.The indexing unit of NH-Index is the neighborhood ofeach database node. The neighborhood concept cap-tures the local graph structure around each node, andresults in an index with a high pruning power. At thesame time, the number of indexing units is equal tothe number of nodes in the database, which allows theindex to grow linearly with the database size. Further-more, NH-Index is a disk-based index, which allows itto handle graph databases that do not fit in memory.It employs a hybrid index that uses existing commondisk-based index structures, which makes implementa-tion in existing DBMSs straightforward.
We also propose an innovative matching paradigmfor querying large graphs. Unlike most previous graphmatching tools which treat every node in a graphequally, this matching technique distinguishes nodesby their importance in the graph structure. The algo-rithm first probes the NH-Index to match the impor-tant nodes in a query graph, and then progressively ex-tends the matches by enclosing satisfiable nearby nodesof already matched nodes.
We have applied TALE to three real biologicaldatasets. Our experiments demonstrate that TALE isable to produce useful and meaningful results in all thethree cases. In addition, our experimental evaluationshows that TALE is very efficient for large queries, andthat the execution time grows gracefully with increas-ing number of graphs in the database. Through com-parisons with other existing tools, we also show thatTALE is significantly faster than existing methods.
The main contributions of this paper are as follows:
(1) We propose TALE a general tool for approxi-mate subgraph matching of large graph queries. TALEuses a novel disk-based indexing method, which indexesthe neighborhood of each database node. It achieveshigh pruning power and its size scales linearly withthe database size. We introduce an innovative graphmatching paradigm, which distinguishes nodes by their
importance in the graph structure, and accordinglytreats them differently in the matching process.
(2) By applying TALE to real applications, we showits effectiveness, significant performance improvementsover existing methods, and ability to gracefully handlelarge graph queries and databases.
The remainder of this paper is organized as follows:Related work is presented in Section 2. Section 3 de-fines the preliminary concepts. Section 4 describesour indexing mechanism, and Section 5 introduces theTALE algorithm. Experimental results are presentedin Section 6, and Section 7 contains our conclusionsand directions for future work.
2 Related Work
There is a long history of database research on meth-ods for querying graphs. However, most previous workshave focused on exact graph or subgraph matching,i.e. graph or subgraph isomorphism. Subgraph iso-morphism was proved to be NP-complete in . Ull-mann  proposed a subgraph matching algorithmbased on a state space search method with backtrack-ing. However, this algorithm is prohibitively expensivefor querying against database with a large number ofgraphs. To reduce the search space, GraphGrep ,GIndex  and TreePi  index substructures of thedatabase (paths, frequent subgraphs and trees respec-tively) to filter out graphs that do not match the query.
Several index-based methods for approximate sub-graph matching have also been proposed. However,most of these techniques only apply to small graphs andallow limited approximation. Grafil  and PIS are both built on top of the exact subgraph matchingmethod GIndex. However, neither method allows nodeinsertion or deletion in their match models. CDIn-dex  only applies to graphs with limited sizes, as itexhaustedly enumerates and indexes all the subgraphsin the database. GString  utilizes sequence matchingto answer graph queries, but it only applies to appli-cations in which the graphs contain a small number ofbasic substructures. C-Tree , which employs an R-tree like index structure, is a more general tool thanthe above methods. In Section 6, we compare TALEwith C-Tree. A recent method , called SAGA, em-ploys a flexible graph similarity model. While SAGAis very efficient for small graph queries, it is compu-tationally expensive when applied to large graphs. Incontrast, TALE focuses on approximate matching forlarge graph queries. In Section 6, we also compareTALE with SAGA.
The life science community has produced vastamount of protein interaction networks. Several tools
for comparing protein interaction networks have beenproposed. These include PathBlast , its successorNetworkBlast , MaWIsh , and Graemlin . Ofthese, Graemlin is the latest method and in many wayssuperior to the other methods for comparing proteininteraction networks. In Section 6, we compare TALEwith Graemlin.
A graph G is denoted as (V,E), where V is the setof nodes and E V V is the set of (directed or undi-rected) edges. Nodes and edges can have labels speci-fied by mappings : V v and : E e respec-tively, where v is the set of node labels and e is theset of edge labels. In order to uniquely identify a node,we assign an unique id to each node in a graph. We alsoimpose an order on the ids. Our indexing method andmatching algorithm support both directed and undi-rected graphs with labeled nodes and/or labeled edges.For ease of presentation, we present our method usingundirected graphs with labeled nodes. Adaptations ofour method to other graph types are fairly straight-forward unless discussed. The simple adaptations areomitted in the interest of space.
Let G1 = (V1, E1) and G2 = (V2, E2) be two graphs.An exact graph match (graph isomorphism) is a bijec-tion mapping function : V1 V2, in which for everyv V1, (v) = (v), and (u, v) E1 if and onlyif (u, v) E2. An exact subgraph match (subgraphisomorphism) from G1 (the query) to G2 (the target) isdefined as G
2 G2, and G
2is an exact graph match
for G1.Approximate graph matching allows node mis-
matches (i.e. (v) 6= (v)), and node/edge insertionsand deletions. We define an approximate graph matchas a bijection mapping : V
2, where V
1 V1 and
V 2 V2. Similarly, an approximate subgraph match
from G1 (the query) to G2 (the target) is defined asG
2 G2, and G
2is an approximate graph match for
G1.An approximate subgraph matching tool often re-
turns a large number of matches for a query. Often theuser is only interested in the top-K results. To returnthe top-K results, TALE has to sort the matches basedon their similarities to the query. Several graph similar-ity or distance models have been proposed, e.g. [2,19].Each model is meaningful for some applications, butthere is no universal model that fits all applications.We do not want to limit the generality of TALE bytailoring it to a particular similarity model. Instead,we let the users customize the similarity method thatbest models their application, thereby allowing TALE
to serve as flexible graph matching tool that can beused in a variety of graph matching applications. Sec-tion 6 shows examples of how this similarity model canbe customized in practice.
4 The NH-Index
In this section, we introduce the novel indexing tech-nique, Neighborhood Index (NH-Index).
4.1 Indexing Unit
The first question that arises with a graph indexingmethod is the graph entities, e.g. nodes, edges, sub-graphs, etc., that should be indexed. The NH-Index isused by the matching algorithm to match the impor-tant nodes in the query graph. These initial matchesfor the important nodes are then extended to producethe final matching results. A naive indexing method isto index all the nodes in the database. This method hasthe benefit that the index size grows linearly with thenumber of nodes in the database, but suffers from lowpruning power, as each query node can have many falsepositive matches (matches that cannot be extendedlater). Our NH-Index size is linear in the numberof nodes in the database and also has a high prun-ing power. NH-Index achieves this by incorporatingneighborhood information into the naive node indexingmethod. When matching a query node, instead of look-ing at the node in isolation, NH-Index also considersits neighborhood. A database node matches the querynode, only if the two nodes match and their neighbor-hoods also match. Using this technique, a large fractionof false positives can be eliminated.
A neighborhood is defined as the induced subgraphof a node and its neighbors (adjacent nodes). Thereare three main properties that characterize the neigh-borhood of a node: the number of neighbors, how theneighbors connect to each other, and the labels of theactual neighbors. The number of neighbors is simplythe degree of the node. To quantify the connected-ness amongst the neighbors, we define neighbor con-nection as the number of edges between the neighbors.For example, the neighbor connection of the black nodein Figure 1 is 5.
To capture the neighbors of a node, a naive methodis to simply enumerate the labels of the neighbors.However, this naive approach results in variable-lengthindex entries as well as large index size (in the worstcase of a clique, the storage cost is O(n2), where n isthe number of nodes in the database). An alternativeto the naive approach is to use a compact bit array tocapture the neighbors set. In the simple case when the
Figure 1. An example graph
total number of different labels in the problem domainis small (i.e. |v| is small), we can use a deterministicbit array to store the neighbors. The size of the bit ar-ray is equal to |v|, and each bit in the array indicateswhether a neighbor with a specific label exists (set to1) or not (set to 0). We call this bit array neighbor ar-ray. When |v| is a large number, using a deterministicbit array is very expensive. To handle this situation,we employ the Bloom filter approach . We fix thesize of the bit array to be Sbit, where Sbit is a user-controllable parameter. A hash function is utilized tomap a node label to a bit array position. To improveprecision, multiple bit arrays and hash functions can beused to characterize the neighbors of a node. For sim-plicity, we only use one bit array to store the neighborinformation in this work.
In summary, the indexing unit of the NH-Index con-tains the following information: (label, degree, nbCon-nection, nbArray), where nbConnection is the neighborconnection of the node,...