TALE: A Tool for Approximate Large Graph Matching
Yuanyuan Tian, Jignesh M. Patel
EECS Department, University of Michigan,
Ann Arbor, Michigan, USA
Large graph datasets are common in many emergingdatabase applications, and most notably in large-scalescientific applications. To fully exploit the wealth ofinformation encoded in graphs, effective and efficientgraph matching tools are critical. Due to the noisyand incomplete nature of real graph datasets, approx-imate, rather than exact, graph matching is required.Furthermore, many modern applications need to querylarge graphs, each of which has hundreds to thousandsof nodes and edges.
This paper presents a novel technique for approxi-mate matching of large graph queries. We propose anovel indexing method that incorporates graph struc-tural information in a hybrid index structure. This in-dexing technique achieves high pruning power and theindex size scales linearly with the database size. In ad-dition, we propose an innovative matching paradigm toquery large graphs. This technique distinguishes nodesby their importance in the graph structure. The match-ing algorithm first matches the important nodes of aquery and then progressively extends these matches.Through experiments on several real datasets, this pa-per demonstrates the effectiveness and efficiency of theproposed method.
Graphs provide a natural way to model data ina wide variety of applications, such as social net-works, road networks, network topology, protein inter-action networks and protein structures. Many graphdatabases are growing rapidly in size. The growth isboth in the number of graphs and the sizes of graphs(the number of nodes and the number of edges). Forexample, the number of interactions (edges in proteininteraction networks) in the BIND database  grew
about 10 folds from 2002 September to 2004 Septem-ber, and almost doubled after that. The number of pro-tein structures (graphs) in the ASTRAL database has increased more than 3 folds since 2002. There isa critical need for efficient and effective graph query-ing tools for querying and mining these growing graphdatabases.
The database community has had a long-standinginterest in querying graph databases [6, 9, 17, 1925].These previous studies have mostly been carried outwithin the context of precise graph data, and have fo-cused on exact graph or subgraph matching queries.However, many real graph datasets are noisy and in-complete in nature. For example, it is well knownthat protein interaction networks produced by high-throughput methods contain many false positives .Moreover, the discovered interactions only represent asmall fraction of the true network. As a result, ex-act graph or subgraph matching often fails to produceuseful results.
In contrast, approximate graph or subgraph match-ing plays a critical role in these applications. Approx-imate matching allows node/edge insertions and dele-tions, and node/edge mismatches. Furthermore, manynew graph applications prefer approximate matchingresults rather than exact ones as they can provide moreinformation such as what might be missing or spuriousin a query or a database graph.
In addition, most existing graph matching methodsare applicable to databases that contain graphs withsmall sizes, i.e. each graph has a small number (tens) ofnodes and edges. Moreover, the query graphs allowedin these methods are also small in size. However, inmany new applications, both the query and databasegraphs are large. Each graph can contain hundredsto thousands of nodes and edges. For example, in lifesciences applications, protein interaction networks forindividual species are often matched to determine sim-ilarities and differences across species. Each protein in-teraction network is large, and typically contains hun-
dreds to thousands of nodes and edges in each graph.
The problem that we address in this paper is ap-proximate subgraph matching of large query graphs.Namely, given a large query graph, with hundreds tothousands of nodes and edges, and a database of largegraphs, we want to find the subgraphs in the databasethat are similar to the query.
In this paper we present an index-based method forapproximate subgraph matching, called TALE (a Toolfor Approximate Subgraph Matching of Large QueriesEfficiently). TALE employs a novel graph indexingmethod, called NH-Index (Neighborhood Index). Mostexisting graph indexing methods only index subgraphs(paths, trees or general subgraphs), which can lead toindex sizes that are exponential in the database size.The indexing unit of NH-Index is the neighborhood ofeach database node. The neighborhood concept cap-tures the local graph structure around each node, andresults in an index with a high pruning power. At thesame time, the number of indexing units is equal tothe number of nodes in the database, which allows theindex to grow linearly with the database size. Further-more, NH-Index is a disk-based index, which allows itto handle graph databases that do not fit in memory.It employs a hybrid index that uses existing commondisk-based index structures, which makes implementa-tion in existing DBMSs straightforward.
We also propose an innovative matching paradigmfor querying large graphs. Unlike most previous graphmatching tools which treat every node in a graphequally, this matching technique distinguishes nodesby their importance in the graph structure. The algo-rithm first probes the NH-Index to match the impor-tant nodes in a query graph, and then progressively ex-tends the matches by enclosing satisfiable nearby nodesof already matched nodes.
We have applied TALE to three real biologicaldatasets. Our experiments demonstrate that TALE isable to produce useful and meaningful results in all thethree cases. In addition, our experimental evaluationshows that TALE is very efficient for large queries, andthat the execution time grows gracefully with increas-ing number of graphs in the database. Through com-parisons with other existing tools, we also show thatTALE is significantly faster than existing methods.
The main contributions of this paper are as follows:
(1) We propose TALE a general tool for approxi-mate subgraph matching of large graph queries. TALEuses a novel disk-based indexing method, which indexesthe neighborhood of each database node. It achieveshigh pruning power and its size scales linearly withthe database size. We introduce an innovative graphmatching paradigm, which distinguishes nodes by their
importance in the graph structure, and accordinglytreats them differently in the matching process.
(2) By applying TALE to real applications, we showits effectiveness, significant performance improvementsover existing methods, and ability to gracefully handlelarge graph queries and databases.
The remainder of this paper is organized as follows:Related work is presented in Section 2. Section 3 de-fines the preliminary concepts. Section 4 describesour indexing mechanism, and Section 5 introduces theTALE algorithm. Experimental results are presentedin Section 6, and Section 7 contains our conclusionsand directions for future work.
2 Related Work
There is a long history of database research on meth-ods for querying graphs. However, most previous workshave focused on exact graph or subgraph matching,i.e. graph or subgraph isomorphism. Subgraph iso-morphism was proved to be NP-complete in . Ull-mann  proposed a subgraph matching algorithmbased on a state space search method with backtrack-ing. However, this algorithm is prohibitively expensivefor querying against database with a large number ofgraphs. To reduce the search space, GraphGrep ,GIndex  and TreePi  index substructures of thedatabase (paths, frequent subgraphs and trees respec-tively) to filter out graphs that do not match the query.
Several index-based methods for approximate sub-graph matching have also been proposed. However,most of these techniques only apply to small graphs andallow limited approximation. Grafil  and PIS are both built on top of the exact subgraph matchingmethod GIndex. However, neither method allows nodeinsertion or deletion in their match models. CDIn-dex  only applies to graphs with limited sizes, as itexhaustedly enumerates and indexes all the subgraphsin the database. GString  utilizes sequence matchingto answer graph queries, but it only applies to appli-cations in which the graphs contain a small number ofbasic substructures. C-Tree , which employs an R-tree like index structure, is a more general tool thanthe above methods. In Section 6, we compare TALEwith C-Tree. A recent method , called SAGA, em-ploys a flexible graph similarity model. While SAGAis very efficient for small graph queries, it is compu-tationally expensive when applied to large graphs. Incontrast, TALE focuses on approximate matching forlarge graph queries. In Section 6, we also compareTALE with SAGA.
The life science community has produced vastamount of protein interaction networks. Several tools
for comparing protein interaction networks have beenproposed. These include PathBlast , its successorNetworkBlast , MaWIsh , and Graemlin . Ofthese, Graemlin is the latest method and in many wayssuperior to the other methods for comparing proteininteraction networks. In Section 6, we compare TALEwith Graemlin.
A graph G is denoted as (V,E), where V is the setof nodes and E V V is the set of (directed or undi-rected) edges. Nodes and edges can have labels speci-fied by mappings : V v and : E e respec-tively, where v is the set of node labels and e is theset of edge labels. In order to uniquely identify a node,we assign an unique id to each node in a graph. W