- Home
- Documents
*RASCAL: Calculation of Graph Similarity using Maximum ... Calculation of Graph Similarity using ......*

prev

next

out of 14

View

215Category

## Documents

Embed Size (px)

c British Computer Society 2002

RASCAL: Calculation of GraphSimilarity using Maximum Common

Edge SubgraphsJOHN W. RAYMOND1, ELEANOR J. GARDINER2 AND PETER WILLETT2

1Pfizer Global Research and Development, Ann Arbor Laboratories, 2800 Plymouth Road, Ann Arbor,Michigan 48105, USA

2Krebs Institute for Biomolecular Research and Department of Information Studies,University of Sheffield, Western Bank, Sheffield S10 2TN, UK

Email: john.raymond@pfizer.com

A new graph similarity calculation procedure is introduced for comparing labeled graphs. Given aminimum similarity threshold, the procedure consists of an initial screening process to determinewhether it is possible for the measure of similarity between the two graphs to exceed the minimumthreshold, followed by a rigorous maximum common edge subgraph (MCES) detection algorithmto compute the exact degree and composition of similarity. The proposed MCES algorithm is basedon a maximum clique formulation of the problem and is a significant improvement over otherpublished algorithms. It presents new approaches to both lower and upper bounding as well as

vertex selection.

Received 16 August 2001; revised 29 April 2002

1. INTRODUCTION

It is often desired in many applications to compare objectsrepresented as graphs and to determine the degree ofsimilarity between the objects. This is sometimes referredto as inexact graph matching. Inexact graph matchingis of importance in many applications such as proteinligand docking [1], video indexing [2] and computer vision[3, 4]. One algorithmic approach to this problem isto use the maximum common subgraph (MCS) betweenthe graphs being compared. In this paper, a novel andefficient algorithm based on the edge induced formulationof the MCS problem is presented. The algorithm hasbeen developed for use in chemical similarity searchingapplications where molecular structures are represented asgraphs such that the atoms correspond to labeled verticesand the chemical bonds correspond to labeled edges, butit is directly applicable to a comparison of any collectionof labeled graphs. For a more detailed description of thegraphical representation of a chemical structure, the readeris referred to a text on chemical graph theory [5].

The similarities between the graphs representing mole-cules play an important role in many aspects of chemistryand, increasingly, biology. Examples include databasesearching [6], the prediction of biological activity [7], thedesign of combinatorial syntheses [8] and the interpretationof molecular spectra [9]. A wide range of types ofsimilarity measures has been described [10]. However,many of these are too time-consuming when very largenumbers of molecules need to be considered, as is the

case for applications involving the searching and clusteringof chemical databases which typically contain tens oreven hundreds of thousands of molecules. In such cases,molecules are represented by a simple bit-string, in whichbits are set to denote the presence in a molecule ofpre-defined molecular substructures, such as a particularring system or functional group. The similarity betweentwo molecules is then calculated using an associationcoefficient based on the numbers of bits in commonbetween the bit-strings describing those two molecules [10].Although widely used for similarity purposes, such simplemolecular representations have several limitations [11];most importantly, a bit-string does not preserve theconnectivity of the parent molecule. MCS-based similaritymeasures do take full account of molecular connectivity buthave been little used to date in chemical information systemsbecause of their computational requirements. The workreported here has been undertaken to address this situation.

The proposed inexact graph matching procedureRASCAL (RApid Similarity CALculation) consists oftwo components, screening and rigorous graph matching.The screening procedure is intended to determine rapidlywhether the graphs being compared exceed some specifiedminimum similarity threshold (without resulting in any falsedismissals) in order to avoid unnecessary calls to the morecomputationally demanding, graph matching procedure.The screening procedure is described in Section 2 alongwith an example calculation. The latter graph matchingprocess consists of an efficient determination of the MCSusing the graph similarity concept. Section 3 formulates the

THE COMPUTER JOURNAL, Vol. 45, No. 6, 2002

at Princeton U

niversity Library on April 19, 2011

comjnl.oxfordjournals.org

Dow

nloaded from

http://comjnl.oxfordjournals.org/

632 J. W. RAYMOND, E. J. GARDINER AND P. WILLETT

a)

b)

FIGURE 1. (a) Maximum common induced subgraph;(b) maximum common edge subgraph.

MCS graph matching procedure in terms of the maximumclique problem and presents improvements to existing cliquedetection algorithms for this purpose, including severalnew contributions. In Section 4, the RASCAL algorithmis compared with other published algorithms in severalexperimental trials where it is shown to be considerablymore efficient.

1.1. Definitions and terminology

All graphs referred to in the following text are assumed tobe simple, undirected graphs. For an introduction to graph-related concepts and notation, the reader is referred to anintroductory text on graph theory [12]. NG(vi) will denotethe set of vertices adjacent to a vertex vi . A maximum clique,(G), of a graph G is the largest set of mutually adjacentvertices in G. A line graph L(G) is a graph whose vertexset consists of the edge set of G; therefore, if (vi , vj ) is anedge in G it is also a vertex in L(G). A pair of verticesin L(G) are adjacent if the two corresponding edges in Gare incident on each other.

A pair of graphs are said to be isomorphic if there isa one-to-one correspondence between their vertices and anedge only exists between two vertices in one graph if anedge exists between the two corresponding vertices in theother graph. An induced subgraph is a set S of vertices ofa graph G and those edges of G with both endpoints in S.A graph G12 is a common induced subgraph of graphs G1and G2 if G12 is isomorphic to induced subgraphs of G1and G2. A maximum common induced subgraph (MCIS)consists of a graph G12 with the largest number of verticesmeeting the aforementioned property. Related to the MCISis the maximum common edge subgraph (MCES). An MCESis a subgraph consisting of the largest number of edgescommon to both G1 and G2. Note that the MCIS or MCESbetween two graphs is not necessarily connected or uniqueby definition. Figure 1a illustrates an MCIS between twographs (highlighted in bold), and Figure 1b demonstrates anMCES between the same two graphs.

2. SCREENING PROCEDURE

Using the classification of Sanfeliu and Fu [13], graphdistance/similarity measures can be categorized into twoclasses:

(1) Feature-based distances. A set of features or invariantsis established from a structural description of agraph, and these features are then used in a vectorrepresentation to which various distance or similaritymeasures can be applied.

(2) Cost-based distance. The distance or similaritybetween two graphs reflects the number of prescribededit operations that are required in order to transformone graph into the other.

The proposed two-tiered screening system is a cost-basedmethod that has been developed to be used in conjunctionwith an MCES algorithm. It will be referred to as the cost-vector approach in order to differentiate it from feature-based methods. While the new screening system is almost assimple as the feature-based approach from a computationalperspective, it has the desirable features that it cannot resultin any false dismissals of graphs that are in fact similar, andit provides an upper bound on the size of the MCES betweentwo molecular graphs. The procedure forms a screeninghierarchy whereby the calculated upper bound for the secondtier is always less than or equal in magnitude to the valuecalculated in the first tier. Following the formal treatment inSections 2.1 and 2.2, we present an example illustrating theproposed screening procedure.

2.1. First tier cost-vector

Degree sequences of a graph have been used by other authorsto establish upper bounds on graph invariants [14, 15]and recently for indexing graph databases [16]. In ourproposed first tier screening procedure, we have used degreesequences to develop a novel method to calculate an upperbound on the size of an MCES between a pair of graphs.First, the set of vertices in each graph is partitioned intol partitions by label type, and then sorted in non-increasingorder by degree. Let L1i and L

2i denote the sorted degree

sequences in partition i in graphs G1 and G2, respectively.An upper-bound on the similarity between a pair of graphscan be given as follows:

V (G1,G2) =l

i=1min{|L1i |, |L2i |}

E(G1,G2) = l

i=1

max{|L1i |,|L2i |}j=1

min{d(v1j ), d(v2j )}2

simcv(G1,G2)

= (V (G1,G2) + E(G1,G2))2

(|V (G1)| + |E(G1)|) (|V (G2)| + |E(G2)|) .

Note that the term to the right of E(G1,G2) is divided bytwo since each edge in this formulation is counted twice and,if |L1i | = |L2i |, then entries of value zero are appended to theshorter sequence to make the sequences of equal length. It isclear that the similarity measure simcv(G1,G2) ranges from

THE COMPUTER JOURNAL, Vol. 45, No. 6, 2002

at Princeton U

niversity Library on April 19, 2011

comjnl.oxfordjournals.org

Dow

nloaded from

http://comjnl.oxfordjournals.org/

RASCAL 633

0 to 1 and that it obeys the following inequality,

simcv(G1,G2)

(|V (G12)| + |E(G12)|)2

(|V (G1)| + |E(G1)|) (|V (G2)| + |E(G2)|) ,

where G12 is the MCES between g