Median graph: A new exact algorithm using a distance based on the maximum common subgraph

Pattern Recognition Letters 30 (2009) 579–588

Contents lists available at ScienceDirect

Pattern Recognition Letters

journal homepage: www.elsevier .com/locate /patrec

Median graph: A new exact algorithm using a distance basedon the maximum common subgraph

M. Ferrer a,*, E. Valveny a, F. Serratosa b

a Centre de Visió per Computador, Departament de Ciències de la Computació, Universitat Autònoma de Barcelona, 08193 Bellaterra, Spainb Departament d’Enginyeria Informàtica i Matemàtiques, Universitat Rovira i Virgili, 43007 Tarragona, Spain

a r t i c l e i n f o

Article history:Received 15 October 2007Received in revised form 4 December 2008Available online 12 January 2009

Communicated by S. Dickinson

Keywords:Median graphMaximum common subgraphMinimum common supergraphGraph matching

0167-8655/$ - see front matter � 2009 Elsevier B.V. Adoi:10.1016/j.patrec.2008.12.014

* Corresponding author. Tel.: +34 93 581 23 01; faxE-mail addresses: [email protected], mferrer@ir

a b s t r a c t

Median graphs have been presented as a useful tool for capturing the essential information of a set ofgraphs. Nevertheless, computation of optimal solutions is a very hard problem. In this work we presenta new and more efficient optimal algorithm for the median graph computation. With the use of a partic-ular cost function that permits the definition of the graph edit distance in terms of the maximum com-mon subgraph, and a prediction function in the backtracking algorithm, we reduce the size of the searchspace, avoiding the evaluation of a great amount of states and still obtaining the exact median. We pres-ent a set of experiments comparing our new algorithm against the previous existing exact algorithmusing synthetic data. In addition, we present the first application of the exact median graph computationto real data and we compare the results against an approximate algorithm based on genetic search. Theseexperimental results show that our algorithm outperforms the previous existing exact algorithm and inaddition show the potential applicability of the exact solutions to real problems.

� 2009 Elsevier B.V. All rights reserved.

1. Introduction

In structural pattern recognition, graphs have been shown as auseful data structure to represent complex objects. The nodes ofthe graph typically represent parts of the object and the edges rep-resent the relations between them. This high representationalpower of graphs has been shown very useful in many areas of com-puter vision and pattern recognition such as character recognition(Lu et al., 1991), shape analysis (Pearce et al., 1994) and 3D objectrecognition (Wong, 1992). Nevertheless, graphs have the maindrawback of its combinatorial nature. The comparison betweentwo objects that is quite simple when they are represented by fea-ture vectors for instance, becomes highly complex when they arerepresented using graphs. The time required by any of the exactalgorithms to compare two graphs may, in the worst case, becomeexponential in the size of graphs. Approximate algorithms in con-trast, have polynomial time complexity, but do not guarantee togive the optimal solution.

In this paper, we are interested in inferring the representative ofa set of objects given a set of noisy samples of such object. In thegraph domain it is not easy to define the representative of a set.In this context, the generic concept of median, which has beenwidely used in areas such as statistics and probability theory, turnsout to be very useful. In the structural domain the median graph(Jiang et al., 2001) has been defined in order to represent this

ll rights reserved.

: +34 93 581 16 70.i.upc.edu (M. Ferrer).

concept. Given a set of graphs, the median graph is defined asthe graph that has the smallest sum of distances (SOD) to all graphsin the set. However, the computation of the median graph is expo-nential both in the number of input graphs and their size (Bunkeet al., 1999). A number of algorithms have been presented in thepast to compute the generalized median graph. The only exactalgorithm proposed up to now (Münger, 1998) is based on an A�

algorithm handled by a data structure called multimatch. As thecomputational cost of this algorithm is very high, a set of approx-imate algorithms have also been developed in the past based ondifferent approaches such as genetic search (Jiang et al., 2001;Münger, 1998), greedy algorithms (Hlaoui and Wang, 2006) andspectral graph theory (Ferrer et al., 2005; White and Wilson, 2006).

The main contributions of this paper are two-fold. Firstly, froma theoretical point of view, we will show that, under a particularcost function and a graph edit distance based on the maximumcommon subgraph, both introduced in (Bunke, 1997), the searchspace for the median graph can be drastically reduced. In addition,using the same conditions, we have also defined an heuristic strat-egy in order to avoid the exploration of some states of this newsearch space. Secondly, based on this theoretical result, we willpresent a new exact and more efficient algorithm for the computa-tion of the generalized median graph.

To validate the theoretic part, we have carried out a set ofexperiments comparing our new algorithm with the previousexisting exact algorithm using synthetic data. We will show howthe new algorithm clearly outperforms the multimatch algorithm.In addition, we have applied the new algorithm to a set of real data

mailto:[email protected]

mailto:[email protected]

http://www.sciencedirect.com/science/journal/01678655

http://www.elsevier.com/locate/patrec

580 M. Ferrer et al. / Pattern Recognition Letters 30 (2009) 579–588

using a database of graphs representing molecules. In this case wehave compared our approach with an approximate algorithmbased on genetic search. We will show how the new exact algo-rithm compares well in terms of computation time with theapproximate algorithm while it is clearly better in terms of accu-racy of the median graph. This result suggests that the exact com-putation of the median graph can be extended for the first time (ina limited way) to real applications. The previous existing algo-rithms could only be applied to synthetic datasets containing onlya small number of small graphs. Moreover, and in spite of the com-plexity, to be able to obtain exact solutions for the median graphcan be useful to better understand the nature of such conceptand may help to improve the existing approximate algorithms orto introduce new approximate solutions.

The rest of the paper will be as follows. In Section 2, we definesome basic concepts and introduce our notation. Section 3 is de-voted to explain the concept of median graph and the differentalgorithms to compute it. The new algorithm, including the reduc-tion of the search space and the heuristic prediction function is de-scribed in Section 4. Experimental results are shown in Section 5and, finally the main conclusions are stated in Section 6.

Fig. 1. A set S of three graphs (a), and a possible mcsðSÞ of them (b).

g1 g2 gM

Fig. 2. Two graphs g1 (a) and g2 (b), a possible MCSðg1; g2Þ (c) and an edit pathbetween g1 and g2 (d).

2. Definitions and notation

2.1. Basic definitions

Let LV and LE denote the set of node and edge labels, respec-tively. A graph is a triple g ¼ ðV ;a; bÞ where, V is the finite set ofnodes, a is the node labeling function ða : V ! LV Þ, and b is the edgelabeling function ðb : V � V ! LEÞ. We assume that our graphs arefully connected. Consequently, the set of edges is implicitly given(i.e. E ¼ V � V). Such assumption is only for notational conve-nience, and it does not impose any restriction in the generality ofour results. In the case where no edge exists between two givennodes, we can include the special null label e in the set of labelsLE to model such situation. If V ¼ /, then g is called the emptygraph. Finally, the number of nodes of a graph g is denoted by jgj.

Given two graphs, g1 ¼ ðV1;a1; b1Þ and g2 ¼ ðV2;a2; b2Þ, g2 is asubgraph of g1, denoted by g2jg1 if,

� V2jV1

� a2ðvÞ ¼ a1ðvÞ for all v 2 V2

� b2ððu; vÞÞ ¼ b1ððu; vÞÞ for all ðu; vÞ 2 V2 � V2

From this definition, it follows that, given a graphg1 ¼ ðV1;a1; b1Þ, a subset V2jV1 of its vertices uniquely defines asubgraph. Such subgraph is called the subgraph induced by V2.

Finally, it is important to be able to check whether two graphsare identical or not. The isomorphism between two graphs permitsto check such a condition. Given two graphs g1 ¼ ðV1;a1; b1Þ, andg2 ¼ ðV2;a2; b2Þ, a graph isomorphism between g1 and g2 is a bijec-tive mapping f : V1 ! V2 such that,

� a1ðxÞ ¼ a2ðf ðxÞÞ for all x 2 V1

� b1ððx; yÞÞ ¼ b2ððf ðxÞ; f ðyÞÞÞ for all ðx; yÞ 2 V1 � V1

If there exists a graph isomorphism between two given graphsg1 and g2, we say that g1 and g2 are isomorphic.

2.2. Minimum common supergraph

Let g1, g2 and g3 be three graphs. If both g1 and g2 are subgraphsof g3 then g3 is called a common supergraph of g1 and g2 (Bunkeet al., 2000). Generalizing the same idea for a setS ¼ fg1; g2; . . . ; gng, a graph gmS

is called a minimum common super-

graph of S (also denoted by mcs(S)) if fg1; g2; . . . ; gng are subgraphsof gmS

and there is no other common supergraph of fg1; g2; . . . ; gnghaving less nodes than gmS

. A set of three graphs and a possibleminimum common supergraph of them is shown in Fig. 1a andb, respectively.

2.3. Maximum common subgraph

Let g1 and g2 be two graphs, and g01jg1, g02jg2. If there exists agraph isomorphism between g01 and g02 then, both g01 and g02 arecalled a common subgraph of g1 and g2. A graph gM is called a max-imum common subgraph (usually denoted by MCSðg1; g2Þ) of g1 andg2 if gM is a common subgraph of g1 and g2 and there is no othercommon subgraph of both g1 and g2 having more nodes than gM .

An example of two graphs g1 and g2 and a possible MCSðg1; g2Þbetween them is shown in Fig. 2. Notice that according to the def-inition of graph given in Section 2.1 these graphs are fully con-nected. The ‘‘null” or unexisting edges are represented by dashedlines in the figure. In addition, notice that according to the defini-tion of subgraph given in Section 2.1, g2 is not a subgraph of g1 be-cause g2 has no edge between the white and black nodes andtherefore, given the same set of nodes in g1 and g2, g2 have a differ-ent set of edges. Thus, a possible MCSðg1; g2Þ is the graph gM shownin Fig. 2c. In this case, gM consists also of a subset of nodes of bothg1 and g2, but in this case the edges connecting such a subset is thesame in g1 and in g2. Thus, since gM is an induced subgraph of bothg1 and g2 and there is no other subgraph of these two graphs hav-ing more nodes, gM is a possible MCSðg1; g2Þ. This fact will be animportant point to be taken into account in Section 4.

In the last years some papers have been presented related to theMCS computation based on different approaches and algorithms(Balas and Yu, 1986; Durand et al., 1999; McGregor, 1982; Wangand Maple, 2005). An explanation of such methods and a compar-ison between them can be found in (Bunke et al., 2002; Conte et al.,2007).

M. Ferrer et al. / Pattern Recognition Letters 30 (2009) 579–588 581

2.4. Graph edit distance

The graph edit distance (Bunke and Allerman, 1983; Sanfeliuet al., 1983), has been shown as one of the most widely used meth-ods to compute the dissimilarity between two graphs.

The basic idea behind the graph edit distance is to define thedissimilarity of two graphs as the minimum amount of distortionrequired to transform one graph into the other. To this end, a num-ber of distortion or edit operations e, consisting of the insertion,deletion and substitution of both nodes and edges are defined. Gi-ven these edit operations, for every pair of graphs, g1 and g2, thereexists a sequence of edit operations, or edit path pðg1; g2Þ ¼ðe1; . . . ; ekÞ (where each ei denotes an edit operation) that trans-forms g1 into g2. In general, several edit paths exist between twogiven graphs. This set of edit paths is denoted by }ðg1; g2Þ. In orderto quantitatively evaluate which edit path is the best, edit costs areintroduced. The basic idea is to assign a penalty cost c to each editoperation according to the amount of distortion it introduces in thetransformation. The edit distance d between two graphs g1 and g2,denoted by dðg1; g2Þ, is given by the minimum cost edit path thattransforms one graph into the other.

2.4.1. Graph edit distance and the maximum common subgraphSeveral exact and approximate algorithms have been presented

in the past related to the computation of the distance between twographs (Justice and Hero, 2006; Neuhaus et al., 2006; Robles-Kellyand Hancock, 2005). In this work, we will use a particular costfunction (Bunke, 1997), that allows to define the edit distance oftwo graphs g1 and g2 in terms of their MCS.

Table 1 summarizes this cost function.In such particular cost function, deletions and insertions of

nodes have always a cost of 1, deletions and insertions of edgeshave always a cost of 0, and node and edge substitutions takethe values of 0 or 1 depending on whether the substitution isidentical or not, respectively. Note that, from the definition ofgraph given in Section 2.1 stating that graphs are fully connected,edge deletions only occur when a node is deleted and then all itsincident edges are deleted too. Other situations as in the case ofthe graphs of Fig. 2a and b are treated as a substitution of the edgeby a null-labeled edge, not as an edge deletion. The practical con-sequences of this observation will be explained in more detail inSection 4.1.

Under such cost function, the edit distance dðg1; g2Þ between twogiven graphs g1 and g2 is related to their MCS in the following way:

dðg1; g2Þ ¼ jg1j þ jg2j � 2jMCSðg1; g2Þj ð1Þ

This result demonstrates the intuitive idea that the more twographs have in common, the lower their distance. Beyond the theo-retical implications of this result there are also important practicalimplications. For instance, an immediate consequence of this resultis that any algorithm that computes the graph edit distance may beused for maximum common subgraph computation if it is run un-der this cost function.

Table 1Detail of the cost function.

Operations on nodes Cost

Node deletion cndðuÞ Always equal to 1Node insertion cniðuÞ Always equal to 1Node substitution cnsðu; vÞ 0 if a1ðuÞ ¼ a2ðvÞ; 1 otherwise

Operations on edges Cost

Edge deletion cedðeÞ Always equal to 0Edge insertion ceiðeÞ Always equal to 0Edge substitution cesðe1; e2Þ 0 if b1ðe1Þ ¼ b2ðe2Þ; 1 otherwise

Where ai and bi are the node and edge labeling functions of a graph gi.

More recently, the same cost function has been used in (Bunkeand Kandel, 2000), to establish the relation between the MCS andthe mean of two graphs. Concretely, it is shown that, under thiscost function, the MCS of two graphs is also the mean, that is thegraph that minimizes the sum of distances to these two graphs. Fi-nally, in (Bunke et al., 2000) it is shown the relation between theMCS and the minimum common supergraph (mcs) of two graphs.In that work, it is demonstrated that under an extensive familyof cost functions, including this particular cost function, the com-putation of the mcs can be done from the computation of the MCS.

It is therefore clear that this cost function has important poten-tial applications in different graph matching problems. Thus, in therest of the paper we will assume, that the distance between twographs is computed according to Eq. (1).

3. Generalized median graph

Given a set of graphs, the concept of median graph has beenpresented as a useful tool to compute a representative of such aset. Let U be the set of graphs that can be constructed using labelsfrom L. Given S ¼ fg1; g2; . . . ; gng# U, the generalized median graphof S is defined as the graph �g 2 U such that its sum of distances(SOD) to all the graphs in S is minimum:

�g ¼ arg ming2U

X

gi2S

dðg; giÞ ð2Þ

Notice that �g is not usually a member of S, and in general more thanone generalized median graph can be found for a given set S.

As shown in Eq. (2) a graph distance between every candidatemedian g and each graph gi 2 S must be computed. Since the com-putation of the graph edit distance is a well-known problem whichis NP-complete in terms of complexity, the computation of thegeneralized median graph can only be done in exponential time,both in the number of graphs in S and their size. As a consequence,in real applications we are forced to use suboptimal methods in or-der to obtain approximate solutions for the generalized mediangraph in a reasonable time. Such approximate methods generallyapply some kind of heuristics in order to reduce the complexityof the computation of the graph edit distance and the size of thesearch space, respectively.

Both exact (Münger, 1998) and approximate (Ferrer et al., 2005;Hlaoui and Wang, 2006; Jiang et al., 2001) algorithms have beenpresented in the past for the median graph computation. In thiswork, we are interested in the exact median computation.

The only optimal algorithm presented up to now (Münger,1998), is based on a combinatorial search exploiting the fact thatthe number of nodes of the candidate median must be in between0 and the sum of nodes of all graphs in the set S (Jiang et al., 2001).The algorithm implements an A�-based search procedure handledby a structure called multimatch. In such structure, a simultaneoustransformation (edit operations) from a candidate median graph toall the graphs in the set is encoded. The operations over the edgesare implicitly given by such structure. The labels for each node/edge are selected from the set of labels in such a way that theSOD of the encoded candidate median to all the graphs is mini-mized. Unfortunately, this approach suffers from a high complex-ity. The results presented in (Bunke et al., 1999) show that for aset of two graphs having six nodes each, the time needed to com-pute the median graph grows up to 104 s (all the experiments wererun on an IBM Thinnode Power 2 computer).

4. New exact algorithm for the generalized median graph

In this section, we will present a new and more efficient algo-rithm for the exact median graph computation. To do this, we will

Fig. 3. Search space including repeated combinations (a) and without theserepeated combinations (b).


tackle the problem from two different angles. Firstly, in Section 4.1we will show how, using the particular cost function presented inSection 2.4.1 the search space where the median has to be searchedfor can be drastically reduced. After that, in Section 4.2, we willpresent a heuristic prediction function that will permit to avoidthe evaluation of some states in the new search space. These twoimprovements will lead us to present a new algorithm for the med-ian graph computation in Section 4.3.

4.1. Reduction of the search space

In the multimatch approach the search space for the mediangraph computation is determined as follows: suppose thatS ¼ fg1; g2; . . . ; gng and let k ¼

Pni¼1jgij be the sum of the sizes of

all the graphs in S. Then, the algorithm explores all possiblegraphs having size in between 0 and k as possible candidatemedians. For each number of nodes, all possible combinationsof labels for these nodes and all possible combinations of edges(including the null edge) linking these nodes have to be tested.Thus, a huge number of combinations must be taken into account,and it makes this approach unfeasible already for a small numberof graphs.

Two observations regarding this approach can be done that willpermit to reduce the size of the search space. The former is that in(Münger, 1998) the union graph gu, the graph containing all thenodes and all the edges of the graphs in S, is used as the startingpoint of the search space. This definition allows to explore all thepossible combinations between nodes and edges but it also in-cludes the repetition of some states in the search space, corre-sponding to combinations that can be found in more than onegraph of the set. In order to avoid the generation of these repeatedstates we propose the use of the minimum common supergraph ofS, gmS

, instead of the union graph gu. The use of gmSallows us to ex-

plore all the possible combinations (all the possible graphs gener-ated from the union graph can be generated using the minimumcommon supergraph), avoiding the repeated combinations sharedby different graphs of the set.

The latter is that given a set of nodes of gu, all the edges in thegraphs of S (including the null edge) are taken into account as pos-sible candidates to construct the intermediate median graph. How-ever, this number of combinations can be reduced taking intoaccount an important property of the cost function given in (Bunkeet al., 2000). It states that there always exists an optimal edit pathbetween two given graphs that implies neither non-identical nodesubstitutions nor non-identical edge substitutions.

This result can be seen in the next example. Given the graphs g1

and g2 of Fig. 2 one could suppose that the cheapest way to convertg1 into g2 is by deleting the edge linking the white and black nodesin g1. Notice that because of the definition of graph given in Section2.1 deleting an edge is equivalent to substitute such edge by anedge labelled as ‘‘null”. But this operation is not allowed becauseit implies a non-identical edge operation. Then, a possible se-quence of edit operations (Fig. 2d) consists in deleting the whitenode from g1 (including the deletion of its adjacent edges) andinserting the same node in g2 together with their incident edges.This sequence of edit operations has a cost equal to 2, one nodedeletion and one node insertion. This result is consistent withthe result obtained applying Eq. (1), since jg1j ¼ 4, jg2j ¼ 4 andjMCSðg1; g2Þj ¼ 3.

Considering this property of the cost function, given a subset ofnodes in gu, only the labels of the edges connecting these nodes ingu have to be considered when searching for the median graph.Any other label would lead to non-identical edge substitutions insome of the graphs of the set and therefore, to a non-optimal editpath that would never be applied. Thus, only the induced sub-graphs of the gu are valid options in the search space.

Conclusion: Merging both observations the new search space iscomposed only by the induced subgraphs of the minimum com-mon supergraph of S, gmS

.

4.1.1. Generation of the new search spaceIn this new scenario, the next approach can be followed. We

take the minimum common supergraph of S, gmS¼ mcsðSÞ with

jgmSj ¼ p as the first candidate median graph. From this initial can-

didate median graph, new candidates can be generated by simplyremoving nodes and their adjacent edges from this first combina-tion. A naive approach to explore all these possible candidate med-ian graphs with a size between 0 and p may be carried out by asimple backtracking based algorithm. The root node correspondsto gmS

. The tree is expanded in a depth-first approach exploringall the combinations produced by removing nodes from the root.The edges are implicitly deleted when a node is removed from agraph. Unfortunately, this straightforward approach still generatesduplicate combinations of candidate medians, because differentbranches may lead to the same combinations of nodes, as can beseen in Fig. 3a, where the gmS

is located in the root node. Noticethat some of the combinations with 1 and 0 nodes are repeatedin the tree. They are shown in the figure by means of dashedcircles.

Then, a different approach can be followed where only non-re-peated combinations are generated (Fig. 3b). The number of differ-ent possible candidate medians can be easily computed. For a giveni, with 0 6 i 6 p, the number of possible candidate median graphswith size i is Cp

i ¼p!

i!ðp�iÞ!. That is, all the possible combinations of inodes from a number of p nodes. As the search space is composedof all the induced subgraphs of gm between the sizes 0 and p, thetotal number of possible different candidates corresponds to thesum

Pni¼1Cp

i , which is exactly 2p. These 2p different combinationscan be generated using a breadth-first approach. Then, the searchspace may be thought as a rhombus (Fig. 4). At the top of thisrhombus (level 0) there is gmS

¼ mcsðSÞ with jgmSj ¼ p. Below, there

are all the induced subgraphs of gmShaving p � 1 nodes (level 1).

Concretely there are Cpp�1 ¼ p graphs. At the next level, each of

these induced subgraphs will generate new induced subgraphs ofgmS

having p � 2 nodes. The number of combinations increase untilthe central row of the rhombus is reached (the maximum numberof combinations). From this point to the bottom the number ofcombinations decreases until the last row (level p) is reached. Atthis point there is Cp

0 ¼ 1 combinations, corresponding to theempty graph ge. This will be the new search space used to findthe median graph.

4.2. Prediction of the cost function

This search space may still have a huge number of states. It istherefore desirable to avoid the evaluation of as many states as

Rhombus Search SpaceLevel Size of graphs

p

0

1

p-1

0

p

p-1

1

Induced subgraphs of size p

Induced subgraphs of size p-1

Induced subgraphs of size 1

Induced subgraphs of size 0

Fig. 4. Detail of the rhombus search space.


possible. To this end, we will show in this section that, from a givencandidate median in the search tree, we can introduce a predictionof the cost associated to the candidate medians generated from thecurrent node at the next level of the search space. Thus, evaluatingthis prediction function we will be able to decide whether it isworth to explore these candidate medians or not.

Let us start by defining the evaluation function, f �, of a candi-date median, �g�. This function is just the sum of distances of thecandidate median to all the graphs in the set, gi. Then, taking thedefinition of the graph edit distance (Eq. (1)), we can expressf �ð�g�Þ, in the following way:

f �ð�g�Þ ¼ SODð�g�Þ ¼Xn

i¼1

dðgi; �g�Þ ¼

Xn

i¼1

ðjgij þ j�g�j � 2jMCSðgi; �g�ÞjÞ

¼ nj�g�j þXn

i¼1

jgij � 2Xn

i¼1

jMCSðgi; �g�Þj

The complexity of this expression depends on the cost of computingthe MCS, which, in the general case, is exponential in the size of theinvolved graphs. It is therefore desirable to avoid as many evalua-tions of this function as possible. To this end, we can try to inferthe value of this function as we traverse the search space, withouthaving to compute the MCS between the candidate median andall the graphs. Let us take two candidate median graphs, �g1 and�g2. As we will be traversing the search space by generating all theinduced subgraphs of gmS

, let us suppose too, without loss of gener-ality, that �g2 is an induced subgraph of �g1. Now, using this assump-tion we will try to find some relation between f �ð�g1Þ and f �ð�g2Þ. Letus denote by h�ð�g1; �g2Þ the function that computes the differencebetween both evaluation functions,

h�ð�g1; �g2Þ ¼ f �ð�g2Þ � f �ð�g1Þ

¼ nðj�g2j � j�g1jÞ � 2Xn

i¼1

jMCSðgi; �g2Þj þ 2Xn

i¼1

jMCSðgi; �g1Þj

ð3Þ

We will call h� the prediction function as we can express the evalu-ation function of any graph �g2 in the search space in terms of theevaluation function of any other previous graph �g1 and the functionh�ð�g1; �g2Þ.

Now, let us analyze this prediction function. If we are traversingthe search space from node �g1 to node �g2 and we have already eval-

uated f �ð�g1Þ, we know the value of all the terms in h�ð�g1; �g2Þ exceptfor jMCSðgi; �g2Þj. However, we can infer an upper limit for this termand therefore, we can define an estimation of this prediction func-tion. We will denote this estimation by hð�g1; �g2Þ.

Let us observe that the MCS between two graphs will never havemore nodes than any of them. In addition, we have assumed that weare traversing the search space from �g1 to �g2 and, therefore, �g2 is asubgraph of �g1. Thus, we can easily conclude that jMCSðgi; �g2Þj 6jMCSðgi; �g1Þj. Therefore, the following condition holds:

jMCSðgi; �g2Þj 6 minðjgij; j�g2j; jMCSðgi; �g1ÞjÞ ð4Þ

Then, combining Eqs. (3) and (4) we can obtain the following esti-mation of the prediction function:

h�ð�g1; �g2ÞP hð�g1; �g2Þ

¼ nðj�g2j � j�g1jÞ þ 2Xn

i¼1

jMCSðgi; �g1Þj

� 2Xn

i¼1

min jgij; j�g2j; jMCSðgi; �g1Þjð Þ ð5Þ

Finally, hð�g1; �g2Þ can be used to obtain an estimation of the evalua-tion function of node �g2, f ð�g2Þ:

f ð�g2Þ ¼ f �ð�g1Þ þ hð�g1; �g2Þ ð6Þ

Using this estimation we can reduce the computation time of thealgorithm described in Section 4.3 by reducing the number of can-didate graphs whose evaluation function needs to be explicitlycomputed. Given candidate a median graph, �g�, we can computethis estimation for all the graphs in the search space that can bereached from this initial graph. All nodes �gj whose estimationf ð�gjÞ is greater than the actual evaluation function of the currentmedian graph, �g� can automatically be discarded and do not haveto be evaluated.

One difficulty to apply this strategy is that, given a graph in thesearch space, we cannot guarantee that the prediction function iseither positive or negative for all its induced subgraphs. Therefore,as we are generating the search space level by level, discarding onestate of the search space using the prediction function does notpermit to discard all its remaining induced subgraphs too. Then,in the algorithm that we will present in the next section, we willuse an strategy consisting of using the prediction function to dis-card states only at the next level of the search space.

Table 2Entire database used in the experiments.


4.3. New exact algorithm

Keeping in mind the reductions of the search space proposed sofar (including the heuristic function), and assuming that the mini-mum common supergraph of S, gmS

is computed, we are able topresent a new and more efficient exact algorithm to compute thegeneralized median graph. For the sake of completeness the algo-rithm is presented as Algorithm 1.

The algorithm receives a set S of graphs as input and returns acomplete list of true median graphs, all of them with the sameSOD. The first task is to compute the gmS

. This is carried out usinga similar approach to that presented in (Bunke et al., 2003). OncegmS

is computed, the algorithm computes the term SODðgmSÞ ¼P

dðgi; gmSÞ using the function ComputeSODðgmS

; SÞ. The distance be-tween two graphs dðg1; g2Þ is computed using Eq. (1). The pairðgmS

; SODðgmSÞÞ is stored in the candidate medians list (line 3). After

that, all the possible induced subgraphs of gmSwith size jgmS

j � 1 aregenerated using the function ExpandðgmÞ (line 5), and stored in a ta-ble. Given a graph g, the function Expand(g) generates all its inducedsubgraphs with size jgj � 1, by taking all the possible different com-binations of its nodes and for each combination building the in-duced subgraphs of g with these nodes. Once all the inducedsubgraphs of gmS

are generated, they are marked as valid or notby using the heuristic function explained in Section 4.2. Only for va-lid graphs, its SOD is computed and compared with the actual bestSOD. If it is better, this graph becomes the new current mediangraph. However, for all the graphs at the current level (either validor not), all its induced subgraphs are generated using again thefunction Expand in order to make them available in the next itera-tion. This is due to the fact explained at the of the previous sectionthat we can only discard states at the next level of the search space.In addition, since different graphs may generate the same inducedsubgraph (the search space is a rhombus, not a tree), we keep a hashtable at each level of the search space that permits to know if an in-duced subgraph has already been generated and if it is valid or not.The process is repeated until all the levels of the rhombus searchspace have been visited.

Algorithm 1. Exact median algorithm

Require: A set of graphs S ¼ fg1; g2; . . . ; gngEnsure: A list L of true median graphs

1: gmS¼ Compute mcsðSÞ

2: ComputeSODðgmS; SÞ

3: Insert pair ðgmS; SODðgmS

ÞÞ in L4: Let SODðgmS

Þ be the minimum SOD5: ExpandðgmS

Þ6: for size=jgmS

j � 1 to 0 do7: while RemainInducedSubGraph(size) do8: g ¼ GetNextInducedSubgraphðÞ9: Expand(g)

10: if IsValid(g) then11: ComputeSOD(g,S)12: if SOD(g) is less than the minimum SOD then13: Let SOD(g) be the minimum SOD14: Delete L15: Insert pair (g,SOD(g)) in L16: else17: if SOD(g) is equal to the minimum SOD then18: Insert pair (g,SOD(g)) in L19: end if20: end if21: end if22: end while23: end for24: Return L

5. Experimental setup

In this section, we will provide the results of an experimentalevaluation of the proposed method for the generalized mediangraph computation in order to validate the improvements wehave introduced with our new algorithm. We have separatedsuch experimental setup in two different parts. Firstly, in Section5.1 we will compare our new algorithm with the multimatch ap-proach using a dataset of synthetical data. After that, in Section5.2 we will extend the applicability of our algorithm to a realdatabase and we will compare it against the genetic algorithmpresented in (Jiang et al., 2001). It is important to notice thatall these experiments were run on an Apple iMac computerequipped with a 2.16 GHz Intel Core 2 Duo processor and 2 GBof main memory.

5.1. Application to synthetic data

In this experiments, the graph dataset was composed of graphsthat represent capital letters. It was originally designed manuallyat the University of Bern. From the original set composed by 15 let-ters, we only used a subset of six letters L, V, N, T, K and M. Then,from each original model, we manually generated four distortedinstances. Therefore, our dataset is composed of 30 elements andsix classes. Each class represents a letter and contains five differentinstances of each letter. Table 2 shows the original letters in thefirst row and the four distorted letters in the rest of the rows. Suchdistortions include moving or deleting nodes and edge deletions.Thus, in the same class graphs with different cardinality may ap-pear. The letters are represented by graphs as follows. The straightlines are represented by edges and the terminal points of the linesby the nodes. Nodes are labelled by a two-dimensional attributethat represents the position (x,y) of the terminal point. Edges havea one-dimensional and binary attribute that represents existenceor non-existence.

The experiments consisted of the computation of the general-ized median graphs using different number of graphs in the train-ing set. For each median, the elapsed time and the number of SODcomputations needed to compute it were recorded.

5.1.1. Experimental resultsTo compare the computation time and the number of SOD com-

putations of our new algorithm with respect to the multimatchalgorithm, we defined 24 sets of graphs. For every class, we formedfour different sets composed of 2, 3, 4 and 5 graphs, respectively.Table 3 shows, for each letter (first column), the number of nodesof the graph that represents it (in brackets) and, for each of the foursets, the maximum sum of nodes of the set.

Notice that, because of computation time, not all the possiblecombinations could be used to compute the generalized median

Table 3Sum of nodes in the set S.

Letters (nodes/graph) No. graphs in S

2 3 4 5

L, V (3) 6� 9� 12� 15N, T (4) 8� 12� 16 19K, M (5) 10� 15 20 24

�Combinations used in the multimatch algorithm.

Fig. 5. Example of some computed medians.


graph using the multimatch algorithm. The used combinations aremarked with the symbol � in Table 3. In contrast, all the possiblecombinations could be applied to our new exact algorithm.

In Fig. 5, we can see some of the median graphs obtained withour algorithm. The figure shows the median graphs obtained usingfive graphs for every letter. A first important remark is that in four

6 7 8 9 10 11 12 13 14 1

6 7 8 9 10 11 12 13 14 1

a

b

Fig. 6. Computation time (a) and number of SOD computations (b)

out of the six letters (Fig. 5a–d) the median graph is also the max-imum common subgraph of all the graphs in the set. Although it isonly an empirical result, this fact permits to think of a possiblerelationship between the maximum common subgraph and themedian graph, as it has already been established in (Bunke andKandel, 2000) for two graphs, but extended to a set of an arbitrarynumber of graphs.

5.1.2. Computation time and number of SOD computationsThe results of the computation time (including the mcs compu-

tation) and the SOD computations required to calculate the mediangraph as a function of the total number of nodes in S are shown inFig. 6a and b, respectively. Notice that the results are the mean val-ues obtained for all sets S having the same number of nodes.

The first important result is that due to the combinatorial explo-sion of the multimatch algorithm it could be only applied to sets ofgraphs whose sum of nodes is up to 12. Beyond this limit, the timerequired for this algorithm is unfeasible. This result is consistentwith the results presented in (Jiang et al., 2001). In contrast, ouralgorithm could be applied obtaining reasonable computationtimes to sets having up to 24 nodes. In addition, the computationtime required by our algorithm is quite lower than the time re-quired by the multimatch algorithm, even for small sum of nodesin S. Such difference in time is more evident when the sum ofnodes in S becomes larger. All this facts can be appreciated in theresults showed in Fig. 6a.

A similar behavior can be appreciated in the number of SODcomputations needed for the two algorithms (Fig. 6b). Again, a sig-nificant difference between the number of SOD computations re-quired by the multimatch algorithm and our algorithm can beobserved in the results. This reduction in the number of SOD com-putations can be associated both to the reduction of the searchspace and the heuristic function presented in Section 4.2.

5 16 17 18 19 20 21 22 23 24

5 16 17 18 19 20 21 22 23 24

as a function of the total number of nodes of the graphs in S.

Table 4Percentage of time and SOD computations required for the new exact algorithmsrespect to the multimatch algorithm.Pjgij Time (s) SOD computations

Multimatch New %Reduction Multimatch New %Reduction

6 0.51 0.103 79.81 1794 16 99.118 12.25 0.181 98.53 2943 32 98.929 25.40 0.184 99.28 14,092 40 99.7210 9712 0.283 99.98 33,961 72 99.7912 46,794.85 0.352 99.99 157,176 76 99.95

Fig. 7. Active compounds (a) and inactive compounds (b).


5.1.3. Reduction in time and SOD computationsIn order to be able to quantify the gain achieved by our algo-

rithm with respect to the multimatch algorithm, we compare inTable 4 the time and number of SOD computations required byboth algorithms to compute the generalized median graphs. Thevalues of the computation time of the multimatch algorithm, andour new exact algorithm are shown in columns 2 and 3, respec-

7

4 5 6 7 8

8 9 10

a

b

Fig. 8. Computation time for active (a) and inactive (

tively. The next column (labelled as %Reduction) represents thereduction of time (in percentage) of our algorithm, taking the valueof the multimatch algorithm as reference. The same reasoning withrespect to the number of SOD computations can be done for therest of the table.

We can observe that a significant reduction in the time neededfor the median graph computation is achieved by our new algo-rithm with respect to the multimatch algorithm. This reductionis increased as the sum of nodes in the set also increases. Noticethat the mean time percentage needed for our algorithm with re-spect to the multimatch algorithm to compute the median graphis about 4.5% (or a 95.5% of reduction). A similar reasoning canbe done for the number of SOD computations. In this case the meanpercentage of the required SOD computations is about 0.5% (99.5%of reduction).

5.2. Application to real data

In the previous section, it has been shown that the computationof the median graph by means of our new algorithm is not re-stricted to a very small number of graphs with a small number ofnodes as in the case of the multimatch algorithm. In this sectionwe will go a step beyond and we will extend the exact computationof the median graph to real data. In particular we will use our algo-rithm to obtain prototypes of certain molecules. We will comparethe results with those obtained by our own implementation of thegenetic algorithm presented in (Jiang et al., 2001).

The molecule database consists of graphs representing molecu-lar compounds. These graphs are extracted from the AIDS AntiviralScreen Database of Active Compounds (DTP, 2004). Such databaseconsists of two different classes of molecules: active and inactive,depending on whether they show activity against HIV or not,respectively. The molecules are converted into graphs in a straight-

9 10 11 12 13 14

11 12 13 14

b) compounds as a function of the size of mcsðSÞ.


forward way, representing the atoms as nodes and the covalentbonds as edges. The nodes are labelled with the number of the cor-responding chemical symbol. The edges are labelled with the va-lence of the linkage. Some examples of each class are shown inFig. 7. In order to simplify the representation, in such examples,different chemical symbols are represented using a different graytonality.

For each class we have 50 different instances or molecules. Theexperiments consisted in generating, for each class, 40 sets havinga different number of graphs. The graphs in each set had differentsizes and were chosen randomly from the original set of 50 in-stances. In this case, for each set the time required and the SODof the median graph obtained by each algorithm were recorded.It is important to notice that the sum of nodes in the sets rangedfrom 4 to 23 nodes. The results are shown in the next sections.

5.2.1. Computation timeThe required computation time for both active and inactive

compounds are shown in Fig. 8a and b, respectively. In both charts,the x-axis represent the size of the minimum common supergraphof the set, independently of the sum of nodes in the set.

First of all it is important to notice that for minimum commonsupergraphs of sizes up to 10–11, the time required by our algo-rithm is lower than the time required by the genetic algorithm.Such difference is specially relevant for small sizes of the minimumcommon supergraph (up to 8), and also for the case of inactivemolecules, where the time required by our exact algorithm istwo orders of magnitude lower than the genetic algorithm. It isimportant to mention that, for sizes of the minimum commonsupergraph up to 11 nodes, we can find sets whose sum of nodeswas 20 nodes. This result suggests that for sets of graphs sharing

7 8 9 100

2

4

6

8

10

12

14

Size

SOD

GANew

4 5 6 7 80

5

10

15

20

25

Size

SOD

GANew

a

b

Fig. 9. SOD of computed median for active (a) and inactiv

large structures (and consequently the minimum common super-graph tends to the mean size of the graphs in the set), our algo-rithm may outperform the genetic algorithm.

As our algorithm is an optimal one, the computation time forlarger sizes of the minimum common supergraph increases rap-idly. Nevertheless, the computation time needed in these cases isnot unfeasible (the sum of nodes of the sets with minimum com-mon supergraph of 14 nodes was 23 nodes).

5.2.2. SODThe definition of the median graph implies the computation of

the sum of distances of the candidate median to all the graphs inthe set. In this sense, a measure of how good the median graphis, can be obtained by computing the term SOD for the computedmedians. A measure of the term SOD achieved by the medianscomputed by both algorithms is shown in Fig. 9a and b, for the ac-tive and inactive compounds, respectively. Again, the x-axis repre-sents the size of the minimum common supergraph of the set,independently of the sum of nodes in the set.

As expected, the term SOD for the genetic algorithm is alwaysgreater than or equal to the term SOD achieved by the exact algo-rithm. This behavior is more clear in the case of inactive com-pounds, where for sizes of the minimum common supergraphgreater than 9, the difference in the term SOD increases signifi-cantly (Fig. 9b). The difference is less evident in the case of activecompounds.

Such difference in the term SOD suggests that the medians ob-tained by the genetic algorithm tend to diverge as the size of theminimum common supergraph increases (the graphs in the setare more dissimilar). This may lead to obtain median graphs thatdo not represent accurately the set of graphs. In contrast, the exact

11 12 13 14

of gms

9 10 11 12 13 14

of gms

e (b) compounds as a function of the size of mcsðSÞ.


algorithm always finds an optimal median graph. Thus it alwaysfinds the best representative in the set (following the criteria ofthe definition of the median graph).

6. Conclusions

Median graphs have been presented as a good alternative tocompute the representative of a set of graphs. Unfortunately, theexact computation of the median graph has an exponential com-plexity. Only one exact algorithm has been presented up. Althoughexact solutions are in most cases not feasible in real applications, itis still important to be able to obtain them as they will always be abetter representation of a set. In addition, finding exact solutionsmay help us to understand the nature of this concept and can beused to obtain better approximate solutions too.

The main contributions of this paper are two-fold. Firstly andfrom a theoretical point of view, we have shown that using a par-ticular cost function and a distance measure based on the maxi-mum common subgraph, the size of the search space where themedian graph has to be computed can be drastically reduced. Inaddition, an heuristic strategy has also been introduced in orderto improve the basic version of this new algorithm avoiding theevaluation of some states in the search space. Secondly, and basedon this theoretical result a new exact algorithm for the computa-tion of the generalized median graph has been presented.

We have compared this new exact algorithm to the previousexisting exact algorithm using synthetic data. The results showthat our algorithm clearly outperforms the previous existing algo-rithm. Encouraged by these results, we have applied our algorithmto the computation of median graphs using real data and we havecompared the results with those obtained by an approximate algo-rithm based on the genetic search. The results show that, althoughthe application of the exact algorithm is still limited, it can be usedin real problems for the first time.

There are still a number of issues to be investigated in the fu-ture. These results open a door towards the application of thenew theoretical properties and the new algorithm to obtain moreaccurate approximate solutions of the median graph and also to in-crease the efficiency of the existing approximate algorithms. In thissense, we have already presented in (Ferrer et al., 2007a,b) two the-oretical results related to some bounds on the size and the maxi-mum SOD of the median graph, that may help to obtain betterapproximate algorithms and also to extend the applicability ofthe median graph. In addition, the possible relation between themaximum common subgraph and the median graph suggested inSection 5 can also be further investigated in order to see whetherthe algorithms for the computation of the MCS could be used toobtain exact or approximate solutions of the median graph.

Acknowledgements

This work has been supported by the research Fellowship num-ber 401-027 (UAB), the Cicyt project TIN2006-15694-C02-02 (Min-isterio Ciencia y Tecnología) and the Spanish research programmeConsolider Ingenio 2010: MIPRCV (CSD2007-00018). We wouldlike to thank K. Riesen from the University of Bern for his help withthe databases.

References

Balas, E., Yu, C.S., 1986. Finding a maximum clique in an arbitrary graph. SIAM J.Comput. 15 (4), 1054–1068.

Bunke, H., 1997. On a relation between graph edit distance and maximum commonsubgraph. Pattern Recognition Lett. 18 (8), 689–694.

Bunke, H., Allerman, G., 1983. Inexact graph matching for structural patternrecognition. Pattern Recognition Lett. 1 (4), 245–253.

Bunke, H., Kandel, A., 2000. Mean and maximum common subgraph of two graphs.Pattern Recognition Lett. 21 (2), 163–168.

Bunke, H., Münger, A., Jiang, X., 1999. Combinatorial search versus geneticalgorithms: A case study based on the generalized median graph problem.Pattern Recognition Lett. 20 (11-13), 1271–1277.

Bunke, H., Jiang, X., Kandel, A., 2000. On the minimum common supergraph of twographs. Computing 65 (1), 13–25.

Bunke, H., Foggia, P., Guidobaldi, C., Sansone, C., Vento, M., 2002. A comparison ofalgorithms for maximum common subgraph on randomly connected graphs. In:Structural, Syntactic, and Statistical Pattern Recognition, Joint IAPR Internat.Workshops SSPR 2002 and SPR 2002, Proc., Windsor, Ontario, Canada, August6–9, 2002. Lecture Notes in Computer Science, vol. 2396.

Bunke, H., Foggia, P., Guidobaldi, C., Vento, M., 2003. Graph clustering using theweighted minimum common supergraph. In: Graph Based Representations inPattern Recognition, 4th IAPR Internat. Workshop, GbRPR 2003, Proc., York, UK,June 30–July 2, 2003. Lecture Notes in Computer Science, vol. 2726.

Conte, D., Foggia, P., Vento, M., 2007. Challenging complexity of maximum commonsubgraph detection algorithms: A performance analysis of three algorithms on awide database of graphs. J. Graph Algorithms Appl. 11 (1), 99–143.

Development Therapeutics Program, DTP, 2004. AIDS Antiviral Screen. <http://dtp.nci.nih.gov/docs/aids/aids_data.html>.

Durand, P.J., Pasari, R., Baker, J.W., Tsai, C.C., 1999. An efficient algorithm forsimilarity analysis of molecules. Internet J. Chem. 2 (17).

Ferrer, M., Serratosa, F., Sanfeliu, A., 2005. Synthesis of median spectral graph. In:Marques, J.S., de la Blanca, N.P., Pina, P. (Eds.), IbPRIA (2), Lecture Notes inComputer Science, vol. 3523. Springer.

Ferrer, M., Serratosa, F., Valveny, E., 2007a. On the relation between the median andthe maximum common subgraph of a set of graphs. In: Escolano, F., Vento, M.(Eds.), Graph-Based Representations in Pattern Recognition, 6th IAPR-TC-15Internat. Workshop, GbRPR 2007, Proc., Alicante, Spain, June 11–13, 2007.Lecture Notes in Computer Science, vol. 4538. Springer.

Ferrer, M., Valveny, E., Serratosa, F., 2007b. Bounding the size of the median graph.In: Martí, J., Benedí, J.-M., Mendonça, A.M., Serrat, J. (Eds.), IbPRIA (2), LectureNotes in Computer Science, vol. 4478. Springer.

Hlaoui, A., Wang, S., 2006. Median graph computation for graph clustering. SoftComput. 10 (1), 47–53.

Jiang, X., Münger, A., Bunke, H., 2001. On median graphs: Properties, algorithms,and applications. IEEE Trans. Pattern Anal. Machine Intell. 23 (10), 1144–1151.

Justice, D., Hero, A.O., 2006. A binary linear programming formulation of the graphedit distance. IEEE Trans. Pattern Anal. Machine Intell. 28 (8), 1200–1214.

Lu, S.W., Ren, Y., Suen, C.Y., 1991. Hierarchical attributed graph representation andrecognition of handwritten chinese characters. Pattern Recognition 24 (7), 617–632.

McGregor, J.J., 1982. Backtrack search algorithms and the maximal commonsubgraph problem. Software Pract. Exper. 12 (1), 23–24.

Münger, A., 1998. Synthesis of prototype graphs from sample graphs. DiplomaThesis, University of Bern (in German).

Neuhaus, M., Riesen, K., Bunke, H., 2006. Fast suboptimal algorithms for thecomputation of graph edit distance. In: Structural, Syntactic, and StatisticalPattern Recognition, Joint IAPR Internat. Workshops, SSPR 2006 and SPR 2006,Proc., Hong Kong, China, August 17–19, 2006. Lecture Notes in ComputerScience, vol. 4109.

Pearce, A.R., Caelli, T.M., Bischof, W.F., 1994. Rulegraphs for graph matching inpattern recognition. Pattern Recognition 27 (9), 1231–1247.

Robles-Kelly, A., Hancock, E.R., 2005. Graph edit distance from spectral seriation.IEEE Trans. Pattern Anal. Machine Intell. 27 (3), 365–378.

Sanfeliu, A., Fu, K., 1983. A distance measure between attributed relational graphsfor pattern recognition. IEEE Trans. Systems Man Cybernet. 13 (3), 353–362.

Wang, Y., Maple, C., 2005. A novel efficient algorithm for determining maximumcommon subgraphs. In: Proc. 9th Internat. Conf. on Information Visualisation,IV 2005, July 6–8, 2005. IEEE Computer Society, London, UK.

White, D., Wilson, R.C., 2006. Mixing spectral representations of graphs. In: Proc.18th Internat. Conf. on Pattern Recognition (ICPR 2006), August 20–24, 2006.IEEE Computer Society, Hong Kong, China.

Wong, E.K., 1992. Model matching in robot vision by subgraph isomorphism.Pattern Recognition 25 (3), 287–303.

http://dtp.nci.nih.gov/docs/aids/aids_data.html

http://dtp.nci.nih.gov/docs/aids/aids_data.html

Documents

Median graph: A new exact algorithm using a distance based on the maximum common subgraph