8
On the use of similarity metrics for approximate graph matching S. Kpodjedo 1 , P. Galinier, G. Antoniol a DGIGL, Ecole Polytechnique de Montreal, Canada Abstract In this paper, we investigate heuristics for approximate graph matching, in par- ticular its formulation as a Maximum Common Edge Subgraph problem. Our ex- periments suggest that a small percentage of accurate node matches is sufficient to get near optimal solutions using a simple hill-climbing. The real challenge could then be to somehow drag the search in this zone. For this purpose, we discuss the use of similarity measures. We present and assess the performance of two similarity measures. Very good results were obtained on labeled graphs. Keywords: approximate graph matching, maximum common edge subgraph, similarity measure, heuristics 1 Introduction In many applications, a frequent task is to find an approximate matching between two given objects represented as graphs while optimizing a particular criterion. Depending on the application domain, the graphs to be matched can represent images (computer vision) [8], molecules (computational chemistry) [9], software artifacts (software engineering) [1,7], etc. Particular cases of 1 Email: [email protected] Electronic Notes in Discrete Mathematics 36 (2010) 687–694 1571-0653/$ – see front matter © 2010 Elsevier B.V. All rights reserved. www.elsevier.com/locate/endm doi:10.1016/j.endm.2010.05.087

On the use of similarity metrics for approximate graph matching

Embed Size (px)

Citation preview

Page 1: On the use of similarity metrics for approximate graph matching

On the use of similarity metrics forapproximate graph matching

S. Kpodjedo 1, P. Galinier, G. Antoniol

a DGIGL, Ecole Polytechnique de Montreal, Canada

Abstract

In this paper, we investigate heuristics for approximate graph matching, in par-ticular its formulation as a Maximum Common Edge Subgraph problem. Our ex-periments suggest that a small percentage of accurate node matches is sufficient toget near optimal solutions using a simple hill-climbing. The real challenge couldthen be to somehow drag the search in this zone. For this purpose, we discuss theuse of similarity measures. We present and assess the performance of two similaritymeasures. Very good results were obtained on labeled graphs.

Keywords: approximate graph matching, maximum common edge subgraph,similarity measure, heuristics

1 Introduction

In many applications, a frequent task is to find an approximate matchingbetween two given objects represented as graphs while optimizing a particularcriterion. Depending on the application domain, the graphs to be matched canrepresent images (computer vision) [8], molecules (computational chemistry)[9], software artifacts (software engineering) [1,7], etc. Particular cases of

1 Email: [email protected]

Electronic Notes in Discrete Mathematics 36 (2010) 687–694

1571-0653/$ – see front matter © 2010 Elsevier B.V. All rights reserved.

www.elsevier.com/locate/endm

doi:10.1016/j.endm.2010.05.087

Page 2: On the use of similarity metrics for approximate graph matching

Fig. 1. Example of graph matching

AGM problems are represented by the Maximum Common Edge Subraph(MCES) problem [5] and the Error-Tolerant Graph Matching (ETGM) [3].Techniques proposed in order to solve graph matching problems encompassexact algorithms (for very small graphs) such as the RASCAL algorithm forMCES [4], and meta-heuristics [2]. In an AGM problem, a matching is anyrelation between the sets of vertices of the two graphs, such that any vertexin one graph is matched to at most one vertex in the other graph (a 1-to-1constraint); pairs of vertices contained in a matching are named node matches.Figure 1 represents two graphs with labels on their edges and a matching{(a, α), (b, β), (d, δ), (e, ε)} betwen them. In this matching, we can observe(i) two correct edge correspondences (between (b, d) and (β, δ) and between(e, b) and (ε, β)) (ii) one bad edge correspondence (between (a, b) and (α, β)as those do not have the same label) (iii) two edges (e, a) and (ε, δ) not beingmatched to an edge in the other graph. In an AGM problem, the score ofa matching depends on bonuses credited for correct correspondences and/orpenalties assigned for errors. The goal of the problem is to find a matchingwith a maximum score. Notice that, when the two graphs are highly similar,near-optimal solutions tend to share a large number of node matches. Thesenode matches will be named correct node matches in the following.

Our current work is devoted to investigate heuristics for the AGM prob-lem. We found that, even with very small graphs (graphs with as little as 20vertices), local search (LS) techniques (such as tabu search) guided only bythe score of the matching provide very poor solutions when they are initializedwith an empty or a randomly generated matching. In a previous paper [6],we have investigated a powerful approach which consists in guiding the localsearch by using a similarity measure between the pairs of vertices of the twographs. The proposed similarity measure is based on local information relatedto the edges adjacent to the two vertices. The principle is to encourage pairsof vertices with a high similarity to be introduced into the solution.

In this paper, we performed new and more detailed experiments in order toget more insight on the nature of the search space and how similarity can beused to simplify the matching task. The particular approximate graph match-

S. Kpodjedo et al. / Electronic Notes in Discrete Mathematics 36 (2010) 687–694688

Page 3: On the use of similarity metrics for approximate graph matching

ing problem considered in these experiments is MCES (a matching model withno penalties for bad correspondences) and we consider graphs labeled on theiredges (and not on their vertices). Our experiments are performed with pairsof random graphs of various sizes and densities, and with controlled distortionbetween two graphs to be matched.

In a first experiment, we observe that a greedy heuristic guided by theobjective function is inefficient, due to erroneous choices frequently performedduring the first iterations. However, the heuristic obtains good results whenit is initialized with correct node matches. It is interesting to observe that avery small proportion of node matches is sufficient to obtain good matchings- at least when the degree of similarity of the two graphs is high.

In a second experiment, we evaluate the ability of the proposed similaritymeasure to predict correct node matches. Very good results were obtained onlabeled graphs with high similarity.

In the conclusion, we discuss the results of the experiment and two differentalgorithms that combine local search (and greedy search) with the use ofsimilarity measures.

2 Benchmark

The graphs used in our experiments have labels on their edges but not on theirnodes. In order to produce these pairs of graphs, we used a random generatorcontrolled by the following parameters:

• Parameter n represents the number of vertices of the two graphs.

• Parameter d is function of the expected density of the graphs and representsthe expected mean of in and out degree of a vertex.

• Parameter nl indicates the number of labels.

• Parameter q(0 ≤ q ≤ 1) is used in order to control the similarity betweenthe two graphs. The larger the value of q, the most similar the two graphs.In particular, for q = 0, the two graphs will be built independently; and forq = 1, the two graphs will be isomorphic.

Given a quadruplet (n, d, nl, q) of parameters, the generators builds thetwo graphs G1 and G2. The two graphs have the same number n of vertices,and an initial (complete) random matching μ0 is built. A pair of vertices inG1 or G2 is assigned an edge with a probability d/(n − 1) = density. A labelis assigned to each edge according to a uniform distribution. Correspondingpairs of vertices in the two graphs are imposed to be assigned the same label

S. Kpodjedo et al. / Electronic Notes in Discrete Mathematics 36 (2010) 687–694 689

Page 4: On the use of similarity metrics for approximate graph matching

with probability q - see more details in [6]. In addition to the two graphs,the generator returns the matching μ0 used during the construction. The

matching μ0 and his score will be used throughout this paper as a referential.

For our experiments, we use the following values for the parameters:

• Number of vertices: n = 300, 1000 or 3000;

• Expected mean of in and out degree of a vertex: d = 2 or 5;

• Number of labels: nl = 1 or 4;

• Similarity parameter: q = 0.5, 0.6, 0.7, 0.8, 0.9 or 1;

We have generated 72 pairs of graphs, one for each possible combination forparameters n, d, nl and q.

3 Assessing the importance of a good start

In the experiment presented in this Section, we evaluate the quality of thesolutions when initialized with more or less numerous correct node matches- taken from the μ0. Our greedy heuristic builds a solution step by stepby inserting a new node match while respecting the 1-to-1 constraint. Thenew node match is the one that produces the largest increase (the largestnumber of new edges correctly matched) in the score function. We performedexperiments by using our 72 pairs of graphs. For each pair of graphs, wehave performed six series of experiments by using different sample sizes: 0(no initialisation), 1, 2, 5, 10, 20% of n (number of vertices). Ten runs (withdifferent random seeds) of our greedy algorithm were performed in each seriesof experiments.

Figures 2 and 3 report results obtained for graphs of n = 1000 and d = 2.Figure 2 presents how good are the scores of final solutions (on y axis, andin percentage of the score of the μ0) when they are initialised with differentpercentages (on x axis) while Figure 3 presents the hamming distance (on yaxis, and in percentage of the number of nodes) from the μ0 of those finalsolutions. Data on unlabeled (nl=1) and labeled (nl=4) graphs are separatedand on each figure, the different values of the similarity between the pair ofgraphs are represented as trend lines. For instance, the lines with points drawnas stars represent results on pairs of graphs with 0.9 of similarity. From thisline, we can see that for labeled graphs (chart at the right on Figure 2), aninitialisation with an empty solution (0%) generates solutions with an averageof only 32% of the μ0 score while an initialisation with as few as 1% (being 10nodes for graphs of 1000 nodes) of correct node matches is enough to get in

S. Kpodjedo et al. / Electronic Notes in Discrete Mathematics 36 (2010) 687–694690

Page 5: On the use of similarity metrics for approximate graph matching

(a) Unlabeled (b) Labeled

Fig. 2. Score of the final solutions

(a) Unlabeled (b) Labeled

Fig. 3. Hamming Distance from the μ0 of the final solutions

average solutions up to 84% of the μ0 score. An initialisation with 2% (being20 nodes) rises the average of scores to 90%. For the same line, we can seeon the chart at the right of Figure 3 that those high score values of the finalsolutions correspond to solutions very close to the μ0: 16% (resp 11%) forsolutions initialised with 1% (resp. 2%) of the μ0. This illustrates the factthat the obtained solutions are only a few node matches away from the μ0, asmall gap one can probably fill with techniques more powerful than a greedyalgorithm.

When looking at the whole picture, we notice that the more similar the

S. Kpodjedo et al. / Electronic Notes in Discrete Mathematics 36 (2010) 687–694 691

Page 6: On the use of similarity metrics for approximate graph matching

graphs, the smaller the needed percentage to reach good solutions and to getnear optimal solutions. There is also a significant difference between unlabeledand labeled graphs: the unlabeled graphs have worse results for all initialisa-tion percentages and all similarity values. On the contrary, on labeled graphs,even an initialisation of 5% of pairs from the μ0 is enough to get to an averageof more than 90% for similarity values ≥ 0.8. In general, the more similar thegraphs, the smaller the needed percentage to get near optimal solutions.

As for experiments with other graphs than the ones presented in the Fig-ures 2 and 3, it is worth noting that the size of the graph did not significantlymodify the displayed results while a higher density (d=5) provides significantlybetter results, probably due to the higher connectivity of nodes: a good nodematch is more likely to generate more good matches as the nodes matchedaffect more neighbors. In summary, we can see on those figures that a smallpercentage of nodes from the μ0 is enough to get a greedy algorithm to verygood solutions. This is especially true for labeled graphs with similarity > 0.7.

4 Using similarity to predict the good node matches

We conjecture that similarity measures between nodes of the two graphs to bematched can greatly help in identifying the correct node matches, those likelyto be in a near optimal matching. We propose a way to compute similaritybetween the vertices of two graphs to be matched.

Let G be a graph G = (V,E, Σ), Σ being the alphabet of edge labels; forany vertex x ∈ V and any label l ∈ Σ, we denote by f+(x, l) and f−(x, l)the number of ingoing and outgoing edges adjacent to x and whose label isl Consider two graphs G1 and G2 and a matching μ between them. Given(x1, x2) ∈ V1 × V2, we denote by potential(x1, x2) the maximum number ofedges adjacent to x1 and x2 that can be correctly matched (assuming that x1

would be matched to x2):

potential(x1, x2) = Σl∈Σ(min(f+(x1, l), f+(x2, l))+min(f−(x1, l), f−(x2, l)))The similarity simil(x1, x2) between two nodes x1 and x2 is computed as:

(i) simil1(x1, x2) = 2×potential(x1,x2)deg(x1)+deg(x2)

(ii) simil2(x1, x2) = simil1(x1, x2) ×potential(x1,x2)maxPotential

where maxPotential = max(x1,x2)∈V1×V2potential(x1, x2).

The first similarity measure (simil1) is a simple assessment of the similarityof two nodes using their potential. The second measure (simil2) multipliessimil1 by a factor meant to discriminate against vertices x1 and x2 with high

S. Kpodjedo et al. / Electronic Notes in Discrete Mathematics 36 (2010) 687–694692

Page 7: On the use of similarity metrics for approximate graph matching

similarity but low degrees. Indeed, if two vertices have low degrees, their highsimilarity can be simply due to mere chance. This is why we decrease theirsimilarity score. Experiments were conducted for the 72 pairs of graphs. For

(a) Unlabeled (b) Labeled

Fig. 4. Precision in top similar pairs of nodes

each graph, the top similar pairs of nodes are considered. Given a percentager of the number of nodes of the graphs, pr is the precision in terms of nodematches belonging to the μ0 one can find in the top r × n pairs of nodes.Figure 4 reports results for graphs of n=1000 and d=2 and with simil2.

While results for unlabeled graphs are very poor, interesting results wereobserved on labeled graphs. The more the graphs are similar, the better theresults. Notably, for graphs with similarity higher than 0.7, we have more than90% of good nodes in the top 2% node matches. Overall, for all the 36 labeledgraphs, the second similarity measure provided good results especially for thefirst node matches: in average more than 85% of nodes from the μ0 in the first2% of very similar graphs (q=0.8, 0.9 and 1). Due to space issues, results forsimil1 were not reported as they appear consistently and significantly worse(about 20%) than those of simil2.

5 Concluding remarks

There are many ways of using the similarity measures in a greedy algorithm.The results of the above experiments suggest a very simple algorithm wheretop similar pairs of nodes would be used to initialise solutions on which wouldbe applied any given heuristic. We implemented and tested it and whileit provided much better results than starting with an empty solution, wediscovered it could be topped using a close alternative. One way that provedefficient was to combine similarity values and actual scores for every node

S. Kpodjedo et al. / Electronic Notes in Discrete Mathematics 36 (2010) 687–694 693

Page 8: On the use of similarity metrics for approximate graph matching

match considered to be inserted in the solution being built. We found thatthe weight of the similarity measure ought to be maximum at the beginningbecause the similarity is the only reliable information about the number ofperfect edge correspondances a node match might ultimately bring; then, itshould decrease up to the point where the similarity is only used to untieex-aequo moves. This technique was used in [6] with excellent results (100%of μ0 score) on labeled graphs with high similarity. However, as consistentlyshown in this paper, more work is needed on unlabeled graphs to get bettersimilarity measures and algorithms.

References

[1] J. Aldrich N. Nahas B. Schmerl D. Garlan Abi-Antoun, M. Differencing andmerging of architectural views. Automated Software Engineering, 15(1):35–74,Mar. 2008.

[2] T. Barecke and M. Detyniecki. Memetic algorithms for inexact graph matching.In CEC: IEEE Congress on Evolutionary Computation, 2007.

[3] H. Bunke. Error-tolerant graph matching: a formal framework and algorithms.In Proc. Advances in Pattern Recognition, pages 1–14, 1998.

[4] J.W. Raymond, E.J. Gardiner, and P. Willett. Rascal: calculation of graphsimilarity using maximum common edge subgraphs. Computer Journal,45(6):631–44, 2002.

[5] Bokhari S. On the mapping problem. In IEEE Trans. Comput., pages 207–214.

[6] Kpodjedo S., P. Galinier, and G. Antoniol. Enhancing a tabu algorithmfor approximate graph matching with a similarity measure. In Evolutionary

Computation in Combinatorial Optimisation, Eur. Conf. on, 2010.

[7] Kpodjedo S., F. Ricca, P. Galinier, and G. Antoniol. Recovering the evolutionstable part using an ecgm algorithm: Is there a tunnel in mozilla? In Software

Maintenance and Reengineering, Eur. Conf. on, volume 0, pages 179–188, 2009.

[8] A. Toshev, S. Jianbo, and K. Daniilidis. Image matching via saliency regioncorrespondences. In CVPR ’07, IEEE Conf. on Computer Vision and Pattern

Recognition, pages 33–40, 2007.

[9] Y. Wang, F. Makedon, J. Ford, and H. Huang. A bipartite graph matchingframework for finding correspondences between structural elements in twoproteins. In Proc. Int. Conf. of the IEEE Engineering in Medicine and Biology

Society, pages 2972–2975, 2004.

S. Kpodjedo et al. / Electronic Notes in Discrete Mathematics 36 (2010) 687–694694