Paper Presentation (Graph)

Error Correcting Graph Matching Application to

Software Evolution

2008 15th Working Conference on Reverse Engineering

www.company.com

PRESENTED BY

Falguni Roy

MSSE-0209

www.company.com

Authors Name

• Segla Kpodjedo

• Filippo Ricca

• Philippe Galinier

• Giuliano Antoniol

www.company.com

Introduction and Problem Statement

• Suited for modeling all kinds of real life objects and problems

• legitimate question• how similar (quantitatively and qualitatively) they

are

• to answer these questions is to match• with respect to some constraints• the nodes and edges of the first graph to the nodes

and edges of the second graph

• Exact matching• Approximate graph matching

www.company.com

• Approximate graph matching• Allow matching two nodes that violate constraints

• edge-preservation constraint, or • exact correspondence of edges, or• any other characteristic such as node/edge labels,

weights etc

• A penalty is assigned to those constraint violations, depending on the specific problem and desired results

• Best matching is considered to be one that minimizes the overall penalty cost

www.company.com

• Best matching

• NP-hard

• optimal algorithms suffer from prohibitive

computation times on medium and large

graphs.

www.company.com

Background

• Class diagrams can be thought of as labeled

graphs

• nodes being the classes and

• edges representing the relations between classes

• Labels on edges can specify the type of the edge

• Node labels can specify properties such as class

name

www.company.com

• To apply ECGM algorithms to study software evolution, first should envisage that the Software artifacts are represented as graphs

• Build a mapping between graphs via an ECGM algorithm.

• Finds an optimal or a near optimal mapping• Algorithm exploits similarities of nodes based

both on• Their number of edges and • Their hierarchical node position in the overall

graph structure

www.company.com

The Error Correcting Graph Matching Model

• A graph with labels from two finite alphabets of symbols vertices’ labels ΣV and edges’ labels ΣE defined as a triple (V, LV , LE ) where V is the finite set of elements, called nodes or vertices

• LV : V → ΣV is the node labeling function• LE : V × V → ΣE is the edge labeling function• g1 = (V1 , LV1 , LE1 ) and g2 = (V2 , LV2 , LE2) be two

graphs• An ECGM from g1 to g2 is a bijective function m :

Ṽ1 → Ṽ2 where Ṽ1 V⊆ 1 , Ṽ2 V⊆ 2

www.company.com

• x ∈ Ṽ1 is matched to node y ∈ Ṽ2 if m(x) = y.• Any node from V1 − Ṽ1 is said to be deleted from

g1

• Any node from V2 − Ṽ2 is said to be inserted in g2

under m

• Any ECGM can be thought of as a set of edit operations that transform a given graph g1 into another graph g2

• Node matching a couple (n1 , m(n1 )) ∈ (Ṽ1 × Ṽ2).

www.company.com

• An ECGM solution, called matching, is then a set of those couples with the constraint that a node is matched to at most one node.

• Penalties are assigned to every distortion found by the solution

• Edit operations leading to distortions• node/edge deletions,• node/edge insertions, and • node/edge matching errors.

• Given (n1 , m(n1 )), a node matching error refers to the dissimilarity between n1 and m(n1 ).

• Edge matching refers to any edge replacement from Ṽ1 × Ṽ1 to Ṽ2 × Ṽ2

www.company.com

• Two types of edge matching errors are to be considered: • Replacing a missing edge (insertion) by an existing

edge (structural error) or• Replacing one edge by another (label error).

• As a result, there are seven possible edit operations or distortions and each one is assigned a given cost depending from the problem at hand.

www.company.com

• ECGM cost function could then be parameterized by seven cost values of the seven edit operations:• Node matching, deletion and insertion: Cnm , Cnd , Cni• Edge deletion and insertion applied to edges of deleted

and added nodes: respectively Ced (cost of deleting an edge of a deleted node from g1 ) and Cei (cost of adding an edge for nodes added into g2 ); and

• Edge matching: edge structural error Cems when an edge is inserted/deleted between two matched nodes and edge label error Ceml (for example, an association is mapped into an aggregation).

www.company.com

• The cost of adding or deleting a node or an edge can be considered identical and thus there is no need to specify two different values Cnd, Cni or Ced , Cei

• Five real positive values suffice to define a cost function: (Cnm, Cno , Ceo , Cems , Ceml ).

www.company.com

Modeling software evolution as an ECGM

• Important elements in software evolution are modeled as node properties and matched by the ECGM algorithm.

• for each class, here considered only a subset of possible class characteristics: • the class name and • the number of attributes and • methods.

www.company.com

www.company.com

• Two classes and their features (label, number of attributes, number of methods): v1 (l1 , #m1, #a1) and v2 (l2, #m2, #a2 ),we compute their internal similarity as follows

www.company.com

Tabu Search Algorithm

• Given a function f (cost function) to be minimized (or maximized) over some set S (the Search Space),

• A local search technique starts from some initial feasible point (solution)in the search space and proceeds iteratively (moves) from one point in S to another (a neighbor) until some termination criterion is met.

• To prevent cycles in the search, TS introduces one or several tabu lists used to exclude moves which would tend to make the search process go back to a previously visited solution.

www.company.com

• For ECGM, a move is either adding a new match or deleting one which is in the current solution

• Before matching two nodes, consider • Internal similarity,• External similarity

• Consider the whole graph structure and the positions of the considered nodes in their respective graphs

• Local features of a node such as the incoming edges and the outgoing edges

• Using PageRank, we can easily compute a metric representative of global structure for each vertex of a given graph. Once combined with local metric, this metric allows us to have a more accurate assessment of the structural similarity of two nodes; structural similarity that is used to guide the TS search

www.company.com

Conclusion

• final version, consists of 17 Java classes and 37 relations (associations, aggregations and generalizations) for a total of 6184 Lines of Code

• Ten classes out of the 17 in the last version of Latazza were in the stable part

• This means that 59% of the classes belong to the tunnel. Regarding the edges, we can observe that 16 of them (out of the 37 edges in the last Latazza’s snapshot) are in the tunnel and 13 kept the same value throughout it

www.company.com

Conclusion

www.company.com

Thank You

Education

Paper Presentation (Graph)