Directed Graph Pattern Matching and Topological Embedding

Ž .JOURNAL OF ALGORITHMS 22, 372]391 1997ARTICLE NO. AL960818

Directed Graph Pattern Matching andTopological Embedding

James Jianghai Fu

Department of Computer Science, Uni ersity of Waterloo,Waterloo, Ontario, Canada, N2L 3G1

Received May 1, 1995

Pattern matching in directed graphs is a natural extension of pattern matching intrees and has many applications to different areas. In this paper, we study severalpattern matching problems in ordered labeled directed graphs. For the rooteddirected graph pattern matching problem, we present an efficient algorithm which,

Ž < < < <given a pattern graph P and a target graph T , runs in time and space O E VP T< <. � < < < < < <4q E . It is faster than the best known method by a factor of min E , E V .T T P T

This algorithm can also solve the directed graph pattern matching problem withoutincreasing time or space complexity. Our solution to this problem outperforms thebest existing method by Katzenelson, Pinter and Schenfeld by a factor of

� < < < < < < < < < <4min V E , V E V . We also present an algorithm for the directed graphP T P P TŽ < < < < < <.topological embedding problem which runs in time O V E q E and spaceP T P

Ž < < < < < < < <.O V V q E q E . To our knowledge, this algorithm is the first one for thisP T P Tproblem. Q 1997 Academic Press

1. INTRODUCTION

Pattern matching in trees has been successful in a number of applicationareas. However, because of the lack of mechanism in trees to expressrecursive structures, trees are not suitable for representing some complexobjects in which explicit expressions of recursive relations are useful. As aconsequence, there has been increasing demand in recent years that

wpattern matching be extended to more general graphs 2, 3, 5, 6, 8, 12, 14,x15 .A directed graph is a natural choice for expressing recursive relations.

Pattern matching in directed graphs has been used by several researchgroups for different purposes in different areas. One example is

w xKatzenelson, Pinter, and Schenfeld’s type-checking system 14 in whichtype expressions are represented by type-graphs, which are directed graphs.A key step in their system is to identify all equivalent subgraphs in a

372

0196-6774r97 $25.00Copyright Q 1997 by Academic PressAll rights of reproduction in any form reserved.

DIGRAPH PATTERN MATCHING AND EMBEDDING 373

type-graph so that redundant parts in the type-graph can be eliminated.Using the technique of pattern matching, Katzenelson, Pinter, and

w xSchenfeld 14 describe a method that can identify all equivalent subgraphsŽ < < 2 < < 2 .in a type-graph G in time O V E .G G

Pattern matching in directed graphs is also used in Holm’s system, inwhich graph concepts are used in semantics descriptions of functional

w xlanguages 12 . In Holm’s system, recursively typed languages are repre-sented by directed graphs. The task of pattern matching can be describedas follows: the patterns describe a class of objects and are representedusing directed graphs. When checking a recursive type, the recursive typevalue is matched against the patterns, i.e., the algorithm checks if the typevalue is an instance of a pattern. The matching process is implemented ina simple graph reduction machine that supports primitive graph operationsand provides a base for language implementations.

Directed graphs are well suited for representing regular tree expressionsw x w x2 . Aiken and Murphy 2 discuss the implementation of regular treeexpressions that are used in type inference and program analysis algo-

w xrithms 3, 10 . The most commonly used operation among all fundamentaloperations in the implementation is the operation for testing inclusionrelations, which can be implemented using a pattern matching technique.

Pattern matching is a crucial component of term rewriting systems.Directed graphs are well suited for expressing the infinite terms with afinite number of distinct subterms in a cyclic term graph rewriting systemw x5, 6 . Motivated by the need for providing satisfactory interpretation for

w xcyclic term graph rewriting, Corradini 5 discusses the extension of theclassical theory of term rewriting systems to infinite and partial terms. Atheoretical treatment of pattern matching in directed graphs has animportant impact on the research of infinite term rewriting systems.

Although pattern matching in directed graphs has been used by individ-ual research groups, a thorough study of the problems has not beencarried out. This paper is a step towards an extensive investigation ofvarious pattern matching problems in directed graphs and the develop-ment of techniques for these problems. Efficient solutions to these prob-lems not only are of theoretical interest in their own right, but alsoimprove the performance of many systems such as those listed above.

In this paper, we consider ordered labeled directed graph patternmatching and topological embedding problems, where an ordered labeleddirected graph is a directed graph in which every node is associated with alabel in an alphabet S, and the left-to-right order of siblings is significant.In terms of mappings, a directed graph P matches a directed graph T ifthere is a mapping f from the nodes in P to the nodes in T such that fpreserves label, degree for internal nodes in P, and the parent relation-ship. Topological embedding relaxes the restriction on preserving the

JAMES JIANGHAI FU374

parent relationship; it requires f to preserve the ancestor relationship, i.e.,for each node a in P, the ith child of a from the left can be mapped to

Ž .either the ith child c of f a or a descendant of c.An important notion in the design of our algorithms is that of the rooted

directed graph, where a rooted directed graph is a directed graph with asingle node serving as the root. The precise definitions of rooted directedgraph and the problems with which we are concerned are given in Section2. There are not existing algorithms for our problems, but some algorithmsfor more general problems can be adapted to solve them. These algorithmsare reviewed in Section 3.

In Section 4 we present an algorithm for the rooted directed graphwpattern matching problem. Unlike the tree pattern matching problem 7,

x11, 16 , a special case of the rooted directed graph pattern matchingproblem, we face the difficulty of dealing with cycles. The main contribu-tion of our algorithm is a scheme for breaking cyclic dependencies duringthe course of dynamic programming. Given a pattern graph P and a target

Ž < < < < < <.graph T , our algorithm runs in time and space O E V q E . Exten-P T Tsion of this algorithm to the directed graph pattern matching problemwithout an increase in time or space complexity is also discussed.

The algorithm for the directed graph topological embedding problem ispresented in Section 5. It employs dynamic programming technique anduses the same scheme to handle the cyclic dependencies as in our patternmatching algorithm. The time and space complexities of our algorithm areŽ < < < < < <. Ž < < < < < < < <.O V E q E and O V V q E q E , respectively.P T P P T P T

2. PRELIMINARIES

2.1. Rooted Directed Graph

Ž .DEFINITION 1. A rooted directed graph RDG is a directed graph inwhich there is a node designated as the root, from which there is a path toevery other node.

We use V and E to denote the set of nodes and the set of edges in aG Ggraph G. The following definitions pertain to an RDG G.

Ž .DEFINITION 2. The depth x of a node x g V is the number of edgesGŽ .in the shortest path from the root of G to x. An edge x , x is an1 2

Ž . Ž . Ž .ordinary edge if depth x is less than depth x ; an edge x , x is a cross1 2 1 2Ž . Ž . Ž .edge if depth x equals depth x ; an edge x , x is a back edge if1 2 1 2

Ž . Ž .depth x is greater than depth x .1 2


DEFINITION 3. Let x and y be two nodes in an RDG G. We say that xis a parent of y if there is an edge pointing from x to y. Node y is a childof x if and only if x is a parent of y. We say that x is an ancestor of y ifthere exist zero or more nodes x , x , . . . , x in G such that x is a parent1 2 qof x , x is a parent of y, and x is a parent of x for 1 F i - q. Node y1 q i iq1is a descendant of x if and only if x is an ancestor of y. Nodes x and ysiblings if they have the same depth and a common parent. A leaf in G is anode from which there is no edge emanating. A subgraph of G rooted as anode x g V is an RDG which contains all descendants of x and theGinduced edges.

Ž .In this paper, we use S to denote an alphabet of labels, r G to denoteŽ . Ž . Ž . Ž .the root of an RDG G, and l x , d x , sub x and c x to denote thei

label of a node x, the degree of x, the subgraph rooted at x, and the ithchild of x from the left, respectively. Unless otherwise specified, we use Pand T to denote a pattern RDG and a target RDG, respectively, and useGreek symbols to denote the nodes in P and the Roman alphabet orArabic numerals to denote the nodes in T.

2.2. Problems

Ž .Given an RDG T , a tree called unfolded RDG of T , denoted unfold T ,can be obtained by unfolding T as follows.

Ž . Ž Ž .. Ž .i r unfold T s r T ;Ž . Ž Ž ..ii for each i, 1 F i F d r T , the subtree rooted at the ith child of

Ž Ž ..r unfold T is obtained by unfolding the subgraph rooted at the ith childŽ .of r T .

Ž .If there are cycles in T , unfold T is an infinite tree. As far as patternmatching is concerned, an unfolded RDG is often a more precise repre-sentation than an RDG. In applications such as Katzenelson, Pinter, and

w xSchenfeld’s type-checking system 14 , two different RDGs represent thesame object if they have the same unfolded RDG. As such, we defineRDG pattern matching and topological embedding in terms of unfoldedRDGs. Figure 1 shows an example of two RDGs having the same unfoldedRDG.

Ž . Ž .The rank of a node x in unfold T is the number of edges in unfold TŽ Ž .. mof the path from r unfold T to x. Let T denote the tree consisting ofŽ .all nodes of unfold T whose ranks are less than or equal to m and the

Ž .induced edges. Let subtree x denote the subtree rooted at node x. Thefollowing are the definitions of pattern matching between unfolded RDGsand between RDGs.


Ž . Ž . Ž .FIG. 1. The RDG in a and the RDG in c have the same unfolded RDG shown in b .

DEFINITION 4. P m matches T m if

Ž . Ž Ž m.. Ž Ž m..i l r P s l r T ,Ž . Ž Ž m.. Ž Ž m.. Ž Ž m..ii d r P s 0 or d r P s d r T , andŽ . Ž Ž Ž m.. Ž Ž Ž m...iii subtree c r P matches subtree c r T for each i, 1 F i Fi i

Ž Ž m..d r P .

DEFINITION 5. An RDG P matches an RDG T , denoted P $ T , if formevery m G 0, P m matches T m.

We write P t T if P does not match T. Topological embedding can bemconsidered a generalization of pattern matching. In topological embedding,the parent relationship does not have to be preserved, i.e., to obtain an

Ž .embedding of P in T , the subgraph rooted at a child of r P may beŽ .embedded in a subgraph rooted at a descendant of r T . Figure 2 shows an

example of topological embedding of an RDG P in an RDG T. Thefollowing formally define the topological embedding between unfoldedRDGs and between RDGs.

DEFINITION 6. P m is topologically embeddable in T j if

Ž . Ž Ž m.. Ž Ž j..i l r P s l r T ,Ž . Ž Ž m.. Ž Ž m.. Ž Ž j..ii d r P s 0 or d r P s d r T , and

Ž . Ž . Ž .FIG. 2. sub g is topologically embeddable in sub 4 ; sub a is topologically embeddableŽ . Ž . Ž . Ž .in sub 1 ; sub b is topologically embeddable in both sub 2 and sub 3 .


Ž . Ž Ž Ž m... Ž .iii subtree c r P is topologically embeddable in subtree d fori iŽ Ž m.. Ž Ž j..each i, 1 F i F d r P , where d is either c r T or a descendant ofi i

Ž Ž j..c r T .i

DEFINITION 7. An RDG P is topologically embeddable in an RDG T ,denoted P $ T , if for every m G 0, there exists a j G m such that P m isetopologically embeddable in T j.

We write P t T if P is not topologically embeddable in T. Theefollowing are the definitions of the problems we consider in this paper.

DEFINITION 8. The RDG pattern matching problem is to determinewhether P $ T 9 for every subgraph T 9 of T. The directed graph patternmmatching problem is to determine whether P9 $ T 9 for every subgraph P9mof P and every subgraph T 9 and T. The directed graph topological embed-ding problem is to determine whether P9 $ T 9 for every subgraph P9 of Peand every subgraph T 9 of T.

3. RELATED WORK

A special case of the RDG pattern matching problem is, given RDGs Pand T , to determine whether P $ T. For ordered RDGs, two algorithmsmcan be adapted to solve this problem: one is the unification algorithm by

w xAho, Sethi, and Ullman 1 and the other is the type equivalence testingw xalgorithm by Katzenelson, Pinter, and Schenfeld 14 . The two algorithms

are based on the same idea which can be described as follows whenadapted to perform matching.

Ž .Function matching a , x

If a has been processed against x then return MATCH;Ž . Ž . Ž . Ž .If l a / l x or d a / d x then return NOT-MATCH;

Ž .For i for 1 to d a do

Ž Ž . Ž ..If matching c a , c x returns NOT-MATCH then return NOT-MATCH:i i

End for;

Return MATCH;

End.

The problem is solved by applying the function to the roots of P and T.< < < <A V = V matrix can be used to store all pairs that have beenP T

processed so that, given a pair of nodes, one can check in constant timewhether or not they have been processed. The time and space complexities

Ž < < < <. Ž < < < < < < < <.of the above algorithms are O E E and O V V q E q E ,P T P T P Trespectively. These two algorithms can be extended to the ordered labeled


RDG pattern matching problem by comparing P with every subgraph ofŽ < < < < < <.T. The time complexity would then be O E E V , which is slowerP T T

� < < < < < <4than our solution to this problem by a factor of min E , E V .T P Tw xKatzenelson, Pinter, and Schenfeld 14 also considered the problem of

finding all equivalent subgraphs in an ordered labeled RDG G, which isessentially the same as the directed graph pattern matching problem. Their

Ž < < 2 < < 2 . Ž < < < < < < < <.solution runs in time O V E , which is O V V E E when itG G P T P Tis applied to solving the directed graph pattern matching problem. With aslightly greater space complexity, our algorithm solves this problem faster

� < < < < < < < < < <4than theirs by a factor of min V E , V E V . As this algorithm isP T P P Tw xnot the main subject of their paper 14 , the efficiency of this algorithm

may not be their main concern.The unordered RDG pattern matching problem is different from the

ordered RDG pattern matching problem in that one can change theleft-to-right order of the children of any pair of nodes when checking forthe matches between the children of the pair. The problem is NP-completew x4 ; as a result, the unordered directed graph pattern matching problemand topological embedding problem are NP-hard.

4. PATTERN MATCHING

The major challenge in designing an efficient algorithm for the directedgraph pattern matching problem is handling the cycles in graphs. Nomatter in what order the matching process proceeds, there are cyclicdependencies to be resolved. For example, in Fig. 3 it is easy to determine

Ž . Ž . Ž . Ž .whether sub a $ sub 2 , but it is not so clear whether sub b $ sub 1m mŽ . Ž .before whether sub d $ sub 4 has been determined. However, wem

Ž . Ž .cannot determine whether sub d $ sub 4 without knowing whethermŽ . Ž . Ž . Ž .sub g $ sub 3 , which in turn depends on whether sub b $ sub 1 .m m

We need some way to break the dependency cycle.

FIG. 3. Dependency cycle.


4.1. Matching-Condition

The dependencies can be described using matching-condition, definedbelow.

Ž . Ž .DEFINITION 9. A pair b , y is a matching-condition, denoted mc b , y ,Ž . Ž . Ž . Ž .on which a pair a , x depends if sub b t sub y implies that sub am

Ž .t sub x .m

Ž . Ž . Ž . Ž .Remark. If a , x depends on mc b , y and b , y depends on mc g , zŽ . Ž .then a , z depends on mc g , z .

Ž . Ž .DEFINITION 10. An mc b , y is undetermined if sub b has not beenŽ . Ž .processed against sub y during the matching process. An mc b , y is

Ž . Ž .ïolated if sub b t sub y has been determined.m

Ž . Ž .Given a sub a of P and a sub x of T , although we cannot determineŽ . Ž .whether sub a $ sub x when a dependency cycle exists, some evi-m

Ž . Ž . Ž . Ž .dence, such as l a / l x , d a / d x , or a matching-condition onŽ . Ž . Ž .which a , x depends is violated, implies that sub a t sub x . If nom

Ž .such evidence can be found, then we say that sub a conditionally matchesŽ .sub x . Formally,

Ž . Ž . Ž .DEFINITION 11. A sub a conditionally matches sub x , denoted sub aŽ . Ž . Ž . Ž . Ž .$ sub x , if l a s l x , d a s d x , no matching-condition on whichcm

Ž .a , x depends has been violated, and there exist some undeterminedŽ .matching-conditions on which a , x depends.

Ž . Ž . Ž .DEFINITION 12. An mc b , y is confirmed if sub b $ sub y ormŽ . Ž .sub b $ sub y has been determined.cm

Ž . Ž .LEMMA 1. For any a g V and x g V , sub a $ sub x if and only ifP T mŽ . Ž . Ž . Ž . Ž . Ž . Ž .i l a s l x , ii d a s d x , and iii all matching-conditions on whichŽ .a , x depends are confirmed.

Proof. The ‘‘only if’’ part is obvious. For the ‘‘if’’ part, we prove byŽ . Ž . Ž .induction on m that if the above i , ii , and iii are true, then for any

Ž .m Ž .mm G 0, sub a matches sub x .Ž . Ž . Ž .0 Ž .0Basis. Since l a s l x , sub a matches sub x .

Induction Step. Suppose that for any pair of nodes a and x, if the aboveŽ . Ž . Ž . Ž .m Ž .mconditions i , ii , and iii are true, then sub a matches sub x . We

Ž .mq 1 Ž .mq 1 Ž . Ž .now prove that sub a matches sub x . Consider c a and c x ,i iŽ . Ž Ž .. Ž Ž .. Ž Ž ..where 1 F i F d a . For each i, since l c a s l c x , d c a si i i

Ž Ž .. Ž Ž . Ž ..d c x , and all matching-conditions on which c a , c x depends arei i iŽ Ž ..m Ž Ž ..mconfirmed, sub c a matches sub c x by the induction hypothesis.i i

mq 1 mq1Ž . Ž .This yields the fact that sub a matches sub x .


4.2. Algorithm for the RDG Pattern Matching Problem

our algorithm works in a bottom-up fashion: given P and T , it traversesP in postorder, where the ordering is obtained by ignoring the back edges

Ž .and cross edges, and for each a g V , it determines whether sub a $P mŽ . Ž . Ž .sub x and whether sub a $ sub x for each x g V in postorder.cm T

Ž .After the bottom-up traversal, if all matching-conditions on which a , xŽ . Ž .depends are confirmed, then sub a $ sub x ; otherwise, if one of them

Ž . Ž .matching-conditions on which a , x depends is violated, then sub a tmŽ .sub x .

Ž .We illustrate this idea under Fig. 3: when processing sub b againstŽ . Ž . Ž . Ž .sub 1 , we decide that sub b $ sub 1 under mc d , 4 . Since P and Tcm

Ž .are processed in a bottom-up fashion, we will eventually process sub dŽ . Ž . Ž .against sub 4 and will confirm mc d , 4 . Then we conclude that sub b

Ž . Ž . Ž . Ž .$ sub 1 and sub d $ sub 4 . Suppose l 3 were not e. We againm mŽ . Ž .decide that sub b $ sub 1 under the same matching-condition ascm

Ž . Ž .above. However, the later computation shows that sub g t sub 3 andmŽ . Ž . Ž .thus sub d t sub 4 , i.e., mc d , 4 is violated. Then we conclude thatm

Ž . Ž . Ž . Ž . Ž . Ž .sub b t sub 1 , sub g t sub 3 , and sub d t sub 4 .m m m< < < <A V = V matrix M is used to store the intermediate informationP T

w xabout the matches between the subgraphs of P and T. Each entry M a , xŽ .may have one of the four values: UNKNOWN if sub a has not been

Ž . Ž . Ž .processed against sub x , yet, MATCH if sub a $ sub x , NOT-MATCH ifmŽ . Ž . Ž . Ž .sub a t sub x , and COND-MATCH if sub a $ sub x . We say thatm cm

w x Ž . Ž . Ž .an entry M a , x depends on an mc b , y if a , x depends on mc b , y .w xAfter the traversal, each entry M b , y that has value NOT-MATCH

signals a violated matching-condition. For those entries that have beenŽ .assigned COND-MATCH and depend on mc b , y , their values must be

changed to NOT-MATCH. This is done by a propagation step in whichw xNOT-MATCH is propagated from each entry M b , y that has value

w x Ž .NOT-MATCH to every entry M a , x depending on mc b , y . We will provelater that after the propagation step an entry which has value MATCH orCOND-MATCH signals a match.

To efficiently propagate NOT-MATCH, we associate with each entry aprop-list to record the entries to which NOT-MATCH is propagated when

w x Ž . w Ž . Ž .xneeded: if M a , x s COND-MATCH, then a , x is in M c a , c x ’si iŽ .prop-list for each i, 1 F i F d a . The prop-lists link entries together in a

Ž . Ž . Ž .way such that if there is a chain of a , x , a , x , . . . , a , x where1 1 2 2 n nŽ . w x Ž .a , x is in M a , x ’s prop-list for 1 F i F n y 1, then a , xi i iq1 iq1 1 1

Ž .depends on mc a , x .n nNow we give the algorithm, which consists of a function match, a

procedure propagate and the main body. Function match takes as inputŽ . Ž .a g V and x g V and determines whether sub a $ sub x andP T m


Ž . Ž .sub a $ sub x , given that M stores the information about the matchescmbetween the subgraphs rooted at the children of a and the subgraphsrooted at the children of x. The value returned from function match maybe MATCH, COND-MATH, or NOT-MATCH. Procedure propagate takes as inputa g V and x g V and propagates NOT-MATCH to the entries of MP T

Ž .depending on mc a , x . To make sure that each pair of nodes a and x arew xpassed to procedure propagate only once, we mark M a , x when a and x

are passed to procedure propagate.Ž .Pattern Matching Algorithm P, T

< < < <Create a V = V matrix M and initialize all entries M to UNKNOWN;P T

For each a g V in postorder doP

w x Ž .For each x g V in postorder do M a , x s match a , x ;T

End for;

w xFor each unmarked entry M a , x that has value NOT-MATCH do

w x Ž .Mark M a , x and propagate a , x ;

End for;

Ž Ž . . w Ž . xOutput all r P , x s.t. M r P , x is MATCH or COND-MATCH;

End.

Ž .Function match a , x

Ž . Ž . Ž . Ž . Ž .If l a s l x , d a s 0 or d a s d x , and for each k

w Ž . Ž .x w xM c a , c x s MATCH then M a , x s MATCH;k k

Ž . Ž . Ž . Ž .Else if l a / l x , d a / d x or there exists k s.t.w Ž . Ž .xM c a , c x s NOT-MATCH then return NOT-MATCH;k k

Else

Ž . w Ž . Ž .xFor each k do put a , x in M c a , c x ’s prop-list;k k

Return COND-MATCH;

End if;

End.

Ž .Procedure propagate a , x

Ž . w x w xFor each b , y in M a , x ’s prop-list s.t. M b , y is not marked do

w xAssign NOT-MATCH to M b , y ;w x Ž .Mark M b , y and propagate b , y ;

End for;

w xRemove M a , x ’s prop-list;

End.

LEMMA 2. After the bottom-up traërsal step and before the propagationŽ . Ž . w xstep, if sub a $ sub x then M a , x is MATCH or COND-MATCH; ifm

Ž . Ž . w xsub a t sub x then M a , x is NOT-MATCH or COND-MATCH.m

Proof. Guaranteed by function match.


w xLEMMA 3. An entry M a , x has älue NOT-MATCH after the propagationŽ . Ž .step if and only if sub a t sub x .m

w xProof. It is obvious that if M a , x s NOT-MATCH after the propagationŽ . Ž .step, then sub a t sub x .m

Ž . Ž . .If sub a t sub x , then at least one of the following is true: imŽ . Ž . . Ž . Ž . . Ž .l a / l x , ii d a / d x or iii there exists an mc b , y on whichŽ . Ž . Ž .a , x depends such that sub b t sub y . During the bottom-up traver-m

. . w xsal step, if the above i or ii is true, then M a , x is assigned NOT-MATCH;. w xotherwise if the above iii is true, then M a , x may be assigned

w xCOND-MATCH. If M a , x is assigned COND-MATCH, then there must existŽ . w x Ž .mc b , y such that M b , y s NOT-MATCH and a , x depends onŽ . Ž . Ž .mc b , y , because otherwise sub a would match sub x according to

Ž . Ž . Ž .Lemma 1. Furthermore, there must exist a , x , a , x , . . . , a , x1 1 2 2 n nŽ . w x Ž . w xsuch that a , x is in M a , x ’s prop-list, a , x is in M b , y ’s prop-list,1 1 n n

Ž . w xand a , x is in M a , x ’s prop-list for 1 F i F n y 1. Therefore,i i iq1 iq1w x w xNOT-MATCH will be propagated from M b , y to M a , x in procedure

propagate.

THEOREM 1. The algorithm correctly sol es the ordered labeled RDGŽ < < < < < <.pattern matching problem in time and space O E V q E .P T T

w Ž . xProof. According to Lemma 3, each entry M r P , y which has valueŽ .COND-MATCH or MATCH after the propagation step signals P $ sub y .m

ŽŽ Ž .. < <. Ž < < < <.Function match takes at most O Ý d a V s O E V stepsa g V T P TP

during the whole course of computation. It is clear that the total number< < < <of elements in all prop-lists is bounded by E V . Since each element inP T

any prop-list is visited at most once during the propagation step, theŽ < < < <.propagation takes at most O E V steps in total. The postorder of PP T

< < < <and T can be computed in time E q E . Therefore, the time complex-P TŽ < < < < < <.ity of the algorithm is O E V q E . The space cost of our algorithmP T T

Ž < < < < < <.is clearly O E V q E .P T T

4.3. Extension

Our algorithm establishes not only the match between P and everysubgraph of T , but also the match between every subgraph of P and everysubgraph of T. This means that it solves not only the ordered labeledRDG pattern matching problem, but also the ordered labeled directedgraph pattern matching problem at the same time.

THEOREM 2. The algorithm correctly sol es the ordered labeled directedŽ < < < < < <.graph pattern matching problem in time and space O E V q E .P T T


5. TOPOLOGICAL EMBEDDING

5.1. Order of Traërsal

Like the pattern matching algorithm, our algorithm for the topologicalŽ .embedding problem adopts the bottom-up approach. Let dag G denote

the directed acyclic graph obtained by replacing each strongly connectedŽ . Ž .component SCC in a directed graph G into a single node. A dag G has

all edges in G except those edges between the nodes in the same SCC. TheSCCs in a directed graph can be identified using Tarjan’s linear time

w x Ž .algorithm 17 . We associate with each node x in dag G a unique number< <i, 1 F i F V , such that i is greater than all numbers associated withda g ŽG.

the descendants of x. This can be done by applying topological sort to theŽ .subgraphs of dag G .

Ž .Each node i in dag G corresponds to one or more nodes in G.Ž .Throughout the description, we use scc i to denote the node or nodes in

Ž . Ž .G represented by node i in dag G . Figure 4 shows an example of dag TŽ .and the numbers associated with the nodes. Note that the nodes in dag G

Ž .do not have labels, since dag G is created to provide an order of traversal.

5.2. Main Framework

Ž . Ž .Our algorithm traverses dag P and dag T in ascending order of thenumbers associated with the nodes so that a node is visited only after allits children have been visited. During the traversal, it compares each pair

Ž . Ž .of nodes i in dag P and j in dag T and determines whether or not thesubgraph or subgraphs of P represented by node i are topologicallyembeddable in the subgraph or subgraphs represented by node j. If note jcorresponds to a node x g V , all information needed to determineTwhether the subgraph or subgraphs represented by node i are topologicallyembeddable in the subgraph represented by node j is readily available; ifnode j corresponds to an SCC in T , we need to deal with cyclic depen-

Ž . Ž .FIG. 4. Node 1 in day T corresponds to node 2 in T ; node 2 in dag T corresponds toŽ .the SCC consisting of node 1, node 3 and node 4 in T ; node 3 in dag T corresponds to node

5 in T.


dence and we will discuss how to deal with it when describing procedurecycle-embed. Note that we distinguish between a node and an SCC consist-ing of one node. If node j corresponds to an SCC consisting of one node,we use procedure cycle-embed to deal with it.

The topological embedding problem is different from the pattern match-ing problem in that an edge in P can be mapped to a path in T. For

Ž . Ž .a g V and x g V such that sub a $ sub x , there may be more thanP T eŽ . Ž .one set of mappings for the nodes in sub a to the nodes in sub x which

Ž . Ž .can result in the embedding of sub a in sub x . Therefore, in order toŽ . Ž . Ž Ž ..prove that sub a t sub x , one has to show not only sub c a te k e

Ž Ž .. Ž Ž .. Ž Ž ..sub c x , but also sub c a t S for any subgraph S of sub c x ,k k e kŽ .1 F k F d a . We use desc-embeddable to express such a relation.

Ž . Ž .DEFINITION 13. A sub a is desc-embeddable in a sub x , denotedŽ . Ž . Ž . Ž .sub a $ sub x , if sub a $ S, where S is a subgraph of sub x .de e

Ž . Ž . Ž . Ž .We write sub a t sub x if sub a is not desc-embeddable in sub x ,deŽ . Ž . Ž .i.e., sub a t sub x if sub a is not topologically embeddable in anyde

Ž . Ž .subgraph of sub x , including sub x itself.

Ž . Ž . Ž . Ž . Ž . Ž .LEMMA 4. sub a $ sub x if and only if l a s l x , d a s d xeŽ . Ž Ž .. Ž Ž ..and for each k, 1 F k F d a , sub c a $ sub c x .k de k

Proof. Immediate from the definitions of topological embedding anddesc-embeddability.

During the traversal, we record not only whether the subgraphs of P aretopologically embeddable in the subgraph of T , but also whether the

< < < <former are desc-embeddable in the latter. A V = V matrix M servesP Tw xthis purpose. Each entry M a , x may have one of four values; UNKNOWN

Ž . Ž .if a has not been processed against x yet, EMBED if sub a $ sub x ,eŽ . Ž . Ž . Ž .DESC-EMBED if sub a $ sub x and sub a t sub x , andd e eŽ . Ž .NOT-DESC-EMBED if sub a t sub x . The following is the algorithm;de

procedure cycle-embed is described in the next section.

Ž .Topological Embedding Algorithm P, TŽ . Ž .Compute dag P and dag T and number of nodes;

< < < <Create a V = V matrix M and initialize each entry to UNKNOWN:P T

< <For i from 1 to V doda g ŽP .

< <For j from 1 to V doda g ŽT .

Ž .If j corresponds to an SCC in T then cycle-embed i, j ;Ž . Ž Ž ..Else for each a g scc i do dag-embed a , scc j ;

End for;


End for;Ž . w xOutput all a , x s.t. M a , x s EMBED;

End.

Ž .Procedure dag-embed a , xŽ . Ž . Ž . Ž . Ž .If l a s l x , d a s 0 or d a s d x , and for each k

w Ž . Ž .x w xM c a , c x is EMBED or DESC-EMBED then M a , x s EMBED;k k

w xElse if there exists a child c of x s.t. M a , c is DESC-EMBED or EMBED

thenw xM a , x s DESC-EMBED;

w xElse M a , x s NOT-DESC-EMBED;End.

5.3. Handling Dependency Cycles

The following notation is needed to describe our algorithm.

Ž .DEFINITION 14. A pair b , y is an embedding-condition, denotedŽ . Ž .ec b , y , on which a pair a , x depends if a and x are parents of b and

Ž . Ž . Ž . Ž .y, respectively, and if sub b t sub y implies that sub a t sub x .de e

Ž . Ž . Ž .DEFINITION 15. An ec b , y is confirmed if sub b $ sub y hasdeŽ . Ž . Ž .been determined. An ec b , y is ïolated if sub b t sub y has beende

determined.

We use conditionally topologically embeddable to describe the situationŽ .when there is not enough evidence to determine whether a sub a is

Ž .topologically embeddable in a sub x .

Ž .DEFINITION 16. A sub a is conditionally topologically embeddable inŽ . Ž . Ž . Ž . Ž . Ž . Ž .sub x , denoted sub a $ sub x , if l a s l x , d a s d x , no em-ce

Ž .bedding-condition on which a , x depends has been violated, and thereŽ .exists some embedding-condition on which a , x depends that has not

been confirmed.

Ž . Ž .LEMMA 5. Let a and x be two nodes in scc i and scc j , respecti ely. IfŽ . Ž . Ž . Ž .sub a $ sub x and if for each b g scc i there exists a node y in sub xce

Ž . Ž . Ž . Ž . Ž . Ž .such that sub b $ sub y or sub b $ sub y , then sub a $ sub x .e ce e

Proof. We prove by induction on m that for any m G 0, there exists jŽ .m Ž . jsuch that sub a is topologically embeddable in sub x .Ž . Ž . Ž .0 Ž .0Basis. Since l a s l x , sub a is topologically embeddable in sub x .

Induction Step. Suppose that for any m F n, there exists j such thatmŽ .m Ž . jmsub a is topologically embeddable in sub x . We now prove that there

Ž .nq1 Ž . jnq 1exists j such that sub a is topologically embeddable in sub x .nq1


Ž .For each k, 1 F k F d a ,

Ž . Ž . Ž . Ž .1. if c a f scc i or c x f scc j , then there exists a subgraph Sk kŽ Ž .. Ž Ž .. Ž .of sub c x such that sub c a $ S, because otherwise sub a wouldk k e

Ž .not be conditionally topologically embeddable in sub x ;Ž . Ž . Ž . Ž . Ž .2. if c a g scc i and c x g scc j , then since sub x is a sub-k kŽ Ž .. Ž Ž ..graph of sub c x , there exists a subgraph S of sub c x such thatk k

Ž Ž .. Ž Ž .. Ž Ž ..either sub c a $ S or sub c a $ S. If sub c a $ S, thenk e k ce k ceŽ . Ž . Ž .r S must be in scc j and sub x is a subgraph of S. Thus for each

Ž . Ž .b g scc i , there exists a subgraph S9 of S such that sub b $ S9 oreŽ .sub b $ S9.ce

All the above cases yield the fact that for each k, there exists j and ankŽ Ž .. Ž Ž ..nsubgraph S of sub c x such that sub c a is topologically embed-k k k

jn Ž .nq1kdable in S . Therefore, sub a is topologically embeddable ink

Ž .m a xk� jn qr k4 Ž . � 4ksub x , where r is the rank of r S and max j q r is thek k k n kkŽ .maximum among all j q r for 1 F k F d a .n kk

The following lemma complements Lemma 5.

Ž . Ž .LEMMA 6. If there exists a g scc i such that sub a $u S for S adeŽ . Ž .subgraph of T , then sub b t S for each b g scc i .de

Ž . Ž .Proof. Suppose that there were a node b in scc i such that sub bŽ .$ S. Then there would exist a subgraph S9 and S such that sub b $de e

S9. Since a is a descendant of b , there would exist a subgraph S0 of S9Ž . Ž .such that sub a $ S0. This contradicts the fact that sub a $ S.e de

Our scheme for handling dependency cycles directly follows from LemmaŽ . Ž .5 and Lemma 6. Given scc i and scc j , it tries to find a mapping f from

Ž . Ž .the nodes in scc i to the nodes in scc j and their descendants such thatŽ . Ž Ž .. Ž . Ž Ž .. Ž .sub a $ sub f a or sub a $ sub f a for each a g scc i . This ise ce

Ž . Ž .done by a traversal of scc i and scc j in some order. During theŽ . Ž .traversal, we determine for each a g scc i and each x g scc j whether

Ž . Ž . Ž . Ž .sub a $ sub x and whether sub a $ sub x . After the traversalce deŽ . Ž . Ž .terminates, if for each a g scc i there exists x g scc j such that sub a

Ž . Ž . Ž .$ sub x or sub a $ sub x , i.e., a mapping f exists, then accordingce deŽ . Ž . Ž . Ž .to Lemma 5, we conclude that sub b $ sub y if sub b $ sub y fore ce

Ž . Ž .each b g scc i and y g scc j ; otherwise, according to Lemma 6, weŽ . Ž . Ž . Ž .conclude that sub a t sub x for each a g scc i and x g scc j .de

There are some differences between procedure cycle-embed and ourpattern matching algorithm. One difference is that when determining

Ž . Ž . Ž Ž ..whether sub a $ sub x , we do not need to check whether sub c ace k


Ž Ž .. Ž .$ sub c x for 1 F k F d a , since the latter has no effect on thece kŽ . Ž .former. Another difference is that scc i and scc j may be traversed in

any order, since during the traversal, the decision we make for any pair ofnodes has nothing to do with the decisions for other pairs of nodes in

Ž . Ž .scc i and scc j . The other difference is that the propagation schemeŽ Ž .. Ž Ž ..cannot be used in this algorithm, since sub c a t sub c x does notk e k

Ž . Ž .imply that sub a t sub x .e< Ž . < < Ž . <A scc i = scc j matrix EM is used to record the results from the

< Ž . < < Ž . < Ž .traversal, where scc i and scc j are the numbers of nodes in scc i andŽ . w xscc j , respectively. Each entry EM a , x may have one of the four values:

Ž . Ž .UNKNOWN if sub a has not been processed against sub x yet, EMBED ifŽ . Ž . Ž . Ž .sub a $ sub x , NOT-EMBED if sub a $ sub x , and COND-EMBED, ife eŽ . Ž .sub a $ sub x . EMBED is used only when i corresponds to a node in T ,ce

in which case all children in i have already been processed against theŽ .nodes in scc j before procedure cycle-embed is called.

< Ž . <We also use a size scc i array DE to record whether each subgraphŽ .rooted at a node in scc i is desc-embeddable in any subgraph rooted at a

Ž . w x Ž . Ž .node in scc j : DE a s 1 if there exists x g scc j such that sub a $ceŽ . Ž . Ž . w xsub x or sub a $ sub x ; DE a s 0 otherwise. A function embed isde

used to determine the value of the entries of EM.

Ž .Procedure cycle-embed i, j

Create matrix EM and array DE and initialize each entry of EM to UNKNOWN

and each entry of DE to 0;

Ž .For each a g scc i do

Ž .For each x g scc j do

w x Ž .EM a , x s embed a , x ;

w x w Ž .xIf EM a , x is EMBED or COND-EMBED or there exists k s.t. M a , c xk

w xis EMBED or DESC-EMBED then DE a s 1;

End for;

End for;

Ž . w xIf there exists b g scc i s.t. DE b s 0 then

Ž . Ž . w xFor each a g scc i and x g scc j do M a , x s NOT-DESC-EMBED;

Ž . Ž .Else for each a g scc i and x g scc j do

w x w xIf EM a , x is EMBED or COND-EMBED then M a , x s EMBED;

w xElse M a , x s DESC-EMBED;

End if;

Remove matrix EM and array DE;End.


Ž .Function embed a , x

Ž . Ž . Ž . Ž . Ž .If l a s l x , d a s 0 or d a s d x , and for each k

w Ž . Ž .xM c a , c x is EMBED or DESC-EMBED then return EMBED;k k

Ž . Ž . Ž . Ž .Else if l a / l x , d a / d x or there exists k s.t.

w Ž . Ž .xM c a , c x s NOT-DESC-EMBED then return NOT-EMBED;k k

Else return COND-EMBED:

End.

THEOREM 3. The algorithm correctly sol es the ordered labeled directedŽ < < < < < <.graph topological embedding problem in time O V E q E and spaceP T P

Ž < < < < < < < <.O V V q E q E .P T P T

Proof. The correctness of procedure cycle-embed immediately followsfrom Lemma 5 and Lemma 6. The correctness of the rest of the algorithmfollows from Lemma 4 and the order in which P and T are traversed.

Ž < < < <. Ž . Ž .It takes O E q E steps to compute dag P and dag T andP T� < 4number of nodes in them. Let A s x x is in an SCC in T and B s V yT

Ž < <Ž Ž ...A. Procedure dag-embed takes at most O V Ý d x steps duringP x g Bthe whole course of computation. Procedure cycle-embed and function

Ž < <Ž Ž ...embed take at most O V Ý d x steps during the whole course ofP x g Acomputation. Therefore, the time complexity of this algorithm is

< < < < < < < <O V Ý d x q O V Ý d x q O E q EŽ . Ž . Ž .Ž . Ž .Ž . Ž .P x g B P x g A P T

< < < < < <s O V E q E .Ž .P T P

Ž < < < < < < < <.The space cost of this algorithm is clearly O V V q E q E .P T P T

Ž .An example see Fig. 5 is given to show how procedure cycle-embedŽ . Ž .handles scc 2 in P and scc 2 in T. Before the procedure is called for

these two SCCs, node a has been processed against all nodes in T andnode b and node g have been processed against node 1. During thepostorder traversal, we first compare node b against node 2. The function

w xcall to embed returns COND-EMBED and thus EM b , 2 is assignedw xCOND-EMBED and DE b is assigned 1. We then compare node b against

node 3 and node 4, and then compare node g against node 2, node 3, andnode 4. Table 1 shows the values of the entries of matrix EM and arrayDE after the traversal.

w x w x w xSince all entries of DE are 1, M b , 2 , M b , 3 and M g , 4 arew x w x w xassigned EMBED, and M b , 4 , M g , 2 and M g , 3 are assigned DESC-

EMBED.


Ž .FIG. 5. Node 1 and node 2 in dag P correspond to node a and the SCC consisting ofŽ .node b and node g in P, respectively; node 1 and node 2 in dag T correspond to node 1

and the SCC consisting of node 2, node 3 and node 4, respectively.

6. CONCLUSION

We have presented efficient algorithms for the RDG pattern matchingproblem, the directed graph pattern matching problem and the directedgraph topological embedding problem. The algorithms can easily be ex-tended to handle multiple patterns and wild-card nodes, where a wild-cardnode is a node in a pattern such that the subgraph rooted at the wild-cardnode matches any subgraph of a target.

Our algorithms use dynamic programming techniques together with theschemes for handling cycles in directed graphs. It would be interesting tosee whether the scheme can be adapted to solve other pattern matchingproblems in directed graphs. It would also be interesting to see whetherthe running times of our algorithms can be improved. One possible way is

TABLE 1Matrix EM and Array DE after Traversal

Matrix EM

Node 2 Node 3 Node 4 Array DE

Node b COND-EMBED COND-EMBED NOT-EMBED 1Node g NOT-EMBED NOT-EMBED COND-EMBED 1


using a preprocessing technique, as in Hoffmann and O’Donnell’s treew xpattern matching algorithms 11 . The major difficulty would be handling

cycles in directed graphs.

ACKNOWLEDGMENT

The author wishes to thank Naomi Nishimura who read the early versions of themanuscript and provided many valuable comments.

REFERENCES

1. A. V. Aho, R. Sethi, and J. D. Ullman, ‘‘Compilers}Principles, Techniques, and Tools,’’Chapter 6.7, Addison-Wesley, Reading, MA, 1986.

2. A. Aiken and B. R. Murphy, Implementing regular tree expressions, in ‘‘Proceedings, 5thACM Conference on Functional Programming Languages and Computer Architecture,1991,’’ pp. 427]447.

3. A. Aiken and B. R. Murphy, Static type inference in a dynamically typed language, in‘‘Proceedings, Eighteenth Annual ACM Symposium on Principles of Programming Lan-guages, 1991,’’ pp. 279]290.

4. S. A. Cook, The complexity of theorem-proving procedures, in ‘‘Proceedings, in 3rdAnnual Symposium on the Theory of Computing, 1971,’’ pp. 151]158.

5. A. Corradini, Term rewriting in CT , in ‘‘Proceedings, International Joint Conference onS

Theory and Practice of Software Development, 1993,’’ pp. 468]484.6. N. Dershowitz and S. Kaplan, Rewrite, rewrite, rewrite, rewrite, rewrite, . . . , in ‘‘Proceed-

ings, Sixteenth Annual Symposium on Principles of Programming Languages, 1989,’’ pp.250]259.

7. M. Dubiner, Z. Galil, and E. Magen, Faster tree pattern matching, in ‘‘Proceedings,IEEE Symposium on Foundations of Computer Science, 1990,’’ pp. 145]150.

8. J. R. W. Glauert, J. R. Kennaway, and M. R. Sleep, Dactl: An experimental graphrewriting language, in ‘‘Proceedings, Fourth International Workshop on Graph Gram-mars and Their Application to Computer Science, 1990,’’ pp. 378]395.

9. J. Fu, Pattern matching in directed graphs, in ‘‘Proceedings, Sixth Annual Symposium onCombinatorial Pattern Matching, 1995,’’ pp. 64]77.

10. N. Heintze and J. Jaffar, A finite presentation theorem for approximating logic programs,in ‘‘Proceedings, Seventeenth Annual ACM Symposium on Principles of ProgrammingLanguages, 1990,’’ pp. 197]209.

11. C. M. Hoffmann and M. J. O’Donnell, Pattern matching in trees, J. Assoc. Comput.Ž .Mach. 29 1982 , 68]95.

12. K. H. Holm, Graph matching in operational semantics and typing, in ‘‘Proceedings,Colloquium on Trees in Algebra and Programming, 1990,’’ pp. 191]205.

13. J. E. Hopcroft and R. M. Karp, An n5r2 algorithm for maximum matchings in bipartiteŽ . Ž .graphs, SIAM J. Comput. 3 4 1973 , 225]231.

14. J. Katzenelson, S. S. Pinter, and E. Schenfeld, Type matching, type-graphs, and theŽ . Ž .Schanuel conjecture, ACM Trans. Programming Lang. Syst. 14 4 1992 , 574]588.


15. J. W. Klop, Term rewriting systems, Technical Report CS-R9073, Free University,Department of Mathematics and Computer Science, 1990.

16. S. R. Kosaraju, Efficient tree pattern matching, in ‘‘Proceedings, IEEE Symposium onFoundations of Computer Science, 1980,’’ pp. 178]183.

Ž . Ž .17. R. Tarjan, Depth first search and linear graph algorithms, SIAM J. Comput. 1 2 1972 ,146]160.

Documents

Directed Graph Pattern Matching and Topological Embedding