29
A graph grammars based framework for querying graph-like data q Sergio Flesca a , Filippo Furfaro a, * , Sergio Greco a,b a Dipto Elettronica Informatica e ISI-CNR, Universita della Calabria, Sistemistica, Via P Rucci 41 C, 87030 Rende-Cosenza, Italy b ICAR-CNR, 87030 Rende, Italy Received 1 November 2005; accepted 1 November 2005 Available online 1 December 2005 Abstract The widespread use of graph-based models for representing data collections (e.g. object-oriented data, XML data, etc.) has stimulated the database research community to investigate the problem of defining declarative languages for querying graph-like databases. In this paper, a new framework for querying graph-like data based on graph grammars is proposed. The new paradigm allows us to verify structural properties of graphs and to extract sub-graphs. More specifically, a new form of query (namely graph query) is proposed, consisting in a particular graph grammar which defines a class of graphs to be matched on the graph representing the database. Thus, differently from path queries, the answer of a graph query is not just a set of nodes, but a subgraph, extracted from the input graph, which satisfies the structural properties defined by the graph grammar. Expressiveness and complexity of different forms of graph queries are discussed, and some practical applications are shown. Ó 2005 Elsevier B.V. All rights reserved. Keywords: Query language; Semistructured data; Graph grammars 1. Introduction The widespread use of graph based models for representing data collections (e.g. object-oriented data, XML data, etc.) has stimulated the database research community to investigate the problem of defining declarative languages for querying graph-like databases [1,3,17,18,22]. Recently, several languages and proto- types have been proposed for searching both generic graph-like data and specific types of graph data such as XML. The most widely used mechanism for extracting information from graphs is that of path queries, due to its simplicity and declarative nature. Basically path queries are navigational queries expressed by means of reg- ular expressions denoting paths in the graph [2,4,6,12,19,23]. A path query of the form hC, ri, where C is a set of node labels and r a regular expression, defines the query ‘‘find all the nodes reachable from a node whose label 0169-023X/$ - see front matter Ó 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.datak.2005.11.001 q Work partially supported by a MURST grant under the project ‘‘D2I’’. * Corresponding author. E-mail addresses: [email protected] (S. Flesca), [email protected] (F. Furfaro), [email protected] (S. Greco). Data & Knowledge Engineering 59 (2006) 652–680 www.elsevier.com/locate/datak

A graph grammars based framework for querying graph-like data

Embed Size (px)

Citation preview

Page 1: A graph grammars based framework for querying graph-like data

Data & Knowledge Engineering 59 (2006) 652–680

www.elsevier.com/locate/datak

A graph grammars based framework for queryinggraph-like data q

Sergio Flesca a, Filippo Furfaro a,*, Sergio Greco a,b

a Dipto Elettronica Informatica e ISI-CNR, Universita della Calabria, Sistemistica, Via P Rucci 41 C, 87030 Rende-Cosenza, Italyb ICAR-CNR, 87030 Rende, Italy

Received 1 November 2005; accepted 1 November 2005Available online 1 December 2005

Abstract

The widespread use of graph-based models for representing data collections (e.g. object-oriented data, XML data, etc.)has stimulated the database research community to investigate the problem of defining declarative languages for queryinggraph-like databases. In this paper, a new framework for querying graph-like data based on graph grammars is proposed.The new paradigm allows us to verify structural properties of graphs and to extract sub-graphs. More specifically, a newform of query (namely graph query) is proposed, consisting in a particular graph grammar which defines a class of graphsto be matched on the graph representing the database. Thus, differently from path queries, the answer of a graph query isnot just a set of nodes, but a subgraph, extracted from the input graph, which satisfies the structural properties defined bythe graph grammar. Expressiveness and complexity of different forms of graph queries are discussed, and some practicalapplications are shown.� 2005 Elsevier B.V. All rights reserved.

Keywords: Query language; Semistructured data; Graph grammars

1. Introduction

The widespread use of graph based models for representing data collections (e.g. object-oriented data,XML data, etc.) has stimulated the database research community to investigate the problem of definingdeclarative languages for querying graph-like databases [1,3,17,18,22]. Recently, several languages and proto-types have been proposed for searching both generic graph-like data and specific types of graph data such asXML. The most widely used mechanism for extracting information from graphs is that of path queries, due toits simplicity and declarative nature. Basically path queries are navigational queries expressed by means of reg-ular expressions denoting paths in the graph [2,4,6,12,19,23]. A path query of the form hC, ri, where C is a setof node labels and r a regular expression, defines the query ‘‘find all the nodes reachable from a node whose label

0169-023X/$ - see front matter � 2005 Elsevier B.V. All rights reserved.

doi:10.1016/j.datak.2005.11.001

q Work partially supported by a MURST grant under the project ‘‘D2I’’.* Corresponding author.

E-mail addresses: [email protected] (S. Flesca), [email protected] (F. Furfaro), [email protected] (S. Greco).

Page 2: A graph grammars based framework for querying graph-like data

S. Flesca et al. / Data & Knowledge Engineering 59 (2006) 652–680 653

belongs to C through paths spelling a string of the language defined by the regular expression r’’. However, thiskind of navigational query is not completely satisfactory since in many cases we would like to express queriesverifying whether the graph has a given structure (e.g. a tree or a chain), or we need to extract from the inputgraph not simply a set of nodes, but a complex subgraph satisfying a given property [7,20,24].

Example 1. Consider the labeled rooted graph shown in Fig. 1, where some pieces of information about acollection of books are represented.

In the above graph, book details (name and surname of authors, title, publisher) are represented by labelsassociated with leaf nodes, whereas edge labels describe the type of information contained in the descending‘‘subtree’’. For instance, any sub-tree identified by an edge with label ‘‘written_by’’ contains information aboutthe book authors.

Assume now that we want to extract the sub-trees corresponding to the books written by Ullman. We coulduse path queries to extract separately all the pieces of the available information about the desired book, but wecannot use path queries to extract the desired information preserving its structure. That is, we could extractthe titles of all Ullman’s books (‘‘A First course in Database Systems’’, ‘‘Principles of Databases and Knowledge

Systems’’) by means of the path query h{‘‘Ullman’’}, surname.author.written_by.titlei, and extract theirpublishers by means of the path query h{‘‘Ullman’’}, surname.author.written_by.pubi. But the informationreturned by the former query is disjoint from the information returned by the latter one. As path queriesreturn sets of nodes, we are not able to reconstruct the correspondence between titles and publishers. It isworth noting that if the rooted graph is replaced by an acyclic rooted digraph (tree), even the single pieces ofinformation cannot be extracted, as arcs cannot be navigated from the destination node to the source node.

In order to overcome the limited expressive power of path queries, without completely renouncing to theirsimplicity and declarative nature, some languages (such as XQuery [27]) embed the path query mechanism intoa more general and more expressive query paradigm. However this result in procedural query languages mak-ing query specification rather complex. Our proposal consists in a framework for defining graph patternswhich can be used to extract information from the input graph. A pattern is a graph which defines the shape(or, more generally, structural properties) and the content of the subgraph to be extracted from the inputgraph. An example of pattern (called query graph) is shown on the left-hand side of Fig. 2. Such a query graphdefines a class of graphs to be matched on the input data graph, where labels associated with nodes may bespecific (e.g. ‘‘Ullman’’) or generic (e.g. $n). The matching between any graph in the class defined by thequery graph with the input data graph permits us to extract portions of the graph of Fig. 1 correspondingto Ullman’s books. Node labels whose first symbol is $ are used to define variables: the node with label $u‘‘extracts’’ the name of Ullman, nodes with label $n, $s are associated with the name and the surname of pos-sible Ullman’s co-authors, whereas nodes with label $t, $p are associated, respectively, to the title and thepublisher of each Ullman’s book. The two sub-graphs extracted by means of this pattern, containing the avail-able information about the two books written by Ullman, are shown on the right-hand side of Fig. 2.

Fig. 1. A tree containing some information about books.

Page 3: A graph grammars based framework for querying graph-like data

Fig. 2. A graph pattern.

654 S. Flesca et al. / Data & Knowledge Engineering 59 (2006) 652–680

Thus, in this paper we investigate the problem of extracting a subgraph (a consistent subset of nodes andedges) satisfying a certain property from a given graph (representing a database). We propose a new form ofqueries, called graph queries, whose answers are (marked) subgraphs having a particular structure. A graphquery is based on a graph grammar, that is used to define the structural properties of the subgraph that isto be extracted [8–11,14]. A graph grammar is a graph rewriting system consisting of a set of rewriting rules(or productions). Just as a production of a standard grammar defines how to substitute a non-terminal symbol(or a group of symbols) with a string, a production of a graph grammar defines how to replace a node (or anedge) in a graph with a sub-graph. A graph grammar defines a class of graphs which have common structuralproperties (e.g. the class of complete graphs, the class of trees, etc.): such classes are named graph languages.Discussing whether a graph belongs to a certain graph language is equivalent to discussing whether the struc-ture of such a graph satisfies the structural property of that language. For instance, we can state that a certaingraph is a tree by simply defining a graph grammar generating trees and then demonstrating that the graphbelongs to the language generated by that grammar.

1.1. Related work

Several languages have been proposed in literature for querying and re-structuring graph-like data. Some ofthese languages adopt a graphical formalism to define queries, such as GraphLog [7], GOOD [20] and G-Log[24]. GraphLog provides a visual formalism whose semantics is defined in terms of Datalog programs. Basi-cally, it allows a limited form of negation (corresponding to stratified linear Datalog programs), and its expres-siveness is equivalent to FO(TC) (i.e. first order logic augmented with transitive closure [21]). In [7] authorsalso define a monotonic version of the language, whose expressive power is not greater than that of positivelinear Datalog.

In GOOD data manipulation is expressed in terms of graph transformations. While basic operations (addi-tion and removal of nodes and edges) have a declarative nature, the semantics of the whole language is pro-cedural, since the extracting/re-structuring mechanism exploits calls to recursive methods. The full language isTuring-complete, while the fragment without methods cannot express recursive queries, and its expressivenessis equivalent to that of nested relational algebra.

G-Log queries are expressed by means of logic-based graph-transformation rules defining the structure ofthe output graph. G-Log is non-deterministic complete, i.e. it enables every non-deterministic database queryto be expressed. In [24] an efficiently computable fragment of G-Log, which is as expressive as Datalog, hasbeen defined.

Some languages adopting a non-visual formalism for manipulating graph-like data have been also defined.Among these, UnQL [5] is based on a rooted-graph data model. It uses bisimulation to test data equality andits semantics is defined in terms of structural recursion. The complexity of answering UnQL queries is PTIME,and the expressiveness of the language corresponds to FO(TC). XQuery [27] is the standard language for que-rying semi-structured data represented as XML documents [25]. It adopts a tree-based data model, and itsextraction paradigm is based on an advanced form of navigational queries, where paths are denoted by means

Page 4: A graph grammars based framework for querying graph-like data

S. Flesca et al. / Data & Knowledge Engineering 59 (2006) 652–680 655

of XPath expressions [26]. XQuery is a procedural query language enabling function calls and iterations. Thewhole language is Turing-complete.

1.2. Contributions

The main contribution of this paper is the definition of a graph grammar based framework for queryinggraph data and the introduction of several different types of graph grammar suitable for expressing graph que-ries. The adoption of graph grammars as a query model results in a very intuitive visual query language with adeclarative semantics. Previously defined visual languages fail in completely fulfilling these issues. Morespecifically:

• We introduce a framework, based on node replacement (NR) graph grammars, for extracting sub-graphsfrom a given graph.

• We specialize NR graph grammars into Conditional Node Replacement (cNR) grammars, and show that thelatter ones are more suitable and expressive for issuing queries on graphs. We also discuss their complexityand expressiveness.

• We study the effects of introducing an ordering criterion in the set of productions of a cNR grammar, thusdefining Partially Ordered Conditional Node Replacement (POcNR) graph grammars and POcNR graph

queries.• We show that POcNR graph queries are less expressive than cNR graph queries, but can be evaluated more

efficiently, and they suffice to express several classical problems on graphs. We also show that POcNR aremore expressive than path queries.

• We show how POcNR graph grammars can be profitably used for querying XML documents.

Although in this paper we only consider the extraction of subgraphs, it is worth noting that graph gram-mars can be also used to re-structure the information extracted by means of graph queries. A graph grammarbased system for extracting and restructuring XML documents is presented in [16].

Plan of the paper. The paper is organized as follows. In Section 2 we introduce basic definitions and nota-tions of graphs, and illustrate the notion on Node Replacement (NR) Graph Grammar. In Section 3 we showhow NR grammars can be used to query graph data and introduce Conditional Node Replacement (cNR)grammars. In Section 4 we present Partially Ordered Conditional Node Replacement (POcNR) graph gram-mars and study the complexity and expressiveness of POcNR graph queries. In Section 5 we sketch a practicalapplication of our framework for querying XML documents. In Appendix A we provide a brief summary ofthe notations used throughout the paper.

2. Basic definitions

2.1. Graphs

Let C be an alphabet of labels. A graph over C is a tuple a = (N,E,k) where N is a set of nodes,E � {(u,r,v)ju,v 2 N, r 2 C} is a set of labeled edges and k : N! C is a node labelling function. The set ofsymbols C consists of two distinct subsets Ct and Cn denoting, respectively, terminal and nonterminal symbols.A node x of a graph a is said to be terminal if k(x) 2 Ct; we say that a is terminal if all its nodes are terminal(i.e. a is defined over Ct). We assume that the set of terminal symbols Ct is partitioned into two distinct sets: theset of constant terminal symbols Cc and the set of variable terminal symbols Cv. In the following, constants arerepresented by strings starting with a digit or a lowercase letter (e.g. b1), variables are denoted by strings pre-ceded by a dollar (e.g. $b1), and non-terminal symbols are denoted by strings starting with capital letters (e.g.X).

Given an alphabet of labels C = Cc [ Cv [ Cn, a graph over C is called query graph, a graph over Cc is calleddata graph, and a graph over Ct is called terminal query graph. The components of a graph a will also bedenoted by Na, Ea and ka, respectively. Analogously, the components of an edge e will be denoted by e[1],e[2] (or label(e)) and e[3], respectively. Graphs with unlabelled nodes and edges are modelled by taking

Page 5: A graph grammars based framework for querying graph-like data

656 S. Flesca et al. / Data & Knowledge Engineering 59 (2006) 652–680

C = Cc = {#}. Given two graphs a1 and a2, we say that a1 is a subgraph of a2 iff (i) N a1� N a2

, (ii) Ea1� Ea2

and(iii) 8x 2 N a1

, ka1ðxÞ ¼ ka2

ðxÞ.

2.2. Path queries

A path over a graph a is a sequence p = (v1,e1,v2,e2, . . . ,vn) where vi 2 N, ej 2 E, ei[1] = vi and ei[3] = vi+1.The label of the path p, denoted as label(p), is an element of C* defined as e1[2], . . . ,en�1[2]. Given a regularexpression r defined on C and a string w 2 C*, we say that a path p on a spells the string w if w = label(p),and we say that p satisfies r if labelðpÞ 2LðrÞ, i.e. the string spelled by p belongs to the language definedby r. Given a graph a over C and a regular expression r over C, we denote with Lðr; aÞ ¼ flabelðpÞ jp is apath on a ^ labelðpÞ 2LðrÞg the set of strings in LðrÞ spelling paths in a. Moreover, given a node x0 of a,we denote with Lðr; a; x0Þ the set of strings in Lðr; aÞ associated with paths starting from x0.

A path query Q is a pair hC0, ri where C0 � C is a set of labels identifying the set of source nodes, and r is aregular expression over a given alphabet. The application of a path query Q = hC0, ri over a graph a = (N,E,k)is defined as the set of node labels C1 � C such that there exist a node x 2 N with k(x) 2 C0 and a node y 2 N

with k(y) 2 C1 which are connected by means of a path satisfying r.

2.3. NR graph grammars

Node replacement (NR) graph grammars [15] generate labeled directed graphs. A production of a NRgraph grammar is of the form X! (b,C) where X is a nonterminal node label, b is a graph and C is theset of connection instructions. A rewriting step of a graph a according to such a production consists of remov-ing a node u labeled with X from a, adding b to a, and adding edges between nodes of a and nodes of b asspecified by the connection instructions in C. The pair (b,C) can be viewed as a new type of object and therewriting step can be viewed as the substitution of the node u with (b,C) in the graph a. Intuitively, this kindof object is quite natural: it is a graph ready to be embedded in an environment. Its formal definition is asfollows.

Let C be an alphabet of labels. A graph with embedding is a pair (b,C) where b is a graph over C, andC � C · C · C · N · {in,out} is the connection relation. Each element (c,r1,r2,v,d) 2 C is a connectioninstruction and is generally written as (c,r1/r2,v,d). Intuitively, for a graph with embedding (b,C), the mean-ing of a connection instruction (c,r1/r2,v,out) is as follows: if there was a r1-labeled arc from a node u whichhas been substituted with b to a c-labeled node w, then the embedding mechanism defines a r2-labeled edgefrom v to w. Similarly, the meaning of a connection instruction (c,r1/r2,v, in) is as follows: if there was ar1-labeled arc from a c-labeled node w to a node u which has been substituted with b, then the embeddingmechanism defines a r2-labeled edge from w to v. The feature which replaces edge labels is called dynamic edge

labelling.Let a be a graph over C, (b,C) be a graph with embedding over the same alphabet, and let v 2 Na. The sub-

stitution of v with (b,C) in a is denoted as a[v/(b,C)]. In the following, all the connection rules of the form(c,r/r,v,d) (i.e. rules which do not re-label edges) are simply written as (c,r,v,d).

NR graph grammars can be also used to generate labeled undirected graphs. In such a case, in a productionX! (b,C) the graph b is undirected and the rules in C are of the form (c,r1/r2,v) (i.e. the last element is omit-ted); the meaning of such a rule is the following: if there was a r1-labeled edge connecting a node u, which hasbeen substituted with b, to a c-labeled node w, then the embedding mechanism defines a r2-labeled edge join-ing v and w. The following definition formally introduces NR graph grammars.

Definition 1. A node replacement (NR) graph grammar is a tuple G = (C,Ct,P,S) where C is the alphabet oflabels, Ct � C is the alphabet of terminal labels, P is the finite set of productions, and S 2 C � Ct is the initialnonterminal symbol (axiom). A production is of the form X! (b,C) where X 2 C � Ct and (b,C) is a graphwith embedding.

The graph appearing on the right-hand side of a production can be empty and a production of the formX! (;,;) will be simply denoted as X! �. In order to make graph grammars syntax more intuitive, produc-tion rules are represented graphically. For instance, the productions q1 and q2 on the left-hand side of Fig. 3

Page 6: A graph grammars based framework for querying graph-like data

Fig. 3. A grammar producing chains.

Fig. 4. Chain derivation.

S. Flesca et al. / Data & Knowledge Engineering 59 (2006) 652–680 657

can be represented graphically, as shown on the right-hand side. In the graphical representation of productionrules, the replaced node and the replacing graph are closed inside dashed squares.

Let G = (C,Ct,P,S) be a NR grammar, a1 and a2 be two graphs whose labels are in C, v be a node in N a1

and q : X! (b,C) be a production of P. Then, we say that a2 is directly derived from a1 (and write a1)v,q a2,or just a1) a2), if ka1

ðvÞ ¼ X and a2 = a1[v/(b,C)]. Moreover, we say that an is derived from a1 if there exists afinite sequence a1) a2) � � � ) an.

In the following, given a production rule q : X! (b,C), we will denote the symbol X appearing on the left-hand side of q as lhs(q), and the graph b on the right-hand side as rhs(q).

Example 2. The graph grammar G defined by the productions q1 and q2 of Fig. 3 describes a languagecontaining only chains. The number drawn inside each node is its identifier, whereas the symbol beside a nodeis its label. Fig. 4 illustrates a chain derivation by means of G productions.

3. A framework for querying data graphs

In this section we investigate the use of graph grammars for querying graph data. We first discuss how thevarious types of graphs, which have been introduced in Section 2.1, will be used in our framework. Then, weinvestigate the problem of using graph grammars for identifying subgraphs having a desired property. In orderto identify subgraphs, we use a query graph which permits us to mark vertices and edges of the identified sub-graph with different symbols. Finally, we show how query graphs can be generated using graph grammars andintroduce a new class of graph grammars specialized in deriving query graphs.

Our framework will use the notions of data graph and query graph as follows:

• a data graph contains only terminal nodes labeled with constants and will be used to represent the inputdatabase;

• a query graph contains nodes labeled with constants, variables and nonterminal symbols; it will be used torepresent the graphs generated at the intermediate steps of a graph grammar derivation;

• a terminal query graph is a query graph which contains only nodes labeled with terminal labels (constantsand variables); it will be used to represent a ‘‘final graph’’ generated by a graph grammar.

The language generated by a graph grammar G, denoted by LðGÞ, is the set of terminal query graphs whichcan be derived from G.

Page 7: A graph grammars based framework for querying graph-like data

Fig. 5. A graph grammar defining trees.

Fig. 6. Graph derivation.

658 S. Flesca et al. / Data & Knowledge Engineering 59 (2006) 652–680

Example 3. The graph grammar G consisting of the productions of Fig. 5 generates query graphs. Thelanguage LðGÞ consists of trees; an example of derivation of a tree in LðGÞ is shown in Fig. 6.

3.1. Extracting sub-graphs

Now, we show how a graph grammar G can be used to query a data graph D, that is to identify sub-struc-tures of D which satisfy a property defined by G. To this end, we define a function which maps a query graphon a given data graph, and define a new entity (called mapping pair) consisting of a query graph and a mappingfunction which associates the query graph to a subgraph of the data graph.

Definition 2. Let a = (Na,Ea,ka) be a terminal query graph, and D = (ND,ED,kD) a data graph. A mapping ufrom a to D is a total function mapping, respectively, nodes in Na to nodes in ND and edges in Ea to edges inED such that:

• for each node n 2 Na either ka(n) = kD(u(n)) or ka(n) is a variable label,• for each arc (u,r,v) 2 Ea there is an arc (u(u),r 0,u(v)) 2 ED such that either r = r 0 or r is a variable, and• there are no two nodes u and v in Na such that ka(u) = ka(v) and u(u) = u(v) (i.e. two distinct nodes of a

with the same label cannot be associated to the same node of D).

We point out that a query graph does not need to be identical to the subgraph it is mapped on. Forinstance, the query graph on the right-hand side of Fig. 7 is a tree, whereas the graph which it is mapped ontois not a tree.

That is, the mapping between a query graph and a data graph is not an isomorphism, but it could be forcedto be injective (by labelling all the nodes in the query graph with the same variable symbol). Indeed, a node ofthe data graph can be associated to many distinct nodes of the query graph, if these nodes are labeled withdifferent variable symbols. This feature is useful to easily express many graph problems such as node reach-ability, spanning tree, Hamiltonian path and others. For instance, to find all the nodes reachable from a givennode u in a data graph D, it is easier to construct a tree T representing the paths from u to any node in D,rather than generating exactly the subgraph of D containing all the nodes reachable from u. Clearly T isnot a subgraph of D, but can be mapped to a subgraph of D containing all the paths starting from u (see,for instance, the mapping shown in Fig. 22).

Page 8: A graph grammars based framework for querying graph-like data

Fig. 7. Mapping example.

S. Flesca et al. / Data & Knowledge Engineering 59 (2006) 652–680 659

The following theorem characterizes the complexity of finding a mapping between a query graph and a datagraph.

Theorem 1. Let D be a data graph and a a terminal query graph. The problem of deciding whether there exists a

mapping u from a to D is NP-complete.

Proof. (Membership) Let f be a total function from a to D which associates each node and each edge of a,respectively, to a node and an edge of D. Verifying whether f is a mapping can be done in polynomial time.

(Completeness) We show that there exists a reduction from HAMILTONIAN CYCLE to the problem of finding amapping pair between a query graph and a data graph. Let D be a connected graph with n nodes. We can builda connected query graph a (see Fig. 8) with n nodes v1, . . . ,vn labeled with the same symbol $c and forming acycle, where every edge e01; . . . ; e0n has the same (variable) label $e.

Now we show that if there exists a Hamiltonian path in D, then there exists a mapping between the querygraph a and D. Let u1,e1,u2,e2, . . . ,un,en,u1 be a Hamiltonian cycle in D, where u1, . . . ,un are the nodes of D

and e1, . . . ,en are edges in D (where, for each i 2 [1, . . . ,n � 1], ei connects the node ui to the node ui+1, and en

connects un to u1). Consider the function u defined on Na [ Ea in the following way:

(1) for each i 2 [1, . . . ,n], u(vi) = ui;(2) for each i 2 [1, . . . ,n], uðe0iÞ ¼ ei.

The function u is a mapping since: (i) it is total; (ii) all the nodes and the edges of a are labeled by variables;(iii) there are no two nodes vi and vj (with i 5 j) such that u(vi) = u(vj).

Now we show that if there exists a mapping between a and D then there exists a Hamiltonian cycle in D.Suppose u is a mapping from a to D. From the definition of mapping we have that u is total on a and,therefore, it associates all the n nodes of a to nodes of D. In particular, as all the nodes in a have the samelabel, they are mapped on distinct nodes of D: hence, all the nodes of D are associated by u with nodes of alabeled with $c. Moreover, from the definition of mapping, we have that an edge of a connecting vi to vj canonly be mapped on an edge of D connecting u(vi) to u(vj). Therefore, the set of edges of D which are associatedto edges of a by u represents a Hamiltonian cycle. h

Fig. 8. A query graph representing a Hamiltonian path.

Page 9: A graph grammars based framework for querying graph-like data

660 S. Flesca et al. / Data & Knowledge Engineering 59 (2006) 652–680

It is worth noting that finding a mapping is not simpler than graph isomorphism,1 and is not harder thansubgraph isomorphism, which is NP-complete too. As discussed above, we use this kind of mapping to achievemore flexibility in expressing queries, not to speed up the query evaluation process. Indeed, if there exists amapping between a terminal query graph generated by a graph grammar and a given data graph, such a map-ping identifies nodes and edges of the data graph which have the structural properties described by thegrammar.

In order to process queries more efficiently, instead of generating a terminal query graph a and then search-ing for a mapping from a to a subgraph of the data graph D, we introduce a new type of object which gives usthe possibility of searching for possible mappings during the derivation process and not afterwards.

Given a query graph a, we shall denote with Terminal(a) the sub-graph derived from a by deleting nodesmarked with non-terminal symbols and arcs connected to deleted nodes.

Definition 3. Let D be a data graph. A mapping pair on D is a pair (a,u) where a is a query graph and u is amapping from Terminal(a) to D. A mapping pair (a,u) on D is said to be terminal iff a = Terminal(a).

Given a mapping pair M = (a,u) on the data graph D = (ND,ED,kD), where a = (Na,Ea,ka), a node u 2 ND

and a node label c, we say that M marks the node u with the symbol c if there exists a node v 2 Na such thatka(v) = c and u(v) = u (i.e. M marks the node u 2 ND with the symbol c if there exists a node in a labeled with cwhich is mapped by M to u).

In the following, we will show how, given a graph grammar G and a data graph D, a terminal mapping pairon D can be produced by extending the derivation of query graphs to the derivation of mapping pairs.

Let D be a data graph and G a graph grammar. We say that a mapping pair (a1,u1) is directly derived from(a0,u0) through a production q of G (and write (a0,u0))q (a1,u1)) if and only if a0)q a1 and u1 extends u0.2

Moreover, we say that a mapping pair (an,un) is derived from a mapping pair (a0,u0) over a data graph D ifða0;u0Þ )q1 ða1;u1Þ )q2 � � � )qn ðan;unÞ. Given a graph grammar G = (C,Ct,P,S) and a data graph D,U(G,D) denotes the set of mapping pairs over D derived from (a0,;), where (i) a0 is a query graph containinga unique (non-terminal) node whose label is S (i.e. the axiom of G), and (ii) ; denotes an empty mapping.Observe that, as stated by the following proposition, each terminal mapping pair (a,u) on a given data graphD corresponds to (at least) one derivation of (a,u) on D.

Proposition 1. Let G be a graph grammar, D a data graph and a a terminal query graph which can be generated

by G. A mapping u from a to D exists iff (a,u) 2 U(G,D).

Proof. The if part is straightforward since, by definition of U(G,D), (a,u) 2 U(G,D) implies that u is a map-ping from a to D.

We show the only if part reasoning by contradiction. Assume that a mapping u from a to D exists but(a,u) 62 U(G,D). From the hypothesis we know that a can be generated by G through a derivationa0 )q1 a1 )q2 � � � )qn a. Let uai

be the restriction of u to the nodes and the edges of ai. It is straightforwardthat uai

is a mapping of ai on D since both relabelling and deleting arcs between two terminal nodes isforbidden. Moreover, since Terminal(ai) is a subgraph of Terminal(ai+1), it holds that uai

is a restriction ofuaiþ1

. Hence, the derivation ða0; ;Þ )q1 ða1;ua1Þ )q2 � � � )qn ða;uÞ is a valid derivation of (a,u) using G. This

contradicts the hypothesis that (a,u) 62 U(G,D). h

The above proposition will be useful to define an algorithm computing a terminal mapping pair generatedby a graph grammar applied to a given data graph. Let us now show an example of derivation of mappingpairs to better understand how the derivation process works.

Example 4. Consider the graph grammar of Example 3, the derivation shown in Fig. 5, and the data graph D

shown on the left-hand side of Fig. 9.

1 The GRAPH ISOMORPHISM problem is one of the few problems in the class NP that is not known to be complete nor polynomial timesolvable. On the other side the related problem SUB-GRAPH ISOMORPHISM is NP-complete.

2 (u0 � u1).

Page 10: A graph grammars based framework for querying graph-like data

Fig. 9. Extraction of a tree.

S. Flesca et al. / Data & Knowledge Engineering 59 (2006) 652–680 661

The query graphs produced respectively at the third and at the last step of the derivation can be mapped onD, as shown in the center and on the right-hand side of Fig. 9. Since all terminal nodes of the generated querygraphs are labeled with the same symbol (i.e. $n), every tree generated by such a grammar can only be mappedto a tree (two nodes with the same label cannot be mapped onto the same node). Although not represented inthe figure, the arcs in the query graph are mapped to arcs in the data graph.

3.2. cNR graph grammars

As shown in the previous subsection, the application of a NR graph grammar to a data graph D returns aset of subgraphs of D. Each of these subgraphs corresponds to a mapping from a terminal query graph gen-erated by the grammar to a set of nodes and edges in D. However, this framework permits us only to expresssome ‘positive’ conditions about the structure of the extracted subgraphs (i.e. the existence of edges and nodesin D corresponding to edges and nodes in a).

In many cases, it would be useful to specify more expressive conditions about the structure and the content(i.e. value of labels) of the extracted subgraph, like the absence of an arc connecting two nodes or the presenceof a node labeled with a certain value. To this end, in this section we introduce a new form of graph grammars,called Conditional Node Replacement Graph Grammars, whose productions are associated to Application Con-

ditions. Basically, an application condition associated to a production q defines a property that the mappingpair, obtained after applying q, must satisfy. That is, q is successfully applied in a derivation process only if itderives a mapping pair which satisfies the specified application condition. Before formally defining applicationconditions, we introduce some notations.

A mapping pair (a,u) over a data graph D can be represented by means of the database MD consisting ofthe following relations:

• the binary relations NodeD = {(x, l)jx 2 ND ^ kD(x) = l} and Nodea = {(x, l)jx 2 Na ^ ka(x) = l} represent-ing, respectively, the nodes of D and a with their labels;

• the ternary relations EdgeD and Edgea representing, respectively, the arcs of D and a;• the mapping relation u.

When a production q : X! (b,C) is applied to a query graph a, a graph b 0 isomorphic to b is inserted into aproducing the new query graph a 0. The graph b 0 is obtained from b by renaming node identifiers with new onesnot appearing in a. The association between node ids in b and nodes in b 0, corresponding to the latest appli-cation of a production rule, is stored into the binary relation Embed. Thus, a tuple (u,v) in the relation Embed

means that the node u, appearing on the right-hand side of the latest applied production rule, generates thenode v in the derived query graph. The current state of the derivation process is stored in the databaseMD [ Embed.

An application condition v is a quantified FO formula without free variables defined over MD [ Embed.Each production q of a graph grammar can be associated to an application condition vq that expresses a con-straint on its applicability: given a mapping pair (a,u) on a data graph D, a mapping (a 0,u 0) generated by thederivation step s : (a,u))q (a 0,u 0) is valid w.r.t. vq iff vq is satisfied on the updated database MD [ Embed.The notion of application condition can be viewed as a natural specialization of post-application conditionsdefined for category-based graph transformations [13].

Page 11: A graph grammars based framework for querying graph-like data

Fig. 10. Extracting a path between two specified nodes.

662 S. Flesca et al. / Data & Knowledge Engineering 59 (2006) 652–680

Example 5. Consider the grammar of Fig. 10. The application condition $(X,Y)(Embed(1,X) ^ u(X,Y) ^NodeD(Y, ‘start’)) associated to the production rule q1 states that q1 is applied successfully only if the node Xgenerated in the query graph (corresponding to the node 1 in rhs(q1)), is mapped onto a node Y labeled with‘start’. Analogously, the condition $(X,Y)(Embed(1,X) ^ u(X,Y) ^ NodeD(Y, ‘end’)) associated to theproduction rule q3 states that q3 is applied successfully only if the node X generated in the query graph(corresponding to the node 1 in rhs(q3)), is mapped onto a node Y labeled with ‘end’. The grammar defined bythe productions in Fig. 10 finds all the paths connecting two nodes labeled, respectively, with ‘start’ and ‘end’.

Writing an application condition, like the one specified in the above example, can be a tedious work, as itoften requires citing the relations Embed and u for identifying the nodes of the data graph which have beenmapped by the application of the rule. Therefore, if we have to express a complex constraint, we will probablyget a verbose application condition. Clearly, there are many other ways for specifying application conditions(e.g. graphically). In this paper we will not investigate this issue. We will just use in our examples a simplifiedsyntax, obtained by introducing the macro (predicate) Node(X,Id,L) defined as follows:

NodeðX;Id;LÞ EmbedðX;YÞ ^ uðY;IdÞ ^ NodeDðId;LÞ.

That is, the ternary predicate Node(X,Id,L) is used to avoid explicit joins involving the relations Embed, uand NodeD.

Example 6. The grammar defined by the productions in Fig. 11, refining the grammar of Example 5, uses themacro above introduced to express application conditions. The query finds all the paths connecting two nodeslabeled, respectively, with ‘start’ and ‘end’ without crossing any node connected to a node with label ‘forb’.

Observe that verifying whether a mapping pair generated by the derivation step s : (a,u))q (a 0,u 0) is validw.r.t. an application condition vq can be done in polynomial time w.r.t. the size of the generated mapping, andthus it is polynomial w.r.t. the size of the data graph too.

We point out that the application conditions defined above are slightly different from the well known appli-cation conditions used in the context of graph rewriting systems. The main difference between the twoapproaches is that graph rewriting systems work on a unique graph (corresponding to ‘our’ query graph),whereas we also use the data graph and the mapping.

Fig. 11. Extracting a path without crossing forbidden nodes.

Page 12: A graph grammars based framework for querying graph-like data

S. Flesca et al. / Data & Knowledge Engineering 59 (2006) 652–680 663

Now we can formally define a class of graph grammars whose productions are associated with an applica-tion condition. In order to ensure the finiteness of every possible derivation, we impose the constraint that theright-hand side of each production must contain at least one terminal node or must be the empty graph.

Definition 4. A conditional node replacement (cNR) graph grammar is a tuple CG = (C,Ct,P,S) where C is thealphabet of labels, Ct � C is the alphabet of terminal labels, P is a finite set of conditional productions, andS 2 C � Ct is the initial nonterminal symbol (axiom). A conditional production is of the form X! (b,C,v)where X 2 C � Ct, (b,C) is a graph with embedding where b either contains at least one terminal node or isempty, and v is an application condition.

In the following, given a cNR graph grammar CG, we denote with St(CG) the corresponding ‘standard’ NR

graph grammar, obtained by removing all the application conditions. Moreover we denote with X! � theproduction X! (;,;,;).

The set U(CG,D) of all the mapping pairs generated by applying a cNR graph grammar CG to a data graphD is defined as for ‘standard’ NR graph grammars. We point out that in the general caseU(CG,D) � U(St(CG), D), and thus Proposition 1 is no longer valid. This means that the validity of a mappingpair (a,u) generated by a cNR grammar strictly depends on the derivation which leads to (a,u).

Observe that, given a cNR graph grammar CG and a data graph D, since for each productionq : X! (b,C,v) in CG, with b not empty, b contains at least one terminal node, then for each terminal map-ping pair (a,u) in U(CG,D) there is a finite derivation for (a,u). In more detail, the length of each derivation ofa mapping pair in U(CG,D) is polynomially bounded by the number of nodes in D as stated by the followinglemma.

Lemma 1. Let CG = (C,Ct,P,S) be a cNR graph grammar, D = (ND,ED,kD) a data graph, l be the number of

variables in C and k the max number of non-terminal nodes appearing in a production in P. The length of each

derivation d of a terminal mapping pair in U(CG,D) is bounded by (k + 1) · (jNDjl + jNDj).

Proof. Since a mapping must associate two nodes of a query graph labeled with the same symbol to distinctnodes in the data graph D, the maximum number of nodes in a query graph is jNDjl + jNDj. Furthermore,since in each production of the form q : X! (b,C,v) in CG, with b not empty, b contains at least one terminalnode, it is not possible to apply non-empty productions more than jNDjl + jNDj times in a derivation. More-over the maximum number of non-terminal nodes produced by these applications is bounded byk · (jNDjl + jNDj) and these non-terminal nodes can only be expanded using a production of the formq : X! �. Thus, the maximum number of derivation steps is bounded by (k + 1) · (jNDjl + jNDj). h

We now characterize the complexity of finding a terminal mapping pair generated by the application of acNR graph grammar to a data graph.

Theorem 2. Let D be a data graph and CG = (C,Ct,P,S) a cNR graph grammar. The problem of deciding

whether U(CG, D) is not empty is NP-complete.

Proof. (Membership) A polynomial size certificate showing that U(CG,D) is not empty is a derivation of a ter-minal mapping pair M in U(CG,D). It follows from Lemma 1 that the length of such a derivation is O(jNDjl),where l is the number of distinct variable symbols appearing in CG. The validity of the derivation can bechecked by simply verifying the correctness of each derivation step. To check if a derivation step is correct,it suffices to verify the validity of the mapping and the validity of the application condition. Since both thesetasks can be performed in polynomial time w.r.t. the size of the data graph, the certificate is verifiable in poly-nomial time.

(Completeness) We prove the completeness by reducing the NP-complete problem of finding an even simplepath between two nodes of a labeled directed graph to the problem of deciding if the set of mapping pairsgenerated by the cNR graph grammar CG shown in Fig. 12 is empty.

Let (T, s,e) be an instance of even simple path, where T = (NT,ET) is a labeled directed graph and s, e 2 NT

are respectively the source and destination node. Let D = (ND,ED) be a data graph such that ND = NT,ED = ET, k(s) = ‘start’ and k(e) = ‘end’. It is straightforward that a mapping pair in M 2 U(CG,D) exists if

Page 13: A graph grammars based framework for querying graph-like data

Fig. 12. Extracting an even simple path.

664 S. Flesca et al. / Data & Knowledge Engineering 59 (2006) 652–680

there exists an even simple path connecting s and e in T. Moreover, the size of the grammar CG does notdepend on the size of T, and constructing the data graph D can be done by simply defining the function k forall the nodes in ND. Obviously, this reduction is a logspace reduction, thus the problem of testing if U(CG,D)is not empty is complete for NP. h

Thus, the application of a cNR graph grammar CG to a data graph D can be regarded as a query on D, andevery mapping pair in U(CG,D) is an answer of such a query.

3.3. Graph queries

cNR graph grammars are a powerful tool for querying graphs, since they combine a user-friendly graphicformalism with the flexibility of first-order logic. However, users might prefer to express queries following agenerate and check paradigm, rather than checking the validity of the generated answer at each step of thederivation. Therefore, we introduce the possibility of also specifying a terminal condition which will be checkedafter a terminal query graph has been generated.

We point out that every query which can be expressed using a terminal condition can be translated into anequivalent query which uses only application conditions. We now formally define a graph query.

Definition 5. A cNR graph query Q is a pair (CG,P) where CG is a cNR graph grammar and P is an FOformula without free variables defined on MD.

We combine the cNR graph grammar CG with an FO formula v to express in a simple way a property thatthe terminal mapping pairs generated using CG must satisfy. When such a property is defined, the generationprocess ends successfully only if a terminal graph satisfying the terminal condition has been produced. Other-wise, the generation process must be continued until an ‘‘acceptable’’ terminal graph has been produced or noother terminal graph can be generated. Clearly a similar behavior can be also achieved using only applicationconditions.

The evaluation of a formula P on a data graph D and a mapping pair M on D, denoted by PðMDÞ, gives aboolean value (true if the formula is satisfied and false otherwise). It is worth noting that the application of theformula P is applied at the end of the derivation process and, as it only involves the query graph and the datagraph, the relation Embed is not used.

Definition 6. The set of answers to a query Q = (CG,P) over a data graph D, denoted by Q(D), is the set ofterminal mapping pairs over D derived from CG and satisfying P, i.e. QðDÞ ¼ fM jM 2 UðCG;DÞ ^PðMDÞg.

Observe that any mapping pair in Q(D) is considered a valid answer.

Corollary 1. Let Q = (CG,P) be a cNR graph query and D a data graph. The problem of checking whether Q(D)

is not empty is NP-complete.

Proof. It follows directly from Theorem 2. h

Page 14: A graph grammars based framework for querying graph-like data

Fig. 13. A grammar extracting a clique.

Fig. 14. A terminal mapping pair extracting a clique.

S. Flesca et al. / Data & Knowledge Engineering 59 (2006) 652–680 665

Example 7. Consider the cNR grammar CG defined by the three productions shown in Fig. 13. Let Pbe the terminal condition defined as follows: P = 9= (w, l)(NodeD(w, l) ^ 9= yu(y,w)) = "(w, l)(NodeD(w, l))$yu(y,w)).

The terminal condition P states that a terminal mapping pair generated by applying CG to a data graph D

is accepted only if all the nodes of D are marked by at least one variable symbol.Let D be the data graph on the left-hand side of Fig. 14. Fig. 14 shows a terminal mapping pair obtained

applying CG on D (the mapping relation between edges is not shown in the figure, since it is trivially impliedby the mapping relation between nodes). Edges labeled with $e in the terminal query graph are mapped onedges identifying a clique of size jNDj

2 .

3.4. Expressiveness

We now discuss the expressiveness of graph queries, and we show that every property on a graph which isexpressible in monadic existential second order logic (MSO$) can be expressed by graph queries.

Theorem 3. Let D be a graph and w an MSO$ formula on the relations ND and ED. Then, there exists a cNR

graph query Q = (CG,P) which expresses w.

Proof. Let w be expressed in the prenex normal form $R1, . . . ,$Rn/, where R1, . . . ,Rn are unary relation vari-ables and / is a first order formula defined on the relations ND, ED, R1, . . . ,Rn. The relations R1, . . . ,Rn aredefined over the domain of node identifiers, node labels and edge labels.

For each relation Ri appearing in w, we define: (i) the variables $ri, $li, $si, $ti, (ii) the non-terminallabel Ti, and (iii) four production rules qi1 , qi2 , qi3 , qi4 as in Fig. 15.

Page 15: A graph grammars based framework for querying graph-like data

Fig. 15. The productions qi associated to the relation Ri.

666 S. Flesca et al. / Data & Knowledge Engineering 59 (2006) 652–680

The bindings of the variables $ri and $li will be used to represent the relations R1, . . . ,Rn, as explained inthe following. The variables $s1, . . . , $sn and $t1, . . . , $tn will mark pairs of nodes, and are used to identifythe edges connecting such nodes.

Let P be the set of every production rule constructed as described above and containing the production ruleqSe

: S ! � and the production rules qTie : T i ! �, for each Ti. The rule qSeand the rules qTie are free from

mapping conditions.Let CG be the cNR graph grammar (C,Ct,P,S), where Ct = {$r1, . . . , $rn, $l1, . . . , $ln, $s1, . . . , $sn,

$t1, . . . , $tn} and C = Ct [ {S,T1, . . . ,Tn}. Let P be the first order formula obtained from w by substitutingeach occurrence of Ri(X) with the disjunction of the formulas /1

Ri/2

Riand /3

Rishown below.

/1Ri¼ 9Y ðNodeaðY ; $riÞ ^ uðY ;X ÞÞ

/2Ri¼ 9ðY ; ZÞðNodeaðY ; $liÞ ^ uðY ; ZÞ ^ NodeaðZ;X ÞÞ

/3Ri¼ 9ðY ; Z; Z 0; Y 0ÞðEdgeaðY ; $li; ZÞ ^ uðY ; Y 0Þ ^ uðZ; Z 0Þ ^ EdgeDðY 0;X ; Z 0ÞÞ

The formula /1RiðX Þ defines all the identifiers X of nodes of D which are marked by $ri. The formula /2

RiðX Þ

defines all the labels X of nodes of D which are marked by $li. Analogously, /3RiðX Þ defines all the labels X of

edges of D which are marked by $li. Therefore, the disjunction of these three formulas defines a relation con-taining node ids, node labels and edge labels.

The process for answering the query (CG,P) on D ends when either a mapping pair satisfying P isgenerated by CG, or no other mapping can be generated. This is the same as trying all the possible assignmentsof node identifiers to the variables $r1, . . . , $rn, and all the possible assignments of $l1, . . . , $ln to nodes andedges, until P is satisfied. That is, it is the same as proving the existence of unary relations satisfying P. Thus,the cNR graph query QCG(CG,P) is equivalent to the formula w. h

Theorem 3 can be motivated observing that each variable label used by a cNR graph grammar correspondsto a unary relation, so that the (unary) relations and the existential quantifiers of a monadic second order for-mula are describable by means of a cNR grammar. The terminal condition of a cNR graph query can be usedfor expressing the first order part of an MSO$ formula.

We point out that Theorem 3 states that every MSO$ sentence can be expressed by graph queries, and itsproof proposes a ‘‘generic’’ translation of an MSO$ formula into a cNR graph query. Observe that the trans-lation used in the above proof leads to a cNR graph query which could be expensive to compute. However, itis possible to define the problems in the class directly, using a more efficient formulation.

It is worth noting that graph queries are strictly more expressive than MSO$ logic, since problems such asEVEN SIMPLE PATH, which is not expressible in MSO$, can be formulated using graph queries (see Theorem 2).

4. Partially ordered conditional graph grammars

Until now, we have shown how graph grammars can be used to express queries on data graphs. However,the use of cNR graph grammars to express queries on graphs may not be completely satisfactory, since, inmany cases, finding a terminal mapping pair (i.e. an answer of the cNR query) is computationally expensive.

Page 16: A graph grammars based framework for querying graph-like data

S. Flesca et al. / Data & Knowledge Engineering 59 (2006) 652–680 667

Often one would like to write queries with a limited form of non-determinism so that, without significant lossof expressive power, they can be computed more efficiently. The following example shows a cNR graph gram-mar where the non-deterministic derivation process is useless, as the condition associated with the productionrule q3 states that the application of q2 should be tried before the application of q3.

Example 8. Consider the cNR grammar CG defined by the three productions in Fig. 16. The query CG(D,;)marks every node in D which is reachable from the node ‘start’ with the symbol $n.

The application condition associated with the production q3 of the cNR grammar means that thisproduction can be applied only when the production q2 cannot be applied successfully.

In the above example a non-terminal node A is expanded into an empty graph only if the rule q2 cannot beapplied, that is if the parent of node A is mapped into a node which is not connected to any node that has notbeen visited (i.e. which has not been marked by the grammar with the symbol $n). It would be easier to expressthis kind of constraint by imposing the order q2 < q3 over the set of productions, so that the production rule q3

can be applied only if the production rule q2 (preceding q3) cannot be applied successfully.We now introduce a new type of graph grammar, called Partially Ordered Conditional Graph Grammar

(POcNR grammar), which is specialized to express the kind of queries discussed above. POcNR grammarsare graph grammars whose production rules are partially ordered according to the conditions introduced inthe following definition.

Definition 7. A Partially Ordered Conditional Graph Grammar is a tuple PCG = (C,Ct,P, <P,S) such that(C,Ct,P,S) is a cNR Graph Grammar, <P is a partial order on P and the following conditions hold:

(1) for each symbol X 2 Cn there is a production X! � in P whose application condition is ‘true’,(2) for each pair of productions qi : X! (b,C), with b not empty, and qj : X! �, is qi < qj.

Before investigating the implications of imposing an ordering criterion on the set of productions, webriefly discuss Condition 1 in the above definition. It implies that for every non-terminal mapping pairM = (a,u) generated by a cNR grammar, there exists at least one production (i.e. the � production) whichcan be applied successfully to each non-terminal node in a. In other words, even if no other production canbe applied successfully on a non-terminal node, the � production can be unconditionally applied. Therefore,given a non-terminal mapping pair M generated applying a cNR grammar to a data graph, there exists aterminal mapping pair M 0 such that M)* M 0. From a computational point of view, Condition 1 makesbacktracking unnecessary during the derivation process. This implies that computing a terminal mapping(randomly chosen) can be done efficiently (see Theorem 4), but the expressive power of the language isreduced.

The partial order on the productions defines a partial order on the derivations of mapping pairs. Given adata graph D, a POcNR grammar PCG, and two productions q and m of PCG such that q < m, we say that aderivation d1 of a pair (a1,u1) from a pair (a,u) precedes a derivation d2 of a pair (a2,u2) from (a,u) (writtend1 � d2), if (1) d1 = (a,u))q (ai,ui))* (a1,u1), d2 = (a,u))m (aj,uj))* (a2,u2), or (2) there are three deriva-tions d, d3 and d4 such that d1 = dd3 and d2 = dd4 and d3 � d4.

Fig. 16. Extracting all the nodes reachable from a starting node.

Page 17: A graph grammars based framework for querying graph-like data

668 S. Flesca et al. / Data & Knowledge Engineering 59 (2006) 652–680

Definition 8. Let PCG = (C,Ct,P,<P,S) be a POcNR graph grammar and D a data graph. Given twomapping pairs M1, M2 2 U(PCG,D), we say that M1 is preferable to M2 (written M1 <PCG M2, or simplyM1 < M2) if for each derivation d2 = S)* M2 there is a derivation d1 = S)* M1 such that d1 � d2.Moreover, a mapping pair M1 such that S)* M1, is said to be preferred, with respect to PCG, if there is nomapping pair M2 such that S)* M2 and M2 < M1.

The relation <PCG on the elements of U(PCG,D), introduced in Definition 8, is a partial order onU(PCG,D). Actually, this order permits us to drive the derivation process towards terminal mapping pairs thathave some desirable properties. Observe that Condition 2 in Definition 7 leads to selecting ‘‘maximal’’ sub-graphs. That is, during a derivation process we first try to substitute non-terminal symbols with not emptygraphs, thus expanding the current graph. Afterwards, if every other production cannot be applied success-fully, we apply the � production.

Definition 9. A POcNR graph query is a POcNR graph grammar. The set of answers to a POcNR graph queryQ over a data graph D, denoted by Q(D), is the set of preferred terminal mapping pairs over D derived from Q.

The following theorem characterizes the complexity of answering a POcNR graph query.

Theorem 4. Let D be a data graph and PCG a POcNR graph query on D. Computing an answer of PCG can be

done in polynomial time (w.r.t. the size of D).

Proof. From Definition 9, an answer of PCG is a preferred mapping pair in U(PCG,D). An algorithm com-puting a preferred mapping pair in U(PCG,D) is shown in Fig. 17.

The mapping pair computed by Algorithm 1 is preferred in U(PCG,D) as at each step it selects theminimum production q that is applicable to the current mapping pair.

We now show that this algorithm can be executed in polynomial time. Let l be the number of distinct nodelabels appearing in PCG. Steps 2–4 of the algorithm are executed at most O(jNjl) times, since the maximumlength of a derivation is O(jNjl) (see Lemma 1).

Steps 1 and 4 require constant time. Steps 2 and 3 can be done in polynomial time (w.r.t. the size of D andPCG). Indeed, the selection of a production q that is applicable to M requires only choosing the firstproduction from the list of productions that is applicable to M. Checking whether a production can be appliedto a mapping pair can be done in polynomial time, because it requires only extending the current mapping pairand verifying the application condition. h

Example 9. Consider the data graph D depicted in Fig. 18, representing a collection of books. Each book isrepresented by means of a node whose label is the book title. For each book whose publisher is known, thepublisher is represented as a node connected to the book node. Therefore, nodes representing books sharingthe same publisher are connected to the same publisher node, and some book nodes are not connected to anypublisher node.

Fig. 17. Algorithm computing a preferred mapping pair.

Page 18: A graph grammars based framework for querying graph-like data

Fig. 18. A data graph representing a collection of books.

Fig. 19. A POcNR graph grammar extracting books and publishers from the data graph in Fig. 18.

S. Flesca et al. / Data & Knowledge Engineering 59 (2006) 652–680 669

The POcNR graph query corresponding to the POcNR graph grammar shown in Fig. 19 (where q2 < q3)extracts from D all books and their publishers (rules S! e, B! e and B1! e are implied).

Basically, rule q1 extracts books associated to a publisher which has not been already marked with $p, andrule q3 extracts all the books sharing a publisher already marked with $p. As q1 < q2, rule q2 is applied only ifq2 cannot be applied, i.e. it extracts books which are not associated to any publisher. Rules B! e and B1! eare applied after all books have been marked with $b.

4.1. POcNR graph queries vs. path queries

In this section we compare the expressive power of POcNR graph queries with the expressiveness of pathqueries. First, we introduce an example of a POcNR grammar which marks the nodes of a data graph corre-sponding to the answer of a given path query. Next, we formally prove that every path query can be ‘‘trans-lated’’ into an equivalent POcNR graph query.

In the following, we admit the use of ‘‘generic variables’’ (denoted by strings preceded by the symbol %)inside connection rules to denote any label in the domain Cc. Thus, given the alphabet Cc = {a,b}, the connec-tion rule of the form {(r,a/a, 1), (r,b/b, 1)} can be rewritten as {(r,%x/%x, 1)}. This shortcut is also used in thegraphical notation, as shown in Fig. 20.

Example 10. The grammar of Fig. 21 extracts a sub-graph containing all the nodes which can be reachedfrom a node labeled with n1 by means of paths spelling a word in the language defined by the regularexpression (ajbc)+. These nodes are marked with $f, whereas ‘‘intermediate’’ nodes are marked with $i.

Page 19: A graph grammars based framework for querying graph-like data

'

Fig. 20. Two equivalent productions.

Fig. 21. A POcNR graph grammar expressing a path query.

Fig. 22. Terminal mapping pair and extracted subgraph.

670 S. Flesca et al. / Data & Knowledge Engineering 59 (2006) 652–680

On the left-hand side of Fig. 22, a terminal mapping pair obtained by applying the grammar of Fig. 21 isshown. The extracted graph, identified by this terminal mapping pair, is shown on the right-hand side ofFig. 22.

Observe that the answer of the path query h{n1}, (ajbc)+i of the above example is the set of nodes markedwith $f, that is the set {n2,n3,n6,n7,n8,n10}. The relation between path queries and POcNR graph grammarsis established by the following theorem.

Theorem 5. Let C be an alphabet, $f be a variable in Cv, D a data graph defined on Cc, and Q = hC0, ri a path

query over D (with C0 � Cc), where r is a regular expression on Cc not denoting empty strings. Then, there exists a

POcNR graph grammar PCG = (C,Ct,P,<P,S) such that the answer of Q coincides with the set of nodes in Dwhich are marked by PCG with the symbol $f.

Proof. We first show that, given a regular expression exp which does not denote empty strings, there exists aPOcNR graph grammar which marks every node of D with the symbol $s, and with the symbol $f thosenodes of D which are reachable from any node of D through a path spelling exp. Next we complete the proofby showing that, given a set of node labels C0 � Cc and the regular expression exp, there exists a POcNR gram-mar which marks with the symbol $f all nodes of D which are reachable from any node of D whose labelbelongs to C0 through a path spelling exp. In the following, for a given set of production rules P, we willdenote as <0

P the partial ordering criterion on P which imposes that every production of the form qi : X! eis preceded by any other production qj s.t. lhs(qj) = lhs(qi).

Page 20: A graph grammars based framework for querying graph-like data

S. Flesca et al. / Data & Knowledge Engineering 59 (2006) 652–680 671

The first thesis can be proved inductively on the form of the regular expression exp. Observe that at eachstep of the induction (including the base case) we consider and generate production rules of the ‘‘form’’ shownin Fig. 23. That is, the left-hand side of the rules consists of a unique context node connected to the non-terminal node, and the right-hand side is a tree rooted in the context node. This root has two children: a non-terminal node and a terminal node. The latter is the parent of a possibly empty set of non-terminal nodes.

(1) exp = r (where r 2 Cc).Let PCG1 be the POcNR graph grammar defined by the productions qS and qX shown in Fig. 24 (and by

the implicit empty productions q0S : S ! e and q0X : X ! e). Let M = (a,u) be a preferred mapping pair inU(PCG1,D). Since M is preferred, all nodes of D are mapped to distinct nodes with label $s in the query grapha. Otherwise, if there was any node of D not mapped to any node of a with label $s, M would not have beenpreferred: from the definitions of POcNR graph grammar and preferred mapping pair we have that qS < q0Sand that in each derivation that leads to M the production q0S is applied only after qS cannot be used any more(that is, after each node of D is mapped to a node labeled with $s in the generated query graph).

Analogously, since qX < q0X and since M is preferred, it can be proved that all nodes which are connected toany node of D by an edge labeled with r are mapped to a node of a labeled with $f.

These considerations can be applied to any preferred mapping pair in U(PCG1,D), since they are based onlyon the assumption that the mapping pair is preferred, and not on any other property of the mapping pair.

Thus, we conclude that PCG1 ¼ ðfS;X ; r; $sg; fr; $sg; fqS; q0S; qX; q0Xg; <0P ; SÞ marks every node of D with

the symbol $s, and those nodes which are connected to any node by means of an edge r with the symbol $f.(2) exp = exp1 Æ exp2 (where exp1 and exp2 are regular expressions—denoting non-empty strings—defined

on Cc, and the symbol ‘Æ’ denotes the concatenation).Let PCG1 ¼ ðC1;C1t ; P 1; <

0P 1; S1Þ (resp. PCG2 ¼ ðC2;C2t ; P 2; <

0P 2; S2Þ) be a POcNR graph grammar labelling

every node n in D with $s1 (resp. $s2) if n is reachable from any node of D through a path spelling exp1 (resp.exp2) and with $f1 (resp. $f2) otherwise. Without loss of generality assume that C1 \ C2 = ; and that$s 62 C1 [ C2 and $f 62 C1 [ C2.

Let P 01 be the set of productions obtained from P1 by substituting in each production the symbol $s1 withthe symbol $s, and let P 02 be the set of productions obtained from P2 by substituting in each production thesymbol $f2 with the symbol $f.

Fig. 23. The form of the production rules constructed at each induction step.

Fig. 24. A POcNR grammar for exp = r.

Page 21: A graph grammars based framework for querying graph-like data

672 S. Flesca et al. / Data & Knowledge Engineering 59 (2006) 652–680

Let qS2: S2 ! ðaS2

;CS2Þ be the production in P 02 associated to the axiom S2, where aS2

¼ ðNaS2;EaS2

; kaS2Þ,

and let Conn($s2) be the set containing a pair hr,ci for each edge labeled with r which connects in aS2the node

labeled with $s2 to a node labeled with c:Connð$s2Þ ¼ fhr; cij9u; v 2 NaS2

9r 2 C2t : kðuÞ ¼ $s2 ^ kðvÞ ¼ c ^ ðu; r; vÞ 2 EaS2g.

Let FinalðP 01Þ be the subset of P 01 containing all the productions qi : X! (ai,Ci) where the graphai ¼ ðN ai ;Eai ; kaiÞ contains a node labeled with $f1. For each production qi : X! (ai,Ci) of FinalðP 01Þ, wedefine the production q0i : X ! ða0i;C0iÞ in the following way:

(a) a0i is obtained by adding, for each pair hr,ci 2 Conn($s2), a new node labeled with c to ai, and connectingthis node to the node labeled with $f1 by means of an edge labeled with r;

(b) C0i ¼ Ci.

Fig. 25 represents the construction of q0i starting from qi and qS2graphically.

Let Final0ðP 01Þ be the set of all the productions q0i obtained from the productions qi of FinalðP 01Þ as describedabove, and let P the set of productions defined as P ¼ ðP 01 n FinalðP 01ÞÞ [ P 02 [ Final0ðP 01Þ. The POcNR graphgrammar PCG ¼ ðC1 [ C2 [ f$s; $fg;C1t [ C2t [ f$s; $fg; P ; <0

P ; S1Þ marks every node of D with the symbol$s, marks those nodes which are connected to any node by a path spelling exp1 with the symbol $f1 and thosenodes which are connected to any node labeled with $f1 by a path spelling exp2 with the symbol $f.

Thus, PCG marks every node of D with the symbol $s and those nodes which are connected to any node bya path spelling exp1 Æ exp2 with the symbol $f.

(3) exp = exp1jexp2 (where exp1 and exp2 are regular expressions—not denoting empty strings—defined onCc, and the symbol ‘j’ denotes the disjunction).

Let PCG1 ¼ ðC1;C1t ; P 1; <0P 1; S1Þ and PCG2 ¼ ðC2;C2t ; P 2; <

0P 2; S2Þ be two POcNR graph grammars

labelling with the symbol $s all nodes of D, and with the symbol $f the nodes which are reachable fromany node of D through a path spelling, respectively, exp1 and exp2. Assume that C1 \ C2 ¼ C1t \ C2t ¼f$s; $fg (i.e. PCG1 and PCG2 use different node labels, except $s and $f).

Fig. 25. Merging qS2and qi into q0i.

Page 22: A graph grammars based framework for querying graph-like data

S. Flesca et al. / Data & Knowledge Engineering 59 (2006) 652–680 673

Let qS12 P 1 be the production S1 ! ðaS1

;CS1Þ, and qS2

2 P 2 be the production S2 ! ðaS2;CS2Þ.

Let Conn($s) be the set containing a pair hr,ci for each edge labeled with r which connects in aS2the node

labeled with $s to a node labeled with c.We define the production qS : S! (aS,CS) in the following way:

(a) S is a non-terminal symbol not belonging to C1 [ C2;(b) aS is a graph over C1 [ C2 built by executing the following steps:

(i) add every node and edge of aS1to aS, except the node with label S1;

(ii) for each pair hr,ci 2 Conn($s), with c 5 S2, add a node labeled with c to aS1and connect it to the

node labeled with $s by means of an edge labeled with r;(iii) add the non-terminal node IDS with label S to aS.

(c) CS is the set of connection rules built as follows:(i) add to CS every rule (c,r,v,d) of CS1

[ CS2such that k(v) 5 S1 and k(v) 5 S2;

(ii) for each rule (c,r,v,d) in CS1[ CS2

, such that k(v) = S1 or k(v) = S2, add the rule (c,%x, IDS,d) toCS;

Fig. 26 represents the construction of qS graphically.Let P the set of productions P1 [ P2 [ {qS}. The POcNR graph grammar PCG ¼ ðC1 [ C2;C1t[

C2t ; P ; <0P ; SÞ marks every node of D with the symbol $s and marks those nodes which are connected to

any node by a path spelling exp1 or exp2 with the symbol $f.(4) exp ¼ expþ1 (where exp1 is a regular expression—not denoting empty strings—defined on Cc, and the

symbol ‘+’ denotes the closure).Let PCG1 ¼ ðC1;C1t ; P 1; <

0P 1; S1Þ be a POcNR graph grammar which marks with the symbol $s all nodes of

D, and with the symbol $f those nodes which are reachable from any node of D through a path spelling exp1.Let qS1

2 P 1 be the production S1 ! ðaS1;CS1Þ, and let Conn($s) be the set containing a pair hr,ci for each

edge labeled with r which connects in aS1the node labeled with $s to a node labeled with c. Let Final(P1) be

Fig. 26. Merging qS1and qS2

into qS.

Page 23: A graph grammars based framework for querying graph-like data

674 S. Flesca et al. / Data & Knowledge Engineering 59 (2006) 652–680

the subset of P1 containing every production qi : X! (ai,Ci) where ai has a node labeled with $f. For eachproduction qi : X! (ai,Ci) of Final(P1), we define the production q0i : X ! ða0i;C0iÞ in the following way:

(a) a0i is obtained by adding, for each pair hr,ci 2 Conn($s), a new node labeled with c to ai, and connectingthis node to the node labeled with $f by means of an edge labeled with r;

(b) C0i ¼ Ci.

Fig. 27 shows the construction of q0i starting from qi and qS1graphically.

Let Final 0(P1) be the set of productions containing every production q0i obtained from the productions inFinal(P1) as described above. Let P be the set of productions defined as: P = P1n(Final(P1) [ Final 0(P1)). ThePOcNR graph grammar PCG ¼ ðC1;C1t ; P ; <

0P ; S1Þ marks every node of D with the symbol $s and marks

those nodes which are connected to any node by a path spelling expþ1 with the symbol $f.Thus, we have shown that there exists a POcNR graph grammar which marks every node of D with the

symbol $s, and marks those nodes connected to any node of D by means of a path spelling exp with thesymbol $f. That is, there exists a POcNR graph grammar which marks every node belonging to the answer ofthe path query hC,ExpQi with the symbol $f. Now, we show how to build a POcNR graph grammarexpressing a path query where the set of starting nodes is a subset of C.

We denote as PCG0 ¼ ðC1;C1t ; P 1; <P 1; S1Þ a POcNR graph grammar corresponding to the path query

hC0,ExpQi, and let qS1: S1 ! ðaS1

;CS1Þ be the production in P1 corresponding to the axiom S1.

Let S be a node label not belonging to C1. For each node label ci 2 C0 we define the productionqi : S! (aS,CS) in the following way:

(1) aS is obtained from aS1by substituting the node label $s with the node label ci, and the node label S1

with the node label S;(2) CS = ;.

Fig. 27. Merging qS1and qi into q0i.

Page 24: A graph grammars based framework for querying graph-like data

Fig. 28. Obtaining qi from qS1.

S. Flesca et al. / Data & Knowledge Engineering 59 (2006) 652–680 675

Fig. 28 shows the construction of qi starting from qS1graphically.

Let P1 0 be the set of the productions qi defined above (the number of productions in P1 0 is jC0j), and let P

be the set of productions defined as: P ¼ P 1 [ P 01 [ fq : S ! eg. The POcNR grammar PCG ¼ ðC1 [ C0 [ fSg;C1t [ C0; P ; <0

P ; SÞ marks every node of D which is reachable from any node whose label is in C0 with thesymbol $f. h

The above-reported theorem states that path queries can be expressed using POcNR graph grammars. Viceversa, POcNR graph grammars allow us to express more general structural properties than path queries do. Inthe next section we will sketch an application of POcNR graph grammars to the problem of querying XMLdocuments. We point out that POcNR graph queries can be also used to build new graphs, and a system basedon this type of grammars for extracting and restructuring XML documents has been proposed in [16].

5. Using POcNR graph grammars for querying XML data

5.1. A brief overview of XML

XML (eXtensible Markup Language) [25] is a new standard, adopted by the World Wide Web Consortium

(W3C), which complements HTML for representing and exchanging data on the Web. Like HTML, XMLmakes use of tags (words bracketed by ‘‘<’’ and ‘‘>’’) and attributes (of the form name = "value"). UnlikeHTML, where tags and attributes are fixed and are used to specify how the text (or the referenced image,sound, etc.) between them will look in a browser, XML uses user-defined tags whose function is only to delimitpieces of data. In particular, the name of a pair of XML tags usually describes the meaning of the data whichare enclosed, and suggests the interpretation of such data to any application that reads it. Moreover, tags canbe nested in order to represent possible relationships between the contained data. As a result, the nested struc-ture of an XML document and the possibility of customizing tags give a meaningful representation of data.

The basic component of an XML document is the element, that is a piece of text between a start-tag and anend-tag. As explained above, tags are defined by users and describe the meaning of the contained text. Forinstance, personal information about people can be organized in an XML document like Fig. 29.

Fig. 29. A piece of XML document.

Page 25: A graph grammars based framework for querying graph-like data

Fig. 30. A piece of XML document.

676 S. Flesca et al. / Data & Knowledge Engineering 59 (2006) 652–680

In the above document, the information about a person is put between the start-tag hpersoni and the end-tag h/personi. Each element person contains some sub-elements (i.e. name, email, phone, city).

The elements of an XML document can be characterized by attributes. An attribute is a pair (name,value)which defines a property of the associated element. For instance, in the XML document of Fig. 30, the ele-ments person are characterized by an id (which identifies unequivocally a person), by an (optional) attributefriendOf which defines a relation of friendship between different elements, and by an attribute state.

5.2. A graph model for XML

In this section we present a data model for representing XML documents by means of graphs, and thenspecialize POcNR grammars to extract data from XML graphs.

An XML document can be represented as an ordered labeled oriented graph where:

• the containment relation between two elements is represented by an arc labeled with the tag of the sub-element;

• references are represented by arcs connecting the referencing element to the referenced one which arelabeled with the name of the reference attribute;

• each node contains the set of the attributes which characterize the corresponding element;• if an element contains text and does not contain any sub-elements, the text is assimilated to the value of an

attribute value;• if an element contains both sub-elements and text strings, text strings are assimilated to sub-elements with a

tag htexti and an attribute string = ‘‘string-value’’.

The representation that we adopt in this paper does not consider the ordering of elements; it can be easilyunderstood by examining the document and the corresponding unordered graph of Example 11.

Example 11. Fig. 31 shows an XML document containing IDREFs. The associated graph contains arcs ofdifferent types which could both be navigated. Observe that the dotted arcs denote attributes, whereas solidarcs denote elements.

In the following definition we formally identify the structure of an XML Graph.

Definition 10. Let A be a set of attribute names, T a set of tag names, and V a set of attribute values. Anunordered XML graph is a labeled oriented graph G = hN,Er [ ECt, f, ri where:

Fig. 31. An XML document and the corresponding XML graph.

Page 26: A graph grammars based framework for querying graph-like data

S. Flesca et al. / Data & Knowledge Engineering 59 (2006) 652–680 677

• N is the set of nodes,• Er � {(u,r,v)ju,v 2 N and r 2 A} is a set of reference arcs,• Et � {(u,r,v)ju,v 2 N and r 2 T} is a set of tag arcs,• f : N! 2A·V is the function associating to each node a set of attribute/value pairs.• hN, Eti is a tree with root r.

An ordered XML graph is a quintuple G = hN, Er [ Et, f, g, ri where N, Er, Et, f and r are defined as aboveand g : Et! Z+ is a function associating a unique ordinal number to each arc in Et.

5.3. Using POcNR to query XML data

In the previous sections, we showed that POcNR graph grammars can be used to extract information froma source data graph. Now we ‘specialize’ POcNR graph grammars to deal with XML graphs.

The only difference with respect to the querying of graphs introduced in the previous section is that eachnode in an XML graph has a set of attributes and a value. Attributes and value can be identified using a‘dot’ notation. For instance, an attribute attr of an element el corresponding to a node with label L is denotedby L.attr. Analogously, the value of the label L is L.value.

The following example shows how XML sub-graphs can be extracted from XML documents.

Example 12. The POcNR grammar PCG defined by the rules of Fig. 32 extracts from the document describedin Fig. 31 the authors of the books whose title is not ‘‘Title2’’. The nodes representing book elements aremarked with the symbol $b, whereas the nodes corresponding to author elements are marked with $a. Theproduction B! e, expanding the non-terminal node B into e, is implied.

Observe that we have assumed that a book is written by a unique author. Otherwise, if a book can bewritten by many authors, the grammar must be modified as follows: (i) the terminal node $a in q2 must bereplaced with a non-terminal node A, and (ii) a new production q3 expanding A into a list of terminal nodes $amust be added to the set of productions.

We conclude by reconsidering the problem presented in Section 1 (Example 1).

Example 13. The rooted graph of Fig. 1 can be transformed into an XML graph by assigning a direction tothe edges and translating edge labels into tags (for instance, the edge label book is rewritten as hbooki). Theproblem discussed in Example 1 (retrieving all the sub-graphs corresponding to books written by Ullman) canbe expressed by means of the following POcNR graph grammar, where the productions S! e, and A! e areomitted (Fig. 33).

The symbol S is expanded into a non-empty graph as many times as the number of Ullman’s books. Foreach of these books, the symbol A is expanded into a non-empty graph as many times as the number of co-authors.

Fig. 32. A POcNR grammar extracting XML data.

Page 27: A graph grammars based framework for querying graph-like data

Fig. 33. A POcNR extracting all Ullman’s books from the graph of Fig. 1.

678 S. Flesca et al. / Data & Knowledge Engineering 59 (2006) 652–680

6. Conclusions

In this paper we have proposed a general framework for querying graph-like data based on graph gram-mars. The use of graph grammars allows us to both verify whether the input graph satisfies a given propertyand extract a subgraph satisfying this property. We have extended classical NR graph grammars to make themwell-suited for specifying queries. Adopting our form of graph grammar (namely, cNR graph grammar)results in a highly expressive query language for which the query answering problem is NP-complete. More-over we have defined a restriction of cNR graph grammars (namely, POcNR graph grammars) which can beevaluated efficiently and which allows us to express problems with polynomial time complexity.

Acknowledgement

The authors would like to thank Bruno Courcelle for reading a preliminary version of the paper and pro-viding several suggestions.

Appendix A. Notation

In this section we present a brief summary which illustrates the notations used in the rest of the paper:

Symbol

Meaning

N

Set of nodes of a graph u, v Generic nodes (u, v 2 N) E Set of edges of a graph k Function labelling nodes C Alphabet of node and edge labels Cn Alphabet of non-terminal node labels (Cn � C) Cv Alphabet of labels denoting variables (Cv � Ct) Cc Alphabet of labels denoting constants (data values) (Cc � Ct) Ct Alphabet of terminal node labels (Ct = Cc [ Cv) c Generic symbol in C used for labelling nodes r, s Generic symbols in Ct used for labelling edges G NR Graph Grammar CG Conditional Graph Grammar PCG Partially Ordered Conditional Graph Grammar S Axiom of a graph grammar X Generic non-terminal symbol (X 2 Cn) used to denote the left-hand side of a production rule b Graph on the right-hand side of a production rule
Page 28: A graph grammars based framework for querying graph-like data

S. Flesca et al. / Data & Knowledge Engineering 59 (2006) 652–680 679

Appendix A (continued)

Symbol

Meaning

C

Connection rule P Set of production rules of a graph grammar q Generic production rule of P

D

Data graph a Query graph representing each step of the derivation a0 Query graph containing an unique non-terminal node whose label is the axiom S

u

Mapping M Mapping pair (of the form: M = (a,u)) U(CG,D) Set of mapping pairs generated by the graph grammar CG on the data graph D

Q

Query

References

[1] S. Abiteboul, P. Buneman, D. Suciu, Data on the Web: From Relations to Semistructured Data and XML, Morgan Kauffman, 1999.[2] S. Abiteboul, V. Vianu, Regular path queries with constraints, in: Proc. 16th ACM SIGACT–SIGMOD–SIGART Symposium on

Principles of Database Systems (PODS), 1997, pp. 122–133.[3] P. Buneman, Semistructured data, in: Proc. 16th ACM SIGACT–SIGMOD–SIGART Symposium on Principles of Database Systems

(PODS), 1997, pp. 117–121.[4] P. Buneman, W. Fan, S. Weinstein, Path constraints in semistructured and structured databases, in: Proc. 17th ACM SIGACT–

SIGMOD–SIGART Symposium on Principles of Database Systems (PODS), 1998, pp. 129–138.[5] P. Buneman, M.F. Fernandez, D. Suciu, UnQL: a query language and algebra for semistructured data based on structural recursion,

The VLDB Journal 9 (1) (2000) 76–110.[6] V. Christophides, S. Cluet, G. Moerkotte, Evaluating queries with generalized path expressions, in: Proc. 1996 ACM SIGMOD

International Conference on Management of Data (SIGMOD), 1996, pp. 413–422.[7] M.P. Consens, A.O. Mendelzon, GraphLog: a visual formalism for real life recursion, in: Proc. 9th ACM SIGACT–SIGMOD–

SIGART Symposium on Principles of Database Systems (PODS), 1990, pp. 404–416.[8] B. Courcelle, Recursive queries and context-free graph grammars, Theoretical Computer Science (TCS) 78 (1) (1991) 217–244.[9] B. Courcelle, Structural properties of context-free sets of graphs generated by vertex replacement, Information and Computation (IC)

116 (2) (1995) 275–293.[10] B. Courcelle, On the expression of graph properties in some fragments of monadic second-order logic, in: N. Immerman, P. Kolaitis

(Eds.), Descriptive Complexity and Finite Models, DIMACS Series in Discrete Mathematics and Theoretical Computer Sciences, vol.31, 1997, pp. 33–62.

[11] B. Courcelle, The expression of graph properties and graph transformations in monadic second-order logic, in: G. Rozenberg (Ed.),Handbook of Graph Grammars and Computing by Graph Transformations, Foundations, vol. 1, World Scientific, New-Jersey,London, 1997, pp. 313–400.

[12] I. Cruz, A. Mendelzon, P. Wood, G+: recursive queries without recursion, in: Proc. 2nd International Conference on Expert DatabaseSystems (EDS), 1988, pp. 355–368.

[13] H. Ehrig, A. Habel, Graph grammars with application conditions, in: G. Rozenberg, A. Salomaa (Eds.), The Book of L, Springer-Verlag, 1986, pp. 87–100.

[14] J. Engelfriet, Context-free graph grammars, in: G. Rozenberg, A. Salomaa (Eds.), Handbook of Formal Languages, Beyond Words,vol. 3, Springer-Verlag, 1997, pp. 125–213.

[15] J. Engelfriet, G. Rozenberg, Node replacement graph grammars, in: G. Rozenberg (Ed.), Handbook of Graph Grammars andComputing by Graph Transformations, Foundations, vol. 1, World Scientific, New-Jersey, London, 1997, pp. 1–94.

[16] S. Flesca, S. Greco, F. Furfaro, A query language for XML based on graph grammars, World Wide Web Journal 5 (2) (2002) 125–157.

[17] S. Flesca, S. Greco, Querying graph databases, in: Proc. 7th International Conference on Extending Database Technology (EDBT),2000, pp. 510–524.

[18] S. Flesca, S. Greco, Partially ordered regular languages for graph queries, in: Proc. 26th International Colloquium on Automata,Languages and Programming (ICALP), 1999, pp. 321–330.

[19] G. Grahne, A. Thomo, An optimization technique for answering regular path queries, in: Proc. 3rd WebDB Workshop, 2000, pp. 99–104.

[20] M. Gyssens, J. Paredaens, J. Van den Bussche, D. Van Gucht, A graph-oriented object database model, IEEE Transactions onKnowledge and Data Engineering (TKDE) 6 (4) (1994) 572–586.

[21] N. Immerman, Descriptive Complexity, Springer-Verlag, 1999.

Page 29: A graph grammars based framework for querying graph-like data

680 S. Flesca et al. / Data & Knowledge Engineering 59 (2006) 652–680

[22] M. Kifer, W. Kim, Y. Sagiv, Querying object-oriented databases, in: Proc. 1992 ACM SIGMOD International Conference onManagement of Data (SIGMOD), 1992, pp. 393–402.

[23] A. Mendelzon, P. Wood, Finding regular simple paths in graph databases, SIAM Journal on Computing 24 (6) (1995) 1235–1258.[24] J. Paradaens, P. Peelman, L. Tanca, Merging graph-based and rule-based computation: The language G-LogData & Knowledge

Engineering, vol. 25, Elsevier, 1998, pp. 267–300.[25] Extensible Markup Language (XML) 1.0—W3C Recommendation. Available from: <http://www.w3.org/TR/2000/REC-xml-

20001006>.[26] XML Path Language (XPath) 2.0—W3C Working Draft. Available from: <http://www.w3.org/TR/2005/WD-xpath20-20050915>.[27] XQuery 1.0: An XML Query Language—W3C Working Draft. Available from: <http://www.w3.org/TR/2005/WD-xquery-

20050915>.

Sergio Flesca got its Ph.D. in Computer Science Engineering at University of Calabria, Italy. Currently he is anassociate professor at the Engineering Faculty, University of Calabria. He was a visiting researcher at theComputer Science Department of Vienna University of Technology. His research interest includes inconsistentdata management, semistructured data, XML query languages, Web information extraction.

Filippo Furfaro got its Ph.D. in Computer Science Engineering at University of Calabria, Italy. Currently he is anassistant professor at the Engineering Faculty, University of Calabria. His research interests include database

theory, logic programming, inconsistent data management, semistructured data, XML query languages, com-pression techniques for multidimensional data, computation on grids.

Sergio Greco received the Laurea degree in Electrical Engineering from the University of Calabria, Italy. Cur-rently, he is a full professor at the faculty of Engineering and chair of the Department of Computer and SystemSciences of the University of Calabria. Prior to this, he was a researcher at CRAI, and visiting researcher at the

Microelectronics and Computer Corporation center and at Computer Science Department of University ofCalifornia at Los Angeles. His primary research interests include database theory, logic programming, deductivedatabase, database integration, intelligent information integration over the web, web search engines, and querylanguages for semistructured data. He is a member of the IEEE, the IEEE Computer Society.