9
IEICE TRANS. INF. & SYST., VOL.E89–D, NO.4 APRIL 2006 1359 PAPER Special Section on Knowledge-Based Software Engineering High-Volume Continuous XPath Querying in XML Message Brokers Hyunho LEE a) , Student Member and Wonsuk LEE b) , Nonmember SUMMARY The core technical issue in XML message brokers, which play a key role in exchanging information in ubiquitous environments, is processing a large set of continuous XPath queries over incoming XML streams. In this paper, a new system as an epochal solution for this issue is proposed. The system is designed to minimize the runtime workload of continuous query processing by transforming XPath expressions and XML streams into newly proposed data structures and matching them eciently. Also, system performances are estimated both in terms of space and time, and verified through a variety of experimental studies, showing that the pro- posed system is practically linear-scalable and stable in terms of processing a set of XPath queries in a continuous and timely fashion. key words: XML message broker, XML stream, continuous XPath queries 1. Introduction Beyond a substitution for HTML on the web, XML stands in the spotlight as the basic data format in ubiquitous envi- ronments including web services such as SOAP and WSDL, B2B (business to business) transactions and personalized content delivery [6]. In these fields, one of the main research issues is XML message brokers that enable applications to exchange information by sending XML messages and sub- scribing to such messages [7]. The core technical challenge in such systems is to process a large set of XPath [1] queries over a continuously incoming stream of XML packets. XPath expressions are composed of a sequence of lo- cation steps consisting of an axis,a node test and zero or more predicates [4]. An axis specifies the hierarchical rela- tionship between the nodes: parent-child(/), descendant-or- self(//). A node test is typically a name test, which can be a fragment name or a wildcard(*) that matches any fragment name. Data streams are continuous, unbounded, possibly rapid and time-varying. Similarly to long-running contin- uous queries [8] in these data streams, XPath queries over XML streams are expected to produce answers in a contin- uous and timely fashion. This paper focuses on minimizing the workload of query processing at runtime by transform- ing XPath expressions into newly proposed ecient data structures. In order to process continuous XPath queries, it is nec- essary to define the logical unit of an infinite XML stream Manuscript received June 30, 2005. Manuscript revised October 4, 2005. The authors are with the Department of Computer Science, Yonsei University, Seoul, 120–749, Korea. a) E-mail: [email protected] b) E-mail: [email protected] DOI: 10.1093/ietisy/e89–d.4.1359 in which the queries are processed. The logical unit should preserve not only the semantics of the stream data but also the syntactic principles to make automatic separations pos- sible by some rules. In this paper, the pattern of an XML stream is assumed to be an infinite repetition of the upper- most element except the virtual root element on the DTD and its contents, such as a series of XML packets or mes- sages. In this pattern, a logical unit for an XML stream should be the uppermost element except the root and its contents. This logical unit is called a chunk of the stream. In particular, the uppermost element of each chunk in the stream is called a seed element of the stream. One of the major diculties in processing XPath queries is dealing with descendant-or-self(//) axes or wild- card(*) elements that matches any elements, since they make the queries non-deterministic. In order to solve this problem, navigation-based algorithms have been proposed by employing an NFA (Non-deterministic Finite Automata) with a secondary data structure such as a runtime stack [4] or transforming an NFA to a DFA (Deterministic Finite Au- tomata) at runtime [7], [9]. The major drawback of these approaches is the increase in runtime overhead caused by trying to solve non-deterministic states lazily at runtime. In this paper, in order to lessen this burden, a strategy in which each descendant-or-self axis or wildcard in XPath queries is transformed into more than one simple parent-child(/) axes by means of referring to the DTD for the XML stream is pro- posed. This strategy not only makes any XPath query deter- ministic and simple but also minimizes the runtime overhead because non-deterministic states in the query are solved be- fore runtime. Another diculty in processing XPath queries involves processing repeated elements. In the case in which a re- peated element is located on the branching point in a non- linear XPath query containing predicates, it is not easy to judge whether the same instances of the branching element are used on the paths for its branched descendant fragments (elements or attributes) that the query matches. In this pa- per, in order to cope with this problem, the global sequence number which is sequentially assigned to each fragment in a stream chunk is used. This sequence number is called the version of the fragment and it is checked whether or not the versions of the branching elements on the paths of the matched branched fragments are equal to each other. For example, the query q 1 in Fig. 1 doesn’t match the example XML stream S , since there doesn’t exist the branching ele- ment y that satisfies both of the conditions for its branched Copyright c 2006 The Institute of Electronics, Information and Communication Engineers

High-Volume Continuous XPath Querying in XML Message ......Fig.1 An XML stream and five XPath queries. fragments c and w. This query result can be acquired by showing that the versions

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: High-Volume Continuous XPath Querying in XML Message ......Fig.1 An XML stream and five XPath queries. fragments c and w. This query result can be acquired by showing that the versions

IEICE TRANS. INF. & SYST., VOL.E89–D, NO.4 APRIL 20061359

PAPER Special Section on Knowledge-Based Software Engineering

High-Volume Continuous XPath Querying in XML MessageBrokers

Hyunho LEE†a), Student Member and Wonsuk LEE†b), Nonmember

SUMMARY The core technical issue in XML message brokers, whichplay a key role in exchanging information in ubiquitous environments, isprocessing a large set of continuous XPath queries over incoming XMLstreams. In this paper, a new system as an epochal solution for this issueis proposed. The system is designed to minimize the runtime workload ofcontinuous query processing by transforming XPath expressions and XMLstreams into newly proposed data structures and matching them efficiently.Also, system performances are estimated both in terms of space and time,and verified through a variety of experimental studies, showing that the pro-posed system is practically linear-scalable and stable in terms of processinga set of XPath queries in a continuous and timely fashion.key words: XML message broker, XML stream, continuous XPath queries

1. Introduction

Beyond a substitution for HTML on the web, XML standsin the spotlight as the basic data format in ubiquitous envi-ronments including web services such as SOAP and WSDL,B2B (business to business) transactions and personalizedcontent delivery [6]. In these fields, one of the main researchissues is XML message brokers that enable applications toexchange information by sending XML messages and sub-scribing to such messages [7]. The core technical challengein such systems is to process a large set of XPath [1] queriesover a continuously incoming stream of XML packets.

XPath expressions are composed of a sequence of lo-cation steps consisting of an axis, a node test and zero ormore predicates [4]. An axis specifies the hierarchical rela-tionship between the nodes: parent-child(/), descendant-or-self(//). A node test is typically a name test, which can be afragment name or a wildcard(*) that matches any fragmentname.

Data streams are continuous, unbounded, possiblyrapid and time-varying. Similarly to long-running contin-uous queries [8] in these data streams, XPath queries overXML streams are expected to produce answers in a contin-uous and timely fashion. This paper focuses on minimizingthe workload of query processing at runtime by transform-ing XPath expressions into newly proposed efficient datastructures.

In order to process continuous XPath queries, it is nec-essary to define the logical unit of an infinite XML stream

Manuscript received June 30, 2005.Manuscript revised October 4, 2005.†The authors are with the Department of Computer Science,

Yonsei University, Seoul, 120–749, Korea.a) E-mail: [email protected]) E-mail: [email protected]

DOI: 10.1093/ietisy/e89–d.4.1359

in which the queries are processed. The logical unit shouldpreserve not only the semantics of the stream data but alsothe syntactic principles to make automatic separations pos-sible by some rules. In this paper, the pattern of an XMLstream is assumed to be an infinite repetition of the upper-most element except the virtual root element on the DTDand its contents, such as a series of XML packets or mes-sages. In this pattern, a logical unit for an XML streamshould be the uppermost element except the root and itscontents. This logical unit is called a chunk of the stream.In particular, the uppermost element of each chunk in thestream is called a seed element of the stream.

One of the major difficulties in processing XPathqueries is dealing with descendant-or-self(//) axes or wild-card(*) elements that matches any elements, since theymake the queries non-deterministic. In order to solve thisproblem, navigation-based algorithms have been proposedby employing an NFA (Non-deterministic Finite Automata)with a secondary data structure such as a runtime stack [4]or transforming an NFA to a DFA (Deterministic Finite Au-tomata) at runtime [7], [9]. The major drawback of theseapproaches is the increase in runtime overhead caused bytrying to solve non-deterministic states lazily at runtime. Inthis paper, in order to lessen this burden, a strategy in whicheach descendant-or-self axis or wildcard in XPath queries istransformed into more than one simple parent-child(/) axesby means of referring to the DTD for the XML stream is pro-posed. This strategy not only makes any XPath query deter-ministic and simple but also minimizes the runtime overheadbecause non-deterministic states in the query are solved be-fore runtime.

Another difficulty in processing XPath queries involvesprocessing repeated elements. In the case in which a re-peated element is located on the branching point in a non-linear XPath query containing predicates, it is not easy tojudge whether the same instances of the branching elementare used on the paths for its branched descendant fragments(elements or attributes) that the query matches. In this pa-per, in order to cope with this problem, the global sequencenumber which is sequentially assigned to each fragment ina stream chunk is used. This sequence number is called theversion of the fragment and it is checked whether or notthe versions of the branching elements on the paths of thematched branched fragments are equal to each other. Forexample, the query q1 in Fig. 1 doesn’t match the exampleXML stream S , since there doesn’t exist the branching ele-ment y that satisfies both of the conditions for its branched

Copyright c© 2006 The Institute of Electronics, Information and Communication Engineers

Page 2: High-Volume Continuous XPath Querying in XML Message ......Fig.1 An XML stream and five XPath queries. fragments c and w. This query result can be acquired by showing that the versions

1360IEICE TRANS. INF. & SYST., VOL.E89–D, NO.4 APRIL 2006

Fig. 1 An XML stream and five XPath queries.

fragments c and w. This query result can be acquired byshowing that the versions of the branching elements y forthe matched fragments c and w are not equal–respectively 4and 10.

The goal of this paper can be summarized as follows:Given an XPath query set Q and a target XML stream S,propose efficient data structures and matching algorithm forprocessing queries in Q concurrently over the stream S in acontinuous and timely fashion.Constraints The proposed approach of this paper may notwork well if a given DTD is a recursive one, which allowsrecursive occurrences of the same element. But this casecan still be solved if a few constraints for XML documentsgenerated from the DTD, such as the maximum depth of thedocument and the maximum number of occurrences for arecursive element, are given.Related works Several researches have concentrated onhow to evaluate multiple XPath queries over large XMLdatasets. XFilter [2] builds a FSM (Finite State Machine)for each XPath query and employs a query index on all theFSMs. YFilter [4] combines all of the XPath queries intoa single NFA, supporting shared processing of the com-mon prefixes among all navigation paths. Index-Filter [5]uses indexes built over the document tags, in order to filterout large portions of the input XML document that aren’tguaranteed to be part of any match. XTrie [3] handles tree-shaped path expressions involving predicates in a trie struc-ture. XPush [7] proposes a single deterministic pushdownautomata (PDA) to be lazily constructed from XPath queriesin order to prevent the rapid increase of states. While thework of this paper is based on the concept of continuousqueries (CQ) processing, these researches don’t give explicitsolutions to processing XPath queries over XML streams ina continuous fashion. Also, as stated previously, these re-

searches try to solve non-deterministic situations at runtime,resulting in an increase in runtime workload, both with re-spect to space and time.Paper outline The rest of this paper is organized as follows:In Sect. 2, the proposed system and matching algorithm isstated in detail. In Sect. 3, the performances of the proposedsystem are estimated. In Sect. 4, the system performancesare analyzed through a series of experiments. Section 5presents our conclusions.

2. Architecture and Algorithm

2.1 Notations and Examples

When multiple XPath queries are processed concurrently, itis likely that significant commonalities between the specifi-cations of the queries exist. To eliminate redundant process-ing while answering the XPath queries, it is recommendedto identify the query commonalities and combine the spec-ifications of the XPath queries into a single identical struc-ture, called a prefix tree [5]. All of the XPath queries can becombined into the one prefix tree rooted at a root element R.

For the query q, some notations are assumed.

• prefixq(x) returns a sequence of location steps from aroot element R to the fragment x on the query q.• branchingq(x) returns a set of branching elements for

the fragment x on the query q.

For example, given an XPath query q = /a[@b >10]/c[d/text() < 20]/e, some notations for the query q areas follows: prefixq(@b) = R/a/@b, prefixq(c) = R/a/c,branchingq(@b) = {a} and branchingq(e) = {a, c}.

Given a prefix-tree T (Q) = (VT , ET ) for a query set Q,some notations are assumed.

• For a node v ∈ VT , label(v) returns the label (fragmentname) associated with the node v. A node is classi-fied as two types: i-node and p-node. If the fragmentcorresponding to the node is located on the predicateconditions or the last position on a query q ∈ Q, thenode is a p-node. Otherwise, the node is an i-node.• For a node v ∈ VT , qset(v) returns a set of XPath queries

that are associated with the node v. And for a queryq ∈ Q, vset(q) returns a set of nodes that are associatedwith the query q.• Given the seed node corresponding to the seed element,

denoted by vs, path(v) returns a sequence of labels andaxes from vs to the node v.

Figure 1 shows an XML stream S , the DTD of streamS and a set of five XPath queries Q, which are used as ex-amples in the following sections. The stream S consists ofa virtual root element R defined in its DTD, and a series oftwo chunks, C1 and C2. Also, the uppermost element x ineach chunk plays a role as the seed element of stream S .

2.2 System Overview

As shown in Fig. 2, a target system consists of two kinds of

Page 3: High-Volume Continuous XPath Querying in XML Message ......Fig.1 An XML stream and five XPath queries. fragments c and w. This query result can be acquired by showing that the versions

LEE and LEE: HIGH-VOLUME CONTINUOUS XPATH QUERYING IN XML MESSAGE BROKERS1361

Fig. 2 The overall system architecture.

components: query-side components and stream-side com-ponents. Query-side components are constructed by com-piling a set of target XPath queries. The first process forcompiling a set of queries is to construct a prefix-tree calledan XP-tree for the query set. An XP-tree is transformed intoa set of lists called an XP-table, which refer to the DTD of atarget XML stream.

Stream-side components are constructed at runtime.An incoming stream is transformed into a stream relationcalled an SR in a chunk unit through a SAX parser. Each tu-ple of SR includes the prefix and text value of the fragment,and also includes versions for the fragments consisting ofits prefix, in order to cope with non-linear queries. The XP-table matching result over an SR is written to a sparse matrixcalled a MEM, which is evaluated to output the final resultof each query through an XP-expression for the query.

2.3 Query-Side Components

Query-side components consist of three data structures: XP-tree, XP-table and XP-expression. An XP-tree is a prefix-tree for representing a set of continuous XPath queries.Definition 1. (XP-tree)Given a prefix-tree T (Q) = (VT , ET ) for a continuous queryset Q, T (Q) is an XP-tree for the query set Q, denoted by anXP-tree(Q), if a set of nodes VT are satisfied as follows:

• For a p-node v ∈ VT , the node v keeps a set of(q, op, val) triples, denoted by a qov set(v), where q ∈qset(v). op and val are respectively an operator andoperand, associated with the attribute label(v) or thechild text element, i.e., text(), of the element label(v)on the query q. For a triple (q, op, val) ∈ qov set(v),a set of values that satisfies op and val conditions isdenoted by a nqv setv(q).• For a node v1 ∈ vset(q) except vR and a set of nodes

V ⊂ vset(q), v1 keeps a set of (q,V) pairs, denotedby a bran set(v1), satisfying that q ∈ qset(v1) andlabel(v1) ∈ branchingq(label(v2)) for any p-node v2 ∈V . �

Figure 3 shows the XP-tree for the five XPath queriesin Fig. 1. In Fig. 3, qov set(v4) = {(q1, null, null), (q2, null,null)}, since the op and val part of qov set(v4) are meaning-less. The role of the node v4 in the queries {q1, q2} is only

Fig. 3 An XP-tree for five XPath queries in Fig. 1.

to check whether the prefix corresponding to path(v4) existsin a target XML stream. In this case, both nqv setv4(q1) andnqv setv4(q2) are (−∞,∞).

The bran set(v) for the node v is kept in order to com-pare the versions of branching element label(v) on the pre-fixes of fragments corresponding to branched p-nodes in annon-linear XPath query. For example, the node v2 has thebran set(v2) = {(q1, {v3, v4}), (q2, {v4, v5})} in the XP-tree ofFig. 3. The pair (q1, {v3, v4}) indicates that the element y’sversion on the prefixq1(c) and the element y’s version on theprefixq1(w) should be the same. In the same way, the pair(q2, {v4, v5}) does so. It is careful that the bran set(vs) is de-fined empty, since a seed element within a chunk appearsonly one i.e., always has the same version.

As described in Sect. 2.2, an XP-tree is transformedinto a sequence of lists called an XP-table. Each list (row)in the XP-table is constructed by the f-sequence, which isconverted from the XP-tree node.Definition 2. (f-sequence)Given an XP-tree(Q) = (VT , ET ), a set of f-sequences fseq(v)for a node v ∈ VT satisfies the following:

• If qov-set(v) = φ, fseq(v) = φ.• Otherwise, for an f-sequence f ∈ fseq(v), f is a se-

quence of fragments, from the seed element to the frag-ment corresponding to the node v. The relationshipbetween two adjacent fragments in the f-sequence fis always a parent-child relationship, making certainthat each relationship between two adjacent nodes onthe path(v) of the XP-tree(Q) always matches the re-lationship between corresponding fragments on the f-sequence f . �

The conversion from the node v to its fseq(v) is per-formed by referring to the DTD of a target XML stream.If there are neither ancestor-descendant edges nor wild-card nodes on the path(v), the conversion is straightforward.Otherwise, all the possible f-sequences that can match thepath(v) should be searched on the specified DTD recursivelyin a depth-first manner, since the number of f-sequencesin the fseq(v) can be more than one (non-deterministic).For example, since the path(v6) = R/x//p has a ancestor-descendant edge in the XP-tree of Fig. 3, the fseq(v6) has

Page 4: High-Volume Continuous XPath Querying in XML Message ......Fig.1 An XML stream and five XPath queries. fragments c and w. This query result can be acquired by showing that the versions

1362IEICE TRANS. INF. & SYST., VOL.E89–D, NO.4 APRIL 2006

two members: xyp and xup. The same f-sequences convertedfrom different nodes in the XP-tree(Q) are shared, such asthe f-sequence xyp converted from three nodes {v5, v6, v8} inthe XP-tree of Fig. 3. This case is denoted by xyp ∈ fseq(V),where V = {v5, v6, v8}.

Each f-sequence converted from the nodes of XP-treeconstructs its own list, which constitutes an XP-table.Definition 3. (XP-table)Given an XP-tree(Q) = (VT , ET ), the XP-table for the queryset Q, denoted by an XP-table(Q), is defined as a sequenceof lists. Each list corresponds to an f-sequence f where f ∈fseq(VT ).

The ith list of the XP-table(Q), denoted by anXP-tablei(Q), consists of its own f-sequence denoted by fi,and the ordered list denoted by Bi. Given an ordered list ρ,the jth cell of Bi b j, the XP-table(Q) is defined as follows:

XP-table(Q) =nρ

i=1XP-tablei(Q) =

i=1( fi, Bi)

=nρ

i=1( fi,

j=1b j)

n is the number of distinct f-sequences in the fseq(VT ), andm is the number of cells in the list Bi. The cell b j ∈ Bi hasits interval [lb j, ub j], where lb j is a lower boundary and ub j

is an upper boundary, and its set of (q, v) pairs, denoted bya qn pairb j, where q ∈ qset(v) and v ∈ vset(q). For a pair(q, v) ∈ qn pairb j and a value x ∈ [lb j, ub j], x ∈ nqv setv(q).

�In Fig. 4, the list whose f-sequence is xyw, which is

converted from the node v4 of the XP-tree in Fig. 3, hasone cell with its interval (−∞,∞) and its set of (q, v) pairs{(q1, v4), (q2, v4), (q5, v4)}. This means that the f-sequencexyw appearing in a target XML stream matches the queries{q1, q2, q5}, whatever the child text value of the last elementw is. Another example is the list whose f-sequence is xyp,which is converted from a set of nodes V = {v5, v6, v8}.This list has four cells with intervals (−∞, 100), [100,140),[140, 180] and (180,∞) respectively, since the three mem-bers of qov set(V) = {(q2,≥, 100), (q3,≥, 140), (q4,≤, 180)}divide a whole interval (−∞,∞) into four sub-intervals withthe boundaries 100, 140 and 180.

Fig. 4 The XP-table for the XP-tree in Fig. 3.

The final result of a query q ∈ Q can be acquired byan expression, denoted by a node exp(q), which combinesthe matching results of nodes related to the query q. If thequery q is non-linear, for a pair (q,V) ∈ bran set(v) in theXP-tree(Q), the version of the element corresponding to thebranching node v on the path of each node in the node setV should be compared with each other. For these versioncomparisons for the query q, a new structure, denoted by aver comp(q), is introduced.Definition 4. (XP-expression)Given an XP-tree(Q) = (VT , ET ) for the query set Q andthe stream S , an XP-expression(q) for the query q ∈ Q con-sists of a node exp(q) and a ver comp(q). Each is defined asfollows:

• node exp(q) = ϕv∈Vq

M(v, S c) where ϕ = {∧,∨,¬} and

q ∈ qset(v). M(v, S c) returns true if the fseq(v) exists inthe chunk S c in the way that the value associated withthe fragment label(v) is included in the nqv setv(q).Otherwise, returns false.• ver comp(q) = {(|fseq(v)|, vset(q))|(q, vset(q)) ∈

bran set(v)} where |fseq(v)| is the length of fseq(v). �

In order to find the location of the element corre-sponding to the node v on the f-sequence, | f seq(v)| in thever comp(q) is used.

2.4 Stream-Side Components

In the target system, stream-side components consist of aSAX parser and a stream relation, called an SR, producedby the SAX parser. Each tuple of SR corresponds to a frag-ment in an XML stream. Each tuple has three attributes:a sequence of fragments from the root element to the cur-rent fragment, the value of the current fragment and a list ofversions for the fragment sequence.Definition 5. (SR)Given an XML stream S and a fragment f ∈ S , an SR for thestream S , denoted by an S R(S ), is a stream relation whosetuple corresponds to the fragment f . Each tuple in the S R(S )has three attributes: <frag seq>, <val>, and <ver list>. Let

Page 5: High-Volume Continuous XPath Querying in XML Message ......Fig.1 An XML stream and five XPath queries. fragments c and w. This query result can be acquired by showing that the versions

LEE and LEE: HIGH-VOLUME CONTINUOUS XPATH QUERYING IN XML MESSAGE BROKERS1363

Fig. 5 An SR for the chunk C1 of the stream S in Fig. 1.

fragseq( f ) = f1 f2 · · · f be a sequence of fragments from theroot element R to the fragment f on the stream S , val( f ) bethe value of the fragment f , and ver( f ) be the version of thefragment f . Three attributes of the tuple corresponding tothe fragment f in the SR(S ) are

• <frag seq> = fragseq( f ), <val> = val( f )

• <ver list> =fρ

x= f1ver(x) where x ∈ fragseq( f ) and ρ is

an ordered list.

val( f ) is the value of the fragment f if f is an attribute.Otherwise, val( f ) is the child text value of the fragment f .

�An SR is constructed by parsing a target XML stream

with a SAX parser. A SAX parser generates several typesof events when it reads an XML document. The proposedsystem provides three call-back functions [startElement(),characters(), endElement()] corresponding to these eventtypes. It is enough to build an SR from the stream S to main-tain only two additional global variables together with thesecall-back functions. One is a variable, denoted by fragseqS ,for keeping the fragseq( f ) for the current fragment f , andthe other is a variable, denoted by verS , for keeping thever( f ) for the fragment f . The implementations of thesesfunctions are simple, as follows:

• startElement( f )1. makes a new fragseqS by concatenating the currentfragseqS and the current fragment f2. increments verS by 1, and ver( f ) =verS

3. inserts a new tuple (fragseqS , NULL, <ver list> forfragseqS )• endElement( f )

1. excludes the current fragment f from the fragseqS .• characters(s)

1. update the attribute <val> of newly inserted tuple instartElement( f ) to s

Figure 5 shows the SR for the chunk C1 of the stream Sin Fig. 1. Since each SAX event is processed in O(1) time,the time that is required to build an SR from a target XMLstream is generalized as O(1) processing time.

2.5 Basic Matching Algorithm

An XPath query is repeatedly performed for each chunk ina target XML stream. Algorithm 1 shows the basic match-ing algorithm of the query set Q over the stream S . The

matching process of Q and S is summarized as follows:Visiting each list in the XP-table(Q) in sequence, find the tu-ples of SR(S c), each <frag seq> of which is consistent withthe f-sequence of the visited list in the XP-table(Q), and scanthe cells of the visited list matched with the <val>’s of thefound tuples. (line 5–13 in Algorithm 1)

The results of cell scanning are written to a matchingevaluation matrix, called a MEM.Definition 6. (MEM)Given a set of XPath queries Q = {q1, . . . , qn}, itsXP-tree(Q) = (VT , ET ), the current chunk S c of the streamS and an ordered list ρ, a MEM for the query set Q overthe chunk S c, denoted by a MEM(Q, S c), is a sparse matrix,defined as follows:

MEM(Q, S c) =nρ

i=1

|VT |−1ρ

j=0(qi, v j, verlists(qi, v j)) where qi ∈ Q,

v j ∈ VT and |VT | is the number of nodes in the XP-tree(Q)and verlists(qi, v j) =

{ fρ

x= f1ver(x)| f seq(v j) = f1 · · · f , f = label(v j)}

if v j ∈ vset(q).φ if v j � vset(q). �

Algorithm 1 Basic matching algorithm of XPath queries over an XMLstream.

After a MEM(Q, S c) is constructed, for each queryq ∈ Q, the expression node exp(q) is evaluated in the waythat each M(q, v) consisting of node exp(q) is evaluated true

Page 6: High-Volume Continuous XPath Querying in XML Message ......Fig.1 An XML stream and five XPath queries. fragments c and w. This query result can be acquired by showing that the versions

1364IEICE TRANS. INF. & SYST., VOL.E89–D, NO.4 APRIL 2006

Fig. 6 MEM and XP-expression for the query result.

if verlists(q, v) � φ in the element (q, v, verlists(q, v)) ∈MEM(Q, S c) and its result is combined logically withthe others (line 16 in Algorithm 1). If the expressionnode exp(q) is true, the expression ver comp(q) is evalu-ated (line 17–23 in Algorithm 1) in order to output the finalmatching result of the query q for the chunk S c.

Figure 6 shows the MEM, the XP-expression and finalresults for matching five queries with the chunk C1 of thestream S in Fig. 1. In Fig. 6, the final result of the query q1

is false, in spite of M(q1, v1)∧M(q1, v3)∧M(q1, v4) = true,since η(verlists(q1, v3), 2) = η({1 − 10 − 11}, 2) = {10} andη(verlists(q1, v4), 2) = η({1 − 4 − 7}, 2) = {4}.

3. Estimations

A query having p p-nodes on the XP-tree(Q) is calleda p-degree query. Given a set of N p-degree queriesQ = {q1, . . . , qn}, an XML stream S and an XP-tree(Q) =(VT , ET ), let M be the number of lists in the XP-table(Q),and fi be the f-sequence of the list XP-tablei(Q), andVXP-table = {v|qov set(v) � φ, v ∈ VT }, and V fi = {v| fi ∈fseq(v), v ∈ VT }. In addition, let Ti be the number of(q, op, val) triples in the qov set(V fi), and Ci be the num-ber of distinct (op, val) pairs in the qov set(V fi), and C =M∑

i=1Ci/M. Ci is also equal to the number of cells in the list

XP-tablei(Q). It is supposed that the cells in the lists ofXP-table(Q) are evenly distributed, i.e, Ci � C for 1 ≤ i ≤M, and (q, v) pairs in the cells of each list are also evenlydistributed.

The size of XP-table is mainly decided by the numberof (q, v) pairs in the cells of its lists. Since the number of f-sequences for the node v ∈ VXP-table, where path(v) containsdescendant-or-self axes or wildcards, can be more than one,qov set(v) may be converted redundantly into several listswhose f-sequences correspond to the path(v). The redun-dancy coefficient concerned with this conversion, denoted

by ξ, is defined as ξ =

∑v∈VXP-table

| f seq(v)||VXP-table | .

In the qov set(VXP-table), the ratio of the number of(q, op, val) triples where each op is an equal operator (=)

to the total number of (q, op, val) triples, is denoted by pe.And the ratio of other cases where each op is a non-equal ornull operator is denoted by pn. Since pe and pn cover all ofthe cases about (q,op, val) triples, pe + pn = 1.Theorem 1. The size of XP-table(Q), denoted by|XP-table(Q)|, is measured by the total number of (q, v) pairsin the lists of the XP-table(Q) as follows: |XP-table(Q)| �ξN p(pe +

C2 pn).

Proof. In a triple (q, op, val) ∈ qov set(v) for a node v ∈ V fi ,the triple (q,op, val) is converted to a (q, v) pair in the cor-responding cell of the list XP-tablei(Q) if op is an equaloperator. Otherwise, the triple (q, op, val) is converted toeach (q, v) pair redundantly in half of the cells in the listXP-tablei(Q) on average. Since the number of cells inXP-tablei(Q) is Ci, |XP-tablei(Q)| = peTi + pn

Ci2 Ti. There-

fore,

|XP-table(Q)|

=

M∑i=1

|XP − tablei(Q)| =M∑

i=1

(peTi + pn

Ci

2Ti

)

� ξN p(pe +

C2

pn

)by)

M∑i=1

Ti = ξN p and Ci � C. �

At runtime, the required memory is decided by an SRand a MEM. The average number of fragments that the cur-rent chunk S c of the stream S has, denoted by |S c|, is moreor less different from the total number of descendant-or-self fragments of the seed element defined in the DTD ofthe stream S , denoted by |S dtd|, because of optional and re-peated fragments. The average ratio of |S c| to |S dtd| is de-fined as a fragment repetition coefficient for the stream S ,denoted by τ.Theorem 2. The size of MEM(Q, S c), denoted by|MEM(Q, S c)|, is measured by the total number ofverlists in successfully matched elements constituting theMEM(Q, S c), as follows: |MEM(Q, S c)| � τN p

(pe

C +pn

2

).

Proof. The number of all possible verlists in the elementsof MEM(Q, S c) is τN p. For a query q ∈ Q, the averagematching ratio of the p-node v having an equal operator in itsqov set(v) is 1/C, since the corresponding (q, v) pair existsin only one cell, and Ci � C for 1 ≤ i ≤ M from the abovesuppositions. On the other hand, the average matching ratioof the p-node v having a non-equal or null operator in itsqov set(v) is 1/2, since the corresponding (q, v) pair mayexist in half of the cells in the corresponding list on average.Therefore, the average matching ratio of each element in theMEM(Q, S c) is pe

C +pn

2 . �As shown in Algorithm 1, query processing in the tar-

get system consists of two parts: the matching process of anXP-table over an SR and an evaluation for a MEM. If an SRis implemented as a hash table whose key is a <frag seq>,it takes O(1) to probe the SR from an XP-table. Also, ittakes O(1) to write the probing result of SR into a MEM ifthe MEM is implemented as a sparse array or a hash tablewhose key is a pair of (qi, v j).Theorem 3. Given a total probing time constant for

Page 7: High-Volume Continuous XPath Querying in XML Message ......Fig.1 An XML stream and five XPath queries. fragments c and w. This query result can be acquired by showing that the versions

LEE and LEE: HIGH-VOLUME CONTINUOUS XPATH QUERYING IN XML MESSAGE BROKERS1365

the S R(S c) HS R(S c), a total writing time constant for theMEM(Q, S c) HMEM(Q,S c) and the selectivity (matching ratio)of query set Q for the chunk S c s, the processing time of thequery set Q for the chunk S c, denoted by t(Q, S c), is

t(Q, S c)

� M log2 C + N p(ξ + 2sτ)( pe

C+

pn

2

)+ HS R(S c)

+ HMEM(Q,S c)

Proof. The total time for searching each target cell in theXP-table(Q) is approximately M log2 C because the cells oflists can be searched in a binary manner. Since the time thatit takes to visit each (q, v) pair in the target cell of the listXP-tablei is proportional to the number of (q, v) pairs in thecell, it can be expressed as the average number of (q, v) pairsin the cell: Ti

Cipe+

Ti2 pn. Therefore, the total time for visiting

all of the (q, v) pairs in the target cells in the XP-table(Q) isM∑

i=1( Ti

Cipe +

Ti2 pn) � ξN p( pe

C +pn

2 ). The number of matched

queries in the query set Q is sN. Comparing the versionsof branching elements of the matched query q ∈ Q, canperformed among the verlists of nodes in the ver comp(q),in the sort-merge join manner. Since the overhead of versioncomparison is the highest when the p-nodes in the query qexist in a complete canonical form in their XP-tree(Q), themaximum version comparison time is sN*2(p − 1)τ( pe

C +pn

2 ) � 2sN pτ( pe

C +pn

2 ). �

4. Experimental Result

This paper’s proposed techniques are implemented in C, andall of the experiments are performed on a Pentium III 1 Ghzprocessor with 1 GM main memory, running on Linux 9.0.The experiments are run on a 9.71 MB (2,500 chunks) XMLdocument of Protein dataset (pir.georgetown.edu) whosemaximum depth is 7. In order to exactly apply the con-cept that an XML stream is the infinitive repetition of XMLchunks, the Protein dataset and its DTD are slightly modifiedso that a seed element is <ProteinEntry>, which is definedas a repeated element, in place of <ProteinDatabase>. Textvalues in this dataset are substituted variously using a SAXparser, in order to vary the selectivity of the target queryset. In order to generate a set of synthetic XPath queries,the modified version of the YFilter’s XPath generator (yfil-ter.cs. berkeley.edu), which generates predicate conditionsusing the text values that can be true on at least some targetXML chunks, is used. The settings of several experimentalparameters introduced in Sect. 3 are in Table 1.

Figure 7 shows the size of XP-table(Q) |XP-table(Q)|for the given query set Q, varying the number of queries inQ. In addition, this figure shows how much the probabilityof a descendant-or-self axis and wildcard, denoted by pd,has an effect on |XP-table(Q)|. This figure shows that thenon-deterministic situations of queries don’t have disastrousresults for the performance of the target system, regardlessof more or fewer differences according to an experimentaldataset.

Table 1 Experimental parameters.

Fig. 7 Memory requirements for the XP-table.

In Fig. 8, the size of MEM(Q, S c)|MEM(Q, S c)| is mea-sured as an average size per chunk for 2,500 chunks con-sisting of the Protein dataset. And the selectivity of queryset s is measured as an average selectivity per chunk. The|S dtd| of dataset is 131. Figure 8 (a) shows |MEM(Q, S c)| andthe selectivity of query set s for the chunk S c according tothe fragment repetition coefficient τ introduced in Theorem2, varying the number of queries in Q. This figure showsthat the fragment repetition coefficient τ that indirectly rep-resents the size of the chunk is one of the important fac-tors determining the runtime space and query selectivity forthe chunk, since a chunk whose fragments are repeated fre-quently, changing their text values, can increase the numberof verlists kept in the MEM(Q, S c) as well as the matchingratio of each query in Q. Figure 8 (b) shows |MEM(Q, S c)|and the selectivity of query set s for the chunk S c, varyingthe number of cells per list in the XP-table(Q). As shown inthis figure, both |MEM(Q, S c)| and the selectivity s are in-creased in inverse proportion to the number of cells per listC, since an increase in the number of cells per list means adecrease in the number of (q, v) pairs per cell where q ∈ Qand v ∈ vset(q), diminishing the matching ratios of elementsconsisting of the XP-expression(q) for the query q.

The resulting processing time of each experiment inFig. 9 is the total time that it takes to process the wholedataset of 2,500 chunks. For the simplicity of these exper-iments, a new parameter µ, such that µ = p(ξ + 2sτ), isintroduced. With Theorem 3, the processing time t(Q, S ) ineach experiment in Fig. 9 can be expressed as follows:

Page 8: High-Volume Continuous XPath Querying in XML Message ......Fig.1 An XML stream and five XPath queries. fragments c and w. This query result can be acquired by showing that the versions

1366IEICE TRANS. INF. & SYST., VOL.E89–D, NO.4 APRIL 2006

(a) Varying N and τ (b) Varying C and N

Fig. 8 Runtime memory requirements for the MEM(Q, S c).

(a) Varying N and µ (b) Varying C and N

Fig. 9 Processing time.

t(Q, S ) =2500∑k=1

t(Q, S k)

=

2500∑k=1

{M log2 C + Nµ

( pe

C+

pn

2

)+ HS R(S k)

+ HMEM(Q,S k)

}

Figure 9 (a) shows t(Q, S ) according to the parameterµ, varying the number of queries in the query set Q. Asshown in this figure, t(Q, S ) is increased in proportion to thenumber of queries in Q as well as the parameter µ, as as-serted in Theorem 3. Figure 9 (b) shows t(Q, S ) accordingto the number of cells per list in the XP-table(Q), togetherwith t(Q, S ) resulting from applying YFilter’s algorithm [4]to 2,500 queries under the same conditions. The experimen-tal results in this figure are due to the decrease of s accord-ing to the increase of C, similarly to the case of Fig. 8 (b).On the other hand, there is little difference in the processingtime of YFilter in spite of the decrease of s.

Figure 10 shows the processing time per 100 chunksin the Protein dataset, according to the number of queriesin a given query set Q. In this experiment, the time thatit takes to process newly arriving 100 chunks is recorded,as the query set Q is processed for the dataset arriving in a

Fig. 10 Processing time per 100 stream chunks.

streaming manner. As shown in this figure, the processingtime of each 100 chunks is practically steady in every queryset, although there exist some variations during the processdue to the differences in the size of the stream chunks.

5. Conclusions

In this paper, data structures and algorithms for minimizingthe runtime workload of multiple continuous XPath queriesover XML streams are proposed. Non-deterministic situ-ations in XPath queries can be solved before runtime by

Page 9: High-Volume Continuous XPath Querying in XML Message ......Fig.1 An XML stream and five XPath queries. fragments c and w. This query result can be acquired by showing that the versions

LEE and LEE: HIGH-VOLUME CONTINUOUS XPATH QUERYING IN XML MESSAGE BROKERS1367

query transformations referring to DTD. And non-linearqueries with predicates can easily be processed by simplycalculating logical expressions and comparing the versionsof branching elements on the paths of queries. Also, var-ious experiments verify performance estimations, ensuringthe scalability and stability of the system. The proposed sys-tem gives fundamentals to advanced topics, such as XQueryprocessing [12], stream join algorithm [10], aggregation andapproximation such as load shedding [11]. Future researchwill explore extending the system to support these facilities.

References

[1] J. Clark and S. DeRose, “XML Path Language (XPath) version 1.0,”W3C Recommendation,http://www.w3.org/TR/1999/REC-xpath-19991116, 1999.

[2] M. Altinel and M.J. Frankin, “Efficient filtering of XML documentsfor selective dissemination of information,” Proc. VLDB Conf.,pp.53–64, 2000.

[3] C.Y. Chan, P. Felber, M. Garofalakis, and R. Rastogi, “Efficient fil-tering of XML documents with XPath expressions,” VLDB Journal,vol.11, pp.354–379, 2002.

[4] Y. Diao, P. Fischer, M.J. Franklin, and R. To, “YFilter: Efficientand scalable filtering of XML documents,” Proc. ICDE, pp.341–342,2002.

[5] N. Bruno, L. Gravano, N. Koudas, and D. Srirastava, “Navigation-vs. index-based XML multi-query processing,” Proc. ICDE, pp.139–150, 2003.

[6] Y. Diao and M. Franklin, “Query processing for high-volume XMLmessage brokering,” Proc. VLDB Conf., 2003.

[7] A.K. Gupta and D. Suciu, “Stream processing of XPath queries withpredicates,” SIGMOD, pp.419–430, 2003.

[8] R. Motwani, J. Widom, A. Arasu, B. Babcock, S. Babu, M. Datar, G.Manku, C. Olston, J. Rosenstein, and R. Varma, “Query processing,resource management, and approximation in a data stream manage-ment system,” Proc. CIDR Conf., pp.245–256, 2003.

[9] T.J. Green, G. Miklau, M. Onizuka, and D. Suciu, “Processing XMLstreams with deterministic automata,” ICDT, pp.173–189, 2003.

[10] L. Ding, E.A. Rundensteiner, and G. Heineman, “MJoin: Ametadata-aware stream join operator,” DEBS, 2003.

[11] N. Tatbul, U. Cetintemel, S. Zdonik, M. Chemiack, and M.Stonebraker, “Load shedding in a data stream manager,” Proc.VLDB Conf., 2003.

[12] V. Josifovski, M. Fontoura, and A. Barta, “Querying XML streams,”VLDB Journal, vol.14, pp.197–210, 2004.

Hyunho Lee received the B.S. and M.S.degrees in the Department of Computer sciencefrom Yonsei Univ., Korea. He is currently adoctoral candidate in the same Department. Hiscurrent research interests include XML streams,XML query processing and ubiquitous comput-ing.

Wonsuk Lee received the B.S. degree inComputer Engineering from Boston University,Boston and the M.S. and Ph.D. degrees in Elec-trical and Computer Engineering from PurdueUniversity, West Lafayette, IN. He is currentlya professor of Department of Computer scienceat Yonsei Univ., Korea. His current research in-terests include data streams, stream query pro-cessing and data mining.