1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst

1

On Efficient Matching of Streaming XML Documents and Queries

Laks V.S. Lakshmanan1

P. Sailaja2

1 University of British Columbia, Canada

2 Indian Inst. of Tech., Bombay, India (work performed while visiting IIT-Bombay).

UBC, Canada EDBT 2002, Prague. 2

Outline

I. Motivating Applications

II. Problem

III. Dual Index

IV. Algorithms

V. Experiments

VI. Summary & Future Work


Motivating Application 1

Information dissemination in the large Numerous data sources on the web

Traditional means: search and browse

Alternative – publish and subscribe

System matches (new) data to subscribers’

interests

Periodic notification


Motivating Application 2

Supply chain automation Catalog of products and services from

suppliers (data) Registered sets of requirements

(subscriptions) from manufacturing units Notify relevant consumers upon arrival

of new data Other applications include electronic

auctioning, online shopping, etc.


Problem

Matching specifications (of products, services, etc.) to requirements (subscriptions) efficiently.

Specs – akin to data. Requirements – queries. Data may stream through. Quickly determine which

subscribers/users a piece of data is relevant to.


Problem Traditional setting:

Large DB One (at most a few) query at a time

Our problem: A small DB (a tuple, XML doc, etc.) Large no. of queries Dual to traditional problem

Focus of this paper: data = XML docs Queries = a fragment of XPath


Problem (Formalized) Given

an XML document a large number of XPath queries

Determine which queries are answered by each element (formalized using matching)

Query labeling: label each node with sets of queries answered by the subtree rooted there

Naïve Approach doesn’t scale w/ no. of queries Main challenge: small (1 or 2) # passes over

data tree


An Exampe Query

<Result> FOR $p IN

document(“catalog.xml”)//part, $b in $p/brand, $q IN $p//part WHERE A2D IN $q/name AND AMD IN $q/brand RETURN $p </Result>


Problem (An Example)


Dual Index Traditional index – quickly localize

search for data matching query pattern

Dual index – for each primitive pattern,

determine (sub)queries to which they

are relevant

Choice of primitive patterns depends on

type of data (e.g., XML vs. relational)

And on classes of queries considered

(e.g., chains vs. trees)


Tree Dual Index

Primitive “access path” questions to be answered: For a constant c, what are leaf appearances? For a tag t, what are non-leaf appearances? What query

nodes are its pc- and ad-children? Example:

a 1

b 2 c 3

a 4

b 5 c 6

b 1

c 2 a 6

c 3

b 5a 4

P Q

Index entry for a:

DI(a)[L]: (P, 3, {}), (Q, 6, {4,6})* *

DI(a)[N]: (P, 1, F, {2,3}, {}),

(P, 4, T, {6}, {5}).


Tree Labeling Algorithm – 3 Lists

a 1

b 2 c 9

c 3 a 8 d 10

c 4 a 11 b 15

a 5 d 6 c 12 d 13

b 7 b 14

3 lists (conceptually)

TML(u): (Query, query node, DN, ans-node)

PL(u): (P,l,m,x): rel

QL(u): Query Ids


Tree Labeling Algo. – TML base case

a 1

b 2 c 9

c 3 a 8 d 10

c 4 a 11 b 15

a 5 d 6 c 12 d 13

b 7 b 14

(P,m,{v1, …, vk}) DI(t)[L]

(P,v1,m,?), …, (P,vk,m,?) TML(u),

whenever u.tag= t;

e.g.: DI(a)[L] has (Q,6,{4,6}).

So, add (Q,4,6,?), & (Q,6,6,?) to TML()

(Q,6,6,?) (Q,6,6,i), i = 1,5, 8, 11.

If vi=m, ? u.


Tree Labeling Algo. – TML PL

a 1

b 2 c 9

c 3 a 8 d 10

c 4 a 11 b 15

a 5 d 6 c 12 d 13

b 7 b 14

(P,l,m,x) TML(u)

(P,l,m,x):child PL(parent(u)).

(P,l,m,x):desc PL(anc(u)).

e.g.: (Q,4,6,?) PL(5)

So, (Q,4,6,?):child PL(4).

And (Q,4,6,?):desc PL(i), i= 3, 2, 1.

Optimizations possible, but suppressed.


Tree Labeling Algo. – TML inductive case

a 1

b 2 c 9

c 3 a 8 d 10

c 4 a 11 b 15

a 5 d 6 c 12 d 13

b 7 b 14

(P,l,B,C,D) DI(t)[N]

c C: (P,c,m,y):child PL(u) &

d D: (P,d,m,y):rel PL(u)

(P,l,m,x) TML(u).

If l=m, x u.

e.g.: (P,4,T,{6},{5}) DI(a)[N].

(P,6,3,?) TML(12), so (P,6,3,?):child PL(11).Similarly, (P,5,3,?):desc PL(11)

So, (P,4,3,?) TML(11).


Tree Labeling Algo. – QL

a 1

b 2 c 9

c 3 a 8 d 10

c 4 a 11 b 15

a 5 d 6 c 12 d 13

b 7 b 14

• TML, PL, feed each other.

•QL – special case of TML

•P QL(u) iff

(P,1,m,x) TML(x).

•e.g.: (P,1,3,9) TML(1),

so P QL(9).

& (Q,1,6,5) TML(2), so

Q QL(5).


Tree Labeling – Summary

labeling completed in two passes pass 1: compute TML/PL (bottom-up) pass 2: compute QL (top-down)

no. of I/O invocations is 2 * # data tree nodes.

Other algorithms in paper: chain labeling chain split labeling of trees


Experiments

matchMaker implementation: JDK1.3 and C++ storage – BerkeleyDB 3.17 dual index stored in disk lists manipulated in memory Intel PIII, 1GB RAM, 512K cache, Linux 7.0

Data sets: generated using IBM’s XML Gen tool conforming to GEDCOM DTD (geological

data) (about 120 elements)


Experiments document depth 10; avg fanout – [2, 5] chain labeling algorithm is at least 5 times

faster than query-at-a-time approach For tree labeling, query-at-a-time doesn’t

produce results in reasonable time! Focus of experiments (for trees):

Direct tree labeling algorithm vs. chain split algorithm (not discussed)


Experiments


Experiments


Experiments


Related Work Documents – user profile match (IR) Notion of standing queries – long history:

E.g., Tapesty, TriggerMan, NiagaraCQ, etc. Publish-and-subscribe – Fabret et al. 00,

01. Patterns: boolean combo of relOp comp value

XFilter 00, 01. Only determine if doc contains an answer Multiple answers in one doc not considered


Related Work XTrie approach

Decompose query tree into ad-free chains Index using trie Determine only if a doc contains an answer

Main distinguishing features of matchMaker: Answers located Multiple answers per doc All proposed algorithms – guaranteed

resource bounds (e.g., #passes, I/O)


Summary & Future Work

Matching large no. of queries to XML data trees (as they stream through)

Dual to usual query processing Dual index (chains vs. trees) Algorithms for query labeling of data

trees Making algorithms more efficient (single

pass algorithm for chains: done) Expanding classes of queries handled Algebra for this dual query processing

problem?