1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1...

Preview:

Citation preview

1

On Efficient Matching of Streaming XML Documents and Queries

Laks V.S. Lakshmanan1

P. Sailaja2

1 University of British Columbia, Canada

2 Indian Inst. of Tech., Bombay, India (work performed while visiting IIT-Bombay).

UBC, Canada EDBT 2002, Prague. 2

Outline

I. Motivating Applications

II. Problem

III. Dual Index

IV. Algorithms

V. Experiments

VI. Summary & Future Work

UBC, Canada EDBT 2002, Prague. 3

Motivating Application 1

Information dissemination in the large Numerous data sources on the web

Traditional means: search and browse

Alternative – publish and subscribe

System matches (new) data to subscribers’

interests

Periodic notification

UBC, Canada EDBT 2002, Prague. 4

Motivating Application 2

Supply chain automation Catalog of products and services from

suppliers (data) Registered sets of requirements

(subscriptions) from manufacturing units Notify relevant consumers upon arrival

of new data Other applications include electronic

auctioning, online shopping, etc.

UBC, Canada EDBT 2002, Prague. 5

Problem

Matching specifications (of products, services, etc.) to requirements (subscriptions) efficiently.

Specs – akin to data. Requirements – queries. Data may stream through. Quickly determine which

subscribers/users a piece of data is relevant to.

UBC, Canada EDBT 2002, Prague. 6

Problem Traditional setting:

Large DB One (at most a few) query at a time

Our problem: A small DB (a tuple, XML doc, etc.) Large no. of queries Dual to traditional problem

Focus of this paper: data = XML docs Queries = a fragment of XPath

UBC, Canada EDBT 2002, Prague. 7

Problem (Formalized) Given

an XML document a large number of XPath queries

Determine which queries are answered by each element (formalized using matching)

Query labeling: label each node with sets of queries answered by the subtree rooted there

Naïve Approach doesn’t scale w/ no. of queries Main challenge: small (1 or 2) # passes over

data tree

UBC, Canada EDBT 2002, Prague. 8

An Exampe Query

<Result> FOR $p IN

document(“catalog.xml”)//part, $b in $p/brand, $q IN $p//part WHERE A2D IN $q/name AND AMD IN $q/brand RETURN $p </Result>

UBC, Canada EDBT 2002, Prague. 9

Problem (An Example)

UBC, Canada EDBT 2002, Prague. 10

Dual Index Traditional index – quickly localize

search for data matching query pattern

Dual index – for each primitive pattern,

determine (sub)queries to which they

are relevant

Choice of primitive patterns depends on

type of data (e.g., XML vs. relational)

And on classes of queries considered

(e.g., chains vs. trees)

UBC, Canada EDBT 2002, Prague. 11

Tree Dual Index

Primitive “access path” questions to be answered: For a constant c, what are leaf appearances? For a tag t, what are non-leaf appearances? What query

nodes are its pc- and ad-children? Example:

a 1

b 2 c 3

a 4

b 5 c 6

b 1

c 2 a 6

c 3

b 5a 4

P Q

Index entry for a:

DI(a)[L]: (P, 3, {}), (Q, 6, {4,6})* *

DI(a)[N]: (P, 1, F, {2,3}, {}),

(P, 4, T, {6}, {5}).

UBC, Canada EDBT 2002, Prague. 12

Tree Labeling Algorithm – 3 Lists

a 1

b 2 c 9

c 3 a 8 d 10

c 4 a 11 b 15

a 5 d 6 c 12 d 13

b 7 b 14

3 lists (conceptually)

TML(u): (Query, query node, DN, ans-node)

PL(u): (P,l,m,x): rel

QL(u): Query Ids

UBC, Canada EDBT 2002, Prague. 13

Tree Labeling Algo. – TML base case

a 1

b 2 c 9

c 3 a 8 d 10

c 4 a 11 b 15

a 5 d 6 c 12 d 13

b 7 b 14

(P,m,{v1, …, vk}) DI(t)[L]

(P,v1,m,?), …, (P,vk,m,?) TML(u),

whenever u.tag= t;

e.g.: DI(a)[L] has (Q,6,{4,6}).

So, add (Q,4,6,?), & (Q,6,6,?) to TML()

(Q,6,6,?) (Q,6,6,i), i = 1,5, 8, 11.

If vi=m, ? u.

UBC, Canada EDBT 2002, Prague. 14

Tree Labeling Algo. – TML PL

a 1

b 2 c 9

c 3 a 8 d 10

c 4 a 11 b 15

a 5 d 6 c 12 d 13

b 7 b 14

(P,l,m,x) TML(u)

(P,l,m,x):child PL(parent(u)).

(P,l,m,x):desc PL(anc(u)).

e.g.: (Q,4,6,?) PL(5)

So, (Q,4,6,?):child PL(4).

And (Q,4,6,?):desc PL(i), i= 3, 2, 1.

Optimizations possible, but suppressed.

UBC, Canada EDBT 2002, Prague. 15

Tree Labeling Algo. – TML inductive case

a 1

b 2 c 9

c 3 a 8 d 10

c 4 a 11 b 15

a 5 d 6 c 12 d 13

b 7 b 14

(P,l,B,C,D) DI(t)[N]

c C: (P,c,m,y):child PL(u) &

d D: (P,d,m,y):rel PL(u)

(P,l,m,x) TML(u).

If l=m, x u.

e.g.: (P,4,T,{6},{5}) DI(a)[N].

(P,6,3,?) TML(12), so (P,6,3,?):child PL(11).Similarly, (P,5,3,?):desc PL(11)

So, (P,4,3,?) TML(11).

UBC, Canada EDBT 2002, Prague. 16

Tree Labeling Algo. – QL

a 1

b 2 c 9

c 3 a 8 d 10

c 4 a 11 b 15

a 5 d 6 c 12 d 13

b 7 b 14

• TML, PL, feed each other.

•QL – special case of TML

•P QL(u) iff

(P,1,m,x) TML(x).

•e.g.: (P,1,3,9) TML(1),

so P QL(9).

& (Q,1,6,5) TML(2), so

Q QL(5).

UBC, Canada EDBT 2002, Prague. 17

Tree Labeling – Summary

labeling completed in two passes pass 1: compute TML/PL (bottom-up) pass 2: compute QL (top-down)

no. of I/O invocations is 2 * # data tree nodes.

Other algorithms in paper: chain labeling chain split labeling of trees

UBC, Canada EDBT 2002, Prague. 18

Experiments

matchMaker implementation: JDK1.3 and C++ storage – BerkeleyDB 3.17 dual index stored in disk lists manipulated in memory Intel PIII, 1GB RAM, 512K cache, Linux 7.0

Data sets: generated using IBM’s XML Gen tool conforming to GEDCOM DTD (geological

data) (about 120 elements)

UBC, Canada EDBT 2002, Prague. 19

Experiments document depth 10; avg fanout – [2, 5] chain labeling algorithm is at least 5 times

faster than query-at-a-time approach For tree labeling, query-at-a-time doesn’t

produce results in reasonable time! Focus of experiments (for trees):

Direct tree labeling algorithm vs. chain split algorithm (not discussed)

UBC, Canada EDBT 2002, Prague. 20

Experiments

UBC, Canada EDBT 2002, Prague. 21

Experiments

UBC, Canada EDBT 2002, Prague. 22

Experiments

UBC, Canada EDBT 2002, Prague. 23

Related Work Documents – user profile match (IR) Notion of standing queries – long history:

E.g., Tapesty, TriggerMan, NiagaraCQ, etc. Publish-and-subscribe – Fabret et al. 00,

01. Patterns: boolean combo of relOp comp value

XFilter 00, 01. Only determine if doc contains an answer Multiple answers in one doc not considered

UBC, Canada EDBT 2002, Prague. 24

Related Work XTrie approach

Decompose query tree into ad-free chains Index using trie Determine only if a doc contains an answer

Main distinguishing features of matchMaker: Answers located Multiple answers per doc All proposed algorithms – guaranteed

resource bounds (e.g., #passes, I/O)

UBC, Canada EDBT 2002, Prague. 25

Summary & Future Work

Matching large no. of queries to XML data trees (as they stream through)

Dual to usual query processing Dual index (chains vs. trees) Algorithms for query labeling of data

trees Making algorithms more efficient (single

pass algorithm for chains: done) Expanding classes of queries handled Algebra for this dual query processing

problem?

Recommended