25
1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst. of Tech., Bombay, India (work performed while visiting IIT-

1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst

Embed Size (px)

Citation preview

Page 1: 1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst

1

On Efficient Matching of Streaming XML Documents and Queries

Laks V.S. Lakshmanan1

P. Sailaja2

1 University of British Columbia, Canada

2 Indian Inst. of Tech., Bombay, India (work performed while visiting IIT-Bombay).

Page 2: 1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst

UBC, Canada EDBT 2002, Prague. 2

Outline

I. Motivating Applications

II. Problem

III. Dual Index

IV. Algorithms

V. Experiments

VI. Summary & Future Work

Page 3: 1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst

UBC, Canada EDBT 2002, Prague. 3

Motivating Application 1

Information dissemination in the large Numerous data sources on the web

Traditional means: search and browse

Alternative – publish and subscribe

System matches (new) data to subscribers’

interests

Periodic notification

Page 4: 1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst

UBC, Canada EDBT 2002, Prague. 4

Motivating Application 2

Supply chain automation Catalog of products and services from

suppliers (data) Registered sets of requirements

(subscriptions) from manufacturing units Notify relevant consumers upon arrival

of new data Other applications include electronic

auctioning, online shopping, etc.

Page 5: 1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst

UBC, Canada EDBT 2002, Prague. 5

Problem

Matching specifications (of products, services, etc.) to requirements (subscriptions) efficiently.

Specs – akin to data. Requirements – queries. Data may stream through. Quickly determine which

subscribers/users a piece of data is relevant to.

Page 6: 1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst

UBC, Canada EDBT 2002, Prague. 6

Problem Traditional setting:

Large DB One (at most a few) query at a time

Our problem: A small DB (a tuple, XML doc, etc.) Large no. of queries Dual to traditional problem

Focus of this paper: data = XML docs Queries = a fragment of XPath

Page 7: 1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst

UBC, Canada EDBT 2002, Prague. 7

Problem (Formalized) Given

an XML document a large number of XPath queries

Determine which queries are answered by each element (formalized using matching)

Query labeling: label each node with sets of queries answered by the subtree rooted there

Naïve Approach doesn’t scale w/ no. of queries Main challenge: small (1 or 2) # passes over

data tree

Page 8: 1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst

UBC, Canada EDBT 2002, Prague. 8

An Exampe Query

<Result> FOR $p IN

document(“catalog.xml”)//part, $b in $p/brand, $q IN $p//part WHERE A2D IN $q/name AND AMD IN $q/brand RETURN $p </Result>

Page 9: 1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst

UBC, Canada EDBT 2002, Prague. 9

Problem (An Example)

Page 10: 1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst

UBC, Canada EDBT 2002, Prague. 10

Dual Index Traditional index – quickly localize

search for data matching query pattern

Dual index – for each primitive pattern,

determine (sub)queries to which they

are relevant

Choice of primitive patterns depends on

type of data (e.g., XML vs. relational)

And on classes of queries considered

(e.g., chains vs. trees)

Page 11: 1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst

UBC, Canada EDBT 2002, Prague. 11

Tree Dual Index

Primitive “access path” questions to be answered: For a constant c, what are leaf appearances? For a tag t, what are non-leaf appearances? What query

nodes are its pc- and ad-children? Example:

a 1

b 2 c 3

a 4

b 5 c 6

b 1

c 2 a 6

c 3

b 5a 4

P Q

Index entry for a:

DI(a)[L]: (P, 3, {}), (Q, 6, {4,6})* *

DI(a)[N]: (P, 1, F, {2,3}, {}),

(P, 4, T, {6}, {5}).

Page 12: 1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst

UBC, Canada EDBT 2002, Prague. 12

Tree Labeling Algorithm – 3 Lists

a 1

b 2 c 9

c 3 a 8 d 10

c 4 a 11 b 15

a 5 d 6 c 12 d 13

b 7 b 14

3 lists (conceptually)

TML(u): (Query, query node, DN, ans-node)

PL(u): (P,l,m,x): rel

QL(u): Query Ids

Page 13: 1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst

UBC, Canada EDBT 2002, Prague. 13

Tree Labeling Algo. – TML base case

a 1

b 2 c 9

c 3 a 8 d 10

c 4 a 11 b 15

a 5 d 6 c 12 d 13

b 7 b 14

(P,m,{v1, …, vk}) DI(t)[L]

(P,v1,m,?), …, (P,vk,m,?) TML(u),

whenever u.tag= t;

e.g.: DI(a)[L] has (Q,6,{4,6}).

So, add (Q,4,6,?), & (Q,6,6,?) to TML()

(Q,6,6,?) (Q,6,6,i), i = 1,5, 8, 11.

If vi=m, ? u.

Page 14: 1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst

UBC, Canada EDBT 2002, Prague. 14

Tree Labeling Algo. – TML PL

a 1

b 2 c 9

c 3 a 8 d 10

c 4 a 11 b 15

a 5 d 6 c 12 d 13

b 7 b 14

(P,l,m,x) TML(u)

(P,l,m,x):child PL(parent(u)).

(P,l,m,x):desc PL(anc(u)).

e.g.: (Q,4,6,?) PL(5)

So, (Q,4,6,?):child PL(4).

And (Q,4,6,?):desc PL(i), i= 3, 2, 1.

Optimizations possible, but suppressed.

Page 15: 1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst

UBC, Canada EDBT 2002, Prague. 15

Tree Labeling Algo. – TML inductive case

a 1

b 2 c 9

c 3 a 8 d 10

c 4 a 11 b 15

a 5 d 6 c 12 d 13

b 7 b 14

(P,l,B,C,D) DI(t)[N]

c C: (P,c,m,y):child PL(u) &

d D: (P,d,m,y):rel PL(u)

(P,l,m,x) TML(u).

If l=m, x u.

e.g.: (P,4,T,{6},{5}) DI(a)[N].

(P,6,3,?) TML(12), so (P,6,3,?):child PL(11).Similarly, (P,5,3,?):desc PL(11)

So, (P,4,3,?) TML(11).

Page 16: 1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst

UBC, Canada EDBT 2002, Prague. 16

Tree Labeling Algo. – QL

a 1

b 2 c 9

c 3 a 8 d 10

c 4 a 11 b 15

a 5 d 6 c 12 d 13

b 7 b 14

• TML, PL, feed each other.

•QL – special case of TML

•P QL(u) iff

(P,1,m,x) TML(x).

•e.g.: (P,1,3,9) TML(1),

so P QL(9).

& (Q,1,6,5) TML(2), so

Q QL(5).

Page 17: 1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst

UBC, Canada EDBT 2002, Prague. 17

Tree Labeling – Summary

labeling completed in two passes pass 1: compute TML/PL (bottom-up) pass 2: compute QL (top-down)

no. of I/O invocations is 2 * # data tree nodes.

Other algorithms in paper: chain labeling chain split labeling of trees

Page 18: 1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst

UBC, Canada EDBT 2002, Prague. 18

Experiments

matchMaker implementation: JDK1.3 and C++ storage – BerkeleyDB 3.17 dual index stored in disk lists manipulated in memory Intel PIII, 1GB RAM, 512K cache, Linux 7.0

Data sets: generated using IBM’s XML Gen tool conforming to GEDCOM DTD (geological

data) (about 120 elements)

Page 19: 1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst

UBC, Canada EDBT 2002, Prague. 19

Experiments document depth 10; avg fanout – [2, 5] chain labeling algorithm is at least 5 times

faster than query-at-a-time approach For tree labeling, query-at-a-time doesn’t

produce results in reasonable time! Focus of experiments (for trees):

Direct tree labeling algorithm vs. chain split algorithm (not discussed)

Page 20: 1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst

UBC, Canada EDBT 2002, Prague. 20

Experiments

Page 21: 1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst

UBC, Canada EDBT 2002, Prague. 21

Experiments

Page 22: 1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst

UBC, Canada EDBT 2002, Prague. 22

Experiments

Page 23: 1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst

UBC, Canada EDBT 2002, Prague. 23

Related Work Documents – user profile match (IR) Notion of standing queries – long history:

E.g., Tapesty, TriggerMan, NiagaraCQ, etc. Publish-and-subscribe – Fabret et al. 00,

01. Patterns: boolean combo of relOp comp value

XFilter 00, 01. Only determine if doc contains an answer Multiple answers in one doc not considered

Page 24: 1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst

UBC, Canada EDBT 2002, Prague. 24

Related Work XTrie approach

Decompose query tree into ad-free chains Index using trie Determine only if a doc contains an answer

Main distinguishing features of matchMaker: Answers located Multiple answers per doc All proposed algorithms – guaranteed

resource bounds (e.g., #passes, I/O)

Page 25: 1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst

UBC, Canada EDBT 2002, Prague. 25

Summary & Future Work

Matching large no. of queries to XML data trees (as they stream through)

Dual to usual query processing Dual index (chains vs. trees) Algorithms for query labeling of data

trees Making algorithms more efficient (single

pass algorithm for chains: done) Expanding classes of queries handled Algebra for this dual query processing

problem?