A Uniform and Layered Algebraic Framework for XQueries on XML Streams Hong Su Jinhui Jian Elke A....

Preview:

Citation preview

A Uniform and Layered Algebraic Framework for XQueries on XML Streams

Hong Su

Jinhui Jian

Elke A. Rundensteiner

Worcester Polytechnic Institute

CIKM, Nov 5, 2003

2

Need for Stream Processing

New computing environment Data sources can be anywhere/anytime

On-line arriving data Data requests can be anywhere/anytime

Real-time response requirement New applications

Relational Sensor networks

XML Analysis of XML web logs Selective dissemination of XML information (e.g., news)

3

What’s Special for XML Stream Processing<Biditems>

<book year=“2001">

<title>Dream Catcher</title>

<author><last>King</last><first>S.</first></author>

<publisher>Bt Bound </publisher>

<price> 30 </initial>

</book>

<biditems> <book> <title> Dream Catcher </title> …

Token-by-Token access manner

timeline

Pattern retrieval + Filtering + Restructuring

FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 20Return <Inexpensive> $t </Inexpensive>

Token: not a direct counterpart of a tuple

30Bt BoundS.KingDream2001

pricepublisherfirstlasttitleyear

Pattern Retrieval on Token Streams

4

Two Computation Paradigms

Automata-based [yfilter02, xscan01, xsm02, xsq03, xpush03…]

Algebraic [niagara00, …]

This Raindrop framework intends to integrate both paradigms into one

5

Automata-Based Paradigm

FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 20Return <Inexpensive> $t </Inexpensive>

1book*

2

4title

price

Auxiliary structures for:

1. Buffering data

2. Filtering

3. Restructuring

//book

//book/title

//book/price3

6

Algebraic Computation

book bookbook

title author

last first

publisher price

Text

Text Text

Text Text

<book> …</book>

<title>… </title>

<book>… </book>

Navigate$b, /title -> $t

Navigate $b, /price->$p

Navigate $b, /title->

$t

Tagger

Select $p < 30

Logic Plan

Navigate //$b, /title->$t

Rewrite by “pushing down selection”

Navigate $b,/price->$p

Select $p < 30

Tagger

Rewritten Logic Plan

Navigate-Index $b, /price -> $p

Select $p < 30

Tagger

Navigate-Scan $b, /title -> $t

Physical Plan

Choose low-level implementation alternatives

FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 20Return <Inexpensive> $t </Inexpensive>

$b

$b $t

7

Observations

Either paradigm has deficiencies

Both paradigms complement each other

Automata Paradigm Algebra Paradigm

Good for pattern retrieval on tokens Does not support token inputs

Need patches for filtering and restructuring

Good for filtering and restructuring

Present all details on same low level Support multiple descriptive levels (declarative->procedural)

Little studied as query processing paradigm

Well studied as query process paradigm

8

How to Integrate Two Paradigms

9

How to Integrate Two Models? Design choices

Extend algebraic paradigm to support automata? Extend automata paradigm to support algebra? Come up with completely new paradigm?

Extend algebraic paradigm to support automata Practical

Reuse & extend existing algebraic query processing engines Natural

Present details of automata computation at low level Present semantics of automata computation (target patterns)

at high level

10

Raindrop: Four-Level Framework

Semantics-focused PlanSemantics-focused Plan

Stream Logic PlanStream Logic Plan

Stream Physical PlanStream Physical Plan

Stream Execution PlanStream Execution Plan

Abstraction Level

High (Declarative)

Low (Procedural)

11

Level I: Semantics-focused Plan [Rainbow-ZPR02] Express query semantics regardless of store

d or stream input sources Reuse existing techniques for stored XML pr

ocessing Query parser Initial plan constructor Rewriting optimization

Decorrelation Selection push down …

12

FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 20Return <Inexpensive> $t </Inexpensive>

<Biditems>

<book year=“2001">

<title>Dream Catcher</title>

<author><last>King</last><first>S.</first></author>

<publisher>Bt Bound </publisher>

<price> 30 </initial>

</book>

$S1<Biditems> … </Biditems>

$S1<Biditems>… </Biditems>

$b<book> … </book>

<Biditems> … </Biditems>

<book> … </book>

$S1<Biditems>… </Biditems>

$b<book>… </book>

$p

30

<Biditems> … </Biditems>

<book>… </book>

$S1<Biditems>…

</Biditems>

$b<book>… </book>

$p 30

$t Dream

Catcher

<Biditems>… </Biditems>

<book>. .. </book>

… …

NavUnnest $S1, //book ->$b

NavNest $b, /price/text() ->$p

NavNest $b, /title ->$t

Select$p<30

Tagger “Inexpensive”, $t->$r

Example Semantics-focused Plan

13

Level II: Stream Logical Plan

Extend semantics-focused plan to accommodate tokenized stream inputs New input data format:

contextualized tokens New operators:

StreamSource, Nav, ExtractUnnest, ExtractNest, StructuralJoin

New rewrite rules: Push-into-Automata

14

One Uniform Algebraic View

Token-based plan (automata plan)

Tuple-based plan

Tuple stream

XML data stream

Query answer

Algebraic Stream Logical Plan

15

Modeling the Automata in Algebraic Plan:Black Box[XScan01] vs. White Box

$b := //book$p := $b/price$t := $b/title

Black Box

FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 20Return <Inexpensive> $t </Inexpensive>

XScan

StructuralJoin$b

ExtractNest $b, $p

ExtractNest $b, $t

White Box

Navigate $b, /price->$p

Navigate $b, /title->$t

Navigate $S1, //book ->$b

16

Example Uniform Algebraic Plan

FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 30Return <Inexpensive> $t </Inexpensive>

Tuple-based plan

Token-based plan (automata plan)

17

Example Uniform Algebraic Plan

FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 30Return <Inexpensive> $t </Inexpensive>

StructuralJoin$b

ExtractNest $b, $p

ExtractNest $b, $t

Navigate $b, /price->$p

Navigate $b, /title->$t

Navigate $S1, //book ->$b

Tuple-based plan

18

Example Uniform Algebraic Plan

FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 30Return <Inexpensive> $t </Inexpensive>

StructuralJoin$b

ExtractNest $b, $p

ExtractNest $b, $t

Navigate $b, /price->$p

Navigate $b, /title->$t

Navigate $S1, //book ->$b

Select$p<30

Tagger “Inexpensive”, $t->$r

19

From Semantics-focused Plan to Stream Logical Plan

StructuralJoin$b

ExtractNest $b, $p

ExtractNest $b, $t

Nav $b, /price/text()->$p

Nav $b, /title->$t

Nav $S1, //book ->$b

Select$p<30

Tagger “Inexpensive”, $t->$r

NavUnnest $S1, //book ->$b

NavNest $b, /price/text() ->$p

NavNest $b, /title ->$t

Select$p<30

Tagger “Inexpensive”, $t->$r

Apply “push into automata”

Apply “push into automata”

20

Level III: Stream Physical Plan

For each stream logical operator, define how to generate outputs when given some inputs Multiple physical implementations may be provided fo

r a single logical operator Automata details of some physical implementation are

exposed at this level Nav, ExtractNest, ExtractUnnest, Structural Join

21

One Implementation of Extract/Structural Join

1book

title*

4price

3

2

ExtractNest$b, $t

ExtractNest/$b, $p

SJoin//book

<title>…</title> <price>…</price>

<biditems> <book> <title> Dream Catcher </title> … </book>…

<price>…</price><title>…</title>

Nav $b, /price->$p

Nav $b, /title->$t

Nav ., //book ->$b

22

Level IV: Stream Execution Plan

Describe coordination between operators regarding when to fetch the inputs When input operator generates one output tuple When input operator generates a batch When a time period has elapsed …

Potentially unstable data arrival rate in stream makes fixed scheduling strategy unsuitable Delayed data under scheduling may stall engine Bursty data not under scheduling may cause overflow

23

Raindrop: Four-Level Framework (Recap)

Semantics-focused PlanSemantics-focused Plan

Stream Logic PlanStream Logic Plan

Stream Physical PlanStream Physical Plan

Stream Execution PlanStream Execution Plan

Express the semantics of query regardless of

input sources

Accommodate tokenized

input streams

Describe how operators manipulate

given data

Decides the Coordination

among operators

24

Optimization Opportunities

25

Optimization Opportunities

Semantics-focused PlanSemantics-focused Plan

Stream Logic PlanStream Logic Plan

Stream Physical PlanStream Physical Plan

Stream Execution PlanStream Execution Plan

General rewriting (e.g., selection push down)

General rewriting (e.g., selection push down)

Break-linear-navigation rewriting

Break-linear-navigation rewriting

Physical implementations choosing

Physical implementations choosing

Execution strategy choosing

Execution strategy choosing

26

From Semantics-focused to Stream Logical Plan: In or Out?

Token-based plan (automata plan)

Tuple-based Plan

Tuple stream

XML data stream

Query answer

Pattern retrieval in Semantics-focused plan

Apply “push into automata”

27

Plan Alternatives

Nav $b, /price->$p

ExtractNest $b, $p

ExtractNest $b, $t

SJoin //book

Select price < 30

Tagger

Nav $b, /title->$t

Nav $S1, //book->$b

ExtractNest $S1, $b

Navigate /price

Select price<30

Navigate book/title

Tagger

Nav $S1, //book->$b

NavUnnest $S1, //book ->$b

NavNest $b, /price ->$p

NavNest $b, /title ->$t

Select$p<30

Tagger “Inexpensive”, $t->$r

Out In

28

Experimentation Results

Execution Time on 85M XML Stream Under Various Selectivity

25000

30000

35000

40000

45000

50000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Selectivity of Selection

Exe

cutio

n Ti

me

(ms)

1 Nav

2 Navs

3 Navs

4 Navs

5 Navs

29

Contributions

Combined automata and algebra based paradigms into one uniform algebraic paradigm

Provided four layers in algebraic paradigm Query semantics expressed at high layer Automata computation on streams hidden at low layer

Supported optimization at an iterative manner (from high abstraction level to low abstraction level)

Illustrated enriched optimization opportunities by experiments

30

Email: suhong@cs.wpi.edu

Recommended