Upload
teleri
View
63
Download
0
Embed Size (px)
DESCRIPTION
Raindrop: An Algebra-Automata Combined XQuery Engine over XML Streams. Hong Su, Elke Rundensteiner, Murali Mani, Ming Li Worcester Polytechnic Institute Worcester, MA VLDB 2004. Stream Processing. data sources. Networks. data requesters. - PowerPoint PPT Presentation
Citation preview
Raindrop:
An Algebra-Automata Combined XQuery Engine over XML Streams
Hong Su, Elke Rundensteiner, Murali Mani, Ming Li
Worcester Polytechnic Institute
Worcester, MA
VLDB 2004
Stream Processingdata sources
data requesters
Networks
What’s Special for XML Stream Processing
<auctions>
Token-by-Token access manner
timeline
Pattern retrieval + Filtering + Restructuring
FOR $a in stream(bids)//auction, $b in $a/seller[homepage], $c in $a/bidder[sameAddr]WHERE $b/*/phone = “508”Return <auction> $b, $c </auction>
Token: not a counterpart of a self-contained tuple
Pattern Retrieval on Token Streams
<auction>
<seller>
<primary>
<phone>
Two Computation Paradigms Automata-based [yfilter, xscan, xsm, xsq, xpush…] Algebraic [niagara00, …]
FOR $a in stream(bids)//auction, $b in $a/seller[homepage], $c in $a/bidder[sameAddr]WHERE $b/*/phone = “508”Return <auction> $b, $c </auction>
1auction
*
2
3seller
bidder
Automata
8Navigate
$a, /seller->$b
Navigate $a, /bidder-> $c
Tagger
Algebra
Navigate stream(bids),//auction->$a
4
homepage
9sameAddr
5 6* phone
…
7
bid
Comparison of Two Paradigms
Either paradigm has deficiencies
Both paradigms complement each other
Automata Paradigm Algebra Paradigm
Good for pattern retrieval on tokens Does not support token inputs
Need patches for filtering and restructuring
Good for filtering and restructuring
Present all details on same low level Support multiple descriptive levels (e.g., logical plan, physical plan)
Little studied as query processing paradigm
Well studied as query process paradigm
Four-Level Algebraic Framework
Semantics-Focused PlanSemantics-Focused Plan
Stream Physical PlanStream Physical Plan
Stream Execution PlanStream Execution Plan
Express the semantics of query regardless of
input sources
Accommodate tokenized streams/
automata computation
Describe implementation
details of operators
Decide how an operator is invoked
(scheduling) Abstraction Level
High (Declarative)
Low (Procedural)
Stream Logic PlanStream Logic Plan
This Raindrop framework intends to integrate both paradigms into one
Level I: Semantics-Focused Plan
Express query semantics regardless of stored or stream input sources [Rainbow-ZPR02]
Reuse existing general optimization techniques Decorrelation Cancel duplicate navigation operators …
Stream Data:Stream Data: <auctions> <auction> <seller> <primary><phone>508</phone></primary> <secondary><phone>613</phone></secondary> </seller> <bid><bidder>…</bidder><bidder>…</bidder></bid> </auction> …
source<auctions> … </auctions>
source<auctions>… </auctions>
$a<auction> … </auction>
<auctions> … </auctions>
<auction> … </auction>
source<auctions>… </auctions>
$a<auction>… </auction>
$b <seller>…
</seller>
<auctions>… </auctions>
<auction>… </auction>
…
source <auctions>…
</auctions>
$a<auction>… </auction>
$b <seller>…
</seller>
$c <bidder>…
</bidder>
<auctions>… </auctions>
<auction>. .. </auction>
…
NavUnneststream(bids),//auction->$a
NavUnnest $a, /seller ->$b
NavUnnest $a, /bid/bidder ->$c
Example Semantics-Focused Plan
Plan and Input/output Data:Plan and Input/output Data:
Query:Query:
…
FOR $a in stream(bids)//auction, $b in $a/seller[homepage], $c in $a/bidder[sameAddr]WHERE $b/*/phone = “508”Return <auction> $b, $c </auction>
Level II: Stream Logical Plan
Extend semantics-focused plan to accommodate tokenized stream inputs New input data format:
Tokens New operators:
StreamSource, TokenNavigate, ExtractUnnest, ExtractNest, StructuralJoin
New rewrite rules: Push-into/Pull-out-of Automata
One Uniform Algebraic View
Token-based plan (automata plan)
Tuple-based plan
Tuple stream
XML data stream
Query answer
Algebraic Stream Logical Plan
Modeling Automata in Algebraic Plan:Black Box[XScan01] vs. White Box
$a := stream(bids)//auction$b := $a/seller$c := $a/bid/bidder
Black Box
XScan
StructuralJoin$a
ExtractUnnest $a, $b
ExtractUnnest $a, $c
White Box
TokenNavigate $a, /seller->$b
TokenNavigate $a, /bid/bidder->$c
TokenNavigate stream(bids), //auction->$a
FOR $a in stream(bids)//auction, $b in $a/seller[homepage], $c in $a/bid/bidder[sameAddr]WHERE $b/*/phone = “508”Return <auction> $b, $c </auction>
Data Model in Algebraic Plan Modeling Automata
StructuralJoin$a
ExtractUnnest $a, $b
ExtractUnnest $a, $c
TokenNavigate $a, /seller->$b
TokenNavigate $a, /bid/bidder->$c
TokenNavigate stream(bids), //auction->$a
…
…
<phone>
<primary>
<seller>
<auction>
…
0314
<bidderid>
<bidder>
…
<bidder>...</bidder>
</primary>
</phone>
508
...
<phone>
<primary>
<seller>
…
<seller>…</seller>
……
<bidder>...</bidder><seller>…</seller>
…
....
<auction>
<auctions>
StreamSource
For Details of Levels III and IV, please refer to “Automaton Meets Query Algebra: Towards a Unified Mo
del for XQuery Evaluation over XML Data Streams”, ER 2003
“Raindrop: A Uniform and Layered Algebraic Framework for XQueries on XML Streams”, CIKM 2003
“Raindrop: A Uniform and Layered Algebraic Framework for XQueries on XML Streams”, Journal Submission 2004
Optimization I: Computation Into or Out of Automata?
TokenNavigate $a, /bid/bi
dder->$c
ExtractUnnest $a, $c
ExtractUnnest $a, $b
StructuralJoin $a
TokenNavigate $a, /seller->$
b
TokenNavigate stream(bids), //a
uction->$a
ExtracUnnest stream(bids), $a
NavigateUnnest $a, /seller-
>$b
NavigateUnnest $a, /bid/bid
der->$c
TokenNavigate stream(bids), //aucti
on->$a
NavUnnest stream(bids), //auction->$a
NavigateUnnest $a, /seller ->$b
NavigateUnest $a, /bid/bidder ->$c
Out of Automata Into Automata
Automata Plan
Automata Plan
…
… …
Experimentation Results
Execution Time on 85M XML Stream Under Various Selectivity
25000
30000
35000
40000
45000
50000
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Selectivity of Selection
Exe
cutio
n Ti
me
(ms)
1 Nav
2 Navs
3 Navs
4 Navs
5 Navs
Optimization II: Semantic Query Optimization
General schema-based optimizations Eliminate predicate/join, … Focus on operators manipulating flat values
XML specific schema-based optimizations Focus on pattern retrieval Fall into two categories
General XML SQO• Minimize query tree [YCL+-AT&T 01]
Stream XML SQO (our focus)
Stream-Specific XML SQO
Observations Pattern retrieval over tokens solely relies on docum
ent-order traversal Schema constraints help expedite document-order t
raversal State-of-the-Art
[XPush03] covers limited query (boolean XPath match) and one type of constraints
Our goals: Support more powerful query (XQuery) Support more types of constraints (XSchema)
Step I: Construct Query Graph
(a) Example Query (b) Query Tree
FOR $a in stream(bids)//auction, $b in $a/seller[homepage], $c in $a/bid/bidder[sameAddr]WHERE $b/*/phone = “508”Return <auction> $b, $c </auction>
Example XML Schema
Step II: Apply Optimization Rules
Offer optimization rules utilizing occurrence constraints exclusive constraints order constraints
Apply rules in an order ensuring no beneficial rule missed no redundant rule introduced
Step III: Translate Rewritten Query Graph Back to Plan (I)
when </phone> is encountered twice, check /*/phone: if fails the predicate, suspend states s2 and s3
Utilize Occurrence Constraints
Step III: Translate Rewritten Query Graph Back to Plan (II)
when <billTo> or <shipTo> is encountered once: suspend states s2 and s9
Utilize Exclusive Constraints
Step III: Translate Rewritten Query Graph Back to Plan (III)
when <primary> is encountered once, check /homepage: if no presence, suspend states s10, s3 and s2
Utilize Order Constraints
http://davis.wpi.edu/dsrg/raindrop/
Thank WPI DSRG Rainbow Team for XAT Algebra Support
Thank WPI DSRG Rainbow Team for XAT
Algebra Support