54
Dan Suciu Univ. of Washington Querying XML Streams 1 From Searching Text to Querying XML Streams Dan Suciu www.cs.washington.edu/homes/ suciu

Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

  • View
    230

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 1

From Searching Text to Querying XML Streams

Dan Suciu

www.cs.washington.edu/homes/suciu

Page 2: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 2

About Me• Born 1957, Romania• BS: Bucharest, PhD: University of Pennsylvania• Now: University of Washington (Seattle)

My work is on semistructured data• Book: Data on the Web:

From relations, to semistructured data and XML

Past/present projects:• XML-QL = precursor of XQuery• XMill = the XML compressor• XML toolkit

Page 3: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 3

Motivation

• Text databases– Studied over the past 15 years– Traditional client/server model– Struggled with lack of standard text syntax

• Recently, new standard: XML– Traditional client/server: in today’s dbms– New applications: stream processing

• This talk: processing stream XML data– My motivation: work on the XML Toolkit project

Page 4: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 4

Outline

• Background

• The XML stream processing problem

• Basic XML processing with automata

• Adapting automata to XML

• Stream indexes

• Conclusions

Page 5: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 5

Background:Relational Databases

• Structured, stored in tables• Schema separate from data• Queries: precise, refer to schema and data (SQL)

: BOOKS

ISBN Title Year Publisher

0201537710Foundations of

Databases1995 AW

155860622X Data on the Web 1999 MK

AUTHOR

AID Name Country

44 Abiteboul FR

06 Buneman UK

62 Hull USA

12 Suciu USA

29 Vianu USA

WROTE:

ISBN AID

0201537710 44

0201537710 62

0201537710 29

155860622X 44

155860622X 06

155860622X 12

Hard to publish, easy to query preciselyHard to publish, easy to query precisely

Page 6: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 6

Background:Text Databases

• Unstructured, stored in documents

• No schema, only data

• Queries: imprecise, refer to data only (keywords)

Foundations of Databases,

Abiteboul (FR), Hull (USA), Vianu (USA)

Addison Wesley,

1995

Foundations of Databases,

Abiteboul (FR), Hull (USA), Vianu (USA)

Addison Wesley,

1995

Data on the Web

Abiteoul (FR), Buneman (UK), Suciu (USA)

Morgan Kaufmann,

1999

Data on the Web

Abiteoul (FR), Buneman (UK), Suciu (USA)

Morgan Kaufmann,

1999

Easy to publish, hard to query preciselyEasy to publish, hard to query precisely

Page 7: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 7

Background:XML Data• Semistructured

• Schema and data are together: self-describing• Queries: precise, refer to schema and data (SQL)

<bib> <book> <title> Foundations… </title> <author> <name> Abiteboul </name> <country> FR </country> </author> <author> <name> Hull </name> <country> USA </country> </author> <author> <name> Vianu </name> <country> USA </country> </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> …

</bib>

<bib> <book> <title> Foundations… </title> <author> <name> Abiteboul </name> <country> FR </country> </author> <author> <name> Hull </name> <country> USA </country> </author> <author> <name> Vianu </name> <country> USA </country> </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> …

</bib>

XML: Easier to publish,easy to query precisely

XML: Easier to publish,easy to query precisely

Page 8: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 8

Background:XML Data

bib

book

paper

titletitle

author author author publisherauthor journal

book

Data onthe Web

name country

Abiteboul FR Buneman UK

name countryAddisonWesley

Data model = tree

Page 9: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 9

Background:XML Data

• Querying with XPath (and XQuery)• This talk: XPath queries restricted to:

tag///* [ ]path=“constant”

Page 10: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 10

Background:XPath in One Slide

/bib/book[author/name=“Abiteboul”]/bib/book[author/name=“Abiteboul”]

/bib/book/[year=“1995” and author[name=“Abiteboul” and country=“FR”]]/bib/book/[year=“1995” and author[name=“Abiteboul” and country=“FR”]]

/bib/book/author/name/bib/book/author/name

/bib/book//name/*/zip/bib/book//name/*/zip

tag, /

//,*

[ ]

This is precisely the “region algebra”

E.g. use proximal nodes [Navarro&Baeza-Yates’97]

This is precisely the “region algebra”

E.g. use proximal nodes [Navarro&Baeza-Yates’97]

Navigate partially known structure

Conjunctivequeries ala SQL

Page 11: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 11

Outline

• Background

• The XML stream processing problem

• Basic XML processing with automata

• Adapting automata to XML

• Stream indexes

• Conclusions

Page 12: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 12

Main Application:XML Packet Routing

• Selective Dissemination of Information [Altinel&Franklin’00, Chan et al.02]

• XML content routing [Snoeren et al.01]

• SOAP Message routing in Application Servers

Page 13: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 13

XML Packet Routing<doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc><doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc><doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc><doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc><doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc><doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc><doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc><doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc> <doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc><doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc><doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc>

Page 14: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 14

/bib/book /publisher=“MK”/bib/book [category=“recent”]/title =“Web”/bib/book //address//*/zip=“123”/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“123”/bib/book /address /field=“567”

/bib/book /publisher=“MK”/bib/book [category=“recent”]/title =“Web”/bib/book //address//*/zip=“123”/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“123”/bib/book /address /field=“567”

XPath expressions

<bib> <book>...</bib>

<bib> <book>...</bib>

Input XML StreamOutput XML Streams

Page 15: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 15

The XML Stream Processing Problem

Given:A set of XPath expressionsAn Incoming stream of XML documents

Decide:For each document which expressions it matches

Given:A set of XPath expressionsAn Incoming stream of XML documents

Decide:For each document which expressions it matches

Hard: Large number of XPath expressions e.g. 103 - 106

Streaming XML data, high throughput e.g. 5MB/s

Easy: Shallow XML data e.g. depth=20 Short XPath expressions

Hard: Large number of XPath expressions e.g. 103 - 106

Streaming XML data, high throughput e.g. 5MB/s

Easy: Shallow XML data e.g. depth=20 Short XPath expressions

Page 16: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 16

The ApproachesBasic techniques• NFA plus optimizations:

– Xfilter/Yfilter [Altinel&Franklin’00]– XTrie [Chan et al.02]

• DFA:– XML Toolkit

Beyond the obvious• Stream indexes (XML Toolkit)• Stream views

Page 17: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 17

Outline

• Background

• The XML stream processing problem

• Basic XML processing with automata

• Adapting automata to XML

• Stream indexes

• Conclusions

Page 18: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 18

From XPath to NFA

/catalog/product[category="tools"][*/price = 200]/quantity//price

/catalog/product[category="tools"][*/price = 200]/quantity//price

Extra processing needed

to combine branches

(not in this talk)

Extra processing needed

to combine branches

(not in this talk)

catalog

product

category

price

quantity

"tools"

200

*

price

*

Page 19: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 19

Basic NFA Evaluation/bib/book /publisher=“MK”/bib/book [category=“recent”]/title/bib/book //address//*/zip=“123”/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“123”/bib/book /address /field=“567”/bib/book /tag=“some”/bib/book [category=“recent”]/title/bib/book //address//*=“Seattle"/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“Lisbon”/bib/book /address /field=“some”. . .. . .. . ./bib/book/publisher=“AW”/bib/book [category=“recent”]/title/bib/book //address//*=“123”/bib/book //address//*="Galaxy"/bib/book /category=“new”/bib/book /address=“London”/bib/book /address /field =“some”/bib/book/category =“old”

/bib/book /publisher=“MK”/bib/book [category=“recent”]/title/bib/book //address//*/zip=“123”/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“123”/bib/book /address /field=“567”/bib/book /tag=“some”/bib/book [category=“recent”]/title/bib/book //address//*=“Seattle"/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“Lisbon”/bib/book /address /field=“some”. . .. . .. . ./bib/book/publisher=“AW”/bib/book [category=“recent”]/title/bib/book //address//*=“123”/bib/book //address//*="Galaxy"/bib/book /category=“new”/bib/book /address=“London”/bib/book /address /field =“some”/bib/book/category =“old”

<bib> <book>...</bib>

NFA

. . . . . .

XPath

3,66,102,4534,...

2,3,543,43,254

1,55,99,...

STACK

SAXevents

Current states

Page 20: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 20

Basic NFA Evaluation

Properties: Space = linear Throughput = decreases linearly

Systems:

• XFilter [Altinel&Franklin’99], YFilter.

• XTrie [Chan et al.’02]

Page 21: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 21

Basic DFA Evaluation/bib/book /publisher=“MK”/bib/book [category=“recent”]/title/bib/book //address//*/zip=“123”/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“123”/bib/book /address /field=“567”/bib/book /tag=“some”/bib/book [category=“recent”]/title/bib/book //address//*=“Seattle"/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“Lisbon”/bib/book /address /field=“some”. . .. . .. . ./bib/book/publisher=“AW”/bib/book [category=“recent”]/title/bib/book //address//*=“123”/bib/book //address//*="Galaxy"/bib/book /category=“new”/bib/book /address=“London”/bib/book /address /field =“some”/bib/book/category =“old”

/bib/book /publisher=“MK”/bib/book [category=“recent”]/title/bib/book //address//*/zip=“123”/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“123”/bib/book /address /field=“567”/bib/book /tag=“some”/bib/book [category=“recent”]/title/bib/book //address//*=“Seattle"/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“Lisbon”/bib/book /address /field=“some”. . .. . .. . ./bib/book/publisher=“AW”/bib/book [category=“recent”]/title/bib/book //address//*=“123”/bib/book //address//*="Galaxy"/bib/book /category=“new”/bib/book /address=“London”/bib/book /address /field =“some”/bib/book/category =“old”

<bib> <book>...</bib>

XPath

399

552

1

STACKSAXevents

DFAs

Current state

Page 22: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 22

Basic DFA Evaluation

Properties: Throughput = constant ! Space = GOOD QUESTION

System:

• XML Toolkit [University of Washington]http://xmltk.sourceforge.net

Page 23: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 23

Outline

• Background

• The XML stream processing problem

• Basic XML processing with automata

• Adapting automata to XML

• Stream indexes

• Conclusions

Page 24: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 24

The Size of the DFA

NFA

b

a

b

a

a

*

DFA for //P

has 1+|P| states [KMP]

DFA for //P

has 1+|P| states [KMP]

0

1

2

3

4

5

[other]

[other]

[other]

[other]

a

[other]a

DFA

b

a

b

a

a

[other]0

01

02

013

014

025

//a/b/a/a/b//a/b/a/a/b

Page 25: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 25

The Size of the DFA

//a/*/*/*/b//a/*/*/*/b

Size of DFA =

exponential in *’s

(not a real concern)

Size of DFA =

exponential in *’s

(not a real concern)

*

*

b

a

*

*0

1

2

3

4

5

NFA

a

a

[other]

[other] [other]

[other] [other]

DFA (fragment, and without back edges)

a

a

b

a

a

[other]0

01

012

0123

01234

012345

023

02

013 03

0234 0134 034

0345 0245 045

b b b

Page 26: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 26

The Size of the DFA

Theorem [GMOS’02] The number of states in the DFA for one linear XPath expression P is at most:

k = number of //

s = size of the alphabet (number of tags)

m = max number of * between two consecutive //

k+|P| k smk+|P| k sm

Page 27: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 27

Size of DFA: Multiple Expressions

//section/table/footnote//table/footnote//section/figure/footnote. . . . .//abstract/footnote/table

//section/table/footnote//table/footnote//section/figure/footnote. . . . .//abstract/footnote/table

DFA = Trie

has linear number of states [Aho&Corasick]

Page 28: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 28

Size of DFA: Multiple Expressions

//section//footnote//table//footnote//figure//footnote. . . . .//abstract//footnote

//section//footnote//table//footnote//figure//footnote. . . . .//abstract//footnote

100 expressions

2100 states !!2100 states !!

There is a theorem here too, but it’s not useful…

Page 29: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 29

Solution:Compute the DFA Lazily

• Also used in text searching• But will it work for 106 XPath expressions ?• YES !• For XPath it is provably effective, for two

reasons:– XML data is not very deep– The nesting structure in XML data tends to be

predictable

Page 30: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 30

Lazy DFA and “Simple” DTDs

• Document Type Definition (DTD)– Part of the XML standard– Will be replaced by XML Schema

• Example DTD:

<!ELEMENT document (section*)><!ELEMENT section ((section|abstract|table|figure)*)><!ELEMENT figure (table?,footnote*)>. . . . .

<!ELEMENT document (section*)><!ELEMENT section ((section|abstract|table|figure)*)><!ELEMENT figure (table?,footnote*)>. . . . .

Definition A DTD is simple if all cycles are loops

Page 31: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 31

Lazy DFA and “Simple” DTDs

document

section

table

figure

footnote

Simple DTD:

//section//footnote//table//footnote//figure//footnote//abstract//footnote

//section//footnote//table//footnote//figure//footnote//abstract//footnote

XPath expressions

abstract

Eager DFA “remembers” 24 sets

Lazy DFA “remembers” only 4 sets

Page 32: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 32

Lazy DFA and “Simple” DTDs

Theorem [GMOS’02] If the XML data has a “simple” DTD, then lazy DFA has at most:

states.

n = max depths of XPath expressions

D = size of the “unfolded” DTD

d = max depths of self-loops in the DTD

1+D(1+n)d1+D(1+n)d

Fact of life: “Data-like” XMLhas simple DTDs

Fact of life: “Data-like” XMLhas simple DTDs

Page 33: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 33

Lazy DFA and Data Guides

• “Non-simple” DTDs are useless for the lazy DFA

• “Everything may contain everything”

<!ELEMENT document (section*)><!ELEMENT section ((section|table|figure|abstract|footnote)*)><!ELEMENT table ((section|table|figure|abstract|footnote)*)><!ELEMENT figure ((section|table|figure|abstract|footnote)*)><!ELEMENT abstract ((section|table|figure|abstract|footnote)*)>

<!ELEMENT document (section*)><!ELEMENT section ((section|table|figure|abstract|footnote)*)><!ELEMENT table ((section|table|figure|abstract|footnote)*)><!ELEMENT figure ((section|table|figure|abstract|footnote)*)><!ELEMENT abstract ((section|table|figure|abstract|footnote)*)>

Fact of life: “Text”-like XML has non-simple DTDsFact of life: “Text”-like XML has non-simple DTDs

Page 34: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 34

Lazy DFA and Data Guides

Definition [Goldman&Widom’97]

The data guide for an XML data instance is the Trie of all its root-to-leaf paths

Page 35: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 35

Lazy DFA and Data Guides

document

section section

sectiontable

table

section

table figure

document

section

section

table

table figure

section

table

XML Data Data Guide

Fact of life: real XML data has “small” data guide

[Liefke&S.’00]

Fact of life: real XML data has “small” data guide

[Liefke&S.’00]

section

figure

figure

Page 36: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 36

Lazy DFA and “Simple” DTDs

Theorem [GMOS’02] If the XML data has a data guide with G nodes, then the number of states in the lazy DFA is at most:

G = number of nodes in the data guide

1+G1+G

Page 37: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 37

1

10

100

1000

10000

100000

simple prov ebBPSS protein nasa treebank

Number of Lazy DFA States - SYNTHETIC Data

103 XPath

104 XPath

105 XPath

4000

states

Page 38: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 38

1

10

100

1000

10000

100000

protein nasa treebank

Number of Lazy DFA States - REAL Data

103 XPath

104 XPath

105 XPath 95 states

40000 states

G = 350000

Page 39: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 39

Number of States in the lazy DFA

Real XML data Synthetic XML data

Data-style DTDTheorem

Lazy DFA is small

Theorem

Lazy DFA is small

Document-style DTDTheorem

Lazy DFA is smallFactLazy DFA is HUGE

Page 40: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 40

Lazy DFA in the XML Toolkit

• The XML toolkit uses a lazy DFA to process XML streams

• “warm-up” phase, followed by very high throughput

Page 41: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 41

Throughput for 103, 104, 105, 106 XPath expressions

[ prob(*)=10%, prob(//)=10% ]

0.0001MB/s

0.001MB/s

0.01MB/s

0.1MB/s

1MB/s

10MB/s

100MB/s

5MB 10MB 15MB 20MB 25MB

Total input size

parser

lazyDFA (103 XPath)

lazyDFA (104 XPath)

lazyDFA (105 XPath)

lazyDFA (106 XPath)

xfilter (103 XPath)

xfilter (104 XPath)

xfilter(105 XPath)

xfilter(106 XPath)

Parser:

10MB/s

Lazy DFA:

5.4MB/s

Page 42: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 42

Summary of Lazy DFA and XML

• Linear Xpath expressions:– Process with one lazy DFA

• Xpath expressions with branches– Process with Deterministic Pushdown

Automata (ongoing work at the University of Washington)

Page 43: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 43

Outline

• Background

• The XML stream processing problem

• Basic XML processing with automata

• Adapting automata to XML

• Stream indexes

• Conclusions

Page 44: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 44

Stream IndeX (SIX)

Main observation:• Parsing is major bottleneck

Definition The SIX of an XML document is a binary table of (begin, end) offsets

Idea: • Use SIX to reduce amount of parsing• Works well with (lazy) DFA• Implemented in the XML toolkit

Page 45: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 45

Stream IndeX (SIX)

<bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book>

</bib>

<bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book>

</bib>

beginOffset endOffset

bib 0 1490124

book 3 409023

publisher 12 423

author 426 879

author 978 . . .

. . .

SIXXML

Page 46: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 46

Stream IndeX (SIX)

<bib> <book>...</bib>

<bib> <book>...</bib>

<bib> <book>...</bib>

<bib> <book>...</bib>

<bib> <book>...</bib>

<bib> <book>...</bib>

0 205

30 66

72 188

0 205

30 66

72 188

90 110

95 98

0 205

30 66

The SIX stream is about 6% of the data stream

And can be made MUCH smaller

The SIX stream is about 6% of the data stream

And can be made MUCH smaller

SIX

(E.g. DIME)

XML

Page 47: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 47

Throughput improvements from SIX (stable)

0

5

10

15

20

25

30

35

55 60 65 70 75 80 85 90 95 100 105

XML stream (MB)

MB

/s

Theta=3% (SIX)

Theta=3%

Theta=8% (SIX)

Theta=8%

Theta=14% (SIX)

Theta=14%

Page 48: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 48

Stream Views

Idea: • Given a workload of XPath expressions with

branches• Precompute some views for each document

to speed up the entire workload

• views header has to be small

Page 49: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 49

Stream Views

/a[b=11][c=22][e=23]/a[b=11][c=22][e=23] /a[b=33][d=44] [e=55]/a[c=66][f=77]/a[f=34][g=56]

/a[b=33][d=44] [e=55]/a[c=66][f=77]/a[f=34][g=56]

/a[b=88][c=99]/a[c=99][e=00]

/a[b=88][c=99]/a[c=99][e=00]

/a/c /a/e /a/f

/a/c /a/e /a/f

3 Views: Short circuit evaluation !Short circuit evaluation !

Queries

Servers

Page 50: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 50

Stream Views

• Views header (binary offsets)

<bib> <book>...</bib>

<bib> <book>...</bib>

<bib> <book>...</bib>

XML XML XML

0

30

72

0

30

72

0

30

72

100x speedup

on a hit

100x speedup

on a hitXML

Header

Choosing the views:

Difficult problem

Choosing the views:

Difficult problem

Page 51: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 51

Outline

• Background

• The XML stream processing problem

• Basic XML processing with automata

• Adapting automata to XML

• Stream indexes

• Conclusions

Page 52: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 52

Summary• XML stream processing problem:

– Fixed XPath queries, transient XML data– Large number of queries– High data throughput

• Relationship to text processing techniques:– Still regular expressions– Still automata and lazy DFAs– Different scale

• Techniques:– Lazy DFAs work for reasons specific to XML– Stream indexes and views: ongoing research

Page 53: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 53

Future Work

• Handle branches in XPath expressions

• View selection for a given workload

• Network configuration

Page 54: Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan SuciuUniv. of Washington

Querying XML Streams 54

Thank you !

Links:

www.cs.washington.edu/homes/suciu

www.cs.washington.edu/homes/suciu/XMLTK

xmltk.sourceforge.net