54
Distributed Structural and Value XML Filtering Iris Miliaraki and Manolis Koubarakis Department of Informatics and Telecommunications National and Kapodistrian University of Athens *Το άρθρο θα παρουσιαστεί στο “4th ACM International Conference on Distributed Event-Based Systems (DEBS 2010)”, Cambridge, UK. 9 Ο Ελληνικό Συμπόσιο Διαχείρισης Δεδομένων, Αγία Νάπα, Κύπρος

Distributed Structural and Value XML Filtering

  • Upload
    kalyca

  • View
    35

  • Download
    0

Embed Size (px)

DESCRIPTION

Distributed Structural and Value XML Filtering. Iris Miliaraki and Manolis Koubarakis Department of Informatics and Telecommunications National and Kapodistrian University of Athens. 9 Ο Ελληνικό Συμπόσιο Διαχείρισης Δεδομένων, Αγία Νάπα, Κύπρος. - PowerPoint PPT Presentation

Citation preview

Page 1: Distributed Structural and Value  XML Filtering

Distributed Structural and Value XML Filtering

Iris Miliaraki and Manolis Koubarakis

Department of Informatics and TelecommunicationsNational and Kapodistrian University of Athens

*Το άρθρο θα παρουσιαστεί στο “4th ACM International Conference on Distributed Event-Based Systems (DEBS 2010)”, Cambridge, UK.

9Ο Ελληνικό Συμπόσιο Διαχείρισης Δεδομένων, Αγία Νάπα, Κύπρος

Page 2: Distributed Structural and Value  XML Filtering

Outline

XML Filtering scenario Background

DHTs Structural matching

Value matching Experiments Sum up and future work

Page 3: Distributed Structural and Value  XML Filtering

XML Filtering system XML Filtering system

XML Filtering scenario

XPath/XQuery?

XPath/XQuery?

Subscriber

Subscriber Publisher

Publisher

YFilter

XTrieFiST

Index-Filter

CentralizedDistributed

ONYX

Gong et al. [ICDE05]

XPush

Parallel/Hierarchical XTrie

Snoeren [SOSP 2001]

Miliaraki [WWW 2008]

Page 4: Distributed Structural and Value  XML Filtering

XML Filtering scenario

XPath/XQuery?

XPath/XQuery

?

Subscriber

Subscriber Publisher

Publisher

Page 5: Distributed Structural and Value  XML Filtering

Background: DHTs Structured overlay networks

Solve the item location problem in a distributed and dynamic network of nodes (in O(log N) hops): Let x be some data item. Find x!

Distributed version of hash table data structure id=Hash(K)

Main operations: Put: given a key (for a data item), map

the key onto a node. Get: Find the location of a data item with

a given a key.

Page 6: Distributed Structural and Value  XML Filtering

XML Filtering scenario

XPath/XQuery?

XPath/XQuery

?

Subscriber

Subscriber Publisher

Publisher

DHT

Page 7: Distributed Structural and Value  XML Filtering

XML data model - example

Q1: /bib/*/author[text()="John Smith"]

Q2: /bib/phdthesis[@published=2005]/author[@nationality=greek]

Q3: /bib/article[@conf=www]

Q4: /bib/article[@year=2009]/author[@degree-from="UOA"]

Q5: /bib/article[@year=2009]/cite[@paper-id=2392]

Q6: /bib/article/cite[@paper-id=2770]

Q1: /bib/*/author[text()="John Smith"]

Q2: /bib/phdthesis[@published=2005]/author[@nationality=greek]

Q3: /bib/article[@conf=www]

Q4: /bib/article[@year=2009]/author[@degree-from="UOA"]

Q5: /bib/article[@year=2009]/cite[@paper-id=2392]

Q6: /bib/article/cite[@paper-id=2770]

<bib> <article title=“XML Filtering”

conf=“VLDB” year=“2007”>

<school> Univ. of Athens

</school> <author institure=“Harvard”>

John Smith </author>

</article></bib>

<bib> <article title=“XML Filtering”

conf=“VLDB” year=“2007”>

<school> Univ. of Athens

</school> <author institure=“Harvard”>

John Smith </author>

</article></bib>

<bib> <article title=“XML Filtering”

conf=“VLDB” year=“2007”>

<school> Univ. of Athens

</school> <author institure=“Harvard”

John Smith </author>

</article></bib>

<bib> <article title=“XML Filtering”

conf=“VLDB” year=“2007”>

<school> Univ. of Athens

</school> <author institure=“Harvard”

John Smith </author>

</article></bib>

Q1: /bib/*/author[text()="John Smith"]

Q2: /bib/phdthesis[@published=2005]/author[@nationality=greek]

Q3: /bib/article[@conf=www]

Q4: /bib/article[@year=2009]/author[@degree-from="UOA"]

Q5: /bib/article[@year=2009]/cite[@paper-id=2392]

Q6: /bib/article/cite[@paper-id=2770]

Q1: /bib/*/author[text()="John Smith"]

Q2: /bib/phdthesis[@published=2005]/author[@nationality=greek]

Q3: /bib/article[@conf=www]

Q4: /bib/article[@year=2009]/author[@degree-from="UOA"]

Q5: /bib/article[@year=2009]/cite[@paper-id=2392]

Q6: /bib/article/cite[@paper-id=2770]

<bib> <article title=“XML Filtering”

conf=“VLDB” year=“2007”>

<school> Univ. of Athens

</school> <author institure=“Harvard”

John Smith </author>

</article></bib>

<bib> <article title=“XML Filtering”

conf=“VLDB” year=“2007”>

<school> Univ. of Athens

</school> <author institure=“Harvard”

John Smith </author>

</article></bib>

Q1: /bib/*/author[text()="John Smith"]

Q2: /bib/phdthesis[@published=2005]/author[@nationality=greek]

Q3: /bib/article[@conf=www]

Q4: /bib/article[@year=2009]/author[@degree-from="UOA"]

Q5: /bib/article[@year=2009]/cite[@paper-id=2392]

Q6: /bib/article/cite[@paper-id=2770]

Q1: /bib/*/author[text()="John Smith"]

Q2: /bib/phdthesis[@published=2005]/author[@nationality=greek]

Q3: /bib/article[@conf=www]

Q4: /bib/article[@year=2009]/author[@degree-from="UOA"]

Q5: /bib/article[@year=2009]/cite[@paper-id=2392]

Q6: /bib/article/cite[@paper-id=2770]

Structural matchingValue Matching

Page 8: Distributed Structural and Value  XML Filtering

Automata-based approaches

XFilter and YFilter, ONYX, XTrie, IndexFilter, FiST etc.

Main idea Construct an automaton from a set of XPath/Xquery

queries Use it as a matching engine against the XML

documents

Page 9: Distributed Structural and Value  XML Filtering

ε

9

*

Q5: //*/cite [@id = 12743]

1111cite Q5

10*

33year Q1

00

bib

phdthesis

1

2

88author Q4

*

7

titleQ3

66

55school Q2

proceedings 4

Q1: /bib/phdthesis/year = ‘2008’Q2: /bib/proceedings/school = ‘Univ. of Athens’Q3: /bib/proceedings/title = ‘XML Dissemination’ Q4: /bib/*/author = ‘John Smith’

Example NFA (YFilter)

Page 10: Distributed Structural and Value  XML Filtering

Distributed structural matching

Utilize a distributed version of a state-of-the-art approach YFilter

Instead of a centralized NFA

Distribute the NFA in the DHT

Miliaraki, Z. Kaoudi and M. Koubarakis. XML Data Dissemination using automata on top of structured overlay networks. In WWW 2008.

Page 11: Distributed Structural and Value  XML Filtering

Distributed NFA

Miliaraki, Z. Kaoudi and M. Koubarakis. XML Data Dissemination using automata on top of structured overlay networks. In WWW 2008.

Structural matching!Structural matching!

What about value matching?

What about value matching?

Page 12: Distributed Structural and Value  XML Filtering

What about value matching?

• Automata-based approaches efficient for structural matching

• Queries apart from defining a structural path also contain value-based predicates

/dblp/phdthesis[@year=2005]/author[@nationality=greek]

• Our goal: Scale for both the size of the query set and the number of predicates per query

Page 13: Distributed Structural and Value  XML Filtering

Definitions

Attribute predicates: element[@attr op value]e.g. /bib/phdthesis[@published=2007]

Textual predicates: element[text() op value]e.g. /bib/*/author[text()=“John Smith”]

Page 14: Distributed Structural and Value  XML Filtering

Direct evaluation with automaton/trie

Treat predicates as elements!

Lazy DFA [Gupta and Suciu, 2003]Hope is that only a small set of DFA states will be computed at

runtime

00 11

22

bibphdthesis

Q3: /bib/article/conference[text()=WWW 2009]

Q1: /dblp/phdthesis[@year=2005]/author[@nationality=greek]

66

articleQ2: /bib/*/author[text()=Michael Smith]

44

33author

77conference

55author*

33year

77conference

55author

99

88author

text()

nationality

text()11

10

Huge increase of NFA states!Huge increase of NFA states!

Destroy sharing of path expressions!Destroy sharing of path expressions!

Page 15: Distributed Structural and Value  XML Filtering

Bottom-up evaluation

Common rule in relational query optimization apply selections as early as possible

Works well for relational query processing

pFist [Kwon et al. 2005]

A lot of effort evaluating predicates while the structure may not be matched

A lot of effort evaluating predicates while the structure may not be matched

Page 16: Distributed Structural and Value  XML Filtering

Step-by-step evaluation XPath queries consist of distinct stepsEach step contains one or more value-based predicatesPerform value matching with structural matching in a

stepwise manner

YFilter – Inline [Diao et al. 2003] process predicates when NFA state is reached

Effort spent for evaluating predicates while the structure may not be fully matched

Effort spent for evaluating predicates while the structure may not be fully matched

Page 17: Distributed Structural and Value  XML Filtering

Top-down evaluationCheck predicates after structural matching

YFilter – Selection-Postponed [Diao et al. 2003] performs predicate evaluation after the execution of the NFA

VA-RoXSum [Vagena et al. 2007] Focus on message aggregation

depending on predicate selectivity number of false positives may be very largedepending on predicate selectivity number of false positives may be very large

Page 18: Distributed Structural and Value  XML Filtering

Moving on to details Parse XML document and generate a set of candidate

predicates to perform predicate evaluation

CP1:article[@title="XML Filtering"]CP2:article[@conf=VLDB]CP3:article[@year=2007]CP4:author[text()="John Smith"]CP5:author[@institute=Harvard]

Enriched parsing events Candidate predicates

Page 19: Distributed Structural and Value  XML Filtering

Step-by-step evaluation

Top-down evaluation

Top-down evaluation with pruning

Bottom-up evaluation

Page 20: Distributed Structural and Value  XML Filtering

Step-by-step evaluation

• Associate NFA states with relevant predicate information organized using a hash index

• At each step of the execution– check predicates – update list with partially matched queries Q– continue with expanding state if Q not empty

Check value-predicates while

matching the structure

Check value-predicates while

matching the structure

Page 21: Distributed Structural and Value  XML Filtering

Example

00 11

22

bibphdthesis

44

*33

55author

88conference

66authorarticle

77

cite

Q1: /bib/phdthesis[@published=2005]/author[@nationality=greek]

Q2: /bib/*/author[text()="John Smith"]

Q3: /bib/article[@conf=www]

Q4: /bib/article[@year=2009]/author[@degree-from="UOA"]

Q5: /bib/article[@year=2009]/cite[@paper-id=2392]

Q6: /bib/article/cite[@paper-id=2770]

PREDICATE QUERY LIST

true {Q1,Q2,Q3,Q4,Q5,Q6}

PREDICATE QUERY LIST

true {Q6}

[@conf=www] {Q3}

[@year=2009] {Q4,Q5}

Candidate predicatesCP1:article[@title="XML Filtering"]CP2:article[@conf=VLDB]CP3:article[@year=2007]CP4:author[text()="John Smith"]CP5:author[@institute=Harvard]

Candidate predicatesCP1:article[@title="XML Filtering"]CP2:article[@conf=VLDB]CP3:article[@year=2007]CP4:author[text()="John Smith"]CP5:author[@institute=Harvard]

At each step of the execution1. check predicates 2. update list with partially matched queries Q3. continue with expanding state if Q not

empty

At each step of the execution1. check predicates 2. update list with partially matched queries Q3. continue with expanding state if Q not

empty

Page 22: Distributed Structural and Value  XML Filtering

Step-by-step evaluation

Top-down evaluation

Top-down evaluation with pruning

Bottom-up evaluation

Page 23: Distributed Structural and Value  XML Filtering

Top-down evaluation

Execute distributed NFAOnly check predicates if an accepting state is

reached

Each peer uses a local index mapping predicates to the list of queries that contain them (hash index)

Delay value matching after

structural matching

Delay value matching after

structural matching

Page 24: Distributed Structural and Value  XML Filtering

Example

author

00 11

22

bibphdthesis

44

*33

55

88conference

66authorarticle

77

cite

Q1: /bib/phdthesis[@published=2005]/author[@nationality=greek]

Q2: /bib/*/author[text()="John Smith"]

Q3: /bib/article[@conf=www]

Q4: /bib/article[@year=2009]/author[@degree-from="UOA"]

Q5: /bib/article[@year=2009]/cite[@paper-id=2392]

Q6: /bib/article/cite[@paper-id=2770]

PREDICATE QUERY LIST

[@conf=WWW] {Q3}PREDICATE QUERY LIST

[paper-id=2770] {Q6}

[paper-id=2392] {Q5}

[@year=2009] {Q5}

Candidate predicatesCP1:article[@title="XML Filtering"]CP2:article[@conf=VLDB]CP3:article[@year=2007]CP4:author[text()="John Smith"]CP5:author[@institute=Harvard]

Candidate predicatesCP1:article[@title="XML Filtering"]CP2:article[@conf=VLDB]CP3:article[@year=2007]CP4:author[text()="John Smith"]CP5:author[@institute=Harvard]

Page 25: Distributed Structural and Value  XML Filtering

Step-by-step evaluation

Top-down evaluation

Top-down evaluation with pruning

Bottom-up evaluation

Page 26: Distributed Structural and Value  XML Filtering

Top-down evaluation with pruning

At each step of the execution, part of the NFA is revealed

Applies on equality predicates

IDEA: Use a compact summary of predicate information to stop NFA execution (prune) if we can deduce that no match can be found

IDEA: Use a compact summary of predicate information to stop NFA execution (prune) if we can deduce that no match can be found

00 11

22

bibphdthesis

44

*33

55

88conference

66authorarticle

77

cite

Page 27: Distributed Structural and Value  XML Filtering

TD with pruning – Details

• Each peer responsible for storing many NFA fragments

• Each peer keeps one Bloom filter which summarizes predicates of queries indexed in the relevant NFA fragments Value filter (VF)

• Assuming a peer p and a state st, for each query q whose NFA accepting path contains st, we insert one predicate of q in the VF of p

Page 28: Distributed Structural and Value  XML Filtering

TD with pruning - Main idea cont.

• Predicates are inserted as a whole in VFs using their string representation:– element[@attr=value] element + attr + value– element[text()=value] element + text + value

• VFs are updated during query indexing

• Since we traverse the NFA accepting path of a query to index all relevant VFs will be updated

Page 29: Distributed Structural and Value  XML Filtering

Example: Constructing Value filters

DHT

100…101

00 11

22

bibphdthesis

44

*33

55author

88conference

66authorarticle

77

cite

is responsible for

Q1: /bib/phdthesis[@published=2005]/author[@nationality=greek]Q2: /bib/*/author[text()="John Smith"]Q3: /bib/article[@conf=www] Q4: /bib/article[@year=2009]/author[@degree-from="UOA"]Q5: /bib/article[@year=2009]/cite[@paper-id=2392]Q6: /bib/article/cite[@paper-id=2770]

keeps value filter

Select 1 predicate per query to insert

m-bit filter

Page 30: Distributed Structural and Value  XML Filtering

Example – Querying Value filters

DHT

Candidate predicatesCP1:article[@title="XML Filtering"]CP2:article[@conf=VLDB]CP3:article[@year=2007]CP4:author[text()="John Smith"]CP5:author[@institute=Harvard]

Candidate predicatesCP1:article[@title="XML Filtering"]CP2:article[@conf=VLDB]CP3:article[@year=2007]CP4:author[text()="John Smith"]CP5:author[@institute=Harvard]

Step 1: expanding state 0

100…101

100…101

check Value filter

MATCH!Step 2: expanding state 1 MISS! Execution continues

Execution stops

Q1: /bib/phdthesis[@published=2005]/author[@nationality=greek]Q2: /bib/*/author[text()="John Smith"]Q3: /bib/article[@conf=www] Q4: /bib/article[@year=2009]/author[@degree-from="UOA"]Q5: /bib/article[@year=2009]/cite[@paper-id=2392]Q6: /bib/article/cite[@paper-id=2770]Q7: //article[@year=2007]

00 11

22

bibphdthesis

44

*33

55author

88conference

66authorarticle

77

cite

99article

10

*

e

Page 31: Distributed Structural and Value  XML Filtering

Online selectivity estimation

• TD with pruning select one of the predicates of each query to insert in the value filter– Randomly– Or…. most selective predicate

• Example/bib/article[@year=2009]/author[text()=“John Smith”]

• It is no feasible to store the entire set of XML data that have been processed by our system

1 2

Sampling!Sampling!

Page 32: Distributed Structural and Value  XML Filtering

Step-by-step evaluation

Top-down evaluation

Top-down evaluation with pruning

Bottom-up evaluation

Page 33: Distributed Structural and Value  XML Filtering

Bottom-up evaluation

Queries are indexed in the network using their predicates

For each distinct predicate in query set select a responsible peer using DHTpeer organizes its queries using a local index

mapping predicates to the list of queries that contain them

This indexing model resembles works from area of Information Filtering

Check values as early as possible

Check values as early as possible

Page 34: Distributed Structural and Value  XML Filtering

Bottom-up evaluation cont.

• Construct set of candidate predicates

• For each candidate predicate contact responsible peer– Peer checks its local index – Performs locally structural matching

Check values as early as possible

Check values as early as possible

Page 35: Distributed Structural and Value  XML Filtering

Example

DHT

Q1: /bib/phdthesis[@published=2005]/author[@nationality=greek]

Find responsible

Find responsible

Candidate predicatesCP1:article[@title="XML Filtering"]CP2:article[@conf=VLDB]CP3:article[@year=2007]CP4:author[text()="John Smith"]CP5:author[@institute=Harvard]

Candidate predicatesCP1:article[@title="XML Filtering"]CP2:article[@conf=VLDB]CP3:article[@year=2007]CP4:author[text()="John Smith"]CP5:author[@institute=Harvard]

Page 36: Distributed Structural and Value  XML Filtering

Experiments

• Implemented methods using FreePastry release in Java• Environment

– Cluster (http://www.grid.tuc.gr/) – 28 machines (4 peers per machine)– 253 peers from Planetlab network

• Queries– Sets of 1000000 queries– 1 to 8 predicates per query

• Data– NITF DTD

• Bloom filter size – 100K bits

Page 37: Distributed Structural and Value  XML Filtering

Cluster (2 predicates per query)

Page 38: Distributed Structural and Value  XML Filtering

Cluster (4 predicates per query)

Page 39: Distributed Structural and Value  XML Filtering

Cluster (4 predicates per query)

Page 40: Distributed Structural and Value  XML Filtering

Network traffic

Page 41: Distributed Structural and Value  XML Filtering

Sum up & future work

Described methods to combine both structural and value XML filtering in a distributed environment

Experimental evaluation of our methodsFuture work

Potential improvements for SBS methodMore sophisticated methods for selectivity estimationRange predicatesTextual predicates

Page 42: Distributed Structural and Value  XML Filtering

Questions?

Page 43: Distributed Structural and Value  XML Filtering

Planetlab (2 predicates per query)

Page 44: Distributed Structural and Value  XML Filtering

Performance improvement

Page 45: Distributed Structural and Value  XML Filtering
Page 46: Distributed Structural and Value  XML Filtering

Structural vs. value matching (small query set)

Page 47: Distributed Structural and Value  XML Filtering

Structural vs. value matching (large query set)

Page 48: Distributed Structural and Value  XML Filtering
Page 49: Distributed Structural and Value  XML Filtering
Page 50: Distributed Structural and Value  XML Filtering
Page 51: Distributed Structural and Value  XML Filtering
Page 52: Distributed Structural and Value  XML Filtering
Page 53: Distributed Structural and Value  XML Filtering
Page 54: Distributed Structural and Value  XML Filtering

<?xml version="1.0" encoding="UTF-8"?><statuses> <status><created_at>Tue Apr 07 22:52:51 +0000 2009</created_at><id>1472669360</id><text>At least I can get your humor through tweets. RT @abdur: I don't mean this in a bad way, but genetically speaking your a cul-de-sac.</text><source>&lt;a href="http://www.tweetdeck.com/">TweetDeck&lt;/a></source><truncated>false</truncated><in_reply_to_status_id></in_reply_to_status_id><in_reply_to_user_id></in_reply_to_user_id><favorited>false</favorited><in_reply_to_screen_name></in_reply_to_screen_name><user><id>1401881</id> <name>Doug Williams</name> <screen_name>dougw</screen_name> <location>San Francisco, CA</location> <description>Twitter API Support. Internet, greed, users, dougw and opportunities are my passions.</description> <profile_image_url>http://s3.amazonaws.com/twitter_production/profile_images/59648642/avatar_normal.png</profile_image_url> <url>http://www.igudo.com</url> <protected>false</protected> <followers_count>1027</followers_count> <profile_background_color>9ae4e8</profile_background_color> <profile_text_color>000000</profile_text_color> <profile_link_color>0000ff</profile_link_color> <profile_sidebar_fill_color>e0ff92</profile_sidebar_fill_color> <profile_sidebar_border_color>87bc44</profile_sidebar_border_color> <friends_count>293</friends_count> <created_at>Sun Mar 18 06:42:26 +0000 2007</created_at> <favourites_count>0</favourites_count> <utc_offset>-18000</utc_offset> <time_zone>Eastern Time (US & Canada)</time_zone> <profile_background_image_url>http://s3.amazonaws.com/twitter_production/profile_background_images/2752608/twitter_bg_grass.jpg</profile_background_image_url> <profile_background_tile>false</profile_background_tile> <statuses_count>3390</statuses_count> <notifications>false</notifications> <following>false</following> <verified>true</verified></user><geo/> </status> ... truncated ...</statuses>