22
Ecole Polytechnique Fédérale de Lausanne, Switzerla nd Efficient processing of XPath Efficient processing of XPath queries with structured overlay queries with structured overlay networks networks Gleb Skobeltsyn, Manfred Hauswirth, Karl Gleb Skobeltsyn, Manfred Hauswirth, Karl Aberer Aberer Agia Napa, Cyprus. 2 November 2005. Agia Napa, Cyprus. 2 November 2005. Presenter: Gleb Skobeltsyn This work was (partially) funded by the EU project BRICKS BRICKS http://www.brickscommunity.org OnTheMove OnTheMove - OTM 2005 Federated Conferences - OTM 2005 Federated Conferences

Efficient processing of XPath queries with structured overlay networks

  • Upload
    devika

  • View
    37

  • Download
    0

Embed Size (px)

DESCRIPTION

OnTheMove - OTM 2005 Federated Conferences. Agia Napa, Cyprus. 2 November 2005. Efficient processing of XPath queries with structured overlay networks. Gleb Skobeltsyn, Manfred Hauswirth, Karl Aberer. Presenter: Gleb Skobeltsyn. - PowerPoint PPT Presentation

Citation preview

Page 1: Efficient processing of XPath queries with structured overlay networks

Ecole Polytechnique Fédérale de Lausanne, Switzerland       

Efficient processing of XPath Efficient processing of XPath queries with structured overlay queries with structured overlay

networksnetworksGleb Skobeltsyn, Manfred Hauswirth, Karl Gleb Skobeltsyn, Manfred Hauswirth, Karl

AbererAberer

Agia Napa, Cyprus. 2 November 2005.Agia Napa, Cyprus. 2 November 2005.

Presenter: Gleb Skobeltsyn

This work was (partially) funded by the EU project BRICKSBRICKS http://www.brickscommunity.org

OnTheMoveOnTheMove - OTM 2005 Federated Conferences- OTM 2005 Federated Conferences

Page 2: Efficient processing of XPath queries with structured overlay networks

Efficient processing of XPath queries with structured overlay networks 22 // 2222

ContentsContents

• Motivation & Problem statement

1.1. PP--GridGrid short overview short overview

2.2. IndexingIndexing strategy strategy– Basic Index– Caching strategy

3.3. SimulationSimulation results results

• Conclusions

Page 3: Efficient processing of XPath queries with structured overlay networks

Efficient processing of XPath queries with structured overlay networks 33 // 2222

MotivationMotivation

• Complex queries are easy to answer in unstructured P2P networks, e.g., edutella.

• But the approach doesn’t scale because of the high bandwidth consumption.

• Structured P2P networks typically offer logarithmic search complexity, but require a special index.

• Indexing structure to support XPath queries over a distributed XML warehouse???

Page 4: Efficient processing of XPath queries with structured overlay networks

Efficient processing of XPath queries with structured overlay networks 44 // 2222

Problem statement (1/2)Problem statement (1/2)

• Problem: To be able to answer structured queries (e.g. XPath) in a XML warehouse distributed in a structured P2P network.

• We assume using 2 different indices:– For indexing structure (e.g. pure XML path);– For indexing values.

• In this paper we concentrate on the first issue.

Page 5: Efficient processing of XPath queries with structured overlay networks

Efficient processing of XPath queries with structured overlay networks 55 // 2222

Problem statement (2/2)Problem statement (2/2)

• We support XPath{*,//} queries, i.e. queries containing:

– Child axes (“/”);– Descendant axis (“//”);– Wildcards (“*”).

• Example: //A/B/*/C

• We propose an indexing structure to answer such queries in a large distributed P2P XML warehouse

• We try to minimize the consumed bandwidth measured in P2P overlay hops

Page 6: Efficient processing of XPath queries with structured overlay networks

Efficient processing of XPath queries with structured overlay networks 66 // 2222

P-Grid (1/3): introductionP-Grid (1/3): introduction

• P-Grid is a trie based DHT P2P, similar to Chord, Pastry, etc… (more info at http://www.p-grid.org/).

• In P-Grid each peer is responsible for a set of binary keys which start from the peer’s prefix.

• Routing is based on longest prefix matching (log search cost for skewed trees):

BB

00*

0*

01*

1*

10* 11*

1* : E01* : B

1* : C, D00* : F

0* : A, F11* : E

0* : B, F10* : D

queryfor ‘100’

queryfor ‘100’

CCfoundfound

AA DD

P-P-GridGrid

Page 7: Efficient processing of XPath queries with structured overlay networks

Efficient processing of XPath queries with structured overlay networks 77 // 2222

P-Grid (2/3): storing indexing P-Grid (2/3): storing indexing informationinformation

• Information is stored in data items.• Data item is a {key,data} tuple.• Each peer in P-Grid network stores data

items whose keys start from the peer’s prefix

1100* 1101*110*

1100011001011011

11000110010

11011

0* 1*

11*

00* 10*

01*

010* 011* 111*

01001…

011011…

1010010 …

1111 …

… …

00001 data1 0011 data2

Page 8: Efficient processing of XPath queries with structured overlay networks

Efficient processing of XPath queries with structured overlay networks 88 // 2222

P-Grid (3/3): order preserving hash P-Grid (3/3): order preserving hash functionfunction

• Keys are generated using a P-Grid order preserving hash function h( ):

• Example: the key h(“comp”) is a prefix for keys: h(“computer”), h(“complexity”), h(“comp*”).

• Routing to the key h(“comp”) may lead to two cases:

)()(:, 212121 shshssss

h(co)*

h(complexity)h(computer)h(corporation)…

1.1. The peer responsible for h(comp)

h(complexity)

… …h(compu)*h(compl)*

h(computer)… …

2.2. The sub-tree responsible for h(comp)

Page 9: Efficient processing of XPath queries with structured overlay networks

Efficient processing of XPath queries with structured overlay networks 99 // 2222

Basic Index (1/4): introductionBasic Index (1/4): introduction

• We index XML paths found in the document.

• Given a path P = l1/../lm, m data items are stored in P-Grid, using the following sub-paths (suffixes) as keys:

{ l1/l2/.../lm, l2/.../lm, …, lm }

• Each data item stores path and URI.

• Example: given a path P = “store/book/title”, 3 data items are created:

Key Original Path URI

h(“store/book/title”) “store/book/title” Link to the document

h(“book/title”) “store/book/title” Link to the document

h(“title”) “store/book/title” Link to the document

Basic Basic indexindex

Page 10: Efficient processing of XPath queries with structured overlay networks

Efficient processing of XPath queries with structured overlay networks 1010 // 2222

Basic Index (2/4): searchBasic Index (2/4): search

• Given a XPath query Q=l1s1l2..sk-1lk, where si: {/,//,*}.

• The first longest sequence of labels divided by “/” is defined as qB.

• Example: for “A//C/D//E”: qB=“C/D”

• The query is answered by routing to the peer responsible for h(qB).

• There are 2 cases:– There is one peer responsible for h(qB) – answer the query,

– There is a set (sub-tree) of peers responsible for h(qB) – a shower broadcast is executed over this set.

Page 11: Efficient processing of XPath queries with structured overlay networks

Efficient processing of XPath queries with structured overlay networks 1111 // 2222

Basic index (3/4): shower broadcastBasic index (3/4): shower broadcast

1100*1100* 1101*1101*

0* 1*

11*00*00* 10*10*01*

010*010* 011*011* 111*111*110*

*

• Shower broadcast – propagates a message (query) among all peers in the sub-tree:– Recursive algorithm, works in parallel fashion;– Each peer in the sub-tree is visited only once.

Page 12: Efficient processing of XPath queries with structured overlay networks

Efficient processing of XPath queries with structured overlay networks 1212 // 2222

Basic Index (4/4): propertiesBasic Index (4/4): properties

• Basic index is sufficient to answer XPath{*,//}

queries.

• The shower broadcast consumes bandwidth, though efficient in time and distributes the computing.

• The improvement is to cache the most frequent queries locally and avoid shower broadcasts for them.

Page 13: Efficient processing of XPath queries with structured overlay networks

Efficient processing of XPath queries with structured overlay networks 1313 // 2222

Caching strategy (1/4): introductionCaching strategy (1/4): introduction

• Types of queries:1. Queries that can be answered by one peer locally.

Example: “A/B/C//E” at the peer responsible for h(“A/B”).

2. Queries that require additional broadcast and contain only one sub-path (q=qB).

Example: “A” at the peer responsible for h(“A/B”).

3. Queries that require additional broadcast and contain more than one sub-path (q≠qB).

Example: “A//C//E” at the peer responsible for h(“A/C”).

• We suggest caching the most popular queries of the type 33 to reduce the number of shower broadcasts.

Caching Caching strategystrategy

Page 14: Efficient processing of XPath queries with structured overlay networks

Efficient processing of XPath queries with structured overlay networks 1414 // 2222

P=

• The key used for routing is no longer h(qB),

but qC=concat(Pl1, Pl2

…Plk), where qB=Pl1

• Example: A C E// //D/

Caching strategy (2/4): searchCaching strategy (2/4): search

A C ED

• The query is routed to a relevant peer which may (or may not) answer the query form cache.

• If the query is of the type 3 and cannot be answered locally, its result can be cached.

• Similarly, the existing cache can be deleted.

qC=

Page 15: Efficient processing of XPath queries with structured overlay networks

Efficient processing of XPath queries with structured overlay networks 1515 // 2222

LC

h(CD)

h(CDA)

h(CDE) h(CDF)

I II III IV V

...

Peers

Query originator Q=A//C/D//E

key pathh(CD) CD

...

key pathh(C) YC

...

key pathh(CDZE) ACDZE

...

key pathh(CDE) ABCDE

...

key pathh(CDF) ACDF

...

Data

items

URI...

URI...

URI...

URI...

URI...

path

Cash

...

...

...

URI......

List of cached

queries

LC LC LC

Search for a key h(“CDAE”)

(1)

Result: ACDZE, ABCDE

(4)

Result: ACDZE, ABCDE, ACDFE

(2)

(2)

(2)

(2)

+A//C/D//E +A//C/D//E +A//C/D//E

(2) (2) (2)

+A//C/D//E

(3)

(3)

ACDZEABCDEC

DA

E ......

(2+) (2+)

h(CDFE) ACDFE

......

ACDFE ...

…A//C/D//E…

Caching strategy (3/4): exampleCaching strategy (3/4): example

Page 16: Efficient processing of XPath queries with structured overlay networks

Efficient processing of XPath queries with structured overlay networks 1616 // 2222

Caching strategy (4/4): analysisCaching strategy (4/4): analysis

• A query is profitable to cache if:

UpdateCost*UpdateRate(subtree)<SearchCost(subtree)*SearchRate(query)

• Where:– UpdateCost – the cost of one cache update (log N)– UpdateRate – average update rate in the sub-tree– SearchCost – the cost of search (routing+broadcast)– SearchRate – the query’s frequency (estimated locally)

• The indexing strategy is adaptive to search/update ratio and tries to keep the messaging costs optimal.

gathered fromneighbours

Page 17: Efficient processing of XPath queries with structured overlay networks

Efficient processing of XPath queries with structured overlay networks 1717 // 2222

Simulations (1/4): testbedSimulations (1/4): testbed

• Java application, stores data locally in a DBMS.

• 50 XML documents, >5k unique paths • ~20k data items• In each experiment we used 10k queries

randomly generated from the paths

SimulationSimulationss

Page 18: Efficient processing of XPath queries with structured overlay networks

Efficient processing of XPath queries with structured overlay networks 1818 // 2222

Simulations (2/4): search costSimulations (2/4): search cost

• Parameter t – fraction of “cachable” queries• All “cachable” queries are cached

0

5

10

15

20

25

0 200 400 600 800 1000 1200 1400

Number of peers

Ave

rag

e co

st o

f an

swer

ing

a q

uer

y (#

msg

)

t=0

t=0.5

t=0.75

t=1

no broadcasts

Page 19: Efficient processing of XPath queries with structured overlay networks

Efficient processing of XPath queries with structured overlay networks 1919 // 2222

Simulations (3/4): search costSimulations (3/4): search cost

• 1000 peers• t=0.5 (50% of queries can be cached)

12

13

14

15

16

17

18

19

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Percentage of cached queries (%)

Avg

. co

st o

f an

swer

ing

a q

uer

y (#

msg

)

No cachingZipf s=0Zipf s=0.8Zipf s=1.2

Page 20: Efficient processing of XPath queries with structured overlay networks

Efficient processing of XPath queries with structured overlay networks 2020 // 2222

Simulations (4/4): average costsSimulations (4/4): average costs

14

15

16

17

0 0.5 1 1.5 2 2.5 3 3.5 4

Percentage of cached queries (%)

Avg

. co

st o

f o

ne

qu

ery/

up

dat

e o

per

atio

n

(# m

sg)

search/update ratio = 1:2

search/update ratio = 2:1

• 1000 peers, t=0.5, Zipf s=1.2.• For a given search/update ratio there is an optimal point

Page 21: Efficient processing of XPath queries with structured overlay networks

Efficient processing of XPath queries with structured overlay networks 2121 // 2222

ConclusionsConclusions

• The efficient solution for indexing XML structure in structured overlay networks is proposed.

• The presented solution can be used in a P2P XML querying engine for answering structural (sub) queries.

Page 22: Efficient processing of XPath queries with structured overlay networks

Efficient processing of XPath queries with structured overlay networks 2222 // 2222

Last slideLast slide

Thank you for your attention!Questions?