16
Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology 2 Chinese University of Hong Kong

Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology

Embed Size (px)

Citation preview

Page 1: Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology

Efficient Top-k Search across Heterogeneous XML Data

Sources

Jianxin Li1 Chengfei Liu1 Jeffrey Xu Yu2 Rui Zhou1 1Swinburne University of Technology

2Chinese University of Hong Kong

Page 2: Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology

2

Outline

Motivation Related Work Preliminary and Problem Statement BT-based Scheduling Strategy Case Study Experiments Conclusions

Page 3: Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology

3

Motivation Top-k queries

Approximate answers are required when exact results cannot be found.

Returning a large number of results is not desirable.

Multiple XML data sources With the application of XML data, sometimes users are

interested in the results retrieved from several data sources at the same time.

Answering top-k queries over multiple xml data sources is still open problem.

Page 4: Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology

4

Related Work Top-k queries in XML

Amelie Marian etc. Adaptive processing of top-k queries in xml. ICDE2005.

Martin Theobald etc. An efficient and versatile query engine for topX search. VLDB2005.

Raghav Kaushik etc. On the integration of structure indexes and inverted lists. SIGMOD2004.

Top-k queries in Relational DB Upper, MPro and TPUT etc.

We focused on top-k queries over multipleXML data sources!

Page 5: Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology

5

Preliminary – XML Query Relaxation XML data and relevant schemas

Fig.1 bookshop S1 Fig.2 schema d1 of S1

Fig.3 bookshop S2 Fig.4 schema d2 of S2

Page 6: Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology

6

Preliminary – XML Query Relaxation Relaxed results

Fig.5 an original query q

Fig.7 a relaxed query to d2

Fig.6 a relaxed query to d1

We keep the changed weight for each edge in relaxed queries.

RankScore = 2.28

RankScore = 4.88

Page 7: Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology

7

Problem Statement Given a weighted query q and a number of data sources

{S1, S2, …, Sn} conforming to DTDs {d1, d2, …, dn}, let {q1, q2, …, qn} be the set of weighted relaxed query templates of q w.r.t. the set of DTDs, our aim is to efficiently search top k results by scheduling the evaluation of {q1, q2, …, qn} over {S1, S2, …, Sn}.

Page 8: Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology

8

BT-based Scheduling Strategy Data source determination and switching Result determination Edge selection

Page 9: Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology

9

Data source determination and switching

Computing the ranking scores {U(1) … U(n)} of relaxed queries {q1, q2, …, qn} w.r.t. data sources {S1, S2, …, Sn}.

Sorting the ranking scores as U={U(k1), … U(kn)} .

Taking the data source Sk1 to be evaluated and U(k2) as the current threshold σ.

The relaxed query q2 w.r.t. d2

The relaxed query q1 w.r.t. d1

U(1) = 2.28

U(2) = 4.88

Threshold σ= 2.28

Page 10: Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology

10

Result determination We adjust the lower bound L and upper bound U during

query evaluation. When L becomes equal to or larger than the current threshold, we can process the current candidates as follows:

The number of candidates is equal to k – Stop The number of candidates is less than k – Continue to search The number of candidates is larger than k – Refine candidates

Page 11: Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology

11

Edge selection Random Min_weight Max_weight

Page 12: Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology

12

Case Study

U(2) = 4.88

book

title infoL(2) = 1.70 <σB1, B2, B4

σ= 2.28

book

title info

price

B1

B2, B4

L(2)(G1) = 3.5 >σ

L(2)(G2) = 1.70 < σU(2)(G2) = 3.08 > σ

Top-1 result found!book

title infoprice year

B2

L(2)(G3) = 4.4 >σ

L(2)(G4) = 1.70 < σU(2)(G4) = 2.18 < σ

B4

Top-2 result found!

Switching Data Source to search top-3 result!

Page 13: Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology

13

Experiments Experimental setup

We run all algorithms in Java on an Intel P4 3GHz PC with 512M memory. Wutka DTD parser was used to analyze the structures of DTDs.

Dataset and selected queries We used Xmark XML data generator to produce a set of data that were

taken as dataset. Three queries were designed:

q1: //item[./description/parlist] q2: //item[./description/parlist/mailbox/mail[./text]] q3: //item[./mailbox/mail/text[./keyword and ./xxx] and ./name and ./xxx]

Page 14: Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology

14

Experiments

Static sort vs. Dynamic sort No schedule vs. BT schedule

Varing top-k size Varing top-k size

Page 15: Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology

15

Conclusions Contributions:

Proposed a BT-based scheduling strategy for evaluating top-k queries over multiple XML data sources;

Output results immediately without waiting for the end of query evaluation;

Implemented relevant algorithms and demonstrated its effectiveness and efficiency with XMark data sets.

Page 16: Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology

16

Thanks & Question