26
Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)

Approximate XML Query Answers

  • Upload
    anthea

  • View
    38

  • Download
    0

Embed Size (px)

DESCRIPTION

Approximate XML Query Answers. Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas). XML. XML Data. Motivation. XML: de-facto standard for data exchange Development of the “ XML Warehouse” - PowerPoint PPT Presentation

Citation preview

Page 1: Approximate XML Query Answers

Approximate XML Query Answers

Alkis Polyzotis (UC Santa Cruz)Minos Garofalakis (Bell Labs)Yannis Ioannidis (U. of Athens, Hellas)

Page 2: Approximate XML Query Answers

Motivation

XML: de-facto standard for data exchange Development of the “XML Warehouse” Conflict between “on-line” and query execution cost

Increased query response times Users might wait for un-interesting results

XML Data

Warehouse

XMLR

Q

Page 3: Approximate XML Query Answers

Approximate Query Answers

Evaluate query over a concise data synopsis and obtain an approximation R’ of the true result

Use approximate result as timely feedback User can assess the “value” of the query

Goal: reduce number of evaluated queries

XML Data

Warehouse

Synopsis

XMLR

XML R’

Q

Page 4: Approximate XML Query Answers

Contributions

TreeSketch Synopses Structural summaries for XML data Approximate answers for complex twig queries Summarization model Structural clustering of elements Efficient processing and construction

Element Simulation Distance Novel distance metric for XML data Captures “approximate” similarity between two XML trees

Experimental Results Accurate approximate answers for low space budgets Low-error selectivity estimates Efficient construction algorithm

Page 5: Approximate XML Query Answers

Outline

Preliminaries TreeSketches

Synopsis model Computing approximate answers Summary construction

Element Simulation Distance Experimental Study Conclusions

Page 6: Approximate XML Query Answers

Data and Query Model

XML Document

q0

q1

q2 q3

//section

.//equation./figure

Twig Query

s2

e11 e13f5 f7

rNesting Tree

p1

s2

f5

c11

s3

f6

c12

f4

e8 c9 e10

f7

c13

r

e10f5s2r

e8f5s2r

e10f4s2r

e8f4s2r

q3q2q1q0

Binding Tuples

Page 7: Approximate XML Query Answers

Problem Definition

Process twig query over a synopsis Compute approximation of nesting tree

q0

q1

q2 q3

//section

.//equation./figure

s2

e11 e13f5 f7

r

s

e ef

r ApproximateNesting Tree

True Nesting Tree

XML Data

Synopsis

Page 8: Approximate XML Query Answers

Graph Synopsis

XML Document Graph Synopsis

Synopsis node Set of elements of the same tag Synopsis edge Document edge(s)

P(1)

S(2)

F(2)

C(4)

F(2)

E(2)

R(1)

p1

s2

f5

c11

s3

f6

c12

f4

e8 c9 e10

f7

c13

r

Page 9: Approximate XML Query Answers

XML Document TreeSketch

TreeSketch Synopsis

Augment graph-synopsis with edge counts count[u,v]: mean #children in v per element in u

2

1

1 1

2

1 1

111

P(1)

S(2)

F(2)

C(4)

F(2)

E(2)

R(1)

p1

s2

f5

c11

s3

f6

c12

f4

e8 c9 e10

f7

c13

r 2

#F

#F

Page 10: Approximate XML Query Answers

XML Document TreeSketch

TreeSketch Synopsis

Augment graph-synopsis with edge counts count[u,v]: mean #children in v per element in u

2

1

2

2

10.5

P(1)

S(2)

C(4)

F(4)

E(2)

R(1)

p1

s2

f5

c11

s3

f6

c12

f4

e8 c9 e10

f7

c13

r

#F

Page 11: Approximate XML Query Answers

TreeSketches and Clustering

TreeSketch Clustering based on structure All elements in a node are mapped to a “centroid” Tight clusters Accurate synopsis The perfect synopsis corresponds to a perfect clustering

Synopsis quality quantified by clustering error Options: Manhattan Distance, Squared Error, … Quality can be measured independent of a workload Key for effective construction

Error = ce − cu2

e∈u

∑u

Page 12: Approximate XML Query Answers

Computing Approximate Answers

TreeSketch

q0

q1

q2 q3

//section

.//equation.//caption

Query Approximate Nesting Tree

R

E

11+1=2

C

S

2

Compute TreeSketch of approximate answer Accuracy depends on quality of clustering

1

2

1 1

111

P(1)

S(2)

F(2)

C(4)

F(2)

E(2)

R(1)

Page 13: Approximate XML Query Answers

TreeSketch Construction

Given an XML tree T, build a TreeSketch of size B Difficult clustering problem

Space dimensionality depends on the clustering itself

Construction based on bottom-up clustering Compress perfect synopsis by merging clusters Best merge determined by marginal gains

Perfect Space Budget

Page 14: Approximate XML Query Answers

Depth-Guided Merging

Key observation: Two elements have similar structure, if their children have similar structure Children clusters should be merged first

Bottom-up merging, based on depth Depth: distance from the leaves of the tree Build a pool of candidate merges by increasing depth Replenish the pool when it falls below a given threshold

Improved construction time - good performance

Page 15: Approximate XML Query Answers

Outline

Preliminaries TreeSketches

Synopsis model Computing approximate answers Summary construction

Element Simulation Distance Experimental Study Conclusions

Page 16: Approximate XML Query Answers

Error of Approximation

Error Distance between R’ and R Popular metric: Tree-edit distance

Min-cost sequence of operations that transform R’ to R Measures syntactic differences between R and R’

Not intuitive for approximate answers!

T1 T

r

s

e

s

f1 4

ef4 1

r

s

e

s

f4 4

ef1 1

r

s

e

s

f2 6

ef6 2

T2

Different countsSimilar Trait

Same countsOpposite Trait

Page 17: Approximate XML Query Answers

Element Simulation Distance

Capture approximate similarity between R and R’ u simulates v: u and v have identical structure ESD(u,v): “degree” of simulation between u,v

How well the structure of u matches the structure of v

Modeled as the distance between multi-sets Efficient computation using perfect summaries

T1 T

r

s

e

s

f1 4

ef4 1

r

s

e

s

f4 4

ef1 1

r

s

e

s

f2 6

ef6 2

T2

Page 18: Approximate XML Query Answers

Outline

Preliminaries TreeSketches

Synopsis model Computing approximate answers Summary construction

Element Simulation Distance Experimental Study Conclusions

Page 19: Approximate XML Query Answers

Experimental Methodology

Data Sets: XMark, DBLP, IMDB, SwissProt Workload: 1000 random twig queries Evaluation metrics:

Average ESD for approximate answers Mean absolute relative error for selectivity estimation

1

|W |×

| estim(q) − count(q) |

count(q)q∈W

Page 20: Approximate XML Query Answers

Approximate Answers

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

10 15 20 25 30 35 40 45 50Summary Size (KB)

Mean ESD

TreeSketchXSketch

IMDB (~102K Elements)Avg. Result Size: 3,477 tuples

Page 21: Approximate XML Query Answers

Selectivity Estimation - SwissProt

0

20

40

60

80

100

120

140

160

10 15 20 25 30 35 40 45 50Summary Size (KB)

Estimation Error (%)

TreeSketchXSketch

SwissProt (~182K Elements)Avg. Result Size: 104,592 tuples

Page 22: Approximate XML Query Answers

Selectivity Estimation

0

5

10

15

20

25

30

10 15 20 25 30 35 40 45 50Summary Storage (KB)

Error (%)

DBLPIMDBSwiss ProtXMark

Data Set

#Elements (x 103)

# Tuples (x 103)

DBLP 1,500 78

IMDB 236 13

S-Prot 473 365

XMark 2,000 145

Data Set

Construction Time (min)

DBLP 11

IMDB 2.5

S-Prot 38

XMark 240

Page 23: Approximate XML Query Answers

Conclusions

Approximate query answering for XML databases TreeSketch Synopses

Structural summaries for tree-structured XML Approximate answers for twig-queries Model: Graph Synopsis + Edge-counts Efficient processing and construction

Element Simulation Distance Capture approximate similarity b/w XML trees

Experimental Results High accuracy for low space budgets Efficient construction

Page 24: Approximate XML Query Answers

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Questions?

Page 25: Approximate XML Query Answers

XML Document

p1

s2

f7

c14

s3

f9

c17

f5

e11 c12 e13

f9

c17

r

P(1)

S(2)

F(2)

C(4)

F(2)

E(2)

R

TreeSketch

1

2

1 1

111

TreeSketch Model (2/2)

Average number of children <--> Edge count

#E

#C

1

1

Page 26: Approximate XML Query Answers

XML

XML Document

p1

s2

f7

c14

s3

f9

c17

f5

e11 c12 e13

p: papers: sectionc: captiont: titlef: figuree: equationf9

c17

r