Generating Efficient Execution Plans for Vertically ...Vertical Fragmentation Specification...

Preview:

Citation preview

Generating Efficient Execution Plans for

Vertically Partitioned XML Databases

Patrick Kling, M. Tamer Ozsu, and Khuzaima Daudjee

University of WaterlooDavid R. Cheriton School of Computer Science

VLDB 2011

1

The Problem

• Centralized query evaluation techniques for XML wellunderstood

• These techniques do not scale to large collection sizes andheavy workloads

• Goal: use distribution to improve scalability

• Focus on end-to-end cost of query evaluation

2

Distributed XML Query Evaluation: Two Scenarios

• Integrating multiple data sources• Fragmentation is determined by existing data sources• Need flexible fragmentation model to express this

• Distribution for performance• Choose fragmentation to suit workload• Can use more constrained fragmentation model• Fragmentation specification allows for distributed query

optimization

3

Distributed XML Query Evaluation: Two Scenarios

• Integrating multiple data sources• Fragmentation is determined by existing data sources• Need flexible fragmentation model to express this

• Distribution for performance• Choose fragmentation to suit workload• Can use more constrained fragmentation model• Fragmentation specification allows for distributed query

optimization

3

Outline

1 Fragmenting XML Collections

2 Querying Distributed XML CollectionsQuery ModelDistributed Query EvaluationImproving Performance

3 Performance Evaluation

4 Conclusion

4

Outline

1 Fragmenting XML Collections

2 Querying Distributed XML CollectionsQuery ModelDistributed Query EvaluationImproving Performance

3 Performance Evaluation

4 Conclusion

5

Fragmenting XML Collections

• Ad-hoc fragmentation

• Structure-based fragmentation

6

Ad-hoc fragmentation

• Cut arbitrary edges in document tree

• Highly flexible (good for data integration)

• No explicit fragmentation specification

• Limited potential for exploiting fragmentation characteristicsfor query optimization

• Not a suitable choice for this work

7

Structure-based Fragmentation

• Fragmentation according to characteristics of data or schema

• Yields a fragmentation specification that can be exploited forquery optimization

• Better choice when distributing for performance

8

Our Fragmentation Model

• Focus on simplicity and precise fragmentation specification

• Focus on partitioning collection (replication is orthogonal)

• Follow semantics of relational fragmentation techniques• Horizontal fragmentation (based on predicates/selection)• Vertical fragmentation (based on partitioning of

schema/projection)• Hybrid fragmentation (combination of horizontal and vertical

steps)

9

Our Fragmentation Model

• Focus on simplicity and precise fragmentation specification

• Focus on partitioning collection (replication is orthogonal)

• Follow semantics of relational fragmentation techniques• Horizontal fragmentation (based on predicates/selection)• Vertical fragmentation (based on partitioning of

schema/projection)• Hybrid fragmentation (combination of horizontal and vertical

steps)

9

Vertical Fragmentation

author2

P1→213 P1→3

14

f V1

RP1→213

name2

first2

Jane

last2

Dean

f V2

RP1→314

pubs2

f V3

10

Vertical Fragmentation Specification

Vertical fragmentation is specified by a fragmentation schema.

author

agent

OPT

f V1

pubs

book

MULT

f V3

name

first

ONCE

last

ONCE

f V2

chapter

reference

OPT ONCE

f V4

ONCE

ONCE

ONCE

MULT

11

Outline

1 Fragmenting XML Collections

2 Querying Distributed XML CollectionsQuery ModelDistributed Query EvaluationImproving Performance

3 Performance Evaluation

4 Conclusion

12

Query model

XQ, subset of XPath

• Nested paths with child and descendant steps

• Explicit node tests and wild cards

• Value constraints (numeric or textual)• Q := σ | ∗ | Q//Q | Q/Q |Q[q]

q := Q | . = / 6= str | . = / 6= / ≤ / < / ≥ / > num

13

Query Example

“Find all references in publications written by authors whose firstname is ‘William’ and whose last name is ‘Shakespeare’ ”

14

Query Example

“Find all references in publications written by authors whose firstname is ‘William’ and whose last name is ‘Shakespeare’ ”

/author[./name[./first = “William”and

./last = “Shakespeare”]]//reference

14

Query Example

“Find all references in publications written by authors whose firstname is ‘William’ and whose last name is ‘Shakespeare’ ”

/author[./name[./first = “William”and

./last = “Shakespeare”]]//reference

• Node tests

14

Query Example

“Find all references in publications written by authors whose firstname is ‘William’ and whose last name is ‘Shakespeare’ ”

/author[./name[./first = “William”and

./last = “Shakespeare”]]//reference

• Node tests

• Value constraints

14

Query Example

“Find all references in publications written by authors whose firstname is ‘William’ and whose last name is ‘Shakespeare’ ”

/author[./name[./first = “William”and

./last = “Shakespeare”]]//reference

• Node tests

• Value constraints

• Structural constraints

14

Tree Patterns

author

name

/

first.=’William’

/

last.=’Shakespeare’

/

reference

//

15

Tree Patterns

author

name

/

first.=’William’

/

last.=’Shakespeare’

/

reference

//

• Pattern nodes with nodetests and value constraints

15

Tree Patterns

author

name

/

first.=’William’

/

last.=’Shakespeare’

/

reference

//

• Pattern nodes with nodetests and value constraints

15

Tree Patterns

author

name

/

first.=’William’

/

last.=’Shakespeare’

/

reference

//

• Pattern nodes with nodetests and value constraints

• Edges annotated with XPathaxes

15

Tree Patterns

author

name

/

first.=’William’

/

last.=’Shakespeare’

/

reference

//

• Pattern nodes with nodetests and value constraints

• Edges annotated with XPathaxes

• Extraction point nodes

15

Evaluating Tree Pattern Queries

author

name

/

first.=’William’

/

last.=’Shakespeare’

/

ae1reference

//

author4

name4

first4

William

last4

Shakespeare

pubs4

book4

chapter4

reference4

chapter5

16

Evaluating Tree Pattern Queries

author

name

/

first.=’William’

/

last.=’Shakespeare’

/

ae1reference

//

author4

name4

first4

William

last4

Shakespeare

pubs4

book4

chapter4

reference4

chapter5

16

Evaluating Tree Pattern Queries

author

name

/

first.=’William’

/

last.=’Shakespeare’

/

ae1reference

//

author4

name4

first4

William

last4

Shakespeare

pubs4

book4

chapter4

reference4

chapter5

16

Evaluating Tree Pattern Queries

author

name

/

first.=’William’

/

last.=’Shakespeare’

/

ae1reference

//

author4

name4

first4

William

last4

Shakespeare

pubs4

book4

chapter4

reference4

chapter5

16

Evaluating Tree Pattern Queries

author

name

/

first.=’William’

/

last.=’Shakespeare’

/

ae1reference

//

author4

name4

first4

William

last4

Shakespeare

pubs4

book4

chapter4

reference4

chapter5

16

Evaluating Tree Pattern Queries

author

name

/

first.=’William’

/

last.=’Shakespeare’

/

ae1reference

//

author4

name4

first4

William

last4

Shakespeare

pubs4

book4

chapter4

reference4

chapter5

16

Evaluating Tree Pattern Queries

author

name

/

first.=’William’

/

last.=’Shakespeare’

/

ae1reference

//

author4

name4

first4

William

last4

Shakespeare

pubs4

book4

chapter4

reference4

chapter5

16

Evaluating Tree Pattern Queries

author

name

/

first.=’William’

/

last.=’Shakespeare’

/

ae1reference

//

author4

name4

first4

William

last4

Shakespeare

pubs4

book4

chapter4

reference4

chapter5

16

Evaluating Tree Pattern Queries

author

name

/

first.=’William’

/

last.=’Shakespeare’

/

ae1reference

//

author4

name4

first4

William

last4

Shakespeare

pubs4

book4

chapter4

reference4

chapter5

[ae1 = reference4]16

Evaluating Tree Pattern Queries

• Various centralized approaches exist• Navigating document trees• Structural joins

• We leverage these for distributed query evaluation

17

Querying Vertically Distributed XML Collections

• Input• Fragmentation-unaware tree pattern query• Fragmentation schema

• Tasks• Annotate tree pattern nodes with corresponding fragments• Decompose tree pattern into sub-patterns for individual

fragments• Convert sub-patterns to local plans using existing techniques

(each site is free to choose local strategy)• Generate distributed execution plan that specifies how results

are combined

18

Querying Vertically Distributed XML Collections

• Annotate tree pattern nodes

• Decompose tree pattern

• Convert sub-patterns into local plans

• Generate distributed execution plan

author

name

/

first.=’William’

/

last.=’Shakespeare’

/

reference

//

19

Querying Vertically Distributed XML Collections

• Annotate tree pattern nodes

• Decompose tree pattern

• Convert sub-patterns into local plans

• Generate distributed execution plan

author f V1

name f V2

/

first.=’William’ f V2

/

last.=’Shakespeare’ f V2

/

reference f V4

//

19

Querying Vertically Distributed XML Collections

• Annotate tree patternnodes

• Decompose tree pattern

• Convert sub-patternsinto local plans

• Generate distributedexecution plan

author f V1

name f V2

/

first.=’William’ f V2

/

last.=’Shakespeare’ f V2

/

reference f V4

//

20

Querying Vertically Distributed XML Collections

• Annotate tree patternnodes

• Decompose tree pattern

• Convert sub-patternsinto local plans

• Generate distributedexecution plan

author f V1

name f V2

/

first.=’William’ f V2

/

last.=’Shakespeare’ f V2

/

reference f V4

//

author

ap2

P1→2∗

/

ap3

P1→3∗

//

q11(f

V1 )

arp2

RP1→2∗

name

/

first.=’William’

/

last.=’Shakespeare’

/

q21(f

V2 )

arp3

RP1→3∗

ap4

P3→4∗

//

q31(f

V3 )

arp4

RP3→4∗

ae1reference

//

q41(f

V4 )

20

Querying Vertically Distributed XML Collections

• Annotate tree patternnodes

• Decompose tree pattern

• Convert sub-patternsinto local plans

• Generate distributedexecution plan

author f V1

name f V2

/

first.=’William’ f V2

/

last.=’Shakespeare’ f V2

/

reference f V4

//

author

ap2

P1→2∗

/

ap3

P1→3∗

//

q11(f

V1 )

arp2

RP1→2∗

name

/

first.=’William’

/

last.=’Shakespeare’

/

q21(f

V2 )

arp3

RP1→3∗

ap4

P3→4∗

//

q31(f

V3 )

arp4

RP3→4∗

ae1reference

//

q41(f

V4 )

20

Querying Vertically Distributed XML Collections

• Annotate tree patternnodes

• Decompose tree pattern

• Convert sub-patternsinto local plans

• Generate distributedexecution plan

⋊⋉id(ap3)=id(arp3 )

⋊⋉id(ap2)=id(arp2 )

p11(fV1 ) p21(f

V2 )

⋊⋉id(ap4)=id(arp4 )

p31(fV3 ) p41(f

V4 )

author

ap2

P1→2∗

/

ap3

P1→3∗

//

q11(f

V1 )

arp2

RP1→2∗

name

/

first.=’William’

/

last.=’Shakespeare’

/

q21(f

V2 )

arp3

RP1→3∗

ap4

P3→4∗

//

q31(f

V3 )

arp4

RP3→4∗

ae1reference

//

q41(f

V4 )

21

Querying Vertically Distributed XML Collections

• Annotate tree patternnodes

• Decompose tree pattern

• Convert sub-patternsinto local plans

• Generate distributedexecution plan

⋊⋉id(ap3)=id(arp3 )

⋊⋉id(ap2)=id(arp2 )

p11(fV1 ) p21(f

V2 )

⋊⋉id(ap4)=id(arp4 )

p31(fV3 ) p41(f

V4 )

author

ap2

P1→2∗

/

ap3

P1→3∗

//

q11(f

V1 )

arp2

RP1→2∗

name

/

first.=’William’

/

last.=’Shakespeare’

/

q21(f

V2 )

arp3

RP1→3∗

ap4

P3→4∗

//

q31(f

V3 )

arp4

RP3→4∗

ae1reference

//

q41(f

V4 )

21

Querying Vertically Distributed XML Collections

• Annotate tree patternnodes

• Decompose tree pattern

• Convert sub-patternsinto local plans

• Generate distributedexecution plan

⋊⋉id(ap3)=id(arp3 )

⋊⋉id(ap2)=id(arp2 )

p11(fV1 ) p21(f

V2 )

⋊⋉id(ap4)=id(arp4 )

p31(fV3 ) p41(f

V4 )

author

ap2

P1→2∗

/

ap3

P1→3∗

//

q11(f

V1 )

arp2

RP1→2∗

name

/

first.=’William’

/

last.=’Shakespeare’

/

q21(f

V2 )

arp3

RP1→3∗

ap4

P3→4∗

//

q31(f

V3 )

arp4

RP3→4∗

ae1reference

//

q41(f

V4 )

21

Querying Vertically Distributed XML Collections

• Annotate tree patternnodes

• Decompose tree pattern

• Convert sub-patternsinto local plans

• Generate distributedexecution plan

⋊⋉id(ap3)=id(arp3 )

⋊⋉id(ap2)=id(arp2 )

p11(fV1 ) p21(f

V2 )

⋊⋉id(ap4)=id(arp4 )

p31(fV3 ) p41(f

V4 )

author

ap2

P1→2∗

/

ap3

P1→3∗

//

q11(f

V1 )

arp2

RP1→2∗

name

/

first.=’William’

/

last.=’Shakespeare’

/

q21(f

V2 )

arp3

RP1→3∗

ap4

P3→4∗

//

q31(f

V3 )

arp4

RP3→4∗

ae1reference

//

q41(f

V4 )

21

Improving Distributed Execution Plans

• Pruning irrelevant fragments

• Join order

• Push cross-fragment joins into local plans

22

Improving Distributed Execution Plans

• Pruning irrelevant fragments

• Join order

• Push cross-fragment joins into local plans

22

Pushing Cross-Fragment Joins

Large fraction of local results are discarded by cross-fragment join

RP3→419

chapter2

reference2

RP3→420

chapter3

reference3

RP3→421

chapter4

reference4

f V4

23

Pushing Cross-Fragment Joins

Large fraction of local results are discarded by cross-fragment join

RP3→419

chapter2

reference2

RP3→420

chapter3

reference3

RP3→421

chapter4

reference4

f V4

23

Pushing Cross-Fragment Joins

Large fraction of local results are discarded by cross-fragment join

RP3→419

chapter2

reference2

RP3→420

chapter3

reference3

RP3→421

chapter4

reference4

f V4

• Idea: only access relevant sub-trees in fragment

• Avoid computing irrelevant local results

• Use pipelining to push cross-fragment join into local plan

23

A Local Query Plan

πarp2

⋉a1/a3

⋉a1/a2

⋊⋉arp2 /a1

scanarp2 :RP1→2∗

scana1:name

σa2=’William’

scana2:first

σa3=’Shakespeare’

scana3:last

p21(f

V2 )

24

A Local Query Plan

πarp2

⋉a1/a3

⋉a1/a2

⋊⋉arp2 /a1

scanarp2 :RP1→2∗

scana1:name

σa2=’William’

scana2:first

σa3=’Shakespeare’

scana3:last

p21(f

V2 )

• Plan scans root proxynodes in fragment

• Idea: filter these rootproxy nodes beforeevaluating remainderof plan

• Works for navigatingplans and plans basedon structural joins(shown here)

24

A Local Query Plan

πarp2 ,...

⋉a1/a3

⋉a1/a2

⋊⋉arp2 /a1

⋊⋉

. . . scanarp2 :RP1→2∗

scana1:name

σa2=’William’

scana2:first

σa3=’Shakespeare’

scana3:last

p21(f

V2 )

• Plan scans root proxynodes in fragment

• Idea: filter these rootproxy nodes beforeevaluating remainderof plan

• Works for navigatingplans and plans basedon structural joins(shown here)

25

Pushing Cross-Fragment Joins

p4′

1 (fV4 )

⋊⋉id(ap4)=id(arp4 )

p3′

1 (fV3 )

⋊⋉id(ap3)=id(arp3 )

p2′

1 (fV2 )

⋊⋉id(ap2)=id(arp2 )

p11(fV1 ) scanarp2 :RP1→2

scanarp3 :RP1→3∗

scanarp4 :RP3→4∗

26

Pushing Cross-Fragment Joins

p4′

1 (fV4 )

⋊⋉id(ap4)=id(arp4 )

p3′

1 (fV3 )

⋊⋉id(ap3)=id(arp3 )

p2′

1 (fV2 )

⋊⋉id(ap2)=id(arp2 )

p11(fV1 ) scanarp2 :RP1→2

scanarp3 :RP1→3∗

scanarp4 :RP3→4∗

26

Pushing Cross-Fragment Joins

p4′

1 (fV4 )

⋊⋉id(ap4)=id(arp4 )

p3′

1 (fV3 )

⋊⋉id(ap3)=id(arp3 )

p2′

1 (fV2 )

⋊⋉id(ap2)=id(arp2 )

p11(fV1 ) scanarp2 :RP1→2

scanarp3 :RP1→3∗

scanarp4 :RP3→4∗

26

Pushing Cross-Fragment Joins

p4′

1 (fV4 )

⋊⋉id(ap4)=id(arp4 )

p3′

1 (fV3 )

⋊⋉id(ap3)=id(arp3 )

p2′

1 (fV2 )

⋊⋉id(ap2)=id(arp2 )

p11(fV1 ) scanarp2 :RP1→2

scanarp3 :RP1→3∗

scanarp4 :RP3→4∗

26

Pushing Cross Fragment Joins: Implementation

• Can use full pipelining if both inputs to join are ordered

• Alternatively, can use index on root proxy nodes

• Full parallelism after first tuple received by local plan

27

Pushing Cross Fragment Joins

• Avoids accessing large portion of sub-trees within a fragment

• Can only be fully used in left-deep plans

• Decreases flexibility (e.g., where joins are performed)

28

Label Path Filtering

• Cross-fragment join pushing works well but decreases flexibility

• Goal: find a solution that can obtain partial benefit forscenarios where join pushing cannot be applied

• Idea: use selection instead of join to filter out some root proxynodes

29

Label Path Filtering

• Assign to each proxy node the label path from the documentroot

• Filter for label paths that are compatible with the query

author4

name4

first4

William

last4

Shakespeare

pubs4

book4

chapter4

reference4

chapter5

30

Label Path Filtering

• Assign to each proxy node the label path from the documentroot

• Filter for label paths that are compatible with the query

author4

name4

first4

William

last4

Shakespeare

pubs4

book4

chapter4

reference4

chapter5

/author/pubs/book

30

Label Path Filtering

⋊⋉id(arp4 )=id(ap4)

⋊⋉id(arp3 )=id(ap3)

⋊⋉id(arp2 )=id(ap2)

p11(fV1 ) p21 ∗ (f

V2 )

p31(fV3 )

p4′

1 (fV4 )

σlabel(arp4 )=/author/pubs/book

scanarp4 =(RP3→4∗

)

• Assume there are twotypes of publications:book and article

• Can use selection tofilter chapters basedon publication type

31

Label Path Filtering

• Can be used in more cases

• Retains higher degree of flexibility

• Benefit is more limited (does not filter all irrelevant root proxynodes)

32

Determining the Best Distributed Execution Plan

• Join pushing and label path filtering are not alwaysadvantageous

• Determine best execution plan using cost model

33

Outline

1 Fragmenting XML Collections

2 Querying Distributed XML CollectionsQuery ModelDistributed Query EvaluationImproving Performance

3 Performance Evaluation

4 Conclusion

34

Performance Evaluation

• Implemented techniques within Natix

• 12 GB XMark collection (auction data)

• 1 Amazon EC2 instance for each of each of 10 verticalfragments

35

Performance Evaluation

• XPathMark queries (with few filtering value constraints)

• Modified, more selective XPathMark queries

A1 /site/closed auctions/closed auction/annotation/description/text/keyword

A2 //closed auction//keyword

A3 /site/closed auctions/closed auction//keyword

A4 /site/closed auctions/closed auction[annotation/description/text/keyword]/date

A5 /site/closed auctions/closed auction[descendant::keyword]/date

A6 /site/people/person[profile/gender and profile/age]/name

B7 //person[profile/@income]/name

A1S /site/closed auctions/closed auction[price > 600]/annotation/description/text/keyword

A2S //closed auction[price > 600]//keyword

A3S /site/closed auctions/closed auction[price > 600]//keyword

A4S /site/closed auctions/closed auction[price > 600][annotation/description/text/keyword]/date

A5S /site/closed auctions/closed auction[price > 600][descendant::keyword]/date

A6S /site/people/person[starts-with(name, ’Ry’)][profile/gender and profile/age]/name

B7S //person[starts-with(name, ’Ry’)][profile/@income]/name

36

Performance Evaluation: XPathMark

200

400

600

800

1000

1200

1400

A1 A2 A3 A4 A5 A6 B7

Res

pons

e tim

e (s

econ

ds)

Plan

centdist

push

37

Performance Evaluation: Selective XPathMark

0

200

400

600

800

1000

1200

1400

A1 A2 A3 A4 A5 A6 B7

Res

pons

e tim

e (s

econ

ds)

Plan

centdist

push

38

Conclusions

• Distribution can make XML query evaluation more scalable

• Join pushing can significantly improve query performance

• A cost model is essential for finding the optimal technique fora given query

39

References

[1] Patrick Kling, M. Tamer Ozsu, Khuzaima Daudjee: GeneratingEfficient Execution Plans for Vertically Partitioned XMLDatabases, PVLDB 2010.

[2] Patrick Kling, M. Tamer Ozsu, Khuzaima Daudjee: ScalingXML Query Processing: Distribution, Localization and Pruning,DAPD 2011.

40

Recommended