Generating Efficient Execution Plans for Vertically ...Vertical Fragmentation Specification Vertical fragmentation is specified by a fragmentationschema. author agent OPT fV 1 pubs

Generating Efficient Execution Plans for

Vertically Partitioned XML Databases

Patrick Kling, M. Tamer Ozsu, and Khuzaima Daudjee

University of WaterlooDavid R. Cheriton School of Computer Science

VLDB 2011

1

The Problem

• Centralized query evaluation techniques for XML wellunderstood

• These techniques do not scale to large collection sizes andheavy workloads

• Goal: use distribution to improve scalability

• Focus on end-to-end cost of query evaluation

2

Distributed XML Query Evaluation: Two Scenarios

• Integrating multiple data sources• Fragmentation is determined by existing data sources• Need flexible fragmentation model to express this

• Distribution for performance• Choose fragmentation to suit workload• Can use more constrained fragmentation model• Fragmentation specification allows for distributed query

optimization

3

Distributed XML Query Evaluation: Two Scenarios

• Integrating multiple data sources• Fragmentation is determined by existing data sources• Need flexible fragmentation model to express this

• Distribution for performance• Choose fragmentation to suit workload• Can use more constrained fragmentation model• Fragmentation specification allows for distributed query

optimization

3

Outline

1 Fragmenting XML Collections

2 Querying Distributed XML CollectionsQuery ModelDistributed Query EvaluationImproving Performance

3 Performance Evaluation

4 Conclusion

4

Outline




4 Conclusion

5

Fragmenting XML Collections

• Ad-hoc fragmentation

• Structure-based fragmentation

6

Ad-hoc fragmentation

• Cut arbitrary edges in document tree

• Highly flexible (good for data integration)

• No explicit fragmentation specification

• Limited potential for exploiting fragmentation characteristicsfor query optimization

• Not a suitable choice for this work

7

Structure-based Fragmentation

• Fragmentation according to characteristics of data or schema

• Yields a fragmentation specification that can be exploited forquery optimization

• Better choice when distributing for performance

8

Our Fragmentation Model

• Focus on simplicity and precise fragmentation specification

• Focus on partitioning collection (replication is orthogonal)

• Follow semantics of relational fragmentation techniques• Horizontal fragmentation (based on predicates/selection)• Vertical fragmentation (based on partitioning of

schema/projection)• Hybrid fragmentation (combination of horizontal and vertical

steps)

9

Our Fragmentation Model

• Focus on simplicity and precise fragmentation specification

• Focus on partitioning collection (replication is orthogonal)

• Follow semantics of relational fragmentation techniques• Horizontal fragmentation (based on predicates/selection)• Vertical fragmentation (based on partitioning of

schema/projection)• Hybrid fragmentation (combination of horizontal and vertical

steps)

9

Vertical Fragmentation

author2

P1→213 P1→3

14

f V1

RP1→213

name2

first2

Jane

last2

Dean

f V2

RP1→314

pubs2

f V3

10

Vertical Fragmentation Specification

Vertical fragmentation is specified by a fragmentation schema.

author

agent

OPT

f V1

pubs

book

MULT

f V3

name

first

ONCE

∗

last

ONCE

∗

f V2

chapter

reference

OPT ONCE

f V4

ONCE

ONCE

ONCE

MULT

11

Outline




4 Conclusion

12

Query model

XQ, subset of XPath

• Nested paths with child and descendant steps

• Explicit node tests and wild cards

• Value constraints (numeric or textual)• Q := σ | ∗ | Q//Q | Q/Q |Q[q]

q := Q | . = / 6= str | . = / 6= / ≤ / < / ≥ / > num

13

Query Example

“Find all references in publications written by authors whose firstname is ‘William’ and whose last name is ‘Shakespeare’ ”

14

Query Example


/author[./name[./first = “William”and

./last = “Shakespeare”]]//reference

14

Query Example




• Node tests

14

Query Example




• Node tests

• Value constraints

14

Query Example




• Node tests

• Value constraints

• Structural constraints

14

Tree Patterns

author

name

/

first.=’William’

/

last.=’Shakespeare’

/

reference

//

15

Tree Patterns

author

name

/


/


/

reference

//

• Pattern nodes with nodetests and value constraints

15

Tree Patterns

author

name

/


/


/

reference

//


15

Tree Patterns

author

name

/


/


/

reference

//


• Edges annotated with XPathaxes

15

Tree Patterns

author

name

/


/


/

reference

//


• Edges annotated with XPathaxes

• Extraction point nodes

15

Evaluating Tree Pattern Queries

author

name

/


/


/

ae1reference

//

author4

name4

first4

William

last4

Shakespeare

pubs4

book4

chapter4

reference4

chapter5

16


author

name

/


/


/

ae1reference

//

author4

name4

first4

William

last4

Shakespeare

pubs4

book4

chapter4

reference4

chapter5

16


author

name

/


/


/

ae1reference

//

author4

name4

first4

William

last4

Shakespeare

pubs4

book4

chapter4

reference4

chapter5

16


author

name

/


/


/

ae1reference

//

author4

name4

first4

William

last4

Shakespeare

pubs4

book4

chapter4

reference4

chapter5

16


author

name

/


/


/

ae1reference

//

author4

name4

first4

William

last4

Shakespeare

pubs4

book4

chapter4

reference4

chapter5

16


author

name

/


/


/

ae1reference

//

author4

name4

first4

William

last4

Shakespeare

pubs4

book4

chapter4

reference4

chapter5

16


author

name

/


/


/

ae1reference

//

author4

name4

first4

William

last4

Shakespeare

pubs4

book4

chapter4

reference4

chapter5

16


author

name

/


/


/

ae1reference

//

author4

name4

first4

William

last4

Shakespeare

pubs4

book4

chapter4

reference4

chapter5

16


author

name

/


/


/

ae1reference

//

author4

name4

first4

William

last4

Shakespeare

pubs4

book4

chapter4

reference4

chapter5

[ae1 = reference4]16


• Various centralized approaches exist• Navigating document trees• Structural joins

• We leverage these for distributed query evaluation

17

Querying Vertically Distributed XML Collections

• Input• Fragmentation-unaware tree pattern query• Fragmentation schema

• Tasks• Annotate tree pattern nodes with corresponding fragments• Decompose tree pattern into sub-patterns for individual

fragments• Convert sub-patterns to local plans using existing techniques

(each site is free to choose local strategy)• Generate distributed execution plan that specifies how results

are combined

18


• Annotate tree pattern nodes

• Decompose tree pattern

• Convert sub-patterns into local plans

• Generate distributed execution plan

author

name

/


/


/

reference

//

19


• Annotate tree pattern nodes


• Convert sub-patterns into local plans

• Generate distributed execution plan

author f V1

name f V2

/

first.=’William’ f V2

/

last.=’Shakespeare’ f V2

/

reference f V4

//

19


• Annotate tree patternnodes


• Convert sub-patternsinto local plans

• Generate distributedexecution plan

author f V1

name f V2

/


/


/

reference f V4

//

20






author f V1

name f V2

/


/


/

reference f V4

//

author

ap2

P1→2∗

/

ap3

P1→3∗

//

q11(f

V1 )

arp2

RP1→2∗

name

/


/


/

q21(f

V2 )

arp3

RP1→3∗

ap4

P3→4∗

//

q31(f

V3 )

arp4

RP3→4∗

ae1reference

//

q41(f

V4 )

20






author f V1

name f V2

/


/


/

reference f V4

//

author

ap2

P1→2∗

/

ap3

P1→3∗

//

q11(f

V1 )

arp2

RP1→2∗

name

/


/


/

q21(f

V2 )

arp3

RP1→3∗

ap4

P3→4∗

//

q31(f

V3 )

arp4

RP3→4∗

ae1reference

//

q41(f

V4 )

20






⋊⋉id(ap3)=id(arp3 )


p11(fV1 ) p21(f

V2 )


p31(fV3 ) p41(f

V4 )

author

ap2

P1→2∗

/

ap3

P1→3∗

//

q11(f

V1 )

arp2

RP1→2∗

name

/


/


/

q21(f

V2 )

arp3

RP1→3∗

ap4

P3→4∗

//

q31(f

V3 )

arp4

RP3→4∗

ae1reference

//

q41(f

V4 )

21








p11(fV1 ) p21(f

V2 )


p31(fV3 ) p41(f

V4 )

author

ap2

P1→2∗

/

ap3

P1→3∗

//

q11(f

V1 )

arp2

RP1→2∗

name

/


/


/

q21(f

V2 )

arp3

RP1→3∗

ap4

P3→4∗

//

q31(f

V3 )

arp4

RP3→4∗

ae1reference

//

q41(f

V4 )

21








p11(fV1 ) p21(f

V2 )


p31(fV3 ) p41(f

V4 )

author

ap2

P1→2∗

/

ap3

P1→3∗

//

q11(f

V1 )

arp2

RP1→2∗

name

/


/


/

q21(f

V2 )

arp3

RP1→3∗

ap4

P3→4∗

//

q31(f

V3 )

arp4

RP3→4∗

ae1reference

//

q41(f

V4 )

21








p11(fV1 ) p21(f

V2 )


p31(fV3 ) p41(f

V4 )

author

ap2

P1→2∗

/

ap3

P1→3∗

//

q11(f

V1 )

arp2

RP1→2∗

name

/


/


/

q21(f

V2 )

arp3

RP1→3∗

ap4

P3→4∗

//

q31(f

V3 )

arp4

RP3→4∗

ae1reference

//

q41(f

V4 )

21

Improving Distributed Execution Plans

• Pruning irrelevant fragments

• Join order

• Push cross-fragment joins into local plans

22

Improving Distributed Execution Plans

• Pruning irrelevant fragments

• Join order

• Push cross-fragment joins into local plans

22

Pushing Cross-Fragment Joins

Large fraction of local results are discarded by cross-fragment join

RP3→419

chapter2

reference2

RP3→420

chapter3

reference3

RP3→421

chapter4

reference4

f V4

23



RP3→419

chapter2

reference2

RP3→420

chapter3

reference3

RP3→421

chapter4

reference4

f V4

23



RP3→419

chapter2

reference2

RP3→420

chapter3

reference3

RP3→421

chapter4

reference4

f V4

• Idea: only access relevant sub-trees in fragment

• Avoid computing irrelevant local results

• Use pipelining to push cross-fragment join into local plan

23

A Local Query Plan

πarp2

⋉a1/a3

⋉a1/a2

⋊⋉arp2 /a1

scanarp2 :RP1→2∗

scana1:name

σa2=’William’

scana2:first

σa3=’Shakespeare’

scana3:last

p21(f

V2 )

24

A Local Query Plan

πarp2

⋉a1/a3

⋉a1/a2

⋊⋉arp2 /a1


scana1:name

σa2=’William’

scana2:first


scana3:last

p21(f

V2 )

• Plan scans root proxynodes in fragment

• Idea: filter these rootproxy nodes beforeevaluating remainderof plan

• Works for navigatingplans and plans basedon structural joins(shown here)

24

A Local Query Plan

πarp2 ,...

⋉a1/a3

⋉a1/a2

⋊⋉arp2 /a1

⋊⋉

. . . scanarp2 :RP1→2∗

scana1:name

σa2=’William’

scana2:first


scana3:last

p21(f

V2 )

• Plan scans root proxynodes in fragment

• Idea: filter these rootproxy nodes beforeevaluating remainderof plan

• Works for navigatingplans and plans basedon structural joins(shown here)

25


p4′

1 (fV4 )


p3′

1 (fV3 )


p2′

1 (fV2 )


p11(fV1 ) scanarp2 :RP1→2

∗



26


p4′

1 (fV4 )


p3′

1 (fV3 )


p2′

1 (fV2 )



∗



26


p4′

1 (fV4 )


p3′

1 (fV3 )


p2′

1 (fV2 )



∗



26


p4′

1 (fV4 )


p3′

1 (fV3 )


p2′

1 (fV2 )



∗



26

Pushing Cross Fragment Joins: Implementation

• Can use full pipelining if both inputs to join are ordered

• Alternatively, can use index on root proxy nodes

• Full parallelism after first tuple received by local plan

27

Pushing Cross Fragment Joins

• Avoids accessing large portion of sub-trees within a fragment

• Can only be fully used in left-deep plans

• Decreases flexibility (e.g., where joins are performed)

28

Label Path Filtering

• Cross-fragment join pushing works well but decreases flexibility

• Goal: find a solution that can obtain partial benefit forscenarios where join pushing cannot be applied

• Idea: use selection instead of join to filter out some root proxynodes

29


• Assign to each proxy node the label path from the documentroot

• Filter for label paths that are compatible with the query

author4

name4

first4

William

last4

Shakespeare

pubs4

book4

chapter4

reference4

chapter5

30


• Assign to each proxy node the label path from the documentroot

• Filter for label paths that are compatible with the query

author4

name4

first4

William

last4

Shakespeare

pubs4

book4

chapter4

reference4

chapter5

/author/pubs/book

30


⋊⋉id(arp4 )=id(ap4)



p11(fV1 ) p21 ∗ (f

V2 )

p31(fV3 )

p4′

1 (fV4 )

σlabel(arp4 )=/author/pubs/book

scanarp4 =(RP3→4∗

)

• Assume there are twotypes of publications:book and article

• Can use selection tofilter chapters basedon publication type

31


• Can be used in more cases

• Retains higher degree of flexibility

• Benefit is more limited (does not filter all irrelevant root proxynodes)

32

Determining the Best Distributed Execution Plan

• Join pushing and label path filtering are not alwaysadvantageous

• Determine best execution plan using cost model

33

Outline




4 Conclusion

34

Performance Evaluation

• Implemented techniques within Natix

• 12 GB XMark collection (auction data)

• 1 Amazon EC2 instance for each of each of 10 verticalfragments

35

Performance Evaluation

• XPathMark queries (with few filtering value constraints)

• Modified, more selective XPathMark queries

A1 /site/closed auctions/closed auction/annotation/description/text/keyword

A2 //closed auction//keyword

A3 /site/closed auctions/closed auction//keyword

A4 /site/closed auctions/closed auction[annotation/description/text/keyword]/date

A5 /site/closed auctions/closed auction[descendant::keyword]/date

A6 /site/people/person[profile/gender and profile/age]/name

B7 //person[profile/@income]/name

A1S /site/closed auctions/closed auction[price > 600]/annotation/description/text/keyword

A2S //closed auction[price > 600]//keyword

A3S /site/closed auctions/closed auction[price > 600]//keyword

A4S /site/closed auctions/closed auction[price > 600][annotation/description/text/keyword]/date

A5S /site/closed auctions/closed auction[price > 600][descendant::keyword]/date

A6S /site/people/person[starts-with(name, ’Ry’)][profile/gender and profile/age]/name

B7S //person[starts-with(name, ’Ry’)][profile/@income]/name

36

Performance Evaluation: XPathMark

200

400

600

800

1000

1200

1400

A1 A2 A3 A4 A5 A6 B7

Res

pons

e tim

e (s

econ

ds)

Plan

centdist

push

37

Performance Evaluation: Selective XPathMark

0

200

400

600

800

1000

1200

1400

A1 A2 A3 A4 A5 A6 B7

Res

pons

e tim

e (s

econ

ds)

Plan

centdist

push

38

Conclusions

• Distribution can make XML query evaluation more scalable

• Join pushing can significantly improve query performance

• A cost model is essential for finding the optimal technique fora given query

39

References

[1] Patrick Kling, M. Tamer Ozsu, Khuzaima Daudjee: GeneratingEfficient Execution Plans for Vertically Partitioned XMLDatabases, PVLDB 2010.

[2] Patrick Kling, M. Tamer Ozsu, Khuzaima Daudjee: ScalingXML Query Processing: Distribution, Localization and Pruning,DAPD 2011.

40

Documents

Generating Efficient Execution Plans for Vertically ...Vertical Fragmentation Specification Vertical fragmentation is specified by a fragmentationschema. author agent OPT fV 1 pubs