Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Generating Efficient Execution Plans for
Vertically Partitioned XML Databases
Patrick Kling, M. Tamer Ozsu, and Khuzaima Daudjee
University of WaterlooDavid R. Cheriton School of Computer Science
VLDB 2011
1
The Problem
• Centralized query evaluation techniques for XML wellunderstood
• These techniques do not scale to large collection sizes andheavy workloads
• Goal: use distribution to improve scalability
• Focus on end-to-end cost of query evaluation
2
Distributed XML Query Evaluation: Two Scenarios
• Integrating multiple data sources• Fragmentation is determined by existing data sources• Need flexible fragmentation model to express this
• Distribution for performance• Choose fragmentation to suit workload• Can use more constrained fragmentation model• Fragmentation specification allows for distributed query
optimization
3
Distributed XML Query Evaluation: Two Scenarios
• Integrating multiple data sources• Fragmentation is determined by existing data sources• Need flexible fragmentation model to express this
• Distribution for performance• Choose fragmentation to suit workload• Can use more constrained fragmentation model• Fragmentation specification allows for distributed query
optimization
3
Outline
1 Fragmenting XML Collections
2 Querying Distributed XML CollectionsQuery ModelDistributed Query EvaluationImproving Performance
3 Performance Evaluation
4 Conclusion
4
Outline
1 Fragmenting XML Collections
2 Querying Distributed XML CollectionsQuery ModelDistributed Query EvaluationImproving Performance
3 Performance Evaluation
4 Conclusion
5
Fragmenting XML Collections
• Ad-hoc fragmentation
• Structure-based fragmentation
6
Ad-hoc fragmentation
• Cut arbitrary edges in document tree
• Highly flexible (good for data integration)
• No explicit fragmentation specification
• Limited potential for exploiting fragmentation characteristicsfor query optimization
• Not a suitable choice for this work
7
Structure-based Fragmentation
• Fragmentation according to characteristics of data or schema
• Yields a fragmentation specification that can be exploited forquery optimization
• Better choice when distributing for performance
8
Our Fragmentation Model
• Focus on simplicity and precise fragmentation specification
• Focus on partitioning collection (replication is orthogonal)
• Follow semantics of relational fragmentation techniques• Horizontal fragmentation (based on predicates/selection)• Vertical fragmentation (based on partitioning of
schema/projection)• Hybrid fragmentation (combination of horizontal and vertical
steps)
9
Our Fragmentation Model
• Focus on simplicity and precise fragmentation specification
• Focus on partitioning collection (replication is orthogonal)
• Follow semantics of relational fragmentation techniques• Horizontal fragmentation (based on predicates/selection)• Vertical fragmentation (based on partitioning of
schema/projection)• Hybrid fragmentation (combination of horizontal and vertical
steps)
9
Vertical Fragmentation
author2
P1→213 P1→3
14
f V1
RP1→213
name2
first2
Jane
last2
Dean
f V2
RP1→314
pubs2
f V3
10
Vertical Fragmentation Specification
Vertical fragmentation is specified by a fragmentation schema.
author
agent
OPT
f V1
pubs
book
MULT
f V3
name
first
ONCE
∗
last
ONCE
∗
f V2
chapter
reference
OPT ONCE
f V4
ONCE
ONCE
ONCE
MULT
11
Outline
1 Fragmenting XML Collections
2 Querying Distributed XML CollectionsQuery ModelDistributed Query EvaluationImproving Performance
3 Performance Evaluation
4 Conclusion
12
Query model
XQ, subset of XPath
• Nested paths with child and descendant steps
• Explicit node tests and wild cards
• Value constraints (numeric or textual)• Q := σ | ∗ | Q//Q | Q/Q |Q[q]
q := Q | . = / 6= str | . = / 6= / ≤ / < / ≥ / > num
13
Query Example
“Find all references in publications written by authors whose firstname is ‘William’ and whose last name is ‘Shakespeare’ ”
14
Query Example
“Find all references in publications written by authors whose firstname is ‘William’ and whose last name is ‘Shakespeare’ ”
/author[./name[./first = “William”and
./last = “Shakespeare”]]//reference
14
Query Example
“Find all references in publications written by authors whose firstname is ‘William’ and whose last name is ‘Shakespeare’ ”
/author[./name[./first = “William”and
./last = “Shakespeare”]]//reference
• Node tests
14
Query Example
“Find all references in publications written by authors whose firstname is ‘William’ and whose last name is ‘Shakespeare’ ”
/author[./name[./first = “William”and
./last = “Shakespeare”]]//reference
• Node tests
• Value constraints
14
Query Example
“Find all references in publications written by authors whose firstname is ‘William’ and whose last name is ‘Shakespeare’ ”
/author[./name[./first = “William”and
./last = “Shakespeare”]]//reference
• Node tests
• Value constraints
• Structural constraints
14
Tree Patterns
author
name
/
first.=’William’
/
last.=’Shakespeare’
/
reference
//
15
Tree Patterns
author
name
/
first.=’William’
/
last.=’Shakespeare’
/
reference
//
• Pattern nodes with nodetests and value constraints
15
Tree Patterns
author
name
/
first.=’William’
/
last.=’Shakespeare’
/
reference
//
• Pattern nodes with nodetests and value constraints
15
Tree Patterns
author
name
/
first.=’William’
/
last.=’Shakespeare’
/
reference
//
• Pattern nodes with nodetests and value constraints
• Edges annotated with XPathaxes
15
Tree Patterns
author
name
/
first.=’William’
/
last.=’Shakespeare’
/
reference
//
• Pattern nodes with nodetests and value constraints
• Edges annotated with XPathaxes
• Extraction point nodes
15
Evaluating Tree Pattern Queries
author
name
/
first.=’William’
/
last.=’Shakespeare’
/
ae1reference
//
author4
name4
first4
William
last4
Shakespeare
pubs4
book4
chapter4
reference4
chapter5
16
Evaluating Tree Pattern Queries
author
name
/
first.=’William’
/
last.=’Shakespeare’
/
ae1reference
//
author4
name4
first4
William
last4
Shakespeare
pubs4
book4
chapter4
reference4
chapter5
16
Evaluating Tree Pattern Queries
author
name
/
first.=’William’
/
last.=’Shakespeare’
/
ae1reference
//
author4
name4
first4
William
last4
Shakespeare
pubs4
book4
chapter4
reference4
chapter5
16
Evaluating Tree Pattern Queries
author
name
/
first.=’William’
/
last.=’Shakespeare’
/
ae1reference
//
author4
name4
first4
William
last4
Shakespeare
pubs4
book4
chapter4
reference4
chapter5
16
Evaluating Tree Pattern Queries
author
name
/
first.=’William’
/
last.=’Shakespeare’
/
ae1reference
//
author4
name4
first4
William
last4
Shakespeare
pubs4
book4
chapter4
reference4
chapter5
16
Evaluating Tree Pattern Queries
author
name
/
first.=’William’
/
last.=’Shakespeare’
/
ae1reference
//
author4
name4
first4
William
last4
Shakespeare
pubs4
book4
chapter4
reference4
chapter5
16
Evaluating Tree Pattern Queries
author
name
/
first.=’William’
/
last.=’Shakespeare’
/
ae1reference
//
author4
name4
first4
William
last4
Shakespeare
pubs4
book4
chapter4
reference4
chapter5
16
Evaluating Tree Pattern Queries
author
name
/
first.=’William’
/
last.=’Shakespeare’
/
ae1reference
//
author4
name4
first4
William
last4
Shakespeare
pubs4
book4
chapter4
reference4
chapter5
16
Evaluating Tree Pattern Queries
author
name
/
first.=’William’
/
last.=’Shakespeare’
/
ae1reference
//
author4
name4
first4
William
last4
Shakespeare
pubs4
book4
chapter4
reference4
chapter5
[ae1 = reference4]16
Evaluating Tree Pattern Queries
• Various centralized approaches exist• Navigating document trees• Structural joins
• We leverage these for distributed query evaluation
17
Querying Vertically Distributed XML Collections
• Input• Fragmentation-unaware tree pattern query• Fragmentation schema
• Tasks• Annotate tree pattern nodes with corresponding fragments• Decompose tree pattern into sub-patterns for individual
fragments• Convert sub-patterns to local plans using existing techniques
(each site is free to choose local strategy)• Generate distributed execution plan that specifies how results
are combined
18
Querying Vertically Distributed XML Collections
• Annotate tree pattern nodes
• Decompose tree pattern
• Convert sub-patterns into local plans
• Generate distributed execution plan
author
name
/
first.=’William’
/
last.=’Shakespeare’
/
reference
//
19
Querying Vertically Distributed XML Collections
• Annotate tree pattern nodes
• Decompose tree pattern
• Convert sub-patterns into local plans
• Generate distributed execution plan
author f V1
name f V2
/
first.=’William’ f V2
/
last.=’Shakespeare’ f V2
/
reference f V4
//
19
Querying Vertically Distributed XML Collections
• Annotate tree patternnodes
• Decompose tree pattern
• Convert sub-patternsinto local plans
• Generate distributedexecution plan
author f V1
name f V2
/
first.=’William’ f V2
/
last.=’Shakespeare’ f V2
/
reference f V4
//
20
Querying Vertically Distributed XML Collections
• Annotate tree patternnodes
• Decompose tree pattern
• Convert sub-patternsinto local plans
• Generate distributedexecution plan
author f V1
name f V2
/
first.=’William’ f V2
/
last.=’Shakespeare’ f V2
/
reference f V4
//
author
ap2
P1→2∗
/
ap3
P1→3∗
//
q11(f
V1 )
arp2
RP1→2∗
name
/
first.=’William’
/
last.=’Shakespeare’
/
q21(f
V2 )
arp3
RP1→3∗
ap4
P3→4∗
//
q31(f
V3 )
arp4
RP3→4∗
ae1reference
//
q41(f
V4 )
20
Querying Vertically Distributed XML Collections
• Annotate tree patternnodes
• Decompose tree pattern
• Convert sub-patternsinto local plans
• Generate distributedexecution plan
author f V1
name f V2
/
first.=’William’ f V2
/
last.=’Shakespeare’ f V2
/
reference f V4
//
author
ap2
P1→2∗
/
ap3
P1→3∗
//
q11(f
V1 )
arp2
RP1→2∗
name
/
first.=’William’
/
last.=’Shakespeare’
/
q21(f
V2 )
arp3
RP1→3∗
ap4
P3→4∗
//
q31(f
V3 )
arp4
RP3→4∗
ae1reference
//
q41(f
V4 )
20
Querying Vertically Distributed XML Collections
• Annotate tree patternnodes
• Decompose tree pattern
• Convert sub-patternsinto local plans
• Generate distributedexecution plan
⋊⋉id(ap3)=id(arp3 )
⋊⋉id(ap2)=id(arp2 )
p11(fV1 ) p21(f
V2 )
⋊⋉id(ap4)=id(arp4 )
p31(fV3 ) p41(f
V4 )
author
ap2
P1→2∗
/
ap3
P1→3∗
//
q11(f
V1 )
arp2
RP1→2∗
name
/
first.=’William’
/
last.=’Shakespeare’
/
q21(f
V2 )
arp3
RP1→3∗
ap4
P3→4∗
//
q31(f
V3 )
arp4
RP3→4∗
ae1reference
//
q41(f
V4 )
21
Querying Vertically Distributed XML Collections
• Annotate tree patternnodes
• Decompose tree pattern
• Convert sub-patternsinto local plans
• Generate distributedexecution plan
⋊⋉id(ap3)=id(arp3 )
⋊⋉id(ap2)=id(arp2 )
p11(fV1 ) p21(f
V2 )
⋊⋉id(ap4)=id(arp4 )
p31(fV3 ) p41(f
V4 )
author
ap2
P1→2∗
/
ap3
P1→3∗
//
q11(f
V1 )
arp2
RP1→2∗
name
/
first.=’William’
/
last.=’Shakespeare’
/
q21(f
V2 )
arp3
RP1→3∗
ap4
P3→4∗
//
q31(f
V3 )
arp4
RP3→4∗
ae1reference
//
q41(f
V4 )
21
Querying Vertically Distributed XML Collections
• Annotate tree patternnodes
• Decompose tree pattern
• Convert sub-patternsinto local plans
• Generate distributedexecution plan
⋊⋉id(ap3)=id(arp3 )
⋊⋉id(ap2)=id(arp2 )
p11(fV1 ) p21(f
V2 )
⋊⋉id(ap4)=id(arp4 )
p31(fV3 ) p41(f
V4 )
author
ap2
P1→2∗
/
ap3
P1→3∗
//
q11(f
V1 )
arp2
RP1→2∗
name
/
first.=’William’
/
last.=’Shakespeare’
/
q21(f
V2 )
arp3
RP1→3∗
ap4
P3→4∗
//
q31(f
V3 )
arp4
RP3→4∗
ae1reference
//
q41(f
V4 )
21
Querying Vertically Distributed XML Collections
• Annotate tree patternnodes
• Decompose tree pattern
• Convert sub-patternsinto local plans
• Generate distributedexecution plan
⋊⋉id(ap3)=id(arp3 )
⋊⋉id(ap2)=id(arp2 )
p11(fV1 ) p21(f
V2 )
⋊⋉id(ap4)=id(arp4 )
p31(fV3 ) p41(f
V4 )
author
ap2
P1→2∗
/
ap3
P1→3∗
//
q11(f
V1 )
arp2
RP1→2∗
name
/
first.=’William’
/
last.=’Shakespeare’
/
q21(f
V2 )
arp3
RP1→3∗
ap4
P3→4∗
//
q31(f
V3 )
arp4
RP3→4∗
ae1reference
//
q41(f
V4 )
21
Improving Distributed Execution Plans
• Pruning irrelevant fragments
• Join order
• Push cross-fragment joins into local plans
22
Improving Distributed Execution Plans
• Pruning irrelevant fragments
• Join order
• Push cross-fragment joins into local plans
22
Pushing Cross-Fragment Joins
Large fraction of local results are discarded by cross-fragment join
RP3→419
chapter2
reference2
RP3→420
chapter3
reference3
RP3→421
chapter4
reference4
f V4
23
Pushing Cross-Fragment Joins
Large fraction of local results are discarded by cross-fragment join
RP3→419
chapter2
reference2
RP3→420
chapter3
reference3
RP3→421
chapter4
reference4
f V4
23
Pushing Cross-Fragment Joins
Large fraction of local results are discarded by cross-fragment join
RP3→419
chapter2
reference2
RP3→420
chapter3
reference3
RP3→421
chapter4
reference4
f V4
• Idea: only access relevant sub-trees in fragment
• Avoid computing irrelevant local results
• Use pipelining to push cross-fragment join into local plan
23
A Local Query Plan
πarp2
⋉a1/a3
⋉a1/a2
⋊⋉arp2 /a1
scanarp2 :RP1→2∗
scana1:name
σa2=’William’
scana2:first
σa3=’Shakespeare’
scana3:last
p21(f
V2 )
24
A Local Query Plan
πarp2
⋉a1/a3
⋉a1/a2
⋊⋉arp2 /a1
scanarp2 :RP1→2∗
scana1:name
σa2=’William’
scana2:first
σa3=’Shakespeare’
scana3:last
p21(f
V2 )
• Plan scans root proxynodes in fragment
• Idea: filter these rootproxy nodes beforeevaluating remainderof plan
• Works for navigatingplans and plans basedon structural joins(shown here)
24
A Local Query Plan
πarp2 ,...
⋉a1/a3
⋉a1/a2
⋊⋉arp2 /a1
⋊⋉
. . . scanarp2 :RP1→2∗
scana1:name
σa2=’William’
scana2:first
σa3=’Shakespeare’
scana3:last
p21(f
V2 )
• Plan scans root proxynodes in fragment
• Idea: filter these rootproxy nodes beforeevaluating remainderof plan
• Works for navigatingplans and plans basedon structural joins(shown here)
25
Pushing Cross-Fragment Joins
p4′
1 (fV4 )
⋊⋉id(ap4)=id(arp4 )
p3′
1 (fV3 )
⋊⋉id(ap3)=id(arp3 )
p2′
1 (fV2 )
⋊⋉id(ap2)=id(arp2 )
p11(fV1 ) scanarp2 :RP1→2
∗
scanarp3 :RP1→3∗
scanarp4 :RP3→4∗
26
Pushing Cross-Fragment Joins
p4′
1 (fV4 )
⋊⋉id(ap4)=id(arp4 )
p3′
1 (fV3 )
⋊⋉id(ap3)=id(arp3 )
p2′
1 (fV2 )
⋊⋉id(ap2)=id(arp2 )
p11(fV1 ) scanarp2 :RP1→2
∗
scanarp3 :RP1→3∗
scanarp4 :RP3→4∗
26
Pushing Cross-Fragment Joins
p4′
1 (fV4 )
⋊⋉id(ap4)=id(arp4 )
p3′
1 (fV3 )
⋊⋉id(ap3)=id(arp3 )
p2′
1 (fV2 )
⋊⋉id(ap2)=id(arp2 )
p11(fV1 ) scanarp2 :RP1→2
∗
scanarp3 :RP1→3∗
scanarp4 :RP3→4∗
26
Pushing Cross-Fragment Joins
p4′
1 (fV4 )
⋊⋉id(ap4)=id(arp4 )
p3′
1 (fV3 )
⋊⋉id(ap3)=id(arp3 )
p2′
1 (fV2 )
⋊⋉id(ap2)=id(arp2 )
p11(fV1 ) scanarp2 :RP1→2
∗
scanarp3 :RP1→3∗
scanarp4 :RP3→4∗
26
Pushing Cross Fragment Joins: Implementation
• Can use full pipelining if both inputs to join are ordered
• Alternatively, can use index on root proxy nodes
• Full parallelism after first tuple received by local plan
27
Pushing Cross Fragment Joins
• Avoids accessing large portion of sub-trees within a fragment
• Can only be fully used in left-deep plans
• Decreases flexibility (e.g., where joins are performed)
28
Label Path Filtering
• Cross-fragment join pushing works well but decreases flexibility
• Goal: find a solution that can obtain partial benefit forscenarios where join pushing cannot be applied
• Idea: use selection instead of join to filter out some root proxynodes
29
Label Path Filtering
• Assign to each proxy node the label path from the documentroot
• Filter for label paths that are compatible with the query
author4
name4
first4
William
last4
Shakespeare
pubs4
book4
chapter4
reference4
chapter5
30
Label Path Filtering
• Assign to each proxy node the label path from the documentroot
• Filter for label paths that are compatible with the query
author4
name4
first4
William
last4
Shakespeare
pubs4
book4
chapter4
reference4
chapter5
/author/pubs/book
30
Label Path Filtering
⋊⋉id(arp4 )=id(ap4)
⋊⋉id(arp3 )=id(ap3)
⋊⋉id(arp2 )=id(ap2)
p11(fV1 ) p21 ∗ (f
V2 )
p31(fV3 )
p4′
1 (fV4 )
σlabel(arp4 )=/author/pubs/book
scanarp4 =(RP3→4∗
)
• Assume there are twotypes of publications:book and article
• Can use selection tofilter chapters basedon publication type
31
Label Path Filtering
• Can be used in more cases
• Retains higher degree of flexibility
• Benefit is more limited (does not filter all irrelevant root proxynodes)
32
Determining the Best Distributed Execution Plan
• Join pushing and label path filtering are not alwaysadvantageous
• Determine best execution plan using cost model
33
Outline
1 Fragmenting XML Collections
2 Querying Distributed XML CollectionsQuery ModelDistributed Query EvaluationImproving Performance
3 Performance Evaluation
4 Conclusion
34
Performance Evaluation
• Implemented techniques within Natix
• 12 GB XMark collection (auction data)
• 1 Amazon EC2 instance for each of each of 10 verticalfragments
35
Performance Evaluation
• XPathMark queries (with few filtering value constraints)
• Modified, more selective XPathMark queries
A1 /site/closed auctions/closed auction/annotation/description/text/keyword
A2 //closed auction//keyword
A3 /site/closed auctions/closed auction//keyword
A4 /site/closed auctions/closed auction[annotation/description/text/keyword]/date
A5 /site/closed auctions/closed auction[descendant::keyword]/date
A6 /site/people/person[profile/gender and profile/age]/name
B7 //person[profile/@income]/name
A1S /site/closed auctions/closed auction[price > 600]/annotation/description/text/keyword
A2S //closed auction[price > 600]//keyword
A3S /site/closed auctions/closed auction[price > 600]//keyword
A4S /site/closed auctions/closed auction[price > 600][annotation/description/text/keyword]/date
A5S /site/closed auctions/closed auction[price > 600][descendant::keyword]/date
A6S /site/people/person[starts-with(name, ’Ry’)][profile/gender and profile/age]/name
B7S //person[starts-with(name, ’Ry’)][profile/@income]/name
36
Performance Evaluation: XPathMark
200
400
600
800
1000
1200
1400
A1 A2 A3 A4 A5 A6 B7
Res
pons
e tim
e (s
econ
ds)
Plan
centdist
push
37
Performance Evaluation: Selective XPathMark
0
200
400
600
800
1000
1200
1400
A1 A2 A3 A4 A5 A6 B7
Res
pons
e tim
e (s
econ
ds)
Plan
centdist
push
38
Conclusions
• Distribution can make XML query evaluation more scalable
• Join pushing can significantly improve query performance
• A cost model is essential for finding the optimal technique fora given query
39
References
[1] Patrick Kling, M. Tamer Ozsu, Khuzaima Daudjee: GeneratingEfficient Execution Plans for Vertically Partitioned XMLDatabases, PVLDB 2010.
[2] Patrick Kling, M. Tamer Ozsu, Khuzaima Daudjee: ScalingXML Query Processing: Distribution, Localization and Pruning,DAPD 2011.
40