Automatic Physical Design Tuning: Workload as a Sequence

Automatic Physical Design Tuning: Workload as a Sequence

Sanjay Agrawal, Microsoft ResearchEric Chu, University of Wisconsin-MadisonVivek Narasayya, Microsoft Research

04/22/23 SIGMOD 2006 2

Automatic Physical Design Tuning DB applications more complex and varied. Considerable time spent on tuning. Reduce cost of ownership of RDBMS.

Automatically recommend physical design. Supported by DB vendors.

Database Engine Tuning Advisor, Microsoft Design Advisor, IBM SQL Access Advisor, Oracle

04/22/23 SIGMOD 2006 3

Microsoft Database Engine Tuning Advisor

QueryOptimizer(extended)

Microsoft SQL Server 2005

Database Engine Tuning Advisor

Recommendation

“What-if”

ApplicationsWorkload

Set of queries, updates

Set of indexes, materialized

views, horizontal partitions

04/22/23 SIGMOD 2006 4

Workload as a Sequence: Motivation Data warehousing

Query by day, update at night. Set: No index recommended when update costs outweigh

benefits. Sequence: May exploit benefits of indexes without incurring

update costs. Insert “create” and “drop” of indexes to workload. Exploit order of statements.

Updates

Night

Queries

Day

Queries

Day

Create Indexes Create IndexesDrop Indexes

04/22/23 SIGMOD 2006 5

Set VS Sequence Set-based

Recommendation is robust to changes in order of statement arrival.

Can miss good recommendations compared to sequenced-based approach.

Outputs are different Set: what indexes to create or drop? Sequence: what indexes to create or drop and where?

UpdatesQueries Queries

Create Indexes Create IndexesDrop Indexes

04/22/23 SIGMOD 2006 6

Model Workload as a Sequence Motivation Problem Definition Optimal Algorithm Disjoint Sequences Greedy-SEQ Experiments

04/22/23 SIGMOD 2006 7

Problem Setting

Cost(Si,Ci) – cost of executing Si with Ci. TC(C1, C2) – transition cost Sequence execution cost

Nk=1((Cost(Sk,Ck) + TC(Ck-1,Ck)) + TC (CN,CN+1)

Workload: S = [S1, S2, …, SN]

S2S1 S3 SN

Si {Select, Insert, Delete, Update}

C1 C2 C3 CN CN+1C0

04/22/23 SIGMOD 2006 8

Problem Definition

Given: Database D, workload W = [S1, …, SN], initial

configuration C0, and storage bound M.

Find configurations C1, C2, …, CN+1 such that Minimize sequence execution cost:

Nk=1((Cost(Sk,Ck) + TC(Ck-1,Ck)) + TC (CN,CN+1)

Storage of Ci ≤ M, for all i.

04/22/23 SIGMOD 2006 9

Search Space

Given N statements and M indexes Sequence-based tuning

2M distinct configurations for each statement. 2M(N+1) possible execution sequences.

Set-based tuning 2M configurations.

04/22/23 SIGMOD 2006 10

Model Workload as a Sequence Motivation Problem Definition Optimal Algorithm Disjoint Sequences Greedy Heuristic Experiments

04/22/23 SIGMOD 2006 11

Optimal Algorithm for Single-Index Case

Node costs: Cost(Si, { }) and Cost(Si,{I}). Edge costs: 0, IC, and ID. Cost of shortest path includes node and edge costs.

SOURCE

{ } { }

DESTINATION0

Ic

0Id

Ic

0

Id

0{ }

{I}

S1

{ }

{I}

SN

{ }

{I}

S2

DAG for single index, N statements

04/22/23 SIGMOD 2006 12

General Case – Multiple Indexes

At each stage, enumerate all possible configurations from the set of indexes.

Algorithm linear in the number of nodes and edges of DAG. However, number of nodes in DAG is exponential in the number of

indexes. M indexes => O(N*2M) nodes and O(N*2M) edges.

S1 S2 SN

C11 C1

2 C1N

C01 C0

2 C0N

CF1 CF

2 CFN

Ci1 Ci

2 CiN CN+1

C0

EXHAUSTIVE

04/22/23 SIGMOD 2006 13

Optimal Solution

Recommendation

Candidate set of structures

Solve sequence using EXHAUSTIVE

Sequence, Constraints

04/22/23 SIGMOD 2006 14

Search-Space PruningTechniques to reduce number of nodes: Cost-based Pruning

Leverages shortest-path solutions of individual indexes. Prunes configurations at each stage without loss of

optimality. Disjoint Sequences

Divide-and-conquer approach. Splits the input sequence and candidate index set.

Greedy-SEQ Guarantees a polynomial number of nodes.

04/22/23 SIGMOD 2006 15


04/22/23 SIGMOD 2006 16

Exploiting Disjoint Sequences Two sequences X and Y are disjoint if they do not

share any statements AND indexes. Disjoint sequences are common

E.g., server hosts multiple applications that touch different databases.

Approach: Split workload into disjoint sequences. Solve each sequence independently. Merge to get final solution.

Idea: DAG for each disjoint sequence has fewer nodes.

04/22/23 SIGMOD 2006 17

Efficiency Gain with Disjoint Sequences

S2S1 S3 S7S5S4 S6

{I1,I2,I3}W

S1 S3 S4

{I1}

S2 S5 S6

{I2}

S7

{I3}

W1

W2

W3

8 nodes at each stage

2 nodes at each stage for each sequence

04/22/23 SIGMOD 2006 18

Merge solutions of W1, W2, and W3: No storage violations DEST

I1cS1 S3SRC

{I1} {I1}

S4

{ }I1d

{ }{ }

W1 = [S1,S3,S4]

DESTS7I3c

{I3} {I3}{ }W3 = [S7]

SRC

S2 DESTS5 S6I2d

W2 = [S2,S5,S6]

I2c

{I2} {I2} { }SRC{ } { }

Pu is optimal when there are no storage violations.

S2

{I1,I2}

S3

{I1,I2}

S1SRC

{I1}

S4

{I2}

S5

{I2}

S6

{ }

S7

{I3} {I3}

DEST

{ }

04/22/23 SIGMOD 2006 19

Merge in the presence of storage violation Suppose storage bound allows only 1 index.

Pu is not a valid solution as it has configurations with storage violation.

S2

{I1,I2}

S3

{I1,I2}

S1SRC

{I1}

S4

{I2}

S5

{I2}

S6

{ }

S7

{I3} {I3}

DEST

{ }

Pu’ = Merge P1, P2 and P3 to get a valid solution.

S1SRC

{I1}{ }

S2 S3

{I1}{I2}

S4

{I2}

S5

{I2}

S6

{ } {I3} {I3}

DESTS7

Note that cost of Pu is a lower bound on cost of any valid solution.

04/22/23 SIGMOD 2006 20

Solution with Split and Merge Sequence,

Constraints


Recommendation

Apply Split operator to get disjoint sequences

Solve each sequence independently using EXHAUSTIVE

Merge results of disjoint sequences

or GREEDY-SEQ

04/22/23 SIGMOD 2006 21


04/22/23 SIGMOD 2006 22

Greedy Approach Goal:

Explore a polynomial number of good configurations.

Run shortest path over the DAG constructed with these configurations.

Solution close to optimal.

Greedy-SEQ: adaptation of existing greedy technique for the sequence model.

04/22/23 SIGMOD 2006 23

Greedy-SEQ Steps of Greedy-SEQ:

1. Get optimal solution for each index. Record configurations.

2. Initialize current best to be the lowest-cost solution seen so far.

3. Improve current best by combining with other solutions and resetting current best. Record new configurations of current best.

Repeat until no more improvement.

4. Run shortest-path over configurations collected.

04/22/23 SIGMOD 2006 24

Combining Two Single-Index Solutions

S1 S2 SNSK SLS0 SN+1

{I1}{} {}{}I1 {I1} {I1} {}

{}{} {}{}I2 {I2} {I2} {I2}

{I1,I2} {I1,I2}

{I1}

{} {}{}{I2}I1,I2

{I1} {I1} {}

{} {I2} {I2}

04/22/23 SIGMOD 2006 25

Combining Two Single-Index Solutions

{I1}

{} {}{}{I2}

{I1,I2} {I1,I2}

I1,I2

{I1} {I1} {}

{} {I2} {I2}

S1 S2 SNSK SLS0 SN+1

{I1}{} {}{}I1 {I1} {I1} {}

{}{} {}{}I2 {I2} {I2} {I2}

04/22/23 SIGMOD 2006 26

Greedy-SEQ: Greedy Approach1. Get optimal solution for each index. Record

configurations. 2. Initialize current best to be the lowest-cost

solution seen so far.3. Improve current best by combining with other

solutions and resetting current best. Record new configurations of current best.

Repeat Step 3 until no more improvement.

4. Run shortest-path over configurations collected.

04/22/23 SIGMOD 2006 27

End-to-End SolutionSequence, Constraints


Recommendation

Apply split operator to get disjoint sequences

Solve each sequence independently using EXHAUSTIVE or GREEDY-SEQ

Merge results of disjoint sequences

Apply cost-based pruning on each sequence

04/22/23 SIGMOD 2006 28


04/22/23 SIGMOD 2006 29

Sequence VS Set-based approaches % improvement relative to the optimal set-

based solution. Sequence is better in the presence of

updates and/or storage bound is low.

Workload M = 1.2 GB M = 3 GBTPCH-22 19% 0%

TPCH-22-I-10-MID 22% 16%

TPCH-22-I-10-END 25% 28%

04/22/23 SIGMOD 2006 30

Greedy-SEQ VS Exhaustive Greedy-SEQ’s much faster with minimal

degradation in quality.

Workload % reduction in running time % reduction in qualityTPCH-3 50% <1%

TPCH-5-M-5 98.4% 2.3%

TPCH-22 Exhaustive was terminated after 24 hours

Not available

04/22/23 SIGMOD 2006 31

Effectiveness of Split and Merge

Workload % reduction in running time compared to WO-SPMR

% reduction in quality compared to WO-SPMR

TPCH-22 <0.1% 0%

WKLD1 89.9% 0%

WKLD1-LOW 71.4% 3.0%

With split and merge (SPMR) VS without (WO-SPMR)

04/22/23 SIGMOD 2006 32

Conclusion Sequence model allows more optimization

opportunities than set model. Model the problem as finding the shortest

path over a DAG. Heuristics give nearly optimal solutions with

much better performance.

Documents

Automatic Physical Design Tuning: Workload as a Sequence