8
An Integer Linear Programming Approach to Database Design Stratos Papadomanolakis Anastassia Ailamaki Computer Science Department Carnegie Mellon University 5000 Forbes Avenue, Pittsburgh, PA 15213 {stratos,natassa}@cs.cmu.edu Abstract Existing index selection tools rely on heuristics to effi- ciently search within the large space of alternative solutions and to minimize the overhead of using the query optimizer for cost estimation. Index selection heuristics, despite be- ing practical, are hard to analyze and formally compute how close they get to the optimal solution. In this paper we propose a model for index selection based on Integer Linear Programming (ILP). The ILP formulation enables a wealth of combinatorial optimization techniques for pro- viding quality guarantees, approximate solutions and even for computing optimal solutions. We present a system ar- chitecture for ILP-based index selection, in the context of commercial database systems. Our ILP-based approach of- fers higher solution quality, efficiency and scalability with- out sacrificing any of the precision offered by existing index selection tools. 1. Introduction Automating database physical design is a major chal- lenge in building self-tuning database systems. Previously proposed tools for index selection rely on heuristic algo- rithms, such as greedy search [6, 2, 8], augmented with techniques to reduce the number of candidate indexes they consider and the number of calls to the query optimizer. A problem with existing techniques is that there is no way to estimate how good the solution they provide is with respect to the optimal. More recent approaches [4] compute a lower bound for a workload’s cost by considering the op- timal indexes for each query individually and disregarding any storage constraints. Although this approach is helpful, the derived bounds are not realistic for storage-constrained scenarios. In this paper we present a new framework based on an Integer Linear Program (ILP) formulation for the index se- lection problem. The ILP formulation allows the applica- tion of standard linear optimization techniques to index se- lection, that remove the shortcomings of existing heuristic techniques. Specifically, through the application of Linear Programming (LP) relaxation, we are able to obtain a tight bound on the quality of the optimal solution for a given problem instance, which we can use to characterize the quality of any solution. The LP relaxation provides useful information about the problem, allowing us for example to find approximate solutions that have optimal performance but exceed the amount of available storage. To further improve the quality of an approximate solu- tion, we apply a branch-and-bound technique for solving ILP problems. Branch-and-bound is advantageous in terms of quality: If run to completion, branch-and-bound will re- turn the optimal solution, while if interrupted before com- pletion, it will return a sub-optimal solution, along with a bound for its distance from the optimal. Finally, the ILP formulation can be tuned so that the solution algorithm has comparable performance to existing index selection tools while examining a much larger number of alternative solu- tions. The contributions of this paper are the following. First, we present an ILP formulation for the index selection prob- lem. Second, we describe a system architecture for effi- ciently solving large ILP problem instances. Third, we re- port preliminary results that demonstrate that our approach has good performance and outperforms existing index se- lection tools. 2. An ILP Framework For Index Selection 2.1. Formulation Consider a workload consisting of m queries and a set of n indexes I 1 -I n , with sizes s 1 -s n . We want our model to ac- count for the fact that a query has different costs depending on the combination of indexes it uses. A configuration is a subset C k = {I k1 , I k2 , ...} of indexes with the property that 442 1-4244-0832-6/07/$20.00 ©2007 IEEE.

An Integer Linear Programming Approach to Database Design

  • Upload
    lammien

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An Integer Linear Programming Approach to Database Design

An Integer Linear Programming Approach to Database Design

Stratos Papadomanolakis Anastassia AilamakiComputer Science Department

Carnegie Mellon University5000 Forbes Avenue, Pittsburgh, PA 15213

{stratos,natassa}@cs.cmu.edu

Abstract

Existing index selection tools rely on heuristics to effi-ciently search within the large space of alternative solutionsand to minimize the overhead of using the query optimizerfor cost estimation. Index selection heuristics, despite be-ing practical, are hard to analyze and formally computehow close they get to the optimal solution. In this paperwe propose a model for index selection based on IntegerLinear Programming (ILP). The ILP formulation enablesa wealth of combinatorial optimization techniques for pro-viding quality guarantees, approximate solutions and evenfor computing optimal solutions. We present a system ar-chitecture for ILP-based index selection, in the context ofcommercial database systems. Our ILP-based approach of-fers higher solution quality, efficiency and scalability with-out sacrificing any of the precision offered by existing indexselection tools.

1. Introduction

Automating database physical design is a major chal-lenge in building self-tuning database systems. Previouslyproposed tools for index selection rely on heuristic algo-rithms, such as greedy search [6, 2, 8], augmented withtechniques to reduce the number of candidate indexes theyconsider and the number of calls to the query optimizer.

A problem with existing techniques is that there is noway to estimate how good the solution they provide is withrespect to the optimal. More recent approaches [4] computea lower bound for a workload’s cost by considering the op-timal indexes for each query individually and disregardingany storage constraints. Although this approach is helpful,the derived bounds are not realistic for storage-constrainedscenarios.

In this paper we present a new framework based on anInteger Linear Program (ILP) formulation for the index se-lection problem. The ILP formulation allows the applica-

tion of standard linear optimization techniques to index se-lection, that remove the shortcomings of existing heuristictechniques. Specifically, through the application of LinearProgramming (LP) relaxation, we are able to obtain a tightbound on the quality of the optimal solution for a givenproblem instance, which we can use to characterize thequality of any solution. The LP relaxation provides usefulinformation about the problem, allowing us for example tofind approximate solutions that have optimal performancebut exceed the amount of available storage.

To further improve the quality of an approximate solu-tion, we apply a branch-and-bound technique for solvingILP problems. Branch-and-bound is advantageous in termsof quality: If run to completion, branch-and-bound will re-turn the optimal solution, while if interrupted before com-pletion, it will return a sub-optimal solution, along with abound for its distance from the optimal. Finally, the ILPformulation can be tuned so that the solution algorithm hascomparable performance to existing index selection toolswhile examining a much larger number of alternative solu-tions.

The contributions of this paper are the following. First,we present an ILP formulation for the index selection prob-lem. Second, we describe a system architecture for effi-ciently solving large ILP problem instances. Third, we re-port preliminary results that demonstrate that our approachhas good performance and outperforms existing index se-lection tools.

2. An ILP Framework For Index Selection

2.1. Formulation

Consider a workload consisting of m queries and a set ofn indexes I1-In, with sizes s1-sn. We want our model to ac-count for the fact that a query has different costs dependingon the combination of indexes it uses. A configuration is asubset Ck = {Ik1, Ik2, ...} of indexes with the property that

4421-4244-0832-6/07/$20.00 ©2007 IEEE.

Page 2: An Integer Linear Programming Approach to Database Design

all of the indexes in Ck are used by some query. This defi-nition is equivalent to the atomic configuration definition in[6].

Let P be the set of all the configurations that can be con-structed using the indexes in I and that can potentially beuseful for a query. For example, if a query accesses tablesT1, T2 and T3 then P contains all the elements in the set(indexes in I on T1) × (indexes in I on T2) × (indexes in Ion T3).

The cost of a query i when accessing a configuration Ck

is c(i, Ck) and c(i, {}) denotes the cost of the query on anunindexed database. We define the benefit of a configurationCk for query i by bik = max(0, c(i, {}) − c(i, Ck)).

Let yj be a binary decision variable that is 1 if the indexis actually implemented and 0 otherwise. In addition, letxik be a binary decision variable that is equal to 1 if queryi uses configuration Ck and 0 otherwise.

Using xik and bik, the benefit for the workload Z is

Z =m∑

i=1

p∑

k=1

bik × xik (1)

where p = |P |. The values of xik depend on the values foryj: We cannot have a query using Ck if a member of Ck isnot implemented. Also, we require that a query uses at mostone configuration at a time. For instance, a query cannotbe simultaneously using both C1 = {I1, I2, I3} and C2 ={I1, I2}. Finally, we require that the set of selected indexesconsumes no more than S units of storage. Thus the formalspecification of the index selection problem is as follows.

maximize Z =m∑

i=1

p∑

k=1

bik × xik (2)

subject to

p∑

k=1

xik ≤ 1 ∀i (3)

xik ≤ yj ∀i, ∀j, k : Ij ∈ Ck. (4)

n∑

j=1

sj × yj ≤ S (5)

Constraints (3) guarantee that a query uses at most oneconfiguration. Constraints (4) ensure that we cannot usea configuration k unless all the indexes in it are built andconstraint (5) expresses the available storage. We can alsouse constraints to restrict the usage of clustered indexes, butwe omit the details due to lack of space.

Figure 1 shows an example with 2 queries and 4 indexes,listing all the relevant configurations for each query. As-sume only indexes I1 and I2 are relevant to Q1, whose cost

Figure 1. Index selection example.

varies depending on whether uses a single-index configura-tion (c1 or c2) or a pair (c3). The same holds for Q2 andindexes I3 and I4. Assume we want to optimize workloadbenefit given total storage capacity of S=200 units.

The equivalent ILP instance is as follows:

minimize Z = b11 × x11 + b12 × x12 + b13 × x13 ++b24 × x24 + b25 × x25 + b26 × x26 (6)

subject to

3∑

k=1

x1k ≤ 1,

6∑

k=4

x2k ≤ 1,

x11 ≤ y1, x12 ≤ y2, x13 ≤ y1, x13 ≤ y2,

x23 ≤ y3, x24 ≤ y4, x35 ≤ y3, x36 ≤ y4,4∑

j=1

sj × yj ≤ 200

By inspection we determine the optimal solution

y1 = 1, y2 = 1, y3 = 0, y4 = 0,

x11 = 0, x12 = 0, x13 = 1, x24 = 0,

x25 = 0, x26 = 0

The set of indexes I1 and I2 is preferable because theircombination has a large benefit for Q1 and outperforms anyother alternative. Notice that the commonly used greedysearch would fail to identify the optimal solution. In thefirst iteration it would pick index I3 and in the second I4.

The exact solution provided by the ILP formulation isoptimal for the given initial selection of indexes. If we wereto include all the possible indexes that are relevant to thegiven workload, it would give us the globally optimal so-lution. Considering the set of all the possible indexes isprohibitively expensive and thus a candidate selection mod-ule is necessary. The ILP approach is flexible in that we canuse it with an arbitrary candidate index set.

443

Page 3: An Integer Linear Programming Approach to Database Design

2.2. Handling Updates

The ILP formulation of Section 2.1 can be extended tohandle updates in the workload (SQL INSERT, UPDATEor DELETE statements). We model an update statementas a sequence of two sub-statements, “select” and ’“mod-ify”. The “select” part is just another query selecting theset of rows to be modified or deleted and thus is handledby the formulation of Section 2.1 (INSERT statements donot have a selection part and thus get a zero benefit valuefor all configurations). The “update” part is a statement thatsimply updates the set of rows returned by the “select” part.It has a different behavior, because an index configurationCk can have a negative benefit for the update part, becauseof the additional cost for updating the relevant indexes inCk. Specifically, the benefit bU

lk of configuration Ck for theupdate sub-statement Ul is

bUlk = costupdate(l, ) − costupdate(l, Ck) (7)

costupdate(l, ) is zero, while costupdate(l, Ck) is equalto the sum of the costs for updating all the indexes in Ck .

bUlk = −

Ij∈Ck

costupdate(l, Ij) (8)

Generally, for every index Ij we can associate a (nega-tive) benefit value −fj which is computed by summing upall the −costupdate(l, Ij) values over all the update state-ments Ul and corresponds to the update overhead intro-duced by that index. To model updates, we only need tomodify the objective function of Section 2.1 (Equation 2) totake into account the negative benefit values fj .

maximize Z =m+m1∑

i=1

p∑

k=1

bik × xik −n∑

j=1

fj × yj (9)

Equation 9 describes the workload benefit in the pres-ence of m queries and m1 update statements. The secondterm simply states that if index Ij is constructed as part ofthe solution, it will cost fj units of benefit to maintain it inthe presence of the m1 update statements.

2.3. Using the ILP Formulation

The ILP model enables the application of new solutiontechniques to the index selection problem. First, it allows usto compute a tight upper bound Z∗ on the maximum bene-fit (Equation 2) achievable for a specific problem instance.The upper bound indicates how far any solution is from theoptimal without actually having the optimal solution, thusallowing us to provide quality guarantees. The upper boundderived from an ILP formulation is more accurate compared

to previous techniques that completely eliminate the storageconstraint, thus significantly overestimating the maximumbenefit achievable.

Second, the ILP formulation allows for the efficient com-putation of an initial approximate solution. Although thissolution is not necessarily optimal, the tool characterizes itsbenefit by comparing it to the Z∗ bound and present thesystem administrator with the choice of keeping it (if thebenefit is acceptable) or investing more time in improvingit. A system can also utilize the ILP formulation to com-pute an initial solution that is optimal in terms of benefit,but violates storage constraints. The system administrator isthen presented with the option of deploying the additionalstorage and getting the performance, or looking for a bet-ter solution. This option is particularly attractive in caseswhere focus is more on exploring the design space ratherthan reaching strictly optimal solutions.

Finally, it is possible to improve the benefit offered bythe initial solution through a branch-and-bound search al-gorithm, that can terminate when a solution of acceptablebenefit is reached, or execute until the optimal solution isfound.

2.4. Linear Program Relaxation

The Linear Program (LP) relaxation of an ILP has thesame objective function and constraints as the original ILP,but its solution is no longer required to be integer. For theILP formulation of the previous section, the LP relaxationaccepts fractional values for the yj , xik variables, as long as0 ≤ yj , xik ≤ 1.

Without the requirement for integer solutions, linear op-timization problems can be solved efficiently by polynomialalgorithms. In the context of index selection, the LP relax-ation quickly provides useful information about the optimalsolution of the ILP and its storage requirements.

Let Z∗ be the optimal benefit value (Equation (2)) for theLP relaxation and Zopt be the optimal benefit value for theoriginal ILP problem. It can be shown that

Zopt ≤ Z∗ (10)

Thus the optimal value of Equation (2) is an upper bound forthe benefit of the optimal ILP solution. It is not possible todetermine how close the solution of the LP relaxation (Z∗)is to the solution of the original ILP (Zopt), for a generalILP problem.

Certain classes of ILP problems, however, are “well-behaved” in that their LP relaxation provides a very goodindicator for the objective function value for the optimalILP solution. We are currently actively working on theo-retically and experimentally characterizing the relationshipbetween the ILP and LP relaxation solutions for index selec-tion. Our preliminary results, presented in Section 5, sug-

444

Page 4: An Integer Linear Programming Approach to Database Design

gest that the best benefit value computed through the LP re-laxation is actually very close to that of the optimal ILP so-lution (within 15%). In any case, the LP relaxation providesa much tighter bound for the optimal solution compared toprevious approaches, that compute bounds by eliminatingthe storage constraint (as Section 5 verifies).

The LP relaxation can also be used to derive an “initial”solution that has a benefit of at least Z∗, but violates thestorage constraint, requiring S′ ≥ S storage. Consider thesolution x∗

ik ,y∗j of the LP relaxation. We derive an integer

solution by first rounding the values y∗j so that every non-

zero variable y∗j is set to 1. Given the new y∗

j values, newx∗

ik can be selected so that each query has a maximum ben-efit under the (3)-(4) constraints of Section 2.1.

If the rounded solution requires only a small increase instorage, it might still be of use to the system administrator.Rounding works best if most of the fractional y∗

j variablesare either 0 or close to 1, in which case the increase in stor-age by setting them to 1 is small. Our experiments suggestthat this is the case in practice: An intuitive explanation isthat y∗

j variables that are non-zero but close to zero don’toffer much in terms of benefit (due to the constraints (4)) ofSection 2.1 and therefore will not often occur in solutions.

2.5. Branch and Bound

Branch-and-bound algorithms are commonly used in thesolution of ILPs. Although they make use of the LP relax-ation method, they are guaranteed to derive the optimal in-teger solution, if run to completion. In addition, the branchand bound algorithm can be interrupted before completion,in which case a sub-optimal solution is returned along witha bound from the optimal.

The general branch-and-bound approach for ILP is tochoose a decision variable and fix it to be either 0 or 1, thusgenerating two subproblems with one fewer decision vari-able (branching). The LP relaxation technique is appliedto the subproblems, computing bounds for their benefit val-ues. The branching is repeated for the subproblems and canbe performed in a depth-first or breadth-first order. Once afirst integer solution is reached (where all the variables havebeen fixed) or if an initial valid solution already exists (forexample through manipulating a rounded LP relaxation re-sult) its value Z can be used to prune those subproblemsthat have an upper bound Z∗ ≤ Z , as no integer assignmentresulting from the subproblem will be better than the exist-ing integer solution. The algorithm terminates when it runsout of time or all the variables have been assigned and thereare no new subproblems.

Combining the ILP model with a branch-and-bound al-gorithm has several advantages over the existing heuristicsearch framework. Quality-wise, branch-and-bound avoidsthe problem of missing “interactions” between indexes, as

Figure 2. Architecture for an ILP-based indexselection algorithm.

long as the interactions are captured in the ILP model. Fur-thermore, branch and bound does not have the problem ofbeing “trapped” in locally optimal solutions. If ran to com-pletion, it is guaranteed to find the optimal solution.

Performance-wise, running to completion and determin-ing the optimal solution might not be feasible for large prob-lems. We propose using the branch-and-bound frameworkonly for a few iterations and stop the search once a so-lution with an acceptable distance from the optimal hasbeen reached. The bounds computed during the branch-and-bound process can directly be used to determine whena solution of acceptable quality has been reached and termi-nate the algorithm as early as possible.

Finally, it is possible to “customize” branch-and-boundsearch by incorporating all the search heuristics developedin the literature. For instance, a greedy search algorithm canbe used to derive an initial integer solution that can be usedto prune subproblems and improve algorithm performance.In a sense, the branch-and-bound approach can provide atleast the quality and performance of existing algorithms,while having additional quality benefits and the added ad-vantage of a bound from the optimal solution.

3. A System for Index Selection Using ILPs

3.1. System Architecture

Figure 2 shows the architecture of our ILP-based ap-proach. The ILP Model module takes as input the queryworkload and a storage constraint and produces an ILPspecification. The candidate indexes used in the ILP for-mulation are provided by the Candidate Selection module.Our ILP approach is flexible in that it can accept any set ofindexes as its input, thus allowing the use of existing candi-date selection techniques. The input index set is processed

445

Page 5: An Integer Linear Programming Approach to Database Design

by the ILP Model module to produce a set of configurationsfor each query (represented by the xik variables).

The benefit bik for each configuration is computed bythe Index Usage Model (INUM), that efficiently providesquery cost estimates with optimizer-precision. The INUMitself is initialized by performing a small number of key op-timizer calls, the results of which are used in query costestimation. Once the ILP model is complete, the LP andbranch-and-bound solver modules are used to derive solu-tions. The solver result is either the optimal solution or anapproximate one, along with a tight bound for its distancefrom the optimal.

3.2. The Index Usage Model

The Index Usage Model [10] is a framework for efficientcost estimation. Given a query and an index configuration,it computes a value for the query cost that is equal (or inthe worst case very close) to the value returned by the opti-mizer, without actually performing an optimizer invocation.Since the optimizer is not involved, the INUM can achieve 3orders of magnitude faster cost estimation while maintain-ing optimizer precision. The INUM is independent of theindex selection algorithm used, however it is a direct matchfor our ILP-based approach that requires evaluating a largenumber of configurations.

The intuition behind the INUM is that although de-sign tools must examine a large number of alternative de-signs, the number of different optimal query execution plansand thus the range of different optimizer outputs is muchsmaller. Therefore it makes sense to reuse the optimizeroutput, instead of repeatedly computing the same plan. TheINUM works by first performing a small number of keyoptimizer calls per query in a precomputation phase andcaching the optimizer output (query plans along with statis-tics and costs for the individual operators). During normaloperation, query costs are derived exclusively from the pre-computed information without any further optimizer invo-cation. The derivation involves a simple calculation (sim-ilarly to computing the value of an analytical model) andthus is significantly faster compared to the complex queryoptimization code.

Besides it accuracy and performance benefits, an aspectof the INUM that is relevant to our ILP-based approach isthe time consumed by its precomputation phase. The INUMcan efficiently provide cost estimates for tens of thousandsof configurations, but it must first be initialized by perform-ing a small number of key optimizer calls. The duration ofthe initialization phase is therefore part of the ILP solvingtime and must be taken into account when comparing withexisting index selection tools. We therefore include INUMsetup timing results in our experimental section.

4. Performance Considerations

Since our solution algorithms (described in Sections 2.4,2.5) rely on solving the LP relaxation of the original ILP for-mulation once or multiple times, the performance of solvingLP problems is critical for the overall feasibility of our ap-proach. The efficiency of solving an LP problem dependson the number of decision variables and the number of con-straints. In the context of index selection, efficiency is de-termined by the number of candidate indexes (yj) and thenumber of the xik variables.

Our approach allows us to view the number of candidatesand configurations as separate “tuning knobs” and use themto control performance. Our difference from existing so-lutions, is that they provide no guarantees over the qualityof their heuristics. The mathematical structure of the ILPformulation allows us to modify the number of candidatesor the number of configurations to improve performance,while quantifying the impact on the optimality of the so-lution. In the next sections we describe candidate selectionand configuration selection in the context of our ILP formu-lation and present sensitivity analysis as a way to estimatesolution quality loss.

4.1. Candidate Selection

Candidate selection (shown also in Figure 2) determinesthe set of indexes that will be used in the ILP formulationand therefore the number of yj variables. Candidate selec-tion is necessary because in practice, incorporating all therelevant indexes in an ILP formulation results in huge mod-els that are not practical to solve. Naturally, omitting in-dexes from consideration can result in suboptimal solutions.It is the job of the candidate selection module to identify a“good” set of indexes that minimizes quality loss. Our for-mulation does not make any assumptions on the algorithmused and can therefore incorporate previously proposed ap-proaches [4, 6, 8].

Compared to previous approaches, the ILP formulationallows for a much larger number of candidates. Our experi-ments indicate that a fairly standard high performance linearsolver (in MATLAB’s Optimization Toolbox) can solve LPrelaxations of 10000 variables in under a minute on a high-end server. The available LP solver performance, combinedwith the use of the INUM (Section 3.2) for fast cost estima-tion, allows us to grow the number of candidates (yj vari-ables) in the thousands, a much larger number compared toprevious techniques.

The flexibility in selecting a large number of candi-dates reduces the chances that some important index willbe missed by the candidate selection. In addition, we pro-pose the use of sensitivity analysis techniques (described inSection 4.3 to estimate the loss in solution quality from a

446

Page 6: An Integer Linear Programming Approach to Database Design

particular set of candidates and to guide the design of effec-tive index selection algorithms.

4.2. Configuration Selection

Configuration selection is the process of determiningwhich configurations will be used in the model. In otherwords, configuration selection determines the xik variablesand the corresponding bik benefits that need to be consid-ered. Reducing the number of configurations is critical forperformance, as it can grow very large in practice. Intu-itively, configuration selection sets the benefit values for theomitted configurations to zero, artificially reducing the “im-portance” for certain index combinations. Therefore, con-figuration selection can also result in quality loss.

Again, the ILP formulation allows for a much largernumber of configurations to be evaluated. Given the per-formance statistics of the previous section (10000 variablesin 1 minute), for a workload of 50 queries, the model caneasily accommodate 200 different configurations per query.Existing index selection tools for a similarly-sized work-load examine a much smaller number of candidates, as eachatomic configuration/query combination requires an opti-mizer call, and performing more than a few hundreds ofcalls is prohibitively expensive.

4.3. Sensitivity Analysis

Sensitivity analysis in linear optimization is the processof estimating the stability of the solution to a given LP withrespect to changes in the LP’s parameters. For instance,there exist sensitivity analysis techniques for computing in-tervals for the parameters of LP programs, such that chang-ing the parameter within those intervals does not changethe optimal solution. There also exist techniques for incre-mentally determining a new optimal solution if some of theinput parameters change.

We plan to investigate the use of sensitivity analysis tomodel candidate and configuration selection. For example,we can use a modern LP solver to obtain intervals for allthe benefit (bik) and index storage (sj) parameters of an LP.The intervals are actually computed at no extra cost, as theyare a product of the LP solution.

We can then use the obtained intervals to reason aboutindexes that are not included in the candidate set: If we canprove that there exist no indexes that can cause LP parame-ters to exceed the computed intervals (for example, indexeswith high benefit values or low storage), we can concludethat the existing candidate set is “good enough”. If not,we can estimate the potential benefit loss and decide if ad-ditional indexes need to be added in the candidate set. For-malizing the application of sensitivity analysis techniques toindex and candidate selection is part of our ongoing work.

Figure 3. Benefit values for various solutions.

5. Comparing with Existing Approaches

In this section we present preliminary results from ourILP-based implementation, using TPC-H queries on a com-mercial DBMS. We compare the performance and qualityof our approach to that of the commercial index selectiontool.

5.1. Experimental Setup

Our implementation interfaces to a commercial DBMSthat has its own index selection tool. Our implemen-tation uses Java for the INUM, database interfaces andmodel construction and MATLAB’s optimization toolboxfor the solvers. We obtain a set of candidate indexes forour input workload through the commercial index selec-tion tool, by observing the virtual indexes it builds us-ing a profiler. Our workload consists of 5 TPCH queries,on a 1GB TPCH database. The storage constraint wasset to 60000 pages or 480MB. With this specification,the commercial index selection tool examined 64 candi-date indexes. We enumerated all the configurations thatcould be constructed from the candidate set and gener-ated 1589 configuration decision variables. All experi-ments were run on an Intel Xeon 3GHz server. Perfor-mance improvements are on the form %improvement =1− performanceoptimized/performanceorig. The bene-fit numbers returned by our algorithm were verified by actu-ally implementing the indexes and obtaining real optimizercost estimates.

5.2. Experimental Results

The solution of the LP relaxation of the ILP had 5 non-zero index (yj) decision variables. Three of them were inte-ger (equal to 1) and the others had a value of 0.76. Figure 3compares the maximum benefit value for the LP relaxation(LP optimal) to the benefit bound computed by removing

447

Page 7: An Integer Linear Programming Approach to Database Design

the storage constraint (unlimited). We obtain the latter sim-ply by invoking the commercial tool without any storageconstraint. LP optimal) provides a much tighter bound tothe optimal value compared to unlimited.

By truncating the LP relaxation result we obtain an in-teger solution with identical benefit that that requires only10% more space. (For this example, the truncated solutionbecomes the optimal if we increase the storage by 10%).Assuming that the original storage constraint is hard, thetruncated integer solution is not feasible. We derive a fea-sible solution, LP approx, by removing one index (settingits yj value to 0) so that the remaining fit into the availablestorage. According to Figure 3, LP approx is within 32%of the optimal ILP solution.

We ran the branch-and-bound algorithm in MATLAB’sOptimization Toolbox in order to improve the approximatesolution, using its benefit value for pruning. The branch-and-bound algorithm returned the optimal solution for theoriginal problem, which is within 15% of the optimal LPrelaxation solution, confirming that for this example the op-timal value of the LP relaxation is a good approximationfor the ILP optimal value. The commercial bar in Figure 3shows the benefit achievable by the solution provided by thecommercial index selection tool. The quality of the com-mercial tool solution is 56% lower than that of the approxi-mate LP solution (LP approx).

MATLAB took 1.3s to solve the LP relaxation, while thecommercial tool reported 1 minute of running time. Thetime for INUM construction (Section 3.2) was 7s. Overall,the time for constructing the INUM and the ILP model andfor obtaining initial solutions through the LP relaxation wasmuch less compared to the commercial tool.

MATLABs branch-and-bound procedure took severalhours to complete. The reason is that generic solvers cannot distinguish between the index decision variables (yj)and the configuration variables (xik) and do not exploit thefact that the latter depend on the former. Developing a aspecialized branch-and-bound procedure to replace MAT-LAB’s generic solver is part of our ongoing work.

6. Related Work

Microsoft’s Index Tuning Wizard and Database TuningAdvisor [6, 1] use a “candidate selection” phase to iden-tify promising indexes, a “merging” step for augmenting thecandidate set [7] and a greedy search for selecting a locally-optimal solution. The DB2 Advisor [8] uses the optimizerto select an initial set of indexes and formulates a knapsackproblem that is solved by a greedy search.

Our ILP formulation is based on [5]. Their formulationaccounts for queries using more than one indexes and alsomodels index update costs. They also provide a specializedbranch-and-bound procedure for its solution. Our work is

concerned with applying the formalism in [5] to real-worldproblems, that involve commercial database systems andworkloads.

In terms of modeling, we integrate the model in [5] withthe query optimizer in existing systems. We also deal withthe problems of selecting candidates and configurations us-ing sensitivity analysis, so that we derive reasonably-sizedILP instances that can be solved efficiently. In terms of so-lutions, we do not focus on algorithms for finding optimalsolutions, as the model, due to candidate and configurationselection is anyways an approximation. Our work proposesthe use of LP relaxation to find approximate solutions andcomputing optimality bounds.

Finally, to accurately model real world constraints, weinclude storage limits in our formulation. [5] does not con-sider storage and it is unclear if their analysis and proposedalgorithms can apply to problem instances resulting fromour formulation.

[9] describes an ILP and a solution based on randomized-rounding, assuming a single index per query. Their solutionhas optimal performance but requires a bounded amount ofadditional storage. Since the algorithms in [9] assume theuse of a single index per query, they ignore “interactions”between indexes and are therefore not directly applicable tocommercial relational systems.

Besides index selection, there is work on designing ad-ditional features, such as materialized views and table parti-tions [2, 11, 3]. The proposed algorithms are similar to theirindex selection counterparts, only they are facing a combi-natorial explosion in the number of alternative designs thatarises when exploring feature combinations. We believe ourmodeling approach will be beneficial for problems combin-ing multiple features, because it can better capture the “in-teraction” between features and it offers higher scalability.

7. Conclusion

In this paper we present an Integer Linear Program-ming (ILP) model for the index selection problem. We ap-ply standard optimization techniques to compute optimalitybounds, derive approximate solutions with known distancefor the optimal and improve approximate solutions. We de-scribe an efficient implementation architecture that makesuse of optimizer estimates similarly to commercial tools.Our preliminary experiments indicate that ILP-based indexselection is efficient efficiently and offers higher quality so-lutions compared to existing tools.

References

[1] S. Agrawal, S. Chaudhuri, L. Kollar, A. P. Marathe, V. R.Narasayya, and M. Syamala. Database tuning advisor formicrosoft sql server 2005. In VLDB 2004.

448

Page 8: An Integer Linear Programming Approach to Database Design

[2] S. Agrawal, S. Chaudhuri, and V. R. Narasayya. Auto-mated selection of materialized views and indexes in SQLdatabases. In VLDB 2000.

[3] S. Agrawal, V. Narasayya, and B. Yang. Integrating verticaland horizontal partitioning into automated physical databasedesign. In Proceedings of the SIGMOD Conference, NewYork, NY, USA, 2004. ACM Press.

[4] N. Bruno and S. Chaudhuri. Automatic physical databasetuning: a relaxation-based approach. In SIGMOD ’05.

[5] A. Caprara and J. Salazar. A branch-and-cut algorithm fora generalization of the uncapacitated facility location prob-lem, 1996.

[6] S. Chaudhuri and V. R. Narasayya. An efficient cost-drivenindex selection tool for Microsoft SQL server. In Proceed-ings of VLDB 1997.

[7] S. Chaudhuri and V. R. Narasayya. Index merging. In Pro-ceedings of ICDE 1999.

[8] G.Valentin, M.Zuliani, D.Zilio, and G.Lohman. DB2 ad-visor: An optimizer smart enough to recommend its ownindexes. In Proceedings of ICDE 2000.

[9] C. Heeren, H. V. Jagadish, and L. Pitt. Optimal indexingusing near-minimal space. In PODS ’03: Proceedings of thetwenty-second ACM SIGMOD-SIGACT-SIGART symposiumon Principles of database systems, 2003.

[10] S. Papadomanolakis, D. Dash, and A. Ailamaki. Intelligentuse of the query optimizer in automated database design.Technical Report CMU-CS-06-151, Computer Science De-partment, Carnegie Mellon University, 2006.

[11] D. C. Zilio, J. Rao, S. Lightstone, G. M. Lohman, A. Storm,C. Garcia-Arellano, and S. Fadden. Db2 design advisor:Integrated automatic physical database design. In VLDB,pages 1087–1097, 2004.

449