Program SQL

8/12/2019 Program SQL

1/12

Reducing the Braking Distance of an SQL Query Engine

Michael J. Carey Donald Kossmann

IBM Almaden Research Center University of PassauSan Jose, CA 95 120 94030 Passau, [email protected] kossmann@db,fmi.uni-passau.de

Abstract

In a recentpaper, we proposed adding a STOP AFTER clause toSQL to permit the cardinalig of a query result to be explicitly lim-ited by query writers and query tools.Wedemonstrated he use-fulness of having this clause, showed how to extend a traditionalcost-basedquery optimizer to accommodatet, and demonstrated

via DBd-basedsimulations that largeperformancegains arepos-sible when STOP AFTER queries are explicitly supported by thedatabase engine. In this paper. we present several new strategiesfor eficiently processing STOP AFTER queries. Thesestrate-gies, based argely on the use of range partitioning techniques,o$er signtjicant additional savings or handling STOP AFTERqueries that yield sizeable result sets. We describe classes ofqueries where such savings would indeed arise and present ex-perimental measurementshat show the benefitsand tradeo$sas-sociated with the new processingstrategies.

1 Introduction

In decision support applications, it is not uncommon towish to pose a query and then to examine and processat most some number (N) of the result tuples. In mostdatabase ystems,until recently, applications could only dothis by using a cursor, i.e., by submitting the entire queryand fetching only the first N results. Obviously, this can bevery inefficient, leading to a significant amount of wastedquery processing. In a recent paper [CK97], we proposedadding a STOP AFTER clause to SQL to enable querywriters to limit the size of a querys result set o a specifiednumber of tuples; related SQL extensions have been pro-posed n [KS95, CG96]. TheSTOP AFTER clause essen-tially provides a declarative way for a user to say enoughalready in the context of an SQL query, enabling the sys-tem to avoid computing unwanted results in many cases.In our previous work we showed he usefulnessof the new

Permission to copy withoutfee all orpart of this material is grantedpro-vided that the copies are not made or distributedfor direct commercialadvantage, the VLDB copyright notice and the title of the publication andits date appeas and notice is given th at copying is by permission of theVery Large Data Base Endowmen t. To copy otherwtie, or to republish,requires afee and/or special permissionrom he Endowment.Proceedings of the 24th VLDB ConferenceNew York, USA, 1998

clause, discussedhow a cost-basedquery optimizer can beextended o exploit it, and used DBZbased simulations todemonstrate he large performance gains that are possiblewhen STOP AFTER query support is explicitly added tothe database ngine.

In this paper, we build upon our previous work bypresenting several new strategies for efficiently process-ing STOP AFTER queries. Although we discussedSTOPAFTER query processing in general in [CK97], the ma-jor focus of our initial attention was on optimizing querieswhere N is relatively small (e.g., top ten queries). Anexample of a typical query that our previously proposedprocessingschemeswill handle well is:

SELECT e.name, e.salaryFROM Ernp eWHERE e.age > 50ORDER BY e.salary DESCSTOP AFTER 10;

This query asks or the namesand salaries of the ten mosthighly-paid older employees n the company. Our previous

schemeswill also work well for primary-key/foreign-keyjoin queries such as:

SELECT e.name, e.salary, d.nameFROM Emp e, Dept dWHERE e.age > 50AND e.works-in = d.dnoORDER BY e.salary DESCSTOP AFTER 10;

This query asks for the employees department names aswell as their names and salaries. For queries such asthese, it is possible for the query processor to manageits sorted, cardinal@-reduced intermediate results using amain memory heap structure, thereby avoiding large vol-umes of wasted sorting I/O as compared o processing hequery without aSTOP AFTER clause and then discardingthe unwanted employee nformation.

In caseswhere the stopping cardinality N is large, ouroriginal approacheswould each end up sorting and thendiscarding a significant amount of data-albeit early (i.e.,before the join in the example above), which still leadsto a significant savings compared to the naive approach.The strategiespresented n this paper seek to avoid thiswasted effort as well. Our new strategiesare based uponborrowing ideas rom existing query processing echniques

158


2/12

such as range partitioning (commonly used n parallel sort-ing and parallel join computations), RID-list processing(commonly used in text processing and set query process-ing), and semi-joins (commonly used in distributed envi-ronments to reduce join processing costs). As we willshow, adapting these echniques for use in STOP AFTERquery processing can provide significant additional savingsfor certain important classes of queries. W e have imple-mented the techniques in the context of an experimentalquery processing system at the University of Passau, andwe will demonstrate he efficacy of our techniques by pre-senting measurements f query plans running there.

Before proceeding, it is worth noting that proprietarySQL extensions closely related to our proposed STOPAFTER clause can be found in current products froma number of major database system vendors. In addi-tion, most of them include some degree of optimizer sup-port for getting the first query results back quickly (e.g.,heuristically favoring pipelined query plans over otherwisecheaper, but blocking, non-pipelined plans). For exam-ple, Informix includes a FIRST-ROWS optimizer hint anda FIRST n clause for truncating an SQL querys resultset. Similarly, Microsoft SQL Serverprovides an OPTIONFAST n clause and a session-level SET ROWCOU NT nstatement or these purposes. BMs DB2 UDB systemal-lows users to include OPTIMIZE FOR n ROWSand/orFETCH FIRST n ROWS ONLY clauses when enteringan SQL query. Oracle Rdb (originally a DEC product)added a LIMIT TO n ROWSclause to SQL, while Or-acle Server makes a virtual ROWNUMattribute part of itsquery results to support cardinality limits; including thepredicate ROWNUM


3/12

sort-stop(N)scan-stop(N)

scan-stop(N)

sort(shy) ridscan(age>50)tbscan(Emp,age>50)

tbscan(Emp,age>50)idxscan(Emp.salary)

Classic Sort-Stop Conventional Sort Trad. Index-Scan

Figure 1: Traditional Plans for Query 1

first N tuples of its input are inserted into the heap, andeach remaining tuple is then incrementally tested againstthe heaps current membership bound to determine whetheror not it warrants nsertion into the heap of the top (or bot-tom) N tuples. For larger values of N, external sorting isrequired to compute he desired Sort-Stop esults; we sim-ply used an ordinary external Sort operator followed by aScan-Stopoperator n such cases n [CK97].

For illustration purposes, consider a slightly more gen-eral version of the first example query from the Introduc-tion (we will call this Query 1 in the following):

Query]: SELECT *FROM EmP

WHERE age > 50ORDER BY salary DESCSTOP AFTER N;

Figure 1 depicts three of the possible execution plans thatcan be constructed or this query by combining one of ourphysical Stop operators with other, pre-existing query oper-ators. The first plan, the CZassic ort-Stop plan, uses a tablescan tbs can) operator o find employees n the appropri-ate age range followed by a heap-based sort - stop (N)operator o limit the results o the N highest paid older em-ployees. This plan is viable as long as N is small enoughfor the heap to indeed be a main memory structure. Thesecond plan, Conventional Sort, instead uses an externalsort on salary followedby a scan-stop(N) to ob-tain the desired result. This would be the preferred plan forlarge N in the absence of a salary index. Of course,plans similar to these two, but with an Emp age indexscan used to produce the inputs to the Stop-related oper-ators, are possible as well. The third plan in Figure 1, theTraditional Index-Scan plan, would also become viable inthe presence of an index on Emp salary. This plan per-forms an index scan (in descending order) on the salaryindex, uses the resulting record ids (RIDS) to fetch high-salaried employees and applies the age predicate o them,and then uses a scan - stop operator o select the top Nresults since the index scan produces ts output in the de-sired s a 1 a ry order. This third plan does very well if the

salary index is a clustered index or N is small. If N islarge and the index is unclustered, however, it would dotoo many random I/OS o be cost-effective, especially f theage predicate is highly selective (in which case many ofthe high-salaried employees found using the index wouldsubsequently be eliminated).

We introduced two policies to govern the placement ofStop operators n query plans in [CK97]. One was a Con-servative policy, which inserts Stop operators as early aspossible in a query plan subject to the constraint that no

tuple that might end up participating in the final N-tuplequery result can be discarded by a Stop operation. We alsoproposed an Aggressive policy that seeks o introduce Stopoperators n a query plan even earlier, placing a Stop op-erator wherever t can first provide a beneficial cardinalityreduction. The Aggressive policy uses result size estima-tion to choose he stopping cardinality for the Stop oper-ator; at runtime, if the stopping cardinality estimate urnsout to have been oo low, the query is restarted in order toget the missing tuples. This is accomplished by placing arestart operator n the query plan; this operators ob isto ensure hat, above ts point in the plan, all N tuples willbe generated. Thus, if its input stream runs out before allN tuples are received, it will restart the query subplanbeneath t to obtain the missing results.

2.2 Range Partitioning

Rangepartitioning is a well-known technique that has beenapplied with much success o numerous problems in theparallel database algorithm area [DG92]. One success-ful example s parallel sorting [DNS9lb], while another isload-balancedparallel join computation [DNSS92]; yet an-other example is the computation of so-called band joins[DNS9la]. The basic idea of range partitioning is ex-tremely simple---the data is divided into separately pro-cessable buckets by placing tuples with attribute values inone range into bucket l, tuples with attribute values inthe next range into bucket 2, and so on. In the case ofparallel sorting, each node in a k-node database machinepartitions its data into k buckets in parallel, based on thesorting attribute(s), streaming each buckets contents o thatbuckets designated eceiver node while the data is beingpartitioned. At the end of this process, he individual buck-ets can be sorted n parallel with no further inter-node in-

teraction. Figure 2 illustrates this process. Successful par-titioning in this manner produces virtually linear sortingspeedup, and sampling (or histogram) techniques can aidin the determination of a good set of partition boundaryvalues [DNSSl b] at relatively low cost.

In Section 3, we will propose and analyze the use ofseveral possible partitioning-based approaches or improv-ing the efficiency of STOP AFTER query processing. Wewill cover the details later, but the basic idea is simple-the relevant data can be range-partitioned on the querysORD ER BY attribute into a number of buckets. The buck-ets can then be processed ne at a time until N results havebeen output; buckets that are not accessed n this processneed never be sorted at all. This provides a way to imple-ment a Stop(N) operation hat scales beyond main memorysizes without requiring full sorting of the input stream. Inaddition, we will see hat in certain contexts, additional sig-nificant savings are possible, e.g., cases nvolving uses of aSTOP AFTER clause n a subquery.

2.3 Experimental Environment

As we work our way through the presentation of theproposed new approaches for executing STOP AFTER

160


4/12

5,8,... 13,... 20,37,...

partitioning

base table

Figure 2: Sorting by Range Partitioning

queries, we will be presenting results from performanceexperiments that demonstrate he tradeoffs related to theapproachesand that quantitatively explore the extent towhich they are able to reduce the costs of STOP AFTERqueries. Like Query 1 above, our test queries will bequeriesover a simple employeedatabasewith the followingself-explanatory schema:

Emp(a, work-sin, age, salary, address)Dept(m, budget, description)

Our instanceof this employeedatabase s fairly small, witha 50 h4B Emp table and a 10 MB Dept table. We keptthe database mall in order to achieve acceptable unningtimes and becausewe had somewhat imited disk spaceavailable for performing our experiments. The Emp ta-ble has 500,000 tuples which are generated as follows:eno is set by counting the tuples from 1 to 500,000,while works-in, age, and salaryare set randomlyusing a uniform distribution on their particular domains;address simply pads the tuples with garbage charac-ters to ensure hat each Emp tuple is 100 bytes long. Thedomain of works-in is, of course, the same as that ofDept. dno (describedbelow), the domain of age is inte-gers n the range from 10 o 60 so that about 100,000Emps(20%) are older than 50, and the domain of salary is in-tegers n the range of 1 to 500,000. Our test database asno correlations; as an example relevant to our experiments,a young Emp s just as ikely to have a high salary as anold Emp s.

The Dep t table has 100,000 uples which are generatedas follows: dno is set by counting the Dept tuples from1 to 100,000;budget is set o 10,000 or all Depts, anddescription pads he Dept tuples out to 100 bytes.

In terms of indexes, our test database as clustered Bttree indexes on the primary key attributes of the tables(i.e., eno and dno) becauseclustered ndexes on primarykeys are relatively comm on. To study plans such as theTraditional Index-Scan plan of Figure 1, we also have anEmp s a 1 ary B+ tree; naturally, this index is unclustered.

Our experimentshave beenperformedon an experimen-tal databasesystem called AODB [WKHM98]. AODBis essentially a textbook relational databasesystem thatusesstandard mplementations or sorting, various kinds ofjoins, group-by operations,and so on. We extendedAODB

with implementations for the scan - stop, sort - stop(using 2-3 trees to organize the heap [AHU83]), andres tart operators described above; we also added sup-port for the forms of range partitioning described in thenext section. We ran AODB on a Sun workstation 10 witha 33 MHz SPARCprocessor,64 MB of main memory, anda 4 GB disk drive that is used to store code, the database,and temporary results of queries. The operating system sSolaris 2.6, and we used Solarisdirect I/O feature; his dis-ables operating system caching in a manner similar to rawdisk devices. Since our database s small, we limited thesize of the database uffer pool proportionally, to 4 MB, inthose cases where we do not explicitly say otherwise. Ofthese4 MB, we always gave at least 100KB to each opera-tor that readsor writes data o disk in order to enable argeblock I/O operations and avoid excessivedisk seeks.

3 Range-Based Braking AlgorithmsWe now turn our attention to the development of newtechniques or processing STOP AFTER queries with lesseffort-i.e ., techniques for reducing the stopping dis-tance of an SQL query engine. The primary tool thatwe will be using is range partitioning. In this section,we present severalalgorithms that use this tool to help theengine to limit wasted work, thereby finishing sooner forSTOP AFTER queries; we refer to these as range-basedbraking algorithms. We start by describing query plancomponents hat can realize the algorithm s and illustratingthem using a typical example query. We then study theirperformance,and we close this section by explaining howto choosean appropriatenumber of partitions and an effec-tive set of partition sizes.

3.1 Range-Based Braking

As mentioned earlier, the problem of extracting the top (orbottom) N elements rom a large data set, where iV is largeas well, can be dealt with by tirst range-partitioning thedata into a number of buckets on the querys ranking at-tribute(s) and then processing he resulting buckets one ata time until all N elements have been output. As an ex-ample, consider again Query 1 of Section 2.1, which se-lects the namesand salariesof the iV highest paid employ-ees over 50 years old. Let us supposewe have a corpora-tion with 100,000older employees as n our test database)and that N is 10,000. We could, for instance, partition thecompanys old em ployees nto three buckets-those withsalaries over 250,000 per year, those who earn between

50,000 o 250,000annually, and thosewho earn ess han50,000. Suppose hat we do this and find that the first

(highest salary) partition endsup with 1,000 uples, the sec-ond with 12,000 uples, and the third with 87,000 uples. Ifthis is the case,we need not sort the tuples in the last parti-tion, as the 10,000 employees n the answer set clearly liein the first two partitions.

While the basic idea of range-basedbraking is simple,there are several possible variations on this theme withcosts and benefits that depend on the nature of the query

161


5/12


6/12

8060

40

20

Figure 4: Resp. Time r sets), Query 1 Figure 5: $$ ?{f%$??Part.4 MB Buffer, Perf. Part. 600KB Buffer, N = 20,000, Perf. Cut

and he partition-basedMaterialize, Reread,and Hybrid ap-proaches,which we just introduced, and which are shownin Figure 3. To investigate the quantitative tradeoffs be-tween he five approaches,we constructedeachof the queryplans and conducted a series of experiments using ourtest employee databaseand the AODB system, which areboth described n Section 2.3. Figure 4 shows he overall

Query 1 response ime results that we obtained by experi-menting with stopping cardinality values ranging from 1 to100,000 i.e., from one up to all of the old empioyees).

terializes any excessdata n this case,and no excessscansoccur, either. These wo partitioningplans even outperformthe Classic Sort-Stop plan for small N; they use quicksortto sort their first (and only) partition, which makes themslightly less costly here than Classic Sort-Stop,which usesits heap o order he results. Finally, as N increases, he dif-ferencesbetween the different approachesdiminishes be-

causeall of them end up sorting the sameamount of datawhen N reaches100,000.

The Conventional Sort plan has the worst performancethroughout most of Figure 4 because t sorts all of the databefore it is able to identify the top N results. The per-formance igures for this plan thus always include the costof a two-phase sort of 10 MB, which takes approximatelytwo minutes in our test environment. The cost increase orlarger values of N are due to the N-dependenceof the finalmerge phase of the sort; for N = 1, only the first pageof each run is read, while for N = 100,000 all pagesof all runs are read and merged. The Classic Sort-Stop

plan provides much better performance han the Conven-tional Sort plan as long as it is applicable; its curve stopsat N = 10,000 because ts sorted heap structure no longerfits in the buffer pool beyond that point. The relative per-formance seen for these approaches s essentially just aspredicted in [CK97].

3.4 Partitioning and Safety Padding

The preceding experiment provides quite a bit of insightinto the relative performanceof our old and new Sort-Stopprocessing schemes or a basic top N query; however, itassumed erfect partitioning. Before we accept ts results,we need to explore the sensitivity of the partitioned plansto the number and sizes of partitions used. We need to doso for two reasons. First, we need to understand how tochoose he partitioning parameters or each type of plan.Second,we need to fmd out how costly partitioning errorsare so we know how to pad the partition sizes or safety; inpractice these values will be selectedbased on a combina-tion of database tatistics and optimizer estimates,and theywill therefore be imperfectly chosen.

We now turn to the three partition-based approachesnFigure 4. In this experiment, we assume hat the opti-mizers selectivity estimator has access o accuratedistri-bution data; this means hat we assume hat partitions areideally sized (a notion that we will examine more closelyin the next subsection).As a result, all partition-based plansend up with just over N tuples in their first partition. Look-ing at the performance esults for Query 1, we see hat theMaterialize plan endsup being the worst performer amongthe three partitioning plans because t always materializes10 MB of temporary partition data, much of which is sub-sequently not needed. Despite this cost, though, it outper-forms the Conventional Sort plan for all values of N exceptN = 100,000 (where their performancebecomesessen-tially the same).The other two partitioning approaches ndup providing the best overall performance or this query;both sort only the required amount of da ta here, neither ma-

Figure 5 shows he results of a seriesof experiments hatdiffer in two ways from the ones we just looked at. Here,we have fixed N at 20,000 and decreased he buffer poolsize by a factor of about seven o 600KB (i.e., 150 pagesof 4 KB). We use an even smaller buffer pool here so asto stress he need or partitioning; this is similar to scalingup the size of the employee able, but keeps he cost of theexperiments within reason. The x-axis in Figure 5 is thenumber of partitions utilized, and the y-axis is the overallquery response ime (as before). In this graph, the numberof partitions is varied with a perfect cut, meaning for Ppartitions and a query stopping cardinality setting of N, wehave P - 1 winner partitions with N/( P -1) tuples ineach one plus one loser partition with the leftover tuplesin it. For comparison and baseline-setting purposes, n ad-dition to showing the performance of the three partition-based plans, the graph also shows the timing results forthe Conventional Sort plan (which do not vary since N isfixed); the Classic Sort-Stop s not applicable here since its

Figure 6: Queryc~~~$&t-point600 KB Buffer, N = 20,000, Perf. Part.

163


7/12

heap will not fit in the available buffer memory.The results shown n Figure 5 provide clear insights into

how each of the partition-based plans should be dealt withwith respect to choosing the number of partitions. TheReread approach, as one would obviously expect, is verysensitive to the number of partitions used; to avoid costlyre-reads, wo partitions one winner, one loser) is the opti-

mal choice. The Materialize approach performs best whenthe winner tuples are partitioned into memory-sized piecesso that each partition can be sorted n memory in a singlepass; his is the case n the figure with five or more) par-titions. The Hybrid approachhas a similar optimal point,for the same eason; t outperforms Materialize by about 20seconds worth of response ime since it does not material-ize the 80,000 oser tuples.

Figure 6 shows he results of a series of experiments hatexplores the question of what happens o the different par-titioning plans when the winner/loser cut-point which isthe x-axis in the figure) has been incorrectly estimated; etus call this cut-point C from now on. The goal of this ex-periment is to obtain insights that we can use to guide thesizing of partitions e.g., so we know which direction it isbetter to err in). As before, the query used for the exper-iments here is Query 1, the buffer pool size is 150 pages,and iV = 20,000. Learning from the results of Figure 5,Reread has one winner and one loser partition in this ex-periment, whereas Materialize and Hybrid further partitionthe winner tuples into memory-sized pieces.

When C is set too low, Figure 6 shows that all of thealgorithms get into fairly big trouble. This is because omeof the winner tuples, which belong in the query result, endup being placed into the large loser partition. In this case,Reread and Hybrid both have to re-scan he entire 50 MBemployee table to get at these tuples, while Materializemust sort even the large loser partition. Under these cir-cumstances, he Conventional Sort plan is able to beat allthree of the partitioning plans. Materialize performs theworst here because t does a great deal of expensive read-ing and writing, and this ends up actually being more costlythan a second sequential scan of the employee able.

When C is set too high, Figure 6 shows hat the parti-tioning algorithms still manage o do quite well for Query1. Materialize has roughly constant cost in this region, asit a lways materializes all of the old employees indepen-dent of C). Hybrid grows slowly more expensive as C in-creases because t materializes C but not all old tuples.Reread grows costly much faster with an increasing cut-point over-estimate. To see why, we need to look at theamount of sorting carried out in the three plans: Reread,which has only one winner partition with C tuples), mustsort this whole winner partition in order to find the top IVtuples. On the other hand, Materialize and Hybrid partitiontheir C winner tuples into several small winner partitionsso that they need not sort all of these winner partitions incases n which C is set too high. In any case, he bottomline of this experiment s that all of the partition-based algo-rithms suffer quite strongly if the number of tuples placed

into winner partitions ends up being too small, and suffermuch less if too many tuples are classified as likely win-ners. Thus, it is better to err on the high side, placing toofew tuples into the loser partition i.e., too many into win-ner partitions), to avoid potential performance nstabilities.

3.5 Choosing a Partitioning Vector

Before moving on to other queries, it is worth discussingthe issue of how the partitioning vector-i.e., the splittingvalues that control which attribute value ranges are associ-atedwith which partitions-can be chosen or the partition-based plans. The preceding subsection showed us how tochoose the partition cardinalities, so the remaining openproblem is one of successfully mapping these desired car-dinalities back into attribute ranges or the ORDER BY at-tribute s) of a STOP AFTER query. This is essentially thedual of the selectivity estimation problem, which takes aquerys attribute value ranges and attempts o estimate car-dinalities from those ranges; moreover, it is amenable o

the same echniques.There are essentially two potential answers here. Thefirst is histograms, which have already been thoroughlystudied e.g., [PIHS96]) and are available in most databasesystems oday. In particular, equi-depth histograms thatprovide good accuracy even n the presence of skewed dataare well understood [Koo80, PSC84, MD88], and in thecase of correlated attributes, multi-dimensional histogramswill help [PI97]. Thus, if histogramsare available, they canbe used o determine partition vectors at query compilationtime. If no histogramsare available, or it is known that theavailable histograms provide insufficient accuracy e.g., forcomplex queries with many unpredictable joins or group-ing operations), then sampling at run time can be used.Sampling has also been thoroughly studied in the databasecontext e.g., [LNS90, HS92]), and it has also been shownto be quite cheap DNS9lb]. To conclude, at this point, werely on existing technology, and he new partition-based ap-proaches we propose n this paper can directly take advan-tage of any improvements made n this field in the future.

4 Other Examples

Thus far we have seen experimental results involving ourfavorite query) that demonstrated he basic tradeoffs re-lated to the alternative range partitioning techniques andthat showed some of the advantages of using range par-titioning for STOP AFTER queries. In this section, wepresent three additional examples with experimental re-sults that highlight several other advantages of range par-titioning. Theseexample queries nclude a percent query, anested query, and a oin query. Since the tradeoffs betweenthe three alternative partitioning approaches are very simi-lar for all STOP AFTER queries, we will focus on Hybridplans, which use our preferred partitioning method, in thissection.

164


8/12

20 - Conv.son +sortstop--)f---

Part/Hybrid -.-m-

0 ' ,' ,' . . .' ' .' . c1 10 100 1000 10000 1omo

PamHybrid ---B--

Figure 7: Resp. Timit@ecs) Query 24 MB Buffer,PerfectPartitioning

Figure 8: Resp.TimeN(secs), uery34 MB Buffer,Perfect artitioning

Figure 9: Query3, VaryL4 MB Buffer,N = 20,000,C = 25,000

4.1 Percent Queries

Our fist additional example (Query 2) is a so-calledper-cent query. The query asks or the 2% highest paid Empsthat are more than 50 years old.

Query 2: 3 CT

WHERE age > 50ORDERBY salary DESCSTOPAFTER x%;

Percent queries are interesting because an additionalcounting step is required in order to fmd out how manytuples are to be returned. That is, we need to count thenumber of Emps that are over 50 before we can actuallystart STOPping (so to speak). Looking back at the plansstudied n the previous section, we see hat we can directlyapply the Conventional Sort plan (Figure 1) to this per-cent query: the counting step can be carried out as partof the tb scan (Emp age> 5 0 ) operator, and the resultof this counting step times c% can be propagated o thescan- stop (N) operator before the scan- s top (N)operator starts to produce tuples; this is possible becausethe sort in between s a pipeline-breaking operator.Like-wise, all three partitioning plans of Figure 3 can be appliedto our percent query: again, counting can be carried out aspart of the tbs can (Emp age> 5 0 ) operatorsand prop-agated o the scan - stop (N) operatorsbecause here ex-ists a pipeline-breaking operator n between (sort and/orpart-mat or part-hybrid). The Classic Sort-Stopplan of Figure 1, however,cannot be directly applied n thiscase. To use a sort - stop operator, we can either readthe whole Emp able twice (once to carry out the countingstep and once o find the top z%), or we can read he wholeEmp table thereby carrying out the counting step and ma-terializing the Emps with age >5 0 as a temporary ableand then read the temporary table in order to find the top2%. Which one of these wo plans is better dependson theselectivity of the age predicate; n our particular example,the secondplan is better becauseonly one out of five Em psis older than 50 in our test database. n any case,both ofthese adjustedClassic-Sort-Stopplans are more expensivethan the (inapplicable) Classic Sort-Stop plan of Figure 1.

Figure 7 shows the running times of the ConventionalSort, the adjustedClassic Sort-Stop with a materialization

step), and the partition-based Hybrid plans for this percentquery. As describedabove, he Conventional Sort and Hy-brid plans have almost the same unning times here as forQuery 1 in Section 3, whereas he Classic Sort-Stop planhas a higher cost due to writing and reading temporary re-sults from disk. As a result, the Hybrid plan is the clearwinner for all z 5 50% for this percent query. Note that

the Classic Sort-Stopplan cannot be applied for x > 50%because ts mem ory requirements hen exceed4 MB.To find the proper partitioning vector for the Hybrid

plan for this query, the observations of Section 3.4 es-sentially still apply. That is, we should partition the datainto memory-sized portions and play it safe by mate-rializing too many rather than too few Emp tuples in thepart-hybridoperator.

4.2 STOP in a Subquery

In the second example of this section (Query 3), we con-sider a query that has a STOP n a subquery. Our examplequery asks or the averagesalary of the N best paid Emps

withage > 50.Query3: ;;X2;CT g;iJ6;alary)

salaryFROM EwWHERE age > 50ORDERBY salary DESCSTOP AFTER N) e;

Both the Conventional Sort and the Classic Sort-Stop planof Figure 1 can be applied to this query; they simply needan aggregate operator at the top in order to computethe average. These traditional plans, however, perform agreatdeal of wastedwork since hey produce heir output insalary order and this ordering is not needed o computethe average.With a partition-based plan, most of this sort-ing can be avoided by partitioning the Emps into three par-titions: one partition containing the topL Emps (L slightlysmaller han N), one partition containing the next M Empssuch that it4 is small andL + M 1 N, and one partitionwith the all of the other loser Emps. In this case,only theM Emps in the secondpartition need to be sorted n orderto find the N - L highest paid Emps in that partition.

Figure 8 shows he running times of the three alternativeplans for this query. We can see hat the Conventional Sortplan again has the highest cost because t sorts all 100,000

165


9/12

Emps with age > 5 0, independent of N. Also, as n theprevious experiments, he cost of the Classic Sort-Stop planincreases with N and s in between he Hybrid and Conven-tional Sort plans. What makes Query 3 and this experimentspecial s that the cost of the Hybrid plan is almost constanthere because Hybrid sorts very few Emps, independent ofN; only for very large N does he cost of the Hybrid planslightly increase,due to materializing many Emp nples. Asa result, the differences in cost between the Classic Sort-Stop and Hybrid plans increase sharply with N, and theHybrid plan outperforms the Conventional Sort plan evenfor N = 100,000. It should be noted that the cost of theConventional Sort plan is lower for this query than in allprevious experiments because his query can be evaluatedusing only the salary colnmn of Emps (i.e., the othercolumns are projected out after the tbscan), permittingthe sort to be carried out in one pass n memory. Simi-larly, the Classic Sort-Stop plan can be used for all N forthis query without exhausting the buffer space.

It is somewhat rickier to find a perfect partitioning vec-tor for this query than for Queries 1 and 2. If we setC = L + M (C is the cut-point between winners andlosers as in Section 3.4), then we need to make sure thatL < N in addition to C 1 N and C as small as possi-ble. In other words, here we need to find a good left cut-ting point, L, in addition to a good right cutting point, C,whereas we only needed o find a good right cutting pointfor Queries 1 and 2. Figure 9 shows the sensitivity of thecost of the Hybrid plan towards cases n which L is set m-perfectly for N = 20,000 and C = 25,000. Obviously,the Hybrid plan performs best f L is close o N. However,the figure shows hat the penalty for a poor setting of L isnot severe 20 at most) due to the fact that the additionalwork is proportional to the number of misclassified uples

(i.e., errors here dont cause entire additional partitions tobecome nvolved in the query plan); in any case he Hybridplan outperforms both traditional plans. (Figure 9 showsthe cost of the Classic Sort-Stop plan, the better of the twotraditional plans, as a baseline.)

4.3 Join Queries

The last example query of this section involves a oin; thisexample shows hat partitioning becomes even more attrac-tive for more complex queries. The query asks for the Nhighest paid Emps that have age > 50 and that work inaDept withbudget > 1,000.

Query 4: ;;;iCT iimp e, Dept aWHEREE i.Edget > 1000AND e.works-in = d.dno

ORDERBY salary DESCSTOPAFTER N;

Figure 11 shows two traditional plans for this query. Thefirst plan is based on (conventionally) sorting the Emp ableinto sa 1 ary order and then probing the top Emps one byone n order to find out whether they work in a Dep t with ahigh budget (i.e., it uses an index nested-loop oin). The

second plan carries out the join first, in order to find allEmps that work in a Dept with a high budget (Grace-hash oin is best for this purpose n our test database), ndthen it finds the N highest paid of these Emps using asort - stop operator. As an alternative to these wo tradi-tional plans, Figure 12 shows wo partitioning-based plansfor this query. The idea here is to partition the Emp tablebefore the oin, and then to join one Emp partition at a timewith the Dep t table until at least N Emps that survive thejoin have been ound. Thus, ust as partitioning was used nthe previous examples o avoid unnecessary sorting work,partitioning is utilized in these wo join plans to avoid tm-necessary sorting and join work. The difference betweenthese wo plans is that the first one uses ndex nested-loopsfor the join, whereas he second one uses hashing. Notethat for small N, the hash oin of the Part+HJ plan can becarried out in one pass f the part - hybrid operator par-titions the data nto memory-sized portions.

Figure 10 shows the running times of the four plans,varying N and using our test database n which all Dep t sactually have a budget > 1,O 0 0. We see mmediatelythat the partitioning plans clearly outperform the two tra-ditional plans. The Sort+NLJ plan has the highest cost,independent of N, because t always sorts all 10 MB ofEmps with age > 50. For N > 1000, it has extremelyhigh costs because, n addition to the expensive sort, theNLJ becomes very costly because many Emp tuples gen-erate probes, resulting in an excessive amount of randomdisk I/O. The GHJ+Sort-Stop plan has low sorting costsfor small N, but it has poor performance because t per-forms a full-fledged Grace-hash oin. For N 2 50,000,the GHJ+Sort-Stop plan is again inapplicable because he

buffer requirements of the sort - stop (NJ operator ex-ceed he limit o f 4 MB. (Replacing the sort - stop (N)operator by a conventional sort and a scan- stop (N)operator would yield an execution time of about 250 setshere.) Both partitioning variants avoid unnecessary sort-ing and joining of Emp tnples. The Part+NLJ plan per-forms best for small N, but its performance deterioratesfor N > 1,000 due to the high cost of the NLJ, just asin the Sort+NLJ plan. The Part+HJ plan shows better per-formance n these cases because hash oins are better thanindex nested-loop oins when both input tables are large.

In terms of sensitivity, the points mentioned forQueries 1 and 2 still basically apply; we should make surethat the first partition contains all o f the Emp uples neededto answer the query. We must keep in mind, however,that for the Part+HJ plan, the penalty for setting the cut-point too low is higher than for the partitioning plans forthe simple sort queries because a restart involves not onlyre-scanning he Emp table, but also re-scanning the Dep ttable in the Part+HJ plan. Since the Part+NLJ plan neveractually scans he Dep t table, the Part+NLJ plan does notpay this additional penalty for restarts.

166


10/12

scan-?op(N) sort-stop(N)I

Grace Hash Join

sort(salary) ridscan(b>1000)/ I

/ \

lbscan(Emp.agw50) idxscan(Dept.dno)tbscan(Emp,age>50) tbs~(Dept,b~lOOO)

Sort + NW GHJ + Sort-Stop

Figure 11: Traditional Plans for Query 4

scan-stop(N)

sort(Tary)rest&t(N)

scan-stop(N)I

sort(salaty)I

restart(N)

Index Nested-Loop Join10 100 1000 10000 100000

/ \

(Classic) Hash Join1 / \

Figure 10: Resp. Timc (secs), Query 4 part-&an ridsc&b=4000) tbscan(De/pl,b>lOOO) par&an4 MB Buffer,Perfect artitioning ,

part-:ybrididxscan(Dept.dno) part-hybrid

tbscan(Ekp,age>50) tbscan(Emp,ageSO)

Part + NW Part + HJ

Figure 12: Range Partitioning Plans for Query 4

5 Other TechniquesWe have seen that range partitioning can be very help-ful to improve the response ime o f several different typesof STOP AFTER queries. In this section, we will showhow two other techniques can be applied to evaluate STOPAFTER queries. The first technique s also based on parti-tioning, but it is based on using ordered ndexes (e.g., B+trees) to partition the data. The second echnique is basedon using semi-joins to reduce he size of temporary results.

5.1 Partitioning with Indexes

Let us return to Query 1, which asks or the N Emps withthe highest salary and age > 50, and see what hap-pens when we have a B+ tree on Emp. salary. TheTraditional Index-Scan plan that executes his query usingthe Emp salary index was shown in Figure 1 and dis-cussed n Section 2.1; it reads he RIDS of the Emps oneat a time in salary order from the index, then fetchesthe age, address, etc. fields and applies the age pred-icate, until the top N old Emps have been found. In Sec-tion 2.1, we noted that this plan would be very good if theEmp salary index is clustered or N is very small, butthat it would have a high cost f N is large and/or the agepredicate ilters out many high paid Emps because, n thiscase, he ridscan operator would lead to a great deal of

random I/O and many page faults for rereading pages ofthe Emp table if the buffer is too small to hold all of therelevant pages of the Emp able.

For large N and unclustered indexes, we can do betterby using, of course, partitioning. The idea is to read theRIDS of the top N Emps from the index, sort these NRIDS npage id order, do the ridscan with the predicate,re-sort into salary order, and cut off the top N tuples orrepeat f less han N of the top IV' Emps have age > 5 0,

Similar RID sorting ideas are known as RID-list process-ing and have been commonly exploited in text databases(e.g., [BCC94]) and set query processing (e.g., [Yao79]),but they can only be applied in the STOP AFTER con-text if they are combined with partitioning. The beauty ofthis Part-Index approach s that the ridscan operator be-comes quite cheap since t reads he Emp pages sequentiallyand reads no Emppage more than once from disk. On thenegative side, this approach nvolves two sorting s teps. IfN and N are small, however, these sorts are fairly cheapbecause hey can be carried out in one pass n memory.

Figure 13 shows the running times of the TraditionalIndex-Scanplan and a Part-Index plan for Query 1, varyingN. As baselines, he figure also shows the running timesof the Hybrid plan that does not use the Emp salaryindex (as in Figure 4) and the ideal running time forQuery 1 generated by running the Traditional Index-Scanplan on a special version of our test database n which theEmp salary index is clustered. We see that the Part-Index plan clearly outperforms the Traditional Index-Scanplan for a large range of N. While the Traditional Index-Scan plan is only attractive for N 5 100, the Part-Indexplan shows almost ideal performanceup to N = 10,000.(After that its sorts become too expensive.) Only forN = 10 does he Traditional Index-Scan plan slightly out-perform the Part-Index plan (0.9 sets vs. 1 set).

The right setting of N, of course,depends upon both Nand the selectivity of the age predicate. In this example,N should be set to 5 * N because every fifth Emp s olderthan 50 in our tes t database nd the values of the salaryand age columns are not correlated. It should be noted,however, hat the penalty for restarts n the Part-Index planis very low: rather than re-scanning the entire Emp able, arestart simply involves continuing the Emp salary indexscan and fetching the next N Emp nples.

167


11/12

200 , . , . . , . , .

Trad-Index . ..$$$c--.pafi-,ndex .-.. 4"

i

I :t~::-:i:l.--.-., _,_._-.i.~ ._.-.a;1 10 100 1000 10000 ,OOOOO

N

Figure 13: Resp. Time (sets), Query 14 MB Buffer, Unclustered Emp salary Index

5.2 A Semi-Join-Like Technique

The idea of semi-joins is to reduce he cost of I/O intensiveoperations, such as sorts, oins, and group-by operations,by projecting out all but those columns actually needed oran operation; doing so reduces he size of the temporary re-sults that need to be read and/or written to disk or shippedthrough a network. The disadvantageof semi-joins is thatcolumns that were projected out must be re-fetched (us-ing a r idscan operator) after the operation in order tocarry out subsequentoperations or because hey are partof the query result. Semi-join techniqueshave been exten-sively used in the early distributed database ystems e.g.,[BGW+81]), but they have not been widely used in cen-tralized databasesystems. One reason for this is the po-tentially prohibitively high and unpredictable costs of re-fetches, though the cost of the re-fetches can be reducedwith the sameRID sorting trick described n the previoussubsection. What makes semi-join-like techniques attrac-

tive for STOP AFTER queries is that the costs of the re-fetches are limited and can be predicted accurately duringquery optimization if the re-fetches are always carried outat the end of query execution: if a query asks or the top Ntuples, then at most N re-fetchesare required at the end.

We studied two different semi-join-like plans forQuery 1. In both plans, the sorting of theEmp tuples canbe carried out in one pass n main memory becauseonlythe RIDS and thesalary fields of theEmp tuples are kept.The difference between the two plans is that the first planwhich we call theStandard 53 Plan does not apply theRID sorting trick described n the previous subsection norder to improve the re-fetches, whereas he secondplan,the SJ+Ridsort Plan, does apply this trick. We ran bothplans with varying N, and Figure 14 shows he results ofthese experiments. The figure also shows he results of theConventional Sort and Hybrid plans of Section 3 as base-lines. First, we see hat the impact of the RID sorting trickis less pronounced than in the previous experiments withthe Part-Index plan. The reason s that in both semi-joinplans, the Emp table has already been read once, and theold Emps have already been iltered out,so the ri dscangets more hits in the buffer pool and is applied to fewertuples. Second,we observe hat for small N, the two semi-

./60 -

,B -.-.-._. m _.__. a____. *-.-.-.-i

40 - Conv.Son +Part-Hybrid --)-20 - Standard SJ .--+---

SJ+RlDSORT - +

0 .' . .' ,' .' -1 10 100 1000 10000 100000

N

Figure 14: Resp.Time (sets), Query14 MB Buffer, Semi-join Plans

join plans indeed outperform the Conventional Sort plan,while for N 2 1000, the performance of the semi-joinplans deterioratesdue to their high re-fetching costs. Fi-nally, we see that the semi-join plans are clearly outper-formed for all N by the partition-based Hybrid plan; likethe semi-join plans, the Hybrid plan of Section 3 carriesout its sort (salary) in one pass (for N 5 lO,OOO),and the Hybrid plan has he additional advantageof sortingslightly more than N rather than all 100,000Emps withage > 5 0. We note that it is not too difficult to find otherexamplequerieswhere a semi-join plan would actually out-perform a partitioning plan (e.g., for very large and highlyselective oins with small N); sometimes he best plan toexecute a query may be a combination of partitioning andsemi-joins.

6 Conclusions

In this paper, we presentedseveral new strategies or ef-ficiently processingSTOP AFTER queries. These strate-gies, based argely on the use of range partitioning tech-niques, were shown to provide significant savings for han-dling important classesof STOP AFTER queries. We pre-sented examples ncluding basic top IV queries, percentqueries, subqueries, and joins; we saw benefits from theuse of the new partitioning-based techniques in each casedue to the reduction in wasted sorting and/or oining effortthat they offer. We showed hat range partitioning can beuseful for indexed as well as non-indexed plans, and wealso showed hat semi-join-like techniques can provide anadditional savings n somecases.

There are several areas that appear fruitful for futurework. One area of interest isSTOP AFTER query pro-cessing for parallel database systems. Our techniquesshould be immediately applicable there, and they may of-fer even greater benefits by reducing the amount of datacommunication required to processsuch queries. Anotheravenue for future investigation would be experimentationwith the effectivenessof histogram and/or sampling tech-niques for determining the partitioning vector entries forSTOP AFTER queries on real data sets.

168


12/12

AcknowledgmentsWe would like to thank Paul Brown of Informix, Jim Grayof Microsoft, Anil Nori of Oracle (and DEC, in a previousjob), and Donovan Schneider of RedBrick Systems or in-forming us about the support fortop N query processing ntheir respectivecompanies current database roducts.

References[AHU83]

[BCC94]

[BGW+11

[CG96]

[CK97 J

[DG92

[DNSgla]

[DNS9 1b]

[DNSS92]

[HS92]

[Knu73]

[KooSO]

A. Aho, J. Hopcroft, and J. Ullman.Data Struc-tures and Algorithms. Addison-Wesley, Reading,MA, USA, 1983.E. Brown, J. Callan, and B. Croft. Fast incre-mental indexing for full-text information retrieval.In Proc. of the ConJ on Very Large Data Bases(VLDB), pages 192-202, Santiago, Chile, Septem-ber 1994.P. Bernstein, N. Goodman, E. Wong, C. Reeve,and J. Rothnie. Query processing n a system ordistributed databases SDD-I). ACM Trans. onDatabase Systems, 6(4), December1981.S. Chaudhuri and L. Gravano. Optimizing queriesover mulitmedia repositories. InProc. ofthe ACMSIGMOD Conf on Managementof Data,pages91-102, Montreal, Canada,June 1996.M. Carey and D. Kossmann. On saying enoughalready in SQL. In Proc. of the ACM SIG-MOD Conj on Management o f Data,pages 219-230, Tucson, AZ, USA, May 1997.D. Dewitt and J. Gray. Parallel database ystems:The future of high performancedatabase ystems.Communications of the ACM,35(6):8S-98, June1992.D. Dewitt, J. Naughton, and D. Schneider. An

evaluation of non-equijoin algorithms. InProc. ofthe Conj on Very Large Data Bases (VLDB),pages443-$52, Barcelona, Spain, September1991.D. Dewitt, J. Naughton, and D. Schneider. Paral-lel sorting on a shared-nothing architecture usingprobabilistic splitting. InProc. of the Intl. IEEEConf: on Parallel and Distnbuted Information Sys-tems, pages 280-29 1, Miami, Fl, USA, December1991.D. Dewitt, J. Naughton, D. Schneider, and S. Se-shadri. Practical skew handling in parallel joins.In Proc. of the ConJr on Very Large Data Buses(VLDB), pages 27-40, Vancouver, Canada,August1992.

P. Haas and A. Swami. Sequential sampling proce-dures or query size estimation. InProc. of the ACMSIGMOD Conf on Management o f Data,pages341-350, June 1992.D. Knuth. The Art of Computer Programming -Sorting and Searching, volume 3. Addison-Wesley,Reading, MA, USA, 1973.R. Kooi. The optimization of queries in relationaldatabases. PhD thesis, CaseWesternReserveUni-versity, September1980.

[KS951

[LNS90]

[MD881

[PI971

[PIHS96]

[PSC84]

[WKHM98]

[Yao79]

R. Kimball and K. Strehlo. Why decision supportfails and how tofix it. ACM SIGMOD Record,24(3):92-97, September1995.R. Lipton, J. Naughton, and D. Schneider. Prac-tical selectivity estimation through adaptive sam-pling. InProc. of the ACMSIGMOD Conf: on Man-agement of Data, pages -l 1, Atlantic City, USA,

April 1990.M. Muralikrishna and D. Dewitt. Equi-depth his-tograms or estimating selectivity factors for multi-dimensional queries. InProc. o f the ACM SIG-MOD Conf on Management of Data,pages28-36,Chicago, IL, USA, May 1988.V. Poosalaand Y. Ioannidis. Selectivity estimationwithout the attribute value independenceassump-tion. InProc. of the Conf on Very Large Data Bases(VLDB), pages 486-495, Athens, Greece, August1997.V. Poosala,Y. Ioannidis, P. Haas, and E. Shekita.Improved histograms for selectivity estimation ofrange predicates. In Proc. of the ACM SIGMODConf: on Management of Data, pages 294-305,Montreal, Canada, une 1996.G. Piatetsky-Shapiroand C. Connell. Accurate es-timation of the number of tuples satisfying a condi-tion. InProc. of the ACM SIGMOD Conf on Man-agement of Data, pages 256-276, Boston, USA,June 1984.T. Westmann,D. Kossmann,S. Helmer, and G. Mo-erkotte. The implementation and performance ofcompressed atabases. 1998. Submitted for publi-cation.S. Yao. Optimization of query evaluation al-gorithms. ACM Trans. on Database Systems,4(2):133-l%, 1979.

169

Documents

Program SQL