46
SPRING 2006 CENG 352 Database Management System s 1 Query Optimization

SPRING 2006CENG 352 Database Management Systems1 Query Optimization

  • View
    231

  • Download
    4

Embed Size (px)

Citation preview

Page 1: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 1

Query Optimization

Page 2: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 2

Query Evaluation

• Problem: An SQL query is declarative – does not specify a query execution plan.

• A relational algebra expression is procedural – there is an associated query execution plan.

• Solution: Convert SQL query to an equivalent relational algebra and evaluate it using the associated query execution plan.– But which equivalent expression is best?

Page 3: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 3

Naive ConversionSELECT DISTINCT TargetListFROM R1, R2, …, RNWHERE Condition

is equivalent to

TargetList (Condition (R1 R2 ... RN))

but this may imply a very inefficient query execution plan.

Example: Name (Id=ProfId CrsCode=‘CS532’ (ProfessorProfessor TeachingTeaching))

• Result can be < 100 bytes• But if each relation is 50K then we end up computing an intermediate result ProfessorProfessor TeachingTeaching of size 1G before shrinking it down to just a few bytes.

Problem: Find an equivalent relational algebra expression that can be evaluated “efficiently”.

Page 4: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 4

Query Processing Architecture

Page 5: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 5

Query Optimizer

• Uses heuristic algorithms to evaluate relational algebra expressions. This involves:

1. Enumerating alternative plans (Typically a subset of all possible plans).• Needs equivalence rules

2. Estimating the cost of each enumerated plan and choosing the plan with the lowest estimated cost.

• Query optimizers actually do not “optimize” – just try to find “reasonably good” evaluation strategies

Page 6: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 6

Selection and Projection Rules

• Break complex selection into simpler ones: Cond1Cond2 (R) Cond1 (Cond2 (R) )

• Break projection into stages: attr (R) attr ( attr (R)), if attr attr

• Commute projection and selection: attr (Cond(R)) Cond ( attr (R)),

if attr all attributes in Cond

Page 7: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 7

Commutativity and Associativity of Join (and Cartesian Product as Special Case)

• Join commutativity: R S S R– used to reduce cost of nested loop evaluation strategies (smaller relation

should be in outer loop)

• Join associativity: R (S T) (R S) T– used to reduce the size of intermediate relations in computation of multi-

relational join – first compute the join that yields smaller intermediate result

• N-way join has T(N) N! different evaluation plans– T(N) is the number of different binary trees with N leaf nodes

– N! is the number of permutations

• Query optimizer cannot look at all plans (might take longer to find an optimal plan than to compute query brute-force). Hence it does not necessarily produce optimal plan

Page 8: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 8

Pushing Selections and Projections

Cond (R S) R Cond S – Cond relates attributes of both R and S– Reduces size of intermediate relation since rows can be

discarded sooner

Cond (R S) Cond (R) S – Cond involves only the attributes of R– Reduces size of intermediate relation since rows of R

are discarded sooner

attr(R S) attr(attr (R) S),

if attributes(R) attr attr– reduces the size of an operand of product

Page 9: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 9

Equivalence Example

C1 C2 C3 (RR SS) C1 (C2 (C3 (RR SS) ) )

C1 (C2 (RR) C3 (SS) )

C2 (RR) C1 C3 (SS)

assuming C2 involves only attributes of RR, C3 involves only attributes of SS, and C1 relates attributes of R and SS

Page 10: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 10

SELECT P.NameFROM ProfessorProfessor P, TeachingTeaching TWHERE P.Id = T.ProfId -- join condition

AND P. DeptId = ‘CS’ AND T.Semester = ‘F1994’

Name(DeptId=‘CS’ Semester=‘F1994’(ProfessorProfessor Id=ProfId TeachingTeaching))

Estimating Cost - Example 1

Name

DeptId=‘CS’ Semester=‘F1994’

Id=ProfId

ProfessorProfessor TeachingTeaching

Master query execution plan (nothing pushed)

Page 11: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 11

Metadata on Tables (in system catalog)

– ProfessorProfessor (Id, Name, DeptId)• size: 200 pages, 1000 rows, 50 departments

• indexes: clustered, 2-level B+tree on DeptId, hash on Id

– TeachingTeaching (ProfId, CrsCode, Semester)• size: 1000 pages, 10,000 rows, 4 semesters

• indexes: clustered, 2-level B+tree on Semester;

hash on ProfId

– Definition: WeightWeight of an attribute – average number of rows that have a particular value

• weight of Id = 1 (it is a key)

• weight of ProfId = 10 (10,000 classes/1000 professors)

Page 12: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 12

Estimating Cost - Example 1• Join - block-nested loops with 52 page buffer (50

pages – input for Professor, Professor, 1 page – input for TeachingTeaching, 1 – output page– Scanning ProfessorProfessor (outer loop): 200 page transfers,

(4 iterations, 50 transfers each)– Finding matching rows in TeachingTeaching (inner loop):

1000 page transfers for each iteration of outer loop– Total cost = 200+4*1000 = 4200 page transfers

Page 13: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 13

Estimating Cost - Example 1 (cont’d)

• Selection and projection – scan rows of intermediate file, discard those that don’t satisfy selection, project on those that do, write result when output buffer is full.

• Complete algorithm:– do join, write result to intermediate file on disk– read intermediate file, do select/project, write final

result– Problem: unnecessary I/O

Page 14: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 14

Pipelining• Solution: use pipeliningpipelining:

– join and select/project act as coroutines, operate as producer/consumer sharing a buffer in main memory.

• When join fills buffer; select/project filters it and outputs result

• Process is repeated until select/project has processed last output from join

– Performing select/project adds no additional cost

joinjoin selectselect/projectprojectIntermediate

result

outputfinal result

buffer

Page 15: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 15

Estimating Cost - Example 1 (cont’d)

• Total cost:4200 + (cost of outputting final result)

– We will disregard the cost of outputting final result in comparing with other query evaluation strategies, since this will be same for all

Page 16: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 16

The resulting Query Execution Plan 1

Name

DeptId=‘CS’ Semester=‘F1994’

Id=ProfIdProfessorProfessor TeachingTeaching

(Block Nested Loops)

(On-the-fly)

(On-the-fly)Pipelining

– Note: different join algorithms (index-nested loop, sort-merge etc.) could have been used for the same tree to get different query execution plans with possibly different costs.

Page 17: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 17

Name(Semester=‘F1994’ (DeptId=‘CS’ (ProfessorProfessor) Id=ProfId TeachingTeaching))

Cost Example 2SELECT P.NameFROM ProfessorProfessor P, TeachingTeaching TWHERE P.Id = T.ProfId AND

P. DeptId = ‘CS’ AND T.Semester = ‘F1994’

Name

Semester=‘F1994’

DeptId=‘CS’

ProfessorProfessor TeachingTeaching

Id=ProfId

Partially pushed plan: selection pushed to ProfessorProfessor

Page 18: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 18

Cost Example 2 -- selection• Compute DeptId=‘CS’ (ProfessorProfessor) (to reduce size of one

join table) using clustered, 2-level B+ tree on DeptId.– 50 departments and 1000 professors; hence weight

of DeptId is 20 (roughly 20 CS professors). These rows are in ~4 consecutive pages in ProfessorProfessor.

• Cost = 4 (to get rows) + 2 (to search index) = 6

• keep resulting 4 pages in memory and pipe to next step

clustered indexon DeptId rows of

ProfessorProfessor

Page 19: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 19

Cost Example 2 -- join

• Index-nested loops join using hash index on ProfId of TeachingTeaching and looping on the selected professors (computed on previous slide)– Since selection on Semester was not pushed, hash

index on ProfId of TeachingTeaching can be used – Note: if selection on Semester were pushed, the

index on ProfId would have been lost – an advantage of not using a fully pushed query execution plan

Page 20: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 20

Cost Example 2 – join (cont’d)– Each professor matches ~10 TeachingTeaching rows. Since 20 CS

professors, hence 200 teaching records.

– All index entries for a particular ProfId are in same bucket. Assume ~1.2 I/Os to get a bucket.

• Cost = 1.2 20 (to fetch index entries for 20 CS professors) + 200 (to fetch TeachingTeaching rows, since hash index is unclustered) = 224

TeachingTeachinghash

1.220 10

ProfId

Page 21: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 21

Cost Example 2 – select/project

• Pipe result of join to select (on Semester) and project (on Name) at no I/O cost

• Cost of output same as for Example 1

• Total cost:6 (select on ProfessorProfessor) + 224 (join) = 230

• Comparison: 4200 (example 1) vs. 230 (example 2) !!!

Page 22: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 22

Resulting Query Execution Plan 2

Name

Semester=‘F1994’

DeptId=‘CS’

ProfessorProfessor TeachingTeaching

Id=ProfIduse clustered B+tree index

indexed nested loop

on the fly

on the fly

Note : Other possibilities are block nested-loops, sort-merge.

Page 23: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 23

Other Possible Query Trees

Name

ProfessorProfessor TeachingTeaching

Name

DeptId=‘CS’

Semester=‘F1994’

ProfessorProfessor TeachingTeaching

Id=ProfId

DeptId=‘CS’ Semester=‘F1994’

Id=ProfId

Page 24: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 24

Estimating Output Size

• It is important to estimate the size of the output of a relational expression – size serves as input to the next stage and affects the choice of how the next stage will be evaluated.

• Size estimation uses the following measures on a particular instance of RR:– Tuples(RR): number of tuples – Blocks(RR): number of blocks– Values(RR.A): number of distinct values of A– MaxVal(RR.A): maximum value of A– MinVal(RR.A): minimum value of A

Page 25: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems

25

Reduction Factor

• Consider a query block:

• Every term in WHERE eliminates some of the result tuples.

• A reduction factor is associated with each term.

• The number of tuples in the result is estimated as the maximum size (i.e. product of cardinalities) times the product of reduction factors.

SELECT attribute list

FROM relation list

WHERE term1 term2 ... termn

Page 26: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 26

• reduction (RRii.A=val) =

– if no index and no statistics about the distinct values: 1/10 (arbitrarily).

• reduction (RRii.A > val) =

– No index, not of arithmetic type: 1/2 (arbitrarily).

MaxVal(RRii.A) – val

MaxVal(RRii.A) – MinVal(Ri.A)

Computing reduction factors 1Values(RRii.A)

Page 27: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 27

Reduction factors

• reduction (RRii.A=RRjj.B) =

– if only one of the columns have an index:

– if no index: 1/10 arbitrarily

1max(Values(RRii.A), Values(RRjj.B))

1

Values(R.A)

Page 28: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 28

Reduction Due to Complex Conditions

• reduction(Cond1 AND Cond2) = reduction(Cond1) reduction(Cond2)

• reduction(Cond1 OR Cond2) =

min(1, reduction(Cond1) + reduction(Cond2))

Page 29: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 29

Reduction Due to TargetList

• reduction(TargetList) =

number-of-attributes (TargetList) i number-of-attributes (RRi)

Page 30: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 30

Choosing Query Execution Plan

• Step 1: Choose a logical plan

• Step 2: Reduce search space

• Step 3: Use a heuristic search to further reduce complexity

Page 31: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 31

Step 1: Choosing a Logical Plan

• Involves choosing a query tree, which indicates the order in which algebraic operations are applied

• Heuristic: Pushed trees are good, but sometimes “nearly fully pushed” trees are better due to indexing (as we saw in the example)

• So: Take the initial “master plan” tree and produce a fully pushed tree plus several nearly fully pushed trees.

Page 32: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 32

Step 2: Reduce Search Space

• Deal with associativity of binary operators (join, union, …)

A B C DA B C D

A B C DA B C D

DD

CC

A BA B

Logical queryexecution plan

Equivalentquery tree

Equivalent left left deep query treedeep query tree

Page 33: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 33

Step 2 (cont’d)

• Two issues:– Choose a particular shape of a tree (like in the

previous slide)• Equals the number of ways to parenthesize N-way

join – grows very rapidly

– Choose a particular permutation of the leaves• E.g., 4! permutations of the leaves A, B, C, DA, B, C, D

Page 34: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 34

Step 2: Dealing With Associativity• Too many trees to evaluate: settle on a particular

shape: left-deep tree.left-deep tree. – Used because it allows pipeliningpipelining:

– Property: once a row of XX has been output by P1, it need not be output again (but C may have to be processed several times in P2 for successive portions of XX)

– Advantage: none of the intermediate relations (X, YX, Y) have to be completely materialized and saved on disk.

• Important if one such relation is very large, but the final result is small

A BA B X CX C Y DY DXX YY

P1 P2 P3

Page 35: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 35

Step 3: Heuristic Search

• The choice of left-deep trees still leaves open too many options (N! permutations):– (((A BA B) CC) DD), – (((C AC A) DD) BB), …..

• A heuristic (often dynamic programming based) algorithm is used to get a ‘good’ plan

Page 36: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems 36

Step 3: Dynamic Programming Algorithm

• To compute a join of E1, E2, …, EN in a left-deep manner:– Start with 1-relation expressions (can involve , )– Choose the best and “nearly best” plans for each (a

plan is considered nearly best if its output has some “interesting” form, e.g., is sorted)

– Combine these 1-relation plans into 2-relation expressions. Retain only the best and nearly best 2-relation plans

– Do same for 3-relation expressions, etc.

Page 37: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems

37

Enumeration of Plans

• ORDER BY, GROUP BY, aggregates etc. handled as a final step, using either an `interestingly ordered’ plan or an additonal sorting operator.

• An N-1 way plan is not combined with an additional relation unless there is a join condition between them, unless all predicates in WHERE have been used up.– i.e., avoid Cartesian products if possible.

• In spite of pruning plan space, this approach is still exponential in the # of tables.

Page 38: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems

38

Example 1

• Pass1:– Sailors: B+ tree matches rating>5, and is probably

cheapest. However, if this selection is expected to retrieve a lot of tuples, and index is unclustered, file scan may be cheaper.

• Still, B+ tree plan kept (because tuples are in rating order).

– Reserves: B+ tree on bid matches bid=100; cheapest.

Sailors: B+ tree on rating Hash on sidReserves: B+ tree on bid

Pass 2:– We consider each plan retained from Pass 1 as the outer, and consider how to join it with the (only) other relation.

e.g., Reserves as outer: Hash index can be used to get Sailors tuples

that satisfy sid = outer tuple’s sid value.

Reserves Sailors

sid=sid

bid=100 rating > 5

sname

Page 39: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems

39

Example 2

• Available Indexes :

SELECT S.sid, COUNT(*) As numresFROM Boats B, Reserves R, Sailors SWHERE R.sid = S.sid and B.bid = R.bid and B.color = ‘redGROUP BY S.sid

Reserves

Sailors

sid=sid

color = red

sid,count(*) as numres

GROUP BYsid

bid=bid

Boats

Sailors: B+ tree on sid Hash on sidReserves: B+ tree on sid B+ tree on bid (clustered)Boats: B+ tree on color Hash on color

Page 40: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems

40

• Pass 1: Find best plans for each file.– Sailors: File scan

– Reserves: File scan

– Boats: hash index on color, B+ tree index on color

• Pass 2: Possible joins are considered.– Reserves Boats

– Reserves Sailors

– Sailors Boats

– Sailors Reserves

– Boats (access B+ tree) Sailors

– Boats (access B+ tree) Reserves

– Boats (access hash) Sailors

– Boats (access hash) Reserves

•For each such pair, consider every join method and for each join method consider every access path for the inner relation.

•For each pair keep the cheapest of the plans & the cheapest plan that generates tuples in sorted order

Page 41: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems

41

• Pass 3: For each plan retained in pass 2, taken as outer relation, consider how to join remaining relation as inner one. Example plan for this pass:

Boats (via hash) Reserves (via B+tree) (sort-merge)

Result is joined with Sailors (via B+ tree) (sort merge)

• GROUP BY is considered after all joins. It requires sorting on sid.

Page 42: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems

42

Translating Complex SQL Queries into R.A.

• As an example consider the following SQL query: SELECT sname, age

FROM SailorsWHERE rating > SELECT max(rating)

FROM Sailors WHERE age = 30

• Decompose into 2 blocks:

SELECT max(rating)

FROM Sailors

WHERE age = 30

SELECT sname, age

FROM Sailors

WHERE rating > constant

Max(rating (age = 30 (Sailors)) sname,age (rating>constant (Sailors))

Page 43: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems

43

Query Blocks: Units of Optimization

• An SQL query is parsed into a collection of query blocks, and these are optimized one block at a time.

• Nested blocks are usually treated as calls to a subroutine, made once per outer tuple.

SELECT S.snameFROM Sailors SWHERE S.age IN (SELECT MAX (S2.age) FROM Sailors S2 GROUP BY S2.rating)

Nested blockOuter block For each block, the plans considered are:

– All available access methods, for each reln in FROM clause.

– All left-deep join trees (i.e., all ways to join the relations one-at-a-time, with the inner reln in the FROM clause, considering all reln permutations and join methods.)

Page 44: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems

44

Nested Queries

• Nested block is optimized independently, with the outer tuple considered as providing a selection condition.

• Outer block is optimized with the cost of `calling’ nested block computation taken into account.

• Implicit ordering of these blocks means that some good strategies are not considered. The non-nested version of the query is typically optimized better.

SELECT S.snameFROM Sailors SWHERE EXISTS (SELECT * FROM Reserves R WHERE R.bid=103 AND R.sid=S.sid)

Nested block to optimize: SELECT * FROM Reserves R WHERE R.bid=103 AND S.sid= outer valueEquivalent non-nested query:

SELECT S.snameFROM Sailors S, Reserves RWHERE S.sid=R.sid AND R.bid=103

Page 45: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems

45

Summary

• Query optimization is an important task in a relational DBMS.

• Must understand optimization in order to understand the performance impact of a given database design (relations, indexes) on a workload (set of queries).

• Two parts to optimizing a query:– Consider a set of alternative plans.

• Must prune search space; typically, left-deep plans only.

– Must estimate cost of each plan that is considered.• Must estimate size of result and cost for each plan node.

• Key issues: Statistics, indexes, operator implementations.

Page 46: SPRING 2006CENG 352 Database Management Systems1 Query Optimization

SPRING 2006 CENG 352 Database Management Systems

46

Summary (Contd.)• Single-relation queries:

– All access paths considered, cheapest is chosen.– Issues: Selections that match index, whether index key has all

needed fields and/or provides tuples in a desired order.

• Multiple-relation queries:– All single-relation plans are first enumerated.

• Selections/projections considered as early as possible.

– Next, for each 1-relation plan, all ways of joining another relation (as inner) are considered.

– Next, for each 2-relation plan that is `retained’, all ways of joining another relation (as inner) are considered, etc.

– At each level, for each subset of relations, only best plan for each interesting order of tuples is `retained’.