41
Relational Database Performance CSCI 6442 Copyright 2013, David C. Roberts, all rights reserved

Relational Database Performance CSCI 6442 Copyright 2013, David C. Roberts, all rights reserved

Embed Size (px)

Citation preview

Relational Database Performance

CSCI 6442

Copyright 2013, David C. Roberts, all rights reserved

“Optimization”More properly called access path selection“Optimizer” selects a strategy for processingApproaches:

◦ Cost-based: estimate total cost to process by different approaches, choose lowest estimate

◦ Heuristic: use rules to decide how to processCost-based is typically used by all database

systems today

2

3

RUNSTATSRUNSTATS is the name of a

statistics-gathering utility first included in IBM’s DB2

It scans the database, gathers statistics used for estimating costs for access path selection

DBA determines how often and when to run the utility

What statistics do you think are gathered?

4

QuandaryThe more that RUNSTATS

collects, the better job the optimizer can do of selecting efficient processing methods

However, RUNSTATS uses a lot of resources, scanning every relation

Use of RUNSTATS must be balanced against its cost

The OptimizerSelects which indexes to use Chooses the order of using

indexesChooses algorithms to useDecides when to apply predicates

5

6

SQL Statement Parts of InterestSimple query:

SELECT ENAME, JOBFROM EMPWHERE SAL > 20 OR JOB =

‘VP’OR JOB LIKE ‘PRES%’;

We’re interested in the FROM clause (that tells us the table names) and the WHERE clause (that tells us the predicates)

7

SINGLE-TABLE QUERIES

8

PredicatesWHERE clause of SQL statement

is made up of predicatesEach predicate is a conditionEach condition references a

columnConditions may be equality,

inequality, range, LIKEThe first three conditions use an

index if one exists, scan the table if no index exists

9

Example

SAL > 20 OR

JOB = ‘VP’OR

JOB LIKE ‘PRES%’;For each predicate, do we use an index to retrieve rows that make it true, then examine each row for the other predicates?

10

Predicate SelectivitySelectivity: an estimate of the

fraction of rows of a table that make a predicate true

Classes of PredicatesPredicate: condition in the WHERE

clausePredicates are combined using AND,

OR to make WHERE clausesClasses of predicates:

◦ Sargable: search arguments that can be processed close to the data

◦ Residual: not sargable, such as complex use of nesting

11

Access PathsFive possible access paths:

◦Table scan◦Non-selective index scan◦Selective index scan◦Index only access◦Fully qualified unique index

Each of these types of scans has different cost estimates for its use

12

Predicate Selectivity Selectivity function f(p): % of rows retrieved on

average by predicate p Number of rows retrieved is strongly related to the

cost to carry out the operation n = number of rows in table

13

Form of P f

column = value 1/n

column != value 1-1/n (nearly 1)

column > value (high value - search value)/(high value - low value)

Column LIKE ‘value’ n

p1 or p2 f(p1) + f(p2)

p1 and p2 f(p1) * f(p2)

14

Single-Table QueriesFind out which columns that are

referenced in the WHERE clause have indexes

Find out selectivity of indexesEstimate selectivity of each

predicateUse most selective index-

predicate combination to retrieve rows that satisfy one predicate

Examine each row for other predicates

15

MULTI-TABLE QUERIES(I.E., JOINS)

16

JoinResult of a join is a subset of the

Cartesian product of the tables being joined

Cartesian product of two tables with m and n rows is a new table of mn rows, where every row of the join consists of one row of the first table and one row of the second table

17

Example Join

18

Simple Join Processing Algorithm1. Form the Cartesian product of all tables

involved in the join2. Scan rows of the Cartesian product,

testing each against all of the predicates3. Eliminate rows of the Cartesian product

that don’t meet the predicates

What’s wrong with this picture?Think about two tables of 1 million rows. Cartesian product would be 1 thousand billion rows!

19

Joining More than 2 RelationsA join of more than two relations

is processed 2 relations at a timePart of access path planning is to

select that sequenceWe will talk about algorithms for

joining 2 tables and then choosing the order of processing a multi-table join

20

JoinsAn equijoin is based on equality

of an attribute of each of two relations

Outer join includes all rows of both tables even if some rows did not have a matching value

A semi-join can be based on inequalities as the relationship

21

Join-Processing AlgorithmsNested loop join

◦Each tuple of outer relation is compared to all rows of the inner relation

Sort-merge joinHash-based join

22

Nested-Loop JoinThe algorithm:

For efficiency, the relation with higher cardinality (R) is chosen as the inner relation

Number of operations: NR+ NR* NS

What if there is an index?

23

Nested-Loop

Join OrderFor JOIN queries, the “outer”

table is access first, “inner” second

Order for joining tables must be selected

Most selective firstLeast costly joins first

24

25

Merge JoinFirst, each relation is sorted on the

join attributeThen both relations are scanned in

the order of the join attributesTuples that satisfy the join predicate

are concatenated and placed in the output relation

Number of operations: NR+NS (after the sort!)

What is there is an index on R or S or both?

26

Merge Join Algorithm

27

Merge Join

28

Hash-JoinThe joins we have looked at

compare tuples in the first relation with tuples in the second relation that cannot possibly be part of the join

The goal of the hash join is to compare only those tuples that might be part of the join

Hashing is used to identify those tuples

There are many variations of hash-join

29

Simple Hash-Join Algorithm

30

Hash-Join PerformancePerformance of hash-join can be

superior to other join algorithmPerformance depends on the

hashing algorithm (although note Lum’s research)

Perfect hashing algorithm could find match or non-match with a single probe

With hash table in RAM, processing would be very fast

31

Indexes Impact of a b-tree index on

performance of these algorithms is obvious

But the index must be maintained itself

When an attribute that’s indexed in changed in a relation, the value in the index must also be changed

And note that the changes must be synchronized (and locked together)

32

Order of Processing JoinsTypically, all combinations of order of

processing are considered and a cost developed for each

Selectivity of predicates, selectivity of indexes, cardinality of relations all are factors in cost analysis

Goal is to minimize number of intermediate results produced during processing

Usually, low selectivity values are processed first (that is, highest selectivity)

33

Summary Single-table queriesMulti-table queries

◦Nested loop◦Sort-merge◦Hash

Order of processing joins

34

But Note:We have left out a LOTRelations may be partitioned and

joins processed by partitionMany other parts of the DBMS

affect performanceIf you are responsible for database

performance, buy a book and dig inRemember not to give up on

normalization to get performance

And Now: What You Do for Performance

35

How to StartFirst, don’t even consider

denormalizationYou have many tools to get the

performance you need without ruining the data model (and the applications)

Performance test the applicationsLook for SQL operations that are

taking a long time

36

EXPLAINIBM invented the EXPLAIN utility;

it explains the processing strategy for each WHERE clause

Run it for operations that are taking too long

Look for table scans, cartesian product joins

Provide indexes to speed things up

37

38

EXPLAIN PLANTells you the execution plan an

Oracle database follow for a SQL statement

Inserts a row describing each step of the execution plan into a specified table

Determines total cost of execution

39

Beyond EXPLAINThere are many indexing options,

other options to control physical characteristics of the database

Learn about them, learn how to control them

But you will go very far with EXPLAIN and providing indexes

40

41

THANK YOU!