Hpd 1

Physical Database Design and Tuning

R&G - Chapter 20

Contents• Physical Database Design• Database Workloads• Physical Design and tuning Decisions• Need for Tuning• Guidelines for Index Selection• Clustering & indexing tools for index selection• Database Tuning: Tuning index• Tuning Conceptual schema• Tuning queries and views• Impact of Concurrency• Benchmarking

Physical Database Design• Process of producing a description of the

implementation of the database on secondary storage.

• It describes the base relations, file organizations, and indexes used to achieve efficient access to the data, and any associated integrity constraints and security measures.

Physical Database Design• We will describe the plan for how to build the tables,

including appropriate data types, field sizes, attribute domains, and indexes.

• The plan should have enough detail that if someone else were to use the plan to build a database, the database they build is the same as the one you are intending to create.

• The conceptual design and logical design were independent of physical considerations. Now, we not only know that we want a relational model, we have selected a database management system (DBMS) such as Access or Oracle, and we focus on those physical considerations.

Logical vs. Physical Design:• Logical database design is concerned with

what to store;

• physical database design is concerned with how to store it.

Introduction• We will be talking at length about “database design”

– Conceptual Schema: info to capture, tables, columns, views, etc.

– Physical Schema: indexes, clustering, etc.• Physical design linked tightly to query optimization

– So we’ll study this “bottom up”– But note: DB design is usually “top-down”

• conceptual then physical. Then iterate.• We must begin by understanding the workload:

– The most important queries and how often they arise.– The most important updates and how often they arise.– The desired performance for these queries and updates.

Understanding the Workload• For each query in the workload:

– Which relations does it access?– Which attributes are retrieved?– Which attributes are involved in selection/join conditions?

How selective are these conditions likely to be? • For each update in the workload:

– Which attributes are involved in selection/join conditions? How selective are these conditions likely to be?

– The type of update (INSERT/DELETE/UPDATE), and the attributes that are affected.

– For the Update command, the fields that are modified by update

Creating an ISUD Chart

Employee TableTransaction Frequency % table Name Salary AddressPayroll Run monthly 100 S S SAdd Emps daily 0.1 I I IDelete Emps daily 0.1 D D DGive Raises monthly 10 S U NA

Insert, Select, Update, Delete Frequencies

Physical Design and tuning Decisions

• Choice of indexes to create– Which relations to index and which field – What field(s) should be the search key– Should we build several indexes– For each index, should it be Clustered or un clustered

• Tuning the conceptual schema– Alternative normalization,– De normalization, – Vertical partitioning, – Views• Query and transaction tuning– Frequently executed queries are rewritten to run

faster.

Need for Database Tuning• Hard to get detailed workload at initial design.• Concept of design and tuning are arbitrary.• Design process is over after conceptual

schema and set of clustering and indexing decisions are made.

• Tuning process is subsequent changes to the conceptual schema or the indexes.

Index Selection• One approach:

– Consider most important queries. – Consider best plan using the current indexes, and

see if better plan is possible with an additional index.

– If so, create it.• Before creating an index, must also consider the

impact on updates in the workload.– Trade-off of slowing some updates in order to speed

up some queries.

Whether to index (Guideline 1)• Do not build an index unless some query including

the query components of updates benefit from it.

• Whenever possible choose indexes that speed up more than one query

Multi attribute Search keys (Guideline 3)

– Two situations should be considered

1. a WHERE clause includes conditions on more than one attribute of a relation 2. They enable index only evaluation strategies (i.e. accessing relation can be avoided) for important queries.

Whether to cluster (Guideline 4)• As a rule of thumb, range queries are

likely to benefit the most from clustering

• If an index enables an index-only evaluation strategy for the query it is intended to speed up, the index need not be clustered

Hash verses Tree index (Guideline 5)

– Hash index is better in following situations• The index is intended to support index

nested loops join; the indexed relation is the inner relation, and the search key includes the join columns

• There is very important equality query, and no range queries, involving the search key attributes

Balancing the cost of Index Maintenance (Guideline 6)

• If maintaining an index slows down frequent update operations, consider dropping the index

• Keep in mind however , that adding an index may well speed up a given update operation. – E.g. an index on employee IDs could speed

up the operation of increasing the salary of a employee (specified by ID)

Example 1

• Hash index on D.dname supports ‘Toy’ selection.– Given this, index on D.dno is not needed.

• Hash index on E.dno allows us to get matching (inner) Emp tuples for each selected (outer) Dept tuple.

• What if WHERE included: `` ... AND E.age=25’’ ?– Could retrieve Emp tuples using index on E.age, then

join with Dept tuples satisfying dname selection. Comparable to strategy that used E.dno index.

– So, if E.age index is already created, this query provides much less motivation for adding an E.dno index.

SELECT E.ename, D.mgrFROM Emp E, Dept DWHERE D.dname=‘Toy’ AND E.dno=D.dno

Example 2• All selections are on Emp so it should be the outer

relation in any Index NL join.– Suggests that we build a B+ tree index on D.dno.

• What index should we build on Emp?– B+ tree on E.sal could be used, OR an index on E.hobby

could be used. Only one of these is needed, and which is better depends upon the selectivity of the conditions.

• As a rule of thumb, equality selections more selective than range selections.

• As both examples indicate, our choice of indexes is guided by the plan(s) that we expect an optimizer to consider for a query. Have to understand optimizers!

SELECT E.ename, D.mgrFROM Emp E, Dept DWHERE E.sal BETWEEN 10000 AND 20000 AND E.hobby=‘Stamps’ AND E.dno=D.dno

Clustering and Indexing• Clustered indexes can be especially important while

accessing the inner relation in an index nested loops joins

• Revisit the same e.g

• Should the used indexes be clustered?• Unclustered index on dname• On the other hand Emp is the inner relation in an

index NL join and dno is not candidate key• Dno should be clustered index

SELECT E.ename, D.mgrFROM Emp E, Dept DWHERE D.dname=‘Toy’ AND E.dno=D.dno

Examples of Clustering• B+ tree index on E.age can be

used to get qualifying tuples.– How selective is the condition?– Is the index clustered?

• Consider the GROUP BY query.– If many tuples have E.age > 10,

using E.age index and sorting the retrieved tuples may be costly.

– Clustered E.dno index may be better!

• Equality queries and duplicates:– Clustering on E.hobby helps!

SELECT E.dnoFROM Emp EWHERE E.age>40

SELECT E.dno, COUNT (*)FROM Emp EWHERE E.age>10GROUP BY E.dno

SELECT E.dnoFROM Emp EWHERE E.hobby=Stamps

Impact of Clustering

Co-clustering Two Relations• It can speed up joins, in particular key

foreign key joins corresponding to 1:N relations

• A sequential scan of either relation becomes slower.

• All inserts, deletes and updates that alter record lengths become slower, thanks to overhead involved in maintaining the clustering

Index-Only Plans• A number of

queries can be answered without retrieving any tuples from one or more of the relations involved if a suitable index is available.

SELECT D.mgrFROM Dept D, Emp EWHERE D.dno=E.dnoSELECT D.mgr, E.eidFROM Dept D, Emp EWHERE D.dno=E.dno

SELECT E.dno, COUNT(*)FROM Emp EGROUP BY E.dno

SELECT E.dno, MIN(E.sal)FROM Emp EGROUP BY E.dno

SELECT AVG(E.sal)FROM Emp EWHERE E.age=25 AND E.sal BETWEEN 3000 AND 5000

<E.dno><E.dno,E.eid>

<E.dno>

<E.dno,E.sal>B-tree trick!

<E. age,E.sal> or<E.sal, E.age>

Tools to Assist in Index Selection• First generation of such tools:

– Index tuning wizards or– Index advisors

• Drawback of these systems– They had to replicate the database query

optimizers cost model

• The DB2 Index Advisor– Tool for automatic index recommendation given a

workload– Workload table: ADVISE_WORKLOAD– It is populated either

• By SQL stmts from DB2 dynamic SQL stmt cache for recently executed SQL stmts

• With SQL stmts from packages statically compiled stmts OR

• With SQL stmts from online monitor called Query Patroller• Output: SQL DDL statements whose execution creates

recommended indexes

Tools to Assist in Index Selection

• The Microsoft SQL server 2000 Index Tuning wizard– Tuning wizard integrated with the database query

optimizer – 3 tuning modes that permits user to trade off running

time of analysis and no. of candidate index configurations examined: fast, medium and thorough with fast having lowest running time and thorough examining the largest no. of configurations

– Max space allowed for indexes, Allows table scaling– Reduces running time by sampling mode – Table scaling

Tools to Assist in Index Selection

Overview of Database Tuning• Actual use of DB provides a valuable source

of detailed information that can be used to refine the initial design

• Original assumptions are replaced• Initial workload is validated• Initial guesses about size of data can be

replaced with actual statistics• Tuning imp to get best possible performance• 3 kinds of tuning: - tuning indexes, tuning

the conceptual schema, and tuning queries

Tuning indexes• Queries and updates considered

important at initial level are not very frequent

• Observed workload may also identify some new queries and updates

• Initial choice of indexes has to be reviewed in light of this new information

• Some original indexes may be dropped and new ones added

• It uses index only scan with Emp as inner relation

• If this query takes an unexpectedly long time to execute replace previous plan with dno field and clustered index

Tuning indexes continues… SELECT E.ename, D.mgrFROM Emp E, Dept DWHERE D.dname=‘Toy’ AND E.dno=D.dno

• In addition we have to Periodically reorganize indexes– E.g Static index (ISAM index) may have developed

long overflow chains, drop or rebuilt- if feasible, improves access time through this index

– Dynamic structure (B+ tree) - if the implementation does not merge pages on deletes, space occupancy can decrease considerably in some situations. This in turn makes the size of the index (in pages) larger than necessary, and could increase the height and therefore the access time

Tuning indexes continues…

Tuning conceptual Schema• If initial schema doesn’t meet our performance

objectives for the given workload with any set of physical design if so redesign conceptual schema

• Such change is called as schema evolution• Issues involved in tuning conceptual schema:

– Decide to settle for a 3NF design instead of BCNF– Among 3NF or BCNF our choice should be guided by

workload– Sometime we might decide to further decompose relation

that is already in BCNF– We might denormalize– partitioning

Tuning Queries and Views• If a query runs slower than expected, check if an index needs

to be re-built, or if statistics are too old and rebuilt the queries.

• Sometimes, the DBMS may not be executing the plan you had in mind. Common areas of optimizer weakness:– Selections involving null values (bad selectivity estimates)– Selections involving arithmetic or string expressions (ditto)– Selections involving OR conditions (ditto)– Complex, correlated subqueries– Lack of evaluation features like index-only strategies or certain

join methods or poor size estimation.• Check the plan that is being used! Then adjust the choice of

indexes or rewrite the query/view.– E.g. check via POSTGRES “Explain” command– Some systems rewrite for you under the covers (e.g. DB2)

• Can be confusing and/or helpful!

More Guidelines for Query Tuning• Minimize the use of DISTINCT: don’t need it if

duplicates are acceptable, or if answer contains a key.

• Minimize the use of GROUP BY and HAVING:SELECT MIN (E.age)FROM Employee EGROUP BY E.dnoHAVING E.dno=102

SELECT MIN (E.age)FROM Employee EWHERE E.dno=102

Consider DBMS use of index when writing arithmetic expressions: E.age=2*D.age will benefit from index on E.age, but might not benefit from index on D.age!

Guidelines for Query Tuning (Contd.)

• Avoid using intermediate relations:

SELECT * INTO TempFROM Emp E, Dept DWHERE E.dno=D.dno

AND D.mgrname=‘Joe’

SELECT T.dno, AVG(T.sal)FROM Temp TGROUP BY T.dno

vs.

SELECT E.dno, AVG(E.sal)FROM Emp E, Dept DWHERE E.dno=D.dno

AND D.mgrname=‘Joe’GROUP BY E.dno

and

Does not materialize the intermediate reln Temp.

Choices in Tuning The Conceptual Schema– Consider the following schema

• Contracts(cid: integer, supplierid : integer, projectid: integer, depti: integer, partid: integer, qty: integer, value: real)

• Departments(did: integer, budget: real, annualreport: varchar)

• Parts(pid: integer, cost: integer)• Projects( jid: integer, mgr: char(20))• Suppliers(sid: integer, address: char(50))

Choices in Tuning The Conceptual Schema

contd…• the relation Contracts, denoted as CSJDPQV

– The meaning of a tuple in this relation is that the contract with cid C is an agreement that supplier S (with sid equal to supplierid) will supply Q items of part P (with pid equal to partid) to project J (with jid equal to projectid) associated with department D (with deptid equal to did), and that the value V of this contract is equal to value

Choices in Tuning The Conceptual Schema

contd…• There are two known integrity constraints with

respect to Contracts• 1. A project purchases a given part using

a single contract• JP C• 2. a department purchases at most one

part from any given supplier• SD P

Settling for a Weaker Normal Form

• Consider contract relation• We will see what normal form it is in• candidate keys for this relation are C and JP• only nonkey dependency is SD P, and P is a

prime attribute because it is part of candidate key JP

• It is in 3NF• We will decompose it and convert it into BCNF• we obtain a lossless-join and dependency-

preserving decomposition into BCNF by decomposing schema we will get schemas CJP, SDP, and CSJDQV

Horizontal Decompositions• Usual Def. of decomposition: Relation is replaced

by collection of relations that are projections. Most important case.– We will talk about this at length as part of Conceptual

DB Design• Sometimes, might want to replace relation by a

collection of relations that are selections. – Each new relation has same schema as original, but

subset of rows.– Collectively, new relations contain all rows of the

original. – Typically, the new relations are disjoint.

Horizontal Decompositions (Contd.)

• Contracts (Cid, Sid, Jid, Did, Pid, Qty, Val) • Suppose that contracts with value > 10000 are subject

to different rules. – So queries on Contracts will often say WHERE val>10000.

• One approach: clustered B+ tree index on the val field.• Second approach: replace contracts by two new

relations, LargeContracts and SmallContracts, with the same attributes (CSJDPQV).– Performs like index on such queries, but no index overhead.– Can build clustered indexes on other attributes, in addition!

Masking Conceptual Schema Changes

• Horizonal Decomposition from above• Masked by a view.

– NOTE: queries with condition val>10000 must be asked wrt LargeContracts for efficiency: so some users may have to be aware of change.

• I.e. the users who were having performance problems• Arguably that’s OK -- they wanted a solution!

CREATE VIEW Contracts(cid, sid, jid, did, pid, qty, val)AS SELECT * FROM LargeContractsUNIONSELECT *FROM SmallContracts

Impact of Concurrency• In a system with many concurrent users,

several additional points must be considered

• Transaction obtains locks on the pages that it reads or writes and others may be blocked

• 2 specific ways to reduce blocking– Reduce the time that transactions hold

locks– Reducing hot spots

Reducing Lock Durations• Delay lock requests

– Tune transaction by writing to local prog. variables and deferring changes to database until the end of transaction

• Make transaction Faster– Tuning indexing and rewriting queries– Careful partitioning of the tuples in relation and

associated indexes across a collection of discs • Replace long transactions by short ones

– Rewriting into two or more smaller transactions

Reducing Lock Durations contd…• Build a warehouse

– Complex queries can hold shared lock for longer time, involve statistical analysis of business trends

– Can run on copy of data that is little out of date• Consider a lower Isolation Level

– In many situations such as queries generating aggregate info or statistical summaries

– Use lower SQL isolation level as REPEATABLE READ or READ COMMITTED

Reducing Hot Spots• Delay operations on Hot Spots

– Requests using frequently used objects• Optimize Access Patterns

– Pattern of updates• Partitioning operations on Hot Spots

– Batch append• Choice of Index

– In Frequent updating relation, B+ tree indexes can become bottleneck so root and index pages becomes hot spots

– Specialized locking protocols help (fine granularity locks)

– Leads to ISAM index (only leafs gets locks)

DBMS Benchmarking• Includes benchmarks for measuring the

performance of a certain class of applications (e.g., the TPC benchmarks) and

• benchmarks for measuring how well a DBMS performs various operations (e.g., the Wisconsin benchmark)– Benchmarks should be portable, easy to understand,

and scale naturally to larger problem instances. They should measure peak performance (e.g., transactions per second, or tps) as well as price/performance ratios (e.g., $/tps) for typical workloads in a given application domain

• The Transaction Processing Council (TPC) was created to define benchmarks for transaction processing and database systems

• Well-Known DBMS Benchmarks– The TPC-A and TPC-B benchmarks constitute the

standard definitions of the tps and $/tps measures

– TPC-A measures the performance and price of a computer network in addition to the DBMS,

– whereas the TPC-B benchmark considers the DBMS by itself

DBMS Benchmarking

DBMS Benchmarking– The TPC-C benchmark is a more complex suite of

transactional tasks than TPC-A and TPC-B– It models a warehouse that tracks items supplied to

customers and involves five types of transactions– Much more expensive than TPC-A and TPC-B– exercises a much wider range of system capabilities– TPC-D TPC-D represents a broad range of decision

support (DS) applications that require complex, long running queries against large complex data structures.

DBMS Benchmarking• The TPC Benchmark™H (TPC-H) is a decision support

benchmark.• It consists of a suite of business oriented ad-hoc queries

and concurrent data modifications. • The queries and the data populating the database have

been chosen to have broad industry-wide relevance. • This benchmark illustrates decision support systems that

examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.

Points to Remember• Indexes must be chosen to speed up important

queries (and perhaps some updates!).– Index maintenance overhead on updates to key fields.– Choose indexes that can help many queries, if possible.– Build indexes to support index-only strategies.– Clustering is an important decision; only one index on a

given relation can be clustered!– Order of fields in composite index key can be important.

• Static indexes may have to be periodically re-built.• Statistics have to be periodically updated.

Points to remember (Contd.)• Over time, indexes have to be fine-tuned (dropped,

created, re-clustered, ...) for performance.– Should determine the plan used by the system, and

adjust the choice of indexes appropriately.• System may still not find a good plan:

– Only left-deep plans?– Null values, arithmetic conditions, string expressions,

the use of ORs, nested queries, etc. can confuse an optimizer.

• So, may have to rewrite the query/view:– Avoid nested queries, temporary relations, complex

conditions, and operations like DISTINCT and GROUP BY.

Engineering

Hpd 1