Database Techniek Martin Kersten Peter Boncz CWI

Database TechniekDatabase Techniek

Martin KerstenPeter Boncz

CWI

©Silberschatz, Korth and Sudarshan4.2Database System Concepts

OutlineOutline

Introduction & Course Organization Recap of Introductory Database Course

SQL

Relational Algebra (X100 flavor)

Storage and File Structures


Why a DBMS?Why a DBMS?

Main Advantages Centralization (at least conceptually)

Data Independence (physical changes don’t break legacy apps)

Declarative Data Integrity Constraints

Atomic actions (DBMS recovers consistently from system crash)

Consistency under Multi-User Concurrent Updates

Declarative & Powerful Query Language, Automatically Optimized

Multi-user security

DBMS now is the basic building block of all information systems

Almost everybody in IT works with DBMS on a daily basis


DoelDoel

verkrijgen van inzicht in de implementatie technieken binnenin een relationeel DBMS

Beoordeling: Cijfer = (2*tentamen+practicum)/3

tentamen >= 6, practicum >= 6

Literatuur: A. Silberschatz e.a., 'Database system concepts', 4th ed, McGraw-Hill, 2002

http://www.cwi.nl/~manegold/teaching/DBtech/


HoorcollegesHoorcolleges

Query OptimizationH14BonczFeb 22

MonetDB/XQueryKersten/BonczMar 155

MonetDB/SQLKersten/NesMar 84

TransactionsH15-17KerstenMar 13

Query ProcessingH13BonczFeb 152

SQL + X100 Alg

Storage + B-Trees

H4 + X100 doc

H11-12

Kersten/

Boncz

Feb 81

OnderwerpMateriaalDocentDatum

Tentamen laatste week maart


PracticumPracticum

Assignment 0:

• Hands-on experience with relational DBMSs & SQL

Assignment 1:

• Translating SQL to X100 algebra ("by hand")

Assignment 2: (choose on of)

a) Building logical cost functions for X100 algebra operations ("by hand")

b) Analyse and explain the behaviour of a query optimizer

Begeleider: Marc Makkes ([email protected])

Hard deadlines (first: Saturday, February 17, 2007, 23:59:59 CET! )

Work in couples


OutlineOutline

Introduction & Course Organization

Recap of Introductory Database Course SQL Relational Algebra (X100 flavor)



SQL re-cap: Basic Structure SQL re-cap: Basic Structure

A typical SQL query has the form:select A1, A2, ..., An

from r1, r2, ..., rm

where P Ais represent attributes

ris represent relations P is a predicate.

This query is equivalent to the relational algebra expression.

projectA1, A2, ..., An(selectP (r1 jointrue r2 jointrue ... jointrue rm))

The result of an SQL query is again a relation. SQL relations may have duplicates

Use select distinct to get a set


Relational algebraRelational algebra

SQL

physical algebra

logical algebra

parsing, normalization

logical query optimization physical query optimization

query execution


The PracticumThe Practicum

SQL

physical algebra

X100 algebra

parsing, normalization

logical query optimization physical query optimization

X100 system


X100 relational algebraX100 relational algebra

MonetDB/X100 is a CWI research projects

http://www.cwi.nl/~boncz/x100.html

high-performance experimental DBMS for e.g. Data warehousing Data mining Information Retrieval Video databases (retrieval by content)

Research goal:

study interaction between modern hardware and database internals

High perf algorithms, compression E.g. exploit CPU caches, Multi-Processors, MEMS


X100 relational algebra (Cont.)X100 relational algebra (Cont.)

X100 has a relational algebra interface

Table ::= table(Identifier) select(Table, Expr<bool>) project(Table, [ Expr<T> ] ) join(Table, TABLE, Expr<bool>) aggr(Table, [ Expr<T> ], [ AggrFcn<T>] ) order (Table, [ Expr<T> ] ) topn(Table, [ Expr<T> ], Expr<int> ) Identifier = Table


select(Table, Expr<bool>)select(Table, Expr<bool>)

• Relation r A B C D

1

5

12

23

7

7

3

10

• select (r, and( ==(A,B), >(D ,int(‘5’) ) ) )

A B C D

1

23

7

10




1

5

12

23

7

7

3

10


A B C D

1

23

7

10

Functional C-like notation:A = B and d > 5




1

5

12

23

7

7

3

10


A B C D

1

23

7

10

All constants denoted ascast: TYPE(‘string’)


project(Table, [ Expr<T> ] )project(Table, [ Expr<T> ] )

Relation r: A B C

10

20

30

40

1

1

1

2

A D

10

10

10

20

Project (r, [ A, D=*(C,int(’10’)) ] )


project(Table, [ Expr<T> ] )project(Table, [ Expr<T> ] )

Relation r: A B C

10

20

30

40

1

1

1

2

A D

10

10

10

20

Project (r, [ A, D=*(C,int(’10’)) ] )

X100 is a bag algebra:

no double elimination


join(Table, Table, Expr<bool>)join(Table, Table, Expr<bool>)

Relations r, s:

A B

12412

C D

aabab

E

13123

F

r

A B

11112

C D

aaaab

F

s

join(r, s, ==(B,E))


join(Table, Table, Expr<bool>)join(Table, Table, Expr<bool>)

Relations r, t:

A B

12412

C D

aabab

E

13123

F

r

A B

11112

C D

aaaab

F

s

X100 join result is the union of all attributes.

Name conflicts must be resolved with an extra project

E

13123

C

t

join(r, s, ==(B,E))

project( t, [ E,F=C ] )


aggr(Table, [Expr<T>], [AggrFcn<T>])aggr(Table, [Expr<T>], [AggrFcn<T>])

Relation account grouped by branch-name:

branch-name account-number balance

PerryridgePerryridgeBrightonBrightonRedwood

A-102A-201A-217A-215A-222

400900750750700

branch-name balance

PerryridgeBrightonRedwood

13001500700

aggr( account, [ branch-name ], [ balance = sum(balance) ] )






A-102A-201A-217A-215A-222

400900750750700

branch-name balance


13001500700

aggr( account, [ branch-name ], [ balance = sum(balance) ] )

Identifier = AggrFcn(Identifier)

AggrFcn<T> ::= count<uint>() avg<T>(T) sum<T>(T) min<T>(T) max<T>(T)



Relation r:

A B

C

7

7

3

10

total

27

aggr( r, [], [total = sum(C)])



Relation r:

A B

C

7

7

3

10

total

27

aggr( r, [], [total = sum(C)])

Empty groupby-list Global aggregate






A-102A-201A-217A-215A-222

400900750750700

branch-name


aggr( account, [ branch-name ], [] )






A-102A-201A-217A-215A-222

400900750750700

branch-name


aggr( account, [ branch-name ], [] )

Empty AggrFcn-list Double elimination


order (Table, [ Expr<T>])order (Table, [ Expr<T>])


23

12

35

25

10

9

7

7

• orderby(r, [D,C desc])

A B C D

35

25

12

23

7

7

9

10


topn(Table, [ Expr<T>], int)topn(Table, [ Expr<T>], int)


23

12

35

25

10

9

7

7

• topn(r, [D,C desc], int(‘2’) )

A B C D

35

25

7

7


TPC-H: Data Warehousing Scenario TPC-H: Data Warehousing Scenario

“Give date, priority and sum of the top 10 high revenue orders for construction customers that had been ordered but not yet shipped on march 15 “

http://www.tpc.org• TPC-C transaction processing• TPC-H data warehousing

Large repository of data about Orders, consisting of Lineitems, delivered to Customers.

CUSTOMER 1n ORDER 1n LINEITEM

Query 3:Query 3:


SQL Data Warehousing Query SQL Data Warehousing Query (TPC-H Query 3) (TPC-H Query 3)

select l_orderkey, o_orderdate, o_shippriority, sum(l_extendedprice * (1 - l_discount)) as revenue

from customer, orders, lineitem

where c_custkey = o_custkey and l_orderkey = o_orderkey and c_mktsegment = 'BUILDING' ando_orderdate < date '1995-03-15' and l_shipdate > date '1995-03-15'

group by l_orderkey, o_orderdate, o_shippriority

order by revenue desc, o_orderdate

limit 10;


SQL SQL Algebra translationAlgebra translation






limit 10;

join








limit 10;

join

select








limit 10;

join

select

aggr








limit 10;

join

select

aggr

topn


Query in X100 AlgebraQuery in X100 Algebra



OutlineOutline

Introduction & Course Organization

Recap of Introductory Database Course SQL

Relational Algebra (X100 flavor)



Storage HierarchyStorage Hierarchy

300GB

300GB

4GB

2GB

2MB

64KB

128B

size bandwidthlatencyEUR/GBUnit

60MB/s (20MB/s)

100000ns202KBNAND Flash

3000MB/s70ns6064BRAM (DDR2)

80MB/s10 min0.1032KBTape (HP)

80MB/s10000000ns0.308KBMagnetic disk (IDE)

7000MB/s10ns64BL2 CPU cache

24000MB/s1ns64BL1 CPU cache

24000MB/s18BCPU registers


Hardware TrendsHardware Trends

CPU speed (KHz)

RAM Size (KB) Disk Size (MB)

RAM Bandwidth (MB/s)

Disk Bandwidth (MB/s)

RAM Latency (ns)

Disk Latency (ms)


Storage Hierarchy (Cont.)Storage Hierarchy (Cont.)

primary storage: Fastest media but volatile (cache, main memory).

secondary storage: next level in hierarchy, non-volatile, moderately fast access time also called on-line storage

E.g. flash memory, magnetic disks

tertiary storage: lowest level in hierarchy, non-volatile, slow access time also called off-line storage

E.g. magnetic tape, optical storage


Magnetic Hard Disk MechanismMagnetic Hard Disk Mechanism

NOTE: Diagram is schematic, and simplifies the structure of actual disk drives


Performance Measures of DisksPerformance Measures of Disks Access time – the time it takes from when a read or write request

is issued to when data transfer begins. Consists of: Seek time – time it takes to reposition the arm over the correct track.

Average seek time is 1/2 the worst case seek time.

– Would be 1/3 if all tracks had the same number of sectors, and we ignore the time to start and stop arm movement

4 to 10 milliseconds on typical disks Rotational latency – time it takes for the sector to be accessed to

appear under the head. Average latency is 1/2 of the worst case latency. 4 to 11 milliseconds on typical disks (5400 to 15000 r.p.m.)

Data-transfer rate – the rate at which data can be retrieved from or stored to the disk. 20 to 60 MB per second is typical Multiple disks may share a controller, so rate that controller can handle

is also important E.g. ATA: 100 MB/second, SCSI: 320 MB/


Magnetic Disk Hardware Trends Magnetic Disk Hardware Trends


Performance Measures (Cont.)Performance Measures (Cont.)

Mean time to failure (MTTF) – the average time the disk is expected to run continuously without any failure. Typically 3 to 5 years

Probability of failure of new disks is quite low, corresponding to a“theoretical MTTF” of 30,000 to 1,200,000 hours for a new disk

E.g., an MTTF of 1,200,000 hours for a new disk means that given 1000 relatively new disks, on an average one will fail every 1200 hours

MTTF decreases as disk ages


RAIDRAID

RAID: Redundant Arrays of Independent Disks disk organization techniques that manage a large numbers of disks,

providing a view of a single disk of

high capacity and high speed by using multiple disks in parallel, and

high reliability by storing data redundantly, so that data can be recovered even if a disk fails

The chance that some disk out of a set of N disks will fail is much higher than the chance that a specific single disk will fail. E.g., a system with 100 disks, each with MTTF of 100,000 hours

(approx. 11 years), will have a system MTTF of 1000 hours (approx. 41 days)

Techniques for using redundancy to avoid data loss are critical with large numbers of disks


Improvement of Reliability via RedundancyImprovement of Reliability via Redundancy

Redundancy – store extra information that can be used to rebuild information lost in a disk failure

E.g., Mirroring (or shadowing) Duplicate every disk. Logical disk consists of two physical disks. Every write is carried out on both disks

Reads can take place from either disk If one disk in a pair fails, data still available in the other

Data loss would occur only if a disk fails, and its mirror disk also fails before the system is repaired

– Probability of combined event is very small

» Except for dependent failure modes such as fire or building collapse or electrical power surges

Mean time to data loss depends on mean time to failure, and mean time to repair E.g. MTTF of 100,000 hours, mean time to repair of 10 hours gives

mean time to data loss of 500*106 hours (or 57,000 years) for a mirrored pair of disks (ignoring dependent failure modes)


RAID LevelsRAID Levels Schemes to provide redundancy at lower cost by using disk

striping combined with parity bits Different RAID organizations, or RAID levels, have differing cost,

performance and reliability characteristics

RAID Level 1: Mirrored disks with block striping Offers best write performance.

Popular for applications such as storing log files in a database system.

RAID Level 0: Block striping; non-redundant. Used in high-performance applications where data lost is not critical.


RAID Levels (Cont.)RAID Levels (Cont.)

RAID Level 5: Block-Interleaved Distributed Parity; partitions data and parity among all N + 1 disks, rather than storing data in N disks and parity in 1 disk. E.g., with 5 disks, parity block for nth set of blocks is stored on

disk (n mod 5) + 1, with the data blocks stored on the other 4 disks.


Choice of RAID LevelChoice of RAID Level

Level 0 provides maximum performance, no safety Level 1 provides much better write performance than level 5

Level 5 requires at least 2 block reads and 2 block writes to write a single block, whereas Level 1 only requires 2 block writes

Level 1 preferred for high update environments such as log disks

Level 1 had higher storage cost than level 5 disk drive capacities increasing rapidly (50%/year) whereas disk

access times have decreased much less (x 3 in 10 years) I/O requirements have increased greatly, e.g. for Web servers When enough disks have been bought to satisfy required rate of I/O,

they often have spare storage capacity so there is often no extra monetary cost for Level 1!

Level 5 is preferred for applications with low update rate,and large amounts of data

Level 1 is preferred for all other applications


Hardware IssuesHardware Issues

Hot swapping: replacement of disk while system is running, without power down Supported by some hardware RAID systems,

reduces time to recovery, and improves availability greatly

Many systems maintain spare disks which are kept online, and used as replacements for failed disks immediately on detection of failure Reduces time to recovery greatly

Many hardware RAID systems ensure that a single point of failure will not stop the functioning of the system by using Redundant power supplies with battery backup

Multiple controllers and multiple interconnections to guard against controller/interconnection failures


Index ClassificationIndex Classification

Primary vs. Secondary primary – the index on the primary key

unique – an index on a candidate key

secondary – not primary

Clustered vs Unclustered clustered – key order corresponds with record order

E.g. B-tree separate from record file

Index-organized table B-tree leaves store records (no file)

unclustered – index contains record-IDs in random order


Root

B+Tree n=4

100

120

150

180

30

3 5 11

30

35

100

101

110

120

130

150

156

179

180

200


Sample non-leafSample non-leaf

57

81

95

to keys to keys to keys to keys

< 57 57 k<81 81k<95 95


Sample leaf node:Sample leaf node:

From non-leaf node

to next leaf

in sequence

57

81

95

To r

eco

rd

wit

h k

ey 5

7

To r

eco

rd

wit

h k

ey 8

1

To r

eco

rd

wit

h k

ey 8

5


Non-root nodes have to be at least half-fullNon-root nodes have to be at least half-full

Use at least

Non-leaf: n/2 children

Leaf: (n-1)/2 pointers to data


Full node min. node

Non-leaf

Leaf

n=4

12

01

50

18

0

30

3 5 11

30

35


Insert into B+treeInsert into B+tree

(a) simple case space available in leaf

(b) leaf overflow

(c) non-leaf overflow

(d) new root


(simple case) Insert key = 32 n=43 5 11

30

31

30

10

03

2


(leaf overflow) Insert key = 7 n=4

3 5 11

30

31

30

100

3 5

7

7


(internal overflow) Insert key = 160n=4

100

120

150

180

150

156

179

180

200

160

18

0

160

179


(new root) insert 45 n=4

10

20

30

1 2 3 10

12

20

25

30

32

40

40

45

40

30new root


insert:

1, 2, 10, 20, 3, 12, 30, 32, 25, 40, 45

n=4


Interesting problem:Interesting problem:

For B+tree, how large should n be?

…

n is number of keys / node


AssumptionsAssumptions

You have the right to set the disk page size for the disk where a B-tree will reside.

Compute the optimum page size n assuming that The items are 4 bytes long and the pointers are also 4 bytes long.

Time to read a node from disk is 10+.0002n

Time to process a block in memory is unimportant

B+tree is full (I.e., every page has the maximum number of items and pointers


FIND FIND nnoptopt by by f’(n)f’(n) = 0 = 0

What happens to nopt as

Disk bandwidth increases?

Access time stays behind?

CPU get faster?


f(n)f(n) = time to find a record= time to find a record

= log= lognn(T) * (10 + 0.0002n)(T) * (10 + 0.0002n)



= log= lognn(T) * (10 + 0.0002n)(T) * (10 + 0.0002n)

1994 (book) 2004 (now)

N=500 n=4000



= log= lognn(T) * (10 + 0.0002n)(T) * (10 + 0.0002n)

1994

Table 1M records

10ms access time

4MB/s bandwidthn~500-1000

4KB / 8KB pagesBe conservative to limit RAM consumption



= log= lognn(T) * (10 + 0.0002n)(T) * (10 + 0.0002n)

2004

Table 10M records

6ms access time

40MB/s bandwidth

n~1000-4000

8KB / 32KB pages

relative benefit decreases so don’t overdo it


FIND FIND nnoptopt by by f’(n)f’(n) = 0 = 0

Answer should be nopt = “few thousand”

What happens to nopt as

block sizes are increasing..

Disk bandwidth increases?

Access time stays behind?

CPU get faster?


Primary or Auxiliary StructurePrimary or Auxiliary Structure

Primary index Leaf blocks in sequence clustered index Main storage structure for a database table

E.g. B+-tree organized file / hash structured files Typically an index on an unique key

But not necessarily Normally, you can have only one clustered index!

Secondary index Also called unclustered index A separate file from where the table is stored Refers with (block/offset) pointers to records in the table file You can define many as you want (to maintain)


Clustered vs. Unclustered IndexClustered vs. Unclustered Index





low

high

Primary B-Tree index

1 access only

(rest is ‘just’ bandwidth)


Clustered vs. Unclustered IndexClustered vs. Unclustered Index





low

high

Primary B-Tree index

1 access only

(rest is ‘just’ bandwidth)

Secondary B-tree index

Pay N times

access cost


Are Unclustered Indices a Good Idea?Are Unclustered Indices a Good Idea?

Secondary indices depend on random I/O

can do asynchronous I/O (multiple I/Os at-a-time)

degenerates into full table scans


Block size for sequential reads?Block size for sequential reads?


When do random I/Os make sense?When do random I/Os make sense?


Are Unclustered Indices a Good Idea?Are Unclustered Indices a Good Idea?

Secondary indices depend on random I/O

can do asynchronous I/O (multiple I/Os at-a-time)

degenerates into full table scans

Is not using an index at all better?

I.e. read the entire table sequentially without any index

Use redundant clustered orderings

– Materialized views

– C-STORE (Stonebraker et al, VLDB 2005), MonetDB/X100

– Database Cracking (Kersten, CIDR 2005+2007)

Documents

Database Techniek Martin Kersten Peter Boncz CWI