76
Database Techniek Database Techniek Martin Kersten Peter Boncz CWI

Database Techniek Martin Kersten Peter Boncz CWI

Embed Size (px)

Citation preview

Page 1: Database Techniek Martin Kersten Peter Boncz CWI

Database TechniekDatabase Techniek

Martin KerstenPeter Boncz

CWI

Page 2: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.2Database System Concepts

OutlineOutline

Introduction & Course Organization Recap of Introductory Database Course

SQL

Relational Algebra (X100 flavor)

Storage and File Structures

Page 3: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.3Database System Concepts

Why a DBMS?Why a DBMS?

Main Advantages Centralization (at least conceptually)

Data Independence (physical changes don’t break legacy apps)

Declarative Data Integrity Constraints

Atomic actions (DBMS recovers consistently from system crash)

Consistency under Multi-User Concurrent Updates

Declarative & Powerful Query Language, Automatically Optimized

Multi-user security

DBMS now is the basic building block of all information systems

Almost everybody in IT works with DBMS on a daily basis

Page 4: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.5Database System Concepts

DoelDoel

verkrijgen van inzicht in de implementatie technieken binnenin een relationeel DBMS

Beoordeling: Cijfer = (2*tentamen+practicum)/3

tentamen >= 6, practicum >= 6

Literatuur: A. Silberschatz e.a., 'Database system concepts', 4th ed, McGraw-Hill, 2002

http://www.cwi.nl/~manegold/teaching/DBtech/

Page 5: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.6Database System Concepts

HoorcollegesHoorcolleges

Query OptimizationH14BonczFeb 22

MonetDB/XQueryKersten/BonczMar 155

MonetDB/SQLKersten/NesMar 84

TransactionsH15-17KerstenMar 13

Query ProcessingH13BonczFeb 152

SQL + X100 Alg

Storage + B-Trees

H4 + X100 doc

H11-12

Kersten/

Boncz

Feb 81

OnderwerpMateriaalDocentDatum

Tentamen laatste week maart

Page 6: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.7Database System Concepts

PracticumPracticum

Assignment 0:

• Hands-on experience with relational DBMSs & SQL

Assignment 1:

• Translating SQL to X100 algebra ("by hand")

Assignment 2: (choose on of)

a) Building logical cost functions for X100 algebra operations ("by hand")

b) Analyse and explain the behaviour of a query optimizer

Begeleider: Marc Makkes ([email protected])

Hard deadlines (first: Saturday, February 17, 2007, 23:59:59 CET! )

Work in couples

Page 7: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.8Database System Concepts

OutlineOutline

Introduction & Course Organization

Recap of Introductory Database Course SQL Relational Algebra (X100 flavor)

Storage and File Structures

Page 8: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.9Database System Concepts

SQL re-cap: Basic Structure SQL re-cap: Basic Structure

A typical SQL query has the form:select A1, A2, ..., An

from r1, r2, ..., rm

where P Ais represent attributes

ris represent relations P is a predicate.

This query is equivalent to the relational algebra expression.

projectA1, A2, ..., An(selectP (r1 jointrue r2 jointrue ... jointrue rm))

The result of an SQL query is again a relation. SQL relations may have duplicates

Use select distinct to get a set

Page 9: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.15Database System Concepts

Relational algebraRelational algebra

SQL

physical algebra

logical algebra

parsing, normalization

logical query optimization physical query optimization

query execution

Page 10: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.16Database System Concepts

The PracticumThe Practicum

SQL

physical algebra

X100 algebra

parsing, normalization

logical query optimization physical query optimization

X100 system

Page 11: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.17Database System Concepts

X100 relational algebraX100 relational algebra

MonetDB/X100 is a CWI research projects

http://www.cwi.nl/~boncz/x100.html

high-performance experimental DBMS for e.g. Data warehousing Data mining Information Retrieval Video databases (retrieval by content)

Research goal:

study interaction between modern hardware and database internals

High perf algorithms, compression E.g. exploit CPU caches, Multi-Processors, MEMS

Page 12: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.18Database System Concepts

X100 relational algebra (Cont.)X100 relational algebra (Cont.)

X100 has a relational algebra interface

Table ::= table(Identifier) select(Table, Expr<bool>) project(Table, [ Expr<T> ] ) join(Table, TABLE, Expr<bool>) aggr(Table, [ Expr<T> ], [ AggrFcn<T>] ) order (Table, [ Expr<T> ] ) topn(Table, [ Expr<T> ], Expr<int> ) Identifier = Table

Page 13: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.19Database System Concepts

select(Table, Expr<bool>)select(Table, Expr<bool>)

• Relation r A B C D

1

5

12

23

7

7

3

10

• select (r, and( ==(A,B), >(D ,int(‘5’) ) ) )

A B C D

1

23

7

10

Page 14: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.20Database System Concepts

select(Table, Expr<bool>)select(Table, Expr<bool>)

• Relation r A B C D

1

5

12

23

7

7

3

10

• select (r, and( ==(A,B), >(D ,int(‘5’) ) ) )

A B C D

1

23

7

10

Functional C-like notation:A = B and d > 5

Page 15: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.21Database System Concepts

select(Table, Expr<bool>)select(Table, Expr<bool>)

• Relation r A B C D

1

5

12

23

7

7

3

10

• select (r, and( ==(A,B), >(D ,int(‘5’) ) ) )

A B C D

1

23

7

10

All constants denoted ascast: TYPE(‘string’)

Page 16: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.22Database System Concepts

project(Table, [ Expr<T> ] )project(Table, [ Expr<T> ] )

Relation r: A B C

10

20

30

40

1

1

1

2

A D

10

10

10

20

Project (r, [ A, D=*(C,int(’10’)) ] )

Page 17: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.23Database System Concepts

project(Table, [ Expr<T> ] )project(Table, [ Expr<T> ] )

Relation r: A B C

10

20

30

40

1

1

1

2

A D

10

10

10

20

Project (r, [ A, D=*(C,int(’10’)) ] )

X100 is a bag algebra:

no double elimination

Page 18: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.24Database System Concepts

join(Table, Table, Expr<bool>)join(Table, Table, Expr<bool>)

Relations r, s:

A B

12412

C D

aabab

E

13123

F

r

A B

11112

C D

aaaab

F

s

join(r, s, ==(B,E))

Page 19: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.25Database System Concepts

join(Table, Table, Expr<bool>)join(Table, Table, Expr<bool>)

Relations r, t:

A B

12412

C D

aabab

E

13123

F

r

A B

11112

C D

aaaab

F

s

X100 join result is the union of all attributes.

Name conflicts must be resolved with an extra project

E

13123

C

t

join(r, s, ==(B,E))

project( t, [ E,F=C ] )

Page 20: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.26Database System Concepts

aggr(Table, [Expr<T>], [AggrFcn<T>])aggr(Table, [Expr<T>], [AggrFcn<T>])

Relation account grouped by branch-name:

branch-name account-number balance

PerryridgePerryridgeBrightonBrightonRedwood

A-102A-201A-217A-215A-222

400900750750700

branch-name balance

PerryridgeBrightonRedwood

13001500700

aggr( account, [ branch-name ], [ balance = sum(balance) ] )

Page 21: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.27Database System Concepts

aggr(Table, [Expr<T>], [AggrFcn<T>])aggr(Table, [Expr<T>], [AggrFcn<T>])

Relation account grouped by branch-name:

branch-name account-number balance

PerryridgePerryridgeBrightonBrightonRedwood

A-102A-201A-217A-215A-222

400900750750700

branch-name balance

PerryridgeBrightonRedwood

13001500700

aggr( account, [ branch-name ], [ balance = sum(balance) ] )

Identifier = AggrFcn(Identifier)

AggrFcn<T> ::= count<uint>() avg<T>(T) sum<T>(T) min<T>(T) max<T>(T)

Page 22: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.28Database System Concepts

aggr(Table, [Expr<T>], [AggrFcn<T>])aggr(Table, [Expr<T>], [AggrFcn<T>])

Relation r:

A B

C

7

7

3

10

total

27

aggr( r, [], [total = sum(C)])

Page 23: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.29Database System Concepts

aggr(Table, [Expr<T>], [AggrFcn<T>])aggr(Table, [Expr<T>], [AggrFcn<T>])

Relation r:

A B

C

7

7

3

10

total

27

aggr( r, [], [total = sum(C)])

Empty groupby-list Global aggregate

Page 24: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.30Database System Concepts

aggr(Table, [Expr<T>], [AggrFcn<T>])aggr(Table, [Expr<T>], [AggrFcn<T>])

Relation account grouped by branch-name:

branch-name account-number balance

PerryridgePerryridgeBrightonBrightonRedwood

A-102A-201A-217A-215A-222

400900750750700

branch-name

PerryridgeBrightonRedwood

aggr( account, [ branch-name ], [] )

Page 25: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.31Database System Concepts

aggr(Table, [Expr<T>], [AggrFcn<T>])aggr(Table, [Expr<T>], [AggrFcn<T>])

Relation account grouped by branch-name:

branch-name account-number balance

PerryridgePerryridgeBrightonBrightonRedwood

A-102A-201A-217A-215A-222

400900750750700

branch-name

PerryridgeBrightonRedwood

aggr( account, [ branch-name ], [] )

Empty AggrFcn-list Double elimination

Page 26: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.32Database System Concepts

order (Table, [ Expr<T>])order (Table, [ Expr<T>])

• Relation r A B C D

23

12

35

25

10

9

7

7

• orderby(r, [D,C desc])

A B C D

35

25

12

23

7

7

9

10

Page 27: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.33Database System Concepts

topn(Table, [ Expr<T>], int)topn(Table, [ Expr<T>], int)

• Relation r A B C D

23

12

35

25

10

9

7

7

• topn(r, [D,C desc], int(‘2’) )

A B C D

35

25

7

7

Page 28: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.34Database System Concepts

TPC-H: Data Warehousing Scenario TPC-H: Data Warehousing Scenario

“Give date, priority and sum of the top 10 high revenue orders for construction customers that had been ordered but not yet shipped on march 15 “

http://www.tpc.org• TPC-C transaction processing• TPC-H data warehousing

Large repository of data about Orders, consisting of Lineitems, delivered to Customers.

CUSTOMER 1n ORDER 1n LINEITEM

Query 3:Query 3:

Page 29: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.35Database System Concepts

SQL Data Warehousing Query SQL Data Warehousing Query (TPC-H Query 3) (TPC-H Query 3)

select l_orderkey, o_orderdate, o_shippriority, sum(l_extendedprice * (1 - l_discount)) as revenue

from customer, orders, lineitem

where c_custkey = o_custkey and l_orderkey = o_orderkey and c_mktsegment = 'BUILDING' ando_orderdate < date '1995-03-15' and l_shipdate > date '1995-03-15'

group by l_orderkey, o_orderdate, o_shippriority

order by revenue desc, o_orderdate

limit 10;

Page 30: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.36Database System Concepts

SQL SQL Algebra translationAlgebra translation

select l_orderkey, o_orderdate, o_shippriority, sum(l_extendedprice * (1 - l_discount)) as revenue

from customer, orders, lineitem

where c_custkey = o_custkey and l_orderkey = o_orderkey and c_mktsegment = 'BUILDING' ando_orderdate < date '1995-03-15' and l_shipdate > date '1995-03-15'

group by l_orderkey, o_orderdate, o_shippriority

order by revenue desc, o_orderdate

limit 10;

join

Page 31: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.37Database System Concepts

SQL SQL Algebra translationAlgebra translation

select l_orderkey, o_orderdate, o_shippriority, sum(l_extendedprice * (1 - l_discount)) as revenue

from customer, orders, lineitem

where c_custkey = o_custkey and l_orderkey = o_orderkey and c_mktsegment = 'BUILDING' ando_orderdate < date '1995-03-15' and l_shipdate > date '1995-03-15'

group by l_orderkey, o_orderdate, o_shippriority

order by revenue desc, o_orderdate

limit 10;

join

select

Page 32: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.38Database System Concepts

SQL SQL Algebra translationAlgebra translation

select l_orderkey, o_orderdate, o_shippriority, sum(l_extendedprice * (1 - l_discount)) as revenue

from customer, orders, lineitem

where c_custkey = o_custkey and l_orderkey = o_orderkey and c_mktsegment = 'BUILDING' ando_orderdate < date '1995-03-15' and l_shipdate > date '1995-03-15'

group by l_orderkey, o_orderdate, o_shippriority

order by revenue desc, o_orderdate

limit 10;

join

select

aggr

Page 33: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.39Database System Concepts

SQL SQL Algebra translationAlgebra translation

select l_orderkey, o_orderdate, o_shippriority, sum(l_extendedprice * (1 - l_discount)) as revenue

from customer, orders, lineitem

where c_custkey = o_custkey and l_orderkey = o_orderkey and c_mktsegment = 'BUILDING' ando_orderdate < date '1995-03-15' and l_shipdate > date '1995-03-15'

group by l_orderkey, o_orderdate, o_shippriority

order by revenue desc, o_orderdate

limit 10;

join

select

aggr

topn

Page 34: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.40Database System Concepts

Query in X100 AlgebraQuery in X100 Algebra

Page 35: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.41Database System Concepts

Page 36: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.42Database System Concepts

OutlineOutline

Introduction & Course Organization

Recap of Introductory Database Course SQL

Relational Algebra (X100 flavor)

Storage and File Structures

Page 37: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.43Database System Concepts

Storage HierarchyStorage Hierarchy

300GB

300GB

4GB

2GB

2MB

64KB

128B

size bandwidthlatencyEUR/GBUnit

60MB/s (20MB/s)

100000ns202KBNAND Flash

3000MB/s70ns6064BRAM (DDR2)

80MB/s10 min0.1032KBTape (HP)

80MB/s10000000ns0.308KBMagnetic disk (IDE)

7000MB/s10ns64BL2 CPU cache

24000MB/s1ns64BL1 CPU cache

24000MB/s18BCPU registers

Page 38: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.44Database System Concepts

Hardware TrendsHardware Trends

CPU speed (KHz)

RAM Size (KB) Disk Size (MB)

RAM Bandwidth (MB/s)

Disk Bandwidth (MB/s)

RAM Latency (ns)

Disk Latency (ms)

Page 39: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.45Database System Concepts

Storage Hierarchy (Cont.)Storage Hierarchy (Cont.)

primary storage: Fastest media but volatile (cache, main memory).

secondary storage: next level in hierarchy, non-volatile, moderately fast access time also called on-line storage

E.g. flash memory, magnetic disks

tertiary storage: lowest level in hierarchy, non-volatile, slow access time also called off-line storage

E.g. magnetic tape, optical storage

Page 40: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.46Database System Concepts

Magnetic Hard Disk MechanismMagnetic Hard Disk Mechanism

NOTE: Diagram is schematic, and simplifies the structure of actual disk drives

Page 41: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.47Database System Concepts

Performance Measures of DisksPerformance Measures of Disks Access time – the time it takes from when a read or write request

is issued to when data transfer begins. Consists of: Seek time – time it takes to reposition the arm over the correct track.

Average seek time is 1/2 the worst case seek time.

– Would be 1/3 if all tracks had the same number of sectors, and we ignore the time to start and stop arm movement

4 to 10 milliseconds on typical disks Rotational latency – time it takes for the sector to be accessed to

appear under the head. Average latency is 1/2 of the worst case latency. 4 to 11 milliseconds on typical disks (5400 to 15000 r.p.m.)

Data-transfer rate – the rate at which data can be retrieved from or stored to the disk. 20 to 60 MB per second is typical Multiple disks may share a controller, so rate that controller can handle

is also important E.g. ATA: 100 MB/second, SCSI: 320 MB/

Page 42: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.48Database System Concepts

Magnetic Disk Hardware Trends Magnetic Disk Hardware Trends

Page 43: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.49Database System Concepts

Performance Measures (Cont.)Performance Measures (Cont.)

Mean time to failure (MTTF) – the average time the disk is expected to run continuously without any failure. Typically 3 to 5 years

Probability of failure of new disks is quite low, corresponding to a“theoretical MTTF” of 30,000 to 1,200,000 hours for a new disk

E.g., an MTTF of 1,200,000 hours for a new disk means that given 1000 relatively new disks, on an average one will fail every 1200 hours

MTTF decreases as disk ages

Page 44: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.50Database System Concepts

RAIDRAID

RAID: Redundant Arrays of Independent Disks disk organization techniques that manage a large numbers of disks,

providing a view of a single disk of

high capacity and high speed by using multiple disks in parallel, and

high reliability by storing data redundantly, so that data can be recovered even if a disk fails

The chance that some disk out of a set of N disks will fail is much higher than the chance that a specific single disk will fail. E.g., a system with 100 disks, each with MTTF of 100,000 hours

(approx. 11 years), will have a system MTTF of 1000 hours (approx. 41 days)

Techniques for using redundancy to avoid data loss are critical with large numbers of disks

Page 45: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.51Database System Concepts

Improvement of Reliability via RedundancyImprovement of Reliability via Redundancy

Redundancy – store extra information that can be used to rebuild information lost in a disk failure

E.g., Mirroring (or shadowing) Duplicate every disk. Logical disk consists of two physical disks. Every write is carried out on both disks

Reads can take place from either disk If one disk in a pair fails, data still available in the other

Data loss would occur only if a disk fails, and its mirror disk also fails before the system is repaired

– Probability of combined event is very small

» Except for dependent failure modes such as fire or building collapse or electrical power surges

Mean time to data loss depends on mean time to failure, and mean time to repair E.g. MTTF of 100,000 hours, mean time to repair of 10 hours gives

mean time to data loss of 500*106 hours (or 57,000 years) for a mirrored pair of disks (ignoring dependent failure modes)

Page 46: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.52Database System Concepts

RAID LevelsRAID Levels Schemes to provide redundancy at lower cost by using disk

striping combined with parity bits Different RAID organizations, or RAID levels, have differing cost,

performance and reliability characteristics

RAID Level 1: Mirrored disks with block striping Offers best write performance.

Popular for applications such as storing log files in a database system.

RAID Level 0: Block striping; non-redundant. Used in high-performance applications where data lost is not critical.

Page 47: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.53Database System Concepts

RAID Levels (Cont.)RAID Levels (Cont.)

RAID Level 5: Block-Interleaved Distributed Parity; partitions data and parity among all N + 1 disks, rather than storing data in N disks and parity in 1 disk. E.g., with 5 disks, parity block for nth set of blocks is stored on

disk (n mod 5) + 1, with the data blocks stored on the other 4 disks.

Page 48: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.54Database System Concepts

Choice of RAID LevelChoice of RAID Level

Level 0 provides maximum performance, no safety Level 1 provides much better write performance than level 5

Level 5 requires at least 2 block reads and 2 block writes to write a single block, whereas Level 1 only requires 2 block writes

Level 1 preferred for high update environments such as log disks

Level 1 had higher storage cost than level 5 disk drive capacities increasing rapidly (50%/year) whereas disk

access times have decreased much less (x 3 in 10 years) I/O requirements have increased greatly, e.g. for Web servers When enough disks have been bought to satisfy required rate of I/O,

they often have spare storage capacity so there is often no extra monetary cost for Level 1!

Level 5 is preferred for applications with low update rate,and large amounts of data

Level 1 is preferred for all other applications

Page 49: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.55Database System Concepts

Hardware IssuesHardware Issues

Hot swapping: replacement of disk while system is running, without power down Supported by some hardware RAID systems,

reduces time to recovery, and improves availability greatly

Many systems maintain spare disks which are kept online, and used as replacements for failed disks immediately on detection of failure Reduces time to recovery greatly

Many hardware RAID systems ensure that a single point of failure will not stop the functioning of the system by using Redundant power supplies with battery backup

Multiple controllers and multiple interconnections to guard against controller/interconnection failures

Page 50: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.57Database System Concepts

Index ClassificationIndex Classification

Primary vs. Secondary primary – the index on the primary key

unique – an index on a candidate key

secondary – not primary

Clustered vs Unclustered clustered – key order corresponds with record order

E.g. B-tree separate from record file

Index-organized table B-tree leaves store records (no file)

unclustered – index contains record-IDs in random order

Page 51: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.58Database System Concepts

Root

B+Tree n=4

100

120

150

180

30

3 5 11

30

35

100

101

110

120

130

150

156

179

180

200

Page 52: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.59Database System Concepts

Sample non-leafSample non-leaf

57

81

95

to keys to keys to keys to keys

< 57 57 k<81 81k<95 95

Page 53: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.60Database System Concepts

Sample leaf node:Sample leaf node:

From non-leaf node

to next leaf

in sequence

57

81

95

To r

eco

rd

wit

h k

ey 5

7

To r

eco

rd

wit

h k

ey 8

1

To r

eco

rd

wit

h k

ey 8

5

Page 54: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.61Database System Concepts

Non-root nodes have to be at least half-fullNon-root nodes have to be at least half-full

Use at least

Non-leaf: n/2 children

Leaf: (n-1)/2 pointers to data

Page 55: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.62Database System Concepts

Full node min. node

Non-leaf

Leaf

n=4

12

01

50

18

0

30

3 5 11

30

35

Page 56: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.63Database System Concepts

Insert into B+treeInsert into B+tree

(a) simple case space available in leaf

(b) leaf overflow

(c) non-leaf overflow

(d) new root

Page 57: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.64Database System Concepts

(simple case) Insert key = 32 n=43 5 11

30

31

30

10

03

2

Page 58: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.65Database System Concepts

(leaf overflow) Insert key = 7 n=4

3 5 11

30

31

30

100

3 5

7

7

Page 59: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.66Database System Concepts

(internal overflow) Insert key = 160n=4

100

120

150

180

150

156

179

180

200

160

18

0

160

179

Page 60: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.67Database System Concepts

(new root) insert 45 n=4

10

20

30

1 2 3 10

12

20

25

30

32

40

40

45

40

30new root

Page 61: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.68Database System Concepts

insert:

1, 2, 10, 20, 3, 12, 30, 32, 25, 40, 45

n=4

Page 62: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.76Database System Concepts

Interesting problem:Interesting problem:

For B+tree, how large should n be?

n is number of keys / node

Page 63: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.77Database System Concepts

AssumptionsAssumptions

You have the right to set the disk page size for the disk where a B-tree will reside.

Compute the optimum page size n assuming that The items are 4 bytes long and the pointers are also 4 bytes long.

Time to read a node from disk is 10+.0002n

Time to process a block in memory is unimportant

B+tree is full (I.e., every page has the maximum number of items and pointers

Page 64: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.78Database System Concepts

FIND FIND nnoptopt by by f’(n)f’(n) = 0 = 0

What happens to nopt as

Disk bandwidth increases?

Access time stays behind?

CPU get faster?

Page 65: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.79Database System Concepts

f(n)f(n) = time to find a record= time to find a record

= log= lognn(T) * (10 + 0.0002n)(T) * (10 + 0.0002n)

Page 66: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.80Database System Concepts

f(n)f(n) = time to find a record= time to find a record

= log= lognn(T) * (10 + 0.0002n)(T) * (10 + 0.0002n)

1994 (book) 2004 (now)

N=500 n=4000

Page 67: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.81Database System Concepts

f(n)f(n) = time to find a record= time to find a record

= log= lognn(T) * (10 + 0.0002n)(T) * (10 + 0.0002n)

1994

Table 1M records

10ms access time

4MB/s bandwidthn~500-1000

4KB / 8KB pagesBe conservative to limit RAM consumption

Page 68: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.82Database System Concepts

f(n)f(n) = time to find a record= time to find a record

= log= lognn(T) * (10 + 0.0002n)(T) * (10 + 0.0002n)

2004

Table 10M records

6ms access time

40MB/s bandwidth

n~1000-4000

8KB / 32KB pages

relative benefit decreases so don’t overdo it

Page 69: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.83Database System Concepts

FIND FIND nnoptopt by by f’(n)f’(n) = 0 = 0

Answer should be nopt = “few thousand”

What happens to nopt as

block sizes are increasing..

Disk bandwidth increases?

Access time stays behind?

CPU get faster?

Page 70: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.84Database System Concepts

Primary or Auxiliary StructurePrimary or Auxiliary Structure

Primary index Leaf blocks in sequence clustered index Main storage structure for a database table

E.g. B+-tree organized file / hash structured files Typically an index on an unique key

But not necessarily Normally, you can have only one clustered index!

Secondary index Also called unclustered index A separate file from where the table is stored Refers with (block/offset) pointers to records in the table file You can define many as you want (to maintain)

Page 71: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.85Database System Concepts

Clustered vs. Unclustered IndexClustered vs. Unclustered Index

Primary index Leaf blocks in sequence clustered index Main storage structure for a database table

E.g. B+-tree organized file / hash structured files Typically an index on an unique key

But not necessarily Normally, you can have only one clustered index!

Secondary index Also called unclustered index A separate file from where the table is stored Refers with (block/offset) pointers to records in the table file You can define many as you want (to maintain)

low

high

Primary B-Tree index

1 access only

(rest is ‘just’ bandwidth)

Page 72: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.86Database System Concepts

Clustered vs. Unclustered IndexClustered vs. Unclustered Index

Primary index Leaf blocks in sequence clustered index Main storage structure for a database table

E.g. B+-tree organized file / hash structured files Typically an index on an unique key

But not necessarily Normally, you can have only one clustered index!

Secondary index Also called unclustered index A separate file from where the table is stored Refers with (block/offset) pointers to records in the table file You can define many as you want (to maintain)

low

high

Primary B-Tree index

1 access only

(rest is ‘just’ bandwidth)

Secondary B-tree index

Pay N times

access cost

Page 73: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.87Database System Concepts

Are Unclustered Indices a Good Idea?Are Unclustered Indices a Good Idea?

Secondary indices depend on random I/O

can do asynchronous I/O (multiple I/Os at-a-time)

degenerates into full table scans

Page 74: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.88Database System Concepts

Block size for sequential reads?Block size for sequential reads?

Page 75: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.89Database System Concepts

When do random I/Os make sense?When do random I/Os make sense?

Page 76: Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.90Database System Concepts

Are Unclustered Indices a Good Idea?Are Unclustered Indices a Good Idea?

Secondary indices depend on random I/O

can do asynchronous I/O (multiple I/Os at-a-time)

degenerates into full table scans

Is not using an index at all better?

I.e. read the entire table sequentially without any index

Use redundant clustered orderings

– Materialized views

– C-STORE (Stonebraker et al, VLDB 2005), MonetDB/X100

– Database Cracking (Kersten, CIDR 2005+2007)