38
CENG 351 File Structures 1 Multilevel Indexing and B+ Trees

Multilevel Indexing and B+ Trees - Welcome to CENG

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 1

Multilevel Indexing and

B+ Trees

Page 2: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 2

Problems with simple indexes

If index does not fit in memory:

1. Seeking the index is slow (binary search):

– We don’t want more than 3 or 4 seeks for a search.

2. Insertions and deletions take O(N) disk accesses.

N Log(N+1)

15 keys 4

1000 ~10

100,000 ~17

1,000,000 ~20

Page 3: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 3

Indexed Sequential Files

• Provide a choice between two alternative

views of a file:

1. Indexed: the file can be seen as a set of

records that is indexed by key; or

2. Sequential: the file can be accessed

sequentially (physically contiguous

records), returning records in order by

key.

Page 4: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 4

Example of applications

• Student record system in a university:

– Indexed view: access to individual records

– Sequential view: batch processing when posting

grades

• Credit card system:

– Indexed view: interactive check of accounts

– Sequential view: batch processing of payments

Page 5: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 5

The initial idea

• Maintain a sequence set:

– Group the records into blocks in a sorted way.

– Maintain the order in the blocks as records are

added or deleted through splitting,

concatenation, and redistribution.

• Construct a simple, single level index for

these blocks.

– Choose to build an index that contain the key

for the last record in each block.

Page 6: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 6

Maintaining a Sequence Set

• Sorting and re-organizing after insertions and deletions is out of question. We organize the sequence set in the following way:

– Records are grouped in blocks.

– Blocks should be at least half full.

– Link fields are used to point to the preceding block and the following block (similar to doubly linked lists)

– Changes (insertion/deletion) are localized into blocks by performing:

• Block splitting when insertion causes overflow

• Block merging or redistribution when deletion causes underflow.

Page 7: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 7

Example: insertion• Block size = 4

• Key : Last name

ADAMS … BIXBY … CARSON … COLE …Block 1

• Insert “BAIRD …”:

ADAMS … BAIRD … BIXBY …

CARSON .. COLE …

Block 1

Block 2

Page 8: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 8

Example: deletion

ADAMS … BAIRD … BIXBY … BOONE …Block 1

BYNUM… CARSON .. CARTER ..

DENVER… ELLIS …

Block 2

Block 3

• Delete “DAVIS”, “BYNUM”, “CARTER”,

COLE… DAVISBlock 4

Page 9: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 9

Add an Index set

Key Block

BERNE 1CAGE 2DUTTON 3EVANS 4FOLK 5GADDIS 6

Page 10: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 10

Tree indexes

• This simple scheme is nice if the index fits in memory.

• If index doesn’t fit in memory:

– Divide the index structure into blocks,

– Organize these blocks similarly building a tree structure.

• Tree indexes:

– B Trees

– B+ Trees

– Simple prefix B+ Trees

– …

Page 11: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 11

Separators

Block Range of Keys Separator

1 ADAMS-BERNE

BOLEN

2 BOLEN-CAGECAMP

3 CAMP-DUTTONEMBRY

4 EMBRY-EVANSFABER

5 FABER-FOLKFOLKS

6 FOLKS-GADDIS

Page 12: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 12

ADAMS-BERNE FOLKS-GADDIS

FABER-FOLKBOLEN-CAGE

CAMP-DUTTON EMBRY-EVANS

BOLEN CAMP

EMBRY

FABER FOLKS

1

2

3

5

4 6

Index set

root

Page 13: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 13

B Trees

• B-tree is one of the most important data structures

in computer science.

• What does B stand for? (Not binary!)

• B-tree is a multiway search tree.

• Several versions of B-trees have been proposed,

but only B+ Trees has been used with large files.

• A B+tree is a B-tree in which data records are in

leaf nodes, and faster sequential access is possible.

Page 14: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 14

Formal definition of B+ Tree Properties

• Properties of a B+ Tree of order v :

– All internal nodes (except root) has at least v keys and at most 2v keys .

– The root has at least 2 children unless it’s a leaf..

– All leaves are on the same level.

– An internal node with k keys has k+1 children

Page 15: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 15

B+ tree: Internal/root node

structure

P0 K1 P1 K2 ……………… Pn-1 Kn Pn

Requirements:

K1 < K2 < … < Kn

For any search key value K in the subtree pointed by Pi,

If Pi = P0, we require K < K1

If Pi = Pn, Kn K

If Pi = P1, …, Pn-1, Ki K < Ki+1

Each Pi is a pointer to a child node; each Ki is a search key value

# of search key values = n, # of pointers = n+1

Page 16: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 16

Pointer L points to the left neighbor; R points to

the right neighbor

K1 < K2 < … < Kn

v n 2v (v is the order of this B+ tree)

We will use Ki* for the pair <Ki, ri> and omit L and

R for simplicity

B+ tree: leaf node structure

L K1 r1 K2 ……………… Kn rn

R

Page 17: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 17

Example: B+ tree with order of 1

• Each node must hold at least 1 entry, and at most

2 entries

10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*

20 33 51 63

40

Root

Page 18: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 18

Example: Search in a B+ tree order 2• Search: how to find the records with a given search key value?

– Begin at root, and use key comparisons to go to leaf

• Examples: search for 5*, 16*, all data entries >= 24* ...

– The last one is a range search, we need to do the sequential scan, starting from

the first leaf containing a value >= 24.

Root

17 24 30

2* 3* 5* 7* 14* 15* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

13

Page 19: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 19

Cost for searching a value in B+ tree• Typically, a node is a page (block or cluster)

• Let H be the height of the B+ tree: we need to read H+1 pages to reach a leaf node

• Let F be the (average) number of pointers in a node (for internal node, called fanout )

– Level 1 = 1 page = F0 page

– Level 2 = F pages = F1 pages

– Level 3 = F * F pages = F2 pages

– Level H+1 = …….. = FH pages (i.e., leaf nodes)

– Suppose there are D data entries. So there are D/(F-1) leaf nodes

– D/(F-1) = FH. That is, H = logF( )

1

2

levelHeight

0

1

1F

D

Page 20: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 20

B+ Trees in Practice

• Typical order: 100. Typical fill-factor: 67%.

– average fanout = 133 (i.e, # of pointers in internal node)

• Can often hold top levels in buffer pool:

– Level 1 = 1 page = 8 Kbytes

– Level 2 = 133 pages = 1 Mbyte

– Level 3 = 17,689 pages = 133 MBytes

• Suppose there are 1,000,000,000 data entries.

– H = log133(1000000000/132) < 4

– The cost is 5 pages read

Page 21: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 21

How to Insert a Data Entry into a

B+ Tree?

• Let’s look at several examples first.

Page 22: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 22

Inserting 16*, 8* into Example B+ tree

Root17 24 3013

2* 3* 5* 7* 8*

2* 5* 7*3*

17 24 3013

8*

You overflow

One new child (leaf node)

generated; must add one more

pointer to its parent, thus one more

key value as well.

14* 15* 16*

Page 23: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 23

Inserting 8* (cont.)

• Copy up the

middle value

(leaf split)

2* 3* 5* 7* 8*

5

Entry to be inserted in parent node.

(Note that 5 iscontinues to appear in the leaf.)

s copied up and

13 17 24 30

You overflow!5 13 17 24 30

Page 24: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 24

(Note that 17 is pushed up and onlyappears once in the index. Contrast

Entry to be inserted in parent node.

this with a leaf split.)

5 24 30

17

13

Insertion into B+ tree (cont.)

5 13 17 24 30

• Understand

difference

between copy-up

and push-up

• Observe how

minimum

occupancy is

guaranteed in

both leaf and

index pg splits.

We split this node, redistribute entries evenly,

and push up middle key.

Page 25: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 25

Example B+ Tree After Inserting 8*

Notice that root was split, leading to increase in height.

2* 3*

Root

17

24 30

14* 15* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

135

7*5* 8*

Page 26: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 26

Inserting a Data Entry into a B+ Tree:

Summary• Find correct leaf L.

• Put data entry onto L.

– If L has enough space, done!

– Else, must split L (into L and a new node L2)

• Redistribute entries evenly, put middle key in L2

• copy up middle key.

• Insert index entry pointing to L2 into parent of L.

• This can happen recursively

– To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.)

• Splits “grow” tree; root split increases height.

– Tree growth: gets wider or one level taller at top.

Page 27: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 27

Deleting a Data Entry from a B+

Tree

• Examine examples first …

Page 28: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 28

Delete 19* and 20*

2* 3*

Root

17

24 30

14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

135

7*5* 8*

22*

27* 29*22* 24*

Have we still forgot something?

Page 29: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 29

Deleting 19* and 20* (cont.)

• Notice how 27 is copied up.

• But can we move it up?

• Now we want to delete 24

• Underflow again! But can we redistribute this time?

2* 3*

Root

17

30

14* 16* 33* 34* 38* 39*

135

7*5* 8* 22* 24*

27

27* 29*

Page 30: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 30

Deleting 24*• Observe the two leaf

nodes are merged, and 27 is discarded from their parent, but …

• Observe `pull down’ of index entry (below).

30

22* 27* 29* 33* 34* 38* 39*

2* 3* 7* 14* 16* 22* 27* 29* 33* 34* 38* 39*5* 8*

30135 17New root

Page 31: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 31

Deleting a Data Entry from a B+ Tree:

Summary

• Start at root, find leaf L where entry belongs.

• Remove the entry.

– If L is at least half-full, done!

– If L has only d-1 entries,

• Try to re-distribute, borrowing from sibling (adjacent node with same parent as L).

• If re-distribution fails, merge L and sibling.

• If merge occurred, must delete entry (pointing to L or sibling) from parent of L.

• Merge could propagate to root, decreasing height.

Page 32: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 32

Example of Non-leaf Re-distribution

• Tree is shown below during deletion of 24*. (What

could be a possible initial tree?)

• In contrast to previous example, can re-distribute entry

from left child of root to right child.

Root

135 17 20

22

30

14* 16* 17* 18* 20* 33* 34* 38* 39*22* 27* 29*21*7*5* 8*3*2*

Page 33: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 33

After Re-distribution

• Intuitively, entries are re-distributed by `pushing

through’ the splitting entry in the parent node.

• It suffices to re-distribute index entry with key 20;

we’ve re-distributed 17 as well for illustration.

14* 16* 33* 34* 38* 39*22* 27* 29*17* 18* 20* 21*7*5* 8*2* 3*

Root

135

17

3020 22

Page 34: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 34

Primary vs Secondary Index

• Note: We were assuming the data items

were in sorted order

– This is called primary index

• Secondary index:

– Built on an attribute that the file is not sorted

on.

• Can have many different indexes on the

same file.

Page 35: Multilevel Indexing and B+ Trees - Welcome to CENG

A Secondary B+-Tree index

14 16 22 27 2917 18 20 2175 82 3 33 34 38 39

Root

135

17

3020 22

2* 16* 5* 39* 20* 7* 38* 29*22* 3* 14* 8*Actual data

buckets

Page 36: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 36

More…

• Hash-based Indexes

– Static Hashing

– Extendible Hashing

– Linear Hashing

• Grid-files

• R-Trees

• etc…

Page 37: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 37

Terminology

• Bucket Factor: the number of records which can fit in a leaf node.

• Fan-out : the average number of children of an internal node.

• A B+tree index can be used either as a primary index or a secondary index.

– Primary index: determines the way the records are actually stored (also called a sparse index)

– Secondary index: the records in the file are not grouped in buckets according to keys of secondary indexes (also called a dense index)

Page 38: Multilevel Indexing and B+ Trees - Welcome to CENG

CENG 351 File Structures 38

Summary

• Tree-structured indexes are ideal for range-searches, also good for equality searches.

• B+ tree is a dynamic structure.– Inserts/deletes leave tree height-balanced; High fanout (F)

means depth rarely more than 3 or 4.

– Almost always better than maintaining a sorted file.

– Typically, 67% occupancy on average.

– If data entries are data records, splits can change rids!

• Most widely used index in database management systems because of its versatility. One of the most optimized components of a DBMS.