12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections 12.1-12.4, 12.6-12.8, 12.10 Problems 12.1-12.4, 12.7, 12.8, 12.13, 12.15, 12.18

12.1

Chapter 12: Indexing and HashingChapter 12: Indexing and HashingSpring 2009Spring 2009

Sections 12.1-12.4, 12.6-12.8, 12.10

Problems 12.1-12.4, 12.7, 12.8, 12.13, 12.15, 12.18

12.2

12.1 Basic Concepts12.1 Basic Concepts

Indexing - to speed up access to data

Search key - attribute or attributes used to look up records in a file

Index file - records of the form

Two kinds ordered: search keys are stored in some order

hash: search keys are distributed uniformly across “buckets” using a hash function

Evaluation criteria access types supported efficiently, e.g.,

records with a specified value in the attribute

records with a value falling in a specified range of values.

record access, insertion, deletion times

index overhead

search-key pointer

12.3

12.2 Ordered Indices12.2 Ordered Indices

Index entries sorted on the search key value

Primary index: in a sequentially ordered file, an index whose search key specifies the sequential order of the file often the primary key

Secondary index: different from the file's sequential order

Dense index:

12.4

A Sparse IndexA Sparse Index

How do we insert/delete records when there is an index?E.g. insert record for Othertown?E.g. delete A-110 record? A-215 record?

12.5

Multilevel IndexMultilevel Index

If a primary index does not fit in

memory, access becomes $$$

Use a sparse index on a dense

index to reduce #disk accesses outer index – a sparse index of

primary index

inner index – the primary index file

Store outer index in main memory

Insertion/deletion?

12.6

Secondary IndicesSecondary Indices

To search on some attribute other than a primary key E.g. the balance field of account

Secondary indices have to be dense

12.7

BB++-Tree Index Files-Tree Index Files

Problems with indexed-sequential files: performance degrades as file grows (many overflow blocks)

periodic reorganization of entire file

Typical node (size n)

Ki: search-key values (ordered in a node)

Pi: pointers to children (for non-leaf nodes) or buckets of records (for leaf nodes)

12.8

BB++-Tree Index Files-Tree Index Files

Properties

all paths from root to leaf are of the same length

root node has between 2 and n children

non-root or leaf nodes have between n/21 and n children (pointers)

leaf nodes have between (n–1)/2 and n–1 values

insertions/deletions done in log time

Automatic reorganization with small, local, changes

1n/2 is the next integer ≥ n/2

12.9

Non-Leaf Nodes in BNon-Leaf Nodes in B++-Trees-Trees

A multi-level sparse index on the leaf nodes

Properties:

all the search-keys in the subtree to which P1 points are less than K1

for 2 i n – 1, all the search-keys in the subtree to which Pi points have values greater than or equal to Ki–1 and less than Kj.

Pn points to search keys with values ≥Kn-1

E.g. (n=3) components are P1 K1 P2 K2 P3

12.10

Leaf Nodes in BLeaf Nodes in B++-Trees-Trees

For i = 1, 2, . . ., n–1, Pi either points to

a file record with search-key value Ki, or

a bucket of pointers to file records, each record having search-key value Ki

(bucket structure only if search-key is not a primary key)

12.11

BB++-tree f-tree for account (n = 3)

Root has at least 2 children

Other non-leaf nodes have between 1 and 3 children ((n/2 and n)

Leaf nodes have between 1 and 2 values ((n–1)/2 and n –1)

Queries: how would you find Downtown and Round Hill

12.12

BB++-tree with n=5-tree with n=5

Leaf nodes have between 2 and 4 values ((n–1)/2 and n –1, with n = 5)

Non-leaf nodes other than root have between 3 and 5 children ((n/2 and n with n =5)

Root has at least 2 children

12.13

Efficiency of Queries on BEfficiency of Queries on B+-+-TreesTrees

Processing a query: traverse from the root to a leaf node

K search-key values: path ≤ logn/2(K)

A node is generally the same size as a disk block

With 1 million search key values and n = 100, ≤ log50(1,000,000) = 4 nodes are accessed in a lookup

Balanced binary tree from CS 132: ~20 nodes are accessed in a lookup significant since every node access may need a disk I/O

12.14

Insertion in BInsertion in B++-Trees -Trees

A record for Perryridge? follow tree and add to bucket

A record for Othertown? put to right of Mianus and add record to database

A record for Clearview? we need to add a new node

12.15

Insertion in BInsertion in B++-Trees-Trees

Splitting a node: take the n(search-key value, pointer) pairs (including the one being

inserted) in sorted order. Place the first n/2 in the original node, and the rest in a new node.

let the new node be p, and let k be the least key value in p. Insert (k,p) in the parent of the node being split. If the parent is full, split it and propagate the split further up.

The splitting proceeds upwards till a node that is not full is found

Worst case the root node is split, increasing the tree height by 1

Result of inserting Clearview in node containing Brighton and Downtown.Now there must be a node for Downtown in the next level up

12.16

Insertion in BInsertion in B++-Trees-Trees

Before and after inserting “Clearview”. Now try: "Dashfield"

12.17

Deletion in BDeletion in B++-Trees-Trees

Find the record to be deleted and remove it from the main file and from the bucket (if present)

Remove (search-key value, pointer) from the leaf node if there is no bucket or if the bucket has become empty

If the node has too few entries due to the removal, and the entries in the node and a sibling fit into a single node, then insert all the search-key values in the two nodes into a single node (the one on the

left), and delete the other node

delete the pair (Ki–1, Pi), where Pi is the pointer to the deleted node, from its parent, recursively using the above procedure

If the node has too few pointers due to the removal, and the entries in the node and a sibling fit into a single node, then redistribute the pointers between the node and a sibling

update the corresponding search-key value in the node's parent

Deletions cascade up until a node with n/2 or more pointers

12.18

Examples of BExamples of B++-Tree Deletion-Tree Deletion

Before and after deleting “Downtown”

Removing the leaf node containing “Downtown” did not leave its parent with too few pointers. Cascaded deletions didn't go beyond the parent.

12.19

Examples of BExamples of B++-Tree Deletion (Cont.)-Tree Deletion (Cont.)

Node with “Perryridge” becomes underfull (empty) and merged with its sibling

As a result “Perryridge” node’s parent became underfull, and was merged with its sibling (and an entry was deleted from their parent)

Root node then had only one child and was deleted

Delete “Perryridge”

12.20

Example of BExample of B++-tree Deletion (Cont.)-tree Deletion (Cont.)

Parent of leaf containing Perryridge became underfull, and borrowed a pointer from its left sibling

Search-key value in the parent’s parent changes as a result

Delete “Perryridge” from earlier example

12.21

BB++-Tree File Organization-Tree File Organization

Index file degradation is addressed using B+-Tree indices

Data file degradation is addressed using B+-Tree file organization

Leaf nodes in a B+-tree file store records, instead of pointers

Records use more space than pointers

Try to keep at least entries in each sibling (data) node

3/2n

12.22

B-Tree Index FileB-Tree Index File

Similar to B+-tree, but search-key values appear only once B+-tree on same data:

Brighton bucket Clearview bucket

12.23

B-Tree Index Files (Cont.)B-Tree Index Files (Cont.)

Advantages:

fewer tree nodes

may find search-key value before reaching leaf node

Disadvantages

only small fraction of all search-key values are found early

non-leaf nodes are larger, so n is smaller and the B-Tree deeper

insertion and deletion more complicated

implementation harder

Typically, advantages of B-Trees do not out weigh disadvantages

12.24

Static HashingStatic Hashing

Bucket: unit of storage containing one or more records

(typically a disk block)

Hash file organization: obtain the bucket of a record directly from its search-key value using a hash function

Hash function: h(K) = B.

K a search-key value, B a bucket address

Used to locate records for access, insertion, and deletion

If records with different search-key values are mapped to the

same bucket, search the bucket sequentially to locate a record

12.25

Examples of Hash File Organization Examples of Hash File Organization

Assume 10 buckets

Let a →1, b→2,...

Method 1: h(k) returns this representation the first letter in k mod 10. E.g. h(Perryridge) = 6, h(Brighton) = 2

Is this a good hash function?

Method 2: h(k) returns the sum of the characters representations mod 10 E.g. h(Perryridge) = 5, h(Brighton) = 3

(B →2, r→8, i→9, g→7, h→8, t→0, o→5, n→4, 2+8+9+7+8+0+5+4=3)

An ideal hash function uniform: each bucket is assigned the same number of search-key values from the

set of all possible values

random: irrespective of the actual distribution of search-key values

12.26

Example of Hash File Organization Example of Hash File Organization Hash file for account, using branch-name as key and method 2

12.27

Handling Bucket Overflows Handling Bucket Overflows

Overflow chaining – the overflow buckets of a given bucket are chained together in a linked list

This scheme is called closed hashing An alternative, open hashing (the data indexed by the hash goes in

the next available slot) is not suitable for databases

12.28

Hash IndicesHash Indices

Hashing can be used for

file organization and

to create an index

This is a secondary index

(not on primary key)

12.29

Deficiencies of Static HashingDeficiencies of Static HashingHash function h maps search-keys to a fixed set of bucket addresses

databases grow with time. If initial number of buckets is too small, performance will degrade due to overflows

if file size at some point in the future is anticipated and number of buckets allocated accordingly, significant amount of space will be wasted initially

if database shrinks, space will be wasted

Expensive option: periodic file re-organization with new hash function

There are also techniques that allow a dynamic # of buckets good for databases that grow and shrink in size, will skip

Hashing usually better at retrieving records with a specified key value

Ordered indices preferred if range queries are common

Ordered Indexing versus HashingOrdered Indexing versus Hashing

12.30

Index Definition in SQLIndex Definition in SQL

Create an index

create index <index-name> on <relation-name> (<attribute-list>)

E.g. create index b-index on branch(branch-name)

create index b-index using btree on branch(branch-name)

create index b-index using hash on branch(branch-name)

To drop an index

drop index <index-name>

Documents

12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections 12.1-12.4, 12.6-12.8, 12.10 Problems 12.1-12.4, 12.7, 12.8, 12.13, 12.15, 12.18