Upload
blake-welch
View
220
Download
0
Embed Size (px)
Citation preview
12.1
Chapter 12: Indexing and HashingChapter 12: Indexing and HashingSpring 2009Spring 2009
Sections 12.1-12.4, 12.6-12.8, 12.10
Problems 12.1-12.4, 12.7, 12.8, 12.13, 12.15, 12.18
12.2
12.1 Basic Concepts12.1 Basic Concepts
Indexing - to speed up access to data
Search key - attribute or attributes used to look up records in a file
Index file - records of the form
Two kinds ordered: search keys are stored in some order
hash: search keys are distributed uniformly across “buckets” using a hash function
Evaluation criteria access types supported efficiently, e.g.,
records with a specified value in the attribute
records with a value falling in a specified range of values.
record access, insertion, deletion times
index overhead
search-key pointer
12.3
12.2 Ordered Indices12.2 Ordered Indices
Index entries sorted on the search key value
Primary index: in a sequentially ordered file, an index whose search key specifies the sequential order of the file often the primary key
Secondary index: different from the file's sequential order
Dense index:
12.4
A Sparse IndexA Sparse Index
How do we insert/delete records when there is an index?E.g. insert record for Othertown?E.g. delete A-110 record? A-215 record?
12.5
Multilevel IndexMultilevel Index
If a primary index does not fit in
memory, access becomes $$$
Use a sparse index on a dense
index to reduce #disk accesses outer index – a sparse index of
primary index
inner index – the primary index file
Store outer index in main memory
Insertion/deletion?
12.6
Secondary IndicesSecondary Indices
To search on some attribute other than a primary key E.g. the balance field of account
Secondary indices have to be dense
12.7
BB++-Tree Index Files-Tree Index Files
Problems with indexed-sequential files: performance degrades as file grows (many overflow blocks)
periodic reorganization of entire file
Typical node (size n)
Ki: search-key values (ordered in a node)
Pi: pointers to children (for non-leaf nodes) or buckets of records (for leaf nodes)
12.8
BB++-Tree Index Files-Tree Index Files
Properties
all paths from root to leaf are of the same length
root node has between 2 and n children
non-root or leaf nodes have between n/21 and n children (pointers)
leaf nodes have between (n–1)/2 and n–1 values
insertions/deletions done in log time
Automatic reorganization with small, local, changes
1n/2 is the next integer ≥ n/2
12.9
Non-Leaf Nodes in BNon-Leaf Nodes in B++-Trees-Trees
A multi-level sparse index on the leaf nodes
Properties:
all the search-keys in the subtree to which P1 points are less than K1
for 2 i n – 1, all the search-keys in the subtree to which Pi points have values greater than or equal to Ki–1 and less than Kj.
Pn points to search keys with values ≥Kn-1
E.g. (n=3) components are P1 K1 P2 K2 P3
12.10
Leaf Nodes in BLeaf Nodes in B++-Trees-Trees
For i = 1, 2, . . ., n–1, Pi either points to
a file record with search-key value Ki, or
a bucket of pointers to file records, each record having search-key value Ki
(bucket structure only if search-key is not a primary key)
12.11
BB++-tree f-tree for account (n = 3)
Root has at least 2 children
Other non-leaf nodes have between 1 and 3 children ((n/2 and n)
Leaf nodes have between 1 and 2 values ((n–1)/2 and n –1)
Queries: how would you find Downtown and Round Hill
12.12
BB++-tree with n=5-tree with n=5
Leaf nodes have between 2 and 4 values ((n–1)/2 and n –1, with n = 5)
Non-leaf nodes other than root have between 3 and 5 children ((n/2 and n with n =5)
Root has at least 2 children
12.13
Efficiency of Queries on BEfficiency of Queries on B+-+-TreesTrees
Processing a query: traverse from the root to a leaf node
K search-key values: path ≤ logn/2(K)
A node is generally the same size as a disk block
With 1 million search key values and n = 100, ≤ log50(1,000,000) = 4 nodes are accessed in a lookup
Balanced binary tree from CS 132: ~20 nodes are accessed in a lookup significant since every node access may need a disk I/O
12.14
Insertion in BInsertion in B++-Trees -Trees
A record for Perryridge? follow tree and add to bucket
A record for Othertown? put to right of Mianus and add record to database
A record for Clearview? we need to add a new node
12.15
Insertion in BInsertion in B++-Trees-Trees
Splitting a node: take the n(search-key value, pointer) pairs (including the one being
inserted) in sorted order. Place the first n/2 in the original node, and the rest in a new node.
let the new node be p, and let k be the least key value in p. Insert (k,p) in the parent of the node being split. If the parent is full, split it and propagate the split further up.
The splitting proceeds upwards till a node that is not full is found
Worst case the root node is split, increasing the tree height by 1
Result of inserting Clearview in node containing Brighton and Downtown.Now there must be a node for Downtown in the next level up
12.16
Insertion in BInsertion in B++-Trees-Trees
Before and after inserting “Clearview”. Now try: "Dashfield"
12.17
Deletion in BDeletion in B++-Trees-Trees
Find the record to be deleted and remove it from the main file and from the bucket (if present)
Remove (search-key value, pointer) from the leaf node if there is no bucket or if the bucket has become empty
If the node has too few entries due to the removal, and the entries in the node and a sibling fit into a single node, then insert all the search-key values in the two nodes into a single node (the one on the
left), and delete the other node
delete the pair (Ki–1, Pi), where Pi is the pointer to the deleted node, from its parent, recursively using the above procedure
If the node has too few pointers due to the removal, and the entries in the node and a sibling fit into a single node, then redistribute the pointers between the node and a sibling
update the corresponding search-key value in the node's parent
Deletions cascade up until a node with n/2 or more pointers
12.18
Examples of BExamples of B++-Tree Deletion-Tree Deletion
Before and after deleting “Downtown”
Removing the leaf node containing “Downtown” did not leave its parent with too few pointers. Cascaded deletions didn't go beyond the parent.
12.19
Examples of BExamples of B++-Tree Deletion (Cont.)-Tree Deletion (Cont.)
Node with “Perryridge” becomes underfull (empty) and merged with its sibling
As a result “Perryridge” node’s parent became underfull, and was merged with its sibling (and an entry was deleted from their parent)
Root node then had only one child and was deleted
Delete “Perryridge”
12.20
Example of BExample of B++-tree Deletion (Cont.)-tree Deletion (Cont.)
Parent of leaf containing Perryridge became underfull, and borrowed a pointer from its left sibling
Search-key value in the parent’s parent changes as a result
Delete “Perryridge” from earlier example
12.21
BB++-Tree File Organization-Tree File Organization
Index file degradation is addressed using B+-Tree indices
Data file degradation is addressed using B+-Tree file organization
Leaf nodes in a B+-tree file store records, instead of pointers
Records use more space than pointers
Try to keep at least entries in each sibling (data) node
3/2n
12.22
B-Tree Index FileB-Tree Index File
Similar to B+-tree, but search-key values appear only once B+-tree on same data:
Brighton bucket Clearview bucket
12.23
B-Tree Index Files (Cont.)B-Tree Index Files (Cont.)
Advantages:
fewer tree nodes
may find search-key value before reaching leaf node
Disadvantages
only small fraction of all search-key values are found early
non-leaf nodes are larger, so n is smaller and the B-Tree deeper
insertion and deletion more complicated
implementation harder
Typically, advantages of B-Trees do not out weigh disadvantages
12.24
Static HashingStatic Hashing
Bucket: unit of storage containing one or more records
(typically a disk block)
Hash file organization: obtain the bucket of a record directly from its search-key value using a hash function
Hash function: h(K) = B.
K a search-key value, B a bucket address
Used to locate records for access, insertion, and deletion
If records with different search-key values are mapped to the
same bucket, search the bucket sequentially to locate a record
12.25
Examples of Hash File Organization Examples of Hash File Organization
Assume 10 buckets
Let a →1, b→2,...
Method 1: h(k) returns this representation the first letter in k mod 10. E.g. h(Perryridge) = 6, h(Brighton) = 2
Is this a good hash function?
Method 2: h(k) returns the sum of the characters representations mod 10 E.g. h(Perryridge) = 5, h(Brighton) = 3
(B →2, r→8, i→9, g→7, h→8, t→0, o→5, n→4, 2+8+9+7+8+0+5+4=3)
An ideal hash function uniform: each bucket is assigned the same number of search-key values from the
set of all possible values
random: irrespective of the actual distribution of search-key values
12.26
Example of Hash File Organization Example of Hash File Organization Hash file for account, using branch-name as key and method 2
12.27
Handling Bucket Overflows Handling Bucket Overflows
Overflow chaining – the overflow buckets of a given bucket are chained together in a linked list
This scheme is called closed hashing An alternative, open hashing (the data indexed by the hash goes in
the next available slot) is not suitable for databases
12.28
Hash IndicesHash Indices
Hashing can be used for
file organization and
to create an index
This is a secondary index
(not on primary key)
12.29
Deficiencies of Static HashingDeficiencies of Static HashingHash function h maps search-keys to a fixed set of bucket addresses
databases grow with time. If initial number of buckets is too small, performance will degrade due to overflows
if file size at some point in the future is anticipated and number of buckets allocated accordingly, significant amount of space will be wasted initially
if database shrinks, space will be wasted
Expensive option: periodic file re-organization with new hash function
There are also techniques that allow a dynamic # of buckets good for databases that grow and shrink in size, will skip
Hashing usually better at retrieving records with a specified key value
Ordered indices preferred if range queries are common
Ordered Indexing versus HashingOrdered Indexing versus Hashing
12.30
Index Definition in SQLIndex Definition in SQL
Create an index
create index <index-name> on <relation-name> (<attribute-list>)
E.g. create index b-index on branch(branch-name)
create index b-index using btree on branch(branch-name)
create index b-index using hash on branch(branch-name)
To drop an index
drop index <index-name>