Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
File Organizations - Indexing
R McFadyen ACS - 3902 1
•Tree terms
•root, internal, leaf, subtree
•parent, child, sibling
•balanced, unbalanced
•b+-tree
- split on overflow; merge on underflow
- in practice it is usually 3 or 4 levels deep
•search, insert, delete algorithms
File Organizations - Indexing
R McFadyen ACS - 3902 2
ACS-3902
b-trees and b+-trees are used for indexes
b+-trees and other index organizations are used in practice
Cover b+-tree from a theoretical perspective
Variations exist in database systems
Database systems mostly use b+-trees
File Organizations - Indexing
R McFadyen ACS - 3902 3
MySQL – simplified syntax
CREATE [UNIQUE] INDEX index_name
ON tbl_name (index_col_name,...)
USING {BTREE | HASH};
Create index index1 on Employees (dno);
File Organizations - Indexing
R McFadyen ACS - 3902 4
PostgreSQL– simplified syntax
CREATE [UNIQUE] INDEX index_name
ON tableName (index_col_name,...)
[USING {B-tree | hash | GiST | SP-GiST | GIN | BRIN}]
(index_col_name,...)
[WHERE predicate];
CREATE INDEX index1 ON Employee (dno);
CREATE INDEX index2 ON Employee (lname, fname);
File Organizations - Indexing
R McFadyen ACS - 3902 5
PRIMARY KEY
Is a constraint that enforces entity integrity for a given column or columns
through a unique index.
Only one PRIMARY KEY constraint can be created per table.
UNIQUE
Is a constraint that provides entity integrity for a given column or columns
through a unique index.
A table can have multiple UNIQUE constraints.
Indexes are automatically created for:
File Organizations - Indexing
R McFadyen ACS - 3902 6
Clustering
The physical order of rows is the same as the indexed order of
the rows.
If Index entries are logically close the the data will be close
together physically.
A Primary key index is normally clustered
File Organizations - Indexing
R McFadyen ACS - 3902 7
Motivation (finding one record given its key)
•Scanning a file is time consuming
•b+-tree provides a short access path
file of records
page1
page2
page3
B+-tree
File Organizations - Indexing
R McFadyen ACS - 3902 8
Motivation
•A b+-tree for a file (table) is stored in a separate file.
•A file (table) could have many b+-trees
file of records
bucket 1
bucket 2
bucket 3
B+-tree
File Organizations - Indexing
R McFadyen ACS - 3902 9
b+-tree
•based on b-tree (Bayer, balanced, Boeing, bushy)
•dynamic
Root
Internal
nodes
Leaf nodes
......
File Organizations - Indexing
R McFadyen ACS - 3902 10
b+-tree
•based on b-tree (Bayer, balanced, Boeing, bushy)
•dynamic
Root
Internal
nodes
Leaf nodes
......
3902: horizontal pointers at
the leaf level
Typical of implementations
Provides for sequential
access by key
File Organizations - Indexing
R McFadyen ACS - 3902 11
Node structure for b+-tree of order p
non-leaf node (internal node or a root)
• < P1, K1, P2, K2, …, Pq-1, Kq-1, Pq > (q p)
• keys are in sequence
K1 < K2 < ... < Kq-1
• for any key value, X, in the subtree pointed to by Pi
•Ki-1 < XKi for 1 < i < q
•X K1 for i = 1
•Kq-1 < X for i = q
•each internal node has at most p pointers
•each node except root must have at least p/2 pointers
•the root, if it has some children, must have at least 2 pointers
File Organizations - Indexing
R McFadyen ACS - 3902 12
Node structure for b+-tree of order p
leaf node (terminal node)
•< (K1, Pr1), (K2, Pr2), …, (Kq-1, Prq-1), Pnext >
•K1 < K2 < ... < Kq-1
•Pri points to a record with key value Ki , or, Pri points to a block
containing a record with key value Ki
•each leaf has at least p/2 keys
•maximum of p keys
•all leaves are at the same level (balanced)
•Pnext points to the next leaf for key sequencing
File Organizations - Indexing
R McFadyen ACS - 3902 13
Example
•insert records with key values
Diane, Cory, Ramon, Amy, Miranda,
Marshall, Zena, Rhonda, Vincent, Simon, Mary
into a b+-tree with p=3.
internal node : minimum 2 pointers and
maximum 3 pointers - inserting a fourth will
cause a split
leaf node : at least 2 key/pointer pairs and a
maximum of 3 key/pointer pairs - inserting a
fourth will cause a split
File Organizations - Indexing
R McFadyen ACS - 3902 14
insert Diane
Diane
Pointer to data
(wherever the
record for Diane
is actually stored)
Pointer to next leaf
in ascending key
sequence –
horizontal pointer
insert Cory
Cory , Diane
Only leaf nodes at this point
– need a split before there
are internal nodes
File Organizations - Indexing
R McFadyen ACS - 3902 15
Example
insert Ramon
Cory , Diane , Ramon
inserting Amy will cause the node to overflow:
Amy , Cory , Diane , Ramon This must split
Only leaf nodes
at this point
File Organizations - Indexing
R McFadyen ACS - 3902 16
Example
This is logically correct but it exceeds the space available …..
it must split into two leafs:
Amy , Cory , Diane , Ramon
Do a 50/50 split
File Organizations - Indexing
R McFadyen ACS - 3902 17
split the node into two nodes
Need to promote a key value upwards
• this must be Cory because it’s the highest key value in the left
subtree
Amy , Cory Diane , Ramon
File Organizations - Indexing
R McFadyen ACS - 3902 18
split the node and promote a key value upwards (this must be Cory
because it’s the highest key value in the left subtree)
Amy , Cory Diane , Ramon
Cory
When the root node splits, the tree has grown one level
≤ Cory
File Organizations - Indexing
R McFadyen ACS - 3902 19
Splitting Nodes
Any value being promoted upwards will come from the node that
is splitting.
•When a leaf splits, a ‘copy’ of a key value is promoted.
•When an internal node splits, the middle key value ‘moves’
from a child to a parent node.
There are three situations to be concerned with:
•a leaf splits,
•an internal node splits,
•a new root is generated.
File Organizations - Indexing
R McFadyen ACS - 3902 20
Leaf splittingWhen a leaf splits, a new leaf is allocated
•the original leaf is the left sibling, the new one is the right sibling
•key and pointer pairs of the overflowing node are redistributed: the left
sibling will have lesser keys than the right sibling
•a 'copy' of the key value which is the largest of the keys in the left sibling
is promoted to the parent
Two situations arise: the parent exists or not.
If the parent exists, then a copy of the key value and the pointer to the right
sibling are promoted upwards. Otherwise, the b+-tree is just beginning to grow
...
File Organizations - Indexing
R McFadyen ACS - 3902 21
Internal node splitting
If an internal node splits and it is not the root,
•insert the key and pointer and then determine the middle key
•a new 'right' sibling is allocated
•everything to its left stays in the left sibling
•everything to its right goes into the right sibling
•the middle key value along with the pointer to the new right sibling is
promoted to the parent (the middle key value 'moves' to the parent to become
the discriminator between the left and right siblings)
File Organizations - Indexing
R McFadyen ACS - 3902 22
Internal node splitting
When a new root is formed, a key value and two pointers must
be placed into it.
File Organizations - Indexing
R McFadyen ACS - 3902 23
A sample trace
Diane, Cory, Ramon, Amy, Miranda,
Marshall, Zena, Rhonda, Vincent, Simon, Mary
into a b+-tree with p=3.
Amy , Cory Diane , Ramon
Cory
Miranda
File Organizations - Indexing
R McFadyen ACS - 3902 24
Amy , Cory
Cory
Diane , Miranda , Ramon
Marshall
Amy , Cory Diane , Marshall Miranda , Ramon
Cory Marshall
Zena
After the 50/50 split
Marshall is the
discriminator
Zena fits in the node
File Organizations - Indexing
R McFadyen ACS - 3902 25
Amy , Cory Diane , Marshall Miranda , Ramon , Zena
Cory Marshall
Rhonda-causes split
-discriminator
promoted
Amy , Cory Diane , Marshall Rhonda , Zena
Cory Marshall Ramon
Miranda , Ramon
File Organizations - Indexing
R McFadyen ACS - 3902 26
Amy , Cory Diane , Marshall Rhonda , Zena
Marshall
Miranda , Ramon
Cory Ramon
Vincent
File Organizations - Indexing
R McFadyen ACS - 3902 27
Amy , Cory Diane , Marshall
Rhonda , Vincent ,Zena
Marshall
Miranda , Ramon
Cory Ramon
Simon-causes split
-Simon promoted
File Organizations - Indexing
R McFadyen ACS - 3902 28
Marshall
Miranda , Ramon
Ramon Simon
Rhonda , Simon
Vincent , ZenaMary-fits in leaf
Amy , Cory Diane , Marshall
Cory
File Organizations - Indexing
R McFadyen ACS - 3902 29
b+-tree operations
•search - always the same search length - tree height+1
•retrieval - sequential access is facilitated as the lowest
level is typically linked
•insert - may cause overflow - tree may grow
•delete - may cause underflow – be aware the tree may shrink
3902