Ch12: Indexing and Hashing - fenix. · PDF fileCh12: Indexing and Hashing Basic Concepts Ordered Indices B+ -Tree Index Files B-Tree Index Files Hashing Static and Dynamic Hashing

Ch12: Indexing and HashingCh12: Indexing and Hashing

�� Basic ConceptsBasic Concepts

�� Ordered Indices Ordered Indices

�� B+B+--Tree Index FilesTree Index Files

�� BB--Tree Index FilesTree Index Files

�� HashingHashing

�� Static and Dynamic Hashing Static and Dynamic Hashing

�� Comparison of Ordered Indexing and Hashing Comparison of Ordered Indexing and Hashing

�� Index Definition in SQLIndex Definition in SQL

�� MultipleMultiple--Key AccessKey Access

Basic ConceptsBasic Concepts

�� Indexing mechanisms used to speed up access to Indexing mechanisms used to speed up access to

desired data, e.g., author catalog in librarydesired data, e.g., author catalog in library

�� Search KeySearch Key -- attribute or set of attributes used to attribute or set of attributes used to

look up records in a file.look up records in a file.

�� An An index fileindex file consists of records (called consists of records (called index index

entriesentries) of the form:) of the form:

search-key pointer

Basic Concepts (cont.Basic Concepts (cont.))

�� Index files are typically much smaller than the Index files are typically much smaller than the

original fileoriginal file

�� Two basic kinds of indices:Two basic kinds of indices:

�� Ordered indicesOrdered indices: : search keys are stored in sorted ordersearch keys are stored in sorted order

�� Hash indicesHash indices:: search keys are distributed uniformly search keys are distributed uniformly

across “buckets” using a “hash function”. across “buckets” using a “hash function”.

Index Evaluation MetricsIndex Evaluation Metrics

�� The The types of accesstypes of access supported efficiently, e.g., supported efficiently, e.g.,

�� records with a specified value in the attributerecords with a specified value in the attribute

�� records with an attribute value falling in a specified records with an attribute value falling in a specified

range of values.range of values.

�� Access timeAccess time

�� Insertion timeInsertion time

�� includes time to find the place to insertincludes time to find the place to insert

�� Deletion timeDeletion time

�� includes access timeincludes access time

�� Space overheadSpace overhead

Indexing techniques evaluated on basis of:

Ordered IndicesOrdered Indices�� An An ordered indexordered index stores the values of search keys stores the values of search keys

in sorted order and associates to each search key in sorted order and associates to each search key

the records that contain itthe records that contain it

�� E.g., author catalog in library.E.g., author catalog in library.

�� Primary index Primary index ((also called also called clustering indexclustering index)): : in a in a

sequentially ordered file, the index whose search key sequentially ordered file, the index whose search key

specifies the sequential order of the file.specifies the sequential order of the file.

�� The search key of a primary index is usually but not The search key of a primary index is usually but not

necessarily the primary key.necessarily the primary key.

�� Secondary indexSecondary index (also called (also called nonnon--clusteringclustering

indexindex):): an index whose search key specifies an an index whose search key specifies an

order different from the sequential order of the file.order different from the sequential order of the file.

�� IndexIndex--sequential filesequential file:: ordered sequential file with a ordered sequential file with a

primary index.primary index.

Dense Index FilesDense Index Files

Dense indexDense index: Index record appears for every : Index record appears for every

searchsearch--key value in the file. key value in the file.

Sparse Index FilesSparse Index FilesSparse IndexSparse Index: contains index records for only some : contains index records for only some

searchsearch--key values.key values.

�� Applicable when records are sequentially ordered on searchApplicable when records are sequentially ordered on search--keykey

�� To To locate a recordlocate a record with searchwith search--key value key value KK::

�� Find index record with largest searchFind index record with largest search--key value < = key value < = KK

�� Search file sequentially starting at the record to which the indSearch file sequentially starting at the record to which the index ex record pointsrecord points

Sparse Sparse vsvs Dense Index Files Dense Index Files

Sparse indexesSparse indexes::

�� Less space and less maintenance overheadLess space and less maintenance overhead for for

insertions and deletions.insertions and deletions.

�� Generally Generally slowerslower than dense index for locating than dense index for locating

records.records.

�� Good tradeoffGood tradeoff: sparse index with an index entry for : sparse index with an index entry for

per block in file, corresponding to least searchper block in file, corresponding to least search--key key

value in the block.value in the block.

�� Dominant cost in processing a query is the time taken to Dominant cost in processing a query is the time taken to

bring the block from disk ino memorybring the block from disk ino memory

Multilevel IndexMultilevel Index�� If primary index does not fit in memory, access If primary index does not fit in memory, access

becomes expensive.becomes expensive.

�� To reduce number of disk accesses to index records, To reduce number of disk accesses to index records,

treat primary index kept on disk as a sequential file treat primary index kept on disk as a sequential file

and construct a sparse index on itand construct a sparse index on it..

�� outer indexouter index –– a sparse index of primary indexa sparse index of primary index

�� inner indexinner index –– the primary index filethe primary index file

�� If even outer index is too large to fit in main memory, If even outer index is too large to fit in main memory,

yet another level of index can be created, and so on.yet another level of index can be created, and so on.

�� Indices at all levels must be updated on insertion or Indices at all levels must be updated on insertion or

deletion from the file.deletion from the file.

Example of a Multilevel Index Example of a Multilevel Index

Index Update: DeletionIndex Update: Deletion

Particular caseParticular case: If deleted record was the only record in : If deleted record was the only record in

the file with its particular searchthe file with its particular search--key value, the searchkey value, the search--

key is deleted from the index also.key is deleted from the index also.

�� SingleSingle--level index deletionlevel index deletion::

�� Perform a lookup using the searchPerform a lookup using the search--key value appearing in key value appearing in the record to be inserted.the record to be inserted.

�� Dense indicesDense indices –– deletion of searchdeletion of search--key is similar to file key is similar to file

record deletion.record deletion.

�� Sparse indicesSparse indices –– if an entry for the search key exists in the if an entry for the search key exists in the

index, it is deleted by replacing the entry in the index with thindex, it is deleted by replacing the entry in the index with the e next searchnext search--key value in the file (in searchkey value in the file (in search--key order). If the key order). If the

next searchnext search--key value already has an index entry, the entry key value already has an index entry, the entry

is deleted instead of being replaced.is deleted instead of being replaced.

Index Update: InsertionIndex Update: Insertion

�� SingleSingle--level index insertionlevel index insertion::

�� Perform a lookup using the searchPerform a lookup using the search--key value appearing in key value appearing in

the record to be inserted.the record to be inserted.

�� Dense indicesDense indices –– if the searchif the search--key value does not appear in key value does not appear in

the index, insert it.the index, insert it.

�� Sparse indicesSparse indices –– if index stores an entry for each block of if index stores an entry for each block of

the file, no change needs to be made to the index unless a the file, no change needs to be made to the index unless a

new block is created. In this case, the first searchnew block is created. In this case, the first search--key value key value

appearing in the new block is inserted into the index.appearing in the new block is inserted into the index.

�� Multilevel insertion (as well as deletion) algorithms are Multilevel insertion (as well as deletion) algorithms are

simple extensions of the singlesimple extensions of the single--level algorithmslevel algorithms

Secondary IndicesSecondary Indices

�� Frequently, one wants to find all the records whose Frequently, one wants to find all the records whose

values in a certain fieldvalues in a certain field (which is not the search(which is not the search--key key

of the primary index) of the primary index) satisfy some conditionsatisfy some condition..

�� Ex1Ex1: In the : In the accountaccount table stored sequentially by account table stored sequentially by account number, we may want to find all accounts in a particular number, we may want to find all accounts in a particular

branchbranch

�� Must be Must be densedense

�� When the search key is not a candidate key, we can When the search key is not a candidate key, we can

have a secondary index with an index record for have a secondary index with an index record for

each searcheach search--key value; index record points to a key value; index record points to a

bucket that contains pointers to all the actual records bucket that contains pointers to all the actual records

with that particular searchwith that particular search--key value.key value.

Secondary Index on Secondary Index on

balancebalance field of field of accountaccount

Primary and Secondary IndicesPrimary and Secondary Indices

�� Indices offer substantial Indices offer substantial benefitsbenefits when searching when searching

for records.for records.

�� When a file is modified, every index on the file When a file is modified, every index on the file

must be must be updatedupdated

�� Updating indices imposes overhead on database Updating indices imposes overhead on database

modification.modification.

�� Sequential scan using Sequential scan using primaryprimary index is efficient, index is efficient,

but a sequential scan using a but a sequential scan using a secondarysecondary index is index is

expensive expensive

�� each record access may fetch a new block from diskeach record access may fetch a new block from disk

BB++--Tree Index FilesTree Index Files

�� Disadvantage of indexedDisadvantage of indexed--sequential filessequential files: performance : performance

degrades as file grows, since many overflow blocks degrades as file grows, since many overflow blocks

get created. Periodic reorganization of entire file is get created. Periodic reorganization of entire file is

required.required.

�� Advantage of BAdvantage of B++--treetree index filesindex files: automatically : automatically

reorganizes itself with small, local, changes, in the reorganizes itself with small, local, changes, in the

face of insertions and deletions. Reorganization of face of insertions and deletions. Reorganization of

entire file is not required to maintain performance.entire file is not required to maintain performance.

�� Disadvantage of BDisadvantage of B++--treestrees: extra insertion and deletion : extra insertion and deletion

overhead, space overhead.overhead, space overhead.

�� Advantages of BAdvantages of B++--trees outweigh disadvantages, and they trees outweigh disadvantages, and they are used extensively.are used extensively.

Alternative to indexed-sequential files.

BB++--Tree Index Files Tree Index Files -- definitiondefinition

�� All paths from root to leaf are of the same lengthAll paths from root to leaf are of the same length

�� Each node that is not a root or a leaf has between Each node that is not a root or a leaf has between

[[nn/2/2] and ] and nn children.children.

�� A leaf node has between [A leaf node has between [((nn––1)/21)/2] and ] and nn––11 valuesvalues

�� Special casesSpecial cases: :

�� If the root is not a leaf, it has at least 2 children.If the root is not a leaf, it has at least 2 children.

�� If the root is a leaf (that is, there are no other nodes in If the root is a leaf (that is, there are no other nodes in

the tree), it can have between 0 and (the tree), it can have between 0 and (nn––1) values.1) values.

A B+-tree is a rooted tree satisfying the following properties:

BB++--Tree Node StructureTree Node Structure

Typical nodeTypical node

�� KKii are the searchare the search--key values key values

�� PPii are pointers to children (for nonare pointers to children (for non--leaf nodes) or leaf nodes) or

pointers to records or buckets of records (for leaf pointers to records or buckets of records (for leaf nodes).nodes).

�� The searchThe search--keys in a node are ordered keys in a node are ordered

KK1 1 < < KK2 2 < < KK3 3 < < . . .. . . < < KKnn––11

Properties of Leaf Nodes Properties of Leaf Nodes �� For For ii = 1, 2, . . ., = 1, 2, . . ., nn––1, pointer 1, pointer PPii either points to a file either points to a file

record with searchrecord with search--key value key value KKii, or to a bucket of , or to a bucket of

pointers to file records, each record having searchpointers to file records, each record having search--

key value key value KKii. .

�� Only need bucket structure if searchOnly need bucket structure if search--key does not form a key does not form a

primary key.primary key.

�� If If LLii, , LLjj are are leaf nodesleaf nodes and and i i < < j, Lj, Lii’s search’s search--key key

values are less than values are less than LLjj’s’s searchsearch--key valueskey values

�� PPnn points to next leaf node in searchpoints to next leaf node in search--key orderkey order

NonNon--Leaf NodesLeaf Nodes

�� NonNon--leaf nodes form a multileaf nodes form a multi--level sparse index level sparse index

on the leaf nodes. For a nonon the leaf nodes. For a non--leaf node with leaf node with mm

pointers:pointers:

�� All the searchAll the search--keys in the subtree to which keys in the subtree to which PP11 points points

are less thanare less than KK11

�� For For 2 2 ≤≤ i i ≤≤ n n –– 11, all the search, all the search--keys in the subtree to keys in the subtree to which which PPii points have values greater than or equal to points have values greater than or equal to

KKii––11 and less than and less than KKmm––11

Example of a BExample of a B++--treetree

B+-tree for account file (n = 3)

Example of BExample of B++--treetree

�� Leaf nodesLeaf nodes must have between 2 and 4 values must have between 2 and 4 values

((((nn––1)/21)/2 and and n n ––1, with 1, with nn = 5).= 5).

�� NonNon--leaf nodesleaf nodes other than root must have between 3 other than root must have between 3

and 5 children and 5 children ((((nn/2/2 and and n n with with nn =5).=5).

�� RootRoot must have at least 2 children.must have at least 2 children.

B+-tree for account file (n = 5)

Observations about BObservations about B++--treestrees

�� Since the interSince the inter--node connections are done by node connections are done by

pointers, “logically” close blocks need not be pointers, “logically” close blocks need not be

“physically” close.“physically” close.

�� The nonThe non--leaf levels of the Bleaf levels of the B++--tree form a tree form a

hierarchy of sparse indiceshierarchy of sparse indices..

�� The BThe B++--tree contains a tree contains a relatively small number of relatively small number of

levelslevels (logarithmic in the size of the main file), (logarithmic in the size of the main file),

thus searches can be conducted efficiently.thus searches can be conducted efficiently.

�� Insertions and deletions to the main file can be Insertions and deletions to the main file can be

handled handled efficientlyefficiently, as the index can be , as the index can be

restructured in logarithmic time (as we shall restructured in logarithmic time (as we shall

see).see).

Queries (searchesQueries (searches)) on Bon B++--TreesTrees

�� Find all records with a searchFind all records with a search--key value of key value of k.k.

1.1. Start with the root nodeStart with the root node

1.1. Examine the node for the Examine the node for the smallest searchsmallest search--key value > key value > k.k.

2.2. If such a value exists, assume it is If such a value exists, assume it is KKii. . Then follow Then follow PPii to the to the child nodechild node

3.3. Otherwise Otherwise kk ≥≥ KKmm––11, where there are , where there are mm pointers in the node. pointers in the node. Then follow Then follow PPmm to the child node.to the child node.

2.2. If the node reached by following the pointer above is If the node reached by following the pointer above is

not a leaf node, repeat the above procedure on the not a leaf node, repeat the above procedure on the node, and follow the corresponding pointer.node, and follow the corresponding pointer.

3.3. Eventually Eventually reach a leaf nodereach a leaf node. If for some . If for some ii, key , key KKii = = k k follow pointer follow pointer PPii to the desired record or bucket. Else to the desired record or bucket. Else

no record with searchno record with search--key value key value kk exists.exists.

Queries on BQueries on B++--Trees (Cont.)Trees (Cont.)�� When processing a query, a path is traversed in the When processing a query, a path is traversed in the

tree from the root to some leaf node.tree from the root to some leaf node.

�� If there are If there are KK searchsearch--key values in the file, the path is key values in the file, the path is

no longer than no longer than loglognn/2/2((KK))..

�� A node is generally the same size as a disk block, typically A node is generally the same size as a disk block, typically

4 kilobytes4 kilobytes, and , and nn is typically around is typically around 100100 (40 bytes per (40 bytes per index entry).index entry).

�� With 1 million search key values and With 1 million search key values and nn = 100, at most = 100, at most

loglog5050(1,000,000) = (1,000,000) = 4 nodes4 nodes are accessed in a lookup.are accessed in a lookup.

�� Contrast this with a Contrast this with a balanced binary treebalanced binary tree with 1 million with 1 million search key values search key values

•• around around 20 nodes20 nodes are accessed in a lookupare accessed in a lookup

�� Above difference is significant since every node access Above difference is significant since every node access

may need a disk I/O, costing around 20 milliseconds!may need a disk I/O, costing around 20 milliseconds!

Updates on BUpdates on B++--Trees: InsertionTrees: Insertion

�� Find the leaf nodeFind the leaf node in which the searchin which the search--key value key value

would appearwould appear

�� If the If the searchsearch--key value is already therekey value is already there in the in the

leaf node, record is added to file and if leaf node, record is added to file and if

necessary a pointer is inserted into the bucket.necessary a pointer is inserted into the bucket.

�� If the If the searchsearch--key value is not therekey value is not there, then add the , then add the

record to the main file and create a bucket if record to the main file and create a bucket if

necessary. Then:necessary. Then:

�� If there is room in the leaf node, If there is room in the leaf node, insert (keyinsert (key--value, value,

pointer) pairpointer) pair in the leaf nodein the leaf node

�� Otherwise, Otherwise, split the nodesplit the node along with the new (keyalong with the new (key--value, pointer) entry.value, pointer) entry.

Example Example –– splitting a nodesplitting a node

• Want to insert a record with branch-name value of “Clearview”

Splitting a nodeSplitting a node�� Take the Take the n n (search(search--key value, pointer) pairs key value, pointer) pairs

(including the one being inserted) in sorted order. (including the one being inserted) in sorted order.

�� Place Place the first the first nn/2/2 in the original node, and the rest in in the original node, and the rest in a new node.a new node.

�� Let the new node be Let the new node be pp,, and let and let kk be the least key be the least key

value in value in pp. .

�� InsertInsert ((k,pk,p)) in the parent of the node being split. in the parent of the node being split.

�� If the parent is full, split it and propagate the split further If the parent is full, split it and propagate the split further up.up.

�� The The splitting of nodes proceeds upwardssplitting of nodes proceeds upwards till a node till a node

that is not full is found. that is not full is found.

�� In the worst case the root node may be split increasing In the worst case the root node may be split increasing the height of the tree by 1. the height of the tree by 1.

B+B+--Tree before and after Tree before and after

insertion of “insertion of “ClearviewClearview””

Updates on BUpdates on B++--Trees: DeletionTrees: Deletion

�� Find the record to be deletedFind the record to be deleted, and remove it from the , and remove it from the

main file and from the bucket (if present)main file and from the bucket (if present)

�� Remove (searchRemove (search--key value, pointer)key value, pointer) from the leaf from the leaf

node if there is no bucket or if the bucket has node if there is no bucket or if the bucket has

become emptybecome empty

�� If If the node has too few entriesthe node has too few entries due to the removal, due to the removal,

and the entries in the node and a sibling and the entries in the node and a sibling fit into a fit into a

single nodesingle node, then , then

�� Insert all the searchInsert all the search--key values in the two nodes into a key values in the two nodes into a

single node (the one on the left), and delete the other node single node (the one on the left), and delete the other node

((coalescecoalesce))

�� Delete the pair (Delete the pair (KKii––11, , PPii),), where where PPii is the pointer to the is the pointer to the

deleted node, from its parent, recursively using the above deleted node, from its parent, recursively using the above

procedure.procedure.

Updates on BUpdates on B++--Trees: Trees:

DeletionDeletion�� Otherwise, if Otherwise, if the node has too few entriesthe node has too few entries due to the due to the

removal, and the entries in the node and a sibling removal, and the entries in the node and a sibling do do

not fit into a single nodenot fit into a single node, then, then

�� RedistributeRedistribute the pointers between the node and a sibling the pointers between the node and a sibling

such that both have more than the minimum number of such that both have more than the minimum number of

entries.entries.

�� Update the corresponding searchUpdate the corresponding search--key value in the parent of key value in the parent of

the node.the node.

�� The node deletions may cascade upwards till a node The node deletions may cascade upwards till a node

which has which has n/2 n/2 or more pointers is found. If the root or more pointers is found. If the root node has only one pointer after deletion, it is deleted node has only one pointer after deletion, it is deleted

and the sole child becomes the root. and the sole child becomes the root.

Before & after deleting “Downtown”Before & after deleting “Downtown”

The removal of the leaf node containing “Downtown” did not resulThe removal of the leaf node containing “Downtown” did not result in its t in its parent having too little pointers. So the cascaded deletions stparent having too little pointers. So the cascaded deletions stopped opped with the deleted leaf node’s parent.with the deleted leaf node’s parent.

Deletion of “Deletion of “PerryridgePerryridge” ” –– ex1ex1

�� Node with “Node with “PerryridgePerryridge” becomes ” becomes underfullunderfull (actually empty, in this special (actually empty, in this special case) and merged with its sibling.case) and merged with its sibling.

�� As a result “Perryridge” node’s parent became As a result “Perryridge” node’s parent became underfullunderfull, and was merged , and was merged with its sibling (and an entry was deleted from their parent)with its sibling (and an entry was deleted from their parent)

�� Root node then had only one child, and was deleted and its childRoot node then had only one child, and was deleted and its child became became the new root nodethe new root node

Before and after deletion of Before and after deletion of

““PerryridgePerryridge” ” –– ex2ex2

�� Parent of leaf containing Parent of leaf containing PerryridgePerryridge became became underfullunderfull, and borrowed a , and borrowed a pointer from its left siblingpointer from its left sibling

�� SearchSearch--key value in the parent’s parent changes as a resultkey value in the parent’s parent changes as a result

BB++--Tree File OrganizationTree File Organization

Index file degradation problemIndex file degradation problem: solved by using B: solved by using B++--

Tree indices. Tree indices.

Data file degradation problemData file degradation problem: solved by using : solved by using

BB++--Tree File OrganizationTree File Organization..

�� The leaf nodes in a BThe leaf nodes in a B++--tree file organization store tree file organization store recordsrecords, , instead of pointers.instead of pointers.

�� Since records are larger than pointers, the maximum Since records are larger than pointers, the maximum

number of records that can be stored in a leaf node is less number of records that can be stored in a leaf node is less than the number of pointers in a nonleaf node.than the number of pointers in a nonleaf node.

�� Leaf nodes are still required to be half full.Leaf nodes are still required to be half full.

�� Insertion and deletion are handled in the same way as Insertion and deletion are handled in the same way as

insertion and deletion of entries in a Binsertion and deletion of entries in a B++--tree index.tree index.

BB++--Tree File Organization Tree File Organization -- ExEx

�� Good space utilizationGood space utilization important since records use more important since records use more space than pointers. space than pointers.

�� To improve space utilization, involve more sibling nodes in To improve space utilization, involve more sibling nodes in redistribution during splits and mergesredistribution during splits and merges�� Involving 2 siblings in redistribution (to avoid split / merge wInvolving 2 siblings in redistribution (to avoid split / merge where here

possible) results in each node having at least possible) results in each node having at least entriesentries 3/2n

BB--Tree Index FilesTree Index FilesSimilar to B+-tree, but B-tree allows search-key

values to appear only once; eliminates redundant storage of search keys.

Search keys in nonleaf nodes appear nowhere else in the B-tree; an additional pointer field for each search key in a nonleaf node must be included.

Generalized B-tree leaf node:

Nonleaf node Nonleaf node –– pointers pointers BBii are the bucket or are the bucket or

file record pointers.file record pointers.

BB--Tree Index File ExampleTree Index File Example

BB--tree (above) and B+tree (above) and B+--tree (below) on same datatree (below) on same data

BB--Tree Index Files (Cont.)Tree Index Files (Cont.)Advantages of BAdvantages of B--Tree indicesTree indices::

�� May use less tree nodes than a corresponding BMay use less tree nodes than a corresponding B++--Tree.Tree.

�� Sometimes possible to find searchSometimes possible to find search--key value before key value before reaching leaf node.reaching leaf node.

Disadvantages of BDisadvantages of B--Tree indicesTree indices::

�� Only small fraction of all searchOnly small fraction of all search--key values are found early key values are found early

�� NonNon--leaf nodes are larger, so fanleaf nodes are larger, so fan--out is reduced. Thus, Bout is reduced. Thus, B--

Trees typically have greater depth than corresponding BTrees typically have greater depth than corresponding B++--

TreeTree

�� Insertion and deletion more complicated than in BInsertion and deletion more complicated than in B++--Trees Trees

�� Implementation is harder than BImplementation is harder than B++--Trees.Trees.

�� Typically, advantages of BTypically, advantages of B--Trees do not outweigh Trees do not outweigh

disadvantages. disadvantages.

Hash File OrganizationHash File Organization

�� A A bucketbucket is a unit of storage containing one or more is a unit of storage containing one or more

records (a bucket is typically a disk block). records (a bucket is typically a disk block).

�� In a In a hash file organizationhash file organization we obtain the bucket of a we obtain the bucket of a

record directly from its searchrecord directly from its search--key value using a key value using a hashhash

function.function.

�� Hash function Hash function hh is a function from the set of all searchis a function from the set of all search--key key

values values KK to the set of all bucket addresses to the set of all bucket addresses B.B.

�� Hash function is used to locate records for access, insertion Hash function is used to locate records for access, insertion

as well as deletion.as well as deletion.

�� Records with different searchRecords with different search--key values may be key values may be

mapped to the same bucket; thus entire bucket has to mapped to the same bucket; thus entire bucket has to

be searched sequentially to locate a record. be searched sequentially to locate a record.

Example of Hash File Example of Hash File

OrganizationOrganization

�� There are 10 buckets,There are 10 buckets,

�� The binary representation of the The binary representation of the iithth

character is assumed to be the integer character is assumed to be the integer i.i.

�� The hash function returns the sum of the The hash function returns the sum of the

binary representations of the characters binary representations of the characters

modulo 10modulo 10

�� E.g. E.g. h(Perryridgeh(Perryridge) = 5 h(Round Hill) = 3 ) = 5 h(Round Hill) = 3

h(Brighton) = 3h(Brighton) = 3

Hash file organization of account file, using branch-name as key(See figure in next slide.)

Hash file organization of account

file, using branch-name as key

Hash FunctionsHash Functions�� Worst hashWorst hash function maps all searchfunction maps all search--key values to the key values to the

same bucket; this makes access time proportional to same bucket; this makes access time proportional to

the number of searchthe number of search--key values in the file.key values in the file.

�� An An ideal hash functionideal hash function is is

�� uniformuniform,, i.e., each bucket is assigned the same number of i.e., each bucket is assigned the same number of

searchsearch--key values from the set of key values from the set of allall possible values.possible values.

�� randomrandom, so each bucket will have the same number of , so each bucket will have the same number of

records assigned to it irrespective of the records assigned to it irrespective of the actual distributionactual distribution of of

searchsearch--key values in the file.key values in the file.

�� Typical hash functionsTypical hash functions perform computation on the perform computation on the

internal binary representation of the searchinternal binary representation of the search--key. key.

�� E.g., for a string searchE.g., for a string search--key, the binary representations of all key, the binary representations of all the characters in the string could be added and the sum the characters in the string could be added and the sum

modulo the number of buckets could be returned. modulo the number of buckets could be returned.

Handling of Bucket OverflowsHandling of Bucket Overflows

�� Bucket overflow can occur because of Bucket overflow can occur because of

�� Insufficient bucketsInsufficient buckets

�� SkewSkew in distribution of records. This can occur in distribution of records. This can occur

due to two reasons:due to two reasons:

•• multiple records have same searchmultiple records have same search--key valuekey value

•• chosen hash function produces nonchosen hash function produces non--uniform distribution uniform distribution of key valuesof key values

�� Although the probability of bucket overflow Although the probability of bucket overflow

can be reduced, it cannot be eliminated; it is can be reduced, it cannot be eliminated; it is

handled by using handled by using overflow bucketsoverflow buckets..

Bucket Overflows Bucket Overflows �� Overflow chainingOverflow chaining ( or ( or closed hashingclosed hashing))–– the overflow the overflow

buckets of a given bucket are chained together in a buckets of a given bucket are chained together in a

linked list.linked list.

�� An alternative, called An alternative, called open hashingopen hashing, which does not use , which does not use overflow buckets, is not suitable for database applications.overflow buckets, is not suitable for database applications.

Hash IndicesHash Indices�� Hashing can be used not only for file organization, Hashing can be used not only for file organization,

but also for indexbut also for index--structure creation. structure creation.

�� A A hash indexhash index organizes the search keys, with their organizes the search keys, with their

associated record pointers, into a hash file associated record pointers, into a hash file

structure.structure.

�� Strictly speaking, hash indices are always Strictly speaking, hash indices are always

secondary indices secondary indices

�� if the file itself is organized using hashing, a separate if the file itself is organized using hashing, a separate primary hash index on it using the same searchprimary hash index on it using the same search--key is key is

unnecessary. unnecessary.

�� However, we use the term hash index to refer to both However, we use the term hash index to refer to both

secondary index structures and hash organized files. secondary index structures and hash organized files.

Example of Hash IndexExample of Hash Index

Deficiencies of Static HashingDeficiencies of Static Hashing�� Function Function hh maps searchmaps search--key values to a key values to a fixed set fixed set

of of BB of bucket addressesof bucket addresses..

�� Databases grow with time. If initial number of buckets Databases grow with time. If initial number of buckets

is too small, is too small, performance will degradeperformance will degrade due to too much due to too much

overflows.overflows.

�� If file size at some point in the future is anticipated and If file size at some point in the future is anticipated and

number of buckets allocated accordingly, significant number of buckets allocated accordingly, significant amount of space will be wastedamount of space will be wasted initially.initially.

�� If database shrinks, again If database shrinks, again space will be wastedspace will be wasted..

�� One option is periodic reOne option is periodic re--organization of the file with a organization of the file with a

new hash function, but it is new hash function, but it is very expensivevery expensive..

�� These problems can be avoided by using These problems can be avoided by using

techniques that allow the number of buckets to techniques that allow the number of buckets to

be modified be modified dynamicallydynamically. .

Dynamic HashingDynamic Hashing

�� Good for database that grows and shrinks in sizeGood for database that grows and shrinks in size

�� Allows the hash function to be modified dynamicallyAllows the hash function to be modified dynamically

Extendable hashingExtendable hashing –– one form of dynamic hashing one form of dynamic hashing �� Hash function generates values over a large range Hash function generates values over a large range ——

typically typically bb--bit integers, with bit integers, with bb = 32.= 32.

�� At any time At any time use only a prefix of the hash functionuse only a prefix of the hash function to index to index into a table of bucket addresses. into a table of bucket addresses.

•• Let the length of the prefix be Let the length of the prefix be ii bits, 0 bits, 0 ≤≤ ii ≤≤ 32. 32.

•• Bucket address table size = 2Bucket address table size = 2i.i. Initially Initially ii = 0= 0

•• Value of Value of ii grows and shrinks as the size of the database grows and grows and shrinks as the size of the database grows and shrinks.shrinks.

•• Multiple entries in the bucket address table may point to a buckMultiple entries in the bucket address table may point to a bucket. et.

�� Thus, Thus, aactual number of buckets is < 2ctual number of buckets is < 2ii

•• The number of buckets also changes dynamically due to coalescingThe number of buckets also changes dynamically due to coalescingand splitting of buckets. and splitting of buckets.

General Extendable Hash General Extendable Hash

Structure Structure

In this structure, i2 = i3 = i, whereas i1 = i – 1 (see next slide for details)

Use of Extendable Hash Use of Extendable Hash

StructureStructure�� Each bucket Each bucket jj stores a value stores a value iijj; ; all the entries that all the entries that

point to the same bucket have the same values on point to the same bucket have the same values on

the first the first iijj bits.bits.

�� To To locate the bucketlocate the bucket containing searchcontaining search--key key KKjj::

1.1. Compute Compute h(Kh(Kjj) = X) = X

2.2. Use the first Use the first ii high order bits of high order bits of XX as a displacement into as a displacement into

bucket address table, and follow the pointer to appropriate bucket address table, and follow the pointer to appropriate bucketbucket

�� To To insert a recordinsert a record with searchwith search--key value key value KKjj

�� follow same procedure as lookfollow same procedure as look--up and locate the bucket, up and locate the bucket,

saysay jj. .

�� If there is room in the bucket If there is room in the bucket jj insert record in the bucket. insert record in the bucket.

�� Else the bucket must be split and insertion reElse the bucket must be split and insertion re--attempted attempted (next slide.)(next slide.)

UpdatesUpdates

If If ii > > iijj (more than one pointer to bucket (more than one pointer to bucket jj))

�� allocate a new bucket allocate a new bucket zz, and set , and set iijj and and iizz to the old to the old iijj --+ 1.+ 1.

�� make the second half of the bucket address table entries make the second half of the bucket address table entries pointing to pointing to jj to point to to point to zz

�� remove and reinsert each record in bucket remove and reinsert each record in bucket j.j.

�� recompute new bucket for recompute new bucket for KKjj and insert record in the bucket and insert record in the bucket

(further splitting is required if the bucket is still full)(further splitting is required if the bucket is still full)

If If i = i = iijj (only one pointer to bucket (only one pointer to bucket jj))

�� increment increment ii and double the size of the bucket address table.and double the size of the bucket address table.

�� replace each entry in the table by two entries that point to replace each entry in the table by two entries that point to

the same bucket.the same bucket.

�� recompute new bucket address table entry for recompute new bucket address table entry for KKjj

Now Now i i > > iijj so use the first case above. so use the first case above.

To split a bucket j when inserting record with search-key value Kj:

Updates (Cont.)Updates (Cont.)

�� When When inserting a valueinserting a value, if the bucket is full after , if the bucket is full after several splits (that is, several splits (that is, ii reaches some limit reaches some limit bb) create ) create an overflow bucket instead of splitting bucket entry an overflow bucket instead of splitting bucket entry table further.table further.

�� To To delete a key valuedelete a key value, , �� locate it in its bucket and remove it. locate it in its bucket and remove it.

�� The bucket itself can be removed if it becomes empty The bucket itself can be removed if it becomes empty (with appropriate updates to the bucket address table). (with appropriate updates to the bucket address table).

�� Coalescing of buckets can be done (can coalesce only Coalescing of buckets can be done (can coalesce only with a “buddy” bucket having same value of with a “buddy” bucket having same value of iijj and same and same iijj––1 prefix, if it is present) 1 prefix, if it is present)

�� Decreasing bucket address table size is also possibleDecreasing bucket address table size is also possible•• Note: decreasing bucket address table size is an expensive Note: decreasing bucket address table size is an expensive

operation and should be done only if number of buckets becomes operation and should be done only if number of buckets becomes much smaller than the size of the table much smaller than the size of the table

Use of Extendable Hash Use of Extendable Hash

Structure: Example Structure: Example

Initial Hash structure, bucket size = 2

Hash structure after insertion of Hash structure after insertion of

one Brighton and two Downtown one Brighton and two Downtown

recordsrecords

Hash structure after insertion of

Mianus record

Hash structure after insertion of

three Perryridge records

Hash structure after insertion of Hash structure after insertion of

Redwood and Round Hill recordsRedwood and Round Hill records

Extendable Hashing vs. Other Extendable Hashing vs. Other

SchemesSchemesBenefits of extendable hashingBenefits of extendable hashing

�� Hash performance does not degrade with growth of fileHash performance does not degrade with growth of file

�� Minimal space overheadMinimal space overhead

Disadvantages of extendable hashingDisadvantages of extendable hashing

�� Extra level of indirection to find desired recordExtra level of indirection to find desired record

�� Bucket address table may itself become very big (larger than Bucket address table may itself become very big (larger than

memory)memory)

•• Need a tree structure to locate desired record in the structure!Need a tree structure to locate desired record in the structure!

�� Changing size of bucket address table is an expensive Changing size of bucket address table is an expensive

operationoperation

�� Linear hashingLinear hashing is an alternative mechanism which is an alternative mechanism which

avoids these disadvantages at the possible cost of avoids these disadvantages at the possible cost of

more bucket overflowsmore bucket overflows

Comparison of Ordered Indexing Comparison of Ordered Indexing

and Hashingand Hashing

�� Cost of periodic reCost of periodic re--organizationorganization

�� Relative frequency of insertions and deletionsRelative frequency of insertions and deletions

�� Is it desirable to optimize average access time at Is it desirable to optimize average access time at

the expense of worstthe expense of worst--case access time?case access time?

�� Expected type of queries:Expected type of queries:

�� Hashing is generally better at retrieving records Hashing is generally better at retrieving records having a specified value of the key.having a specified value of the key.

�� If range queries are common, ordered indices are to If range queries are common, ordered indices are to be preferredbe preferred

Index Definition in SQLIndex Definition in SQL

�� Create an indexCreate an indexcreate indexcreate index <index<index--name> name> onon <relation<relation--name>name>

(<attribute(<attribute--list>)list>)

E.g.: E.g.: create index create index bb--index index onon branch(branchbranch(branch--name)name)

�� Use Use create unique indexcreate unique index to indirectly specify and to indirectly specify and

enforce the condition that the search key is a enforce the condition that the search key is a

candidate key is a candidate key.candidate key is a candidate key.

�� Not really required if SQL Not really required if SQL uniqueunique integrity constraint is integrity constraint is

supportedsupported

�� To drop an index To drop an index

drop index drop index <index<index--name>name>

MultipleMultiple--Key AccessKey Access�� Use multiple indices for certain types of queries.Use multiple indices for certain types of queries.

�� Example: Example: select select accountaccount--numbernumber

fromfrom accountaccount

wherewhere branchbranch--name name = “Perryridge” = “Perryridge” and and

balancebalance = 1000= 1000

�� Possible strategies for processing query using Possible strategies for processing query using

indices on single attributes:indices on single attributes:

1.1. Use index on Use index on branchbranch--name name to find accounts with balances to find accounts with balances

of $1000; test of $1000; test branchbranch--name = “name = “Perryridge”.Perryridge”.

2.2. Use indexUse index onon balance balance to find accounts with balances of to find accounts with balances of $1000; test$1000; test branchbranch--name = name = “Perryridge”.“Perryridge”.

3.3. Use Use branchbranch--name name index to find pointers to all records index to find pointers to all records

pertaining to the Perryridge branch. Similarly use index pertaining to the Perryridge branch. Similarly use index

on on balancebalance. Take intersection of both sets of pointers . Take intersection of both sets of pointers

Indices on Multiple AttributesIndices on Multiple Attributes

�� With the With the wherewhere clauseclause

wherewhere branchbranch--name =name = “Perryridge” “Perryridge” andand balance = balance =

10001000

the index on the combined searchthe index on the combined search--key will fetch key will fetch

only records that satisfy both conditions.only records that satisfy both conditions.

Using separate indices in less efficient Using separate indices in less efficient —— we may we may

fetch many records (or pointers) that satisfy only fetch many records (or pointers) that satisfy only

one of the conditions.one of the conditions.

�� Can also efficiently handle Can also efficiently handle

wherewhere branchbranch--namename = “Perryridge” = “Perryridge” and and balance balance < <

10001000

�� But cannot efficiently handleBut cannot efficiently handle

Suppose we have an index on combined search-key(branch-name, balance).

End of ChapterEnd of Chapter

Partitioned HashingPartitioned Hashing�� Hash values are split into segments that depend on Hash values are split into segments that depend on

each attribute of the searcheach attribute of the search--key.key.

((AA11, A, A22, . . . , , . . . , AAnn) ) for for nn attribute searchattribute search--keykey

�� Example: Example: n = n = 2, for 2, for customer, customer, searchsearch--key being key being

((customercustomer--street, customerstreet, customer--citycity))

searchsearch--key valuekey value hash valuehash value

(Main, Harrison)(Main, Harrison) 101 111101 111

(Main, Brooklyn)(Main, Brooklyn) 101 001101 001

(Park, Palo Alto)(Park, Palo Alto) 010 010010 010

(Spring, Brooklyn)(Spring, Brooklyn) 001 001001 001

(Alma, Palo Alto)(Alma, Palo Alto) 110 010110 010

�� To answer equality query on single attribute, need to To answer equality query on single attribute, need to

look up multiple buckets. Similar in effect to grid files. look up multiple buckets. Similar in effect to grid files.

Documents

Ch12: Indexing and Hashing - fenix. · PDF fileCh12: Indexing and Hashing Basic Concepts Ordered Indices B+ -Tree Index Files B-Tree Index Files Hashing Static and Dynamic Hashing