11266 Ch12 Indexing and Hashing-2

8/13/2019 11266 Ch12 Indexing and Hashing-2

1/31

Chapter 12: Indexing and Hashing


2/31

Organization of Records in Files

Several of the possible ways of organizing records infiles are:

Heap file organization.

Any record can be placed anywhere in thefile where there is space for the record.

There is no ordering of records.

Typically, there is a single file for each

relation


3/31

Organization of Records in Files

Sequential file organization.Records are stored in sequential order,

according to the value of a search key of

each record.

Hashing file organization.

A hash function is computed on someattribute of each record.

The result of the hash function specifies inwhich block of the file the record should beplaced.


4/31

Sequential File Organization

A sequential file is designed for efficient processingof records in sorted order based on some search-key.

A search key is any attribute or set of attributes; itneed not be the primary key, or even a superkey.


5/31

Clustering File Organization

Relational-database systems store each relation in a separate

file.

A clustering file organization is a file organization, that storesrelated records of two or more relations in each block.

Such a file organization allows us to read records that would

satisfy the join condition by using one block read.


6/31


7/31

Chapter 12: Indexing and Hashing

Basic Concepts

Ordered Indices

Multiple-Key Access

Static Hashing

Dynamic Hashing

Comparison of Ordered Indexing and Hashing


8/31

Basic Concepts

Indexing mechanisms used to speed up access to desired data.

E.g., author catalog in library

Search Key- attribute to set of attributes used to look uprecords in a file.

An index fileconsists of records (called index entries) of the

form

Index files are typically much smaller than the original file.

Two basic kinds of indices:

Ordered indices: search keys are stored in sorted order.

Hash indices: search keys are distributed uniformly acrossbuckets and values from these buckets can access using

a hash function.

search-key pointer


9/31

Index Evaluation Metrics

Each technique must be evaluated on the basis of these factors:

Access type: finding records with a specified value and findingrecords whose attribute values fall in a specified range.

Access time: The time it takes to find a particular data item.

Insertion time: The time it takes to insert a new data item.

Finding the place to insert and time to update the index structure.

Deletion time:

Space overhead: The additional space occupied by an index

structure.


10/31

Ordered Indices

In an ordered index, index entries are stored sorted on thesearch key value. E.g., author catalog in library.

Primary index: in a sequentially ordered file, the index whosesearch key specifies the sequential order of the file.

Also called clustering index The search key of a primary index is usually but not

necessarily the primary key.

Secondary index:an index whose search key specifies an orderdifferent from the sequential order of the file. Also called

non-clustering index.

Index-sequential file:ordered sequential file with a primary index.


11/31

Dense and Sparse Index

Dense index:

An index record appears for every search key value in thefile.

The index record contains the search key and a pointer tothe first data record with that search-key value.

Sparse index: An index is created only for a few values. Each index

contains a value and pointer to first record that contains thatvalue.


12/31

Dense Index Files

Dense indexIndex record appears for every search-key

value in the file.


13/31

Sparse Index Files

Sparse Index: contains index records for only some search-key values.

Applicable when records are sequentially ordered on search-key

To locate a record with search-key value Kwe:

Find index record with largest search-key value < K

Search file sequentially starting at the record to which the index

record points


14/31

Sparse Index Files (Cont.)

Compared to dense indices:

Less space and less maintenance overhead for insertions anddeletions.

Generally slower than dense index for locating records.

Good tradeoff: sparse index with an index entry for every block in

file, corresponding to least search-key value in the block.


15/31

Multilevel Index

If primary index does not fit in memory, access becomesexpensive.

Solution: treat primary index kept on disk as a sequential fileand construct a sparse index on it.

outer indexa sparse index of primary index

inner indexthe primary index file

If even outer index is too large to fit in main memory, yetanother level of index can be created, and so on.

Indices at all levels must be updated on insertion or deletionfrom the file.


16/31

Indices themselves may become too large for efficient

processing.

Example:

Consider file with 100000 records with 10 records in ablock.

With sparse index and one index per block we have about10,000 indices.

Assuming 100 indices fit into a block we need about 100blocks.

It is desirable to keep the index file in the main memory.

Problem: Searching a large index file becomes expensive.

Multilevel Index (Cont )


17/31

Multilevel Index (Cont.)


18/31

Index Update: Record Deletion

If deleted record was the only record in the file with its particular search-key value, the search-key is deleted from the index also.

Single-level index deletion:

Dense indicesdeletion of search-key: similar to file record deletion.

Sparse indices

if deleted key value exists in the index, the value is replaced by

the next search-key value in the file (in search-key order). If the next search-key value already has an index entry, the entry

is deleted instead of being replaced.


19/31

Index Update: Record Insertion

Single-level index insertion:

Perform a lookup using the key value from inserted record

Dense indicesif the search-key value does not appear inthe index, insert it.

Sparse indicesif index stores an entry for each block of

the file, no change needs to be made to the index unless anew block is created.

If a new block is created, the first search-key valueappearing in the new block is inserted into the index.

Multilevel insertion (as well as deletion) algorithms are simple

extensions of the single-level algorithms


20/31

Secondary Indices Example

Index record points to a bucket that contains pointers to all theactual records with that particular search-key value.

Secondary indices have to be dense

Secondary index onbalance

field ofaccount


21/31

Hashing


22/31

Static Hashing

In a hash file organization, we obtain the address of the disk block

containing a

desired record directly by computing a function on the search-keyvalue of the record.

A bucketis a unit of storage containing one or more records (abucket is typically a disk block).

In a hash file organizationwe obtain the bucket of a record directlyfrom its search-key value using a hashfunction.

Hash function his a function from the set of all search-key values Kto the set of all bucket addresses B.

Hash function is used to locate records for access, insertion as well

as deletion.


23/31

Example of Hash File Organization

There are 10 buckets,

The binary representation of the ith character is assumed to be the

integer i. The hash function returns the sum of the binary representations of

the characters modulo 10

E.g. h(Perryridge) = 5 h(Round Hill) = 3 h(Brighton) = 3

Hash file organization of accountfile, using branch_name as key(See figure in next slide.)


24/31

Example of Hash File Organization

Hash file organizationof accountfile, usingbranch_name as key(see previous slide fordetails).


25/31

Hash Functions

Worst hash function maps all search-key values to the same bucket;

this makes access time proportional to the number of search-keyvalues in the file.

An ideal hash function is uniform,i.e., each bucket is assigned thesame number of search-key values from the set of allpossible values.

Ideal hash function is random, so each bucket will have the same

number of records assigned to it irrespective of the actual distributionofsearch-key values in the file.


26/31

Handling of Bucket Overflows

If the bucket does not have enough space, a bucket overflow is said

to occur. Bucket overflow can occur because of

Insufficient buckets

Skew in distribution of records. Some buckets are assigned more recordsthan are others, so a bucket may overflow even when other buckets still

have space. Although the probability of bucket overflow can be reduced, it

cannot be eliminated; it is handled by using overflow buckets.

H dli f B k t O fl (C t )


27/31

Handling of Bucket Overflows (Cont.)

Overflow chainingthe overflow buckets of a given bucket are chained

together in a linked list. Above scheme is called closed hashing.

An alternative, called open hashing, which does not use overflowbuckets, is not suitable for database applications.


28/31

Hash Indices

Hashing can be used not only for file organization, but also for index-

structure creation. A hash indexorganizes the search keys, with their associated record

pointers, into a hash file structure.

Strictly speaking, hash indices are always secondary indices

if the file itself is organized using hashing, a separate primary

hash index on it using the same search-key is unnecessary.

However, we use the term hash index to refer to both secondaryindex structures and hash organized files.


29/31

Example of Hash Index


30/31

Deficiencies of Static Hashing

In static hashing, function hmaps search-key values to a fixed set of B

of bucket addresses. Databases grow or shrink with time. If initial number of buckets is too small, and file grows, performance

will degrade due to too much overflows.

If space is allocated for anticipated growth, a significant amount ofspace will be wasted initially (and buckets will be underfull).

If database shrinks, again space will be wasted. One solution: periodic re-organization of the file with a new hash

function

Expensive, disrupts normal operations

Better solution: allow the number of buckets to be modified dynamically.


31/31

Dynamic Hashing

Good for database that grows and shrinks in size

Allows the hash function to be modified dynamically

1.Choose a hash function based on the current file size. This option willresult in performance degradation as the database grows.

2. Choose a hash function based on the anticipated size of the file atsome point in the future. Although performance degradation is avoided,

a significant amount of space may be wasted initially.

3. Periodically reorganize the hash structure in response to file growth.Such a reorganization involves choosing a new hash function, re-computing the hash function on every record in the file, and generatingnew bucket assignments.

This reorganization is a massive, time-consuming operation.

Documents

11266 Ch12 Indexing and Hashing-2