11266 Ch12 Indexing and Hashing-2

Embed Size (px)

Citation preview

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    1/31

    Chapter 12: Indexing and Hashing

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    2/31

    Organization of Records in Files

    Several of the possible ways of organizing records infiles are:

    Heap file organization.

    Any record can be placed anywhere in thefile where there is space for the record.

    There is no ordering of records.

    Typically, there is a single file for each

    relation

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    3/31

    Organization of Records in Files

    Sequential file organization.Records are stored in sequential order,

    according to the value of a search key of

    each record.

    Hashing file organization.

    A hash function is computed on someattribute of each record.

    The result of the hash function specifies inwhich block of the file the record should beplaced.

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    4/31

    Sequential File Organization

    A sequential file is designed for efficient processingof records in sorted order based on some search-key.

    A search key is any attribute or set of attributes; itneed not be the primary key, or even a superkey.

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    5/31

    Clustering File Organization

    Relational-database systems store each relation in a separate

    file.

    A clustering file organization is a file organization, that storesrelated records of two or more relations in each block.

    Such a file organization allows us to read records that would

    satisfy the join condition by using one block read.

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    6/31

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    7/31

    Chapter 12: Indexing and Hashing

    Basic Concepts

    Ordered Indices

    Multiple-Key Access

    Static Hashing

    Dynamic Hashing

    Comparison of Ordered Indexing and Hashing

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    8/31

    Basic Concepts

    Indexing mechanisms used to speed up access to desired data.

    E.g., author catalog in library

    Search Key- attribute to set of attributes used to look uprecords in a file.

    An index fileconsists of records (called index entries) of the

    form

    Index files are typically much smaller than the original file.

    Two basic kinds of indices:

    Ordered indices: search keys are stored in sorted order.

    Hash indices: search keys are distributed uniformly acrossbuckets and values from these buckets can access using

    a hash function.

    search-key pointer

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    9/31

    Index Evaluation Metrics

    Each technique must be evaluated on the basis of these factors:

    Access type: finding records with a specified value and findingrecords whose attribute values fall in a specified range.

    Access time: The time it takes to find a particular data item.

    Insertion time: The time it takes to insert a new data item.

    Finding the place to insert and time to update the index structure.

    Deletion time:

    Space overhead: The additional space occupied by an index

    structure.

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    10/31

    Ordered Indices

    In an ordered index, index entries are stored sorted on thesearch key value. E.g., author catalog in library.

    Primary index: in a sequentially ordered file, the index whosesearch key specifies the sequential order of the file.

    Also called clustering index The search key of a primary index is usually but not

    necessarily the primary key.

    Secondary index:an index whose search key specifies an orderdifferent from the sequential order of the file. Also called

    non-clustering index.

    Index-sequential file:ordered sequential file with a primary index.

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    11/31

    Dense and Sparse Index

    Dense index:

    An index record appears for every search key value in thefile.

    The index record contains the search key and a pointer tothe first data record with that search-key value.

    Sparse index: An index is created only for a few values. Each index

    contains a value and pointer to first record that contains thatvalue.

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    12/31

    Dense Index Files

    Dense indexIndex record appears for every search-key

    value in the file.

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    13/31

    Sparse Index Files

    Sparse Index: contains index records for only some search-key values.

    Applicable when records are sequentially ordered on search-key

    To locate a record with search-key value Kwe:

    Find index record with largest search-key value < K

    Search file sequentially starting at the record to which the index

    record points

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    14/31

    Sparse Index Files (Cont.)

    Compared to dense indices:

    Less space and less maintenance overhead for insertions anddeletions.

    Generally slower than dense index for locating records.

    Good tradeoff: sparse index with an index entry for every block in

    file, corresponding to least search-key value in the block.

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    15/31

    Multilevel Index

    If primary index does not fit in memory, access becomesexpensive.

    Solution: treat primary index kept on disk as a sequential fileand construct a sparse index on it.

    outer indexa sparse index of primary index

    inner indexthe primary index file

    If even outer index is too large to fit in main memory, yetanother level of index can be created, and so on.

    Indices at all levels must be updated on insertion or deletionfrom the file.

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    16/31

    Indices themselves may become too large for efficient

    processing.

    Example:

    Consider file with 100000 records with 10 records in ablock.

    With sparse index and one index per block we have about10,000 indices.

    Assuming 100 indices fit into a block we need about 100blocks.

    It is desirable to keep the index file in the main memory.

    Problem: Searching a large index file becomes expensive.

    Multilevel Index (Cont )

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    17/31

    Multilevel Index (Cont.)

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    18/31

    Index Update: Record Deletion

    If deleted record was the only record in the file with its particular search-key value, the search-key is deleted from the index also.

    Single-level index deletion:

    Dense indicesdeletion of search-key: similar to file record deletion.

    Sparse indices

    if deleted key value exists in the index, the value is replaced by

    the next search-key value in the file (in search-key order). If the next search-key value already has an index entry, the entry

    is deleted instead of being replaced.

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    19/31

    Index Update: Record Insertion

    Single-level index insertion:

    Perform a lookup using the key value from inserted record

    Dense indicesif the search-key value does not appear inthe index, insert it.

    Sparse indicesif index stores an entry for each block of

    the file, no change needs to be made to the index unless anew block is created.

    If a new block is created, the first search-key valueappearing in the new block is inserted into the index.

    Multilevel insertion (as well as deletion) algorithms are simple

    extensions of the single-level algorithms

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    20/31

    Secondary Indices Example

    Index record points to a bucket that contains pointers to all theactual records with that particular search-key value.

    Secondary indices have to be dense

    Secondary index onbalance

    field ofaccount

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    21/31

    Hashing

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    22/31

    Static Hashing

    In a hash file organization, we obtain the address of the disk block

    containing a

    desired record directly by computing a function on the search-keyvalue of the record.

    A bucketis a unit of storage containing one or more records (abucket is typically a disk block).

    In a hash file organizationwe obtain the bucket of a record directlyfrom its search-key value using a hashfunction.

    Hash function his a function from the set of all search-key values Kto the set of all bucket addresses B.

    Hash function is used to locate records for access, insertion as well

    as deletion.

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    23/31

    Example of Hash File Organization

    There are 10 buckets,

    The binary representation of the ith character is assumed to be the

    integer i. The hash function returns the sum of the binary representations of

    the characters modulo 10

    E.g. h(Perryridge) = 5 h(Round Hill) = 3 h(Brighton) = 3

    Hash file organization of accountfile, using branch_name as key(See figure in next slide.)

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    24/31

    Example of Hash File Organization

    Hash file organizationof accountfile, usingbranch_name as key(see previous slide fordetails).

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    25/31

    Hash Functions

    Worst hash function maps all search-key values to the same bucket;

    this makes access time proportional to the number of search-keyvalues in the file.

    An ideal hash function is uniform,i.e., each bucket is assigned thesame number of search-key values from the set of allpossible values.

    Ideal hash function is random, so each bucket will have the same

    number of records assigned to it irrespective of the actual distributionofsearch-key values in the file.

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    26/31

    Handling of Bucket Overflows

    If the bucket does not have enough space, a bucket overflow is said

    to occur. Bucket overflow can occur because of

    Insufficient buckets

    Skew in distribution of records. Some buckets are assigned more recordsthan are others, so a bucket may overflow even when other buckets still

    have space. Although the probability of bucket overflow can be reduced, it

    cannot be eliminated; it is handled by using overflow buckets.

    H dli f B k t O fl (C t )

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    27/31

    Handling of Bucket Overflows (Cont.)

    Overflow chainingthe overflow buckets of a given bucket are chained

    together in a linked list. Above scheme is called closed hashing.

    An alternative, called open hashing, which does not use overflowbuckets, is not suitable for database applications.

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    28/31

    Hash Indices

    Hashing can be used not only for file organization, but also for index-

    structure creation. A hash indexorganizes the search keys, with their associated record

    pointers, into a hash file structure.

    Strictly speaking, hash indices are always secondary indices

    if the file itself is organized using hashing, a separate primary

    hash index on it using the same search-key is unnecessary.

    However, we use the term hash index to refer to both secondaryindex structures and hash organized files.

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    29/31

    Example of Hash Index

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    30/31

    Deficiencies of Static Hashing

    In static hashing, function hmaps search-key values to a fixed set of B

    of bucket addresses. Databases grow or shrink with time. If initial number of buckets is too small, and file grows, performance

    will degrade due to too much overflows.

    If space is allocated for anticipated growth, a significant amount ofspace will be wasted initially (and buckets will be underfull).

    If database shrinks, again space will be wasted. One solution: periodic re-organization of the file with a new hash

    function

    Expensive, disrupts normal operations

    Better solution: allow the number of buckets to be modified dynamically.

  • 8/13/2019 11266 Ch12 Indexing and Hashing-2

    31/31

    Dynamic Hashing

    Good for database that grows and shrinks in size

    Allows the hash function to be modified dynamically

    1.Choose a hash function based on the current file size. This option willresult in performance degradation as the database grows.

    2. Choose a hash function based on the anticipated size of the file atsome point in the future. Although performance degradation is avoided,

    a significant amount of space may be wasted initially.

    3. Periodically reorganize the hash structure in response to file growth.Such a reorganization involves choosing a new hash function, re-computing the hash function on every record in the file, and generatingnew bucket assignments.

    This reorganization is a massive, time-consuming operation.