Indexing Methods. Storage Requirements of Databases Need data to be stored “permanently” or...

Preview:

Citation preview

Indexing Methods

Storage Requirements of Databases

• Need data to be stored “permanently” or persistently for long periods of time

• Usually too big to fit in main memory• Low cost of storage per unit of data and the

definition of “very large databases” • Main cost incurred after storage is of

searching the database• Primary and secondary (auxiliary) file

organizations

File Organizations

• Relations usually stored in files as logical “records” and read in terms of physical “blocks”

• File organization refers to the way records are stored in terms of blocks and the way blocks are placed on the storage medium and interlinked.

• Types of organizations– Unsorted– Sorted– Hashing

Records

• Represents a tuple in a relation

• A file is a sequence of records

• Records could be either fixed-length or variable-length

• Records comprise of a sequence of fields (column, attribute)

Blocks• Refer to physical units of storage in storage

devices (Example: Sectors in hard disks, page in virtual memory)

• Of fixed length, based on physical characteristics of the storage/computing device and operating system

• Storage device is either defragmented or fragmented depending on whether contiguous sets of records lie in contiguous blocks

Blocking Factor

The number of records that are stored in a block is called the “blocking factor”. Blocking factor is constant across blocks if record length is fixed, or variable otherwise.

If B is block size and R is record size, then blocking factor is:

bfr = B/R

Since R may not exactly divide B, there could be some left-over space in each block equal to:

B – (bfr * R) bytes.

Spanned and Unspanned Records

When extra space in blocks are left unused, the record organization is said to be “unspanned”.

Record 1 Record 2 Record 3

Unused

Spanned and Unspanned Records

In “spanned” record storage, records can be split so that the “span” across blocks.

Record 1 Record 2 Record 3

Record 4 (part)

Record 4(remaining)

Block m

Block p

p

Spanned and Unspanned Records

When record size is greater than block size (i.e. R > B), use of “spanned” record storage is compulsory.

Indexes• Index Files

– Secondary or auxiliary files that help speed up data access in primary files

• Indexes or access structures.– Data structures (and search methods) used for fast

access

• Single level index– index file maps directly to the block or the address of

the record

• Multi-level index– multiple levels of indirection among indexes

Definitions

• Indexing field (indexing attribute): The field on which an index structure is built (searching is fast on this field)

• Primary index: An index structure that is defined on the ordering key field (the field that is used to physically order records on disk in sorted file organizations)

Definitions

• Clustering index: When the ordering field is not a key field (i.e. not unique) a clustering index is used instead of a primary index

• Secondary index: An index structure defined on a non-ordering field.

Primary Indexes

• Comprises of an ordered file of fixed length records having two fields

• The first field of same data type as ordering key (primary key), and second field is of the type block address.

Primary index records are represented by a pair:

(k(i), a(i))

–Where k(i) is the key for the ith record and a(i) is the block address containing the ith record.

Primary Index

.

.

.

Index File

2003-0101

2003-0121

2003-0181

…..…..

…..

2003-0201

2003-0241

2003-0221…..

RollNo Name Age Gender Grade2003-0101

2003-0121

2003-0262

2003-0120

…..

.

.

2003-0140

2003-0221

2003-0262

…..…..

…..…..

…..…..

…..…..

2003-0240

2003-0280K(i) a(i)

.

.

Primary Index

• The number of entries in the index is equal to the number of disk blocks in the ordered data file

• The first record in each block of the file is indexed (in sparse indexes). These records are called anchor records

• A sparse index has index entries for only some of the search values

• A dense index has an index for every search key value (every record in the data file). Dense indexes are not beneficial on ordered data files.

Primary Index

• Search: – Easy. Perform Binary Search on index file to

identify block containing required record

• Insertion / Deletion:– Easy if key values in records are fixed length and

statically allocated to blocks without block spanning (results in wasted space however).

– Else, re-computation of index required on insertion / deletion. Use of overflow buffers may be necessary.

Clustering Index

• Clustering field: A non-key ordering field. That is, blocks are ordered on this field which does not have the UNIQUE constraint

• Structure of index file similar to primary index file, but each index points to the first block having the given value in its clustering field.

• One index entry for every distinct value of the clustering field.

Clustering Index

.

.

.

K(I) A(I)

1

2

30

3…..

…..

39

…..

80…..

…..

.

.

90

Dept No Name Gender DOB Job1112

2233

808081

898990

Clustering Index

• A sparse index, since only distinct values are indexed

• Insertion and deletion cause problems when a block can hold more than one value for clustering field

• Alternative solution: Allocate blocks for each value of clustering field.

Clustering Index

.

.

.

K(I) A(I)

1

2

30

3…..

…..

39

…..

80…..

…..

.

.

90

Dept No Name Gender DOB Job111

222

8080

898989

More 1 fields

More 2 fields

More 89 fields

Secondary Index

• Used to index fields that are neither ordering fields nor key fields.

• Many secondary indexes possible on a single file.

• One index entry for the every record in the data file (dense index), containing the value of the indexed attribute, and a pointer to the block / record.

Secondary Index on Key Field

K(i), A(i)

2003-0101

2003-0102

2003-0103

2003-0104

2003-0106

2003-0105

RollNo Name Age Dept No Job

2003-01012003-0107

2003-0107

2003-01032003-0102

2003-01052003-01042003-0106

Has as many index entries as the number of records…

Secondary Index on Key Field

• Since key fields are unique, number of index entries equal to number of records

• Data file need not be sorted on disk

• Fixed length records for index file

Secondary Index on non-key Field

• When a non-key field is indexed, duplicate values have to be handled.

• There are three different techniques for handling duplicates: – Duplicate index entries– Variable length records– Extra redirection levels

Duplicate Index Entries

K(i) A(i)2003-0101

2003-0102

2003-0102

2003-0102

2003-0103

2003-0102

2003-0103

Index entries are repeated for each duplicate occurrence of the non-key attribute.

Binary search becomes more complicated. Mid-point of a search may have duplicate entries on either side.

Insertion of records may need restructuring of index table.

Variable Length Records

• Use variable length records for index table in order to accommodate duplicate key entries

• For a given key K(i), there is a set of address pointers instead of a single address pointer

• Binary search becomes complicated since address mid points cannot be computed efficiently

• Insertion of records may need restructuring of the index table

Extra Redirection Levels

K(I) A(I)

1

23

4

RollNo Name Age LabId Grade1

3

.

.

.

.

22

1

2

3

14

41

1

AddressBlocks

Extra Indirection Levels

• Most frequently used technique• Index records are of fixed length• A(i) in an index record points to a block

of address fields • Block overflows handled by chaining • Retrieval requires sequential search

within blocks• Insertion of records straightforward

Multi-level Indexes

• Binary search in single-level indexes require a search time of the order of log2b number of block accesses. Here b is the number of blocks in the index file

• If the bfr of the index file is greater than 2, number of block accesses can be reduced even further

• Multi-level indexes are meant for such a reduction.

Multi-level Indexes

• Contains several levels of the index file

• Each index block at a given level connects to a maximum of fo number of blocks at the next level. Here fo is called the “fan out” of the index structure

• Block accesses reduced from log2b to logfob on an average.

A Two-level Index Structure

2

10

2

5

10

15

2

4

5

8

10

12

15

18

First (base) level

Second (top)level

Block 1

Block 2

Two-level Index Structure

• First (Base) level is the usual primary index that is maintained in a sorted file

• Second (Top) level is a primary index into the first level index file

• The process can be repeated to any number of levels

• Each level reduces the number of entries of its next level by a factor of fo.

Summary

Types of Indexes

Ordering Field Nonordering Field

Key field Primary index Secondary index (key)

Non-key field Clustering index

Secondary index (non-key)

SummaryProperties of

IndexesNumber of (first-

level) index entries

Dense or non-dense

Primary Number of blocks in data file

Non-dense

Clustering Number of distinct index field values

Non-dense

Secondary (key) Number of records in data file

Dense

Secondary (non-key)

Number of records or number of distinct field values

Dense or non-dense

Summary

• Multi-level indexes: Several level of index files

• Characteristic “fan out” property. Fan out fo preferably greater than 2

• Reduces number of block accesses to order of logfob.

Dynamic Multi-level Indexes

Overview of Index Structures• Index Files

– Secondary or auxiliary files that help speed up data access in primary files

• Indexes or access structures– Data structures (and search methods) used for fast

access

• Single level index– index file maps directly to the block or the address of

the record

• Multi-level index– multiple levels of indirection among indexes

Definitions

• Indexing field (indexing attribute): The field on which an index structure is built (searching is fast on this field)

• Primary index: An index structure that is defined on the ordering key field (the field that is used to physically order records on disk in sorted file organizations)

Definitions

• Clustering index: When the ordering field is not a key field (i.e. not unique) a clustering index is used instead of a primary index

• Secondary index: An index structure defined on a non-ordering field.

Primary Index Illustration

.

.

.

Index File

2003-0101

2003-0121

2003-0181

…..…..

…..

2003-0201

2003-0241

2003-0221…..

RollNo Name Age Gender Grade2003-0101

2003-0121

2003-0262

2003-0120

…..

.

.

2003-0140

2003-0221

2003-0262

…..…..

…..…..

…..…..

…..…..

2003-0240

2003-0280K(i) a(i)

.

.

Clustering Index Illustration

.

.

.

K(I) A(I)

1

2

30

3…..

…..

39

…..

80…..

…..

.

.

90

Dept No Name Gender DOB Job1112

2233

808081

898990

Secondary Index on Key Field

K(i), A(i)

2003-0101

2003-0102

2003-0103

2003-0104

2003-0106

2003-0105

RollNo Name Age Dept No Job

2003-01012003-0107

2003-0107

2003-01032003-0102

2003-01052003-01042003-0106

Has as many index entries as the number of records…

Secondary Index on non-Key Field

K(I) A(I)

1

23

4

RollNo Name Age LabId Grade1

3

.

.

.

.

22

1

2

3

14

41

1

AddressBlocks

Summary

Types of Indexes

Ordering Field Nonordering Field

Key field Primary index Secondary index (key)

Non-key field Clustering index

Secondary index (non-key)

SummaryProperties of

IndexesNumber of (first-

level) index entries

Dense or non-dense

Primary Number of blocks in data file

Non-dense

Clustering Number of distinct index field values

Non-dense

Secondary (key) Number of records in data file

Dense

Secondary (non-key)

Number of records or number of distinct field values

Dense or non-dense

Multi-level Indexes

• Binary search in single-level indexes require a search time of the order of log2b number of block accesses. Here b is the number of blocks in the index file

• If the bfr of the index file is greater than 2, number of block accesses can be reduced even further

• Multi-level indexes are meant for such a reduction.

Multi-level Indexes

• Contains several levels of the index file

• Each index block at a given level connects to a maximum of fo number of blocks at the next level. Here fo is called the “fan out” of the index structure

• Block accesses reduced from log2b to logfob on an average.

A Two-level Index Structure

2

10

2

5

10

15

2

4

5

8

10

12

15

18

First (base) level

Second (top)level

Block 1

Block 2

Two-level Index Structure

• First (Base) level is the usual primary index that is maintained in a sorted file

• Second (Top) level is a primary index into the first level index file

• The process can be repeated to any number of levels

• Each level reduces the number of entries of its next level by a factor of fo.

A Two-level Index Structure

2

10

2

5

10

15

2

4

5

8

10

12

15

18

First (base) level

Second (top)level

Block 1

Block 2

Two-level Index Structure

• First (Base) level is the usual primary index that is maintained in a sorted file

• Second (Top) level is a primary index into the first level index file

• The process can be repeated to any number of levels

• Each level reduces the number of entries of its next level by a factor of fo.

Balanced and Unbalanced Index Trees

Unbalanced O(n)

Balanced (log fo n)

Insertions and Deletions

• Balanced property of index trees should be maintained during insertions and deletions

• Insertions and deletions are problematic in multi-level index, since all index files are physically sorted files

• An approach to overcome this is to use dynamic multi-level indexes

B-Trees

• A Tree data structure where each node has a predetermined maximum fan-out p

• Terminologies: root node, leaf nodes, internal nodes, parent, children

Structure of a Node

DataPointer

Left-mostSubtree

K1 K2 Ki-1 Ki

X > K

DataPointer

X < K1

K1 < X < K2

Right-most

Subtree

B-Tree constraints

• For a node containing p-1 (or p sub trees) keys, the following condition must always hold: – K1 < K2 < … < Kp-1

• For any data element X in subtree Pi, it should always be the case that: – Ki-1 < X < Ki, K1 < X and Kp-1 > X

B-Tree Constraints

• Each node has at most p tree pointers• Each node, except the root and leaf nodes,

has at least p/2 tree pointers (tree balancing constraint)

• The root node has at least 2 tree pointers unless it is the only node in the tree

• All leaf nodes are at the same level. In a leaf node, all tree pointers are null.

B+ Trees

• Most common index structures in RDBMS.• Leaf and non-leaf nodes have different structures:

data pointers are stored only at the leaf nodes• Leaf nodes form a “sense index” containing every

entry for the search field and its corresponding record pointer

• Leaf nodes linked to provide ordered access to data file records.

Non-leaf Nodes in B+ Trees

Left-mostSubtree

K1 K2 Ki-1 Ki

X > K X < K1

K1 < X < K2

Right-most

Subtree

Leaf Nodes in B+ Trees

K1 K2 Ki-1 Ki

Datapointer

Datapointer

Datapointer

Datapointer

Pointer tonextleaf nodein tree

Properties of Leaf Nodes

• Keys along the leaf nodes chain is organized in sorted order– K1 < K2 < … < Kn

• Each leaf node has at least p/2 values

• All leaf nodes are at the same level

Searching in B+ Trees

Generalization of Binary Search.

1. Given a search key k start from the root node2. If key is present in current node then success; else 3. If current node is a leaf node and key not present

in node, then key not in the database4. Search for a tree pointer Pi such that Ki-1 < k ki 5. Return to step 2 to continue search.

Insertion

• Originally, tree begins with only the root node.

• As and when nodes fill up, they are “split” and made children of a new node.

• Keys are split uniformly across the three nodes.

Insertion

85

Let p = 2.

Let insertion sequence of keys be: 5, 8, 3, 7, 2, 9, 17, 10, …

Tree, after insertion of 5 and 8.

Insertion of next key 3 causes overflow requiring a split.

Insertion

53

5

8

7 is inserted into this node. No overflow.

Insertion

53

5

7 8

Insertion of 2 causes overflows that need to be cascaded to upper levels.

Insertion

5 7 832

3 7

Insertion of 9…

Insertion

5 7 832

3

9

8

5

Deletion

• Deletion of keys may cause underflows which have to be handled separately

• An underflow occurs when a node contains less than p/2 keys

• Nodes are merged with their siblings when underflows occur

Indexes on Multiple Attributes

• All index structures explored till now assumes simple attributes: comprising of only one value

• Many applications require multi-attribute (composite) keys

Ordered Index on Multi-attributes

• Considers a composite key as a tuple of simple keys (k1, k2, …kn)

• Ordered index files maintained by ordering each key in sequence.

Partitioned Hashing

• Given a composite key (k1, k2, …kn), partitioned hashing returns n different bucket numbers

• Hash bucket is determined by concatenating the n numbers.

Grid Files

• Partitions the range of key values for each key into several buckets

• Combinations of buckets of each key forms a “grid”

• A grid file stores a grid in either a row major or a column major form.

Grid Files

Roll No. 1 2 3 4 5

Grade

A

B

C

D

Roll No.

1 001– 0252 026 – 0503 051 – 0754 076 – 1005 101 – 125

Bucket Pool

Summary

• Multi-level Indexes

• Trees, root node, leaf nodes, non-leaf (internal) nodes

• Dynamic multi-level indexes, B-trees and B+ trees

• Insertion and deletion in B+ trees

• Indexes on multiple attributes.

Recommended