Lt20 21 Index

7/29/2019 Lt20 21 Index

1/28

Database Systems and

Applications

Lecture 20-21

Indexing

7/29/2019 Lt20 21 Index

2/28

Physical model in 3 level design?

Recall 3 levels of database design

Conceptual model: high level abstract description

Logical model: description of a concrete realization

Physical model: implementation using basic components

Analogy with vehicles

Conceptual model: mechanisms to move, turn, stop, ...

Logical models:

Car: accelerator pedal, steering wheel, brake pedal,

Bicycle: pedal forward to move, turn handle, pull brakes on handle

Physical models :

Car: engine, transmission, master cylinder, break lines, brake pads,

Bicycle: chain from pedal to wheels, gears, wire from handle to brakepads

7/29/2019 Lt20 21 Index

3/28

What is a physical data model?

What is a physical data model of a database?

Concepts to implement logical data model

Using current components, e.g. computer hardware,operating systems

In an efficient and fault-tolerant manner

Why learn physical data model concepts?

To be able to use DBMS facilities for performancetuning

For example, If a query is running slow,

one may create an index to speed it up

For example, if loading of a large number of tuplestakes for ever

one may drop indices on the table before the inserts

and recreate index after inserts are done!

7/29/2019 Lt20 21 Index

4/28

Concepts in a physical data model

Database concepts

Conceptual data model - entity, (multi-valued) attributes,relationship,

Logical model - relations, atomic attributes, primary andforeign keys

Physical model - secondary storage hardware, file structures,indices,

Examples of physical model concepts from relational DBMS

Secondary storage hardware: Disk drives

File structures - sorted

Auxiliary search structure -

search trees (hierarchical collections of one-dimensional ranges)

7/29/2019 Lt20 21 Index

5/28

Storage Hierarchy in Computers

Computers have several componentsCentral Processing Unit (CPU)Input, output devices, e.g. mouse, keyword, monitors, printersCommunication mechanisms, e.g. internal bus, network card,modemStorage Hierarchy

Types of storage DevicesMain memories - fast but content is lost when power is offSecondary storage - slower, retains content without powerTertiary storage - very slow, retains content, very large capacity

DBMS usually manage dataon secondary storage, e.g. disksUse main memory to improve performanceUser tertiary storage (e.g. tapes) for backup, archival etc.

7/29/2019 Lt20 21 Index

6/28

Software view of Disks: Fields,Records and File

Views of secondary storage (e.g. disks)Hardware views previous slides.

Software views - Data on disks is organized into fields, records, files

ConceptsField presents a property or attribute of a relation or an entity

Records represent a row in a relational table

Collection of fields for attributes in relational schema of the table

Files are collections of records /A relation is typically stored as a file ofrecords.

Homogeneous collection of records may represent a relation

Heterogeneous collections may be a union of related relations.

7/29/2019 Lt20 21 Index

7/28

File organization

7/29/2019 Lt20 21 Index

8/28

DBMS Plan Executor ParserQuery

EvaluationOperator Evaluator Optimizer

Engine%

File and Access Methods

Transaction

Manager

Buffer ManagerRecovery

Manager

Lock

ManagerDisk Space Manager

Concurrency Control%

Index FilesSystem

Data FilesCatalog

Web Forms Application Front Ends SQL Interface

SQL Commands

DBMS ARCHITECTURE

7/29/2019 Lt20 21 Index

9/28

Data on External Storage Disks: Can retrieve random page at fixed cost.

But reading several consecutive pages is much cheaper thanreading them in random order.

Tapes: Can only read pages in sequence. Cheaper than disks; used for archival storage.

File organization: Method of arranging a file of recordson external storage. Record id (rid) is sufficient to physically locate record.

Indexes are data structures that allow us to find the record ids ofrecords with given values in index search key fields.

Architecture: Buffer manager fetches pages fromexternal storage to main memory buffer pool. File and

index layers make calls to the buffer manager.

7/29/2019 Lt20 21 Index

10/28

Alternative File Organizations Many alternatives exist, each ideal for some situations,

and not so good in others:

Heap (random order) files: Suitable when typical access is a filescan retrieving all records.

Sorted Files: Best if records must be retrieved in some order, oronly a `range of records is needed.

Indexes: Data structures to organize records via trees orhashing.

Like sorted files, they speed up searches for a subset of records, basedon values in certain (search key) fields.

Updates are much faster than in sorted files.

7/29/2019 Lt20 21 Index

11/28

Indexes

An indexon a file speeds up selections on the

search key fields for the index. Any subset of the fields of a relation can be the

search key for an index on the relation.

Search keyis not the same as key(minimal set offields that uniquely identify a record in a relation).

An index contains a collection ofdata entries,and supports efficient retrieval of all dataentries k* with a given key value k.

A t t D t E t

7/29/2019 Lt20 21 Index

12/28

A ternat ves or Data Entry nIndex

Three alternatives: Data record with key value k

Choice of alternative for data entries is

orthogonal (list of axes) to the indexingtechnique used to locate data entries with agiven key value k. Examples of indexing techniques: B+ trees, hash-

based structures.

Typically, index contains auxiliary information thatdirects searches to the desired data entries.

Al i f D E i

7/29/2019 Lt20 21 Index

13/28

Alternatives for Data Entries(Contd.)

Alternative 1: If this is used, index structure is a file

organization for data records (instead of aHeap file or sorted file).

At most one index on a given collection of datarecords can use Alternative 1. (Otherwise,data records are duplicated, leading toredundant storage and potentialinconsistency.)

If data records are very large, # of pagescontaining data entries is high.

7/29/2019 Lt20 21 Index

14/28

erna ves or a a n r es(Contd.)

Alternatives 2 and 3: Data entries typically much smaller than data

records. So, better than Alternative 1 withlarge data records, especially if search keysare small. (Portion of index structure used to

direct search, which depends on size of dataentries, is much smaller than with Alternative1.)

Alternative 3 more compact than Alternative 2,

but leads to variable sized data entries even ifsearch keys are of fixed length.

7/29/2019 Lt20 21 Index

15/28

Example of Alternative 1

8bluerectangle6

4bluesquare5

2blueround4

Red

Red

Red

colour

3

2

1

Location

8rectangle

4square

2round

holesshape

6 data entries,sorted by colour

7/29/2019 Lt20 21 Index

16/28


blue6

blue5

blue4

Red

Red

Red

colour

3

2

1

Location

6 data entries,sorted bycolour

7/29/2019 Lt20 21 Index

17/28


Locations

colour

1, 2, 3 Red

4,5,6 Blue

2 dataentries,

variablelength

7/29/2019 Lt20 21 Index

18/28

Index Classification

Primaryvs. secondary: If search key contains primary key,

then called primary index. Unique index: Search key contains a candidate key.

Clustered vs. unclustered: If order of data records is thesame as, or `close to, order of data entries, then calledclustered index.

Alternative 1 implies clustered; in practice, clustered also impliesAlternative 1 (since sorted files are rare).

A file can be clustered on at most one search key.

Cost of retrieving data records through index varies greatlybasedon whether index is clustered or not!

7/29/2019 Lt20 21 Index

19/28

Suppose that Alternative (2) is used fordata entries, and that the data recordsare stored in a Heap file.

To build clustered index, first sort theHeap file (with some free space on eachpage for future inserts).

Overflow pages may be needed for inserts.(Thus, order of data recs is `close to, but

not identical to, the sort order.)

Index Classification(cntd)

7/29/2019 Lt20 21 Index

20/28

Clustered vs. Unclustered Index

CLUSTERED Index entries UNCLUSTEREDdirect searchfor data entries

Data entries Data entries

(index file)

(data file)

Data Records DataRecords

7/29/2019 Lt20 21 Index

21/28

Hash-Based Indexes

Good for equality selections. Index is a collection ofbuckets.Bucket =primarypage plus zero or more

overflow pages.

Hashing function h: h(r) = bucket in whichrecord rbelongs. h looks at the search keyfields ofr.

If Alternative (1) is used, the buckets contain the datarecords; otherwise, they contain

or pairs.

7/29/2019 Lt20 21 Index

22/28

Samu,44,3000

Jones,40,6003

Tracy,44,5004Sahu,25,3000

Kones,33,4003

Tincy,29,2007Sanju,50,5004

John,22,6003

h1

age

h(age)=00

h(age)=10

h(age)=01

3000

3000

5004

5004

4003

2007

6003

6003

h2

sal

h(sal)=00

h(sal)=11

7/29/2019 Lt20 21 Index

23/28

Tree Indexes

Non-leafpages

Leaf pages contain data entries and are chained (prev. &next).

Non-leaf pages contain index entries and direct searches.

leaf

pages

7/29/2019 Lt20 21 Index

24/28

Cost Model for Our Analysis

We ignore CPU costs, for simplicity: B: The number of data pages R: Number of records per page D: (Average) time to read or write disk page

Measuring number of page I/Os ignores gains ofpre-fetching a sequence of pages; thus, even I/O

cost is only approximated. Average-case analysis; based on several

simplistic assumptions.* Good enough to show the overall trends!

7/29/2019 Lt20 21 Index

25/28

Comparing File Organizations

Heap files (random order; insert at eof)

Sorted files, sorted on

Clustered B+ tree file, Alternative (1),search key

Heap file with unclustered B + treeindex on search key < age, sal >

Heap file with unclustered hash index onsearch key < age, sal >

7/29/2019 Lt20 21 Index

26/28

Operations to Compare

Scan: Fetch all records from disk

Equality search

Range selection Insert a record

Delete a record

7/29/2019 Lt20 21 Index

27/28

Assumptions in Our Analysis

Heap Files: Equality selection on key; exactly one match.

Sorted Files:

Files compacted after deletions. Indexes:

Alt (2), (3): data entry size = 10% size of record

Hash: No overflow buckets.

80% page occupancy => File size = 1.25 data sizeTree: 67% occupancy (this is typical).

Implies file size = 1.5 data size

7/29/2019 Lt20 21 Index

28/28

(a)scan

(b)equality

(c) range (d)insert

(e)delete

1.Heap BD 0.5BD BD 2D Searc

h+D

2.Sorted BD Dlog2B Dlog2B+#matchingpages

Search+BD

Search+BD

3.Clustered 1.5BD DlogF1.5B

DlogF1.5B+#matchingpages

Search+D

Search+D

4.Unclusteredtree index

BD(R+0.15)

D(1+logF0.15B)

D(logF0.15B+#matchingrecords)

D(3+logF0.15B)

Search+2D

5.Unclusteredhash index

BD(R+0.125)

2D BD 4D Search+2D

Documents

Lt20 21 Index