Lt20 21 Index

Embed Size (px)

Citation preview

  • 7/29/2019 Lt20 21 Index

    1/28

    Database Systems and

    Applications

    Lecture 20-21

    Indexing

  • 7/29/2019 Lt20 21 Index

    2/28

    Physical model in 3 level design?

    Recall 3 levels of database design

    Conceptual model: high level abstract description

    Logical model: description of a concrete realization

    Physical model: implementation using basic components

    Analogy with vehicles

    Conceptual model: mechanisms to move, turn, stop, ...

    Logical models:

    Car: accelerator pedal, steering wheel, brake pedal,

    Bicycle: pedal forward to move, turn handle, pull brakes on handle

    Physical models :

    Car: engine, transmission, master cylinder, break lines, brake pads,

    Bicycle: chain from pedal to wheels, gears, wire from handle to brakepads

  • 7/29/2019 Lt20 21 Index

    3/28

    What is a physical data model?

    What is a physical data model of a database?

    Concepts to implement logical data model

    Using current components, e.g. computer hardware,operating systems

    In an efficient and fault-tolerant manner

    Why learn physical data model concepts?

    To be able to use DBMS facilities for performancetuning

    For example, If a query is running slow,

    one may create an index to speed it up

    For example, if loading of a large number of tuplestakes for ever

    one may drop indices on the table before the inserts

    and recreate index after inserts are done!

  • 7/29/2019 Lt20 21 Index

    4/28

    Concepts in a physical data model

    Database concepts

    Conceptual data model - entity, (multi-valued) attributes,relationship,

    Logical model - relations, atomic attributes, primary andforeign keys

    Physical model - secondary storage hardware, file structures,indices,

    Examples of physical model concepts from relational DBMS

    Secondary storage hardware: Disk drives

    File structures - sorted

    Auxiliary search structure -

    search trees (hierarchical collections of one-dimensional ranges)

  • 7/29/2019 Lt20 21 Index

    5/28

    Storage Hierarchy in Computers

    Computers have several componentsCentral Processing Unit (CPU)Input, output devices, e.g. mouse, keyword, monitors, printersCommunication mechanisms, e.g. internal bus, network card,modemStorage Hierarchy

    Types of storage DevicesMain memories - fast but content is lost when power is offSecondary storage - slower, retains content without powerTertiary storage - very slow, retains content, very large capacity

    DBMS usually manage dataon secondary storage, e.g. disksUse main memory to improve performanceUser tertiary storage (e.g. tapes) for backup, archival etc.

  • 7/29/2019 Lt20 21 Index

    6/28

    Software view of Disks: Fields,Records and File

    Views of secondary storage (e.g. disks)Hardware views previous slides.

    Software views - Data on disks is organized into fields, records, files

    ConceptsField presents a property or attribute of a relation or an entity

    Records represent a row in a relational table

    Collection of fields for attributes in relational schema of the table

    Files are collections of records /A relation is typically stored as a file ofrecords.

    Homogeneous collection of records may represent a relation

    Heterogeneous collections may be a union of related relations.

  • 7/29/2019 Lt20 21 Index

    7/28

    File organization

  • 7/29/2019 Lt20 21 Index

    8/28

    DBMS Plan Executor ParserQuery

    EvaluationOperator Evaluator Optimizer

    Engine%

    File and Access Methods

    Transaction

    Manager

    Buffer ManagerRecovery

    Manager

    Lock

    ManagerDisk Space Manager

    Concurrency Control%

    Index FilesSystem

    Data FilesCatalog

    Web Forms Application Front Ends SQL Interface

    SQL Commands

    DBMS ARCHITECTURE

  • 7/29/2019 Lt20 21 Index

    9/28

    Data on External Storage Disks: Can retrieve random page at fixed cost.

    But reading several consecutive pages is much cheaper thanreading them in random order.

    Tapes: Can only read pages in sequence. Cheaper than disks; used for archival storage.

    File organization: Method of arranging a file of recordson external storage. Record id (rid) is sufficient to physically locate record.

    Indexes are data structures that allow us to find the record ids ofrecords with given values in index search key fields.

    Architecture: Buffer manager fetches pages fromexternal storage to main memory buffer pool. File and

    index layers make calls to the buffer manager.

  • 7/29/2019 Lt20 21 Index

    10/28

    Alternative File Organizations Many alternatives exist, each ideal for some situations,

    and not so good in others:

    Heap (random order) files: Suitable when typical access is a filescan retrieving all records.

    Sorted Files: Best if records must be retrieved in some order, oronly a `range of records is needed.

    Indexes: Data structures to organize records via trees orhashing.

    Like sorted files, they speed up searches for a subset of records, basedon values in certain (search key) fields.

    Updates are much faster than in sorted files.

  • 7/29/2019 Lt20 21 Index

    11/28

    Indexes

    An indexon a file speeds up selections on the

    search key fields for the index. Any subset of the fields of a relation can be the

    search key for an index on the relation.

    Search keyis not the same as key(minimal set offields that uniquely identify a record in a relation).

    An index contains a collection ofdata entries,and supports efficient retrieval of all dataentries k* with a given key value k.

    A t t D t E t

  • 7/29/2019 Lt20 21 Index

    12/28

    A ternat ves or Data Entry nIndex

    Three alternatives: Data record with key value k

    Choice of alternative for data entries is

    orthogonal (list of axes) to the indexingtechnique used to locate data entries with agiven key value k. Examples of indexing techniques: B+ trees, hash-

    based structures.

    Typically, index contains auxiliary information thatdirects searches to the desired data entries.

    Al i f D E i

  • 7/29/2019 Lt20 21 Index

    13/28

    Alternatives for Data Entries(Contd.)

    Alternative 1: If this is used, index structure is a file

    organization for data records (instead of aHeap file or sorted file).

    At most one index on a given collection of datarecords can use Alternative 1. (Otherwise,data records are duplicated, leading toredundant storage and potentialinconsistency.)

    If data records are very large, # of pagescontaining data entries is high.

  • 7/29/2019 Lt20 21 Index

    14/28

    erna ves or a a n r es(Contd.)

    Alternatives 2 and 3: Data entries typically much smaller than data

    records. So, better than Alternative 1 withlarge data records, especially if search keysare small. (Portion of index structure used to

    direct search, which depends on size of dataentries, is much smaller than with Alternative1.)

    Alternative 3 more compact than Alternative 2,

    but leads to variable sized data entries even ifsearch keys are of fixed length.

  • 7/29/2019 Lt20 21 Index

    15/28

    Example of Alternative 1

    8bluerectangle6

    4bluesquare5

    2blueround4

    Red

    Red

    Red

    colour

    3

    2

    1

    Location

    8rectangle

    4square

    2round

    holesshape

    6 data entries,sorted by colour

  • 7/29/2019 Lt20 21 Index

    16/28

    Example of Alternative 2

    blue6

    blue5

    blue4

    Red

    Red

    Red

    colour

    3

    2

    1

    Location

    6 data entries,sorted bycolour

  • 7/29/2019 Lt20 21 Index

    17/28

    Example of Alternative 3

    Locations

    colour

    1, 2, 3 Red

    4,5,6 Blue

    2 dataentries,

    variablelength

  • 7/29/2019 Lt20 21 Index

    18/28

    Index Classification

    Primaryvs. secondary: If search key contains primary key,

    then called primary index. Unique index: Search key contains a candidate key.

    Clustered vs. unclustered: If order of data records is thesame as, or `close to, order of data entries, then calledclustered index.

    Alternative 1 implies clustered; in practice, clustered also impliesAlternative 1 (since sorted files are rare).

    A file can be clustered on at most one search key.

    Cost of retrieving data records through index varies greatlybasedon whether index is clustered or not!

  • 7/29/2019 Lt20 21 Index

    19/28

    Suppose that Alternative (2) is used fordata entries, and that the data recordsare stored in a Heap file.

    To build clustered index, first sort theHeap file (with some free space on eachpage for future inserts).

    Overflow pages may be needed for inserts.(Thus, order of data recs is `close to, but

    not identical to, the sort order.)

    Index Classification(cntd)

  • 7/29/2019 Lt20 21 Index

    20/28

    Clustered vs. Unclustered Index

    CLUSTERED Index entries UNCLUSTEREDdirect searchfor data entries

    Data entries Data entries

    (index file)

    (data file)

    Data Records DataRecords

  • 7/29/2019 Lt20 21 Index

    21/28

    Hash-Based Indexes

    Good for equality selections. Index is a collection ofbuckets.Bucket =primarypage plus zero or more

    overflow pages.

    Hashing function h: h(r) = bucket in whichrecord rbelongs. h looks at the search keyfields ofr.

    If Alternative (1) is used, the buckets contain the datarecords; otherwise, they contain

    or pairs.

  • 7/29/2019 Lt20 21 Index

    22/28

    Samu,44,3000

    Jones,40,6003

    Tracy,44,5004Sahu,25,3000

    Kones,33,4003

    Tincy,29,2007Sanju,50,5004

    John,22,6003

    h1

    age

    h(age)=00

    h(age)=10

    h(age)=01

    3000

    3000

    5004

    5004

    4003

    2007

    6003

    6003

    h2

    sal

    h(sal)=00

    h(sal)=11

  • 7/29/2019 Lt20 21 Index

    23/28

    Tree Indexes

    Non-leafpages

    Leaf pages contain data entries and are chained (prev. &next).

    Non-leaf pages contain index entries and direct searches.

    leaf

    pages

  • 7/29/2019 Lt20 21 Index

    24/28

    Cost Model for Our Analysis

    We ignore CPU costs, for simplicity: B: The number of data pages R: Number of records per page D: (Average) time to read or write disk page

    Measuring number of page I/Os ignores gains ofpre-fetching a sequence of pages; thus, even I/O

    cost is only approximated. Average-case analysis; based on several

    simplistic assumptions.* Good enough to show the overall trends!

  • 7/29/2019 Lt20 21 Index

    25/28

    Comparing File Organizations

    Heap files (random order; insert at eof)

    Sorted files, sorted on

    Clustered B+ tree file, Alternative (1),search key

    Heap file with unclustered B + treeindex on search key < age, sal >

    Heap file with unclustered hash index onsearch key < age, sal >

  • 7/29/2019 Lt20 21 Index

    26/28

    Operations to Compare

    Scan: Fetch all records from disk

    Equality search

    Range selection Insert a record

    Delete a record

  • 7/29/2019 Lt20 21 Index

    27/28

    Assumptions in Our Analysis

    Heap Files: Equality selection on key; exactly one match.

    Sorted Files:

    Files compacted after deletions. Indexes:

    Alt (2), (3): data entry size = 10% size of record

    Hash: No overflow buckets.

    80% page occupancy => File size = 1.25 data sizeTree: 67% occupancy (this is typical).

    Implies file size = 1.5 data size

  • 7/29/2019 Lt20 21 Index

    28/28

    (a)scan

    (b)equality

    (c) range (d)insert

    (e)delete

    1.Heap BD 0.5BD BD 2D Searc

    h+D

    2.Sorted BD Dlog2B Dlog2B+#matchingpages

    Search+BD

    Search+BD

    3.Clustered 1.5BD DlogF1.5B

    DlogF1.5B+#matchingpages

    Search+D

    Search+D

    4.Unclusteredtree index

    BD(R+0.15)

    D(1+logF0.15B)

    D(logF0.15B+#matchingrecords)

    D(3+logF0.15B)

    Search+2D

    5.Unclusteredhash index

    BD(R+0.125)

    2D BD 4D Search+2D