DataB Ch3 File

Embed Size (px)

Citation preview

  • 7/28/2019 DataB Ch3 File

    1/15

    File Organization Chapter_3 (1 / 15)

    ___________________________________________________________________________________

    _________________________________________________________________________________

    Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University

    File Organization for DBMS

    Physical Storage Media

    Several types of data storage exist in most computer systems. These storage

    media are classified by the speed with which data can be accessed, by the cost per unit

    of data to buy the memory, and by how reliable they are. Among the media

    available are:

    Cache :This is the fastest and most costly form of storage. The size ofcache memory is very small and the use of cache is managed by theoperating system.

    Main Memory : This is the storage media used for data that isavailable to be operated on. The general-purpose machine instructions

    operate on main memory. The main memory is sometimes referred to

    as core memory. The contents of the main memory are usually lost

    if a power failure or system crash occurs.

    Disk Storage :This is the primary medium for the long-term storageof data. Typically the entire data base is stored on disk. Data must be

    moved from disk to main memory in order for the data to be operated

    on. After operations are performed, data must be returned to disk.

    Disk storage is referred to as direct-access storage because it is

    possible to read data on disk in any order. Disk storage usually

    survives power failure and system crashes.

    Tape Storage :This is storage used primarily for backup and archivaldata. Although tape is much cheaper than disk, access to data is much

    slower, since the tape must be read sequentially from the beginning.

    For this reason, tape storage is referred to as sequential-access

    storage and is used primarily for recovery from failures. Tape devices

    are less complex than disks; thus, they are more reliable.

    The figure below shows a simple disk. The head is a device which stays closeto the surface of the platter and reads or writes information encoded magnetically on

  • 7/28/2019 DataB Ch3 File

    2/15

    File Organization Chapter_3 (2 / 15)

    ___________________________________________________________________________________

    _________________________________________________________________________________

    Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University

    the platter. The platter is organized into concentric tracks of data. The arm can be

    positioned over any one of the tracks. The platter is spun at a high speed. To read or

    write information, the arm is positioned over the correct track; and, when the data to

    be accessed passes under the head, the read or write operation is performed.

    Since the platter rotates at a high speed, it does not take very long for the

    contents of an entire track to pass under the head. This time is known as the disk

    latency time. The time for repositioning the arm, known as seek time, is relatively

    longer than the latencytime. It is useful to store related information on the same track

    or on physically close tracks in order to minimize the seek time.

    It is typical to have high-capacity disks with multiple platters. Multiple-platter

    disks are called disk packs. All the disk arms are moved as a unit by the actuator.Each arm has two heads: one to read and write the top surface of the platter below it,

    Platter

    Head

    Arm

    Track 0

    Track 1

    Track n

    Disk latency time

    Seek time

    Platter 0

    actuator

    Arm 0

    Arm 1

    Arm 2

    Platter 1Platter 2

    Platter 3Disk pack

  • 7/28/2019 DataB Ch3 File

    3/15

    File Organization Chapter_3 (3 / 15)

    ___________________________________________________________________________________

    _________________________________________________________________________________

    Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University

    and one to read and write the bottom surface of the platter above it. At any moment,

    the set of tracks over which the heads are located form a cylinder. This cylinder holds

    the data that is accessible without any movement of the actuator. That is, all data

    within this cylinder is accessible within the disk latency time. Just as it is efficient to

    store related data in a single track or a collection of close tracks, so it is efficient to

    store related data in the same cylinder, or in cylinders that are close to one another.

    Data is transferred between disk and main memory in units called blocks. A

    block is a contiguous sequence of bytes from a single track of one platter. Block sizes

    range from 512 bytes to several thousand bytes.

    File Organization

    A f i leis organized logically as a sequence of records. These recordsare mapped onto disk blocks. Files are provided as a basic construct in operating

    systems. Although blocks are of a fixed size determined by the physical properties of

    the disk and by the operating system, record sizes vary. For example, in a network

    database, it is likely that he owner record type is of a different size than the member

    record type. One approach to mapping the database to files is to use several files and

    store records of only one fixed length in any file. An alternative is to structure the

    files such that we can accommodate multiple lengths for records.

    Let us consider ofdepositrecords for our bank database. Each record of this

    file is defined as follows:

    type deposi t= recordbranch_name : string[20];account_number : integer;customer_name : string[20];balance : real;end;

    Table 1 :

    record 0 Aden 102 Ali 12,000record 1 Arwa 200 Ahmed 50,000

    blocks

  • 7/28/2019 DataB Ch3 File

    4/15

    File Organization Chapter_3 (4 / 15)

    ___________________________________________________________________________________

    _________________________________________________________________________________

    Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University

    record 2 National 115 Abdulla 20,000

    record 3 Watni 333 Salem 1000

    record 4 Taiz 209 Omer 30,000

    record 5 Islamic 191 Aziz 10,000

    record 6 Crater 555 Mohd 10,000

    record 7 Maala 524 Jamil 12,000

    record 8 Sanaa 882 Salah 15,000

    If we assume that each character occupies a byte, an integer occupies 4 bytes,

    and a real 8 bytes, our deposit record is 52 bytes long. A simple approach is to use

    the first 52 bytes for the first record, next 52 bytes for the second record, and so on

    (Table 1). The problem here are :

    It is difficult to delete a record.

    Unless the block size happens to be a multiple of 52, some records will

    cross block boundaries. That is part of the record will be stored in oneblock and part in another. It would thus require two block accesses to

    read or write such a record.

    When a record is deleted, we could move the record that came after it into the

    space formerly occupied by the deleted record, and so on, until every record following

    the deleted record has been moved ahead. (Table 2). This approach will require

    moving a large number of records. The second approach is to move the last record of

    the file into the space of the deleted record.(Table 3).

    Table 2 :record 0 Aden 102 Ali 12,000

    record 1 Arwa 200 Ahmed 50,000

    record 3 Watni 333 Salem 1000

    record 4 Taiz 209 Omer 30,000

    record 5 Islamic 191 Aziz 10,000

    record 6 Crater 555 Mohd 10,000

    record 7 Maala 524 Jamil 12,000

    record 8 Sanaa 882 Salah 15,000

    Table 3 :record 0 Aden 102 Ali 12,000

    record 1 Arwa 200 Ahmed 50,000

    record 8 Sanaa 882 Salah 15,000

    record 3 Watni 333 Salem 1000

    record 4 Taiz 209 Omer 30,000

    record 5 Islamic 191 Aziz 10,000

    record 6 Crater 555 Mohd 10,000

    record 7 Maala 524 Jamil 12,000

    Since the insertion tent to be more frequent than deletions, it is acceptable to

    leave the space of the deleted record open. At the beginning of the file, we allocate af i le header. This header contains a variety of information about the file. It may store

  • 7/28/2019 DataB Ch3 File

    5/15

    File Organization Chapter_3 (5 / 15)

    ___________________________________________________________________________________

    _________________________________________________________________________________

    Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University

    the address of the first record whose contents are deleted. In addition, we use this first

    record to keep the address of the next record whose contents are deleted, and so on.

    We may think of these stored addresses as pointers. (Table 4).

    Table 4:

    record 0 Aden 102 Ali 12,000

    record 2 National 115 Abdulla 20,000

    record 3 Watni 333 Salem 1000

    record 5 Islamic 191 Aziz 10,000

    record 7 Maala 524 Jamil 12,000

    record 8 Sanaa 882 Salah 15,000

    f i le organization

    Just as arrays, trees, and other data structures are used to implement data

    organization in main memory. A number of strategies are used to support theorganization of data in secondary memory. The four fundamental file organization

    techniques are:

    (1) Sequential,(2) Indexed Sequential,(3) Relative,(4) Multi-Key.

    The file organizations differ two basic ways. First the organization determines

    the filesrecordsequencing, which is the physical ordering of the records onstorage. Second, the file organization determines the set of operation necessary

    to fine particular records. The organization most appropriate for a particular

    file is determine by the operational characteristics of the storage medium used

    and the nature of the operations to be performed on the data. The most

    important storage characteristic of the storage device is that whether the device

    allows direct accessto particular record, or allows only sequenti al accessto

    record occurrences. Magnetic disks are examples of direct access storage

    devices (DASDs), and magnetic tapes are examples of sequential storage

    devices.

  • 7/28/2019 DataB Ch3 File

    6/15

    File Organization Chapter_3 (6 / 15)

    ___________________________________________________________________________________

    _________________________________________________________________________________

    Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University

    SEQUENTIAL FILEORGANIZATION

    A sequentialf i leis designed for efficient processing of records in sorted order

    based on some searchkey. The file records are written consecutively when thefile is created and must be accessed consecutively when the file used. The

    records are maintained in the logical sequence of theirprimarykey values.

    The processing of a sequential file is simple but inefficient for random access.

    However, if access to the file is strictly sequential, a sequential file is suitable.

    The file could be stored on a sequential storage device such as a magnetic tape.

    Search for a given record in a sequential file requires, on average, access to

    half the records in the file. Updating usually requires the creation of a new file.

    To maintain file sequence, records are copied to the point where the updating is

    required. The changes are then made and copied into the new file. Followingthis, the remaining records are copied. This method of updating a sequential file

    creates an automatic backup copy. Addition can be handled similar to updating.

    Inversely, deletion of a record requires a compression of the file space, achieved

    by shifting of records.

    The basic advantage offered by a sequential file is the ease of access to the

    next record, the simplicity of organization and the absence of auxiliary data

    structure. However, replies to simple queries are time consuming for large files.

    A single update is an expensive proposition if a new file must be created. To

    reduce the cost per update, all the requests are batched, sorted in the order of the

    sequential file and then executed in one pass. Such a file that contains the

    updates is called a transaction f il e.

    A possible method of reducing the creation of a new file at each update run is to

    create original file with holes (space left for the addition of new records). As such,

    if a block could hold K records, then at initial creation it is made to contain only L*Krecords, where 0

  • 7/28/2019 DataB Ch3 File

    7/15

    File Organization Chapter_3 (7 / 15)

    ___________________________________________________________________________________

    _________________________________________________________________________________

    Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University

    An index is a set ( x , address ) pairs. Indexing associates a set of objects toa set of orderable quantities, which are usually smaller in number or their

    properties provide a mechanism for faster search. The purpose of indexing is to

    speed the search process. A sequential (or sorted on primary keys) file that is

    indexed is called an index sequential file. The index provides for randomaccess to records, while the sequential nature of the file provides easy access to

    the subsequent records. An additional feature of this file system is the

    OVER_FLOWarea, which provides additional space for record addition without

    necessitating the creation of a new file.

    TYPES OF INDEXES

    The idea behind an index access structure is similar to that behind the indexesused in textbooks. An index is usually defined on a single field of a file, called an

    indexing field. The index stores each value of the indexfieldalong with a list ofpointers to all disks blocks that contain a record with that field value. The values

    of the index are ordered so that we can do a binarysearch on the index. There

    are several types of indexes.

    Primary index: Is an index specified on the orderingkey field of an orderedfile of records. The ordering key field is used to physically order the file records

    on disk, and every record has a unique value for that field.

    Clustering index: If the ordering field is not a key field, that is, severalrecords in the file can have the same value for the ordering field.

    Secondary index: Can be specified on any non-ordering field of a file. Thefile can have several secondary indexes in addition to its primary access method.

    a) PRIMARY INDEXES

  • 7/28/2019 DataB Ch3 File

    8/15

    File Organization Chapter_3 (8 / 15)

    ___________________________________________________________________________________

    _________________________________________________________________________________

    Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University

    A primary index is an ordered file whose records are of fixed length with two

    fields. Thefirst fieldis of the same data types as the ordering key field of the data

    file, and thesecond fieldis a pointer to a disk block. The ordering key field is called

    the primary key of the data file.

    The first record in each block of the data file is called the anchor record of the

    block.

    BLOCK BLOCK

    ANCHOR POINTER

    Aaron

    Adams

    Wright

    NAME SSN SALARY

    ADDRESS

    Aaron 100 3000 112 road

    Abbot 101 12000 Keen 10

    Acosta

    Adams

    .

    Akers

    Wright

    .

    .

    Zimmer

    b) CLUSTERING INDEXES

    If records of a file are physically ordered on a non-key field that does not have

    a distinct value for each record, that field is called the clustering field of the file.

    A clustering index is an ordered file with two fields, the first field is of the same type

    as the clustering field of the data file, and the second field is a block pointer.

    CLUSTERING BLOCK

    FIELD POINTER

    1

    2

    8DEPARTMENT NAME SSN SALARY ADDRESS

    1 Aaron 100 3000 112 road

    1 Abbot 101 12000 Keen 10

    1

    1 Acosta

    2

  • 7/28/2019 DataB Ch3 File

    9/15

    File Organization Chapter_3 (9 / 15)

    ___________________________________________________________________________________

    _________________________________________________________________________________

    Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University

    2

    2

    2

    8

    8

    8

    8

    c) SECONDARY INDEXES

    A secondary index is also an ordered file with two fields, the second field is a

    pointer to a disk block.

    1

    2

    3

    45

    6

    9 Aaron 100

    6 Abbot 101

    2

    1

    3

    4

    5

    15

    ..

    The first field is of the same data type as some non-ordering field of the data

    file. The field on which the secondary index is constructed is called an indexing

    field of the file, whether its values are distinct for every record or not.

    A secondary index will usually need substantially more storage space than a primary

    index because of its larger number of entries.

    d) MULTI-LEVELINDEXING SCHEMES

    For a large file, it is possible to create a hierarchy of indexes with the lowest

    level index pointing to the records, while the higher-level index point to the indexes

    below them. The lowest level index consists of the pair of each record

    in the file.

    key key key key

  • 7/28/2019 DataB Ch3 File

    10/15

    File Organization Chapter_3 (10 / 15)

    ___________________________________________________________________________________

    _________________________________________________________________________________

    Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University

    1000 I11 200 I21 60 I31 10 P11

    2000 I12 450 I22 99 I32 29 P12

    780 I23 150 55

    1000 200 60

    999

    1100 210 7688

    2000 450 99

  • 7/28/2019 DataB Ch3 File

    11/15

    File Organization Chapter_3 (11 / 15)

    ___________________________________________________________________________________

    _________________________________________________________________________________

    Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University

    ISAM Technique

    When a record is stored by ISAM (Index Sequential Access Method), itsrecord key must be one of the fields in the record. Each record is stored on one of the

    tracks of a disk. If room does not permit, they are spilled over onto the next track in

    the same cylinder.

    Figure below shows two cylinders of records, but only their keys are shown.

    When ISAM retrieves a record, it needs to know the cylinder, the track address, and

    the record key.

    CYLINDER

    1

    1 50 60 70 80 90

    2 100 110 120 130 1403 150 160

    930 940

    19 950 960 970 980 990

    20 1000 1010 1020 1050 1060

    CYLIND

    ER

    2

    1 1090 1100 1230 1250 1300

    2 1345 1560 1600 1700 1711

    3 1900

    19

    20 2990 3001

    Cylinder 0Cylinder 1

    Cylinder 2Top not used

    Track

    Track

    TrackTrack

  • 7/28/2019 DataB Ch3 File

    12/15

    File Organization Chapter_3 (12 / 15)

    ___________________________________________________________________________________

    _________________________________________________________________________________

    Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University

    DIRECT FILEORGANIZATION

    In the index-sequential file organization, the mapping from the search-key

    value to the storage location is via index entries.In direct file organization, the key value is mapped directly to the storage location.

    The usual method of direct mapping is by performing some arithmetic manipulation

    of the key value. This process is called HASHING.

    Let us consider a hash function hthat maps the key valueKto the value h(K). The

    value h(K) is used as an address.

    It is obvious that a hash function that maps many different key values to a

    single address is a bad hash function. A collision is said to occur when two distinct

    key values are mapped to the same storage location.

    Consider the hash key h(K) = K % 100 which produce a set of indices from 0 to 999.

    Position0 4967000

    1 4967700

    2 8421845

    990 00009978

    999 00018773

    RESOLVING HASH CLASHES

    The simplest method of resolving hash clashes is to place the record in the

    next available position in the array of locations.

    Othermethod is called L inear Probing or Rehashing. In general, with a rehash

    function rh, if the location h(K) is already occupied by a record, a rh(K)is used.

    Another solution is to use a linked list for every position.

    HashAddressKey

  • 7/28/2019 DataB Ch3 File

    13/15

    File Organization Chapter_3 (13 / 15)

    ___________________________________________________________________________________

    _________________________________________________________________________________

    Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University

    MULTI-KEY FILEORGANIZATION

    The organization, which enables a single data file to support multiple access

    paths, each by a different key. Consider the following table;

    Recor

    d

    K# Name Occup at io

    n

    Degre

    e

    Sex Salary

    A 800 Aaa Programmer

    M.Tech

    M 10000

    B 510 Bbb Analyst B.Sc F 15000

    C 950 Ccc Analyst B.Sc F 12000

    D 750 Ddd Programmer

    M.Sc F 12000

    E 620 Eee programmer

    B.Sc M 9000

    K# Upper Value

    700 Record B Record E

    900 Record D Record A

    1100 Record C

    Linking together all records of the same type.

    K# INDEX OCCUPATIONINDEX

    Max K 700 900 1100 Value Analyst Programmer

    Numbers

    2 2 1 Length 2 3

    Pointer B D C Pointer B E

    K# 0 0 0

    Occupation C 0 0

    Sex index D A C 0 0

    Salaryindex

    O 0 C D 0

    B E D A C

    Value Female Male Value

  • 7/28/2019 DataB Ch3 File

    14/15

    File Organization Chapter_3 (14 / 15)

    ___________________________________________________________________________________

    _________________________________________________________________________________

    Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University

    Length

    3 2 Length 1 3 1

    Pointer

    B E Pointer E A B

    SEX INDEX SALARY INDEX

  • 7/28/2019 DataB Ch3 File

    15/15

    File Organization Chapter_3 (15 / 15)

    ___________________________________________________________________________________

    _________________________________________________________________________________

    Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University