DataB Ch3 File

7/28/2019 DataB Ch3 File

1/15

File Organization Chapter_3 (1 / 15)

___________________________________________________________________________________

_________________________________________________________________________________

Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University

File Organization for DBMS

Physical Storage Media

Several types of data storage exist in most computer systems. These storage

media are classified by the speed with which data can be accessed, by the cost per unit

of data to buy the memory, and by how reliable they are. Among the media

available are:

Cache :This is the fastest and most costly form of storage. The size ofcache memory is very small and the use of cache is managed by theoperating system.

Main Memory : This is the storage media used for data that isavailable to be operated on. The general-purpose machine instructions

operate on main memory. The main memory is sometimes referred to

as core memory. The contents of the main memory are usually lost

if a power failure or system crash occurs.

Disk Storage :This is the primary medium for the long-term storageof data. Typically the entire data base is stored on disk. Data must be

moved from disk to main memory in order for the data to be operated

on. After operations are performed, data must be returned to disk.

Disk storage is referred to as direct-access storage because it is

possible to read data on disk in any order. Disk storage usually

survives power failure and system crashes.

Tape Storage :This is storage used primarily for backup and archivaldata. Although tape is much cheaper than disk, access to data is much

slower, since the tape must be read sequentially from the beginning.

For this reason, tape storage is referred to as sequential-access

storage and is used primarily for recovery from failures. Tape devices

are less complex than disks; thus, they are more reliable.

The figure below shows a simple disk. The head is a device which stays closeto the surface of the platter and reads or writes information encoded magnetically on


2/15


___________________________________________________________________________________

_________________________________________________________________________________


the platter. The platter is organized into concentric tracks of data. The arm can be

positioned over any one of the tracks. The platter is spun at a high speed. To read or

write information, the arm is positioned over the correct track; and, when the data to

be accessed passes under the head, the read or write operation is performed.

Since the platter rotates at a high speed, it does not take very long for the

contents of an entire track to pass under the head. This time is known as the disk

latency time. The time for repositioning the arm, known as seek time, is relatively

longer than the latencytime. It is useful to store related information on the same track

or on physically close tracks in order to minimize the seek time.

It is typical to have high-capacity disks with multiple platters. Multiple-platter

disks are called disk packs. All the disk arms are moved as a unit by the actuator.Each arm has two heads: one to read and write the top surface of the platter below it,

Platter

Head

Arm

Track 0

Track 1

Track n

Disk latency time

Seek time

Platter 0

actuator

Arm 0

Arm 1

Arm 2

Platter 1Platter 2

Platter 3Disk pack


3/15


___________________________________________________________________________________

_________________________________________________________________________________


and one to read and write the bottom surface of the platter above it. At any moment,

the set of tracks over which the heads are located form a cylinder. This cylinder holds

the data that is accessible without any movement of the actuator. That is, all data

within this cylinder is accessible within the disk latency time. Just as it is efficient to

store related data in a single track or a collection of close tracks, so it is efficient to

store related data in the same cylinder, or in cylinders that are close to one another.

Data is transferred between disk and main memory in units called blocks. A

block is a contiguous sequence of bytes from a single track of one platter. Block sizes

range from 512 bytes to several thousand bytes.

File Organization

A f i leis organized logically as a sequence of records. These recordsare mapped onto disk blocks. Files are provided as a basic construct in operating

systems. Although blocks are of a fixed size determined by the physical properties of

the disk and by the operating system, record sizes vary. For example, in a network

database, it is likely that he owner record type is of a different size than the member

record type. One approach to mapping the database to files is to use several files and

store records of only one fixed length in any file. An alternative is to structure the

files such that we can accommodate multiple lengths for records.

Let us consider ofdepositrecords for our bank database. Each record of this

file is defined as follows:

type deposi t= recordbranch_name : string[20];account_number : integer;customer_name : string[20];balance : real;end;

Table 1 :

record 0 Aden 102 Ali 12,000record 1 Arwa 200 Ahmed 50,000

blocks


4/15


___________________________________________________________________________________

_________________________________________________________________________________


record 2 National 115 Abdulla 20,000

record 3 Watni 333 Salem 1000

record 4 Taiz 209 Omer 30,000

record 5 Islamic 191 Aziz 10,000

record 6 Crater 555 Mohd 10,000

record 7 Maala 524 Jamil 12,000

record 8 Sanaa 882 Salah 15,000

If we assume that each character occupies a byte, an integer occupies 4 bytes,

and a real 8 bytes, our deposit record is 52 bytes long. A simple approach is to use

the first 52 bytes for the first record, next 52 bytes for the second record, and so on

(Table 1). The problem here are :

It is difficult to delete a record.

Unless the block size happens to be a multiple of 52, some records will

cross block boundaries. That is part of the record will be stored in oneblock and part in another. It would thus require two block accesses to

read or write such a record.

When a record is deleted, we could move the record that came after it into the

space formerly occupied by the deleted record, and so on, until every record following

the deleted record has been moved ahead. (Table 2). This approach will require

moving a large number of records. The second approach is to move the last record of

the file into the space of the deleted record.(Table 3).

Table 2 :record 0 Aden 102 Ali 12,000

record 1 Arwa 200 Ahmed 50,000







Table 3 :record 0 Aden 102 Ali 12,000

record 1 Arwa 200 Ahmed 50,000







Since the insertion tent to be more frequent than deletions, it is acceptable to

leave the space of the deleted record open. At the beginning of the file, we allocate af i le header. This header contains a variety of information about the file. It may store


5/15


___________________________________________________________________________________

_________________________________________________________________________________


the address of the first record whose contents are deleted. In addition, we use this first

record to keep the address of the next record whose contents are deleted, and so on.

We may think of these stored addresses as pointers. (Table 4).

Table 4:

record 0 Aden 102 Ali 12,000

record 2 National 115 Abdulla 20,000





f i le organization

Just as arrays, trees, and other data structures are used to implement data

organization in main memory. A number of strategies are used to support theorganization of data in secondary memory. The four fundamental file organization

techniques are:

(1) Sequential,(2) Indexed Sequential,(3) Relative,(4) Multi-Key.

The file organizations differ two basic ways. First the organization determines

the filesrecordsequencing, which is the physical ordering of the records onstorage. Second, the file organization determines the set of operation necessary

to fine particular records. The organization most appropriate for a particular

file is determine by the operational characteristics of the storage medium used

and the nature of the operations to be performed on the data. The most

important storage characteristic of the storage device is that whether the device

allows direct accessto particular record, or allows only sequenti al accessto

record occurrences. Magnetic disks are examples of direct access storage

devices (DASDs), and magnetic tapes are examples of sequential storage

devices.


6/15


___________________________________________________________________________________

_________________________________________________________________________________


SEQUENTIAL FILEORGANIZATION

A sequentialf i leis designed for efficient processing of records in sorted order

based on some searchkey. The file records are written consecutively when thefile is created and must be accessed consecutively when the file used. The

records are maintained in the logical sequence of theirprimarykey values.

The processing of a sequential file is simple but inefficient for random access.

However, if access to the file is strictly sequential, a sequential file is suitable.

The file could be stored on a sequential storage device such as a magnetic tape.

Search for a given record in a sequential file requires, on average, access to

half the records in the file. Updating usually requires the creation of a new file.

To maintain file sequence, records are copied to the point where the updating is

required. The changes are then made and copied into the new file. Followingthis, the remaining records are copied. This method of updating a sequential file

creates an automatic backup copy. Addition can be handled similar to updating.

Inversely, deletion of a record requires a compression of the file space, achieved

by shifting of records.

The basic advantage offered by a sequential file is the ease of access to the

next record, the simplicity of organization and the absence of auxiliary data

structure. However, replies to simple queries are time consuming for large files.

A single update is an expensive proposition if a new file must be created. To

reduce the cost per update, all the requests are batched, sorted in the order of the

sequential file and then executed in one pass. Such a file that contains the

updates is called a transaction f il e.

A possible method of reducing the creation of a new file at each update run is to

create original file with holes (space left for the addition of new records). As such,

if a block could hold K records, then at initial creation it is made to contain only L*Krecords, where 0


7/15


___________________________________________________________________________________

_________________________________________________________________________________


An index is a set ( x , address ) pairs. Indexing associates a set of objects toa set of orderable quantities, which are usually smaller in number or their

properties provide a mechanism for faster search. The purpose of indexing is to

speed the search process. A sequential (or sorted on primary keys) file that is

indexed is called an index sequential file. The index provides for randomaccess to records, while the sequential nature of the file provides easy access to

the subsequent records. An additional feature of this file system is the

OVER_FLOWarea, which provides additional space for record addition without

necessitating the creation of a new file.

TYPES OF INDEXES

The idea behind an index access structure is similar to that behind the indexesused in textbooks. An index is usually defined on a single field of a file, called an

indexing field. The index stores each value of the indexfieldalong with a list ofpointers to all disks blocks that contain a record with that field value. The values

of the index are ordered so that we can do a binarysearch on the index. There

are several types of indexes.

Primary index: Is an index specified on the orderingkey field of an orderedfile of records. The ordering key field is used to physically order the file records

on disk, and every record has a unique value for that field.

Clustering index: If the ordering field is not a key field, that is, severalrecords in the file can have the same value for the ordering field.

Secondary index: Can be specified on any non-ordering field of a file. Thefile can have several secondary indexes in addition to its primary access method.

a) PRIMARY INDEXES


8/15


___________________________________________________________________________________

_________________________________________________________________________________


A primary index is an ordered file whose records are of fixed length with two

fields. Thefirst fieldis of the same data types as the ordering key field of the data

file, and thesecond fieldis a pointer to a disk block. The ordering key field is called

the primary key of the data file.

The first record in each block of the data file is called the anchor record of the

block.

BLOCK BLOCK

ANCHOR POINTER

Aaron

Adams

Wright

NAME SSN SALARY

ADDRESS

Aaron 100 3000 112 road

Abbot 101 12000 Keen 10

Acosta

Adams

.

Akers

Wright

.

.

Zimmer

b) CLUSTERING INDEXES

If records of a file are physically ordered on a non-key field that does not have

a distinct value for each record, that field is called the clustering field of the file.

A clustering index is an ordered file with two fields, the first field is of the same type

as the clustering field of the data file, and the second field is a block pointer.

CLUSTERING BLOCK

FIELD POINTER

1

2

8DEPARTMENT NAME SSN SALARY ADDRESS

1 Aaron 100 3000 112 road

1 Abbot 101 12000 Keen 10

1

1 Acosta

2


9/15


___________________________________________________________________________________

_________________________________________________________________________________


2

2

2

8

8

8

8

c) SECONDARY INDEXES

A secondary index is also an ordered file with two fields, the second field is a

pointer to a disk block.

1

2

3

45

6

9 Aaron 100

6 Abbot 101

2

1

3

4

5

15

..

The first field is of the same data type as some non-ordering field of the data

file. The field on which the secondary index is constructed is called an indexing

field of the file, whether its values are distinct for every record or not.

A secondary index will usually need substantially more storage space than a primary

index because of its larger number of entries.

d) MULTI-LEVELINDEXING SCHEMES

For a large file, it is possible to create a hierarchy of indexes with the lowest

level index pointing to the records, while the higher-level index point to the indexes

below them. The lowest level index consists of the pair of each record

in the file.

key key key key


10/15


___________________________________________________________________________________

_________________________________________________________________________________


1000 I11 200 I21 60 I31 10 P11

2000 I12 450 I22 99 I32 29 P12

780 I23 150 55

1000 200 60

999

1100 210 7688

2000 450 99


11/15


___________________________________________________________________________________

_________________________________________________________________________________


ISAM Technique

When a record is stored by ISAM (Index Sequential Access Method), itsrecord key must be one of the fields in the record. Each record is stored on one of the

tracks of a disk. If room does not permit, they are spilled over onto the next track in

the same cylinder.

Figure below shows two cylinders of records, but only their keys are shown.

When ISAM retrieves a record, it needs to know the cylinder, the track address, and

the record key.

CYLINDER

1

1 50 60 70 80 90

2 100 110 120 130 1403 150 160

930 940

19 950 960 970 980 990

20 1000 1010 1020 1050 1060

CYLIND

ER

2

1 1090 1100 1230 1250 1300

2 1345 1560 1600 1700 1711

3 1900

19

20 2990 3001

Cylinder 0Cylinder 1

Cylinder 2Top not used

Track

Track

TrackTrack


12/15


___________________________________________________________________________________

_________________________________________________________________________________


DIRECT FILEORGANIZATION

In the index-sequential file organization, the mapping from the search-key

value to the storage location is via index entries.In direct file organization, the key value is mapped directly to the storage location.

The usual method of direct mapping is by performing some arithmetic manipulation

of the key value. This process is called HASHING.

Let us consider a hash function hthat maps the key valueKto the value h(K). The

value h(K) is used as an address.

It is obvious that a hash function that maps many different key values to a

single address is a bad hash function. A collision is said to occur when two distinct

key values are mapped to the same storage location.

Consider the hash key h(K) = K % 100 which produce a set of indices from 0 to 999.

Position0 4967000

1 4967700

2 8421845

990 00009978

999 00018773

RESOLVING HASH CLASHES

The simplest method of resolving hash clashes is to place the record in the

next available position in the array of locations.

Othermethod is called L inear Probing or Rehashing. In general, with a rehash

function rh, if the location h(K) is already occupied by a record, a rh(K)is used.

Another solution is to use a linked list for every position.

HashAddressKey


13/15


___________________________________________________________________________________

_________________________________________________________________________________


MULTI-KEY FILEORGANIZATION

The organization, which enables a single data file to support multiple access

paths, each by a different key. Consider the following table;

Recor

d

K# Name Occup at io

n

Degre

e

Sex Salary

A 800 Aaa Programmer

M.Tech

M 10000

B 510 Bbb Analyst B.Sc F 15000

C 950 Ccc Analyst B.Sc F 12000

D 750 Ddd Programmer

M.Sc F 12000

E 620 Eee programmer

B.Sc M 9000

K# Upper Value

700 Record B Record E

900 Record D Record A

1100 Record C

Linking together all records of the same type.

K# INDEX OCCUPATIONINDEX

Max K 700 900 1100 Value Analyst Programmer

Numbers

2 2 1 Length 2 3

Pointer B D C Pointer B E

K# 0 0 0

Occupation C 0 0

Sex index D A C 0 0

Salaryindex

O 0 C D 0

B E D A C

Value Female Male Value


14/15


___________________________________________________________________________________

_________________________________________________________________________________


Length

3 2 Length 1 3 1

Pointer

B E Pointer E A B

SEX INDEX SALARY INDEX


15/15


___________________________________________________________________________________

_________________________________________________________________________________


Documents

DataB Ch3 File