31
Introduction to File Structures Objectives At the end of the lesson, the student should be able to: Explain the concepts behind physical file organization Differentiate logical and physical files Determine the advantages of using primary over secondary storage devices Identify the characteristics, capacities and access costs of different storage devices Explain the concepts behind file organization and file access Identify the different indexing mechanisms Use file compression techniques to improve space utilization and file access performance CS 165 Database Systems 1 Lesson 1: Introduction to File Structures

Slides 1 Introduction to File Structures

Embed Size (px)

Citation preview

Page 1: Slides 1 Introduction to File Structures

Introduction to File Structures

Objectives

At the end of the lesson, the student should be able to:

• Explain the concepts behind physical file organization

• Differentiate logical and physical files

• Determine the advantages of using primary over secondary storage devices

• Identify the characteristics, capacities and access costs of different storage devices

• Explain the concepts behind file organization and file access

• Identify the different indexing mechanisms

• Use file compression techniques to improve space utilization and file access performance

CS 165 Database Systems 1 Lesson 1: Introduction to File Structures

Page 2: Slides 1 Introduction to File Structures

Issues in Physical File Organization

• Slow disk access• Volative primary storage• Virtually infinite secondary storage• Cheaper secondary storage• Dynamic nature of data• Need for fast searching and retrieval

Ideally, we would like to get information with one access to the disk. If it is not possible, even with just as few accesses as possible. We also want to have file structures with information groups, or what we call records, to get a bulk of information in just one access to the disk.

Solutions Made

• Sequential access• Indexed access• Binary trees• Tree-structured file organization: B-tree, B*-tree and B+-tree• Direct access design in the form of hashing

CS 165 Database Systems 2 Lesson 1: Introduction to File Structures

Page 3: Slides 1 Introduction to File Structures

Logical vs. Physical Files

physical file ⇒ a file in a disk, which means it exists physicallylogical file ⇒ a file as viewed by the user

Example: assign(input_file,’input.txt’);

input_file ⇒ logical, input.txt ⇒ physical

Secondary Storage Devices

⇒ hold files that are not currently being used

⇒ used for mid-term to long-term storage

Types of Storage Devices

• Direct Access Storage Devices (DASDs) ⇒ also known as random access devices⇒ magnetic disks such as hard disks, floppy disks and optical disks

• Serial Devices ⇒ use media such as magnetic tapes that permit only serial access.

CS 165 Database Systems 3 Lesson 1: Introduction to File Structures

Page 4: Slides 1 Introduction to File Structures

Disks

Use magnetism to store the data on a magnetic surfaceAdvantages: high storage capacity, reliable, gives direct access to data A drive spins the disk very quickly underneath a read/write head

Organization of Disks

All magnetic disks are similarly formatted, or divided into areas, called tracks, sectors and cylinders.

• Platters • Tracks • Sectors• Disk pack •

• Cylinder

Seeking ⇒ Moving the r/w head to find the right location on the disk

⇒ Usually the slowest part of reading information

CS 165 Database Systems 4 Lesson 1: Introduction to File Structures

Page 5: Slides 1 Introduction to File Structures

Disk Capacity

• Amount of data on a track depends upon how densely bits can be stored Low density High density

• To compute track, cylinder and drive capacity:

CT = Track Capacity = #sectors/track x #bytes/sectorCC = Cylinder Capacity = #tracks/cylinder x CT

CD = Drive Capacity = #cylinders/drive x CC

Example: What is the capacity of the drive with the following characteristics?

#bytes/sector = 512#sectors/track = 40#tracks/cylinder = 11#cylinders = 1331

CS 165 Database Systems 5 Lesson 1: Introduction to File Structures

Page 6: Slides 1 Introduction to File Structures

Cost of a Disk Access (Latency)

Factors affecting disk access time: seek time [ts], rotational delay [tr], block transfer time [tbt]

Step Measured as:(in ms)

1. seek move the head to proper track

seek time

2. rotaterotate disk under the head to the correct sector

rotational delay

3. settlehead lowers to disk;wait for vibrations from moving to stop (actually touches only on floppies)

settling time

4. data transfercopy data to main memory

data transfer time

CS 165 Database Systems 6 Lesson 1: Introduction to File Structures

Page 7: Slides 1 Introduction to File Structures

Seek Time

• Time it takes to move the arm to the correct cylinder

• Depends on the physical characteristics of the disk drive as well as on the #cylinders.

• Assumption: all cylinders are equally likely to be sources or destinations.

minimum seek time ⇒ time it takes to move the access arm from a track to its adjacent track.

maximum seek time ⇒ time it takes to move the access arm from the outermost to the innermost track.

average seek time ⇒ the average of the minimum and maximum seek times.

Rotational Delay

• Time it takes for the head to reach the right place on the track.

• Assumption: Correct block can be recognized by some flag or block identifier at the beginning.

rotational time (tr ) ⇒ time needed for a disk pack to complete one whole revolution.

CS 165 Database Systems 7 Lesson 1: Introduction to File Structures

Page 8: Slides 1 Introduction to File Structures

• Average latency time: [tr/2]= time it takes the disk drive to make ½ revolution.

tr/2 = (1/2) (60 x 1000) / rpm

where rpm is revolutions per minute

or if tr is given:

tr/2 = tr / 2

• Hard disks usually rotates at about 7,200 rpm ~ 1 rev. per 8.33 ms so the average rotational delay is 4.167 msec.

• For floppy disk: 360 rpm, tr/2 = 83.3 ms.

Block Transfer Time

• Time it takes for the head to pass over a block

• The bytes of data can be transferred from disk to RAM and vice versa.

• Depends on rotational speed, recording density (#bytes/track), and #bytes in a block.

CS 165 Database Systems 8 Lesson 1: Introduction to File Structures

Page 9: Slides 1 Introduction to File Structures

Let tbt = block transfer time rbt = block transfer rate B = #bytes transferred

tbt = B / rbt

Example: Assume a block size of 1,000 bytes and that the blocks are stored randomly. The disk drive has the following characteristics:

ts/2 = 30 mstr = 16.67 mstr/2 = 8.3 msrbt = 806 KB/sec = 825,344 bytes/sec

• What is the average access time per block?

Average access time per block= seek + rotational latency + transfer= ts/2 + tr/2 + tbt

=30 ms/block + 8.3 ms/block + 1,000bytes/block 825,344bytes/sec

= (30 + 8.3 + 1.21) ms /block= 39.51 ms/block

CS 165 Database Systems 9 Lesson 1: Introduction to File Structures

Page 10: Slides 1 Introduction to File Structures

Magnetic Tape

• Sequential access

• Logical position of a byte within a file corresponds directly to its physical location relative to the start of the file.

• Good medium for archival storage, for transportation of data

• Tape drives come in many shapes, sizes and speeds. Performance of each drive can be measured in terms of the following quantities:Tape (or recording) density ⇒ number of characters or bytes of data that can be stored per inch (bpi)

• Tape Speed ⇒ commonly 30 to 200 inches per second (ips)

Disk vs. Tape

DISK TAPERandom Access Sequential AccessUse to store files in shorter terms Long-term storage of filesGenerally serves many processes Dedicated to one process Expensive Less expensiveUsed as main secondary storage Considered to be a tertiary storage

CS 165 Database Systems 10 Lesson 1: Introduction to File Structures

Page 11: Slides 1 Introduction to File Structures

RAID: Improving File Access Performance by Parallel Processing

Redundant Array of Inexpensive Disks

A set of physical disk drives that appear to the database users and programs as if they form one large logical storage unit.

Parallel database processing speeds up writes and reads. This can be accomplished through using a RAID. RAID does not change the logical or physical structure of application programs or database queries.

CS 165 Database Systems 11 Lesson 1: Introduction to File Structures

Page 12: Slides 1 Introduction to File Structures

RAID Levels

CS 165 Database Systems 12 Lesson 1: Introduction to File Structures

Page 13: Slides 1 Introduction to File Structures

CS 165 Database Systems 13 Lesson 1: Introduction to File Structures

Page 14: Slides 1 Introduction to File Structures

Fundamental File Structure Concepts

Data is stored in a physical storage for later retrieval. In order to organize the data for easy retrieval, physical file organization and access mechanism have to be determined.

Field and Record Organization⇒ Refers to the physical structure of records to store⇒ Fixed or variable in length

A Stream File

PROGRAM: writstrmget output file name and open it with the logical name OUTPUTget LAST name as inputwhile (LAST name has a length > 0)

get FIRST name, ADDRESS, CITY, STATE, and ZIP as inputwrite LAST to the file OUTPUTwrite FIRST to the file OUTPUTwrite ADDRESS to the file OUTPUTwrite CITY to the file OUTPUTwrite STATE to the file OUTPUTwrite ZIP to the file OUTPUTget LAST name as input

endwhileclose OUTPUTend PROGRAM

CS 165 Database Systems 14 Lesson 1: Introduction to File Structures

⇒ Writes a stream of bytes containing no added information: hard to get the data back since there is no integrity of the fundamental organizational units of the input data.

Page 15: Slides 1 Introduction to File Structures

Field Structures

FieldA conceptual tool Smallest logically meaningful unit of information in a fileA subdivision of a record containing a single attribute of the entity the record

describes

Methods of Field Organization

1. Fixed-length fields

• Disadvantage: wasted space, insufficient space

2.Length indicator at the beginning of each field

3.Delimiters to separate fields

4.“keyword = value” expression to identify fields

• used in combination with delimiters• advantage: self-descriptive• disadvantage: wasted space

CS 165 Database Systems 15 Lesson 1: Introduction to File Structures

Page 16: Slides 1 Introduction to File Structures

Record Structures

Records

Also a conceptual toolSet of fields that belong together when the file is viewed in terms of a higher level of

organization

Methods of Record Organization

1. Fixed-length records

• Does not imply that the sizes or number of fields in the record must be fixed

2. Fixed number of fields

3. Length indicator at the start of each record

4. Use of index to keep track of addresses

• Index in another file or at the file's header.

5. Delimiter at the end of each record

CS 165 Database Systems 16 Lesson 1: Introduction to File Structures

Page 17: Slides 1 Introduction to File Structures

File Organization

The following are the main types of file organization:• Sequential • Relative

Sequential File Organization

⇒ Oldest type of file organization since during the 1950s and 1960s, the foundation of many information systems was sequential processing

Sequential Files ⇒Files that are read from beginning to end

Physical Characteristics

• Physical order of the records is the same as the logical order• Logical representation: (record 1) ⇒ (record 2) ⇒ (record 3) ⇒ … ⇒ (record n)

Two Types of Sequential Files

• Unordered pile files• Sorted sequential files

CS 165 Database Systems 17 Lesson 1: Introduction to File Structures

Page 18: Slides 1 Introduction to File Structures

Relative File Organization

• There exists a predictable relationship between the key used to identify a record and that record’s absolute address on an external file

• Allows access to a record directly given only the key, regardless of the position of the record in the file

• Characterized as providing random access because the logical organization of the file need not correspond to its physical organization

• Simplest relative file organization: key value corresponds directly to the physical location of the record in the file

Useful for dense keys, i.e., values of consecutive keys differ only by one.

If the key collection is not dense, it might result to wasted space. This can be solved by mapping the large range of non-dense key values into smaller range of record positions in the file. ⇒ The key is no longer the address of the record in the file. Hashing or indexing may be used.

CS 165 Database Systems 18 Lesson 1: Introduction to File Structures

Page 19: Slides 1 Introduction to File Structures

File Access

⇒ Refers to the manner data is retrieved

File Access Methods

Sequential Access • O(n) – expensive if done directly to the disk

Relative Access

• Addresses of records can be obtained directly from a key

• Uses an indexing technique that allows a user to locate a record in a file with as few accesses as possible, ideally with just one access

• The indexing scheme could be one of the following: • Direct addressing• Binary search tree• B-trees• Multiple-key indexing• Hashing

If access is sequential, file organization used will not have significant effect on the access cost. However, relative access entails fixing the size of records or using an index.

CS 165 Database Systems 19 Lesson 1: Introduction to File Structures

Page 20: Slides 1 Introduction to File Structures

Indexing Mechanisms

Index

Consists of keys and reference fieldsWorks by indirectionUsed to provide multiple access path to a file Gives keyed access to variable-length record files

Key

Identifies a record based on the record’s contents not on the sequence of the records

Primary keys

Keys that uniquely identify a single record

Primary keys should be unchanging

Secondary keys

Used to overcome the shortcomings of the primary key

Used to access records according to data content

Types of Indexes

• Primary Index on Unordered Files • Primary Index on Ordered Files• Clustering Index on Ordered Files• Secondary Index

CS 165 Database Systems 20 Lesson 1: Introduction to File Structures

Page 21: Slides 1 Introduction to File Structures

Primary Index on Unordered Files

• The index is a simple arrays of structures that contain the keys and reference fields• Allows binary search• The index provides the order to unordered physical records• Physical files are entry-sequenced

CS 165 Database Systems 21 Lesson 1: Introduction to File Structures

Page 22: Slides 1 Introduction to File Structures

Primary Index on Ordered Files

• Physical records are ordered based on the primary key• The index is ordered but only one index record for each block• Reduces the index requirement, enabling binary search over the values without

having to read the entire file to perform binary search

CS 165 Database Systems 22 Lesson 1: Introduction to File Structures

Page 23: Slides 1 Introduction to File Structures

Clustering Index on Ordered Files

CS 165 Database Systems 23 Lesson 1: Introduction to File Structures

Page 24: Slides 1 Introduction to File Structures

Secondary Index

• Used to facilitate faster access to commonly queried non-primary fields• Typically point to the primary index

Advantage: Record deletion and update cause less workDisadvantage: Less efficient

Retrieval Using Combination of Secondary Keys⇒ use boolean AND operation, specifying the intersection of two subsets of the data file

CS 165 Database Systems 24 Lesson 1: Introduction to File Structures

Page 25: Slides 1 Introduction to File Structures

Hashing

• Used to obtain addresses from keys. A hash function is used to map a range of key values into a smaller range of relative addresses.

• Hashing is like indexing in that it involves associating a key with a relative record address.

• Unlike indexing, with hashing, there is no obvious connection between the key and the address generated since the function "randomly selects" a relative address for a specific key value, without regard to the physical sequence of the records in the file. Thus it is also referred to as randomizing scheme.

• Problem in hashing: Presence of collision. Collision happens when two or more input keys generate the same address when the same hash function is used.

• Solution: Collision resolution technique or use of perfect hash functions.

CS 165 Database Systems 25 Lesson 1: Introduction to File Structures

Page 26: Slides 1 Introduction to File Structures

Common Hashing Techniques Perfect Hashing TechniquesHash functions which are able to generate random addresses (not necessarily unique)

Hash functions which are able to generate a perfectly uniform distribution of addresses

• Prime Number Division Method • Digit Extraction• Folding• Mid-Square• Radix Conversion

• Quotient Reduction• Remainder Reduction• Associated Value• Reciprocal Hashing

AdvantagesEasier to implement

Provide 1-1 mapping of keys into addresses (no collisions)

Disadvantages• Records may collide• Records may be unevenly

distributed. In the worst case, all records hash into a single address

• Requires knowledge of the set of key values in order to generate a perfectly uniform distribution of keys

• Quite complicated to use• Has a lot of pre-requisites • It is hard to find a function that produces no

collision

CS 165 Database Systems 26 Lesson 1: Introduction to File Structures

Page 27: Slides 1 Introduction to File Structures

Other Indexing Mechanisms

• Binary Search Trees: BST, Height-Balanced (AVL, Red-Black)

• B-Trees

Indexes That Are Too Large to Hold in Memory

Index too large for memory must be maintained on secondary storage ⇒ a lot of seeking is required.

Alternatives: Use hashing for direct access or tree-structured index for both keyed access and sequential access

CS 165 Database Systems 27 Lesson 1: Introduction to File Structures

Page 28: Slides 1 Introduction to File Structures

Organizing Files for Performance

• Used to improve space utilization and file access times• Some methods: data compression and reclaiming unused space

Data Compression⇒ The process of making files smaller by encoding the information to take up less space

Methods• Using a Different Notation• Suppressing Repeated Sequences• Assigning Variable-Length Codes

Redundancy Reduction – compression by reducing redundancy in data representation

Using a Different Notation

Compact Notation ⇒ a compression technique which we decrease the number of bits in data representation.

Fixed-length fields of a record are good candidates for compressiono Choosing minimal length that is still conservative enougho Using alternative values: for example, course code instead of course name

CS 165 Database Systems 28 Lesson 1: Introduction to File Structures

Page 29: Slides 1 Introduction to File Structures

Suppressing Repeated Sequences

Run-Length Encoding (RLE)

• A compression technique in which runs of repeated codes are replaced by a count of the number of repetitions of the code, followed by the code that is repeated

• Images with repeated pixels are good candidates for RLE

• The Algorithm: Read through the pixels in the image, copying the values to the file sequence,

except where the same pixel value occurs more that one successively. Substitute with the following 3 bytes the pixel values which occurred in succession:

• Run-length code indicator;• Pixel value that is repeated; and• The number of times it is repeated.

For example, we want to compress the following, with 0xFF not included in the image:22 23 24 24 24 24 24 24 24 25 26 26 26 26 26 26 25 24

The compressed version is22 23 FF 24 07 25 FF 26 06 25 24

• RLE does not guarantee any particular amount of space savings

CS 165 Database Systems 29 Lesson 1: Introduction to File Structures

Page 30: Slides 1 Introduction to File Structures

Assigning Variable-Length Codes

The most frequently used piece of data is assigned to have the shortest code

Example 1: Morse Code

E • T I • • M S • • • O H • • • •

However, Morse Code still need some delimiter to recognize characters

CS 165 Database Systems 30 Lesson 1: Introduction to File Structures

Page 31: Slides 1 Introduction to File Structures

Example 2: Huffman Code

Suppose we have an alphabet consisting of only seven characters:

Letter a b c d e f gProbability 0.4 0.1 0.1 0.1 0.1 0.1 0.1Code 1 010 011 0000 0001 0010 0011

• Each letter here occurs with the probability indicated

• Letter a has the greatest probability to occur more frequently, so it is assigned the one bit code

• So the string abacde is represented as ‘1010101100000001’

• Seven letters can be stored using three bits only, but in this example, as much as four bits are used to ensure that the distinct codes can be stored together without delimiters and still could be recognized

CS 165 Database Systems 31 Lesson 1: Introduction to File Structures