69
File System Reliability

File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Embed Size (px)

Citation preview

Page 1: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

File System Reliability

Page 2: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Today’s Lecture

• Review of Disk Scheduling• Storage reliability

Page 3: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Disk Scheduling• The operating system is responsible for using hardware efficiently

— for the disk drives, this means having a fast access time and disk bandwidth.

• Access time has two major components– Seek time is the time for the disk are to move the heads to the cylinder

containing the desired sector.– Rotational latency is the additional time waiting for the disk to rotate the

desired sector to the disk head.• Minimize seek time• Seek time seek distance• Disk bandwidth is the total number of bytes transferred, divided

by the total time between the first request for service and the completion of the last transfer.

Page 4: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Disk Performance Parameters

• The actual details of disk I/O operation depend on many things– A general timing diagram of disk I/O transfer is

shown here.

Dave BremerOtago Polytechnic, NZ©2008, Prentice Hall

Page 5: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Disk Scheduling (Cont.)

• Several algorithms exist to schedule the servicing of disk I/O requests.

• We illustrate them with a request queue (0-199).

98, 183, 37, 122, 14, 124, 65, 67

Head pointer 53

Page 6: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

FCFS: First Come First Serve

Illustration shows total head movement of 640 cylinders.

Page 7: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

SSTF: Shortest Seek Time First

• Selects the request with the minimum seek time from the current head position.

• SSTF scheduling is a form of SJF scheduling; may cause starvation of some requests.

• Illustration shows total head movement of 236 cylinders.

Page 8: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

SSTF (Cont.)

Page 9: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

SCAN

• The disk arm starts at one end of the disk, and moves toward the other end, servicing requests until it gets to the other end of the disk, where the head movement is reversed and servicing continues.

• Sometimes called the elevator algorithm.• Illustration shows total head movement of 208

cylinders.

Page 10: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

SCAN (Cont.)

Page 11: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

C-SCAN

• Provides a more uniform wait time than SCAN.• The head moves from one end of the disk to

the other. servicing requests as it goes. When it reaches the other end, however, it immediately returns to the beginning of the disk, without servicing any requests on the return trip.

• Treats the cylinders as a circular list that wraps around from the last cylinder to the first one.

Page 12: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

C-SCAN (Cont.)

Page 13: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

C-LOOK• Version of C-SCAN• Arm only goes as far as the last request in

each direction, then reverses direction immediately, without first going all the way to the end of the disk.

Page 14: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

C-LOOK (Cont.)

Page 15: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Selecting a Disk-Scheduling Algorithm

• SSTF is common and has a natural appeal• SCAN and C-SCAN perform better for systems that place a

heavy load on the disk.• Performance depends on the number and types of requests.• Requests for disk service can be influenced by the file-

allocation method.• The disk-scheduling algorithm should be written as a

separate module of the operating system, allowing it to be replaced with a different algorithm if necessary.

• Either SSTF or LOOK is a reasonable choice for the default algorithm.

Page 16: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

ReliabilityMain Points

• Problem posed by machine/disk failures• Transaction concept• Three approaches to reliability

– Careful sequencing of file system operations– Copy-on-write (WAFL, ZFS)– Journalling (NTFS, linux ext4)

• Approaches to availability– RAID

Page 17: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

File System Reliability

• What can happen if disk loses power or machine software crashes?– Some operations in progress may complete– Some operations in progress may be lost– Overwrite of a block may only partially complete

• File system wants durability (as a minimum!)– Data previously stored can be retrieved (maybe

after some recovery step), regardless of failure

Page 18: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Storage Reliability Problem

• Single logical file operation can involve updates to multiple physical disk blocks– inode, indirect block, data block, bitmap, …– With remapping, single update to physical disk block can

require multiple (even lower level) updates• At a physical level, operations complete one at a

time– Want concurrent operations for performance

• How do we guarantee consistency regardless of when crash occurs?

Page 19: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Types of Storage Media• Volatile storage – information stored here does not survive

system crashes– Example: main memory, cache

• Nonvolatile storage – Information usually survives crashes– Example: disk and tape

• Stable storage – Information never lost– Not actually possible, so approximated via replication or RAID to

devices with independent failure modesGoal is to assure transaction atomicity where failures cause loss of information on volatile storage

Page 20: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Reliability Approach #1: Careful Ordering

• Sequence operations in a specific order– Careful design to allow sequence to be interrupted

safely• Post-crash recovery

– Read data structures to see if there were any operations in progress

– Clean up/finish as needed

• Approach taken in FAT, FFS (fsck), and many app-level recovery schemes (e.g., Word)

Page 21: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

FAT: Append Data to File

• Add data block• Add pointer to

data block• Update file tail to

point to new MFT entry

• Update access time at head of file

Page 22: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

FAT: Append Data to File

• Add data block• Add pointer to

data block• Update file tail to

point to new MFT entry

• Update access time at head of file

Page 23: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

FAT: Append Data to File

Normal operation:• Add data block• Add pointer to data

block• Update file tail to point

to new MFT entry• Update access time at

head of file

Recovery:• Scan MFT• If entry is unlinked,

delete data block• If access time is

incorrect, update

Page 24: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

FAT: Create New File

Normal operation:• Allocate data block• Update MFT entry to

point to data block• Update directory with

file name -> file number– What if directory spans

multiple disk blocks?

• Update modify time for directory

Recovery:• Scan MFT• If any unlinked files (not

in any directory), delete• Scan directories for

missing update times

Page 25: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

FFS: Create a File

Normal operation:• Allocate data block• Write data block• Allocate inode• Write inode block• Update bitmap of free

blocks• Update directory with file

name -> file number• Update modify time for

directory

Recovery:• Scan inode table• If any unlinked files (not in

any directory), delete• Compare free block bitmap

against inode trees• Scan directories for

missing update/access times

Time proportional to size of disk

Page 26: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

FFS: Move a File

Normal operation:• Remove filename from

old directory• Add filename to new

directory

Recovery:• Scan all directories to

determine set of live files

• Consider files with valid inodes and not in any directory– New file being created?– File move?– File deletion?

Page 27: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Application Level

Normal operation:• Write name of each

open file to app folder• Write changes to backup

file• Rename backup file to

be file (atomic operation provided by file system)

• Delete list in app folder on clean shutdown

Recovery:• On startup, see if any

files were left open• If so, look for backup

file• If so, ask user to

compare versions

Page 28: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Careful Ordering

• Pros– Works with minimal support in the disk drive– Works for most multi-step operations

• Cons– Can require time-consuming recovery after a failure– Difficult to reduce every operation to a safely

interruptible sequence of writes– Difficult to achieve consistency when multiple

operations occur concurrently

Page 29: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Reliability Approach #2:Copy on Write File Layout

• To update file system, write a new version of the file system containing the update– Never update in place– Reuse existing unchanged disk blocks

• Seems expensive! But– Updates can be batched– Almost all disk writes can occur in parallel

• Approach taken in network file server appliances (WAFL, ZFS)

Page 30: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Copy on Write/Write Anywhere

Page 31: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Copy on Write/Write Anywhere

Page 32: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Copy on Write Batch Update

Page 33: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Copy on Write Garbage Collection

• For write efficiency, want contiguous sequences of free blocks– Spread across all block groups– Updates leave dead blocks scattered

• For read efficiency, want data read together to be in the same block group– Write anywhere leaves related data scattered

=> Background coalescing of live/dead blocks

Page 34: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Copy On Write

• Pros– Correct behavior regardless of failures– Fast recovery (root block array)– High throughput (best if updates are batched)

• Cons– Potential for high latency– Small changes require many writes– Garbage collection essential for performance

Page 35: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Reliability Approach #3: Log Structured File Systems

• Log structured (or journaling) file systems record each update to the file system as a transaction

• All transactions are written to a log– A transaction is considered committed once it is written to the log– However, the file system may not yet be updated

• The transactions in the log are asynchronously written to the file system– When the file system is modified, the transaction is removed from

the log

• If the file system crashes, all remaining transactions in the log must still be performed

Page 36: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Transaction• Assures that operations happen as a single logical unit

of work, in its entirety, or not at all• Challenge is assuring atomicity despite computer

system failures• Transaction - collection of instructions or operations

that performs single logical function– Here we are concerned with changes to stable storage – disk– Transaction is series of read and write operations– Terminated by commit (transaction successful) or abort

(transaction failed) operation– Aborted transaction must be rolled back to undo any

changes it performed

Page 37: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Transaction Concept

• Transaction is a group of operations–Atomic: operations appear to happen as a group, or

not at all (at logical level)• At physical level, only single disk/flash write is atomic

–Consistency: sequential memory model

– Isolation: other transactions do not see results of earlier transactions until they are committed

–Durable: operations that complete stay completed• Future failures do not corrupt previously stored data

Page 38: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Logging File Systems

• Instead of modifying data structures on disk directly, write changes to a journal/log– Intention list: set of changes we intend to make– Log/Journal is append-only

• Once changes are on log, safe to apply changes to data structures on disk– Recovery can read log to see what changes were

intended• Once changes are copied, safe to remove log

Page 39: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Log-Based Recovery

• Record to stable storage information about all modifications by a transaction

• Most common is write-ahead logging– Log on stable storage, each log record describes single

transaction write operation, including• Transaction name• Data item name• Old value• New value

– <Ti starts> written to log when transaction Ti starts

– <Ti commits> written when Ti commits• Log entry must reach stable storage before operation on data occurs

Page 40: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Log-Based Recovery Algorithm

• Using the log, system can handle any volatile memory errors– Undo(Ti) restores value of all data updated by Ti

– Redo(Ti) sets values of all data in transaction Ti to new values

• Undo(Ti) and redo(Ti) must be idempotent– Multiple executions must have the same result as one

execution• If system fails, restore state of all updated data via log

– If log contains <Ti starts> without <Ti commits>, undo(Ti)

– If log contains <Ti starts> and <Ti commits>, redo(Ti)

Page 41: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Redo Logging

• Prepare– Write all changes (in

transaction) to log

• Commit– Single disk write to make

transaction durable

• Redo– Copy changes to disk

• Garbage collection– Reclaim space in log

• Recovery– Read log– Redo any operations for

committed transactions– Garbage collect log

Page 42: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Before Transaction Start

Page 43: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

After Updates Are Logged

Page 44: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

After Commit Logged

Page 45: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

After Copy Back

Page 46: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

After Garbage Collection

Page 47: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Redo Logging

• Prepare– Write all changes (in

transaction) to log

• Commit– Single disk write to make

transaction durable

• Redo– Copy changes to disk

• Garbage collection– Reclaim space in log

• Recovery– Read log– Redo any operations for

committed transactions– Garbage collect log

Page 48: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Performance

• Log written sequentially– Often kept in flash storage

• Asynchronous write back– Any order as long as all changes are logged before

commit, and all write backs occur after commit• Can process multiple transactions

– Transaction ID in each log entry– Transaction completed iff its commit record is in

log

Page 49: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Concurrent Transactions

• Must be equivalent to serial execution – serializability

• Could perform all transactions in critical section– Inefficient, too restrictive

• Concurrency-control algorithms provide serializability

Page 50: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Serializability

• Consider two data items A and B• Consider Transactions T0 and T1

• Execute T0, T1 atomically• Execution sequence called schedule• Atomically executed transaction order called

serial schedule• For N transactions, there are N! valid serial

schedules

Page 51: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Schedule 1: T0 then T1

Page 52: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Nonserial Schedule

• Nonserial schedule allows overlapped execute– Resulting execution not necessarily incorrect

• Consider schedule S, operations Oi, Oj

– Conflict if access same data item, with at least one write

• If Oi, Oj consecutive and operations of different transactions & Oi and Oj don’t conflict– Then S’ with swapped order Oj Oi equivalent to S

• If S can become S’ via swapping nonconflicting operations– S is conflict serializable

Page 53: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Schedule 2: Concurrent Serializable Schedule

Page 54: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Locking Protocol

• Ensure serializability by associating lock with each data item– Follow locking protocol for access control

• Locks– Shared – Ti has shared-mode lock (S) on item Q, Ti can read Q but

not write Q– Exclusive – Ti has exclusive-mode lock (X) on Q, Ti can read and

write Q• Require every transaction on item Q acquire appropriate lock• If lock already held, new request may have to wait

– Similar to readers-writers algorithm

Page 55: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Two-phase Locking Protocol

• Generally ensures conflict serializability• Each transaction issues lock and unlock

requests in two phases– Growing – obtaining locks– Shrinking – releasing locks

• Does not prevent deadlock

Page 56: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Timestamp-based Protocols

• Select order among transactions in advance – timestamp-ordering

• Transaction Ti associated with timestamp TS(Ti) before Ti starts– TS(Ti) < TS(Tj) if Ti entered system before Tj

– TS can be generated from system clock or as logical counter incremented at each entry of transaction

• Timestamps determine serializability order– If TS(Ti) < TS(Tj), system must ensure produced schedule

equivalent to serial schedule where Ti appears before Tj

Page 57: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Timestamp-based Protocol Implementation

• Data item Q gets two timestamps– W-timestamp(Q) – largest timestamp of any transaction that

executed write(Q) successfully– R-timestamp(Q) – largest timestamp of successful read(Q)– Updated whenever read(Q) or write(Q) executed

• Timestamp-ordering protocol assures any conflicting read and write executed in timestamp order

• Suppose Ti executes read(Q)– If TS(Ti) < W-timestamp(Q), Ti needs to read value of Q that was

already overwritten• read operation rejected and Ti rolled back

– If TS(Ti) ≥ W-timestamp(Q)• read executed, R-timestamp(Q) set to max(R-timestamp(Q), TS(Ti))

Page 58: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Timestamp-ordering Protocol

• Suppose Ti executes write(Q)– If TS(Ti) < R-timestamp(Q), value Q produced by Ti was needed

previously and Ti assumed it would never be produced• Write operation rejected, Ti rolled back

– If TS(Ti) < W-tiimestamp(Q), Ti attempting to write obsolete value of Q

• Write operation rejected and Ti rolled back

– Otherwise, write executed

• Any rolled back transaction Ti is assigned new timestamp and restarted

• Algorithm ensures conflict serializability and freedom from deadlock

Page 59: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Schedule Possible Under Timestamp Protocol

Page 60: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Log Structure

• Log is the data storage; no copy back– Storage split into contiguous fixed size segments

• Flash: size of erasure block• Disk: efficient transfer size (e.g., 1MB)

– Log new blocks into empty segment• Garbage collect dead blocks to create empty segments

– Each segment contains extra level of indirection• Which blocks are stored in that segment

• Recovery– Find last successfully written segment

Page 61: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Storage Availability

• Storage reliability: data fetched is what you stored– Transactions, redo logging, etc.

• Storage availability: data is there when you want it– More disks => higher probability of some disk failing– Data available ~ Prob(disk working)^k

• If failures are independent and data is spread across k disks

– For large k, probability system works -> 0

Page 62: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Hardware Reliability SolutionRAID

• RAID (originally redundant array of inexpensive disks; now commonly redundant array of independent disks )

• Replicate data for availability– RAID 0: no replication– RAID 1: mirror data across two or more disks

• Google File System replicated its data on three disks, spread across multiple racks

– RAID 5: split data across disks, with redundancy to recover from a single disk failure

– RAID 6: RAID 5, with extra redundancy to recover from two disk failures

Page 63: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

RAID 1: Mirroring

• Replicate writes to both disks

• Reads can go to either disk

Page 64: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Parity

• Parity block: Block1 xor block2 xor block3 …

10001101 block101101100 block211000110 block3--------------00100111 parity block

• Can reconstruct any missing block from the others

Page 65: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

RAID 5: Rotating Parity

Page 66: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

RAID Update

• Mirroring– Write every mirror

• RAID-5: to write one block– Read old data block– Read old parity block– Write new data block– Write new parity block

• Old data xor old parity xor new data

• RAID-5: to write entire stripe – Write data blocks and parity

Page 67: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Non-Recoverable Read Errors

• Disk devices can lose data– One sector per 10^15 bits read– Causes:

• Physical wear• Repeated writes to nearby tracks

• What impact does this have on RAID recovery?

Page 68: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

Read Errors and RAID recovery

• Example– 10 1 TB disks, and 1 fails– Read remaining disks to reconstruct missing data

• Probability of recovery = (1 – 10^15)^(9 disks * 8 bits * 10^12 bytes/disk)= 93%

Page 69: File System Reliability. Today’s Lecture Review of Disk Scheduling Storage reliability

RAID 6 Dual Redundancy

• Two different parity calculations are carried out – stored in separate blocks on different disks.

• Can recover from two disks failing