IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

IRON File Systems

Remzi Arpaci-Dusseau

University of Wisconsin, Madison

Understanding How ThingsFail Is Important

How Disks Fail

Classic Failure Model: “Fail Stop”

As defined [Schneider ‘90]:• Stop: Upon failure, halt• Make known: But first, switch to state s.t.

other components can detect that you have failed

Very simple model of disk failure• Used by all early file and storage systems

(once controllers could detect failure)• But is it realistic?

Assertion:Modern Disks Are Not Whole-Disk Fail Stop

Real Failures

Latent sector errors [Kari ’93, Bairavasundaram ‘07]

• Block or blocks becomes inaccessible

Data corruption [Weinberg ‘04, Greene ’05, Bairavasundaram ‘08]

• Controller bugs, not bit rot

Transient errors too [Talagala ‘99]

• Bus stuttering, etc.

Result: Partial failures are a reality

So What Should We Do?

High-end Systems: Extra Measures

Disk Scrubbing [Kari ‘93]• Proactively scan drives in search of latent

errors• When detected, correct from redundant copy

on another disk

Extra redundancy [Corbett ‘04]• RAID system with two parity disks

Checksums [Bartlett ‘04, Weinberg ‘04]• Extra computation over data• Guard against corruption

But What About Desktop File Systems?

Desktop FS’s: Lost In The Past?

Desktop file systems are important• Home use: Photos, movies, tax returns, ...• Cluster use too: GoogleFS built on local FS’s

Performance policies are well known• e.g., FFS placement policy

But what is their fault-handling policy?• Do they handle partial disk failures?• How can we tell?

Two Questions

Questions I Will Answer

Question 1: How do local file systems reactto the more realistic set of disk failures?

Question 2: How can we change file systemsto better handle these types of faults?

How Disks Fail: The Details

The Storage Stack

Not just file system on top of the disk• Many layers

Lots of software• Even within disk!

Failures occur at all levels

Media

Firmware

File System

Generic I/O

Device DriverDevice Controller

ElectricalMechanical

Cache

Transport

Dis

kH

ost

Latent Sector Errors

Disks experience partial failures• “a small portion of data on disk becomes

temporarily or permanently unavailable”[Corbett ‘04]

Root causes:• Surface is scratched, inaccurate arm

movement, interconnect problems

Bottom line: A single read or write can fail

Data Corruption

Sun’s ZFS [Weinberg ‘04]• Misdirected writes: Right data, wrong location• Phantom/Lost writes: “Yes I wrote the data!”

(but didn’t)

EIDE Interface on motherboards [Greene ‘05]• Read reported as “done” when not (race)• Similar problem at Google [Ghemewat ‘03]

Network Appliance [Lewis ‘99]• Disk occasionally returns byte-shifted data

Transient Errors

18-month study of large disk farm [Talagala ‘99]• Most machines had SCSI timeout errors

(loose cables, bad cables?)• SCSI parity errors were common too

(data corrupted when moving across the bus)

Failures can be transient too• Might work if just retried

Even Worse With ATA (Not SCSI)

ATA drives: Less reliable [Anderson ‘03, Hughes & Murray ‘05]• Few are returned for “failure analysis”• Some are “partially flaw marked during testing”• Test conditions not as harsh (power, temp.)• High-end reliability features missing

(filters: remove particles, chemicals: humidity)

Cheap disks -> less testing -> less reliability• But cost drives many purchasing decisions…

Trend: More Problems, Not Less

Denser drives: Capacity sells drives• More logic -> more complexity• More complexity -> more bugs

Cost per byte dominates: “Pennies matter”• Manufacturers will cut corners• Reliability features are the first to go

Increasing amount of software:• ~400K lines of code in modern Seagate drive• Hard to write, hard to debug

The Fail-Partial Failure Model

The Fail-Partial Failure Model

Disk failure: • Entire disk may fail

Block failure: • Part of disk may fail

Block corruption: • Part of disk may get corrupted

All can be either transient or sticky

Important Parameters

Locality• Are partial faults independent of each other?

Frequency• How often do partial faults occur?

Frequency of FailuresStudy of Latent Sector Errors [Bairavasundaram et al. ‘07]

• 1.53 millions disks, 3+ years of data• ATA: 8.5% - SCSI: 1.9%• Latent sector errors are not independent• Spatial locality exists, disk capacity matters

Study of Block Corruption [Bairavasundaram et al. ‘08]

• Same data set• ATA: 0.6% - SCSI: 0.06%• Corruptions within disk are not independent• Spatial locality exists• The “bad block number” problem

How Do File Systems ReactTo Partial Failures?

How To Detect & Handle Failures?

Need: Classification of techniques• Detection: Discovering a failure took

place• Recovery: Recovering from the failure

Detection + Recovery = IRON• File systems with Internal RObustNess• IRON Taxonomy: Classify techniques

IRON Detection Taxonomy

How to detect block failure or corruption?

Possible strategies:• Zero: No detection technique used• Error Code: Check return codes from disk• Sanity: Check data structures for consistency• Redundancy: Add checksums or other

forms of computed replication to detect problems

IRON Recovery Taxonomy

How to recover from a detected failure?

Possible strategies:• Zero: Don’t do anything• Propagate: Pass error on to higher level• Stop: Halt activity (“fail stop”)• Guess: Manufacture data, return to user• Retry: Assume failure is transient• Repair: If inconsistency is detected• Remap: Redirect to another block• Redundancy: Use another copy of block

What IRON Techniques DoModern File Systems Use?

Fault Injection

Typical fault injection:• Insert failures at random disk locations/times• Watch system to see what happens

Not good enough:• May miss interesting behavior• May find problems, but not explanatory

What we do: Space- and Time-aware injection• A “gray box” approach to testing

Space Awareness

File systems comprised of many on-disk structures• e.g., superblocks, inodes, etc.

Idea: Make fault injection layer awareof file system structures• Inject faults across all block types

Inodes Data

Sup

er

Time AwarenessTime is key to testing as well• e.g., update sequence

Idea: Build model of file system I/O activity

S1 S2

J K/S

C

J

Use model to induce faults at crucial times• Don’t miss interesting behaviors

J: JournalC: CommitK: CheckpointS: Superblock

Writes

Data

Journ

alin

g(S

imp

lified

)

Making It Comprehensive

Workloads• Exercise as much of FS as possible

Two types of workloads • Singlets: Stress single system call

(open, lstat, rename, symlink, write, etc.)• Generics: Stress common functionality

(path traversal, recovery, log writes, etc.)

Injecting Faults

Disk: Hard to do ->it’s hardware

Software approach:• Easy• Desirable

Fail-partial faults:• Read, write errors• Read corruption

Media

Firmware

File System

Generic I/O

Device DriverDevice Controller

ElectricalMechanical

Cache

Transport

Dis

kH

ost

Fault Injector

The File Systems We Tested

Linux ext3• Popular, simple, compatible Linux file

system

Linux ReiserFS• Scalable, “database-like” file system

Linux IBM JFS• Big Blue’s classic journaling file system

Windows NTFS• Yes, a non-Linux file system

Result Matrix

WorkloadsD

ata

Str

uct

ure

s

Read

()

Inode

Zero

Stop

Propagate

Retry

Redundancy

N/A

Read Errors:Recovery

Ext3: Stop and propagate(don’t tolerate transience)

ReiserFS: Mostly propagate

JFS: Stop, propagate, retry

All: Some cases missedExt3

Reis

erF

SJF

SZero

Stop

Propagate

Retry

Redundancy

Write Errors: Recovery

Ext3/JFS: Ignore write faults• No detection -> no recovery• Can corrupt entire volume

ReiserFS always calls panic• Exception: indirect blocks

Ext3

Reis

erF

SJF

SZero

Stop

Propagate

Retry

Redundancy

Corruption: Recovery

Ext3/Reiser/JFS:• Some sanity checking

used• Stop/Propagate common• Sanity checking not

enough

Ext3

Reis

erF

SJF

SZero

Stop

Propagate

Retry

Redundancy

File System Specific Results

Ext3: Overall simplicity• Checks error codes, modest sanity checking,

propagates errors, aborts operation• Overreacts on read errors -> halt instead of propagate• But, some write errors are ignored

ReiserFS: First, do no harm• At slightest sign of failure, panic() file system• Preserves integrity; overreacts to transients

IBM JFS: The kitchen sink• Uses broadest range of techniques

Windows NTFS: Persistence is a virtue• Liberal retry (understands disks can be flaky)

General Results (1 of 3)

Illogical inconsistency is common• Similar faults -> different reactions

(e.g., JFS failed read of superblock)

Bugs are common• Code not stress-tested enough?

(e.g., ReiserFS indirect block code paths)

Error codes are sometimes ignored• Highly surprising: Easiest to detect

(but sometimes hard to act upon)


Sanity checking is of limited utility• Doesn’t help if read right type, wrong block• Hard to do for some structures (e.g., bitmaps)

Stop is useful (if used correctly)• ReiserFS halts on write errors• Ext3 tries to do this (but aborts too late)

Stop should not be overused• Faults can be transient• Faults can be sticky, too!


Retry is underutilized• JFS does it some, NTFS quite a bit• But transient faults occur

Automatic repair is rare• Almost all “stop” actions involve

administrator intervention/repair (running fsck, reboot, etc.)

Redundancy is rarely used• Only superblocks are replicated,

sometimes

Towards an IRON File System

IRON ext3: ixt3

Prototype of an IRON file system• First cut: Many other possibilities still exist

Start with Linux ext3• Add checksums: To detect corruption• Add replication: For important structures

(e.g., meta-data)• Add parity: For user data

Result: IRON ext3 (ixt3)

Ixt3 Implementation

Checksums: • Initially write to the ext3 log,

then checkpoint them to their final location

Meta-data replicas:• Write to replica log, checkpoint

later to their final on-disk location

Parity protection for data• One block per file, extra pointer in inode

Performance issues: • Space overhead: Low• Time overhead?

Ixt3 Performance Evaluation

For “home use” or read-mostly: No overhead• Has cost for write-intensive workloads

Metadata Data Both

SSH-Build

1.00 1.00 1.00

Web Server

1.00 1.00 1.00

PostMark 1.19 1.13 1.37

TPC-B 1.20 1.10 1.42

Wrapping Up

Summary

File systems are important• Used everywhere, in many different ways

Disks fail in interesting ways• New model: Fail-partial failure model

Local file systems: Not ready for local faults• Illogical inconsistencies, bugs, and little recovery

Need: IRON file systems• Ixt3: Low-cost protection from partial failures

Challenges and Directions

Need to rethink how we build file systems• Performance policy isn’t the only policy• Fault-handling policy is critical

Testing and beyond testing• Failure handling must be tested

(continuously?)• Beyond testing: Code analysis too?

Guiding principles• Lessons from networking• Put simply: Don’t trust the disk

ADvanced Systems Lab (ADSL)

www.cs.wisc.edu/adsl

ADvanced Systems Lab (ADSL)

Who did the real work:• Nitin Agrawal• Lakshmi

Bairavasundaram• Haryadi Gunawi• Vijayan Prabhakaran

Backup Slides

Read Errors: Detection Techniques

Across all three file systems:• Error codes checked for

read errors(rarely ignored)D

ete

ctio

n Zero

ErrorCode

Sanity

Ext3 ReiserFS JFS

Write Errors: Detection Techniques

Ext3, JFS ignore write errors!• Either ignored altogether

or not used meaningfully

ReiserFS: Much more careful

Dete

ctio

n Zero

ErrorCode

Sanity

Ext3 ReiserFS JFS

Corruption: Detection Techniques

Sanity checking used acrossall three file systems

Sanity checking not sufficient• e.g., when you read block

of similar typeD

ete

ctio

n Zero

ErrorCode

Sanity

Ext3 ReiserFS JFS

File Systems: The Manager of Your Data

Why File Systems Are Important

The file system: The manager of “most” data• Consists of named files: Linear array of bytes• Organized in directories: /this/is/my/file• Access methods: open(), read(), write(), close()

Where we use them: Everywhere• Home use: Photos, tax returns, home movies• Servers: Network file servers, Google search

engine

Why we use them:• Simple, convenient• Good performance: Subject of much research• Reliable? Depends on how disks fail…

File System Background

Meta-data: Structures the file system usesto track what it needs to track• Superblock: File-system wide parameters• Inodes: Information about a file• Data: Blocks to hold user data

Sup

er

Inodes Data

Documents

IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison