Upload
aubrey-shepherd
View
219
Download
1
Tags:
Embed Size (px)
Citation preview
IRON File Systems
Remzi Arpaci-Dusseau
University of Wisconsin, Madison
Understanding How ThingsFail Is Important
How Disks Fail
Classic Failure Model: “Fail Stop”
As defined [Schneider ‘90]:• Stop: Upon failure, halt• Make known: But first, switch to state s.t.
other components can detect that you have failed
Very simple model of disk failure• Used by all early file and storage systems
(once controllers could detect failure)• But is it realistic?
Assertion:Modern Disks Are Not Whole-Disk Fail Stop
Real Failures
Latent sector errors [Kari ’93, Bairavasundaram ‘07]
• Block or blocks becomes inaccessible
Data corruption [Weinberg ‘04, Greene ’05, Bairavasundaram ‘08]
• Controller bugs, not bit rot
Transient errors too [Talagala ‘99]
• Bus stuttering, etc.
Result: Partial failures are a reality
So What Should We Do?
High-end Systems: Extra Measures
Disk Scrubbing [Kari ‘93]• Proactively scan drives in search of latent
errors• When detected, correct from redundant copy
on another disk
Extra redundancy [Corbett ‘04]• RAID system with two parity disks
Checksums [Bartlett ‘04, Weinberg ‘04]• Extra computation over data• Guard against corruption
But What About Desktop File Systems?
Desktop FS’s: Lost In The Past?
Desktop file systems are important• Home use: Photos, movies, tax returns, ...• Cluster use too: GoogleFS built on local FS’s
Performance policies are well known• e.g., FFS placement policy
But what is their fault-handling policy?• Do they handle partial disk failures?• How can we tell?
Two Questions
Questions I Will Answer
Question 1: How do local file systems reactto the more realistic set of disk failures?
Question 2: How can we change file systemsto better handle these types of faults?
How Disks Fail: The Details
The Storage Stack
Not just file system on top of the disk• Many layers
Lots of software• Even within disk!
Failures occur at all levels
Media
Firmware
File System
Generic I/O
Device DriverDevice Controller
ElectricalMechanical
Cache
Transport
Dis
kH
ost
Latent Sector Errors
Disks experience partial failures• “a small portion of data on disk becomes
temporarily or permanently unavailable”[Corbett ‘04]
Root causes:• Surface is scratched, inaccurate arm
movement, interconnect problems
Bottom line: A single read or write can fail
Data Corruption
Sun’s ZFS [Weinberg ‘04]• Misdirected writes: Right data, wrong location• Phantom/Lost writes: “Yes I wrote the data!”
(but didn’t)
EIDE Interface on motherboards [Greene ‘05]• Read reported as “done” when not (race)• Similar problem at Google [Ghemewat ‘03]
Network Appliance [Lewis ‘99]• Disk occasionally returns byte-shifted data
Transient Errors
18-month study of large disk farm [Talagala ‘99]• Most machines had SCSI timeout errors
(loose cables, bad cables?)• SCSI parity errors were common too
(data corrupted when moving across the bus)
Failures can be transient too• Might work if just retried
Even Worse With ATA (Not SCSI)
ATA drives: Less reliable [Anderson ‘03, Hughes & Murray ‘05]• Few are returned for “failure analysis”• Some are “partially flaw marked during testing”• Test conditions not as harsh (power, temp.)• High-end reliability features missing
(filters: remove particles, chemicals: humidity)
Cheap disks -> less testing -> less reliability• But cost drives many purchasing decisions…
Trend: More Problems, Not Less
Denser drives: Capacity sells drives• More logic -> more complexity• More complexity -> more bugs
Cost per byte dominates: “Pennies matter”• Manufacturers will cut corners• Reliability features are the first to go
Increasing amount of software:• ~400K lines of code in modern Seagate drive• Hard to write, hard to debug
The Fail-Partial Failure Model
The Fail-Partial Failure Model
Disk failure: • Entire disk may fail
Block failure: • Part of disk may fail
Block corruption: • Part of disk may get corrupted
All can be either transient or sticky
Important Parameters
Locality• Are partial faults independent of each other?
Frequency• How often do partial faults occur?
Frequency of FailuresStudy of Latent Sector Errors [Bairavasundaram et al. ‘07]
• 1.53 millions disks, 3+ years of data• ATA: 8.5% - SCSI: 1.9%• Latent sector errors are not independent• Spatial locality exists, disk capacity matters
Study of Block Corruption [Bairavasundaram et al. ‘08]
• Same data set• ATA: 0.6% - SCSI: 0.06%• Corruptions within disk are not independent• Spatial locality exists• The “bad block number” problem
How Do File Systems ReactTo Partial Failures?
How To Detect & Handle Failures?
Need: Classification of techniques• Detection: Discovering a failure took
place• Recovery: Recovering from the failure
Detection + Recovery = IRON• File systems with Internal RObustNess• IRON Taxonomy: Classify techniques
IRON Detection Taxonomy
How to detect block failure or corruption?
Possible strategies:• Zero: No detection technique used• Error Code: Check return codes from disk• Sanity: Check data structures for consistency• Redundancy: Add checksums or other
forms of computed replication to detect problems
IRON Recovery Taxonomy
How to recover from a detected failure?
Possible strategies:• Zero: Don’t do anything• Propagate: Pass error on to higher level• Stop: Halt activity (“fail stop”)• Guess: Manufacture data, return to user• Retry: Assume failure is transient• Repair: If inconsistency is detected• Remap: Redirect to another block• Redundancy: Use another copy of block
What IRON Techniques DoModern File Systems Use?
Fault Injection
Typical fault injection:• Insert failures at random disk locations/times• Watch system to see what happens
Not good enough:• May miss interesting behavior• May find problems, but not explanatory
What we do: Space- and Time-aware injection• A “gray box” approach to testing
Space Awareness
File systems comprised of many on-disk structures• e.g., superblocks, inodes, etc.
Idea: Make fault injection layer awareof file system structures• Inject faults across all block types
Inodes Data
Sup
er
Time AwarenessTime is key to testing as well• e.g., update sequence
Idea: Build model of file system I/O activity
S1 S2
J K/S
C
J
Use model to induce faults at crucial times• Don’t miss interesting behaviors
J: JournalC: CommitK: CheckpointS: Superblock
Writes
Data
Journ
alin
g(S
imp
lified
)
Making It Comprehensive
Workloads• Exercise as much of FS as possible
Two types of workloads • Singlets: Stress single system call
(open, lstat, rename, symlink, write, etc.)• Generics: Stress common functionality
(path traversal, recovery, log writes, etc.)
Injecting Faults
Disk: Hard to do ->it’s hardware
Software approach:• Easy• Desirable
Fail-partial faults:• Read, write errors• Read corruption
Media
Firmware
File System
Generic I/O
Device DriverDevice Controller
ElectricalMechanical
Cache
Transport
Dis
kH
ost
Fault Injector
The File Systems We Tested
Linux ext3• Popular, simple, compatible Linux file
system
Linux ReiserFS• Scalable, “database-like” file system
Linux IBM JFS• Big Blue’s classic journaling file system
Windows NTFS• Yes, a non-Linux file system
Result Matrix
WorkloadsD
ata
Str
uct
ure
s
Read
()
Inode
Zero
Stop
Propagate
Retry
Redundancy
N/A
Read Errors:Recovery
Ext3: Stop and propagate(don’t tolerate transience)
ReiserFS: Mostly propagate
JFS: Stop, propagate, retry
All: Some cases missedExt3
Reis
erF
SJF
SZero
Stop
Propagate
Retry
Redundancy
Write Errors: Recovery
Ext3/JFS: Ignore write faults• No detection -> no recovery• Can corrupt entire volume
ReiserFS always calls panic• Exception: indirect blocks
Ext3
Reis
erF
SJF
SZero
Stop
Propagate
Retry
Redundancy
Corruption: Recovery
Ext3/Reiser/JFS:• Some sanity checking
used• Stop/Propagate common• Sanity checking not
enough
Ext3
Reis
erF
SJF
SZero
Stop
Propagate
Retry
Redundancy
File System Specific Results
Ext3: Overall simplicity• Checks error codes, modest sanity checking,
propagates errors, aborts operation• Overreacts on read errors -> halt instead of propagate• But, some write errors are ignored
ReiserFS: First, do no harm• At slightest sign of failure, panic() file system• Preserves integrity; overreacts to transients
IBM JFS: The kitchen sink• Uses broadest range of techniques
Windows NTFS: Persistence is a virtue• Liberal retry (understands disks can be flaky)
General Results (1 of 3)
Illogical inconsistency is common• Similar faults -> different reactions
(e.g., JFS failed read of superblock)
Bugs are common• Code not stress-tested enough?
(e.g., ReiserFS indirect block code paths)
Error codes are sometimes ignored• Highly surprising: Easiest to detect
(but sometimes hard to act upon)
General Results (2 of 3)
Sanity checking is of limited utility• Doesn’t help if read right type, wrong block• Hard to do for some structures (e.g., bitmaps)
Stop is useful (if used correctly)• ReiserFS halts on write errors• Ext3 tries to do this (but aborts too late)
Stop should not be overused• Faults can be transient• Faults can be sticky, too!
General Results (3 of 3)
Retry is underutilized• JFS does it some, NTFS quite a bit• But transient faults occur
Automatic repair is rare• Almost all “stop” actions involve
administrator intervention/repair (running fsck, reboot, etc.)
Redundancy is rarely used• Only superblocks are replicated,
sometimes
Towards an IRON File System
IRON ext3: ixt3
Prototype of an IRON file system• First cut: Many other possibilities still exist
Start with Linux ext3• Add checksums: To detect corruption• Add replication: For important structures
(e.g., meta-data)• Add parity: For user data
Result: IRON ext3 (ixt3)
Ixt3 Implementation
Checksums: • Initially write to the ext3 log,
then checkpoint them to their final location
Meta-data replicas:• Write to replica log, checkpoint
later to their final on-disk location
Parity protection for data• One block per file, extra pointer in inode
Performance issues: • Space overhead: Low• Time overhead?
Ixt3 Performance Evaluation
For “home use” or read-mostly: No overhead• Has cost for write-intensive workloads
Metadata Data Both
SSH-Build
1.00 1.00 1.00
Web Server
1.00 1.00 1.00
PostMark 1.19 1.13 1.37
TPC-B 1.20 1.10 1.42
Wrapping Up
Summary
File systems are important• Used everywhere, in many different ways
Disks fail in interesting ways• New model: Fail-partial failure model
Local file systems: Not ready for local faults• Illogical inconsistencies, bugs, and little recovery
Need: IRON file systems• Ixt3: Low-cost protection from partial failures
Challenges and Directions
Need to rethink how we build file systems• Performance policy isn’t the only policy• Fault-handling policy is critical
Testing and beyond testing• Failure handling must be tested
(continuously?)• Beyond testing: Code analysis too?
Guiding principles• Lessons from networking• Put simply: Don’t trust the disk
ADvanced Systems Lab (ADSL)
www.cs.wisc.edu/adsl
ADvanced Systems Lab (ADSL)
Who did the real work:• Nitin Agrawal• Lakshmi
Bairavasundaram• Haryadi Gunawi• Vijayan Prabhakaran
Backup Slides
Read Errors: Detection Techniques
Across all three file systems:• Error codes checked for
read errors(rarely ignored)D
ete
ctio
n Zero
ErrorCode
Sanity
Ext3 ReiserFS JFS
Write Errors: Detection Techniques
Ext3, JFS ignore write errors!• Either ignored altogether
or not used meaningfully
ReiserFS: Much more careful
Dete
ctio
n Zero
ErrorCode
Sanity
Ext3 ReiserFS JFS
Corruption: Detection Techniques
Sanity checking used acrossall three file systems
Sanity checking not sufficient• e.g., when you read block
of similar typeD
ete
ctio
n Zero
ErrorCode
Sanity
Ext3 ReiserFS JFS
File Systems: The Manager of Your Data
Why File Systems Are Important
The file system: The manager of “most” data• Consists of named files: Linear array of bytes• Organized in directories: /this/is/my/file• Access methods: open(), read(), write(), close()
Where we use them: Everywhere• Home use: Photos, tax returns, home movies• Servers: Network file servers, Google search
engine
Why we use them:• Simple, convenient• Good performance: Subject of much research• Reliable? Depends on how disks fail…
File System Background
Meta-data: Structures the file system usesto track what it needs to track• Superblock: File-system wide parameters• Inodes: Information about a file• Data: Blocks to hold user data
Sup
er
Inodes Data