58
IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Embed Size (px)

Citation preview

Page 1: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

IRON File Systems

Remzi Arpaci-Dusseau

University of Wisconsin, Madison

Page 2: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Understanding How ThingsFail Is Important

Page 3: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

How Disks Fail

Page 4: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Classic Failure Model: “Fail Stop”

As defined [Schneider ‘90]:• Stop: Upon failure, halt• Make known: But first, switch to state s.t.

other components can detect that you have failed

Very simple model of disk failure• Used by all early file and storage systems

(once controllers could detect failure)• But is it realistic?

Page 5: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Assertion:Modern Disks Are Not Whole-Disk Fail Stop

Page 6: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Real Failures

Latent sector errors [Kari ’93, Bairavasundaram ‘07]

• Block or blocks becomes inaccessible

Data corruption [Weinberg ‘04, Greene ’05, Bairavasundaram ‘08]

• Controller bugs, not bit rot

Transient errors too [Talagala ‘99]

• Bus stuttering, etc.

Result: Partial failures are a reality

Page 7: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

So What Should We Do?

Page 8: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

High-end Systems: Extra Measures

Disk Scrubbing [Kari ‘93]• Proactively scan drives in search of latent

errors• When detected, correct from redundant copy

on another disk

Extra redundancy [Corbett ‘04]• RAID system with two parity disks

Checksums [Bartlett ‘04, Weinberg ‘04]• Extra computation over data• Guard against corruption

Page 9: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

But What About Desktop File Systems?

Page 10: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Desktop FS’s: Lost In The Past?

Desktop file systems are important• Home use: Photos, movies, tax returns, ...• Cluster use too: GoogleFS built on local FS’s

Performance policies are well known• e.g., FFS placement policy

But what is their fault-handling policy?• Do they handle partial disk failures?• How can we tell?

Page 11: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Two Questions

Page 12: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Questions I Will Answer

Question 1: How do local file systems reactto the more realistic set of disk failures?

Question 2: How can we change file systemsto better handle these types of faults?

Page 13: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

How Disks Fail: The Details

Page 14: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

The Storage Stack

Not just file system on top of the disk• Many layers

Lots of software• Even within disk!

Failures occur at all levels

Media

Firmware

File System

Generic I/O

Device DriverDevice Controller

ElectricalMechanical

Cache

Transport

Dis

kH

ost

Page 15: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Latent Sector Errors

Disks experience partial failures• “a small portion of data on disk becomes

temporarily or permanently unavailable”[Corbett ‘04]

Root causes:• Surface is scratched, inaccurate arm

movement, interconnect problems

Bottom line: A single read or write can fail

Page 16: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Data Corruption

Sun’s ZFS [Weinberg ‘04]• Misdirected writes: Right data, wrong location• Phantom/Lost writes: “Yes I wrote the data!”

(but didn’t)

EIDE Interface on motherboards [Greene ‘05]• Read reported as “done” when not (race)• Similar problem at Google [Ghemewat ‘03]

Network Appliance [Lewis ‘99]• Disk occasionally returns byte-shifted data

Page 17: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Transient Errors

18-month study of large disk farm [Talagala ‘99]• Most machines had SCSI timeout errors

(loose cables, bad cables?)• SCSI parity errors were common too

(data corrupted when moving across the bus)

Failures can be transient too• Might work if just retried

Page 18: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Even Worse With ATA (Not SCSI)

ATA drives: Less reliable [Anderson ‘03, Hughes & Murray ‘05]• Few are returned for “failure analysis”• Some are “partially flaw marked during testing”• Test conditions not as harsh (power, temp.)• High-end reliability features missing

(filters: remove particles, chemicals: humidity)

Cheap disks -> less testing -> less reliability• But cost drives many purchasing decisions…

Page 19: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Trend: More Problems, Not Less

Denser drives: Capacity sells drives• More logic -> more complexity• More complexity -> more bugs

Cost per byte dominates: “Pennies matter”• Manufacturers will cut corners• Reliability features are the first to go

Increasing amount of software:• ~400K lines of code in modern Seagate drive• Hard to write, hard to debug

Page 20: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

The Fail-Partial Failure Model

Page 21: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

The Fail-Partial Failure Model

Disk failure: • Entire disk may fail

Block failure: • Part of disk may fail

Block corruption: • Part of disk may get corrupted

All can be either transient or sticky

Page 22: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Important Parameters

Locality• Are partial faults independent of each other?

Frequency• How often do partial faults occur?

Page 23: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Frequency of FailuresStudy of Latent Sector Errors [Bairavasundaram et al. ‘07]

• 1.53 millions disks, 3+ years of data• ATA: 8.5% - SCSI: 1.9%• Latent sector errors are not independent• Spatial locality exists, disk capacity matters

Study of Block Corruption [Bairavasundaram et al. ‘08]

• Same data set• ATA: 0.6% - SCSI: 0.06%• Corruptions within disk are not independent• Spatial locality exists• The “bad block number” problem

Page 24: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

How Do File Systems ReactTo Partial Failures?

Page 25: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

How To Detect & Handle Failures?

Need: Classification of techniques• Detection: Discovering a failure took

place• Recovery: Recovering from the failure

Detection + Recovery = IRON• File systems with Internal RObustNess• IRON Taxonomy: Classify techniques

Page 26: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

IRON Detection Taxonomy

How to detect block failure or corruption?

Possible strategies:• Zero: No detection technique used• Error Code: Check return codes from disk• Sanity: Check data structures for consistency• Redundancy: Add checksums or other

forms of computed replication to detect problems

Page 27: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

IRON Recovery Taxonomy

How to recover from a detected failure?

Possible strategies:• Zero: Don’t do anything• Propagate: Pass error on to higher level• Stop: Halt activity (“fail stop”)• Guess: Manufacture data, return to user• Retry: Assume failure is transient• Repair: If inconsistency is detected• Remap: Redirect to another block• Redundancy: Use another copy of block

Page 28: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

What IRON Techniques DoModern File Systems Use?

Page 29: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Fault Injection

Typical fault injection:• Insert failures at random disk locations/times• Watch system to see what happens

Not good enough:• May miss interesting behavior• May find problems, but not explanatory

What we do: Space- and Time-aware injection• A “gray box” approach to testing

Page 30: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Space Awareness

File systems comprised of many on-disk structures• e.g., superblocks, inodes, etc.

Idea: Make fault injection layer awareof file system structures• Inject faults across all block types

Inodes Data

Sup

er

Page 31: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Time AwarenessTime is key to testing as well• e.g., update sequence

Idea: Build model of file system I/O activity

S1 S2

J K/S

C

J

Use model to induce faults at crucial times• Don’t miss interesting behaviors

J: JournalC: CommitK: CheckpointS: Superblock

Writes

Data

Journ

alin

g(S

imp

lified

)

Page 32: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Making It Comprehensive

Workloads• Exercise as much of FS as possible

Two types of workloads • Singlets: Stress single system call

(open, lstat, rename, symlink, write, etc.)• Generics: Stress common functionality

(path traversal, recovery, log writes, etc.)

Page 33: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Injecting Faults

Disk: Hard to do ->it’s hardware

Software approach:• Easy• Desirable

Fail-partial faults:• Read, write errors• Read corruption

Media

Firmware

File System

Generic I/O

Device DriverDevice Controller

ElectricalMechanical

Cache

Transport

Dis

kH

ost

Fault Injector

Page 34: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

The File Systems We Tested

Linux ext3• Popular, simple, compatible Linux file

system

Linux ReiserFS• Scalable, “database-like” file system

Linux IBM JFS• Big Blue’s classic journaling file system

Windows NTFS• Yes, a non-Linux file system

Page 35: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Result Matrix

WorkloadsD

ata

Str

uct

ure

s

Read

()

Inode

Zero

Stop

Propagate

Retry

Redundancy

N/A

Page 36: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Read Errors:Recovery

Ext3: Stop and propagate(don’t tolerate transience)

ReiserFS: Mostly propagate

JFS: Stop, propagate, retry

All: Some cases missedExt3

Reis

erF

SJF

SZero

Stop

Propagate

Retry

Redundancy

Page 37: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Write Errors: Recovery

Ext3/JFS: Ignore write faults• No detection -> no recovery• Can corrupt entire volume

ReiserFS always calls panic• Exception: indirect blocks

Ext3

Reis

erF

SJF

SZero

Stop

Propagate

Retry

Redundancy

Page 38: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Corruption: Recovery

Ext3/Reiser/JFS:• Some sanity checking

used• Stop/Propagate common• Sanity checking not

enough

Ext3

Reis

erF

SJF

SZero

Stop

Propagate

Retry

Redundancy

Page 39: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

File System Specific Results

Ext3: Overall simplicity• Checks error codes, modest sanity checking,

propagates errors, aborts operation• Overreacts on read errors -> halt instead of propagate• But, some write errors are ignored

ReiserFS: First, do no harm• At slightest sign of failure, panic() file system• Preserves integrity; overreacts to transients

IBM JFS: The kitchen sink• Uses broadest range of techniques

Windows NTFS: Persistence is a virtue• Liberal retry (understands disks can be flaky)

Page 40: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

General Results (1 of 3)

Illogical inconsistency is common• Similar faults -> different reactions

(e.g., JFS failed read of superblock)

Bugs are common• Code not stress-tested enough?

(e.g., ReiserFS indirect block code paths)

Error codes are sometimes ignored• Highly surprising: Easiest to detect

(but sometimes hard to act upon)

Page 41: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

General Results (2 of 3)

Sanity checking is of limited utility• Doesn’t help if read right type, wrong block• Hard to do for some structures (e.g., bitmaps)

Stop is useful (if used correctly)• ReiserFS halts on write errors• Ext3 tries to do this (but aborts too late)

Stop should not be overused• Faults can be transient• Faults can be sticky, too!

Page 42: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

General Results (3 of 3)

Retry is underutilized• JFS does it some, NTFS quite a bit• But transient faults occur

Automatic repair is rare• Almost all “stop” actions involve

administrator intervention/repair (running fsck, reboot, etc.)

Redundancy is rarely used• Only superblocks are replicated,

sometimes

Page 43: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Towards an IRON File System

Page 44: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

IRON ext3: ixt3

Prototype of an IRON file system• First cut: Many other possibilities still exist

Start with Linux ext3• Add checksums: To detect corruption• Add replication: For important structures

(e.g., meta-data)• Add parity: For user data

Result: IRON ext3 (ixt3)

Page 45: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Ixt3 Implementation

Checksums: • Initially write to the ext3 log,

then checkpoint them to their final location

Meta-data replicas:• Write to replica log, checkpoint

later to their final on-disk location

Parity protection for data• One block per file, extra pointer in inode

Performance issues: • Space overhead: Low• Time overhead?

Page 46: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Ixt3 Performance Evaluation

For “home use” or read-mostly: No overhead• Has cost for write-intensive workloads

Metadata Data Both

SSH-Build

1.00 1.00 1.00

Web Server

1.00 1.00 1.00

PostMark 1.19 1.13 1.37

TPC-B 1.20 1.10 1.42

Page 47: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Wrapping Up

Page 48: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Summary

File systems are important• Used everywhere, in many different ways

Disks fail in interesting ways• New model: Fail-partial failure model

Local file systems: Not ready for local faults• Illogical inconsistencies, bugs, and little recovery

Need: IRON file systems• Ixt3: Low-cost protection from partial failures

Page 49: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Challenges and Directions

Need to rethink how we build file systems• Performance policy isn’t the only policy• Fault-handling policy is critical

Testing and beyond testing• Failure handling must be tested

(continuously?)• Beyond testing: Code analysis too?

Guiding principles• Lessons from networking• Put simply: Don’t trust the disk

Page 50: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

ADvanced Systems Lab (ADSL)

www.cs.wisc.edu/adsl

Page 51: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

ADvanced Systems Lab (ADSL)

Who did the real work:• Nitin Agrawal• Lakshmi

Bairavasundaram• Haryadi Gunawi• Vijayan Prabhakaran

Page 52: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Backup Slides

Page 53: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Read Errors: Detection Techniques

Across all three file systems:• Error codes checked for

read errors(rarely ignored)D

ete

ctio

n Zero

ErrorCode

Sanity

Ext3 ReiserFS JFS

Page 54: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Write Errors: Detection Techniques

Ext3, JFS ignore write errors!• Either ignored altogether

or not used meaningfully

ReiserFS: Much more careful

Dete

ctio

n Zero

ErrorCode

Sanity

Ext3 ReiserFS JFS

Page 55: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Corruption: Detection Techniques

Sanity checking used acrossall three file systems

Sanity checking not sufficient• e.g., when you read block

of similar typeD

ete

ctio

n Zero

ErrorCode

Sanity

Ext3 ReiserFS JFS

Page 56: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

File Systems: The Manager of Your Data

Page 57: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Why File Systems Are Important

The file system: The manager of “most” data• Consists of named files: Linear array of bytes• Organized in directories: /this/is/my/file• Access methods: open(), read(), write(), close()

Where we use them: Everywhere• Home use: Photos, tax returns, home movies• Servers: Network file servers, Google search

engine

Why we use them:• Simple, convenient• Good performance: Subject of much research• Reliable? Depends on how disks fail…

Page 58: IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

File System Background

Meta-data: Structures the file system usesto track what it needs to track• Superblock: File-system wide parameters• Inodes: Information about a file• Data: Blocks to hold user data

Sup

er

Inodes Data