Upload
vchidambaram
View
115
Download
0
Embed Size (px)
Citation preview
Coerced Cache Evic-on and Discreet-‐Mode Journaling: Dealing with Misbehaving Disks
Abhishek Rajimwale*, Vijay Chidambaram, Deepak Ramamurthi Andrea Arpaci-‐Dusseau, Remzi Arpaci-‐Dusseau
*Data Domain Inc University of Wisconsin Madison
Disks are not perfect
DSN 11 2 3/13/12
• Expanding disk fault model
• Latent Sector Errors [Bairavasundaram SIGMETRICS 07] – RAID-‐6
• Block Corrup-on [Bairavasundaram FAST 08] – Checksums
• The disk cache – Always trusted so far
Disk Surface
Disk Cache
Disk Caches • Disk cache improves performance – But at the risk of data loss
• Order of writes issued by file system: – A, B ,C
• Disks reorder writes during destaging: – B, A, C
• File systems flush the disk cache to ensure correct ordering of writes – A, flush, B, flush, C
DSN 11 3 3/13/12
Disk Surface
Disk Cache
Write to disk
Problem: Flushing doesn’t work
• Disks can fail to flush data upon request
• One reason: Bugs – Errors in the storage stack [Bairavasundaram FAST 08]
– Improper propaga-on of error codes [Bairavasundaram FAST 08] – Inadequate failure policies [Prabhakaran SOSP 05] – Bugs in the firmware [Ghemawat SOSP 03]
DSN 11 4 3/13/12
Disks can lie!
DSN 11 5
0 5 10 15 20 25 30 35 40 45 50
4k 16k 64k 128k 512k 1m
Avg 6m
e (m
sec)
Write size
Sequen6al writes
w/ cache
w/o cache
• Misbehaving disks ignore or delay flush requests • Increases risk for data loss - File systems usually blamed for such loss
3/13/12
Disks can lie!
DSN 11 6
F_FULLFSYNC
• From the fcntl man page in Mac OSX: Does the same thing as fsync(2) then asks the drive to flush all buffered data to the permanent storage device (arg is ignored). This is currently implemented on HFS, MS-DOS (FAT), and Universal Disk Format (UDF) file systems. The operation may take quite a while to complete. Certain FireWire drives have also been known to ignore the request to flush their buffered data.
3/13/12
• Evidence from industry experts – Microsoc – Seagate
Ordering points are essen-al • All modern file systems depend on ordering points – Journaling file systems (ext3, ext4) • Data before the commit block
– Copy on write file systems (ZFS) • Data before the uber-‐block
• If ordering points are not enforced: – Data corrup-on – Inconsistent file system
DSN 11 7 3/13/12
Summary • We present Coerced Cache Evic-on (CCE) – Write extra data into the cache to evict target blocks
• We show how to characterize 9 SATA disk drive cache – Examine the wide range of caching policies
• We implement CCE in ext3 – Well known journaling file system
• CCE provides stronger enforcement for ordering points – At acceptable overheads
DSN 11 8 3/13/12
Outline
• Mo-va-on • Background • Coerced Cache Evic-on • Cache Fingerprin-ng • Discreet Mode Journaling • Evalua-on • Conclusion
DSN 11 9 3/13/12
File System Background • Consider dele-ng a file – Removing its directory entry – Freeing the space occupied by the file and its metadata
• Journaling file system – Makes sure all changes get to disk or none do – Groups writes into transac-ons – Writes everything to a log first – Checkpoints to disk later
DSN 11 10 3/13/12
File System Background • Ext3 file system – Semi-‐modern journaling file system – Well known, well understood
• Variants of journaling – Data journaling mode • Everything (data, metadata) goes to the log first
– Ordered journaling mode • Only metadata is logged
DSN 11 11 3/13/12
Disk Surface Journal
Fixed locations
Data Journaling
DSN 11 12
D D D C M M Memory
3/13/12
B
Disk Surface Journal
Fixed locations
Disk Cache
Data Journaling
DSN 11 13
D D D C M M Memory
3/13/12
B
Outline
• Mo-va-on • Background • Coerced Cache Evic-on • Cache Fingerprin-ng • Discreet Mode Journaling • Evalua-on • Conclusion
DSN 11 14 3/13/12
Coerced Cache Evic-on • Ensures that cache has been truly flushed • Key idea: – Extra writes to flush the disk cache – Desired Order of writes: A, B, C – With CCE: • Write A • Write to flush zone • Write B • Write to flush zone • Write C
DSN 11 15 3/13/12
Disk Surface
Flush Zone
Disk Cache
Journal
Fixed locations
Coerced Cache Evic-on
DSN 11 16
D D D C M M Memory F F F F F F F F
3/13/12
B F
Coerced Cache Evic-on • Desired proper-es: – High probability of flushing target blocks – Low performance overhead
• Need to understand the disk cache to design the flush workload
DSN 11 17 3/13/12
Outline
• Mo-va-on • Background • Coerced Cache Evic-on • Cache Fingerprin-ng • Discreet Mode Journaling • Evalua-on • Conclusion
DSN 11 18 3/13/12
Cache Fingerprin-ng
• Manufacturers don’t expose details about disk caches
• Disk caches can vary in: – Read/Write par--on size – Number of segments – Replacement policy
• Poorly characterized in literature
DSN 11 19 3/13/12
Disk Cache
Cache Fingerprin-ng • Flush micro-‐benchmark: – Write target block – Write varied flush workload – measure cost – fsync() – Read target – infer evic>on
• Micro-‐benchmark is repeated – Probability of evic-on is calculated
• Vary in each workload: – Number of writes – Amount of data in each write – Sequen-al/Random writes
DSN 11 20 3/13/12
Cache Fingerprin-ng
DSN 11 21 3/13/12
• Evic-on fingerprint – Probability of evic-on is visually shown – Darker region indicates higher probability
90 – 100% 70 -‐ 90 50 – 70 30 – 50 10 – 30 0 – 10
Eviction Probability
Cache Fingerprin-ng
DSN 11 22 3/13/12
• Performance fingerprint – Time taken to write flush workload – Darker region indicates more -me
500+ ms 100 – 500 50 – 100 10 -‐ 50 0 -‐ 10
Flush Latency
Cache Fingerprin-ng • Selec-ng a flush workload: – Combine informa-on from both fingerprints – High probability of evic-on • Dark region in evic-on fingerprint
– Low performance cost • Light region in performance fingerprint
DSN 11 23 3/13/12
Cache Fingerprin-ng Manufacturer Cache (MB) Capacity
(GB)
Hitachi 8 80
Hitachi 32 1024
Samsung 8 250
Samsung 16 250
Western Digital 16 320
Western Digital 64 800
Seagate 8 250
Seagate 16 320
Seagate 32 750
DSN 11 24 3/13/12
Cache Fingerprin-ng Sequen-al writes may be ineffec-ve at flushing – Regardless of the size of the write
A number of random writes are required
DSN 11 25 3/13/12
90 – 100%
70 -‐ 90
50 – 70
30 – 50
10 – 30
0 – 10
Eviction Probability
Cache Fingerprin-ng Ver-cal stripes indicate that the cache is segmented – Each write, regardless of size, is sent to one segment
DSN 11 26 3/13/12
90 – 100%
70 -‐ 90
50 – 70
30 – 50
10 – 30
0 – 10
Eviction Probability
Cache Fingerprin-ng Cache behavior of disks from the same manufacturer is qualita-vely similar across their different models
DSN 11 27 3/13/12
90 – 100%
70 -‐ 90
50 – 70
30 – 50
10 – 30
0 – 10
Eviction Probability
Cache Fingerprin-ng
DSN 11 28 3/13/12
• It’s not all good news however: – Some caches appear to use random replacement policies
– For such caches, we cannot evict blocks with 100% certainty
– A large number of random writes are required to get high evic-on probability
Cache Fingerprin-ng -‐ Results Drive Number
of writes Total Data (MB)
Evic6on Probability
Time (s)
Hitachi 8 MB 1 2.38 100 0.05
Hitachi 32 MB 1 11 100 0.087
Seagate 8 MB 256 31 100 0.87
Seagate 16 MB 128 17 100 0.342
Seagate 64 MB 128 37 100 0.396
Samsung 8 MB 128 49 ~ 90 1.328
Samsung 16 MB 256 128 ~ 90 2.872
Western Digital 16 MB 1792 19 ~ 90 5.107
Western Digital 64 MB 256 1 100 7.705
DSN 11 29 3/13/12
Outline
• Mo-va-on • Background • Coerced Cache Evic-on • Cache Fingerprin-ng • Discreet Mode Journaling • Evalua-on • Conclusion
DSN 11 30 3/13/12
Discreet Mode Journaling
• Incorpora-ng CCE into ext3 – Fingerprint the disk to find op-mal flush workload – Create flush zone with suitable size – Modify ext3 to issue flush zone writes: • One at each ordering point • # of CCE opera-ons = # of ordering points
• Can be used with any disk: – As long as the disk is fingerprinted first
DSN 11 31 3/13/12
Outline
• Mo-va-on • Background • Coerced Cache Evic-on • Cache Fingerprin-ng • Discreet Mode Journaling • Evalua-on • Conclusion
DSN 11 32 3/13/12
Evalua-on • Goal: – CCE provides higher reliability – At what cost? Is it prac-cal to use?
• Experimental setup: – File system: Ext3 – Disk: Hitachi 8 MB – Journaling mode: Data journaling • (See paper for ordered journaling results)
– Opera-ng system: Linux 2.6.13, Linux 2.6.23
DSN 11 33 3/13/12
Evalua-on • What we compare: – Regular journaling with disk cache turned off • “Safe” but slow • Disk might not obey command to turn off cache!
– Regular journaling with disk cache turned on • Unsafe but fast
– Discreet mode journaling • Midway op-on – Safe but with cost
DSN 11 34 3/13/12
Evalua-on
DSN 11 35 3/13/12
• Benchmarks: – OpenSSH • copy, untar, configure, make
– Postmark • Simulates a mail server • Single threaded
– Filebench Webserver • I/O intensive
– Filebench Varmail • Mul-threaded postmark
Evalua-on – OpenSSH
DSN 11 36
Data Journaling Mode
3/13/12
0 5
10 15 20 25 30 35 40 45
Time (s)
regular w/o cache discreet regular w/ cache
Evalua-on – Postmark
DSN 11 37
Data Journaling Mode
3/13/12
0 100 200 300 400 500 600 700 800
Time (s)
regular w/o cache discreet regular w/ cache
Evalua-on – Filebench Webserver
DSN 11 38
Data Journaling Mode
3/13/12
0
50
100
150
200
250
300
Throughtpu
t (MB/s)
regular w/o cache discreet regular w/ cache
Evalua-on – Filebench Varmail
DSN 11 39
Data Journaling Mode
3/13/12
0
2
4
6
8
10
12
Throughtpu
t (MB/s)
regular w/o cache discreet regular w/ cache
Evalua-on – Filebench varmail • Workload writes a small amount of data and calls fsync() repeatedly
• Each fsync()causes 3 CCEs
• Number of op-miza-ons : – Incorporate Group Commit in varmail • Improves throughput for all modes
– We use a few other techniques as well (see paper)
DSN 11 40 3/13/12
Evalua-on – Filebench Varmail
DSN 11 41
0
5
10
15
20
25
Throughp
ut (M
B/s)
With op-miza-ons
3/13/12
Original performance
0
5
10
15
20
25
Throughtpu
t (MB/s)
regular w/o cache discreet regular w/ cache
Summary
• Coerced Cache Evic-on (CCE): – Run file systems reliably on top of misbehaving disks
• Characteriza-on of 9 SATA disk caches through fingerprints • Discreet Mode Journaling: – Implementa-on of CCE for ext3 filesystem – Acceptable performance on 3 workloads • Only if the cache doesn’t use random replacement
– High overhead for apps which call fsync() frequently
DSN 11 42 3/13/12
Conclusion
• Trust in disk is weakening: – Latent Sector Errors – Block corrup-on – Cache flushing
• Cloud compu-ng systems: – Virtualized hardware – Large socware stack
• Can such hardware be trusted? • Will coercion be more widely used?
DSN 11 43 3/13/12
DSN 11 44
Thank you!
3/13/12
Advanced Systems Lab (ADSL) University of Wisconsin-‐Madison hEp://www.cs.wisc.edu/adsl