Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved.
Storage Systems for Shingled Disks
Garth Gibson Carnegie Mellon University
and Panasas Inc
Anand Suresh, Jainam Shah, Xu Zhang, Swapnil Patil, Greg Ganger
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved.
Kryder’s Law for Magnetic Disks
r Market expects ever more dense disks r Future is multi-terabit per square inch r Real challenge is making money at $100/disk
when engineering is this hard
G. Gibson, Sept 2012"2
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved.
Directions in High Capacity Disks
r Heat-Assisted (HAMR) r Small bits need high coercivity
media to retain orientation r High coercivity can’t be
changed by normal writing r Heated media lowers coercivity r Include lasers?
r Bit-Patterned (BPM) r Small bits retain orientation
easier if bits kept apart r Pattern media so only write a
single dot per bit r Tera-dots per sq. inch?
G. Gibson, Sept 2012"3
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved.
Shingled Magnetic Recording (SMR)
G. Gibson, Sept 2012"4
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved.
G. Gibson, Sept 2012"5
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved.
G. Gibson, Sept 2012"
File systems do far too much small random writing
6
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved.
G. Gibson, Sept 2012"
File systems do far too much small random writing
Disk becomes tape !!
7
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved.
What About Reading?
Read head is possibly thinner than write head r If target is 2-3 X density, maybe not too hard Targeting higher density sees lots of crosstalk r Signal processing in two dimensions (TDMR) One approach to TDMR involves gathering signal
from 1-2 adjacent tracks on both sides r Means 3 to 5 revs to read a single sector r Not likely to be accepted by marketplace Safe plan is to “see” residual track w/ only 1 head
G. Gibson, Sept 2012 8
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved.
Geometry Model: Getting a handle on the parameters
G. Gibson, Sept 2012"9
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved.
Shingled writing: need big bands
r Reason for doing it: density r Shingling projected at 1.5-2.5X track density
r Can mix shingled and non-shingled r so, e.g., separate sequential from random r just lose some of the density gains
r Can break up sets of shingled tracks (“bands”) r allowing overwrite of individual bands r but, they need to be big… like 32 to 256 MB
G. Gibson, Sept 2012"10
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved.
Simple Geometry Model
r SMR allows wider write heads, w’>w
r SMR reduces gaps, g, per track to per band (B tracks)
r Residual (readable) track width (r) after overlapping is a key factor
r A fraction of tracks not shingled, f, allows some random sector writing
G. Gibson, Sept 2012"
! !" # # $
% & ' (
% & ' ( ) *
11
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved.
Simple Geometry Model
r SMR allows wider write heads, w’>w
r SMR reduces gaps, g, per track to per band (B tracks)
r Residual (readable) track width (r) after overlapping is a key factor
r A fraction of tracks not shingled, f, allows some random sector writing
r SMR increase in areal density given by simple model
G. Gibson, Sept 2012"
! !" # # $
% & ' (
% & ' ( ) *
12
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved.
Areal Density Favors Large Bands
G. Gibson, Sept 2012"
Eg. w=25, g=5, w’=70, r=10,13,20 nm, f=0%,1%,10%
13
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved.
Areal Density Favors Large Bands
G. Gibson, Sept 2012"
Eg. w=25, g=5, w’=70, r=10,13,20 nm, f=0%,1%,10%
• 1% unshingled is affordable • 10% if r<w
• small B bad news • r~=w needs
large B (~100+) • r<w allows smallish
B (~10) • But not soon ….
Systems should plan for large bands
14
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved.
Coping with SMR at the system level
G. Gibson, Sept 2012"15
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved.
Convergence with Flash
G. Gibson, Sept 2012"16
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved.
Transparent STL/FTL approach
r Shingled disks implement “translation” r Same types of algorithms as Flash r Data will be correct using existing program code
r But, not performance transparent r Erase block: 100-1000 X bigger r Read-erase-write: 1000-10000 X longer r Sure to exceed long tolerable latency thresholds
r And, not cost transparent r Disk margins < flash margins r Yet disk STL needs more resources
G. Gibson, Sept 2012"17
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved.
Non-transparent SMR interface
r Define an interface exposing key differences r Bands, non-shingled regions, trim, …
r Modify systems software to avoid and minimize read-modify-write r Log-structured files systems 20 years old r STL-like technology not costly in host r Cloud storage writes in 64 MB chunks (HDFS) r Flash, PCM, etc may be available to host
G. Gibson, Sept 2012"18
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved.
Non-transparent SMR interface
r Standards processes in T13 and T10 exist r Key idea: disk attribute says “sequential writing” r Each band has a write cursor for next write LBA r Writes before and reads after cursor are “bad” r Software can reset cursor, mostly to start of band r Software can ask for map of bands & cursors
G. Gibson, Sept 2012"19
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved.
Experimenting with File Systems for SMR
G. Gibson, Sept 2012"20
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved.
Project Plan
r Demonstrate systems using SMR interface r Mock interface models SMR device
r Cloud/BigData initial target application space r Hadoop / HDFS first example
r Chunks ~= Bands r HDFS is write once, so easier to pack frags
r Log-structured Merge Tree/LFS ? r Implement directories and inodes as table entries r Logs of changes in tables written as bands
G. Gibson, Sept 2012"21
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved.
Start w/ class project framework 1) App does create(f1.txt) 2) MelangeFS creates “f1.txt” in SSD 3) Ext2 on SSD returns a handle for
“f1.txt” to FUSE 4) FUSE “translates” that handle into
another handle which is is returned to the app
5) App uses the returned handle to write to “f1.txt” on the SSD
6) When “f1.txt” grows big, MelangeFS moves it to HDD, and “f1.txt” on the SSD becomes a symlink to the file on HDD
7) Because this migration has to be transparent, app continues to write as before (all writes go to the HDD).
Your FUSE file system (melangefs)
SSD (ext2) HDD (ext2)
Application
1
2
3
4
5 6 7
<F1> …
G. Gibson, Sept 2012"22
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved. 23
Experimental Platform Today
Shingledfs
Hadoop/HDFS
FUSE
User
ext3
SMR Model
USER-LEVEL EMULATOR
File-to-Band/Block translation
SMR Device Emulator
T13 interface model
To disk
Open file cache F1 F2 …
Shingled partition
Unshingled partition
open(F2,w)
Metadata ops for F2
F2 written to SMR on close()
Write to open file go to cache
G. Gibson, Sept 2012"
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved.
Prototype Banded Disk API
r CMU view of API essentials r Edi_modesense()
r Discover band information (number, size) r Edi_managebands(OP, band, offset, length)
r GET: where is next_write_offset? r SET: change next_write_offset (mostly to 0)
r Edi_read(band, offset, length) r Offset must be less than next_write_offset
r Edi_write(band, offset, length) r Offset must be next_write_offset (else reject)
G. Gibson, Sept 2012"24
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved.
Hadoop Sort Benchmark
G. Gibson, Sept 2012"25
r 6 node Hadoop cluster: write, sort, verify X GBs r Compare local disk, FUSE-local, FUSE-SMR r FUSE causes
most overhead r No cleaning
during tests
r SMR file system can support Big Data apps
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved.
Ongoing Project Directions
G. Gibson, Sept 2012"26
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved.
Future Work: General Workloads
r Compile Linux 2.6.35 on SMR r Bigger overheads
r Especially untar r Lots of small files, lots of directory operations, etc
G. Gibson, Sept 2012"27
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved.
Future Work: Pack Metadata
r Change traditional file systems in unshingled region
r Use LSM Tree for directories, inodes r Eg. LevelDB r Most metadata on
disk in SSTable blobs r Initial experiments
reduce disk seeks for metadata ops
G. Gibson, Sept 2012"28
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved.
Summary of status
r Experiments: SMR appropriate for Big Data apps r Deployment: embed in HDFS DataNode servers or
local file system r Implementation greatly simplified by
r one file: one band r files open for write held in memory until close r Hadoop/HDFS is write-once
r Next steps: r Cleaning overhead, cluster soon-to-delete r Log-structured Merge Tree to pack metadata
29
2012 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved.
Further reading
r www.pdl.cmu.edu technical reports:
CMU-PDL-12-105: Big Data experiments CMU-PDL-11-107: Principles of operations CMU-PDL-12-103: TableFS approach
Thanks to our sponsors: Seagate and the PDL Consortium (Actifio, APC, EMC, Emulex, Facebook, Fusion-IO, Google, HP Labs, Hitachi, Huawei, Intel, Microsoft, NEC, NetApp, Oracle, Panasas, Riverbed, Samsung, STEC, Symantec, VMWare, Western Digital)
30