Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
11
Commodity Reliability And PracticesCommodity Reliability And Practicesoror
Building Reliable Systems with CRAPBuilding Reliable Systems with CRAP
Thomas M. RuwartThomas M. RuwartChief ScientistChief Scientist
[email protected]@sherwoodinfo.com
University of Minnesota
Digital Technology Center
Minneapolis, MN
October 8, 2007
22
Why you are hereWhy you are here
• To learn about a relatively new conceptin the storage industry
• Concepts behind CRAP• Related issues and problems• Addressing those issues and
problems
33
OrientationOrientation
• A bit of history• How did we get to this point?• Design requirements for enterprise-class storage• Design requirements for consumer-class storage• Principles of Commodity Reliability And
Practices• Why we need CRAP• Conclusions
44
A bit of HistoryA bit of History
• What disk drives were like in the “oldendays”
• 1967-2007 in 10-year increments – theview from someone who lived through it
• Future history – were are we headed
55
19671967
• Disk drive platters were 30-inches in diameter• Disk drives were the size of a large clothes
washing machine• A Million dollars per disk drive that held only a
few MegaBytes• Disk drives required daily maintenance of
cleaning heads and platters• Several people were crushed under the weight
of their iPods which held only one song at atime
66
IBM 305 RAMAC
77
88
19771977
• Disk drive platters are now 14-inches in diameter• Disk drive capacity up to about 315MB in a small
washing-machine sized box• Disk drives are still only used in computer centers• Disk drives were just beginning to be sealed so that
no physical maintenance was required for the headsor media
• Media maintenance for dealing with bad spots wasrequired on a monthly basis
• iPods now carried by pack mules and elephants andas such, never caught on in the US
99
1010
19871987• The 8-inch disk drive form factor is the standard for “enterprise-
class” disk drives and uses the SMD or emerging IPI interface• The PC revolution gives rise to “consumer-class” hard disk
drives• Consumer-class disk drives start with the small form factor 5.25-
inch full height• ST506, ATA/IDE, SCSI, and ESDI are the new interfaces hard disk
drives for consumer use• Maintenance required to manage bad sectors but no other
physical maintenance required• RAID is just being invented… again• iPods become practical to the extent that they are smaller but
battery life is about 3 seconds so the Walkman wins
1111
1212
19971997• The 8-inch and 5.25-inch hard disk drive form factors give way to the 3.5-
inch form factor (half-height)• 3.5-inch form factor is dominant in both enterprise-class and consumer-
class disk drives• Parallel SCSI and Fibre Channel SCSI replace all other interfaces for
enterprise-class disk drives• IDE and EIDE are the standard consumer-class disk drive interface• Capacities/Densities on both enterprise-class and consumer-class disk
drives are equivalent• No maintenance required – bad sector management an integral part of
each disk drive• Zone-bit recording significantly increases disk drive capacities• RAID Arrays are in wide use to increase data integrity and disk storage
reliability as well as performance with minimal tradeoff in capacity• iPods are deemed a waste of time because everything will be on
miniDISC or DAT
1313
20072007• The 3.5-inch, low profile (1-inch high) form factor is dominant in both
enterprise-class and consumer-class storage• The 2.5-inch form factor starting to make an appearance in the data
center• 2.5-inch (laptop drive) and 1.8-inch (iPod) disk drives are dominant in
mobile devices• Sub-1.8-inch disk drives being replaced by FLASH• Consumer-class disk drives maintain a consistently higher bit density
and drive capacity than Enterprise-class disk drives• Enterprise-class disk drives maintain higher RPM and faster access
times over consumer-class disk drives and 3-4x drive cost• Enterprise-class disk drives giving way to consumer-class disk drives in
previously Enterprise-class applications• Concerns rise over use of Consumer-class storage in Enterprise-class
applications• iPods take over and begin to network themselves together to form
another, higher intelligent life-form based solely on pop music. We call it:
1414
20172017• The 2.5-inch and 1.8-inch form factors are dominant in
both enterprise-class and some consumer-classstorage
• Data centers beginning to give up on disk drives andwant to revert to using punch-cards
• Mobile devices are primarily FLASH-based• Sub-1.8-inch disk drives being replaced by FLASH• Consumer-class disk drives dominant in Enterprise-
class applications – distinction between Consumer-class and Enterprise-class disk drives is narrow
• Concerns rise over the US President’s demand topaint the White House purple
1515
77””
Where Drive Form Factors Come From.Where Drive Form Factors Come From.
5.255.25””
5.255.25””3.53.5””
1616
WhatWhat’’s inside a Video IPODs inside a Video IPODI knew it was too good to be true….
1717
LetLet’’s talk about the Lunaticss talk about the Lunatics• High-End Computing (HEC) Community
– BIG data or LOTS of data, locally and widelydistributed, high bandwidth access or hightransaction rate, relatively few users, secure,short-term and long-term retention
• High Energy Physics (HEP) – Fermilab,CERN, DESY– BIG data, locally distributed, widely available,
moderate number of users, sparse access,long-term retention
• DARPA – Interagency High ProductivityComputing Systems– Design and build a peta-scale
computer system that is usablefor the year 2010
1818
HEP HEP –– FermilabFermilab and CMS and CMS• The Compact Muon Solenoid (CMS -
http://cms.cern.ch )– $750M Experiment being built at CERN in Switzerland– Will be active in 2007
• The Easy Part – collecting the data– Data rate from the detectors is ~1 PB/sec– Data rate after filtering is a few GB/sec
• The Hard Part: Storing and Access– Dataset for a single experiment is ~1PB– Several experiments per year are run– Must be made available to 5000 scientists all over the planet
(Earth primarily) for the next 10-25 years– Dense dataset, sparse data access by any one scientist– Access patterns are not deterministic
Tier 1
Tier2 Center
Online System
eventreconstruction
French RegionalCenter
GermanRegional Center
InstituteInstituteInstituteInstitute~0.25TIPS
Workstations
~100MBytes/sec
~0.6-2.5 Gbps
100 - 1000Mbits/sec
Physics data cache
~PByte/sec
~2.5 Gbits/sec
Tier2 CenterTier2 CenterTier2 Center
Tier 0 +1
Tier 3
Tier 4
Tier2 Center
LHC Data Grid HierarchyLHC Data Grid HierarchyCMS as example, Atlas is similarCMS as example, Atlas is similar
Tier 2
CERN/CMS data goes to 6-8 Tier 1 regional centers,and from each of these to 6-10 Tier 2 centers.
Physicists work on analysis “channels” at 135institutes. Each institute has ~10 physicists workingon one or more channels.2000 physicists in 31 countries are involved in this20-year experiment in which DOE is a major player.
CMS detector: 15m X 15m X 22m
12,500 tons, $700M.
human=2m
analysis
eventsimulation
Italian Center FermiLab, USARegional Center
CourtesyHarvey
Newman,CalTech and
CERN
~0.6-2.5 Gbps
2020
2121
2222
What are the DARPA requirements?What are the DARPA requirements?
• HEC Community – The High ProductivityComputing Systems (HPCS) from DARPA– 1015 computations per second – Peta-scale computing– 1-10 trillion files in a single file system– 100’s of thousands of processors– Millions of process threads all needing and generating
data– 1-100 TBytes/sec aggregate bandwidth to disk– 30,000+ file creations per second– Focus on ease of use, efficiency, and RAS
2323
Lots of things have to scale Lots of things have to scaleFile System Attributes
1999 2002 2005 2008Teraflops 3.9 30 100 400
Memory size (TB) 2.6 13-20 32-67 44-167
File system size (TB) 75 200 - 600 500 -2,000 20,000
Number of Client Tasks 8192 16384 32,768 65,536
Number of Users 1,000 4,000 6,000 10,000
Number of Directories 5.0*106 1.5*107 1.8*107 2*108
Metadata RatesData Rate
500/sec1 mds
3 GB/sec
2000/sec1 mds
30 GB/sec
20,000/secn mds
100 GB/sec
50,000/secn mds
400 GB/secNumber of Files 1.0*109 4.0*109 1.0*1010 2.0*1012
2424
What are we getting ourselves into?What are we getting ourselves into?• 1TB/sec
– 20,000 disk drives• @ 50MB/sec/disk average• @ 10ms average access time ≈ 2 million IOPS• @ 1TB/disk ≈ 20PB raw capacity• @ 25watts/disk (including cooling power) ≈ 500 KWatts
– 40,000 disk drives in an real design to includeredundancy
• Space and power/cooling increase by 2x ≈1MWatt
– And that is just the beginning….– 10TB/sec would be up to 400,000 disk drives
2525
What does 1TB/sec really mean?What does 1TB/sec really mean?
• To what?– 1,000 processes @ 1GB/sec each?– 100,000 processes at 10MB/sec each?– Assumes a process/processor can
absorb/generate data at that rate– Current data:instruction ratio is about 10:1
• Therefore, 1TB/sec implies 100GFlops• Thus 1PFlop implies a data rate of 100TB/sec –
opps.
2626
Digging ourselves in deeper?Digging ourselves in deeper?• 1 Trillion Files
– 30,000 file creations per second for 1 year = 1 trillion files– 1PB of MetaData to describe 1Trillion files– Finding any one file within 1 Trillion files– Finding anything inside of the 1 Trillion files– This is a major transactional problem not a bandwidth
problem– Traditional file systems and associated [POSIX] semantics
break down at these scales – need new/relaxed semantics– Is the concept of a “file” still valid in this context?
2727
The Growing Disk Drive BottleneckThe Growing Disk Drive Bottleneck Subsystem
19931
2007E1
Increase
Network I/O2
0.001
2
2000x Intel CPU
0.48
100
200x
Storage Channel I/O3
0.05
4
80x PCI
7
0.13
16
123x
Intel Front Side Processor Bus
0.53
13
24x Random Disk IOPS
5 90 150 1.7x
Random Disk IOPS per Gbyte5,6
43 4.2 -10x
Sequential Disk I/O4
0.005 0.1 20x
Sequential Disk BW/Gbyte 0.005 0.0001 -50x
Notes: 1 Speed of subsystem in GBps
2 Ethernet
3 SCSI and Fibre Channel
4 IBM 3.5 inch drives internal data rate
5 IBM 3.5 inch drives se ek + rotational latency
6 Horison/Fred Moore
7 PCI versus 16xPCIe
Source: www.ArchiveBuilders.com, "Evolution of Intel Microprocessors: 1971 to 2001”
2828
Need more disks, not higher capacity onesNeed more disks, not higher capacity ones• Disk drive capacity improves faster than
– Data transfer rate– Seek time– Rotational Latency
2929
Access DensityAccess Density
3030
Serious QuestionsSerious Questions
• How do you package it?• How do you maintain it?• How do you connect it all together?• How do you access/use a storage system
with 250,000 disk drives?
3131
How do you package this?How do you package this?
• Conservatively one hundred 3½ inch disksper rack with controllers
• 400 racks of disk drives and controllers• 8,000 square feet• 10TB/sec is 10 times this or about the size
of two football fields (~100,000 sq ft)
3232
How do you maintain it?How do you maintain it?
• Assume– 40,000 disk configuration– 2,000,000 hours MTBF per Enterprise-class disk– 300,000 hours MTBF per Consumer-class disk
• ~4 disk failure per week for Enterprise-classdisks
• ~20 failures per week for Consumer-class disks• Continual rebuilds in progress• 10TB/sec is 10 times this
3333
How do you connect it all together?How do you connect it all together?• 10Gbit/sec/channel → 1,000 channels @ 100%
efficiency• Implies a 2,000 channel non-blocking switch fabric• What about transceiver failure rates• When it breaks, how do you
find the broken transceiver?• 10TB/sec – who on earth would
want to do that? (don’t ask)
3434
How do you use this?How do you use this?• Current file system technology is based on
30+ year-old designs and does not scale• Disk I/O software stack is 30+ years old
and does not scale• Need lots of innovation in many areas
– Common shared file system interfaces– Data Life Cycle Management and seamless
integration into existing HEC environments– Changes to standards that offer greater
scalability without sacrificing data integrity– Streaming I/O from zillions of single nodes– Data alignment, small-block, large-block,
and RAID issues– File System Metadata
Application
OperatingSystem
Storage andTransport
Application
OperatingSystem
Storage andTransport
3535
Commodity Reliability And PracticesCommodity Reliability And Practices• Processors, Networks, Graphics Engines have for the
most part gone “commodity”• Disk drives are still largely “enterprise-class”• Significant pressure to move toward more use of
commodity disk drives• Requires a fundamental change in how we think about
RAS for storage – i.e. Fail-In-Place• Assumes something is always in the process of breaking• Must re-orient engineering to think about how to build
reliable systems using unreliable components• AKA – How to build reliable systems using CRAP
3636
What is CRAP?What is CRAP?• Successful systems are designed to be “fault-tolerant” – faults are a
normal occurrence• Until recently, systems were designed and assumed to be highly-
reliable and faults were an anomaly rather than a normal occurrence• That was possible due to the relatively “low” part count in a given
system• In order to make computations go faster, significantly more
parallelism is needed• Parallelism implies a far greater “part count”• A natural consequence of higher part count is a higher failure rate of
some part in the system as a whole• However, there is a great deal of duplication in highly parallel or
high-part-count systems• Need to take advantage of this parallelism
3737
So, what is CRAP?So, what is CRAP?• High part-count systems must contain a large number of
“commodity” components to be commercially viable• Commodity components are less expensive and arguably less
reliable in enterprise-class applications• Thus, Systems should be designed with Commodity Reliability in
mind• Commodity Reliability means that at any point in time, something is
in the process of failing within the system as a whole• Given that something is always in the process of breaking,
appropriate Engineering Practices must be employed to maintain:– Data Access– Data Integrity– System Performance
• Requires a broader, systemic engineering view when designing withCRAP in order to ensure that things above and below a point offailure are minimally impacted – software and hardware
3838
WhatWhat’’s happening now?s happening now?• Areal Density is at about 150Gigabits per square inch• 3.5-inch form factor is currently the standard• 2.5-inch form factor is emerging in the enterprise• SAS and SATA are getting significant traction• OSD has been demonstrated and is in active
development• Consumer-grade storage is cheap cheap cheap• Commodity interface speeds are up to 10Gigabits/sec• Storage and Network processing engines are available• New applications for storage are rapidly evolving• Relaxed POSIX standards• NFS V4 and Parallel NFS
3939
Reaching Error Detection Code LimitsReaching Error Detection Code Limits• Error Detection Codes are statistical in nature• Current codes are rated at about 1 “undetected”
error in 1015 bits transferred or about 1 bit inevery 100 TeraBytes
• As storage capacities and transfer ratesincrease, the current error “detection” codes willno longer be sufficient to guard against silentdata corruption
• Need “stronger”, non-intrusive error “detection”codes at multiple levels
4040
ConclusionsConclusions• Need CRAP to design the next generation of
scalable, high-performance systems• System design is highly dependent on
leveraging commodity components
4141
Zone Bit RecordingZone Bit Recording
• Commonly used technique to maximizeuse of media area
• More sectors recorded on the outer tracksthan inner tracks
• Bit density remains constant but the datarate changes from zone to zone
• Typically 10-100 zones on a disk – variesfrom mfg to mfg, model to model
4242
Example of Zones in ZBRExample of Zones in ZBR
Zone Map of the Fujitsu 60 GB 5400 RPM 2.5-inch SATA Disk
15
17
19
21
23
25
27
29
31
33
35
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95
Percent of disk from outer to inner tracks
Ban
dw
idth
in
MB
/sec
4343
Adaptive FormattingAdaptive Formatting
• ZBR on steroids• Each head is “tuned” to maximize the
signal-to-noise ratio of each individualtrack
• Results in different effective data rate foreach track
• Results in a variable number of sectorsper track
4444
Example of Example of ““zoneszones”” in AF in AFSeagate 100 GB 5400 RPM Momentus
0
5
10
15
20
25
30
35
40
0 10 20 30 40 50 60 70 80 90
Percent of disk from outer to inner tracks
Ba
nd
wid
th i
n M
B/s
ec