ECE7130: Advanced Computer Architecture:Storage Systems
Dr. Xubin He
http://iweb.tntech.edu/hexb
Email: [email protected]
Tel: 931-3723462, Brown Hall 319
ECE7130
Outline
• Quick overview: storage systems
• RAID
• Advanced Dependability/Reliability/Availability
• I/O Benchmarks, Performance and Dependability
ECE7130 3
Storage Architectures
04/11/23
ECE7130
Disk Figure of Merit: Areal Density• Bits recorded along a track
– Metric is Bits Per Inch (BPI)
• Number of tracks per surface– Metric is Tracks Per Inch (TPI)
• Disk Designs Brag about bit density per unit area– Metric is Bits Per Square Inch: Areal Density = BPI x TPI
Year Areal Density1973 2 1979 8 1989 63 1997 3,090 2000 17,100 2006 130,000
1
10
100
1,000
10,000
100,000
1,000,000
1970 1980 1990 2000 2010
Year
Are
al D
ensi
ty
ECE7130
Historical Perspective
• 1956 IBM Ramac — early 1970s Winchester– Developed for mainframe computers, proprietary interfaces– Steady shrink in form factor: 27 in. to 14 in.
• Form factor and capacity drives market more than performance• 1970s developments
– 5.25 inch floppy disk formfactor (microcode into mainframe)– Emergence of industry standard disk interfaces
• Early 1980s: PCs and first generation workstations• Mid 1980s: Client/server computing
– Centralized storage on file server» accelerates disk downsizing: 8 inch to 5.25
– Mass market disk drives become a reality» industry standards: SCSI, IPI, IDE» 5.25 inch to 3.5 inch drives for PCs, End of proprietary interfaces
• 1900s: Laptops => 2.5 inch drives• 2000s: What new devices leading to new drives?
ECE7130
Storage Media
• Magnetic:– Hard drive
• Optical:– CD
• Solid State Semiconductor:– Flash SSD
– RAM SSD
• MEMS-based– Significantly cheaper than DRAM but much faster than traditional HDD, high density
• Tape: sequential accessed
604/11/23
ECE7130
Driving forces
7Source: Anderson & Whittington, Seagate, FAST’07
04/11/23
ECE7130
HDD Market
8Source: Anderson & Whittington, Seagate, FAST’07
04/11/23
ECE7130
HDD Technology Trend
9Source: Anderson & Whittington, Seagate, FAST’07
04/11/23
ECE7130
Errors
• Errors happens:– Media cause: bit flip, noise…
• Data detection and decoding logic, ECC
1004/11/23
ECE7130
Right HDD for right application
11Source: Anderson & Whittington, Seagate, FAST’07
04/11/23
ECE7130
Interfaces
• Internal– SCSI: wide SCSI, Ultra wide SCSI…
– IDE ATA: Parallel ATA (PATA): PATA 66/100/133
– Serial ATA (SATA): SATA 150, SATA II 300
– Serial Attached SCSI (SAS)
• External– USB (1.0/2.0)
– Firewire (400/800)
– eSATA: external SATA
1204/11/23
ECE7130
SATA/SAS compatibility
13Source: Anderson & Whittington, Seagate, FAST’07
04/11/23
ECE7130
Top HDD manufactures
• Western Digital
• Seagate (also own Maxtor)
• Fujitsu
• Samsung
• Hitachi
• IBM used to make hard disk drives but sold that division to Hitachi.
1404/11/23
ECE7130
Top companies to provide storage solutions
• Adaptec
• EMC
• Qlogic
• IBM
• Hitachi
• Brocade
• Cisco
• HP
• Network appliance
• Emulex
1504/11/23
ECE7130
Fastest growing storage companies (07/2008) Source: storagesearch.com
Company Yearly growth Main product
Coraid 370% NAS
ExaGrid Systems 337% Disk to disk backup
RELDATA 300% iSCSI
Voltaire 194% InfiniBand
Transcend Information 166% SSD
OnStor 157% NAS
Bluepoint Data Storage 140% Online Backup and Storage
Compellent 124% NAS
Alacritech 100% iSCSI
Intransa 100% iSCSI
1604/11/23
ECE7130
Solid State Drive
• A data storage device that uses solid-state memory to store persistent data.
• A SSD either uses Flash non-volatile memory (Flash SSD) or DRAM volatile memory (RAM SSD)
04/11/23 17
ECE7130
Advantages of SSD
• Faster startup: no spin-up
• Fast random access for read
• Extremely low read/write latency: much smaller seek time
• Quiet: no moving
• High mechanical reliability: endure shock, vibration
• Balanced performance across entire storage device
04/11/23 18
ECE7130
Disadvantages of SSD
• Price: unit price of SSD is 20x of HDD
• Limited write cycles (Flash SSD): 30-50 millions
• Slower write speed (Flash SSD): erase blocks
• Lower storage density
• Vulnerable to some effects: abrupt power loss (RAM SSD), magnetic fields and electric/static charges
04/11/23 19
ECE7130
Top SSD manufactures (as of 1Q of 2008, source: storagesearch.com)
Rank Manufacturer SSD Technology
1 BitMICRO Networks Flash SSD
2 STEC Flash SSD
3 Mtron Flash SSD
4 Memoright Flash SSD
5 SanDisk Flash SSD
6 Samsung Flash SSD
7 Adtron Flash SSD
8 Texas Memory Systems RAM/Flash SSD
9 Toshiba Flash SSD
10 Violin Memory RAM SSD
2004/11/23
ECE7130 21
Networked Storage
• Network Attached Storage(NAS)• Storage accessed over TCP/IP, using industry standard file
sharing protocols like NFS, HTTP, Windows Networking Provide File System Functionality Take LAN bandwidth of Servers
• Storage Area Network(SAN)• Storage accessed over a Fibre Channel switching fabric,
using encapsulated SCSI. Block level storage system Fibre-Channel SAN IP SAN
» Implementing SAN over well-known TCP/IP» iSCSI: Cost-effective, SCSI and TCP/IP
04/11/23
ECE7130 22
Advantages
• Consolidation
• Centralized Data Management
• Scalability
• Fault Resiliency
04/11/23
ECE7130
NAS and SAN shortcomings
• SAN Shortcomings--Data to desktop--Sharing between NT and UNIX--Lack of standards for file access and locking
• NAS Shortcomings--Shared tape resources--Number of drives--Distance to tapes/disks
• NAS--Focuses on applications, users, and the files and data that they share
• SAN--Focuses on disks, tapes, and a scalable, reliable infrastructure to connect them
• NAS Plus SAN--The complete solution, from desktop to data center to storage device
2304/11/23
ECE7130 24
Organizations
• IEEE Computer Society Mass Storage Systems Technical Committee (MSSTC or TCMS): http://www.msstc.org
• IEEE IETF: www.ietf.org– IPS, IMSS
• International Committee for Information Technology Standards: www.incits.org
– Storage: B11, T10, T11, T13
• Storage mailing list: [email protected]
• INSIC: Information Storage Industry Consortium: www.insic.org
• SNIA: Storage Networking Industry Association: http://www.snia.org– Technical work groups
04/11/23
ECE7130 25
Conferences
• FAST: Usenix: http://www.usenix.org/event/fast09/
• SC: IEEE/ACM: http://sc08.supercomputing.org/
• MSST: IEEE: http://storageconference.org/– SNAPI, SISW, CMPD, DAPS
• NAS: IEEE: http://www.eece.maine.edu/iwnas/• SNW: SNIA: Storage Networking World
• SNIA Storage Developer Conference: http://www.snia.org/events/storage-developer2008/
• Other conferences with storage components:– IPDPS, ICDCS, ICPP, AReS, PDCS, Mascots, HPDC, IPCCC, ccGRID…
04/11/23
ECE7130 26
Awards in Storage Systems
• IEEE Reynold B. Johnson Information Storage Systems Award– Sponsored by: IBM Almaden Research Center
• 2008 Recipient:– 2008 - ALAN J. SMITH, Professor
University of California at BerkeleyBerkeley, CA, USA
•2007- Co-Recipients
– DAVE HITZExecutive Vice President and Co-FounderNetwork ApplianceSunnyvale, CA
– JAMES LAUChief Strategy Officer, Executive Vice President and Co-FounderNetwork ApplianceSunnyvale, CA
•2006 Recipient
– JAISHANKAR M. MENONDirector of Storage Systems Architecture and DesignIBM, San Jose, CA
– "For pioneering work in the theory and application of RAID storage systems."
• More: http://www.ieee.org/portal/pages/about/awards/sums/johnson.html
04/11/23
ECE7130 27
HPC storage challenge: SC
• HPC systems are comprised of 3 major subsystems: processing, networking and storage. In different applications, any one of these subsystems can limit the overall system performance. The HPC Storage Challenge is a competition showcasing effective approaches using the storage subsystem, which is often the limiting system, with actual applications.
• Participants must describe their implementations and present measurements of performance, scalability, and storage subsystem utilization. Judging will be based on these measurements as well as innovation and effectiveness; maximum size and peak performance are not the sole criteria.
• Finalists will be chosen on the basis of submissions which are in the form of a proposal; submissions are encouraged to include reports of work in progress. Participants with access to either large or small HPC systems are encouraged to enter this challenge.
04/11/23
ECE7130
Research hotspots
• Energy Efficient Storage: CMU, UIUC
• Scalable Storages Meets Petaflops: IBM, UCSC
• High availability and Reliable Storage Systems: UC Berkeley, CMU, TTU, UCSC
• Object Storage (OSD): CMU, HUST, UCSC
• Storage virtualization: CMU
• Intelligent Storage Systems: UC Berkeley, CMU
2804/11/23
ECE7130
Energy Efficient Storage
• Energy aware: – disk level: SSD
– I/O and file system level: efficient memory, cache, networked storage
– Application level: data layout
• Green storage initiative (GSI): SNIA– Energy efficient storage networking solutions
– Storage administrator and infrastructure
2904/11/23
ECE7130
Peta-scale scalable storage
• Challenges to using storages that can meet the ever increasing speed, multi-core designs, and petaflop computing capabilities.
• Latencies to disks are not keeping up with peta-scale computing.
• Incentive approaches are needed.
04/11/23 30
ECE7130
My research in high performance and reliable storage systems
• Active/Active Service for High Availability Computing– Active/Active Metadata Service
• Networked Storage Systems– STICS
– iRAID
• A Unified Multiple-Level Cache for High Performance Storage Systems
– iPVFS
– iCache
• Performance-Adaptive UDP for Bulk Data Transfer over Dedicated Network Links: PA-UDP
3104/11/23
ECE7130 32
Improving disk performance.
• Use large sectors to improve bandwidth
• Use track caches and read ahead;– Read entire track into on-controller cache
– Exploit locality (improves both latency and BW)
• Design file systems to maximize locality– Allocate files sequentially on disks (exploit track cache)
– Locate similar files in same cylinder (reduce seeks)
– Locate simlar files in near-by cylinders (reduce seek distance)
• Pack bits closer together to improve transfer rate and density.
• Use a collection of small disks to form a large, high performance one--->disk array
Stripping data across multiple disks to allow parallel I/O, thus improving performance.
04/11/23
ECE7130
Use Arrays of Small Disks?
14”10”5.25”3.5”
3.5”
Disk Array: 1 disk design
Conventional: 4 disk designs
Low End High End
•Katz and Patterson asked in 1987: •Can smaller disks be used to close gap in performance between disks and CPUs?
ECE7130
Replace Small Number of Large Disks with Large Number of Small Disks! (1988 Disks)
Capacity
Volume
Power
Data Rate
I/O Rate
MTTF
Cost
IBM 3390K
20 GBytes
97 cu. ft.
3 KW
15 MB/s
600 I/Os/s
250 KHrs
$250K
IBM 3.5" 0061
320 MBytes
0.1 cu. ft.
11 W
1.5 MB/s
55 I/Os/s
50 KHrs
$2K
x70
23 GBytes
11 cu. ft.
1 KW
110 MB/s
3900 IOs/s
??? Hrs
$150K
Disk Arrays have potential for large data and I/O rates, high MB per cu. ft., high MB per KW, but what about reliability?
9X
3X
8X
6X
ECE7130
Array Reliability
•MTTF: Mean Time To Failure: average time that a non - repairable component will operate before experiencing failure.
•Reliability of N disks = Reliability of 1 Disk ÷N
50,000 Hours ÷ 70 disks = 700 hours Disk system MTTF: Drops from 6 years to 1 month!
• Arrays without redundancy too unreliable to be useful!
Solution: redundancy.
ECE7130
Redundant Arrays of (Inexpensive) Disks
• Replicate data over several disks so that no data will be lost if one disk fails.
• Redundancy yields high data availability– Availability: service still provided to user, even if some components
failed
• Disks will still fail• Contents reconstructed from data redundantly stored
in the array Capacity penalty to store redundant info
Bandwidth penalty to update redundant info
ECE7130
ECE7130
Levels of RAID
• Original RAID paper described five categories (RAID levels 1-5). (Patterson et al, “A case for redundant arrays of inexpensive disks (RAID)”, ACM SIGMOD, 1988)
• Disk striping with no redundant now is called RAID0 or JBOD(Just a bunch of disks).
• Other kinds have been proposed in literature,Level 6 (P+Q Redundancy), Level 10, RAID53, etc.
• Except RAID0, all the RAID levels trade disk capacity for reliability, and the extra reliability makes parallism a practical way to improve performance.
ECE7130
RAID 0: Nonredundant (JBOD)
file data block 1block 0 block 2 block 3
Disk 1Disk 0 Disk 2 Disk 3
• High I/O performance.•Data is not save redundantly.•Single copy of data is striped across multiple disks.
•Low cost.•Lack of redundancy.
•Least reliable: single disk failure leads to data loss.
ECE7130
Redundant Arrays of Inexpensive DisksRAID 1: Disk Mirroring/Shadowing
• Each disk is fully duplicated onto its “mirror” Very high availability can be achieved• Bandwidth sacrifice on write: Logical write = two physical writes
• Reads may be optimized, minimize the queue and disk search time
• Most expensive solution: 100% capacity overhead
recoverygroup
Targeted for high I/O rate , high availability environments
ECE7130
RAID 2: Memory-Style ECC
f0(b)b2b1b0 b3f1(b) P(b)
Data Disks Multiple ECC Disks and a Parity Disk
• Multiple disks record the ECC information to determine which disk is in fault
• A parity disk is then used to reconstruct corrupted or lost data
• Needs log2(number of disks) redundancy disks
ECE7130
RAID 3: Bit (Byte) Interleaved Parity
• Only need one parity disk • Write/Read accesses all disks• Only one request can be serviced at a time• Easy to implement• Provides high bandwidth but not high I/O rates
Targeted for high bandwidth applications: Multimedia, Image Processing
100100111100110110010011
. . .
Logical record
1 0 0 1 0 0 1 1 0 1 1 0 0 1 1 0 1 1 1 0 0 1 0 0 1 1 0
Striped physicalrecords
P
Physical record
ECE7130
RAID 3
• Sum computed across recovery group to protect against hard disk failures, stored in P disk
• Logically, a single high capacity, high transfer rate disk: good for large transfers
• Wider arrays reduce capacity costs, but decreases availability• 12.5% capacity cost for parity in this configuration
• RAID 3 relies on parity disk to discover errors on Read• But every sector has an error detection field• Rely on error detection field to catch errors on read, not on the
parity disk• Allows independent reads to different disks simultaneously
Inspiration for RAID 4
ECE7130
RAID 4: Block Interleaved Parity
block 0
block 4
block 8
block 12
block 1
block 5
block 9
block 13
block 2
block 6
block 10
block 14
block 3
block 7
block 11
block 15
P(0-3)
P(4-7)
P(8-11)
P(12-15)
•Blocks: striping units •Allow for parallel access by multiple I/O requests, high I/O rates • Doing multiple small reads is now faster than before. (allows small read requests to be restricted to a single disk).• Large writes(full stripe), update the parity:
P’ = d0’ + d1’ + d2’ + d3’; • Small writes(eg. write on d0), update the parity:
P = d0 + d1 + d2 + d3P’ = d0’ + d1 + d2 + d3 = P + d0’ + d0;
• However, writes are still very slow since the parity disk is the bottleneck.
ECE7130
Problems of Disk Arrays: Small Writes (read-modify-write procedure)
D0 D1 D2 D3 PD0'
+
+
D0' D1 D2 D3 P'
newdata
olddata
old parity
XOR
XOR
(1. Read) (2. Read)
(3. Write) (4. Write)
RAID-5: Small Write Algorithm
1 Logical Write = 2 Physical Reads + 2 Physical Writes
ECE7130
Inspiration for RAID 5
• RAID 4 works well for small reads• Small writes (write to one disk):
– Option 1: read other data disks, create new sum and write to Parity Disk
– Option 2: since P has old sum, compare old data to new data, add the difference to P
• Small writes are limited by Parity Disk: Write to D0, D5 both also write to P disk. Parity disk must be updated for every write operation!
D0 D1 D2 D3 P
D4 D5 D6 PD7
ECE7130
Redundant Arrays of Inexpensive Disks RAID 5: High I/O Rate Interleaved Parity
Independent writespossible because ofinterleaved parity
Independent writespossible because ofinterleaved parity
D0 D1 D2 D3 P
D4 D5 D6 P D7
D8 D9 P D10 D11
D12 P D13 D14 D15
P D16 D17 D18 D19
D20 D21 D22 D23 P
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.Disk Columns
IncreasingLogical
Disk Addresses
Example: write to D0, D5 uses disks 0, 1, 3, 4
ECE7130
Comparison of RAID Levels (N disks, each with capacity of C)
RAID level
Technique Capacity Advantage Disadvantage
0 Striping NxC Maximum data transfer rate and sizes
No redundancy
1 Mirroring (N/2)xC High performance,
fastest write
cost
3 Bit(byte)-level parity
(N-1)xC Easy to implement, high error recoverability
Low performance
4 Block-level parity
(N-1)xC High redundancy and better performance
Write-related bottleneck
5 Interleave parity
(N-1)xC High performance, reliability
Small write problem
ECE7130
RAID 6: Recovering from 2 failures
• Why > 1 failure recovery?– operator accidentally replaces the wrong disk during a
failure
– since disk bandwidth is growing more slowly than disk capacity, the MTT Repair a disk in a RAID system is increasing increases the chances of a 2nd failure during repair since takes longer
– reading much more data during reconstruction meant increasing the chance of an uncorrectable media failure, which would result in data loss
ECE7130
RAID 6: Recovering from 2 failures• Network Appliance’s row-diagonal parity or RAID-DP
• Like the standard RAID schemes, it uses redundant space based on parity calculation per stripe
• Since it is protecting against a double failure, it adds two check blocks per stripe of data.
– If p+1 disks total, p-1 disks have data; assume p=5
• Row parity disk is just like in RAID 4 – Even parity across the other 4 data blocks in its stripe
• Each block of the diagonal parity disk contains the even parity of the blocks in the same diagonal
ECE7130
Example p = 5
• Row diagonal parity starts by recovering one of the 4 blocks on the failed disk using diagonal parity
– Since each diagonal misses one disk, and all diagonals miss a different disk, 2 diagonals are only missing 1 block
• Once the data for those blocks is recovered, then the standard RAID recovery scheme can be used to recover two more blocks in the standard RAID 4 stripes
• Process continues until two failed disks are restored
Data Disk 0
Data Disk 1
Data Disk 2
Data Disk 3
Row Parity
Diagonal Parity
0 1 2 3 4 0
1 2 3 4 0 1
2 3 4 0 1 2
3 4 0 1 2 3
4 0 1 2 3 4
0 1 2 3 4 0
ECE7130
Summary: RAID Techniques: Goal was performance, popularity due to reliability of storage
• Disk Mirroring, Shadowing (RAID 1)
Each disk is fully duplicated onto its "shadow" Logical write = two physical writes
100% capacity overhead
• Parity Data Bandwidth Array (RAID 3)
Parity computed horizontally
Logically a single high data bw disk
• High I/O Rate Parity Array (RAID 5)
Interleaved parity blocks
Independent reads and writes
Logical write = 2 reads + 2 writes
10010011
11001101
10010011
00110010
10010011
10010011
Recommended