2
• Where are we today ?? • File systems • Interconnects • Disk technologies • Solid state devices • Recipes and Solutions • Final thoughts …
Agenda ??
Pan Galactic Gargle Blaster
"Like having your brains smashed out by a slice of lemon wrapped
around a large gold brick.”
3
Current Top10 …..
n.b. NCSA Bluewaters 24 PB 1100 GB/s (Lustre 2.1.3)
Rank Name Computer Site Total Cores Rmax Rpeak Power
(KW) File
system Size Perf
1 Tianhe-2 TH-IVB-FEP Cluster, Xeon E5-2692 12C 2.2GHz, TH Express-2, Intel Xeon Phi
National Super Computer Center in Guangzhou
3120000 33862700 54902400 17808 Lustre/H2FS 12.4 PB ~750 GB/s
2 Titan Cray XK7 , Opteron 6274 16C 2.2GHz, Cray Gemini interconnect, NVIDIA K20x
DOE/SC/Oak Ridge National Laboratory
560640 17590000 27112550 8209 Lustre 10.5 PB 240 GB/s
3 Sequoia BlueGene/Q, Power BQC 16C 1.60 GHz, Custom Interconnect
DOE/NNSA/LLNL 1572864 17173224 20132659 7890 Lustre 55 PB 850 GB/s
4 K computer Fujitsu, SPARC64 VIIIfx 2.0GHz, Tofu interconnect
RIKEN AICS 705024 10510000 11280384 12659 Lustre 40 PB 965 GB/s
5 Mira BlueGene/Q, Power BQC 16C 1.60GHz, Custom
DOE/SC/Argonne National Laboratory
786432 8586612 10066330 3945 GPFS 7.6 PB 88 GB/s
6 Piz Daint Cray XC30, Xeon E5-2670 8C 2.600GHz, Aries interconnect , NVIDIA K20x
Swiss National Supercomputing Centre (CSCS)
115984 6271000 7788853 2325 Lustre 2.5 PB 138 GB/s
7 Stampede PowerEdge C8220, Xeon E5-2680 8C 2.7GHz, IB FDR, Intel Xeon Phi
TACC/ Univ. of Texas 462462 5168110 8520112 4510 Lustre 14 PB 150 GB/s
8 JUQUEEN BlueGene/Q, Power BQC 16C 1.600GHz, Custom Interconnect
Forschungs zentrum Juelich (FZJ)
458752 5008857 5872025 2301 GPFS 5.6 PB 33 GB/s
9 Vulcan BlueGene/Q, Power BQC 16C 1.600GHz, Custom Interconnect
DOE/NNSA/LLNL 393216 4293306 5033165 1972 Lustre 55 PB 850 GB/s
10 SuperMUC iDataPlex DX360M4, Xeon E5-2680 8C 2.70GHz, Infiniband FDR
Leibniz Rechenzentrum 147456 2897000 3185050 3423 GPFS 10 PB 200 GB/s
5
• GPFS – Running out of steam ?? – Let me qualify !! (and he then rambles on …..)
• Frauenhofer (now BeeGFS) – Excellent metadata perf – Many modern features – No real HA
• Ceph – New, interesting and with a LOT of good features – Immature and with limited track record
• Panasas – Still RAID 5 and running out of steam ….
Other parallel file systems
6
• A traditional file system includes a hierarchy of files and directories
• Accessed via a file system driver in the OS • Object storage is “flat”, objects are located by direct
reference • Accessed via custom APIs
o Swift, S3, librados, etc.
• The difference boils down to 2 questions: o How do you find files? o Where do you store metadata?
• Object store + Metadata + driver is a filesystem
Object based storage
7
• It’s more flexible. Interfaces can be presented in other ways, without the FS overhead. A generalized storage architecture vs. a file system
• It’s more scalable. POSIX was never intended for clusters, concurrent access, multi-level caching, ILM, usage hints, striping control, etc.
• It’s simpler. With the file system-”isms” removed, an elegant (= scalable, flexible, reliable) foundation can be laid
Object Storage Backend: Why?
8
Most clustered FS and OS are built on local FS’s – ….and inherit their problems • Native FS
o XFS, ext4, ZFS, btrfs • OS on FS
o Ceph on btrfs o Swift on XFS
• FS on OS on FS o CephFS on Ceph on btrfs o Lustre on OSS on ext4 o Lustre on OSS on ZFS
Elephants all the way down...
9
• ObjectStor based solutions offers a lot of flexibility: – Next-generation design, for exascale-level size, performance,
and robustness – Implemented from scratch
• "If we could design the perfect exascale storage system..." – Not limited to POSIX – Non-blocking availability of data – Multi-core aware – Non-blocking execution model with thread-per-core – Support for non-uniform hardware – Flash, non-volatile memory, NUMA – Using abstractions, guided interfaces can be implemented
• e.g., for burst buffer management (pre-staging and de-staging).
The way forward ..
10
• S-ATA 6 Gbit • FC-AL 8 Gbit • SAS 6 Gbit • SAS 12 Gbit • PCI-E direct attach
• Ethernet • Infiniband • Next gen interconnect…
Interconnects (Disk and Fabrics)
11
• Doubles bandwidth compared to SAS 6 Gbit • Triples the IOPS !! • Same connectors and cables J • 4.8 GB/s in each direction with 9.6 GB/s following
– 2 streams moving to 4 streams • 24 Gb/s SAS is on the drawing board ….
12 Gbit SAS
12
PCI-E direct attach storage
PCIe Arch Raw Bit Rate Data Encoding
Interconnect bandwidth
BW Lane Direction
Total BW for x16 link
PCIe 1.x 2.5GT/s 8b/10b 2Gb/s ~250MB/s ~8GB/s
PCIe 2.0 5.0GT/s 8b/10b 4Gb/s ~500MB/s ~16GB/s
PCIe 3.0 8.0GT/s 128b/130b 8Gb/s ~1 GB/s ~32GB/s
PCIe 4.0 ?? ?? ?? ?? ??
• M.2 Solid State Storage Modules • Lowest latency/highest bandwidth • Limitations in # of PCI-E channels available
• Ivy Bridge has up to 40 lanes per chip
13
• Ethernet has now been around for 40 years !!! • Currently around 41% of Top500 systems …
28% 1 GbE 13% 10 GbE
• 40 GbE shipping in volume • 100 GbE being demonstrated
– Volume shipments expected in 2015 • 400 GbE and 1 TbE is on the drawing board
– 400 GbE planned for 2017
Ethernet – Still going strong ??
15
• Intel acquired Qlogic Infiniband team … • Intel acquired Cray’s interconnect technologies … • Intel has published and shows silicon photonics …
• And this means WHAT ????
Next gen interconnects …
17
Areal density futures
2020 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018
105
100
101
102
103
104
Year
Gig
abit
/ Sq
inch
Superparamagnetic limit (Single particle)
Inductive Read/Write
Inductive writingMR/GMR reading
Perpendicular Write and GMR reading
HAMR
HAMR+ BPM
19
• Sealed Helium Drives (Hitachi ) – Higher density – 6 platters/12 heads – Less power (~ 1.6W idle) – Less heat (~ 4°C lower temp)
• SMR drives (Seagate) – Denser packaging on current technology – Aimed at read intensive application areas
• Hybrid drives (SSHD) – Enterprise edition – Transparent SSD/HHD combination (aka Fusion drives)
• eMLC + SAS
Hard drive futures …
21
• HAMR drives (Seagate) – Using a laser to heat the magnetic substrate
(Iron/Platinum alloy) – Projected capacity – 30-60 TB/ 3.5 inch drive … – 2016 timeframe ….
• BPM (bit patterned media recording) – Stores one bit per cell, as opposed to regular hard-drive
technology, where each bit is stored across a few hundred magnetic grains
– Projected capacity – 100+ TB / 3.5 inch drive …
Hard drive futures …
22
• RAID 5 – No longer viable • RAID 6 – Still OK
– But re-build times are becoming prohibitive • RAID 10
– OK for SSDs and small arrays • RAID Z/Z2 etc
– A choice but limited functionality on Linux • Parity Declustered RAID
– Gaining foothold everywhere – But …. PD-RAID ≠ PD-RAID ≠ PD-RAID ….
• No RAID ??? – Using multiple distributed copies works but …
What about RAID ?
23
• Supposed to “Take over the World” [cf. Pinky and the Brain] • But for high performance storage there are issues ….
– Price and density not following predicted evolution – Reliability (even on SLC) not as expected
• Latency issues – SLC access ~25µs, MLC ~50µs … – Larger chips increase contention
• Once a flash die is accessed, other dies on the same bus must wait
• Up to 8 flash dies shares a bus
• Address translation, garbage collection and especially wear leveling add significant latency
Flash (NAND)
24
• MLC 3-4 bits per cell @ 10K duty cycles
• SLC – 1 bit per cell @ 100K duty cycles
• eMLC – 2 bits per cell @ 30K duty cycles
• Disk drive formats – S-ATA / SAS bandwidth limitations
• PCI-E accelerators • PCI-E direct attach
Flash (NAND)
25
• Flash is essentially NV-RAM but …. • Phase Change Memory (PCM)
– Significantly faster and more dense that NAND – Based on chalcogenide glass
• Thermal vs electronic process – More resistant to external factors
• Currently the expected solution for burst buffers etc …
– but there’s always Hybrid Memory Cubes ……
NV-RAM
26
• Comes in many flavour: – Stacked DDR3 + Logic – Stacked NAND
• Demonstrated performance: 480GB/sec / 1 memory device
3D RAM (aka Hybrid Memory Cube)
27
• Size does matter ….. 2014 – 2016 >20 proposals for 40+ PB file systems Running at 1 – 4 TB/s !!!!
• Heterogeneity is the new buzzword – Burst buffers, data capacitors, cache off-loaders …
• Mixed workloads are now taken seriously …. • Data integrity is paramount
– T10-PI/DIX is a decent start but … • Storage system resiliency is equally important
– PD-RAID need to evolve and become system wide • Multi tier storage as standard configs … • Geographical distributed solutions commonplace
Solutions …
28
• Object based storage – Not an IF but a WHEN …. – Flavor(s) still TBD – DAOS, Exascale10, XEOS, ….
• Data management core to any solution – Self aware data, real time analytics, resource management
ranging from job scheduler to disk block …. – HPC storage = Big Data
• Live data – – Cache ↔︎ Near line ↔︎ Tier2 ↔︎ Tape ? ↔ ︎ Cloud ↔ ︎ Ice
• Compute with storage ➜ Storage with compute
Final thoughts ???
Storage is no longer a 2nd class citizen