29
1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

HPC Storage Current Status and Futures...3 Current Top10 ….. n.b. NCSA Bluewaters 24 PB 1100 GB/s (Lustre 2.1.3) Rank Name Computer Site Total Cores Rmax Rpeak Power (KW) File system

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

1

HPC Storage Current Status and Futures

Torben Kling Petersen, PhD Principal Architect, HPC

2

• Where are we today ?? • File systems • Interconnects • Disk technologies • Solid state devices • Recipes and Solutions • Final thoughts …

Agenda ??

Pan Galactic Gargle Blaster

"Like having your brains smashed out by a slice of lemon wrapped

around a large gold brick.”

3

Current Top10 …..

n.b. NCSA Bluewaters 24 PB 1100 GB/s (Lustre 2.1.3)

Rank Name Computer Site Total Cores Rmax Rpeak Power

(KW) File

system Size Perf

1 Tianhe-2 TH-IVB-FEP Cluster, Xeon E5-2692 12C 2.2GHz, TH Express-2, Intel Xeon Phi

National Super Computer Center in Guangzhou

3120000 33862700 54902400 17808 Lustre/H2FS 12.4 PB ~750 GB/s

2 Titan Cray XK7 , Opteron 6274 16C 2.2GHz, Cray Gemini interconnect, NVIDIA K20x

DOE/SC/Oak Ridge National Laboratory

560640 17590000 27112550 8209 Lustre 10.5 PB 240 GB/s

3 Sequoia BlueGene/Q, Power BQC 16C 1.60 GHz, Custom Interconnect

DOE/NNSA/LLNL 1572864 17173224 20132659 7890 Lustre 55 PB 850 GB/s

4 K computer Fujitsu, SPARC64 VIIIfx 2.0GHz, Tofu interconnect

RIKEN AICS 705024 10510000 11280384 12659 Lustre 40 PB 965 GB/s

5 Mira BlueGene/Q, Power BQC 16C 1.60GHz, Custom

DOE/SC/Argonne National Laboratory

786432 8586612 10066330 3945 GPFS 7.6 PB 88 GB/s

6 Piz Daint Cray XC30, Xeon E5-2670 8C 2.600GHz, Aries interconnect , NVIDIA K20x

Swiss National Supercomputing Centre (CSCS)

115984 6271000 7788853 2325 Lustre 2.5 PB 138 GB/s

7 Stampede PowerEdge C8220, Xeon E5-2680 8C 2.7GHz, IB FDR, Intel Xeon Phi

TACC/ Univ. of Texas 462462 5168110 8520112 4510 Lustre 14 PB 150 GB/s

8 JUQUEEN BlueGene/Q, Power BQC 16C 1.600GHz, Custom Interconnect

Forschungs zentrum Juelich (FZJ)

458752 5008857 5872025 2301 GPFS 5.6 PB 33 GB/s

9 Vulcan BlueGene/Q, Power BQC 16C 1.600GHz, Custom Interconnect

DOE/NNSA/LLNL 393216 4293306 5033165 1972 Lustre 55 PB 850 GB/s

10 SuperMUC iDataPlex DX360M4, Xeon E5-2680 8C 2.70GHz, Infiniband FDR

Leibniz Rechenzentrum 147456 2897000 3185050 3423 GPFS 10 PB 200 GB/s

4

Lustre® Roadmap

5

• GPFS – Running out of steam ?? – Let me qualify !! (and he then rambles on …..)

• Frauenhofer (now BeeGFS) – Excellent metadata perf – Many modern features – No real HA

• Ceph – New, interesting and with a LOT of good features –  Immature and with limited track record

• Panasas – Still RAID 5 and running out of steam ….

Other parallel file systems

6

•  A traditional file system includes a hierarchy of files and directories

•  Accessed via a file system driver in the OS •  Object storage is “flat”, objects are located by direct

reference •  Accessed via custom APIs

o  Swift, S3, librados, etc.

•  The difference boils down to 2 questions: o  How do you find files? o  Where do you store metadata?

•  Object store + Metadata + driver is a filesystem

Object based storage

7

•  It’s more flexible. Interfaces can be presented in other ways, without the FS overhead. A generalized storage architecture vs. a file system

•  It’s more scalable. POSIX was never intended for clusters, concurrent access, multi-level caching, ILM, usage hints, striping control, etc.

•  It’s simpler. With the file system-”isms” removed, an elegant (= scalable, flexible, reliable) foundation can be laid

Object Storage Backend: Why?

8

Most clustered FS and OS are built on local FS’s – ….and inherit their problems •  Native FS

o  XFS, ext4, ZFS, btrfs •  OS on FS

o  Ceph on btrfs o  Swift on XFS

•  FS on OS on FS o  CephFS on Ceph on btrfs o  Lustre on OSS on ext4 o  Lustre on OSS on ZFS

Elephants all the way down...

9

• ObjectStor based solutions offers a lot of flexibility: –  Next-generation design, for exascale-level size, performance,

and robustness –  Implemented from scratch

•  "If we could design the perfect exascale storage system..." –  Not limited to POSIX –  Non-blocking availability of data –  Multi-core aware –  Non-blocking execution model with thread-per-core –  Support for non-uniform hardware –  Flash, non-volatile memory, NUMA –  Using abstractions, guided interfaces can be implemented

•  e.g., for burst buffer management (pre-staging and de-staging).

The way forward ..

10

• S-ATA 6 Gbit • FC-AL 8 Gbit • SAS 6 Gbit • SAS 12 Gbit • PCI-E direct attach

• Ethernet •  Infiniband • Next gen interconnect…

Interconnects (Disk and Fabrics)

11

• Doubles bandwidth compared to SAS 6 Gbit • Triples the IOPS !! • Same connectors and cables J • 4.8 GB/s in each direction with 9.6 GB/s following

–  2 streams moving to 4 streams • 24 Gb/s SAS is on the drawing board ….

12 Gbit SAS

12

PCI-E direct attach storage

PCIe Arch Raw Bit Rate Data Encoding

Interconnect bandwidth

BW Lane Direction

Total BW for x16 link

PCIe 1.x 2.5GT/s 8b/10b 2Gb/s ~250MB/s ~8GB/s

PCIe 2.0 5.0GT/s 8b/10b 4Gb/s ~500MB/s ~16GB/s

PCIe 3.0 8.0GT/s 128b/130b 8Gb/s ~1 GB/s ~32GB/s

PCIe 4.0 ?? ?? ?? ?? ??

•  M.2 Solid State Storage Modules •  Lowest latency/highest bandwidth •  Limitations in # of PCI-E channels available

•  Ivy Bridge has up to 40 lanes per chip

13

• Ethernet has now been around for 40 years !!! • Currently around 41% of Top500 systems …

28% 1 GbE 13% 10 GbE

• 40 GbE shipping in volume • 100 GbE being demonstrated

–  Volume shipments expected in 2015 • 400 GbE and 1 TbE is on the drawing board

–  400 GbE planned for 2017

Ethernet – Still going strong ??

14

Infiniband …

15

•  Intel acquired Qlogic Infiniband team … •  Intel acquired Cray’s interconnect technologies … •  Intel has published and shows silicon photonics …

• And this means WHAT ????

Next gen interconnects …

16

Disk drive technologies NEW Seagate 1 PB drive !!!

17

Areal density futures

2020 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018

105

100

101

102

103

104

Year

Gig

abit

/ Sq

inch

Superparamagnetic limit (Single particle)

Inductive Read/Write

Inductive writingMR/GMR reading

Perpendicular Write and GMR reading

HAMR

HAMR+ BPM

18

Disk write technology ….

18

19

• Sealed Helium Drives (Hitachi ) –  Higher density – 6 platters/12 heads –  Less power (~ 1.6W idle) –  Less heat (~ 4°C lower temp)

• SMR drives (Seagate) –  Denser packaging on current technology –  Aimed at read intensive application areas

• Hybrid drives (SSHD) –  Enterprise edition –  Transparent SSD/HHD combination (aka Fusion drives)

•  eMLC + SAS

Hard drive futures …

20

SMR drive deep-dive

21

• HAMR drives (Seagate) –  Using a laser to heat the magnetic substrate

(Iron/Platinum alloy) –  Projected capacity – 30-60 TB/ 3.5 inch drive … –  2016 timeframe ….

• BPM (bit patterned media recording) –  Stores one bit per cell, as opposed to regular hard-drive

technology, where each bit is stored across a few hundred magnetic grains

–  Projected capacity – 100+ TB / 3.5 inch drive …

Hard drive futures …

22

• RAID 5 – No longer viable • RAID 6 – Still OK

–  But re-build times are becoming prohibitive • RAID 10

–  OK for SSDs and small arrays • RAID Z/Z2 etc

–  A choice but limited functionality on Linux • Parity Declustered RAID

–  Gaining foothold everywhere –  But …. PD-RAID ≠ PD-RAID ≠ PD-RAID ….

• No RAID ??? –  Using multiple distributed copies works but …

What about RAID ?

23

• Supposed to “Take over the World” [cf. Pinky and the Brain] • But for high performance storage there are issues ….

–  Price and density not following predicted evolution –  Reliability (even on SLC) not as expected

• Latency issues –  SLC access ~25µs, MLC ~50µs … –  Larger chips increase contention

•  Once a flash die is accessed, other dies on the same bus must wait

•  Up to 8 flash dies shares a bus

• Address translation, garbage collection and especially wear leveling add significant latency

Flash (NAND)

24

• MLC 3-4 bits per cell @ 10K duty cycles

• SLC – 1 bit per cell @ 100K duty cycles

• eMLC – 2 bits per cell @ 30K duty cycles

• Disk drive formats – S-ATA / SAS bandwidth limitations

• PCI-E accelerators • PCI-E direct attach

Flash (NAND)

25

• Flash is essentially NV-RAM but …. • Phase Change Memory (PCM)

– Significantly faster and more dense that NAND – Based on chalcogenide glass

• Thermal vs electronic process – More resistant to external factors

• Currently the expected solution for burst buffers etc …

– but there’s always Hybrid Memory Cubes ……

NV-RAM

26

• Comes in many flavour: –  Stacked DDR3 + Logic –  Stacked NAND

• Demonstrated performance: 480GB/sec / 1 memory device

3D RAM (aka Hybrid Memory Cube)

27

• Size does matter ….. 2014 – 2016 >20 proposals for 40+ PB file systems Running at 1 – 4 TB/s !!!!

• Heterogeneity is the new buzzword –  Burst buffers, data capacitors, cache off-loaders …

• Mixed workloads are now taken seriously …. • Data integrity is paramount

–  T10-PI/DIX is a decent start but … • Storage system resiliency is equally important

–  PD-RAID need to evolve and become system wide • Multi tier storage as standard configs … • Geographical distributed solutions commonplace

Solutions …

28

• Object based storage –  Not an IF but a WHEN …. –  Flavor(s) still TBD – DAOS, Exascale10, XEOS, ….

• Data management core to any solution –  Self aware data, real time analytics, resource management

ranging from job scheduler to disk block …. –  HPC storage = Big Data

• Live data – –  Cache ↔︎ Near line ↔︎ Tier2 ↔︎ Tape ? ↔ ︎ Cloud ↔ ︎ Ice

• Compute with storage ➜ Storage with compute

Final thoughts ???

Storage is no longer a 2nd class citizen

29

Thank You [email protected]