21
Site-Wide Storage Use Case and Early User Experience with Infinite Memory Engine Tommy Minyard Texas Advanced Computing Center DDN User Group Meeting November 17, 2014

Tacc Infinite Memory Engine

Embed Size (px)

DESCRIPTION

In this deck from the DDN User Group at SC14, Tommy Minyard from TACC presents: Site-wide Storage Use Case and Early User Experience with Infinite Memory Engine. "IME unleashes a new I/O provisioning paradigm. This breakthrough, software defined storage application introduces a whole new new tier of transparent, extendable, non-volatile memory (NVM), that provides game-changing latency reduction and greater bandwidth and IOPS performance for the next generation of performance hungry scientific, analytic and big data applications – all while offering significantly greater economic and operational efficiency than today’s traditional disk-based and all flash array storage approaches that are currently used to scale performance." Watch the video presentation: http://insidehpc.com/2014/12/site-wide-storage-use-case-early-user-experience-infinite-memory-engine/

Citation preview

Page 1: Tacc Infinite Memory Engine

Site-Wide Storage Use Case and Early User Experience with Infinite Memory

Engine

Tommy Minyard Texas Advanced Computing Center

DDN User Group Meeting November 17, 2014

Page 2: Tacc Infinite Memory Engine

TACC Mission & Strategy The mission of the Texas Advanced Computing Center is to enable scientific discovery and enhance society through the application of advanced computing technologies.

To accomplish this mission, TACC:

–  Evaluates, acquires & operates advanced computing systems

–  Provides training, consulting, and documentation to users

–  Collaborates with researchers to apply advanced computing techniques

–  Conducts research & development to produce new computational technologies

Resources & Services

Research & Development

Page 3: Tacc Infinite Memory Engine

TACC Storage Needs

•  Cluster specific storage –  High performance (tens to hundreds GB/s bandwidth) –  Large-capacity (~2TBs per Teraflop), purged frequently –  Very scalable to thousands of clients

•  Center-wide persistent storage –  Global filesystem available on all systems –  Very large capacity, quota enabled –  Moderate performance, very reliable, high availability

•  Permanent archival storage –  Maximum capacity, tens of PBs of capacity –  Slow performance, tape-based offline storage with spinning

storage cache

Page 4: Tacc Infinite Memory Engine

History of DDN at TACC

•  2006 – Lonestar 3 with DDN S2A9500 controllers and 120TB of disk

•  2008 – Corral with DDN S2A9900 controller and 1.2PB of disk

•  2010 – Lonestar 4 with DDN SFA10000 controllers with 1.8PB of disk

•  2011 – Corral upgrade with DDN SFA10000 controllers and 5PB of disk

Page 5: Tacc Infinite Memory Engine

Global Filesystem Requirements

•  User requests for persistent storage available on all production systems –  Corral limited to UT System users only

•  RFP issued for storage system capable of: –  At least 20PB of usable storage –  At least 100GB/s aggregate bandwidth –  High availability and reliability

•  DDN proposal selected for project

Page 6: Tacc Infinite Memory Engine

Stockyard: Design and Setup

•  A Lustre 2.4.2 based global files system, with scalability for future upgrades

•  Scalable Unit (SU): 16 OSS nodes providing access to 168 OST’s of RAID6 arrays from two SFA12k couplets, corresponding to 5PB capacity and 25+ GB/s throughput per SU

•  Four SU’s provide 25PB raw with >100GB/s •  16 initial LNET routers for external mounts

Page 7: Tacc Infinite Memory Engine

Scalable Unit (One server rack with two DDN SFA12k couplet racks)

Page 8: Tacc Infinite Memory Engine

Scalable Unit Hardware Details

•  SFA12k Rack: 50U rack with 8x L6-30p •  SFA12k couplet with 16 IB FDR ports (direct

attachment to the 16 OSS servers) •  84 slot SS8460 drive enclosures (10 per rack,

20 enclosures per SU) •  4TB 7200RPM NL-SAS drives

Page 9: Tacc Infinite Memory Engine

Stockyard Logical Layout

Page 10: Tacc Infinite Memory Engine

Stockyard: Installation

Page 11: Tacc Infinite Memory Engine

Stockyard: Capabilities and Features

•  20PB usable capacity with 100+ GB/s aggregate bandwidth

•  Client systems can add LNET routers to connect to the Stockyard core IB switches or connect to the built-in LNET routers using either IB or TCP. (FDR14 or 10GigE)

•  Automatic failover with Corosync and Pacemaker

Page 12: Tacc Infinite Memory Engine

Stockyard: Performance

•  Local storage testing surpassed 100GB/s •  Initial bandwidth from Stampede compute

clients using Lustre 2.1.6 and 16 routers: 65GB/s with 256 clients (IOR, posix, fpp, with 8 tasks per node)

•  After upgrade of Stampede clients to Lustre 2.5.2: 75GB/s

•  Added 8 LNET routers to connect Maverick visualization system: 38GB/s

Page 13: Tacc Infinite Memory Engine

Failover Testing

•  OSS failover test setup and results •  Procedure:

–  Identify the OST’s for the test pair –  Initiate write processes targeted to the particular OST’s, each of

about 67GB in size so that it does not finish before the failover –  Interrupt one of the OSS server with shutdown using ipmitool –  Record the individual write process outputs as well as server and

client side Lustre messages –  Compare and confirm the recovery and operation of the failover

pair with all OST’s

•  All I/O completes within 2 minutes of failover Client log: Oct 9 14:25:43 gsfs-lnet-006 kernel: : LustreError: 11-0: gsfs-OST00a1-osc-ffff88181bc84000: Communicating with 192.168.202.13@o2ib100, operation ost_write failed with -19. Oct 9 14:25:43 gsfs-lnet-006 kernel: : Lustre: gsfs-OST00a1-osc-ffff88181bc84000: Connection to gsfs-OST00a1 (at 192.168.202.13@o2ib100) was lost; in progress operations using this service will wait for recovery to complete Oct 9 14:26:08 gsfs-lnet-006 kernel: : Lustre: 13689:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1381346761/real 0] req@ffff8815889f1c00 x1448277233942752/t0(0) o8->[email protected]@o2ib100:28/4 lens 400/544 e 0 to 1 dl 1381346768 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 9 14:26:25 gsfs-lnet-006 kernel: : Lustre: 13697:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1381346761/real 1381346785] req@ffff881420cc3400 x1448277233942464/t0(0) o400->[email protected]@o2ib100:28/4 lens 224/224 e 0 to 1 dl 1381346768 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 9 14:26:25 gsfs-lnet-006 kernel: : Lustre: gsfs-OST0059-osc-ffff88181bc84000: Connection to gsfs-OST0059 (at 192.168.202.13@o2ib100) was lost; in progress operations using this service will wait for recovery to complete Oct 9 14:26:35 gsfs-lnet-006 kernel: : Lustre: gsfs-OST0099-osc-ffff88181bc84000: Connection to gsfs-OST0099 (at 192.168.202.13@o2ib100) was lost; in progress operations using this service will wait for recovery to complete Oct 9 14:27:02 gsfs-lnet-006 kernel: : Lustre: 13689:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1381346816/real 0] req@ffff880e4fc36400 x1448277235383180/t0(0) o8->[email protected]@o2ib100:28/4 lens 400/544 e 0 to 1 dl 1381346822 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 9 14:27:02 gsfs-lnet-006 kernel: : Lustre: 13689:0:(client.c:1869:ptlrpc_expire_one_request()) Skipped 24 previous similar messages Oct 9 14:27:11 gsfs-lnet-006 kernel: : Lustre: gsfs-OST0059-osc-ffff88181bc84000: Connection restored to gsfs-OST0059 (at 192.168.202.12@o2ib100) Oct 9 14:27:11 gsfs-lnet-006 kernel: : Lustre: Skipped 5 previous similar messages Oct 9 14:27:21 gsfs-lnet-006 kernel: : Lustre: gsfs-OST0061-osc-ffff88181bc84000: Connection restored to gsfs-OST0061 (at 192.168.202.12@o2ib100) Oct 9 14:27:30 gsfs-lnet-006 kernel: : Lustre: gsfs-OST0071-osc-ffff88181bc84000: Connection restored to gsfs-OST0071 (at 192.168.202.12@o2ib100) Oct 9 14:27:46 gsfs-lnet-006 kernel: : Lustre: gsfs-OST00a1-osc-ffff88181bc84000: Connection restored to gsfs-OST00a1 (at 192.168.202.12@o2ib100) Oct 9 14:27:46 gsfs-lnet-006 kernel: : Lustre: Skipped 6 previous similar messages Oct 9 14:39:51 gsfs-lnet-006 kernel: : LustreError: 11-0: gsfs-OST0059-osc-ffff88181bc84000: Communicating with 192.168.202.12@o2ib100, operation obd_ping failed with -107. Oct 9 14:39:51 gsfs-lnet-006 kernel: : Lustre: gsfs-OST0061-osc-ffff88181bc84000: Connection to gsfs-OST0061 (at 192.168.202.12@o2ib100) was lost; in progress operations using this service will wait for recovery to complete Oct 9 14:39:51 gsfs-lnet-006 kernel: : LustreError: Skipped 6 previous similar messages Oct 9 14:40:16 gsfs-lnet-006 kernel: : Lustre: 13689:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1381347616/real 1381347616] req@ffff88180d37e800 x1448277242472716/t0(0) o8->[email protected]@o2ib100:28/4 lens 400/544 e 0 to 1 dl 1381347622 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 9 14:40:16 gsfs-lnet-006 kernel: : Lustre: 13689:0:(client.c:1869:ptlrpc_expire_one_request()) Skipped 6 previous similar messages Oct 9 14:41:16 gsfs-lnet-006 kernel: : Lustre: gsfs-OST0059-osc-ffff88181bc84000: Connection restored to gsfs-OST0059 (at 192.168.202.13@o2ib100)

Page 14: Tacc Infinite Memory Engine

Failover Testing (cont’d)

•  Similarly for MDS pair: same sequence of interrupted I/O and collection of Lustre messages on both servers and clients, client side log shows the recovery.

–  Oct 9 14:58:24 gsfs-lnet-006 kernel: : Lustre: 13689:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1381348698/real 0] req@ffff88180cfcd000 x1448277242593528/t0(0) o250->MGC192.168.200.10@[email protected]@o2ib100:26/25 lens 400/544 e 0 to 1 dl 1381348704 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1

–  Oct 9 14:58:24 gsfs-lnet-006 kernel: : Lustre: 13689:0:(client.c:1869:ptlrpc_expire_one_request()) Skipped 1 previous similar message

–  Oct 9 14:58:43 gsfs-lnet-006 kernel: : Lustre: Evicted from MGS (at MGC192.168.200.10@o2ib100_1) after server handle changed from 0xb9929a99b6d258cd to 0x6282da9e97a66646

–  Oct 9 14:58:43 gsfs-lnet-006 kernel: : Lustre: MGC192.168.200.10@o2ib100: Connection restored to MGS (at 192.168.200.11@o2ib100)

Page 15: Tacc Infinite Memory Engine

Infinite Memory Engine Evaluation

•  As with most HPC filesystems, rarely sustain full bandwidth capability of filesystem

•  Really need the capacity of lots of disk spindles and handle the bursts of I/O activity

•  Stampede used to evaluate IME at scale using old /work filesystem for backend storage

Page 16: Tacc Infinite Memory Engine

IME Evaluation Hardware

•  Old Stampede /work filesystem hardware –  Eight storage servers, 64 drives each –  Lustre 2.5.2 server version –  Capable of 24GB/s peak performance –  At ~50% of capacity from previous use

•  IME hardware configuration –  Eight DDN IME servers fully populated with SSDs –  Two FDR IB connections per server –  80GB/s peak performance

Page 17: Tacc Infinite Memory Engine

Initial IME Evaluation

•  First testing showed bottlenecks with write performance reaching only 40GB/s

•  IB topology identified as culprit as 12 of the IB ports connected to a single IB switch with only 8 uplinks to core switches

•  Redistributing IME IB links to switches without oversubscription resolved bottleneck

•  Performance increased to almost 80GB/s after moving IB connections

Page 18: Tacc Infinite Memory Engine

HACC_IO @ TACC Cosmology Kernel

Lustre PFS

BURST BUFFER

COMPUTE CLUSTER

17 GB/s!

HACC_IO Cosmology!

80 GB/s!

Particles per Process

Num. Clients

IME Writes (GB/s)

IME Reads (GB/s)

PFS Writes (GB/s)

PFS Read (GB/s)

34M 128 62.8 63.7 2.2 9.8 34M 256 68.9 71.2 4.6 6.5 34M 512 73.2 71.4 9.1 7.5 34M 1024 63.2 70.8 17.3 8.2 IME Acceleration

3.7x-28x 6.5x-11x

Page 19: Tacc Infinite Memory Engine

S3D @ TACC Turbulent Combustion Kernel

Lustre PFS

BURST BUFFER

COMPUTE CLUSTER

3.3 GB/s!

S3D Turbulent Combustion!

60.8 GB/s!

Processes X Y Z IME Write (GB/s)

PFS Write (GB/s)

Acceleration

16 1024 1024 128 8.2 1.2 6.8x

32 1024 2048 128 14.0 1.5 9.3x

64 1024 4096 128 22.3 1.5 14.9x

128 1024 8192 128 31.8 3.0 10.6x

256 1024 16384 128 44.7 2.6 17.2x

512 1024 32768 128 53.5 2.4 22.3x

1024 1024 65536 128 60.8 3.3 18.4x

Page 20: Tacc Infinite Memory Engine

MADBench @ TACC

Lustre PFS

BURST BUFFER

COMPUTE CLUSTER

8.7 GB/s!

70+ GB/s!

Phase IME Read (GB/s)

IME Write (GB/s)

PFS Read (GB/s)

PFS Write (GB/s)

S 71.9 7.1 W 74.6 75.5 7.8 8.7 C 74.7 11.9 IME Accel.

6.2x-9.6x 8.7x-10.1x

Application Configuration: NP = 3136, #Bins=8, #pix = 265K !

Page 21: Tacc Infinite Memory Engine

Summary

•  Storage capacity and performance needs growing at exponential rate

•  High-performance and reliable filesystems critical for HPC productivity

•  Current best solution for cost, performance and scalability is Lustre-based filesystem

•  Initial IME testing demonstrated scalability and capability on large scale system