DDN | Burst Buffer 1 HPC Advisory Council Perth July … · ddn.com Any statements or representations around future events are subject to change. 1! 1 DDN | Burst Buffer HPC Advisory

ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.

1!

1 DDN | Burst Buffer HPC Advisory Council Perth July 2017 Justin Glen & Daniel Richards


2! Agenda

▶  What is Burst Buffer? ▶  What is it for? ▶  Recommendations ▶  Regional Solutions ▶  Performance Benchmarks


3!DDN | About Us Solving HPC, Enterprise Big Data & Web Scale Challenges

World-Renowned & Award Winning

History

Founded in ’98

World’s Largest Private Storage Company

Double Digit Growth, Profitable, Self Funded

Headquarters Chatsworth, CA


4! Sample Customers in Australia


5!

During acceptance testing

Why Burst Buffer?


6!

After the acceptance testing

Burst Buffer is a cache that caters for real-world IO patterns


7!Why Cache Matters in HPC A Day In the Life of A Multi-Disciplinary Cluster

•  90% of I/Os in a standard HPC environment are <32KB

•  Cache is critical in aligning all-too-frequent unaligned writes and capturing small writes to preserve performance

•  As a minimum include storage controllers with battery-backed mirrored write cache

DDN Proprietary & Confidential INTERNAL ONLY

11% 89%

Total Reads Total Writes

4KB 23%

8KB 23%

16KB 42%

32KB 5%

128KB 0%

256KB 0%

512KB 2%

1024KB 5%

Write Distribution for Multi-Disciplinary HPC Cluster

Sampling from DDN University

Lustre Customer


8!Where to put Cache (Flash) in HPC workflow?

▶  Add flash (NVMe) either 1.  On the compute nodes for lower cost but more difficult to

use efficiently (may require changes in code) and share between compute nodes or

2.  In the file system for easiest to use and share but greater cost and poor performance or

3.  In a burst buffer between file system and compute nodes for optimal mix of price & performance. Benefit is data sharing, resilience & excellent performance.

-  Adding flash to the file system does not

necessarily improve the performance. -  Burst Buffer is about the software not the

hardware.


9!Burst Buffer | 3 Key Areas of Workflow Optimization

S3D Turbulent Flow Model

1. MITIGATES POOR PFS PERFORMANCE caused by PFS

locking, small I/O, and mal-aligned, fragmented I/O patterns

Burst Buffer “makes bad apps run well” and also prevents a poor-

behaving app from impacting the entire supercomputer

This is especially valuable to diverse

workload environments and ISV applications

IOR benchmarks indicate a

3x – 20x speedup on I/Os <32KB

2) PROVIDES HIGHER PERFORMANCE I/O (bandwidth and latency) to the application At ISC, we demonstrated three orders of magnitude speed-up due to this high performance tier

3) IME DRIVES SIGNIFICANTLY MORE EFFICIENT I/O TO THE PFS by re-aligning and coalescing data within the non-volatile storage At ISC, we demonstrated two orders of magnitude speed-up due to this efficiency

4 GB/s

25 MB/s

50 GB/s


10! What is it good for ?

▶  Initially designed for check pointing • Shared file write

▶  Developing into both read and write ▶  Extract the most benefit from Flash based

storage, design around POSIX/PFS limitations • Random IO • Small IO • Extreme number of threads • Slow components


11! Sizing a burst buffer

▶  Include a burst buffer in your new HPC system • Bandwidth of Burst Buffer is typically 3x to 5x back end

parallel file system bandwidth • Aggregate NVMe capacity of Burst Buffer around 0.5 to 3

times the aggregate memory capacity of the cluster • Burst Buffer investment is 25% to 50% of total working

storage


12!A*Star – Peak scientific system for Singapore


13! JCAHPC Oakforest-PACS

25PF 550K KNL cores

Burst Buffer 25xIME14K 1500GB/s,940TB Raw

Scratch 10xES14KX

400GB/s 26PB useable

OPA Interconnect


14!Proposed Solution

Overview OPA KNL based Compute

Burst Buffer 40xIME240

800GB/s,800TB Raw

Scratch 9xES14KX 300GB/s

15.27PB useable

OPA Interconnect

Home & Apps ES14KX

30GB/s, 1PB Home & Apps

Metadata

Scratch Metadata

Management Servers

Purge Server

Testbed ES14KX


15! Performance ▶  Like PFS’s:

•  Excellent for some workloads •  Good for some •  Not so good for others

▶  Excellent for •  Random IO, large number of threads •  Shared file •  Small IO

▶  Good for •  Sequential

▶  Not so good for (at this point) •  Metadata intensive

Also consider repeatability of execution, and performance under fault conditions


16! Brain Simulation Workflow

(Source: I/O Challenges in Brain Tissue Simulation (IME Neuromapp), Judit Planas. ISC17 DDN User group)


17! WRF on Lustre 48 job, total 240 compute node – throughput – focus, 5 minute window

~100GB/s


18! WRF on IME 48 job, total 240 compute node – throughput – focus, 5 minute window

IME capacity consumption

WRF output WRF checkpoint

~390GB/s


19! WRF on IME at scale Application runtime speedup over 1.5x


20!

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34

I/O timing variance, in seconds 48 concurrent job, 960 sample, 20 sample/job

IME Lustre

very small variance in IO times for jobs when run concurrently (fastest times don't show)

Lustre IO times vary widely from 2s to 34s


21!

21 Thank You [email protected], [email protected]

Documents

DDN | Burst Buffer 1 HPC Advisory Council Perth July … · ddn.com Any statements or representations around future events are subject to change. 1! 1 DDN | Burst Buffer HPC Advisory