32
`` Creating ADIOS-2 for scientific exascale data HPC Advisory Council China Workshop Xi’an China October 26, 2016

HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

``

Creating ADIOS-2 for scientific exascale data

HPC Advisory Council China WorkshopXi’an China

October 26, 2016

Page 2: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

Computation: Fusion

•Develop high-fidelity Whole Device Model (WDM) of magnetically confined fusion plasmas to predict the performance of ITER

•Couple existing, well established extreme-scale gyrokinetic codes

• GENE continuum code for the core plasma and the

• XGC particle-in-cell (PIC) code for the edge plasma

•Data challenges

• Couple codes (XGC to GENE) using a service framework

• Need to save O(10)PB of data from a single simulation

2

• PMI, Dust, RF, Neutral particles, Ohmicpower supply, Poloidal field, Magnetic Equlibirum, RMP coils

• Fusion reaction- α-particles, neutrons

Page 3: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

Experiments: ITER

• Volume: Initially 90 TB per day, 18 PB per year, maturing to 2.2 PB per day, 440 PB per year

• Value: All data are taken from expensive instruments for valuable reasons

• Velocity: Peak 50 GB/s, with near real-time analysis needs

• Variety: ~100 different types of instruments and sensors, numbering in the thousands, producing interdependent data in various formats

• Veracity: The quality of thedata can vary greatlydepending upon theinstruments and sensors

Page 4: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

What isADIOS• An extendable framework that allows developers to plug-in

• I/O methods: Aggregate, Posix, MPI

• Services: Compression, Decompression

• File Formats: HDF5, netcdf, …

• Stream Format: ADIOS-BP

• Plug-ins: Analytic, Visualization

• Indexing: FastBit, ISABELLA-QA

• Incorporates the “best” practices in the I/O middleware layer

• Incorporates self describing data streams and files• Released twice a year, now 1.9, under the completely free BSD license

• https://www.olcf.ornl.gov/center-projects/adios/, https://github.com/ornladios/ADIOS

• Available at ALCF, OLCF, NERSC, CSCS, Tianhe-1,2, Pawsey SC, Ostrava

• Applications are supported through OLCF INCITE program

• Outreach via on-line manuals, and live tutorials

Interfacetoappsfordescrip onofdata(ADIOS,etc.)

BufferingFeedback Schedule

Mul -resolu onmethods

DataCompressionmethods

DataIndexing(FastBit)methods

DataManagementServices

WorkflowEngine Run meengine DatamovementProvenance

Pluginstothehybridstagingarea

Visualiza onPluginsAnalysisPlugins

ParallelandDistributedFileSystem

IDX HDF5Adios-bp pnetcdf “raw”data Imagedata

Viz.Client

Page 5: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

1. Accelerator: PIConGPU, Warp

2. Astronomy: SKA

3. Astrophysics: Chimera

4. Combustion: S3D

5. CFD: FINE/Turbo, OpenFoam

6. Fusion: XGC, GTC, GTC-P, M3D,M3D-C1, M3D-K, Pixie3D

7. Geoscience: SPECFEM3D_GLOBE, AWP-ODC, RTM

8. Materials Science: QMCPack, LAMMPS

9. Medical Imaging: Cancer pathology

10.Quantum Turbulence: QLG2Q

11.Relativity: Maya

12.Weather: GRAPES

13.Visualization: Paraview, Visit, VTK, ITK, OpenCV, VTKm

ADIOS applications

Impact on Industry :• NUMECA (FINE/Turbo) –

Allowed time-varying interaction of turbomachinery-relatedaerodynamic phenomena

• TOTAL (RTM) – Allowed running

of higher fidelity seismic simulations

• FMGLOBAL (OpenFoam)–Allowed running higher fidelity fire propagation simulations

LCF/NERSC Codes in red

Over 1B LCF hours from ADIOS enabled Apps 2015Over 1,500 citations

Page 6: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

Typical Applications on LCF machines observed a 10X performance improvement in I/O

Impact from running large scale applications at scale

Page 7: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

How to use ADIOS

• ADIOS is provided as a library to users; use it like other I/O libraries, except

• ADIOS has a declarative approach for I/O

• User defines in application source code: “what” and “when”

• ADIOS takes care of the “how”

• Biggest hurdle for users:

• Forget all the tricks to gain I/O performance: just say what you want to write/read

• Performance Portability:

• Write once, perform well anywhere

• It comes naturally, with different I/O strategies for different: motifs, systems

• Predictable performance

• Staging, I/O throttling: Allow scientist to use different computational technologies to achieve good, predictable, performance

Page 8: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

ADIOS project goals and current status

• Utilize the best practices for ALL of the platforms DOE researchers utilize

• File System, Topology, Memory_ Optimizations

• Domain Specific Optimizations

• Particle simulations, Data Driven: Visualization, Medical Images, …

• Predictable Performance Optimizations

• Hybrid Staging Techniques, I/O Throttling

• Reduce I/O load

• In situ indexing, queries, executable code, data refactoring

• In situ infrastructure for code coupling and in situ/transit processing

• Hybrid Staging, Burst Buffers, learning techniques for caching

• Usability and Sustainability

• Partnering with Kitware, along with user surveys to create better software for our users

Page 9: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

ADIOS Research-Development-Production cycle

ResearchDevelopment

InfluencersCollaborators

HDF5

Applications

Technology

DevelopmentORNL

ResearchUniversities &

ORNLProductionProductionKitware

OR

NL

Page 10: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

Key ideas for good performance of ADIOS for large writes

• Avoid latency (of small writes)

• Buffer data for large bursts

• Avoid global communication• ADIOS has that for metadata only, which can

even be postponed for post-processing

• Later: Topology-aware data movement that takes advantage of topology

• Find the closest I/O node to each writer

• Minimize data movement across racks/mid-planes (on Bluegene/Q)

ADIOS-BP stream/file format

• Allows data from each node to be written independently with each other with metadata

• Ability to create a separate metadata file when “sub-files” are generated

• Allows variables to be individually compressed

• Has a schema to introspect the information, on each process

• Has workflows embedded into the data streams

• Format is for “data-in-motion” and “data-at-rest”

Bu

rst Bu

ffer

Page 11: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

Impact to two LCF applications

• Accelerators – PIConGPU• M. Bussmann, et al. - HZDR• Study laser-driven acceleration of ion

beams and its use for therapy of cancer• Computational laboratory for real-time

processing for optimizing parameters of the laser

• Over 184 GB/s on 16K nodes on Titan

• Seismic Imaging – RTM by Total Inc.• Pierre-Yves Aquilanti, TOTAL E&P

in context of a CRADA• TBs as inputs, outputs PBs of results along with

intermediate data• Company conducted comparison tests among

several I/O solutions. ADIOS is their choice for other codes: FWI, Kirchoff

PH5

PH5

http://rice2016oghpc.rice.edu/program/

http://www.sciencedaily.com/releases/2016/01/160126130823.htm

Page 12: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

Small Writes From Many Cores at high velocities

• Small writes• Overheads are bigger than the cost of outputting

small amount of bytes

• Avoid accessing a file system target from manyprocesses at once• Aggregate to a small number of actual

writers: O(file system targets)

• Avoid lock contention• by striping correctly & writing to subfiles

• Reading many small data blocks• Data reorganization to merge data into bigger blocks

• Reading with different access patterns than the data was written• Data reorganization to optimize overall read access

Q. Liu, J. Logan, Y. Tian, H. Abbasi, N. Podhorszki, J. Y. Choi, S. Klasky, R. Tchoua, J. Lofstead, R. Oldfield, et al.. Hello ADIOS: the challenges and lessons of developing leadership class I/O frameworks. Concurrency and Computation: Practice and Experience 2014, 26, 1453–1473.

http://www.hpcwire.com/2009/10/29/adios_ignites_combustion_simulations/

Page 13: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

• Use ADIOS to write a single file per timestep• Use OpenFOAM functionObject approach for

defining extra functionality (used to extend OpenFOAM with new computations, I/O, diagnostics, etc.)

• User specifies the output objects in a configuration file (in a usual way for OpenFOAM)

ApproachChallenges

Further impact: OpenFOAM CFD simulations

• Open C++ framework for all CFD simulations

• FM Global: fireFOAM application for fire simulations

• Output is stream-based, no central IO class

• Hundreds of objects contribute to the output writing

• Many small files with even more tiny write operations

• Per processor and simulation variable

• Add a scalable and fast IO solution to OpenFOAM

Research Products/Artifacts

Yi Wang, Karl Meredith – FM GlobalS. Klasky, N. Podhorszki - ORNL

Select Results

• Prototype implementation of ADIOS IO in

OpenFoam 2.2 source as extension

• https://github.com/pnorbert/IOADIOSWrite

• Tests performed by FM Global on local cluster

with NFS file system as well as at OLCF with

Lustre file system

“ADIOS was able to

achieve a 12X performance

improvement from the

original IO “

https://www.olcf.ornl.gov/2016/01/05/fighting-fire-with-firefoam/ Stacking commodities on wood

pallets slows horizontal fire spread, versus absence of pallets

Page 14: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

Staging to address I/O performance/variability issue• Simplistic approach to staging

• Decouple application performance from storage performance (burst buffer)• Move data directly to remote memory in a “staging” area• Write data to disk from staging

• Built on past work with threaded buffered I/O• Buffered asynchronous data movement with a single memory copy for

networks which support RDMA• Application blocks for a very short time to copy data to outbound buffer• Data is moved asynchronously

using server-directedremote reads

• Exploits network hardware for fast data transfer to remote memory

Page 15: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

Overheads with XGC

• 1 PB of total output per 24 hours, but really wanted 10 PB/24 hours

Page 16: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

XGC I/O Variability

• Staging of XGC output

• Observed high I/O variations with files on XGC check-pointing

• Research on Staging I/O with Rutgers and GA Tech: Decoupling app and I/O

• Reduced I/O variations with staging methods as well as high I/O throughput

• Creates new research opportunities with BurstBuffer (ADIOS + Staging + BurstBuffer)

• Impact: I/O performance is much more predictable

XGC I/O variations on checkpoint writing. Compared File I/O and Staging I/O on Titan.

XGC ADIOS

File

Staging

BurstBuffer

Page 17: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

Performance Portability

• ADIOS has methods optimized for different platforms

• BGQ-GPFS, Cray XK-Lustre, IB clusters

• Different methods can be changed for R/W optimizations

• E.g. Quantum Physics QLG2Q codePerformance Island navigation with ADIOS

Page 18: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

“Compute-Data Gap”

• Filesystem/network bandwidth does not keep up with CPU/memory advances

• Can’t write as many bits/FLOP

• Swap more I/O for (de)compression CPU cycles

Exascale system require well tuned SDM core

Page 19: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

19

CODAR: Center for Online Data Analysis and Reduction

CODAR

Data

services

Exascale

platforms

Applications

PI: Ian Fostera

Co-Is: Scott Klaskyo, Kerstin Kleese-Van Damb, Todd Munsona

Participants: Mark Ainsworthd, Franck Cappelloa, Barbara

Chapmanb,s, Jong Choio, Emil Constantinescua, Hanqi Guoa,

Tahsin Kurcs, Qing Liuo, Jeremy Logan, Klaus Muellers, George

Ostrouchovo, Manish Parasharr, Tom Peterkaa, Norbert

Podhorszkio, Dave Pugmireo, Rangan Sukumaro, Stefan Wilda,

Matthew Wolfo, Justin Wozniaka, Wei Xub, Shinjae Yoob

a: Argonne National Laboratory

b: Brookhaven National Laboratory

o: Oak Ridge National Laboratory

Page 20: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

Reduction comes with challenges

• Handling high entropy• Performance – no benefit otherwise• Not only errors in variable itself

Ε ≡ 𝑓 − ሚ𝑓must also consider impact on derived quantities:

Ε ≡ (𝑔𝑙𝑡(𝑓 റ𝑥, 𝑡 ) −

෫𝑔𝑙𝑡( ෪𝑓𝑙

𝑡( റ𝑥, 𝑡 )

Wh

ere did

it go?

Page 21: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

ADIOS + Indexing/Queries

• GOAL: reduce the time to read data by understanding what to read• Designed for querying data in multi-dimensional arrays

• Unified handling of conditions on coordinates and values

• Support parallel queries through explicit data partitioning

• Key Challenges• Organize indexing data structures

efficiently

• Reading efficiently

• Impact• S3D data run on Jaguar

can be queried + read much fasterthan reading entire dataset

• E.g. S3D data on a1100x1080x1048 mesh

Page 22: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

I/O Pipelines in Staging

• Use the staging nodes and create a workflow in the staging nodes

• Prepare the data for future analytics

• Mitigate performance impact of I/O by using asynchronous data movement

SortSort

Bitmap

Indexing

Bitmap

Indexing

HistogramHistogram

2D Histogram2D Histogram

BP writerBP writer

Particle array

sorted arrayBP file

Index file

PlotterPlotter

PlotterPlotter

Total simulation runtime improved by 2.7%

Online processing of 260GB of data from 16K cores in 40s

Page 23: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

DataSpaces• Goal: Seamlessly support in situ

processing for coupled simulations using hybrid-staging

• Support for applications usage modes such as IO offloading, in-situ and in-transit analytics, tight/loose coupling, fan-out workflows, etc.

• Autonomic data placement and movement leveraging deep storage and memory hierarchies

• Impact

• Loose coupling: XGC1 + M3D +Viz

• Tight Coupling: XGC1 + XGCa

• Loose coupling: S3D + analytics

• Tight coupling: LES+DNS tight coupling

• Results: Performance improvement for fusion workflow using ADIOS + DataSpaces

Page 24: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

Data Management Tradeoffs at Exascale to hybrid staging

• Balance of memory size and speed

• Feedback for node designs with NVRAM, larger memory, on-chip NIC

• Network throughput and latency impact on SDMA tasks

• Placement of operations in concert with solver and network topology

Explore node design choices for data management

Page 25: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

In-situ and In-transit Partitioning of Workflows

• Primary resources execute the main simulation and in situ computations

• Secondary resources provide a staging area whose cores act as buckets for in transit computations

• 4896 cores total (4480 simulation/in situ; 256 in transit; 160 task scheduling/data movement)

• Simulation size: 1600x1372x430• All measurements are per simulation time step

Page 26: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

• MPI simulations with OpenMP show idleresources during data movement

• Challenges• Interference between simulation and analytics due to contention on

memory hierarchy

• Select suitable idle periods to amortize scheduling costs

• Approach: Harvest Idle Resource for In-Situ Analytics• Dynamically predict idle resource availability

• Reduce interference with execution throttling

• Impact: GTS simulation with parallel coordinate visualization• Improve time to solution and resource efficiency

• Indicate that co-scheduling analytics services on underutilized resources can improve in situ processing of data

• Scale to up to 12288 cores on Hopper Cray XE6

0%

20%

40%

60%

80%

100%

1536 3072 1536 3072 1536 3072 1536 3072 1536 3072 1536 3072 1536 3072 1536 3072 1536 3072 1536 3072 1536 3072 1536 3072 1536 3072 1536 3072

GTC GTS GROMACS

(D.DPPC)

GROMACS

(ADH_Cubic)

LAMMPS

(LJ)

LAMMPS

(EAM)

LAMMPS

(Chain)

LAMMPS

(Rhodopsin)

BT-MZ

(Class C)

BT-MZ

(Class D)

BT-MZ

(Class E)

SP-MZ

(Class C)

SP-MZ

(Class D)

SP-MZ.E

(Class E)

Main Lopp TIme

Other Sequential

MPI

OpenMP

0

20

40

60

80

100

0 2.4 4.8 7.2 9.6 12 >14.4

Per

cen

tag

e (%

)

Length of Period (Milli-Seconds)

Counts

Aggregated. Time

Data Staging considering Idle CPU Resources

Page 27: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

In Situ/Transit/LAN/WAN Data Staging for EOD data

• ICEE– Using EVPath package (GATech)– Support uniform network

interface for TCP/IP and RDMA– Allows us to stage data over the

LAN/WAN– Used for EOD data in the fusion

community.

• Dataspaces (with sockets)– Support TCP/IP and RDMA

• Select only areas of interest and send (e.g., blobs)

• Reduce payload on average by about 5X

Filtered out

Areas of interest

512 pixels

64

0 p

ixe

ls

Sub-chunk

Page 28: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

Creating living I/O MiniApps/skeletons to study these workflows

• Extended BP metadata to support automatic replay, by saving provenance

• Allows new MiniApps to be created from any previous simulation

• Incorporates performance information for post-hoc performance understanding

• Workflow Model provides input and output data for each component, uses a DAG to describe data flow

• Code generators are built using templates

Page 29: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

ADIOS and VTKm

• VTKm and XViz are efforts to prepare for the increasing complexity of extreme-scale visualization• Addresses: Emerging Processors, In Situ Visualization, and Usability

• Minimizes resources required for analysis and visualization

• Processor/Acceleratorawareness providesexecution flexibility

• Idea is to incorporateVTK-M into the ADIOSsoftware eco-system

Page 30: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

XSSA: eXtreme Scale Service Architecture• Philosophy based on Service-Oriented Architecture

• System management

• Changing requirements

• Evolving target platforms

• Diverse, distributed teams

• Applications built by assembling services• Universal view of functionality

• Well defined API

• Implementations can be easily modified and assembled

• Manage complexity while maintaining performance, scalability• Scientific problems and codes

• Underlying disruptive infrastructure

• Coordination across codes and research teams

• End-to-end workflows

Page 31: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

ADIOS roadmap to Exascale

2017

• Create test harness

• Create a clearer, more modular layering of application interfaces, data abstractions, and runtime components

• Burst buffer support

• New methods for CORAL optimizations

2018

• Code coupling support with hybrid staging

• Living workflow

• Support for new programming models

• WAN staging

2019

• EOD integration for validationworkflows

• Ensembleworkflow optimizations

• Data Model support for software ecosystem

Page 32: HPC Advisory Council China Workshophpcadvisorycouncil.com/events/2016/china-conference/wp-content/uploads/... · HPC Advisory Council China Workshop Xi’an China October 26, 2016

[email protected]

QuestionsAcknowledgements• EPSI• ICEE• OLCF• RSVP• MONA• Big Data Express

• Xviz• VTM-M• SDAV• Exact• SENSEI• Sirius• Fusion ECP