Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
``
Creating ADIOS-2 for scientific exascale data
HPC Advisory Council China WorkshopXi’an China
October 26, 2016
Computation: Fusion
•Develop high-fidelity Whole Device Model (WDM) of magnetically confined fusion plasmas to predict the performance of ITER
•Couple existing, well established extreme-scale gyrokinetic codes
• GENE continuum code for the core plasma and the
• XGC particle-in-cell (PIC) code for the edge plasma
•Data challenges
• Couple codes (XGC to GENE) using a service framework
• Need to save O(10)PB of data from a single simulation
2
• PMI, Dust, RF, Neutral particles, Ohmicpower supply, Poloidal field, Magnetic Equlibirum, RMP coils
• Fusion reaction- α-particles, neutrons
Experiments: ITER
• Volume: Initially 90 TB per day, 18 PB per year, maturing to 2.2 PB per day, 440 PB per year
• Value: All data are taken from expensive instruments for valuable reasons
• Velocity: Peak 50 GB/s, with near real-time analysis needs
• Variety: ~100 different types of instruments and sensors, numbering in the thousands, producing interdependent data in various formats
• Veracity: The quality of thedata can vary greatlydepending upon theinstruments and sensors
What isADIOS• An extendable framework that allows developers to plug-in
• I/O methods: Aggregate, Posix, MPI
• Services: Compression, Decompression
• File Formats: HDF5, netcdf, …
• Stream Format: ADIOS-BP
• Plug-ins: Analytic, Visualization
• Indexing: FastBit, ISABELLA-QA
• Incorporates the “best” practices in the I/O middleware layer
• Incorporates self describing data streams and files• Released twice a year, now 1.9, under the completely free BSD license
• https://www.olcf.ornl.gov/center-projects/adios/, https://github.com/ornladios/ADIOS
• Available at ALCF, OLCF, NERSC, CSCS, Tianhe-1,2, Pawsey SC, Ostrava
• Applications are supported through OLCF INCITE program
• Outreach via on-line manuals, and live tutorials
Interfacetoappsfordescrip onofdata(ADIOS,etc.)
BufferingFeedback Schedule
Mul -resolu onmethods
DataCompressionmethods
DataIndexing(FastBit)methods
DataManagementServices
WorkflowEngine Run meengine DatamovementProvenance
Pluginstothehybridstagingarea
Visualiza onPluginsAnalysisPlugins
ParallelandDistributedFileSystem
IDX HDF5Adios-bp pnetcdf “raw”data Imagedata
Viz.Client
1. Accelerator: PIConGPU, Warp
2. Astronomy: SKA
3. Astrophysics: Chimera
4. Combustion: S3D
5. CFD: FINE/Turbo, OpenFoam
6. Fusion: XGC, GTC, GTC-P, M3D,M3D-C1, M3D-K, Pixie3D
7. Geoscience: SPECFEM3D_GLOBE, AWP-ODC, RTM
8. Materials Science: QMCPack, LAMMPS
9. Medical Imaging: Cancer pathology
10.Quantum Turbulence: QLG2Q
11.Relativity: Maya
12.Weather: GRAPES
13.Visualization: Paraview, Visit, VTK, ITK, OpenCV, VTKm
ADIOS applications
Impact on Industry :• NUMECA (FINE/Turbo) –
Allowed time-varying interaction of turbomachinery-relatedaerodynamic phenomena
• TOTAL (RTM) – Allowed running
of higher fidelity seismic simulations
• FMGLOBAL (OpenFoam)–Allowed running higher fidelity fire propagation simulations
LCF/NERSC Codes in red
Over 1B LCF hours from ADIOS enabled Apps 2015Over 1,500 citations
Typical Applications on LCF machines observed a 10X performance improvement in I/O
Impact from running large scale applications at scale
How to use ADIOS
• ADIOS is provided as a library to users; use it like other I/O libraries, except
• ADIOS has a declarative approach for I/O
• User defines in application source code: “what” and “when”
• ADIOS takes care of the “how”
• Biggest hurdle for users:
• Forget all the tricks to gain I/O performance: just say what you want to write/read
• Performance Portability:
• Write once, perform well anywhere
• It comes naturally, with different I/O strategies for different: motifs, systems
• Predictable performance
• Staging, I/O throttling: Allow scientist to use different computational technologies to achieve good, predictable, performance
ADIOS project goals and current status
• Utilize the best practices for ALL of the platforms DOE researchers utilize
• File System, Topology, Memory_ Optimizations
• Domain Specific Optimizations
• Particle simulations, Data Driven: Visualization, Medical Images, …
• Predictable Performance Optimizations
• Hybrid Staging Techniques, I/O Throttling
• Reduce I/O load
• In situ indexing, queries, executable code, data refactoring
• In situ infrastructure for code coupling and in situ/transit processing
• Hybrid Staging, Burst Buffers, learning techniques for caching
• Usability and Sustainability
• Partnering with Kitware, along with user surveys to create better software for our users
ADIOS Research-Development-Production cycle
ResearchDevelopment
InfluencersCollaborators
HDF5
Applications
Technology
DevelopmentORNL
ResearchUniversities &
ORNLProductionProductionKitware
OR
NL
Key ideas for good performance of ADIOS for large writes
• Avoid latency (of small writes)
• Buffer data for large bursts
• Avoid global communication• ADIOS has that for metadata only, which can
even be postponed for post-processing
• Later: Topology-aware data movement that takes advantage of topology
• Find the closest I/O node to each writer
• Minimize data movement across racks/mid-planes (on Bluegene/Q)
ADIOS-BP stream/file format
• Allows data from each node to be written independently with each other with metadata
• Ability to create a separate metadata file when “sub-files” are generated
• Allows variables to be individually compressed
• Has a schema to introspect the information, on each process
• Has workflows embedded into the data streams
• Format is for “data-in-motion” and “data-at-rest”
Bu
rst Bu
ffer
Impact to two LCF applications
• Accelerators – PIConGPU• M. Bussmann, et al. - HZDR• Study laser-driven acceleration of ion
beams and its use for therapy of cancer• Computational laboratory for real-time
processing for optimizing parameters of the laser
• Over 184 GB/s on 16K nodes on Titan
• Seismic Imaging – RTM by Total Inc.• Pierre-Yves Aquilanti, TOTAL E&P
in context of a CRADA• TBs as inputs, outputs PBs of results along with
intermediate data• Company conducted comparison tests among
several I/O solutions. ADIOS is their choice for other codes: FWI, Kirchoff
PH5
PH5
http://rice2016oghpc.rice.edu/program/
http://www.sciencedaily.com/releases/2016/01/160126130823.htm
Small Writes From Many Cores at high velocities
• Small writes• Overheads are bigger than the cost of outputting
small amount of bytes
• Avoid accessing a file system target from manyprocesses at once• Aggregate to a small number of actual
writers: O(file system targets)
• Avoid lock contention• by striping correctly & writing to subfiles
• Reading many small data blocks• Data reorganization to merge data into bigger blocks
• Reading with different access patterns than the data was written• Data reorganization to optimize overall read access
Q. Liu, J. Logan, Y. Tian, H. Abbasi, N. Podhorszki, J. Y. Choi, S. Klasky, R. Tchoua, J. Lofstead, R. Oldfield, et al.. Hello ADIOS: the challenges and lessons of developing leadership class I/O frameworks. Concurrency and Computation: Practice and Experience 2014, 26, 1453–1473.
http://www.hpcwire.com/2009/10/29/adios_ignites_combustion_simulations/
• Use ADIOS to write a single file per timestep• Use OpenFOAM functionObject approach for
defining extra functionality (used to extend OpenFOAM with new computations, I/O, diagnostics, etc.)
• User specifies the output objects in a configuration file (in a usual way for OpenFOAM)
ApproachChallenges
Further impact: OpenFOAM CFD simulations
• Open C++ framework for all CFD simulations
• FM Global: fireFOAM application for fire simulations
• Output is stream-based, no central IO class
• Hundreds of objects contribute to the output writing
• Many small files with even more tiny write operations
• Per processor and simulation variable
• Add a scalable and fast IO solution to OpenFOAM
Research Products/Artifacts
Yi Wang, Karl Meredith – FM GlobalS. Klasky, N. Podhorszki - ORNL
Select Results
• Prototype implementation of ADIOS IO in
OpenFoam 2.2 source as extension
• https://github.com/pnorbert/IOADIOSWrite
• Tests performed by FM Global on local cluster
with NFS file system as well as at OLCF with
Lustre file system
“ADIOS was able to
achieve a 12X performance
improvement from the
original IO “
https://www.olcf.ornl.gov/2016/01/05/fighting-fire-with-firefoam/ Stacking commodities on wood
pallets slows horizontal fire spread, versus absence of pallets
Staging to address I/O performance/variability issue• Simplistic approach to staging
• Decouple application performance from storage performance (burst buffer)• Move data directly to remote memory in a “staging” area• Write data to disk from staging
• Built on past work with threaded buffered I/O• Buffered asynchronous data movement with a single memory copy for
networks which support RDMA• Application blocks for a very short time to copy data to outbound buffer• Data is moved asynchronously
using server-directedremote reads
• Exploits network hardware for fast data transfer to remote memory
Overheads with XGC
• 1 PB of total output per 24 hours, but really wanted 10 PB/24 hours
XGC I/O Variability
• Staging of XGC output
• Observed high I/O variations with files on XGC check-pointing
• Research on Staging I/O with Rutgers and GA Tech: Decoupling app and I/O
• Reduced I/O variations with staging methods as well as high I/O throughput
• Creates new research opportunities with BurstBuffer (ADIOS + Staging + BurstBuffer)
• Impact: I/O performance is much more predictable
XGC I/O variations on checkpoint writing. Compared File I/O and Staging I/O on Titan.
XGC ADIOS
File
Staging
BurstBuffer
Performance Portability
• ADIOS has methods optimized for different platforms
• BGQ-GPFS, Cray XK-Lustre, IB clusters
• Different methods can be changed for R/W optimizations
• E.g. Quantum Physics QLG2Q codePerformance Island navigation with ADIOS
“Compute-Data Gap”
• Filesystem/network bandwidth does not keep up with CPU/memory advances
• Can’t write as many bits/FLOP
• Swap more I/O for (de)compression CPU cycles
Exascale system require well tuned SDM core
19
CODAR: Center for Online Data Analysis and Reduction
CODAR
Data
services
Exascale
platforms
Applications
PI: Ian Fostera
Co-Is: Scott Klaskyo, Kerstin Kleese-Van Damb, Todd Munsona
Participants: Mark Ainsworthd, Franck Cappelloa, Barbara
Chapmanb,s, Jong Choio, Emil Constantinescua, Hanqi Guoa,
Tahsin Kurcs, Qing Liuo, Jeremy Logan, Klaus Muellers, George
Ostrouchovo, Manish Parasharr, Tom Peterkaa, Norbert
Podhorszkio, Dave Pugmireo, Rangan Sukumaro, Stefan Wilda,
Matthew Wolfo, Justin Wozniaka, Wei Xub, Shinjae Yoob
a: Argonne National Laboratory
b: Brookhaven National Laboratory
o: Oak Ridge National Laboratory
Reduction comes with challenges
• Handling high entropy• Performance – no benefit otherwise• Not only errors in variable itself
Ε ≡ 𝑓 − ሚ𝑓must also consider impact on derived quantities:
Ε ≡ (𝑔𝑙𝑡(𝑓 റ𝑥, 𝑡 ) −
෫𝑔𝑙𝑡( ෪𝑓𝑙
𝑡( റ𝑥, 𝑡 )
Wh
ere did
it go?
ADIOS + Indexing/Queries
• GOAL: reduce the time to read data by understanding what to read• Designed for querying data in multi-dimensional arrays
• Unified handling of conditions on coordinates and values
• Support parallel queries through explicit data partitioning
• Key Challenges• Organize indexing data structures
efficiently
• Reading efficiently
• Impact• S3D data run on Jaguar
can be queried + read much fasterthan reading entire dataset
• E.g. S3D data on a1100x1080x1048 mesh
I/O Pipelines in Staging
• Use the staging nodes and create a workflow in the staging nodes
• Prepare the data for future analytics
• Mitigate performance impact of I/O by using asynchronous data movement
SortSort
Bitmap
Indexing
Bitmap
Indexing
HistogramHistogram
2D Histogram2D Histogram
BP writerBP writer
Particle array
sorted arrayBP file
Index file
PlotterPlotter
PlotterPlotter
Total simulation runtime improved by 2.7%
Online processing of 260GB of data from 16K cores in 40s
DataSpaces• Goal: Seamlessly support in situ
processing for coupled simulations using hybrid-staging
• Support for applications usage modes such as IO offloading, in-situ and in-transit analytics, tight/loose coupling, fan-out workflows, etc.
• Autonomic data placement and movement leveraging deep storage and memory hierarchies
• Impact
• Loose coupling: XGC1 + M3D +Viz
• Tight Coupling: XGC1 + XGCa
• Loose coupling: S3D + analytics
• Tight coupling: LES+DNS tight coupling
• Results: Performance improvement for fusion workflow using ADIOS + DataSpaces
Data Management Tradeoffs at Exascale to hybrid staging
• Balance of memory size and speed
• Feedback for node designs with NVRAM, larger memory, on-chip NIC
• Network throughput and latency impact on SDMA tasks
• Placement of operations in concert with solver and network topology
Explore node design choices for data management
In-situ and In-transit Partitioning of Workflows
• Primary resources execute the main simulation and in situ computations
• Secondary resources provide a staging area whose cores act as buckets for in transit computations
• 4896 cores total (4480 simulation/in situ; 256 in transit; 160 task scheduling/data movement)
• Simulation size: 1600x1372x430• All measurements are per simulation time step
• MPI simulations with OpenMP show idleresources during data movement
• Challenges• Interference between simulation and analytics due to contention on
memory hierarchy
• Select suitable idle periods to amortize scheduling costs
• Approach: Harvest Idle Resource for In-Situ Analytics• Dynamically predict idle resource availability
• Reduce interference with execution throttling
• Impact: GTS simulation with parallel coordinate visualization• Improve time to solution and resource efficiency
• Indicate that co-scheduling analytics services on underutilized resources can improve in situ processing of data
• Scale to up to 12288 cores on Hopper Cray XE6
0%
20%
40%
60%
80%
100%
1536 3072 1536 3072 1536 3072 1536 3072 1536 3072 1536 3072 1536 3072 1536 3072 1536 3072 1536 3072 1536 3072 1536 3072 1536 3072 1536 3072
GTC GTS GROMACS
(D.DPPC)
GROMACS
(ADH_Cubic)
LAMMPS
(LJ)
LAMMPS
(EAM)
LAMMPS
(Chain)
LAMMPS
(Rhodopsin)
BT-MZ
(Class C)
BT-MZ
(Class D)
BT-MZ
(Class E)
SP-MZ
(Class C)
SP-MZ
(Class D)
SP-MZ.E
(Class E)
Main Lopp TIme
Other Sequential
MPI
OpenMP
0
20
40
60
80
100
0 2.4 4.8 7.2 9.6 12 >14.4
Per
cen
tag
e (%
)
Length of Period (Milli-Seconds)
Counts
Aggregated. Time
Data Staging considering Idle CPU Resources
In Situ/Transit/LAN/WAN Data Staging for EOD data
• ICEE– Using EVPath package (GATech)– Support uniform network
interface for TCP/IP and RDMA– Allows us to stage data over the
LAN/WAN– Used for EOD data in the fusion
community.
• Dataspaces (with sockets)– Support TCP/IP and RDMA
• Select only areas of interest and send (e.g., blobs)
• Reduce payload on average by about 5X
Filtered out
Areas of interest
512 pixels
64
0 p
ixe
ls
Sub-chunk
Creating living I/O MiniApps/skeletons to study these workflows
• Extended BP metadata to support automatic replay, by saving provenance
• Allows new MiniApps to be created from any previous simulation
• Incorporates performance information for post-hoc performance understanding
• Workflow Model provides input and output data for each component, uses a DAG to describe data flow
• Code generators are built using templates
ADIOS and VTKm
• VTKm and XViz are efforts to prepare for the increasing complexity of extreme-scale visualization• Addresses: Emerging Processors, In Situ Visualization, and Usability
• Minimizes resources required for analysis and visualization
• Processor/Acceleratorawareness providesexecution flexibility
• Idea is to incorporateVTK-M into the ADIOSsoftware eco-system
XSSA: eXtreme Scale Service Architecture• Philosophy based on Service-Oriented Architecture
• System management
• Changing requirements
• Evolving target platforms
• Diverse, distributed teams
• Applications built by assembling services• Universal view of functionality
• Well defined API
• Implementations can be easily modified and assembled
• Manage complexity while maintaining performance, scalability• Scientific problems and codes
• Underlying disruptive infrastructure
• Coordination across codes and research teams
• End-to-end workflows
ADIOS roadmap to Exascale
2017
• Create test harness
• Create a clearer, more modular layering of application interfaces, data abstractions, and runtime components
• Burst buffer support
• New methods for CORAL optimizations
2018
• Code coupling support with hybrid staging
• Living workflow
• Support for new programming models
• WAN staging
2019
• EOD integration for validationworkflows
• Ensembleworkflow optimizations
• Data Model support for software ecosystem
QuestionsAcknowledgements• EPSI• ICEE• OLCF• RSVP• MONA• Big Data Express
• Xviz• VTM-M• SDAV• Exact• SENSEI• Sirius• Fusion ECP