ORNL is managed by UT-Battelle for the US Department of Energy Accelerated Applications Readiness at...
If you can't read please download the document
ORNL is managed by UT-Battelle for the US Department of Energy Accelerated Applications Readiness at OLCF Valentine Anantharaj National Center for Computational
ORNL is managed by UT-Battelle for the US Department of Energy
Accelerated Applications Readiness at OLCF Valentine Anantharaj
National Center for Computational Sciences Oak Ridge National
Laboratory EPSC Air-Ocean-Land-Ice Global Coupled Prediction
Project Meeting 19-20 November 2014 Boulder CO
Slide 2
2 Applications Readiness Application Readiness at OLCF Oak
Ridge Leadership Computing Facility: delivering breakthrough
science Center for Accelerated Application Readiness (CAAR):
motivation and lessons learned Going forward: Application Readiness
for OLCF- 4 (Summit)
Slide 3
3 Applications Readiness Mission: Deploy and operate the
computational resources required to tackle global challenges
Providing world-leading computational and data resources and
specialized services for the most computationally intensive
problems Providing stable hardware/software path of increasing
scale to maximize productive applications development Providing the
resources to investigate otherwise inaccessible systems at every
scale: from galaxy formation to supernovae to earth systems to
automobiles to nanomaterials With our partners, deliver
break-through and transforming discoveries in materials, biology,
climate, energy technologies, and basic science & engineering
Oak Ridge Leadership Computing Facility
Slide 4
4 Applications Readiness ORNL has a long history in high
performance computing 1954 ORACLE 1969 IBM 360/9 1985 Cray X-MP
1992-1995 Intel Paragons 1996-2002 IBM Power 2/3/4 2003-2005 Cray
X1/X1E 2007 IBM Blue Gene/P ORNL has had 22 systems on the
list
Slide 5
5 Applications Readiness ORNL has increased system performance
by 1,000 times [2004-2010]
Slide 6
6 Applications Readiness Breakthrough science at OLCF
Slide 7
7 Applications Readiness CORAL System Our Science requires that
we continue to advance OLCFs computational capability over the next
decade on the roadmap to Exascale. Since clock-rate scaling ended
in 2003, HPC performance has been achieved through increased
parallelism. Jaguar scaled to 300,000 CPU cores. Titan and beyond
deliver hierarchical parallelism with very powerful nodes. MPI plus
thread level parallelism through OpenACC or OpenMP plus vectors
Jaguar: 2.3 PF Multi-core CPU 7 MW Titan: 27 PF Hybrid GPU/CPU 9 MW
2010201220172022 OLCF5: 5-10x Summit ~20 MW Summit: 5-10x Titan
Hybrid GPU/CPU 10 MW
Slide 8
8 Applications Readiness SYSTEM SPECIFICATIONS: Peak
performance of 27.1 PF 24.5 GPU + 2.6 CPU 18,688 Compute Nodes each
with: 16-Core AMD Opteron CPU NVIDIA Tesla K20x GPU 32 + 6 GB
memory 512 Service and I/O nodes 200 Cabinets 710 TB total system
memory Cray Gemini 3D Torus Interconnect 8.9 MW peak power ORNLs
Titan Hybrid System: Cray XK7 4,352 ft 2 404 m 2 Titan was an
upgrade of the XT5 Jaguar system
Slide 9
9 Applications Readiness The Hybrid Programming Model
Imperative On Jaguar, with 299,008 cores, we were seeing scaling
issues for single-level, MPI-only applications To take advantage of
the vastly larger parallelism in Titan, users need to use
hierarchical parallelism in their codes Distributed memory: MPI,
SHMEM, PGAS Node Local: OpenMP, Pthreads, local MPI communicators
Within threads: Vector constructs on GPU, libraries, OpenACC These
are the same types of constructs needed on all multi-PFLOPS
computers to scale to the full size of the systems!
Slide 10
10 Applications Readiness Center for Accelerated Application
Readiness (CAAR): Enabling Science on Day-One Created as part of
Titan (OLCF-3) project to help prepare applications for accelerated
architectures Goals: Work with code teams to develop and implement
strategies for exposing hierarchical parallelism for our users
applications Maintain code portability across modern architectures
Learn from and share our results We selected six applications from
across different science domains and algorithmic motifs
Slide 11
11 Applications Readiness Center for Accelerated Application
Readiness (CAAR) WL-LSMS Role of material disorder, statistics, and
fluctuations in nanoscale materials and systems. S3D Understanding
turbulent combustion through direct numerical simulation with
complex chemistry.. CAM-SE Answer questions about specific climate
change adaptation and mitigation scenarios; realistically represent
features like precipitation patterns/statistics and tropical
storms. Denovo Discrete ordinates radiation transport calculations
that can be used in a variety of nuclear energy and technology
applications. LAMMPS Biofuels: An atomistic model of cellulose
(blue) surrounded by lignin molecules comprising a total of 3.3
million atoms. Intent: Ensure Day-One GPU-enabled science on Titan
NRDF Radiation transport critical to astrophysics, laser fusion,
combustion, atmospheric dynamics, and medical imaging.
13 Applications Readiness CAAR Approach Comprehensive team
assigned to each app OLCF application lead Cray engineer NVIDIA
developer Other: other application developers, local tool/library
developers Single early-science problem targeted for each app
Success on this problem is primary metric for success Particular
plan-of-attack different for each app WL-LSMS dependent on
accelerated ZGEMM CAM-SE pervasive and widespread custom
acceleration required Multiple acceleration methods explored
WL-LSMS CULA, MAGMA, custom ZGEMM CAM-SE CUDA FORTRAN, directives
Two-fold aim Maximum acceleration for science target(s)
Determination of optimal, reproducible acceleration path for other
science problems We believe we now have a good recipe for scalable
MPI or MPI+OpenMP codes to enable acceleration with GPUs
Slide 14
14 Applications Readiness Some birth pangs complications All of
the chosen apps were/are under constant development Groups had, in
many cases, already begun to explore GPU acceleration on their own
[good and bad; need for buy in; science priorities; etc.]
Production-level tools, compilers, libraries, etc. were just
beginning to become available at the start of CAAR Multiple paths
were available, with multifarious trade- offs ease-of-use
(potential) portability performance
Slide 15
15 Applications Readiness All codes will need rework to scale!
Up to 1-2 person-years required to port each code from Jaguar to
Titan Takes work, but an unavoidable step required for next
generation regardless of the type of processors. It comes from the
required level of parallelism on the node Also pays off for other
systemsthe ported codes often run significantly faster CPU-only
(Denovo 2X, CAM-SE >1.7X) We estimate possibly 70-80% of
developer time is spent in code restructuring, regardless of
whether using OpenMP / CUDA / OpenCL / OpenACC / Each code team
must make its own choice of using OpenMP vs. CUDA vs. OpenCL vs.
OpenACC, based on the specific casemay be different conclusion for
each code Our users and their sponsors must plan for this
expense
Slide 16
16 Applications Readiness Think and rethink algorithms Are
there more opportunities here? Heterogeneous architectures can make
previously infeasible or inefficient models and implementations
viable Alternative methods for electrostatics that perform slower
on traditional x86 can be significantly faster on GPUs (Nguyen, et
al. J. Chem. Theor. Comput. 2013. 73-83) 3-body coarse-grain
simulations of water with greater concurrency can allow > 100X
simulation rates when compared to fastest atomistic models even
though both are run on the GPUs (Brown, et al. Submitted)
Slide 17
17 Applications Readiness Something more to keep in mind and
think about Science codes are under active development porting to
GPU can be pursuing a moving target, challenging to manage More
available FLOPS on the node should lead us to think of new science
opportunities enabled e.g., more degrees of freedom per grid cell
We may need to look in unconventional places to get another ~30X
thread parallelism that may be needed for exascale e.g.,
parallelism in time
Slide 18
18 Applications Readiness Lessons learned along the way [1]
Major code restructurings were required this is required regardless
of the parallel API used. 70-80% of the effort was in code
restructuring for all CAAR codes Isolating CUDA-specific constructs
in one place in the code is good defensive programming to prepare
for programming models that may change It is challenging to
negotiate conflict between heavy code optimization and good SWE
practice its not always easy to have both, in general and
specifically using CUDA (e.g., assertions)
Slide 19
19 Applications Readiness Lessons learned along the way [2] It
is helpful to develop a performance model based on flop rate,
memory bandwidth, and algorithm tuning knobs, to guide mapping of
the algorithm to the GPU It is worthwhile to write small codes to
test performance for simple operations, incorporate this insight
into algorithm design It is a challenge to understand what the
processor is doing, under the abstractions
Slide 20
20 Applications Readiness Lessons learned along the way [3] It
is difficult to know beforehand what will be the best strategy for
parallelization or what will be the final outcome a porting effort
could easily fail if the GPU has inadequate register space for the
planned algorithm mapping Performance can be very sensitive to
small tweaks in the code must determine empirically the best way to
write the code Often, the GPU porting effort for the algorithm also
improves performance on the CPU (in all the CAAR cases, in fact,
~2X)
Slide 21
21 Applications Readiness Summary Critically important problems
require leadership computing facilities that provide the most
powerful compute and data infrastructure Computer system
performance increases through parallelism Clock speed trend flat to
slower over coming years Applications must utilize all inherent
parallelism Accelerated, hybrid-multicore computing solutions are
performing well on real, complex scientific applications But you
must work to expose the parallelism in your codes. This refactoring
of codes is largely common to all massively parallel architectures
OLCF resources are available to industry, academia, and labs,
through open, peer-reviewed allocation mechanisms
Slide 22
22 Applications Readiness How effective are GPUs on scalable
applications and specific problems? Cray XK7: Fermi GPU plus AMD
16-core Opteron CPU XE6: 2X AMD 16-core Opteron CPUs
Slide 23
23 Applications Readiness Application Power Efficiency of the
Cray XK7 WL-LSMS for CPU-only and Accelerated Computing Runtime Is
8.6X faster for the accelerated code Energy consumed Is 7.3X less o
GPU accelerated code consumed 3,500 kW-hr o CPU only code consumed
25,700 kW-hr Power consumption traces for identical WL-LSMS runs
with 1024 Fe atoms on 18,561 Titan nodes (99% of Titan)
Slide 24
24 Applications Readiness OLCF User Requirements Survey Key
Findings Surveys are a lagging indicator that tend to tell us what
problems the users are seeing now, not what they expect to see in
the future For the first time ever, FLOPS was NOT the #1 ranked
requirement Local memory capacity was not a driver for most users,
perhaps in recognition of cost trends 76% of users said there is
still a moderate to large amount of parallelism to extract in their
code, but 85% of respondents rated the difficulty level of
extracting that parallelism as moderate or difficult underlining
the need for more training and assistance in this area Data
sharing, long-term aggregate archival storage of 100s of PB, data
analytics, and faster WAN connections also showed up as important
requirements
Slide 25
25 Applications Readiness 2017 OLCF Leadership System Hybrid
CPU/GPU architecture Vendor: IBM (Prime) / NVIDIA / Mellanox
Technologies At least 5X Titans Application Performance
Approximately 3,400 nodes, each with: Multiple IBM POWER9 CPUs and
multiple NVIDIA Tesla GPUs using the NVIDIA Volta architecture CPUs
and GPUs completely connected with high speed NVLink Large coherent
memory: over 512 GB (HBM + DDR4) all directly addressable from the
CPUs and GPUs An additional 800 GB of NVRAM, which can be
configured as either a burst buffer or as extended memory over 40
TF peak performance Dual-rail Mellanox EDR-IB full, non-blocking
fat-tree interconnect IBM Elastic Storage (GPFS) - 1TB/s I/O and
120 PB disk capacity.
Slide 26
26 Applications Readiness CAAR Call for Proposal: Applications
being accepted now! For information on how to apply, visit:
https://www.olcf.ornl.gov/summit/caar-call-for-
proposals/https://www.olcf.ornl.gov/summit/caar-call-for-
proposals/ February 20, 2015: CAAR proposal deadline OLCF
Distinguished Postdoctoral Associates Program: Apply now: 8
positions available immediately!
https://www.olcf.ornl.gov/summit/olcf-distinguished-
postdoctoral-associates-program/https://www.olcf.ornl.gov/summit/olcf-distinguished-
postdoctoral-associates-program
Slide 27
27 Applications Readiness Acknowledgment OLCF-3 CAAR Team:
Bronson Messer (Lead), Wayne Joubert, Mike Brown, Matt Norman,
Markus Eisenbach, Ramanan Sankaran; and many others including
OLCF/NCCS UAO and Scicomp groups Jack Wells and Arthur (Buddy)
Bland OLCF-3 Vendor Partners: Cray, AMD, and NVIDIA This research
used resources of the Oak Ridge Leadership Computing Facility at
the Oak Ridge National Laboratory, which is supported by the Office
of Science of the U.S. Department of Energy under Contract No.
DEAC05-00OR22725.
Slide 28
28 Applications Readiness [email protected] This research
used resources of the Oak Ridge Leadership Computing Facility,
which is a DOE Office of Science User Facility supported under
Contract DE-AC05-00OR22725 Want to join our team? ORNL is hiring.
Contact us at https://jobs.ornl.gov GPU Hackathon October 2014
Slide 29
29 Applications Readiness Additional material
Slide 30
30 Applications Readiness How does Summit compare to Titan
FeatureSummitTitan Application Performance5-10x TitanBaseline
Number of Nodes~3,40018,688 Node performance> 40 TF1.4 TF Memory
per Node>512 GB (HBM + DDR4)38GB (GDDR5+DDR3) NVRAM per Node800
GB0 Node InterconnectNVLink (5-12x PCIe 3)PCIe 2 System
Interconnect (node injection bandwidth) Dual Rail EDR-IB (23
GB/s)Gemini (6.4 GB/s) Interconnect TopologyNon-blocking Fat Tree3D
Torus ProcessorsIBM POWER9 NVIDIA Volta AMD Opteron NVIDIA Kepler
File System120 PB, 1 TB/s, GPFS32 PB, 1 TB/s, Lustre Peak power
consumption10 MW9 MW
Slide 31
31 Applications Readiness Allocations on Titan DD and INCITE
are open to international researchers ALCC (ASCR Leadership
Computing Challenge) primarily intended for US DOE and industry
INCITE call 1/year (usually due 30 June) INCITE proposals are
subject to scientific peer review and computational readiness
review DD web form application, accepted continuously DD
applications reviewed weekly by Resource Utilization Council
Typically have 2 scientific referees For new users it is most
important to say what development you plan to perform and whether
you anticipate competing in INCITE in the future Awards are
typically O(1M) Titan-core hours
Slide 32
32 Applications Readiness OLCF-4
Slide 33
33 Applications Readiness Where do we go from here? Provide the
Leadership computing capabilities needed for the DOE Office of
Science mission from 2018 through 2022 Capabilities for INCITE and
ALCC science projects CORAL was formed by grouping the three Labs
who would be acquiring Leadership computers in the same timeframe
(2017). Benefits include: Shared technical expertise Decreases
risks due to the broader experiences, and broader range of
expertise of the collaboration Lower collective cost for developing
and responding to RFP
Slide 34
34 Applications Readiness CORAL Collaboration ORNL, ANL, LLNL)
Leadership Computers RFP requests >100 PF, 2 GB/core main
memory, local NVRAM, and science performance 4x-8x Titan or Sequoia
Objective - Procure 3 leadership computers to be sited at Argonne,
Oak Ridge and Lawrence Livermore in 2017. Two of the contracts have
been awarded with the Argonne contract in process. Approach
Competitive process - one RFP (issued by LLNL) leading to 2 R&D
contracts and 3 computer procurement contracts For risk reduction
and to meet a broad set of requirements, 2 architectural paths will
be selected and Oak Ridge and Argonne must choose different
architectures Once Selected, Multi-year Lab-Awardee relationship to
co-design computers Both R&D contracts jointly managed by the 3
Labs Each lab manages and negotiates its own computer procurement
contract, and may exercise options to meet their specific needs
Understanding that long procurement lead-time may impact
architectural characteristics and designs of procured computers
Sequoia (LLNL) 2012 - 2017 Mira (ANL) 2012 - 2017 Titan (ORNL) 2012
- 2017 Current DOE Leadership Computers
Slide 35
35 Applications Readiness Two Architecture Paths for Today and
Future Leadership Systems Hybrid Multi-Core (like Titan) CPU / GPU
hybrid systems Likely to have multiple CPUs and GPUs per node Small
number of very powerful nodes Expect data movement issues to be
much easier than previous systems coherent shared memory within a
node Multiple levels of memory on package, DDR, and non-volatile
Many Core (like Sequoia/Mira) 10s of thousands of nodes with
millions of cores Homogeneous cores Multiple levels of memory on
package, DDR, and non-volatile Unlike prior generations, future
products are likely to be self hosted Power concerns for large
supercomputers are driving the largest systems to either Hybrid or
Many-core architectures
Slide 36
36 Applications Readiness Summit Key Software Components System
Linux IBM Elastic Storage (GPFS) IBM Platform Computing (LSF) IBM
Platform Cluster Manager (xCAT) Programming Environment Compilers
supporting OpenMP and OpenACC IBM XL, PGI, LLVM, GNU, NVIDIA
Libraries IBM Engineering and Scientific Subroutine Library (ESSL)
FFTW, ScaLAPACK, PETSc, Trilinos, BLAS-1,-2,-3, NVBLAS cuFFT,
cuSPARSE, cuRAND, NPP, Thrust Debugging Allinea DDT, IBM Parallel
Environment Runtime Edition (pdb) Cuda-gdb, Cuda-memcheck,
valgrind, memcheck, helgrind, stacktrace Profiling IBM Parallel
Environment Developer Edition (HPC Toolkit) VAMPIR, Tau,
Open|Speedshop, nvprof, gprof, Rice HPCToolkit
38 Applications Readiness Employs equal-angle, cubed-sphere
grid and terrain-following coordinate system. Scaled to 172,800
cores on XT5 Exactly conserves dry mass without the need for ad hoc
fixes. Original baseline code achieves parallelism through domain
decomposition using one MPI task per element CAM-SE Community
Atmosphere Model Spectral Elements Porting Strategy Early
Performance Results on XK7: Code Description Refactored code was
1.7x faster on Cray XT5 XK7 outperforms XE6 by 1.5x Science Target
(20PF Titan) Cubed-sphere grid of CAM spectral element model. Each
cube panel is divided into elements. CAM simulation using Mozart
tropospheric chemistry with 106 constituents at 14 km horizontal
grid resolution Using realistic Mozart chemical tracer network,
tracer transport (i.e., advection) dominates the run time. Use
hybrid MPI/OpenMP parallelism Intensive kernels are coded in CUDA
Fortran Migration to OpenACC in progress
http://www-personal.umich.edu/~paullric/A_CubedSphere.png CUDA
Fortran
Slide 39
39 Applications Readiness Linear Boltzmann radiation transport
Discrete ordinates method Iterative eigenvalue solution Multigrid,
preconditioned linear solves C++ with F95 kernels DENOVO 3D Neutron
Transport for Nuclear Reactor Design Porting Strategy Early
Performance Results on XK7 Code Description Refactored code was 2x
faster on Cray XT5 XK7 performance exceeds XE6 by 3.8x SWEEP kernel
re-written in C++ & CUDA, runs on CPU or GPU Scaling to over
200K cores with opportunities for increased parallelism on GPUs
Reintegrate SWEEP into DENOVO Science Target (20PF Titan) DENOVO is
a component of the DOE CASL Hub, necessary to achieve CASL
challenge problems 3-D, Full reactor radiation transport for CASL
challenge problems within 12 wall clock hours. This is the CASL
stretch goal problem! CUDA
Slide 40
40 Applications Readiness Compressible Navier-Stokes equations
3D Cartesian grid, 8 th -order finite difference Explicit 4 th
-order Runge-Kutta integration Fortran, 3D Domain decomposition,
non-blocking MPI S3D Direct Numerical Simulation of Turbulent
Combustion Porting Strategy Early Performance Results on XK7 Code
Description Refactored code was 2x faster on Cray XT5 OpenACC
acceleration with minimal overhead XK7 outperforms XE6 by 2.2x
Hybrid MPI/OpenMP/OpenACC application All intensive calculations
can be on the accelerator Redesigned message passing to overlap
communication and computation Science Target (20 PF Titan)
Increased chemical complexity for combustion: Jaguar: 9-22 species
(H 2, syngas, ethylene) Titan: 30-100 species (ethanol, n-heptane,
iso-octane) DNS provides unique fundamental insight into the
chemistry- turbulence interaction OpenACC
Slide 41
41 Applications Readiness Combines classical statistical
mechanics (W-L)for atomic magnetic moment distributions with first-
principles calculations (LSMS) of the associated energies. Main
computational effort is dense linear algebra for complex numbers
F77 with some F90 and C++ for the statistical mechanics driver.
Wang-Landau LSMS First principles, statistical mechanics of
magnetic materials Porting Strategy Early Performance Results on
XK7: Code Description XK7 outperforms XE6 by 3.8x Leverage
accelerated linear algebra libraries, e.g., cuBLAS + CULA,
LibSci_acc Parallelization over (1) W-L Monte-Carlo walkers, (2)
over atoms through MPI process, (3) OpenMP on CPU sections.
Restructure communications: moved outside energy loop Science
Target (20PF Titan) Compute the magnetic structure and
thermodynamics of low-dimensional magnetic structures Calculate
both the magnetization and the free-energy for magnetic materials.
Libraries
Slide 42
42 Applications Readiness CAAR algorithmic coverage CodeFFT
Dense linear algebra Sparse linear algebra ParticlesMonte
CarloStructured grids Unstructured grids S3DXXXX CAMXXXXX LSMSX
LAMMPSXX DenovoXXXXX PFLOTRANXX (AMR) Selected applications
represented bulk of use for 6 INCITE allocations totaling 212M
cpu-hours (2009) Represented 35% of 2009 INCITE allocations 23% of
2010 INCITE allocations (in cpu-hours)
Slide 43
43 Applications Readiness Extra Slides
Slide 44
44 Applications Readiness Todays Leadership System - Titan
Hybrid CPU/GPU architecture, Hierarchical Parallelism Vendors: Cray
/ NVIDIA 27 PF peak 18,688 Compute nodes, each with 1.45 TF peak
NVIDIA Kepler GPU - 1,311 GF 6 GB GDDR5 memory AMD Opteron- 141 GF
32 GB DDR3 memory PCIe2 link between GPU and CPU Cray Gemini 3-D
Torus Interconnect 32 PB / 1 TB/s Lustre file system
Slide 45
45 Applications Readiness Transition to Operations (T2O) T2O:
period between acceptance and full-user operations Much of first
quarter of 2013 was in T2O Purpose: Harden machine using mature
codes running big jobs OLCF-ported apps: initial codes to run in
T2O Other very ready codes also ran in T2O Ramped up INCITE &
ALCC over few-month period Even with an extended T2O period, OLCF
delivered INCITE and ALCC commitments Weeks (2013)
Slide 46
46 Applications Readiness Scientific Progress at all Scales
Fusion Energy A Princeton Plasma Physics Laboratory team led by
C.S. Chang was able to simulate and explain nonlinear coherent
turbulence structures on the plasma edge in a fusion reactor by
exploiting increased performance of the Titan architecture.
Turbomachinery Ramgen Power Systems is using the Titan system to
significantly reduce time to solution in the optimization of novel
designs based on aerospace shock wave compression technology for
gas compression systems, such as carbon dioxide compressors.
Nuclear Reactors Center for the Advanced Simulation of Lightwater
Reactors (CASL) investigators successfully performed full core
physics power-up simulations of the Westinghouse AP1000 pressurized
water reactor core using their Virtual Environment for Reactor
Application code that has been designed on Titan architecture.
Liquid Crystal Film Stability ORNL Postdoctoral fellow Trung Nguyen
ran unprecedented large-scale molecular dynamics simulations on
Titan to model the beginnings of ruptures in thin films wetting a
solid substrate. Earthquake Hazard Assessment To better prepare
California for large seismic events SCEC joint researchers have
simulated earthquakes at high frequencies allowing structural
engineers to conduct realistic physics-based probabilistic seismic
hazard analyses. Climate Science Simulations on OLCF resources
recreated the evolution of the climate state during the first half
of the last deglaciation period allowing scientists to explain the
mechanisms that triggered the last period of Southern Hemisphere
warning and deglaciation.