ORNL is managed by UT-Battelle for the US Department of Energy Accelerated Applications Readiness at OLCF Valentine Anantharaj National Center for Computational

ORNL is managed by UT-Battelle for the US Department of Energy Accelerated Applications Readiness at OLCF Valentine Anantharaj National Center for Computational Sciences Oak Ridge National Laboratory EPSC Air-Ocean-Land-Ice Global Coupled Prediction Project Meeting 19-20 November 2014 Boulder CO

2 Applications Readiness Application Readiness at OLCF Oak Ridge Leadership Computing Facility: delivering breakthrough science Center for Accelerated Application Readiness (CAAR): motivation and lessons learned Going forward: Application Readiness for OLCF- 4 (Summit)

3 Applications Readiness Mission: Deploy and operate the computational resources required to tackle global challenges Providing world-leading computational and data resources and specialized services for the most computationally intensive problems Providing stable hardware/software path of increasing scale to maximize productive applications development Providing the resources to investigate otherwise inaccessible systems at every scale: from galaxy formation to supernovae to earth systems to automobiles to nanomaterials With our partners, deliver break-through and transforming discoveries in materials, biology, climate, energy technologies, and basic science & engineering Oak Ridge Leadership Computing Facility

4 Applications Readiness ORNL has a long history in high performance computing 1954 ORACLE 1969 IBM 360/9 1985 Cray X-MP 1992-1995 Intel Paragons 1996-2002 IBM Power 2/3/4 2003-2005 Cray X1/X1E 2007 IBM Blue Gene/P ORNL has had 22 systems on the list

5 Applications Readiness ORNL has increased system performance by 1,000 times [2004-2010]

6 Applications Readiness Breakthrough science at OLCF

7 Applications Readiness CORAL System Our Science requires that we continue to advance OLCFs computational capability over the next decade on the roadmap to Exascale. Since clock-rate scaling ended in 2003, HPC performance has been achieved through increased parallelism. Jaguar scaled to 300,000 CPU cores. Titan and beyond deliver hierarchical parallelism with very powerful nodes. MPI plus thread level parallelism through OpenACC or OpenMP plus vectors Jaguar: 2.3 PF Multi-core CPU 7 MW Titan: 27 PF Hybrid GPU/CPU 9 MW 2010201220172022 OLCF5: 5-10x Summit ~20 MW Summit: 5-10x Titan Hybrid GPU/CPU 10 MW

8 Applications Readiness SYSTEM SPECIFICATIONS: Peak performance of 27.1 PF 24.5 GPU + 2.6 CPU 18,688 Compute Nodes each with: 16-Core AMD Opteron CPU NVIDIA Tesla K20x GPU 32 + 6 GB memory 512 Service and I/O nodes 200 Cabinets 710 TB total system memory Cray Gemini 3D Torus Interconnect 8.9 MW peak power ORNLs Titan Hybrid System: Cray XK7 4,352 ft 2 404 m 2 Titan was an upgrade of the XT5 Jaguar system

9 Applications Readiness The Hybrid Programming Model Imperative On Jaguar, with 299,008 cores, we were seeing scaling issues for single-level, MPI-only applications To take advantage of the vastly larger parallelism in Titan, users need to use hierarchical parallelism in their codes Distributed memory: MPI, SHMEM, PGAS Node Local: OpenMP, Pthreads, local MPI communicators Within threads: Vector constructs on GPU, libraries, OpenACC These are the same types of constructs needed on all multi-PFLOPS computers to scale to the full size of the systems!

10 Applications Readiness Center for Accelerated Application Readiness (CAAR): Enabling Science on Day-One Created as part of Titan (OLCF-3) project to help prepare applications for accelerated architectures Goals: Work with code teams to develop and implement strategies for exposing hierarchical parallelism for our users applications Maintain code portability across modern architectures Learn from and share our results We selected six applications from across different science domains and algorithmic motifs

11 Applications Readiness Center for Accelerated Application Readiness (CAAR) WL-LSMS Role of material disorder, statistics, and fluctuations in nanoscale materials and systems. S3D Understanding turbulent combustion through direct numerical simulation with complex chemistry.. CAM-SE Answer questions about specific climate change adaptation and mitigation scenarios; realistically represent features like precipitation patterns/statistics and tropical storms. Denovo Discrete ordinates radiation transport calculations that can be used in a variety of nuclear energy and technology applications. LAMMPS Biofuels: An atomistic model of cellulose (blue) surrounded by lignin molecules comprising a total of 3.3 million atoms. Intent: Ensure Day-One GPU-enabled science on Titan NRDF Radiation transport critical to astrophysics, laser fusion, combustion, atmospheric dynamics, and medical imaging.

12 Applications Readiness AppScience AreaAlgorithm(s)Grid type Programming Language(s) Compiler(s) supported Communication Libraries Math Libraries CAM-SE climate spectral finite elements, dense & sparse linear algebra, particles structuredF90PGI, Lahey, IBMMPITrilinos LAMMPS biology/material s molecular dynamics, FFT, particles N/AC++ GNU, PGI, IBM, Intel MPIFFTW S3D combustion Navier-Stokes, finite diff, dense & sparse linear algebra, particles structuredF77, F90PGIMPINone Denovo nuclear energywavefront sweep, GMRESstructured C++, Fortran, Python GNU, PGI, Cray, Intel MPI Trilinos, LAPACK, SuperLU, Metis WL-LSMS nanoscience density functional theory, Monte Carlo N/AF77, F90, C, C++PGI, GNUMPI LAPACK (ZGEMM, ZGTRF,ZGTRS) NRDF radiation transport Non-equilibrium radiation diffusion AMRC++PGI, GNUMPI, SAMRAI SAMRAI, PETSc, Hypre CAAR code characteristics

13 Applications Readiness CAAR Approach Comprehensive team assigned to each app OLCF application lead Cray engineer NVIDIA developer Other: other application developers, local tool/library developers Single early-science problem targeted for each app Success on this problem is primary metric for success Particular plan-of-attack different for each app WL-LSMS dependent on accelerated ZGEMM CAM-SE pervasive and widespread custom acceleration required Multiple acceleration methods explored WL-LSMS CULA, MAGMA, custom ZGEMM CAM-SE CUDA FORTRAN, directives Two-fold aim Maximum acceleration for science target(s) Determination of optimal, reproducible acceleration path for other science problems We believe we now have a good recipe for scalable MPI or MPI+OpenMP codes to enable acceleration with GPUs

14 Applications Readiness Some birth pangs complications All of the chosen apps were/are under constant development Groups had, in many cases, already begun to explore GPU acceleration on their own [good and bad; need for buy in; science priorities; etc.] Production-level tools, compilers, libraries, etc. were just beginning to become available at the start of CAAR Multiple paths were available, with multifarious trade- offs ease-of-use (potential) portability performance

15 Applications Readiness All codes will need rework to scale! Up to 1-2 person-years required to port each code from Jaguar to Titan Takes work, but an unavoidable step required for next generation regardless of the type of processors. It comes from the required level of parallelism on the node Also pays off for other systemsthe ported codes often run significantly faster CPU-only (Denovo 2X, CAM-SE >1.7X) We estimate possibly 70-80% of developer time is spent in code restructuring, regardless of whether using OpenMP / CUDA / OpenCL / OpenACC / Each code team must make its own choice of using OpenMP vs. CUDA vs. OpenCL vs. OpenACC, based on the specific casemay be different conclusion for each code Our users and their sponsors must plan for this expense

16 Applications Readiness Think and rethink algorithms Are there more opportunities here? Heterogeneous architectures can make previously infeasible or inefficient models and implementations viable Alternative methods for electrostatics that perform slower on traditional x86 can be significantly faster on GPUs (Nguyen, et al. J. Chem. Theor. Comput. 2013. 73-83) 3-body coarse-grain simulations of water with greater concurrency can allow > 100X simulation rates when compared to fastest atomistic models even though both are run on the GPUs (Brown, et al. Submitted)

17 Applications Readiness Something more to keep in mind and think about Science codes are under active development porting to GPU can be pursuing a moving target, challenging to manage More available FLOPS on the node should lead us to think of new science opportunities enabled e.g., more degrees of freedom per grid cell We may need to look in unconventional places to get another ~30X thread parallelism that may be needed for exascale e.g., parallelism in time

18 Applications Readiness Lessons learned along the way [1] Major code restructurings were required this is required regardless of the parallel API used. 70-80% of the effort was in code restructuring for all CAAR codes Isolating CUDA-specific constructs in one place in the code is good defensive programming to prepare for programming models that may change It is challenging to negotiate conflict between heavy code optimization and good SWE practice its not always easy to have both, in general and specifically using CUDA (e.g., assertions)

19 Applications Readiness Lessons learned along the way [2] It is helpful to develop a performance model based on flop rate, memory bandwidth, and algorithm tuning knobs, to guide mapping of the algorithm to the GPU It is worthwhile to write small codes to test performance for simple operations, incorporate this insight into algorithm design It is a challenge to understand what the processor is doing, under the abstractions

20 Applications Readiness Lessons learned along the way [3] It is difficult to know beforehand what will be the best strategy for parallelization or what will be the final outcome a porting effort could easily fail if the GPU has inadequate register space for the planned algorithm mapping Performance can be very sensitive to small tweaks in the code must determine empirically the best way to write the code Often, the GPU porting effort for the algorithm also improves performance on the CPU (in all the CAAR cases, in fact, ~2X)

21 Applications Readiness Summary Critically important problems require leadership computing facilities that provide the most powerful compute and data infrastructure Computer system performance increases through parallelism Clock speed trend flat to slower over coming years Applications must utilize all inherent parallelism Accelerated, hybrid-multicore computing solutions are performing well on real, complex scientific applications But you must work to expose the parallelism in your codes. This refactoring of codes is largely common to all massively parallel architectures OLCF resources are available to industry, academia, and labs, through open, peer-reviewed allocation mechanisms

22 Applications Readiness How effective are GPUs on scalable applications and specific problems? Cray XK7: Fermi GPU plus AMD 16-core Opteron CPU XE6: 2X AMD 16-core Opteron CPUs

23 Applications Readiness Application Power Efficiency of the Cray XK7 WL-LSMS for CPU-only and Accelerated Computing Runtime Is 8.6X faster for the accelerated code Energy consumed Is 7.3X less o GPU accelerated code consumed 3,500 kW-hr o CPU only code consumed 25,700 kW-hr Power consumption traces for identical WL-LSMS runs with 1024 Fe atoms on 18,561 Titan nodes (99% of Titan)

24 Applications Readiness OLCF User Requirements Survey Key Findings Surveys are a lagging indicator that tend to tell us what problems the users are seeing now, not what they expect to see in the future For the first time ever, FLOPS was NOT the #1 ranked requirement Local memory capacity was not a driver for most users, perhaps in recognition of cost trends 76% of users said there is still a moderate to large amount of parallelism to extract in their code, but 85% of respondents rated the difficulty level of extracting that parallelism as moderate or difficult underlining the need for more training and assistance in this area Data sharing, long-term aggregate archival storage of 100s of PB, data analytics, and faster WAN connections also showed up as important requirements

25 Applications Readiness 2017 OLCF Leadership System Hybrid CPU/GPU architecture Vendor: IBM (Prime) / NVIDIA / Mellanox Technologies At least 5X Titans Application Performance Approximately 3,400 nodes, each with: Multiple IBM POWER9 CPUs and multiple NVIDIA Tesla GPUs using the NVIDIA Volta architecture CPUs and GPUs completely connected with high speed NVLink Large coherent memory: over 512 GB (HBM + DDR4) all directly addressable from the CPUs and GPUs An additional 800 GB of NVRAM, which can be configured as either a burst buffer or as extended memory over 40 TF peak performance Dual-rail Mellanox EDR-IB full, non-blocking fat-tree interconnect IBM Elastic Storage (GPFS) - 1TB/s I/O and 120 PB disk capacity.

26 Applications Readiness CAAR Call for Proposal: Applications being accepted now! For information on how to apply, visit: https://www.olcf.ornl.gov/summit/caar-call-for- proposals/https://www.olcf.ornl.gov/summit/caar-call-for- proposals/ February 20, 2015: CAAR proposal deadline OLCF Distinguished Postdoctoral Associates Program: Apply now: 8 positions available immediately! https://www.olcf.ornl.gov/summit/olcf-distinguished- postdoctoral-associates-program/https://www.olcf.ornl.gov/summit/olcf-distinguished- postdoctoral-associates-program

27 Applications Readiness Acknowledgment OLCF-3 CAAR Team: Bronson Messer (Lead), Wayne Joubert, Mike Brown, Matt Norman, Markus Eisenbach, Ramanan Sankaran; and many others including OLCF/NCCS UAO and Scicomp groups Jack Wells and Arthur (Buddy) Bland OLCF-3 Vendor Partners: Cray, AMD, and NVIDIA This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DEAC05-00OR22725.

28 Applications Readiness [email protected] This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725 Want to join our team? ORNL is hiring. Contact us at https://jobs.ornl.gov GPU Hackathon October 2014

29 Applications Readiness Additional material

30 Applications Readiness How does Summit compare to Titan FeatureSummitTitan Application Performance5-10x TitanBaseline Number of Nodes~3,40018,688 Node performance> 40 TF1.4 TF Memory per Node>512 GB (HBM + DDR4)38GB (GDDR5+DDR3) NVRAM per Node800 GB0 Node InterconnectNVLink (5-12x PCIe 3)PCIe 2 System Interconnect (node injection bandwidth) Dual Rail EDR-IB (23 GB/s)Gemini (6.4 GB/s) Interconnect TopologyNon-blocking Fat Tree3D Torus ProcessorsIBM POWER9 NVIDIA Volta AMD Opteron NVIDIA Kepler File System120 PB, 1 TB/s, GPFS32 PB, 1 TB/s, Lustre Peak power consumption10 MW9 MW

31 Applications Readiness Allocations on Titan DD and INCITE are open to international researchers ALCC (ASCR Leadership Computing Challenge) primarily intended for US DOE and industry INCITE call 1/year (usually due 30 June) INCITE proposals are subject to scientific peer review and computational readiness review DD web form application, accepted continuously DD applications reviewed weekly by Resource Utilization Council Typically have 2 scientific referees For new users it is most important to say what development you plan to perform and whether you anticipate competing in INCITE in the future Awards are typically O(1M) Titan-core hours

32 Applications Readiness OLCF-4

33 Applications Readiness Where do we go from here? Provide the Leadership computing capabilities needed for the DOE Office of Science mission from 2018 through 2022 Capabilities for INCITE and ALCC science projects CORAL was formed by grouping the three Labs who would be acquiring Leadership computers in the same timeframe (2017). Benefits include: Shared technical expertise Decreases risks due to the broader experiences, and broader range of expertise of the collaboration Lower collective cost for developing and responding to RFP

34 Applications Readiness CORAL Collaboration ORNL, ANL, LLNL) Leadership Computers RFP requests >100 PF, 2 GB/core main memory, local NVRAM, and science performance 4x-8x Titan or Sequoia Objective - Procure 3 leadership computers to be sited at Argonne, Oak Ridge and Lawrence Livermore in 2017. Two of the contracts have been awarded with the Argonne contract in process. Approach Competitive process - one RFP (issued by LLNL) leading to 2 R&D contracts and 3 computer procurement contracts For risk reduction and to meet a broad set of requirements, 2 architectural paths will be selected and Oak Ridge and Argonne must choose different architectures Once Selected, Multi-year Lab-Awardee relationship to co-design computers Both R&D contracts jointly managed by the 3 Labs Each lab manages and negotiates its own computer procurement contract, and may exercise options to meet their specific needs Understanding that long procurement lead-time may impact architectural characteristics and designs of procured computers Sequoia (LLNL) 2012 - 2017 Mira (ANL) 2012 - 2017 Titan (ORNL) 2012 - 2017 Current DOE Leadership Computers

35 Applications Readiness Two Architecture Paths for Today and Future Leadership Systems Hybrid Multi-Core (like Titan) CPU / GPU hybrid systems Likely to have multiple CPUs and GPUs per node Small number of very powerful nodes Expect data movement issues to be much easier than previous systems coherent shared memory within a node Multiple levels of memory on package, DDR, and non-volatile Many Core (like Sequoia/Mira) 10s of thousands of nodes with millions of cores Homogeneous cores Multiple levels of memory on package, DDR, and non-volatile Unlike prior generations, future products are likely to be self hosted Power concerns for large supercomputers are driving the largest systems to either Hybrid or Many-core architectures

36 Applications Readiness Summit Key Software Components System Linux IBM Elastic Storage (GPFS) IBM Platform Computing (LSF) IBM Platform Cluster Manager (xCAT) Programming Environment Compilers supporting OpenMP and OpenACC IBM XL, PGI, LLVM, GNU, NVIDIA Libraries IBM Engineering and Scientific Subroutine Library (ESSL) FFTW, ScaLAPACK, PETSc, Trilinos, BLAS-1,-2,-3, NVBLAS cuFFT, cuSPARSE, cuRAND, NPP, Thrust Debugging Allinea DDT, IBM Parallel Environment Runtime Edition (pdb) Cuda-gdb, Cuda-memcheck, valgrind, memcheck, helgrind, stacktrace Profiling IBM Parallel Environment Developer Edition (HPC Toolkit) VAMPIR, Tau, Open|Speedshop, nvprof, gprof, Rice HPCToolkit

37 Applications Readiness OLCF-3 CAAR Applications

38 Applications Readiness Employs equal-angle, cubed-sphere grid and terrain-following coordinate system. Scaled to 172,800 cores on XT5 Exactly conserves dry mass without the need for ad hoc fixes. Original baseline code achieves parallelism through domain decomposition using one MPI task per element CAM-SE Community Atmosphere Model Spectral Elements Porting Strategy Early Performance Results on XK7: Code Description Refactored code was 1.7x faster on Cray XT5 XK7 outperforms XE6 by 1.5x Science Target (20PF Titan) Cubed-sphere grid of CAM spectral element model. Each cube panel is divided into elements. CAM simulation using Mozart tropospheric chemistry with 106 constituents at 14 km horizontal grid resolution Using realistic Mozart chemical tracer network, tracer transport (i.e., advection) dominates the run time. Use hybrid MPI/OpenMP parallelism Intensive kernels are coded in CUDA Fortran Migration to OpenACC in progress http://www-personal.umich.edu/~paullric/A_CubedSphere.png CUDA Fortran

39 Applications Readiness Linear Boltzmann radiation transport Discrete ordinates method Iterative eigenvalue solution Multigrid, preconditioned linear solves C++ with F95 kernels DENOVO 3D Neutron Transport for Nuclear Reactor Design Porting Strategy Early Performance Results on XK7 Code Description Refactored code was 2x faster on Cray XT5 XK7 performance exceeds XE6 by 3.8x SWEEP kernel re-written in C++ & CUDA, runs on CPU or GPU Scaling to over 200K cores with opportunities for increased parallelism on GPUs Reintegrate SWEEP into DENOVO Science Target (20PF Titan) DENOVO is a component of the DOE CASL Hub, necessary to achieve CASL challenge problems 3-D, Full reactor radiation transport for CASL challenge problems within 12 wall clock hours. This is the CASL stretch goal problem! CUDA

40 Applications Readiness Compressible Navier-Stokes equations 3D Cartesian grid, 8 th -order finite difference Explicit 4 th -order Runge-Kutta integration Fortran, 3D Domain decomposition, non-blocking MPI S3D Direct Numerical Simulation of Turbulent Combustion Porting Strategy Early Performance Results on XK7 Code Description Refactored code was 2x faster on Cray XT5 OpenACC acceleration with minimal overhead XK7 outperforms XE6 by 2.2x Hybrid MPI/OpenMP/OpenACC application All intensive calculations can be on the accelerator Redesigned message passing to overlap communication and computation Science Target (20 PF Titan) Increased chemical complexity for combustion: Jaguar: 9-22 species (H 2, syngas, ethylene) Titan: 30-100 species (ethanol, n-heptane, iso-octane) DNS provides unique fundamental insight into the chemistry- turbulence interaction OpenACC

41 Applications Readiness Combines classical statistical mechanics (W-L)for atomic magnetic moment distributions with first- principles calculations (LSMS) of the associated energies. Main computational effort is dense linear algebra for complex numbers F77 with some F90 and C++ for the statistical mechanics driver. Wang-Landau LSMS First principles, statistical mechanics of magnetic materials Porting Strategy Early Performance Results on XK7: Code Description XK7 outperforms XE6 by 3.8x Leverage accelerated linear algebra libraries, e.g., cuBLAS + CULA, LibSci_acc Parallelization over (1) W-L Monte-Carlo walkers, (2) over atoms through MPI process, (3) OpenMP on CPU sections. Restructure communications: moved outside energy loop Science Target (20PF Titan) Compute the magnetic structure and thermodynamics of low-dimensional magnetic structures Calculate both the magnetization and the free-energy for magnetic materials. Libraries

42 Applications Readiness CAAR algorithmic coverage CodeFFT Dense linear algebra Sparse linear algebra ParticlesMonte CarloStructured grids Unstructured grids S3DXXXX CAMXXXXX LSMSX LAMMPSXX DenovoXXXXX PFLOTRANXX (AMR) Selected applications represented bulk of use for 6 INCITE allocations totaling 212M cpu-hours (2009) Represented 35% of 2009 INCITE allocations 23% of 2010 INCITE allocations (in cpu-hours)

43 Applications Readiness Extra Slides

44 Applications Readiness Todays Leadership System - Titan Hybrid CPU/GPU architecture, Hierarchical Parallelism Vendors: Cray / NVIDIA 27 PF peak 18,688 Compute nodes, each with 1.45 TF peak NVIDIA Kepler GPU - 1,311 GF 6 GB GDDR5 memory AMD Opteron- 141 GF 32 GB DDR3 memory PCIe2 link between GPU and CPU Cray Gemini 3-D Torus Interconnect 32 PB / 1 TB/s Lustre file system

45 Applications Readiness Transition to Operations (T2O) T2O: period between acceptance and full-user operations Much of first quarter of 2013 was in T2O Purpose: Harden machine using mature codes running big jobs OLCF-ported apps: initial codes to run in T2O Other very ready codes also ran in T2O Ramped up INCITE & ALCC over few-month period Even with an extended T2O period, OLCF delivered INCITE and ALCC commitments Weeks (2013)

46 Applications Readiness Scientific Progress at all Scales Fusion Energy A Princeton Plasma Physics Laboratory team led by C.S. Chang was able to simulate and explain nonlinear coherent turbulence structures on the plasma edge in a fusion reactor by exploiting increased performance of the Titan architecture. Turbomachinery Ramgen Power Systems is using the Titan system to significantly reduce time to solution in the optimization of novel designs based on aerospace shock wave compression technology for gas compression systems, such as carbon dioxide compressors. Nuclear Reactors Center for the Advanced Simulation of Lightwater Reactors (CASL) investigators successfully performed full core physics power-up simulations of the Westinghouse AP1000 pressurized water reactor core using their Virtual Environment for Reactor Application code that has been designed on Titan architecture. Liquid Crystal Film Stability ORNL Postdoctoral fellow Trung Nguyen ran unprecedented large-scale molecular dynamics simulations on Titan to model the beginnings of ruptures in thin films wetting a solid substrate. Earthquake Hazard Assessment To better prepare California for large seismic events SCEC joint researchers have simulated earthquakes at high frequencies allowing structural engineers to conduct realistic physics-based probabilistic seismic hazard analyses. Climate Science Simulations on OLCF resources recreated the evolution of the climate state during the first half of the last deglaciation period allowing scientists to explain the mechanisms that triggered the last period of Southern Hemisphere warning and deglaciation.

Documents

ORNL is managed by UT-Battelle for the US Department of Energy Accelerated Applications Readiness at OLCF Valentine Anantharaj National Center for Computational