D4.1 “Preliminary report of progress about the porting of ...montblanc-project.eu/wp-content/uploads/2017/12/d4.1_preliminary... · BQCD F90,C MPI, OpenMP ILDG, LIME, BLAS, ScaLAPACK

D4.1 “Preliminary report of progress about the porting of the full-scale scientific applications”

Version 2.0

Document Information Contract Number 288777 Project Website www.montblanc-project.eu Contractual Deadline PM12 (30 Sept 2012) Dissemination Level PU Nature R Author Stéphane Requena (GENCI)

Contributors B. Videau (IMAG), A. Delorme (IMAG). M. Culpo (CINECA), R. Halver (JSC), S. Mohanty (JSC), D. Broemmel (JSC), J. Meincke (JSC), V. Moureau (CORIA/CNRS), M. Allalen (LRZ), X. Saez (BSC)

Reviewer Name : J. Costa (BSC) Keywords Exascale, scientific applications, porting, profiling, optimisation Notices: The research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 288777. 2011 Mont-Blanc Consortium Partners. All rights reserved.

D4.1 “Preliminary report of progress about the porting of the full-scale scientific applications” Version 2.0

2

Change Log Version Description of Change

v1.0 Initial version released

V1.1 Final draft for review

V2.0 Final version


3

Table of Contents Executive Summary ...................................................................................................................... 4 1 Introduction ........................................................................................................................... 5

1.1 Organisation of WP4 ........................................................................................................... 5 2 Hardware and software platforms used by WP4 ........................................................................... 7

2.1 Snowball boards ................................................................................................................. 7 2.2 The Tibidabo cluster ............................................................................................................ 9

3 Report on the different applications ......................................................................................... 11 3.1 BIGDFT .......................................................................................................................... 11

3.1.1 Description of the code ....................................................................................... 11 3.1.2 Report of the progress on the porting of the code .............................................. 12

3.2 BQCD ............................................................................................................................ 14 3.2.1 Description the code ........................................................................................... 14 3.2.2 Power consumption of BQCD on CooLMUC ...................................................... 15 3.2.3 Report of the progress on the porting of the code .............................................. 16

3.3 COSMO ......................................................................................................................... 17 3.3.1 Description the code ........................................................................................... 17 3.3.2 Report on the progress of the porting of the code .............................................. 18

3.4 EUTERPE ....................................................................................................................... 19 3.4.1 Description of the code ....................................................................................... 19 3.4.2 Report on the progress of the porting of the code .............................................. 20

3.5 MP2C ............................................................................................................................ 21 3.5.1 Description the code ........................................................................................... 21 3.5.2 Report on the progress of the porting of the code .............................................. 22

3.6 PEPC ............................................................................................................................ 23 3.6.1 Description of the code ....................................................................................... 23 3.6.2 Report on the progress of the porting of the code .............................................. 25

3.7 ProFASI ......................................................................................................................... 27 3.7.1 Description of the code ....................................................................................... 27 3.7.2 Report on the progress of the porting of the code .............................................. 28

3.8 QuantumEspresso ............................................................................................................ 28 3.8.1 Description the code ........................................................................................... 28 3.8.2 Report on the progress of the porting of the code .............................................. 30

3.9 SMMP ............................................................................................................................ 30 3.9.1 Description of the code ....................................................................................... 30 3.9.2 Report on the progress of the porting of the code .............................................. 31

3.10 SPECFEM3D .................................................................................................................. 32 3.10.1 Description of the code ..................................................................................... 32 3.10.2 Report on the progress of the porting of the code ............................................ 33

3.11 YALES2 ........................................................................................................................ 35 3.11.1 Description of the code ..................................................................................... 35 3.11.2 Report on the progress of the porting of the code ............................................ 37

4 Interactions with others Mont Blanc work packages ................................................................... 38 5 Perspectives ........................................................................................................................ 41 6 Conclusion .......................................................................................................................... 41 List of figures ............................................................................................................................. 43 List of tables .............................................................................................................................. 44 Acronyms and Abbreviations ........................................................................................................ 44


4

Executive Summary The Mont Blanc project aims to assess the potential of low power embedded components based clusters to address future Exascale HPC needs. The role of work package 4 (WP4, “Exascale applications”) is to port, co design and optimise up to 11 real exascale-class scientific applications to the different generation of platforms available in order to assess the global programmability and the performance of such systems. The first section will introduce the different applications and their different characteristics, the second section will describe the platforms used by WP4 during the first year, the third section will report the progress of the porting and the profiling of each of the 11 applications during the first year and the last section will give perspectives on WP4 activities.


5

1 Introduction

The Mont Blanc project aims to assess the potential of low-power embedded components based clusters to address future Exascale HPC needs. As complement of the activities of work package 3 (WP3, “Optimized application kernels”), a part of the activities of Mont Blanc will be to assess on the different generation of platforms made available by the project the behaviour of up to 11 real exascale-class scientific applications. The objective of such work package 4 (WP4, “Exascale applications”) will be to evaluate the global programmability and the performance (in terms of time and energy to solution) of the architecture and to assess the efficiency of hybrid OmpSs/MPI programming model. These eleven real scientific applications, used by academia and industry, running daily in production into existing European (PRACE Tier-0 systems) or national HPC facilities have been selected by the different partners in order to cover a wide range of scientific domains (geophysics, fusion, materials, particle physics, life sciences, combustion, weather forecast) as well as hardware and software needs. Some of these applications are also part of the PRACE Benchmark (flagged with the PRACE banner in the following table):

Figure 1 - List of the 11 WP4 scientific applications

1.1 Organisation of WP4 The work performed during the lifetime of the Mont Blanc project by WP4 is divided into the three following tasks:


6

• Task 4.1: Port a list of 11 full European representative scientific applications to low power prototypes (m1:m24)

• Task 4.2: Select a subset of applications and perform specific optimisations and performance/energy evaluation (m18:m36)

• Task 4.3 (m24:m36) o Assess the available programming model and evaluate the global

programmability of the system o Write best practices for an efficient usage of the system

The overall progress report on WP4 was reported by each code owner of the 11 applications through regular WP4 conferences calls using the tools provided by WP2. Since the activity of WP4 is strongly linked with the one from WP3 for the extraction of kernels from real applications and their porting/optimisation using OmpSs, joint teleconferences between WP3 and WP4 has also been organised during the first year, especially during the second half of the year for issuing the D3.1 document (assessment and selection of the kernels). WP3 and WP4 also decided to setup a joint forge for sharing kernels, codes and results. This tool has been proposed and hosted by CINECA. WP4 is also collaborating with WP5 regarding the porting and the usage of components from the software stack (compilers, MPI and I/O libraries, numerical libraries, runtime systems, …) provided by WP5. Finally WP4 is using the different generation of hardware and software platforms provided by WP7. During the first period WP4 used Tibidabo the first ARM powered cluster of the project for performing the initial porting of the applications and provided feedback on performance and scalability to WP7. One of the first activities of WP4 was to classify the 11 applications by scientific domains, programming language, parallel programming paradigm, I/O requirements, external libraries dependencies, portability and scalability. Before porting all of these applications each code owner worked on gathering pertinent datasets and running comparative simulation on x86 systems or IBM BG/P systems in order to be able to compare with results obtained on the different generations of low power prototypes. The outcome for each of the 11 applications was the following :

Code Implementation details Language Prog model Dependencies and

I/O needs Scalability

YALES2 F90 MPI HDF5, FFTW3, BLAS,

PARMETIS, LAPACK

//I/O (HDF5)

>32k cores

EUTERPE F90, C MPI, OpenMP BLAS, FFT, PETSc

> 60k BG/P

SPECFEM3D F90 MPI, CUDA, StarSs No May have large I/O

>150k cores Cray XE

1152 GPUs MP2C F90 MPI No >65k cores


7

BG/P BigDFT F90,C MPI, OpenMP,

CUDA, OpenCL BLAS >2000 cores

>300 GPUs Quantum Expresso

F90 MPI, OpenMP, CUDA

Scalapack BLAS

Good

PEPC F90, C MPI+Posix, SMPsS No //IO (sionlib)

>300k BG/P

SMMP F77/90C Python

MPI, OpenCL No 256 cores (energy)

16k cores (// temp)

ProFASI C++ MPI No good COSMO F90 MPI, OpenMP BLAS and

LAPACK, Grib/NetCDF

BQCD F90,C MPI, OpenMP ILDG, LIME, BLAS,

ScaLAPACK //IO (LIME)

>294k cores BG/P

Finally during the phase of porting and profiling of the 11 applications, each code owner worked with WP3 in identifying per application a set of representative kernels which where characterized and will be ported and optimised by WP3.

2 Hardware and software platforms used by WP4

During the first period the partners involved into WP4 activities used multiple platforms for performing their work, from x86 and BG/P systems to low power systems: individual boards and Tibidabo the first cluster made available by the project.

2.1 Snowball boards The Snowball board is a full-embedded computer with state of the art features, and a powerful CPU developed by ST-Ericsson, the A9500. This SoC (System On Chip) is a dual-core 1GHz ARM with an integrated Mali 400 GPU, and a Neon floating point unit.

Figure 2 - Picture of a Snowball board


8

The main hardware features of the boards we are using are :

• ARM Dual Cortex A9 @ 1GHz • 8GByte e-MMC • 1GByte LP-DDR2 • 1x Micro-SD • 1x HDMI Full HD • 1x Ethernet 10/100Mbits • 1x USB OTG HS (480Mbits)

The board can be powered by a battery, a sector adapter, or by USB only, so its power consumption is less than 2.5W (maximum value available from USB). The board can host a full Linux system with graphical interface and it is controlled via USB mouse and keyboard. It can also be controlled via its Ethernet interface, once the network set up. It also provides a serial-to-USB port, allowing low-level control via an USB cable and a serial console tool like minicom or picocom. The snowball board can run multiple Linux-based systems, as Android or Meego, but our focus will be on the Ubuntu-based distribution Linaro. Linaro is a non-profit company aiming at developing a better Linux experience for ARM platforms. With the support of ARM manufacturers, they maintain an Ubuntu and an Android version for several development platforms (such as the Snowball or the Pandaboard). To gather developers around the board and provide better user support, a community has been built around it: The Igloo Community1. This community, pushed by St-Ericsson and Movial (a Finnish company that provides commercial support for the snowball) works together with Linaro in order to provide up-to-date systems and packages for the Snowball Board. An IRC channel, a forum and several mailing lists allows the community to live and keep growing. Right now thee board supports all recent versions of Ubuntu, including the future Precise Pangolin version. For Android, an early release of the 4th version, Ice Cream Sandwich, is already available while not yet fully operational. Image for deployment on the board are hosted on the Igloo repository. They can be deployed either on the internal memory of the board or in an external SD card via linaro tools (package linaro-image-tools in Ubuntu/Debian) or the riff tool for direct writes to the internal Flash memory. Some early tests were conduced in order to compare the performance of Snowballs boards against regular x86 systems based on Intel Xeon X550 processors and also low power systems like the ipad2 and Tegra2 based devices. The High Performance LINPACK Benchmark is the reference benchmark used in the HPC world to compare performance of supercomputers. It's based on a linear algebra library which focus on floating point performance of the systems. It's not the best benchmark for real world applications, but it is used to compare raw power of the systems. It is also used in mobile platforms to compare mobile CPUs, but these versions of LINPACK, embedded in iphone or Android applications, are executed through a heavy runtime engine and can't really be compared to the Linux optimized ones. Therefore the results for the Ipad2 or the Tegra 2 are just given as is and should not be considered. We note that the theoretical given value for the CPU is 2 GFlops.


9

But this is when the Neon unit is fully used (hard float). Right now we only have a slower soft float version of the system available, and we cannot validate these theoretical results.

Taking into account the 2.5 W maximum value for the Snowball, and the maximum TDP of 95W for a recent Intel Xeon processor, we obtain the overall flops/Watt table below.

In this benchmark we have a slightly better Flops/Watt performance for the Snowball board over the Xeon system. We will investigate the lack of scaling of the benchmark on two boards once the CPU frequency problem will be solved. During the submission of the Montblanc Project, the Tegra 2 SoC with dual core ARM cortex A9 at 1GHz, like the A9500 was given for 2 GFlops. This number is still theoretical, we only have 40% of this value now, and will try again with hard -float version of the ABI when available.

2.2 The Tibidabo cluster Tibidabo is the first real low power platform provided by the Mont Blanc project (WP7) to the all the partners of the project. Tibidabo is a one rack cluster made by 32 multi boards modules, each module composed by 8 independent boards, each board composed by one tegra2 chipset with one ARM dual core A9 socket with 1 GB of memory and one GbE link.


10

Figure 3 - Architecture of

the Tibidabo cluster

Tibidabo is targeted to address the planned porting of the kernels, applications and the software stack on a large-scale set of low power nodes interconnected by a regular Ethernet switch. Even if WP3 and WP4 made some performance/power measurements on this cluster

Tegra2 SoC: 2x ARM Corext-A9 Cores

2 GFLOPS 0.5 Watt

Tegra2 Q7 module 1x Tegra2 SoC

2x ARM Corext-A9 Cores

1 GB DDR2 DRAM 2 GFLOPS

~4 Watt 1 GbE interconnect

1U Multi-board container: 1x Board container

8x Q7 carrier boards 8x Tegra2 SoC

16x ARM Corext-A9 Cores 8 GB DDR2 DRAM

16 GFLOPS ~35 Watt

Rack: 32x Board container

10x 48-port 1GbE switches 256x Q7 carrier boards

256x Tegra2 SoC 512x ARM Corext-A9 Cores

256 GB DDR2 DRAM 512 GFLOPS

~1.7 Kwatt 300 MFLOPS / W


11

with kernels and codes, the purpose of Tibidabo was not to provide a performance-oriented platform. Such features will come with the next generation of Mont Blanc platforms like PedraForca in which each compute node will be boosted by a mobile GPU and will have a much more performing interconnect.

3 Report on the different applications

We report now in this section, the progress during the first period of all the 11 applications in a code-by-code description, in terms of initial porting of the applications (Task 4.1) and profiling and optimisation (Task 4.2).

3.1 BIGDFT

3.1.1 Description of the code BigDFT2 is an ab-initio simulation software based on the Daubechies wavelets family. The software computes the electronic orbital occupations and energies. Several execution modes are available, depending on problem under investigation. Cases can be periodic in various dimensions, and use K-points to increase the accuracy along periodic dimensions. Test cases can also be isolated.

Figure 4 - Nitrogen (N2) electronics orbitals

The BigDFT project was initiated during an European project (FP6-NEST) from 2005 to 2008. Four institutions were implicated in the BigDFT project at that time:

• Commissariat à l'Énergie Atomique (T. Deutsch),


12

• University of Basel (S. Goedecker), • Université catholique de Louvain (X. Gonze), • Christian-Albrechts Universität zu Kiel (R. Schneider).

From 2010, four laboratories are contributing: L_Sim (CEA), UNIBAS, LIG and ESRF. BigDFT is mainly used by academics.

BigDFT is an Open Source project and the code is available at: http://inac.cea.fr/L_Sim/BigDFT/.

It is written mainly in FORTRAN (121k lines) with part in C/C++ (20k lines) and it is parallelized using MPI, OpenMP and an OpenCL support. It also uses BLAS and LAPACK libraries.

BigDFT scalability is good, runs using 2000 cores of Blue Gene P have been conducted. Hybrid runs based on a MPI+CUDA version using 288 GPUS have also been realized.

3.1.2 Report of the progress on the porting of the code Porting BigDFT to ARM using the GNU tool chain was pretty straightforward. As BigDFT possesses a large non-regression test library, asserting its proper behaviour was simple. Porting BigDFT to ARM using the BSC tool chain is an ongoing task. In order to have comparison opportunities and to conduct some benchmarks before the availability of Mont-Blanc prototypes we also used a Snowball CortexA9 ARM platform for testing. Results obtained on the Snowball cards are similar to those of Tibidabo, with about 5% improvement in favour of Tibidabo. The test used in this case is a non-regression test of BigDFT. This test computes the electronic density surrounding an isolated carbon atom. We used two configurations for both test systems using one and two cores. On the ARM platform the options used are: -O2 -mfpu=neon -funsafe-math-optimizations .

Table 1 - Initial BigDFT timings on ARM vs x86

The benchmarking of BigDFT revealed a communication issue with the network infrastructure of the prototype. Without network problems the Tibidabo prototype was found to be 10 times slower then a Xeon processor but 4 time more power efficient, using a worst-case scenario. A network issue due to congestions observed at the level of the Ethernet switches causes a poor scaling of the application as shown in the following figure (speedup in the Y-axis against the number of cores in the X-axis):


13

0 5 10 15 20 25 30 35 400

2

4

6

8

10

12

14

16

18

Figure 5 - BigDFT scaling on the Tibidabo prototype

We investigated this problem and it was found to be due to collective communication problems where some node participating in the communication lost packets. Those packets had to be resent, adding an additional delay in the communication. Sometimes even those resent packets would be lost and the communication had to suffer yet another delay. The delay is 10 times longer than the communication it hindered, and the problem is more likely to occur if the number of nodes is high. This explains the bad scaling in the previous figure. In order to investigate those problems, we instrumented the code using Extrae and PAPI. The results we obtained are shown in the next figure:

Figure 6 - Traces of BigDFT on Tibidabo using 36 cores (18 boards) showing the delayed

communications problem


14

3.2 BQCD

3.2.1 Description the code BQCD3 is used in benchmarks for supercomputer procurement at LRZ, as well as in the DEISA and PRACE projects4, and it is a code basis in the QPACE project5. The benchmark code is well written in Fortran90. BQCD is a program that simulates QCD with the Hybrid Monte-Carlo algorithm. QCD is the theory of strongly interacting elementary particles. The theory describes particle properties like masses and decay constants from first principles. The starting point of QCD is an infinite-dimensional integral. In order to study the theory on a computer space-time continuum is replaced by a four-dimensional regular finite lattice with (anti-) periodic boundary conditions. After this discretisation, the integral is finite-dimensional but still rather high dimensional. The high-dimensional integral is solved by Monte-Carlo methods. Hybrid Monte-Carlo programs have a compute intensive kernel, which is an iterative solver of a large system of linear equations. In BQCD we use a standard conjugate gradient solver (CG). Depending on the physical parameters 80% or up to more than 95% of the execution time is spent in this solver. The dominant operation in the solver is the matrix-vector multiplication. In the context of QCD the matrix involved is called hopping matrix. The hopping matrix is large and sparse. The entries in a row are the eight nearest neighbours of one site of the four dimensional lattice. QCD programs are parallelised by domain decomposition. The nearest neighbour structure of the hopping matrix implies that the boundary values (surfaces) of the input vector have to be exchanged between neighbouring processes at every iteration of the solver. We have compared the hybrid performance of BQCD against the pure MPI performance on the systems CRAY XT5 using approximately 65k cores and SGI Altix 4700 up to 8k cores6. The XT5 refer to the Petaflops machine (Jaguar) located at Oak Ridge National Laboratory US, and SGI Altix 4700 machine at LRZ supercomputer centre. In those machines, we observed that the pure MPI version of the code is faster up 4096~cores, but after that the hybrid OpenMP/MPI version is faster. The hybrid implementations showed different sensitivity to network speed, depending on the parallelization strategy employed and the used platform. Running BQCD on the full BlueGene/P at JSC (Figure 6) was very helpful for Benchmarking activities7. With regard to benchmarking future supercomputers it is important to know from experience that there are no practical limitations in scaling the program to extreme numbers of cores.


15

Figure 7 - Strong scaling of BQCD on a BG/P system on 643*128 and 963*192 lattices

It was nice to observe that the approach to I/O which was implemented 10 years ago using MPI-1 worked reasonably even on the full on the full BlueGene/P machine. We measured I/O rates between 3 and 5 GB/s which includes reordering and gathering data as well as calculating checksums on the fly. If the communication overhead becomes bigger one might profit from a hybrid parallelisation. BQCD has this capability. However, we found that hybrid runs on BlueGene/P were always a few percent slower than pure MPI runs.

3.2.2 Power consumption of BQCD on CooLMUC

The CooLMUC cluster is the first direct liquid cooled cluster based on the AMD architecture, it’s mainly used for assessing the different cooling technologies from the chip to the datacentre. CooLMUC is equipped with a fine grain profiling of the power consumption of the applications. On CooLMUC, we primarily wanted to analyse the effects of frequency scaling, so we used a constant number of 256 cores and a lattice size of 56x56x56x56 for maximum memory consumption. The cores were utilized by various hybridisations ranging from one OpenMP-Threads per 256 MPI-Tasks (B_A01) to 16 OpenMP-Threads per 16 MPI-Tasks (B_A16).


16

Due to the NUMA architecture of CooLMUC, 16 OpenMP-Threads aren’t very efficient; also, frequency scaling usually has little effect between 1.4 and 2.0 GHz. Interestingly, there is a clear minimum in duration and Energy-to-Solution with two OpenMP-Threads at 2.0 GHz. Between one and eight OpenMP-Threads, there is nearly no difference, the B_A08 Jobs only take a few seconds longer than the B_A01 Jobs.

Figure 8 - Impact of clock frequency on BQCD timings

3.2.3 Report of the progress on the porting of the code

Porting the application to the ARM system was mainly straightforward by using the system specific files created as templates for Tibidabo. However, there were difficulties with running the code with the standard lattice size due to the memory problems. All our runs are limited to the lattice size of 8x8x8x16. The strong scaling results are plotted in the next figure, we see that on the scaling is almost linear up to 128 cores.


17

Figure 9 - Strong scaling of the CG solver for 583*16 lattices

The hybrid version shows a strange behaviour between 30 and 60 cores. The reason for this is currently not clear and needs to be investigated in more detail.

3.3 COSMO

3.3.1 Description the code The COSMO application if focused on the field of climatology and weather prediction. The principal objective of the COSMO (COnsortium for Small-scale MOdeling) is the creation of a meso-to-micro scale prediction and simulation system. This system is intended to be used as a flexible tool for specific tasks of weather services as well as for various scientific applications on a broad range of spatial scales. COSMO is developed into a consortium and available through a licence on http://www.cosmo-model.org/content/default.htm. The package is used on both academia and industry. COSMO is a Fortran 90 code, parallelised using MPI and relies on the NetCDF library for all the I/O operations. COSMO has been ported on standard Linux clusters (PPC, x86 and ARM cores). Some efforts have been undertaken outside of this project to port it to GPUs. Scalability curves on the PLX hybrid cluster at CINECA (http://www.hpc.cineca.it/content/ibm-plx-gpu-user-guide-0) are reported on D3.1 of the Mont Blanc project.


18

3.3.2 Report on the progress of the porting of the code The code has been ported to Tibidabo using the GNU toolchain. No particular issues have been encountered in the process. NetCDF 4.1.3 was compiled and installed on the cluster to satisfy the library dependency of COSMO. This required us to install:

o HDF5 1.8.8 o zlib 1.2.6 o szip 2.1

in order to fulfil the chain of dependency of NetCDF. The libraries and COSMO where both compiled with GCC 4.4.5 and linked against OpenMPI 1.6.0 To correctly profile the code a major effort has been undertaken to port TAU on the ARM platform. During the first period contacts have been established with TAU developers, and a close collaboration with them brought to a complete porting of the toolkit on ARM platforms starting from TAU 2.21.3. The test suite coming with the source code of COSMO has been executed successfully after the porting, as well as a small production run that will be used as a benchmark during the continuation of the project. The first tests that have been undertaken to assess the performance of COSMO on Tibidabo are quite controversial, in the sense that while they point to a nice scaling of the computational part of the code they also reveals some problems with the network of the cluster when using more than approximately 16 cores.

This is evident for instance from Figure 9 and Figure 10, showing the mean exclusive time spent inside functions and its standard deviation. The huge increase in the time needed to perform point-to-point and collective MPI communications, not encountered on other platforms, suggests some malfunctioning or under-calibration of the network resources.

Figure 10 - Mean exclusive time spent inside functions and relative standard deviation for a 16 cores run.


19

Figure 10 - Mean exclusive time spent inside functions and relative standard deviation

for a 64 cores run

3.4 EUTERPE

3.4.1 Description of the code EUTERPE is a code for the simulation of micro-turbulences in the fusion plasma, so it is focused on the plasma physics area. EUTERPE code solves the gyro-averaged Vlasov equation for the distribution function of each kinetically treated species (ions, electrons and third species). The code follows the particle-in-cell (PIC) scheme, where the distribution function is discretized using markers.

EUTERPE was created at Centre de Recherche en Physique des Plasmas (CRPP) in Lausanne as a global linear PIC code. Subsequently, it has been further developed at Max-Planck-Institut für Plasmaphysik (IPP) and has been adapted to different computing platforms. The code is co-developed and exploited at several European institutes, including Centro de

Figure 11 - Energy spectrum

Figure 12 - Electrostatic potential in the

plane (r,z) for phi=0

Figure 2: Electrostatic potential in


20

Investigaciones Energéticas, Medioambientales y Tecnológicas (CIEMAT) and Barcelona Supercomputing Center (BSC) in Spain. EUTERPE is an academic code which can be provided as source code. To get it, you have to contact with the authors, because they claim the right on the final decision whether EUTERPE will be actually provided. Basically, they want to know what will be the purpose of the collaboration. EUTERPE is mainly written in Fortran90 with a few C preprocessor directives. It has been parallelized using MPI and Barcelona Supercomputing Center (BSC) has introduced OpenMP at the version 2.61. The application uses the following free libraries: FFTW (for computing Fast Fourier Transform) and PETSc (for solving sparse linear system of equations). Moreover, the application includes a tool to generate the electrostatic equilibrium for input. The I/O activity of EUTERPE can be summarized in: an initialization phase, periodic updates of a histogram file throughout execution, periodically storing of restart states, and the finalization with diagnostic information. Finally, EUTERPE has been ported to different platforms as BlueGene, Xeon and PowerPC. In particular, it has shown an excellent performance up to several thousand processors on the Huygens supercomputer (PowerPC) and it has also performed relatively well for up to 61 440 processors on the Jugene supercomputer (BlueGene).

3.4.2 Report on the progress of the porting of the code The porting activity of EUTERPE started using the GNU toolchain and the Mercurium compiler. Currently, it has not reached a correct compilation of the code due to issues in the Mercurium compiler. Until today, some tickets have been reported using the Mercurium bug/issue tracking system. Some of the found problems are:

• ghost errors due to a INTERFACE structure, • wrong type conversion when computing an initializer of array, • not recognition OpenMp directives, • internal error when it is declared a substring without lower index, • wrong identification of a parameter in the intrinsic function (SUM), • incorrect detection of the reduced variable in a reduction, • internal error when the _OPENMP macro appears.

So as the porting using Mercurium is still not successful using Mercurium compiler, EUTERPE has been compiled using GFORTRAN compiler on Tibidabo computer to get the first performance results. To validate the binary, a small test case based on a 32x32x16 grid was developed. Nevertheless, the runs finish abruptly showing an error message about probably a memory access out of range. Now, the aim is finding the source of this problem and discards the limited memory capacity in the nodes as the cause.


21

3.5 MP2C

3.5.1 Description the code MP2C (Massively Parallel Multi-Particle Collision) is a highly scalable parallel program which couples Multi-Particle Collision Dynamics (MPC) to Molecular Dynamics (MD). MPC is a particle-based method to model hydrodynamics on a mesoscopic scale. It is a local algorithm, which performs random collisions between particles on a cell level, requiring only the knowledge of particle properties from locally neighbouring particles. The collision algorithm is energy and momentum conserving and therefore reproduces hydrodynamic effects on larger spatial scales. Coupling this method to standard schemes allows to take into account e.g. hydrodynamic interactions between solvated molecules. The algorithm consists basically of two steps: (i) a streaming step, where particles are propagated ballistically in space according to their actual velocity and the applied time step; (ii) a collision or momentum transfer step, where collisions between multiple particles are performed. Different variants of MPC vary in the way the momentum exchange between particles within a collision cell is performed. Collision cells are organized in a grid which is (in order to ensure Galileien invariance) stochastically shifted over the system in every timestep. The current version of MP2C implements an MPC variant which is called Stochastic Rotation Dynamics (SRD). In this method, the collision step is performed such that the velocity relative to the center-of-mass velocity inside a locally defined collision cell is rotated by a given angle around a randomly chosen axis. In addition to the hydrodynamics, MP2C is able to simulate short-range interactions based on Molecular Dynamics (MD). It is possible to simulate different kinds of potentials between particles and include different kinds of molecular bonds. The MD part can be used as a stand-alone version or can be coupled to the MPC part. This coupling allows to study the dynamics of e.g. polymer chains including hydrodynamic interactions, or driven particle laden flows. The code is developed by the Simulation Laboratory Molecular Systems at the Ju ̈lich Supercomputing Center (Forschungszentrum Ju ̈lich). As of now no final licence policy was agreed upon. Its actual usage is academic and it is used within different national and international projects. MP2C is written in Fortran 90 and uses MPI for parallelization. There are some approaches to include different other parallelization models, like OpenMP or CUDA, but these versions are still under development and experimental. It is possible to use the parallel SionLib I/O library which on the one hand highly improves the I/O performance but which on the other hand limits the use to a specific file format. The use of SionLib is not mandatory but strongly recommended for large amounts of data. Furthermore, it is possible to create restart files, in order to compute long simulations that need more than one program run. The amount of I/O can be regulated via input files, that determine the interval in which program information is printed. Until now no problems were detected while trying to port the code to different computer architectures. The code runs on all Ju ̈lich-based machines and supports GNU compilers (gfortran) as well as Intel (ifort) and IBM compilers (xlf).


22

Figure 13 - System of 100 polymer chains consisting of 250 monomers each

Each sphere represents one monomer for which interactions and bonds are simulated with MD, while the whole system is embedded into a liquid under shear simulated with MPC. The same system is used for the scaling plots below.

3.5.2 Report on the progress of the porting of the code Within the project the code was ported to Tibidabo during the first period. This porting was straightforward and no big adjustments had to be made. A reason for this might be that the code runs fine with the GNU Fortran compiler which was used for the porting to Tibidabo. The calculated results on Tibidabo were in accordance with results from other platforms. One fact that should be mentioned is that some particle configurations could not be tested with all number of processes due to memory restrictions on Tibidabo, since larger systems require a lot of memory. Together with Xavier Martorell from BSC the porting to the Mercurium compiler was done on JUDGE and the code was supplied to help with the further development of the Mercurium compiler. In July 2012 this development was concluded and the current version of the code (without support for the external library) works and produces correct results in comparison to the GNU version. These tests were done on the JUDGE platform in Juelich. The porting of the code to OmpSs will start once the OmpSs training for Fortran will be done in October. Performance profiling of the code was done with time measurements on Tibidabo and were supplied to WP3. More intensive profiling (e.g. with the use of extrae) might follow in the near future. The results of the first performance analysis showed that the scaling behaviour of the code on the Tibidabo platform is as good as on the JUDGE x86 platform, but the absolute runtimes are about a factor of six to eight times longer.


23

Figure 14 - Scaling of MP2C on Tibidabo

On Tibidabo only with 8 and 16 processes could be tested since only these configurations worked due to memory limitations

3.6 PEPC

3.6.1 Description of the code PEPC is a N-body solver for Coulomb or gravitational systems. It is used by diverse user communities in areas such as warm dense matter, magnetic fusion, astrophysics, complex atomistic systems and vortex fluid dynamics. PEPC has also formed part of the extended PRACE benchmark suite used to evaluate Petaflops computer architectures. Current projects use PEPC for laser- or particle beam-plasma interactions as well as plasma-wall interactions in tokamaks, simulating fluid turbulence using the vortex particle method, and for investigating planet formation in circumstellar discs consisting of gas and dust.

Figure 15 - Example of one of the applications of PEPC:

Simulation of vortex dynamics using the vortex particle method


24

PEPC is based on the generic Barnes-Hut tree algorithm and exploits multi-pole expansions to create hierarchical groupings of more distant particles. This reduces the computational effort in the force calculation from the generally unaffordable O(N^2) operations needed for brute-force summation, to a more amenable O(N log N) complexity. The code is open source and developed at Juelich8 within the Simulation Laboratory Plasma Physics9 under GPL license. PEPC is written in Fortran2003 with some C wrappers to enable pthreads support, thus making use of a hybrid MPI/pthreads programming model. There is also a branch of PEPC that uses the hybrid MPI/SMPSs programming model instead. The only external dependency is a library for parallel sorting written in C that is included in the source tree. The different applications of PEPC are split into different front-ends and can be combined with different interaction specific modules.

Figure 16 - Structure of the Tree code framework with the different modules and front-ends

Checkpointing via MPI-I/O is possible but has to be enabled by the different front-ends of PEPC. The current implementation allows for an excellent scaling of the code on different architectures with up to 2,048,000,000 particles across up to 294,912 processors10. PEPC runs on IBM Blue Gene/P (Jugene) and /Q (Juqueen) architectures, the Nehalem Cluster JuRoPA, standard Linux clusters and workstations as well as Mac OSX machines. In principle, it should be portable to any Unix-based parallel architecture.


25

Figure 17 - Scaling of PEPC on JUGENE with different problem sizes

3.6.2 Report on the progress of the porting of the code Since PEPC supports a variety of applications, we chose the demo application called pepc-mini for this project. For that, initial porting of PEPC to ARM is complete and did not pose any problems since PEPC conforms to the Fortran standard and can be compiled with a number of different compilers, including the GCC suite. First tests on the Tibidabo cluster have already been performed. Porting to OmpSs has started by testing the Mercurium compiler. These tests revealed several missing features in the compiler, some of which have already been rectified. Others are less crucial for pepc-mini or work-arounds are possible. The ticketing system has been used to report those problems, see e.g. tickets 990, 1016, 1017, 1018, 1021, 1022, and 1023. Major issues included, for example, missing support of (Fortran) compiler supplied modules. More intensive testing is planned for the OmpSs-Fortran training event in October. In terms of taskifying PEPC we are already in very good shape thanks to the experience gained with the SMPSs version of the code developed within the EU TEXT11 project. Tasks and dependencies have already been identified and implemented, so we anticipate a smooth transition to OmpSs as soon as it is ready. Benchmarks with the SMPSs code show that it performs at least as well as the MPI/Pthreads version.


26

Table 2 - Comparison of execution times of pthread and SMPSs versions of PEPC on JuRoPA On the table bellow, both versions show similar scaling with the number of ranks and threads used. The high-lighted diagonal marks a constant number of cores used. We will be able to start from the SMPSs version once OmpSs is ready, so we already have tasks identified and implemented and can hopefully switch between SMPSs and OmpSs. Apart from the profiling efforts within WP3, some early tests have been performed on Tibidabo to compare the performance differences between the ARM and Intel architectures. We found a shorter execution time on Intel of approximately one order of magnitude for similar problem sizes. We note however, that efficiency and execution time on a varying number of nodes depend on the problem size. Since Tibidabo and JuRoPA have different memory sizes an identical problem size may not be the best to compare. Looking at the speed-up of PEPC on those two architectures provides us with the possibility to pick a different problem size. Such performance evaluation is done just after the first time step and an initial phase of load balancing of all the particles among different subdomains. Different problem sizes have been used to better suit the available memory and we find similar, promising speed-ups on the ARM as on Intel architecture. In particular, we could not observe a notable difference in speed-ups when a varying number of threads (via pthreads) was used.


27

Figure 18 - speedup of the 1st time step after load balancing on ARM and Intel platforms

3.7 ProFASI

3.7.1 Description of the code ProFASI (PROtein Folding and Aggregation SImulator)12 is a C++ program package for Monte Carlo simulations of protein folding and aggregation. It provides an implementation of an all-atom protein model with fixed bond lengths and bond angles, an implicit water simplified force field, and a set of tools to perform Monte Carlo simulations with the model. PROFASI has been used in a number of studies of protein folding and thermodynamics with proteins of helical, β-sheet and mixed structures. It has also been used to study amyloid aggregation, with up to 30 peptide chains in full atomic detail. Other interesting applications include studies of mechanical and thermal unfolding of globular proteins as well as studies of small semiconductor-binding peptides. The ProFASI package is developed by A. Irbäck and S. Mohanty, licensed under the GPL and mainly used in academia. It is composed by almost 40k lines of C++ code and parallelized using MPI, an hybrid version running on nVIDIA GPUs has been also developed using the CUDA framework. The code is very portable from GNU Linux systems to large scale IBM BlueGene/P platforms.


28

3.7.2 Report on the progress of the porting of the code The code has been ported quite easily on Tibidabo using the GNU toolchain, performance measurements showed that the ARM based version of the code running on Tibidabo was 8x slower that a x86 version. The code owners started to identify kernels for WP3 based on the energy calculation, which represents almost 90% of the time spent by the code for average sized proteins. Currently the team is also working on new features of the code regarding force fields and such developments will be backported and tested on Tibidabo end of 2012.

3.8 QuantumEspresso

3.8.1 Description the code QUANTUM ESPRESSO13 is an integrated suite of computer codes for electronic-structure calculations and materials modelling at the nanoscale, based on density-functional theory, plane waves, and pseudo-potentials (norm conserving, ultrasoft, and PAW). QUANTUM ESPRESSO stands for opEn Source Package for Research in Electronic Structure, Simulation, and Optimization. Quantum ESPRESSO (QE) is an initiative of the DEMOCRITOS National Simulation Center(Trieste) and SISSA (Trieste), in collaboration with CINECA National Supercomputing Center, the Ecole Polytechnique Fédérale de Lausanne, Université Pierre et Marie Curie, Princeton University, and Oxford University. Courses on modern electronic-structure theory with hands-on tutorials on the Quantum ESPRESSO codes are offered on a regular basis in collaboration with the Abdus Salam International Centre for Theoretical Physics in Trieste. The code is distributed through a GPL licence and used in both academia and industry. QE is mainly written in Fortran90, but it contains some auxiliary libraries written in C and Fortran77. The whole distribution is approximately 500K lines of code, even if the core computational kernels (CP and PW) are roughly 50K lines each. Both data and computations are distributed in a hierarchical way across available processors, ending up with multiple parallelization levels that can be tuned to the specific application and to the specific architecture. More in detail, the various parallelization levels are geared into a hierarchy of processor groups, identified by different MPI communicators. The single task can take advantage both of shared memory nodes using OpenMP parallelization and NVIDIA

Figure 19 - ProFASI successfully describes the folding behavior of a variety of proteins with diverse structures


29

accelerating devices thanks to the CUDA drivers implemented for the most time consuming subroutines. QE distribution is by default self contained, all what you need are a working Fortran and C compiler. Nevertheless it can be linked with most common external libraries, such as FFTw, MKL, ACML, ESSL, ScalaPACK and many others. External libraries for FFT and Linear Algebra kernels are necessary to obtain optimal performance. QE contain dedicated driver for FFTW, ACML, MKL, ESSL, SCSL and SUNPERF FFT specific subroutines. Quantum ESPRESSO is not an I/O intensive application, and it does significant I/O activities only at the end of the simulation to save electronic wave functions, used both for post-processing and as a checkpoint restart . As a consequence I/O activities are expected also at the beginning of the simulation in a restart run. Each task saves its own bunch of data using Fortran direct I/O primitives. The code has been ported to almost all platforms. Its scalability depends very much on the simulated system. Usually, on architecture with high performance interconnect, the code display a strong scalability over two orders of magnitude of processors (e.g. between 1 and 100), considering a dataset size that saturate the memory of the nodes used as basis for the computation of the relative speed-up. On the other hands the code display good weak scalability. Recently, on a large simulation, a good scalability up to 65K cores have been obtained (see figure below).

Figure 20 - Scalability of the CP kernel of QE on BG/Q system

using the CNT10POR8 benchmark

0

50

100

150

200

250

300

350

400

4096 8192 16384 32768 65536

2048 4096 8192 16384 32768

1 2 4 8 16

second

s /step

s

CNT10POR8 -‐ CP on BGQ

calphi

dforce

rhoofr

updatc

ortho

Virtual cores

Real cores

Band groups


30

3.8.2 Report on the progress of the porting of the code To reduce the possible porting problems we have configured QuantumESPRESSO to use all internal libraries rather than external one. In fact QuantumESPRESSO is self contained and external libraries can be used as on optimization step. Then we do not link the code with external BLAS, FFT, LAPACK, SCALAPACK. The source code of QuantumESPRESSO is mainly Fortran90 with a little subset of C source code. Moreover the compilation of the whole package using gfortran and gcc is routinely checked. Then we do not find any problem related to the compilation of the code. The code has been compiled using MPI and OpenMP. To validate the porting we have selected a well-known test case (water molecule), already used on many other systems and with different codes. To profile the code we have used the internal profiling feature of QuantumESPRESSO allowing us to monitor the performance of the most time consuming subroutine of the applications and we have compared them with the behaviour on other machines in order that, a part the absolute performance, there were relative differences. The overall performance on Tibidabo against x86 or BG/P systems is very poor for the moment (even if Tibidabo is mainly a porting platform and not a performance-oriented platform). We have performed tests using different combination of task/threads per nodes.

3.9 SMMP

3.9.1 Description of the code SMMP14 is used to study the thermodynamics of peptides and small proteins using Monte Carlo. It uses a protein model with fixed bond lengths and bond angles reducing the number of degrees of freedom significantly while maintaining a complete atomistic description of the protein under investigation. Currently, four different force fields, which describe the interactions between atoms, are available. The interaction with water is approximated with the help of implicit solvent models. The strength of SMMP is the availability of a variety of advanced Monte Carlo methods that can be used to study, for example, the folding behaviour of a protein. The comparatively simple program structure makes it easy to implement new algorithms. The software is developed by Jan H. Meinke, Sandipan Mohanty, Ulrich H.E. Hansmann, Shura Hayryan, Frank Eisenmenger, and Chin-Kun Hu and distributed through a GPL licence primarily to academia. SMMP is written in Fortran and includes Python bindings (PySMMP). Parallelization is done using MPI and OpenCL (or CUDA). The parallelization of the energy function (using MPI, OpenCL, or CUDA) is often combined with parallel tempering leading for a two-level parallelization. SMMP itself only requires a Fortran 90 compiler and an MPI library if parallelization is wanted. PySMMP needs Python and NumPy to work.


31

The amount of I/O generated depends on the user setting but is usually moderate even if the trajectories are stored. SMMP creates checkpoints at user-defined intervals. These checkpoints are quick to write, but the current implementation requires that all the data that was previously written to disk is read again for a restart.

Figure 21 - Simulation of the folding of the 67-residue designed protein GS-a3W starting from

random initial conditions The program is written using standard Fortran and has been ported to a large number of platforms including Intel x86 and Xeon Phi, IBM Blue Gene L/P, Power 7, and CellBE. Usually, only the compiler settings in the Makefile need to be adjusted to build the program. Optimized implementations take more effort. Porting to the CellBE required a special kernel to take advantage of the Synergistic Processing Element. For the Blue Gene line, we had to adjust the communication patterns to take advantage of the network topology. SMMP employs parallel algorithms and a parallelized version of the energy function. Using this kind of parallelization, simulations of GS-α3W (see above) scaled up to 16384 cores (128 replicas with 128 cores per replica) on an IBM Blue Gene/P with 50% parallel efficiency.

3.9.2 Report on the progress of the porting of the code After adjusting the Makefile, SMMP compiled and ran immediately on Tibidabo. The results are consistent with simulations performed on other platforms. Performance and scaling results were disappointing, however. Using gfortran version 4.4.4 the code ran about 5 times slower than on an Intel Xeon 5650. Using more recent versions of gfortran on Tibidabo, single-core performance dropped by a factor of about 13 and was almost 67 times slower than an Intel Xeon 5650 processor unless –mfloat-abi=softfp is set. Parts of the code have been ported to CUDA in anticipation of the availability of CUDA capable GPUs.


32

Figure 22 - Scaling of // tempering with 1 (blue), 2 (green) and 4 (red) replica

compiled with gfortran 4.7.0 On figure 21, the total number of cores used is the number of replicas times the number of cores per replica. The black dashed line indicates the 50% parallel efficiency line. Processor counts for which the speedup is above this line are usually considered OK for production runs. In the plot we see that for a single replica this point is reached for 16 cores per replica.

3.10 SPECFEM3D

3.10.1 Description of the code SPECFEM3D is an application that models seismic waves propagation in complex 3D geological models using the spectral element method (SEM). This approach, combining finite element and pseudo-spectral methods, allows the formulation of the seismic waves equations with a greater accuracy and flexibility if compared to more traditional methodologies.


33

Figure 23 - SPECFEM3D simulation applied to the 2008 Sichuan earthquake

SPECFEM3D is an Open Source project and the code is available at: http://www.geodynamics.org/cig/software/specfem3d. Two versions coexist, the simple version and the globe version, aiming at simulating larger meshes (continents or whole planet, thus needing more HPC power). SPECFEM3D is a Fortran application, but a subset of the globe version has been ported to C, to experiment with CUDA, and with StarSs for the TEXT project. This subset contains the main computation loop of the main application. The full application is composed of 50k lines of Fortran, while the subset contains 3k lines of C. SPECFEM3D scalability is excellent, showing strong scaling up to 896 GPUs and more than 150k cores on a Cray XT5 system.

3.10.2 Report on the progress of the porting of the code Porting SPECFEM3D to ARM using the GNU tool chain was straightforward. It has to be noted that SPECFEM3D GLOBE was also ported and benchmarked. As SPECFEM3D was already ported to StarSs, the OmpSs port of SPECFEM3D was also easy to achieve. In order to have comparison opportunities and to conduct some benchmarks before the availability of Mont-Blanc prototypes we also used a Snowball CortexA9 ARM platform for testing. Results obtained on the Snowball cards are similar to those of Tibidabo, with about 5% improvement in favour of Tibidabo. The scaling on Tibidabo is very good; the reference is for 4 cores (2 nodes):


34

Figure 24 - SPECFEM3D scaling on the Tibidabo prototype

SPECFEM3D does not suffer from the collective communication problem that plagues other applications as it only uses point-to-point communications. There is a slight inflexion after 32 cores that has to be explained. The benchmarking of SPECFEM3D revealed a problem with some processors from Tibidabo that were slowing down to about 70% of their performance and requiring a reboot to come back to normal. Without those problems the Tibidabo prototype was found to be 8 times slower then a Xeon processor but 5 time more power efficient, using a worst-case scenario. In order to investigate that problem we instrumented the code using Extrae and PAPI. The results we obtained are shown in the next figure. The first graph presents the traces of computations while the second shows the communication patterns:


35

Figure 25 - Traces of SPECFEM3D on Tibidabo using 4 cores (2 boards)

This trace reveals that the first thread is slower than the three other threads. Those threads have to wait for the computations on the first one to finish in order to proceed. As this performance problem is random, bigger jobs are more likely to be affected. This explains the inflexion we saw in the scaling figure.

3.11 YALES2

3.11.1 Description of the code YALES215 is a research code developed by V. Moureau & G. Lartigue at CORIA and several other researchers in labs (success.coria-cfd.fr). It aims at the solving of two-phase combustion from primary atomization to pollutant prediction on massive complex meshes. It is able to handle efficiently unstructured meshes with several billions of elements, thus enabling the Direct Numerical Simulation of laboratory and semi-industrial configurations. The solvers of YALES2 cover a wide range of phenomena and applications and they may be assembled to address multi-physics problems. YALES2 solves the low-Mach Navier-Stokes equations with a projection method for constant and variable density flows. These equations are discretized with a 4th-order central scheme in space and a 4th-order Runge-Kutta like scheme in time. The efficiency of projection approaches is usually driven by the performances of the Poisson solver. In YALES2, the linear solver is a highly efficient Deflated Preconditioned Conjugate Gradient with two mesh levels. YALES2 has a free academic licence for French labs and it is also used through some industrial partnerships. YALES2 is written in Fortan90, parallelised using MPI-1 and relies on external libraries like HDF5, PETSc, FFTW, METIS, SCOTCH, SUNDIALS for dealing with pre/post processing and numerical methods. The code also implements in-house parallel IO with checkpoint and restart.


36

Figure 26 - Simulations of small vortices of the turbulent flow of an industrial swirl burner

The code has been ported on various platforms: IBM Blue Gene/P (Babel @ IDRIS, Jugene @ JUELICH) , IBM Blue Gene/Q (Turing @ IDRIS), BullX Intel cluster (Curie@TGCC and and Airain @ CEA), IBM Power 6 (Vargas @ IDRIS).

Figure 27 YALES2 solvers performance on BG/P system (IDRIS) with 2.2 billion elements

x x

0 4096 8192 12288 16384Number of cores

0 0

4096 4096

8192 8192

12288 12288

16384 16384

Spee

d-up

Linear scalingYales2 with A-DEF2 solverYales2 with RA-DEF2(d) solver


37

3.11.2 Report on the progress of the porting of the code The code was ported on Tibidabo with the gcc toolchain. After the compilation of the missing required external libraries (the HDF5 lib for I/O), the porting of the code itself went smooth. The validation was performed for the simulation of the flow around a 2D cylinder:

Figure 28 - Simulation of the wake behind a 2D cylinder at Re=100. The color represents the velocity magnitude and the white dots are the Lagrangian particles emitted from the cylinder

. Profiling and thermal efficiency measurements (comparison with a 6-core x86 Xeon X5675 3.07GHz processor assuming a 95W power consumption) were performed. The reduced thermal efficiency, i.e. the amount of energy required to perform the simulation for one time step (iteration) for one control volume (node) on a single core is 50 to 60% better for the ARM cores than for the X86 Xeon.

Number of

cores

X86 Xeon Reduced time efficiency

(µs*nb_core/nb_ite/nb_nodes)

ARM Reduced time efficiency

(µs*nb_core/nb_ite/nb_nodes)

Ratio (ARM/X86)

1 11.8 295.2 25.1 2 11.4 300.3 26.3 4 14.3 442.5 31.0

Number of

cores

X86 Xeon Reduced thermal efficiency

(µJ*nb_core/nb_ite/nb_nodes)

ARM Reduced thermal efficiency

(µJ*nb_core/nb_ite/nb_nodes)

Ratio (ARM/X86)

1 186.4 73.8 0.396 2 181.1 75.1 0.414 4 225.9 110.6 0.490


38

Figure 29 - YALES2 strong scaling on Tibidabo

4 Interactions with others Mont Blanc work packages

The interactions between WP4 and the others workpackages of the project can be summarised using the following map:

0 1 2 3 4Number of cores

0

1

2

3

4Sp

eed-

up

Ideal scalingX86 XeonARM


39

Figure 30 - Map of interactions between WP4 and the others MontBlanc WPs

The main interaction is located with WP3 by the extraction of representative kernels from WP4 applications, their optimisation and porting using OmpSs and their final inclusion into mini-app or full versions of the scientific applications. During the second half of the first period, WP3 and WP4 collaborated together by organising joint regular teleconferences in order to steer the work of extracting and characterising kernels. Such outcome has been reported into D3.1 “Kernel selection criteria and assessment report ». More precisely for some of the WP4 applications the work with WP3 has been the following:

• BigDFT The profiling effort allowed us to get a breakdown of the time spent in different functions of BigDFT:

• Preconditioning: 27% • Wave-Function Transposition: 21% • Projectors Applications : 16% • Density Computation: 8% • Kinetic Energy: 8% • ...

Two kernels were identified and transmitted to the work-package 3. 1. The preconditioning subroutine where most of the time is spent. This is a

kernel composed of a lot of different operations representative of BigDFT wavelet approach. But, as this kernel did not contain communications, we added to it the two Wave-Function Transpositions that precede the kernel, as well as the Cholesky kernel in between.

2. The second kernel selected is the much smaller magicfilter. It is frequently used and allows computation of the potential energy of the system. This kernel is very sensitive to optimisations, for instance unrolling correctly the outer loop


40

of the kernel can yield improvement of a factor 2.5. The results obtained with this kernel will be directly transposable to other convolutions in BigDFT.

• Cosmo A detailed profiling of the application on PLX (a x86 hybrid cluster at CINECA) has been provided to WP3 (and reported in D3.1) to start the identification of possible task sections. In the near future a close collaboration between WP3 and WP4 is foreseen to port the code to the OmpSs shared memory paradigm.

• Euterpe A number of small and frequent operations in the most time-consuming sections of the code (corresponding to the particle push, the charge/current calculation on the grid and the field solver) have been identified as potential candidate kernels in the WP3. The preliminary choice of potential candidate kernels are the following functions: getfield, equil_grad, equil_n, equil_t, equil_absb, equil_phieq.

• QuantumEspresso

The main computational kernels of QuantumESPRESSO are matrix multiplication, solution of (dense) eigen-problem and FFT; all of them, depending on the computation and configuration, may be used in the serial and the parallel version. The optimization of any one of these kernels will have a significant impact on the performance of the whole application. For testing purpose the parallel matrix multiplication driver of Quantum ESPRESSO has been enucleated from the code and may be used by WP3 to test its performances.

• MP2C As mentioned before performance analysis was performed for Work Package 3 in order to identify possible kernels. Due to this analysis the thermostat of the MPC part of the code was identified and proposed as a kernel candidate. The extraction of the kernel is still work in progress and it will be provided to WP3 as soon as this work is finished.

• PEPC The task of WP3 is optimising/tuning selected kernels of various applications. As mentioned above, PEPC itself has a modular structure with front-end and interaction parts that are specific to the physics application while the tree algorithm is common to all of those. From that point of view, the PEPC library containing the core tree code may be considered a kernel. However, since the execution of PEPC is non-deterministic (in the way the tree is traversed and particles are distributed to processors) it is not possible to isolate a meaningful set of input parameters to single functions. It has thus been decided to include the most minimalistic example application, our demo 'pepc-mini' as a kernel for this workpackage.

During the first period WP4 worked also with WP5 in order to install and assess on Tibidabo the software stack deployed by WP5. Mandatory global components like compilers (the GNU toolchain as well as the Mercurium compiler from BSC), MPI libraries, parallel numerical


41

libraries, I/O packages, … has been compiled, tuned for the architecture and installed in order to facilitate the porting effort of WP3 and WP4. With WP7 in charge of setting up platforms for the use of WP3, WP4 and WP5, we had during the first period interactions by using Tibidabo (and soon Pedraforca) and providing performance feedback to WP7. As an example the profiling of WP4 applications showed strong limitation into the scalability of the codes dues to network congestions at the level of the Ethernet switchs. Even if Tibidabo is not targeted to performance but much more to portability and assessment of the programmability of the platform, this kind of information is useful into the settings of Tibidabo and the future systems deployed by WP7. With WP2 in charge of the dissemination of the results of the project interactions took place with WP4 when the initial porting of the WP4 applications finished by advertising about such good preliminary results. Finally interactions with WP6 will need to be developed with the publication of D3.1 as well as this current document. General Mont Blanc telcos, WP4 telcos or face-to-face meetings were the opportunity for us to share with WP5 and WP7.

5 Perspectives

During the next period and following the availability to the partners of new hardware platforms, a subset of the 11 original applications will be ported and assessed into these platforms. The Pedraforca cluster currently in installation at BSC is made by Tegra3 based boards (based on an ARM A9 1.5 GHz SoC) coupled with a mobile GPU. Around half of the 11 scientific applications have already an hybrid version of the code in CUDA, OpenCL or OpenACC. WP4 will also continue its collaboration with WP3 by working on the porting of the application’s kernels using OmpSs. In that sense, an interesting issue will be to assess the performance of OmpSs based kernels and applications towards the overlapping of data transfer between the host and the mobile GPU and computations.

6 Conclusion

During the first period, WP4 activity was focused on Task 4.1 by porting all the 11 exascale-class scientific applications on Tibidabo, the fist full low-power cluster made available by the project. The first feedback showed that such porting was quite straightforward thanks to the availability of a full GNU toolchain and others mandatory tools (MPI, HDF5, numerical libraries, …) provided by WP5.


42

Some early experiments started on using the Mercurium compiler provided by BSC in order to be able to use the OmpSs programming environments. Some performances/portability issues were raised and joint work with BSC is ongoing for solving them. Beyond the full initial porting of the applications, WP4 worked in the field of task 4.2 in extracting kernels and profiling the performance and the power consumption of a subset of applications. WP4 established strong relations with WP3 for working on kernels extracted from the applications, with WP5 for using and providing feedback on the installed software stack and WP7 for the use of the Tibidabo platform.


43

List of figures Figure 1 - List of the 11 WP4 scientific applications .................................................... 5 Figure 2 - Picture of a Snowball board .......................................................................... 7 Figure 3 - Architecture of the Tibidabo cluster ........................................................... 10 Figure 4 - Nitrogen (N2) electronics orbitals ............................................................... 11 Figure 5 - BigDFT scaling on the Tibidabo prototype ................................................ 13 Figure 6 - Traces of BigDFT on Tibidabo using 36 cores (18 boards) showing the

delayed communications problem ....................................................................... 13 Figure 7 - Strong scaling of BQCD on a BG/P system on 643*128 and 963*192

lattices .................................................................................................................. 15 Figure 8 - Impact of clock frequency on BQCD timings ............................................. 16 Figure 9 - Strong scaling of the CG solver for 583*16 lattices .................................... 17 Figure 9 - Mean exclusive time spent inside functions and relative standard deviation

for a 16 cores run. ................................................................................................ 19 Figure 11 - Energy spectrum ........................................................................................ 20 Figure 12 - Electrostatic potential in the plane (r,z) for phi=0 .................................... 20 Figure 13 - System of 100 polymer chains consisting of 250 monomers each ........... 22 Figure 14 - Scaling of MP2C on Tibidabo ................................................................... 23 Figure 15 - Example of one of the applications of PEPC: ........................................... 23 Figure 16 - Structure of the Tree code framework with the different modules and

front-ends ............................................................................................................. 24 Figure 17 - Scaling of PEPC on JUGENE with different problem sizes ..................... 25 Figure 18 - speedup of the 1st time step after load balancing on ARM and Intel

platforms .............................................................................................................. 27 Figure 19 - ProFASI successfully describes the folding behavior of a variety of

proteins with diverse structures ........................................................................... 28 Figure 20 - Scalability of the CP kernel of QE on BG/Q system using the

CNT10POR8 benchmark ..................................................................................... 29 Figure 21 - Simulation of the folding of the 67-residue designed protein GS-a3W

starting from random initial conditions ................................................................ 31 Figure 22 - Scaling of // tempering with 1 (blue), 2 (green) and 4 (red) replica ......... 32 Figure 23 - SPECFEM3D simulation applied to the 2008 Sichuan earthquake .......... 33 Figure 24 - SPECFEM3D scaling on the Tibidabo prototype ..................................... 34 Figure 25 - Traces of SPECFEM3D on Tibidabo using 4 cores (2 boards) ................ 35 Figure 26 - Simulations of small vortices of the turbulent flow of an industrial swirl

burner ................................................................................................................... 36 Figure 27 YALES2 solvers performance on BG/P system (IDRIS) with 2.2 billion

elements ............................................................................................................... 36 Figure 28 - Simulation of the wake behind a 2D cylinder at Re=100. The color

represents the velocity magnitude and the white dots are the Lagrangian particles emitted from the cylinder ..................................................................................... 37

Figure 29 - YALES2 strong scaling on Tibidabo ........................................................ 38 Figure 30 - Map of interactions between WP4 and the others MontBlanc WPs ......... 39


44

List of tables Table 1 - Initial BigDFT timings on ARM vs x86 ....................................................... 12 Table 2 - Comparison of execution times of pthread and SMPSs versions of PEPC on

JuRoPA ................................................................................................................ 26

Acronyms and Abbreviations - DEISA Distributed European Infrastructure for Supercomputing Application - GbE Gigabit Ethernet - GPL General Public Licence - GPU Graphics Processing Unit - HPC High Performance Computing - I/O Input (read), Output (write) operations on memory or on disks/tapes - MD Molecular Dynamics - PRACE Partnership for Advanced Computing in Europe (http://www.prace-ri.eu) - SoC System On Chip - TDP Thermal Dissipation Power - WP2 Work Package 2 (“Dissemination and Exploitation”) - WP3 Work Package 3 (“Optimized application kernels ») - WP4 Work Package 4 (“Exascale applications”) - WP5 Work Package 5 (“System software”) - WP6 Work Package 6 (“Next-generation system architecture”) - WP7 Work Package 7 (“Prototype system architecture”) List of References 1 http://www.igloocommunity.org 2 http://inac.cea.fr/L_Sim/BigDFT/ 3 http://www.deisa.eu/science/benchmarking/codes/bqcd 4 http://www.prace-project.eu/documents/public-deliverables/PublicRelease-D7.2.pdf 5 http://en.wikipedia.org/wiki/QPACE 6 M. Allalen, M. Brehm and H. Stuben “Performance of quantum chromodynamics (QCD) simulations on the SGI Altix 4700”, Computational Methods in Science and Technology CMST 14(2) 2008 7 S. Krieg and T. Lippert, “Tuning Lattice QCD to Petascale on Blue Gene/P” Forschungszentrum Juelich, IAS Series Vol. 3 (2010) 155--164. 8 http://www.fz-juelich.de/ias/jsc/pepc 9 http://www.fz-juelich.de/ias/jsc/slpp 10 M. Winkel et al., A massively parallel, multi-disciplinary Barnes–Hut treecode for extreme-scale N-body simulations, Comp. Phys. Commun.183, 880 (2012). 11 J. Labarta, “Towards EXaflop Applications”, TEXT Final Report, EU Grant RI261580. http://www.project-text.eu 12 PROFASI: A Monte Carlo simulation package for protein folding and aggregation, A. Irbäck and S. Mohanty, J. Comput. Chem. 27, 1548-1555 (2006) 13 http://www.quantum-espresso.org 14 http://smmp.berlios.de 15 www.coria-cfd.fr

Documents

D4.1 “Preliminary report of progress about the porting of ...montblanc-project.eu/wp-content/uploads/2017/12/d4.1_preliminary... · BQCD F90,C MPI, OpenMP ILDG, LIME, BLAS, ScaLAPACK