AuroraScience Project - web.infn.it · AuroraScience Project Report on the ... F.Peruch f, E.Peserico, A.Pietracaprina f, M.Pivantib, F ... In conclusion, the project has reached

AuroraScience ProjectReport on the First PhaseJuly 31st, 2009 - April 22th, 2011

The AuroraScience Collaboration:S.a Beccarae , R.Alfierib, F.Articof, G.Bilardif, M.Brambillaa,b, F.Bosof, V.M.Cappellerif,

A.Cestaroc, E.Ciliac, M.Cristoforettia, M.Dalla Bridaa, M.D'Antonioa,c, F.Di Renzob, C.Destrib, P.Facciolie, C.Fantozzif, P.Fontanac, G.Garganod, L.Giustib, M.Grossib, A.Y.Illarionove,

R.Leonardid, W.Leidemanne, G.Marchesinib, E.Milanif, C.Moserc, E.Onofrib, G.Orlandinie, F.Pederivae, F.Peruchf, E.Pesericof, A.Pietracaprinaf, M.Pivantib, F.Pozzatia, G.Puccif,

F.Rapuanob, A.Richtera, M.Segae1, S.F.Schifanob, M.Schimdf, M.Schwarzd, L.Scorzatoa, H.Simmab2, T.Skrbica,e, E.Tagliavinib, M.Trainie, R.Tripiccioneb, R.Velascoc, P.Verrocchioe,

F.Versacif, L.Yuana,e, N.Zagof.

(a) European Center for Theoretical Studies in Nuclear Physics and Related Areas - Fondazione Bruno Kessler (ECT*-FBK),

(b) Istituto Nazionale di Fisica Nucleare (INFN), (c) Istituto Agrario di San Michele AllʼAdige - Fondazione Edmund Mach (IASMA-FEM), (d) Agenzia Provinciale per la Protonterapia (ATreP),(e) Dipartimento di Fisica - Università di Trento (FIS-UNITN),(f) Dipartimento di Elettronica e Informatica - Università di Padova (DEI-UNIPD).

1 Current Address: ICP - University of Stuttgart.

2 Current Address: DESY - Zeuthen.

1. Summary of the Results of the ProjectThis document is a report on the results obtained within the first phase of the AuroraScience project. Here, the main achievements are summarized, before describing them in detail in the following sections.

The AuroraScience project is a direct collaboration of the Fondazione Bruno Kessler (FBK) with the Istituto Nazionale di Fisica Nucleare (INFN), and the following partner institutions: Fondazione Edmund Mach (FEM), Agenzia Provinciale per la Protonterapia (ATreP), the Physics Department of the University of Trento (FIS-UNITN) and the Dipartimento di Elettronica e Informatica of the University of Padova (DEI-UNIPD). The project is funded by the Provincia Autonoma di Trento (PAT) and by INFN and it is administratively managed by the FBK, through the European Center for Theoretical Studies in Nuclear Physics and Related Areas (ECT*-FBK) and by INFN.

The main goal of the project is the development of a computer architecture suitable for high performance scientific computing, the installation of a working prototype of that architecture and the exploitation of this machine for a variety of scientific applications, all requiring the key support of large number crunching capabilities.

In the original proposal of the AuroraScience3 project, the milestones set for the first phase of the project were stated as follows (see Section 3.3.1, in PAT-FBK and FBK-INFN agreements):

• Hardware, system, and algorithmic development (Work Package 1, WP1):

1. Development of the Aurora system: availability of a small system with at least 2 fully populated chassis (...). The system should run the operating system and enable communications between the processors. The system should be able to execute demonstrative parallel version of some of the applications in WP2 (...). In our opinion the achievement of this milestone should be considered a sufficient condition for the start of the second phase.

2. Optimization methodologies: a) definition of a general computational model for multicore systems with nearest-neighbors network, which would provide a solid base for the optimization of algorithms an codes. b) Specialization of the model for Aurora. c) Preliminary evaluation of the impact of the network on the algorithmic performances of the scientific applications of Aurora.

• Scientific applications (Work Package 2, WP2):

3. Lattice QCD: evidence of the functioning on the Aurora prototype of a full LQCD program, able to produce significant results in the second phase.

4. Nuclear Physics: adaptation to Aurora and test of existing codes for Auxiliary Field Diffusion Monte Carlo and for Few-Body calculations. In this activity the optimal parallelization strategy, able to exploit the features of Aurora will be studied.

5. Protein Folding: evidence of the functioning on the Aurora prototype of the basic routines for the Dominant Reaction Pathway (DRP) method, using simple techniques of parallelization.

3 The AuroraScience project was still called Aurora in that document.

6. Bio-informatics: development of an interface between the data produced in laboratory and the Aurora system and development of the main parallel computing routines.

7. Radiotherapy: numerical simulation on Aurora of a radiation with analogous complexity to the one needed for the Radiotherapy Treatment Plan, using existing version of the programs and test geometries.

The main results obtained in the 1st phase of the project, concerning the points 1. and 2. (i.e. WP1) are the following:

• The design and the testing of the Aurora processing board, which is the result of a research and development collaboration with the industrial partner Eurotech, has been finalized.

• The prototype has been installed and commissioned at the premises of FBK in Trento. The current installation has two full chassis, with 64 processing nodes, corresponding to peak performance of more than 10 Tflops (see Section 2.1).

• A management infrastructure (storage, network connection, queueing system), enabling users to exploit the machine in a reasonably friendly way has been developed (see Section 2.1);

• The firmware supporting the toroidal interconnection network for our massively parallel machine has been developed (see Section 2.2).

• The low-level and middle level software layers supporting the network have been developed (see Section 2.3).

• Algorithms suitable to support general communication patterns via the FPGA devices have been investigated (see Section 2.4).

As a result of these achievements, the 10 TFlops prototype has been available for scientific computations since november 25th, 2010.

The main results of the AuroraScience project in the area of the computational sciences (i.e. WP2) are the following:

• The parallelization strategy and the communication routines of the tmLQCD code (a comprehensive Lattice QCD package) have been adapted to efficiently exploit the Aurora parallel architecture. The full tmLQCD code is now working on the Aurora system and it is used for two physics projects. Two more Lattice QCD codes have been adapted to Aurora (see Section 3.1).

• First significant physical results have been obtained by using the prototype for Lattice QCD computations (see Sections 3.1.2 and 3.1.4).

• A fluid dynamics code for turbulent compressible flows in 2D, based on the Lattice Boltzmann algorithms, has been ported to Aurora. The code has been fully tested and various versions of the code show a very satisfactory efficiency in the range of 30% - 35% of the theoretical peak. (see Section 3.2).

• A novel strategy that exploits the toroidal network to rebalance the population across nodes within a Diffusion Monte Carlo algorithm has been implemented and tested(see Section 3.3).

• The computational requirements of the Auxiliary Field Diffusion Monte Carlo algorithm for many-nucleon systems and of the Lorentz transform method have been analyzed, in view of the efficient use of the Aurora system (see Section 3.3).

• A code supporting the DRP approach, used in the study of protein dynamics has been parallelized (see Section 3.4).

• The folding mechanism of knotted proteins has been extensively investigated on Aurora via Molecular Dynamics and Monte Carlo simulations (see Section 3.4).

• Design and functional tests of a Monte Carlo code that can produce a realistic physical description of a therapeutic beam (i.e. with the appropriate range of energy, beam size, shape, direction and position etc.) and successful beam simulation transport through a patient geometry described with computed tomography data (see Section 3.5).

• Acquisition of experimental data from proton therapy centers to be used as benchmarks of our simulations (see Section 3.5).

• An algorithm to improve genome assembly, exploiting informations from next generation sequencing data and the characteristics of the Aurora architecture has been developed and tested (see Section 3.6).

Besides the scientific applications already foreseen in the original proposal and listed above, the AuroraScience project has attracted interests from many other groups. In particular, the INFN groups working on Lattice QCD, under the iniziativa specifica ROMA123, have used the machine and produced the results that are reported in Sections 3.1.4, of this document. Moreover, new groups from UNITN and FBK, working in condensed matter physics (polymers, soft matter, materials science) have ported their codes and tested the machine, as reported in Sections 3.4.2 and 3.4.3.

One further measure of the success of the project is the number of collaborations among the groups that were initiated. In particular, the ECT*-FBK and INFN groups are collaborating in the program described in Section 3.1; the FIS-UNITN, ECT*-FBK and INFN groups in the program of Section 3.3; the FIS-UNITN, DEI-UNIPD and ECT*-FBK groups in the program of Section 3.4; the DEI-UNIPD and IASMA-FEM groups in the program of Section 3.6; the DEI-UNIPD and INFN groups in the program of Section 2.4; the ECT*-FBK, INFN and FIS-UNITN groups in the program of Section 2.3; finally all groups are actively participating at least by pointing out problems and suggesting solutions which are crucial for the developments of Section 2.1.

All these collaborations were possible thanks, in particular, to the funding made available, within the AuroraScience project, for the opening of new research positions. The following researchers were hired:

• Luigi Scorzato (local coordinator of the project, ECT*-FBK),• Marco Cristoforetti (ECT*-FBK),• Enrico Tagliavini (ECT*-FBK),• Laura Sartori (ECT*-FBK, working @ INFN Ferrara),• Fabio Pozzati (ECT*-FBK, working @ INFN Ferrara),• Michele Brambilla (ECT*-FBK, working @ INFN Parma),• Marco Grossi (ECT*-FBK, working @ INFN Parma),• Mattia DʼAntonio (ECT*-FBK, working with IASMA-FEM),

• Tatjana Skrbic (FIS-UNITN and ECT*-FBK),• Luping Yuan (FIS-UNITN and ECT*-FBK),• Alexei Yu.Illarionov (FIS-UNITN),• Silvio a Beccara (FIS-UNITN),• Elisa Cilia (IASMA-FEM),• Gianfranco Gargano (ATreP),• Francesco Versaci (DEI-UNIPD),• Fausto Artico (DEI-UNIPD),• Michele Schimd (DEI-UNIPD),• Emanuele Milani (DEI-UNIPD),• Federica Boso (DEI-UNIPD),• Nicola Zago (DEI-UNIPD),• Vincenzo Maria Cappelleri (DEI-UNIPD),• Francesco Peruch (DEI-UNIPD).

AuroraScience has also been the main driving force that led to the foundation of the new Interdisciplinary Laboratory for Computational Sciences (LISC), which gathers scientists from the Physics Department of the University of Trento, from the Materials Science Division and from ECT* at FBK. The presence of Aurora is today of vital importance for the development of the research of the various groups involved, and has already started an intense exchange and collaboration on the technical issues related to the use of the new machine.

The results of the project have been presented in a number of international conferences, including the International Supercomputing Conference ISC10 (June 2010, Hamburg), the Lattice Conference 2010 (June 2010, Villasimius), the Supercomputing Conference SC20 (November 2010, New Orleans). The system has been also presented to the general public in occasion of the “notte dei ricercatori” (September 2010) and in a radio interview (October 2010, Radio Dolomiti). Finally, an Aurora school for PhD students has been organized at ECT* between September 20th and October 1st 2010, with lectures on the computational aspects of Lattice QCD, Statistical Mechanics, Few- and Many-Body Nuclear Physics and including also training sessions on Aurora (see Appendix B).

In conclusion, the project has reached the milestones that it had set for itself for the development of the Aurora architecture, has pursued and extended its scientific activities and looks forward to exploiting fully, in the second phase, this powerful machine in a variety of science projects. To this end, it is thus necessary to proceed with the development of the computing system and the continuation of the funding of research positions for those people that have been and are essential for the scientific activities described in this document.

Section 2 and 3 of this document describe in details all the activities and results summarized so far for Work Packages WP1 and WP2, respectively. Section 4 provides an explanation of how the resources were used. The status of the budget, as of March 31st 2011, is also attached to this document as Appendix A. Finally, the report of the Aurora School can be found in Appendix B.

2. Development of the Computing System (WP1)

2.1 Development and installation of the Prototype (ECT*-FBK, INFN, FIS-UNITN groups)

The main building blocks and features of the Aurora computing system, which is installed in the premises of FBK are the following:

• The basic computing element is the Aurora board (also called node), hosting two Intel CPU's (X5680 Westmere 6-cores @ 3.33GHz) and 12 GB of DDR3 RAM Memory each. The distinctive feature of the board is its connectivity: on top of a 40-Gbit/s Infiniband adapter, it hosts 1 FPGA (Altera Stratix IV GX230) and 6 PMC-Sierra quad-link PHYs, enabling the deployment of the FPGA-based toroidal network firmware. Each board has its coldplate, ensuring liquid cooling.

• 16 boards, combined with a root-card (on the top), a DC/DC tray (on the bottom) and a backplane (on the back) constitute a half-chassis. The backplane provides both toroidal and Infiniband connectivity. The root-card hosts a 36-ports Infiniband switch, and also provides power management and monitoring functionalities.

• Liquid cooling is a key feature of Aurora. It enables the high computing density which is an Aurora fingerprint. Energy consumption is dramatically reduced by this choice. Moreover, the so-called quick-disconnect technology (which has been used here for the first time in HPC) ensures that each board can be disconnected at any time, without halting the system.

• Two half-chassis sit back to back to form a chassis. This is the basic module on which a full three dimensional torus (8x2x2) can be closed.

• Up to eight chassis can be hosted on a rack, which also includes the AC/DC power supplies, the power distributions and the water distribution systems.

The present installation can be seen in Figure 2.1.1. It includes two full chassis (64 boards), corresponding to a peak performance of 10 Tflops. A fifth half-chassis, equipped with an older version of the boards and Nehalem processors is currently kept in Ferrara for development purposes.

A very first installation was tested in Ferrara. Later, while the cooling system at FBK was under constriction, a second installation was tested on a system hosted at the Eurotech laboratories in Amaro. The actual deployment phase at FBK required the joint efforts of many people, including mainly the ECT*-FBK team, the INFN-Parma team and a system manager from the University of Trento, with a constant support from the INFN-Ferrara team. Needless to say, there was constant cooperation with people from Eurotech. As it is usual in such a phase, on top of implementation of all the planned activities, extra work was needed to solve all the minor problems which one can encounter in the deployment of a system like Aurora. The success in overcoming the difficulties has been all in all a convincing evidence of the overall robustness of the project.

The realization of the cooling system required a significant effort and was essentially ready by the end of June 2010. The system implements the state of the art technologies enabling the saving of energy and a sustainable conduct. It includes a system for ʻheat recoveryʼ and ʻfree coolingʼ. Thanks to the favorable heat capacity of the water, the temperature difference between incoming and outgoing water is very modest (25C in, 30C out). This is an ideal condition for free-cooling, that can be used instead of the chiller for most of the year (whenever the external temperature goes below 20C).

Since August 2010, the system has been growing up to the present level of two fully populated chassis, and has been continuously used by test applications, benchmarks and production runs.

The system can be reached via a front-end server (aurora1.fbk.eu) and is managed by four powerful servers, whose configuration has been tested on a prototype system in Amaro. The servers deal with the general services such as I/O operations (see below), management of the queue system, code compilations, nodes management, monitoring. The installation of the nodes uses tools which have been initially developed by the Ferrara team and later improved by the Parma and ECT*-FBK teams. The system can be monitored via Ganglia. Full remote control of all diagnostic signals (temperatures and flux of the cooling liquid, voltages, status of the UPS and the chiller) is available to the Servizio Tecnico of the FBK. The Infiniband switches are working with the expected performances and various tests have been and are still being performed to achieve the optimal tuning of the network, both for inter-board communications and for I/O.

Figure 2.1.1. The present installation of Aurora at LISC-FBK, seen from both sides.

During the installation phase, a 500GB /home partition was made available to users, exported to all nodes via NFS over Infiniband. Later on, a parallel file system (GPFS) was installed. At the moment, two of the servers are managing I/O operations in parallel, but in future, all four servers will share the load. Concerning the storage, two different SAN systems have been evaluated and a third one (which is an improvement of the second one) has been purchased. The final system, which was delivered on January 28th, includes 12 disks optimized for performance (SAS 15K rpm, with 600 GB each) and 12 disks optimized for storage space (SAS 7.2K rpm with 2 TB each), for a total of more than 31TB (because of RAID 6 redundancy, only about 24TB are actually available). By now, two GPFS storage areas (/home and /work) are exported to the systems via NFS; SAS 15K rpm disks are used only for /work (to be used for scratch space), while SAS 7.2K rpm disks are used for both /home and /work (that increases the available throughput under adverse conditions). After installation, I/O tests have been performed, focusing in particular on the usage of libraries dedicated to Lattice QCD I/O tasks (support has been provided by A. Deutzman, one of the authors of the LEMON library, which relies on MPI I/O functions) and on a program dedicated to I/O testing (Bonnie), besides more basic I/O routines. Such tests are very satisfactory: for very large data blocks we measured that the system can sustain on the /work partition at least 800 MB/s and 700 MB/s for concurrent read and concurrent write operations respectively. Work is in progress to overcome operating system limitations which at the moment prevent the nodes from achieving the best performances. Performance are quite lower on the /home partition, which should not be used for intensive I/O operations, but rather for storage. A dedicated service ("atnadmin") has been designed to manage the partitioning of the system with respect to the torus network connectivity: the users can list available and active partitions, monitor status in terms of active processes, activate or delete partitions. A customized queue system has been developed on top of PBS Torque to manage jobs submission. Via the "atnsub" command users can submit a job to be run under a given partition/connectivity. A variety of commands customize the environment for runs under the TORUS library (see below).

2.2 Development of the 3D-Torus Network (INFN-Ferrara group)

Most of the logics of the network processor is an architecture-specific porting to Aurora of the FTNW-design (developed by M. Pivanti, S.F. Schifano and H. Simma [2]).The FTNW-design includes the definition and implementation of several firmware and software components, and in detail:

• the logic for transmitting and receiving data-packets,• the physical- and data-link protocol,• the interface with the CPU, to move data to and from the main-memory, • the APIs and the functions for implementing a low-level communication library

supported by a Linux driver.These parts have been taken from the FTNW-project, adapted and implemented on the specific field programmable device (FPGA) installed on the node-cards of the Aurora machine.

The activities carried-out during last year, can be divided in different areas:First, we have tested the components of the FTNW-design onto the specific Altera FPGA which is installed on the boards of the Aurora system. This has required a new definition of the top-level and pin-out of the design, and tuning of the components to fit the resources available on the FPGA of the Aurora boards. During this phase we have adapted the

processor interface of the FTNW module for the specific Intel CPUs, and tuned the operations to move the data to and from the CPU for the “PCI-express” bus.We have also developed and included at the top-level of the design a module to allow to load the configuration-firmware of the FPGA on the flash-memory, using a program running under Linux. This allows to perform in a simple and quick way any required update of the firmware of the network-processor.

The driver and low-level communication library for the torus have been adapted and optimized for the specific instruction set of the Intel CPUs. Figure 2.2.1 shows the aggregate bandwidth provided by the implementation with 1,2 and 3 links concurrently active. The measure has been performed running two threads exchanging messages, with various payload sizes. The maximum bandwidth achieved is ≈ 500 MB/s per link, for message payload size larger than 8 KBytes, and it is not affected by the number of active links. The latency, the time to transfer a message of 128 Bytes, is ≈ 1.7 μs. To validate functionalities and reliability of the network-processor we have also developed specific tests. In particular we have tested communications along all six physical directions, with various payloads, and different virtual channels. These tests have been also used to validate part of the design of the node-cards, and to verify machine stability at system level.

A second area of activities was the development of tools to manage the installation and operation of large machines, and the implementation of application programs that we have used for network-processor reliability tests and as performance benchmarks. We have implemented and made available to the project a set of tools to install and configure the root file-system of the Aurora nodes, cloning it from that of the front-end master. We have

Figure 2.2.1. Aggregate bandwidth of the torus network with 1, 2 and 3 links concurrently active.

also provided low-level basic programs to set-up the configuration of the network, and the torus topology for different partitions of the machine.

The last area of activities was the development, implementation and testing of two simulation programs. The former is a Monte Carlo simulation of an Heisenberg spin-glass, and the latter is a solver of a compressible 2D fluid-dynamics flow based on the Lattice-Boltzmann method, and including thermal and combustion effects. Both implementations have been developed to exploit parallelism over the cores of the CPUs and over the nodes of the machines. The program executed by each thread (we use one thread per core) uses Intel-SSE instructions set to exploit also streaming-parallelism. The threads communicate over the torus network using the low-level communication library. Both simulations have been extensively used to test the stability of the machine at system level, and to make early benchmarks for the evaluation of the performance. The spin glass code has not been further developed, since it was intended only as a test environment for the machine, while the Lattice Boltzmann code has now evolved into a full production-level physics code, and it is described in Section 3.2, together with the results that have been obtained with it.

2.3 High Level Torus Libraries (ECT*-FBK, INFN, FIS-UNITN groups)

In order to make the torus network functionalities available to programs originally developed in MPI, a higher level TORUS library has been developed on top of the low level FTNW library. The TORUS library relies on a "proxy" service which is in charge of supporting FTNW communication functionalities (which are thread-safe) in a multi-process environment. One part of the TORUS library (TORMPIlib) ensures support for standard MPI communication functions. A second part (TORUSlib) has been developed to optimize nearest-neighbor communications in an MPI program. In any case the solutions are designed to deal with the routing of communication via the torus network, via standard MPI over Infiniband or via dedicated shared memory areas (trying to minimize latencies). While libraries have been developed in C, Fortran programs are also supported.

At the very first stage of development of the FTNW library, an emulator (ATNemu) has been developed to emulate the FTNW communication environment, which has since then been used to support the design of the high level libraries and test programs.

One of the first and most important applications of the TORUS library has been the porting of the tmLQCD code (see Section 3.1 below). Through it, the full tmLQCD code has now access to the FTNW communications routines, wherever they may offer better performance than the Infiniband network. Tests have shown that the overhead introduced by this level is absolutely acceptable. A completely independent confirmation of the effectiveness of the TORUS library came from a totally different LQCD code. For benchmark purposes, a Fortran LQCD program implementing the Overlap regularizations for fermions was easily adapted to the Aurora toroidal network by making use of the TORUS library.

The TORUS library is being used also by a Monte Carlo code for Many-Body Nuclear Physics, adapted to exploit the nearest-neighbor network (see Section 3.3). Moreover, the TORUS library has been extended to support the GROMACS code, which is used for Molecular Dynamics simulations.

2.4 Algorithmic Developments (under the coordination of the DEI-UNIPD group)

The research unit DEI-UNIPD has carried out work along the following lines.

(a) Models of computation.

(a1) Tori of multi-core nodes. A model of computation is being developed for computing systems consisting of multi-core computational nodes connected by a toroidal network. This model should provide a theoretical foundation for optimizing the implementation of algorithms on the Aurora machine. A target case-study on multidimensional discrete Fourier transforms will be developed in collaboration with the ECT*-FBK research unit.

(a2) Portability of performance across networks. The general issue of portability of software has been studied in the framework of network obliviousness. Particular attention is being devoted to the analysis of algorithms that deploy near-neighbor communication patterns on regular grids. This study will enable us to investigate the portability of algorithms optimized for the Aurora toroidal network to a variety of other interconnections with optimal or near-optimal performance. The results apply even to the general-purpose Infiniband network of the Aurora machine.

(b) Routing algorithms. The current design of Aurora includes a toroidal network that can support near-neighbor communication, particularly useful for a class of scientific applications. The network and the routing algorithm are implemented on FPGA hardware. It would be of great interest to equip future generations of the Aurora architecture with generalized routing capabilities, enabling messages to be sent to nodes in arbitrary positions. This theme is being investigated in collaboration with the research unit INFN-Ferrara. As a first step, the performance of some algorithms is being explored via simulations. The algorithms that will appear more promising in simulation will be implemented in the FPGA network node and their performance analyzed experimentally.

(c) Algorithm optimization for Genome Assembly. In collaboration with the IASMA-FEM research unit, DEI-UNIPD is investigating algorithms relevant to genome assembly in the context of next-generation sequencing technology. The focus of the research is the optimization for platforms with parallel computational resources and hierarchical memories. In particular, the target is the closure of the gaps left in the genome sequences obtained by traditional techniques.

References for Section 2.

[1] “AuroraScience”, The AuroraScience Collaboration. PoS (Lattice 2010) 039.[2] “The AuroraScience Project: The Machine”, M.Pivanti, F.S.Schifano, H.Simma. PoS (Lattice 2010) 038.[3] “The AuroraScience Project”, AuroraScience Coll. SC10 conference, New Orleans, LA USA, November 13-19, 2010.[4] “New Area-Time Lower Bounds for the Multidimensional DFT.”, G.Bilardi and C.Fantozzi. 17th Computing: the Australasian Theory Symposium (CATS2011), Perth, Australia, January 17-20, 2011.[5] “Efficient Stack Distance Computation for Priority Replacement Policies”, G.Bilardi, K.Ekanadham, P.Pattnaik. 8th ACM International Conference on Computing Frontiers (CF'11), Ischia, Italy, May 3-5,2011.[6] “psort, yet another fast stable sorting software”, P.Bertasi, M.Bressan, E.Peserico. ACM Journal of Experimental Algorithmics. Accepted for publication.

3. Scientific Applications (WP2)3.1 Lattice QCD (ECT*-FBK, INFN groups)

Within the AuroraScience project, research on Lattice QCD is being pursued by the groups at ECT*-FBK and INFN, in particular the group of Parma. Three positions have been opened by ECT*-FBK with focus on Lattice QCD: one taken by Luigi Scorzato (since September 2009), one by Marco Cristoforetti (since April 2010) and one taken by Michele Brambilla (since January 2011), who is working with the INFN group of Parma. Luigi Scorzato is one of the developers of the Monte Carlo code (tmLQCD) used by the European Twisted Mass Collaboration, which is one of the main Lattice QCD collaboration worldwide (see e.g. [1]), including a large group of INFN associates. For this reason, tmLQCD has been the main reference code for Lattice QCD in this project. However, also two other codes have been used: the PRlgt (see Section 3.1.5) and a fortran code for overlap fermions.

3.1.1 Algorithmic Development and Benchmarks

Algorithmic developments for Lattice QCD can be divided into two main activities:

1. The first activity was to adapt the overall structure of the tmLQCD code to better exploit the features of a 3D-torus network with low latency and high bandwidth. A suitable parallelization strategy (called time-split) for similar architectures was proposed in [2]. Its implementation required the re-organization of the data layout in the tmLQCD code. This task was done in a portable way by using MPI directives, while access to the 3D-torus network is enabled by the TORUS library. By now many different implementations of the Dirac operator are available and they have been tested on the Aurora prototype.

2. A second activity was to adapt the tmLQCD to better exploit the multicore structure of modern processors. In particular the Intel Westmere processor has 6 cores per chip and two such processors are mounted in each Aurora node card and see the same RAM. The original tmLQCD code was based on a "distributed memory" paradigm implemented via MPI. However, for multicore systems it may be convenient to adopt a "shared memory" paradigm, using openMP directives instead. This activity started in May 2010 and by now openMP directives are implemented and tested in all the critical tmLQCD routines. Extensive tests have been performed for the different parallelization strategies inside the node card (only MPI, only openMP, or a mixture of the two) and allowed us to select the optimal configuration in different setup.

The results are the following. First we compared the performance, in the Aurora prototype, of the newly developed time-split algorithm with respect to the previously most efficient version of the code (called half-spinor). The half-spinor algorithm is the one currently used by the ETMC collaboration on BlueGene/P systems.

The performance gain on the largest physical lattice that we considered (483 x 96) is very satisfactory: a factor between 1.8 (using 4 node cards) and 1.5 (using one chassis = 32 node-cards). On smaller physical lattices (323 x 64) the gain goes down to a factor between 1.6 and 1.2. On the smallest physical lattices (163 x 32) the half-spinior algorithm is more efficient. This behavior is expected, since the time-split algorithm is optimized for large lattices. Some points of this comparisons are shown in Figure 3.1.1.

0 10 20 300

5

10

15

20

Nr. of node cards

Perf.

/nod

e ca

rd [G

Flop

s]

483 x 96323 x 64

0 10 20 300

5

10

15

20

Nr. of node cards

Perf.

/nod

e ca

rd [G

Flop

s]

483 x 96323 x 64

Figure 3.1.2. Example of strong scaling (left) and weak scaling (right). In the case of weak scaling, the lattice sizes in the legend refer to 32 nodes.

With the new algorithm we obtained up to 12% of the peak performance (which is 150 Gflops / node-card). In Figure 3.1.2 (left) we show an example of strong scaling, in which the problem size is kept fixed and distributed across an increasing number of nodes. In the right panel of the same figure we show an example of weak scaling, in which the size of the problem in a single node is kept fixed while the full problem size grows proportionally to the number of nodes. Weak scaling is nearly ideal, which means that the Infiniband network is still very efficient up to the size that we considered. Strong scaling is sensitive to more effects, but it appears to be still quite good up to 32 nodes. The code can also use the 3D-network, but the present size of the Aurora prototype is too small to see the advantage. More details can be found in Ref. [4]

3.1.2 Production of Dynamical Nf=4 Gauge Configurations for the Computation of the Renormalization Factors

The computation of the renormalization factors (RF) is an essential ingredient for the precise computation of many physical quantities with Lattice QCD techniques. Since the RF are defined in the chiral limit, the best way to compute them non perturbatively is via

0 10 20 300

5

10

15

20

Nr. of node cards

Perf.

/nod

e ca

rd [G

Flop

s]

483 x 96, time split483 x 96, half spinor

Figure 3.1.1 Performance gain of the time-split algorithm.

dedicated simulations with 4 degenerate light quarks. In this case the ordinary HMC algorithm is the simplest choice.

The time-split algorithm is now being used in Aurora to produce dynamical Nf = 4, L = 24, Twisted Mass, Iwasaki gauge configurations with the goal of computing non perturbatively the renormalization factors. With these parameters the code is performing a factor 1.5 better than the previous version using the half-spinor algorithm. Production runs have now been started and the Aurora prototype has already collected 150 gauge configurations on the first simulation point.

3.1.3 Implementation of the Laplacian-Heaviside Method for the Computation of Hadronic Correlation Functions

The computation of multi-hadron correlators and, more in general, correlators involving annihilation diagrams, have always been a great challenge for the Lattice approach to QCD. A new technique has been recently proposed in [3] that leverages on a long experience of previous attempts of challenging LQCD computations.

Given the high relevance of these calculations for Nuclear Physics and our previous experience with similar techniques, the group at ECT*-FBK (in collaboration with the University of Cyprus) has implemented this method in the tmLQCD code. The essential ingredient of the method is the computation of the low-lying eigenvalues (and associated eigenvectors) of the 3D covariant Laplacian operator.

2 4 6 8 10 12 140

0.1

0.2

0.3

0.4

0.5

t/a

Meff

Figure 3.1.3. Pion correlation function computed with different techniques. The one-end-LapH method (red) enjoys both reduced noise - as the one-end method (blu) - as well as earlier plateau - as the LapH method (green).

This was tested on a set of L=16 tmLQCD configurations and the advantages expected from [3] were confirmed. The green points of Figure 3.1.3 show (for the simple case of the pion) that indeed a plateau is reached earlier than with the standard technique used for the blue points.

Besides this, we also showed that the techniques of noise reduction developed previously by the ETM Collaboration [5] can be effectively combined with the novel method proposed in [3], which leads to both an earlier plateau and reduced noise, as shown by the red points in Figure 3.1.3.

3.1.4 Computation of the Mass Difference between Up and Down Quarks

The INFN group of Roma (“Iniziativa specifica RM123”) has used the Aurora prototype system to compute for the first time the difference between Down and Up quark masses in the Twisted Mass regularization. This was done thanks to a novel method proposed by the Roma group (yet unpublished) which allow to treat the QCD isospin breaking effect as a first order perturbation in the QCD lagrangian. By computing pseudo-scalar correlation function on lattice of 243 x 48 points with a lattice spacing of 0.085 fm it has been possible to obtain a value of (md -mu) = 2.5(5) MeV.

In Figure 3.1.4 it is shown the chiral extrapolation to the physical point of (md -mu) computed at an unphysical pion mass ranging from 300 to 500 MeV. The contours on the plot represent the 68% C.L interval of the extrapolation. For each point in the plot 150 gauge configurations have been used. All the correlation functions used for this analysis have been computed on Aurora.

Figure 3.1.4. Chiral extrapolation of the mass difference md -mu.

3.1.5 Numerical Stochastic Perturbation Theory

The Numerical Stochastic Perturbation Theory (NSPT) code PRlgt has also run on Aurora. Configurations have been generated in multi-thread mode exploiting spare nodes. The NSPT code has been the playground for the joint exploitation of the torus and Infiniband networks. Although still preliminary, this code shows that such programming approach is indeed viable. In the NSPT order-by-order inversion of the Dirac operator, computations proceed by going back and forth from configuration to momentum space. One thus needs an FFT, which can be easily implemented on Infiniband, overlapping configuration space computations which instead make use of the torus network.

References for Section 3.1

[1] “Light meson physics from maximally twisted mass lattice QCD”, ETM Collaboration. JHEP 1008 (2010) 097.[2] “QCD on the Cell Broadband Engine”, F.Belletti et al. PoS LAT2007 (2007) 039.[3] “A Novel Quark-field Creation Operator Construction for Hadronic Physics in Lattice QCD”, J.Bulava et al. Phys. Rev. D 79 (2009) 034505.[4] “AuroraScience” , The AuroraScience Collaboration. PoS (Lattice 2010) 039.[5] “Dynamical twisted mass fermions with light quarks: simulation and analysis details”, ETM Collaboration. Comp.Phys.Comm 179 (2008) 695.

3.2 Fluid-dynamics (INFN-Ferrara group)

The Lattice Boltzmann (LB) method is a very flexible approach, able to cope with many different fluid equations (e.g., multiphase, multicomponent and thermal fluids) and to consider complex geometries and boundary conditions. LB builds on the fact that the details of the interaction among the fluid components at microscopic level do not change the structure of the equations of motion at the macroscopic level, but only modulate the values of their parameters. The idea is then creating on the computer simple synthetic dynamics of fictitious particles that evolve explicitly in time and, appropriately averaged, provide the correct values of the macroscopic quantities of the flow.

From the computational point of view, LB is local (it does not require the computation of non local fields, such as pressure) so it is easy to parallelize on massively parallel machines. The challenge is to exploit the easily identifiable parallelism available in the algorithm by matching it with all parallel resources at both the inter-processor and intra-processor scale.

In this first step, we have ported and tested a 2D code, that uses 37 populations (a so called D2P37 code, in LB jargon).

Lx / Np × Ly 480 x 32000 864 x 16000 3996 x 4000

comm 6.35-38.49% 8.83-25.16% 0.88-5.20%

stream 41.66% 39.12% 39.15%

bc 0.07% 0.14% 0.59%

collide 57.54% 56.70% 56.01%

Rmax 29.66% 30.04% 29.91%

T/site (μs) 0.16 0.16 0.16

MLUps 6.1 6.2 6.1

Table 3.2.1. Performance of the code without combustion.

Our LB code runs in two different version, one of which includes the effect of combustion; from the computational point view, combustion requires that the divergence of the velocity field be computed at each time step. Table 3.2.1 summarizes the performance results for our optimized code. In the table, the first row reports the size of the sub-lattice allocated to each node of the machine, the next rows display the relative time budget associated to the key steps of the algorithm, the sustained efficiency achieved (Rmax), the time needed to process one lattice site (T/site), and its inverse, customarily measured as Millions of Lattice-cell Updates per second (MLUps).

Most of the time is spent in the computation of the stream and collide phases, while the time spent for the communication is completely overlapping with the computation of stream. The high variability of the communication time is due to de-synchronization of the nodes during each iteration. This configuration achieves a maximum efficiency of ≈ 30%, corresponding to ≈ 0.768 TFlops when run on a 16-board machine.

If the computation steps associated with the combustion process are enabled, the code performs a slightly different sequence of computational steps, with non negligible impacts on performance. Computation of the temperature taking into account the divergence of the velocity field requires in fact an additional communication step (comm2), scheduled after the computation of the boundary conditions and before the computation of the collide step. This is required to propagate the velocity values of the Y-border columns to the neighbor nodes. Moreover, during the computation of the collide phase, the code performs a new set of scattered memory-accesses similar to that of the stream-phase, to access field values of cells at distance 1,2 and 3.

Lx / Np × Ly 432 x 32000 864 x 16000 3996 x 4000

comm 4.99% 3.52% 0.87%

stream 38.74% 39.26% 40.32%

bc 0.07% 0.14% 0.61%

comm2 5.66% 4.18% 0.89%

collide 52.07% 53.53% 53.14%

Rmax 25.36% 25.64% 26.18%

T/site (μs) 0.19 0.19 0.19

MLUps 5.13 5.19 5.30

Performance results for this code are shown in Table 3.2.2. The performances are lower than the previous code: we reach a sustained performance efficiency of ≈ 25%, which translates to ≈ 0.64 Tflops on a 16-board machine. If combustion is not enabled, the structure of the code can be further optimized. The computation of the stream and collide phases can be merged in one single computational step and applied to each cells of the lattice, to save the time to access the main memory when storing results of the stream phase and re-loading them during the execution of the collide phase. Performances of this implementation are shown in Table 3.2.3. Most of the execution time is spent during computation of step 3, which includes the computation of the collide, the most intensive floating-point computational part of our code. This code achieves a maximum efficiency of ≈ 36 − 38% which translates to ≈ 1 Tflops. Table 3.2.4 reports the performance of the code running on 4,8 and 16 nodes, for a lattice size of 3840 × 16000 cell points. Efficiency remains in the range of 31 − 37 % showing an almost linear scaling of the implementation.

Table 3.2.2. Performance of the code with combustion.

Lx × Ly 3840 x 16000

3840 x 16000

3840 x 16000

Np 4 8 16

step1 4.5% 5.7% 10.7%

step2 0.22% 0.25% 0.22%

step3 92.13% 89.92% 84.75%

Rmax 37.46% 33.55% 31.43%

T/site (μs) 0.13 0.14 0.15

MLUps 7.68 6.88 6.45

Lx / Np × Ly 432 x 32000

864 x 16000

3996 x 4000

step1 7.14% 5.10% 1.15%

step2 0.18% 0.24% 0.75%

step3 89.76% 91.28% 92.63%

Rmax 36.22% 37.37% 38.78%

T/site (μs) 0.13 0.13 0.12

MLUps 7.42 7.65 7.94

Table 3.2.3. Performance results of the optimized code with the stream and collide routines merged in a single step (this optimization can only be applied if combustion is not enabled).

Table 3.2.4. Scaling of our code for a simulation of a lattice size: Lx × Ly = 3840 × 16000 cells.

Figure 3.2.5 shows a snapshots of temperature and vorticity during the evolution of a Rayleigh-Taylor instability, triggered by a cold fluid mixing with a hot fluid under the action of gravity. The snapshots have been produced simulating a lattice of 512 × 1000 grid-points on an 8-node prototype of the Aurora machine. Results described here have been presented in the Ref. [1-3].


[1] “The AuroraScience Project: The Machine”, M.Pivanti, F.S.Schifano, H.Simma. PoS(Lattice 2010) 038. [2] “The AuroraScience Project”, AuroraScience Collaboration. SC10 conference, New Orleans, LA USA, November 13-19, 2010,[3] “Lattice Boltzmann Methods Simulations on Massively Parallel Multi-core Architectures” L. Biferale et al. Submitted to HPC 2011 : 19th High Performance Computing Symposium, Boston, February 2011.

Figure 3.2.5. Snapshots of temperature and vorticity in the evolution of a Rayleigh-Taylor instability, simulated on a grid of 512 × 1000 sites, using 8 processing nodes.

3.3 Nuclear Physics: Structure and Reactions (FIS-UNITN group)

3.3.1 Quantum Monte Carlo Methods

The Auxiliary Field Diffusion Monte Carlo (AFDMC) method, an extension of the DMC method for many-nucleon systems, allows for accurate calculations of the binding energy and other quantities both in nuclei with mass number up to A=40 [1] and in nuclear matter [2]. Recent applications include the accurate computation of the 1S0 gap in superfluid neutron matter [3] and the computation of the equation of state of nuclear matter with a density dependent potential, which accurately describes constraints coming from astrophysical observations and heavy ion collisions [4].

In order to better exploit the features of Aurora it was necessary to isolate the most CPU-time consuming sections of the code. First, the performance was measured on a single Nehalem processor. As expected, it was found that the critical section is the computation of the Slater determinants needed for the evaluation of the wave functions. Such computation involves the use of standard linear algebra routines. We concluded that at the present stage the performance on a single processor is sufficient, and there is no need of made this section of the code parallel.

DMC algorithms are based on the evolution of a population of points in the configuration space, and for each point the value of the wave function and of its derivatives must be computed. Each computation scales as A3, where A is the number of nucleons in the system studied. A simple way of exploiting the parallel architecture consists of distributing the walkers to be processed across the computing cores. However, the population is not constant in time, and the points must be re-distributed among the processors in order to keep the working load balanced. This part of the algorithm can be efficiently implemented on the toroidal network, and has been tested first on the Aurora emulator, and more recently on the first implementation of the machine with promising results. However, sizable effects on the scaling will be observed only for larger scale computations which will require the machine in the next stage configuration. The activity was carried on in cooperation with Alexey Yu. Illarionov, who was hired on AuroraScience funds.

Porting the AFDMC code on Aurora is essential to the development of the research line. In Figure 3.3.1 recent results on the thermal contribution to the free energy per nucleon F at temperature T, defined as ΔF=F(T)-F(T=0), as a function of the baryon density are reported. These calculations, which employ both Fermi Hyper-Netted Chain (FHNC), and AFDMC calculations, are essential for the development of a temperature dependent equation of state of nuclear and neutron matter is reported. In this kind of calculations a precise determination of finite size corrections on the value of the binding energy per nucleon is necessary. This implies calculations of a system of A nucleons, with 28≤A≤132, in a simulation box under periodic boundary conditions. The computational time required for each simulation scales at least as A3, and the overall CPU time required for a single computation quickly becomes unmanageable unless a very good scalability of the time with the number of processors used is reached.

Within the framework of the research on Quantum Monte Carlo methods, the porting of a code developed in collaboration with the Lawrence Livermore National Laboratory (LLNL) for the implementation of the Fermion Monte Carlo algorithm, meant to overcome the present accuracy limitations due to the Fermion Sign Problem, has been started.

References for Section 3.3.1

[1] “Auxiliary Field Diffusion Monte Carlo calculation of nuclei with A≤40 with tensor interaction”, S.Gandolfi, F.Pederiva, S.Fantoni, and K.E.Schmidt. Phys. Rev. Lett. 99, 022507 (2007) [2] “Quantum Monte Carlo calculations of symmetric nuclear matter”, S.Gandolfi, F.Pederiva, S.Fantoni, and K.E.Schmidt. Phys. Rev. Lett. 98, 102503 (2007)[3] “Equation of state of low-density neutron matter, and the 1S0 pairing gap”, S.Gandolfi, A.Yu.Illarionov, F.Pederiva, K.E.Schmidt, and S.Fantoni. Phys. Rev. C 80, 045802 (2009) [4] “Microscopic calculation of the equation of state of neutron matter and neutron star structure”, S.Gandolfi, A.Yu.Illarionov, S.Fantoni, J.C.Miller, F.Pederiva, K.E.Schmidt. Month. Not. Royal Ast. Soc. 404, L35 (2010)

Figure 3.3.1: Thermal contribution to the free energy per nucleon ΔF/A (in MeV) as a function of the baryon number density (in fm-3) for a system of A=66 neutrons with periodic boundary conditions. Different point styles refer to calculations at different temperatures reported in the legend. Lines are analytic fits to the numerical results.

-60

-50

-40

-30

-20

-10

0

!F

(",T

,1/2

)/A

[M

eV]

T=1 MeVT=5 MeVT=10 MeVT=15 MeVT=20 MeVT=25 MeVT=30 MeV

0 0.08 0.16 0.24 0.32 0.4 0.48 0.56

" [fm-3

]

-60

-50

-40

-30

-20

-10

!F

(",T

,0)/

A [

MeV

]

SNM

PNM

3.3.2 Lorentz Integral Transform

The Lorentz integral transform (LIT) method is presently the only approach for ab initio calculations of reactions involving the many-body continuum for more than three particles [1,2]. The LIT method was developed by the Trento group in collaboration with V.D. Efros (Kurchatov Institute, Moscow) about 15 years ago. Since then, more than fifty LIT applications have been published. The bulk of the papers concentrated on electromagnetic and weak reactions with nuclear three- and four-body systems using realistic two- and three-nucleon forces. In order to go to more complex systems like 6Li or 6He one needs on the one hand high-performance computational resources and on the other hand further improvements of the codes.

Recently we have made one important step in this direction: in [3] we have succeeded to incorporate non-local nuclear forces into our effective interaction hyperspherical harmonics code. Figure 3.3.2 shows that the use of the Lee-Suzuki transformation, when applied to a non local potential to generate the corresponding effective interaction, speeds up the convergence considerably. The results obtained in [3] enable us to work also with chiral nuclear forces within the LIT approach. Such forces, obtained within the chiral effective field theory, have two very important advantages: (i) the principal one is that they can be directly related to the symmetries that govern QCD and (ii) a more technical one, i.e. their rather soft short-range repulsion, which will allow considerable decrease of the necessary numerical effort.

A further important achievement of the past year is the optimization and parallelization of our codes, tailored to the Aurora system. This work has been accomplished with the help of Luping Yuan, who is supported by AuroraScience. Having now an optimized parallel version of the code, with a high performance computer available calculations can be pushed to more realistic nuclear interactions. Preliminary tests on Aurora have shown that in a case involving huge matrices, the single run using 8 nodes is almost 10 times faster, compared to the implementation on standard clusters.

Finally we would like to mention that we have been able to extend the LIT method to include isobar degrees of freedom (Delta) in the calculation of an electromagnetic reaction [4]. The resulting coupled equations have been treated in impulse approximation solving the three nucleon and the two nucleon plus one delta channels separately. A very interesting result has been found at higher momentum transfer: the delta-isobar current contributions are largely canceled by the relativistic contribution from the one-body currents. This cancellation leads to good agreement of our theoretical results with experimental data at very low energy transfer, while the experiment is underestimated at somewhat higher energies.


[1] “Response functions from integral transforms with a Lorentz kernel”, V.D.Efros, W.Leidemann, and G.Orlandini. Phys. Lett. B 338, 130 (1994).[2] “The Lorentz Integral Transform (LIT) method and its applications to perturbation-induced interactions”, V.D.Efros, W.Leidemann, G.Orlandini and N.Barnea. J. Phys. G 34,R459 (2007).[3] “Hypersferical effective interaction for nonlocal potentials”, N.Barnea, W.Leidemann and G.Orlandini. Phys. Rev. C 81, 064001 (2010).[4] “Transverse electron scattering response function of 3He with Delta-isobar degrees of freedom'', L. P. Yuan, V. D. Efros, W. Leidemann, E. L. Tomusiak. Phys. Rev. C 82 (2010) 054003.

Figure 3.3.2 The ground-state energies of 6He and 6Li with the non local bare JISP16 nuclear force and with the corresponding effective interaction.

3.4 Condensed Matter Simulations (FIS-UNITN group)

3.4.1 Simulation of the Dynamics of Bio- and Macro-molecules by Dominant Reaction Pathway (DRP) Methods

Part of the AuroraScience project concerns the development of an approach based on high-performance computing to investigate rare thermally activated chemical and conformational transitions of (bio)-molecules. Rare noise-driven reactions of molecular systems cannot be efficiently investigated using Molecular Dynamics (MD) simulations, since in such an algorithm almost all of the computational time is wasted to describe the thermal oscillations in the initial state. To overcome this problem, the Trento group has developed in recent years the Dominant Reaction Pathways (DRP) approach, which relies on the functional integral formulation of the stochastic Langevin dynamics [1,2,3].

Two post-docs partially supported with AuroraScience funds, Silvio a Beccara and Marcello Sega, have played a crucial role in the development and optimization of a parallel code that implements the DRP approach to molecular systems, at an all-atom level of detail, including an implicit description of the solvent-induced interactions. The calculation of the molecular energy and of the forces are obtained both from classical so- called "empirical force fields", or directly from the quantum mechanical determination of the electronic structure (e.g. through density functional calculations). Using such a code the, AuroraScience group calculated the most probable pathway in the ciclobuthene- butadiene transition [4], a reaction which simply cannot be studied with Car-Parinello methods. Recently, we have also performed the first calculation of the folding of a peptide based on quantum mechanical calculations of the electronic structure [5].*

The research work is concerned with the folding dynamics of biomolecules, in particular peptides and poly-peptides. The application of DRP to these systems is very computation intensive, since it deals with macromolecules, and requires the calculation of the molecular potential energy Laplacian derivative at each simulation step. Typically a minimum of 32 processors is used for very small molecules. Up to 400 processors are needed for medium sized molecules.

In test runs up to 192 computing cores on the Aurora machine were used. Benchmarks revealed nearly linear scaling of the computing speed with the number of cores used. In particular we enjoyed the very high performance of the Westmere processors, and the high speed of the network and of the storage system. From the experience gathered so far, we can conclude that Aurora will be a very useful tool for our biophysical computational work. To further exploit the peculiar characteristics of the machine, we are working on a torus version of our MPI code.

Tatjana Skrbic, a postdoc entirely supported with AuroraScience funds since January 2010, has developed a Monte Carlo code that has been used within the research group in order to investigate protein folding and protein-protein interactions, at the coarse-grained level of resolution. The actual model implemented was adopted from Ref. [6]. The first results based entirely on simulations performed on Aurora show that specific so-called non-native interactions, which take place in the early stage of the folding reaction play a crucial role formation of knots in proteins with topologically non-trivial native structures.

The principal quantity that describes a two-state kinetics that leads to a formation of a protein complex AB, from the two protein types A and B at equilibrium, is the so-called binding affinity Kd , that is the concentration at which there is an equal probability to have

the protein complex AB formed, as well as to find separate biomolecules A and B in the solution. Alternatively, Kd can be determined as the concentration at which the specific heat of the system, C = <(Utot − <Utot>)2>, calculated as the ensemble average of the fluctuations in the protein-protein interaction energy Utot, exhibits the maximum.

Figure 3.4.3: Specific heat vs. protein concentration for the Ub/UIM protein complex at room temperature, calculated for different models: model A (squares), model B (circles), model C (upper triangles), model E (down triangles) and model F (stars). The positions of the peaks correspond to the binding affinities Kd for the different models (A-F) adopted.

0

5

10

15

20

25

30

35

0 50 100 150 200 250 300 350 400

spec

ific

heat

[(kc

al/m

ol)2 ]

concentration [µM]

ABCDEF

As an illustration of the results obtained with Aurora, the Figure 3.4.3 shows the specific heat vs. protein concentration of the protein complex consisting of ubiquitin (Ub) and the so-called ubiquitin interacting motif (UIM), that is a specific protein sequence that appears in all the proteins that interact with Ub. The curves of the specific heat are calculated employing various variants of the model (A-F), differing in the weights with which various protein parts (that are more or less exposed to the solvent) participate in the overall protein-protein interaction.

Figure 3.4.4 emphasized the single specific heat curve, those whose peak is located at the experimental binding affinity Kdexp = 280 μM (Ref. [7]) of the Ub/UIM complex, so that we have concluded that the model A is the most suitable one, and we would employ it in our studies.


[1] “Dominant Protein Folding Pathways”, P.Faccioli, M.Sega, F.Pederiva, and H.Orland. Phys. Rev. Lett. 97, 108101 (2006). [2] “Quantitative Protein Dynamics from Dominant Folding Pathways”‚ M.Sega, P.Faccioli, F.Pederiva, G Garberoglio and H.Orland. Phys. Rev. Lett. 99, 118102 (2007).[3] “Dominant Reaction Pathways in High Dimensional Systems”, E.Autieri, P.Faccioli, M.Sega, F.Pederiva and H.Orland. J. Chem. Phys. 130 064106 (2009). [4] “Ab-initio Dynamics of Rare Thermally Activated Reed Reactions”, S.a Beccara, G.Garberoglio, P.Faccioli and F.Pederiva. J. Chem. Phys. 132, 111102 (2010).[5] “Dominant folding pathways of a peptide chain, from ab-initio quantum-mechanical simulations” S.a Beccara, P.Faccioli, G.Garberoglio, M.Sega, F.Pederiva, H.Orland. arXiv:1007.5235 (presently under review).[6] “Coarse-grained models for simulations of multiprotein complexes: application to Ubiquitin binding”ʼ Y.C.Kim and G.Hummer. J. Mol. Biol. 375, 1416 (2008). [7] “Solution structure of Vps27 UIM-ubiquitin complex important for endosomal sorting and receptor downregulation”, K.A.Swanson, R.S.Kang, S.D.Stamenova, L.Hicke, I.Radhakrishnan. EMBO J. 22, 4597 (2003).

Figure 3.4.4: Specific heat vs. protein concentration for the Ub/UIM complex at room temperature calculated for model A, for which the position of the peak is located at concentration = 280 μM, in agreement with the experimental findings (Ref. [7]).

2.5

3

3.5

4

4.5

5

5.5

6

6.5

0 200 400 600 800 1000 1200 1400 1600 1800

spec

ific

heat

[(kc

al/m

ol)2 ]

concentration [µM]

A

3.4.2 Materials Simulations: Quantum ESPRESSO

We installed on Aurora the quantum chemistry suite "Quantum ESPRESSO". This is a set of programs that calculate electronic and structural properties of condensed matter systems (usually solids), using Density Functional Theory. A typical calculation aimed at studying modern materials requires 128 processors for at least 12 hours. The performance of the code is largely dependent on the presence of a fast network connecting the computing nodes. We verified that the Quantum ESPRESSO suite of programs scales quite well on Aurora.

3.4.3 Simulating Slow Scales in Soft Systems

The low temperature phases of polydisperse systems are far from being fully understood. Three different scenarios have been proposed:

g) the terminal polidispersity scenarios, where for large enough polydispersities no ordered phase is permitted;

h) the fractionation scenario, according to which the system undergoes a phase separation between many almost monodisperse crystals;

i) the crystal-amorphous scenario, that predicts a phase separation among an (barely polydisperse) ordered and a (largely polydisperse) disordered phase.

In order to gain a physical insight into this fundamental issue for the statistical mechanics of off-lattice systems, we plan to perform on Aurora extensive Monte Carlo (MC) simulations of a simple model of polydisperse (24%) soft-spheres. Moderately large systems (about 1000 particles) are presently under investigation. The main difficulty is related to the existence of very slow time scales, both due to glassiness and to the exponential critical slowing down of first order transitions.

In order to achieve thermodynamic equilibrium, the Swap Monte Carlo algorithm (that exchanges the position of two randomly chosen particles) has been used. In this initial stage, parallel computing is implemented at a very basic level, by means of the replica-exchange algorithm, which does not rely on massive message passing. For a system of 864 polydisperse particles, the code runs with a speed of about 1000 MC steps per second. As a reference, on previously used PC clusters the speed was typically about 500 MC steps per second.

Finally, the forthcoming stage of the project aims at simulating much larger systems (order 10000 particles), thus more refined and demanding parallel techniques, such as domain decomposition, will be needed.


[1] "Separation and fractionation of order and disorder in highly polydispersesystems”, L.A.Fernandez, V.Martin-mayor, B.Seoane, P.Verrocchio. Phys. Rev. E 82, 021501 (2010).[2] "Phase diagram of a polydisperse soft-spheres model for liquids and colloids",L.A.Fernandez, V.Martin-Mayor, P.Verrocchio. in Phys. Rev. Lett. 98, 085702 (2007).

3.5 Monte Carlo for Protontherapy (ATreP group)

The GEANT4 Monte Carlo code was defined as the library of choice for installation on the first Aurora prototype. GEANT4 and its ancillary libraries were installed successfully on an Aurora node. A few GEANT4-based test applications of increasing complexity were installed and run for functionality tests. These are application not initially designed for multi-CPU/multicore architectures. The most complex of them simulated the transport of proton beams through a beam line used for eye treatment. Being aimed at testing functionality, these run did not simulate a number of events high enough to accumulate sufficient statistic.

The contract with FBK was finalized in early 2010, thus allowing ATreP to start a formal procedure to cover the position funded within AuroraScience. Two very good prospective candidates went through the selection procedure, that ended in May, but when offered the position both declined. Since early June, the search for a collaborator continued, but it has been unsuccessful for a few months. In October ATreP decided to add internal funds in order to be able to offer a 2-year position. Thus a second round started which led to four new candidates. This allowed to select a good candidate, at the post-doc level with previous experience in computational applications in medical physics, who started his work on the 15th of January 2011. In the first days of work GEANT4 was then successfully installed on the new Aurora system.

20 40 60 80 100 120 140 160 180

0.005

0.01

0.015

0.02

0.025

range [mm]

dose

[a.u

.]

Monte Carlo simulated data: 30000 proton 152 MeV ± 0.7 MeV Essen measured data

Figure 3.5.1. Comparison between the real data from the Essen proton beam (red) and the results of the Monte Carlo simulations performed in Aurora (blue).

Currently we have developed a Monte Carlo simulation that can produce a realistic physical description of a therapeutic beam. The beam is simulated in a treatment room that reproduces the geometric setup of a real gantry for proton beam delivery. The treatment room could contain either a water phantom of any desired shape and size or patient anatomy acquired by means of CT DICOM images. We now started the "commissioning" phase of our code, where our calculations are benchmarked against experimental data. For this purpose we are using for comparison data from WPE Essen (http://www.uk-essen.de/wpe/), a proton therapy center using very similar beam delivery technology to what will be available in Trento. Preliminary results show good quantitative agreement between real a simulated data. Figure 3.5.1 shows the real and simulated data

http://www.uk-essen.de/wpe/

http://www.uk-essen.de/wpe/

for the Essen proton beam, and Figure 3.5.2 shows spatial dose distribution of a 100 MeV proton beam in real patient anatomy. This is an initial promising step towards the final aim of achieving a mm to sub-mm agreement in the spatial dose distribution over the whole energy range needed for proton therapy.

Figure 3.5.2 Spatial dose distribution of a 100 MeV proton beam in real patient anatomy

3.6 Bioinformatics (IASMA-FEM and DEI-UNIPD groups)

The genome sequence of an organism contains invaluable information able to speed up comparative and evolutionary genomic studies. The reduction of sequencing costs has recently led to an increase in the number of genome sequencing projects and this has determined an explosion of draft assemblies deposited in the public databases.

Next Generation Sequencing (NGS) can be a valuable technology to improve draft genome assemblies (ordered contigs with possibility of gaps among them) The throughput of the next generation sequencers is orders of magnitude higher than that of earlier techniques, but data are characterized by shorter DNA fragments (reads).

The genome finishing is a manual and labor-intensive procedure to fill gaps among contigs. At our knowledge, only one tool is currently available to raise the quality of a draft assembly [1]. The mentioned tool is based on a perl pipeline that automates some steps of the process exploiting an external de novo assembler (Velvet). This approach has some limitations: first, there is no full control on the assembly process, which is particularly important for solving repeated regions. Second, the parallelization of the algorithm, for exploiting the recent multi core computer architectures, is very difficult.

We are developing a novel algorithm for draft genome finishing exploiting mate-pair information (reads at known distance along the genome) from NGS data. The algorithm is therefore optimized in speed and memory usage and is conceived for running on the Aurora High Performance Computing machine.

Starting from a draft genome assembly [2], the algorithm takes a pair of consecutive contigs, search for putative reads that align on the boundaries and iteratively extends the two sequences toward each other to fill the gap between them. A graph-theoretical approach was chosen for combining the reads based on their sequence similarity. It consists in building a graph (De Bruijn graph) of overlapping k-mers extracted from the reads and searching for Eulerian paths through it. An Eulerian path in the De Bruijn graph corresponds to a local assembly of the reads. Mate-pair data are indexed based on the k-mers they contain for faster retrieval of the desired reads during the different phases of the algorithm.

The algorithm implements a manager/worker paradigm in which many worker processes in parallel (a) build a De Bruijn graph based on the reads selected from two contig boundaries; (b) search for Eulerian paths through the graph and (c) return an assembly or an extension of the contigs. The algorithm has a dynamic number of tasks that varies according to how many pairs of contigs need to be joined. Those tasks are assigned to processes at run-time. We distinguished the following parallel tasks:

• master/manager: collects contigs and distributes them among workers. Manages the whole genome finishing process.

• k-mers indexer: builds and maintains the index of k-mers. Listens and answers to queries from the workers.

• mate-pair indexer: builds and maintains the mate-pairs index. Listens and answers to queries from the workers.

• worker: given a pair of contigs builds the De Bruijn graph and try to fill the gap (iteratively).

We designed and implemented the basic communication framework (see Figure 3.6.1) among the different processes by using Message Passing Interface (MPI) protocol and, at different levels of detail, some of the algorithm steps:j) The mate-pairs indexer program. Needed each time a worker finds a read which

aligns on a contig: this information is used to gather its mate-end that can be potentially useful in the gap filling phase.

k) The k-mer indexer program for building the index of the k-mers contained in the

whole set of reads. The index is stored in an efficient hybrid data structure, conceived for parallelization with MPI, which is a quaternary tree with hash maps attached at the leaves. This solution was implemented for trading-off space occupancy versus access time; each computational node stores a subset of k-mers based on the nucleotides it contains. This corresponds to cut the quaternary tree at a certain level of depth and to distribute the subtrees among the nodes. In order to save as much memory space as possible the hash map keys are stored with their base-4 encoding.

l) Part of the master program reading the ordered set of contigs to be distributed to the worker processes and the worker procedure for selecting all the mate-pairs that share a k-mer with the considered contig.

m) The worker procedure for building the De Bruijn graph starting from the k-mers of the selected reads and a procedure for collapsing chains of nodes and saving the resulting graph. The built graph is efficiently stored in a hash map as proposed in [3]: nodes (corresponding to k-mers) are stored as keys in the hash map and the presence or absence of an ingoing and outgoing edge is stored in one bit. Since

masterworkerworkerworkerworker index

k-mersindex

mate-pairs

synchronization barrier

init() init() init() init()

ready

send work unitk-mer query

k-mer query

mate-pair query

mate-pair query

completed

termination

Friday, 4 March 2011

Figure 3.6.1. Diagramatic representation of the basic communication framework.

each node in the graph can have a maximum of four ingoing and four outgoing edges, one for each one of the four bases, only 8 bits of information are needed. Furthermore the edges are weighted according to the frequencies each k-mer is found in the reads: this information is very useful to identify repeated regions and to eliminate sequencing errors during the graph traversal.

Next steps involve the implementation of the iterative procedure for searching paths in the graph and extending the contig boundaries until the gap is filled or nothing else can be done. The current implementation of steps from a) to d) was tested on Aurora on small artificially generated genomes (1Mbp) following a simple generative approach for producing simulated mate-pairs of SOLiD reads. Sequencing errors have not been modeled yet. Tests were performed on partitions of 8 nodes and execution times were measured by varying the length of the k-mers to be indexed. In the next months we are planning to analyze real SOLiD data of the apple genome by running the algorithm on larger partitions of 16 or 32 nodes.


[1] “Improving Draft Assemblies by Iterative Mapping and Assembly of Short Reads to Eliminate Gaps”, IJ. Tsai, TD. Otto, and M. Berriman. Genome Biol. 11, R41 (2010).

[2] “The Genome of the Domesticated Apple (Malus × Domestica Borkh.)”, R.Velasco, A.Zharkikh, J.Affourtit, et al. Nat. Genet. 42, 833-839 (2010)

[3] “ABySS: a Parallel Assembler for Short Read Sequence Data”, JT. Simpson, K. Wong, SD. Jackman, et al. Genome Res. 19, 1117-1123 (2009).

Acknowledgments

We are very grateful to the General Secretary Ing. Andrea Simoni for the support of the whole FBK in many aspects of this project. In particular, we are especially grateful to Ing. Stefano Bernini and his great staff for directing the works for the cooling system of Aurora and for solving promptly the many issues connected with such a complex installation. We also thank Giovanni Garberoglio (LISC-FBK) for the content of Section 3.4.2, Silvano Simula and Francesco Sanfilippo (INFN-ROMA) for the content of Section 3.1.6. Finally, we are very grateful to all the members of LISC for creating a great scientifical environment, which is an ideal place for Aurora to grow.

4. Explanation of the expendituresWe attach to this report the detailed budget table. The first column shows the original subdivision of the budget chapters (Personale, Collaborazioni, Prototipo, Attrezzature, Materiali e Beni di Consumo, Missioni and Alta Formazione). The second column includes all expenses that have been spent up to the 31/3/2011. The third column includes also all the expenses that are planned for the remaining four months of the first phase of the project (i.e. up to the 31/7/2011). The fourth and fifth columns contains the reference code of each expenses (whenever applicable) and their status (C=closed, A=open).

As one can see, there has been a delay in filling some research positions. This was the effect of a combination of factors. The project had originally foreseen a number of senior research positions. However, the breakup of the project into two phases (which was decided a few months before the official start of the project) and the resulting impossibility to offer positions on a timeframe longer than 18 months, made those positions much less attractive to experienced researchers. Moreover, a serious problem in the formalization of the agreements between FBK and the partner institutions, which was the result of an unexpected legal issue, delayed the opening of some positions further. Fortunately, the special engagement of the permanent staff working in the partner institutions and the high quality of the young researchers hired in the project largely compensated these problems and ensured that the original milestones could be reached. The modest usage of travel money is a natural consequence of the above considerations.

Another point to note is that the costs associated with the prototype have been somewhat higher than expected. This was mostly due to the necessity of installing the chiller for the cooling system on the roof of the FBK north building, and the water pumps in the underground. This considerably increased the plumbing, the electrical and the engineering costs associated with the cooling system. The cooling infrastructure also includes a system of heat recovery, that will provide heating to the north building of FBK, which justifies the direct extra contribution of 25kE from FBK, coming from other resources than the Aurora budget. These items are marked with the code ʻFBKʼ in the fourth column of the budget. Moreover, the system includes a program of free-cooling that --- thanks to the favourable heat capacity of water and weather conditions in Trento --- can be activated for most of the year and hence will make the running costs for cooling essentially negligible.

The budget for Alta Formazione was partially used to organize an Aurora school for PhD students in late summer 2010 (see Section 1). In the first phase, no PhD grant was offered with Aurora funds. However, some research contracts were offered to graduate and undergraduate students. In fact, the funding of a PhD grant must be planned at least one year before the actual PhD selection, this was a problem, because the Aurora funding could be spent only within the timeframe of the project. However, we are confident that some of these formal problems can be solved in the second phase, when we do plan to fund PhD grants.

It was also recognized that the best way in which the AuroraScience project could support high education in computational sciences was in providing high class computing tools and technical support to students that are working on their Laurea or PhD thesis. It was therefore decided that AuroraScience should encourage graduate and undergraduate students (even those working outside the AuroraScience collaboration) to run their research projects on Aurora. To this purpose, a considerable amount of the budget devoted to high education (alta formazione) has been used to purchase more computing

nodes of Aurora. A corresponding quota of the Aurora system will be reserved to projects run by students and it will be appropriately advertized. The support, that these young users need, will represent an extra burden on the system administrators and developers of the AuroraScience collaboration, but we are sure that their exceptional feedback will be precious to the project and, in turns, the project will be most effective in promoting HPC education in Trentino.

Finally, one can observe that, at the date of March 31st 2011, only about 2/3 of the budget have been spent, although we still plan to spend the full budget allocated for the first phase by the end of July 2011. There are two main reasons for this. On reason is the delay of the opening of some research positions, as already explained above. In fact, most of the budget that will be spent between April and July (about 300kE) is for salaries. The second reason is that recently we have been able to define, together with the industrial partner Eurotech, the detailed features of the final prototype, that we are planning to install in the second phase. This is described in great detail in the Proposal for the Second Phase of the AuroraScience Project, that comes together with the present document. On the basis of this plan, we are now able to order the last parts of the computing system that we want to install in the first phase, in a way that is consistent, technically and economically with the overall plan of the project.

Appendix B. AURORA SCHOOL REPORT

TITLE AURORA SCHOOL

DATE September 20th - October 1st 2010

ORGANIZERSF. Pederiva, University of Trento, ItalyF. Di Renzo, University of Parma, ItalyW. Leidemann, University of Trento, ItalyL. Scorzato, ECT*, ItalyP. Verrocchio, University of Trento, Italy

NUMBER OF PARTICIPANTS 34

MAIN TOPICSThe school covered a few selected topics in computational methods commonly employed in nuclear physics, high-energy physics, and statistical physics, with the addition of technical sessions describing the architecture of the novel high performance computing system Aurora, related software and applications. Two seminars were devoted to computational applications in other areas. The detailed list is the following:• Monte Carlo Methods for Statistical Physics and Critical Phenomena• Few-body Methods in Nuclear Physics• Lattice QCD• Quantum Monte Carlo Methods for Nuclear Physics• The Aurora Architecture• Software and Applications for the Aurora System• Genomics Applications on Aurora• Nucleon Transport Simulations on Aurora for Proton Therapy

SPEAKERS:Scientific Sessions:Alexandrou C., University of Cyprus, CyprusBarnea N., Hebrew University of Jerusalem, IsraelPelissetto A., University of Rome, “La Sapienza”, ItalySchmidt K.E., Arizona State University, USA

Technical Sessions:

Schifano F. and Pivanti M., University of Ferrara and INFN, ItalyAlfieri R. and Grossi M., University of Ferrara and INFN, Italy

Topical Session:

Schwarz M., A.Tre.P., ItalyCilia E., Edmund Mach Foundation, Italy

SCIENTIFIC REPORT:

The First Aurora School was organized in the framework of the AuroraScience collaboration. AuroraScience is a research project at the crossroad of computational sciences and Computer Architecture. It builds on the combined know-how collectively available to the members of the collaboration on: design, development and operation of application-driven high-performance computer systems (e.g., the series of APE machines, developed by INFN), algorithm development and physics analysis in compute-based areas of physics (Lattice Gauge Theory, Computational Fluid-Dynamics, Molecular Dynamics, Few- and Many-body Nuclear Physics), Quantitative-Biology (Protein Folding), Bio-Informatics (Gene-Sequencing) and Medical Physics.

The school aimed at educating young scientists, together with high quality short courses on selected topics in computational science given by world class scientists, on the architecture and the possible applications of the Aurora machine. The main goal has been to create a wide basis of possible users in different institutions in Europe and worldwide. In this first school the focus was kept on algorithms and applications in nuclear physics and high-energy physics. The targeted attendees were graduate students and Ph.D. students working on related topics. The participations of young post-docs was also encouraged.

The level of the scientific courses was meant to be introductory. The first week hosted the classes of Andrea Pelissetto on basic Monte Carlo techniques for statistical physics, with an introduction to specific techniques for improving the sampling of statistical ensembles near a phase transition, which are of common use also in other fields of physics. In the same week Nir Barnea illustrated methods commonly employed in few-body simulations, from a direct diagonalization of the Hamiltonian to the most sophisticated methods like the Lorentz Integral Transfom. In the second week Constantia Alexandrou presented an introduction to LQCD methods, and Kevin Schmidt illustrated the advanced Quantum Monte Carlo methods employed nowadays to study non-relativistic nuclear Hamiltonians. Despite the fact that the topics appear to be spread on a wide range of physical applications, the technical aspects are in many cases very similar, and the participants were stimulated to find the connections between the different topics introduced.

The technical sessions were divided in two sections. The first section dealt with a basic introduction to the architecture of the Aurora system, mainly in order to make the participants familiar with the concept of double data transport network that characterizes the machine. This part was coordinated by Fabio Schifano. In the second week the focus was moved to the software that has been developed to operate the machine. In both sessions the students had the possibility to directly explore and exercise the notions that were presented in the classes. The presence of two speakers operating in fields not directly covered by the school (Genomics and proton transport) was meant to give to the attendees a sense of the broad range of topics that are covered by the AuroraScience collaboration, and to stimulate the interest towards other possible applications.

GENERAL EVALUATION

The school characterized itself for a very high standard of lectures and for an active participation of the students. The attendance to all the proposed courses and seminars has been regular and constant. The limited overlap of presence of the various speakers did not foster new interaction. However there have been occasions for an exchange of scientific and technical competences, in particular concerning the peculiar aspects of the Aurora architecture.

The response among the students was rather high. Many of them actively contributed and kept the level of the discussion on a high standard. Also the practical sessions were appreciated, although some of the attendees expected even more focus on practical applications and real programming. This aspect has to be taken into account in view of the possibility of organizing a second edition of the Aurora school

SCHEDULE

-------------------------------------------------------------------------FIRST WEEK - Mon Sep. 20th - Fri Sep. 24th-------------------------------------------------------------------------

Mon Sep. 20th

9:00 (Monday only) Achim Richter, Director ECT*Greetings and presentation of ECT*

9:25 (Monday only) Communications

9:30-11:00 Andrea Pelissetto Monte Carlo Methods for Statistical Physics and Critical Phenomena

Coffee Break

11:30-13:00 Nir Barnea Few-body Methods in Nuclear Physics

Lunch break

14:30-16:30 (Monday - Thursday) Fabio Schifano, Marcello PivantiAurora Architecture

14:30-16:00 (Friday) Elisa CiliaThe Aurora of Assembly Algorithms for Next Generation Sequencing

---------------------------------------------------------------------------SECOND WEEK - Mon Sep. 27th - Fri Oct. 1st ---------------------------------------------------------------------------

9:30-11:00 Constantia Alexandrou Introduction to Lattice QCD Coffee Break

11:30-13:00 Kevin E. Schmidt Quantum Monte Carlo Methods for Many Nucleon Systems

Lunch break

14:30-16:30 (Monday - Thursday) Roberto Alfieri, Marco GrossiAurora Software and Applications

14:30-16:00 (Friday) Marco SchwarzRole of High Performance Computing in Medical Physics: the Case of Monte Carlo Simulations for Radiation Transport--------------------------------------------------------------------------

Documents

AuroraScience Project - web.infn.it · AuroraScience Project Report on the ... F.Peruch f, E.Peserico, A.Pietracaprina f, M.Pivantib, F ... In conclusion, the project has reached