A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,

Embed Size (px)

Citation preview

A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi, Neutron transport problem Sweep3D Programming Ultra-scale Parallel Systems CHALLENGES High-performance and good scalability Programmer productivity CAF: a promising near-term alternative As expressive as MPI Simpler to program than MPI More amenable to compiler optimizations User has control over performance-critical factors MPI: a library-based parallel programming model Portable and widely used The developer has explicit control over data and communication placement Difficult and error prone to program Most of the burden for communication optimization falls on application developers; compiler support is underutilized San Fernando Valley Earthquake Simulation Spark98 Sparse matrix vector multiply (sf2 traces) Performance of all CAF versions is comparable to that of MPI and better on large number of CPUs CAF GETs is simple and more natural to code, but up to 13% slower Without considering locality, applications do not scale on NUMA architectures (Hybrid) ARMCI library is more efficient than MPI Parallel extension of Fortran 90 SPMD programming model fixed number of images during execution images operate asynchronously Both private and shared data real a(20,20) private: a 20x20 array in each image real a(20,20) [*] shared: a 20x20 array in each image Simple one-sided communication (PUT & GET) x(:,j:j+2) = a(r,:) [p:p+2] copy rows from p:p+2 into local columns Flexible explicit synchronization sync_team(team [,wait]) team = a vector of process ids to synchronize with wait = a vector of processes to wait for Pointers and dynamic allocation Co-Array Fortran LanguageRice CAF Compiler Source-to-source code generation Open source compiler Build on Open64/SL infrastructure Support for core language features Code generation: library-based communication: portable ARMCI and GASNet communication libraries and array descriptor CHASM library load/store communication: on shared-memory platforms Operating systems: Linux IA64/IA32 Alpha Tru64 SGI IRIX64 Interconnects & Platforms: Quadrics QSNet (Elan 3), QSNet II (Elan 4) Myrinet 2000 Ethernet SGI Altix 3000, SGI Origin 2000 integer :: a(10,20)[*] if (this_image() > 1) a(1:10,1:2)=a(1:10,19:20)[this_image( )-1] a(10,20) image 1image 2image N image 1image 2image N Copies from left neighbor CAF Model Refinements Current Optimizations Procedure Splitting Hints for non-blocking communication Library-based and load/store communication Packing of strided communication Point-to-point synchronization sync_notify(p) sync_wait(p) Less restrictive memory fences at call site Collective operations CAF Applications and Benchmarks Sweep3D wave-front parallelism Spark98 sparse matrix vector multiply NAS Parallel Benchmarks 2.3: MG, CG, SP, BT, LU Random Access, STREAM Planned Optimizations Communication vectorization and aggregation Synchronization strength-reduction Automatic split-phase communication Platform-driven communication optimizations transform communication from one-sided into two- sided and collective, if useful multi-model code for hierarchical architectures convert GETs into PUTs Multi-buffer co-arrays for asynchrony tolerance Employ virtualization for latency tolerance Interoperability with other programming models Computational Fluid Dynamics: Cluster Platforms Computational Fluid Dynamics: SGI Altix D wave-front parallelism NSF Mesh Partitioned Mesh Spark98 on SGI Altix 3000 Sweep3D on Itanium2+MyrinetSweep3D on SGI Altix 3000 Sweep3D on Itanium2+Quadrics NAS BT C on Itanium2+Myrinet 2000NAS MG C on Itanium2+Myrinet 2000 NAS BT B on SGI Altix 3000 NAS MG C on SGI Altix 3000