42
Alan Gray, Developer Technology Engineer, NVIDIA GTC, March 26-29 2018 PERFORMANCE OPTIMIZATION FOR SCIENTIFIC APPLICATIONS

PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

Alan Gray, Developer Technology Engineer, NVIDIA

GTC, March 26-29 2018

PERFORMANCE OPTIMIZATION FOR SCIENTIFIC APPLICATIONS

Page 2: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

2

AGENDA

• Introduction

• Single-GPU

• Exposing parallelism

• Memory Coalescing

• Data Management Optimization

• Interoperability

• ESCAPE performance results

• Multi-GPU

• CUDA-aware MPI

• AlltoAll optimization

• DGX-2 (with NVSwitch) vs DGX-1V

Page 3: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

3

INTRODUCTION

Page 4: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

4

ESCAPE

• NVIDIA’s role is to take existing GPU-enabled codes and optimize.

Page 5: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

5

ECMWF

• European Centre for Medium Range Weather Forecasts (ECMWF) are an

intergovernmental org.

• Global forecasts:

• used by more than 22 countries to drive their regional forecasts.

• provide accurate weekly, monthly and seasonal predictions, including early warnings of

severe events.

• In 2012 ECMWF provided the most accurate prediction for the trajectory and landfall

of Hurricane Sandy: information that undoubtedly saved lives.

• We are working closely with ECMWF (and other partners) in ESCAPE to evaluate and

improve the algorithms, techniques and software on GPUs.

• This is done through use of dwarves: mini-apps designed to represent the key

properties of the real simulation applications.

ESCAPE Project Leaders

Page 6: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

6

ESCAPE DWARVES

• Info in “grid-point” space can be equivalently represented in “spectral” space, i.e. in

terms of the frequencies of the fluctuating waves, which is more suited to some

calculations.

• IFS therefore repeatedly transforms between these representations, Fourier

transforms (FFTs) in longitude and Legendre transforms (DGEMMs) in latitude, with

AlltoAll data movement in-between.

• This dwarf represents the spectral transforms from IFS.

• NB. Number of points varies (e.g. most round equator, fewest at poles). Additionally,

there exist multiple altitude “levels”, in third dimension away from surface of earth,

each with 3 “fields”.

• ECMWF’s Integrated Forecasting System (IFS) is a global prediction

system: entire earth’s atmosphere is represented as a spherical grid.

Spherical Harmonics (SH) Dwarf

Page 7: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

7

ESCAPE DWARVES

• Advection: horizontal transport

• Uses unstructured grid with nearest-neighbour stencils

• MPDATA scheme already used within COSMO-EULAG (PSNC), and of interest to ECMWF for future developments

• Both SH and MPDATA Dwarves Fortran+OpenACC+MPI. SH also has interfacing to CUDA libraries.

• Many of the optimizations I will present are transferable to other applications/languages etc.

MPDATA Dwarf

Page 8: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

8

SINGLE-GPU: EXPOSING PARALLELISM

Page 9: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

9

EXPOSING PARALLELISM

Loop over timesteps

Loop over 1st dimension

Loop over 2nd dimension

Loop over fields

Operation (involving multidimensional arrays)

Another Loop over dimensions…

OpenACC Loop Mapping

Typical Structure of Application (usually spanning multiple routines/files):

Aim is to expose as much parallelism in this red box as possible, as flexibly as possible

Page 10: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

10

EXPOSING PARALLELISM

Loop over 1st dimension

Loop over 2nd dimension

Loop over fields

Operation

OpenACC Loop Mapping

Before Optimization:

SH • Loop over 1st dimension was sequential:

parallelism not exposed to GPU

MPDATA • Naïve mapping of loops, using “kernels”

and/or “loop” directives without

restructuring.

• Resulting decomposition chosen by

compiler did not work well since runtime

loop extents didn’t map well to GPU

architecture.

Page 11: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

11

EXPOSING PARALLELISM

$!ACC parallel loop collapse(3)

Loop over 1st dimension

Loop over 2nd dimension

Loop over fields

Operation

OpenACC Loop Mapping

• Assuming loops are independent, better to restructure such that loops are tightly nested, and use “collapse” clause.

• Loops will be collapsed into a single loop, and compiler will map all parallelism to GPU blocks/threads in an efficient way.

• This can require extensive restructuring, depending on how application is originally written.

Page 12: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

12

EXPOSING PARALLELISM

$!ACC parallel loop collapse(2)

Loop over 1st dimension

Loop over 2nd dimension

$!ACC loop seq

Loop with dependencies

Operation

OpenACC Loop Mapping

• Sometimes we have loop-carried dependencies.

• These can be performed sequentially by each thread at the innermost level.

• Can still perform well if there is enough parallelism in outermost loops.

Page 13: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

13

EXPOSING PARALLELISM

do i=1, N

do j=1, i

do k=1, P

Operation

OpenACC Loop Mapping

$!ACC parallel loop collapse(3)

do i=1, N

do j=1, MAX_J

do k=1, P

if(j .le. i) then

Operation

• Sometimes extent of a loop depends on index of another (outer) loop which prevents loop collapsing.

• Can replace extent with max value, and use conditional statement in loop body.

Page 14: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

14

SINGLE-GPU: MEMORY COALESCING

Page 15: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

15

KERNEL OPTIMIZATION

$!ACC parallel loop collapse(3)

do k=1, …

do j=1, …

do i=1, …

Array(i,j,k)=…

Memory Coalescing: data layout

• For memory coalescing, fastest moving index in array access should correspond to vector level (CUDA thread), which will correspond to innermost collapsed loop index.

$!ACC parallel loop collapse(2)

do m=1, …

do n=1, …

$!ACC loop seq

do p=1, …

Array(n,m,p)=…

Page 16: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

16

KERNEL OPTIMIZATION

!$ACC parallel loop tile(16,32)

do j=1, …

do i=1, …

array_t(i,j)=array(j,i)

Memory Coalescing: transposing with tile clause

• If you need to transpose, either read or write will be in wrong layout for coalescing.

• But can use OpenACC “tile” clause to improve performance

• Compiler tiles the operation by generating new innermost loops

• For each tile, data is staged on-chip

• Results in better global memory access patterns

• Experiment with tile sizes

Page 17: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

17

KERNEL OPTIMIZATIONMemory Coalescing: transposing within CUDA BLAS

• In some of our cases, the transpose kernels were adjacent to CUDA BLAS matrix multiplication (DGEMM) calls.

• Coalescing was facilitated through replacing C = AB matrix multiplications by equivalent 𝐶𝑇 = 𝐵𝑇𝐴𝑇.

• This allows transpose operations to be pushed into the DGEMM library calls, which have much higher-performing implementations of transposed data accesses

Page 18: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

18

SINGLE-GPU: DATA MANAGEMENT OPTIMIZATION

Page 19: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

19

DATA MANAGEMENT OPTIMIZATION

Loop over timestep:

loops over spatial dims

Minimizing data allocation and movement

• Important to keep as much data as possible resident on GPU within timestep loop.

• All allocations/frees should be outside timestep loop.

• Copies for constant data should be outside main timestep loop.

• Re-use temporary scratch arrays on device

• Any necessary repeated copies (e.g. halo regions in MPI code): volume copied should be minimized.

Data allocation and movement is expensive

Many codes have:

Page 20: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

20

SINGLE-GPU: INTEROPERABILITY

Page 21: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

21

INTEROPERABILITYSimple example: Calling C/CUDA from PGI Fortran

!main.f90

program main

interface

subroutine kernel_from_f(arg) &

bind(C,name='kernel_wrapper')

use iso_c_binding

integer(c_int),value :: arg

end subroutine kernel_from_f

end interface

call kernel_from_f(1)

end program main

//kernel.cu

#include <stdio.h>

__global__ void kernel(int arg){

if (threadIdx.x==0)

printf("hello from kernel\n");

return;

}

extern "C" void kernel_wrapper(int arg){

kernel <<<1,1>>> (arg);

cudaDeviceSynchronize();

return;

}

$ nvcc -c -arch=sm_60 kernel.cu -o kernel.o

$ pgf90 -c main.f90 -o main.o

$ pgf90 main.o kernel.o -o code -L $CUDA_HOME/lib64 –lcudart

$ ./code

hello from kernel

CUDA libraries can be called from C code in the usual manner.

Page 22: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

22

INTEROPERABILITY AND LIBRARIES: SH DWARF

cuFFT cuFFT

cublasDgemm cublasDgemm

Base language Fortran, MPI for multi-GPU communications.

OpenACC OpenACC

OpenACC OpenACC

OpenACCOpenACC

Page 23: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

23

BLAS/FFT LIBRARY CALLS IN SH DWARF• At each timestep, SH dwarf performs transforms using Matrix Multiplications and FFTs.

• Multiple operations - one for each:

• Field (associated with vertical levels)

• Longitude (Matmult) / Latitude (FFT)

• Can batch over fields, since sizes are the same. But different longitudes/latitudes

have different sizes: not supported by batched versions of cublasDgemm/cuFFT.

• So, originally we had many small calls: low parallelism exposure and launch latency sensitivity.

• For DGEMM, we pad with zeros up to largest size and batch over longitudes as well as fields: single call to library; extra operations do not contribute to result.

• But FFT does not allow padding in the same way. Worked around launch latency problem by removing sync after each call: allows launch latency to be hidden behind execution.

• As will be seen, however, this is the only part of the dwarf which remains suboptimal. Future: batched FFT with differing sizes should improve performance.

Page 24: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

24

SINGLE-GPU: ESCAPE RESULTS

Page 25: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

25

MPDATA OPTIMIZATION: P100

Before:

After:

Page 26: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

26

OPTIMIZED MPDATA: P100 VS V100

P100

V100

Page 27: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

27

ESCAPE DWARF V100 PERFORMANCE

Page 28: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

28

MPDATA KERNEL PERFORMANCE

• 100% Roofline is STREAM benchmark throughput, since all kernels are memory bandwidth bound

Page 29: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

29

ESCAPE DWARF V100 PERFORMANCE

Page 30: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

30

SH KERNEL PERFORMANCE

• 100% Roofline is peak DP Performance (compute bound kernels) or STREAM benchmark throughput (memory bandwidth bound kernels)

Experiments with batching (using av. size) show 4.6X speedup

Page 31: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

31

MULTI-GPU: CUDA-AWARE MPI

Page 32: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

32

CUDA-AWARE MPI

• Modern MPI implementations are CUDA-aware. This means that pointers to GPU memory can be passed directly into the MPI calls, avoiding unnecessary transfers (both in the application and in the underlying MPI implementation).

!non CUDA-aware:

!$ACC update host(array,…)

call MPI_Alltoallv(array,…)

!$ACC update device(array,…)

!CUDA-aware:

!$ACC host_data use_device(array,…)

call MPI_Alltoallv(array,…)

!$ACC end host_data

• Also for point-to-point etc. Note that if explicit buffer packing is involved, this will also need to be ported from CPU to GPU.

Page 33: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

33

MULTI-GPU: ALLTOALL OPTIMIZATION

Page 34: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

34

ALLTOALL: 4 GPUS ON DGX-1

• SH dwarf performs AlltoAll operations. Even with CUDA-aware MPI, there are inefficiencies:

• Can instead use CUDA IPC directly with CUDA streams to overlap.

• Future MPI implementations are expected to take care of this

Page 35: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

35

CUDA IPC ALLTOALLAs a starting point, used the 0_Simple/simpleIPCcode from the CUDA samples.

Setup:

• On each MPI rank, get IPC memory handle for array locally

• Share IPC memory handles via MPI

• Setup CUDA streams

AlltoAll:

• On each rank, loop over all targetranks (including own)

• targetrank=(targetrank+rank)%number_of_gpus (for better balance)

• Push message to targetrank (in stream[targetrank])

• Sync all streams

Page 36: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

36

SH RESULTS ON 4 GPUS

Page 37: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

37

MULTI-GPU: DGX-2 (WITH NVSWITCH) VS DGX-1V

Page 38: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

38

SPHERICAL HARMONICS: SCALING BEYOND 4 GPUS

• When using all 8 GPU in DGX-1V:

• No AlltoAll NVLINK Connectivity – some messages go through PCIe and system memory

• This limits performance

• When using 16 GPUs across 2 DGX-1V servers

• Some messages go across Infinibandnetwork

• Further bottleneck

Page 39: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

39

DGX-2 WITH NVSWITCH

• AlltoAll network architecture with NVSwitch maps perfectly to the problem.

• Full bandwidth between each GPU pair.

Page 40: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

40

SPHERICAL HARMONICS: DGX-2 VS DGX-1V

Page 41: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing

41

SUMMARY AND FUTURE WORK• Optimizing the exposure of parallelism, memory coalescing and data management

can have dramatic effects on performance.

• MPDATA single-GPU performance is now optimal.

• MPDATA multi-GPU: halo-exchange currently implemented via MPI abstracted CPU-based ATLAS library (also developed by ECMWF). To be fully optimized, ATLAS needs to be made CUDA aware, and this is currently being worked on by others.

• SH single-GPU performance is vastly improved, but FFT part remains sub-optimal.

• Implementation of batching where different sizes are allowed within each batch would expectedly fix this.

• DGX-2/NVSwitch all-to-all connectivity allows SH to scale to all 16 GPUs.

• These results give indications that multi-GPU systems can be effectively exploited to allow forecasting agencies to continue to further improve weather predictions.

Page 42: PERFORMANCE OPTIMIZATION FOR SCIENTIFIC ...on-demand.gputechconf.com/gtc/2018/presentation/s8190...FOR SCIENTIFIC APPLICATIONS 2 AGENDA • Introduction • Single-GPU • Exposing