12/7/2015 1 Performance Optimizations for running NIM on GPUs Jacques Middlecoff NOAA/OAR/ESRL/GSD/AB [email protected] Mark Govett, Tom Henderson

04/21/23 1

Performance Optimizations for running NIM on GPUs

Jacques MiddlecoffNOAA/OAR/ESRL/GSD/AB

[email protected] Govett, Tom Henderson

Jim Rosinski

04/21/23 2

Goal for NIM

04/21/23 3

Optimizations to be discussed

NIM: The halo to be communicated between processors is packed and unpacked on the GPU No copy of entire variable to and from the CPU About the same speed as the CPU

Halo computation Overlapping communication with computation Mapped, pinned memory NVIDIA GPUDirect technology

04/21/23 4

Halo Computation

Redundant computation to avoid communication Calculate values in the halo instead of MPI send Trades computation time for communication time GPUs create more opportunity for halo comp

NIM has halo comp for everything not requiring extra communication

NIM next step is to look at halo comp’s that require new but less often communication

04/21/23 5

Overlapping Communication with Computation

Works best with a co-processor to handle comm Overlap communication with other calculations

between when a variable is set and used. Not enough computation time on the GPU

Calculate perimeter first then do communication while calculating the interior Loop level: Not enough computation on the GPU Subroutine level: Not enough computation time Entire dynamics: Not feasible for NIM

04/21/23 6

Overlapping Communication with Computation: Entire Dynamics

14 exchanges per time step

3 iteration Runge Kutta loop

Exchanges in the RK loop

Results in a 7 deep halo

Perimeter

Interior

Way too much communication More halo comp? Move exchanges out of RK loop?

Considerable code restructuring required.

04/21/23 7

Mapped, Pinned Memory: Theory

Mapped, pinned memory is CPU memory Mapped so GPU can access it across PCIe bus Page-locked so the OS can’t swap it out Limited amount

Integrated GPUs: Always a performance gain Discrete GPUs (what we have)

Advantageous only in certain cases The data is not cached on the GPU Global loads and stores must be coalesced

Zero-copy: Both GPU and CPU can access data

04/21/23 8

Mapped, Pinned Memory: Practice

Using mapped, pinned memory for fast copy SendBuf is mapped and pinned Regular GPU array (d_buff) is packed on GPU d_buff is copied to SendBuf Twice as fast as copying d_buff to a CPU array

Pack the halo on GPUSendBuf = VARZero-copy 2.7X slowerWhy?

Unpack halo on GPUVAR = RecvBufZero-copy unpack same speed but no copy

04/21/23 9

Mapped, Pinned Memory: Results

NIM 10242 horizontal, 96 vertical 10 processors Lowest value selected to avoid skew

04/21/23 10

Mapped, Pinned Memory: Results

04/21/23 11

NVIDIA GPUDirect Technology

Eliminates the CPU in interprocessor communication

Based on an interface between the GPU and InfiniBand Both devices share pinned memory buffers Data written by GPU can be sent immediately by

InfiniBand Overlapping communication with computation?

No longer a co-processor to do the comm? We have this technology but have yet to install it

04/21/23 12

Questions?

Documents

12/7/2015 1 Performance Optimizations for running NIM on GPUs Jacques Middlecoff NOAA/OAR/ESRL/GSD/AB [email protected] Mark Govett, Tom Henderson