View
38
Download
0
Category
Preview:
Citation preview
Porting Scientific Research Codes to GPUs with CUDA Fortran:Incompressible Fluid Dynamics using the Immersed Boundary Method
Josh Romero, Massimiliano Fatica - NVIDIAVamsi Spandan, Roberto Verzicco - Physics of Fluids, University of Twente
HPC Advisory Council Workshop, Stanford, CA, February 2018
Outline
● Introduction and Motivation
● Solver Details
● GPU implementation in CUDA Fortran
● Benchmarking and Results
● Conclusions
Introduction and Motivation
● Increased availability of GPU compute resources:○ Explosion of interest in Machine Learning○ Focus on energy efficiency for exascale
● Lots of choices to make:○ OpenACC vs. CUDA○ CUDA C vs. CUDA Fortran
● Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools
● Talk is focused on getting up and running with “low-effort.”
Solver Details
Solver Details
● Incompressible CFD solver for DNS computations in structured domains
● IB + structural solver using method described in [1]
○ Immersed interface contributes forcing term to fluid
○ Interface structural dynamics treated as triangulated network of springs
[1] Spandan et al., Journal of Computational Physics, 2017
Solver Details
InitializeSolver Compute RK step Compute IB
forcing term Structural update
RK Loop
Timestep Loop
GPU Implementation in CUDA Fortran
CUDA Fortran
● Baseline CPU code is written in Fortran so natural choice for GPU port is CUDA Fortran.
● Benefits:○ More control than OpenACC:
■ Explicit GPU kernels written natively in Fortran are supported■ Full control of host/device data movement
○ Directive-based programming available via CUF kernels○ Easier to maintain than mixed CUDA C and Fortran approaches
● Requires PGI compiler (community edition available now for free)
Profiling with NVPROF + NVVP + NVTX
● NVPROF:○ Can be used to gather detailed kernel properties and timing information
● NVIDIA Visual Profiler (NVVP):○ Graphical interface to visualize and analyze NVPROF generated profiles○ Does not show CPU activity out of the box
● NVIDIA Tools EXtension (NVTX) markers:○ Enables annotation with labeled ranges within program○ Useful for categorizing parts of profile to put activity into context○ Can be used to visualize normally hidden CPU activity (e.g. MPI communication)
NVIDIA Visual Profiler with NVTX Markers
GPU Porting of Key Computational Routines
● In many CFD (and similar) codes, common code patterns appear:
○ Tightly-nested loop computations (computation of derivatives using stencils)
○ Common mathematical computations (Fourier transforms, matrix-algebra)
● But there are also unique patterns specific to a given application:
○ Computation of IB forcing on flow field
○ Computation of interface structural forces
Case 1: Tightly-nested loops
Consider the original CPU subroutine to compute the divergence.
subroutine divg use param use local_arrays, only: q1, q2, q3,& dph, jpv, ipv,& udx3m ...
do kc = kstart,kend do jc = 1,n2m do ic = 1,n1m kp = kc+1; jp = jpv(jc); ip = ipv(ic) dqcap = (q1(ip,jc,kc) - q1(ic,jc,kc)) * dx1 & +(q2(ic,jp,kc) - q2(ic,jc,kc)) * dx2 & +(q3(ic,jc,kp) - q3(ic,jc,kc)) * udx3m(kc)
dph(ic,jc,kc) = dqcap*usdtal
enddo enddo enddo
end subroutine divg
Case 1: Tightly-nested loops
Now, consider the version for GPU using CUF kernel directives.
subroutine divg use param use local_arrays, only: q1=>q1_d, q2=>q2_d, q3=>q3_d,& dph=>dph_d, jpv=>jpv_d, ipv=>ipv_d,& udx3m=>udx3m_d ... !$cuf kernel do(3) do kc = kstart,kend do jc = 1,n2m do ic = 1,n1m kp = kc+1; jp = jpv(jc); ip = ipv(ic)
dqcap = (q1(ip,jc,kc) - q1(ic,jc,kc)) * dx1 & +(q2(ic,jp,kc) - q2(ic,jc,kc)) * dx2 & +(q3(ic,jc,kp) - q3(ic,jc,kc)) * udx3m(kc)
dph(ic,jc,kc) = dqcap*usdtal
enddo enddo enddo
end subroutine divg
Case 1: Tightly-nested loops
● CUF kernel directive automatically generates GPU kernels for tightly nested loops.
● Scalar data passed by value to device.
● Array data must already be resident on device.
subroutine divg use param use local_arrays, only: q1=>q1_d, q2=>q2_d, q3=>q3_d,& dph=>dph_d, jpv=>jpv_d, ipv=>ipv_d,& udx3m=>udx3m_d ... !$cuf kernel do(3) do kc = kstart,kend do jc = 1,n2m do ic = 1,n1m kp = kc+1; jp = jpv(jc); ip = ipv(ic)
dqcap = (q1(ip,jc,kc) - q1(ic,jc,kc)) * dx1 & +(q2(ic,jp,kc) - q2(ic,jc,kc)) * dx2 & +(q3(ic,jc,kp) - q3(ic,jc,kc)) * udx3m(kc)
dph(ic,jc,kc) = dqcap*usdtal
enddo enddo enddo
end subroutine divg
Case 1: Tightly-nested loops
● For getting data onto the device, CUDA Fortran allows for straightforward declaration/allocation of device data.
subroutine divg use param use local_arrays, only: q1=>q1_d, q2=>q2_d, q3=>q3_d,& dph=>dph_d, jpv=>jpv_d, ipv=>ipv_d,& udx3m=>udx3m_d ... !$cuf kernel do(3) do kc = kstart,kend do jc = 1,n2m do ic = 1,n1m kp = kc+1; jp = jpv(jc); ip = ipv(ic)
dqcap = (q1(ip,jc,kc) - q1(ic,jc,kc)) * dx1 & +(q2(ic,jp,kc) - q2(ic,jc,kc)) * dx2 & +(q3(ic,jc,kp) - q3(ic,jc,kc)) * udx3m(kc)
dph(ic,jc,kc) = dqcap*usdtal
enddo enddo enddo
end subroutine divg
module local_arrays real(8), allocatable :: q1(:,:,:) real(8), device, allocatable :: q1_d(:,:,:) ...end module local_arrays
allocate(q1(nx,ny,nz)); q1 = 0.d0allocate(q1_d(nx,ny,nz); q1_d = q1
Alternative using sourced allocation:allocate(q1_d, source = q1)
Additional CUF kernel features
● CUF kernels can be used to perform reductions of scalar device data.
● Final reduced result can be on the host or device.
subroutine calculate_volume_gpu (Volume,nv,nf,xyz,vert_of_face) integer, dimension (3,nf), device, intent(in) :: vert_of_face real(8), dimension (nv,3), device, intent(in) ::xyz real(8), intent(out) :: Volume ... Volume = 0.d0
!$cuf kernel do (1) do i = 1,nf v1 = vert_of_face(1,i) v2 = vert_of_face(2,i) v3 = vert_of_face(3,i)
x1 = xyz(v1,1); x2 = xyz(v2,1); x3 = xyz(v3,1) y1 = xyz(v1,2); y2 = xyz(v2,2); y3 = xyz(v3,2) z1 = xyz(v1,3); z2 = xyz(v2,3); z3 = xyz(v3,3)
Volume = Volume + (x1 * (y2*z3 - z2*y3) + & x2 * (y3*z1 - z3*y1) + & x3 * (y1*z2 - z1*y2))/6.d0 enddo
end subroutine calculate_volume_gpu
Case 2: Common Mathematical Computations
● Beyond loop-based computations, many codes use common math computations for which there are GPU libraries readily available:
○ FFT: CUFFT○ BLAS: CUBLAS○ Linear Algebra: CUSOLVER
● Use wisely: Favor batched implementations when available, avoid many repeated calls of small operations
Case 2: Common Mathematical Computations
Consider the original CPU code for completing a real-to-complex FFT using FFTW library.
coefnorm = 1.d0/(dble(n1m) * dble(n2m))
do k = kstart,kend do j = 1,n2m do i = 1,n1m xr(j,i) = dph(i,j,k) enddo enddo
call dfftw_execute_dft_r2c(fwd_plan,xr,xa)
do j = 1,n2m/2 + 1 do i = 1,n1m dpho(i,j,k) = dreal(xa(j,i)) * coefnorm dpho(i,j+n2mh,k) = dimag(xa(j,i)) * coefnorm enddo enddo
end do
Case 2: Common Mathematical Computations
Now consider the version for GPU using CUFFT library.
● Modified to use batched 2D FFTs
● Final loop merged with later packing loop ← kernel fusion
coefnorm = 1.d0/(dble(n1m) * dble(n2m))
!$cuf kernel do (3)do k = kstart,kend do j = 1,n2m do i = 1,n1m xr_d(j,i,k) = dph_d(i,j,k) enddo enddoenddo
istat = cufftExecD2Z(cufft_fwd_plan, xr_d, xa_d)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Scaling/rearrangement combined with subsequent loop!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Case 2: Common Mathematical Computations
Now consider the version for GPU using CUFFT library.
● Modified to use batched 2D FFTs
● Final loop merged with later packing loop ← kernel fusion
coefnorm = 1.d0/(dble(n1m) * dble(n2m))
!$cuf kernel do (3)do k = kstart,kend do j = 1,n2m do i = 1,n1m xr_d(j,i,k) = dph_d(i,j,k) enddo enddoenddo
istat = cufftExecD2Z(cufft_fwd_plan, xr_d, xa_d)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Scaling/rearrangement combined with subsequent loop!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
integer :: cufft_fwd_planinteger :: rank(2), inembed(2), onembed(2)
rank(1) = n1m; rank(2) = n2minembed(1) = n1m; inembed(2) = n2monembed(1) = n1m; onembed(2) =n2m/2 + 1
istat = cufftPlanMany(cufft_fwd_plan, 2, rank, inembed, 1, & n1m*n2m, onembed, 1, n1m*(n2m/2 + 1),& CUFFT_D2Z, kend-kstart+1)
Interfaces for BLAS routines
● PGI provides overloaded interfaces for BLAS routines.
● Calls with device-resident arrays are automatically passed to the CUBLAS library.
use cudaforuse cublas
integer :: m, n, kreal(8) :: alpha, betareal(8) :: a(m,k), b(k,n), c(m,n)real(8),device :: a_d(m,k), b_d(k,n), c_d(m,n)
...
! DGEMM using linked CPU librarycall DGEMM(‘N’, ‘N’, m, n, k, alpha, a, m, b, k, & beta, c, m)
! DGEMM using CUBLAScall DGEMM(‘N’, ‘N’, m, n, k, alpha, a_d, m, b_d, k, & beta, c_d, m)
Case 3: Unique computations
● The need for custom kernels arises in most programs:
○ Unique computations not amenable to a CUF kernel
○ Common mathematical operation, but no good GPU library implementation:
■ Tridiagonal LU factorization/solves with multiple RHS
○ Pattern of library usage that would be poor performing on GPU:
■ Data interpolation from flow grid to structural grid involves many small matrix and vector computations.
Example 1: Batched Tridiagonal Solver
● Flow solver requires tridiagonal LU factorization/solves with multiple RHS
● Wrote batched tridiagonal solver using Thomas algorithm
● One GPU thread assigned per RHS
● To ensure coalesced access of RHS values by threads, data transposition required:
rhs_d(1:N1*N2, 1:NRHS) → rhs_t_d(1:NRHS, 1:N1*N2)
Example 2: Data Interpolation Between Grids
This is the most time consuming operation in the IBM portion of the solver.
Goal is to compute interpolated value on structural grid from flow grid.
Example 2: Data Interpolation Between Grids
For a given triangle i:● Form 27-point support domain
around triangle centroid. ● Compute transfer function,
using support point and centroid data.
●
Final centroid result scattered back to support points or to triangle vertices.
Example 2: Data Interpolation Between Grids
For a given triangle i:● Form 27-point support domain
around triangle centroid. ● Compute transfer function,
using support point and centroid data.
●
Final centroid result scattered back to support points or to triangle vertices.
Example 2: Data Interpolation Between Grids
Computation of transfer function for each triangle requires:
● 4 x 4 matrix inversion● Several small matrix-vector
multiplies:○ [1 x 4][4 x 4] and [1 x 4][4 x 27]
Final computation of is an inner product of 27 values.
Example 2: Data Interpolation Between Grids
GPU strategy:● Process each triangle using a
warp (32 thread unit), map threads to support points
● Data is warp-local → most matrix algebra can be completed efficiently using warp shuffle intrinsics.
● Scattering of final result completed using atomic adds.
Benchmarking and Results
Verification Case
Benchmarking Case
● Unit cube, quiescent flow
● N = 128, 256, 384
● # of Particles = 1, 8, 27, 64
● Particle Resolution= 1280, 5120, 20480 triangles
● Run on:○ 1x 16-core Intel(R) Xeon(R) CPU
E5-2698 v3 @ 2.30GHz
○ 1x NVIDIA Tesla V100 PCIe
Grid Resolution
Fixed # of Particles = 8Particle Resolution = 5120 triangles
Fluid: ● 10 to 14x speedup vs.
CPU
IB + Structural: ● 40 to 100x speedup vs.
CPU
● Percentage of time:○ CPU: 72% to 14%○ GPU: 20% to 6%
Particle Resolution
Fixed N = 256Fixed # of Particles = 8
IB + structural solver time increases at reduced rate on GPU:
● CPU: 15% to 55%● GPU: 6% to 13%
Number of Particles
Fixed N = 256Particle Resolution = 5120 triangles
IB + Structural solver time increases at similar rates:
● CPU: 14% to 59%● GPU: 5% to 22%
Conclusions
Conclusions
● Porting research codes to GPUs is worth the investment○ Faster runtimes enable larger cases, more rapid experimentation
● Large performance gains can be achieved with low effort using CUDA Fortran○ CUF kernel directives○ CUDA-enabled libraries○ Custom kernels when all else fails
● Working with developers to apply current code to challenging research cases
● Some previous work with these developers can be found on GitHub: https://github.com/PhysicsofFluids/AFiD_GPU_opensource
Recommended