19
GPU acceleration of a non- hydrostatic ocean model with a multigrid Poisson/Helmholtz solver Takateru Yamagishi 1 , Yoshimasa Matsumura 2 1 Research Organization for Information Science and Technology 2 Institute of Low Temperature Science, Hokkaido University 6th International Workshop on Advances in High- Performance Computational Earth Sciences: Applications & Frameworks

GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver

Embed Size (px)

Citation preview

Page 1: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver

GPU acceleration of a non-hydrostatic ocean model with a

multigrid Poisson/Helmholtz solver

Takateru Yamagishi1, Yoshimasa Matsumura2

1 Research Organization for Information Science and Technology

2 Institute of Low Temperature Science, Hokkaido University

6th International Workshop on Advances in High-Performance Computational Earth Sciences: Applications

& Frameworks

Page 2: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver

Table of ContentsMotivation

Numerical ocean model ‘kinaco’

GPU implementation and Optimization

Evaluation and validation

Summary

Page 3: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver

MotivationSignificance of numerical ocean modelling

Global climate, weather, marine resource, etc.GPU’s high computational performance

Explicit and detail expression, long time simulation, many experiment cases

Previous studiesBleichrodt et al. (2012), Milakov et al. (2013), Werkhoven et al. (2013) Xu, et al. (2015)They showed high performance, but limited to experimental studies

We aim at realistic and practical studies

Page 4: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver

Non-hydrostatic numerical ocean model ‘kinaco’

Formation of Antarctic bottom water in the southern Weddell Sea

We try to accelerate this model by the GPU

Page 5: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver

Basic equation of dynamics in kinaco

3D Navier-Stokes equationFluid dynamics

Poisson/Helmholtz equation∆ = , (∆ + )ℎ = 0

DiscretizationStencil access to adjacent 6 grids

Solving systems of equations: Ax=bSparse matrix-vector multiplication

Efficient solver to solve Ax=b is required

Page 6: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver

CG method with multigrid preconditioner (MGCG)

Fast and scalable iteration method

Matsumura and Hasumi(2008)

Preconditioner: Multigrid method

Solve equation on various resolution grids

multigrid method

Page 7: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver

Implementation to the GPUCUDA Fortran

kinaco is written in Fortran 90CUDA instructions are available

almost the same as CUDA CFollowing the original structure of CPU code

Good performance vs CPU is achieved

We aimed at further acceleration!

Page 8: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver

Optimization of the MGCG solver

The cost of MGCG solver: 21% of total simulation

Mainly consists of sparse matrix-vector multiplication

Optimization1. Memory access2. Hide latency by thread/Instruction-level

parallelism3. Mixed precision preconditioner of MGCG

Page 9: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver

Memory access in CPU kernel

DO k=1, n3DO j=1, n2

DO i=1, n1out(i,j,k) = a(-3,i,j,k) * x(i, j, k-1) &

+ a(-2,i,j,k) * x(i, j-1,k ) &+ a(-1,i,j,k) * x(i-1,j, k ) &+ a( 0,i,j,k) * x(i, j, k ) &+ a( 1,i,j,k) * x(i+1,j, k ) &+ a( 2,i,j,k) * x(i, j+1,k ) &+ a( 3,i,j,k) * x(i, j, k+1)

END DOEND DO

END DO

-3 -2 -1 0 1 2 3

a(-3,i,j,k)~a( 3,i,j,k)

Sparse matrix-vector kernel in the CPU code

matrix coefficient

Location of matrix coefficient

-3

3

1-1-2

2

0

CPU thread load the array ‘a’ in cache line.

Page 10: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver

Memory access in GPU kernel

a(i,j,k,-3)

a(i+1,j,k,-3)

a(i+2,j,k,-3)

thread(id)thread(id+1)

thread(id+2)

a(-3:3,i,j,k) a(i,j,k,-3:3)

Each GPU thread accesses array “a” with 7 intervals.

a(-3,i,j,k) a(-3,i+1,j,k) a(-3,i+2,j,k)

thread(id) thread(id+1) thread(id+2)

Coalesced access to array “a”

Page 11: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver

Hide latency by thread/Instruction-level parallelism

Hide latency = do other operations when waiting for latencyThread-level parallelism

Switch thread to hide latencyInstruction-level parallelism (Volkov, 2010)

One thread with several independent operations

Comparison of the two parallelism

Page 12: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver

Case 1: Thread-level parallelism

i = threadidx%x + blockdim%x * (blockidx%x-1)j = threadidx%y + blockdim%y * (blockidx%y-1) k = threadidx%z + blockdim%z * (blockidx%z-1)

out(i,j,k) = a(i,j,k,-3) * x(i, j, k-1) &+ a(i,j,k,-2) * x(i, j-1,k ) &+ a(i,j,k,-1) * x(i-1,j, k ) &+ a(i,j,k, 0) * x(i, j, k ) &+ a(i,j,k, 1) * x(i+1,j, k ) &+ a(i,j,k, 2) * x(i, j+1,k ) &+ a(i,j,k, 3) * x(i, j, k+1)

Set many threads as possible (i, j, k)

• 3D (i, j, k) threads are set• One thread for one grid

Hyde latency by switching many threads

Page 13: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver

Case 2: Instruction-level parallelism

Independent operations are repeated

i = threadidx%x + blockdim%x * (blockidx%x-1)j = threadidx%y + blockdim%y * (blockidx%y-1)

DO k=1, n3out(i,j,k) = a(i,j,k,-3) * x(i, j, k-1) &

+ a(i,j,k,-2) * x(i, j-1,k ) &+ a(i,j,k,-1) * x(i-1,j, k ) &+ a(i,j,k, 0) * x(i, j, k ) &+ a(i,j,k, 1) * x(i+1,j, k ) &+ a(i,j,k, 2) * x(i, j+1,k ) &+ a(i,j,k, 3) * x(i, j, k+1)

END DO

Hyde latency with instructions

• 2D (i, j) threads are set• One thread for one column

(i, j)

Case 2 is faster

Page 14: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver

Mixed precision for multigrid preconditioning

Low precisionutilize GPU resources

PreconditioningLow precision is enoughGPU: Deterioration of performance with coarse grids

multigrid method

Number of iterations in CG method unchanged with/without mixed precision

Page 15: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver

Evaluation, experimental settingCPU (Fujitsu SPARC64VIIIfx) vs GPU (NVIDIA K20c)

1 CPU vs 1 GPUStudy of baloclinic instability

Visbeck et al. (1996)Forcing: Coriolis force, temperature forcingStructured, Isotropic domain

size: (256, 256, 32)Time step, simulation time

2min, 5hours (150 steps)5 days(3600 steps)

256256

32

Page 16: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver

Performance

CPU GPU_1 GPU_2 GPU_3 Speedup (GPU_3)

all components 174.2 42.6 39.2 37.3 4.7 Poisson/Helmholtz solver 36.8 15.8 12.4 10.5 3.5

others 137.4 26.9 26.8 26.8 5.1

Elapsed time[s]: CPU vs GPU

CPU : original CPU codeGPU_1: basic and typical implementation to the GPUGPU_2: GPU_1 + memory optimization, hyde latencyGPU_3: GPU_2 + mixed precision preconditioning

GPU achieved 4.7 times speedup vs CPU

5hours (150 steps)

Page 17: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver

Surface ocean current/velocity field

GPU_3GPU_2CPU

Good reproduction of growing meanders due to baloclinic instability

Page 18: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver

Temperature at the cross section

Good reproduction of vertical

convection of water

CPU GPU_2

GPU_2

Page 19: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver

Summary and future worksNumerical ocean model on the GPU (K20C) vs the CPU (SPARC 64 VIIIfx)

x4.7 faster compared to CPUThe errors due to implementation

not significant to oceanic studies

Further worksApplication of mixed precision to other kernels MPI implementationRealistic experiments