GPGPU Seminar (Accelerataion of Lattice Boltzmann Method using CUDA Fortran)

長岡技術科学大学電気電子情報工学専攻出川智啓

GPGPU講習会CUDA Fortranによる格子ボルツマン法の高速化

本講習会の目標

GPGPU先端シミュレーションシステムの使用方法の習得

GPUの活用方法の修得

CUDAプログラミング技法の修得

並列計算手法の修得

2016/1/13GPGPU講習会2

本日の内容


CUDA Fortranによる流体アプリケーションの高速化

格子ボルツマン法

D2Q9モデル

単純なGPU実装

使用メモリやデータ構造の適化

雑多な高速化手法


スケール


巨視的スケール

連続体近似

偏微分方程式に対する数値計算法を利用

差分法，有限要素法，有限体積法等

非線形性，大規模連立一次方程式などの困難さ

微視的スケール

分子動力学

個々の原子の挙動を取り扱う

非現実的な計算量（1022個/リットル）

粒子の集合微視的モデルと巨視的運動方程式

粒子の分布関数の時間発展方程式を計算

支配方程式


粒子の分布関数の時間発展方程式

BGKモデル

衝突項を簡単化

Bhatnagar‐Gross‐Krook方程式

)(),(),( ftfttf ppp

p

xcx

f : 粒子の分布関数 c : 粒子の移流速度

t : 時間 x : 直交格子上の位置ベクトルp : 粒子の番号（方向）

: 衝突項

),(),(1),(),( xxxcx tftftfttf p

eqppp

p

: 緩和時間feq : 局所平衡分布関数

方程式の離散化


粒子の分布関数

格子BGK方程式

初期値，境界値問題として解く

),(),(1),(),( xxxcx tftftfΔtΔttf peq

pppp

),(),(1),(),( xxxcx tftftfttf p

eqppp

p

時間離散化（1次精度Euler法）

空間離散化（1次精度上流差分）

マクロ量の計算


マクロ量（いわゆる流体の物理量）の定義

温度変化を取り扱わない

密度

速度ベクトルu

1

0

N

p

pf

1

0

N

p

pi

pi cfu より N : 座標xにおける粒子の個数

1

0

N

p

pi

pi cfu

i : 空間方向

粒子番号は0~N−1

x1

x2

D2Q9モデル


0 1

2

3

4

56

7 8

格子点上に9個の粒子があり，t秒後に周囲8格子点に粒子が移動

一つはその場にとどまる

移動方向に応じた移流速度を定義

D2Q9モデル


22 5.1)(

29)(31 uucucpp wf

0 1

2

3

4

56

7 8

分布関数と重み係数，移流速度ベクトル

方向p 移流速度重み係数

0 ( 0, 0) 4/9

1 ( 1, 0) 1/9

2 ( 0, 1) 1/9

3 (‐1, 0) 1/9

4 ( 0,‐1) 1/9

5 ( 1, 1) 1/36

6 (‐1, 1) 1/36

7 (‐1,‐1) 1/36

8 ( 1,‐1) 1/36i i+1i−1

j

j+1

j−1

Collision Step


衝突項の計算

粒子の移動に伴う相互作用

他格子にある粒子の情報を必要としない

局所的（1点完結）で簡単な計算

並列計算に適

),(),(1),(),(~

xxxx tftftftf peq

ppp

f(p,i,j) = f(p,i,j)‐(f(p,i,j)‐f_eq(p,i,j))/

粒子の移動

粒子自身の移流速度によって隣の格子点へ移動

単純なメモリコピー

Stream Step


),(~

),( xcx tfΔtΔttf ppp

03 147 8

26 5

i i+1i−1

j

j+1

j−1

粒子の移動

粒子自身の移流速度によって隣の格子点へ移動

単純なメモリコピー

Stream Step


),(~

),( xcx tfΔtΔttf ppp

03 1

47 8

26 5

f(0,i ,j ) = f(0,i,j)f(1,i+1,j ) = f(1,i,j)f(2,i ,j+1) = f(2,i,j)f(3,i‐1,j ) = f(3,i,j)f(4,i ,j‐1) = f(4,i,j)

:

i i+1i−1

j

j+1

j−1

境界条件


マクロ量に対する境界条件から粒子の分布関数を決定

流入・流出境界条件

ここでは取り扱わない

固定壁境界条件

壁面が格子点上に存在

すべり無し条件

Bounce Back 移動壁境界条件（Zou‐Heの境界条件）

Bounce Back


すべり無し壁の境界条件

固体壁に入射した粒子は入射した方向に跳ね返る

単純だが非常に効果的

03 147 8

26 503 147 8

26 5

Bounce Back


すべり無し壁の境界条件

固体壁に入射した粒子は入射した方向に跳ね返る

単純だが非常に効果的

0 1

4 8

2 5

03 126 5

15

26 5

8

Zou-He境界条件


境界で速度が規定されている場合の分布関数の決定法

他格子点の情報を必要としない局所的な方法

計算領域内の分布関数と密度と流束(u1,u2)を連立

Zou,Q. and He,X., Phys. Fluids, 9(1997), 1591‐1598

03 126 5

f 7 , f 4, f 8をf 0, f 1, f 2, f 3, f 5, f 6

と境界上の速度から決定

流入境界（法線方向速度が存在する場合）にも適用可能

U

652310 2 ffffffB 24 ff

6/57 Uff B

6/68 Uff B

47 8

Zou-He境界条件


境界で速度が規定されている場合の分布関数の決定法

他格子点の情報を必要としない局所的な方法

計算領域内の分布関数と密度と流束(u1,u2)を連立

Zou,Q. and He,X., Phys. Fluids, 9(1997), 1591‐1598

03 147 8

26 5

f 7 , f 4, f 8をf 0, f 1, f 2, f 3, f 5, f 6

と境界上の速度から決定

流入境界（法線方向速度が存在する場合）にも適用可能

U

652310 2 ffffffB 24 ff

6/57 Uff B

6/68 Uff B 47 8

U

粒子運動のイメージ（Stream）


03 147 8

26 503 147 8

26 5

03 147 8

26 503 147 8

26 5

0326

03 126 5

03 126 5

0 12 5

0 14 8

2 5

0 14 8

2 5

0 14 8

2 503 147 8

26 503 147 8

26 5

0347

26

0347

26

0347

2637

6

37

6

37

6

37

6

47 847 84 8 47

18

5

18

5

18

15

8

5

U

粒子運動のイメージ（境界条件）


03 147 8

26 503 147 8

26 5

03 147 8

26 503 147 8

26 5

0326

03 126 5

03 126 5

0 12 5

0 14 8

2 5

0 14 8

2 5

0 14 8

2 503 147 8

26 503 147 8

26 5

0347

26

0347

26

0347

2637

6

37

6

37

6

37

6

47 847 84 8 47

18

5

18

5

18

15

8

5

計算手順


1. 初期流れ場のマクロな密度と速度u1, u2を定める

2. 局所平衡分布関数feqを計算する

3. 衝突項を計算する

4. 粒子を移流させる

5. 境界条件を計算する

6. マクロな密度と速度u1, u2を計算する

7. feqを分布関数fとし，2に戻って繰り返す

LBMプログラムの作成


Fortran 90/95, CUDA Fortranを利用

キャビティ流れを計算

溝の上に置かれたフタが一定速度で移動

初期条件

静止状態

密度一定

速度0 境界条件

左右，下壁面は固定壁

上壁面のみ移動壁x

y

計算用パラメータ


物理空間におけるパラメータとボルツマン法の離散空間におけるパラメータの対応付けが必要

非圧縮性粘性流れに必要なパラメータ

長さ

時間

動粘度

レイノルズ数=長さ×速度/動粘度=長さ2/時間/動粘度

速度=長さ/時間



物理空間

代表長さ[m] L 代表速度[m/s] U 代表時間[s] T=L/U 動粘度[m2/s] レイノルズ数[-] Re=LU/=L2/T/

物理空間（無次元化）

代表長さ[-] L*=1 代表時間[-] T*=1 代表速度[-] U*=L*/T*=1 動粘度[-] *=1/Re



LBM離散空間

代表長さ[-] LLB

代表時間[-] TLB

速度[-] ULB=TLBU*/LLB=TLB/LLB

動粘度[-] LB=TLB/LLB2/Re

緩和時間[-] =3LB+0.5

代表時間の決定

代表時間 TLB≈LLB2

圧縮性に関係する誤差の議論から導出

差分法等Euler系解法における数値安定性と同様

module SimulationParameterimplicit noneinteger,parameter :: Nt = 50000real(8),parameter :: Re = 1000d0

integer,parameter :: NumCell_x = 512integer,parameter :: NumCell_y = NumCell_xinteger,parameter :: Nx = NumCell_xinteger,parameter :: Ny = NumCell_y

real(8),parameter :: dx = 1d0/dble(NumCell_x)real(8),parameter :: Uwall = 0.5d0 !dt/dxreal(8),parameter :: dt = Uwall*dx !dx**2real(8),parameter :: KineticViscosity = Uwall/dx/Re

real(8),parameter :: RelaxTime = 3d0*KineticViscosity + 0.5d0end module SimulationParameter



module_SimulationParameter.f90

本来はdtを決めてからUwallを決める

source/cpu/に置いています

integer,parameter :: Center = 0integer,parameter :: Right = 1integer,parameter :: Up = 2integer,parameter :: Left = 3integer,parameter :: Down = 4integer,parameter :: UpRight = 5integer,parameter :: UpLeft = 6integer,parameter :: DownLeft = 7integer,parameter :: DownRight = 8integer,parameter :: First = Centerinteger,parameter :: Last = DownRightinteger,parameter :: Opposite(First:Last) = (/ Center, Left, Down, Right, Up,&

DownLeft, DownRight,UpRight,UpLeft/)

real(8),parameter :: Weight(First:Last) =(/4d0/ 9d0,&1d0/ 9d0, 1d0/ 9d0, 1d0/ 9d0, 1d0/ 9d0,&1d0/36d0, 1d0/36d0, 1d0/36d0, 1d0/36d0 /)

integer,parameter :: ConvVelx(First:Last) = (/ 0, 1, 0,‐1, 0, 1,‐1,‐1, 1 /)integer,parameter :: ConvVely(First:Last) = (/ 0, 0, 1, 0,‐1, 1, 1,‐1,‐1 /)

D2Q9モデル（パラメータ）


module_D2Q9Model.f90

初期マクロ量の設定


subroutine computeIntialMacroQuantities(velx,vely,dens)use SimulationParameterimplicit nonereal(8),intent(inout) :: velx(Nx,Ny)real(8),intent(inout) :: vely(Nx,Ny)real(8),intent(inout) :: dens(Nx,Ny)

integer :: i,j

velx(:,:) = 0d0vely(:,:) = 0d0dens(:,:) = 1d0velx(2:Nx‐1,Ny) = Uwall

end subroutine computeIntialMacroQuantities


マクロな密度と速度u1, u2の計算


subroutine computeMacroQuantities(f,velx,vely,dens)use SimulationParameterimplicit nonereal(8),intent(in) :: f(First:Last,1:Nx,1:Ny)real(8),intent(inout) :: velx(1:Nx,1:Ny)real(8),intent(inout) :: vely(1:Nx,1:Ny)real(8),intent(inout) :: dens(1:Nx,1:Ny)integer :: i,jreal(8) :: f_boundary, f_exterior

do j=1,Nydo i=1,Nx

dens(i,j) = f(Center ,i,j) + f(Right ,i,j) + f(Up ,i,j)&+ f(Left ,i,j) + f(Down ,i,j) + f(UpRight ,i,j)&+ f(UpLeft ,i,j) + f(DownLeft ,i,j) + f(DownRight,i,j)

end doend dodo i=2,Nx‐1

f_boundary = f(Center,i,Ny) + f( Right,i,Ny) + f( Left,i,Ny)f_exterior = f(Up ,i,Ny) + f(UpRight,i,Ny) + f(UpLeft,i,Ny)dens(i,Ny) = f_boundary + 2d0*f_exterior

end do




do j=2,Ny‐1do i=2,Nx‐1

velx(i,j) = ( f(Center ,i,j)*ConvVelx(Center )&+f(Right ,i,j)*ConvVelx(Right )&+f(Up ,i,j)*ConvVelx(Up )&+f(Left ,i,j)*ConvVelx(Left )&+f(Down ,i,j)*ConvVelx(Down )&+f(UpRight ,i,j)*ConvVelx(UpRight )&+f(UpLeft ,i,j)*ConvVelx(UpLeft )&+f(DownLeft ,i,j)*ConvVelx(DownLeft )&+f(DownRight,i,j)*ConvVelx(DownRight))/dens(i,j)

vely(i,j) = ( f(Center ,i,j)*ConvVely(Center )&+f(Right ,i,j)*ConvVely(Right )&+f(Up ,i,j)*ConvVely(Up )&+f(Left ,i,j)*ConvVely(Left )&+f(Down ,i,j)*ConvVely(Down )&+f(UpRight ,i,j)*ConvVely(UpRight )&+f(UpLeft ,i,j)*ConvVely(UpLeft )&+f(DownLeft ,i,j)*ConvVely(DownLeft )&+f(DownRight,i,j)*ConvVely(DownRight))/dens(i,j)

end doend do

end subroutine computeMacroQuantities


局所平衡分布関数


subroutine computeLocalEquilibriumFunction(f_eq,velx,vely,dens)use SimulationParameterimplicit nonereal(8),intent(inout) :: f_eq(First:Last,1:Nx,1:Ny)real(8),intent(in) :: velx(1:Nx,1:Ny)real(8),intent(in) :: vely(1:Nx,1:Ny)real(8),intent(in) :: dens(1:Nx,1:Ny)real(8) :: u,v,conv_velo,velo_squareinteger :: i,j,direction

do j=1,Nydo i=1,Nx

u = velx(i,j)v = vely(i,j)velo_square = u*u + v*vdo direction = First,Last

conv_velo = u*ConvVelx(direction) + v*ConvVely(direction)f_eq(direction,i,j) = Weight(direction)*dens(i,j)&

*(1d0 + 3d0*conv_velo + 4.5d0*conv_velo*conv_velo ‐ 1.5d0*velo_square)end do

end doend do

end subroutine computeLocalEquilibriumFunction


Collision Step


subroutine collide(f,f_eq)use SimulationParameterimplicit nonereal(8),intent(inout) :: f (First:Last,1:Nx,1:Ny)real(8),intent(in) :: f_eq(First:Last,1:Nx,1:Ny)

integer :: i,j,direction

do j=1,Nydo i=1,Nx

f(:,i,j) = f(:,i,j) + (f_eq(:,i,j)‐f(:,i,j))/RelaxTimeend doend do

end subroutine collide


Stream Step


subroutine stream(f)use SimulationParameterimplicit nonereal(8),intent(inout) :: f(First:Last,1:Nx,1:Ny)integer :: i,j

do j=1,Nydo i=Nx,2,‐1 !RIGHT TO LEFT

f(Right,i,j)=f(Right,i‐1,j)end dodo i=1,Nx‐1 !LEFT TO RIGHT

f(Left,i,j)=f(Left,i+1,j)end do

end do


Stream Step


do j=Ny,2,‐1 !TOP TO BOTTOMdo i=1,Nx

f(Up,i,j)=f(Up,i,j‐1)end dodo i=Nx,2,‐1

f(UpRight,i,j)=f(UpRight,i‐1,j‐1)end dodo i=1,Nx‐1

f(UpLeft,i,j)=f(UpLeft,i+1,j‐1)end do

end do do j=1,Ny‐1 !BOTTOM TO TOP

do i=1,Nxf(Down,i,j)=f(Down,i,j+1)

end dodo i=1,Nx‐1

f(DownLeft,i,j)=f(DownLeft,i+1,j+1)end dodo i=Nx,2,‐1

f(DownRight,i,j)=f(DownRight,i‐1,j+1)end do

end doend subroutine stream


境界条件


subroutine imposeBoundayCondition(f)use SimulationParameterimplicit none

real(8),intent(inout) :: f(First:Last,1:Nx,1:Ny)

integer :: i,jreal(8) :: dens_wallreal(8) :: f_boundary, f_exterior

do j=1,Ny!bounce back on west boundaryf( Right, 1,j) = f(Opposite( Right), 1,j)f( UpRight, 1,j) = f(Opposite( UpRight), 1,j)f(DownRight, 1,j) = f(Opposite(DownRight), 1,j)!bounce back on east boundaryf( Left ,Nx,j) = f(Opposite( Left ),Nx,j)f(DownLeft ,Nx,j) = f(Opposite(DownLeft ),Nx,j)f( UpLeft ,Nx,j) = f(Opposite( UpLeft ),Nx,j)

end do


境界条件


!bounce back on south boundarydo i=1,Nx

f(Up ,i,1)=f(Opposite(Up ),i,1)f(UpRight,i,1)=f(Opposite(UpRight),i,1)f(UpLeft ,i,1)=f(Opposite(UpLeft ),i,1)

end do!moving wall, north boundarydo i=2,Nx‐1

f_boundary = f(Center,i,Ny)+f( Right,i,Ny)+f( Left,i,Ny)f_exterior = f(Up ,i,Ny)+f(UpRight,i,Ny)+f(UpLeft,i,Ny)dens_wall = f_boundary + 2d0*f_exteriorf(Down ,i,Ny)=f(Opposite(Down ),i,Ny)f(DownRight,i,Ny)=f(Opposite(DownRight),i,Ny) + dens_wall*Uwall/6.0f(DownLeft ,i,Ny)=f(Opposite(DownLeft ),i,Ny) ‐ dens_wall*Uwall/6.0

end do

end subroutine imposeBoundayCondition


メインルーチン


program LBM_Cavityuse SimulationParameteruse D2Q9Modelimplicit nonereal(8),allocatable :: velx(:,:) !マクロな速度ベクトルと密度real(8),allocatable :: vely(:,:) !real(8),allocatable :: dens(:,:) !real(8),allocatable :: f (:,:,:)!分布関数real(8),allocatable :: f_eq(:,:,:)!局所平衡分布関数integer :: n

allocate( velx(1:Nx,1:Ny))allocate( vely(1:Nx,1:Ny))allocate( dens(1:Nx,1:Ny)) allocate(f (First:Last,1:Nx,1:Ny))allocate(f_eq(First:Last,1:Nx,1:Ny))

lbm_cavity.f90



call computeIntialMacroQuantities(velx,vely,dens)do n=1,Nt

call computeLocalEquilibriumFunction(f_eq,velx,vely,dens)call collide(f,f_eq)call stream(f)call imposeBoundayCondition(f)call computeMacroQuantities(f,velx,vely,dens)

end do

deallocate(f )deallocate(f_eq)deallocate(velx)deallocate(vely)deallocate(dens)

end program LBM_Cavity

lbm_cavity.f90

プログラムのコンパイル


コンパイラにはpgfortranを利用

pgf90でも可能

リンクせずにオブジェクトファイルを生成

$ pgf90 ‐c module_SimulationParameter.f90 $ pgf90 ‐c module_D2Q9Model.f90 $ pgf90 ‐c lbm_cavity.f90

オブジェクトファイルをリンクして実行ファイルを生成

$ pgf90 ‐o lbm_cavity *.o

実行結果


計算条件

格子点数 512 移動壁の速度 0.5 レイノルズ数 1000 計算時間 0~50000

実行時間

512×512 52ms/step 1024×1024 214ms/step 2048×2048 900ms/step

u1−0.2 0.5

GPUへの移植


とりあえずGPUで実行すればいいのなら･･･

拡張子を.cufに変更

use cudaforを追加

GPUの都合を反映

サブルーチンにattributes(global)を付ける

カーネル名と引数の間に<<<1,1>>>を付ける

GPUで使うメモリにdevice属性を付与

allocate()の変更は不要

GPUとのデータのやり取りには代入演算子(=)を使う

適化は追々考えればいい

D2Q9モデルのパラメータの取扱


parameter属性のホストスカラ変数はカーネルから直接参照可能

GPUへの転送が不要

比較的古いバージョンのCUDA Fortranから可能

parameter属性が付いていても配列は参照不可能* D2Q9モデルの重み係数や移流速度は，GPU側の変数を宣言してコピー

* 近のCUDA Fortranではparameter属性付きの配列をカーネルから直接参照可能

配列添字は1開始に強制

integer,parameter :: a(0:8)と宣言しても，カーネルからはa(1:9)として利用しなければならない



attributes(global) subroutine computeIntialMacroQuantities(velx,vely,dens)use SimulationParameterimplicit nonereal(8),intent(inout),device :: velx(Nx,Ny)real(8),intent(inout),device :: vely(Nx,Ny)real(8),intent(inout),device :: dens(Nx,Ny)integer :: i,j

do j=1,Nydo i=1,Nx

velx(i,j) = 0d0vely(i,j) = 0d0dens(i,j) = 1d0

end doend doj=Nydo i=2,Nx‐1

velx(i,j) = Uwallend do

end subroutine computeIntialMacroQuantities

module_D2Q9Model.cuf

source/gpu/serial/に置いています



attributes(global) subroutine computeMacroQuantities(f,velx,vely,dens,ConvVelx,ConvVely)use SimulationParameterimplicit nonereal(8),intent(in) ,device :: f(First:Last,1:Nx,1:Ny)real(8),intent(inout),device :: velx(1:Nx,1:Ny)real(8),intent(inout),device :: vely(1:Nx,1:Ny)real(8),intent(inout),device :: dens(1:Nx,1:Ny)integer,intent(in) ,device :: ConvVelx(First:Last) !移流速度integer,intent(in) ,device :: ConvVely(First:Last) !integer :: i,jreal(8) :: f_boundary, f_exteriordo j=1,Nydo i=1,Nx

dens(i,j) = f(Center ,i,j)+f(Right ,i,j)+f(Up ,i,j)&+f(Left ,i,j)+f(Down ,i,j)+f(UpRight ,i,j)&+f(UpLeft ,i,j)+f(DownLeft ,i,j)+f(DownRight,i,j)

end doend dodo i=2,Nx‐1

f_boundary = f(Center,i,Ny)+f( Right,i,Ny)+f( Left,i,Ny)f_exterior = f(Up ,i,Ny)+f(UpRight,i,Ny)+f(UpLeft,i,Ny)dens(i,Ny) = f_boundary + 2d0*f_exterior

end do




do j=2,Ny‐1do i=2,Nx‐1

velx(i,j) = ( f(Center ,i,j)*ConvVelx(Center )&+f(Right ,i,j)*ConvVelx(Right )&+f(Up ,i,j)*ConvVelx(Up )&+f(Left ,i,j)*ConvVelx(Left )&+f(Down ,i,j)*ConvVelx(Down )&+f(UpRight ,i,j)*ConvVelx(UpRight )&+f(UpLeft ,i,j)*ConvVelx(UpLeft )&+f(DownLeft ,i,j)*ConvVelx(DownLeft )&+f(DownRight,i,j)*ConvVelx(DownRight))/dens(i,j)


end doend do

end subroutine computeMacroQuantities




attributes(global) &subroutine computeLocalEquilibriumFunction(f_eq,velx,vely,dens,ConvVelx,ConvVely,Weight)

use SimulationParameterimplicit nonereal(8),intent(inout),device :: f_eq(First:Last,1:Nx,1:Ny)real(8),intent(in) ,device :: velx(1:Nx,1:Ny)real(8),intent(in) ,device :: vely(1:Nx,1:Ny)real(8),intent(in) ,device :: dens(1:Nx,1:Ny)integer,intent(in) ,device :: ConvVelx(First:Last) !移流速度integer,intent(in) ,device :: ConvVely(First:Last) !real(8),intent(in) ,device :: Weight(First:Last) !重み係数real(8) :: u,v,conv_velo,velo_squareinteger :: i,j,direction




do j=1,Nydo i=1,Nx




end doend do



Collision Step


attributes(global) subroutine collide(f,f_eq)use SimulationParameterimplicit none

real(8),intent(inout),device :: f (First:Last,1:Nx,1:Ny)real(8),intent(in) ,device :: f_eq(First:Last,1:Nx,1:Ny)

integer :: i,j,direction

do j=1,Nydo i=1,Nx

f(:,i,j) = f(:,i,j) + (f_eq(:,i,j)‐f(:,i,j))/RelaxTimeend doend do

end subroutine collide


Stream Step


attributes(global) subroutine stream(f)use SimulationParameterimplicit nonereal(8),intent(inout),device :: f(First:Last,1:Nx,1:Ny)integer :: i,j

do j=1,Nydo i=Nx,2,‐1 !RIGHT TO LEFT

f(Right,i,j)=f(Right,i‐1,j)end dodo i=1,Nx‐1 !LEFT TO RIGHT

f(Left,i,j)=f(Left,i+1,j)end do

end do


Stream Step


do j=Ny,2,‐1 !TOP TO BOTTOMdo i=1,Nx

f(Up,i,j)=f(Up,i,j‐1)end dodo i=Nx,2,‐1

f(UpRight,i,j)=f(UpRight,i‐1,j‐1)end dodo i=1,Nx‐1

f(UpLeft,i,j)=f(UpLeft,i+1,j‐1)end do

end dodo j=1,Ny‐1 !BOTTOM TO TOP

do i=1,Nxf(Down,i,j)=f(Down,i,j+1)

end dodo i=1,Nx‐1

f(DownLeft,i,j)=f(DownLeft,i+1,j+1)end dodo i=Nx,2,‐1

f(DownRight,i,j)=f(DownRight,i‐1,j+1)end do

end doend subroutine stream


境界条件


attributes(global) subroutine imposeBoundayCondition(f,Opposite)use SimulationParameterimplicit none

real(8),intent(inout),device :: f(First:Last,1:Nx,1:Ny)integer,intent(in) ,device :: Opposite(First:Last)

integer :: i,jreal(8) :: dens_wallreal(8) :: f_boundary, f_exterior

do j=1,Ny!bounce back on west boundaryf( Right, 1,j) = f(Opposite( Right), 1,j)f( UpRight, 1,j) = f(Opposite( UpRight), 1,j)f(DownRight, 1,j) = f(Opposite(DownRight), 1,j)!bounce back on east boundaryf( Left ,Nx,j) = f(Opposite( Left ),Nx,j)f(DownLeft ,Nx,j) = f(Opposite(DownLeft ),Nx,j)f( UpLeft ,Nx,j) = f(Opposite( UpLeft ),Nx,j)

end do


境界条件


!bounce back on south boundarydo i=1,Nx

f(Up ,i,1)=f(Opposite(Up ),i,1)f(UpRight,i,1)=f(Opposite(UpRight),i,1)f(UpLeft ,i,1)=f(Opposite(UpLeft ),i,1)

end do!moving wall, north boundarydo i=2,Nx‐1


end do

end subroutine imposeBoundayCondition




program LBM_Cavityuse cudaforuse SimulationParameteruse D2Q9Modelimplicit nonereal(8),allocatable,device :: velx(:,:)real(8),allocatable,device :: vely(:,:)real(8),allocatable,device :: dens(:,:)

real(8),allocatable,device :: f (:,:,:)real(8),allocatable,device :: f_eq(:,:,:)real(8),allocatable,device :: dev_Weight(:)integer,allocatable,device :: dev_ConvVelx(:)integer,allocatable,device :: dev_ConvVely(:)integer,allocatable,device :: dev_Opposite(:)

integer :: n,stat

lbm_cavity.cuf



allocate( velx(1:Nx,1:Ny))allocate( vely(1:Nx,1:Ny))allocate( dens(1:Nx,1:Ny))

allocate(f (First:Last,1:Nx,1:Ny));f =0d0allocate(f_eq(First:Last,1:Nx,1:Ny));f_eq=0d0allocate(dev_Weight(First:Last)); dev_Weight =Weightallocate(dev_ConvVelx(First:Last));dev_ConvVelx=ConvVelxallocate(dev_ConvVely(First:Last));dev_ConvVely=ConvVelyallocate(dev_Opposite(First:Last));dev_Opposite=Opposite

lbm_cavity.cuf



call computeIntialMacroQuantities<<<1,1>>>(velx,vely,dens)do n=1,Nt

call computeLocalEquilibriumFunction<<<1,1>>>(f_eq,velx,vely,dens,dev_ConvVelx,dev_ConvVely,dev_Weight)

call collide<<<1,1>>>(f,f_eq)call stream<<<1,1>>>(f)call imposeBoundayCondition<<<1,1>>>(f,dev_Opposite)call computeMacroQuantities<<<1,1>>>(f,velx,vely,dens,dev_ConvVelx,dev_ConvVely)stat = cudaThreadSynchronize() !バージョンが古いため，cudaDeviceSynchronizeは利用不可

end do

deallocate(f )deallocate(f_eq)deallocate(velx)deallocate(vely)deallocate(dens)deallocate(dev_Weight)deallocate(dev_ConvVelx)deallocate(dev_ConvVely)deallocate(dev_Opposite)


lbm_cavity.cuf

プログラムのコンパイル


コンパイラにはpgfortranを利用

pgf90でも可能

リンクせずにオブジェクトファイルを生成

$ pgf90 ‐c module_SimulationParameter.f90 $ pgf90 ‐Mcuda=cc20 ‐c module_D2Q9Model.cuf $ pgf90 ‐Mcuda=cc20 ‐c lbm_cavity.cuf

オブジェクトファイルをリンクして実行ファイルを生成

$ pgf90 ‐Mcuda=cc20 ‐o lbm_cavity *.o

1スレッド実装の実行結果


CPU版と同じ結果は得られる

実行が遅すぎて使い物にならない

実行時間

512× 512で約 3s/step 1024×1024で約10s/step 2048×2048で約50s/step

GPUは並列計算しないと遅い

どのような計算でも速くなるわけではない

1スレッドが1格子点（9粒子）を計算


1スレッド実装からの変更点

1.複数スレッドでのカーネル呼出

2.カーネルの内容

1.i,jに関するdoループがあると1スレッドが複数の点を計算してしまう

2.スレッド番号と格子点番号の対応付け

3.境界条件を処理するカーネルの分割

U

03 147 8

26 503 147 8

26 5

03 147 8

26 503 147 8

26 5

0326

03 126 5

03 126 5

0 12 5

0 14 8

2 5

0 14 8

2 5

0 14 8

2 503 147 8

26 503 147 8

26 5

0347

26

0347

26

0347

2637

6

37

6

37

6

37

6

47 847 84 8 47

18

5

18

5

18

15

8

5

ｽﾚｯﾄﾞ13 ｽﾚｯﾄﾞ14 ｽﾚｯﾄﾞ15 ｽﾚｯﾄﾞ16




1スレッドが1格子点（9粒子）を計算


1スレッド実装からの変更点

Stream Stepの実装

一時的な配列が必要

スレッド33が必ず先に処理をするか，スレッド32,33が全く同時に処理を行うことが保証されている必要がある

CUDAでは，あるまとまった数のスレッド群が協調して動作

スレッド群を切替ながら処理を実行

一時的な配列を利用

f(Right,33,1)=f(Right,32,1) f(Right,34,1)=f(Right,33,1)スレッド32 スレッド33

f_new(Right,33,1)=f(Right,32,1) f_new(Right,34,1)=f(Right,33,1)スレッド32 スレッド33

GPU実行用パラメータ（新規追加）


module GPUParameteruse cudaforuse SimulationParameter,only:Nx,Nyimplicit none

!1ブロックあたりのスレッド数の基準値integer,parameter :: num_Thread = 64

!境界条件以外のカーネルの並列度integer,parameter :: Thread_x = min(Nx,num_Thread)integer,parameter :: Thread_y = 1integer,parameter :: Block_x = Nx/Thread_xinteger,parameter :: Block_y = Ny/Thread_ytype(dim3),parameter :: Thread = dim3(Thread_x, Thread_y, 1) !dim3型構造体を利用してtype(dim3),parameter :: Block = dim3( Block_x, Block_y, 1) !カーネルの並列度を指定

module_GPUParameter.cuf

source/gpu/naive/に置いています

GPU実行用パラメータ（新規追加）


!x方向境界条件を処理するカーネルの並列度integer,parameter :: ThreadBCx_x = min(Nx,num_Thread)integer,parameter :: ThreadBCx_y = 1integer,parameter :: BlockBCx_x = Nx/ThreadBCx_xinteger,parameter :: BlockBCx_y = 1 !y方向のブロック数は1に固定type(dim3),parameter :: ThreadBCx = dim3(ThreadBCx_x, ThreadBCx_y, 1)type(dim3),parameter :: BlockBCx = dim3( BlockBCx_x, BlockBCx_y, 1)

!y方向境界条件を処理するカーネルの並列度integer,parameter :: ThreadBCy_x = 1integer,parameter :: ThreadBCy_y = min(Ny,num_Thread)integer,parameter :: BlockBCy_x = 1 !x方向のブロック数は1に固定integer,parameter :: BlockBCy_y = Ny/ThreadBCy_ytype(dim3),parameter :: ThreadBCy = dim3(ThreadBCy_x, ThreadBCy_y, 1)type(dim3),parameter :: BlockBCy = dim3( BlockBCy_x, BlockBCy_y, 1)

end module GPUParameter

module_GPUParameter.cuf



attributes(global) subroutine computeIntialMacroQuantities(velx,vely,dens)use SimulationParameterimplicit none

real(8),intent(inout),device :: velx(Nx,Ny)real(8),intent(inout),device :: vely(Nx,Ny)real(8),intent(inout),device :: dens(Nx,Ny)

integer :: i,j

i = (blockIdx%x‐1)*blockDim%x + threadIdx%x !スレッド番号と配列添字の対応付けj = (blockIdx%y‐1)*blockDim%y + threadIdx%y !

!1スレッドが1格子点を処理するのでi,jのdoループを削除velx(i,j) = 0d0vely(i,j) = 0d0dens(i,j) = 1d0

if (2<=i.and.i<=Nx‐1 .and. j==Ny) then !doループで格子点（i,j）を制御できないので，if文で制御velx(i,j) = Uwall

end ifend subroutine computeIntialMacroQuantities




attributes(global) subroutine computeMacroQuantities(f,velx,vely,dens,ConvVelx,ConvVely)use SimulationParameterimplicit nonereal(8),intent(in) ,device :: f(First:Last,1:Nx,1:Ny)real(8),intent(inout),device :: velx(1:Nx,1:Ny)real(8),intent(inout),device :: vely(1:Nx,1:Ny)real(8),intent(inout),device :: dens(1:Nx,1:Ny)integer,intent(in) ,device :: ConvVelx(First:Last)integer,intent(in) ,device :: ConvVely(First:Last)integer :: i,jreal(8) :: f_boundary, f_exterior

i = (blockIdx%x‐1)*blockDim%x + threadIdx%xj = (blockIdx%y‐1)*blockDim%y + threadIdx%y


if (2<=i.and.i<=Nx‐1 .and. j==Ny) thenf_boundary = f(Center,i,j)+f( Right,i,j)+f( Left,i,j)f_exterior = f(Up ,i,j)+f(UpRight,i,j)+f(UpLeft,i,j)dens(i,j) = f_boundary + 2d0*f_exterior

end if




if (2<=i.and.i<=Nx‐1 .and. 2<=j.and.j<=Ny‐1) thenvelx(i,j) = ( f(Center ,i,j)*ConvVelx(Center )&

+f(Right ,i,j)*ConvVelx(Right )&+f(Up ,i,j)*ConvVelx(Up )&+f(Left ,i,j)*ConvVelx(Left )&+f(Down ,i,j)*ConvVelx(Down )&+f(UpRight ,i,j)*ConvVelx(UpRight )&+f(UpLeft ,i,j)*ConvVelx(UpLeft )&+f(DownLeft ,i,j)*ConvVelx(DownLeft )&+f(DownRight,i,j)*ConvVelx(DownRight))/dens(i,j)


end ifend subroutine computeMacroQuantities




attributes(global) &subroutine computeLocalEquilibriumFunction(f_eq,velx,vely,dens,ConvVelx,ConvVely,Weight)

use SimulationParameterimplicit nonereal(8),intent(inout),device :: f_eq(First:Last,1:Nx,1:Ny)real(8),intent(in) ,device :: velx(1:Nx,1:Ny)real(8),intent(in) ,device :: vely(1:Nx,1:Ny)real(8),intent(in) ,device :: dens(1:Nx,1:Ny)integer,intent(in) ,device :: ConvVelx(First:Last)integer,intent(in) ,device :: ConvVely(First:Last)real(8),intent(in) ,device :: Weight(First:Last)real(8) :: u,v,conv_velo,velo_squareinteger :: i,j,direction





u = velx(i,j)v = vely(i,j)velo_square = u*u + v*v!1スレッドが9個の粒子を計算するので，粒子番号に関するdoループは存在do direction = First,Last





Collision Step


attributes(global) subroutine collide(f,f_eq)use SimulationParameterimplicit nonereal(8),intent(inout),device :: f (First:Last,1:Nx,1:Ny)real(8),intent(in) ,device :: f_eq(First:Last,1:Nx,1:Ny)integer :: i,j,direction


f(:,i,j) = f(:,i,j) + (f_eq(:,i,j)‐f(:,i,j))/RelaxTimeend subroutine collide


Stream Step


attributes(global) subroutine stream(f,f_new)use SimulationParameterimplicit nonereal(8),intent(in) ,device :: f (First:Last,1:Nx,1:Ny)real(8),intent(inout),device :: f_new(First:Last,1:Nx,1:Ny)integer :: i,j


f_new(Center,i,j) = f(Center,i,j) !一時配列f_newを利用if (1<=i .and. i<=Nx‐1) then

f_new(Right,i+1,j) = f(Right,i,j)end ifif (1<=j .and. j<=Ny‐1) then

f_new(Up,i,j+1) = f(Up,i,j)end ifif (2<=i .and. i<=Nx) then

f_new(Left,i‐1,j) = f(Left,i,j)end ifif (2<=j .and. j<=Ny) then

f_new(Down,i,j‐1) = f(Down,i,j)end if


Stream Step


if (1<=i .and. i<=Nx‐1 .and. 1<=j .and. j<=Ny‐1) thenf_new(UpRight,i+1,j+1) = f(UpRight,i,j)

end ifif (2<=i .and. i<=Nx .and. 1<=j .and. j<=Ny‐1) then

f_new(UpLeft,i‐1,j+1) = f(UpLeft,i,j)end ifif (2<=i .and. i<=Nx .and. 2<=j .and. j<=Ny) then

f_new(DownLeft ,i‐1,j‐1) = f(DownLeft ,i,j)end ifif (1<=i .and. i<=Nx‐1 .and. 2<=j .and. j<=Ny) then

f_new(DownRight,i+1,j‐1) = f(DownRight,i,j)end if

end subroutine stream


x方向境界条件


1行分のスレッドを起動し，1スレッドが2点の境界値を計算

ブロックは1行分あればよい

y方向ブロック数は1に固定

x方向はスレッド番号と配列要素の対応付けが可能

i = (blockIdx%x‐1)*blockDim%x+ threadIdx%x

y方向は数値を直接指定

j=1

f(:,:,:)

i

jブロック

x方向境界条件


1行分のスレッドを起動し，1スレッドが2点の境界値を計算

ブロックは1行分あればよい

y方向ブロック数は1に固定

x方向はスレッド番号と配列要素の対応付けが可能

i = (blockIdx%x‐1)*blockDim%x+ threadIdx%x

y方向は数値を直接指定

j=Ny

i

j

f(:,:,:)

ブロック

x方向境界条件


attributes(global) subroutine imposeBoundayCondition_x(f,Opposite)use SimulationParameterimplicit none


integer :: ireal(8) :: dens_wallreal(8) :: f_boundary, f_exterior

i = (blockIdx%x‐1)*blockDim%x + threadIdx%x

!bounce back on south boundaryf(Up ,i,1)=f(Opposite(Up ),i,1)f(UpRight,i,1)=f(Opposite(UpRight),i,1)f(UpLeft ,i,1)=f(Opposite(UpLeft ),i,1)


x方向境界条件


!moving wall, north boundaryif (2<=i.and.i<=Nx‐1) then


end if

end subroutine imposeBoundayCondition_x


y方向境界条件


1列分のスレッドを起動し，1スレッドが2点の境界値を計算

ブロックは1列分あればよい

x方向ブロック数は1に固定

y方向はスレッド番号と配列要素の対応付けが可能

j = (blockIdx%y‐1)*blockDim%y+ threadIdx%y

x方向は数値を直接指定

i=1

i

j

f(:,:,:)

ブロック

y方向境界条件


1列分のスレッドを起動し，1スレッドが2点の境界値を計算

ブロックは1列分あればよい

x方向ブロック数は1に固定

y方向はスレッド番号と配列要素の対応付けが可能

j = (blockIdx%y‐1)*blockDim%y+ threadIdx%y

x方向は数値を直接指定

i=Nx

i

j

f(:,:,:)

ブロック

y方向境界条件


attributes(global) subroutine imposeBoundayCondition_y(f,Opposite)use SimulationParameterimplicit none


integer :: j

j = (blockIdx%y‐1)*blockDim%y + threadIdx%y

!bounce back on west boundaryf( Right, 1,j) = f(Opposite( Right), 1,j)f( UpRight, 1,j) = f(Opposite( UpRight), 1,j)f(DownRight, 1,j) = f(Opposite(DownRight), 1,j)!bounce back on east boundaryf( Left ,Nx,j) = f(Opposite( Left ),Nx,j)f(DownLeft ,Nx,j) = f(Opposite(DownLeft ),Nx,j)f( UpLeft ,Nx,j) = f(Opposite( UpLeft ),Nx,j)

end subroutine imposeBoundayCondition_y




program LBM_Cavityuse cudaforuse SimulationParameteruse D2Q9Modeluse GPUParameterimplicit none

real(8),allocatable,device :: velx(:,:)real(8),allocatable,device :: vely(:,:)real(8),allocatable,device :: dens(:,:)

real(8),allocatable,device :: f (:,:,:)real(8),allocatable,device :: f_eq (:,:,:)real(8),allocatable,device :: f_new(:,:,:) !一時配列

real(8),allocatable,device :: dev_Weight(:)integer,allocatable,device :: dev_ConvVelx(:)integer,allocatable,device :: dev_ConvVely(:)integer,allocatable,device :: dev_Opposite(:)

integer :: n,stat

lbm_cavity.cuf




allocate(f (First:Last,1:Nx,1:Ny));f =0d0allocate(f_eq (First:Last,1:Nx,1:Ny));f_eq =0d0allocate(f_new(First:Last,1:Nx,1:Ny));f_new=0d0

allocate(dev_Weight(First:Last)); dev_Weight =Weightallocate(dev_ConvVelx(First:Last));dev_ConvVelx=ConvVelxallocate(dev_ConvVely(First:Last));dev_ConvVely=ConvVelyallocate(dev_Opposite(First:Last));dev_Opposite=Opposite

lbm_cavity.cuf



call computeIntialMacroQuantities<<<Block,Thread>>>(velx,vely,dens)do n=1,Nt

call computeLocalEquilibriumFunction<<<Block,Thread>>>(f_eq,velx,vely,dens,dev_ConvVelx,dev_ConvVely,dev_Weight)

call collide<<<Block,Thread>>>(f,f_eq)call stream<<<Block,Thread>>>(f,f_new)call imposeBoundayCondition_x<<<BlockBCx,ThreadBCx>>>(f_new,dev_Opposite)call imposeBoundayCondition_y<<<BlockBCy,ThreadBCy>>>(f_new,dev_Opposite)call computeMacroQuantities<<<Block,Thread>>>

(f_new,velx,vely,dens,dev_ConvVelx,dev_ConvVely)f = f_new !一時配列f_newの値をfにコピー（同期実行されるのでcudaThreadSynchronizeは削除）

end dodeallocate(f )deallocate(f_eq )deallocate(f_new)deallocate(velx )deallocate(vely )deallocate(dens )deallocate(dev_Weight)deallocate(dev_ConvVelx)deallocate(dev_ConvVely)deallocate(dev_Opposite)

end program LBM_Cavitylbm_cavity.cuf

実行結果（1スレッド1格子点）


実行時間（1ブロックあたりのスレッド数が64のとき）

512× 512 約 9ms/step 1024×1024 約 35ms/step 2048×2048 約150ms/step 単純な実装でもCPUより6倍程度高速化

格子点数実行時間[ms] 高速化率

(CPU/GPU)CPU GPU

512× 512 52 9 5.8

1024×1024 214 35 6.1

2048×2048 900 150 6

1ブロックあたりのスレッド数の違いによる実行時間の変化


いずれの格子点数でも，1ブロックあたりのスレッド数が64の時がも高速以降の適化でも64スレッド/ブロックを使用

実行

時間

[s/

step

]実行

時間

[s/

step

]

Number of Threads/Block

512×512 1024×1024

2048×2048

Number of Threads/Block

パラメータを保持するメモリの選択


重み係数，移流速度を保持するメモリを変更

コンスタントメモリを利用

引数で渡さなくなるのでカーネルが単純化

全スレッドが同じデータにアクセスするので，コンスタントキャッシュにより高速化が期待

コンスタントメモリ

GPUからは読込専用のオフチップ（GPUのチップ外の）メモリ

読込自体は高速ではない

複数のスレッドが同じデータにアクセスすると，コンスタントキャッシュが利用される

グローバル領域で宣言

D2Q9モデルのパラメータ


!パラメータを定義

:

!コンスタントメモリはconstant属性を付けて宣言real(8),constant :: cWeight(First:Last)integer,constant :: cConvVelx(First:Last)integer,constant :: cConvVely(First:Last)integer,constant :: cOpposite(First:Last)

:


source/gpu/constant/に置いています



attributes(global) subroutine computeMacroQuantities(f,velx,vely,dens)use SimulationParameterimplicit nonereal(8),intent(in) ,device :: f(First:Last,1:Nx,1:Ny)real(8),intent(inout),device :: velx(1:Nx,1:Ny)real(8),intent(inout),device :: vely(1:Nx,1:Ny)real(8),intent(inout),device :: dens(1:Nx,1:Ny)integer :: i,jreal(8) :: f_boundary, f_exterior



if (2<=i.and.i<=Nx‐1 .and. j==Ny) thenf_boundary = f(Center,i,j)+f( Right,i,j)+f( Left,i,j)f_exterior = f(Up ,i,j)+f(UpRight,i,j)+f(UpLeft,i,j)dens(i,j) = f_boundary + 2d0*f_exterior

end if




if (2<=i.and.i<=Nx‐1 .and. 2<=j.and.j<=Ny‐1) thenvelx(i,j) = ( f(Center ,i,j)*cConvVelx(Center )&

+f(Right ,i,j)*cConvVelx(Right )&+f(Up ,i,j)*cConvVelx(Up )&+f(Left ,i,j)*cConvVelx(Left )&+f(Down ,i,j)*cConvVelx(Down )&+f(UpRight ,i,j)*cConvVelx(UpRight )&+f(UpLeft ,i,j)*cConvVelx(UpLeft )&+f(DownLeft ,i,j)*cConvVelx(DownLeft )&+f(DownRight,i,j)*cConvVelx(DownRight))/dens(i,j)

vely(i,j) = ( f(Center ,i,j)*cConvVely(Center )&+f(Right ,i,j)*cConvVely(Right )&+f(Up ,i,j)*cConvVely(Up )&+f(Left ,i,j)*cConvVely(Left )&+f(Down ,i,j)*cConvVely(Down )&+f(UpRight ,i,j)*cConvVely(UpRight )&+f(UpLeft ,i,j)*cConvVely(UpLeft )&+f(DownLeft ,i,j)*cConvVely(DownLeft )&+f(DownRight,i,j)*cConvVely(DownRight))/dens(i,j)





attributes(global) subroutine computeLocalEquilibriumFunction(f_eq,velx,vely,dens)use SimulationParameterimplicit nonereal(8),intent(inout),device :: f_eq(First:Last,1:Nx,1:Ny)real(8),intent(in) ,device :: velx(1:Nx,1:Ny)real(8),intent(in) ,device :: vely(1:Nx,1:Ny)real(8),intent(in) ,device :: dens(1:Nx,1:Ny)real(8) :: u,v,conv_velo,velo_squareinteger :: i,j,directioni = (blockIdx%x‐1)*blockDim%x + threadIdx%xj = (blockIdx%y‐1)*blockDim%y + threadIdx%y


conv_velo = u*cConvVelx(direction) + v*cConvVely(direction)f_eq(direction,i,j) = cWeight(direction)*dens(i,j)&




x方向境界条件


attributes(global) subroutine imposeBoundayCondition_x(f)use SimulationParameterimplicit nonereal(8),intent(inout),device :: f(First:Last,1:Nx,1:Ny)real(8) :: dens_wallreal(8) :: f_boundary, f_exteriorinteger :: ii = (blockIdx%x‐1)*blockDim%x + threadIdx%x

!bounce back on south boundaryf(Up ,i,1)=f(cOpposite(Up ),i,1)f(UpRight,i,1)=f(cOpposite(UpRight),i,1)f(UpLeft ,i,1)=f(cOpposite(UpLeft ),i,1)!moving wall, north boundaryif (2<=i.and.i<=Nx‐1) then

f_boundary = f(Center,i,Ny)+f( Right,i,Ny)+f( Left,i,Ny)f_exterior = f(Up ,i,Ny)+f(UpRight,i,Ny)+f(UpLeft,i,Ny)dens_wall = f_boundary + 2d0*f_exteriorf(Down ,i,Ny)=f(cOpposite(Down ),i,Ny)f(DownRight,i,Ny)=f(cOpposite(DownRight),i,Ny) + dens_wall*Uwall/6.0f(DownLeft ,i,Ny)=f(cOpposite(DownLeft ),i,Ny) ‐ dens_wall*Uwall/6.0

end ifend subroutine imposeBoundayCondition_x


y方向境界条件


attributes(global) subroutine imposeBoundayCondition_y(f)use SimulationParameterimplicit none

real(8),intent(inout),device :: f(First:Last,1:Nx,1:Ny)

integer :: j

j = (blockIdx%y‐1)*blockDim%y + threadIdx%y

!bounce back on west boundaryf( Right, 1,j) = f(cOpposite( Right), 1,j)f( UpRight, 1,j) = f(cOpposite( UpRight), 1,j)f(DownRight, 1,j) = f(cOpposite(DownRight), 1,j)!bounce back on east boundaryf( Left ,Nx,j) = f(cOpposite( Left ),Nx,j)f(DownLeft ,Nx,j) = f(cOpposite(DownLeft ),Nx,j)f( UpLeft ,Nx,j) = f(cOpposite( UpLeft ),Nx,j)





program LBM_Cavityuse cudaforuse SimulationParameteruse D2Q9Modeluse GPUParameterimplicit none

real(8),allocatable,device :: velx(:,:)real(8),allocatable,device :: vely(:,:)real(8),allocatable,device :: dens(:,:)

real(8),allocatable,device :: f (:,:,:)real(8),allocatable,device :: f_eq (:,:,:)real(8),allocatable,device :: f_new(:,:,:)

integer :: n,stat

lbm_cavity.cuf




allocate(f (First:Last,1:Nx,1:Ny));f =0d0allocate(f_eq (First:Last,1:Nx,1:Ny));f_eq =0d0allocate(f_new(First:Last,1:Nx,1:Ny));f_new=0d0

!CPUのメモリからコンスタントメモリへ転送!メモリのallocateは不要cWeight =WeightcConvVelx=ConvVelxcConvVely=ConvVelycOpposite=Opposite

lbm_cavity.cuf




call computeLocalEquilibriumFunction<<<Block,Thread>>>(f_eq,velx,vely,dens)call collide<<<Block,Thread>>>(f,f_eq)call stream<<<Block,Thread>>>(f,f_new)call imposeBoundayCondition_x<<<BlockBCx,ThreadBCx>>>(f_new)call imposeBoundayCondition_y<<<BlockBCy,ThreadBCy>>>(f_new)call computeMacroQuantities<<<Block,Thread>>>(f_new,velx,vely,dens)f = f_new

end do

deallocate(f )deallocate(f_eq )deallocate(f_new)deallocate(velx )deallocate(vely )deallocate(dens )


lbm_cavity.cuf

実行結果（コンスタントメモリ利用）


実行時間（2048×2048）基準となる実装（Naïve）と比較してわずかに高速化

実行

時間

[s/

step

]

実装

18,632 15,098マクロ量の計算が有意に高速化

146,633 142,500*CPUの実行時間900,000s/step

カーネル融合（フュージョン）


局所平衡分布関数の計算とCollision Stepの融合

局所平衡分布関数f_eqはCollision Stepでしか利用されていない

局所平衡分布関数の計算とCollision Stepのカーネルを合体すると

変数f_eqが不要

f_eqへの書込とf_eqからの読込が不要

局所平衡分布関数と衝突項の計算


attributes(global) &subroutine computeLocalEquilibriumFunctionAndCollision(f,velx,vely,dens)

use SimulationParameterimplicit nonereal(8),intent(inout),device :: f(First:Last,1:Nx,1:Ny)real(8),intent(in) ,device :: velx(1:Nx,1:Ny)real(8),intent(in) ,device :: vely(1:Nx,1:Ny)real(8),intent(in) ,device :: dens(1:Nx,1:Ny)real(8) :: u,v,conv_velo,velo_square,f_eq !f_eqをレジスタに確保integer :: i,j,directioni = (blockIdx%x‐1)*blockDim%x + threadIdx%xj = (blockIdx%y‐1)*blockDim%y + threadIdx%yu = velx(i,j)v = vely(i,j)velo_square = u*u + v*vdo direction = First,Last

conv_velo = u*cConvVelx(direction)&+ v*cConvVely(direction)

f_eq = cWeight(direction)*dens(i,j)& !f_eqを計算した直後に衝突項を計算*(1d0 + 3d0*conv_velo + 4.5d0*conv_velo*conv_velo ‐ 1.5d0*velo_square)

f(direction,i,j) = f(direction,i,j) + (f_eq‐f(direction,i,j))/RelaxTimeend do

end subroutine computeLocalEquilibriumFunctionAndCollision


source/gpu/fusion/に置いています



program LBM_Cavityuse cudaforuse SimulationParameteruse D2Q9Modeluse GPUParameterimplicit nonereal(8),allocatable,device :: velx(:,:)real(8),allocatable,device :: vely(:,:)real(8),allocatable,device :: dens(:,:)real(8),allocatable,device :: f (:,:,:)real(8),allocatable,device :: f_new(:,:,:) !f_eqを消去integer :: n,stat

allocate( velx(1:Nx,1:Ny))allocate( vely(1:Nx,1:Ny))allocate( dens(1:Nx,1:Ny))allocate(f (1:Nx,1:Ny,First:Last));f =0d0allocate(f_new(1:Nx,1:Ny,First:Last));f_new=0d0cWeight =WeightcConvVelx=ConvVelxcConvVely=ConvVelycOpposite=Opposite

lbm_cavity.cuf




call computeLocalEquilibriumFunctionAndCollision<<<Block,Thread>>>(f,velx,vely,dens)

call stream<<<Block,Thread>>>(f,f_new)call imposeBoundayCondition_x<<<BlockBCx,ThreadBCx>>>(f_new)call imposeBoundayCondition_y<<<BlockBCy,ThreadBCy>>>(f_new)call computeMacroQuantities<<<Block,Thread>>>(f_new,velx,vely,dens)f = f_new

end do

deallocate(f )deallocate(f_new)deallocate(velx )deallocate(vely )deallocate(dens )


lbm_cavity.cuf

実行結果（カーネル融合）


実行時間（2048×2048）局所平衡分布関数と衝突項の計算が著しく高速化

実行

時間

[s/

step

]

実装

83,610 17,712

76,474

146,633 142,500*CPUの実行時間900,000s/step

配列構造の最適化


粒子の分布関数f(:,:,:) x, y座標，9個の粒子のデータを一括して取り扱う

3次元配列の構造

粒子×x座標×y座標

この並びはGPUにとって適ではない

3次元配列の構造の変更

粒子(9個分)×x座標×y座標→x座標×y座標×粒子



GPUのメモリ（グローバルメモリ）の特徴

読み込みがある一定サイズでまとめて行われる

スレッド群が協調してメモリにアクセス

効率のよいアクセスには一定の条件がある

コアレスアクセス（Coalesce Access) データのサイズ（4,8,16バイトのいずれか）

アクセスする初のアドレス（64か128バイトの倍数）

アドレスの隣接

スレッド群がアクセスするメモリのアドレスが，スレッド番号順に隣接

・・・A128 A132 A136

ｽﾚｯﾄﾞ

1ｽﾚｯﾄﾞ

3ｽﾚｯﾄﾞ

2ｽﾚｯﾄﾞ

i‐1ｽﾚｯﾄﾞ

i

グローバルメモリ



今までの配列構造とスレッド群のメモリアクセス

pij

f(p,i,j) Fortranのメモリはp,i,jの順に連続

f(1,1,1),f(2,1,1),f(3,1,1)の順に連続

i,j方向を並列化

1スレッドが粒子に逐次アクセス

各スレッドは粒子9個×8バイト

の間隔でグローバルメモリにアクセス

コアレスアクセスできていない

スレッド群



適化した配列構造とスレッド群のメモリアクセス

配列構造をx座標×y座標×粒子に変更

i,j方向を並列化

1スレッドが粒子に逐次アクセス

ijp

f(i,j,p)

各スレッドは連続したアドレスにアクセス

コアレスアクセス

1スレッドは粒子9個×x方向格子点数×y方向格子点数×8バイトの間隔でグローバルメモリにアクセス

スレッド群

局所平衡分布関数の計算と衝突項の計算



use SimulationParameterimplicit nonereal(8),intent(inout),device :: f(1:Nx,1:Ny,First:Last)real(8),intent(in) ,device :: velx(1:Nx,1:Ny)real(8),intent(in) ,device :: vely(1:Nx,1:Ny)real(8),intent(in) ,device :: dens(1:Nx,1:Ny)real(8) :: u,v,conv_velo,velo_square,f_eqinteger :: i,j,directioni = (blockIdx%x‐1)*blockDim%x + threadIdx%xj = (blockIdx%y‐1)*blockDim%y + threadIdx%y


conv_velo = u*cConvVelx(direction) + v*cConvVely(direction)f_eq = cWeight(direction)*dens(i,j)&

*(1d0 + 3d0*conv_velo + 4.5d0*conv_velo*conv_velo ‐ 1.5d0*velo_square)f(i,j,direction) = f(i,j,direction) + (f_eq‐f(i,j,direction))/RelaxTime

end doend subroutine computeLocalEquilibriumFunctionAndCollision


source/gpu/memory_layout/に置いています



attributes(global) subroutine computeMacroQuantities(f,velx,vely,dens)use SimulationParameterimplicit nonereal(8),intent(in) ,device :: f(1:Nx,1:Ny,First:Last)real(8),intent(inout),device :: velx(1:Nx,1:Ny)real(8),intent(inout),device :: vely(1:Nx,1:Ny)real(8),intent(inout),device :: dens(1:Nx,1:Ny)integer :: i,jreal(8) :: f_boundary, f_exterior


dens(i,j) = f(i,j,Center )+f(i,j,Right )+f(i,j,Up )&+f(i,j,Left )+f(i,j,Down )+f(i,j,UpRight )&+f(i,j,UpLeft )+f(i,j,DownLeft )+f(i,j,DownRight)

if (2<=i.and.i<=Nx‐1 .and. j==Ny) thenf_boundary = f(i,j,Center)+f(i,j, Right)+f(i,j, Left)f_exterior = f(i,j,Up )+f(i,j,UpRight)+f(i,j,UpLeft)dens(i,j) = f_boundary + 2d0*f_exterior

end if




if (2<=i.and.i<=Nx‐1 .and. 2<=j.and.j<=Ny‐1) thenvelx(i,j) = ( f(i,j,Center )*cConvVelx(Center )&

+f(i,j,Right )*cConvVelx(Right )&+f(i,j,Up )*cConvVelx(Up )&+f(i,j,Left )*cConvVelx(Left )&+f(i,j,Down )*cConvVelx(Down )&+f(i,j,UpRight )*cConvVelx(UpRight )&+f(i,j,UpLeft )*cConvVelx(UpLeft )&+f(i,j,DownLeft )*cConvVelx(DownLeft )&+f(i,j,DownRight)*cConvVelx(DownRight))/dens(i,j)

vely(i,j) = ( f(i,j,Center )*cConvVely(Center )&+f(i,j,Right )*cConvVely(Right )&+f(i,j,Up )*cConvVely(Up )&+f(i,j,Left )*cConvVely(Left )&+f(i,j,Down )*cConvVely(Down )&+f(i,j,UpRight )*cConvVely(UpRight )&+f(i,j,UpLeft )*cConvVely(UpLeft )&+f(i,j,DownLeft )*cConvVely(DownLeft )&+f(i,j,DownRight)*cConvVely(DownRight))/dens(i,j)



Stream Step


attributes(global) subroutine stream(f,f_new)use SimulationParameterimplicit nonereal(8),intent(in) ,device :: f (1:Nx,1:Ny,First:Last)real(8),intent(inout),device :: f_new(1:Nx,1:Ny,First:Last)integer :: i,j

i = (blockIdx%x‐1)*blockDim%x + threadIdx%xj = (blockIdx%y‐1)*blockDim%y + threadIdx%yf_new(i,j,Center) = f(i,j,Center)if (1<=i .and. i<=Nx‐1) then

f_new(i+1,j,Right) = f(i,j,Right)end ifif (1<=j .and. j<=Ny‐1) then

f_new(i,j+1,Up) = f(i,j,Up)end ifif (2<=i .and. i<=Nx) then

f_new(i‐1,j,Left) = f(i,j,Left)end ifif (2<=j .and. j<=Ny) then

f_new(i,j‐1,Down) = f(i,j,Down)end if module_D2Q9Model.cuf

Stream Step


if (1<=i .and. i<=Nx‐1 .and. 1<=j .and. j<=Ny‐1) thenf_new(i+1,j+1,UpRight) = f(i,j,UpRight)

end ifif (2<=i .and. i<=Nx .and. 1<=j .and. j<=Ny‐1) then

f_new(i‐1,j+1,UpLeft) = f(i,j,UpLeft)end ifif (2<=i .and. i<=Nx .and. 2<=j .and. j<=Ny) then

f_new(i‐1,j‐1,DownLeft) = f(i,j,DownLeft)end ifif (1<=i .and. i<=Nx‐1 .and. 2<=j .and. j<=Ny) then

f_new(i+1,j‐1,DownRight) = f(i,j,DownRight)end if

end subroutine stream


境界条件


attributes(global) subroutine imposeBoundayCondition_x(f)use SimulationParameterimplicit nonereal(8),intent(inout),device :: f(1:Nx,1:Ny,First:Last)integer :: ireal(8) :: dens_wallreal(8) :: f_boundary, f_exteriori = (blockIdx%x‐1)*blockDim%x + threadIdx%x

!bounce back on south boundaryf(i,1,Up )=f(i,1,cOpposite(Up ))f(i,1,UpRight)=f(i,1,cOpposite(UpRight))f(i,1,UpLeft )=f(i,1,cOpposite(UpLeft ))!moving wall, north boundaryif (2<=i.and.i<=Nx‐1) then

f_boundary = f(i,Ny,Center)+f(i,Ny, Right)+f(i,Ny, Left)f_exterior = f(i,Ny,Up )+f(i,Ny,UpRight)+f(i,Ny,UpLeft)dens_wall = f_boundary + 2d0*f_exteriorf(i,Ny,Down )=f(i,Ny,cOpposite(Down ))f(i,Ny,DownRight)=f(i,Ny,cOpposite(DownRight)) + dens_wall*Uwall/6.0f(i,Ny,DownLeft )=f(i,Ny,cOpposite(DownLeft )) ‐ dens_wall*Uwall/6.0



境界条件


attributes(global) subroutine imposeBoundayCondition_y(f)use SimulationParameterimplicit nonereal(8),intent(inout),device :: f(1:Nx,1:Ny,First:Last)integer :: jj = (blockIdx%y‐1)*blockDim%y + threadIdx%y

!bounce back on west boundaryf( 1,j, Right) = f( 1,j,cOpposite( Right))f( 1,j, UpRight) = f( 1,j,cOpposite( UpRight))f( 1,j,DownRight) = f( 1,j,cOpposite(DownRight))!bounce back on east boundaryf(Nx,j, Left ) = f(Nx,j,cOpposite( Left ))f(Nx,j,DownLeft ) = f(Nx,j,cOpposite(DownLeft ))f(Nx,j, UpLeft ) = f(Nx,j,cOpposite( UpLeft ))





program LBM_Cavityuse cudaforuse SimulationParameteruse D2Q9Modeluse GPUParameterimplicit nonereal(8),allocatable,device :: velx(:,:)real(8),allocatable,device :: vely(:,:)real(8),allocatable,device :: dens(:,:)real(8),allocatable,device :: f (:,:,:)real(8),allocatable,device :: f_new(:,:,:)integer :: n,stat

allocate( velx(1:Nx,1:Ny))allocate( vely(1:Nx,1:Ny))allocate( dens(1:Nx,1:Ny))allocate(f (1:Nx,1:Ny,First:Last));f =0d0allocate(f_new(1:Nx,1:Ny,First:Last));f_new=0d0cWeight =WeightcConvVelx=ConvVelxcConvVely=ConvVelycOpposite=Opposite

lbm_cavity.cuf




call computeLocalEquilibriumFunctionAndCollision<<<Block,Thread>>>(f,velx,vely,dens)

call stream<<<Block,Thread>>>(f,f_new)call imposeBoundayCondition_x<<<BlockBCx,ThreadBCx>>>(f_new)call imposeBoundayCondition_y<<<BlockBCy,ThreadBCy>>>(f_new)call computeMacroQuantities<<<Block,Thread>>>(f_new,velx,vely,dens)f = f_new

end do

deallocate(f )deallocate(f_new)deallocate(velx )deallocate(vely )deallocate(dens )


lbm_cavity.cuf

実行結果（配列構造の最適化）


実行時間（2048×2048） 1/3程度に短縮

実行

時間

[s/

step

]

実装

21,577

*CPUの実行時間900,000s/step

76,474

146,633 142,500

その他雑多な高速化


塵も積もれば山となる

著しい高速化は期待できないが，確実に高速化可能

GPUの得手不得手が分かる

除算を逆数のかけ算に変更

衝突項の計算に用いる緩和時間を，緩和時間の逆数の積に変更

レジスタによるマネージドキャッシュ

除算に用いる値をレジスタに格納して再利用

間接参照をやめてみる

Bounce Back境界条件で現れるOpposite()を消去してベタ書き

f(i,j,Down) = f(i,j,Oppsite(Down)) f(i,j,Down) = f(i,j,Up)

計算パラメータ


!パラメータを設定

:real(8),parameter :: CoefRelax = 1d0/(3d0*KineticViscosity + 0.5d0)

module_SimulationParameter.cuf

source/gpu/misc/に置いています



attributes(global) subroutine computeMacroQuantities(f,velx,vely,dens)use SimulationParameterimplicit nonereal(8),intent(in) ,device :: f(1:Nx,1:Ny,First:Last)real(8),intent(inout),device :: velx(1:Nx,1:Ny)real(8),intent(inout),device :: vely(1:Nx,1:Ny)real(8),intent(inout),device :: dens(1:Nx,1:Ny)real(8) :: f_boundary, f_exterior, rhointeger :: i,ji = (blockIdx%x‐1)*blockDim%x + threadIdx%xj = (blockIdx%y‐1)*blockDim%y + threadIdx%y

rho = f(i,j,Center )+f(i,j,Right )+f(i,j,Up )&+f(i,j,Left )+f(i,j,Down )+f(i,j,UpRight )&+f(i,j,UpLeft )+f(i,j,DownLeft )+f(i,j,DownRight)

dens(i,j) = rho

if (2<=i.and.i<=Nx‐1 .and. j==Ny) thenf_boundary = f(i,j,Center)+f(i,j, Right)+f(i,j, Left)f_exterior = f(i,j,Up )+f(i,j,UpRight)+f(i,j,UpLeft)dens(i,j) = f_boundary + 2d0*f_exterior

end if




if (2<=i.and.i<=Nx‐1 .and. 2<=j.and.j<=Ny‐1) thenvelx(i,j) = ( f(i,j,Center )*cConvVelx(Center )&

+f(i,j,Right )*cConvVelx(Right )&+f(i,j,Up )*cConvVelx(Up )&+f(i,j,Left )*cConvVelx(Left )&+f(i,j,Down )*cConvVelx(Down )&+f(i,j,UpRight )*cConvVelx(UpRight )&+f(i,j,UpLeft )*cConvVelx(UpLeft )&+f(i,j,DownLeft )*cConvVelx(DownLeft )&+f(i,j,DownRight)*cConvVelx(DownRight))/rho

vely(i,j) = ( f(i,j,Center )*cConvVely(Center )&+f(i,j,Right )*cConvVely(Right )&+f(i,j,Up )*cConvVely(Up )&+f(i,j,Left )*cConvVely(Left )&+f(i,j,Down )*cConvVely(Down )&+f(i,j,UpRight )*cConvVely(UpRight )&+f(i,j,UpLeft )*cConvVely(UpLeft )&+f(i,j,DownLeft )*cConvVely(DownLeft )&+f(i,j,DownRight)*cConvVely(DownRight))/rho



局所平衡分布関数の計算と衝突項の計算



use SimulationParameterimplicit nonereal(8),intent(inout),device :: f(1:Nx,1:Ny,First:Last)real(8),intent(in) ,device :: velx(1:Nx,1:Ny)real(8),intent(in) ,device :: vely(1:Nx,1:Ny)real(8),intent(in) ,device :: dens(1:Nx,1:Ny)real(8) :: u,v,conv_velo,velo_square,f_eqinteger :: i,j,directioni = (blockIdx%x‐1)*blockDim%x + threadIdx%xj = (blockIdx%y‐1)*blockDim%y + threadIdx%y


conv_velo = u*cConvVelx(direction) + v*cConvVely(direction)f_eq = cWeight(direction)*dens(i,j)&

*(1d0 + 3d0*conv_velo + 4.5d0*conv_velo*conv_velo ‐ 1.5d0*velo_square)f(i,j,direction) = CoefRelax*f_eq + (1d0‐CoefRelax)*f(i,j,direction)

end doend subroutine computeLocalEquilibriumFunctionAndCollision


x方向境界条件


attributes(global) subroutine imposeBoundayCondition_x(f)use SimulationParameterimplicit nonereal(8),intent(inout),device :: f(1:Nx,1:Ny,First:Last)real(8) :: dens_wallreal(8) :: f_boundary, f_exteriorinteger :: ii = (blockIdx%x‐1)*blockDim%x + threadIdx%x

!bounce back on south boundaryf(i,1,Up )=f(i,1,Down )f(i,1,UpRight)=f(i,1,DonwLeft )f(i,1,UpLeft )=f(i,1,DownRight)!moving wall, north boundaryif (2<=i.and.i<=Nx‐1) then

f_boundary = f(i,Ny,Center)+f(i,Ny, Right)+f(i,Ny, Left)f_exterior = f(i,Ny,Up )+f(i,Ny,UpRight)+f(i,Ny,UpLeft)dens_wall = f_boundary + 2d0*f_exteriorf(i,Ny,Down )=f(i,Ny,Up )f(i,Ny,DownRight)=f(i,Ny,UpLeft ) + dens_wall*Uwall/6.0f(i,Ny,DownLeft )=f(i,Ny,UpRight) ‐ dens_wall*Uwall/6.0



y方向境界条件


attributes(global) subroutine imposeBoundayCondition_y(f)use SimulationParameterimplicit nonereal(8),intent(inout),device :: f(1:Nx,1:Ny,First:Last)integer :: jj = (blockIdx%y‐1)*blockDim%y + threadIdx%y

!bounce back on west boundaryf( 1,j, Right) = f( 1,j, Left)f( 1,j, UpRight) = f( 1,j,DownLeft)f( 1,j,DownRight) = f( 1,j, UpLeft)!bounce back on east boundaryf(Nx,j, Left ) = f(Nx,j, Right)f(Nx,j,DownLeft ) = f(Nx,j, UpRight)f(Nx,j, UpLeft ) = f(Nx,j,DownRight)



実行結果（雑多な最適化）


実行時間（2048×2048）カーネルによっては2,3%高速化

実行

時間

[s/

step

]

実装

7,030

5,584

6,884

5,406

21,248

除算の置き換えは有効

レジスタ利用は有効

21,577

76,474

146,633 142,500

間接参照の排除は有効性が不明（処理が軽すぎる）x方向境界条件 6s→ 5sy方向境界条件 36s→36s

*CPUの実行時間900,000s/step

まとめ



並列化，GPU化に適した数値計算法

GPU化しない理由が無い

GPU実装といくつかの適化を行った結果

単純なGPU実装から適化により約7倍高速化

単純なCPU実装と比較して大42倍高速化

他にも導入できる高速化は多数存在

局所平衡分布関数，衝突項とStream Stepを融合

共有メモリやレジスタの利用

テクスチャメモリの利用

Engineering

GPGPU Seminar (Accelerataion of Lattice Boltzmann Method using CUDA Fortran)