Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs

Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs

Wangqun Lin, Fengshun LuCollege of ComputerNational University of Defense TechnologyCACHES 2011Tucson, Arizona, June 4th, 2011

Outline

Motivation Spherical Harmonic Transforms (SHT) Methods

Direct Method Efficiency of Threads Utilization Reshaped Method Concurrent Kernel Execution

Experiments

2

Motivation

Computing the S.H.T with GPUs S.H.T is widely used But with complexity of O(N3) GPUs are powerful

Performance Metric in the SM level Only emphasizing on the OCCUPANCY Finding another metric to measure how the

launched threads are efficiently used

3

Spherical Harmonic Transforms(1/2)

( )

| |

( , ) ( )N mM

m m imn n

m M n m

P e

ξ: state variable ξn

m: spectral coefficients of state variable ξ μ: Gaussian latitude λ: Longitude M: model truncation wavenumber N(m): highest degree of associated Legendre function for wavenumber mPn

m(μ)eimλ: associated Legendre functions

4

Spherical Harmonic Transforms(2/2)

1

1

1( ) ( , )

( ) ( ) (3)

( ) ( )

( , ) ( ) (5)

i

Iimm

j i ji

Jm m mn j n j j

j

Mm m m

j n n jn m

Mm im

j jm M

eI

P

P

e

Forward FFT

Inverse FFT

Forward Legendre

Inverse Legendre

State Variable Fourier Coefficient Spectral Coefficient

),( )(m mn

Forward FourierForward Fourier

Forward LegendreForward Legendre

Inverse LegendreInverse Legendre

Inverse FourierInverse Fourier5

Methods – Direct (1/9)

Forward Legendre m ≤ n

ξMM

ξ00

ξ10

ξM0 ξM

1 … …

ξ11

n=0

2

1

M

M-1

m=0 21 M

……

0

…… M-1

… …

… …

… …

ξM2 ξM

M-1

1

( ) ( )J

m m mn j n j j

j

P

CUDA Thread

Thread Block

6

busy threads idle threads

m=0 M… 4321

0,0

1,0

M,0

…

4,0

3,0

2,0

1,1

…

4,1

3,1

2,1 2,2

…

4,2

3,2 3,3

…

4,3

…

4,4

…

M,1 M,2 M,3 M,4 … M,M

n=0

2

1

ξ0(μj) …

3

4

…

M

ξM(μj)ξ4(μj)ξ3(μj)ξ2(μj)ξ1(μj)

Methods – Direct (2/9)

Inverse Legendre m ≤ n

( ) ( )M

m m mj n n j

n m

P

CUDA Threads of block j

7

busy threads idle threads

Methods – ETU Metric (3/9)

Efficiency of Thread Utilization(ETU) Measures the proportion of launched threads doing

useful work during the entire execution interval Mainly used as a algorithm design guideline Assumption Algorithms consist of many micro steps tu(t,s) function t: thread s: micro step

1, if doing useful work at ( , )

0, otherwise

t stu t u

8

Methods – ETU (4/9)

1 1

( , )sm tm

s t

tu t sETU

sm tm

Algorithm 2: Direct Inverse Legendre Transform (DILT)

Input: ξnm, Pn

m, J, M Output: ξm

Execution configuration: (J, M+1)

Declaration: tid, bid, fc_sh(M+1) // fc_sh: shared memory

1 initialize fc_sh(tid) to null; // 1 m_s

2 for n=0 to M do // M+1 m_s

3 if tid ≤ n then

4 fc_sh(tid) += ξntid×Pn

tid(μbid); end if

5 end for

6 ξtid(μbid) = fc_sh(tid); // 1 m_s

ETU Metric

Example

( 1) ( 1)( 2) / 2 ( 1)

1 3

1 2 / 2 1 =

36

= 2( 3)

J M J M M J METU

J M M

M

MM

M

9

Methods – Reshaped (5/9)

Forward Legendre

ξMM

ξ00

ξ10

ξM0 ξM

1 … …

… …

… …

ξ11

blk 0

blk 2

blk 1

blk M

blk M-1

T 0 T 2T 1 T M

……

idle threads

…… T M-1

blk x x+1 threads busy M-x threads idle

ξ00

ξ10

… …

… …

ξ11

… …

blk 0

blk 2

blk 1

blk

T 0 T 2T 1 T M+1

……

…… T M

blk x

ξM0 ξM

1 ……

ξM-10 ……

M -1

2

blk M-3

2

ξ… ξ0 … ξξ0 ξ1 ξ

……

M-1

2

M-1

2M-1

2

M-1

2M+1

2

M+1

2

M-1

2M+1

2

M+1

2

M-1ξ M-1

M-1ξ M-2

Mξ M

Mξ M-1

all threads of block x busyreshape

ETU ≈ 1/2 ETU ≈ 1

10


Inverse Legendre T213 model

(128,0) (128,128)

(213,0) (213,213)

86

34

10(9,9)

(10,0)

(19,19)

(20,0)(29,0) (29,29)

(30,0)(39,0) (39,39)

(40,0) (40,40)(49,0) (49,49)

(50,0) (50,50)(59,0) (59,59)

(60,0) (60,60)

(93,0) (93,93)

(94,0)

(127,0) (127,127)

(94,94)

α β

block size

T=9 10 30 59 60 99 100 149 150 20929

213 59 127 213

0 0 60 128

m m m m m m m m mj n n j n n j n n j n n j

n n n

P P P P

reshape

11

m=0 M=213… 4321

0,0

1,0

M,0

…

4,0

3,0

2,0

1,1

…

4,1

3,1

2,1 2,2

…

4,2

3,2 3,3

…

4,3

…

4,4

…

M,1 M,2 M,3 M,4 … M,M

n=0

2

1

ξ0(μj) …

3

4

…

M=213

ξM(μj)ξ4(μj)ξ3(μj)ξ2(μj)ξ1(μj)



10 20 30 40 50 60 sh1

94 128 sh2

214 sh3

block size

③②

① ① ①

①②

④

T=9 10 30 59 60 99 100 149 150 20929

93 94 221reconstruct

12



computation for trapezium α and β

127

60

m mn n j

n

P

93

60

m mn n j

n

P

127

94

m mn n j

n

P

94sh2 128

①

13

Methods – Concurrent Kernel (9/9)

Concurrent Kernel Execution Supported by Fermi and later architectures Programs with many small kernels can efficiently

executed on GPUs The consideration of software scalability in the

future T213 model

KernelConcurrent Forward Legendre Concurrent Inverse Legendre

n Grid size Block size m Grid size Block size

1 [ 0,53 ] 54 64 [ 0,53 ] 320 64

2 [ 54,117] 64 128 [ 54,117] 320 64

3 [118,213] 96 224 [118,213] 320 9614

Experiments (1/4)

Validation of ETU metric T341 model Variable Block size

Observations Basically larger ETU indicates better performance No direct relationship shows between OCCUPANCY

and performance Same OCCUPANCY doesn't mean equal performance Same-OCCUPANCY, larger-ETU, better performance

BS ETU OCCUPANCY Time (ms)96 0.8039 0.312 1.975

128 0.7480 0.417 2.239160 0.7831 0.417 2.038192 0.6519 0.625 2.198

15

Experiments (2/4)

Performance

Forward Legendre Inverse Legendre

16

Experiments (3/4)

Case Study: STSWM A global shallow water model based on S.H.T. Exhibits many mathematical and computational

properties of more complete models Used to investigate and compare numerical

methods for simulating atmospheric models T213 truncation

Forward Legendre: ftrnve, ftrndi and ftrnpi Invserse legendre: shtrns

17

Experiments (4/4)

Case Study: STSWM

18

Review

Motivation Spherical Harmonic Transforms Methods

Direct Method Efficiency of Threads Utilization Reshaped Method Concurrent Kernel Execution

Experiments

19

20

Documents

Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs