Upload
julian-snow
View
50
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs. Wangqun Lin, Fengshun Lu College of Computer National University of Defense Technology CACHES 2011 Tucson, Arizona, June 4th, 2011. Outline. Motivation Spherical Harmonic Transforms (SHT) Methods Direct Method - PowerPoint PPT Presentation
Citation preview
Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs
Wangqun Lin, Fengshun LuCollege of ComputerNational University of Defense TechnologyCACHES 2011Tucson, Arizona, June 4th, 2011
Outline
Motivation Spherical Harmonic Transforms (SHT) Methods
Direct Method Efficiency of Threads Utilization Reshaped Method Concurrent Kernel Execution
Experiments
2
Motivation
Computing the S.H.T with GPUs S.H.T is widely used But with complexity of O(N3) GPUs are powerful
Performance Metric in the SM level Only emphasizing on the OCCUPANCY Finding another metric to measure how the
launched threads are efficiently used
3
Spherical Harmonic Transforms(1/2)
( )
| |
( , ) ( )N mM
m m imn n
m M n m
P e
ξ: state variable ξn
m: spectral coefficients of state variable ξ μ: Gaussian latitude λ: Longitude M: model truncation wavenumber N(m): highest degree of associated Legendre function for wavenumber mPn
m(μ)eimλ: associated Legendre functions
4
Spherical Harmonic Transforms(2/2)
1
1
1( ) ( , )
( ) ( ) (3)
( ) ( )
( , ) ( ) (5)
i
Iimm
j i ji
Jm m mn j n j j
j
Mm m m
j n n jn m
Mm im
j jm M
eI
P
P
e
Forward FFT
Inverse FFT
Forward Legendre
Inverse Legendre
State Variable Fourier Coefficient Spectral Coefficient
),( )(m mn
Forward FourierForward Fourier
Forward LegendreForward Legendre
Inverse LegendreInverse Legendre
Inverse FourierInverse Fourier5
Methods – Direct (1/9)
Forward Legendre m ≤ n
ξMM
ξ00
ξ10
ξM0 ξM
1 … …
ξ11
n=0
2
1
M
M-1
m=0 21 M
……
0
…… M-1
… …
… …
… …
ξM2 ξM
M-1
1
( ) ( )J
m m mn j n j j
j
P
CUDA Thread
Thread Block
6
busy threads idle threads
m=0 M… 4321
0,0
1,0
M,0
…
4,0
3,0
2,0
1,1
…
4,1
3,1
2,1 2,2
…
4,2
3,2 3,3
…
4,3
…
4,4
…
M,1 M,2 M,3 M,4 … M,M
n=0
2
1
ξ0(μj) …
3
4
…
M
ξM(μj)ξ4(μj)ξ3(μj)ξ2(μj)ξ1(μj)
Methods – Direct (2/9)
Inverse Legendre m ≤ n
( ) ( )M
m m mj n n j
n m
P
CUDA Threads of block j
7
busy threads idle threads
Methods – ETU Metric (3/9)
Efficiency of Thread Utilization(ETU) Measures the proportion of launched threads doing
useful work during the entire execution interval Mainly used as a algorithm design guideline Assumption Algorithms consist of many micro steps tu(t,s) function t: thread s: micro step
1, if doing useful work at ( , )
0, otherwise
t stu t u
8
Methods – ETU (4/9)
1 1
( , )sm tm
s t
tu t sETU
sm tm
Algorithm 2: Direct Inverse Legendre Transform (DILT)
Input: ξnm, Pn
m, J, M Output: ξm
Execution configuration: (J, M+1)
Declaration: tid, bid, fc_sh(M+1) // fc_sh: shared memory
1 initialize fc_sh(tid) to null; // 1 m_s
2 for n=0 to M do // M+1 m_s
3 if tid ≤ n then
4 fc_sh(tid) += ξntid×Pn
tid(μbid); end if
5 end for
6 ξtid(μbid) = fc_sh(tid); // 1 m_s
ETU Metric
Example
( 1) ( 1)( 2) / 2 ( 1)
1 3
1 2 / 2 1 =
36
= 2( 3)
J M J M M J METU
J M M
M
MM
M
9
Methods – Reshaped (5/9)
Forward Legendre
ξMM
ξ00
ξ10
ξM0 ξM
1 … …
… …
… …
ξ11
blk 0
blk 2
blk 1
blk M
blk M-1
T 0 T 2T 1 T M
……
idle threads
…… T M-1
blk x x+1 threads busy M-x threads idle
ξ00
ξ10
… …
… …
ξ11
… …
blk 0
blk 2
blk 1
blk
T 0 T 2T 1 T M+1
……
…… T M
blk x
ξM0 ξM
1 ……
ξM-10 ……
M -1
2
blk M-3
2
ξ… ξ0 … ξξ0 ξ1 ξ
……
M-1
2
M-1
2M-1
2
M-1
2M+1
2
M+1
2
M-1
2M+1
2
M+1
2
M-1ξ M-1
M-1ξ M-2
Mξ M
Mξ M-1
all threads of block x busyreshape
ETU ≈ 1/2 ETU ≈ 1
10
Methods – Reshaped (6/9)
Inverse Legendre T213 model
(128,0) (128,128)
(213,0) (213,213)
86
34
10(9,9)
(10,0)
(19,19)
(20,0)(29,0) (29,29)
(30,0)(39,0) (39,39)
(40,0) (40,40)(49,0) (49,49)
(50,0) (50,50)(59,0) (59,59)
(60,0) (60,60)
(93,0) (93,93)
(94,0)
(127,0) (127,127)
(94,94)
α β
block size
T=9 10 30 59 60 99 100 149 150 20929
213 59 127 213
0 0 60 128
m m m m m m m m mj n n j n n j n n j n n j
n n n
P P P P
reshape
11
m=0 M=213… 4321
0,0
1,0
M,0
…
4,0
3,0
2,0
1,1
…
4,1
3,1
2,1 2,2
…
4,2
3,2 3,3
…
4,3
…
4,4
…
M,1 M,2 M,3 M,4 … M,M
n=0
2
1
ξ0(μj) …
3
4
…
M=213
ξM(μj)ξ4(μj)ξ3(μj)ξ2(μj)ξ1(μj)
Methods – Reshaped (7/9)
Inverse Legendre T213 model
10 20 30 40 50 60 sh1
94 128 sh2
214 sh3
block size
③②
① ① ①
①②
④
T=9 10 30 59 60 99 100 149 150 20929
93 94 221reconstruct
12
Methods – Reshaped (8/9)
Inverse Legendre T213 model
computation for trapezium α and β
127
60
m mn n j
n
P
93
60
m mn n j
n
P
127
94
m mn n j
n
P
94sh2 128
①
13
Methods – Concurrent Kernel (9/9)
Concurrent Kernel Execution Supported by Fermi and later architectures Programs with many small kernels can efficiently
executed on GPUs The consideration of software scalability in the
future T213 model
KernelConcurrent Forward Legendre Concurrent Inverse Legendre
n Grid size Block size m Grid size Block size
1 [ 0,53 ] 54 64 [ 0,53 ] 320 64
2 [ 54,117] 64 128 [ 54,117] 320 64
3 [118,213] 96 224 [118,213] 320 9614
Experiments (1/4)
Validation of ETU metric T341 model Variable Block size
Observations Basically larger ETU indicates better performance No direct relationship shows between OCCUPANCY
and performance Same OCCUPANCY doesn't mean equal performance Same-OCCUPANCY, larger-ETU, better performance
BS ETU OCCUPANCY Time (ms)96 0.8039 0.312 1.975
128 0.7480 0.417 2.239160 0.7831 0.417 2.038192 0.6519 0.625 2.198
15
Experiments (2/4)
Performance
Forward Legendre Inverse Legendre
16
Experiments (3/4)
Case Study: STSWM A global shallow water model based on S.H.T. Exhibits many mathematical and computational
properties of more complete models Used to investigate and compare numerical
methods for simulating atmospheric models T213 truncation
Forward Legendre: ftrnve, ftrndi and ftrnpi Invserse legendre: shtrns
17
Experiments (4/4)
Case Study: STSWM
18
Review
Motivation Spherical Harmonic Transforms Methods
Direct Method Efficiency of Threads Utilization Reshaped Method Concurrent Kernel Execution
Experiments
19
20