Upload
carol-palmer
View
213
Download
0
Embed Size (px)
Citation preview
By Xinggao Xia and Jong Chul Lee
CUDA Linear Equations Solver
Based on Modified Gaussian
Elimination
Applications of Systems of Linear Equations
Biology
Ex) How many specimens must be experimented on.
Physics
Ex) To find missing currents with given voltages
Chemistry
Ex) 2H2 + O2 -> 2H2O
Technique Additions Multiplications/Divisions
Gauss-Jordan n3/2 n3/2
Gaussian Elimination n3/3 n3/3
Cramer’s Rule n4/3 n4/3
n : the number of elements in matrix
Table 1: Computational Complexity of Various Solving Techniques
Comments: Computational complexity increases tremendously as the dimension of matrix increases.Gaussian Elimination solver has obvious advantage in terms of complexity as the matrix size increases.
1
1
1
9
4
1
3
2
1
1
1
1
1
1
31
21
mmpivot
0
2
1
8
3
1
2
1
1
0
0
1
232mpivot
4
2
1
2
3
1
0
1
1
0
0
1
2
2
1
1
3
1
0
1
1
0
0
1
BABAx
Iteration No.1 Iteration No.2 Iteration No.3
Normalization
10:000
10:000
10:00
10:000
10:00
10:0
10:
10 1
……
Iteration No.1 Iteration No.2 …… Iteration No.N
Inter-iteration parallelism
10:000
m1i
m0i
m2im
3i
For Iteration i
A[j][]=A[j][] –m[j][i]*matrix pivot row
Multiplier array m must be determined before each iteration
SIMD
Perfectly fit CUDA architecture
Modified Gaussian Elimination is considered for CUDA linear equations solver
More parallelismNo back substitution
Partial pivoting guarantees accuracy of solution
1
1
1
9
4
1
3
2
1
1
1
1
1
1
31
21
mmpivot
0
2
1
8
3
1
2
1
1
0
0
1
232mpivot
4
2
1
2
3
1
0
1
1
0
0
1
Initial state Iteration No.1 Iteration No.2
1
1
1
9
4
1
3
2
1
1
1
1
1
1
31
21
mmpivot
0
2
1
8
3
1
2
1
1
0
0
1
232mpivot
4
2
3
2
3
2
0
1
0
0
0
1
Initial state Iteration No.1 Iteration No.2
2
8
7
1
0
0
0
1
0
0
0
1
Iteration No.3
Traditional Gaussian Elimination
Modified Gaussian Elimination
0:00100:0
Row i
Column i
For iteration ithRow j=Row j-mj*Row i
Traditional Gaussian Elimination
Added elimination in modified Gaussian Elimination
More parallelism !
10:000
10:000
10:00
10:000
10:00
10:0
10:
10 1
……
Iteration No.1 Iteration No.2 …… Iteration No.N-1
Back Substitut
ion
Traditional Gaussian
Linear Solver
Gaussian Elimination
10:000
10:000
010:00
10:000
010:00
0010:0
0:01:0
0::010
0:::01
……
Iteration No.1 Iteration No.2 …… Iteration No.N
Modified Gaussian
Elimination
Modified Gaussian
Linear Solver
1 11
11
11
11
1 11
11
11
11
For (i=0; i<N; i++){ Partial pivoting { Transfer the ith column back to host; Search the maximum of this column and return the index; (Host) Switches rows if necessary; (Device) } Determine the multiplier column; (Device) Modified Gaussian elimination; (Device)}Normalize the solution; (Device)Transfer solution back to host;
Threads Architecture Matrix handling like modified Gaussian Elimination kernel, each thread handles an operation of A[j][i]=A[j][i]-mj*A[i][j] for iteration ith, use two dimensional grid and block, total of N*N threads in the kernel Row or column handling like partial pivoting and others, each thread for one elements in the row or column, use one dimentsional grid and block, total of N threads in the kernel
0:00abc:x
h_tempd_temp
Kernel1
c
Host: search maximum is c
0:00abc:x
d_temp
CudaMemcpy:Device to Host
Kernel2
Kernel3Kernel4
Minimizing Device Host transportation: Switching rows by kernel
For ith iterationEach thread handles: A[j][i]=A[j][i]-mj*A[i][j]
B(1,1)B(0,0)
B(0,1)
B(N-1,1)
B(0,N-1)
:::
……
B(i,j)
T(0,0) T(0,1) ………………T(0,M-1)
T(0,0)T(0,1)
:::::::
T(0,M-1)
Iteration i data partitioning
N
BLOCK_SIZE
N
BL
OC
K_S
IZE
0:00100:0
Row i
Column i
B(1,1)B(0,0)
B(0,1)
B(N-1,1)
B(0,N-1)
:::
……
B(i,j)
NN Multiplier Column m
Row i
Shared Memor
y
Shared Memor
y
For ith iteration: A[j][i]=A[j][i]-mj*A[i][j]
Platform Configuration:GPU: GeForce 8400 GS1 SM, 8 cores, Clock rate 1.40GHzCPU: Intel Core2 Quad Q6600Clock rate 2.39GHz
512 1024
2048 4096
Serial Traditional Gaussian Linear Solver
47 403 5214 46098
Serial Modified Gaussian Linear Solver
71 564 8412 69949
Global Memory (1SM) 1718
13488
108916
862580
Shared Memory (1SM) 662 4806
38923
312787
Global Memory (scaled by 16) 107 843 6807 53911
Shared Memory (scaled by 16) 41 300 2433 19549
Comments:GPU implementation (Global or shared) is much slower than CPU implementation(1SM)Try to mimic Tesla (16SM) by scaling GPU time by 16
512 10240
100
200
300
400
500
600
700
800
900Traditional GE
Modified GE
2048 40960
10000
20000
30000
40000
50000
60000
70000
80000
Comments: CPU prefers traditional GE solver than modified GE solver GPU shared implementation is always 2-3 times faster than global implementation GPU(16SM) shared implementation is around 2 times speedup compared to traditional GE
512 1024 2048 40960
10000
20000
30000
40000
50000
60000
70000
80000Traditional GEModified GEGlobal(16SM)
Matrix size
Time (ms) Time (ms)
Time (ms) Matrix size
Matrix size
Method #Calls GPU(usec)
%GPU time
Global
GE_kernel 1024 1.3e+07 99.11
Shared
GE_kernel 1024 4.8e+06 97.6
gld uncoalesced
gld coalesced
%uncoalesced rate
Global
1048576 131072 89
Shared
61440 73728 45
For 1024 case (1SM), global memory implementation time is 13488ms, shared implementation is 4806ms
Conclusion:Linear equations solver based on modified Gaussian Elimination is implemented on CUDAShared memory is about 3 times faster than global memory implementationShared memory is expected about 3 times faster than traditional Gaussian Elimination Solver serial code in 16SM GPUPartial pivoting guarantees stability and accuracy. (error less than 0.001 compared to serial code)Problem found:More uncoalesced global memory accessing offsets advantages gained from more parallelism in modified Gaussian Elimination.