17
By Xinggao Xia and Jong Chul Lee CUDA Linear Equations Solver Based on Modified Gaussian Elimination

By Xinggao Xia and Jong Chul Lee. TechniqueAdditionsMultiplications/Divisions Gauss-Jordann 3 /2 Gaussian Eliminationn 3 /3 Cramer’s Rulen 4 /3 n :

Embed Size (px)

Citation preview

Page 1: By Xinggao Xia and Jong Chul Lee. TechniqueAdditionsMultiplications/Divisions Gauss-Jordann 3 /2 Gaussian Eliminationn 3 /3 Cramer’s Rulen 4 /3 n :

By Xinggao Xia and Jong Chul Lee

CUDA Linear Equations Solver

Based on Modified Gaussian

Elimination

Page 2: By Xinggao Xia and Jong Chul Lee. TechniqueAdditionsMultiplications/Divisions Gauss-Jordann 3 /2 Gaussian Eliminationn 3 /3 Cramer’s Rulen 4 /3 n :

Applications of Systems of Linear Equations

Biology

Ex) How many specimens must be experimented on.

Physics

Ex) To find missing currents with given voltages

Chemistry

Ex) 2H2 + O2 -> 2H2O

Page 3: By Xinggao Xia and Jong Chul Lee. TechniqueAdditionsMultiplications/Divisions Gauss-Jordann 3 /2 Gaussian Eliminationn 3 /3 Cramer’s Rulen 4 /3 n :

Technique Additions Multiplications/Divisions

Gauss-Jordan n3/2 n3/2

Gaussian Elimination n3/3 n3/3

Cramer’s Rule n4/3 n4/3

n : the number of elements in matrix

Table 1: Computational Complexity of Various Solving Techniques

Comments: Computational complexity increases tremendously as the dimension of matrix increases.Gaussian Elimination solver has obvious advantage in terms of complexity as the matrix size increases.

Page 4: By Xinggao Xia and Jong Chul Lee. TechniqueAdditionsMultiplications/Divisions Gauss-Jordann 3 /2 Gaussian Eliminationn 3 /3 Cramer’s Rulen 4 /3 n :

1

1

1

9

4

1

3

2

1

1

1

1

1

1

31

21

mmpivot

0

2

1

8

3

1

2

1

1

0

0

1

232mpivot

4

2

1

2

3

1

0

1

1

0

0

1

2

2

1

1

3

1

0

1

1

0

0

1

BABAx

Iteration No.1 Iteration No.2 Iteration No.3

Normalization

10:000

10:000

10:00

10:000

10:00

10:0

10:

10 1

……

Iteration No.1 Iteration No.2 …… Iteration No.N

Page 5: By Xinggao Xia and Jong Chul Lee. TechniqueAdditionsMultiplications/Divisions Gauss-Jordann 3 /2 Gaussian Eliminationn 3 /3 Cramer’s Rulen 4 /3 n :

Inter-iteration parallelism

10:000

m1i

m0i

m2im

3i

For Iteration i

A[j][]=A[j][] –m[j][i]*matrix pivot row

Multiplier array m must be determined before each iteration

SIMD

Perfectly fit CUDA architecture

Page 6: By Xinggao Xia and Jong Chul Lee. TechniqueAdditionsMultiplications/Divisions Gauss-Jordann 3 /2 Gaussian Eliminationn 3 /3 Cramer’s Rulen 4 /3 n :

Modified Gaussian Elimination is considered for CUDA linear equations solver

More parallelismNo back substitution

Partial pivoting guarantees accuracy of solution

Page 7: By Xinggao Xia and Jong Chul Lee. TechniqueAdditionsMultiplications/Divisions Gauss-Jordann 3 /2 Gaussian Eliminationn 3 /3 Cramer’s Rulen 4 /3 n :

1

1

1

9

4

1

3

2

1

1

1

1

1

1

31

21

mmpivot

0

2

1

8

3

1

2

1

1

0

0

1

232mpivot

4

2

1

2

3

1

0

1

1

0

0

1

Initial state Iteration No.1 Iteration No.2

1

1

1

9

4

1

3

2

1

1

1

1

1

1

31

21

mmpivot

0

2

1

8

3

1

2

1

1

0

0

1

232mpivot

4

2

3

2

3

2

0

1

0

0

0

1

Initial state Iteration No.1 Iteration No.2

2

8

7

1

0

0

0

1

0

0

0

1

Iteration No.3

Traditional Gaussian Elimination

Modified Gaussian Elimination

Page 8: By Xinggao Xia and Jong Chul Lee. TechniqueAdditionsMultiplications/Divisions Gauss-Jordann 3 /2 Gaussian Eliminationn 3 /3 Cramer’s Rulen 4 /3 n :

0:00100:0

Row i

Column i

For iteration ithRow j=Row j-mj*Row i

Traditional Gaussian Elimination

Added elimination in modified Gaussian Elimination

More parallelism !

Page 9: By Xinggao Xia and Jong Chul Lee. TechniqueAdditionsMultiplications/Divisions Gauss-Jordann 3 /2 Gaussian Eliminationn 3 /3 Cramer’s Rulen 4 /3 n :

10:000

10:000

10:00

10:000

10:00

10:0

10:

10 1

……

Iteration No.1 Iteration No.2 …… Iteration No.N-1

Back Substitut

ion

Traditional Gaussian

Linear Solver

Gaussian Elimination

10:000

10:000

010:00

10:000

010:00

0010:0

0:01:0

0::010

0:::01

……

Iteration No.1 Iteration No.2 …… Iteration No.N

Modified Gaussian

Elimination

Modified Gaussian

Linear Solver

1 11

11

11

11

1 11

11

11

11

Page 10: By Xinggao Xia and Jong Chul Lee. TechniqueAdditionsMultiplications/Divisions Gauss-Jordann 3 /2 Gaussian Eliminationn 3 /3 Cramer’s Rulen 4 /3 n :

For (i=0; i<N; i++){ Partial pivoting { Transfer the ith column back to host; Search the maximum of this column and return the index; (Host) Switches rows if necessary; (Device) } Determine the multiplier column; (Device) Modified Gaussian elimination; (Device)}Normalize the solution; (Device)Transfer solution back to host;

Threads Architecture Matrix handling like modified Gaussian Elimination kernel, each thread handles an operation of A[j][i]=A[j][i]-mj*A[i][j] for iteration ith, use two dimensional grid and block, total of N*N threads in the kernel Row or column handling like partial pivoting and others, each thread for one elements in the row or column, use one dimentsional grid and block, total of N threads in the kernel

Page 11: By Xinggao Xia and Jong Chul Lee. TechniqueAdditionsMultiplications/Divisions Gauss-Jordann 3 /2 Gaussian Eliminationn 3 /3 Cramer’s Rulen 4 /3 n :

0:00abc:x

h_tempd_temp

Kernel1

c

Host: search maximum is c

0:00abc:x

d_temp

CudaMemcpy:Device to Host

Kernel2

Kernel3Kernel4

Minimizing Device Host transportation: Switching rows by kernel

Page 12: By Xinggao Xia and Jong Chul Lee. TechniqueAdditionsMultiplications/Divisions Gauss-Jordann 3 /2 Gaussian Eliminationn 3 /3 Cramer’s Rulen 4 /3 n :

For ith iterationEach thread handles: A[j][i]=A[j][i]-mj*A[i][j]

B(1,1)B(0,0)

B(0,1)

B(N-1,1)

B(0,N-1)

:::

……

B(i,j)

T(0,0) T(0,1) ………………T(0,M-1)

T(0,0)T(0,1)

:::::::

T(0,M-1)

Iteration i data partitioning

N

BLOCK_SIZE

N

BL

OC

K_S

IZE

0:00100:0

Row i

Column i

Page 13: By Xinggao Xia and Jong Chul Lee. TechniqueAdditionsMultiplications/Divisions Gauss-Jordann 3 /2 Gaussian Eliminationn 3 /3 Cramer’s Rulen 4 /3 n :

B(1,1)B(0,0)

B(0,1)

B(N-1,1)

B(0,N-1)

:::

……

B(i,j)

NN Multiplier Column m

Row i

Shared Memor

y

Shared Memor

y

For ith iteration: A[j][i]=A[j][i]-mj*A[i][j]

Page 14: By Xinggao Xia and Jong Chul Lee. TechniqueAdditionsMultiplications/Divisions Gauss-Jordann 3 /2 Gaussian Eliminationn 3 /3 Cramer’s Rulen 4 /3 n :

Platform Configuration:GPU: GeForce 8400 GS1 SM, 8 cores, Clock rate 1.40GHzCPU: Intel Core2 Quad Q6600Clock rate 2.39GHz

512 1024

2048 4096

Serial Traditional Gaussian Linear Solver

47 403 5214 46098

Serial Modified Gaussian Linear Solver

71 564 8412 69949

Global Memory (1SM) 1718

13488

108916

862580

Shared Memory (1SM) 662 4806

38923

312787

Global Memory (scaled by 16) 107 843 6807 53911

Shared Memory (scaled by 16) 41 300 2433 19549

Comments:GPU implementation (Global or shared) is much slower than CPU implementation(1SM)Try to mimic Tesla (16SM) by scaling GPU time by 16

Page 15: By Xinggao Xia and Jong Chul Lee. TechniqueAdditionsMultiplications/Divisions Gauss-Jordann 3 /2 Gaussian Eliminationn 3 /3 Cramer’s Rulen 4 /3 n :

512 10240

100

200

300

400

500

600

700

800

900Traditional GE

Modified GE

2048 40960

10000

20000

30000

40000

50000

60000

70000

80000

Comments: CPU prefers traditional GE solver than modified GE solver GPU shared implementation is always 2-3 times faster than global implementation GPU(16SM) shared implementation is around 2 times speedup compared to traditional GE

512 1024 2048 40960

10000

20000

30000

40000

50000

60000

70000

80000Traditional GEModified GEGlobal(16SM)

Matrix size

Time (ms) Time (ms)

Time (ms) Matrix size

Matrix size

Page 16: By Xinggao Xia and Jong Chul Lee. TechniqueAdditionsMultiplications/Divisions Gauss-Jordann 3 /2 Gaussian Eliminationn 3 /3 Cramer’s Rulen 4 /3 n :

Method #Calls GPU(usec)

%GPU time

Global

GE_kernel 1024 1.3e+07 99.11

Shared

GE_kernel 1024 4.8e+06 97.6

gld uncoalesced

gld coalesced

%uncoalesced rate

Global

1048576 131072 89

Shared

61440 73728 45

For 1024 case (1SM), global memory implementation time is 13488ms, shared implementation is 4806ms

Page 17: By Xinggao Xia and Jong Chul Lee. TechniqueAdditionsMultiplications/Divisions Gauss-Jordann 3 /2 Gaussian Eliminationn 3 /3 Cramer’s Rulen 4 /3 n :

Conclusion:Linear equations solver based on modified Gaussian Elimination is implemented on CUDAShared memory is about 3 times faster than global memory implementationShared memory is expected about 3 times faster than traditional Gaussian Elimination Solver serial code in 16SM GPUPartial pivoting guarantees stability and accuracy. (error less than 0.001 compared to serial code)Problem found:More uncoalesced global memory accessing offsets advantages gained from more parallelism in modified Gaussian Elimination.