Jacobi iteration for Laplace's equation in OpenMP® and CUDATM

Jacobi iteration for Laplace's equation inOpenMP and CUDAHere are the results on running two straight-forward implementations to solve the two dimensionalLaplace's equation. One implementation running on a Intel Core i7-950 Processor usingOpenMP with up to 4 threads. The other on a NVIDIA GeForce GTX 580 using CUDA.

The following iteration will compute an approximation of the two dimensional Laplace's equation:

$$u_{i,j}^{[k+1]} = \frac{1}{4} (u_{i-1,j}^{[k]} + u_{i+1,j}^{[k]} + u_{i,j-1}^{[k]} + u_{i,j+1}^{[k]})$$

In pseudo-code this is:

while (not converged){ for (i,j) u[i][j] = (uold[i+1][j] + uold[i-1][j] + uold[i][j+1] + uold[i][j-1])/4; for (i,j) uold[i][j] = u[i][j];}

Instead of checking for convergence, a fixed amount of iterations is computed. And instead ofcopying the new values to uold, the pointers are swapped. The data was obtained doing in all cases10000 iterations.

\newpage

In the CPU the performance for single and double precision was very similar, so only the results fordouble precision is shown. For OpenMP we have the following results with a grid of 1000x1000:

threads time speedup GFLOPs

1 220.838198 0.18 2 113.171384 1.95 0.35 3 79.301204 2.78 0.50 4 59.054578 3.74 0.68

\newpage

Next, the results for CUDA, also using a grid of 1000x1000:

precision time GFLOPs

single 1.048223 38.16 double 1.814037 22.05

\newpage

Finally, a comparison between these two implementations using double precision:

N OMP GFLOPs CUDA GFLOPs OMP time CUDA time speedup --- ------------- --------------- ---------- ------------------ 100 1.18 3,99 0.34 0.10 3.39200 0.90 12,18 1.77 0.13 13.46 300 1.03 13,59 3.48 0.26 13.15 400 0.99 17,85 6.47 0.36 18.05 5000.98 19,87 10.23 0.50 20.32 600 0.94 20,26 15.24 0.71 21.44 700 0.82 21,74 23.97 0.90 26.59 8000.71 22,16 36.12 1.16 31.27 900 0.71 21,57 45.50 1.50 30.28 1000 0.68 22,05 59.05 1.81 32.55

The implementation in C using OpenMP is straightforward:

~~~ {#ompjacobi .C} void omp_jacobi(_T ***u, _T ***uold, int n, int iters) { int i, j, k; _T **temp;

for (k = 0; k < iters; k ++){ #pragma omp parallel for private(i) for (j = 1; j < n-1; j ++) { for (i = 1; i < n-1; i ++) (*u)[i][j] = ((_T)0.25)*((*uold)[i+1][j] + (*uold)[i-1][j] + (*uold)[i][j+1] + (*uold)[i][j-1]); }

/* Swap */ temp = *uold; *uold = *u; *u = temp;}

/* Last iteration result is in uold, swap again */temp = *uold;*uold = *u;*u = temp;

}

A straight-forward implementation is also done in CUDA. The kernel for solving theiteration is:

~~~ {#cujacobi .C}__global__void cuda_jacobi1(_T *u, _T *uold){ int i, j; i = threadIdx.x+1; j = blockIdx.x+1;

u[GET_I1(i,j)] = ((_T)0.25)*(uold[GET_I1(i+1,j)] + uold[GET_I1(i-1, j)] + uold[GET_I1(i, j+1)] + uold[GET_I1(i, j-1)]);

}Then, the kernel is launched multiple times, one for each iteration:

~~~ {#cusolvejacobi .C} void cuda_solve_jacobi1(_T *ud, _T *uoldd, _T *u) { int i; _T *temp; for (i = 0; i< IT; i ++) { cuda_jacobi1(ud, uoldd); cudaDeviceSynchronize();

/* Swap */ temp = uoldd; uoldd = ud; ud = temp;}cudaMemcpy(u, uoldd, N*N*sizeof(_T), cudaMemcpyDeviceToHost);

}~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

One limitation of this approach is that the size of the grid is limited by the maximum number ofthreads per block in the device. The data shown here was executed on a NVIDIA GeForce GTX 580,which has 1024 threads per block.

Documents

Jacobi iteration for Laplace's equation in OpenMP® and CUDATM