3
  Jacobi iteration for Laplace's equation in OpenMP® and CUDA™ Here are the results on running two straight-forward implementations to solve the two dimensional Laplace's equation. One implementation running on a Intel® Core™ i7-950 Processor using OpenMP® with up to 4 threads. The other on a NVIDIA® GeForce® GTX 580 using CUDA™. The following iteration will compute an approximation of the two dimensional Laplace's equation: $$u_{i,j}^{[k+1]} = \frac{1}{4} (u_{i-1,j}^{[k]} + u_{i+1,j}^{[k]} + u_{i,j-1}^{[k]} + u_{i,j+1}^{[k]})$$ In pseudo-code this is: while (not converged) {  for (i,j)  u[i][j] = (uold[i+1][j] + uold[i-1][j] + uold[i][j+1] + uold[i][j-1])/4;  for (i,j)  uold[i][j] = u[i][j]; } Instead of checking for convergence, a fixed amount of iterations is computed. And instead of copying the new values to uold, the pointers are swapped. The data was obtained doing in all cases 10000 iterations. \newpage In the CPU the performance for single and double precision was very similar, so only the results for double precision is shown. For OpenMP we have the following results with a grid of 1000x1000: threads time speedup GFLOPs 1 220.838198 0.18 2 113.171384 1.95 0.35 3 79.301204 2.78 0.50 4 59.054578 3.74 0.68 \newpage Next, the results for CUDA, also using a grid of 1000x1000: precision time GFLOPs single 1.048223 38.16 double 1.814037 22.05

Jacobi iteration for Laplace's equation in OpenMP® and CUDATM

  • Upload
    ddg314

  • View
    12

  • Download
    0

Embed Size (px)

DESCRIPTION

Jacobi iteration for Laplace's equation inOpenMP® and CUDATM

Citation preview

  • Jacobi iteration for Laplace's equation inOpenMP and CUDAHere are the results on running two straight-forward implementations to solve the two dimensionalLaplace's equation. One implementation running on a Intel Core i7-950 Processor usingOpenMP with up to 4 threads. The other on a NVIDIA GeForce GTX 580 using CUDA.

    The following iteration will compute an approximation of the two dimensional Laplace's equation:

    $$u_{i,j}^{[k+1]} = \frac{1}{4} (u_{i-1,j}^{[k]} + u_{i+1,j}^{[k]} + u_{i,j-1}^{[k]} + u_{i,j+1}^{[k]})$$

    In pseudo-code this is:

    while (not converged){ for (i,j) u[i][j] = (uold[i+1][j] + uold[i-1][j] + uold[i][j+1] + uold[i][j-1])/4; for (i,j) uold[i][j] = u[i][j];}

    Instead of checking for convergence, a fixed amount of iterations is computed. And instead ofcopying the new values to uold, the pointers are swapped. The data was obtained doing in all cases10000 iterations.

    \newpage

    In the CPU the performance for single and double precision was very similar, so only the results fordouble precision is shown. For OpenMP we have the following results with a grid of 1000x1000:

    threads time speedup GFLOPs

    1 220.838198 0.18 2 113.171384 1.95 0.35 3 79.301204 2.78 0.50 4 59.054578 3.74 0.68

    \newpage

    Next, the results for CUDA, also using a grid of 1000x1000:

    precision time GFLOPs

    single 1.048223 38.16 double 1.814037 22.05

  • \newpage

    Finally, a comparison between these two implementations using double precision:

    N OMP GFLOPs CUDA GFLOPs OMP time CUDA time speedup --- ------------- --------------- ---------- ------------------ 100 1.18 3,99 0.34 0.10 3.39200 0.90 12,18 1.77 0.13 13.46 300 1.03 13,59 3.48 0.26 13.15 400 0.99 17,85 6.47 0.36 18.05 5000.98 19,87 10.23 0.50 20.32 600 0.94 20,26 15.24 0.71 21.44 700 0.82 21,74 23.97 0.90 26.59 8000.71 22,16 36.12 1.16 31.27 900 0.71 21,57 45.50 1.50 30.28 1000 0.68 22,05 59.05 1.81 32.55

    The implementation in C using OpenMP is straightforward:

    ~~~ {#ompjacobi .C} void omp_jacobi(_T ***u, _T ***uold, int n, int iters) { int i, j, k; _T **temp;

    for (k = 0; k < iters; k ++){ #pragma omp parallel for private(i) for (j = 1; j < n-1; j ++) { for (i = 1; i < n-1; i ++) (*u)[i][j] = ((_T)0.25)*((*uold)[i+1][j] + (*uold)[i-1][j] + (*uold)[i][j+1] + (*uold)[i][j-1]); }

    /* Swap */ temp = *uold; *uold = *u; *u = temp;}

    /* Last iteration result is in uold, swap again */temp = *uold;*uold = *u;*u = temp;

    }

    A straight-forward implementation is also done in CUDA. The kernel for solving theiteration is:

    ~~~ {#cujacobi .C}__global__void cuda_jacobi1(_T *u, _T *uold){ int i, j; i = threadIdx.x+1; j = blockIdx.x+1;

    u[GET_I1(i,j)] = ((_T)0.25)*(uold[GET_I1(i+1,j)] + uold[GET_I1(i-1, j)] + uold[GET_I1(i, j+1)] + uold[GET_I1(i, j-1)]);

  • }Then, the kernel is launched multiple times, one for each iteration:

    ~~~ {#cusolvejacobi .C} void cuda_solve_jacobi1(_T *ud, _T *uoldd, _T *u) { int i; _T *temp; for (i = 0; i< IT; i ++) { cuda_jacobi1(ud, uoldd); cudaDeviceSynchronize();

    /* Swap */ temp = uoldd; uoldd = ud; ud = temp;}cudaMemcpy(u, uoldd, N*N*sizeof(_T), cudaMemcpyDeviceToHost);

    }~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    One limitation of this approach is that the size of the grid is limited by the maximum number ofthreads per block in the device. The data shown here was executed on a NVIDIA GeForce GTX 580,which has 1024 threads per block.