I NVESTIGATE AND P ARALLEL P ROCESSING USING E1350 IBM E S ERVER C LUSTER Ayaz ul Hassan Khan (g201002860)

INVESTIGATE AND PARALLEL PROCESSING USING E1350 IBM ESERVER CLUSTERAyaz ul Hassan Khan (g201002860)

OBJECTIVES

Explore the architecture of E1350 IBM eServer Cluster

Parallel Programming: OpenMP MPI MPI+OpenMP

Analyzing the effects of above programming models on speedup

Finding out overheads and optimize as much as possible

IBM E1350 CLUSTER

CLUSTER SYSTEM The cluster is unique in its dual-boot capability with

Microsoft Windows HPC Server 2008 and Red Hat Enterprise Linux 5 operating systems.

The cluster has 3 master nodes, one for Red Hat Linux, one for Windows HPC Server 2008 and one for cluster management.

The cluster has 128 compute nodes. Each compute node of the cluster is dual-processor having

two 2.0 GHz x3550 Xeon Quad-core E5405 processors. The total number of cores in the cluster is 1024. Each master node has 1 TB of hard disk space and each

compute node has 500 GB of hard disk. Each master node has 8 GB of RAM. Each compute node has 4 GB of RAM. The interconnect is 10 GBASE-SR

EXPERIMENTAL ENVIRONMENT

Nodes: hpc081, hpc082, hpc083, hpc084 Compilers:

icc: for sequential and OpenMP programs mpiicc: for MPI and MPI+OpenMP programs

Profiling Tools: ompP: for OpenMP profiling mpiP: for MPI profiling

APPLICATIONS USED/IMPLEMENTED

Jacobi Iterative Method Max Speedup = 7.1 (OpenMP, Threads = 8) Max Speedup = 3.7 (MPI, Nodes = 4) Max Speedup = 9.3 (MPI+OpenMP, Nodes = 2,

Threads = 8) Alternating Direction Integration (ADI)

Max Speedup = 5.0 (OpenMP, Threads = 8) Max Speedup = 0.8 (MPI, Nodes = 1) Max Speedup = 1.7 (MPI+OpenMP, Nodes = 1,

Threads = 8)

JACOBI ITERATIVEMETHOD Solving systems of linear equations

= - x

/

JACOBI ITERATIVEMETHOD Sequential Codefor(i = 0; i < N; i++){

x[i] = b[i];}

for(i=0; i<N; i++){sum = 0.0;for(j=0; j<N; j++){

if(i != j){sum += a[i][j] * x[j];new_x[i] = (b[i] - sum)/a[i][i];

}}

}for(i=0; i < N; i++)

x[i] = new_x[i];

128 256 384 512 640 7680

0.10.20.30.40.50.60.70.80.9

sequential

sequential

Space Size (N)

Tim

e (

secs)

JACOBI ITERATIVEMETHOD OpenMP Code#pragma omp parallel private(k,i,j, sum){

for(k = 0; k < MAX_ITER; k++){#pragma omp for

for(i=0; i<N; i++){sum = 0.0;for(j=0; j<N; j++){

if(i != j){sum += a[i][j] * x[j];new_x[i] = (b[i] - sum)/a[i][i];

}}

}#pragma omp for

for(i=0; i < N; i++)x[i] = new_x[i];

}}

JACOBI ITERATIVEMETHOD OpenMP Performance

128 256 384 512 640 768012345678

OpenMP (barrier)

2-cores4-cores8-cores

Space Size (N)

Speedup

128 256 384 512 640 7680

2

4

6

8

10

12

OpenMP (nowait)


Space Size (N)

Speedup

128 256 384 512 640 7680

0.10.20.30.40.50.60.70.8

OpenMP (barrier)


Space Size (N)

Overh

ead

128 256 384 512 640 7680

0.10.20.30.40.50.60.70.8

OpenMP (nowait)


Space Size (N)

Overh

ead

JACOBI ITERATIVEMETHOD ompP results (barrier)

R00002 jacobi_openmp.c (46-55) LOOP TID execT execC bodyT exitBarT taskT 0 0.09 100 0.07 0.01 0.00 1 0.08 100 0.07 0.00 0.00 2 0.08 100 0.07 0.01 0.00 3 0.08 100 0.07 0.01 0.00 4 0.08 100 0.07 0.01 0.00 5 0.08 100 0.07 0.01 0.00 6 0.08 100 0.07 0.01 0.00 7 0.08 100 0.07 0.01 0.00 SUM 0.65 800 0.59 0.06 0.00


JACOBI ITERATIVEMETHOD ompP results (nowait)



JACOBI ITERATIVEMETHOD MPI CodeMPI_Scatter(a, N * N/P, MPI_DOUBLE, apart, N * N/P, MPI_DOUBLE, 0, MPI_COMM_WORLD);MPI_Bcast(x, N, MPI_DOUBLE, 0, MPI_COMM_WORLD);

for(i=myrank*N/P, k=0; k<N/P; i++, k++)bpart[k] = x[i];

for(k = 0; k < MAX_ITER; k++){for(i=0; i<N/P; i++){

sum = 0.0;for(j=0; j<N; j++){

index = i+((N/P)*myrank);if(index != j){

sum += apart[i][j] * x[j];new_x[i] = (bpart[i] - sum)/apart[i][index];

}}

}MPI_Allgather(new_x, N/P, MPI_DOUBLE, x, N/P, MPI_DOUBLE, MPI_COMM_WORLD);}

JACOBI ITERATIVEMETHOD MPI Performance

128 256 384 512 640 7680

0.51

1.52

2.53

3.54

MPI

1-node2-nodes4-nodes

Space Size (N)

Speedup

128 256 384 512 640 7680

102030405060708090

MPI


Space Size (N)

Max M

PIT

ime-t

o-A

ppTim

e

Rati

o (

%)

JACOBI ITERATIVEMETHOD mpiP results

---------------------------------------------------------------------------@--- Aggregate Time (top twenty, descending, milliseconds) -------------------------------------------------------------------------------------------Call Site Time App% MPI% COVAllgather 1 60.1 6.24 19.16 0.00Allgather 2 58.8 6.11 18.77 0.00Allgather 3 57.3 5.96 18.29 0.00Scatter 4 34.6 3.59 11.03 0.00Scatter 3 31.8 3.30 10.14 0.00Scatter 1 30.1 3.13 9.61 0.00Scatter 2 27 2.81 8.62 0.00Bcast 2 7.05 0.73 2.25 0.00Allgather 4 4.33 0.45 1.38 0.00Bcast 3 2.25 0.23 0.72 0.00Bcast 1 0.083 0.01 0.03 0.00Bcast 4 0.029 0.00 0.01 0.00

JACOBI ITERATIVEMETHOD MPI+OpenMP CodeMPI_Scatter(a, N * N/P, MPI_DOUBLE, apart, N * N/P, MPI_DOUBLE, 0, MPI_COMM_WORLD);MPI_Bcast(x, N, MPI_DOUBLE, 0, MPI_COMM_WORLD);for(i=myrank*N/P, k=0; k<N/P; i++, k++)

bpart[k] = x[i];omp_set_num_threads(T);#pragma omp parallel private(k, i, j, index){for(k = 0; k < MAX_ITER; k++){#pragma omp for

for(i=0; i<N/P; i++){sum = 0.0;for(j=0; j<N; j++){

index = i+((N/P)*myrank);if(index != j){

sum += apart[i][j] * x[j];new_x[i] = (bpart[i] - sum)/apart[i][index];

}}

}#pragma omp master{

MPI_Allgather(new_x, N/P, MPI_DOUBLE, x, N/P, MPI_DOUBLE, MPI_COMM_WORLD);}}}

JACOBI ITERATIVEMETHOD MPI+OpenMP Performance

128 256 384 512 640 7680123456789

10

MPI+OpenMP


Space Size (N)

Speedup

128 256 384 512 640 7680

0.5

1

1.5

2

2.5

3

3.5

MPI+OpenMP


Space Size (N)

Overh

ead

128 256 384 512 640 7680

102030405060708090

MPI+OpenMP


Space Size (N)

Max M

PIT

ime-t

o-A

ppTim

e

Rati

o (

%)

JACOBI ITERATIVEMETHOD ompP results

R00002 jacobi_mpi_openmp.c (55-65) LOOP TID execT execC bodyT exitBarT taskT 0 0.03 100 0.02 0.01 0.00 1 0.24 100 0.02 0.23 0.00 2 0.24 100 0.02 0.22 0.00 3 0.24 100 0.02 0.22 0.00 4 0.24 100 0.02 0.22 0.00 5 0.24 100 0.02 0.22 0.00 6 0.24 100 0.02 0.22 0.00 7 0.24 100 0.02 0.22 0.00 SUM 1.72 800 0.15 1.56 0.00

R00003 jacobi_mpi_openmp.c (67-70) MASTER TID execT execC 0 0.22 100 SUM 0.22 100

JACOBI ITERATIVEMETHOD mpiP results

---------------------------------------------------------------------------@--- Aggregate Time (top twenty, descending, milliseconds) -------------------------------------------------------------------------------------------Call Site Time App% MPI% COVScatter 8 34.7 9.62 14.11 0.00Allgather 1 32.6 9.05 13.28 0.00Scatter 6 31.3 8.70 12.76 0.00Scatter 2 30.2 8.39 12.31 0.00Allgather 3 29.9 8.30 12.18 0.00Allgather 5 27.6 7.67 11.25 0.00Scatter 4 27.1 7.51 11.02 0.00Allgather 7 22.1 6.14 9.00 0.00Bcast 4 7.12 1.98 2.90 0.00Bcast 6 2.81 0.78 1.14 0.00Bcast 2 0.09 0.02 0.04 0.00Bcast 8 0.033 0.01 0.01 0.00

ADI Alternating Direction Integration

-= * /

-= * /

ADI Sequential Code//////ADI forward & backword sweep along rows//////for (i = 0; i < N; i++){

for (j = 1; j < N; j++){x[i][j] = x[i][j]-x[i][j-1]*a[i][j]/b[i][j-1];b[i][j]= b[i][j] - a[i][j]*a[i][j]/b[i][j-1];

}x[i][N-1] = x[i][N-1]/b[i][N-1];

}for (i = 0; i < N; i++)

for (j = N-2; j > 1; j--)x[i][j]=(x[i][j]-a[i][j+1]*x[i][j+1])/b[i][j];

////// ADI forward & backward sweep along columns//////for (j = 0; j < N; j++){

for (i = 1; i < N; i++){x[i][j] = x[i][j]-x[i-1][j]*a[i][j]/b[i-1][j];b[i][j]= b[i][j] - a[i][j]*a[i][j]/b[i-1][j];

}x[N-1][j] = x[N-1][j]/b[N-1][j];

}for (j = 0; j < N; j++)

for (i = N-2; i > 1; i--)x[i][j]=(x[i][j]-a[i+1][j]*x[i+1][j])/b[i][j];

128 256 384 512 640 7680

0.51

1.52

2.53

3.54

4.55

sequential

sequential

Space Size (N)

Tim

e (

secs)

ADI OpenMP Code#pragma omp parallel private(iter){for(iter = 1; iter <= MAXITER; iter++){//////ADI forward & backword sweep along rows//////#pragma omp for private(i,j) nowaitfor (i = 0; i < N; i++){

for (j = 1; j < N; j++){x[i][j] = x[i][j]-x[i][j-1]*a[i][j]/b[i][j-1];b[i][j]= b[i][j] - a[i][j]*a[i][j]/b[i][j-1];

}x[i][N-1] = x[i][N-1]/b[i][N-1];

}#pragma omp for private(i,j)for (i = 0; i < N; i++)

for (j = N-2; j > 1; j--)x[i][j]=(x[i][j]-a[i][j+1]*x[i][j+1])/b[i][j];

////// ADI forward & backward sweep along columns//////#pragma omp for private(i,j) nowaitfor (j = 0; j < N; j++){

for (i = 1; i < N; i++){x[i][j] = x[i][j]-x[i-1][j]*a[i][j]/b[i-1][j];b[i][j]= b[i][j] - a[i][j]*a[i][j]/b[i-1][j];

}x[N-1][j] = x[N-1][j]/b[N-1][j];

}#pragma omp for private(i,j)for (j = 0; j < N; j++)

for (i = N-2; i > 1; i--)x[i][j]=(x[i][j]-a[i+1][j]*x[i+1][j])/b[i][j];

}

ADI OpenMP Performance

128 256 384 512 640 7680

1

2

3

4

5

6

OpenMP


Space Size (N)

Speedup

128 256 384 512 640 768012345678

OpenMP


Space Size (N)

Overh

ead

ADI ompP results

R00002 adi_openmp.c (43-50) LOOP TID execT execC bodyT exitBarT taskT 0 0.18 100 0.18 0.00 0.00 1 0.18 100 0.18 0.00 0.00 2 0.18 100 0.18 0.00 0.00 3 0.18 100 0.18 0.00 0.00 4 0.18 100 0.18 0.00 0.00 5 0.18 100 0.18 0.00 0.00 6 0.18 100 0.18 0.00 0.00 7 0.18 100 0.18 0.00 0.00 SUM 1.47 800 1.47 0.00 0.00




ADI MPI CodeMPI_Bcast(a, N * N, MPI_FLOAT, 0, MPI_COMM_WORLD);MPI_Scatter(x, N * N/P, MPI_FLOAT, xpart, N * N/P, MPI_FLOAT, 0, MPI_COMM_WORLD);MPI_Scatter(b, N * N/P, MPI_FLOAT, bpart, N * N/P, MPI_FLOAT, 0, MPI_COMM_WORLD);

for(i=myrank*(N/P), k=0; k<N/P; i++, k++)for(j=0;j<N;j++)

apart[k][j] = a[i][j];

for(iter = 1; iter <= 2*MAXITER; iter++){//////ADI forward & backword sweep along rows//////for (i = 0; i < N/P; i++){

for (j = 1; j < N; j++){xpart[i][j] = xpart[i][j]-xpart[i][j-1]*apart[i][j]/bpart[i][j-1];bpart[i][j]= bpart[i][j] - apart[i][j]*apart[i][j]/bpart[i][j-1];

}xpart[i][N-1] = xpart[i][N-1]/bpart[i][N-1];

}for (i = 0; i < N/P; i++){

for (j = N-2; j > 1; j--)xpart[i][j]=(xpart[i][j]-apart[i][j+1]*xpart[i][j+1])/bpart[i]

[j];

ADI MPI CodeMPI_Gather(xpart, N*N/P, MPI_FLOAT, x, N*N/P, MPI_FLOAT, 0, MPI_COMM_WORLD);MPI_Gather(bpart, N*N/P, MPI_FLOAT, b, N*N/P, MPI_FLOAT, 0, MPI_COMM_WORLD);

//transpose matricestrans(x, N, N);trans(b, N, N);trans(a, N, N);

MPI_Scatter(x, N * N/P, MPI_FLOAT, xpart, N * N/P, MPI_FLOAT, 0, MPI_COMM_WORLD);MPI_Scatter(b, N * N/P, MPI_FLOAT, bpart, N * N/P, MPI_FLOAT, 0, MPI_COMM_WORLD);

for(i=myrank*(N/P), k=0; k<N/P; i++, k++)for(j=0;j<N;j++)

apart[k][j] = a[i][j];}

ADI MPI Performance

128 256 384 512 640 7680

0.10.20.30.40.50.60.70.80.9

MPI


Space Size (N)

Speedup

128 256 384 512 640 7680

20

40

60

80

100

120

MPI


Space Size (N)

Max M

PIT

ime-t

o-A

ppTim

e

Rati

o (

%)

ADI mpiP results

---------------------------------------------------------------------------@--- Aggregate Time (top twenty, descending, milliseconds) -------------------------------------------------------------------------------------------Call Site Time App% MPI% COVGather 1 8.63e+04 22.83 23.54 0.00Gather 3 6.29e+04 16.63 17.15 0.00Gather 2 6.08e+04 16.10 16.60 0.00Gather 4 5.83e+04 15.43 15.91 0.00Scatter 4 3.31e+04 8.76 9.03 0.00Scatter 2 3.08e+04 8.14 8.39 0.00Scatter 3 2.87e+04 7.58 7.81 0.00Scatter 1 5.53e+03 1.46 1.51 0.00Bcast 2 50.8 0.01 0.01 0.00Bcast 4 50.8 0.01 0.01 0.00Bcast 3 49.5 0.01 0.01 0.00Bcast 1 40.4 0.01 0.01 0.00Reduce 1 2.57 0.00 0.00 0.00Reduce 3 0.259 0.00 0.00 0.00Reduce 2 0.056 0.00 0.00 0.00Reduce 4 0.052 0.00 0.00 0.00

ADI MPI+OpenMP CodeMPI_Bcast(a, N * N, MPI_FLOAT, 0, MPI_COMM_WORLD);MPI_Scatter(x, N * N/P, MPI_FLOAT, xpart, N * N/P, MPI_FLOAT, 0, MPI_COMM_WORLD);MPI_Scatter(b, N * N/P, MPI_FLOAT, bpart, N * N/P, MPI_FLOAT, 0, MPI_COMM_WORLD);

omp_set_num_threads(T);

#pragma omp parallel private(iter){int id, sindex, eindex;int m,n;id = omp_get_thread_num();

sindex = id * node_rows/T;eindex = sindex + node_rows/T;int l = myrank*(N/P);

for(m=sindex; m<eindex; m++){for(n=0;n<N;n++)

apart[m][n] = a[l+m][n];l++;

}

ADI MPI+OpenMP Codefor(iter = 1; iter <= 2*MAXITER; iter++){//////ADI forward & backword sweep along rows//////#pragma omp for private(i,j) nowaitfor (i = 0; i < N/P; i++){

for (j = 1; j < N; j++){xpart[i][j] = xpart[i][j]-xpart[i][j-1]*apart[i][j]/bpart[i][j-1];bpart[i][j]= bpart[i][j] - apart[i][j]*apart[i][j]/bpart[i][j-1];

}xpart[i][N-1] = xpart[i][N-1]/bpart[i][N-1];

}

#pragma omp for private(i,j)for (i = 0; i < N/P; i++)

for (j = N-2; j > 1; j--)xpart[i][j]=(xpart[i][j]-apart[i][j+1]*xpart[i][j+1])/bpart[i][j];

#pragma omp master{

MPI_Gather(xpart, N*N/P, MPI_FLOAT, x, N*N/P, MPI_FLOAT, 0, MPI_COMM_WORLD);MPI_Gather(bpart, N*N/P, MPI_FLOAT, b, N*N/P, MPI_FLOAT, 0, MPI_COMM_WORLD);

}

#pragma omp barrier

ADI MPI+OpenMP Code#pragma omp sections{

#pragma omp section{ trans(x, N, N); }#pragma omp section{ trans(b, N, N); }#pragma omp section{ trans(a, N, N); }

}#pragma omp barrier

#pragma omp master{MPI_Scatter(x, N * N/P, MPI_FLOAT, xpart, N * N/P, MPI_FLOAT, 0, MPI_COMM_WORLD);MPI_Scatter(b, N * N/P, MPI_FLOAT, bpart, N * N/P, MPI_FLOAT, 0, MPI_COMM_WORLD);}

l = myrank*(N/P);for(m=sindex; m<eindex; m++){

for(n=0;n<N;n++)apart[m][n] = a[l+m][n];

l++;}}#pragma omp barrier}

ADI MPI+OpenMP Performance

128 256 384 512 640 7680

0.20.40.60.8

11.21.41.61.8

MPI+OpenMP


Space Size (N)

Speedup

128 256 384 512 640 7680

100200300400500600700800900

MPI+OpenMP


Space Size (N)

Overh

ead

128 256 384 512 640 7680

20

40

60

80

100

120

MPI+OpenMP


Space Size (N)

Max M

PIT

ime-t

o-A

ppTim

e

Rati

o (

%)

ADI ompP results

R00002 adi_mpi_scatter_openmp.c (89-96) LOOP TID execT execC bodyT exitBarT taskT 0 0.05 200 0.05 0.00 0.00 1 0.05 200 0.05 0.00 0.00 2 0.08 200 0.08 0.00 0.00 3 0.08 200 0.08 0.00 0.00 4 0.08 200 0.08 0.00 0.00 5 0.08 200 0.08 0.00 0.00 6 0.08 200 0.08 0.00 0.00 7 0.08 200 0.08 0.00 0.00 SUM 0.58 1600 0.58 0.00 0.00

R00003 adi_mpi_scatter_openmp.c (99-104) LOOP TID execT execC bodyT exitBarT taskT 0 0.06 200 0.05 0.01 0.00 1 34.23 200 0.05 34.18 0.00 2 34.22 200 0.05 34.17 0.00 3 34.22 200 0.05 34.17 0.00 4 34.21 200 0.05 34.16 0.00 5 34.20 200 0.05 34.15 0.00 6 34.21 200 0.05 34.16 0.00 7 34.20 200 0.05 34.15 0.00 SUM 239.54 1600 0.39 239.14 0.00

ADI ompP results

R00005 adi_mpi_scatter_openmp.c (113) BARRIER TID execT execC taskT 0 0.00 200 0.00 1 64.29 200 0.00 2 64.29 200 0.00 3 64.29 200 0.00 4 64.29 200 0.00 5 64.29 200 0.00 6 64.29 200 0.00 7 64.29 200 0.00 SUM 450.02 1600 0.00

R00004 adi_mpi_scatter_openmp.c (106-111) MASTER TID execT execC 0 64.28 200 SUM 64.28 200

R00006 adi_mpi_scatter_openmp.c (116-130) SECTIONS TID execT execC sectT sectC exitBarT mgmtT taskT 0 0.85 200 0.85 200 0.00 0.00 0.00 1 0.85 200 0.83 200 0.02 0.00 0.00 2 0.85 200 0.44 200 0.41 0.00 0.00 3 0.85 200 0.00 0 0.85 0.00 0.00 4 0.85 200 0.00 0 0.85 0.00 0.00 5 0.85 200 0.00 0 0.85 0.00 0.00 6 0.85 200 0.00 0 0.85 0.00 0.00 7 0.85 200 0.00 0 0.85 0.00 0.00 SUM 6.80 1600 2.12 600 4.67 0.01 0.00

ADI ompP results


R00008 adi_mpi_scatter_openmp.c (134-138) MASTER TID execT execC 0 34.46 200 SUM 34.46 200


ADI mpiP results

---------------------------------------------------------------------------@--- Aggregate Time (top twenty, descending, milliseconds) -------------------------------------------------------------------------------------------Call Site Time App% MPI% COVGather 2 8.98e+04 23.32 23.52 0.00Gather 6 6.57e+04 17.05 17.19 0.00Gather 8 6.45e+04 16.74 16.89 0.00Gather 4 6.17e+04 16.03 16.16 0.00Scatter 4 3.39e+04 8.79 8.87 0.00Scatter 8 3.1e+04 8.06 8.13 0.00Scatter 6 2.96e+04 7.68 7.75 0.00Scatter 2 5.4e+03 1.40 1.41 0.00Bcast 7 49.5 0.01 0.01 0.00Bcast 3 49.3 0.01 0.01 0.00Bcast 5 47.8 0.01 0.01 0.00Bcast 1 40 0.01 0.01 0.00Scatter 1 30.5 0.01 0.01 0.00Scatter 5 30.3 0.01 0.01 0.00Scatter 7 30.3 0.01 0.01 0.00Scatter 3 28.8 0.01 0.01 0.00Reduce 1 1.8 0.00 0.00 0.00Reduce 5 0.062 0.00 0.00 0.00Reduce 3 0.049 0.00 0.00 0.00Reduce 7 0.049 0.00 0.00 0.00

THANKS

Q & A Any Suggestions?

Documents

I NVESTIGATE AND P ARALLEL P ROCESSING USING E1350 IBM E S ERVER C LUSTER Ayaz ul Hassan Khan (g201002860)