Towards Structured Parallel Programming - cOMPunity€¦ · Towards Structured Parallel Programming ... Casiano Rodríguez and )UDQFLVFR GH 6DQGH ([email protected]) ... Gonzalez M. and

Parallel Computing Group - University of La Laguna

Towards Structured Parallel Programming

Antonio Dorta, Jesús A. González, Casiano Rodríguez and )UDQFLVFR GH�6DQGH�([email protected])

8QLYHUVLW\�RI�/D�/DJXQDTenerife, Canary Islands, Spain

Rome, September 19 2002


2XWOLQH

l Skeletons

á Basic Skeletonsl Goals l Related Workl The OTOSP Model

á Examplesl The llCoMP compilerl Example(s)l Computational resultsl Conclusionsl Future Work


6NHOHWRQV

l Skeletons are software components that reflect the common patterns of most parallel programs

l The goal of the skeleton approach is to develop a viable and formally well-founded methodology for parallel programming

l Programs will be based on a restricted set of structures avoiding the send/receive mechanism

l Dijkstra structured programming, with the inclusion of the for, while, repeat, etc. skeletons, rejecting the use of unstructured gotos is an analogy in the scope of sequential programming


6NHOHWRQV�&KDUDFWHULVWLFV

l A good skeleton should be a piece of code:

ü Carefully designedü Reusableü Parametrisedü With pre-packaged implementations for

different architectures

l These codes are named skeletons, because they have structure but lack detail


%DVLF 6NHOHWRQV

l Although there are many flavours of parallel skeletons, but it is clear the importance of these:

l FARM / WorkQueuingl PIPEl MAP / foralll REDUCE, SCAN

l Susana Pelagatti Structured development of parallel programs. Taylor and Francis 1997


7KH�)DUP��:RUNTXHXLQJ�VNHOHWRQ

ü Models a set of identical workers computing in parallel a stream of independent tasks

Master

TaskList

ResultList

Tasks

Results

Workers

...


7KH�3LSH�VNHOHWRQ

ü Exploits parallelism in the evaluation of a cascade of stages

...Stage0

Stage1

StageNp-2

StageNp-1

. . .N data items


7KH�0DS � IRUDOO�VNHOHWRQ

ü Models independent data parallel computations in which the same function is applied to all the elements of a data array

...0 1 2 3 4 5 6 P-1

...f(0) f(1) f(2) f(3) f(4) f(5) f(6) f(P-1)


d[0]

d[1]

d[i]

d[Np-1]

d[Np-2]

d[j]

7KH 5HGXFH�� 6FDQ�VNHOHWRQü Implementing parallel reduction and

prefix computations of the elements of an array by means of an associative and commutative operator

0

1

Np-2

Np-1i

i

@

@ i=1,Np


2XU�JRDOV

l We are working in the design and implementation of a language giving suport to these basic skeletons

l The language follows the OpenMP syntax wherever there is one for a skeleton

l Our constructs extend the OpenMP directives with new annotations when it is necessary

l The language should allow the efficient nested combination of any of the basic skeletons

l We want the compiler to produce code for both shared and distributed memory architectures


5HODWHG�:RUN

l Skeletons:ü The P3L project (Prof. S. Pelagatti)

A “structured” parallel programming language

ü Project eSkel (Prof. M. Cole) Library based approach

ü Project COFFE (Prof. S. Gorlatch) intensive and pragmatic use of collective operations

ü The skeleton library (Prof. H. Kuchen)


5HODWHG�:RUN

l Nesting IRUDOO clauses:ü The NANOS Project. Ayguade E., Martorell

X., Labarta J., Gonzalez M. and Navarro N. Exploiting Multiple Levels of Parallelism in OpenMP: A Case Study Proc. of the 1999 International Conference on Parallel Processing, Aizu (Japan), September 1999

ü The OMNI Project. Yoshizumi Tanaka, Kenjiro Taura, Mitsuhisa Sato, and Akinori Yonezawa Performance Evaluation of OpenMP Applications with Nested Parallelism Languages, Compilers, and Run-Time Systems for Scalable Computers pp. 100-112, 2000

l The FARM / Workqueuing skeleton:ü KAI-Intel Group (Shah, Petersen, Throop)

Flexible control structures for parallelism in OpenMP EWOMP 1999


7KH 27263�PRGHO

l The One Thread is One Set of Processors(OTOSP) Model

facilitates the interpretation of how we intend to map the skeletons on distributed memory machines


$�27263�FRPSXWHU

l Let’s consider an (ideal) OTOSP computer:

ü It is composed of a number of infinite processors connected through a network

ü Each processor is a RAM machine with its own private memory, and the only difference among them is an internal register, containing an integer, the NAME (or number) of the processor

ü The processors are organized in setsü The initial set is composed of all the

processors in the machineü At any time, the memory state of all the

processors in the same set is identicalü An OTOSP computation assumes that all

the processors have the same input dataand the same program in memory


([DPSOH�RI�FRPSXWDWLRQ...

1 �SUDJPD�RPS�SDUDOOHO�IRU2 �SUDJPD�OOF�UHVXOW(ri + i, si[i]); 3 for(i = 1; i <= 3; i++) {4 ...5 �SUDJPD�RPS�SDUDOOHO�IRU6 �SUDJPD�OOF�UHVXOW(rj + j, sj[j]);7 for(j = 0; j <= i; j++) {8 rj[j] = J_function(i, j, &sj[j], ...);9 }

10 ri[i] = I_function(i, &si[i], ...);11 }

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 230 ...

...6 120 9 153

i=11 4 7 10 13 16 ...

i=22 5 8 11 14 17 20 ...23

i=3


([DPSOH

1 24 56 7 810 1112 13 1416 17 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 230 ...

...230 ......

1 10... 4 13... 7 16 ...

9 153

2 14... 5 17... 8 20... 11 ...23

i=1 i=2 i=3

6 120 ... ...3 9 15i=1, j =0 i=1, j =1

...1 �SUDJPD�RPS�SDUDOOHO�IRU2 �SUDJPD�OOF�UHVXOW(ri + i, si[i]); 3 for(i = 1; i <= 3; i++) {4 ...5 �SUDJPD�RPS�SDUDOOHO�IRU6 �SUDJPD�OOF�UHVXOW(rj + j, sj[j]);7 for(j = 0; j <= i; j++) {8 rj[j] = J_function(i, j, &sj[j], ...);9 }

10 ri[i] = I_function(i, &si[i], ...);11 }


7KH�UHDO�VLWXDWLRQ

l In a real scenario, the number of processors is limited

l Two situations have to be considered:

ü The number of tasks is larger than the number of processors available in the current set

nT � nP

ü The number of processors in the current set is larger than the number of tasks:

nP > nT


Q7�� Q3�

l In the case of more Tasks than processors

1

10

2 3

4 5

67

8

911

12

13

14

15

0

P0 P1 P2 P3 P4 P5


Q7�� Q3�

l Each processor has to compute several tasks:

34

5 6 78

10911

12 13 14 151

20

P0 P1 P2 P3 P4 P5

l Each processor constitutes a different set

l At the end of the computation, each processor sends its results to its partners processors

#pragma llc result


Q7�� Q3

l In the case of more processors than tasks

1

230

P0 P1 P2 P3 P4 P5


Q7�� Q3

l Several processors replicate the computation of the same task

2 3

P0 P1 P2 P3 P4 P5

1 10 0

l All the processors replicating the same task are in the same set

l Each processor exchanges the corresponding results with its partners in the other sets


OO&R03

l llCoMP is our prototype compiler for the OTOSP model (LL stands for La Laguna)

l The llCoMP (piler)ü Transforms annotated C code to C+MPI

calls ü implemented using lex and yaccü Portability (for the compiler and the

generated code) is an issue in the design


OO&R03

Source C codeWith pragmaannotations

C preprocessor

Lexical analysis

Syntax analysis

Inermediate Code(C + MPI functions)

Binary MPI code

C compiler MPILibrary


&RPSXWLQJ P1 double t, pi=0.0, w;2 long i, n = 100000000;3 double local, pi = 0.0, w = 1.0 / n;4 ...5 �SUDJPD�RPS�SDUDOOHO�IRU�reduction(+:pi) private(t)6 �SUDJPD�OOF�UHGXFWLRQBW\SH(double)7 for(i = 0; i < n; i++) {8 t = (i + 0.5) * w;9 pi += 4.0/(1.0 + t*t);10 }11 pi *= w;12 ...

p = = S �

� ��[��G[ ��L�1�1��L��1��


&RPSXWLQJ P1 double t, pi=0.0, w;2 long i, n = 100000000;3 double local, pi = 0.0, w = 1.0 / n;4 ...5 �SUDJPD�RPS�SDUDOOHO�IRU�reduction(+:pi) private(t)6 �SUDJPD�OOF�UHGXFWLRQBW\SH(double)7 for(i = 0; i < n; i++) {8 t = (i + 0.5) * w;9 pi += 4.0/(1.0 + t*t);10 }11 pi *= w;12 ...

l When compiled with llCoMP, the loop iterations (tasks) are splitted among the processors

l The SULYDWH clause is kept only for compatiblility with OpenMP, because in the OTOSP model, all the storages are private

l The OpenMP UHGXFWLRQ clause implies a collective communication among all the processors in the set


&RPSXWLQJ P

l Speedup computing P on a SunFire 6800109 iterations


&RPELQLQJ�UHGXFWLRQ�DQG�UHVXOW�FODXVHV��0ROHFXODU�'\QDPLFV

1 void compute(int np, int nd, double *box, vnd_t *pos, vnd_t *vel, double mass, vnd_t *f, double *pot_p, double *kin_p) {

3 double x, d, pot, kin;4 int i, j, k;5 vnd_t rij;6 7 pot = kin = 0.0;8 9 �SUDJPD�RPS�SDUDOOHO�IRU�GHIDXOW�VKDUHG�

SULYDWH�L��M��N��ULM��G��UHGXFWLRQ��SRW��NLQ�10 �SUDJPD�OOF�UHGXFWLRQBW\SH��GRXEOH��GRXEOH�11 �SUDJPD�OOF�UHVXOW�I>L@��QG�12 for (i = 0; i < np; i++) { /* energies and forces */13 for (j = 0; j < nd; j++)14 f[i][j] = 0.0;15 for (j = 0; j < np; j++) {16 if (i != j) {17 d = dist(nd, box, pos[i], pos[j], rij);18 pot = pot + 0.5 * v(d); 19 for (k = 0; k < nd; k++) {20 f[i][k] = f[i][k] - rij[k] * dv(d) /d;21 }22 }23 }24 kin = kin + dotr8(nd, vel[i], vel[j]); 25 }26 kin = kin * 0.5 * mass;27 *pot_p = pot;28 *kin_p = kin;29 }


0ROHFXODU�'\QDPLFV

l SGI Origin 3800 ü dim = 3ü 8192 particlesü simulation steps = 10


7KH 6LQJOH 5HVRXUFH $OORFDWLRQ�3UREOHP �65$3�

l M units of an indivisible resource and aset of N Tasks

l fn(r) � benefit obtained when r units of resource are allocated to task n

G[n][r] = max{G[n-1][r-i] + fn(i) / 0 � i � r }

max

Subject to

integer,

IQ UQ1

UQ 0Q1

UQU� Q 1 0

( )

,..., ;

=Ê

==Ê

�= ³

1

10

1 NQ


$�VHTXHQWLDO�G\QDPLF�SURJUDPPLQJ�DOJRULWKP�IRU�WKH�65$3�SUREOHP

1 int srap(int N, int M, cost f, table G, table L) {2 int r, n, i, s, decision_i, temp, pos, chunksize,

buffersize;3456 for (n = 0; n < N; n++) {7 if (n == 0) 8 for (r = 0; r <= M; r++) {9 G[0][r] = f(0, r); /* f is non decreasing */ 1011 }12 else13 for (r = 0; r <= M; r++) {1415 temp = G[n-1][r]; 16 pos = 0;17 for (i = 1; i <= r; i++) {18 decision_i = G[n-1][r-i] + f(n, i);19 if (decision_i > temp) {20 temp = decision_i; 21 pos = i;22 }23 }24 G[n][r] = temp; 2526 L[n][r] = pos;27 }28 }29 return G[N-1][M];30 }


8VLQJ�WKH 3,3(�VNHOHWRQ�IRU�WKH�65$3�SUREOHP

1 int srap(int N, int M, cost f, table G, table L) {2 int r, n, i, s, decision_i, temp, pos, chunksize,

buffersize;34 �SUDJPD�OOF�SLSHOLQH�VFKHGXOH�FKXQNVL]H� EXIIHUVL]H�5 �SUDJPD�OOF�UHVXOW (&G[n][0], M) (&L[n][0], M)6 for (n = 0; n < N; n++) {7 if (n == 0) 8 for (r = 0; r <= M; r++) {9 G[0][r] = f(0, r); /* f is non decreasing */ 10 �SUDJPD�OOF�VHQG (&G[0][r], 1)11 }12 else13 for (r = 0; r <= M; r++) {14 �SUDJPD�OOF�UHFHLYH (&G[n-1][r], &s)15 temp = G[n-1][r]; 16 pos = 0;17 for (i = 1; i <= r; i++) {18 decision_i = G[n-1][r-i] + f(n, i);19 if (decision_i > temp) {20 temp = decision_i; 21 pos = i;22 }23 }24 G[n][r] = temp; 25 �SUDJPD�OOF�VHQG (&G[n][r], 1)26 L[n][r] = pos;27 }28 }29 return G[N-1][M];30 }


7KH�3LSH�VNHOHWRQ

l Lets see the organization of the processors for the case where nP > nT(more processors than stages in the pipe)

data items . . .

...0 1 2 3 4 5 6 P

Stage0

0

N

2N

. . .

. . .

Stage1

2N+1

N+1

1

Stage2

P

N+2

2

StageN-1

2N-1

N-1


'HSHQGHQFLHV

. . .

.. .

. .

. . .

.

. . .

.

. . .

.

. . .

.

. . .

.. .

. .

3URFHVVRU Q

*>Q��@>U@ *>Q@>U@

3URFHVVRU Q��

U

l In the case of a PIPE skeleton, the tasks are not independent: there is a specific relationship among them

G[n][r] = max{G[n-1][r-i] + fn(i) / 0 � i � r }


7KH 6LQJOH 5HVRXUFH�$OORFDWLRQ�3UREOHP

l Cray T3Eü Tasks: 350ü Resource units: 4000


1HVWLQJ�WKH�VNHOHWRQV��WKH�))7�DOJRULWKP

1 void FFT(Complex *A, Complex *a, Complex *W, unsigned N, 2 unsigned stride, Complex *D) {3 Complex *B, *C;4 Complex Aux, *pW;5 unsigned i, n;67 if(N == 1) {8 A[0].re = a[0].re; 9 A[0].im = a[0].im;10 }11 else {12 n = (N >> 1); 13 B = D; 14 C = D + n;1516 �SUDJPD�RPS�SDUDOOHO�IRU17 �SUDJPD�OOF�UHVXOW(D+i*n, n)18 for(i = 0; i <= 1; i++)19 FFT(D+i*n, a+i*stride, W, n, stride<<1, A+i*n);2021 for(i = 0, pW = W; i < n; i++, pW += stride) {22 Aux.re = pW->re * C[i].re - pW->im * C[i].im;23 Aux.im = pW->re * C[i].im + pW->im * C[i].re;24 A[i].re = B[i].re + Aux.re; 25 A[i].im = B[i].im + Aux.im;26 A[i+n].re = B[i].re - Aux.re; 27 A[i+n].im = B[i].im - Aux.im;28 } 29 }30 }


7KH ))7 DOJRULWKP

))7(A) = B = (B[0] , ..., B[N-1]) ³&�N

B[i] = Êk=0..N-1 A[k] wki, w = e2pi/ N=cos(2p/ N)+i sin(2p/ N)

B[ i] = Êk=0..N/ 2-1 A[2k] (w2)ki + wi Êk=0..N/ 2-1 A[2k+1] (w2)ki

A=(A[0], ...., A[N-1]) ³&�N

FFT(A[0], A[1], ... A[N-1])

FFT(A[�], A[�], ... A[1��]) FFT(A[�], A[�], ... A[1��])

FFT(A[�], A[�], ... A[1��])3�

FFT(A[�], A[�], ... A[1��])

3�FFT(A[�], A[�], ... A[1��])

3�FFT(A[�], A[�], ... A[1��])

3�


7KH�)DVW�)RXULHU�7UDQVIRUP

l Cray T3Eü Sizes: 64K, 120K, 256K, 512K, 1M


7KH�0DWUL[�SURGXFW

1 �SUDJPD�RPS�SDUDOOHO�IRU2 �SUDJPD�OOF�ZHLJKW (1 << t)3 �SUDJPD�OOF�UHVXOW(CC + t * m * m, m * m)45 for(t = 0; t <= tasks - 1; t++) {6 col = n << t;7 A = AA + m * n * ((1 << t) - 1);8 B = BB + m * n * ((1 << t) - 1);9 C = CC + m * m * t; 10 �SUDJPD�RPS�SDUDOOHO�IRU11 �SUDJPD�OOF�UHVXOW(C + i * m, m)12 13 for(i = 0; i <= m - 1; i++) {14 for(j = 0; j < m; j++)15 for(C[t][i][j] = 0.0, k = 0; k < n; k++) 16 C[t][i][j] += A[t][i][k] * B[t][k][j];17 }18 }


0DWUL[�SURGXFWV�ZLWK�GLIIHUHQW�OHYHOV�RI�SDUDOOHOLVP

l SGI 3000ü Each task is a matrix productü Exploiting different levels of parallelism


&XUUHQW�GHYHORSPHQWV

l Algorithms:ü Molecular Dynamicsü Conjugate Gradientü NAS Embarrasingly Parallelü Mandelbrot Setü Matrix productü Quicksortü Fast Fourier Transformü Single Resource Allocation Problemü Knapsack Problem

l Architectures:ü Cray T3Eü Beowulf-type PC clusterü SGI Origin 3800ü Sunfire 6800 SMP


&RQFOXVLRQVl We have presented a proposal for a

skeletal language that extends OpenMP with new constructs

l We have introduced the OTOSP abstract model, as the basis for our implementation

l The model guarantees the portability to any platform

l We have developed llCoMP a prototype compiler for the language

l We have shown different examples of algorithms implemented following the ideas of the model

l Have presented computational results for these examples on different architecturesobtained with the llCoMPiler

l We think that it makes worth the research and development of tools oriented to the OTOSP model


)XWXUH�:RUN

l Add new features to the languageü Improve the PIPE skeleton

n Introducing different assignment policiesn Controling the number of processors

assigned to each stagen Using buffers to produce ‘tiled’ coden Managing data sequences with unknown

lenghtn Managing data sequences with varying

sizes

ü Work on the FARM skeleton

l Improve the prototype llCoMP compilerü Including type analysis

l To implement the OTOSP model using threads


$FNQRZOHGJPHQWV

l Edinburgh Parallel Computing Centre (EPCC)l Centre Europeu de Parallelisme de Barcelona

(CEPBA)l Centre de Supercomputació de Catalunya

(CESCA)l Centro de Investigaciones Energéticas,

Medioambientales y Tecnológicas (CIEMAT)

l This research benefits from the support of Secretaría de Estado de Universidades e Investigación, SEUI, project MaLLBa, TIC1999-0754-C03

l Also from the European Commission through grant number HPRI-CT-1999-00026


Towards Structured Parallel Programming

Antonio Dorta, Jesús A. González, Casiano Rodríguez and

)UDQFLVFR GH�6DQGH�([email protected])

http://nereida.deioc.ull.es/llCoMP/

Documents

Towards Structured Parallel Programming - cOMPunity€¦ · Towards Structured Parallel Programming ... Casiano Rodríguez and )UDQFLVFR GH 6DQGH ([email protected]) ... Gonzalez M. and