Upload
trinhdung
View
237
Download
0
Embed Size (px)
Citation preview
Parallel Computing Group - University of La Laguna
Towards Structured Parallel Programming
Antonio Dorta, Jesús A. González, Casiano Rodríguez and )UDQFLVFR GH�6DQGH�([email protected])
8QLYHUVLW\�RI�/D�/DJXQDTenerife, Canary Islands, Spain
Rome, September 19 2002
Parallel Computing Group - University of La Laguna
2XWOLQH
l Skeletons
á Basic Skeletonsl Goals l Related Workl The OTOSP Model
á Examplesl The llCoMP compilerl Example(s)l Computational resultsl Conclusionsl Future Work
Parallel Computing Group - University of La Laguna
6NHOHWRQV
l Skeletons are software components that reflect the common patterns of most parallel programs
l The goal of the skeleton approach is to develop a viable and formally well-founded methodology for parallel programming
l Programs will be based on a restricted set of structures avoiding the send/receive mechanism
l Dijkstra structured programming, with the inclusion of the for, while, repeat, etc. skeletons, rejecting the use of unstructured gotos is an analogy in the scope of sequential programming
Parallel Computing Group - University of La Laguna
6NHOHWRQV�&KDUDFWHULVWLFV
l A good skeleton should be a piece of code:
ü Carefully designedü Reusableü Parametrisedü With pre-packaged implementations for
different architectures
l These codes are named skeletons, because they have structure but lack detail
Parallel Computing Group - University of La Laguna
%DVLF 6NHOHWRQV
l Although there are many flavours of parallel skeletons, but it is clear the importance of these:
l FARM / WorkQueuingl PIPEl MAP / foralll REDUCE, SCAN
l Susana Pelagatti Structured development of parallel programs. Taylor and Francis 1997
Parallel Computing Group - University of La Laguna
7KH�)DUP���:RUNTXHXLQJ�VNHOHWRQ
ü Models a set of identical workers computing in parallel a stream of independent tasks
Master
TaskList
ResultList
Tasks
Results
Workers
...
Parallel Computing Group - University of La Laguna
7KH�3LSH�VNHOHWRQ
ü Exploits parallelism in the evaluation of a cascade of stages
...Stage0
Stage1
StageNp-2
StageNp-1
. . .N data items
Parallel Computing Group - University of La Laguna
7KH�0DS � IRUDOO�VNHOHWRQ
ü Models independent data parallel computations in which the same function is applied to all the elements of a data array
...0 1 2 3 4 5 6 P-1
...f(0) f(1) f(2) f(3) f(4) f(5) f(6) f(P-1)
Parallel Computing Group - University of La Laguna
d[0]
d[1]
d[i]
d[Np-1]
d[Np-2]
d[j]
7KH 5HGXFH�� 6FDQ�VNHOHWRQü Implementing parallel reduction and
prefix computations of the elements of an array by means of an associative and commutative operator
0
1
Np-2
Np-1i
i
@
@ i=1,Np
Parallel Computing Group - University of La Laguna
2XU�JRDOV
l We are working in the design and implementation of a language giving suport to these basic skeletons
l The language follows the OpenMP syntax wherever there is one for a skeleton
l Our constructs extend the OpenMP directives with new annotations when it is necessary
l The language should allow the efficient nested combination of any of the basic skeletons
l We want the compiler to produce code for both shared and distributed memory architectures
Parallel Computing Group - University of La Laguna
5HODWHG�:RUN
l Skeletons:ü The P3L project (Prof. S. Pelagatti)
A “structured” parallel programming language
ü Project eSkel (Prof. M. Cole) Library based approach
ü Project COFFE (Prof. S. Gorlatch) intensive and pragmatic use of collective operations
ü The skeleton library (Prof. H. Kuchen)
Parallel Computing Group - University of La Laguna
5HODWHG�:RUN
l Nesting IRUDOO clauses:ü The NANOS Project. Ayguade E., Martorell
X., Labarta J., Gonzalez M. and Navarro N. Exploiting Multiple Levels of Parallelism in OpenMP: A Case Study Proc. of the 1999 International Conference on Parallel Processing, Aizu (Japan), September 1999
ü The OMNI Project. Yoshizumi Tanaka, Kenjiro Taura, Mitsuhisa Sato, and Akinori Yonezawa Performance Evaluation of OpenMP Applications with Nested Parallelism Languages, Compilers, and Run-Time Systems for Scalable Computers pp. 100-112, 2000
l The FARM / Workqueuing skeleton:ü KAI-Intel Group (Shah, Petersen, Throop)
Flexible control structures for parallelism in OpenMP EWOMP 1999
Parallel Computing Group - University of La Laguna
7KH 27263�PRGHO
l The One Thread is One Set of Processors(OTOSP) Model
facilitates the interpretation of how we intend to map the skeletons on distributed memory machines
Parallel Computing Group - University of La Laguna
$�27263�FRPSXWHU
l Let’s consider an (ideal) OTOSP computer:
ü It is composed of a number of infinite processors connected through a network
ü Each processor is a RAM machine with its own private memory, and the only difference among them is an internal register, containing an integer, the NAME (or number) of the processor
ü The processors are organized in setsü The initial set is composed of all the
processors in the machineü At any time, the memory state of all the
processors in the same set is identicalü An OTOSP computation assumes that all
the processors have the same input dataand the same program in memory
Parallel Computing Group - University of La Laguna
([DPSOH�RI�FRPSXWDWLRQ...
1 �SUDJPD�RPS�SDUDOOHO�IRU2 �SUDJPD�OOF�UHVXOW(ri + i, si[i]); 3 for(i = 1; i <= 3; i++) {4 ...5 �SUDJPD�RPS�SDUDOOHO�IRU6 �SUDJPD�OOF�UHVXOW(rj + j, sj[j]);7 for(j = 0; j <= i; j++) {8 rj[j] = J_function(i, j, &sj[j], ...);9 }
10 ri[i] = I_function(i, &si[i], ...);11 }
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 230 ...
...6 120 9 153
i=11 4 7 10 13 16 ...
i=22 5 8 11 14 17 20 ...23
i=3
Parallel Computing Group - University of La Laguna
([DPSOH
1 24 56 7 810 1112 13 1416 17 20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 230 ...
...230 ......
1 10... 4 13... 7 16 ...
9 153
2 14... 5 17... 8 20... 11 ...23
i=1 i=2 i=3
6 120 ... ...3 9 15i=1, j =0 i=1, j =1
...1 �SUDJPD�RPS�SDUDOOHO�IRU2 �SUDJPD�OOF�UHVXOW(ri + i, si[i]); 3 for(i = 1; i <= 3; i++) {4 ...5 �SUDJPD�RPS�SDUDOOHO�IRU6 �SUDJPD�OOF�UHVXOW(rj + j, sj[j]);7 for(j = 0; j <= i; j++) {8 rj[j] = J_function(i, j, &sj[j], ...);9 }
10 ri[i] = I_function(i, &si[i], ...);11 }
Parallel Computing Group - University of La Laguna
7KH�UHDO�VLWXDWLRQ
l In a real scenario, the number of processors is limited
l Two situations have to be considered:
ü The number of tasks is larger than the number of processors available in the current set
nT � nP
ü The number of processors in the current set is larger than the number of tasks:
nP > nT
Parallel Computing Group - University of La Laguna
Q7�� Q3�
l In the case of more Tasks than processors
1
10
2 3
4 5
67
8
911
12
13
14
15
0
P0 P1 P2 P3 P4 P5
Parallel Computing Group - University of La Laguna
Q7�� Q3�
l Each processor has to compute several tasks:
34
5 6 78
10911
12 13 14 151
20
P0 P1 P2 P3 P4 P5
l Each processor constitutes a different set
l At the end of the computation, each processor sends its results to its partners processors
#pragma llc result
Parallel Computing Group - University of La Laguna
Q7�� Q3
l In the case of more processors than tasks
1
230
P0 P1 P2 P3 P4 P5
Parallel Computing Group - University of La Laguna
Q7�� Q3
l Several processors replicate the computation of the same task
2 3
P0 P1 P2 P3 P4 P5
1 10 0
l All the processors replicating the same task are in the same set
l Each processor exchanges the corresponding results with its partners in the other sets
Parallel Computing Group - University of La Laguna
OO&R03
l llCoMP is our prototype compiler for the OTOSP model (LL stands for La Laguna)
l The llCoMP (piler)ü Transforms annotated C code to C+MPI
calls ü implemented using lex and yaccü Portability (for the compiler and the
generated code) is an issue in the design
Parallel Computing Group - University of La Laguna
OO&R03
Source C codeWith pragmaannotations
C preprocessor
Lexical analysis
Syntax analysis
Inermediate Code(C + MPI functions)
Binary MPI code
C compiler MPILibrary
Parallel Computing Group - University of La Laguna
&RPSXWLQJ P1 double t, pi=0.0, w;2 long i, n = 100000000;3 double local, pi = 0.0, w = 1.0 / n;4 ...5 �SUDJPD�RPS�SDUDOOHO�IRU�reduction(+:pi) private(t)6 �SUDJPD�OOF�UHGXFWLRQBW\SH(double)7 for(i = 0; i < n; i++) {8 t = (i + 0.5) * w;9 pi += 4.0/(1.0 + t*t);10 }11 pi *= w;12 ...
p = = S �
� ����[��G[ ��L�1�1�����L������1���
Parallel Computing Group - University of La Laguna
&RPSXWLQJ P1 double t, pi=0.0, w;2 long i, n = 100000000;3 double local, pi = 0.0, w = 1.0 / n;4 ...5 �SUDJPD�RPS�SDUDOOHO�IRU�reduction(+:pi) private(t)6 �SUDJPD�OOF�UHGXFWLRQBW\SH(double)7 for(i = 0; i < n; i++) {8 t = (i + 0.5) * w;9 pi += 4.0/(1.0 + t*t);10 }11 pi *= w;12 ...
l When compiled with llCoMP, the loop iterations (tasks) are splitted among the processors
l The SULYDWH clause is kept only for compatiblility with OpenMP, because in the OTOSP model, all the storages are private
l The OpenMP UHGXFWLRQ clause implies a collective communication among all the processors in the set
Parallel Computing Group - University of La Laguna
&RPSXWLQJ P
l Speedup computing P on a SunFire 6800109 iterations
Parallel Computing Group - University of La Laguna
&RPELQLQJ�UHGXFWLRQ�DQG�UHVXOW�FODXVHV��0ROHFXODU�'\QDPLFV
1 void compute(int np, int nd, double *box, vnd_t *pos, vnd_t *vel, double mass, vnd_t *f, double *pot_p, double *kin_p) {
3 double x, d, pot, kin;4 int i, j, k;5 vnd_t rij;6 7 pot = kin = 0.0;8 9 �SUDJPD�RPS�SDUDOOHO�IRU�GHIDXOW�VKDUHG�
SULYDWH�L��M��N��ULM��G��UHGXFWLRQ���SRW��NLQ�10 �SUDJPD�OOF�UHGXFWLRQBW\SH��GRXEOH��GRXEOH�11 �SUDJPD�OOF�UHVXOW�I>L@��QG�12 for (i = 0; i < np; i++) { /* energies and forces */13 for (j = 0; j < nd; j++)14 f[i][j] = 0.0;15 for (j = 0; j < np; j++) {16 if (i != j) {17 d = dist(nd, box, pos[i], pos[j], rij);18 pot = pot + 0.5 * v(d); 19 for (k = 0; k < nd; k++) {20 f[i][k] = f[i][k] - rij[k] * dv(d) /d;21 }22 }23 }24 kin = kin + dotr8(nd, vel[i], vel[j]); 25 }26 kin = kin * 0.5 * mass;27 *pot_p = pot;28 *kin_p = kin;29 }
Parallel Computing Group - University of La Laguna
0ROHFXODU�'\QDPLFV
l SGI Origin 3800 ü dim = 3ü 8192 particlesü simulation steps = 10
Parallel Computing Group - University of La Laguna
7KH 6LQJOH 5HVRXUFH $OORFDWLRQ�3UREOHP �65$3�
l M units of an indivisible resource and aset of N Tasks
l fn(r) � benefit obtained when r units of resource are allocated to task n
G[n][r] = max{G[n-1][r-i] + fn(i) / 0 � i � r }
max
Subject to
integer,
IQ UQ1
UQ 0Q1
UQU� Q 1 0
( )
,..., ;
=Ê
==Ê
�= ³
1
10
1 NQ
Parallel Computing Group - University of La Laguna
$�VHTXHQWLDO�G\QDPLF�SURJUDPPLQJ�DOJRULWKP�IRU�WKH�65$3�SUREOHP
1 int srap(int N, int M, cost f, table G, table L) {2 int r, n, i, s, decision_i, temp, pos, chunksize,
buffersize;3456 for (n = 0; n < N; n++) {7 if (n == 0) 8 for (r = 0; r <= M; r++) {9 G[0][r] = f(0, r); /* f is non decreasing */ 1011 }12 else13 for (r = 0; r <= M; r++) {1415 temp = G[n-1][r]; 16 pos = 0;17 for (i = 1; i <= r; i++) {18 decision_i = G[n-1][r-i] + f(n, i);19 if (decision_i > temp) {20 temp = decision_i; 21 pos = i;22 }23 }24 G[n][r] = temp; 2526 L[n][r] = pos;27 }28 }29 return G[N-1][M];30 }
Parallel Computing Group - University of La Laguna
8VLQJ�WKH 3,3(�VNHOHWRQ�IRU�WKH�65$3�SUREOHP
1 int srap(int N, int M, cost f, table G, table L) {2 int r, n, i, s, decision_i, temp, pos, chunksize,
buffersize;34 �SUDJPD�OOF�SLSHOLQH�VFKHGXOH�FKXQNVL]H� EXIIHUVL]H�5 �SUDJPD�OOF�UHVXOW (&G[n][0], M) (&L[n][0], M)6 for (n = 0; n < N; n++) {7 if (n == 0) 8 for (r = 0; r <= M; r++) {9 G[0][r] = f(0, r); /* f is non decreasing */ 10 �SUDJPD�OOF�VHQG (&G[0][r], 1)11 }12 else13 for (r = 0; r <= M; r++) {14 �SUDJPD�OOF�UHFHLYH (&G[n-1][r], &s)15 temp = G[n-1][r]; 16 pos = 0;17 for (i = 1; i <= r; i++) {18 decision_i = G[n-1][r-i] + f(n, i);19 if (decision_i > temp) {20 temp = decision_i; 21 pos = i;22 }23 }24 G[n][r] = temp; 25 �SUDJPD�OOF�VHQG (&G[n][r], 1)26 L[n][r] = pos;27 }28 }29 return G[N-1][M];30 }
Parallel Computing Group - University of La Laguna
7KH�3LSH�VNHOHWRQ
l Lets see the organization of the processors for the case where nP > nT(more processors than stages in the pipe)
data items . . .
...0 1 2 3 4 5 6 P
Stage0
0
N
2N
. . .
. . .
Stage1
2N+1
N+1
1
Stage2
P
N+2
2
StageN-1
2N-1
N-1
Parallel Computing Group - University of La Laguna
'HSHQGHQFLHV
. . .
.. .
. .
. . .
.
. . .
.
. . .
.
. . .
.
. . .
.. .
. .
3URFHVVRU Q
*>Q��@>U@ *>Q@>U@
3URFHVVRU Q�� �
U
l In the case of a PIPE skeleton, the tasks are not independent: there is a specific relationship among them
G[n][r] = max{G[n-1][r-i] + fn(i) / 0 � i � r }
Parallel Computing Group - University of La Laguna
7KH 6LQJOH 5HVRXUFH�$OORFDWLRQ�3UREOHP
l Cray T3Eü Tasks: 350ü Resource units: 4000
Parallel Computing Group - University of La Laguna
1HVWLQJ�WKH�VNHOHWRQV��WKH�))7�DOJRULWKP
1 void FFT(Complex *A, Complex *a, Complex *W, unsigned N, 2 unsigned stride, Complex *D) {3 Complex *B, *C;4 Complex Aux, *pW;5 unsigned i, n;67 if(N == 1) {8 A[0].re = a[0].re; 9 A[0].im = a[0].im;10 }11 else {12 n = (N >> 1); 13 B = D; 14 C = D + n;1516 �SUDJPD�RPS�SDUDOOHO�IRU17 �SUDJPD�OOF�UHVXOW(D+i*n, n)18 for(i = 0; i <= 1; i++)19 FFT(D+i*n, a+i*stride, W, n, stride<<1, A+i*n);2021 for(i = 0, pW = W; i < n; i++, pW += stride) {22 Aux.re = pW->re * C[i].re - pW->im * C[i].im;23 Aux.im = pW->re * C[i].im + pW->im * C[i].re;24 A[i].re = B[i].re + Aux.re; 25 A[i].im = B[i].im + Aux.im;26 A[i+n].re = B[i].re - Aux.re; 27 A[i+n].im = B[i].im - Aux.im;28 } 29 }30 }
Parallel Computing Group - University of La Laguna
7KH ))7 DOJRULWKP
))7(A) = B = (B[0] , ..., B[N-1]) ³&�N
B[i] = Êk=0..N-1 A[k] wki, w = e2pi/ N=cos(2p/ N)+i sin(2p/ N)
B[ i] = Êk=0..N/ 2-1 A[2k] (w2)ki + wi Êk=0..N/ 2-1 A[2k+1] (w2)ki
A=(A[0], ...., A[N-1]) ³&�N
FFT(A[0], A[1], ... A[N-1])
FFT(A[�], A[�], ... A[1��]) FFT(A[�], A[�], ... A[1��])
FFT(A[�], A[�], ... A[1��])3�
FFT(A[�], A[�], ... A[1��])
3�FFT(A[�], A[�], ... A[1��])
3�FFT(A[�], A[�], ... A[1��])
3�
Parallel Computing Group - University of La Laguna
7KH�)DVW�)RXULHU�7UDQVIRUP
l Cray T3Eü Sizes: 64K, 120K, 256K, 512K, 1M
Parallel Computing Group - University of La Laguna
7KH�0DWUL[�SURGXFW
1 �SUDJPD�RPS�SDUDOOHO�IRU2 �SUDJPD�OOF�ZHLJKW (1 << t)3 �SUDJPD�OOF�UHVXOW(CC + t * m * m, m * m)45 for(t = 0; t <= tasks - 1; t++) {6 col = n << t;7 A = AA + m * n * ((1 << t) - 1);8 B = BB + m * n * ((1 << t) - 1);9 C = CC + m * m * t; 10 �SUDJPD�RPS�SDUDOOHO�IRU11 �SUDJPD�OOF�UHVXOW(C + i * m, m)12 13 for(i = 0; i <= m - 1; i++) {14 for(j = 0; j < m; j++)15 for(C[t][i][j] = 0.0, k = 0; k < n; k++) 16 C[t][i][j] += A[t][i][k] * B[t][k][j];17 }18 }
Parallel Computing Group - University of La Laguna
0DWUL[�SURGXFWV�ZLWK�GLIIHUHQW�OHYHOV�RI�SDUDOOHOLVP
l SGI 3000ü Each task is a matrix productü Exploiting different levels of parallelism
Parallel Computing Group - University of La Laguna
&XUUHQW�GHYHORSPHQWV
l Algorithms:ü Molecular Dynamicsü Conjugate Gradientü NAS Embarrasingly Parallelü Mandelbrot Setü Matrix productü Quicksortü Fast Fourier Transformü Single Resource Allocation Problemü Knapsack Problem
l Architectures:ü Cray T3Eü Beowulf-type PC clusterü SGI Origin 3800ü Sunfire 6800 SMP
Parallel Computing Group - University of La Laguna
&RQFOXVLRQVl We have presented a proposal for a
skeletal language that extends OpenMP with new constructs
l We have introduced the OTOSP abstract model, as the basis for our implementation
l The model guarantees the portability to any platform
l We have developed llCoMP a prototype compiler for the language
l We have shown different examples of algorithms implemented following the ideas of the model
l Have presented computational results for these examples on different architecturesobtained with the llCoMPiler
l We think that it makes worth the research and development of tools oriented to the OTOSP model
Parallel Computing Group - University of La Laguna
)XWXUH�:RUN
l Add new features to the languageü Improve the PIPE skeleton
n Introducing different assignment policiesn Controling the number of processors
assigned to each stagen Using buffers to produce ‘tiled’ coden Managing data sequences with unknown
lenghtn Managing data sequences with varying
sizes
ü Work on the FARM skeleton
l Improve the prototype llCoMP compilerü Including type analysis
l To implement the OTOSP model using threads
Parallel Computing Group - University of La Laguna
$FNQRZOHGJPHQWV
l Edinburgh Parallel Computing Centre (EPCC)l Centre Europeu de Parallelisme de Barcelona
(CEPBA)l Centre de Supercomputació de Catalunya
(CESCA)l Centro de Investigaciones Energéticas,
Medioambientales y Tecnológicas (CIEMAT)
l This research benefits from the support of Secretaría de Estado de Universidades e Investigación, SEUI, project MaLLBa, TIC1999-0754-C03
l Also from the European Commission through grant number HPRI-CT-1999-00026
Parallel Computing Group - University of La Laguna
Towards Structured Parallel Programming
Antonio Dorta, Jesús A. González, Casiano Rodríguez and
)UDQFLVFR GH�6DQGH�([email protected])
http://nereida.deioc.ull.es/llCoMP/