ASYNCHRONOUS PARALLEL PROGRAMMING MODEL FOR …comp.is.uec.ac.jp/yoshinagalab/paper/PDCS2005_viet.pdf · Asynchronous Parallel Programming Model, ... a communication task. This phenomenon

ASYNCHRONOUS PARALLEL PROGRAMMING MODELFOR SMP CLUSTERS

Ta Quoc Viet and Tsutomu YoshinagaGraduate School of Information SystemsUniversity of Electro-Communications

Tokyo, Japan

ABSTRACTOur study proposes a novel MPI-only parallel program-ming model with improved performance for SMP clusters.By rescheduling tasks in a typical flat MPI solution, ourmodel forces processors of an SMP node to work in differ-ent phases, thereby avoiding unneccessary communicationand computation bottlenecks. This study achieves a signif-icant performance improvement with a minimal program-ming effort. In comparison with a de-facto flat MPI solu-tion, our algorithm can yield a 21% performance improve-ment for a 16-node cluster of Xeon dual-processor SMPswhile performing a distributed matrix multiplication.KEY WORDSAsynchronous Parallel Programming Model, Flat MPI,Matrix Multiplication, SUMMA, SMP cluster

1 Introduction

Our study targets an improvement in a flat MPI program-ming model for SMP clusters. The typical flat MPI modeltreats all the processors of a cluster equally, without con-sidering computation and communication links among pro-cessors on the same SMP node. This study examines theselinks and proposes a more effective programming model.

Performance of an SMP may increase when its pro-cessors work asynchronously, i.e., a (number of) proces-sor(s) perform(s) a computation task while the remain-ing processor(s) perform(s) a communication task. Thisphenomenon can be explained based on the limitation ofshared resources for computation and/or communication.When all processors work synchronously, they require thesame kind of resources. This may lead to a scarcity of re-sources, thereby forming an execution bottleneck. In oursystem, network and memory bus bandwidth limitationsare the root causes of communication and computation bot-tlenecks, respectively.

Consequently, we propose an asynchronous paral-lel programming model in which the processors of anode perform computation and communication tasks asyn-chronously. Moreover, we rearrange the communicationtasks to avoid unnecessary internode communication. Theprogramming effort required to develop our new solutionfrom an existing flat MPI code is fairly small.

As an example, the new model is applied to performa distributed matrix multiplication. The experimental en-

vironment consists of a cluster of 16 Intel Xeon 2.8 GHzdual-processor nodes connected via a Gigabit Ethernet net-work. Each node has 1.5 GB of memory, Red Hat Linux9.0 as the operating system, and MPICH 1.2.6 [1] as theMPI library. Local matrix multiplication (��) is per-formed by goto-blas [2].

The significant features of this study are (1) a clearand easy-to-apply MPI-only asynchronous parallel pro-gramming model for SMP clusters and (2) a formula forevaluating its performance improvement using variablesdefined for problem nature, problem size, and system spec-ifications.

The remaining paper is organized as follows. Section2 introduces related studies that also examine programmingmodels for SMP clusters. Section 3 presents our model indetails. Section 4 describes a distributed matrix multiplica-tion solution. Section 5 discusses the experimental results.Finally, section 6 concludes the paper.

2 Related Studies

Taking into account the hierarchical memory nature ofSMP clusters, several studies have proposed hybrid MPI-OpenMP programming models in which each SMP noderuns only a single MPI process and computation paral-lelization inside a node is achieved by OpenMP. Hybridmodels are classified as a hybrid model with process-to-process communication (hybrid PC) and a hybrid modelwith thread-to-thread communication (hybrid TC).

The hybrid PC model has been examined earlier andcould not show a positive result. F. Cappello et. al havefirstly shown a common algorithm to develop a fine-grainhybrid PC solution [3]. The model then is continued to beapplied for several problems and hardware platforms [4, 5].These studies revealed that in most cases, hybrid PC is in-ferior to flat MPI despite its three main advantages: (1) lowcommunication cost, (2) dynamic load balancing capabil-ity, and (3) coarse-grain communication capability [5]. Thepoor performance of the fine-grain hybrid PC model is pri-marily due to its poor intranode OpenMP parallelization ef-ficiency [4] resulting from an extremely low cache hit ratio[5].

Then, we proposed an enhanced hybrid version, thehybrid TC model [6, 7], which was also discussed by G.Wellein et. al [8] and R. Rabenseifner et. al [9, 10]. Fi-

dT

T

T

dT

1

m

Tp

(a) Flat MPI (b) Asynchronous MPI

Communication

Computation

out-phaseTin-phase

2

Figure 1. MPI variations for a dual-processor SMP node

nally, we proposed a middle grain size approach that allowshybrid TC to achieve an impressive performance on differ-ent platforms in various types of experiments [11]. How-ever, for problem solving, hybrid TC requires a consider-able programming effort to determine an effective middlegrain task size.

The asynchronous MPI model proposed in this paperachieves a performance that is similar to that of the hy-brid TC model without concerning OpenMP. Morever, itretains the flat MPI computation pattern, thereby avoidingthe grain-size problem.

SMP-aware [12] is another optimization approachthat improves (especially collective) communication per-formance for SMP clusters. However, since SMP-awaredoes not concern computation tasks, it is not expected toget a comparable performance as that of the asynchronousmodel.

3 The Asynchronous Model

3.1 Flat and Asynchronous MPI

Figure 1 shows activities of a dual-processor SMP node forthe flat and asynchronous MPI solutions.

In the flat MPI solution, all the processors of a nodework in the same phase. They execute communication orcomputation simultaneously. The execution time �� isthe sum of the communication time �� and the computa-tion time ��, which usually depend on the problem size.

In the asynchronous MPI solution, processors workin different phases. The time for which processors are inan out-of-phase status, i.e., a processor executes computa-tion while the remaining performs communication, is de-noted by ��. Similarly, the time for which proces-sors work simultaneously is denoted by ��. The sumof �� and �� forms the asynchronous modelexecution time �� .

The communication speed-up �� is defined by theratio between communication speeds during �� and��. The computation speed-up �� is defined simi-larly. Note that the communication and computation speedsduring �� are the same as that of the flat MPI model.During ��, an SMP node has better communicationand computation speeds than the flat MPI model has. Inother words, �� and �� .

With regard to Figure 1,

��

��

�� (1)

and the difference in the execution time �� between thetwo models is given by

��

On the other hand,

��

��

and

��

��

Consequently,

��

�� (2)

The performance speed-up � of the model is evalu-ated by

� ��

� ��

��

�� (3)

Equations (1), (2), and (3) define the time saved andthe speed-up of an asynchronous solution through ��, ��,��, and ��.

In general, �� and �� increase with the problem size,which also increases �� . This implies that as the problemsize increases, the time saved by the asynchronous modelalso increases.

Speed-up � achieves a maximum value of �� if �� , i.e., �� .

3.2 Communication Rearrangement

The asynchronous model requires a rearrangement of com-munication. In this model, internode communication canbe performed among processes of the same phase only. Theoriginal flat MPI communication pattern does not alwayssatisfy this requirement.

We propose a simple method to perform the requiredrearrangement, which is shown in Figure 2, where squaresrepresent MPI processes, numbers inside these squares de-note the process ID, and dotted rectangles indicate com-municators. The upper and lower parts of the figure de-scribe the communicators before and after the rearrange-ment. We assume a flat MPI communicator of eight pro-cesses originating from four dual-processor nodes. This

0 1 2 3 4 5 6 7

0 2 4 6

1 3 5 7

Node 0 Node 1 Node 2 Node 3

Figure 2. Communication rearrangement

A C

B

CA

B

g g

g

anb

kk x nb

b

i

i

i

broadcasting scope(communicator)

Figure 3. The SUMMA algorithm

communicator is split into two separate internode paritycommunicators–the odd and even–in terms of the processIDs. There are four additional intranode communicators,which are not shown in the figure. All communication op-erations are rearranged to be performed inside the newlyformed communicators only. An internode communicationoperation between two processes of different parity com-municators is replaced by an intranode operation and aninternode operation inside the same parity communicator.For example, communication between processes 2 and 7may be replaced by that between pocesses 2 and 3 togetherwith that between processes 3 and 7.

3.3 Task Dependency Problem

Independency between communication and computationtasks is another requirement of the asynchronous model.We can arrange the tasks in any order only if they are in-dependent. Section 3 of ref. [11] has already demonstratedhow to eliminate the dependencies. In this study, we do notdescribe the technique but show the result in Section 4.

� �for (� � � �� ++) �

bcast(�� );bcast(�� ); += �� ;

�

Figure 4. Flat MPI SUMMA pseudo-code.

4 Distributed Matrix Multiplication

4.1 SUMMA

The asynchronous MPI model is demonstrated by perform-ing a distributed matrix multiplication operation:

� � � ��

where �� , �� or �� ; ��, �� , and � are globalmatrices of sizes � � �, � � �, and � � �, respectively;� and � are scalars. In this paper, we focus on the case�� , but the idea can easily be extended to othercases.

The scalable universal matrix multiplication algo-rithm (SUMMA) [13] is one of the most effective algo-rithms for this problem. In comparison with other strongcandidates like PUMMA [14] or DIMMA [15], it exhibitsa comparable performance and high scalability. Moreover,SUMMA is simple and straightforward. Due to these ad-vantages, SUMMA is used as the basic for developing anasynchronous solution.

In SUMMA, processes are mapped onto an �� process grid. Global matrices ��, �� , and � aresplit into �� equal submatrices. A pro-cess stores the local submatrices �, � and that arecoressponding to its location in the process grid. Then,the local matrices � and � are divided into �� blocks ofcolumns and �� blocks of rows, each of size ��. Figure 3shows the data distribution corresponding to a � � � pro-cess grid. Shaded cells �, �, and are the submatricesdistributed to a process that is located in the second row,third column of the process grid.

Figure 4 shows the pseudo-code of the algorithm, inwhich function bcast(�� ) broadcasts �� overthe communicator ��.

4.2 Communication

The processor grid is organized such that it minimizes therearrangement. Namely, processors are arranged such thatall processors of a single node belong to the same pro-cess row. For a 16-node cluster, we develop a � � � pro-cess grid, in which a row communicator contains processesof the same parity; thus, it is not necessary to rearrangebcast(�� ).

0

8

16

24

1 2 3 4 5 6 7a

b

i

i

(a) Broadcast operations within flat MPI

0

1

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

share(half_a , co_proc):

bcast(half_a , row_parity_comm):

exchange(half_a , co_proc):i

i

i

(b) Rearrangement for bcast(� �� )

Figure 5. Communication rearrangement for a 16-nodedual-processor cluster.

The communication rearrangement is shown in Fig-ure 5 with the assumption that processor 0 is the source of�� and � �. Figure 5 (a) illustrates both the two broadcastoperations within flat MPI. Figure 5 (b) shows the threeoperations that replace the original bcast(� �� ):

� share(�� ): intranode communication.In reality, this step is implemented by a set of ��and �� calls evoked by the source process and�� –the remaining process of the node.

� bcast(�� ! ��): internode com-munication. Even and odd row communicators broad-cast different halves of ��. After this step, every pro-cess has half ��, while two processes of the same nodehave different halves.

� exchange(�� ): intranode communica-tion. Each process exchanges its half with its �� .After this step, all processes have a complete ��.

The internode communication stepbcast(�� ! ��) will be executed

1 1

2

2

3

3

4

4

5 5

Even Odd

1: share(half_a , co_proc)

2: bcast(half_a , row_parity_comm)

3: bcast(b , col_comm)

4: c += alpha a b

5: exchange(half_a , co_proc)

Tasync i i

i

i

i

i

Figure 6. Execution schema for the even and odd processesof a node during a single iteration.

within the asychronous execution time. The intranodecommunication steps should be executed simultaneouslyby all processes.

4.3 Elimination of Dependency

We develop independent communication and computationtasks, which can be executed in any order. A loop itera-tion shown in Figure 4 is the object of the elimination. Byapplying one of the techniques described in [11], we re-construct the loop such that a new iteration includes thecomputation part of the original iteration and the commu-nication part of the next iteration. By this reconstruction,we obtain a new iteration with no dependency between thecommunication and computation parts.

4.4 Asynchronous Solution

Figure 6 shows an execution schema for processors inside anode during a single iteration. The corresponding pseudo-code is shown in Figure 7. It is evident that the code israther simple. Consequently, a small programming effort isrequired to achieve the asynchronous solution.

Some parts of the program are unavailable to asyn-chronous execution. In SUMMA, tasks 1 and 5 representintranode communication and they should be executed si-multaneously by all the processes of a node. However, in-tranode communication comprises only a small part of thetotal execution time.

Tasks 2, 3, and 4, which comprise the main part ofthe solution, will be executed asynchronously. Tasks 2 and3 represent internode communication. Task 4 represents acomputation. Even processes execute communication tasksfirst, followed by computation tasks. In contrast, odd pro-

� �for (� � � �� ++) �

if (is source node) �if (is source proc) send(�� );else recv(�� );

�if (is even) �

bcast(�� ! ��);bcast(�� ); += �� ;

�if (is odd) �

+= �� ;bcast(�� ! ��);bcast(�� );

�exchange(�� );

�

Figure 7. Asynchronous SUMMA pseudo-code.

Table 1. Predicted saved time �� with different problemsizes.

�=�=� �� (sec.) �� (sec.) Expected �� (sec.)5000 1.72 1.77 0.6510000 6.85 14.24 2.5915000 15.69 49.12 5.9420000 27.80 116.75 10.5325000 43.56 222.68 16.5030000 62.35 385.17 23.62

cesses work in the reverse order. This results in the appear-ance of an out-of-phase time, thereby improving communi-cation and computation performance.

5 Performance

Preliminary examination results show that when the sizesof local matrices �, �, and are sufficiently large, thecommunication and computation speed-ups of the asyn-chronous model are relatively stable with different problemsizes. The values of �� and �� can be estimated by sim-ulating synchronous and asynchronous executions in ac-cordingly applicable conditions and then comparing theirperformance. For our cluster, �� and ��

Experimental results for different problem sizes areshown in Figure 8. Square and equal size matrices �� ,�� , and � are considered. The block size �� is fixed to112. The values of �=�=� vary between 5000 and 30000,which are sufficiently large to stabilize the local goto-blas�� and small enough to fit the memory limit. The re-sults are in good agreement with those obtained by using

the formulas given by Equation (2) and shown in Table 1,although there is a small disparity due to measurement er-ror.

The left chart shows the time saved by the asyn-chronous solution in comparison with the flat MPI one. Asexpected, the saved time increases together with the prob-lem size.

The right chart shows the absolute performance perprocessor. The asynchronous model always exhibits abetter result. However, the distance between the twoperformance lines reduces gradually. At �=�=�=5000,the difference is approximately 21% (450 MFlops). At�=�=�=30000, it decreases by 5.4% (202 MFlops). Thisresult can be explained by Equations (2) and (3) togetherwith the nature of the matrix multiplication. The growthrate of �� is O(��) while that of �� is only O(��). Conse-quently, the growth rates of �� and �� are O(��) andO(��), respectively. This implies that the increase in �� isslower than that in �� , and � becomes smaller with anincrease in �.

In comparison with a hybrid TC solution [11], per-formance of which is not shown in the figure, the asyn-chronous model is slightly slower when the difference inperformance is less than 1% for all problem sizes. Thereplacement of the intranode communication with the us-age of shared variables saves some time for the hybrid TCmodel. However, we believe that the ease in programmingof the asynchronous model is more important.

6 Conclusions and Discussions

This paper proposed a simple and effective parallel pro-gramming model and demonstrated its advantages over theexisting solutions. It completely outperforms the typicaland widely accepted flat MPI model. In comparison with astrong and complicated hybrid TC model, it saves a signif-icant amount of programming effort with a negligible re-duction in performance.

The formulas for evaluating the increase of perfor-mance are also remarkable. They predict the behavior of acluster in various circumstances and prove scalability of theasynchronous model. In fact, scalability was proved for theSUMMA algorithm applying the flat MPI model [13]; andthe precedence of the asynchronous model thereby guaran-tees its scalability.

For problems in which the computation and commu-nication time have different growth rates (e.g., SUMMA),the increase of performance decreases with a large problemsize. For problems with constant computation and commu-nication cost ratio with different problem sizes (e.g., NAS-CG), we expect a stable rate of performance improvementby the asynchronous model.

Though the analyses and experiments in this paperwere mainly for a dual-processor cluster, the conclusionsare expandable to other clusters with more processors pernode. In this case, it is necessary to determine an effectivenumber of phases and corresponding task-schedules.

1000

2000

3000

4000

5

10

100005000 15000 20000 25000 30000 100005000 15000 20000 25000 30000

15

20

25

Matrix Size (m=n=k) Matrix Size (m=n=k)

Sav

ed ti

me

(Sec

onds

)

Per

form

ance

per

pro

cess

or (M

FLop

s)

Asynchronous MPIFlat MPI

Figure 8. Experimental results.

We are developing an asynchronous MPI versionof the HPL benchmark, experiment results and analy-ses of which are soon to be published on our homepage,www.sowa.is.uec.ac.jp/�viet/.

Acknowledgment

This research is partially supported by the Nippon Founda-tion of the Japan Science Society (JSS) No.17-251.

References

[1] MPICH Team. MPICH, a Portable MPI Implementa-tion. http://www-unix.mcs.anl.gov/mpi/mpich/.

[2] K. Goto. High-Performance BLAS by KazushigeGoto. http://www.cs.utexas.edu/users/flame/goto/.

[3] F. Cappello and O. Richard. Intra Node Paral-lelization of MPI Programs with OpenMP. TR-CAP-9901, technical report, http://www.lri.fr/˜fci/-goinfreWWW/1196.ps.gz, 1998.

[4] F. Cappello and D. Etiemble. MPI versusMPI+OpenMP on IBM SP for the NAS Benchmark.Proc. Supercomputing 2000, 2000.

[5] T. Boku, S. Yoshikawa, and M. Sato. Implementationand Performance Evaluation of SPAM Article Codewith OpenMP-MPI Hybrid Programming. Proc. Eu-ropean Workshop on OpenMP 2001, 2001.

[6] T. Q. Viet, T. Yoshinaga, and M. Sowa. A Master-Slaver Algorithm for Hybrid MPI-OpenMP Program-ming on a Cluster of SMPs. IPSJ SIG notes 2002-HPC-91-19, 2002, 107-112.

[7] T. Q. Viet, T. Yoshinaga, B. A. Abderazek, and M.Sowa. A Hybrid MPI-OpenMP Solution for a Lin-ear System on a Cluster of SMPs. Proc. Symposium

on Advanced Computing Systems and Infrastructures,2003.

[8] G. Wellein, S. Hager, A. Basermann, and H. Fehske.Fast Sparse Matrix-Vector Muliplication for Ter-aFlop/s Computers. Proc. Vector and Parallel Pre-cessing, 2002.

[9] R. Rabenseifner and G. Wellein. Communicationand Optimization Aspects of Parallel ProgrammingModels on Hybrid Architectures. The InternationalJournal of High Performance Computing Application,17(1), 2003, 49-62.

[10] R. Rabenseifner. Hybrid Parallel Programming: Per-formance Problems and Chances. Proc. The 45thCUG (Cray User Group) Conference 2003, 2003.

[11] T. Q. Viet, T. Yoshinaga, B. A. Abderazek, and M.Sowa. Construction of Hybrid MPI-OpenMP Solu-tions for SMP Clusters. IPSJ Transactions on Ad-vanced Computing Systems, 46(SIG 3), 2005, 25-37.

[12] J. L. Traff. SMP-Aware Message Passing Program-ming. Proc. The Eight International Workshop onHigh-Level Parallel Programming Models and Sup-portive Environments, 2003.

[13] R. A. van de Geijn and J. Watts. SUMMA: ScalableUniversal Matrix Multiplication Algorithm. LAPACKWorking Note 99, technical report, University of Ten-nessee, 1995.

[14] J. Choi, J. J. Dongarra, and D. W. Walker. PUMMA:Parallel Universal Matrix Multiplication Algorithmson Distributed Memory Concurrent Computers. Con-currency: Practice and Experience, 6(7), 1994, 543-570.

[15] J. Choi. A Fast Scalable Universal MultiplicationAlgorithm on Distributed-Memory Concurrrent Com-puters. Proc. IPPS’97, 1997.

Documents

ASYNCHRONOUS PARALLEL PROGRAMMING MODEL FOR …comp.is.uec.ac.jp/yoshinagalab/paper/PDCS2005_viet.pdf · Asynchronous Parallel Programming Model, ... a communication task. This phenomenon