Statistical analysis of message passing programs to guide computer design

Statistical Analysis of Message Passing Programs to Guide Computer Design

William E. Cohen and Basel Ali Mahafzah

Department of Electrical and Computer EngineeringCollege of Engineering

The University of Alabama in HuntsvilleHuntsville, AL 35899

{cohen,mahafzah}@ece.uah.edu

uar

innnra

uaS

a.1

nreaoe uhe a

eningsere

aitr

stheto

ofntofof

aldsderg

es,eenands.

calne

ons.

ers to

d.

s isataeson 5e

AbstractLittle data exists on how message passing programs

parallel computers. The behavior of these programs cstrongly influence design decisions made for futucomputer systems. The computer designer’s useincorrect assumptions about program behavior cadegrade performance.

In many cases simple statistical parameters describcharacteristics such as message sizes, destinatiosources, and times between sends would give the desigof the communication libraries and the computer hardwagreat insight into how the hardware is used by actuprograms.

Techniques of collecting statistical information abothe communication characteristics for system design hbeen applied to the parallel version of the NAbenchmarks. This paper describes the statistical dcollected for multiprocessor runs of the NAS 2benchmarks and some of the characteristics observedthat data.

1. Introduction

Parallel computers are complicated machines desigto efficiently perform computations. As with othemachines designed for a specific purpose, the requiremof the task must be understood. The task that these parmachines support is running large-scale applicatiprograms. For portability many of these parallapplication programs have been written to make usemessage passing libraries. Thus, it is important that futparallel computers be designed to efficiently support tprogramming model. Poor performance can result if this a mismatch between the assumed program behaviorthe actual program behavior.

1060-3425/98 $10.

seneof

n

gs,ersel

tve

ta

in

ed

ntsllelnlofreisrend

There are a number of performance tools that have beused to evaluate the performance of programs on existcomputer systems, e.g. Pablo [11] and Paradyn [7]. Thetools often concentrate on time-based metrics, which aappropriate when attempting to minimize the run time ofparticular program on a particular computer. However, is often difficult to extend these results to computearchitectures that have significantly different timingassociated with the operations. These tools require program to be executed on each different machine gather the information.

Designers need methods of describing aspects program behavior that are portable across differecomputer architectures. Statistical measurements message passing programs would provide this type information. A few numbers can capture essenticharacteristics of parallel programs. This approach avoithe difficulties of detailed trace-based data collection anclearly summarizes important properties to the computarchitect. Therefore, this work focuses on measurinspecific parameters (e.g., distribution of message sizaverage message size, and processing time betwmessage sends) of message passing programs measurements obtained from existing parallel programOther researchers [4, 15] have proposed the statistianalysis of parallel application programs communicatiocharacteristics. However, these works were limited to thinter-processor arrival times of messages, the distributiof message sources, and completion time of the programThe data collected for the work described in this papfocuses on the message sizes and time for processorcomplete work between sends.

The Message Passing Interface (MPI) library [14] usefor this work is described in greater detail in section 2The method of instrumenting message passing programdescribed in section 3. Section 4 describes the dcollected for the NAS 2.1 benchmarks [1, 2] and maksome observations on the size of messages sent. Sectioutlines future work planned to extend the utility of thestechniques. Finally, section 6 summarizes the results.

00 (c) 1998 IEEE

leANall.g blsg,ion

on

Ina

hentiv in

lleg

thsuf

3.thi

the

facs ce, aisceesofis

tio

I n

Th aen

ofylyallonthe

ey.isde

ionPIs

dedeoesre the is

eheg

an

the tochweennto

ofalofthe

2. Overview of Message Passing Interface

MPI is a library of portable, efficient, and flexibfunctions [5] that can be used in C, C++, and FORTRprograms. This library can be used on massively parmachines with distributed-memory architectures (enCUBE2, IBM SP2, CM-5, and Intel Paragon), and canused on shared memory machines (e.g., Cray T3D). AMPI provides profiling libraries, e.g. time accountinlogfile creation and examination, and runtime animat[6].

In the MPI programming model, a task consists of or more processes that communicate data betweenseparate processes by calling the MPI library routines. program that uses MPI, a fixed set of processes is creat program initialization, one process per processor. Tprocesses use point-to-point communication to exchadata between pairs of processes and colleccommunication to exchange data between processesgroup.

3. Instrumentation of Code

A crucial element of this work is to instrument paraprograms in a manner that is portable, minimizes chanto the programs under study, and is able to collectrequired data. The MPI standard [14] addresses the isof instrumenting the MPI library. An overview oinstrumenting MPI programs is discussed in section The types of data that can be collected from instrumentation are discussed in section 3.2.

3.1. The MPI Profiling Interface

Various implementations of the MPI library exist widifferent levels of performance, but all of thimplementations share a common programming inter[14]. This interface to the application program provideclean separation between local operations within a proand operations that involve other processes. SimilarlyMPI profiling interface is defined to ensure that it relatively easy for designers of profiling tools to interfatheir codes to MPI implementations on different machinThe design of the MPI library allows the insertion additional instrumentation code. This profiling code restricted to the MPI interface between the applicaprogram and the MPI library.

The general scheme supported by all versions of MPto have two versions of the same set of library functioThe original library has the normally used names. other library used for profiling has the same functionsthe original library, but all of the functions have berenamed, e.g. prefixed with “PMPI_”.

1060-3425/98 $10.

el.,eo,

ethe atedsegee a

lesees

1.s

eassn

.

n

iss.es

The instrumentation routines use the original names library functions and are called instead of the librarroutine. These instrumentation routines are simpwrappers that record the desired information and then cthe renamed routines in the profiling library. Thus, tinstrument an application program, the applicatioprogram only needs to be linked with the wrappers and MPI profiling library.

3.2. Abilities of the Data Collection Technique

The MPI profiling interface [14] provides a portableand convenient method of collecting information about thperformance of a program that uses the MPI librarHowever, there are limitations to the types of data that thinterface can collect. The profiling interface has a limiteview of what occurs in the program. It can observe thparameters that are passed into it and obtain informatfrom calls to system functions, e.g. time of day. The Mfunctions that implement the communication operationare black boxes; similarly, information about the user cois limited to what is passed into the instrumentation coas parameters. As a result the instrumentation code dnot have a view of the low-level operations that aimplementation specific, e.g. when a message is sent tophysical network device, or details of what the user codedoing when it is not calling the library.

However, the MPI profiling interface does enable thcollection of data about how the library is being used. Tfollowing types of data can be collected using the profilininterface:

1. Source of the message.2. Destination of the message.3. Number of bytes in the message.4. Type of data being sent.5. Number of times each library routine is called.6. When library routine is called.Although collection of this data does not give

complete picture of what a parallel program is doing, it cagive designers valuable feedback on how programs usemessage passing libraries. Instrumentation was writtencollect data about the number of bytes sent in eamessage, the average message size, and the time betcalls to the send functions. The following sectioillustrates how this type of data collection can be used examine the behavior of existing parallel programs.

4. Collected Data

The goal of this work is to build a library of informationthat concisely describes the characteristics of a set parallel programs. We selected the NumericAerodynamic Simulation (NAS) benchmarks [1] as a set programs to instrument and characterize. NAS uses

00 (c) 1998 IEEE

catene

nmTg

okTt

agra

het b

U.

hw

4.d

ot

.ts.

y]..S

l

ss

esP,

te

MPI library, it is freely available to researchers, andattempts to mimic the characteristics of numeriaerodynamic simulations. The data collection was limito collecting histograms of message sizes for both seand receives and some data about the time betwprocessor sends for the sample size runs.

The instrumentation was performed on two clustersworkstations. The initial instrumentation development adata collection for four processor runs of the prograwere performed on a cluster of uni-processor PCs. larger program runs were performed on a cluster of sinprocessor SUN SPARC workstations.

Each PC had an Intel 486DX4 100MHz process256KB L2 cache, 16MB of RAM, and 1GB hard disThe four PCs were connected via 10Mb/s Ethernet. software on these systems consisted of the Linux operasystem (kernel 1.3.20), MPICH version 1.1 messpassing libraries, and GNU G77 version 2.7.0 Fortcompiler.

A variety of different SUN workstations made up tprocessing elements in the 16 processor cluster, bumachines used in this cluster had a minimum of 64MBmemory, a single SPARC processor, and a 10MEthernet connection. The software on the Sworkstations consisted of Solaris 2.5, MPICH version 1and GNU G77 version 2.7.2.f.1 Fortran compiler.

The two clusters of workstations recorded tcharacteristics of the NAS benchmarks. Section 4.1 describe the NAS benchmarks and the data collectedthe individual programs in greater detail. Section examines some of the characteristics observed in the collected.

4.1. NAS Benchmarks

The NAS benchmarks [1] are a collection computational kernels developed by NASA Ames

1060-3425/98 $10

itlddsen

ofds

hele

r,.heingen

allof/sN1,

eillfor2ata

fo

determine the suitability of high performance computersfor performing aerodynamic simulations via computationsThe code in the benchmarks omits input and outpuoperations and concentrates on only the computationInitially, the NAS 1.0 benchmarks were distributed as“pencil and paper” specifications because of the diversitof high-performance computer hardware and software [1With the development of portable parallel software, e.gmessage passing libraries, a later generation of NAS, NA2.1, is available as code using the MPI communicationlibraries. NAS 2.1 reflects what a typical parallelapplication programmer might write.

The NAS 2.1 benchmarks contain five computationakernels: 3-D FFT (FT), LU solver (LU), multigrid (MG),

block tridiagonal solver (BT), and pentadiagonal solver(SP). There are four classes of program runs, sample cla(“S”), class “A”, class “B”, and class “C”. The sampleclass requires the fewest computations and class C requirthe most. Table 1 shows the problems sizes for the BT, SLU, and MG benchmarks and the number of floatingoperations performed for each [2, 12]. These differenclasses allow the runs of the program to be scaled to th

Table 1 NAS Benchmark problem sizes.

Bench-mark

Sample class (“S”) Class “A”

Problemsize

FLOPS(x 106)

Problemsize

FLOPS(x 109)

BT 123 244.8 643 181.3

SP 123 172.8 643 102.0

LU 123 98.0 643 64.6

MG 323 12.8 2563 3.9

t
Table 2 Summary of total number of messages sent, average size of messages sent, and number of floating poinoperations performed per byte sent for NAS benchmarks.
Sample size(“S”),4 proc.

Class A,BT&SP 9 proc.LU&MG 8 proc

Class A, 16 proc.

Bench-mark

Tot.Mesg.(x10 3)

Avg.mesg.Size(bytes)(x10 3)

FLOPSpermesg.byte

Tot.Mesg.(x10 3)


FLOPSpermesg.byte

Tot.mesg.(x10 3)


FLOPSpermesg.byte

BT 2.95 4.85 17.1 32.6 69.5 80.0 77.3 45.7 51.3SP 4.87 3.04 11.7 65.0 60.7 25.9 154. 38.4 17.2LU 4.42 .723 30.6 315. 3.07 66.8 756. 1.92 44.5MG 1.78 1.27 5.69 5.71 27.0 25.3 11.0 18.8 18.8

.00 (c) 1998 IEEE

heT,atth

me wld

thsst s nlesrate

ingacewian

he

ar, ssthinre

ificat

ockara

th

egthlls

orse i

r .g. as,

aest

the thatre 2f SPage sents are

16

timethe

SPs in16-

isingual

sedtionthehowal

k toelym.arksn the

Thus, toarehens

and13].rom

ithes. the the thetaltheolve

ntsuiredthevide

lassssor

n

size of the machine. Data was collected on tcharacteristics for four of the benchmarks: LU, MG, Band SP for the sample and “A” class problems. The dcollected consisted of average message size, distribution of message sizes, and the distribution of tibetween message sends. The remainder of section 4.1examine the data collected and show how it wouinfluence the design of a computer system.

For the NAS 2.1 benchmarks we collected data on size and the number of messages sent by each proceTable 2 summarizes the total number of messages senall the processors and the average size of the messageeach benchmark program. The four programs have a raof average message sizes, with LU having the smalaverage message size and BT having the largest avemessage size. Table 2 shows an additional compumetric for each of the program runs, the number of floatpoint operations (FLOPS) per byte sent. This metric cbe used to estimate the bandwidth required for balansystem operation. The characteristics of the programs be discussed in greater detail in sections 4.1.1, 4.1.2, 4.1.3. Section 4.2 will discuss the implications of trecorded data on the design of future computers.

4.1.1. BT and SP Benchmarks

The BT and SP benchmarks perform similcomputations and have similar communication patternsthey will be discussed together in this section. Thebenchmarks are designed to be representative of computations associated with the implicit operators Computational Fluid Dynamics (CFD) code [1]. There adifferences between BT and SP in the speccomputations performed, but they share common ddistribution and communication patterns.

Both BT and SP solve multiple independent systemsnon-diagonally block tridiagonal equations for a 5x5 blofor a certain number of time steps. The equations arranged as a cube with equal number of equations in edirection. The equations are grouped into cells with n cellsin each direction, giving a total of n3 cells. The number ofprocessors used to solve each problem is equal tonumber of cells on a face (n2). Thus, the number ofprocessors used to solve the problem must have an intsquare root and each processor gets n cells. Rather giving each processor a simple vertical column of ceeach processor obtains a diagonal column of cells.

In each iteration of the program, the processexchange variable values for the equations on the faceach cell with the adjacent cell. This data exchangeperformed in the subroutine copy_faces . In the datalayout for the BT and SP benchmarks, if processocontains an adjoining cell in a particular direction (eeast) from a cell in processor j, then processor i hasadjoining cells in that direction from processor j. Thu

1060-3425/98 $10.0

ae

ill

eor.byfor

getged

ndlld

oee

a

f

ech

e

eran,

ofs

i

ll

copy_faces groups these multiple cell faces for particular direction into a single send, yielding the largaverage message size of the NAS benchmarks.

Figure 1 shows the histogram of message sizes forsample size run of BT on 4 processors, and it is clearthe message sizes are not uniformly distributed. Figushows similar characteristics for the sample size run oon 4 processors. This non-uniform distribution of messsizes is due to the number of cell faces that are beingto the adjacent processor. These same characteristicechoed in the class “A” runs of BT and SP for 9 andprocessors shown in figures 3 and 4.

Figures 5 and 6 show histograms measuring the between consecutive calls to the send functions in library for the sample class runs of the BT and benchmarks on four processors. Due to the variationthe performance of machines used for the 9 and processor runs no timing data was taken. One surprfeature present in both the timing histograms is the dpeaks. If a typical queuing theory model [8] was uwhere computation tasks have an exponential distribufor completion and a single send occurs after computation is complete, then the histograms should sa straight line for bin sizes used. The individucomputational tasks require time around the right peacomplete, then this is followed by a series of closspaced sends which cause the left peak in the histogra

For each time step in both the BT and SP benchmthe processor compute all the values for the elements icells they hold and then the copy_faces subroutine isused to exchange the data between the processors. the processors must wait for the communicationscomplete before the next iteration’s computations performed. A more efficient method of performing titerations is to split the computations into computatiothat have results that will be sent to other processorscomputations that computed only locally used values [The communication operations sending the results fthe first part of the computations can be overlapped wthe computations producing only local used valuTable 3 give the ratio of the total elements computed tointerior elements computed by each processor. Forsample class problems most of the elements are onexterior of the cell, causing an unfavorable ratio of toelements to interior elements. The ratio is better for class “A” runs, but as more processors are used to sthe problem, the ratio becomes less favorable.

Using the ratio of total elements to interior elemeand the operations per byte sent an estimate of the reqnetwork performance can be made. For example processors used in the machine are estimated to pro100 million floating operations per second. For the c“A” SP benchmark running on 16 processors a procewould produce 5.8MB/s(100MFLOPS/17.2FLOP/(bytes/s)) of network traffic o

0 (c) 1998 IEEE

u

ora

h

hoh1

e

n

ooOttot fa

u

m tyt

hattheuchizes

tedingthe

theps, if byint ine aableItofatefor

thehe Ams.

the

aheg

the toe

son ism

foreshehine.themss ofa

gedthe ishus,ns

average. However, this network traffic could only occduring the computation of the interior elements and wouhave to be multiplied 1.49 (ratio of total element to interielements), indicating that each processor requi8.66MB/s bandwidth and 16 processors require aggregate bandwidth of 139MB/s. For this exampsystem 100Mb/s (10MB/s) Ethernet would not providsufficient bandwidth for a balanced system and tnetwork would limit performance.

Table 3 Ratio of total elements to interior element inthe BT and SP benchmarks.

Ratio Total Elements : Interior ElementsClass “S”4 proc.

Class “A”9 proc.

Class “A”9 proc.

3.38:1.0 1.33:1.0 1.49:1.0

4.1.2. LU Benchmark

The LU benchmark is another program witcalculations that are representative of the computatiperformed in CFD code. The LU benchmark performs tcomputations on a three-dimensional array, with elements in each direction for the sample problem andelements in each direction for the class “A” problemRecursive bisection is performed on two of the thrdimensions in the array to partition the problem. Thueach processor works on a prism of elements that extefrom the top to the bottom of the array. Each prismcomposed of “tiles”, horizontal planes of elements.

The LU benchmark starts with a single processperforming Symmetric Successive Over-Relaxati(SSOR) on a tile in a corner of the array. Once the SSoperation is completed on a tile, the updated values for elements along the edges of the tile are sent to adjoining processors. Once the adjoining processreceive the updates, they start to perform SSOR on the that needed the information contained in the messages,the processor that just sent the messages starts to perSSOR on the tile above the completed tile. Thus, a wfront of tile processing extends from the starting corner.

This method of processing leads to a large numbersmall messages being sent between the processFigure 7 shows that for the sample sized run with foprocessors there are 10 times as many messages less257 bytes in size as messages greater than 4096 bytesize. Figure 8 shows that for the class “A” runs of the Lbenchmark for 8 and 16-processor runs there are 60 tias many messages less than 2049 bytes as theremessages greater than 65536 bytes. Thus, for benchmark message, latency and setup overhead mamore of a concern than peak bandwidth because of small average message size.

all

1060-3425/98 $10

rldresn

leee

nse264.es,ds

is

rnR

hehers

ilesandormve

ofors.rthans inUesare

his behe

Another effect of the disparity of message sizes is tthe average message size is not representative of message traffic. The actual messages are either msmaller or much larger than the average message slisted in table 2.

The average bandwidth requirements can be compufor the LU benchmark using the estimated rate that floatpoint operations are performed by the processor and number of operations per byte sent in Table 2. Unlike BT and SP benchmarks, the LU benchmark fully overlathe transmission of values with the computations. Thusthe network can supply the average bandwidth obtaineddividing the rate the processor can perform floating pooperations by the number of operation per byte senttable 2, the computer system is balanced. Assumcomputer system composed of 16 processors each capof 100 million floating point operations per second. would require network connections with a bandwidth 2.25MB/s for each processor and the total aggregbandwidth 36MB/s to provide balanced performance 16 processors performing the class “A” LU benchmark.

Figure 9 shows the time between message sends forsample class run with four processors. Again tdistribution of times between sends is not exponential. significant number of sends are separated by less than 2This is due to the multiple messages sent during updates.

4.1.3. MG Benchmark

The MG benchmark is a multigrid solver, which finds solution to a relatively coarse grid, and then refines tsolution for higher and higher resolution grids describinthe same problem. Once the solution is determined athighest resolution grid, the results are propagated backthe lower resolution grids. This algorithm allows iterativmethods to converge to a solution more quickly.

The MG benchmark solves the scalar discrete Poisequations for a three-dimensional grid. A 32x32x32 gridthe highest resolution grid for the sample sized probleand a 256x256x256 grid is the highest resolution grid the class “A” problem. Because of the various grid sizbeing solved, there is a wide range of sizes for tmessages sent between the processors in the macFigure 10 shows the histogram of message sizes for sample sized problem and figure 11 shows the histograof the message sizes for the eight and 16-processor runthe MG benchmark. Even for the class “A” problem significant number of small messages are exchanbetween the processors. For this type of algorithm amount of work being performed by each processorsmall when the small messages are being sent. Thiding the overhead of communication with computatiomay not be feasible and like the LU benchmark over

.00 (c) 1998 IEEE

pe

eek hptimthed

tn da ass tan

uir

ize

o

lwa

in

t thththevidU

esnt

llehonc

afins aedss

m o thaa

si

e

henotg a, ited

eneee

s

dte

ss

ldeneds

g.

.de

srr

bed

lyata

performance may be influenced more by latency than bandwidth.

Figure 12 provides the histogram of the times betwconsecutive sends. The sample class MG benchmarthe fewest number of operations out of the four samclass benchmarks as shown in table 1. Thus, the between sends is smaller than that for the obenchmarks and the majority of the sends are separatless than 4ms of time.

4.2. Discussion of Results

The driving force for collecting these parameters ischaracterize parallel programs and use that informatioguide the design of future computer systems. The collected about the message sizes is portable tocomputer architecture that supports a message pamodel and will not vary as long as the computers havesame sizes for the data types. Several observations cmade about the data:1) The collected data can be used to estimate the req

interconnection bandwidth.2) There are significant variations in message s

between the benchmarks.3) Most of the applications did not use a wide variety

message sizes.4) The average sizes of the messages were not a

representative of the messages.5) The traffic presented to the network was typically

bursts.The data collected in this study could be used

estimate the minimum bandwidth required to preventinterconnection network from being the component in system that limits performance. As shown in examples, a 100Mb/s broadcast network would not prosufficient bandwidth for the class “A” BT, SP, and Lbenchmarks for a 16 processor cluster with each proccapable of performing 100 million floating poioperations per second.

There is a factor of 6.7 between the largest and smaaverage message size for the sample size problem sin table 2. This variation becomes even more pronouin the class “A” problems where the ratio is 23. The smaverage message size in LU is explained by the grained nature of the program [2]. The average sizemessages shown in the table for the LU benchmarkconsiderably smaller than the size of messages needobtaining the peak performance of the message palibraries [9, 16].

Many analytical models using queuing theory assueither a contiguous distribution of message sizes ormessage size [10, 8]. However, the only benchmarkappeared to approach these ideals was MG, which hwide range of message sizes. The other three benchmused few categories of sizes. The different message

1060-3425/98 $10.

ak

nas

lee

er by

ototanyinghe be

ed

s

f

ys

oee

e

sor

stwnedlle-ofre toing

eneats arks

zes

could also be separated by significant distances; with thclass “A” LU benchmark having at least at factor of 16between the two message categories.

As a result of the wide separation in message sizes taverage size of the message sent shown in table 2 may represent the type of messages sent. Rather than forminsingle distribution of message sizes around an averagemay be more appropriate to have several widely separatdistributions of message sizes in the models.

Finally, a common simplification made in analyticalmodels is to assume an exponential distribution of timbetween individual send operations [3]. This time betweesends is the processing time. However, for thbenchmarks examined for this work a more appropriatmodel may be to have some limited range of computtimes (e.g. uniform distribution between minimum andmaximum time) and then to have bursts of multiple sendat the end of the compute time.

5. Future Work

The results presented in this paper are rather limitemeasurements of four existing application programs thause message passing. Future work will extend thcollection and analysis of the communication metricdiscussed in this paper. Work in the following areas iplanned:

1) Collect data from a wider variety of programs.2) Collect data from larger runs of programs to

determine if observed statistical parameters hold.3) Consider measurement of additional parameters.4) Apply the parameters collected to improve the

quality of the queuing theory models built topredict program performance.

There are a number of desirable extensions that shoube made to this work. Section 4 only presents data takfor the NAS benchmarks. These programs were designto represent typical code found in aerodynamic simulationbeing performed at NASA Ames [1]. However, it is likelythat the characteristics of other application programs, e.n-body simulations and image rendering, will differ fromthe characteristics of the NAS benchmarksInstrumentation of a broader sampling of programs woulensure that the data collected is not skewed by thpeculiarities of a particular type of program.

The data presented was for relatively small-scale runof programs. It would be instructive to determine whethethe properties for these small-scale runs hold for largeprograms, e.g. the class “B” and “C” NAS problems. Todetermine these scaling properties large runs need to made. These runs would be for the larger class “B” an“C” problems in NAS. Also the number of processors inthe runs should be varied.

The data collected for this paper was almost exclusivefor the message sizes sent between processors. D

00 (c) 1998 IEEE

tthnaad

socoa

ea

n

ilae

ap

ri

fuB t

R

tt

,,”a,

e,”4-

.a

lel

PIut.f

for

Itorg

..el

ce

f,”

4,

ndrn,

describing the time between sends was limited to sample sized runs. As the work progresses oparameters may be collected such as the distributiomessage sources and destinations to aid in the simulof “hot spots” in network traffic. These additionparameters should be defined in a machine-indepenmanner.

Finally, additional work to determine how theparameters improve the prediction of queuing themodels should be performed. The parameters collecould be used to determine the types of models that wbest reflect a program’s behavior. The collected dcould calibrate the simulation models indicating whthere are significant difference between actual progrand the simulations.

6. Conclusion

This paper demonstrates the utility of collectiparameters describing the characteristics of the Nbenchmarks. The parameters collected provide insightthe operation of the NAS benchmarks on paracomputers. It shows that two of the benchmarks, LU MG, send a number of relatively small messages betwthe processors. The timing information also shows thatnetwork traffic from a processor may not be even, but moccur in bursts. These characteristics are difficult to nin the detailed step-by-step simulations because the uparameter being measured is the total run time.

Computer architects would be able to use the datgain an understanding of what features would best supparallel programs. This same data could be usedproduce less detailed simulations, e.g. queuing themodels [3, 8, 10], that still capture the essential behaviothe parallel programs. A library of parameters describthe performance of many parallel programs will acomputer architects in producing computers that efficiensupport parallel programs.

7. Acknowledgements

We would like to acknowledge the many usediscussions with Jeffery Kulick, Constantine Katsinis, Earl Wells, and Rhonda Gaede. We would also likeacknowledge the suggestions made by GayaKrishnamurthy to improve this paper.

8. References

[1] D. Bailey, E. Barszcz, J. Barton, D. Browning, Carter, L. Dagum, R. Fatoohi, S. Fineberg, Frederickson, T. Lasinski, R. Schreiber, H. Simon, Venkatakrishnan, and S. Weeratunga, “The NAS Para

1060-3425/98 $10

heer

oftionlent

eryteduldta

rems

gASntolelnden

theay

otesual

toort

toory ofngidtly

l.tohri

.P.V.llel

Benchmarks,” NASA Ames Research Center, MoffeField, California, RNR-94-007, March, 1994.

[2] D. Bailey, T. Harris, W. Saphir, R. Wijngaart, A. Wooand M. Yarrow, “The NAS Parallel Benchmarks 2.0NASA Ames Research Center, Moffett Field, CaliforniNAS-95-020, December, 1995.

[3] B. Bodnar and A. Liu, “Modeling and PerformancAnalysis of Single-Bus Tightly-Coupled MultiprocessorsIEEE Transactions on Computers, Vol. 38, No. 3, pp. 46470, March, 1989.

[4] S. Chodnekar, V. Srinivasan, A. S. Vaidya, ASivasubramaniam, and C. R. Das, “Towards Communication Characterization Methodology for ParalApplications,” in the Proceedings of High-PerformanceComputer Architectures 3, San Antonio, Texas, February1-5, 1997.

[5] W. Gropp, E. Lusk, N. Doss and A. Skjellum, “AHigh-Performance, Portable Implementation of the MMessage Passing Interface Standard,” Math. & CompSci. Division at Argonne National Laboratory, Dept. oComput. Sci. & NSF Engineering Research Center CFS at Mississippi State Univ., 1996.

[6] E. Karrels and E. Lusk, “Performance Analysis of MPProgram”, in J. J. Dongarra and B. Tourancheau, edi,Environments and Tools for Parallel Scientific Computin,pages 195-200, SIAM, 1994.

[7] Miller, B. P., M. D. Callaghan, J. M. Cargille, J. KHollingsworth, R. B. Irvin, K. Kunchithapadam, K. LKaravanic, and T. Newhall. “The Paradyn ParallPerformance Measurement Tool,” Computer, Vol. 28, pp.37-46, November 1995.

[8] P. Mohapatra, C. Das, and T.-Y. Feng, “Performananalysis of cluster-based multiprocessors,” IEEETransactions on Computers, Vol. 43, No. 1, pp 109-114,January, 1994.

[9] N. Nupairoj and L. M. Ni, “Performance evaluation osome MPI implementations on workstation clustersProceedings of the 1994 Scalable Parallel LibrariesConference, IEEE Computer Society Press, October 199pp. 98-105.

[10] K. Park, G. Kim, and M. Crovella, “On therelationship between file sizes, transport protocols, aself-similar network traffic,” BU-CS-96-016, ComputeScience Department, Boston University, BostoMassachuetts, August, 1996.

.00 (c) 1998 IEEE

ei

t

f

.,n,

:4,ee,

,”

[11] D. A. Reed, P. C. Roth, R. A. Aydt, K. A. Shields, L.F. Tavera, R. J. Noe, and B. W. Schwartz, “ScalablPerformance Analysis: The Pablo Performance AnalysEnvironment,” in Proceedings of The Scalable ParallelLibraries Conference, IEEE Computer Society Press,1994, pp. 104-113.

[12] S. Saini and D. H. Bailey, “NAS Parallel BenchmarkResults 12-95,” NASA Ames Research Center, MoffetField, California, NAS-95-021, December, 1995.

[13] Sawdey, A. C., M. T. O’Keefe, and W. B. Jones, “AGeneral Programming Model for Developing ScalableOcean Circulation Applications,” Laboratory forComputational Science and Engineering, Univ. oMinnesota, Minneapolis, Minnesota, January 1997.

0 0 0 0 0

732

1488

0

732

00

200

400

600

800

1000

1200

1400

1600

<=

64

<=

128

<=

256

<=

512

<=

1024

<=

2048

<=

4096

<=

8192

<=16

384

<=32

768

Bin Message Size (bytes)

Num

ber

of M

essa

ges

Sen

t

Figure 1 Histogram of the message sizes for the class“S” run of the BT benchmark on four processors.

0 0 0 0 0

1212

2448

1212

0 00

500

1000

1500

2000

2500

3000

<=

64

<=

128

<=

256

<=

512

<=

1024

<=

2048

<=

4096

<=

8192

<=

1638

4

<=

3276

8


Num

ber

of M

essa

ges

Sen

t

Figure 2 Histogram of message sizes for the class “S”run of the SP benchmark on four processors.

1060-3425/98 $10.0

s[14] M. Snir, S. W. Otto, S. Huss-Lederman, D. WWalker, and J. Dongarra, MPI: The Complete ReferenceThe MIT Press, Cambridge, Massachusetts, LondoEngland, 1996.

[15] E. Strohmaier, “Statistical Performance ModelingCase Study of the NPB 2.1 Results,” UTK-CS-97-35Computer Science Department, University of TennessKnoxville, Tennessee, March, 1997.

[16] Z. Xu and K. Hwang, “Modeling communicationoverhead: MPI and MPL performance on the IBM SP2IEEE Parallel & Distributed Technology: Systems &Applications, Vol. 4, No. 1, pp 9-24, IEEE, 1996.

0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0

28944

0

28944

0 0

2176

2

1085

4

0

1939

2

0

5000

10000

15000

20000

25000

30000

35000

<=

64

<=

128

<=

256

<=

512

<=

1024

<=

2048

<=

4096

<=

8192

<=16

384

<=32

768

<=65

536

<=

1E+

05

<=

3E+

05

<=

5E+

05


Num

ber

of M

essa

ges

Sen

t

Compiled Processors = 9 Compiled Processors = 16

Figure 3 Distribution of message sizes for class "A"runs of the BT benchmark.

0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0

96336

0 0 02406

1924

8

4336

25774

4

0

20000

40000

60000

80000

100000

120000

<=

64

<=

128

<=

256

<=

512

<=

1024

<=

2048

<=

4096

<=

8192

<=16

384

<=32

768

<=65

536

<=

1E+

05

<=

3E+

05

<=

5E+

05


Num

ber

of M

essa

ges

Sen

t


Figure 4 Distribution of message sizes for class "A"runs of the SP benchmark.

0 (c) 1998 IEEE

0 0 0 4

801

322

9625

171

655

226

2 7 1 0 0

642

0

100

200

300

400

500

600

700

800

900

<=

0.12

<=

0.24

<=

0.49

<=

0.98

<=

1.95

<=

3.91

<=

7.81

<=

15.6

<=

31.3

<=

62.5

<=

125

<=

250

<=

500

<=

1000

<=

2000

<=

4000

<=

8000

Bin Time Between Sends (millisec)

Num

ber

of M

essa

ges

Sen

t

Figure 5 Histogram of time between message sends forthe class “S” run of the BT benchmark on fourprocessors.

0 0 0 6

1319

514

177

523

1603

1483 2 4 0 0

280293

0

200

400

600

800

1000

1200

1400

1600

1800

<=

0.12

<=

0.24

<=

0.49

<=

0.98

<=

1.95

<=

3.91

<=

7.81

<=

15.6

<=

31.3

<=

62.5

<=

125

<=

250

<=

500

<=

1000

<=

2000

<=

4000

<=

8000


Num

ber

of M

essa

ges

Sen

t

Figure 6 Histogram of time between message sends forthe class “S” run of the SP benchmark for fourprocessor.

0 8

4000

0 0 0 0 0 0 0416

0

500

1000

1500

2000

2500

3000

3500

4000

4500

<=

64

<=

128

<=

256

<=

512

<=

1024

<=

2048

<=

4096

<=

8192

<=

1638

4

<=

3276

8

<=

6553

6


Num

ber

of M

essa

ges

Sen

t

Figure 7 Histogram of message sizes for the class “S”run of the LU benchmark.

1060-3425/98 $10.00

0 4 8

1240

06 1860

00

0 0 3024

00 12 24 0 0 0 1209

6

0 02016

00 0

744000

0000

100000

200000

300000

400000

500000

600000

700000

800000

<=

128

<=

256

<=

512

<=

1024

<=

2048

<=

4096

<=

8192

<=16

384

<=32

768

<=65

536

<=

1E+

05

<=

3E+

05

<=

5E+

05


Num

ber

of M

essa

ges

Sen

t


Figure 8 Distribution of message size for class "A"runs of the LU benchmark.

0 0 7 51

709

234

929

1866

68 60 8 2 0 0 0

217 273

0

200

400

600

800

1000

1200

1400

1600

1800

2000

<=

0.12

<=

0.24

<=

0.49

<=

0.98

<=

1.95

<=

3.91

<=

7.81

<=

15.6

<=

31.3

<=

62.5

<=

125

<=

250

<=

500

<=

1000

<=

2000

<=

4000

<=

8000

Bin Time Between Sends (milli sec)

Num

ber

of M

essa

ges

Sen

t

Figure 9 Histogram of time between sends for the class“S” run of the LU benchmark for four processors.

(c) 1998 IEEE

"

0 0

80

0

200

0

320

120

360

0

392

0

304

0 0 00

50

100

150

200

250

300

350

400

450

<=

2

<=

4

<=

8

<=

16

<=

32

<=

64

<=

128

<=

256

<=

512

<=

1024

<=

2048

<=

4096

<=

8192

<=16

384

<=32

768

<=65

536


Num

ber

of M

essa

ges

Sen

t by

4 P

roce

ssor

s

Figure 10 The distribution of message sizes for class“S” run of the MG benchmark.

240

640

240

480

480

240

480

240

480

240 30

4

608

00 0

480

640

480

960 960 960 960

1088

0

480

400

160

0 0

480

160

520

480

480

608

608

1160

0

200

400

600

800

1000

1200

1400

<=

4

<=

8

<=

16

<=

32

<=

64

<=

128

<=

256

<=

512

<=

1024

<=

2048

<=

4096

<=

8192

<=

16384

<=

32768

<=

65536

<=

1E

+05

<=

3E

+05

<=

5E

+05


Num

ber

of M

essa

ges

Sen

t

Com piled Processor s = 8 Com piled Processor s = 16

Figure 11 Distribution of messages sizes for class "Aruns of the MG benchmark.

1060-3425/98 $10.00

0

97109

7

154

343

408

325

134

7235

72

9 3 7 1 0 00

50

100

150

200

250

300

350

400

450

<=

0.06

<=

0.12

<=

0.24

<=

0.49

<=

0.98

<=

1.95

<=

3.91

<=

7.81

<=

15.6

<=

31.3

<=

62.5

<=

125

<=

250

<=

500

<=

1000

<=

2000

<=

4000

<=

8000


Num

ber

of M

essa

ges

Sen

t by

4 P

roce

ssor

s

Figure 12 Histogram of time between sends for class“S” run of the MG benchmark.

(c) 1998 IEEE

Documents

Statistical analysis of message passing programs to guide computer design