Upload
independent
View
0
Download
0
Embed Size (px)
Citation preview
Statistical Analysis of Message Passing Programs to Guide Computer Design
William E. Cohen and Basel Ali Mahafzah
Department of Electrical and Computer EngineeringCollege of Engineering
The University of Alabama in HuntsvilleHuntsville, AL 35899
{cohen,mahafzah}@ece.uah.edu
uar
innnra
uaS
a.1
nreaoe uhe a
eningsere
aitr
stheto
ofntofof
aldsderg
es,eenands.
calne
ons.
ers to
d.
s isataeson 5e
AbstractLittle data exists on how message passing programs
parallel computers. The behavior of these programs cstrongly influence design decisions made for futucomputer systems. The computer designer’s useincorrect assumptions about program behavior cadegrade performance.
In many cases simple statistical parameters describcharacteristics such as message sizes, destinatiosources, and times between sends would give the desigof the communication libraries and the computer hardwagreat insight into how the hardware is used by actuprograms.
Techniques of collecting statistical information abothe communication characteristics for system design hbeen applied to the parallel version of the NAbenchmarks. This paper describes the statistical dcollected for multiprocessor runs of the NAS 2benchmarks and some of the characteristics observedthat data.
1. Introduction
Parallel computers are complicated machines desigto efficiently perform computations. As with othemachines designed for a specific purpose, the requiremof the task must be understood. The task that these parmachines support is running large-scale applicatiprograms. For portability many of these parallapplication programs have been written to make usemessage passing libraries. Thus, it is important that futparallel computers be designed to efficiently support tprogramming model. Poor performance can result if this a mismatch between the assumed program behaviorthe actual program behavior.
1060-3425/98 $10.
seneof
n
gs,ersel
tve
ta
in
ed
ntsllelnlofreisrend
There are a number of performance tools that have beused to evaluate the performance of programs on existcomputer systems, e.g. Pablo [11] and Paradyn [7]. Thetools often concentrate on time-based metrics, which aappropriate when attempting to minimize the run time ofparticular program on a particular computer. However, is often difficult to extend these results to computearchitectures that have significantly different timingassociated with the operations. These tools require program to be executed on each different machine gather the information.
Designers need methods of describing aspects program behavior that are portable across differecomputer architectures. Statistical measurements message passing programs would provide this type information. A few numbers can capture essenticharacteristics of parallel programs. This approach avoithe difficulties of detailed trace-based data collection anclearly summarizes important properties to the computarchitect. Therefore, this work focuses on measurinspecific parameters (e.g., distribution of message sizaverage message size, and processing time betwmessage sends) of message passing programs measurements obtained from existing parallel programOther researchers [4, 15] have proposed the statistianalysis of parallel application programs communicatiocharacteristics. However, these works were limited to thinter-processor arrival times of messages, the distributiof message sources, and completion time of the programThe data collected for the work described in this papfocuses on the message sizes and time for processorcomplete work between sends.
The Message Passing Interface (MPI) library [14] usefor this work is described in greater detail in section 2The method of instrumenting message passing programdescribed in section 3. Section 4 describes the dcollected for the NAS 2.1 benchmarks [1, 2] and maksome observations on the size of messages sent. Sectioutlines future work planned to extend the utility of thestechniques. Finally, section 6 summarizes the results.
00 (c) 1998 IEEE
leANall.g blsg,ion
on
Ina
hentiv in
lleg
thsuf
3.thi
the
facs ce, aisceesofis
tio
I n
Th aen
ofylyallonthe
ey.isde
ionPIs
dedeoesre the is
eheg
an
the tochweennto
ofalofthe
2. Overview of Message Passing Interface
MPI is a library of portable, efficient, and flexibfunctions [5] that can be used in C, C++, and FORTRprograms. This library can be used on massively parmachines with distributed-memory architectures (enCUBE2, IBM SP2, CM-5, and Intel Paragon), and canused on shared memory machines (e.g., Cray T3D). AMPI provides profiling libraries, e.g. time accountinlogfile creation and examination, and runtime animat[6].
In the MPI programming model, a task consists of or more processes that communicate data betweenseparate processes by calling the MPI library routines. program that uses MPI, a fixed set of processes is creat program initialization, one process per processor. Tprocesses use point-to-point communication to exchadata between pairs of processes and colleccommunication to exchange data between processesgroup.
3. Instrumentation of Code
A crucial element of this work is to instrument paraprograms in a manner that is portable, minimizes chanto the programs under study, and is able to collectrequired data. The MPI standard [14] addresses the isof instrumenting the MPI library. An overview oinstrumenting MPI programs is discussed in section The types of data that can be collected from instrumentation are discussed in section 3.2.
3.1. The MPI Profiling Interface
Various implementations of the MPI library exist widifferent levels of performance, but all of thimplementations share a common programming inter[14]. This interface to the application program provideclean separation between local operations within a proand operations that involve other processes. SimilarlyMPI profiling interface is defined to ensure that it relatively easy for designers of profiling tools to interfatheir codes to MPI implementations on different machinThe design of the MPI library allows the insertion additional instrumentation code. This profiling code restricted to the MPI interface between the applicaprogram and the MPI library.
The general scheme supported by all versions of MPto have two versions of the same set of library functioThe original library has the normally used names. other library used for profiling has the same functionsthe original library, but all of the functions have berenamed, e.g. prefixed with “PMPI_”.
1060-3425/98 $10.
el.,eo,
ethe atedsegee a
lesees
1.s
eassn
.
n
iss.es
The instrumentation routines use the original names library functions and are called instead of the librarroutine. These instrumentation routines are simpwrappers that record the desired information and then cthe renamed routines in the profiling library. Thus, tinstrument an application program, the applicatioprogram only needs to be linked with the wrappers and MPI profiling library.
3.2. Abilities of the Data Collection Technique
The MPI profiling interface [14] provides a portableand convenient method of collecting information about thperformance of a program that uses the MPI librarHowever, there are limitations to the types of data that thinterface can collect. The profiling interface has a limiteview of what occurs in the program. It can observe thparameters that are passed into it and obtain informatfrom calls to system functions, e.g. time of day. The Mfunctions that implement the communication operationare black boxes; similarly, information about the user cois limited to what is passed into the instrumentation coas parameters. As a result the instrumentation code dnot have a view of the low-level operations that aimplementation specific, e.g. when a message is sent tophysical network device, or details of what the user codedoing when it is not calling the library.
However, the MPI profiling interface does enable thcollection of data about how the library is being used. Tfollowing types of data can be collected using the profilininterface:
1. Source of the message.2. Destination of the message.3. Number of bytes in the message.4. Type of data being sent.5. Number of times each library routine is called.6. When library routine is called.Although collection of this data does not give
complete picture of what a parallel program is doing, it cagive designers valuable feedback on how programs usemessage passing libraries. Instrumentation was writtencollect data about the number of bytes sent in eamessage, the average message size, and the time betcalls to the send functions. The following sectioillustrates how this type of data collection can be used examine the behavior of existing parallel programs.
4. Collected Data
The goal of this work is to build a library of informationthat concisely describes the characteristics of a set parallel programs. We selected the NumericAerodynamic Simulation (NAS) benchmarks [1] as a set programs to instrument and characterize. NAS uses
00 (c) 1998 IEEE
catene
nmTg
okTt
agra
het b
U.
hw
4.d
ot
.ts.
y]..S
l
ss
esP,
te
MPI library, it is freely available to researchers, andattempts to mimic the characteristics of numeriaerodynamic simulations. The data collection was limito collecting histograms of message sizes for both seand receives and some data about the time betwprocessor sends for the sample size runs.
The instrumentation was performed on two clustersworkstations. The initial instrumentation development adata collection for four processor runs of the prograwere performed on a cluster of uni-processor PCs. larger program runs were performed on a cluster of sinprocessor SUN SPARC workstations.
Each PC had an Intel 486DX4 100MHz process256KB L2 cache, 16MB of RAM, and 1GB hard disThe four PCs were connected via 10Mb/s Ethernet. software on these systems consisted of the Linux operasystem (kernel 1.3.20), MPICH version 1.1 messpassing libraries, and GNU G77 version 2.7.0 Fortcompiler.
A variety of different SUN workstations made up tprocessing elements in the 16 processor cluster, bumachines used in this cluster had a minimum of 64MBmemory, a single SPARC processor, and a 10MEthernet connection. The software on the Sworkstations consisted of Solaris 2.5, MPICH version 1and GNU G77 version 2.7.2.f.1 Fortran compiler.
The two clusters of workstations recorded tcharacteristics of the NAS benchmarks. Section 4.1 describe the NAS benchmarks and the data collectedthe individual programs in greater detail. Section examines some of the characteristics observed in the collected.
4.1. NAS Benchmarks
The NAS benchmarks [1] are a collection computational kernels developed by NASA Ames
1060-3425/98 $10
itlddsen
ofds
hele
r,.heingen
allof/sN1,
eillfor2ata
fo
determine the suitability of high performance computersfor performing aerodynamic simulations via computationsThe code in the benchmarks omits input and outpuoperations and concentrates on only the computationInitially, the NAS 1.0 benchmarks were distributed as“pencil and paper” specifications because of the diversitof high-performance computer hardware and software [1With the development of portable parallel software, e.gmessage passing libraries, a later generation of NAS, NA2.1, is available as code using the MPI communicationlibraries. NAS 2.1 reflects what a typical parallelapplication programmer might write.
The NAS 2.1 benchmarks contain five computationakernels: 3-D FFT (FT), LU solver (LU), multigrid (MG),
block tridiagonal solver (BT), and pentadiagonal solver(SP). There are four classes of program runs, sample cla(“S”), class “A”, class “B”, and class “C”. The sampleclass requires the fewest computations and class C requirthe most. Table 1 shows the problems sizes for the BT, SLU, and MG benchmarks and the number of floatingoperations performed for each [2, 12]. These differenclasses allow the runs of the program to be scaled to th
Table 1 NAS Benchmark problem sizes.
Bench-mark
Sample class (“S”) Class “A”
Problemsize
FLOPS(x 106)
Problemsize
FLOPS(x 109)
BT 123 244.8 643 181.3
SP 123 172.8 643 102.0
LU 123 98.0 643 64.6
MG 323 12.8 2563 3.9
t
Table 2 Summary of total number of messages sent, average size of messages sent, and number of floating poinoperations performed per byte sent for NAS benchmarks.Sample size(“S”),4 proc.
Class A,BT&SP 9 proc.LU&MG 8 proc
Class A, 16 proc.
Bench-mark
Tot.Mesg.(x10 3)
Avg.mesg.Size(bytes)(x10 3)
FLOPSpermesg.byte
Tot.Mesg.(x10 3)
Avg.mesg.Size(bytes)(x10 3)
FLOPSpermesg.byte
Tot.mesg.(x10 3)
Avg.mesg.Size(bytes)(x10 3)
FLOPSpermesg.byte
BT 2.95 4.85 17.1 32.6 69.5 80.0 77.3 45.7 51.3SP 4.87 3.04 11.7 65.0 60.7 25.9 154. 38.4 17.2LU 4.42 .723 30.6 315. 3.07 66.8 756. 1.92 44.5MG 1.78 1.27 5.69 5.71 27.0 25.3 11.0 18.8 18.8
.00 (c) 1998 IEEE
heT,atth
me wld
thsst s nlesrate
ingacewian
he
ar, ssthinre
ificat
ockara
th
egthlls
orse i
r .g. as,
aest
the thatre 2f SPage sents are
16
timethe
SPs in16-
isingual
sedtionthehowal
k toelym.arksn the
Thus, toarehens
and13].rom
ithes. the the thetaltheolve
ntsuiredthevide
lassssor
n
size of the machine. Data was collected on tcharacteristics for four of the benchmarks: LU, MG, Band SP for the sample and “A” class problems. The dcollected consisted of average message size, distribution of message sizes, and the distribution of tibetween message sends. The remainder of section 4.1examine the data collected and show how it wouinfluence the design of a computer system.
For the NAS 2.1 benchmarks we collected data on size and the number of messages sent by each proceTable 2 summarizes the total number of messages senall the processors and the average size of the messageeach benchmark program. The four programs have a raof average message sizes, with LU having the smalaverage message size and BT having the largest avemessage size. Table 2 shows an additional compumetric for each of the program runs, the number of floatpoint operations (FLOPS) per byte sent. This metric cbe used to estimate the bandwidth required for balansystem operation. The characteristics of the programs be discussed in greater detail in sections 4.1.1, 4.1.2, 4.1.3. Section 4.2 will discuss the implications of trecorded data on the design of future computers.
4.1.1. BT and SP Benchmarks
The BT and SP benchmarks perform similcomputations and have similar communication patternsthey will be discussed together in this section. Thebenchmarks are designed to be representative of computations associated with the implicit operators Computational Fluid Dynamics (CFD) code [1]. There adifferences between BT and SP in the speccomputations performed, but they share common ddistribution and communication patterns.
Both BT and SP solve multiple independent systemsnon-diagonally block tridiagonal equations for a 5x5 blofor a certain number of time steps. The equations arranged as a cube with equal number of equations in edirection. The equations are grouped into cells with n cellsin each direction, giving a total of n3 cells. The number ofprocessors used to solve each problem is equal tonumber of cells on a face (n2). Thus, the number ofprocessors used to solve the problem must have an intsquare root and each processor gets n cells. Rather giving each processor a simple vertical column of ceeach processor obtains a diagonal column of cells.
In each iteration of the program, the processexchange variable values for the equations on the faceach cell with the adjacent cell. This data exchangeperformed in the subroutine copy_faces . In the datalayout for the BT and SP benchmarks, if processocontains an adjoining cell in a particular direction (eeast) from a cell in processor j, then processor i hasadjoining cells in that direction from processor j. Thu
1060-3425/98 $10.0
ae
ill
eor.byfor
getged
ndlld
oee
a
f
ech
e
eran,
ofs
i
ll
copy_faces groups these multiple cell faces for particular direction into a single send, yielding the largaverage message size of the NAS benchmarks.
Figure 1 shows the histogram of message sizes forsample size run of BT on 4 processors, and it is clearthe message sizes are not uniformly distributed. Figushows similar characteristics for the sample size run oon 4 processors. This non-uniform distribution of messsizes is due to the number of cell faces that are beingto the adjacent processor. These same characteristicechoed in the class “A” runs of BT and SP for 9 andprocessors shown in figures 3 and 4.
Figures 5 and 6 show histograms measuring the between consecutive calls to the send functions in library for the sample class runs of the BT and benchmarks on four processors. Due to the variationthe performance of machines used for the 9 and processor runs no timing data was taken. One surprfeature present in both the timing histograms is the dpeaks. If a typical queuing theory model [8] was uwhere computation tasks have an exponential distribufor completion and a single send occurs after computation is complete, then the histograms should sa straight line for bin sizes used. The individucomputational tasks require time around the right peacomplete, then this is followed by a series of closspaced sends which cause the left peak in the histogra
For each time step in both the BT and SP benchmthe processor compute all the values for the elements icells they hold and then the copy_faces subroutine isused to exchange the data between the processors. the processors must wait for the communicationscomplete before the next iteration’s computations performed. A more efficient method of performing titerations is to split the computations into computatiothat have results that will be sent to other processorscomputations that computed only locally used values [The communication operations sending the results fthe first part of the computations can be overlapped wthe computations producing only local used valuTable 3 give the ratio of the total elements computed tointerior elements computed by each processor. Forsample class problems most of the elements are onexterior of the cell, causing an unfavorable ratio of toelements to interior elements. The ratio is better for class “A” runs, but as more processors are used to sthe problem, the ratio becomes less favorable.
Using the ratio of total elements to interior elemeand the operations per byte sent an estimate of the reqnetwork performance can be made. For example processors used in the machine are estimated to pro100 million floating operations per second. For the c“A” SP benchmark running on 16 processors a procewould produce 5.8MB/s(100MFLOPS/17.2FLOP/(bytes/s)) of network traffic o
0 (c) 1998 IEEE
u
ora
h
hoh1
e
n
ooOttot fa
u
m tyt
hattheuchizes
tedingthe
theps, if byint ine aableItofatefor
thehe Ams.
the
aheg
the toe
son ism
foreshehine.themss ofa
gedthe ishus,ns
average. However, this network traffic could only occduring the computation of the interior elements and wouhave to be multiplied 1.49 (ratio of total element to interielements), indicating that each processor requi8.66MB/s bandwidth and 16 processors require aggregate bandwidth of 139MB/s. For this exampsystem 100Mb/s (10MB/s) Ethernet would not providsufficient bandwidth for a balanced system and tnetwork would limit performance.
Table 3 Ratio of total elements to interior element inthe BT and SP benchmarks.
Ratio Total Elements : Interior ElementsClass “S”4 proc.
Class “A”9 proc.
Class “A”9 proc.
3.38:1.0 1.33:1.0 1.49:1.0
4.1.2. LU Benchmark
The LU benchmark is another program witcalculations that are representative of the computatiperformed in CFD code. The LU benchmark performs tcomputations on a three-dimensional array, with elements in each direction for the sample problem andelements in each direction for the class “A” problemRecursive bisection is performed on two of the thrdimensions in the array to partition the problem. Thueach processor works on a prism of elements that extefrom the top to the bottom of the array. Each prismcomposed of “tiles”, horizontal planes of elements.
The LU benchmark starts with a single processperforming Symmetric Successive Over-Relaxati(SSOR) on a tile in a corner of the array. Once the SSoperation is completed on a tile, the updated values for elements along the edges of the tile are sent to adjoining processors. Once the adjoining processreceive the updates, they start to perform SSOR on the that needed the information contained in the messages,the processor that just sent the messages starts to perSSOR on the tile above the completed tile. Thus, a wfront of tile processing extends from the starting corner.
This method of processing leads to a large numbersmall messages being sent between the processFigure 7 shows that for the sample sized run with foprocessors there are 10 times as many messages less257 bytes in size as messages greater than 4096 bytesize. Figure 8 shows that for the class “A” runs of the Lbenchmark for 8 and 16-processor runs there are 60 tias many messages less than 2049 bytes as theremessages greater than 65536 bytes. Thus, for benchmark message, latency and setup overhead mamore of a concern than peak bandwidth because of small average message size.
all
1060-3425/98 $10
rldresn
leee
nse264.es,ds
is
rnR
hehers
ilesandormve
ofors.rthans inUesare
his behe
Another effect of the disparity of message sizes is tthe average message size is not representative of message traffic. The actual messages are either msmaller or much larger than the average message slisted in table 2.
The average bandwidth requirements can be compufor the LU benchmark using the estimated rate that floatpoint operations are performed by the processor and number of operations per byte sent in Table 2. Unlike BT and SP benchmarks, the LU benchmark fully overlathe transmission of values with the computations. Thusthe network can supply the average bandwidth obtaineddividing the rate the processor can perform floating pooperations by the number of operation per byte senttable 2, the computer system is balanced. Assumcomputer system composed of 16 processors each capof 100 million floating point operations per second. would require network connections with a bandwidth 2.25MB/s for each processor and the total aggregbandwidth 36MB/s to provide balanced performance 16 processors performing the class “A” LU benchmark.
Figure 9 shows the time between message sends forsample class run with four processors. Again tdistribution of times between sends is not exponential. significant number of sends are separated by less than 2This is due to the multiple messages sent during updates.
4.1.3. MG Benchmark
The MG benchmark is a multigrid solver, which finds solution to a relatively coarse grid, and then refines tsolution for higher and higher resolution grids describinthe same problem. Once the solution is determined athighest resolution grid, the results are propagated backthe lower resolution grids. This algorithm allows iterativmethods to converge to a solution more quickly.
The MG benchmark solves the scalar discrete Poisequations for a three-dimensional grid. A 32x32x32 gridthe highest resolution grid for the sample sized probleand a 256x256x256 grid is the highest resolution grid the class “A” problem. Because of the various grid sizbeing solved, there is a wide range of sizes for tmessages sent between the processors in the macFigure 10 shows the histogram of message sizes for sample sized problem and figure 11 shows the histograof the message sizes for the eight and 16-processor runthe MG benchmark. Even for the class “A” problem significant number of small messages are exchanbetween the processors. For this type of algorithm amount of work being performed by each processorsmall when the small messages are being sent. Thiding the overhead of communication with computatiomay not be feasible and like the LU benchmark over
.00 (c) 1998 IEEE
pe
eek hptimthed
tn da ass tan
uir
ize
o
lwa
in
t thththevidU
esnt
llehonc
afins aedss
m o thaa
si
e
henotg a, ited
eneee
s
dte
ss
ldeneds
g.
.de
srr
bed
lyata
performance may be influenced more by latency than bandwidth.
Figure 12 provides the histogram of the times betwconsecutive sends. The sample class MG benchmarthe fewest number of operations out of the four samclass benchmarks as shown in table 1. Thus, the between sends is smaller than that for the obenchmarks and the majority of the sends are separatless than 4ms of time.
4.2. Discussion of Results
The driving force for collecting these parameters ischaracterize parallel programs and use that informatioguide the design of future computer systems. The collected about the message sizes is portable tocomputer architecture that supports a message pamodel and will not vary as long as the computers havesame sizes for the data types. Several observations cmade about the data:1) The collected data can be used to estimate the req
interconnection bandwidth.2) There are significant variations in message s
between the benchmarks.3) Most of the applications did not use a wide variety
message sizes.4) The average sizes of the messages were not a
representative of the messages.5) The traffic presented to the network was typically
bursts.The data collected in this study could be used
estimate the minimum bandwidth required to preventinterconnection network from being the component in system that limits performance. As shown in examples, a 100Mb/s broadcast network would not prosufficient bandwidth for the class “A” BT, SP, and Lbenchmarks for a 16 processor cluster with each proccapable of performing 100 million floating poioperations per second.
There is a factor of 6.7 between the largest and smaaverage message size for the sample size problem sin table 2. This variation becomes even more pronouin the class “A” problems where the ratio is 23. The smaverage message size in LU is explained by the grained nature of the program [2]. The average sizemessages shown in the table for the LU benchmarkconsiderably smaller than the size of messages needobtaining the peak performance of the message palibraries [9, 16].
Many analytical models using queuing theory assueither a contiguous distribution of message sizes ormessage size [10, 8]. However, the only benchmarkappeared to approach these ideals was MG, which hwide range of message sizes. The other three benchmused few categories of sizes. The different message
1060-3425/98 $10.
ak
nas
lee
er by
ototanyinghe be
ed
s
f
ys
oee
e
sor
stwnedlle-ofre toing
eneats arks
zes
could also be separated by significant distances; with thclass “A” LU benchmark having at least at factor of 16between the two message categories.
As a result of the wide separation in message sizes taverage size of the message sent shown in table 2 may represent the type of messages sent. Rather than forminsingle distribution of message sizes around an averagemay be more appropriate to have several widely separatdistributions of message sizes in the models.
Finally, a common simplification made in analyticalmodels is to assume an exponential distribution of timbetween individual send operations [3]. This time betweesends is the processing time. However, for thbenchmarks examined for this work a more appropriatmodel may be to have some limited range of computtimes (e.g. uniform distribution between minimum andmaximum time) and then to have bursts of multiple sendat the end of the compute time.
5. Future Work
The results presented in this paper are rather limitemeasurements of four existing application programs thause message passing. Future work will extend thcollection and analysis of the communication metricdiscussed in this paper. Work in the following areas iplanned:
1) Collect data from a wider variety of programs.2) Collect data from larger runs of programs to
determine if observed statistical parameters hold.3) Consider measurement of additional parameters.4) Apply the parameters collected to improve the
quality of the queuing theory models built topredict program performance.
There are a number of desirable extensions that shoube made to this work. Section 4 only presents data takfor the NAS benchmarks. These programs were designto represent typical code found in aerodynamic simulationbeing performed at NASA Ames [1]. However, it is likelythat the characteristics of other application programs, e.n-body simulations and image rendering, will differ fromthe characteristics of the NAS benchmarksInstrumentation of a broader sampling of programs woulensure that the data collected is not skewed by thpeculiarities of a particular type of program.
The data presented was for relatively small-scale runof programs. It would be instructive to determine whethethe properties for these small-scale runs hold for largeprograms, e.g. the class “B” and “C” NAS problems. Todetermine these scaling properties large runs need to made. These runs would be for the larger class “B” an“C” problems in NAS. Also the number of processors inthe runs should be varied.
The data collected for this paper was almost exclusivefor the message sizes sent between processors. D
00 (c) 1998 IEEE
tthnaad
socoa
ea
n
ilae
ap
ri
fuB t
R
tt
,,”a,
e,”4-
.a
lel
PIut.f
for
Itorg
..el
ce
f,”
4,
ndrn,
describing the time between sends was limited to sample sized runs. As the work progresses oparameters may be collected such as the distributiomessage sources and destinations to aid in the simulof “hot spots” in network traffic. These additionparameters should be defined in a machine-indepenmanner.
Finally, additional work to determine how theparameters improve the prediction of queuing themodels should be performed. The parameters collecould be used to determine the types of models that wbest reflect a program’s behavior. The collected dcould calibrate the simulation models indicating whthere are significant difference between actual progrand the simulations.
6. Conclusion
This paper demonstrates the utility of collectiparameters describing the characteristics of the Nbenchmarks. The parameters collected provide insightthe operation of the NAS benchmarks on paracomputers. It shows that two of the benchmarks, LU MG, send a number of relatively small messages betwthe processors. The timing information also shows thatnetwork traffic from a processor may not be even, but moccur in bursts. These characteristics are difficult to nin the detailed step-by-step simulations because the uparameter being measured is the total run time.
Computer architects would be able to use the datgain an understanding of what features would best supparallel programs. This same data could be usedproduce less detailed simulations, e.g. queuing themodels [3, 8, 10], that still capture the essential behaviothe parallel programs. A library of parameters describthe performance of many parallel programs will acomputer architects in producing computers that efficiensupport parallel programs.
7. Acknowledgements
We would like to acknowledge the many usediscussions with Jeffery Kulick, Constantine Katsinis, Earl Wells, and Rhonda Gaede. We would also likeacknowledge the suggestions made by GayaKrishnamurthy to improve this paper.
8. References
[1] D. Bailey, E. Barszcz, J. Barton, D. Browning, Carter, L. Dagum, R. Fatoohi, S. Fineberg, Frederickson, T. Lasinski, R. Schreiber, H. Simon, Venkatakrishnan, and S. Weeratunga, “The NAS Para
1060-3425/98 $10
heer
oftionlent
eryteduldta
rems
gASntolelnden
theay
otesual
toort
toory ofngidtly
l.tohri
.P.V.llel
Benchmarks,” NASA Ames Research Center, MoffeField, California, RNR-94-007, March, 1994.
[2] D. Bailey, T. Harris, W. Saphir, R. Wijngaart, A. Wooand M. Yarrow, “The NAS Parallel Benchmarks 2.0NASA Ames Research Center, Moffett Field, CaliforniNAS-95-020, December, 1995.
[3] B. Bodnar and A. Liu, “Modeling and PerformancAnalysis of Single-Bus Tightly-Coupled MultiprocessorsIEEE Transactions on Computers, Vol. 38, No. 3, pp. 46470, March, 1989.
[4] S. Chodnekar, V. Srinivasan, A. S. Vaidya, ASivasubramaniam, and C. R. Das, “Towards Communication Characterization Methodology for ParalApplications,” in the Proceedings of High-PerformanceComputer Architectures 3, San Antonio, Texas, February1-5, 1997.
[5] W. Gropp, E. Lusk, N. Doss and A. Skjellum, “AHigh-Performance, Portable Implementation of the MMessage Passing Interface Standard,” Math. & CompSci. Division at Argonne National Laboratory, Dept. oComput. Sci. & NSF Engineering Research Center CFS at Mississippi State Univ., 1996.
[6] E. Karrels and E. Lusk, “Performance Analysis of MPProgram”, in J. J. Dongarra and B. Tourancheau, edi,Environments and Tools for Parallel Scientific Computin,pages 195-200, SIAM, 1994.
[7] Miller, B. P., M. D. Callaghan, J. M. Cargille, J. KHollingsworth, R. B. Irvin, K. Kunchithapadam, K. LKaravanic, and T. Newhall. “The Paradyn ParallPerformance Measurement Tool,” Computer, Vol. 28, pp.37-46, November 1995.
[8] P. Mohapatra, C. Das, and T.-Y. Feng, “Performananalysis of cluster-based multiprocessors,” IEEETransactions on Computers, Vol. 43, No. 1, pp 109-114,January, 1994.
[9] N. Nupairoj and L. M. Ni, “Performance evaluation osome MPI implementations on workstation clustersProceedings of the 1994 Scalable Parallel LibrariesConference, IEEE Computer Society Press, October 199pp. 98-105.
[10] K. Park, G. Kim, and M. Crovella, “On therelationship between file sizes, transport protocols, aself-similar network traffic,” BU-CS-96-016, ComputeScience Department, Boston University, BostoMassachuetts, August, 1996.
.00 (c) 1998 IEEE
ei
t
f
.,n,
:4,ee,
,”
[11] D. A. Reed, P. C. Roth, R. A. Aydt, K. A. Shields, L.F. Tavera, R. J. Noe, and B. W. Schwartz, “ScalablPerformance Analysis: The Pablo Performance AnalysEnvironment,” in Proceedings of The Scalable ParallelLibraries Conference, IEEE Computer Society Press,1994, pp. 104-113.
[12] S. Saini and D. H. Bailey, “NAS Parallel BenchmarkResults 12-95,” NASA Ames Research Center, MoffetField, California, NAS-95-021, December, 1995.
[13] Sawdey, A. C., M. T. O’Keefe, and W. B. Jones, “AGeneral Programming Model for Developing ScalableOcean Circulation Applications,” Laboratory forComputational Science and Engineering, Univ. oMinnesota, Minneapolis, Minnesota, January 1997.
0 0 0 0 0
732
1488
0
732
00
200
400
600
800
1000
1200
1400
1600
<=
64
<=
128
<=
256
<=
512
<=
1024
<=
2048
<=
4096
<=
8192
<=16
384
<=32
768
Bin Message Size (bytes)
Num
ber
of M
essa
ges
Sen
t
Figure 1 Histogram of the message sizes for the class“S” run of the BT benchmark on four processors.
0 0 0 0 0
1212
2448
1212
0 00
500
1000
1500
2000
2500
3000
<=
64
<=
128
<=
256
<=
512
<=
1024
<=
2048
<=
4096
<=
8192
<=
1638
4
<=
3276
8
Bin Message Size (bytes)
Num
ber
of M
essa
ges
Sen
t
Figure 2 Histogram of message sizes for the class “S”run of the SP benchmark on four processors.
1060-3425/98 $10.0
s[14] M. Snir, S. W. Otto, S. Huss-Lederman, D. WWalker, and J. Dongarra, MPI: The Complete ReferenceThe MIT Press, Cambridge, Massachusetts, LondoEngland, 1996.
[15] E. Strohmaier, “Statistical Performance ModelingCase Study of the NPB 2.1 Results,” UTK-CS-97-35Computer Science Department, University of TennessKnoxville, Tennessee, March, 1997.
[16] Z. Xu and K. Hwang, “Modeling communicationoverhead: MPI and MPL performance on the IBM SP2IEEE Parallel & Distributed Technology: Systems &Applications, Vol. 4, No. 1, pp 9-24, IEEE, 1996.
0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0
28944
0
28944
0 0
2176
2
1085
4
0
1939
2
0
5000
10000
15000
20000
25000
30000
35000
<=
64
<=
128
<=
256
<=
512
<=
1024
<=
2048
<=
4096
<=
8192
<=16
384
<=32
768
<=65
536
<=
1E+
05
<=
3E+
05
<=
5E+
05
Bin Message Size (bytes)
Num
ber
of M
essa
ges
Sen
t
Compiled Processors = 9 Compiled Processors = 16
Figure 3 Distribution of message sizes for class "A"runs of the BT benchmark.
0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0
96336
0 0 02406
1924
8
4336
25774
4
0
20000
40000
60000
80000
100000
120000
<=
64
<=
128
<=
256
<=
512
<=
1024
<=
2048
<=
4096
<=
8192
<=16
384
<=32
768
<=65
536
<=
1E+
05
<=
3E+
05
<=
5E+
05
Bin Message Size (bytes)
Num
ber
of M
essa
ges
Sen
t
Compiled Processors = 9 Compiled Processors = 16
Figure 4 Distribution of message sizes for class "A"runs of the SP benchmark.
0 (c) 1998 IEEE
0 0 0 4
801
322
9625
171
655
226
2 7 1 0 0
642
0
100
200
300
400
500
600
700
800
900
<=
0.12
<=
0.24
<=
0.49
<=
0.98
<=
1.95
<=
3.91
<=
7.81
<=
15.6
<=
31.3
<=
62.5
<=
125
<=
250
<=
500
<=
1000
<=
2000
<=
4000
<=
8000
Bin Time Between Sends (millisec)
Num
ber
of M
essa
ges
Sen
t
Figure 5 Histogram of time between message sends forthe class “S” run of the BT benchmark on fourprocessors.
0 0 0 6
1319
514
177
523
1603
1483 2 4 0 0
280293
0
200
400
600
800
1000
1200
1400
1600
1800
<=
0.12
<=
0.24
<=
0.49
<=
0.98
<=
1.95
<=
3.91
<=
7.81
<=
15.6
<=
31.3
<=
62.5
<=
125
<=
250
<=
500
<=
1000
<=
2000
<=
4000
<=
8000
Bin Time Between Sends (millisec)
Num
ber
of M
essa
ges
Sen
t
Figure 6 Histogram of time between message sends forthe class “S” run of the SP benchmark for fourprocessor.
0 8
4000
0 0 0 0 0 0 0416
0
500
1000
1500
2000
2500
3000
3500
4000
4500
<=
64
<=
128
<=
256
<=
512
<=
1024
<=
2048
<=
4096
<=
8192
<=
1638
4
<=
3276
8
<=
6553
6
Bin Message Size (bytes)
Num
ber
of M
essa
ges
Sen
t
Figure 7 Histogram of message sizes for the class “S”run of the LU benchmark.
1060-3425/98 $10.00
0 4 8
1240
06 1860
00
0 0 3024
00 12 24 0 0 0 1209
6
0 02016
00 0
744000
0000
100000
200000
300000
400000
500000
600000
700000
800000
<=
128
<=
256
<=
512
<=
1024
<=
2048
<=
4096
<=
8192
<=16
384
<=32
768
<=65
536
<=
1E+
05
<=
3E+
05
<=
5E+
05
Bin Message Size (bytes)
Num
ber
of M
essa
ges
Sen
t
Compiled Processors = 8 Compiled Processors = 16
Figure 8 Distribution of message size for class "A"runs of the LU benchmark.
0 0 7 51
709
234
929
1866
68 60 8 2 0 0 0
217 273
0
200
400
600
800
1000
1200
1400
1600
1800
2000
<=
0.12
<=
0.24
<=
0.49
<=
0.98
<=
1.95
<=
3.91
<=
7.81
<=
15.6
<=
31.3
<=
62.5
<=
125
<=
250
<=
500
<=
1000
<=
2000
<=
4000
<=
8000
Bin Time Between Sends (milli sec)
Num
ber
of M
essa
ges
Sen
t
Figure 9 Histogram of time between sends for the class“S” run of the LU benchmark for four processors.
(c) 1998 IEEE
"
0 0
80
0
200
0
320
120
360
0
392
0
304
0 0 00
50
100
150
200
250
300
350
400
450
<=
2
<=
4
<=
8
<=
16
<=
32
<=
64
<=
128
<=
256
<=
512
<=
1024
<=
2048
<=
4096
<=
8192
<=16
384
<=32
768
<=65
536
Bin Message Size (bytes)
Num
ber
of M
essa
ges
Sen
t by
4 P
roce
ssor
s
Figure 10 The distribution of message sizes for class“S” run of the MG benchmark.
240
640
240
480
480
240
480
240
480
240 30
4
608
00 0
480
640
480
960 960 960 960
1088
0
480
400
160
0 0
480
160
520
480
480
608
608
1160
0
200
400
600
800
1000
1200
1400
<=
4
<=
8
<=
16
<=
32
<=
64
<=
128
<=
256
<=
512
<=
1024
<=
2048
<=
4096
<=
8192
<=
16384
<=
32768
<=
65536
<=
1E
+05
<=
3E
+05
<=
5E
+05
Bin Message Size (bytes)
Num
ber
of M
essa
ges
Sen
t
Com piled Processor s = 8 Com piled Processor s = 16
Figure 11 Distribution of messages sizes for class "Aruns of the MG benchmark.
1060-3425/98 $10.00
0
97109
7
154
343
408
325
134
7235
72
9 3 7 1 0 00
50
100
150
200
250
300
350
400
450
<=
0.06
<=
0.12
<=
0.24
<=
0.49
<=
0.98
<=
1.95
<=
3.91
<=
7.81
<=
15.6
<=
31.3
<=
62.5
<=
125
<=
250
<=
500
<=
1000
<=
2000
<=
4000
<=
8000
Bin Time Between Sends (millisec)
Num
ber
of M
essa
ges
Sen
t by
4 P
roce
ssor
s
Figure 12 Histogram of time between sends for class“S” run of the MG benchmark.
(c) 1998 IEEE