Upload
vandien
View
224
Download
1
Embed Size (px)
Citation preview
National Library 1 of Canada Bibliothèque nationale du Canada
Acquisitions and Acquisitions et Bibliographic Services services bibliographiques
395 Wellington Street 395, rue Wellington OttawaON K I A ON4 Ottawa ON K1 A O N 4 Canada Canada
The author has granted a non- exclusive licence allowing the National Library of Canada to reproduce, loan, distribute or seil copies of this thesis in microform, paper or electronic formats.
The author retains ownership of the copyright in this thesis. Neither the thesis nor substantial extracts fiom it may be printed or othewise reproduced without the author's permission.
L'auteur a accordé une Licence non exclusive permettant a la Bibliothèque nationale du Canada de reproduire, prêter, distribuer ou vendre des copies de cette thèse sous la forme de microfiche/film, de reproduction sur papier ou sur format électronique.
L'auteur conserve la propriété du droit d'auteur qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son autorisation.
Supervisor: Dr. Nikitas J. Dimopoulos
Abstract With the availability of fast microprocessors and small-scale multiprocessors. inter-
node communication has beçome an increasingly important factor that limits the pertor-
mance of parallel computers. Essentially. message-passing parallrl computers require
txtrernçly short communication latcncy such that messase transmissions have minimal
impact on the overall computation time. This thesis consentrates on issues rsgnrding hard-
ware communication latrncy in single-hop recontigurable netwcirks. and sofi~vare coinmu-
nication latrncy regardlrss of the type of network.
The fint contribution of this thesis is the design md evaluation of two ditierent catego-
ries of prediction techniques for message-passing systems. This thesis utilizes the çornmu-
nicütions locrility propttny of message-passing parrillel applications to devise a number of
heuristics that ciin be uscd to predict the taryt of subscquent communication requtists. and
to predict the nest consumable message at the receiving ends olcomrnunications.
SpeciRcally. 1 propose two sets of predictors: Cirie-bascci predictors. which are purely
dynamic predictors. and h g - b a s 4 predictors. whiçh are statiudynamic predictors. The
performance of the proposed predictors. speçially Better-cycle7 and T~lüg-brttercyclsl. are
very well lm the application benchmarks studied in this thesis. The proposed predictors
could be easily implemcnted on the network interface due to their simple algorithms and
low mrmory requirements.
As the second contribution of this thesis. 1 show that rnajority of reconfiguration
delays in single-hop recontigurable networks can be hiddcn by using one of the proposed
tiigh hit ratio predictors. The proposed predicton can be used in establishing a communi-
cation pathway between a source and a destination in such networks before this pathway is
to be used.
This thesis' third contribution is the analysis of a broadcasting algorithm that utilizes
latençy hiding and reconfiguration in the network to speed the broadcasting operation. The
analysis brings up closed fomulations that yields the temination time of the algorithms.
Table of Contents . .
Abstract ............................................................................................................................... i l
Table of Contents ............................................................................................................... iv List of Figures ................................................................................................................... vii List o f Tables .................................................................................................................. xi . . Trridemrirks ....................................................................................................................... xi1 ... Glossary .......................................................................................................................... xiii
.Acknowled, (rments ............................................................................................................ xvi
Chapter 1 Introduction ............................................................................. 1
1 . 1 Comrnunicritions Locality and Prediçtion Techniques ......................................... 5 1.2 k i n g the Proposed Predictors at the Send Sidr ................................................... 8 1.3 Redundant Message Copying in Software Mrssaging Layers ............................. 9 . ............................................................................... I 4 CoIlt'ctiw Coinmtinicritions 10
I.5 Thesis Contributions ........................................................................................... 1 1
Chapter 3 Application Benchmarks and Experimental Methodo
...................................... ........................... Pmllel Benchmarks .....,.. ................................. 2.1.1 YPEkNAS P;tr;tllt.lBenchmarksSuite
2.I.I.1 CG ........................................................................... ................................................... . . . 2 I 1 2 MG ..................................... 17
2.1.1.3 LU ........................................................................................... 17 2.1.1.4 BTrind SP .............................................................................. 17
2.1.2 PSTSWM ...................... ,. ................................................................. 1s 2 . 1 . 3 QCDMPI ................................................................................................ 13 .-\pplications' Communication Primitives ......................................................... 1') 2.2.1 MPI-Send .............................................................................................. 20 2.2.2 MPI-Isend ............................................................................................. 20 7 7 - ... MP 1-Srndrecv-replace .......................................................................... 20 2.2.4 MPI-Recv .............................................................................................. 20 2.2.5 MPI-Irecv ............................................................................................. 1 2.1.6 MPIWait ............................................................................................ 1 2.2.7 MPIWriitall .......................................................................................... 1 Exptirimcntal Mcthodology ............................................................................ 1
Chapter 3 Design and Evaluation of Latency Hiding/Reduction Messase
1 3 Destination Predictors ..................................................................................... 3-' 3.1 Introduction ......................................................................................................... 3
3.1.1 Message Switching Layers .................................................................... 24 3 . 1.2 Recontigurable Optical Networks ......................................................... 25
3 . 1.2.1 Communication Modelinj ................................................ 29 Comrnunicotion Frequency and Message Destination Distribution ................... 3û Communication Locality and Caching ............................................................... 35 * 7 1 The LRU . FIFO and LFU Heuristics ..................................................... 38 . . Message Destination Predictors .......................................................................... 43 3.41 The Single-cycle Predictor .................................................................... 46 3 . 4 . Thesingle-cycle2 Predictor .................................................................. 48 4 3 The Better-cycle and Better-cycle? Predictors ...................................... 49
- 3 3.4.4 The Tagging Predictor ........................................................................... 33
3 . 5 The Tag-cycle and Tag-cycle2 Prdictors ............................................. 54 3 A 6 The Tag-bettercycle and Tag-bettercyçlr 2 Predictors ........................... 56 Prediçtors' Cornparison ...................................................................................... 57 3 .5 . l Prrdictor's Mrrnory Requireinrnts ........................................................ 59
.................................................................................. Cking Message Predictors 60 ............................................................................................................. Summary 61
........... Chapter 4 Rrconfiguration Time Enhancements Using Predictors 63
4.1 DistnhutionufblcssiiyeSizss ............................................................................ 64 4.2 Intcr-scnd Computation Times ........................................................................... 64
......................................................... 4.3 Total RecontigurationTimeEnhanccmcnt 71 4.4 Prcdictors' Eî'fect on the Reccive Side ................................................................ 79 4.5 Sumrn3r-y ........................................................................................................... 1
Chapter 5 Collective Communications on a Reconfigurable
............................................................................... Interconnection Network 84
........................................................................................................ Introduction 84 ...................... Communication blodeling for BroadcastingMulti-broadcastiv SS
.................... Bruadcasting and Multi-broadcasting -0 .......................................................................................... 5.3.1 Broridcasting 90
5.3.1. l Analysis of the Greedy Algonthm .......................................... 92 1 2 Groupinç schema ................................................................ 101
5.3. Multi-broadcasting ............................................................................... 102 ................... Communication illodeling for other Collective Communications 103
Scattering ...................................................................................................... 103 Multinodr Broadçasting .............................................................................. 105 Total Exchanse ................................................................................................. 1 OS Summary ........................................................................................................... 1 i2
Chapter 6 Efficient Communication Using Message Prediction for Clusters
of Multiprocessors 1 II
6.1 Introduction ...................................................................................................... 115
.......................................................................... Motivation and Related Work 117 . .
L'sing Message Predictions ............................................................................... 171 Expenrnental Methodology ............................................................................ 113
.................................................................... Rrçeiver-sidr Locality Estimation 123 ..................................................................... 6.5. l Communication Loçality 115
6.5.3 The LRU . FIFO and LFU Heuristics ................................................... 127 ........................................................................................... Message Predictors 129
6.6.1 The Tügging Predictor ......................................................................... 129 6.6.2 The Single-cycle Predictor .................................................................. 130 6.6.3 The Tapçyclel Predictor ................................................................... 130 6.6.4 The Tag-bettrrcycle2 Prctdictor ........................................................... 131
.................................................................... bltissage Preditdictors' Cornparison ,132 ' 7 6.7.1 Predictor's blttmory Rcquirements ...................................................... IL
........................................................................................................ Sumrnary 1 34
Chapter 7 Conclusions and Directions for Future Research ................... 136 3 -l .................................. ...........................*................................ 7.1 Futurc Research ,, 1 JS
Bibliognphy ................................................................................................ 141
AppendixA Rrmoving Timing Disturbances ............................................ 153
List of Figures
Fisure 1.1 : A generic parallel computer ....................................................................... 7
Figure 3. I : RON (k. N). a massively parallel çomputrr intrrconnected by a complrte ti-er-space optical interconnection nrtwork ....................... . ............... .... .... 27
Figure 3.2: 'lumber of send ralls per proçess in the applications undrr different sys- tem sizes .................................................................................................. 32
Figure 3.3: Number of message destinations per proçess in the applications undrr dif- firent systrrn sizes ..... .. .... .. ............................................ ......................... 34
Figure 3 -4: Distribution ot'mttssüge destinations in the applications when N = 64 - 2 6
Figure 3.5: Distribution of message destinations in the applications for process zero. when N = 64 ............................................................................................ 37
Figure 3.6: Cornparison of the LRU. FIFO. and LFL heuristics when N = 64 ........- 39
Figure 3.7: Effects of the sçalibilty of the LRU. FIFO. and LFU heuristics on the BT. SP and CG ~ipplicatiuns .......................................................................... 40
Figure 3.8: Effrcts of the scülibilty of the LRU. FIFO. and LFU heuristics on the MG and L L: applications .................... ..... ........ ..... .. ... ............... .......... .... ........ 4 1
Figure 3.9: Effects of the scalibilty of the LRU. FIFO. and L F U heuristics on the PSTSWM and QCDklPI applications ............................ .. ................... 42
Figure 3. l O: Operation of the Single-cycle predictor on a sample request sequcnce..l7
Figurc 3.1 1 : Effect of the Single-cycle predictor on the applications ......................... -18
Figure 3.12: Comparison o f the performance of the Single-cycle predictor with the LRL'. LFU. and FIFO heuristics on the applications under single-port inodeling when N = 64 ............................................................................ 48
Figure 3.13: Operation of the Single-cycle? prediçtor on the sample request sequençe
Figurc 3.14: Effeçt of the Single-cycle2 predictor on the applications ....................... 49
Figure 3.1 5: S tatr diagram of the Better-cycle prediçtor ............................................ 50
Figure 3.16: Operation of the Better-cycle predictor on the sampls request sequençe ...
Figure 3-17: Effeçt of the Better-cycle predictor on the applications ......................... 52
Figure 3.1 S : Operat ion of the Better-cycle2 predictor on the sample request sequence.
...... * ......... *..* ...............*..*.*...- * ....................*...................*...... ....*........*..... 52
Figure 3.19: Effert of the Better-cycle2 predictor on the applications ....................... 53
Fisure 3.10:
Figure 3.2 1 :
Figure 3.22:
Figure 3.23:
Figure 3.24:
Figure 3.25:
Figure 4. I :
Figure 4.2:
Figure 4.3 :
Figure 4.4:
Figure 4.5:
Figure 4.6:
Figure 4.7:
Figure 4.8:
.............................. Effects of the Tagging predictor on the applications 5-1 - -
E fkcts of the Tag-cycle predictor on the applications ......................... ..x
Effects of the Tag-cycle7 predictor on the applications ......................... 56
Effects of the Tag-brttercycle predictor on the applications .................. 56
Effrcts of the Tag-bettercyclrl predictor on the applications ................ 57
Cornparison of the performance of the prediçtors proposed in this chapter whèn numbrr of processes is 64.32 (36 for BT and S P). and 1 6 ........... 58
Distribution of message sizrs of the applications when N = 4 ............... 65
Distribution of message sizes of the applications when N = 9 h r BT and ........................... SP. and S for CG. MG. LU. PSTSWM. and QCDMPI 66
Distribution of message sizes of the applications whrn N = 16 ............. 67
Distribution of message sizes of the BT. SP. PSTSWkl. and QCDM PI ap- ........................................................................... plications whcn N = 25 68
Cumulative distribution funçtion of the inter-send computation times for node zero of the application benchmarks when the number of processors is 16 fbr CG. MG. md LU. and 25 h r BT. SP. QCDbIPI. and PSTSWM.
Perçentagc of the inter-send computation timcs fbr different benchmarks that art. more than 5.1 O, and 35 iniçro~econd~ when N = 4. S or% 16, and 25.
Different sceniirios for message transmission in a multicomputer with a recontigurablc optical interconnect (a) when the messagc-transfefielay is less than the inter-send time. and the rivailable timr is liirger than the recon tiguration-drlay ( b) when the rnessage-transfer-delay is less t han the intcrsend time. and the rivailable time is less than the reconfigurûtion-delay (ç ) whrn the message-transfer-delay is larger than the inter-send time.. ................................................................................ -73
Average ratio of the total reçontiguration time afier hiding over the total original recontigration time for different benchmarks with the current genention and a 10 times faster CPU when d = 1.5. 10. and 3 microsec- onds: -4 class tOr NPB. 4 nodes (shorter bars are better) ........................ 75
Averqr: ratio of the total reconfiguration time afier hiding over the total original recontiswration timr for diffèrent benchmarks with the current generation and a 10 times faster CPU when d = 1.5. 10. and 25 microsec- onds: A çlass for NPB. 9 nodes for BT and SP. 8 nodes for other applica- tions (shorter bars are better) .................................................................. 76
Figure 4.10:
Figure 1 . 1 1 :
Figure 4.12:
Figure 4.1 3 :
Figure 4.14:
Figure 5.1 :
Figure 5.2:
Figure 5.4:
Figure 5.5:
Figure 5.6:
Figure 5.7:
Figure 5.8:
Fipre 1.9:
Figure 6.1 :
Figure 6.2:
Figure 6.3:
Average ratio of the total recontiguration tirne after hiding over the total original recontiguration time for diffèrent benchmarks with the current generation and a 1 0 times faster CPU when d = 1. 5. 1 0. and 25 microsec- onds: A class t'or NPB. Ibnodes (shorter bars are better) ....................... 77
Average ratio of the total recontiguration time alier hiding over the total original reconfiguration time for different benchmarks with the current senerotion and a 10 times faster CPU when d = 1.5. 10. and 23 microsec- onds. A clrtss for WB. 25 nodes (shorter bars are better) ...................... 75
Summary of the average ratio of the total reconfiguration tirne atier hiding ovcr the total original recontiguration time with the current generation and a 1 0 times hster CPU when applying the Tag-bettercyclç2 predictor on the benchmarks witli d = 25 microseconds. A class for NPB. and under . . .............................................................................. di ttercnt system sizcs SO
.Airerigr prrcentage of the times the recciw crills art: issued before the cor- responding scnd cal 1s ............................................................................ ..82
Somr collrcti\.e communication operations ........................................... 87
Lütcnsy Ming broadcasting algxithm for RON ( k . Y). N - 4. k = 2. d = 1
First and second genrration trees. The numbers underneüth rach tree de- note the number of trces having the same height. These trees are rootttd at
................ nodes that were ;lt the same levef in the iirst generation tree. 94
Sequential tree algorithm ..................................................................... 104
Spanning binomial tree algorithm ................................................... I O5
%~ultinodrbroadçastingonanS-nodeRON(k. N) undersingle-portmodeling
Multinode broadcastiny on an 9-nodc RON (k. N ) under ?-port modeling
Total eschange on an S-node RON (k. N) under single-port modelins 108
....... Total exchange on an 9-nodr RON (k. N) under ?-port modeiing 1 1 O
Data transfers in a traditional messaging layer ..................................... 1 19
Number of reçeive calls in the applications under di fferent system sizes ..
Number of unique message identitien in the applications under di fferent ........................................................................................... system sizes 126
Figure 6.4. Distribution of the unique message identitiers for process zero in the applications ........................................................................................... 117
Figure 6.5. Effrcts of the LRU . FIFO . and LFU heuristics on the applications ..... I IS
Figurc 6.6. Efkçrs of the Tagging predictor on the applications ............................ 130
Figure 6.7. Eftkcts of the Single-cycle predictor on the applications ..................... 131
Fisure 6.8. E f k t s of the Tag-cycle2 predictor on the applications ....................... 131
Figure 6.9. Effects of the Tag-bertercycle2 predictor on the applications .............. 132
Figure 6.10: Coinpanson of the performance of the prediçtors on the applications . 133
List of Tables
Table 3.1 : Memory requirements (in bytes) of the predictors when N = 64 .............. 59
Table 4.1 : Minimum inter-srnd çomputation times (microseconds) in NAS Parallel Benchmarks. PSTSWM, and QCDMPI when N = 4. S. 9. 16. and 25 .......................... 70
Table 4.2: Communication to computation ratio of the applications ......................... 83
T b 5.1 : Broadçasting time. k = 7. d = 1 ................................................................. 99
Tabk 5.2: Broiidcasting tirne. k = 4. d = 3 ............................................................... i CIO
Table 5.3: Broadsasting timc. d = 3 .......................................................................... 1 O 1
Table 5.4: Multi-broadçasting timr. k = 4. d =3. kl = 1 O .......................................... 1 O3
Table 5.5 : Total exchange time. 'l = 102-1. single-port ............................................. 1 II
Table 5.6: Total exchange timr. 'l = I U X k = 3 ..................................................... 1 12
Tüblr 6.1 : blernory requirements ( in 6-tuplt: sets) for the prcdictors whrn N = 64 h r CG. and Y = 49 t'or BT. SP. and PSTSWM .......................................................................... 134
Trademarks
hl any o t' the desisnations used by manufûcturers and sel [ers to distinguish their products are claimed as trademarks. Trademarks and registered trodrmiirks usrd in this work. where the author ~ 3 s aware 01' them. are listed below. Al1 othcr trademarks are the proprrty of thçir respective owners.
IBM SP2 is a registered trademark of International Business Machines Corp.
IBM P2SC CPC is a registerrd trademrirk of lntrmational Business Machines Curp.
[BbI VuIcan Switçh is a rcgisterrd tradeinark of International Business Machines Corp.
blyrinrt is a rcgisterrd tr3derni.uk of Myricom.
SemcrNet is a rrgisterrd tradrmark of Tandem Division of Compaq.
SGI Origin 2oOO i s a rryistered trademark of Silicon Graphiçs. Inç.
SGI Spider Switch is (i rcgisrered trndemark of Silicon Graphics. Iiic.
LVawStcir LnmbdaRoutcr is ri registered tradeinark of Lucent Technologp.
Glossary
A b1
ASCI
BIP
BT
C'A
CDF
CGH
CIC
CG
CLCblP
COW
D bI
DS iU
EP
FlFO
F b1
FT
HPF
1s
L*AlM/ b1 P 1
LAN
LAPI
Active Messases
Açcelerated Strategic Computing Initiatke prograrn
Basic Intrirîàce tor Parallelism
Block Tridiagonal Application Benchmark
Cammunicatim Assist
Cumulritive Distribution Ftinction
Cornputer Gcnenited Holograms
Computing. Intonation and Commun' :ations Projet
Conjugatc Gradient Application Bençhmark
Clustcr of Multiprocessors
Cluster of Workstations
Detormriblc Mirrors
Distributed S hared-Memory blultiprocrssor
Embaw~ssingl y Paral le1 Application Bençhmark
First-in-tirst-out
Fast Messages
3-D Fast-Fourier Transform Application Benchmark
High Performance Fortran
Intesa- Sort Application Benchmark
Local Area Multiçomputed Messase Passing Interface
Local Area Nehvorks
Low-level Application P r o ~ ~ m m e r s Intertàce
LIFO
L RU
LU
.M G
bIIivlD
hl P I
b1PICH
bIPP
Y1
NOW
? P B
ORPC ( k )
ors
P2SC
POPS
P b1
PSTS Wb1
PVM
QCDblPI
RON (k. N)
RMA
SAN
SEED
Last-in-first-out
Least Recently Used
Lower-Upper Diagonal Application Benchmark
Multigrid Application Benchmark
Multiple Instructions Multiple Data
Message Passing Interthce
A Ponable Implementütion of MPI
Xliissiid y Pard le1 Processors sy stems
Nctwork interthce
Networks of Workstritions
NAS Par;illel Bcnchmarks
Opticcilly Rrc»ntigurable Parüllt.1 Cornputer
Optical Passive Stars
PoweC-Super Microproçessor
Partitionrd Optical Passive Stars
A H igh-Prrtbnnançe Communication Library
Power Spcctrum Transform Shallow Watcr klodel
Parailel Virtual Machine
Quantum Chromodynamics wi th Message passing Interface
Reconîigurable Optical Network.
Rernote Mernory Access
S ystem Area Networks
Self Electro Optics Emitting Device
SWRIMP
SP
SPMD
TLB
C-Net
VCC
VCSEL
w.4
XV
Scalablr High-Performance Real1 y Inexpensive Multiprocrssor
Scalar Pentadiagonal Application Benchmark
Single Program Multiple Data
Translation Lookaside Bu f e r
-4 User-Letei Network Interface Architecture
Vinual Circuit Caching
Vertical Cavity Surtice Ernitting Laser
Vinuril Interface .Architecture
Virtuai btemory-.Mapped Communications
Acknowledgments
1 would like to express my drepest appreciation to my supervisor. Dr. Nikitas J.
Dimopoulos for his thoughttùl suggestions that shaped and improved my ideas. 1 am very
uratcttul tu Nikitas for providins me with his valuable guidance. encouragement. support. 2
criticism. patience. and kindnrss from the first day I rame to Victoria.
1 would like to thmk the mcmbrrs of my dissertation sommittrr. I wish to thank Dr.
rioes- Kin F. Li. Dr. Vijny K. Bharpva. and Dr. D. Micharl kliller for their support and su,,
tions. I am very grateful to Dr. JosC Duato b r his kind açceptancr to be the cxtemal exam-
iner of tliis dissertation. and for his brilliant sugestions.
1 ilin greatly indrbted to my wifr. Azitü G r n m i for hrr continuous support and
cnci)uriigc.mcnts. Without her understanding. 1 would not Iiave Rnished rny dissertation. I
tvould likt. to express my gratitude to my parents who always encouraged me to pursue a
Ph.D.
1 want to thank ail my fnends and graduate fellows rsprcially the fellow rcsearchers at
LAPIS including .b\ndré Schuorl. Nicolaos P. Kourounakis. Shahadat Khiin. blohamrd
Watheq El-Khiirashi. Stephen W. Neville. Riifde! Parra Hemandez. Caedmon Somers. lon
Kanie. and Eric 1.ûsdal who have made my stay so much fun.
I would likc to thank the department's systrm and otfice stati for their continuous
cooperation. I cim thankful to Vicky Smith. Lynne Barrett. Maureen Denning. and Moneca
Bracken.
Speçial thanks to Dr. Murray Campbell at the IBM T. J. Watson Research Centrr for
his kind coopention and hrlp in accessing the IBM Deep Blue. and the staff of the com-
puter center at the University of Victoria for the acçess to the University IBM SP2.
My dissertation rrisrarch was supported by gants %om the Natural Science and Engi-
neerins Rrscarch Counçil (NSERC) of Canada. and the University of Victoria.
Chapter 1
Introduction
Researçh in the xea of xivanced çomputer architecture has bren primanly tiocused on
how to improve the performance o t' çomputers in order to solw computational 1 y intensive
problems [ a l . 62. 601. Soine of thesr problems are called grrr~id cliallerlges. A grand chal-
lenge is a fundamental problem in science or engineering tliat has a broad rconomic
mdm- sçientitic impact: couplcd fields. gcophysical. and astrophysisal Huid dynamics
(GAFD) turbulcnçe. modelinp the global dimate system. formation of the large scale uni-
verse. global optimization algorithms for macromolecuiür rnodrling. petroleum explora-
tion. üerodynrimiç simulations. ocean circulation. are just a few to mention.
The perfomiincc of processors is doubling eaçh eighteen monrhs [QI. However. there
is always a demünd for more cornputing power. To solve p n d challenge problems. com-
putcr systems at thc w q i h p ( 10'' Rc~atin~ point i~perations prr second) and pcr~flap
I ; ( 1 O - Himtins point «pcrûtions per second) performance levels are needed.
Proccssors are becornine vçry cornplex and only a î'w çompanies are drsigning new
processors. Thrreîiore. it is not cost-rtfkctive to build high performance computers just by
using custom-design high performance processors. The trend is to design parallel comput-
ers using çommodity prosessors to achieve terailop and petaflop pertbrmançr. For
instance. two major projects to develop high performance supercomputers in the LISA are:
the fedenl program in Compiiri/g Iiifbiniariorr und Co~nnzirriicoiioi~s (CIC) project at the
national coordination office [9S]. and the Department of Energ .-lccekrnted Strategic
Compiiting hitiritiw (ASCI) program including InteVSandia Option Red. IBWLawrençe
Livitrmore National Laboratory Blue Pacific. and SGI/Los Alamos National Laboratory
BIue Mountain [jL)].
This should nor give us the wrong impression that such high pertbtmance computrrs.
otien salled ~lltzssii*e!i Par-del Piocessoi (MPP) systrms. are only used for grand chal-
lenges and parailel sçientific applications. Even for applications requiring lower comput-
ing power. parallel computing is a cost-eerctive solution. These days. rnany high
performance parallel computing systems arc being used in network and commercial appli-
cations such 3s data wareh~using. internet sen'ers. and di pi ta1 librarits.
ParaIlcl proccssing is cit the hcan of such powcrful computcrs. ?ilthough parallclism
apprars at ditferent Ievels for a single processor systrm. such as lookahead. pipelining.
superscalarity. sprçulative exeçution. vectorization. intrtrlcüving. overlapping. multiplicity.
time sharing. multitasking. multiproyramming. and multithreading. but i t is the parallel
processing and parallel computing among ditierent processors whic h bnngs us such levels
of perîimnançe.
Bnsicaily. a panillrl computer is a "collection of proçrssing elements that comrnuni-
c m and cooperatt. to solvc large problems fut'' [ 9 ] . In other words. a parallel computer.
w hether r ~ ~ ~ ~ . ~ s ~ ~ g c ~ - p c i s s i ~ t g or tiisn-ihitttd sltorwf-nlernoi? ( D S M ). is a collection o f com-
plrttt coniputers. including processor and mtirnory. that communicate through a genrral-
purpose. high-performance. sçülablr intrrconnection network using a cboniniwiicatiori
trssist ( C A ) and!or a rrc~t-or%- irircr;/iwc ( N I ) [XI. as shown in Figure 1.1.
P: Processor S: Cache
1 Nrtwork ~ntrrfaci 1 u
r H Inrerconnectian Network m Figure 1.1 : A generic parallel corn puter
:Ilrssnge-pnssing niulricornp~rrer-S. iimong al1 known paraIlel architectures. are the best
to açhiew such somputing performance level. Message-passing multicomputrrs are char-
acterized by the distribution of memory among a numbrr of computing nodes that corn-
municate with rach other by sxchançing messages through their interconnection
networks. Each node has its own processor. local mrmory. and communication assisthet-
work interthce. AI1 local mernories are privatr and are accessible only by the local procrs-
sors. The wide accrptance of messrigr-passing multiprocessor systems has bren proven by
the introduction of . l f w q c P L I S S ~ ~ I / I I L > I . I ~ ~ c ~ (MPI ) standard [9I . 931. Currently. in addi-
tion to wndor implemrntations of MPI on çommrrcial machines. there are mliny tieely
a d a b l e .LIPI implcrnentations including MPICH [37] and LAMMPi [75] .
Rcsently. .Venc~i-lLs 01' Hbîk~rrrrio~rs (NOW) [ 1 I l . C1irsrer.s qf' Flbr~kstarioi~s (COW).
and C'lirsr~~is of'.Ifirl~ipi~uc~c'ssm (CLCMP) [Y7]. have bern proposed to build inexpensive
pürallel cornputm. howewr. oftrn iit a lower performansc levrl compared to MPP sys-
tems. Thc developmcnt of high-perfomcincr switçhes sprcially for building cost-etfective
interconneçts known as S\mm .-11-ecr .V~~iii+or.ks ( S A N ) [73. 67. 1 13. 5-11 has motivüted suit-
ability of the networks of workstation/multiprocessors as an inexpcnsive high-pertbr-
mance computing plathrm. Systrm area networks such as the Myricorn M yrinet [XI. the
IBM Vulcün switch in the IBM S P I machine [ I 131. the Tandem ServerNet [67]. and the
Spider switch in SGI Origin 2000 machine [ S I . are ri n r v generation of networks that
falls bctwccn memory buses and commercial local area networks (LANs).
Parallcl processing. whether MPP. DSM. YOW. COW. or C'LUMP. puts tremendous
pressure on the intrrconneçtion networks and the memory hirrarchy subsystems. As the
communication overhead is one of the most important factors atiecting the performance of
parnllel cornputers r76.69.431. there has been a gowing interest in the design of intercon-
nection networks. In this respect. various types of interconnection networks. such as corn-
plae networks. hypercubes. meshes. rings. toi. irregular switçh-based. stack-gaphs. and
hypermesh have bern proposed and some of thrm have been implemented [46. 134. 1081.
Meanwhile. many routing algrithms [47.56. 121 have bern proposed for such networks.
In parailcl processing systems. the ability to etficiently communicate and share data
between processors is very critical to obtaining high perfomance. In essence. parallel
cornputers require extremely short communication latencies such that network transac-
tions have minimal impact on the overall çomputation time. Communication hardware
latency. communication software latency. and the user environment (multipro_erarnming.
rnultiuser) are the major F~ctors atfecting the perfonnançr of parüllrl cornputer systems.
This thcsis concentrates on issues rrgardinp hardware communicütion latency in elrc-
tronic networks and rcconfigurable optical networks. and software communication latency
(regardless of the type of network).
In this thesis. 1 propose a numbttr of techniques to achieve etticirnt communications in
message-passing systrms. This thesis makes tix wntnbutions:
The first contribution of this thosis (Chapter 3 ) is the design and evaluation of two
di firent categories of prcdiction techniques for message-passing systems. Speci fi-
cally. I use these prdictors to predist the tnrget of communication messages in
parüllel applications.
.As the second contribution of this thesis (Chapter 4). 1 show that the majority of
recontiguration delays in reconîigurriblr networks can be hidden by usiny one of
the high hit ratio proposed predictors in Chaptrr 3.
The third contribution of this thrsis (Chapter 5 ) is the analysis of a latency hidiny
bmüdcasting ü ly i thm on single-hop reconfigurable networks under single-port
and k-port modeliny which brings up closed formulations thiit yield the termina-
tivn time.
As the founh contribution of this thesis (Chapter 5) . 1 propose a new total
rxchange algorithm in single-hop reconfigurciblr networks under single-port and k-
pon modelinp.
Finally. the f fth contribution (Chapter 6) is the use and evaluation of the proposed
predictors in Chapter 3 to predict the next consumable message at the receiving
ends of message-passing systems (regardless of the type of network). I argue that
these message predictors can be eficiently used to drain the network and cache the
incoming messages even if the corresponding receive calls have not been posted
yet.
Chapter 2 introduces the panllel applications used in this thesis. Chapter 7 concludes
this dissertation and gives directions for future resrarch. Appendix A describes how tim-
ing disturbances have bren removed from the timing profiles of the paralld applications
used in this tliesis.
The rest of this çhrtpter is orynized as follows. In Section 1 . 1 . 1 explain the cornmuni-
cation loçality in message-passing parallrl applications and disçuss ditferent latensy hid-
ing techniques for parallel cornputer systems. In Section 1.7. 1 discuss the advantages of
usin2 prediction techniques at the send side of communications in the rrcontigurable opti-
sril interconnection networks. and in the circuit switched and wormhole routing elrçtronic
interconnection networks. ln Section 1.3. 1 describe the issues related to the messriging
laycr and soiiware communication overhcrid in message-passing systrms. and how predic-
tion san help diminate redundnnt message copying operntions. 1 jive an introduction to
thc issues rcgürding colleçtiw communications in Section 1.4. Finrilly. 1 summarize the
contributicms ot'this thcsis in Section 1.5.
1.1 Communications Locality and Prediction Techniques
In this thesis. I am interestrd in the message-passing mode1 of parallclism as rnessaze-
passin2 parallel computers scale much better than the shared-memory parallel computers.
Communicrition propenirs of message-passing parallel applications can be categorized by
the spatial. wmpotnl. and i-ohinie anributes of the communications [30. 75. 651. The tem-
poral attribute of communications in parallel applications characterizes the rate of mes-
sage generation. and the rate of computations in the applications. The volume of
communications is characterized by the number of messages. and the distribution of mes-
sage sizrs in the applications.
The Spatial attribute of communications in parallrl applications is çharacterized by the
distribution of message destinations. Point-to-point communication patterns may be repet-
itive in message-passing applications as most parctllèl algorithms consist of a number of
computation and communication phases. Several researchers have worked to find or use
the çornrniriiiccitiotis locci& proprrties of parallel applications [30. 75. 65. 36. 3 71.
By inesmye cfesriiiririoii ~*onzniuriicatiotl loccilih: 1 mean that if a certain source-desti-
nation pair has bccn u a d it will br. re-uad ivitli Iiigli probability by a portion of code tliat
is *-near" the place that was used earlirr. and that it will btr re-usrd in the near future. By
tnessqcJ twrprioi~ c*oninrzitlicwrioir locditj* 1 mean that if a certain message rcception ça11
hüs bern uscd i t will be rc-uscd with hish probability by a portion of code that is "near"
the place thüt was used earlisr. and that i t will be re-used in the near Iùture.
Traditionally. une approach to dral with communication latency is to rolriarr the
Iiitrncy: thüt is. hide the latrnsy from the processor's critical path by overlapping it with
other hiyh latcncy ewnts. or hide it with computations. The processor is thrn fret. to do
0tht.r usetiil trisks.
Three üpproiiches can be used to tolerate latcncy in shared-mrmory and message-pass-
in2 s y stems [3 21. They art: pi-oc~wiiiiilg pust comt~i~r~iicu~io~i irl die same tht*eud. mirhi-
tlti.eciditlg. and pt-~conitnrrt~ic*~~tiotl. The tirst approaçli. proceeding past communication in
the same thread in message-passing systems. is to rnakr communication messages asyn-
chronous and proceed past them either to other asynchronous communication messages.
or to the computation in the same thread. This approach is usually used by the parallrl
algorithm designers. Some olthe applications studied in this thesis use this type of latency
tolrrancr by usinp nonblocking asynchronous MPI calls.
In muliithreading. a thread issuing a communication operation suspends itself and lets
another thread run. This approach is used for other threads too. It is hoped that when the
îint thread is resçheduied. its communication operations have concluded. Multithreading
can be done in sotiware or hardware. Sofiware multithreading is very expensive. Some
hardware mu1 tithreading research architectures for message-passinç sy stems such as the J-
Mac hine [3 51. and the M-Machine [ 52) have been reported.
In precommunication. communication operations are pulled up from the place that
communications naturally occur in the progam sso that it is partially or entirely completsd
before data is needed. This c m be done in software by insertin3 ap,rconimirr~icntion oper-
crriorr. or in hardware. by piedicti>zg the subsequent communication operations and issue
them car i .
Precommunication is comrnon in receiver-initirited communications (that is. in shared-
IIIF~IIWI-~ systt'm) rç herr communication coinmences wlien a data is nèedcd such as a read
operation. In so/hi~trr*e-~*oriti~o/l~~i pre/ivcliir~g. the programmer or the compiler decides
when and what to prefetch by analyzing the program and thcn inserting pref2rclz instruc-
tions befort. the actuiil data rtrqurst in the progam [%]. In i i r z i ~ d ~ ~ ~ c z r ~ - ~ ~ o ~ ~ ~ ~ ~ o l l e d p ~ ~ e f L ; t c l ~ -
hg. dediciitrtd hiirdwlirt. is used to predict the future accesses of shariny patterns and
coherence cistivitirs by Iooking at their obscnrd behavior [06. 77. 73. 133.34. 1071. Thus.
there is no need to add instructions to the program. Tiiese techniques assume that memory
occesscs and coherensc rictivities in the neür future will follow past patterns. Then. the
hardware prefetchcis the data based on its prediction.
In sendcr-initiated systrms (that is. in mrssagc-passing systrtms). it is usually ditfiçult
to do the communicrition operation enrlier rit the send sides and thus hide the Iiitency. This
is because message communication is naturally initiatrd to trnnsfer the data when the data
is produced. Howrver. messages may arrive enrlier cit the receiver than it is needcd which
leads to a prrcomrnunication for the receiver side of communication.
.As far as the nuthor is aware. no precornmunication technique has bern proposed for
message-passing systenis. Predictions techniques can be used to predict the subsequent
message destinations. and message reception ccills in message-passing systems. This thr-
sis. for the tirst time. proposes and rvaluates two categories of pattern-based predictors.
nnmely. Clde-bused predictors. and Tq-bïised predictors for message-passing systems.
These predictors c m be used dynamically (at the send side or receive side of communica-
tions) at the communication assist or network interface with or without the help of a pro-
rrammrr or the compiler. -
1.2 Using the Proposed Predictors at the Send Side
In the following. 1 explain how message destination prediction c m be helpful in hiding
the reçonti_ruration delay in single-hop and multi-hop reçontigurable optical interconnec-
tion networks. and in hiding path setup time in circuit switched electronic networks. 1 also
describe the benefit of message destination prediçtion techniques to reduce the latency of
cominuniçütir>ns in çurrent commercial wormhole routed nct~vorks.
The interconneçtiun network piays a key rolr in the performance of message-passing
panllel computers. -4 message is sent h m a source to a destination through the intercon-
neçtion nrtwork. High communication bnndwidth and low communication latcnçy are
essential tor etticient cornmunication betwtten a source and rt destination. However, corn-
muniwtion latency is the most important factor atfecrin~ the performance of message-
passing parüllrl computers. In this thesis. I am interested in hidiny and reducing the com-
munication Iütrncy. Two cütr~orirs of interciv~nection networks a i s ~ : clrctronic intercon-
netcion networks. and optical interconntiçtion netwrks. I have devslopcd prediçtion
techniyues that can be cipplied to both eleçtronic and opticül interconneciion networks.
The proposcd predictors cm br used to set up thc paths in advance in electronic net-
works using rithcr circuit switching or i iwc sii~itclii~zg. In circuit-switching. the routing
hcüticr H i t progresses tlirouph the message destination and resewes p hysical links. Wavc
switçhing is a liybrid switcliing tcçhnique for high performance routers in electronic inter-
connection net\vorks. Wwe switching combines wormholc switching and circuit switch-
inp in the sanie router architecture to reduce the f ixd overhead of communication latençy
by rxploitinrl_ communication loçality. Hence. it is possible to hide the hardware communi-
cation latençy using message destination predictions to pre-establish physical circuits in
circuit switching and wave switching networks.
The predictors can wen b r useful to reduce communication latency in current com-
mercial networks. For example. Myrinet networks [23] have a relatively long routing time
compared with link transmission time. Predicton would allow sending the message header
in advance for the predicted message destination. When data becomes available. they can
be directly transmittcd through the network if the prediction was correct. thus reducing
latency significantly. In case of mis-prediction. a message tail is fonvardcd to tear the path
down. Obviously. nuIl messages must be discarded at the destination.
Optics is ideally suited for implementinj interconnection nrtworks because of its
suprrior characteristics over rlectronic interconnrcts such as higher bandwidth. greatrr
number of fan-ins and fan-outs. higher interconnection densitirs. less s i p a l crosstülk.
frèedoin frein planar canstraint as it can sasily csploir the tliird spatial dimension whicli
drarnatically increases the available communication bandwidth. lower signal and dock
skew. lower power dissipation. inherent parallelism. immunity from elrctromagnetiç inter-
ference and gound loops. and suitability for reconfigurnble intrrconnects [ 1 00. 5 1. 74. 19.
50. 129. 83. 191.
Future massively pcirallel cornputers might benetit fiom using reconfigurable optical
intcrconnection networks. Currently. t h m are somr problems with the optical intercon-
nwt trchnology. Signal attrnuation. optical elemrnt aligning. low conversion timç
berwcen rlectrunics t i ~ photoniçs and vice versa, and high recontiguration dclay are somc
disodvantaycs of uptics which are mostly due to its relatively immature technology. How-
rver. this teçhnolosy is maturing fast. As an rxamplc. Lzicent S ICnveStm* Lnrnb<inRor~rc.i
[S6] relies on an a m y of hundreds of electrically configurnblc niicroscopic rnirrors fabri-
çated on 3 single substrate so that an individual wavelength crin be passed to any of 256
input and output îibers.
As statcd above. the recontiguration drlay in recontigurable optical interconnection
networks is çurrrntly very high. The proposed message destination prediçtors çan be etti-
ciently used to hidr the recontiguration delay in the single-hop and multi-hop reconfig-
urable optical interconnection networks concurrently to the computations [127. 841.
1.3 Redundant Message Copying in S o f ~ a r e Messaging Layers
The communication sotiware overhead currently dominates the communication tirne
in cluster of workstations/multiprocessors. Crossing protection boundaries several times
betwern the user spacr and the kemel space. passing several protocol layers. and involving
a number of memory copying are three different sources of software communication cost.
Several researchers are working to minimize the cost of crossing protection bound-
aries. and using simple protocol layers by utilizing iczer-ievel rnessagi~lg techniques such
as .-l crhv :\/essczges (AM) [ 1251. Fust Messages (FM) [ 1021. I b-nrd Menloi?.-blnpped
Conr~~iwicurioics (VM MC-?) [-BI. L-!Ver [ 1261. L.-IPI [ I IO]. Btcsic Inre~.fizrc. fi>r P ~ ~ I L J I -
irnr (B IP) [ 1 051. I ?rtirtzi Inrerfizc*~ .-Ir-cliir~.crirr-r (V1.4) [49]. and P M [ 1 2 1 1.
.A significant portion of the software communication overhead belcings to a number of
mcssasc copyins operations. Idcally. mcssape protocds should sopy the message dircçtly
from the send butfer in its user spaçe to the receive buKer in the destination without any
intemediate bufiering. However. appkitions at the send side do not know the fnal
recciw butfer addresses and. hence. the communication subspstcms at the receiving end
still copy messages ut 3 temporüry bufier.
Severri1 rrsecirch groups hüw tried to iivoid memory copying [79. 14. 106. 1 19. I I Y 1. Thry h u e bren able to remoïe thc extra mcmory copying operations betwren the applica-
tion user buftkr spriçr m i the nrtwork interface at the send side. Horvt.ter. they haïrn't
been able tu rernovc the rneinory copying at the receiver sides. They may açhicve a zero-
copy rnessaying at the receiver sides only when the reçeivt. call is tilready posted. ü ren-
dez-\uus type wmmunicütion is used for large messages. or the destination buRu address
is already known by an extra communiccition (pre-communication). Howevcr. the prcdiç-
tors proposed in this dissertation can be rtticiently used to predict the next message recep-
tion calls and thus movc the corresponding incorninp messages to a place nrar the CPCi
such as a stüging cache.
1.4 Collective Communications
Communication operations rnay be ritherpoi~ir-to-poim. which involve a single source
and a single destination. or soliecriir. in which more than two processes participate. Col-
lective communications are rommon basic patterns of interprocessor communication that
are îiequently used as building blocks in a vanety of parallel algorithms. Proper imple-
mentation of these basic communication operations is a key to the pertormance of the par-
allel cornputers. Therrfore. there has been a great deal of interest in their design and the
study of their performance. Excellent surveys on collective communication algorithms c m
be found in [90. 53.611.
Collective communication operations can bs used for dara rnovement. process sontrol.
or global operations. Data movrmrnt operations include. 61-ocidcnsting, nniitricastir~g. scat-
rzr-il~g. gurlwritlg. imlr b~otlc~ br.ocitkmritig. and [oral e~chirge . Btzrriw SJI I C / I ) U I I ixtiot 1. i s
J typc of proccss control. Global operations includc ~.edui*tioii. and smz. The gowing
interest in collective communications is rvidrnt by thrir inclusion in the Message Passing
Interfi~cc (WI) [C)3.92].
1.5 Thesis Contributions
In Chaptrr 2. I drsçribe the applications used in this thrsis dong with the point-to-
point communiçation primitives that they use. t rxplain the tixperimental methodolog
used to collect the communication traces of the applications.
In Cliaprer 3. 1 introduce a complete interconnection network using free-space rrcon-
tigurablc optical intcrconnrcts t'or message-passing parallei machines. A comput ing node
in this parallel machine contipures its communiçation link(s) to reach to its destination
node(s). Thcn it scnds irs rnessagc(s) over the rstahlishcd link(s).
1 charrictsrize some communication properties of the parallel applications by present-
ing their communication frtquency and message destination distributions. 1 define the
concept of communication loçality in message-passing parallel applications. and çacliinç
in recontipurabk nerworks. I present widence. using classical memory hierarchy heuris-
tics. LRL! LFL', and FIFO. that there exists message destination communiçation locality
in the message-passiny parallel applications.
The first contribution of this thesis (Chapter 3) is the desi~m and evaluation (in ternis of
hit-ratio) of two ditferent categories of hardware/so%ware communication latency hiding
predictors for such recontigurable message-passing environments. 1 have utilized the mes-
sage destination locality property of message-passing parallel applications to devise a
number of heuristics that c m be usrd to predicr the target of subsequent communication
calls. This technique. çan b r applied directly to reconfigurable interconnects to hide the
communications latency by reconfiguring the communications network soncurrently to
the çomputation.
S periticall y, 1 propose two sets of message destination predictors: C+r-based predic-
tors. which are purely dynamic predictors. and Tay-btrsed predictors. whiçh arc statiu
dynamic predistors. 1 n cycle-based predictors. Siriglr-cylc. Sirgle-cjdr2. Berrrr.-~;~de
and BLJIW-L:\.L./~~. prediciions are dune dynümicall y at the nr iwrk intrrhcr w itliout any
help from the prosrammer or compiler. In Tag-based predictors. 7 izggU~g. kg-c-yclc.. Zig-
<i.c~/el. fiig-bertïvr:ide, and ïig-berrnx;i.clc~?. predictions are donc dynümically nt the net-
work interface as well. but they require an interface to pass some information from the
proyrarn to the network interface. This cm be donç witli the help of a progammer or the
compiler through inserting instructions in the program such as p r~-~~or r r i e~r ( r q ) (or
pe-ivceiir (rczg) as in Chüpter 6). The performance of the proposed predictors. Better-
cycle2 and Tq-bettercycle?. is wry high and prove thiit thry have the potential to hide the
hardwiire ci)mmuniciition latency in recontigurablr networks. The memory requirenients
of the predictors is very low. That makes them very ;ittrric!ivc tor the implementation on
the co~nniunicriticm rissist or network interthce.
In order to tttlicirntiy use the proposed predictors in Chapter 3 to hidc rlie hardware
Iritency of the recontigurable interconnects. rnough lead rime should exist such that the
recontiguration of the interconnect be sompleted before the cominunicati»n request
arrives. In Chapter 4. I present the pure rxecution times of the computation phases of the
pürallel applications on the IBM Derp Blue machine at the IBM T. J. Watson Rrsearch
Center usiny its high-performance switch and under the user spacr mode.
As the second contribution of this thesis. Chapter 4 States that by comparing the inter-
send computation timrs of these parallel benchmarks with some specitic recontiguration
times. most of the time. we are able to hlly utilize these computation times for the concur-
rent reçontiqration of the interconnect when we know. in advance. the next target using
one of the proposed high hit ratio target prediction algorithms introduced in Chapter 3. 1
present the performance enhancements of the proposed predictors on the application
benchmarks for the total reconfiguration time. Finally, 1 show that by
tors at the send sides. applications at the recriver sides would also
amve earlier than betore.
As the third contribution of this thrsis (Chapter 5). I present and
applyinp the predic-
benefit as messages
analyze a broadcasr-
ing algorithm that utilizes latency hiding and reçontiguration in the nenvork t» speed the
broadcasting operation undrr single-port and k-port modcling. In this algorithm. the
recsinfiguration phasc of some of thc nodcs is ~vcrlappcd with the message transmission
phase of the other nodes which ultimately reduçes the broadcasting time. The analysis
brings up closcd formulation that yields the termination time of the algorithm.
The fourtli contribution of this thesis (Chaptrr 5 ) is a conibimd tord ml~atrge dgo-
t?t/inz based on ii combination of the rlitwr [ I O % 1701. and srcriiclaid t'xc*/ra~z,ge [7 1 . 2-11
algorithms. This ensures a berter trnnination tims than that which çan be achiewd by
cither of the two alpithms. Also. known dgorithrns [?O. 401 tor scattering and dl-to-al1
broadcasting have been adapted to the nrtwork.
In Chapter 6. 1 present the frequsncy and distributions of reçeive communication calls
in the applications. I present rvidence that there enists message reception communiçcitions
locality in the message-passing parallel applications. .As 1 stated earlier. the communica-
tion subsystems iit the receiving end still copy early amving inessages unnrçessarily at a
trmporriry bufkr. As far 3s the author is aware. no prediction techniques have been pro-
posed to remove this unnecessary message copyin;.
1 use the proposed prcdiçtors introduçed in Chapter 3 to predict the next consumable
message. and to thus establish the existence of message reception communications local-
ity. As the Wth contribution of this thesis. Chapter 6 argues that these message predictors
cm be etficiently used to drain the network and cache the incoming messages even if the
corresponding receive calls have not been posted yet. This way. there is no need to unnec-
essarily copy the early miving messages into a temporary buEer.
The performance of the proposed predictors. Single-cy de. Tag-cycle2 and Tag-
bettercyclel. in t ems of hit ratio. on the parallel applications are quite promising and suç-
gest that prediction has the potential to eiiminate most of the remaining message copies. C
Moreover. the mrmory requiremrnts of thrse predictors is very low making tliem easy to
implement. Finally. 1 discuss ways in which thsse predirtions could be used to drastically
reduce the latency due io message copying.
In Chapter 7.1 conçlude this thrsis and rive some directions for future research.
Chapter 2
Application Benchmarks and Experimental Methodology
In Section 2.1. 1 describe the applications used in this thesis. 1 rxplain the various
point-to-point message-passing primitiws of the applications in Section 1.2. 1 discuss the
experimcntal methodology in Section 2.3.
2. t Parallel Benchmarks
This thesis (exccpt Chapter 5 ) studirs the computation and cvmmunication character-
istiçs of açtual parallel cipplications. For these studies. 1 have used sorne well-known paral-
le1 bmç hmarks Vomi the .WS pni-ciilcl bcnchzcii4-.Y (8 PB ) sui te [ 1 31. the Pniailel Sprcrid
li*oii.sfi~ivi S/td/oit. I j i i r c ~ i . .llocki (PSTSWM) parallel appiication [125]. and the pure
Qirwritnl CItron~o D~mtaics ..\loizr~~ C d o Sii~r~iiririoit Code with .WI (QCDMPI) pnrallel
application [65]. .Althouyh the rcsults prcscnted in this thrsis are for tlir above parallcl
applications. these rippl isiitions have been widely used as bendimarks representing the
cornputcitions in scientitic mi engineering parallei appliciitions.
1 used the MPI [92] impiemrntation of the NAS bençhmarks. version 2.3. the
PSTSWM. w-sion 6.2. and the QCDMPI. version 1.4. and run them on several IBM SP3
machines. 1 chose the IBM SP2 as it is a message-passing parallel machine so that the çho-
sen parallel applications are mapped directly on it. 1 used difirent systrm siztts and prob-
lem sizes of the applications in this study. NPB 2.3 cornes with five problrm sizes for each
benchmark: small class "Sv. workstation class " W . large class '*.A" and lûrger classes "B"
and "C". Due io access limitations in the use of the IBM Deep Blue machine at the IBM T.
J. Watson Rrsearch Crnter. and space limitations in usin% the University of Victoria IBM
SPI. 1 was able to experirnrnt with only the "W" and "A" classes and the results included
in this thesis represent theses classes.
3.1.1 NPB: NAS Paraliel Benchmarks Suite
The NAS Parallel Benchmarks (NPB) [13] have bern drveloped at the NASA Ames
Research Crnter to study the pertbmance of massivrly parallel processor systems and
nctworks of workstations. The NAS Panllel Benchmarks are a set of eight benchmark
problrrns. raçh of which tocuses on some important aspect of highly parallel supercorn-
putins for aerophysics applications. The NPB are a set of implementations of the NAS
Parailel Benchmarks hased on fwtran 77 and the MPl message-psiog interthce stan-
dard. and are not tied to any sperif c systcm.
The NPB consists of Rve bbkrmrls". and three "simulated compurational fluid dynamis
(CFD) applications". The three simulated CFD application benchmarks. loiwr-irppct-
di(rgotra/ ( L U ) . sc~rrkui* potrritiingotrd (SP). and bloc8k rri~i iqorid (BT) arc intended to
ciccuriitely represcnt the principal cornputationül and data mowment requirements of mod-
ern CFD applicritkms. The kernels. ~.otijl!qrrc gwtlieitr (CG ). nrirlrigriti (MG) . enrbat.t+uss-
i/lg!i. p<rtrr//el ( E P ) . 3-D firsr-Fowicr norts/iwm (FT) . and itiregct. sotet (1s) are relatiidy
compact problems. rrich of which cmphosizes ri particular type of numerical çomputation.
1 am intrrested in the point-tu-point patterns u f the LU. BT. and SP applications. and CG
rind MG kernels. EP. FT. and IS kernrls are not suitable for this study. EP and FT use only
collective communication operations whilr rach node in the IS kernel always communi-
cates with a speçific node.
The rorijirga!ore gmiieie,lt kemel. CG. tests the performance of the system for unstruc-
tured grid computations which by their nature require irregular long distance communica-
tions which is a challenge for al1 kinds of parallel çomputers. Essentially. it requires
computing a sparse matris-vector product. The inverse power method is usrd to find an
rstimate of the Ilirgest rigenvalue of a symmetnc positive-definite sparse matrk with a
random pattern of non-zrros. This code requires a power-of-two number of procrssors.
2.1.1.2 BIG
The second kemel benchrnark is a simplified rnirltigritf ke1wel. MG. which solves a
3-D poisson PDE. Four iterations of the V-cycle multi-gid algorithm are used to obtain an
approximate solution i r to the discrete Poisson problem V'U = i- un a 156 x 756 x 256
q-id with periodiç boundary conditions. This code is a good test of both short and Ions dis- 2
tance highly stniçtured communication. This code requires a power-of-two numbcr of pro-
çrssors. The partitioniny of the gnd onto proccssors occurs such that the grid is
suççessively halvrd. starting with the z dimension. thrn the ?. dimension and thrn the A-
dimension. and rrprating until al l power-of-two proccssors cire assignctd.
2.1.1.3 LU
The / O I W - I ~ ~ ~ L ~ I . ~ f i u p ~ d benchmark. LU. employs ü syrnmetnc successive over-
relaxiition (SSOR) numerical schemc t« sdve ti regular-sparsr block 5 x 5 lower and
upper rriiinguliir systcm. A 2-D partitioning u f the grid onto proccssors oçcurs by halving
the grid repcütedly in the Rrst two dimensions. altemately .Y and then j: until al1 pwer -d -
two processors arc assigneci. resuiting in vertical prncil-like grid partitions on the individ-
ual proçessors. The ordering of point based operations constitutins the SSOR procedure
procerds on diagonds which progressively sweep h m one corner on a zivttn r plane to
the opposite corner of the same I plane. thereupon proceeding to the next z plane. Commu-
nication of partition boundary data occurs aRer cornpletion of computation on ail diago-
nais that contact an adjacent partition. LU is very sensitive to the smali-message
communiçcition performance of an MPI iinplementation. It is the only benchmark in the
NPB 2.3 suitc that sends large numbers of very srnall (40 byte) messages.
2.1.1.4 BTandSP
The BT and SP algorithms have a similar structure: rach solves three sets of uncoupled
systrms of rquations. first in the * . then in the y. and finally in the i direction. In the block
widiagot~nl benchmark. BT. multiple independent systems of non-diagonally dominant,
block tndiaçonal equations with a 5 x 5 block size are solved. in the scaiat- pertradiago-
rtul benchmark. SP. multiple independent systems of non-diagonally dominant. scalar pen-
tadiagonal equations with a 5 x 5 block size are solved. Both BT and SP codes require a
square number of processors. These codes have bern written so that if rt given parallel
piatforrn only pemits a power-of-two number of processors to br assigned to a job. then
unnerded processors are deemed inactive and are ignored during computation. but are
counted when determining MRopis mtes.
2.1 .L PSTS W PI
The Pm-nlid Specrid Ti-<irzs/~im Slid/ori IKzrer .llociel (PSTSWM) application [ 1251.
was dttveloped by Wor1t.y at Oak Ridge National Laboratory and Foster at Ar, 'JO nne
Nationiil L:iboratory. PSTSW41 is a message-püssing benchmark cilde and parallrl algo-
rithm tcstbrd that solws the nonlinrar shallow watrr rquations on a rotating sphsrc using
the spectral triinsform method. PSTSWM was devtiloped to evaluate pardlrl üigonthms
for the spectral transform method as i t is used in global atmosphcric circulation modrls.
klultiple parallel cilgorithms are embedded in the code and cm be srlected at run-timr. as
c m the problrm s i x . number of processors. and data decomposition. PSTSWM is written
in Fortran 77 wirh VMS extensions and a small number of C preprocessor directives. 1
used the MPI implemrntation of the PSTSWM with the default input sizes.
2.1.3 QCDMPl
Pure Quantum Chromo Dynamics Monte Carlo Simulation Code with MPI
(QCDklPI) [65] . writtcn by Hioki at Tezukayama University. is a pure Quantum Chromo
Dynamics simulation code with MPI calls. It is a powerful tooi to analyze the non-prrtur-
bative aspects of QCD. This proyram çan bc applied to any dimensional QCD such as the
Mimensional QCD in whiçh the color and/or quark conf nenirnt mechanism are
obtained. QCDMPI runs on any numbcr of processon and also any dimensional partition-
ing of the system can be applied.
2.2 Applications' Communication Primitives
As stated carlier. I m only interested in the patterns of the point-to-point communica-
tions betwren pair-wise nodes in the above applications as discussed in Chapter 3. Chapter
4. and Chapter 6 of this thesis. Etficirnt algorithms for collective communications are pre-
sentcd in Chüpter 5 . These applications use synchronous and asynçhronous MPI send and
receive primitives [ 9 I ] . 1 briefly explain thesr: communication primitives here.
.\n bIPl program consists of autonomous processes. rxecuring their own code. in an
riiiiltiple i~isri~~rrtiorcs nidriplc ciam (MIMD) style. Note that al1 panllel applications stud-
ied in this thesis use an s i r g l ~ ~ prog~.ani rrrrrltiple hm ( S P M D ) style. Processes are identi-
tird açcording to rheir relative rank in a group. that is. consecutive intcgers in the range O
to guzrpsizc - 1 . If the g o u p consists of al1 processes then thc processes are ranked kom O
to .V - I whrre ,V is the total number of processes in the application.
The processes commur.icate via salls to .LIPI communication primitives. The basic
point-to-point commüniçation operations are scrd and receiw. Thrre are two genrral
point-to-point cornmunicütion oprratiuns in M PI: bled-irjg and rro~lbIo~*kN~g. Blocking
scnd or reseivt: sdls will not rctum until the püramcters of the çalls can be safel y modi-
îicd. Thot is. in the case of a send d l . the niessqe '~welup has b r r n created and the mes-
sage has bcen sent out or has been buffered into a systcm butier. For the case of a receive
ciill. it means thüt the messqe has been rescivcd into the receive bufer. Note thüt the mes-
sage envelop consists of a Hxed number of fields (mirce. desr. mg, comr~r) and it is used to
distinguish messages and selectivel y reçeive them. Nonblocking çommunication operÿ-
tions just post or start the oprration. Thus the application programmer must explicitly
somplrte the çommunication cal1 later at some point in the program using one of the van-
ous hnction calls in MPI such as :L/Pi_CCilit or MPi-ICizitall.
There are four communication modes in MPI: standard. bu[fér*ed. -mh.or~oirs. and
i-eac!\: These correspond to four ditferent types of send operations. In the synchronous
mode send call. the call will not finish until a matching receive call has bern issued and
has brgun reception of the message. In the buffered mode send d l . the send call is local
(in contrary to other communication modes where the send calis are nonlocal) and is not
waiting for the reseive cal1 to be posted. Actually. it bufers data when the reçeive call is
not posted. In the reûdy mode send d l . the receive call rnust have been posted radier. In
the standard mode. it is up to the system to buffer the data or send it as in synchronous
mode. Note that the standard mode is the only mode for the receive calls.
1.2.1 MPI-Send
.\fPl-.Sci~d (bu/.' c-oriilr. riiz!nhpe. desr. rng. conrtn) [92] is a standard blocking send cal1
which is a combination of buKered and synchronous mode and is dependent on the imple-
mentation. When the cal1 tinishes, the send bul-ter c m be used. In the buttered mode. data
is writtrn îiom the send bu fer to the systrm bufler and the cal1 retums. ln the synchronous
mode. the cal1 waits for the reçeiver to be posted and then reiums. The LU. MG. CG. and
PSTSWM applications use tliis type of scnd call.
1.22 MPI-lsend
.Ill->/-l.wtd (bill: ~.orr>rt. cbrttr~pc. d w . mg, '*onm. t-ccpresr) (921 is a standard non-
blocking scnd d l . I t rcturns immedititeiy. Therefore. the send butier cannot be reused. [ t
ciin be implemented in the butfered or synchronous mode. It needs another call. IIPlJliiit
or .\/PI-lbiiird/. to complete [lie call. Thcse completion calls are explainrd Ilitrr in Section
1.2.6 and Section 1.7.7. respectively. BT and SP use this typo of send d l .
1.2.3 MPI-Sendrecv-replace
. 1 1 P ~ i 1 r i 1 * e ~ - i ~ ~ c p I ~ r ~ ' c i (bif#,' colrrit. datïthpc. <f~ist. sei~dtag, solri'ce. rccvrag, conm.
srutzrs) [92] combines in one call the sending of a message and receivinç another message
in the same butfer. QCDMPl uses this type of communication call.
1.2.4 MP'Recv
:lIPIRi?ec~ (bi!l: coiuir. d a t a ~ p e soirrce. r<zg ontm. m. srcitiw) [92] is a standard blocking
receive call. When it returns, the data is rivailable at the destination buffer. L U and
PSTSWkl use this type of receivr call.
2.2.5 MPI-Irecv
.\fPl-I~-eci* (hi/' corirzr. durnnpe. sozii.ce. rng. cornm. reqziesr) [93] is a standard non-
blocking recrive çd1. It imrnediiitely posts the cal1 and retums. Hence. data is not available
at the tirne ofreturn. It nerds anothrr completion cal1 such as MPI-IVair or iLIPI-IC.iii~~lll ro
çornplete this c d . .411 applications except QCDMPl use this type of receive call.
.A cal1 to .IfP[-llizir {reqzrcsr. sr<rriis) [97] returns when the operation idttntitied by
rpqiic~[ is cornplrtv. For .\lPiJst~rrd operation. when W'/IlNir rrtums the srnd bumer can
be reused. For .lfP/-Recr operation. the completion of the .CIPI-ICRN cal1 notifies the
availability of the data at the rccrive butier. BT. Lü. MG. CG. PSTSWM applications al1
use this type of completion call.
2.2.7 RIPI-Waitall
.\If IJliri[cill (coiulr. u i~~~a~~q /_ i - c~qr i~~s t . r , ~ z ~ - ~ * a ! ~ ~ ~ r r i r i r s c s ) [92] waits for the complr-
t i m of al1 nonblocking clills associated with the active handlcs in the list. BT and SP use
tliis type of completion d l .
2.3 Experimental Methodology
1 executrd the applications on the 12-nodc IBM SP2 machine at the University of Vic-
toria for gûthrrinp their communication traces. and on the 30-node IBM Derp Blue at the
IBM T. J. Writson Rcsecirch Ccnter for collrcting their timing protiles. 1 wrote my own
proiiling codes usiny the wrripper hcility of the MPI to gather the communication traces.
and the timing protilrs of thrsr applications. 1 did this by inserting monitor operations in
the proHling M PI librriry for the communication relatrd activities. These opentions
inçlude anthmetic operations for the calculation of the desired characteristics. It is wonh
mentioning that jathering communication traces does not affect the communication pat-
terns of thrse applications. However. it atfects the temporal properties of these applica-
tions. In Appendix A. I explain the approach used to remove the timing disturbances fiom
the timing profiles of the applications.
Chapter 3
Design and Evaluation of Latency HidingIReduction Message Destination Predictors
Interconnection networks and their senices such as message delivery and flow control
are (i major source of communication hardware latrnçy in pariillel cornputer systems. In
Section 3.1. I briefly describr message-passing computers and message switching laycrs.
Then. as a specitiç circuit switçhed interconnrction nrtwork. 1 introduce a recot~figlriuble
o p r i ~ d iidrit.o~k ROX ( P . .V). for message-passing parallel computers. The advantages of
such recontigurablc optical interconnects are thrir high bandwidth and their ability to pro-
vide wrsati le lippiicrttion-dependrnr network reconîigurations.
1 chüracttirizr somr communication properties of the parallel application benchmarks
by presentiny thriir communication frequency and message destination distributions in
Section 3.2. 1 dctine the concept of contmiîtiicoriori lucnlih. in message-passing parallel
applications. and c*rrclii~ip in reçonfigurable networks in Section 3.3. I present cvidence
that therc exists message destination communication losality in the message-passing par-
allcl applications in Section 3.3.1. k i n g classical replacement heuristiçs. LRL', LFL', and
FlFO. i show that message destinations display a tom of locality.
I have utilized the message destination locality property of message-passinp parallel
applications to devise a number of heuristiçs that can be used to prvdict the tarpet of sub-
sequrnt communication requests. Thus. in Section 3.4.1 contribute by proposing and eval-
uating (in trrms of hit ratio) two dificrent categories of hardware/software commlrnicwiot~
luroi-. hitling pr*eiiicrors for message-passin2 environments. By utiiizing such predictors.
the hardware communication latency in reconfigunble interconnects ran be etfectively
hidden by reconfiguring the communication nenvork concurrent to the computation. 1
compare the performance and storage requirements of the proposed predictors in Section
3.5. In Section 3.6. 1 &borate on how these prcdictors clin be used and integrated into the
network interfaces. Finall. 1 sumrnarize this chapter in Section 3.7.
3.1 Introduction
Message-pûssing multicomputers are composrd of a number of computinp modules
that communicate with each other by rxclianging messages through thrir interconnection
networks. Each computing module has its own proçessors. local mrmory. and cotnmuni-
cation üssistmtwork interface. All local mernories are private and are accessible only by
the local processors. Communication hardware latençy. communication software latency.
and the user environment ( multiprogramming. mu1 tiuser) are the major tàctors atfrcting
the performance of message-possing parallel computer systems.
Interconneçtion networks. and their services such as message delivery and Row control
are n major source u f communication hardware latency. Essttntially. an interconnection
network is charricterired by irs topolog: sit*idii~ig roxz re~ : j i o i i corirr*ol mocliairisnr. and
roirri)r,g ri/gu~-itltni. The topology is the p hysical structure of the nctwork. The interconnrc-
tion network [46] miyht bc a shrired-medium network (such as Ethemct. Token Ring). a
direct network (such as mesh. torus). an indirect network (multistrige interconnection net-
work sush 3s IBM SP [l 171. or irreguliir such as Myinet [23]). or a hybrid network (such
ris hypcrmesh) [117].
The rout ing algorithm detemines which routes messages should fbllow through the
network to reach their destinations. There are many ditferent routing algorithms with dif-
ferent guaranters and performance such as Duato's adaptive routing [47]. Glass and NI'S
tum-mode1 routing [56], and up*-down* routing [E l .
The How control mechanism determines when the message. or packet. or portion of a
message should move dong its route. Packrts or Hits may be bloçked. butiered. disçarded
or cietoured to an alternate route based on the flow control mechanism.
3.1.1 Message Switching Layers
The switching strategy determines how a message rnoves dong its routes. There are
many switching stntegies. Ci!-cuir sii*irchirig. packet swircliirig. virt~ral c i i~ l i ro ig l i . and
wotw/io/e swirchi/ig are the basic switching stntegies [J6]. In packet switching. messages
are divided into fixed-size packeis. Each parket is routed individually tiom source to des-
tination and has to bt. butiered in each intermediate node. I t is also called stol-e-nd~fbr-
wtrd .swirc*/tiug. In virtual ait-thrcwgh switching. the entire packet does not need to he
bufered in the nodes. The packct heailer can be examined and afkr the routing decision is
made and the output rhannel is tiee the headcr and the following data can be immediately
transmitted. ln wormh«le switching. the paçket is broken up into nits. Wormhole switrh-
ing pipelines the Hits through the nrtwork just l ikr the vinual ut-through switching strat-
e y y but it hüs reducd butkr requirrmrnts.
In circuit switching. ri physical path is reserved from a source to a destination before
the i ~ ~ t ~ i l l mess32e transmission takes place. The routing heüdcr is injecteci into the net-
ivork. It rcisrrws physiçal links as i t is transmitted through intemediate nodes. A corn-
plete path is set up when the routing hradrr renches the destination. Then an
acknowledgment is transmi tted back to the source. Then. the message contents can be sent
alon- the rescned channels. The disadvantaye is that during message transmission uther
messages may be blockrd. The advantage is the minimum message trnnsfer latrncy as the
physiçal path is already esrablished.
In Cliaptrr 3 through Chapter 5 of this thesis. I am interested in the circuit switching
stmtegy. As 1 explain latrr in Section 3.3. message destinations in message-passing panl-
lei applications display a form of loçality. Thus. it is possible to use this communication
locality to pre-establish the physical links and thus hide the path setup time. This applies
both to the electronic circuit switched interconnection networks. and to the recontigurable
optical intrrconneçtion networks. However. as 1 drscribe in Section 3.4. the prediction
techniques that 1 propose in this chapter would also reduce the communication time in
wormhole routed networks. In the next section. 1 consider a circuit switched reconfig-
urable optical intrrconnection network as an specific case.
3.1.2 Reconfigurable Optical Networks
Several topological propenies. such as tkgtee. mVrt*age ciistatice. and dinnrerrr: can be
used to evaluate and compare difkrent interconnection networks. Most of thesr propenirs
c m be deriveci tiom the underlying g a p h of an interconnection nrtwork. where processors
and communication links are mapped onto the vertices (nodes) and r d p s (links) of the
graph. respective! y.
.A Gttrph cunsists of set of venicrs. i: inierconnected 'Dy a set of ctdpes. E. symbol-
ized as G = ( I :E) [ 1 2 3 The number of venicrs and rdges in a gaph is .V = 1I.I . and IEl
rrspectivrly. An rdge E E connects vertices i r and i: written as c = iri: and is said to be
irrcicloir with ir and 1: .A vertes i* has degree d,. if it is incident with exactly d,. ttdges. In a
sequcnce of distinct wtices ipl . i.,. . . . . i l such that for every I 5 i c k . the edge r i ib,+ is
in E. The <lisflrttc*c betwcen ir and i: rlist(tr.i*). is the minimum length of a piith between i r
and i: The r~-rv~rti.ic*ih* of LI is e(ir) = disr(1r.i.). where iv is (i vertex such that
/ . = . . disr( l r . ~ ) . The maximum ccçentricity among al1 vertices is the
~firrtrrcwr of the graph.
1 am interested in havinp a complete interconnection network. where 3ny computing
nodr can coinmunicate with any othrr node in a single-hop. Cornplete interconnection net-
works can be rnodelrd by a complete gaph. K.,.. A çomplete g a p h is a regular graph
where al1 X vertices are linked together and the diameter is one. Each vertex has degee dG
eyual to .V - 1. and the number of edses. 1 El . is iV(N - l ) i ? far too high to be of pnctical
intrrest whrn :V is large. These limitations prevent implementing çornplete networks using
metal-basrd interconnections as there is a fixed physical link between any two nodes.
Optics is ideaily suited for implementing interconnection networks because of its
suptit-ior characteristics over electronics [IOO. 5 1 . 741. such as higher interconnection den-
si- higher bandwidth. suitability for reconfigurablr: interconnects. greater fan-in and fan-
out. lower error rate. tieedom fiom planar constraints (light beams can easily cross each
other). immunity from electromagnetic fieid and ground loops. lower signal crosstalk.
Several research groups in academia and industry are working on diferent aspects of uti-
l i z in~ optiçal interconnects in massively parailel processing systrms including works on
the feasibili ty study and teçhnoloig related problems of optical interconnects. architec-
tures for optically interconnected coinputer systems. and corninunications and algorithmic
issues for such parallrl systerns [Q?. 191.
One of the main features of an optical interconnect is its capability to rrcory5gzi1-e. This
is vcry suitâblc fOr thc construction of 3-D VLSI computcrs [ S I . By i~~rcrcotrriccr i r co~r -
&ru-cirion. I siinply mean the abi lity to change the interconnect dynamically upon
demand. In essence. the adkantages of recontigurablcl optical interconnects are due to their
ability to provide versatile application-drpendent network contigurütions. F r - s p m r opri-
cbr i l i~trc.rroilneccz are a class of opticd intrrconnccts that c m support network recontigurn-
tion.
Free-spüçe optiçal inrerconnects use free-spaci: (vacuum. air or glass) for optical sig-
nal propagation. In tiee-space optical interconnects. optical signüls çan propagüte very
close to rash other and pass eash othcr without interaction. It can easily exploit the third
spatial dimension wliich dr~maticdly inçrrtises the wailablc communication bandwidth.
Free-space recontigur~blc optical interconnects result in much drnser interconnrction net-
works than metal-based and zuided-wave interconnections [?Y. 831. and have the potential
to solvs the problems nssociated with implementing cornplete networks due to their ability
to recontiyure.
1 introduce an nbstrast mode1 [ I l for a cornplete interconnection network using tiee-
space recontigurable optical interconnects for massively parallcl cornputers. and discuss
its charricteristics.
Definition A recoq5gwabk uptica! irenrorlr. RON (k. !V). consists of N computing
nodes with their own local memory. A node is capable of connecting directly to my other
node. A node can establish k simultaneous connections. These connections are established
dynamically by recontiguring the opticai interconnect. The links remain established until
they are explicitly destroyed.
Messages are sent using ciiririt sicircliiilg. That is. a connection must be established
betwern the source and destination pair before the messase is sent. Each node has the abil-
ity to simultanrouslp send and receive k messases on its k links (the k-port model). or
rxoçtly one message on one of its links (the single-port rnodel). Full-duplex communica-
tion where a node can send and receive messages at the same tirnr is supportrd. A simpli-
tied block diagram of the network is shown in Figure 3.1 where each node uses on1 y one
of its links.
- - -b Poten tial links
4 Efréctiw links
Figure 3.1 : R0.V (1 .VI. a massive1 y paral lrl cornputer interconnected by a completr frce-spüce optical interconnection network
Vanous implrmrntation technologies cxist to embody the above iibstract rnodel. Such
techni)logirs inçlude r.ei-ric-<il-~*~ri.ih s~r~.fc~c*~-mirti,ir: lrisci-.s ( VCSELs) for photon «_encra-
tion. sel#4ecno-oprit. c[fC;cr d ~ v i c ~ s ( SEEDs) for modulaiion. frequençy hoping for rod-
ing. wawlengtli tuning t i r transmitters and receiven. conprrer geiiei-uted lrolograrns
(CGH). and ~i~-fiwmrible mir~ois (DM) for switching and optical beam routing. The
switching in the case of CGH san be achieved by recording the desired source-destination
communication patterns. As stated in Chapter 1. deforniable mirrors. such as Lltceiir S
Ilkrdtrrrr Lmbcic~Roitrer [%]. are also reaching matunty. Optical beam routing in a free-
space optical interconnection nrtwork otien ernploys other extemal optical elrments such
as rnirron. prisms. lenses.
Each node has a fi xed number of tunable transmit~ers for sending optical beams toward
its beam router. such as a cornputer generated hologam or a deformable minor. to be redi-
rected to the receivers of the other nodes. Also. each node has ri large number of Axrd
receivrrs at its input ports. Sornct of thesr input ports may be used only for collective com-
munications operaiions while others may be used for pair-wise communications.
Path setup phase san b r done by sendins an ençoded light beam to the beam router to
reproyram the cumputtx gt'nerütrd holugriirn. or tu J e h m the mirror such tliüt the astual
message san br delivered to the destination(s) directly. It can br donc in two di firent
ways. First. the routcr (CGH or DM) upon receiving the message (which includes the pay-
loüd) stores the message in a butfer and thcn confipures its output links so that it can for-
ward the message to the destination node(s). This approach nreds a bufier for the entire
message üt cash beani router which is of high cost. It alsi> involves an extra copy. The brt-
tcr approash is to scnd an optisal bram havins only the destination address to the beam
routcr for the piith setup phase. Then. ÿfter some time. to be çallrd ~.cc*ot~ti,qtii+~ztiori th/+:
the second bcarn containing the actual message can he sent through the çonf yured router
to its destination.
Collision can happen cit the receiving nodes considering the hçt that several beams
may a m w üt a destination nodr at the samr time. Hençe. a destination node may not be
able to completr the path srtup phase. or accept the message. However. 1 assume that dur
to the availribility of a laqe number of Axed recrivers at the destinations. connections are
establishcd iminedioiely afier sorne time (reconfiguration drlay).
1 assume an unbounded number of available wavelengths for the system. However. in
case of a limited number of available wavelengths. one can utilize spread-spectrum tech-
niques where erich transmitter sends its information changing the wavelen~qh in a pseudo-
random fashion. The receiver can reconstruct the transmitted message if it is aware of the
pseudo-random code usrd h r encoding the sequence of wavelengths used during the
transmission.
1 am not interested in the technology itself. and implementation çoncems are outside
the scope of this dissertation. Instead. I am particularly interestêd in the abstract model of
this network. I shall assume that one or more of the technologies outlined above will be
used to iinplement the proposcd interconnect. L'ndrr such an implementation. the various
overheads associated with the reconfiguration of the network (such as beam strenng. set-
ting up the cornputer-generated hulograms. tuning the transmitters. or sending the fre-
quency code in a îi-cqurncy hoping implrmrntation etc.) are lumped togethcr as the
recantigurrttion delay t l . I assume that the reconfiguration dclay. d. rnost of the time is con-
stant but occnsionally may br unboundrd dur to hot spots in applications.
3.1.1.1 Communication Modeling
.An important concem is to model the communication time T required to srnd a mes-
sage from one node to anothcr. 1 use the communication motleling of Hockney [Ml. Hock-
ney's model characterizes the communication time for a point-to-point communication
4, opcration as: T = r , - - . where r , is thc start-up timc which is rqual to the time needed l-7.,
to send (i zero byte message. and includes the time required to preparr the message. suçh
as adding a header. and a tniler. I,,, is the lrngth of message to be transmitted. and i., is
the trsyrprorir btidwitirli in Mbytes per second and is the maximum bandwidth achiev-
able when the message length approaches inf nity. The communication time cün be written
as: T = r , - i,,,r where r is the per unit transmission tirne and is equai to the reciprocal of
r , . For the RON (k . N). 1 amend the model by explicitly including the reconfiguration
dclay d that is necessary for a nodr to configure a link that would connect directly to its
target node(s). The transmission time then becomes T = d + I , . + i , r .
The time on the Hy. 2 , r . for small messages is negligiblc compared to the setup time.
r,. and the recontiguration delüy. d. In the current gcnention of parallel computer systems.
the setup time. r,. is several tens of microseconds E-131. Several researchers are working to
minimize the srtup time by using user-level messaging techniques such as clctive Mes-
sages (AM) [ 1351 and Fast iClessages (FM) [102]. In Chapter 6. I discuss issues regarding
the soliware overhead component of the communication latency. I utilize the prediction
techniques proposed in this chapter to reduce the communication latency by avoiding
unnrressary memory iropying operations at the rereiver side of communications.
In this chapter. I am partiçularly interestrd in the techniques that hide the reconfigura-
tion drlay. ci. For this. and for the tirst time as h r the author is aware. I propose and evalu-
cite ditiirent communislttion lütency hiding predictors at the srnd side of communications
in rncssagc-pûssing systcms using rccontigurablc nctl~*orks so that rhc rcçontiguration
delay c m be hidden. in essence. by utilizing such predictors. the hardware communication
latency in reçontigunblr intcrconnects can be rtiectivel y hidden by recontiguring the
communicrition networks concurrent to the computcitions.
3.2 Communication Frequency and Message Destination Distribution
Several rcsearchrn have inwstigated the communication brhavior of parallel applica-
tions [-30. 75. 65. 72.371. Chodnctkar and his colleagues [30] have developed n trafic char-
acterizrition rnethodolosy tor parallel applications. They have çonsidered the inter-arriva1
time distribution of messases (send cülls). spatial mcssage distribution. and the message
wlume in messqe-passing and shared-memory applications. Kim and Liija [ 7 5 ] exam-
incd the communication patterns of message-passing parallel scientitiç proyrams in t ems
of message s ix . message destination. and gcnrration distributions for the send time.
reçeix time. and computation time. Hsu and Bane j e e [65] anlilyzed the communication
cliaracteristiçs of parallel CAD applications on a hypercube. Karisson and Brorsson [71]
have compared the communication properties of parallei applications in message-passing
applications using MPI. and shared memory applications using TreadMarks [IO]. de
Lahaut and Germain [37] have shown that in scientific applications written in Hiyh Perfor-
mance Fortran (HPF) [SS] a large part of communications can be known tiom the analysis
of the code. This is çalied smic contmtaiicburio~is. communications that can be known at
compile-timct. in çontrrist to &mrnic ~~onirniitlicatio~is where communications c m be
determined only at run-tirne.
Essentially. communication propenirs of parallcl applications can be cateprized by
the sparicil. renrpord. and n h n w anributes of the communications [;O. 75. 651. The tem-
poral attribute o f communications in panilel applications characterizes the rate of mes-
sage genentions. and the rate of computiitions. 1 present the cumulative distribution
tùnction of the inter-send romputaiion timrs of the applications studied in this thesis in
Chiipter 4.
Tlie ~oluinc. of somtnunicütions is characterizcd by thci nuinber of inessages. anci the
distribution of message sizes in the applications. In this chaptrr. 1 am particularly inter-
ested in the number of messages. In Chüpter 4. 1 show the distribution of message sizes in
the paraIlel applications.
One of the communication volumc charactcrïstics of pariilIr1 applications is the îi-e-
quency o t'scnd mcsstiyes. 1 use a number of pardlrl bençhmarks. as introductxi in Chapter
2. and nttrüct their communication traces. The processes in these applications use block-
iny and nonblocking stiindard M PI srnd primitives. nümely .llPIISc'tid MPi-lscild. and
.\~P~-$L~~~~~~-L.L~~I_I~cJ~/~Ic*LJ [KI. Fi y ure 3 -2 illustrates the numbttr of scnd communication
calls per proccss in the applications under ditferent system sizes. 1 cxecuted al1 applica-
tions once for crich dif i rent system s i x and counted the nuniber of send calls tijr crich
process of the applications. Hence. in Figure 3.2. by average. minimum. and maximum. I
mran the üwrage. minimum. and maximum number of send cd ls taken over a11 processes
of each application. It is rvident tliat processes in the BT. SP. CG. and QCDMPI appliçû-
tions have the samr number of send communication çalls for rach direrent system size.
This is also true for LU. MG. and PSTSWM when the number of processes is four. four
and eight. and a power of two. respectively.
The Spatial anribute of communications in parallrl applications is characterized by the
distribution of message destinations. It is cornmonly assumed that the message destina-
tions art. evenly distributed among al1 of the processes iilthough an individual process may
not se r a uni form message destination distribution [75. 301.
1WOO~ Minimum Average
* noOO* - - m O u 5 60001 V) - O
" 4 9 16 5 36 49 6J Number of Processes
Minimum .a~)ml Average
Maximum-
ffl
4 8 16 32 E-l Number of Processes
x 10' LU
Minimum -
W Avenge
J B 16 32 64 Number of Prccesses
- Minimum Average
6 - . Mâximum - - Cu U O c i 2 Q)
V3 - O 5 0 8- O
5 =O.)-
n 4 9 16 25 36 49 a
Number of Processes
tjaw Minimum Average
U .a 8 16 32 64 Number of Processes
PSTSWM 9000
Minimum 5ZEl Avenge
ln 7500- Maximum
- - <O O ,- C Q)
O
v - - - - - 4 8 16 25 32 36 49 63
Number of Processes
QCDMPt 400
Minrmum a Average -
Maximum ,
- 4 8 t6 25 32 36 49 65 Number of Processes
Figure 3.2: Number of send calls per process in the applications under different system sizes
In MPI. the scnd operation (iCfPI-s~~id !CfPl_lse~id and MPI-Se)lclr-ecii-epIace com-
munication calls in the paralle1 applications studied in this thrsis). associates an e~welope
with a message. Messages in addition to the data pan ça- information that csn be used to
distinguish messages and selectively receive them. This infomlation consists of a fixed
numbrr of tields. which is collectively callrtd the message e~iwlope. These fields are the
source procrss of a message. suwcr. the destination proçttss of a message. cim. the mes-
sage tas. r~ic. and the message communiçator. cornni. The message source is implicitly
determined by the identity of the message sender and necid not be rxplicitly camed by
messages. The other tields are specitied by arguments in the sond operation. The destina-
tion process is specitird by the dest argument. The integr-valued message tag is specified
by the rrlg argument. This intcger can bt: used by the program to distin-,uisli ditferent types
of messages. h communiciitor specitirs the communication context for a communication
operation. It iilso spccities the set of processes that share this communication context.
Each communication çontext provides a srparate communication universe. Messages rire
alwavs received within the contrxt they were sent. and messages sent in ditierent contcfxts
do not interkrc. Thc BT. SP. and PSTS Wb1 applications use a number of dityerent com-
municators i ncluding the predrtinrd communiçator. .Cft>I-CO~LfM~IC.'ORL, D. provided by
M P I while othcr pardlel applications. CG. MG. LU. and QCDMPI use only the pre-
detinèd c~m~municritor.
As stated above. ;l message rnvelop consists of sowce. &sr. rng. and comm. The
soiur-e and rczg of a message cinvctlop do not atiect the link establishment phase tor a mes-
sage transmission to a destination process. Thus. 1 assigned a ditferent identifier. called
ii~riqire nressngc. dcsrirrnriori itlelit@e~: for each d e s t . cornni> tuple hund in the cornmuni-
cation traces of the applications. For simpilicity. from now on. I use the tenn "message
destination" instead of unique message destination identifier. Figure 3.3, shows the mini-
mum. average. and maximum number of message destinations per process in the applica-
tions under ditferent system sizes. It is rvident that processes in al1 applications
cornmunicate with only a favorite subset ofail other processes. Note that processes in the
BT. and SP applications. in contrast to the other applications. have the sarne number of
message destinations undrr ditferent system sizes (except when N is four). This is also
sr Mtnimum c fi Average 0 - . Maximum m
,5 . - Minimum c . W Average 0 , . Mairlmum -
Nurnber of Processes Number at Processes
4
V) Minirnwn
c a Average 0 - - Maximum
4 8 16 32 64 Number of Processes
3- - a Minimum c , Average 9 - Maximum (TJ C -
4 8 i6 32 64 Nurnber of Processes
LU PSTSWM
-i,inimum- Average
0 - - Maairnurn m - 3 - in
O al $2 C 3 - O
z 1 - n
5 2
n- J 8 16 32 64
Number of Processes
Minimum '
Avera~e
-4 9 16 25 32 36 -49 64 Number of Processes
Number of Pra 36 49 6 4
cesses
Figure 3.3: Number of message destinations per process in the applications under different system sizes
tme for CG when the number of processes is S and 16. and for MG when it is 4 and S.
Meanwhile. in al1 applications. rxcept BT and SP. the number of messase destinations
increastts when the number of processes increases (note the exception cases in PSTSWM
and QCDMPI when the number of processes increases h m 31 to 36).
Figure 3.4. illustrates the distribution of message destinations in the applications when
the number of processes is 64. The BT. SP. CG. PSTSWM. and QCDMPI applications
wri fy tiic assumptim that the inessage destinations arc uni f~mnlp Jistributed arnong al1 of
the processes. MG shows an aImost uniform message destination. However. LU prrsents
three di fferent pcriks for message destinations.
Figure 3.5. shows the distribution of message destinations for one of the processes.
prosess m o . of the applications when the number of processes is 64. 1 choose process
zero becausi: i t is ii favorite destination of al1 processes and is usuall y responsible for dis-
tributiny data and veriSing the results of the computatioo. I t is slear that this primss tends
to communisntr with only a hvorire subset cit'all other processes in the applications. 1
have found similûr results for ail other processes in rach application as it u n be seen in
Figure 3.4.
3.3 Communication Locality and Caching
I define the trrms m~ssc~ge ki f imrion conimtu~icario~i losnli~: and crrcltim~ in con-
junction with this work as follows. By message destination communication locality 1 mean
that if 3 certain source-dcstitiation pair has bren used i t will be re-used with high probabil-
ity by a portion ofçode that is "near" the place that was used radier. and that i t will be re-
uscd in the nzar future. If communication locality exists in parallrtl applications. tlien it is
possible to cciche the configuration that a previous communication request has made and
reuse it at a latrr stage. Caçhing in the context of this discussion will mean that when a
communication çhmnel is established it will remain established unril it is explicitly
destroyed. .As already mentioned. in the context of fier-space optical interconnect main-
taining an established communication channel does not interfère with communications
that are in progress in other parts of the network.
BT (64 processes) x 104 SP (64 processes)
0 . . . . . . . . . . . . . . . . . . . . . . . . - ...... i) l b 32 4 3 oJ
Message Destinatioos
n O 16 32 48 0 4
Message Destinations
CG (64 processes) 5000
PSTSWM (64 processes) j X I O + LU 164 processesl
..A. . . . . . . ..< .... ..
..... .-....... .... - A - - . . - . - A -
16 32 43 Message Desiinations
. . . . . 16 32 48
Massage Destinations
QCDMPl (61 processesl
Ci....,,:. ...
"O 16 32 48 64 Message Destinations
Figure 3.4: Disrribution of message destinations in the applications when N = 64
BT (64 processesl SP (a processes)
- m C E n.- -
Y 16 EJ 32 JO 48 56 64 2 O 8 TB 14 32 40 48 56 oj Message Destinations Message Desrinarions
MG (64 processes)
5 (, --A---- E, L) .-A---------.
0 d 16 24 32 40 48 36 64 2 0 8 16 24 32 JO 48 56 13 Message Deslinaiions Message Destinations
Q PSTSWM (64 processes)
n 1200-
i E. -
h
Q) - a a
(, 5 0. --- - z O 8 16 24 32 JO 48 56 64 z O 8 16 24 32 JO -18 56 64
Message Desrinations Message Destinations
- O QCDMPi (63 processes) 59 80 w
0 CS
Figure 3.5: Distribution of message destinations in the applications for process zero. when 1V = 63
In the message-passing progamming paradiam. müny parallel algorithms are built
h m loops consisting of computation and communication phases. Therefore. çommunica-
tion patterns may be repetitive. This has motivated researchers to tind the corrrnrlrrzicntioii
f ~ ~ d i i y proprrties of parallel applications [75. 681. Kim and Lilja [75] have recently
shown that there is ri locality in message destination. message sizes. and consecutive mns
of send and recrive primitives in parallel algorithms. They have proposed and expanded
the concept of memory acsess locality based on the L e m Recelil& Lsed ( L R U ) [6S] stack
rnodel to determine these localities.
In the following subsection. 1 expand on the work by Kim and Lilja [75] by utilizing
the FIFO and LFU heuristics on the applications to see the existence of message drstina-
tion comrnunicütion lostility or repetitive message destinations. 1 use the term ltit iwio to
establish and compare the performance of these heuristics. If the ncxt message destination
is alrcady in the set of messriyr destinations maintained by the LRU. LFU. and FIFO heu-
ristics. 1 count a hir. othewise. 1 count a niiss. I t is çlear that the hit ratio is equal to the
number of hits divideci by the total nurnbrr of hits and misses.
3.3.1 The LRU, FIFO and LFU Heuristics
The Lcttsr R~~celrt(r C Sed ( L RU) . Fil~r-lu- Filsr-Oltr ( FI FO). and L L ~ Fwpenrb. #
L'retl (LFU) heuristics. al1 maintain a set of k (Ir is the wiiidoii. s i x ) message drstinations.
I f the next message destination is not in the set. then it replaces one of the destinations in
the set acçordins to which of the LRU. FIFO or LFU strategies is adopted. The window
s ix . X: corresponds to the number of input. output ports used in RON (& X). Figure 3.6
shows the results of the LRU. FIFO. and LFU hruristics on the applications when the
number of prosesses is 64. Figure 3.7. Figure 3.8 and Figure 3.9 illustrate the size scali-
biltiy of the these heuristics on the applications. It is clear that the hit ratios in 311 applica-
tions approach 1 as the window size increases. The performance of the FIFO nlgorithrn is
almost the samr as the LRU for al1 benchmarks. Howcver. the LFU algorithm has a better
performance than the LRLr and FIFO hruristics. the exception is for the LU benchmark.
when k = 2 and A'= 16.32. and 64.
ET (64 processes) SP (64 processes)
- LFU - - - LRU
FlFO I
"0 4 a 12 Window Size
CG (64 processes)
- LFU LRU FlFO
7 3 ~ i n d i w Size
4
LU (64 processes) . - - - - - - - 7
- LFU - - - LRU FlFO
QCDMPI (64 processes)
- LFU - - - LAU
FlFO 3 9 4 8 17
Window Size
MG (64 processes)
02- - LFU - - . LRLI
. _ _ _ _ _ . - _ - - - - - - F IF0
Où - -
3 rj 9 Window Size
PSTSWM (64 processes)
- LFU - - - LRU
1 FlFO
Window Size
Figure 3.6: Compaxison of the LRU. FIFO. and LFU heuristics when iV= 64
LRU (BT. SPI
0 0 3 1 2 3 -: 5 6 7 8 9 1 0 1 1 1 2 Window Size
LFU (ET, SP)
-i s i + a i i o ; ~ ~ Window Size
~ i n d o w Size
FlFO (BT. SP)
' f P TT
01 0 1 2 3 4 5 6 7 8 9101112
Window Size
LRU (CG)
1 2 3 4 Window Size
~ i n d o w Size
Figure 3.7: Efects of the scalibilty of the LRU. FIFO. and LFU heuristics on the BT, SP and CG applications
LRU (MG) FlFO (MG)
Window Size
01.
LFU (MG)
-+ N=32 & P3=64
LRU (LU)
O 0 0 6 Window Size
-& N=64
0 : 2 S 6 f s S Window Size
O: I
1 2 3 4 Window Size
FlFO (LU) LfU (LU)
"O 1 9 3 4 ~ i n d c & Size
Figure 3.8: Etiects of the sçalibilty of the LRU. FIFO. and LFU heuristics on the MG and LU applications
LRU (PSTSWM) FlFO (PSTSWM)
" 0 f 2 3 3 5 6 7 8 9 1 0 Window Size
LFU (PSTSWM)
+ N=32 -&- N=36 -t N d 9
* 0 1 2 3 J S 6 7 8 9 1 0 Window Size
FIFO (QCDMPI)
" 0 ' o 2 3 3 5 6 7 8 9 l O Window Size
0.2.
i3 1 .
LRU (QCDMPI)
4- N=36 - N 4 9 + N=64
- N=49 + N=Ô4
2 3 4 5 6 Window Size
O i ; a
2 3 4 5 6 Window Size
LFU (QCDMPI)
Window Size
Figure 3.9: Effects of the scalibi!ty of the LRU. FIFO. and LFU heuristics on the PSTSWM and QCDMPI applications
Basiçally, the LRU. FIFO luid LFU heuristics do not predict êxactly the nsxt message
destination but show the probability that the next message destination is in the message
destination set of the LRU. FIFO and LFU heuristics. respectively. For instance. the
PSTSWM application shows nearly 70941 hit ratio for a window size of seven under the
LRU heuristic when the number of processes is 64. This mrans that 70°h of the time one
of the seven most recent message destination will br used in the nrxt message. The LRU.
FIFO. and the LFC; heuristics perforrn better uhcn k is sutticicntly large. Howrver. this
adds to the hardware complexity as k links should b r setup and remain active before the
nest message is reiidy to btr sent.
1 am intcrestd in having prediçtors that can prediçt the next message destination with
a hi j h probability. and work under single-port modeling to minimize the cost of hardware
implementation. In the following section. 1 propose ii numbrr of novei message destina-
tion prrdictors.
3.4 Message Destination Predictors
.As notrd earlier. a node srnds ri message to anottier nodr by (irst establishing ii iink to
the taget (hençct the reçontiguration delay <I) and thrn sending the artual message over the
rstablished link. It is obvious thût if the link is already in place. then the configuration
phase does not enter the picture with a sommrnsurate saving in the message transmission
time. I would like to cstablish etlicient algorithms where the link establishment çosts are
minimized. The srated objective can be accomplished. if the target of the communication
operation c m bct p/v<iicrl.ri before the message itsel f is available. In this way. the commu-
nication pathway u n be established and be ready to be used as soon as the message tu be
sent becomes nailable.
Thrre are several ways of accomplishing this. If the communication operation is r e g -
lar and known. then it is possible that one can detemine the destinations and the instances
that these shall be used. 1 have developed such algorithms for broadcastingimultibroad-
casting [ l ] and discuss them in Chapter 5 . However. if the algonthm is not known. as is
usually the case for point-to-point communications. the approach mentioned above cannot
be used.
Prediction techniques have been proposed in the past to prediçt the future accesses of
sharing patterns and coherençe activities in distributed shared memory (DSkl) by looking
at their observed behûvior [96. 77. 73. 133. 34. 1071. These techniques assume that mrm-
ory accrsses and cohrrence açtivitirs in the near future will follow past pattems. Sakr and
his collragues have used tirno series and neural nrtworks for the prediction of the next
memory sharing rqurs ts [ 1071. Dahlgren and his collragues devised hardware regular
stride techniques to prefetch several bloçks ahead of the çurrent data block [XI. More
elaborare hardware-based irregular stride prefetching approüches have been proposed by
Zhang and Torrellas [ 1331. Kaxiras and Goodman have recently proposed an instruction-
hased tipproach which maintains the history uf load and store instructions in relation to
cache misses and predicting their future behavior [ 73 ] . This is in contrast to address-based
techniques thlit keep data-açcess history for the predictions. .Mukhrt jee and Hill proposed
a general pattern-hüsed predictor. cosmos. to leam and predirt the coherençr activity for a
memory block in a DSM [96]. Cosmos rnakes ri prediction in two steps. First. it uses a
cache block addrcss to index into ii message history table to obtûin the <proccssor and
message-type) tuples of the last few coherence messages receivod for thüt cache block.
Thcn i t uses thesc iproccssor. message-type> tuples to index a pattem history table to
obtriin ii <prosessor. messase-type> tuplt: prediction. In a recent paper. Lai and Falsati
proposetl a new clüss of pattern-based prcdictors. niernot? slinrNig pwtlicror7i. to eliminate
the coherence ovrrhead on a remote access latency by just predictiny the memory request
messages. those prima- messages that invoke a sequence of protocol actions [77]. It
improvrs prediction accuracy over cosmos by elirninatin_r the acknow ledgments messages
from the pattern tables. It also reducrs memory overhead and perturbation in the tables
due to message re-ordering. Both works in [96. 771 are adaptations of Yeh and Patt's two-
level P.-lp branch predictor [ l 3 11. R4p is a two-level adaptive branch predictor based on
the past behavior of the same branch.
In so ftware-controlled prefetching. the progammer or compiler decides when and
what to prefetch by analyzinr the code and inserting prefétch instructions. Mowry and
Gupta [95] have used software-controlled pretètching. and multithreading to hide and
reducr the Iatency in shared memory rnultiprocessoo.
As stated above. many prediction techniques have been proposed to reduce or hide the
latençy of a rernotr memory accrss in shared memory systems. However. to the brst of my
knowledge. no prediction technique has been proposed to predict the next message drsti-
nation for message-passing systerns to hide the latency of recontiguration delay in recon-
tigurable networks.
I explore the efiect that a number of heuristics have in prediçting the tri@ of a çom-
munication rcquesr. The set of prcdictors proposcd in this scction [?. 31 predict the mcs-
sage destination of a subsequent communication request based on a past history of
communication patterns on a per source procrss basis. Thesr predictors çan be used
dynamically at the communication assist or network interface with or without the hrlp of
the programmer or a compiler.
.Acturilly. I propose two sets of predictors in this thesis: Cide-based predictors. which
are pure dynamic predictors. and 7iig-btrsd prrdiçtors. w hich arc stlitic/dynamiç prediç-
tors. I n C yc le-based prediçtors. Si~ igh-~yck , Si11g11é-q TICI. Brrter.-qde. and Berter-
L ~ L W . predictions rire dont. dynamically at the network intertàce without any hrlp iiom
the programmer or compiler. In Tq-based predictors. fiiggiltg, 7&-cy/e. Tu~-L;\Y*IcI,
fiig-bcrrci-,+~-1~~. and ïà,g-hcrrci-~~i-c/~2. predictions are done dynamically at the network
interthcc 3s wrll. but they requirr some intonnation to b r passed t'rom the proyrrim to the
network interfûce. This c m be done with the help of the programmer and!or the compiler
through insrrting instructions such as p~rco~itrect (tag) in the prognrn. Thc Tag-basrd
predictors can be pure dynamiç predictors if another Ievel of prediction is done on the tag
themselvcs at the network interface. This way. there is no need for the program to pass
pre-connect (tag) information to the network intertàce. I leave this approach for the tùture
rescarc h .
It is worth mentioninp that these predictors can be used in any circuit-switched net-
worbs including the works proposed in [%. 13-1. Dao and his colleagues 1361 exploit the
'r 1 t!Ul.'c' communication locality to improvr the pefirmance of parallel computers usin,
swidzing. a hybnd switching technique for high performance routers in electronic inter-
connection networks. Wave switching combines wormhole switching and circuit switch-
ing in the same router architecture to reduçe the fixed overhead of communication latency
by exploiting communication locality. Thus. it is possible to reduce latrncy for cornmuni-
cations that display locali ty and use pre-established physical çircui ts. Yuan and others
[ 1321 use the communication locality in circuit-switchrd time-multiplexed optical inter-
ronnection nrtworks. They rely upon existing techniques for identifying communication
patterns suc11 that their compiled communication algorithrns compute the minimal multi-
plesing degree required for establishing 311-optical paths from sources to destinations in
such networks.
The predictors can rvrn be useful in reducing the latcncy in current commercial net-
works. For example. Myrinet networks [23] have a relatively long routing timr compared
with link transmission time. Predictors would allow sending the routiny hrader in iidvance
tOr the prediçtrci message destination. LVhen the message beçomes üvüilable. i t can be
dirrctly rransmitted through the network if the prediction was correct. thus reducing
Intcnçy significantly. In case of a mis-prediction. a message tail is fonvarded to tsar the
path down. Obviously nul1 mrssaycs must be disciirded nt the destination.
.As in the LRU. LFU. and FIFO heuristics. I use the lzir m i o to rstablish and compare
the performance of thése predictors. As a hit ratio. 1 define the percentage of tjmes that the
predicted message destination was correct out of al1 communication rrquests. The hit
ratios presrnted for the performance of the prcdictors are cither the minimum. the average.
ur the maximum of the hit ratios takrn over al 1 nodes of rach application.
3.4.1 The Single-cycle Predictor
The Siiig/+qdp predictor is basrd on the fact that if a group of message destinations
are requested rrpeatedly in a cyclical îàshion. then a single port can accommodate these
requests by rnsuring that the connection to the subsequent message destination in the
cycle can be rstablished as soon as the current request terminates. This predictor imple-
rnents a simple cycle discovery algorithm. Startins with a cycle-lwad message destination
(this is the frst message destination that is requested at start-up. or the one that causes a
miss). 1 log the sequence of requests until the cycle-head is requested again. This stored
sequençe constitutes a cycle. and c m be used to predict the subsequent requests. If the pre-
dicted message destination coincides with the subsequent requested message destination.
then I record a hit. Othenvise. 1 record a miss and the cycle formation stage commences
with the cycle-hrad being the message destination that caused the miss.
Figure 3. IO illustrates an example for the operation of the Single-cycle prediçtor. The
top trace represents the sequence of requested message destinations. while the bottorn
trace represents the predictrd messase destinations açcording to the Single-cycle predic-
tut-. Tliz a~-rc>u.s witli the cross represrnt misses. ~ l i i l r : the ones witli the circlr reprrsttnt
hits. The "dash" in place of a predicted message destination indicates tliat a cycle is being
f~mnrd. and thrrefore no predicted message destination is oîfered (note that this is also
rtddèd to the misses).
Cyclc t;mnation
Cycle Cycle Cyclc Cycle tonnation t'ormrition ibm-iation ibnnation
Figure 3.1 0: Operntion of the Single-cycle predictor on a sample request sequençe
Figure 3.1 1. shows thc behavior of this algorithm. The performance of the Single-
cycle prediçtor is very p o d on the CG. LU. MG (rncept when N = 4. 8). BT and SP
(except when .V = 4). The Single-cycle predictor behaves poorly on the PSTSWM (except
when ,V = 36.49) and QCDMPI applicritions.
The pertormance of the Single-cycle predictor is much better than the L RU. FIFO and
LFC; heuristics under the single-port modeiing for the LU and CG brnchmarks. for the
MG. PSTSWM applications (except when iV = 4.8). and for BT and SP (except when :V =
4). However. the pertbrmance for QCDMPI is almost the same. Note that 1 compare the
pertormance of the predictors with the LRU. LFU. and FIFO heuristics under single-port
modeling for the same optical interconnect implementation çost although the proposed
Single-cycle (63 processes)
j P LU UG CG PSTSWhl OCD
Single-cycle
Figure 3.11 : EKcçt of the Single-cycle predictor on the applications
predictors have higher rnemory requirernents (refer to Section 3.5.1). Figure 3.12 com-
pares the performance of the Single-cycle predictor with the LRC;. LFC. and FIFO under
single-port modeling whcn .V = 6-1.
Single-cycle. LRU/LFU/FIFO Comparison (64 processes) 1
- LRU. LFU. FiFO -. Single-qcio ,
0 - z 3) ; U 0 7- 1.1
O ' BT SP LU MG CGPSTSWMQCD
Figure 3.1 2: Comparison of the performance of the Single-cycle predictor wi th the LRLr. LFU. and FIFO heuristics on the applications under single-port modeling when N = 64
3.42 The Single-cycle2 Predictor
In the communication tnces of some of the applications. there rxist cycles of length
ont. (such as the one composed of the reqursted message destination 7 in Figure 3.10). For
thtse situations. thrre will always be iwo misses until the predictor detemiines that there is
a cycle of lrtngth one. The [email protected] predictor is identical to the single-cycle predictor
with the addition that during cycle formation. the previously requested message destina-
tion is otfered as the predicted message destination. If a miss occurs during cycle forma-
tion. the formation phase continues until a cycle is formed. Then and only then misses
cause a new cycle hrniation phase to begin. 1 appliçd the Single-cycle? predictor to the
request srquence of the previous example as shown in Figure 3.13. As was rxpectrd. the
Sin_rlr-cycle2 predictor reacts better to cycles of length one.
Rey urst sequrnce 1 3 5 3 1 3 5 5 7 7 1 3 5 5 1 7 7 1 3 1 I
1 - 1 2 3 5 3 5 5 1 7 7 1 3 5 5 3 7 7 1 3 2 Prtdictd seq irerice
Cycle Cycle C y clr Cycle Cycle fimnation î'~x-mation fimnation hrmiition formation
Figure 3.13: Operation of the Single-cycle2 predictor on tlic sample request sequense
Figure 3.14 illustrates the performance of the Single-cycle2 prediçtor. This predictor
has 3 better performance thnn the single-cycle alpo ri thm.
Single-cycle2 (64 processesl Single-cycle2 - .
0 1 '
' ET SP LU MG CG PSTSWM OCD Jo tu 10 JO 40 50 na -0 Nurnber of Pracesses
Figure 3.14: Effcct of the Single-cycle2 predictor on the applications
3.43 The Better-cycle and Better-cycle2 Predictors
In the Single-cycle and Single-cycle? algori thms. as soon as a message destination
breaks a cycle 1 discard the cycle and start forming a new cycle with this message destina-
tion as the nrw cycle-head. Then I just rely upon the new cycle to predict the next message
destination. The Single-cycle and Single-cycle7 predictors could achieve a better pertor-
mance if the previous cycle information was not discarded as new cycle is fomed.
In the Beiter-cide predictor. each cycle-head bas its own cycle. For this. 1 keep the last
cycle associated with each cycle-head encountered in the communication pattern of rach
process. This means that when a cycle breaks 1 keep this cycle in memory for the r o m -
sponding cycle-head for later references. Whrn a cycle breaks. if 1 haven't already seen
the new cycle-hriad thrn 1 form a cycle for it. otherwise I prediçt the next message destina-
tion based on the rnember of the cycle assoçiated with this cycle-head that I have from the
p s t in mrmory. If the predicted message destination coincides with the subsequent
requested message destination. then 1 record a hit. If not. then 1 record a miss and revise
the cyclc fbr this cycle-head. The state diayram of this predictor is s h o w in Figure 3.15.
Miss A Cycle (new cycle-head)
phase u
H it A One-cyc le-complete
Figure 3.15: State diagram of the Brtter-cycle predictor
The top lefi state is the "cycle formation phase" initiated with a cycle-head. This is the
same as the cycle formation phase in the Sinsle-cycle predictor. Upon a cycle completion.
1 enter the "cycle prediction phase". In case of a mis-prediction in the "cycle prediction
phase". 1 move back to the "cyclr formation phase" if the new cycle-hmd has not been vis-
ited so h r (that is. there is no cycle associated with this new cycle-hrad in the memory).
Othrnvise. 1 move fonvard to the **cycle prediction phase for the new cycle-head". 1 move
back to "cyclr prediction phase" atirr one cornpletc ç y l e to continue the predictions for
this new cycle-head. In case of a mis-prediction during the tirst cycle of predictions in the
**cycle prediction phase for the new cycle-hrad". 1 move to the "cycle-revision phase" to
rcvise the cyclr for this néw cycle-head. It is ckar that after the revision phase. I move to
the "cyçlr prediction phase" for the nrxt cycles of predistions.
Figurc 3.16 illustrates the operation of the Better-cycle prcdictor on the sample requrst
srquencc. I t is clear that the Hrst cycle assoçiated with cycle-liead I consists of message
destinations 1. 3. 5 . and 6. Hoivever. in the fbunh nppeiirince of this cycle-head a rcvised
cycle fonns which contains message destinations 1. 3. and 2.
Cycle t'ormation
Cycle formation
Cycle hrmation
Figure 3.1 6: Operation of the Better-cycle predictor on the sample request sequrnce
The performance of the Bctter-cycle predictor on the benchmarks is shown in Figure
3.1 7. It is evident that its performance is exceptionally better for ail benchmarks compared
to the Single-cycle predictor except for the QCDMPI benchmark when N = 25.32. 36 and
49.
Better-cycle (64 processes) Bener-cvcle
Figure 3.17: Etiect of the Bettrr-cycle predictclr on the applications
L I ? .
(1 f
The Berro.-qde2 predictor is identical to the Better-cycle prediçtor with the addition
+ -- G- MG
. -8- CG -& PSTS1h(M - CCDMPl
that dunng cycle formation and cycle revision phases the previously rrquestrd message
Id 10 L3 10 20 50 00 XI Number of Processes
destination is otfL.red as the prediçtrd message destination. Fiyrc 3.18 illustrates the
operation of thc Bettrr-cycle2 prediçtor on the sarnr sample requrst seyuence.
Cycle formation
Cycle fotmation
Cycle formation
Figure 3.18: Operation of the Brtter-cycle? predictor on the sample request sequencr
The Better-cycle? predictor has a better performance than the Single-cycle. Sinsle-
çycle2. and the Better-cycle predictor for the QCDMPI benchmark. The performance of
this predictor is show in Figure 3.19. It is ~vonh mentioning that i found that the applica-
tions have a very small number of cycle-heads (at most 9) under the Better-cycle and Bet-
ter-cycle? predictors and diflerent system sizes. Section 3.5.1 discusses the memory
requirernent of a11 predicton proposed in this thesis.
Figure 3.19: EKect of the Better-cycle3 predictor on the applications
3.4.4 The Tagging Predictor
The ï iggitg predictor assumes a static communication environment in the sense thüt ri
paniculür communication request (send) in a section of code. will be to the same message
destination with ti large probability. Therefore. as the exeçution trace nears the section of
code in question. i t clin cause the communiçation subsystem to establish the çonnestion to
the target node More the actual curnmuniçations rcqurst is issued. This c m be implr-
mrntrd with the hrlp of the compiler or by the programmer through ri pr-e-so/irircr (ta@
operation which wili force the communication system to rstablish the communication
çonnection before the actual communication request is issued. As noted earlier. for this
predictor and other Tag-basrd predictors. 1 can avoid the help frorn the compiler or the
proyrammer by predictiny the tag itself at the network intertice. This way. there is no nred
for the propram to pass prr-connect (tag) information to the network interface. However.
the pertomance of these ?-lri*el Tug-buseci prediction techniques has not been evaluated
yet.
1 attaçh a ditierent rng (this is ditrerent thnn the tag in an MPI communication d l : it
may be a unique identifier or the program counter at the address of the communication
call) to each of the communicrtion requests found in the applications. This tag is passed to
the communication subsystem by the pre-connect (tag) operation. To this tag and at the
çommunication assist. 1 assign the requested message destination the first time a link is
established. A hi t is recorded if in subsequent encounters of the tag, the reqursted message
destination is the samr as the one already associoted with the tag. Othenvise. a miss is
recorded and the tag is assiged the newly requested message destination.
The pertbmancr of the Tagging predictor is presented in Figure 3.20. As can be seen.
the Tagging predictor results in excellent performance (hit ratios in the upper 90°f0) for d l
the application benchmarks except the CG. PSTSWM. and QCDMPI. The reason is that
tliese bcncliinarks includc scnd opcratians witli inessaye destitiations cülsulatrd busd on
loop variables. Thus. the samr section of code cycles through a number of di firent mes-
sage destinations. As we have seen eariier. the Better-cycle and Better-cycle2 predictors
are excellent in disçovering such cyclic occurrences for the CG and PSTSWM bench-
marks. bllttanwhile. the Brtter-cycle? predictor has better performance for the QCDMPI
benchmark compored to the Tagging predictor.
Figure 3.2 O : EKects of the Tagging predictur on the applications
3.45 The Tag-cycle and Tag-cycle2 Predictors
The Tagging prediçior does not have a good performance on the CG. PSTSWM. and
the QCDMP l benchmarks whilr the Single-cycle and Single-cycle7 predictors showed
good results for the CG benchmark. 1 combine the Tagging algorithm with the Sinçle-
cycle alsorithm and cail it <lie Tag-c~~ lc . algorithm.
In the Tapcycltt predictor. 1 attach a ditferent tag to each of the communication
requests Found in the application benchmarks and do a Single-cycle discovery algorithm
on each tag. To this tag and at the communication assist. 1 assisn the requested message
destination. to be çalled fng~de- l i ead message destination (this is the tirst message desti-
nation that is requested at this tag. or the one that causes a miss). 1 log the srquencr of the
requests at this tag until the tagcycle-head is requested again. This stored sequencr: consti-
tutes a cycle at each ta-. and can he usrd to predict the suhsequent requests. A hit is
recordrd if in subsequent encounter of the tng. the requested message destination is the
same as the predicted one in the cycle. If not. then 1 record a miss and the cycle formation
stage begins with the tagsycle-head being the message destination that ciiused the miss.
The Tag-cycle prrdistor performs exceptionally well across al1 the benchmarks except for
thc QCDMPI benchmark as s h o w in Figure 3.2 1 .
Number of Processes
Figure 3.21: Effects of the Tag-cycle predictor on the applications
The Ttlg-q*cIcI predictor is identical to the Tag-cycle predictor with the addition that
duriny cycle formation. similar to the Single-cycle2 predictor. the previousl y requested
message destination is otiered as the predicted one. The performance of the Tag-cycle7
predictor. as s h o w in Figure 3-13. is better than the Tagging and Tag-cycle predicton for
r i t 1 benchmarks.
Tag-cycle2 (64 processes) Taa-cvcle2
Number of Processes
Figure 3.12: Etkcts of the Tag-cycle2 predictor on the applications
3.4.6 The Tag-bettercycle and Tag-bettercyclez Predictors
The Bettrr-cycle and Better-cycle? nlyorithms have berter performance on the parallrl
applications than the Single-cycle and Single-cycle? cilgorithms. Therefore. 1 combine the
Better-cycle and Bettrr-cycle2 rilgoritlirns with the Tagging alsorithm to pet better perfor-
mance than the Tag-cycle and Tag-cycle2 algori thms. 1 cal1 thrse 7&g-be~ret~lde and Tq-
h~~rtc1r: idc2 predictors. The performance of these two predictors are s h o w in Fi yure 3 23.
and Figure 3.24.
Tag-bettercycle (64 processes) Tag-beltercycle
'3 10 20 30 4 0 50 tiû 3 Nurnber of Processes
Figure 3.23: Etfects of the Tag-bettercycle predictor on the applications
In Tag-bettercycle predictor. 1 anach a different tag to each of the communication
requests tound in the benchmarks and do a Better-cycle discovery algorithm on each iag.
To this tag and at the communication assist. I assign the requested target node. to be called
rngbrrtei-c~rle-lierrd node. The Tapbettercyclel predictor is identical to the Tag-bettercy-
de prediçtor rvith the addition that d u r i n cycle formation. similar to the Better-cycle?
predictor. the previously requested message destination is otfered as the predicted mes-
sage destination. The performance of Tas-bettercyçle for the QCDMPl benchmark is bet-
ter than the Tag-cycle algorithm. but not beaer than the Tag-cycle2 predictor. Howrvrr. the
hg-bettercyclel predictor is superior to al1 other predictors for al1 parallel benchmarks.
Moreover. 1 found that the applications have w-y small number of tûgbettercycle-heads (at
most 3) under the Tag-bettercycle and Tq-bettercycle2 prediçtors and ditferent system
sizes.
' i -0 20 !O a0 50 ria '0
Number of Processes
Tag-bettercycle2 (64 processes) Tag-beltercycle2 * p P
O 9 .
' 3 - a # '
0 C
m
= 1 al
Figure 3.14: Etiects of the Tay-bettercyçle? predictor on the applications
ul ul 281 & - CO.'' al 2 &
1 .' )L'
J ' .
,,
3.5 Predictors' Cornparison
-3 BT + 5P + it) * MG -+ CG &- QrjfÇLWi t ( X O l - . -
Figure 3.25. presents a cornparison of the perîbmance of the prediçtors prrsented in
this chapter when the number of processors is 64.31 and 36. and 1 6. respectively. It is evi-
dent that the Tag-bettercyclel predictor has the best overrill performance for al1 applica-
tions (except tOr QCDMPI when the number of processes is 16. and 64 whrre Better-
cycle2 hns a better performance) and its hit ratio is consistently very high. It is also cieür
that under single-port modeling. the proposed predicton outprrform the classical LRU.
LFU, FIFO heuristics.
3.5.1 Predictor's Memory Requirements
Table 3.1 compares the maximum memory requirement of the proposed messare des-
tination predictors on the application benclimarks when the number of processors is 64. I
have found that the memory requirement of the predictors decrease gadually when the
number of processes decreases. The numbers in the table are the multiplication factor for
the ümount of storage nerded to maintain the message destination and its communicator.
Hriving 64 processes in this u s e study 2nd at most 4 diftèrent çommunicators in the appli-
cations. one nrrds ro have only one byte of storage per raçh message destination and its
cornmunicator.
Table 3.1: Memory requirements (in bytes) of the predictors whrn N = 61
It is quitr c l r x thÿt the rnemory requirements of the predictors is very low. That makes
thern very attractive fbr implrmentation on the communication assist or network interface.
Comparatively the Better-cycle. and Tag-bettercycle predictors have a iinie hiyher mem-
ury requirements than the other predictors. Although. the ciassical LRU. LFU. and FIFO
heuristics nred less m e m o . as stated earlier. the beauty of the proposed predictors lies in
the fact that they opcrate under single-port modelinp. That is. only one communication
channel is available at a- time. and this is reconfigured on demand. This btings the cost
of optical interconnect implrmentation to the minimum. The stonge requirement of the
prrdistors have been found using the following formulae:
Single-cyçle(2)
Brttrr-çycle(2)
Tayging
hg-cycle( 2)
Trig-bt.ttcrcyclc(2)
BT
49
40
12
24
24
SP
49
49
12
24
24
CG
9
IS
MG
7
2s
LU
4
12
1 0
40
O
1 O
10
30
12
14
36
QCD
S
32
PSTSWM
33
297
- 7
1 O
30
S
48
48
- f , - - . . - !l,fents .lL, A L.i.cldi i x Maximum number of cycle-heads ( 3 . 7 )
.\,fini ,,,,,,,, = blaximum number of tags 3.- .-
(3.3)
- T ' , ~ - L..,.L.,Ll, 2 l - .Clm ',,, -,.- .* x Maximum cycle length of caçh tags ( 3 A)
3.6 Using Message Predictors
ln ihis section. I hrieHy discuss how a message destination prediçtor c m be used and
integrated into the nztwork intertice. Predictors would reside beside the communication
üssist or nrtwork interface and accelerate the reconfiguration phase of the interconnect.
Thry monitor the message destination patterns of thrir host node and make ü prediction
according to their prediction algorithms. Then. the network interface uses the predictions
to rstablish the links to its Rnal message destinations.
As statrd abovc the predictors would execute on the communication assist of rach
node of the parallei machine. and predict the message destinations for communications
originating at the node on which they residr based on the p s t history of communications.
In Cycle-based predictors (Singie-cycle. Sinsle-çycle2, Better-cycle. and Better-cyçle2).
predictors do not nerd any help from the compiler or programmer. However. as stated ear-
lier. in Tag- based predictors (Tagging, Tag-cycle. Tag-cycle?. Tag-bettercyclr. and Tag-
bettercyclel). predictors require an intertàce to pass some information tiom the progam
to the network interface. Wiih a simple help tiom the programmer or compiler. this can be
dons throush inserting prr-cor~tiecr (mg) instructions in the program well above earh spe-
cific srnd communication operation but rvidently after the previous send communication
operation.
Determinhg whrn to perform the path setup action ( recontiguration phase) is quite
simple. Basically. predictors should map the prediction into the path setup action when the
previous communication has teminated. Thus. as soon as the previous message transmis-
sion is compltitc. tlir sommunication assist recontigurcs thc link to the nest message desti-
nation. It is clear that upon ii mis-prediction. the on-seing reconfiguration which is not
correct and may or ma): not be completrd by the time of the mis-prediction due to ri
shoner inter-send çomputation time (to be discussed in Chapter 4) immediately stops and
a new rcconfiguration tnkes place.
3.7 Summary
interconnection nrtwurks arc still ii source of bottleneck for high pertbmançe som-
munications in massively parallel rnvironments. In this çhapter. I introduced a reçonîiy-
urable interconnection network that could alleviate the communication problems in such
environrnents.
In ordrr to beneti t from such interconnects ct fcçtivel y. recontiguration drlay should be
hiddrn. For this. 1 analyzrd the communication properties of some parallrl applications in
tems of communication frequency and message destination distributions. Lking classical
memory hirrarchy heuristics. I found that message destinations display a form of locality.
Having message destination loçality in parallel applications. I proposed a number of
predictors that can be used to acçurately predict the message destination of the subsequent
ci>mmunicütion reyuest. The proposed predictors would execute on the communication
assist of each node of the parallel machine. The performance of the proposed predictors.
esprçiall y Better-cycle2 and Tag-bettercycleZ. are very goood and they could effective1 y
hide the hardware communication latency by recontigunng the communications network
concurrently to the computation.
For these predictors to be used efficiently, I shall aque. in Chapter 4. that at least in the
application benchmarks studied. there is enough computation preceding a communication
request suçh that the prediçton could rfectively hidr the reconfiguration cost [4.3].
Chapter 4
Reconfiguration Time Enhancements Using Predictors
To reçontigure the optiçal intcrconnect concurrently to the cornputation. or to spectula-
tivcly setup the path in electronic interconnrcts. two conditions are necessary: ( 1 ) An
accurate prrdiction of the destination: ( 2 ) Enough lead time so that the reconfiguration of
the interconnect (or the path srtup phase) be compirttid before the communication reyuest
In Chapter 3. 1 utilized the message destination locality property of parallel applica-
tions to devise a numbrr of heuristics that can br: used to "predict" the targrt of subsequrnt
communication requests. This technique. çan be applied directly to recontigurable intrr-
connttsts t« hide the communications Iütençy by reçontiguring the communication net-
work çoncurrcntly to the computation.
1 present the pure evrcution times of the çomputation phases of the parallel bench-
marks on the [Bk! Deep Blue machine at the IBM T. J. Watson Rrsttarch Center using its
high-perhm~ance switch under the user space mode. This çhaptrr contributes by arguins
that by çompüring the inter-communication computation tirntts of these parallel bench-
marks with soinr specitic recontiyuration times. most of the time. we are able to hl ly uti-
lize thesr: computation times for the concurrent reconfiguration of the interconnect whrn
we know. in advance. the next target using one of the proposed high hit-ratio target predic-
tion algorithms introduced in Chapter 3.
In this chapter. 1 fi rst show the distribution of message sizes of the applications in Sec-
tion 4.1. In Section 4.2. the pure inter-send computation times of the parallel applications
on an [BM SP2 machine is presented. 1 present the performance enhancements of the pro-
posed prediçtors on the application benchmarks for the total reconfigration time in Sec-
tion 4.3. In Section 4.1. i discuss how the predictors at the send side affect the receive side
of communications. Finally. 1 conclude this chapter in Section 4.5.
4.1 Distribution of Message Sizes
The volume of communications is characterized by the number of messages. and the
distribution of message sizes in the applications. I presented the number of messages in
Chaptrr 3. In this çhapter. I am particularly interested in the distribution of message sizes
in the applications. In Section 4.3. 1 use the size of messages in the applications to çalcu-
late the message transfer delay tims. Figure 4.1 through Figure 4.4 illustrate the distribu-
tion of message .;izes of al1 applications under diffèrent systems sizrs. The MG.
PSTSWM. SP. and BT applications use more distinct message sizes in thrir communica-
tion calls than the othrr applications. The CG. LU. and QCDMPI use a few number of dis-
tinct message siïrs.
4.2 Inter-send Computation Times
in Section 4.3. 1 shall entirnine the ettèçtiveness of the proposed predictors. 1 shall
quanti- thc ability of the proposrd prediciors in hiding the reconfiguriition delays. For
this. 1 need to know the pure computation timrs betwern ctny two send communication
operations.
did rnpcriments on a fast machine to establish the inter-srnd çomputation times and
the efects of the heuristics on the total reconfiguration Jelay I usrd the IBM SP? Derp
Blue machine at IBM T. J. Watson Research C'enter. a 30 node machine with 160 MHZ
PlSC thin nodes. Z56MB R A M and a second genrration high performance switçh and ran
the suite of applications. one process on each node under the user space mode. when 1 was
the only user of this machine. This avoided any task switching that might have atfected my
measurements. My mrasurements detrnnined a lower bound on the inter-semi cornputa-
tion times (Le. the tirne devoted to computation between any two send communication
d l ) .
I excluded al\ timing overheads in the profiling codes to computr the exeçution times
of the çomputation and communication phases of the parallel application benchmarks. The
inter-srnd computation measurements excluded any overhead associated wi th any other
,1-- -- il .! 7 1 1 5 4 r> J 10 12
.IL
Message Sires (bytes1 . 10' Message Sces (byiesl ro'
QCDMPI (4 nodesl
% 0.5 1 1 5 2 Message Sizes (bytes) x 10'
Figure 4.1 : Distribution of message sizes of the applications when N = 4
-1 8 t2 16 Message Sires {bytesi x 10'
CG 18 nodes)
Oo O S 1 r 5 2 2 s Message Sires (bytes) x 104
Figure 12: Distribution of message sizes of the applications when N = 9 for BT and SP. and S for CG. MG. LU. PSTSWM. and QCDMPI
U 6 8
Message Sires ibves) r 10'
0- - - O 1 1 3 4
Message Stres (bidesi * 10. Message Sires I bvies)
PSTSWM ( 16 nodesi
,J 5'
, ]C A---
J Message I Sues O r bytes) 'I .O '" , i.-!-yh $ 2 3
, 10' Message Sites (bytes) . 10'
2000 4000 6000 8000 tOOOO Message Sites (bytesi
Figure 4.3: Distribution of message sizes of the applications when N = 16
PSTSWM (25 iiodes) 5000
. . - . . - - . . . - - . - - . 7 4 O
Message Sixes (bytes1 r 10'
- -
1 'Y 3 Message Sires i bytesi r 10'
Figure 4.4: Distribution of message sizes of the BT. S P. PSTS WM. and QCDMPl applications when .V = 25
communication primitives (q. reçeivr communication call. collrctive communications).
Thus i t c m be çonsidered as a lower bound on the pure computation time. In Appendix A.
1 rxplain how the pure inter-send computation times have been computed.
The temporal attribute of inter-send çomputations in parallel applications charactenzes
thc rate of computations. The inter-amval times of the computation time can be used to
obtain the ciinrirlntiw tiistribirriorl jirncriort (CDF) of the computation times. The CDF of
the computation times cm dien be used for curve fining to generate the inter-amval times
of computation times for simulation purposes. Figure 4.5 presents the cumulative distribu-
tion hnction of the inter-srnd computation timrs for node zero of the applications ( 16
nodes for CG. MG. and LU; 15 nodes tor BT. SP. PSTSWM. and QCDMPI). Note that I
have found sirnilar cumulative distribution tiinction plots for other systern sizes.
'O !O ho 30 '00 intoi --alna Coinouiarion T'rnn IIM~?I
5 ? O,;
20 JO 60 aa t oo lniar - S ~ M Ciimput~l~on lime Inari
i' 2 ilS ?O JO 90 BC 'O0
Inter -sena Cmputiition T I ~ H Irwlrx
Figure 4.5: Cumulative distribution function of the inter-send computation times for node zero of the application benchmarks when the number of processors is 16 for CG. iMG. and
LU. and 35 for BT. SP. QCDMPI. and PSTSWM.
Table 4.1 shows the minimum pure inter-rend computation times of the applications
under diwerent system sizes. Note that LU. MG. and CG nin only on a power-of-two num-
ber of procrssors. The inter-send computation times for the CG (4 nodes) and QCDklPI
application benchmarks are quite large while al1 other applications have a minimum of
less than 23 microseconds pure cornputation times.
Table 4.1 : Minimum inter-send cornputation times (microseconds) in NAS Parallel Btnchmrirks. PSTSWM. and QCDMPI when 1V = 4, S. 9. 16. and 25
8 nodes J nodes
(9 for BT, SP) 16 nodcs 25 nodes
1
IBM Deep Blue uses a state-of-the-art high performance CPU. PowerZ-Super (P3SC)
microprocessor. in its nodes. The nodes are interconnected via an adapter to a high perfor-
mance. multistage. packet-switchrd network for interprocessor communications. 1 am
interested in having a rough cornparison benveen the pure inter-send computation times of
the applications running on such powerful machines and the current state-of-the-art recon-
figuration dclay associated with optical interconnects. Researchers in optical engineering
are using diwerent approaches to design reconfigurable interconnects [103. S 11. In [103].
the authors report a 25 microseconds reconfiguration delay for their experimental recon-
tigunble interconnects. Based on these reports. 1 compare the pure cornputation times of
the application benchmarks with 25 microsrconds reconfiguration time. and with reconfig-
uration tirnes of 10. 5. and 1 microseconds as a rneasure of future advancements in the
area of recontigurable interconnects. Figure 4.6 presents the distribution of the inter-send
computation tirnes on ditferent applications whcn the computation times are more than 5.
10.25 microseconds and the number of processors is 4. S or 9. 16. and 25.
Examining the distribution OF the inter-send times. rewuld tliai t l i q are qui te widely
distributrd. ,411 applications have nearly 100°-O inter-send computation times that are
greater than 5 microsttconds. For the BT. SP. LU. MG. and CG (except 4 nodes) applica-
tion benshmarks. betwern 6O0* to 80% of the computation times are above 25 microsec-
onds. The PSTSWM and QCDMPI application benchmarks have nearly 1 OOOb inter-send
computation tirncis that are greater than 25 microsrconds. It is evidrnt that the majority of
the recontigurations can procerd in parallel with the computation and be readied hefore
rhe end of the computation. For the cases where the computation timr is not sulficiently
long t« wmpletely hide the resonfiyuriition i r etfectively reduces the recontiguration cost
by the cixrcspondiny length ut' time.
4.3 Total Reconfiguration Time Enhancement
1 assume 3 multicomputrr with nodrs similar to the thin nodrs of an IBM SPI system
but with a rrconfigurable optical interconnect whish has a reconfiguration delay d (ci = 1.
5. 10.25 microseconds). I t is interesting to ser the rtiectivenrss of the proposed predictors
on such a multicornputer system. Specifically. 1 shall quanti@ the ability of the proposed
predictors in hiding the reconfiguration delays. For the calculaiions used to quanti- the
reconfpration hiding capabilities of the predictors. 1 use the lower bound of the inter-
send computation times.
Figure 4.7 illustrates ditierent sçenarios for message transmission in the multicom-
putrr with the recontigurable optical interconnect. Note that as won as a send cal1 is
issued. the message c'an be sent to the destination if the link is already established. Recon-
figuration is staned as soon as the messase is delivered to the destination. Thus. the
nlcssageet~o>rsfei~deIa~+ (the delay associated with the transfer of a message) reduces the
4 Nodes (W ciass far NAS) 4 Nodes (A class for NAS)
8 Nodes (9 nodes tor BT. SP: W class for NAS) . . . . . . . . . . . . - . . . .
, Uois m n '9
r-
16 Nodes (W class tor NASi . . . . .
3 hlori, min
{W dass tor NAS) . . -
8 Nodes (9 nodes tor BT. SP: A class for NAS) . . . . . . . . . . . . . . . . . . . . .
3 . ~ d 1 . mi,,
16 Nodes (A class for NAS) . . . . . - . . . . * . . . . . . ,
.UA; '
25 Nodes (A class lor NAS) . . . . . . . . . . . . . . . . . .
0 MC& &ri i
Figure 4.6: Percentage of the inter-send cornputation times for ditierent benchmarks that are more than 5. 10, and 25 microseconds when N = 4.8 or 9, 16, and 25.
rimount of time available before the next send cal1 is issued. For this, I subtract the
t~ressngt.-tioii~f2i~~-ieic1~* ( For the specific message sizr) from the correspondinp inrei--setd
finie and cal1 the remaining time, the uiaiZable-rime. This allows me to compute the lower
bound of the tirnrs that c m be hidden. For eash n l c . s s a g e _ r t ~ - ~ i ~ ~ s f ~ ~ ~ ~ i e l ( ~ ~ ~ calculation. I use
the corrrsponding message size and a one Gigabyte per second communication channel.
If the trwilcrblc~-rinze is geattir than zero as in Figure4.7(a) (that is the
~~i~~ss~~g~--r~iz~isfI~i~~i~~I~~~. is lcss than thc corrcsponding N ~ I c I - - s ~ ~ z ~ ~ii~zi). and it is morc
thrin the i ~ c ~ o r ~ ~ g i i r a r i o ~ i - ~ i c l n ~ ~ then a correct prediçtion would help completely hide the
rvc~or~~~qiri*(itior~~~fe/~iy. 1 f the tridnble-rime is geater than zero as in Fisure 4.7( b) but i t
is less than the ivcor~~gti~-ciiio~t~~icla~~ then part CIE the ) r s o i ~ i g ~ r i a r i o ~ i ~ ~ f e l ~ i j ~ qua1 to the
ti\'~lihbletinre can be hidden. Howevcr. if the tzmii~iblerimc is less than zero
Figure 4.7(c) (that is tlir n~e~sscigr~~r~cii~fC.~-~~i~l~~~~ is greater than the çorresponding
Figure 4.7: Diffrrent scenuios for message transmission in a multicomputer with n recontigurable optical interconnect (a) when the rnessage-transferJrlay is less than the inte-end time. and the available timr is lager than the reconfiguration-delay (b) when the message-transfer-cielay is less than the inter-send time. and the available time is less
than the recontiguration-delay (c) when the message-transfer-delay is larger than the inter-send time
The algorithm uscd to obtain the time spent in reconfiguring the interconnect with and
without applying the predictors is given by the following pseudocode. The
totn~origii~o~~eco~~fiprration is the sum of the reconfiqration delays encountered in the
applications' run-time. The toral-nmiyeconJigirration is the sum of the recontipration
drlays encountered in the applications' mn-time when predictions are used to hide them
with the inter-send cornpuration times. The recor,fiJlr~atiorr-mio is the ratio of
t u t ~ ~ - ~ i ~ ~ ~ ~ ~ - e ~ o i ~ / i ~ r i ~ ~ ~ t i o ~ ~ over ~~t~I~origi~rul~~~ecorIfigt~i'atio~~. It is ciear that the less this
ratio. the better is the prrdictor's capability to hide the reconfiguration delay.
total- n e w-recon figura tion = 0.0;
total- original- recon figura tion = 0.0;
for each inter-send-computation {
available- time = inter-send-computation - message transfecdelay;
if (available-time c 0) (
total- new- reconfiguration += recon figuration_delay;
total- original- recon figura tion += recon figura ?ion- delay;
1 eise {
if (hi?) then
if (available- time < reconfiguration-delay) then
to ta/- n e w- recon figura tion += recon figura tion- delay - ava ila ble- time:
else;
else total-new-reconfiguration += reconfiguratio'delay:
total- original- recon figura tion += recon figuration- delay;
1 1 recon figura tion-ratio = total- ne w- recon figura tion / btal-original-recon figura tion
Figure 4.Y through Figure 4.1 1 illustrate the r e c o ~ ~ ~ r i n ~ i o i ~ - ~ ~ n t i o . the average ratio of
the total new recontiguntion delay (atier applying predictions) over the total original
reconfiguration delay for ench application benchmark under two ditferent CPU speeds and
four diflerent rrcontiguratinn delays. 1 present the results for two different C P U speeds:
one tOr the current P2SC thin nodes, and one h r a 10 tirnes faster CPU as a measure of
hture CPUs. The results are shown for the best predictors. Better-cycle2 and Tag-
bettercyçld. In these figures. shorter bars are better. For the sake of completeness. 1 have
includrd the results for LRU. LFU. and FIFO heuristics under single-port modeling (recall
LRU. L N . FlFO (4 nodes. A cfass for NAS) 1
u = i A =a=5u?d C 0 = !Qin -
u = 2 5 i n --
Berrer-cycle2 id nodes. A class for NAS)
LRU. L N . FtFO (4 nodes. A class tor NAS. CPU 10 limes faster)
Bener-cycle2 (4 nodes. A class for NAS. CPU 10 limes hsterl
Tag-benercycie2 (3 nodes. A class for NAS. CPU 10 nmes faster)
n al
E o a i *
Figure 4.8: Average ratio of the total reconfiguration time afier hiding over the total oriyinal recontiguration time for ditierent benchrnarks with the current generztion and a 10 times t'aster CPU when d = 1 . 5. 10. and 25 microseconds: A class for NPB, 4 nodes
(shorter bars rire better)
LRU. LFU. FIFO (9 nodes (BT. SP). A ciass for NAS)
~ r l : i w
Bener-cycle2 (9 nwes iBT. SP). A ciass tor NAS)
LRU LFU FIFO (9 n a e s i3T SPI A c ! z s for NA5 CPU 70 rimes hsleji 1 -
I u - l d
Tag-benercwle2 19 nodes IBT. Spi. A class for NAS)
Bener -iyc!@2 19 n a e s ;Br S P i A cuss tar NAS. CPU 70 trmes hslerb
f.ig-ociicrc;h!eZ i9 n a e s iBT SPI A c i . 1 ~ ~ for PIAI CPU 10 limes f.uleri
Figure 4.9: Average ratio of the total reconfiguration tirne after hiding over the total original reconfiguration time for different benchmarks with the current çeneration and
a 10 times faster CPLT when d = 1.5. 10. and 25 microseconds: A class for NPB, 9 nodes for BT and SP. S nodes for other applications (shorter bars are better)
LRU. LFU. FlFO { 16 nodes. A cfass for NAS) LRU. LFU. FlFO (16 nodes: A class for NAS. CPU 10 iimes taster) 1 -
i - l r n
Better-cycle2 ( 16 nocles. A C&SS for NAS. CPU 10 times taster)
Tdq-benercycie2 (16 nodes; A ciass for NAS. CPU 10 iimes tasterl
Figure 4.10: Average ratio of the total reconfiguration time alter hiding over the total original reconfiguration tirne for ditferent benchmarks with the current generation and a 10 times tàster CPU when d = 1.5. 10, and 25 microseconds: A class for NPB. Ionodes
(shorter bars are better)
L W . LFU. FlFO (25 nWes. A cQss for NAS) LRU. LFU. FlFO (25 nodes: A ciass for NAS. CPU 10 tunes fasteu
Belter-cycle3 (75 nodes: A class for N A 3
Tag-&nercycie2 (25 nodes. A class for NAS) (25 nodes; A ciass for NAS. CPU IO bmes !aster1
i l r 7"s - I I - 5 4 - < l . l u rn --
* l 25 ta
Figure 1.1 1 : Average ratio of the total recontiguration time afier hiding over the total original reconfiguntion time for different benchmarks with the current generation and a 1 0 times faster CPU when d = 1.5. 10. and 25 microseconds. A class for NPB. 25 nodes
(shorter bars are better)
thnt the LRU. LFU. and FIFO heuristics under single-port modeling predict the next desti-
nation to be the same as the previous message destination). It is clear that the Better-
cycle?. and Tag-bettercycleî predictors outperform the LRUILFL~~FIFOF heuristics. The
Tag-bettercycle? predictor improves the total reconfiguration delay better than the Better-
cycle2 predictor. especially when the number of processors is 4. or 9. Under the Tag-
bettercycleî prrdiçtor. the majorin, of recontigmtion delays in the CG. MG. and LU
benchmrirks clin be hidden. Meanwhile. the recontiguration-ratio for BT and SP deçreasrs
tiom 0.4 to O. 13 when the number of nodes increases h m 4 to 25. The QCDhlPI has ri
reconfigurntion-ratio betwren 0.3 and 0.5. However. the PSTSWM application shows a
consistent recontiguration-ratio of near 0.6 (rxcept when :V = 4). I t is also evident that the
ratios increase with ü taster CPC for the same recontiguration drlay. However. the reçon-
figuration drlay time may rilso decrelise in the future. In this respect. it is informative to
compare the bar griiphs undrr dittèrent recontiguration delays and processor sperds. From
the plots for BT. SP. QCDMPI. and PSTS WM. i t seems that the recontiguration delay is
not a factor. It meüns that either the inter-send somputation times are so short that thry
cannot hide the recontiguration delays or they are long enough that thry can hide larse
reconîi yuration delays.
In general. the results are consistent with the fact that wi: c m hide most of the recon-
figuration delnys using one of the proposed high hit-ratio prrdiçtors. Figure 1.12 shows a
sumrnary of the average ratio of the total new reçontiguntion delay over the total original
reçonfiguratian delay with the surrent yeneration and a I O times fastrr CPU when apply-
ing the Tag-bettrrcycle2 predictor on the benchmarks for ri = 15 microseçonds. A class for
WB. and under diferent systrm sizes.
1.4 Predictors' Effect on the Receive Side
It is intzresting to discover the eflect of applying the heuristics at the send side of com-
munications on the receivins sides and hence on the total mecution time. Using one of the
high hit-ratio predictors reduces the total reconfiguration del- When this happens at the
sender sides. most of the tirne the messages are delivered sooner at the receiver sides. If
Tag-bettercyc2 (A class for NAS) 1 . * B T 1 + EP
+- LU go,, . -E+ MG + CG +% PSTSWM t
Taq-bettercvc2 QI dass for NAS, CPU 70 tirnes faster)
Number of Processors '0 4 8 12 16 20 24 28
Number of Processors
Figure 4.12: Summary of the average ratio of the total recontiguration time atier hiding ovrr the total original recontiguration time with the surrent generation and a 1 O times
taster CPLr when applying the Tas-bettercycle? predictor on the benchmarks with c i = 25 rnicroseconds. A class h r NPB. and under different system sizes
the rrccice calls have brrn issued aftrr the message has amved. thrre would be no y i n .
However. i f they rire issued rarlier. then there would be performance enhancemrnt on the
receivins side and theretire on the whole execution time. This is shown in the Figure 4.13.
I have used the following stratepy for discovenng the number of times that the receive
calls are issued enrlier than their correspond in^ send calls. 1 synchronized the timing
traces of each node of these applications. 1 have considrred the times just before the send
and receive calls arc: issued. In case of blocking and non-bloçking send calls. the time just
before the calls (MP-rrd and .WI-Isend) have been taken into account. That is the time
that the message is ready to be sent over. For the bloçkinç receive cal1 (MPI-Recv). 1 did
the same. That is the time that the receiver is ready to get the message. However. for the
non-blocking receive call (rWI-Irecv). 1 consider the time when the wait cal1 (AdPr-CC'ait)
is issued for the corresponding receive call (MPI-Irecv). This @es us the worst case sce-
nario for the number of times the receive calls are issued betore their corresponding send
calls.
Process 1 Process Z
Scnd-crll /j3/ With heuristic I <z"' I
I hl I I d 1 No heuristic I hi I
I
I A
Process 1 Process 2
I I
Receive-cal1 I O ( c m consume 1 erirlier )
p g i t h heuristic 1 Send-cal1 6,/ I
l 7 1 No heuristic i *
4l 1 I I I
I I i j Receive-cal1 1 1
( crinnot consume earl ier) 1 1 1 I I
I
Figure 4.13: Heuristics e t k t s on the reçeiving side
1 present the average percentage of the times that the rriceive calls are issued earlirr
than thrir correspondinp send calls for the CG. SP. and PSTSWM benchmarks in Figure
4.14. The results are truc for ci = 1 . 5. I O , and 25 microseconds. LU and MG benchmarks
cire using . \ I f [ - . - f :V}xC'RCE [92] for some of thrir reçeive calls and hencr orir cannot
identie the sourccs of messages to compare with. What 1 have calçulated is a lower bound
of the improvrmrnt. .-\ trace-driven simulator should b r written for the exact çalçulation of
the improvernent.
1.5 Summary
In order to eîficiently use the proposed predicton in Chapter 3 to hide the hardware
latency of the reconfigurable interconnects. rnough lead time should exist such that the
reconfiguration of the interconnect be completed before the communication request
amves. For this. 1 presentrd the distribution of execution times of the computation phases
of the parailel application benchmarks on an IBM SP? machine. The results showed that
most of the time. we are able to fully utilize these computation times for the concurrent
reconfiguntion of the interconnect when we know. in advance. the next target using one of
the proposed high hit-ratio target prediction algorithms.
4 9 16 75 4 9 16 25 4 8 16 SP tW) SP ( A ) CG ( W ana A
Figure 4.11: Average percentazr of the timrs the receive calls are issued before the çorresponding scnd cal 1s
1 ais0 presented the pcrfomancc enhnnçements of the best predictors. Better-cycle'.
iind Ttis-bettercyslrî. on the application benchmarks for the total reconfiguration time.
Finiilly. I considercd the zi?ects that usinj message destination predictors have on the
recciiing sides of çominuniçations. 1 showed that up to 50U;, of the tirne applications
might brnrfit from the situations whcre thry post early receive çalls. However. 4 trace-
driven sirnulator should be writtrn for the calculation of the irnprovement.
1 did not evaluate the application spredup whcn usin2 the predictors on the applica-
tions. Rough estimates point to minimal speedup gains. This is because the parallel appli-
c*onirrrlrrtica~io~i cations studied are very çoarse-graincd and hence the ratio is small. conrptcrariorl
Table 4.2 shows the çommuniçation to computation ratios for the applications under dif-
ferent system sizes. These applications have been written to avoid a lot of communications
hetween pair-wise nodes mostly because of the high communication latency in the current
seneration of parallrl systems [Ij]. and panly because of the algorithms. thrmselves. As
show in Table 4.7. the çommuniçation to computation ratio is increasing when the num-
ber of nodes increases. This means that we might have better speedup for these applica-
tions for larjer system sizes. However. the inter-send cornputation times may decrease and
thus reconfiguration delays cannot be hidden.
Table 4.2: Communication to computation ratio of the applications
4 nodes
In this chapter. and Chapier 3 of this dissertation. 1 am partiçularly interested in the
point-to-point communications in parallrl applications. In Chapter 5. 1 disçuss eficient
collcçtivc communication nlgorithms for such recontigurable interconnects.
8 nodes (9 for BT, SP)
16 nodes 25 nodes
Chapter 5
Collective Communications on a Reconfigurable Interconnection Network
COIIPcriw cSonlnltl)lictlrio)ls are basic patterns of intrrprocessor communication that
are tiequently usrd as building blocks in a variety of parallel algorithms. Proper implti-
mrntation of collective communication algorithms is a key to the overill performance of
para1 le1 cornputers.
Frer-spacr opticcil interconnertion is used to fashion ii reçontigurable network. Since
network recontiguration is expensive cornparrd to message transmission in suçh networks.
/ ~ l r o i ~ ; ~ hidirig i~~c+lirliqiics ciin be used to increase the performance of collective communi-
cations operürions.
1 prrscnt and analyze a broadcasting~multi-broadcasting algorithm [?O] that utiliztts
liitency hidiny and recontiguration in the network. RON (k . Y). to spsed these operarions.
.As the Hrst contribution of this chapter. the analysis of the broadcasting algorithm includes
a slosrd fomulation that yields the termination time. Secondly. I contribute by proposing
a conrbitied tord ~~-vciinrrg~ ~iigoritlinl based on a combination of the cli,-eci [109. 1101. and
s r m i i d ~ ~ ~ c l i t r ~ t g e [7 1 . 2-11 nlpithms. This ensures n brttrr termination time than what
can be achievrd by eithrr ofthe two algorithms. Meanwhiie. known algorithms for scatter-
ing and dl-to-dl broadcasting from the literature [-IO. 2 11 have been adapted to the net-
work.
5.1 Introduction
Communication operations may be rither poi~ii-to-poitit. as discussed so far. or collec-
tive. in which more than two processes participate. The study of classical algorithrns
bnngs up some generic communication patterns. collective communications, that appear
very otien in parallel algorithms [70. 761. Collective communications are common basic
patterns of interprocessor communication that are frequcntly used as building blocks in a
variety of parailel algorithms. Proper implementation of these basic communication oper-
ations on various parallel architectures is a key to the etticient rxrcution of the parailel
algorithms that use them. and hence. on the overall performance of the parallel cornputers.
Wktber communiçation operations are progammed by the user (low-levrl routines).
contained in a librrtry such as MPI [92.93], and Pnrallrl &tirul iCfm-hiirr (PVM) [ l 151. or
generated by a compiler to translate Iiigh-1cvr.l data prirallc.1 languapc such as Iligh Pc.r$w
nzuricc Foranir ( H P F ) [SZ]. their latrncy directly atiects the total çomputation time of the
pürallel application. The gowing interest in collective communiçation operations is cvi-
dent by their inclusion in the WI.
Collcçtive communicntion operations çan be usrd for data movement. process syn-
shronization. or global opentions. as shown in Figure 5.1. Data movement operations
i ncl ude. hi-utrdctrsriilg nitriri-btnu~Ic*c~.stiirg, ~iiir~i~us~irzg. scnrtwiilg, gurliei-it g. t?itiltitmle
hrr)d~~~r.sri~rg. and rotd ~ w l w ~ g i ~ . In broadcasting. a node sends its unique message to al1
ortirr nocics. Broadcasting is ussd in a varirty of linrcir dyebra iilyorithms [76]. such as
inütrix-wctor multiplication. matrix-matrix inultiplication. LU-tactorkation. and House-
holder transfcmnations. It is also usrd in database queries and transitive closure algo-
rithms. In multi-broadçasting. ii node broadçasts ii number of messages to al1 other nodes.
In multicasting. a sprcial case of broadcasting. 3 node sends its unique message to a subset
of (il1 the other nodes. In scattering. a node sends a ditierent message to al1 other nodes. i t
is basically used for distribution of data among the processors. Gathering is the exact
reverse of scattering. That is. a node receives a di tierent message from al1 other nodes. 1
will not discuss it here as a separate operation. In multinode broadcasting al1 nodes send
their unique messages to al1 other nodes. In total exchange. al1 nodcs send their ditierent
messages to al1 other nodes. Perwnaiizeii conimirrricatiorts (scattering, gatherinp, and total
exchange) are used. for instance. in transposing a rnatrix. and the conversion between dif-
ferent data structures. or in neural nrtwork simulations. It is worth mentioning that the ter-
minology is not yet standard. For example. broadcasting is referred as me-ru-dl.
multinode broadcasting is referred as all-[O-al/ or gossipiug. scattering is referred as per-
sonalized om-[O-dl. and to ta1 exc hange is referred as mlr hi-scntteriq or pei-io/ialiied d l -
to-dl.
Bar-riei s~~r~cliroriixzriou, is a type of process synchronization. It defines a logical point
in the control How of an algorithm at which al1 members of the g o u p must amve before
any of the processes in the subset is allowed to proceed further. Therefore. one ofthe pro-
cesses plays the role of ii banier process. This process gathers messages of al1 d i c r pro-
cesses. and then broadçasts a message to them indicating that they can continue.
Global operations include miiicriori. and scu~i. In reduction. an operation suçh as sirirr.
, i rm- . nriu. is applird açross data items received tiom each membrr of the group. In an .VI I
~*dir~*rioii operation. the resultant data resides at the root node. Theretore. it contains a
m h c r i n y operütion. In an M N rwlircrio/t operation. evcry node or procrss involved in the b
operütion obtliins 3 copy of the rrduccd data. Hençr. it is ri combination of gatlirring and
broadsiisting. In sçan operation. given prosesses Pa. pi . . .. . p,,. and data items do. di . . . . . ti,,. an operation O is applied such that the result d,, O d , O ... O d, is üvailablr at the pro-
CCSS pi.
Collective operations have brcn usually proposed and designed for systems that sup-
port on1 y point-to-point. or iinicusr. communication in hardware. In these environmçnts.
collective oprrations are implemented by sending multiple unicast messages. Such imple-
mentations are çalled iii~i~ast-bosed. .An alternative approash is to provide more direct
support for collective communication in the hardware. Two main approaçhes have been
studied. The hrst approach uses a network other than the primary data network to imple-
ment collective communications [SOI. ln the second approach. the data network is
enhanced to bettrr support some collective communications. To improve collective corn-
munication performance and reducr software overhead. two such enhancements to routers
have been proposed: message replicarion and i~itei~meniare reception. Message replication
rrfers to the ability to duplicate incoming messages ont0 more than one outgoing chan-
Broadcast Scatter Gather Multinode broadcast
Total exchange Barrier Reduction Scan
Figure 5.1 : Some collective communication operations
nels. whilr intemrdiiitr reception is the ability to simultaneously dctliver an incoming
message to the local processor. and to an outgoiny channel. N i has proposed how scalnble
parallcl coinputers should support efficient hardware multicast [99].
'lumcrous works 1iai.e bern reporteci on collective communications. Excellent surveys
on cul lectivr communiç~tion algorithms in srorn-~rtiri-/bi~t-ai~d systems can b r found in
[53]. Anorlier sun-ry of broadcasting and multinodc: broadcasting in store-and- fonvard
systems san be found in [Ml. Dimakopoulos and Dimopoulos have shown how total
exchange can be donc in cûyley gaphs [-Il]. Tliey hwe also presented collective commu-
nication alyorithrns on binay h t trees [43]. McKinley and his colleagues have surveyed
soIleçtivr communications on hypercubes. meshes. and tori in ii-otnihole-iairretI networks
[go]. Recently Banikazemi and others. have proposed efficient broadcastins and rnulti-
casting algorithms using communication capabilities of heterogeneous networks of work-
stations [ l j ] . In the context of optical interconnection networks. Berthorne and Ferreira
[?O. 2 11 have presented broadcasting and multicasting algorithrns for networks using opri-
cal passire sraï-s (OPS). Comparative Study of one-to-many wavelength division multi-
plexing (WDM) lightwave interconnection networks. based on hypergagh theory [ l SI.
have been studied by Bourdin and his collea~ues [3]. Grwenstreter and Melhem have
presented some çommunication algorithms in partitio~ied opricd passiw stars (POPS)
networks [59] .
In this çhapter. 1 present and analyzr some collective communication algorithm for
the reconfigurable network. RON(k . YI. drfined in Chapter 3. In Section 5.2. 1 describe the
communication modelinp. 1 prctsent and analy~r. bruadcasting [?O]. ü n J niulti-brcdcast-
ing algonthms thüt utilize the resontiguration capabilities of the network in Section 5.3.
Latrr on in Section 5.5 and Section 5.6. known algorithms h m literature for scattering
and rnultinode broadcasting [?O. -101 are adapted to the nrtwork. Then. 1 propose a new
algorithm for total exchange operation. to br calléd cornbitrd rord e . ~ ~ h o ~ r g e nlgurirlinz. in
Section 5.7. Finrilly. 1 summarizc: this çhapter in Section 5.3.
5.2 Communication Modeling for Broadcasting/MuIti-broadcasting
A s Jiscussed in Chapter 3. I use a moditied Hockney's çommunication model [66] . 1
modi- the Hockney's rnodel into tw» models. In this section. 1 define the tirst model as
used for hidin- thc reconfiipation delays in broadcasting and multi-broadcasting algo-
rithms. In Section 5.4. I define the second model for other collective communication dgo-
rithms. The second rnodel supports combining messases into a single message as used in
scattering. rnultinode broadcasting. and total exçhanyr algorithms. ro be discussed later.
Note that these alorithms are efficient but thry do not hide the reçontiguration drlay in
the network.
The communication rime to send a unit lensth message. I,,I. ii-om one node to another
in the nrtwork is equlil to T = d + r , + I , s . 1 incorporate both I, and I,r,t into a single
message dslay r,, = r , - i , t . Thus. a unit length message transmission takes
T = I I - r,,, . For the remaining of the discussion. and without loss of generality. 1 shail
assume that r , = 1 for a message of tked length used in broadcastingimulti-broadcast-
ing.
Culler and his collragues have proposed the LogP mode1 [33] which uses another ter-
minology for communication modeliog. LogP models sequrnces of point-to-point com-
munications of short messages. L is the network hardware latency for one-word message
transtèr. O is the combined overhead in processing the message at the sender (0,) and
receiver (O, . ) . P is the number of processors. The gap. g, is the minimum timr interval
between two consecutive message transmission îkom a procrssor. Alexandrov and cithen
have proposed the LogGP model [ R I which incorporates long messages intn the LogP
model. The Gap per byte for long messages. G. is defined as the time per byte for a long
message. Bar-Noy and Kipnis have developed the posrd niodel [KI. a sprcial case of
LogP model. whrre g is one. Howew. they don2 consider the parameters o and G.
.A node in LogP. LoyGP. and postal modrls can send tinother message iinmediately y
time üfter the previous messase has been sent without waitins for the previous message to
be deliwred at the destination. Thesc rnodels are more suitable t'or the currcnt state-of-the-
an wormiiole-routed nrtworks whrre messages can be pipelinrd through the netwixk.
Howevcr. a nodr in iny communication modrling can send another message only atier irs
prcvious message has been delivered and its link has been reçonfigured (if nerdrd). This is
because rny mode1 is a rdlep/mtc-likc nrodel basrd on the circuit-switching technique
which is suitable for rrçonfigurablc optisal networks.
The model that 1 have used is slightly ditierent tiom the rnodel that is otfered in [?O.
2 1. 401. The difference lies in the fact that in the network. RON (k N), only the sender is
üIIowed to reconfigure. and hençe the drlriy penalties occur there. The receiver. in contrast
to the rnodels in [2 1. -101. and in [?O] is rntirely passive.
1 use the notations B ,,,. i~ fBt , l . S ,,,, G ,,,. TE ,,,. for broadcasting time. multi-broadcasting
time. scattering time. multinode broadcasting time. and total exchange time. respectively.
i derive tirne complexities of collective communication algorithms in the network.
RO:V (k. N). under the model m. where m E {F 1. F k } . F I stands for full-duplex, single-
port communication. While. Fk stands for tùll-duplex. &pan communication.
5.3 Broadcasting and Multi-broadcasting
In this section. 1 shall concentrate in techniques that could etfectively hide the recon-
figuration delay d in the network. By reconfiguration latency hiding. 1 mean the process in
which while some nodes are in their reconfiguration phase. other nodes are in their mes-
sage transmission phase. Hence. the reconfiguration phase is overlapped with the message
transmission phase which ultimately reduçes the broadcasting and multi-broadcasting
times,
5.3.1 Broadcasting
In broadçüstiny. a node. assuming nodr tr,, without loss of generality sends its unique
message to al1 other nodes. I assume an unbounded number of available wavelengths for
the systrm. As noted carlier in Chaptrr 3. techniques such as sprerid-spectrum cm be used
in case of limited number of availablr wavelrnpths. In the following. 1 Rrst discuss the
broadçastiny algorithm under k-port modeling. and then present the results for the singlr-
pon mode1 ing.
K-port: The naive algorithm is to let the broadcasting node 14, inform k nrw nodes at a
stcp. Cleïrly. it tïkcs i ti A 1 )[y 1 tirne units. In a mure eficicnt algorithm. BIii. node
q, sends the messase to k other nodes and these k nodes. upon receiving the message. send
i t tu k othrr nodrs eüch. which are distinct tiom the nodes that have reçrived the message
ttius far. Continuiny this way. the algorithm will terminate cifier [ Iogk( N ( k - 1 ) + l 11 - 1
steps. while in terms of rlapsed time. the algorithm will take
( d + 1 ) ( r l o g k ( N ( l i - I ) + 1 )1- 1 ) time units.
Obviously. one çan do better than this if one allows the nodes that have already been
informed. to re-send the same message to a different group of nodes. Thus. starting with
node no. it srnds the message to A- nodes. At the end of this step. k + 1 nodes possess the
message which they now send to k other nodes each. Proçeeding this way. this algorithm.
B2Fk. will terminate atier [logp - N1 steps and will require ( d + 1 )[log, _ , N1 tirne
units.
The above algorithms. BIFk and BJFLY are logarithrnic in time. but they sutfer because
of the large reconfiguration delay. d. that cach node incurs. 1 am interested in devising
algorithms that will ovcrcomr the existence of the large reconfiguration delays by essen-
tially hiding it. The aigorithm BIFk can be improved if the configuration of al1 the links
foming the tree procerd in parallel. Hence. in this new algorithm. BjFk. the broadcasting
message would reach the leaves of the tree in time d + [log,( .V( k - I ) + 1 ) 1 - 1 .
The algorithm BJFk c m be improvcd i f the configurations can takr place concurrent to
the message transmissions. 1 adopt a grerdy algorithm. B+k. wliere a node reconfigures
its links to r e x h k children which lead to 3 pir-so~!figwetl tree of an appropriate
O( log,.V) drpth. A s won as the broadcasting node has Hnished sending its message. it
recontigures its links to reach another predefined tree. It is understood that while this node
is reconiigunng (this takes d steps time units). nodes that have already been configured
and art. in possession of the message send it w k nsiyhbors mch. This process repeats at
cash node e w r y tirnr it sends the message. Potentially. the message. starting at nodr 1 1 ,
1 ', f i - ' - 1 w i I l rt.ach 1 ~k - k - - .. . + k = nodes before node I I , br able to recontigure.
k - 1
Figure 5.7 drpicts the BdFk algorithm for a 2-port network with 41 nodes and a reconfigu-
ration drlay of 1 . This al2orithm is optimai since a node after sending~reçeivinp the mes-
sage immrrdiately recontigures to send the message to a new node. This alporithm is
similar to the broadcasting algorithm by Benhome and Frrreira for their loosely-coupied
opticrrlli. r~~.co~~figui*chie pu~.a//d cor~lpirre~: ORPC (k). using optical passive stars (OPS)
[?O].
I t is clecir that rither this broadcasting network is a dedicated network. or there exists a
global control where nodes understand that a broadcasting is _coing to take place and hence
they reconfigure their links correspondingly In the latter case. an early reconfiguration
delay should be added to the broadcasting time.
Figure 5.1: Latençy hiding broadcasting algorithm for ROiV (k. W . :V = 4 1. k = 2. ti = 1
5.3.1.1 Analysis of the Greedy Algorithm
Before presentiny the annlysis of the yreedy algorithm. i t is wonh noting that it can be
shown tliat the total number of nodrs. !V(S). informrd up to step S follows the recurrence
relations:
It crin dso be shown that the numbrr of nodes. i.(S). that receive the message at cash
step. S. follows the recurrence relations:
Thrse recurrence relations are a kind of generalization of the Fibonacci hnctions
defined by Bar-Noy and Kipnis for the postal mode1 [ 161. and are similar to the recurrence
relations of the broadcasting algot-ithms by Berthorne and Ferreira [XI. The above rela-
tions and thosr in [M. 201 cannot be solved for a general d. T h y should be computed step
by step or be piyen in a table in order to tind the temination time of the algorithms. How-
evrr. as will be shown in the following. the andysis of the broadcasting algorithm includes
a closed formulation that yields the temination time.
1 present anothrr approach to tind a closed formula for the total number of nodes. IV(S).
up to the strp S. The problem I shall endenvor to solve is to find the time required for the
greedy aigorithm to complete. 1 shall approach the analysis constructively. that is. 1 shall
find the nuniber of nudes tliat will be inhrnxd as time progresses. and I diail stop whrn
d l nocies .Y have been intormed.
Denote by S the termination time (in units of r,,,). Then starting h m an arbitr~ry node
I I , , . the nodes that wiil be informeci and assuming no recontiguration. belong to a h--ary tree
p-1-1 routed üt node 11 , ) and of depth S. There are .VI = nodes in this trce. and 1 s h d l
k - l
reference them as belonsing to the Hrst yaxration. Each of the nodes in this tree. once i t
has broadclist the message to its own çhildren. will reconîigure and wili become the root of
a nctw tree over which a new wave of broadcasting will commence and proçeed concur-
rently with the broadcasting in the tint generation tree. This can only happrn if S > ti -+ 2
rnsuring that the Rrst node to be reçontigured (node ,tn) will have enough time to reçonfig-
lire and broridcrist to its k children.
1 shall refer to the nodes belonging to the trees rooted at nodes which were included in
the Arst generation tree and reconfigured. as the second generation nodes. Thus. node 110
can send its message aagain at time d + 1 atier its router has been recontigured to çonnect to
a set of k new nodrs. By sending this new message. 110 actuall y embeds a new k-ary tree at
depth t l A 1 . The next k nodes at depth 1 of the tirst generation o f trees embed k ditferent k-
ary trees at depth t / + 2. Lising this concept. the A'-'/- ' nodes at depth S - d - Z of the first
yenrration embed the last kS - ' - ' difkrent trees at depth S - 1 in the second generation.
Figure 5.3 depicts the embeddinç of the first two generritions of the nodes.
Denote by NT the total number of new nodes in the second generation. and by :CIi the
total tiumber o f new nodes in the trees of the second generation rooted at depth i.
Figurc 5.3: First and second generation trees. The nuinbers undernerith each tree denote the number of trees having the same hright. These trees arc
rootrd tit nodes that wrre at the same level in the first generation tree.
This continues until depth S - 1 where:
Therefore. the total number of new nodes in the second generation. fi. will be:
The process of reconfiguring the optical interconnects continues by the nodes as soon
as they have broadcast the messase to their children. Each generation of trees embeds a
new genention rhat commences at depth d + I from its parent generation. It is clear that
the total number of gctnentions is
Let us now count the total number of nodes 3; in the third generation. The tirst tree of
the [hird p e r a t i o n is einbrddrd ot Jrptli I ( d - I ) by q,. 1 bttgin ~c i t l i tliuse trces af tliis
wneration whish are rmbedded by the nodes of the tirst tree in the second genrration. Let b
1 Q denotes the total number of nodes in these trees rooted at drpth i.
This continues until the depth S - 1 where:
l = k S - ? ' / - ' ( k ) =
kS - 2L / - l -k-I - ? Q s - , k - l
Now. consider trees cmbrdded in the third generation by the nodes of the next k trees
7 at depth S - t l - 1 in the second generation. and let Qf denotes the total number of nodes in
thrse trers rooted at drpth i. Therefore.
This continues until the depth S - 1 where:
1 continue with the trees cmbdded in the third grneration by the nodes of the next X'
3 trees of depth S - d - 3 in the second generation. and let C l i denotes the total number of
nodes in these trres rooted at depth i. Therefore.
Hence. the total number of the new nodes in the third generation. N3, will be:
In ri similar manner. I can compute the number of nodes for the fourth and fitih gener-
ations as:
This proçrss implies lcmma 1 .
Lemmr 1 The number of nttw nodes in generation i + 1. i 2 1 çan be found as:
Proof. 1 give a combinatorial argument for its validity. Assume a tree belonging to
yenrration i - I and rootrd iit depth ( i - l)(d A 1 ). This tree will produce a number of trees
k S - / i l / - I I - I
belonging to genention i and rooted at depth i(d + 1). The tem - hJ repre-
k - I
sents the number of new nodes in the first tree of generation i rooted at depth i(d + 1 ). Sub-
sequent trers in this generation. have a decreasing (by one) number of levels. but since
they were produced by nodes that are at lower levels in the parent generation. their num-
bers grow with the power of k. Therefore. the number of nodes within al1 the trees at each
1 have however accounted for the number of trees produced by a single tree in a parent
generation. There is more than one tree of identical depth in the parent generation. and the
multiplicative trrm [ j :i '1 accounts h r this nurnber based on the Pascal's triangle
The total numbrr of nodes in al1 genrrations. :V(S). informed up to step S. is equal to:
Note that Equation 5.30 is a closed formula and easier to compute ( l e s computation
and mrmory requiremrnts) than the recurrence Equation 5.1. and Equation 5.2. To drter-
minc the termination time S one hüs to solve Equation 5.30 for S. This equation can br
solvrd numericülly. h b l e 5.1 and Table 5.2 provide a cornparison of somr numcrical
examples for thc bmadcasting time under difièrent broadcasting algorithms. BIFk. BJFk
BjFk, RJFI . and for the best case logk - , .V whtin there is no recontiguration delay (Le. d
= 0). for panicular number of nodrs. .V. recontiguration delay. c i . and port modeling. k. I t
is quitc ç l a r that the latency hidiny algorithm. BJFk. pcrtoms better than the other algo-
ri thms.
Table 5.1 : Broadcasting time. k = 2. d = I
Table 3.2: Broadcasting time. k = 4. d = 3
Siitgk-port: In this case. a nodr cm only use one of its links. Therefore. instrad of k-
ary trrrs. linrar arrays cire embedded. Hencr. using the same concept as in the k-port mod-
eling. the total number of nodes for generütions l . 2. 3.4 are:
If 1 continue in a similar manner to the k-port modeling. then the total number of nodes
in dl penrrations. .V(SI. would br:
Table 5.3 provides a cornparison of some numerical examples for the broadcasting
time of the latency hiding algorithm. BFI. of the spatuiig bblomiul olgorithnz [ I 141.
( d .- 1 ) r los, - NI. and for the best case log ,:V when there is no reçonfiguration delay (Le.
il = 0). for a particular number of nodes. :V. and resonfiguration delay. d. I t is çlear that the
algorithm. BFI . performs better than the spanning binomial algorithm.
Table 5.3: Broadcasting time, r i = 3
5.3.1.2 Grouping schema
The total numbrr of nodes. .V(S). informtld up to strp S is gven as Equation 5.1. Mecin-
while. the number of nodcs. i-(S). that rereive the message ~ i t rach step S is defined ns
Equation 5.2. Tlir nodes are divided into two proups. The yroup that has already received
the messügc and the «ne that has not. The nodes that know the niessaye ai any give strp
wn br grouped into thosr nodrs that have already rrcrived the message and those that
rcceive iit this timr step. The nodes that receive at e x h step. is proportional (A- times) to
the number of nodrs that haïe received the message at the last step and thosr that have
sent the message t! - I steps ügo.
The same grouping schema as in [20] can be used to tind the set of nodes that transmit
the message. and the set of the nodes that receive the message at any given step. The set
T(S) consists of the nodes transmitting the message at step S. While. the set R(S) consists
of the nodrs that receive the message at step S. These two sets can be found by Equation
5.36. Note that the same grouping schema c m be applied to the multi-broadcasting case to
be discussed in the next section.
5.3.2 Multi-broadcasting
If there ore .LI messages to be broadcast by a node to al1 other nodes. the simplest algo-
rithm is to use the above latency hiding broadçasting algorithms or B F I ) :Cl tirnrs in
sequence. This algorithin. denote it by MBI. gives an upper bound for muiti-broadcosting
and takrs :Il x ( d-+ B-l,, ) . and hl x ( i l - BF l ) timr units undrr k-port and single-port
modeling. respeçtively. .\ lower bound for multi-broadcasting. .Il - 1 - MB,,, (MB,,, is the
broadçüsting timc for an optimal nl~orithm). can bc achiwed by pipelining the messages
through the netwurk. That is. node ql sends its bl messages in sequrnce in an optimal
broüdciistiny algorithm.
Onc may think of another algorithm. illB?Fh: where the tirst message embeds a broad-
casting tree (Rrst ~enrration tree) rooted at nodc 1 1 , ~ : Each of the subsequrnt messages use
this embedded tree to broiidwst thus bypassiny the reconfiguration costs that the first mes-
sage incurred. Hrnce. the îirst message will incur a delay of d A (rlogAAr( k - I ) 7 I ] - 1 )
time units to broadcüst over al1 :V nodes and to embed the broadcast tree, while the second
and subsequent messages would only incur a broadcast delay of r logkN( k - 1 ) + 1 1 - 1
cich. Therefore. the total cost is
Table 5.4 compares the two algorithms :LIB.lBIFKand MB7FK. Note that an optimal alço-
rithm for multi-broadcasting is to be devised such that messages are pipelined through the
rmbedded trees using the latency hiding broadcasting algorithms (BdFA., or BFI) .
Table 5.4: Multi-broadcasting time. k = 4. d 4. ,W = 10
5.4 Communication Modeling for other Collective Communications
ln this section. 1 detine the second communication modcling used for scattering. mult-
inodr broadcasting. and total exchange algori thms. This mode1 supports corn bining mes-
sages into a single lager message as used in these alporithms. Note thût the ctlgorithms for
scüttcring. multinodc broadcasting. and total rxshange art: quite efficient but they do not
hide the rrcontiyuration delliy in the network.
A s stiited in Section 5.2. the communication time to send a unit Iength message from
one node to another in the network is rqual to T = ti - r , - /,,,r . Without loss of yencrül-
ity. 1 nomalize the timr T with respect to l , , , ~ . Thus. a representative length message
transmission takrs T = ri' + r , - 1 . The communication rime to send an .1f representative
\en& message from one node to another would be T = i i + r , - .CI. 'lote that. sending a
cornbined message (that is a larger message) does not affect the start-up time. r , . and the
recontigur~tion delay. ii. For simplicity. I incorponte both t , and d into a single message
5.5 Scattering
The scattering oprration. is used basically to distribute data to the nodes of parallei
cornputer. The easiest algorithm for the scattering operation is based on the seqiienriai rree
[ 1 O I 1. In this case. the source node sends its ditferent messages to each of the other nodes
sequrntially. as shown in Figure 5.4 for single-port rnodeling. As the source of communi-
cation is the same for the whole scnttering operation. this node should reconfigure its links s
atier each step. Thrrefore. the scattering time, SIFI . is (:V - 1 ) (d +- 1 ) time units.
Figure 5.4: Sequrntial tree al son thm
The sp~11111i1ig biltor~rilzl trec nlgoritl~m [9 1 ] used for broadcasting~multicasting open-
tions con dso be usrd for scattering operation. ln this aigorithm. the number of informrd
nodes doubles ar eaçh strp. and rach node stores its own message and fonvards the rest of
the messages it received. if necessap. to its childrrn. .As illustrated in Figure 5.5. the
source node sends its messages for the upper half of the nodes to the node 4. In the second
step. nades O and 4 are responsible for sendinp messages to the nodes in their halves. That
is. to the node 2 (messages t i~r nodes 1. and 3). and nodr 6 (messages for nodes 6. and 7).
respectively. In the third step. al1 nodes send the rcmaining messages to the remaining - - -
nodes. These thret: steps (açtuolly log,N - steps) takes rach ( t l * 4). (11 -7). and ( d T 1)
time units. respttçtively. Grtnerally. this algorithm has a scattering time:
'lote that 1 have nqlected rhe data permutation tirne at each node. It shoulc 1 be noted
thüt he spanning binomial algorithm has a much berter termination time than the sequen-
tial alyorithm for the RON (A-. .V) (except for the trivial case. N = 2. where they have the
same termination tirne).
A-port: The sequential tree algorithm can be rxtended for k-port modeling. That is. at
each step the source node sends its A- difierent messages to k other different nodes. There-
N - 1 fore. S I F k = (<i+ 1 I ( ~ ) .
Figure 5.5: Spûnning binomial tree algorithm
Desprez and his çolleagurs have extended the spanning binomial algorithm for the A--
.V pon modeling [-IO]. In this algorithm. the sçatrrring nodr rio. sends k messages of - k - 1
.v Iength each. to its k children. Thrrefore. therc are ( k - 1 ) nodes havinp - di f i r e n t
k t 1
m e s s a p . Thrsr ni~des. iit step 2. communicate each with their k çhildren and srnd one
( k - I )-th o f their initial message to each one. This process continues and al1 nodes are
informeci iitirr log, . , .V communiciition steps. Thus the scattering tirne is rqual to
5.6 Multinode Broadcasting
In multinode broadcastinp. also called gossiping [53]. al1 nodes send their unique mes-
sages to all othcr nodrs. and this is basically used in parallel algorithms when al1 nodes
nerd to exchange their data. The simplest alsorithm for multinode broadcasting is to use
the latency hidiny broadcasting algorithm N times. one for rach nodr. Another algorithm
is to consider the multinode broadcasting as a degenerate case of total exchange. to be dis-
cussed in the next section. However. better dgot-ithms exist.
Sin&-port: In the direct algorithms [109. 1201. at any step i. a node p sends its mes- -
sage to node ( p - i) niod IV. Clnirly. the cost of this algorithm. GIFI. is (N - 1 )(d - 1 ).
One may use a better algorithm. just like the starrdaid cwharige algorithms for the
total exchange operation [7 1 . 2-11, where during each step. the complete network is recur-
sivctly dividttd into halves. and messages are exchanged across new divisions at each strp.
This algorithm combines messages into larger messages to be transmitted as a single unit.
As tud ly , each node sends its message dong with the other messages it received at the pre-
vious strps. Hençe. the multinods broadcasting has log - .X strps. and a çost of
Figure 5.6 show painvise communications and the lrngth of messages at each step for
inuitinodc brocidcasting on an Y node messase-passing multicornputer. Unfortunately.
lütençy hidiny ciinnot improvc this cost.
r - 1 r - 1 , r - i r - i -1 t - 1 04-m 1
L L - J r - 7 4 r - 1
2-3 L L - J r o i 4 r o i
4 4 +j
Figure 5.6: Multinode broadcasting on an Y-node RON t f N) undrr single-port modeling
k-port: A simple algorithm is baseci on the extension of the direct algorithm for k-port
modrling. That is. at step i. node p sends its messase to the nodes @ + ( i - 1 )k + I ) n i d :V.
( p -+ ( i - I ) k - 2) mod ,V. ... . ( p + ik) n i d Y. This algorithm has a cost of:
Desprez and his colleagues [do] extended the G7FI algorithm for k-port modelinj by
Ietting the nodcs combine the messages to reduce the effect of reconfigration delay. Fig-
o f ( k + 1 ) nodes mch. Nodes are grouprd as (O. 1. ... . k) . ( k + 1. k - 2. ... . Z(k T I ) - 1).
... . ( N - ( k + 1). Y - ( k + 1 ) + 1. ... . .V- 1). At step 1. al1 nodes within a group exchange
thcir mcssagcs. At the cnd of this stcp. each node has (k - 1 ) messages. At step 2 . nodr p
exchanges al1 its messages with nodes @ - (k - 1 )) nrod N. ( p - I ( k i 1 )) nioif X. . . . . ( p +
1
k ( k - 1 )) nrod .V. At the end of this step. rach node has ( k + I ) - messages. Let
S = logl - ,Y . This process continues to step S. where node p nchünges its messages
mud .V. i t is clcar that at each step i of this algorithm. cash node sends ( k + I )' - I rnes-
sages to k othcr nodes. Hense. this algorithm has a multinodr broadcasting timr:
Step 1
st,, 2 y- Q6 3 * 3
y-y, 3
Figurc 5.7: Multinodr broadcasting on an 9-node RON (k. .V} under 2-port modeling
5.7 Total Erchange
In total exchange. al1 nodes send their ditferent messages ro al1 other nodes. A nüive
algorithm for total exchange is to pertbm a scattering operation !V times in sequence.
However. better algorithrns exist.
Single-port: In the direct algorithms [109. 1101. at 3ny step i. a node p sends the mes-
sage to destinrd nodr ( p 7 i) nrotf N. Clearly the çost of this algorithm. TE Z F I . i s q u d 10
One may also use tlir standard exchange dgorithm for total exchange similar to the
ones ~ ~ e d in hypercubes. and meshes [71. 241. where during cach stcp. the çomplrte net-
work is reçursively dividrd into halves. and messages are exçhanged across new divisions
at rach srep. Nodes combine messages into larger messages to be rransmitted as a single
unit. Consider this algorithm for an Y-nodr multiçomputer. as shown in Figure 5 .8 . There
rire :V/ 2 messages to be sent by niçh node at any srep in this algorithm. 1 only desrribe
rhis for node O. Node O sends al1 its messages for the nodes at the upper half (that is. nodes
4. 5. 6. and 7) to node 4 at step 1. .4t the süme time. i t receives the messages for its half
from node 4. At the second step. node O sends its message. dong with the messages tiom
node 4. destincd to nodes 2 and 3. to node 2 . At the same timr. it receivrs the messayes
frorn rhr nodes 2. and 6 for itself and node 1 . At the third step (actually. log,.V steps). - node O sends its message dong with the other messages from nodes 2.4. and 6 to node 1.
I t is clear thüt üt the end of this step al1 nodes have exchanged al1 their messages. Thus.
this alp-ithm. TEIF,. haï a cort of (2 - z h o g , N . 2 1 '-
Figure 5.8: Total exchange on an 8-node RON (k. :V) under single-port modelins
Which algorithm. TE I F , or is faster depends on the nurnber of nodes :V. and
- the tem. d . I propose another algorithm. called conibi~ied [oral ci-cliange a i p i - i r h .
TE+!. which is a combination of these two algorithms.
1 begin this a l p i t hm by doing some of (or even none of) the steps involved in the
standard total rxçhange algorithm. and then continue with the direct algorithm. That is.
divide the nodes in the complrte network in half and do the steps involvcd in the standard
total rxchange algorithm up to a point(s) that there is no gain in continuing to do sa. From
that step(s) on. the direct algorithrn is used for al1 the nodes in each of the created sub-
groups at the same tirnr. Actually. the goal is to tind the number of steps. or 3 bound h r
the number ofsteps. before switching to the direct algrithrn such thüt the time assoçiated
with this alprithm is less than (or at Irast rqual to) the cither two (direct. and standard
exchange) algori thms.
Let inc: explain this algorithm with i = 1 (number of doing the standard exchange alpo-
rithrn) tor the example shuwn in Fizurc 5.8. At the step 1. the nodes in the complete net-
work Lire divided in halves. Eüch node exchmges 4 messages with i ts corresponding node -
at the uther hült This takes CI- 4. and at this point. rüch of the network halves contain
messages drstined to the hülf itsrlf. As a matter of fact. each node now has two messages
for each of the nodes in its half. Thrse messages clin b r distributed to their destinations
using a direct algorithm. There are 4 nodes in each half and 2 messages to b<: exchanged at C - -
a time tor a cost of (4 - 1 ) (d - 1) = 3d i 6. Hçnce. this algorithm has a total çost of 4d
1 O.
Lemma 1 The conhiiird tom/ r.rcltringe dgot-ithm undrr single-port modeling on
ROY (K. N) has a cost of
where i is the number of steps to do the standard rxchange algorithrn before switchinç to
the direct algorithrn.
Proof. In the combined total exchange algorithm. each time a standard exchange algo-
- AJ rithm strp is dune a cust of ii4 is addrd. This brings up the term i(2 + :). The tirrt - d
part of the second tem. . is for the number of nodes in the groups doing the direct -
algonthms sirnultanrously. The second part. ( d + 2') . stands for the dslay assosiated with
the tnnsfer of messages which is doubled at rach steps.
It is cirrir that this algorithm is rxactly the same as the direct algorithm when i = 0. and
the standard exchange rilgorithm whrn i = los - ,M.
k-port: The direct algorithm for the k-port modeling requires node p at stcp i to send
its niessage to the nodes ( p - (i - 1 )k + I ) nrod Y. ( p + ( i - 1 )k - 2 ) n i d iV. . . . . ( p + ik) nrod
.V - 1 .Y This algiirirhrn hns ri cust 01: TE 1 ic (lb 1 ) ( ) .
The same yroupiny and algorithm as G'&-k can br used for total exchange with the
.v exception tliat this timr eash nodr sends - messages at a timr. Therefore. the çost of
k - 1
the above olgorithm whrn IV =9 and k = 2.
Step 2
Figure 5.9: Total exchange on an 9-node ROiV (k. iV) under 2-port modeling
11 1
Which algorithm. TEIFk or is hster depends on the number of nodes iV. num-
ber of inputhutput channels. L and the term. 2. Just likr the singlr-port modeling. a corn-
Lemma 3 The contbiiled
RO& (k. .V) has a cost of
is proposed which is a combination of the above
to rd e.rr/rntige d g o i - i h i under k-port modeling on
where i is the number of steps to do the standard exchange algorithm before switching to
the direct algorithm.
Proof. In the combined total exchnngr algorithm and undrr k-port modeliny. rach tirne
- !V a standard exchan-c algorithm step is done a cost of d - - is cidded. This brings up the
k - 1
.L' I ;v rem i ( 2 - - ) . The tira part of the second term. -( - I . is for the number of k - 1 t ( k 7 1 ) ' 1
nodrs in the groups doing the direct dgorithms simultaneously. The seond part.
( 2 - ( k - I 1 ' ) . stands for the drlay associated with the transfer of messages.
Ir is clerir that this algorîthm is exactly the sarne as the direct algorithm when i = 0, and
the standard exchange algorithm when i = logt - , Y . 1 haven't found any mathematicai
proof that this i t l p i t h r n is bettcr than the known algorithms. However. in all the numeri-
cal examplrs (more than one hundred thousand exnmplrs) that 1 have performed for the
cornparison of these algorithms. I have always found a step. i. for which. the combined
total exchangr algorithm had a shortrr or equal exchanpe time than both the direct algo-
above statement is also true for single-port rnodeling. Therefore. It is conjectured that the
proposed algorithm is better than (or at lrast equal to) both known alponthms. Table 5.5
and Table 5.6 summarize some typical examples with optimal costs for and
Table 5.5: Total exchange time. .V = 1024. single-port
Table 5.6: Total exchange timc. .V = 1024. k = 3
5.8 Summary
In this chiiprrr. I presented and analyzed a broadcasting algorithm [?O] that could
effectively hide the rrconfiguration delay d in the network. RON (k. :V). Essrntially, in this
algorithm. the reconf guration phase of some of tlie nodes is overlapped with the message
transmission phase of the other nodes which ultimately reducrs the broadcasting time. The
analysis of the broadcasting algorithm includes a closed formulation that yields the terni-
nation time.
The solution for the total exchange problem combines two known algorithms. direct
[109. 1201. and s r o t d d excl~atrgr 171. 241, and it includrs an optimization phase that
determines the number of steps alter which the first algorithm teminates and the second
one is rngaged. This rnsures a temination time that is better than what can be accom-
plished by rither of the two algorithms. Mranwhile. known algorithms for scattering and
dl-to-al1 broadcasting h m literature [-!O. 211 have been adapted to the network.
RO!V (k. :V).
The scattering. multinode broadcastiny. and total exchange rilgorithms discussed in
this çhapter iissumed that the number of nodes in the ROIV (k . M) is a power of 2 . or a
p o w r of ( X 7 1 ) undrr single-port and P-port inodeling. resprctiwly. However. wlien the
nuinber of processors is not a power of 2. or a power of ( k 1). duinmy nodes can be
itssumed to rxist until the next power of 2 or ( k 7 1 ) with a little pertormancr loss.
So far. in this thesis. 1 have becn concerned about efficient communications in mes-
sage-passing paralle1 cornputer systems using recontigurable interconnects. I have usrd
knowledgr of the next destination (either by prediçtion or algorithmically) to hide the
reconîiguration Ititensy of the interconnesr. In Chapter 6. regiirdlrss of the type of the
interconnsction nrtwork. 1 utilize prediçtion techniques in general. and more specitically
the proposed predictors in Chaptrr 3. to remove the redundant message copying at the
recciving side o t'cominunicritions in message-passiny systems.
I l-i
Chapter 6
Efficient Communication Using Message Prediction for Clusters of Multiprocessors
A signiticant portion of the software communication overhsad belongs to a number of
message çopying operations. Ideally i t is desirable to have a trur zero-copy protosol
whttre the message moves direçtly from the send buflkr in its user spüce to the receive
buftkr in the destination without any intermediate bultèring. However. due to the hct thot
message-passing applications at the send side do not know the final receive butfer
addresses rarly iimving messages have to be buffered in a temporary area.
[ explüin the motivation behind this work and disçuss related work in Section 6.2. In
Section 6.3. 1 daborate on how prrdiction wouid hrlp eliminate message copying at the
recriving side of communications. 1 explain the experimrntal methodologies to gathcr
communication traces of the parallrl applications in Section 6.4. 1 sharaçterize some com-
munication propenies of the parallel application benchmarks by presentins the frequrncy
and distributions of receive communication calls in Section 6.5. 1 show that there is ü mes-
sage rrçeption communication locality in message-passing parallel applications [j]. H3v-
ing this comrnunication locality at the rcceiver sides. 1 use the proposed predictors
introduced in Chapter 3 to predict the next consumable message. This chapter contributes
by arguing that these message predictors can be rfficiently used to drain the network and
cache the incoming messages rven if the corresponding reccive çalls have not been posted
yet. This way. there is no need to unnecessarily copy the early amving messages into a
temporaty butfer. As shown in Section 6.6. the performance of thrse predictors. in terms
of hit ratio. on some parallcl applications is quite promising [ 5 ] and sugçest thai prediction
has the potential to eliminate most of the remaining message copies. 1 compare the pertbr-
mance and storage requirements of the predictors in Section 6.7. Finaily. I summarize this
chapter in Section 6.5.
6.1 Introduction
With the increasing uniprocessor and SMP somputation power available today. inter-
procrssor communication has become an important factor that limits the pertormance of
workstations clusters. Essentially. communication overhead is one of the most important
factors aReçting the performance of parallel cornputers. iMany factors atfect the perfor-
mance of communication subsystems in parallrl systems. Specitically. communication
hardware and its senrices. communication sotiware. and the user environment (multipm-
gramminp. mu1 tiuser) are the major sources of the communication overhead. - The communication hardware aspect includes the architecture and placement of the
network interthce. and the intercclnnection nrtwork and its srnices. Many arshitcctures
Imvr been proposrd for the network interfaces. Thry are classifird as ( 1 ) direct [ 52 . 7.63.
W. 97. SS] and (2) mcmory-basrd [4S. 1 11. 126. 131. Direct network interfaces allow a
procrssor to directly açcess the network queue. However. thry mostly ignore the issue of
multiprogmmming. That is. a single thread san only use the ncitwork interface at a timr.
Memory-basrd intrrthxs provids protection but have high latency. Interconnection net-
ivorks themselves cire anothcr source of communication hardware latrncy. Communication
services including flow control. and messaye delivrry dso add to this latency.
Corninunication software ovrrhrsd currently dominates the communication time in
çlusters of workstations. In the current generation of parallel cornputer systems. the soft-
ware overheads are tens of microseconds [43]. This is worse in clusters of workstations.
Even with high performance nrtworks 123. 67. 1 I l ] available today. there is still a gap
brtwecn what the network can otier and what the user application can sec. The cornmuni-
cation software overhead cost cornes mainly h m three different soiirces: crossing protec-
tion boundaries several times between the user space and the kemel space. passing several
protocol laycrs. and iwolving a number of memory copying operations.
Several researchrrs are working to minimize the cost of crossing protection bound-
aries. and using simple protocol layers by utilizinz ~iser=levrl messaging techniques such
as .-fcri\-e :Clessa,os ( A M ) [125]. Fasr Messages ( F M ) [102]. bïrt~tal iClemo~?*-hlapped
Cornnizrriications (VMMC-2) [48], LWer [ 1261. LAPI [ 1 1 O ] . Basic lntelfacejor- ParaIlel-
isnl (BIP) [105]. ?irrirnl Inre~fice .-lrchi&ectine (VIA) [19]. and PM [ I I I ] . A sigiticant
portion of the soliware communication overhead belonçs to a nurnber of message copy-
ing. Idrally. message protocols should transfer messages in a single copy (this is usually
called o true zero-copy). In other words. the protocol should çopy the message directly
tiorn the srnd buRer in its user space to the receive bufer in the destination without any
intermediate buflerin-. However. applications at the send side do not know the final
receiw butfer addresses and. hence. the communication subsystems at the receiving end
still copy messages unnrcessarily from the network interface to a system butfer. and then
tiom the system butiir to the user buWcr when the receiving application posts the reçeive
crtll.
Some resrarchers have tned ro avoid memory çopying [4S. 79. 106. 14. 1 1 9 . I lS I .
LVhilr thry have been able to remove the mernory sopying between the application butier
spacr and the network interface at the send sidr by using user-levrl rnessaging techniques.
they haven't been able to remove the memory copying tit the receiver sides completely.
They may üchieve a zero-copy rnessaging üt the receiver sides only if the receive cal1 is
dready posted. 3 rendez-vous type c.ommunication is used for large messages. or the desti-
nation butfrr address is already known by a pre-communication. Note. however. that
WI-l [%] supports a rcmotr mrmory aççess ( R M A ) operation but this is mostly suitablr
tor receivrr-initiateci communications arising from the shared-memory paradign.
1 am intrrestttd in bypassing the memory copying at the destination in the general case.
rager or rendez-vous and for sender-initiated communications as in MPI [92. 931. In this
chapter. 1 argue that it is possible to address the message copying problem at the receiving
side by speçulation. I support my daim by showing that messages display a form of local-
ity at the receiving ends of communications.
1 introduce hrre. for the frst time. the notion of message prediction for the receiving
side of message-passing systems. By predicting the next receive communication call. and
hence the next destination bufièr address. before the receiving cal1 is posted one will be
able to copy the message directly into the CPU cache speculatively before it is needed so
that in etfect 3 zero-copy transtèr can be achieved.
1 am interested in utilizing the proposed predictors in Chapter 3 [3. 21. but this time at
the receiver sidrs to predict the next consumable message and drain the network as soon as
the message amves. Upon a message amval. a user-lewl threüd is invoked. If the receive
cal1 has not bern issued yet. the message will be cached. but etticient cache mapping
mechanisms need to be devised to facilitate bindins üt the moment the receive cal1 is
issued. If the receive cal1 has already been issued. then the message can be written to its
final destination.
This çhaptrr concentrates on mmeage predictions at the destinations in message-pass-
ing systems using MP 1 in isolation. This is analogous to branch prediction. and coherence
üctivity prediction [97] in isolation. Our tools are not ready for measuring the rtYrçtive-
nrss of the predictors on the application mn-timo yet. My preliminary svaluation masures
the accuracy of the predictors in t e m s of hit ratio. The results are quite promising and sug-
yest that prediction has the potential to rliminate most of the remaining message copies.
6.2 Motivation and Related Work
High performance computing is increasingly concemed with efficient communication
cicross the interconnrçt due to the availability of high-speed highly-üdvanced processors.
Modem switched networks. called a+steni .-lrw ,V~.niwks (SAN). such as Myrinet [23]
and ServerNrt [67]. provide high communication biindwidth and low communication
lotency. However. beçause of hi yh proçessing overhead due to communication sotiware
incl uding network interface control. How control. butter management. memory copyinç.
polling and intempt handlinç. usrrs cannot sec much ditference cornpared to traditional
local a r a networks.
Fortunately. several user-level messaging techniques have been developed to remove
the operating system kemel and protocol stack tiorn the critical path of communications
[125. 107. 18. 126. 49. 105. 1 10. I Z I l . This way. applications can send and receive mes-
sages without operat ing system intervention which otien greatly reduces the communica-
tion latency.
Data transfer mechanisms and message copying. çontrol transfer mechanisms. address
translation rnechanisms. protection mechanisms. and reliability issues are the key factors
for the performance of a user-kvel communication systern. In this chapter. 1 am panicu-
lady interested to avoid message copying at the receivcr side of communications.
-4 si~miticant portion of the software communication overhead belongs to a number of
message copying. With the traditional software rnessaging layers. thrre are usually four
niessagr copyirig uperatiuns from the wid b u t k r to tlir rrceiïe butkr. as show in Figure
6.1. These copies arc namely irom the send butfrr to the system bufler ( 1 ). from the sys-
tem butkr tu the network intert';tce (NI) (3 ) . and at the other end of communication from
rhe network interhce to the system bufer (3). and from the system butfer to the reçeive
bufer (4) whrn the recrive cal1 is posted. Note that. 1 haven't çonsidered dota transfer
from the network intrrtiw (NI) at the sending process to the network interface at the
rcceiiiny process ns scparats çopy. Also. the network interface's place can bc cithcr on
the I!O bus or on the memory bus.
At the send side. some user-level rnessaging layrrs use programmed I/O to avoid sys-
tem butycr copying. FM uses progrümmed PO while .Ab!-Il and BIP do so only for small
messages. Some other user-mrssaginy layers use DMA. VMMC-2. U-Net. and PM use
DMA to bypass the systern butier çopy while AM-II and BIP do so only for large mes-
sages. In systems that use Dbf A. applications or a library dynamically pins and unpins
pages in the user spacr that sontain the send and the receive butfers. Address translation
can be donr using a kemel module as in BIP. or by caçhing a limited number of address
translations for the pinned pages as in VMMC-2. U-NetlMM [17]. and PM. Some network
intertices also pcrmit bypassing message copying at the network interface by directly
writing into the nrtwork.
Contrary to the send side. bypassing the system buffrr copying at the receiving side
may not be achievable. Processes at the sending sidrs do not know the destination buffer
addresses. Therefore. when a message amves at the receivinç side it has to be buffered if
the receive d l has not been posted yet. VMMC [El for the SHRIMP rnulticomputer is a
communication mode1 that provides direct data transfer between the sender's and
Send Process 1 Send bu* 1
Receive Process 1 Receive buffer
Network 0 Fig : Data transfers in a traditional messaging Iûyer
recctiver's vinual nddress spacr. However. it crin achieve zero-copy transfer only if the
sender knows the destination butfer address. Thrrefore. the receiver exports its butfer
iiddress by scouting a message to the sender beforr the actual transmission can take place.
This leads to a ?-phase rendez-vous protoçol which adds to the nrtwork trafic. and net-
work latcnçy espeçially for short messages.
VbfiIIC-2 [AS]. uses 3 n-trnsfi.i i*c.ciiivciioil mrchanism insteüd. [ t uses a dcfault. redi-
rtxtliblt. rcceiw buftkr for a sender who does not know the address of the receive butter.
h l e n il message amves at the reseiving network interface. the redirection mechûnism
checks to see if the receiver has aiready posted its bufier address. If the receive buffer has
bern posted rarlier than the message amval. the message will be directly transferred to the
user butter. Thus it achieves a zero-copy transfer. If the buttèr address is not posted. the
message must be butfered in the default bufler. It will then be transferred when the receive
bufier is posted. Thus. it achieves a one-copy transfer. However. if the receiver posts its
butfer address when the message amves. pan of the message is buffered ai the default
buRer and the rest is transferred to the user buKer.
Fast sockrts [ 1061 has bren built using active messages. It uses a mrchanism at the
receiver side called irceirle postiilg to avoid the message copy in the fast socket bufier. If
the message handler knows that the data's final mernory destination is already known upon
message limval the message is directly moved to the application user space. Othenvise. it
has to be copied into the fast socket bufer.
FM 2.x [79] uses a similar approach as fast sockets. namely kg .PI. irirei.I.n~*ing. FM
collaborates with the handlrr to direct the incorning messages into the destination butkr if
the receive cal1 has already been posted.
MPI-LAPI [I4] is an implementation of MPI on top of LAPl [110] for the IBM SP
machines. In the implemrntation of the eager protoçol. the header handler of the LAPI
retums (i biiftér pointer to LAPI which tells LXPI where the packets of the message must
be reûssembld. If a receive cal1 has been posted. the address of the user bufier is retumrd
to LAPI. If the herider handlrr dorsn't tind a m a t c h s receive. i t will return the address of
an L J W ~ I * w r i l o ~ ~ / bu/li>i* and hencr a one-çopy triansfer is accomplished. Meanwhile. mrs-
sage sizes of larger than eayer s i x is trmsferred using ?-phase rendez-vous protoçol.
Some resttürch projects have proposed solutions to multi-protocol message-passing
interfaces on ciirsrei*~ o f r~tiiiripi*ocessois (Clumps) using both shared-memory for intra-
node communications and message-passing for inter-node communications [ I 1 Y. 55.871.
MPICH-PMiCLUMP [llY] is an MPI library implemented on a cluster of SMPs. It
uses a message-passing oniy modrl where erich process runs on a processor of an SMP
node. For inter-node communications. it uses eugei and ie~~dez-rvms protocols. For short
messages. it achieves one-copy using eager protocol as the message is copied into a tem-
porûry butier if the MPI receive primitive has not been issued. For large message. it uses
rcndrz-vous protocol to achieve zero-copy by using a remote wnte operation but it needs
an extra communication. For intra-node communications. it achieves a one-copy using a
kernel primitive that allows to copy messages from the sender to the receiver without
inw lving the communication butfer.
BIP-SMP [SI. for intra-node communications. uses shared memory for small mes-
sages with hvo rnemory copy. and direct copy for large messages with a kemel overhead.
For inter-node communications. it rvorks like MPI-BIP which is a port of MPICH [57].
TOMPI [38] is a threaded implementation of MPI on a single SMP node. I t copies a
message only once by utilizing multiple threads on an SMP node. Unfortunately. it is not
scalabie to a cluster of SMP machines.
Other techniques to bypass extra copying ore the re-ninppbrg. and copj-oir-wire tech-
niques [3 1. 451. Both techniques require switching to the supervisor mode. acqui ring nec-
essary locks to virtual mcmot-y data structure. and chanying vinual memory mapping at
s e ~ w d Iwels for rach page. and thrn pertbrming li-cursicirioti Lookaside B1rlfi.r (TLB)/
cache consistency actions. and finally returning to the user mode. This Iimits the perfor-
mance of the paye re-mapping. and çopy-on-write techniques. A zero-copy TCP stack is
iniplemcnted in S01at-i~ by using copy-on-wnté pages and re-mapping to improve çomrnu-
nication performance [3 11. [ t açhieves a relatively high throuyhput for large messages.
Howruer. it does not have a 200d performance for small messages. This work is dso
solely dedicated ro the SUN Solaris virtual memory system.
/hiils (451 is also using the re-mapping technique to avoid the penalty of copying large
messages across di Kerent lûyers of protoçol stack. Howrver. tbu fs allows re-mapping uni y
for a lirnited range of user virtual mrmory.
I t is quite clcar that rven user-lrvel messaging techniques may not achieve a zero-çopy
communication al1 the time at the receiver side of communications. Meanwhile. the major
problem with al1 pase re-mapping techniques is their pour performance for shon messages
whiçh is evtremcly important For parallel computing.
As stiited in Chapter 3. many prediction techniques have been proposed in the past to
predict the tùture Iiccesses of sharing patterns and coherençe activities in disttibuted
shared rnemory (DSM) by looking at their observed behavior [96. 77. 73. 133, 31. 1071.
Recently. Afsahi and Dimopoulos proposed some heunstics to predict the destination tar-
set of subsequent communication requests at the send side of communications in mes-
sage-passin2 systems [3. 41. However. to the best of my knowledge. no prediction
technique has been proposed for the reçrive side of communications in message-passing
systems to reducc the latency of a message transfer.
This chapter of the thesis. reports on an innovative approaçh for removing message
ropying at the receiving ends of communications for message-passing systems. I argue
that i t is possible to address the message copying problem at the receiving sides by specu-
iation. 1 introduce message prediction techniques such that messages can be directly trans-
ferreci to the cache even if the receivr calls have not been posted yet.
6.3 Using Message Predictions
in this section. 1 anülyze the problem with the early arriva1 of messages at the destina-
tions in message-passing systems. In such systems. a number of messages amve in arbi-
trary order 3t the destinations. The consuming process or thread will consume one
message at a timr. If l know which message is going to be consumed next. then i can move
the message upon its arriva1 to nrar the place thût it is to br consumed (cg. a staging
cache). or 1 could schedule which thread to exrcute next preferably at the same processor
as the çonsuming thread to enhancc the chances that the data will br in the processor cache
whrn it is acçessed by the consumer.
For this. one hüs to consider thrce ditierent issues. First. deciding which message is
going to be consumcd next. This can be donc by devising receive call predictors. history-
bassd predictors that predict subsequrnt receive calls by a given process in a message-
passiny progrnm. Second. deciding where and how this messaçe is to be movrd in the
cache. Third. etficient cache re-mapping and late binding rnechanisms need to be devised
for when the receive call is posted.
In this chapter. 1 am addressing the first problem. That is. utilizing message predictors
and evaluating thtir performance. I am workiny on several methods to address the remain-
ing issues.
6.4 Experimental Methodology
In exploring the sttèct that ditferent heuristics have in predicting the next receive call.
1 used a number of parallel benchmarks. and extracted their communication traces on
which 1 applied the predictors. Specifically. 1 used BT. Si? and CG benchmarks from NPB
suite [ l j ] . and PSTSWM application [ E S ] . introducrd in Chapter 2. 1 didn't use the MG
and LLJ benchmarks form the NPB suite because these benchmarks use
1fPI_-f .VY-SOC 'RTE in wrne of thsir receive cnlls (MP I-Recv and blPl_lrecv). This
mrans that the applications may receive a panicular message from ditferent sources
depcnding on the order of arrivai. 1 also didn't use the QCDMPI application as this appli-
cation uses the sy nshronous communication primi tivci. . W ~ ~ r t L ~ c * r ~ ~ ~ c p I ~ ~ c ~ e . where the
sender waits for the receive call to br posted. Then it transmits the message. In this case.
prediçtion wouldn't help as the receive call is already posted.
1 çxpçrimented with the workstation class "W". and thé larger class *'A" of the NPB
suite. and the default problem size for thr PSTSWM application. Note that because of
spacc and acccss limi tations 1 did not cxperirncnt wi th the larger classcs "B". and "C". The
N P B results arc nlmost the same for "W" and "A" classes. Hençc. 1 report only for the ':4"
CIÜSS here. Note that 1 also rcmoved the initialization part liom the communication traces
ut' the PSTSWM application.
6.5 Receiver-side Locality Estimation
The applications use blocking and nonblocking standard klPl receive primitives.
namely .lfP!Jcc~. and .\IPl-ireci* [92]. ILIPI-Recv (biif,' coio~r. h i o h p e m u r e . mg,
romm. starus) is a standard blocking receive call. Whrn it retums. data is available at the
destination bufier. The PSTSWM application uses this type of receive call. :ÇIPl-h-ecv
(616,' colrnt. tiurahpe. m u r e . mg, comnr. irqiresr) is a standard nonblockinç receive call. It
immediately posts the cal1 and retums. Hence. data is not available at the time of return. It
needs another call to complrte the call. Al1 applications in this study use this type of
receive calI.
As noted rarlirr in Chapter 3. one of the communication characteristics of any pcirallel
application is the tiequency of communications. Figure 6.2 illustrates the minimum. aver-
age. and maximum number of receive communication calls in the applications under dif-
tèrent system sizes. 1 sxttcuted the applications once for rach diffrrent system size and
çounted the number of receive calls for rach process of the applications. Hence. in Figure
6.2. by average. minimum. and maximum. 1 meün the average. minimum. and maximum
number of receive calls taken over al1 processes of sach application. 1 t is clear that al1 pro-
cesses in the BT. SP. and CG applications have the same number of rrceivr communica-
tion calls h r rach diffrrent system size. While processes in the PSTSWM application have
diferent number of rcceivtt communic;rtion crtlls.
-- -- Minimum Minimum
8000 . Average r6000- S Z l Average -- Maximum - VI - - - - -- - - U)
Maximum m - - - -
m O
2 12000 -
4 3 16 25 36 49 4 9 16 25 36 49 Number of Processes Number of Processes
CG PSTSWM
Minimum 4000. Average -
in klaxirnum
8 16 32 64 Nurnber ot Processes
80001 @ Average Maximum ul - - -- - - - -
a O
2 6000 r z 2 0 4 m t
8 n E 32000t
ni " J 8 16 25 32 36 4 9
Number of Processes
Figure 6.2: Number of receive calls in the applications under different system sizes
MPI-Recv and MPI-Irecv salls have a 7-tuple set consisting of sorii-ce. tag cotm.
dmiype. biiJ contm. and siriria or rvqirest. In ordrr to choose precisely one of the received
messages at the network interface and trnnstèr it to the cache. the predictors need to con-
sider al1 the details of a message enveiop. That is. solore. mgog. coiritt. daraype. bld,' and
conlm ( 1 don? consider sronrs and i.eqlrest as they are just a handle when the calls return).
1 did not rrly only on the butfrr ddress. b i~ f : of a receive cal1 as many processes may srnd
thrir messages to the same butfer address of a particular destination process. Nor 1 çould
drpçnd only on the sendrr. sorrirr. of a message. or on the length. coimr. of a message.
Thrreforct. 1 assignai a difirent identitirr for rach unique 6-tuple found in the communi-
cation traces of the applications. Figure 6.3 shows the number of ~rt~iqiie message ideurifi-
ers in the applications under difirent system sizes. By ahmerage. minimum. and maximum.
1 mriin the average. minimum. and müximum number of unique identitiers taken over al1
processes of e x h application. I t is svident that dl processes in the BT. and CG iipplicn-
tions Iiave the same number of unique message identitiers while processrs in the SP. and
PSTSWM applications have ditferent number of uniqur message idrntifiers (rxcept whrn
the numbcr of processrs is four for the SP benchmark).
Figure 6.4 shows the distribution of crich unique message identifier for process zero of
the applications when the number of processes is 64 for CG and 49 for the other applica-
tions. I chose procrss zero becausr this process almost always had the largest number of
unique message identitiers arnong al1 professes in the applications and is also responsible
for distributing data and verifying the results of the cornputation. As it is shown in Figure
6.4. the inessage identifiers are evenly distnbuted in BT. However. the distribution of the
message identitiers in CG and PSTSWM are almost bimodal with two separated peaks.
The SP benchmark shows four difierent peaks for the message identifiers. Similar distribu-
tions have been found for other system sizes [ 6 ] .
6.5.1 Communication Locality
As noted in Chapter 3. some researchers have tried to find or use the cotttmunicntions
locdin properties of parallel applications [3. 4. 75. 30. 361. 1 define the term message
r-eceptiotz locam* in conjunction with this work. By message reception locality I mean that
if a certain message reception cal1 has been used it will be re-used with high probability by
a portion of code that is "near" the place that wlis used earlier. and that it will b e re-used in
the near future.
ET SP !!l 130- - o! Minimum o! ~ t n ~ m u m = - p!J Average
s 2 5 - . Maxtrnurn 0 a 0 m20- in UY
al E l 5 a 3 O - y 0 - O
àj 5 a E
4 9 16 75 36 49 Z 0 1 9 t6 25 36 49 Number of Processes Nurnber of Processes
CG PSTSWM C 8 700,
, Minrrnun o! Minimum = Average 3 Average
- 5 --- ~aximum $600c Maximum 0 z
saor 2 g 400, E al 3 3001 z C 2 2oof O
5 look
s & g-
o 4 9 16 25 32 36 49 Number of Processes Number of Processes
Figure 6.3: Yumber of unique message identitiers in the applications undcr di Kerent sy stem sizes
In the following subsection. 1 present the performance of the classical LRU. LFU. and
FIFO heuristics on the applications to see the existence of locality or repetitive receive
calls. 1 use the hir turio to rstablish and compare the performance of these heuristics. As a
bit ratio. I define the percrntage of the timrs that the predictcd receive cal1 was correct out
of al1 receive communication requests.
BT (49 processes) SP (49 processes)
20 30 Message Identifier
CG (64 processes)
4 8 12 Messaqe Identifier
Message Identifier
PSTSWM (49 processes) JO
"O 200 400 600 800 Messaqe Identifier
Figure 6.4: Distribution of the unique message identitiers for proress zero in the applications
6.5.2 The LRU, FIFO and LFU Heuristics
The L e m Recem!i. L &.ri ( LRU). Firsr-hi- Fhr-Oiit ( FI FO ). and Lmsr Fieqzwrir~r
LSed ( L F ü ) heuristics. 311 maintain a set of k (k is the window size) unique message iden-
titiers. If the next message identifier is already in the set. then a hit is rrcorded. Othenvise.
a miss is resorded and the new message identifier replaces one of the identitien in the set
accordin2 CI to which of the LRU. FlFO or LFU strategirs is adoptrd.
Figure 6.5 shows the results of the LRU. FIFO. and LFU heunstics on the application
benchmarks when the number of processes is 64 for CG luid -19 for al1 other applications.
It is clear that the hit-ratios in al1 benchmarks approach I as the window size increases.
The pertomance of the FlFO algorithm is the same as the LRU for BT. and PSTSWM
benchmarks. and almost the same for the SP and CG benchmarks. The LFU algorithm
128
consistently has a bettrr performance than the LRU and FIFO heuristics on the BT. CG.
and PSTS WM applications. It also has a better performance than the LRU and FIFO heu-
ristics on the SP benchmark for window sizrs of geater than five. [t is interesting to sre
that a real application like PSTSWM needs window sizes of geater than 150 to achieve a
good performance (hi t ratios above S O 0 o ) under the LFU policy. Similar perfonniince
results for the LRU. FIFO. and LFU heuristics on other sysrrm sizes can be found in [ 6 ] .
I
r O XI IO 40 50 Window size
CG (64 processes)
Window size
02' , + - LFU
/ I - - LRU - - FIFO --
PSTSWM (49 processes)
0.2.
O6 - 5 1 O 15 20
-. - LFU - - - LRU - - FIFO
Window size
- LFU - - - LRU - - FIFO
Figure 6.5: Etfeçts of the LRU. FIFO. and LFU heuristics on the applications
% I
5 1 O 15 20 Window stre
Essentialiy. the LRU. FIFO and LFU heuristics do not predict exactly the nrxt receive
cd1 but shows the probability that the next receive cal1 might be in the set. For instance.
the SP benchmark shows nearly a 60?/0 hit ratio for a window size of five under the LRU
heuristic. This rneans that 60°/0 of the time one of the tive most recently issued calls will be
issued next. These heunstics perform better when the window size k is suficiently large.
However. this large window adds to the hardware and software irnplernentation cornplex-
ity as one nreds to move al1 messages in the set to the cache in the likrlihood that one of
them is joing to be used next. This is prohibitive for large window sizes.
1 am intrrested in having predictors that can predict the next receive call with a high
probability In Section 6.6. 1 utilize the novel message predicton proposed in Chüpter 3
cmploying different hriiristics and evaluate thrir performance on the applications.
6.6 Message Predictors
The set of predictors usrd in this section predict the subsequent receive calls bascd on
the past history of communication patterns on a per process basis. These predictors were
proposed in Chapter 3 to predict the destination taget of subsequent communication
requests ÿt the send side of communications. It is wonh mrntioning that the message re-
ordering rffrct 1771 (messages tiom different processes may amw out-of-order even if
messages t'rom the samc proçcsszs may amve in-order in most networks) has no r fec t on
the prrdiçtions as the predictors predict the next reçcive calls based on the patterns of the
receive calls in the program that runs on the samr process and not on the amving mrs-
sages unlrss the order of receive calls depends on the order of message arrival. Note that in
the following fi gurtrs. by avernse. minimum. and maximum. 1 mean the average. mini-
mum. and maximum hit ratio taken o x r al1 processes of eaçh application.
6.6.1 The Tagging Predictor
A s desçribed earlirr in Chapter 3. the TkggNzg predictor assumes a static communica-
tion environment in the sense that a particular communication receive call in a section of
code. will be the snme one with a large probabiliiy. 1 attach a differrnt mg to each of the
recei1.e calls found in the applications. This can be implemented with the help of a com-
piler or by the prokgammer through a pi-e-rereive (iag) operation which will be passed to
the communication subsystem tu predict the next receive cal1 before the actual receive call
is issued.
Tagging predictor Mnimum
O 8- y Average - M-lm_urn
Q) gil4
âi 2 i3.2
0- BT SP CG PSTSWM
Tagging predictor
Number of processes
S - h-i ibr CG. and 49 t'or others
Figure 6.6: Eticçts of the Tagging predictor on the applications
The perfonance of the Tagging predictor is shawn in Figure 6.6. I t is evidrnt that this
predictor doesn't have a good performance for the applications studied. It çannot prrdict
the communication patterns of PSTSWM at all. and has a degndinp performance for all
other applications whrn the number of processes inçreases.
6.6.2 The Single-cycle Predictor
The S i i l g / ~ ~ - ~ : i d c ~ prediçtor. proposed in Chapter 3. is based on the tact that if a group
of receive calls arc issued repratrdly in a cyclical fashion. then 1 can predict the next
request one step ahead. The perf~mnançe of the Single-cycle predictor is shown in Figure
6.7. It is evident that its performance is consistently very high (hit ratios of more than 0.9).
Note that for the PSTSWM application. the Single-cycle predictor bas a zero hit-ratio for
one of the processes. However. it doesn't atfeçt the average hit-ratio over ail the processes.
I t is wonh mentioning that al1 Cycle-based predicton proposed in Chapter 3. (Single-
cycle. Single-cyclel. Bener-cycle. and Briter-cycle?) have the same pertbrmance for the
applications studird. Thus. I just reported the results for the Single-cycle predictor here.
6.6.3 The Tag-cycle2 Predictor
The Tag predictor didn't have a good prrtbnnance on the applications while the Sin-
gle-cycle predictor had a very good performance. The Tag-qde2 predictor. proposed in
Chapter 3, is a combination of the Tap predictor and the Single-cycle? predictor. In the
Tag-cycle? predictor. I attach a diffèrent tag to each of the communication requests found
Single-cycle predictor Single-cycle predictor d
Q
-8- CG 6 PSTSWM
O O 10 20 30 JO 50 60 70
Number of processes
S = h4 for CG. and 49 for iithrrs
Figure 6.7: Etfects of the Single-cycle predictor on the applications
in the benchmarks and do a Single-cycle? discovery algorîthm on a c h tag. The perfor-
mance of the Tag-cycle2 prcdictor is s h o w in Figure 6.8. The Tay-cycle? predictor per-
fbms well on al1 henchmarks. Its performance is the samr as the Single-cyclc predictor on
BT and PSTSWM. i-liwcwr. it has a bctter performance on CG and a lower performance
cm SP.
Tag-cycle2 predictor Tag-cycle2 predictor
Minimum A 0
ESl Average
~ u m b e r of processes
?(: = 64 tor CG. and 30 for others
Figure 6.8: Effects of the Tag-cycle2 predictor on the applications
6.6.4 The Tag-bettercycle2 Predictor
In the Sin&-cycle and Tag-cycle2 predictors. as soon as a receive cal1 breaks a cycle I
remove the cycle and fom a new cycle. In the Tig-berreqde.? predictor. proposed in
Chapter 3. 1 keep the last cycle associated with each tagbettercycle-head encountered in
the communication patterns of rach process. This mrans that when a cycle breaks 1 main-
tain the rlrments o f this cycle in memory for later references. The performance of the Tag-
bettercycle? predictor is shown in Figure 6.9. The Tag-bettercyclr? predictor perforrns
well on al1 benchmarks. Its performance is the sarne as the Single-cycle and Tag-cycle2
predictors on the BT and PSTSWM. However. it has a better performance on the CG and a
lower performance on the SP relative to the Single-cycle predictor. The Tag-bettercyclr?
predictor has a better performance on the SP application compared to the Tag-cycle.! pre-
also found that the applications have vrry small number of tagbettrrcyclr-heads
2) undrr rhe Tagbenercycle2 predictor and difkrent system sizes.
Tag-bentercycle2 predictor Minimum Average
. Maximum -
BT SP CG PSTSWM
Tag-bettercycle2 predictor a
2 0.4
. . 3.2
++ CG 4- PSTSWM
'0 10 20 30 40 50 60 70 Number of processes
S = h-4 tilr CG. ,inri 49 for ù t h m
Figure 6.9: Etfeçts of the hg-bettercyclcl predictor on the applications
6.7 Message Predictors' Cornparison
Figure 6.1 O presents o cornparison of the performance of the predictors on the applica-
tions under some typical system sizes. As we have seen so far. Single-cycle. Tag-cycle2
~ ind Tÿg-bettercycle? al1 perform exceptionall y well on the benchrnarks. However. the per-
formance of the Single-cycle is better on the SP benchmark while Tag-cycle3 and Tag-
bcittercycle2 have better perfomançe on the CG benchmnrk.
6.7.1 Predictor's Memory Requirements
Table 6.1 compares the maximum memory requirement of the message predictors on
the application benchmarks when the number of processes is 64 for CG. and 49 for BT. SP.
and PSTSWM. I have found that the memory requirement of the predictors decrease gad-
I PSTSWM
N = 64 t'or CG. and 49 h r othrrs
N = j2 h r CG and PSTSW M. and 36 for BT and SP
Figure 6.10: Cornparison of the pertormance of the predictors on the applications
ually whrn the number of processes decreases. The numbers in the table are the multipli-
cation factor for the amount of storage needed to maintain the message 6-tuple sets. It is
quitr clear that the memory requirements of the predictors is low. That inakes them very
attractive for the implementation at the network intertàce. Cornparatively. predictors (Sin-
de-cycle. Tq-cycle. and Tag-betterçycle) nerd higher mrmory requirement for the *
PSTSWM application. Xlthough. the classical LRU. LFC. and FIFO heuristics need less
mrmory rrquirements. but as stated earlirr. the beauty of the predictors lies on the fûct that
they predict with high accuracy and transfer only one message to the cache which should
dramaticcilly reducr the cache pollution eKect. if any. This shouid also bring down the
software ccist of the irnplementation.
Table 6.1 : Memory requirements (in 6-tuple sets) for the predictors when :V = 64 h r CG. and !V = 49 for BT, SP. and PSTSWbl
6.8 Summary
L
Cominunication latency adversely nRects the performance of networks of worksta-
tions. A sipnifiçant portion of the software communication overhead belongs to a number
of message copying operations. Ideally. it is very desirable to have a true zero-copy proto-
col where the message is moved directly tiom the send bufier in its user space to the
receivr butfer in the destination without any intermediate bufiering. However. this is not
always possible as a message may amve at the destination where the sorrespondinç
reçeiw cal1 has not been issued yet. Hence. the message has to be butfered in a temporary
bu fier.
BT
In this chapter of the dissertation. I have shown that there is a message reception com-
munication locality in message-passing applications. I have utilized the different predic-
tors proposed in Chapter 3 to predict the next receive cal1 at the receiver side of
SP CC PSTSWM
communications. By predicting receive calls early. a process ran perform the necessary
data placement upon message reception and move the message directly into the cache. 1
presented the performance of these predictors on some parallel applications. The perfor-
mance rcsults are quite promising and justi@ more work in this area.
1 cnïision these predictors to be used to drain the network and place the incoming mes-
sages in the cache in suçh a way so as to increase the probability thrit the messages \vil1
still be in cache whrn the cansuming thrrnd ne& to access them.
Chapter 7
Conclusions and Directions for Future Research
Parallrl procrssing is the key to the design of high performance cornputers. However.
with the aviiilability of fast microprocessors and small-scalt: multiprocessors. intemode
communication has become an increasingly important factor that limits the performance
of parallel computers. In essence. parallel computers require rxtremely short coinmunica-
tion Iritrncy such that network transactions have minimal impact on the overall cornputa-
tion timr. This thesis uses a number of techniques to achirve etlicient communications in
message-piissing systems. This thesis makes tive contributions.
The Hrst contribution of this thesis is the design and evaluation of two ditierent utego-
ries of prediction techniques for message-passing systems. I present rvidence that mes-
sage destinations display a form of locality. This thesis utilizes the message destination
loçali ty property of mctssa_re-passiny parallel applications to devise a number of heuristics
that can be usccl to predict the taget of subsciquent communication rcquests.
Speci f saIl y. 1 propose two sets of message destination predictors: C ~ c k - b m C d predic-
tors. which are purely dynamic predictors. and fig-bascd predictors. which are stritia
dynamic predictors. In cycle-based predictors. Sirigle-cirie. Si&-q~lel. Berter-c~.cle.
and Betre1-cycle2. predictions are done dynamically at the network interface without any
hrlp from the programmer or compiler. In Tay-based predictors. f iggiq, Tog-c~de. irug-
L ~ c W , k g - b c r t e ~ ~ ~ I e . and Tug-bertel-cj~le?. prediçtions are done d ynamical 1 y at the net-
work intedice as wrll. but they require an interface to pass some information from the
program to the nrnvork intertàce. This can bci done with the help of programmer or com-
piler through inserting instructions in the progain such as pre-conriecr (ta@. The perfor-
mance of the proposed predictors. specially Better-cycle2 and Tag-bettercycle2. are very
wel1 on al1 application benchmarks. Meanwhile. the memory requiremenrs of the predic-
tors is very low. The proposed predictors should be easily implemented on the network
interthce dur to their simple algorithrns and low memory requirements.
The heuristics proposed are only possible because of the existence of communications
lociility that can be used in establishing a communication pathway between a source and a
destination in recontigurable interconnects before this pathway is to be used. This is a very
iIcsirablc property sinec. it allows us ti, ctfc.stivcly hidc the iost of establishing such com-
munications links. providing thus the application with the raw power of the underlying
hardware (cg. a recontigurable optiçal intrrconnect).
As the second contribution O C this thesis. 1 show thnt the majority of reçontiguration
delays in single-hop recontigurable networks can be hidden by using one of the proposed
high hit ratio predictors. In othrr words. by çompanng the inter-send computation times u f
some parallel benchmarks with somr speçific reconfiguration times. most of the time. we
are able to fuily utilize these computation timtts for the concurrent recontigurütion of the
intcrconnect whrn we know. in advançr. the nrxt taqet using one of the propossd high hit
ratio target prcdiction algorithms. This thrisis also states that by utilizing the predictors at
the srnd sidr of communications. applications at the recriver sides would also benetit as
messages arrive rnrlier than befbre.
.As the third contribution of this thesis. I analyze a broadcasting algorithm that utilizes
latency hiding and reconîîguration in the network to speed the broadcasting operation
under single-port and k-port modeling. In this algorithm. the reconfiguration phase of
some of the nodes is overlapped with the message transmission phase of the other nodes
which ultimately reduces the broadcastinz time. The analysis brings up ciosed formuia-
tions that yield the termination timr of the algorithrns.
The fourth contribution of this thesis is a new total exchange algorithm in single-hop
recontigumble nstworks undrr single-port and k-port modelinç. 1 conjecture that this algo-
rithm ensures a bettrr temination time than what can be açhieved by either of the direct,
and standard exchange rilgorithms.
Ideally. message protocols should copy the message directly fiom the send bufier in its
user space to the receive buffer in the destination without any intermediate bufferin_r.
However. Applications at the send side do not know the final receive bufier addresses and.
hence. the communication subsystems at the rereiving end still copy messages unneces-
sarily at a ternporary butfrr.
This thesis presents rvidence that there rxists message reçeption communications
locaiity in message-püssing parüilri applications. Having messagr rrception communica-
tions losality, the Atih contribution of this thesis is the use and evüluation of the proposed
predictors to predict the next consumablr message at the recriving ends of communiça-
tions. This thesis contributes by claimin2 that thrse message predictors c m bc etficiently
~ised to drain the network and cache the insoming messages even if the corresponding
receive calls have not becn posted yet. This w a . there is no need tto unnecessari1 y copy the
early amvine rnessap into a temporary butfer. The performance of the proposed predic-
tors. Single-cycle. Tq-cycle2 and Tag-bettercyclt.2. on the parriIlel applications arc quite
promising and suggest thüt prediction has the potcntial to eliminate most of the remaining
message copics.
7.1 Future Research
The proposed predictors in Chapter 3 of this thesis such as Tag-bettercycle3 and Bet-
ter-cycle2 perfonn exceptionally well on al1 applications excrpt QCDMPI. under ditferent
system sizrs. It seems that this application rrpeatedl y changes its message destinations in
ditirrent cycles that even the bssi proposed predictors cannot aiways capture them. Thus.
i t rnight be hrlpfui to devise other predicton. called .JI/--de and Zig-ullcycle. that could
maintain al1 cycles associated with each cycle-head and tagbettercycle-head found in the
communication traces of the applications. In case that these two predictors. All-cycle and
Tas-allcycle. have high memory requirernents. it might be better to devise predictors that
faIl somewhere between the extreme cases. That is. predictors that can maintain more than
one cycle but less than al1 of the cycles associated with each cycle-head and tagbettercy-
clr-head. Not to mention that searchinç in diffrrent cycles may add to the performance
penalty.
The Tag-based predictors proposed in Chapter 3 can be pure dynamiç predictors if
another Ievrl of prediction is done on the tag themselvrs at the network interface. This
ivay. there is no need for the program to pass pre-courlecf (fag) (or pre-irceiw (ta@ as in
Chapter 6) information to the network interface. It is interesring to see what would b r the
performance of such 2-lewl Tog-bas4 predictors.
In Chapter 4. i roughly showed that up to 5O0o of the times applications at the receiv-
in3 end iiiight benrfit whtn the prsdictars arc applicd at the send side of communications.
Howrver. a trace-drivsn simulator should be writtrn to precisely rvnluate the etfect that
cipplying the predictors at the send sidr has on the reçeive side. and on the total application
run-time.
This thesis in Chapter 5 cinalyzes rtticittnt broadcasting~multi-broadcasting nlgorithms
that utilizes latrncy hiding to speed these operations. An optimal alporithm for multi-
bri~adcasting is t« be devisrd suçh that mrssayrs are pipelincd in the embedded trecls usiny
the latrncy hiding broadcasting algorithms (BdFn-. or B F I ) . In this thesis. dthough algo-
ri thms for scattering. all-to-all broadcasting. and total exchange are very znisirnt but they
do not use latency hiding technique. Although very challenying. efficient algorithms for
multicastiny. scattering. all-to-all brolidçasting. and total exchange should be devisrd such
that they use Iütrnçy hidiny technique to hide the reconfiguration delay in the network.
As stated in Chapter 6. by predicting recrive calls eariy. a node can pertbrm the neces-
sa- data placement upon message reception and move the message direct1 y into the cache
in suçh a way so as to incrrase the probability that the messages will still be in cache whrn
the consuming thread needs to açcess them. Further issues that should be investigated are
deciding whrre and how this message is to be moved in the cache. Would this cache be a
first-lwel cache, ri second-ltxel cache. a third-Ievel cache or even a network-cache? What
mechanism should be used to transfer the message into the cache? User-level messaging
and/or multithreaded MPI environment. Meanwhile. efficient cache re-mapping and late
binding mechmisms need to be devised for when the receive cal1 is posted. Also. cache
pollution and inaccurate timing are the other issues that should be addressed.
The performance of the predictors proposed in this thesis were evaluated under single-
port modeling. That is the predictors predict one step ahead. However. Cycle-based pre-
diçtors. Single-cycle. Single-çyclel. Better-cycle. and Better-cycle?. and Tagcycle-based
predictors. Tag-cycle. Tag-cyciel. Tag-bettercycle. and Tag-bettercycle? maintain the
message destinations of n cycle. Therefore. it is possible to predict more than one step
ahrad. It is interesting to Rnd the performance of the predictors under such modeling in
ternis of hit ratio. and for the total recontiguration delays. and the application run t h e .
Finally. al1 the applications studied in this dissertation are scirntitiç and engineering
ones. It is interesting to discover the impact of the predictors on the performance of com-
mercial applications.
14 1
Bibliography
A. Afsahi and N. J. Dimopoulos. "Collective Communications on a Rsconfig- urable Optictil I nterconnect". Proceeciirigs qf'rhe OPODIS 'Y 7. Itrrei.riarioirn1 C m - , /2i .~we or1 PI-i~iciplrs o/Disri.ibrrred $stenis. December. 1 997. pp. 1 67- 1 S 1 .
A. Afstihi and N. J. Dimopoulos. "Hiding Communication Latency in Rrconfig- urable >lessage-Passing Environments". P)-oc*t.t.rlirigs q/' rlir IPP Si SPDP 1 Y9 Y. 1 MI lnré~~iinrioiiol Prr~nllel P~oc*essitg SSi*r~iposiiini uird / 0th S~nzposiiu~i on Parcil- Ici r l ~ i < f Distt-ihirred Piwc*rssbz,g. April 1999. pp. 55-60.
A. At'sahi and '1. J. Dimopoulos. "Commiiniccition Latency Hiding in Recontig- iirable Message-Passine Environments: Quantitative Stiidies". Prncwrlirigs qf'rlw IIPC5'YY. lirli .-l~rrriicd Ir~t~wiuriuricil Simposiitni ori High Pct~urwiczrice C1umpitrirzg Siisr~~~~i.s w d . Ipp/i~*oriuris. Wiriiw .-Icud~~r.niics Pirbiislre~~. June 1099. pp. 1 1 I - 126.
A. Afsahi and N. J. Dimopoulos. "Efficient Communication C h g Message Prc- diction h r Clustcrs of Multiprocessof'. 7;.clt)iicd Rcpurr EC'E-99-5, Depczrrnieiit u f 'E /~ t - t r i~ t~I 'mi Conipi<tc~r E/igirtwritikg, L iriiwsin of' l ictor-icz. December 1990.
A. Aganvül. R. Bianchini. D. Chaiken. K. L. Johnson. D. Kranz. J. Kubiatowiz, B- H. Lim. K. Mackenzie and D. Yeung. "The MIT Alewife Machine: Architecture and Performance". Procveiiitigs r uf'rlie Xrli .-lniiiral lrztei~tiario~~al $rnrposiirm ou Conipirres .-11-chirectlrm. 1998.
-4. Alexandrov. M. Ionescu. K. E. Schauscr and C. Schriman. "LogGP: Incorporat- ing Long Messages Into the LojP Mode1 - One Step Closrr Towards a Realistic Mode1 for Parallel Computation". ?rh .4tintrnl Si*ntposiitm otr Parailei .-figorirlinu ~ i m f .-Ir-cliitccrirrr (SE-1.4 '95). Jul y 1 995.
G. S. Almasi and A. Gottlieb. Higlii~ Paraile( Coniputirlg. BenjaminiCumrnings. 1989.
C. Amza. A. L. Cox. S. Dwarkadas. P. Keieher. H. Lu. R. Rajamony. W. Yu and W. Zwrienepoel. "TreadMarks: Shared Memory Computing on Networks of Worksta- tion. IEEE Cornputer. Volume 19. no. 2. February 1996. pp. 18-25.
T. E. Anderson. D. E. Culler. D. A. Patterson, and the NOW team. "*.A case for Net- works of Workstations: NOW. IEEE Mic*r.o. February 1 995.
T. E. Anderson. S. S. Owicki. P. Saxe. and C. P. Thacker. "High Speed Switch Scheduling for Local Area Networks". Irireiwiirioiial Coizft?i-eiice oit .-lrcltirecrii,~I Sirpporr fhi* Pi~~qin»rniiir,g Lan,~trc<~es aiid 0priuriif.q Sisrerw. 1 992. pp. 9s- 1 10.
D. H. Bailcy. T. Harsis. W. Saphir. R. V. der Wijnganrt. A. Woo and M. Yarrow. "The NAS Parailel Benshinarks 1.0: Report NAS-95-O?Oq*. Kasa Ames Research C'enter. Detxmber 1995.
M. Banikazemi. R. K. Govindaraju. R. Blackmore and D. K. Panda. "Implrmrnt- ing Eilicient MPI on I A P I iOr IBkl RS/6000 SP Systems: Experiences and Pertbr- mancc E v d urition. "Pi-uc~~~dii~qs O/' rhc O / ' IPPS, SPBP 1 549, Ijrh 1iiic.r-ittrriorrid PLI~zIIIL'I Pi -~cess iq Srrriposilm L I I ~ l thli Srmposi~inl U I I Ptodlel uiicl Disri-ibiircd Pror~c~ssirig. Apnl 1999. p p . 183-1 90.
M. Bünikazemi. J. Sümpüthkumar. S. Prabhu. D. K. Panda. and P. Sadayappün. "Communication klodcling of Heterogeneous Networks of Workstiitions for Per- formance Charactcrization of Collective Operritions". Piocccdi/i,qs of ' rlte 1rtrcv.iru- rioitcil h k d t o p ou Hcreiugmeo~is Compiiriizg, iir co~rjirrrcrioii irirli IPPS, SPBP 'W. Apnl 1999. pp. 173- 13 1 .
A. Bar-Noy and S. Kipnis. "Designing Broadcasting Algorithms in the Postal >Iode1 for Message-Passing Systems". 4rlr :lnnzral .-!CM Siwiposirinr oii ParaIlel .-l Igoi-ir1tm.s < m l .-l i-chirecrrirvs. Junc 1992. pp. 1 1-22.
A. Basu. M. Welsh. T. V. Eicken. "lncorporating Memory Management into User- Letel Network Intertaces". ffor Iirrcr-curtitccrs C: August 1 997.
C. Berge. &peiyuplts . North-Hoiland. 1989.
P. Berthorne and A. Ferreira. Editors. Opricd Iirrerrotrriectioiis aird Pcrrullel Pra- cessiftg: 7kirds U r rhr htrri$rciicr. Kluwer Academic Publishers. 1998.
P. Berthorne and A. Ferrein. "Communication lssues in ParaIlel Systems with Optical Interconnections". Iiirei-~iariuird Jo~ii.ilal of'Foliirdariom c$Conipiirei* Sci- mce. Volume S . Numbrr 2. June 1997. pp. 143- 162.
P. Berthorne and A. Ferreira. "On Broadcûsting Schemes in Restricted Optical Pas- sive Star Systems". Interconnection Nehvorks and Mapping and Scheduiing Paral-
D. E. Culler. J. P. Singh and A. Gupta. Pal-del Conipirrer ..1~diitecrii1-t.: -4 Hwd- it-cu- SI .-lpp~vacli. Morgan Kauhann. 1999.
D. E. Culler. R. M. Karp. D. A. Patterson. .A. Sahay. K. E. Schauser. E. Santos. R. Subnmonian and T. von Eiçken. "LogP: Towards a Realistic Model of Parallel Computation". 4th .-I CM SICPL.-iiV Swrposiirni olr Ptirrciples nnd Pmcfice ~ ~ ' P U I - <rllc./ Piqyrrimr i11g. 1 993.
F. Dahlgren. M. Dubois and P. Stenstrom. "Srquential Hardware Prefetching in S hared-Mrmory kt ultiproçessors". f EEE TI-criucicriolw or1 Pczr-~dld c i d Disn-iblrteci Sisrcnis. 6( 7). 1 995.
W. J . Dally. .i. A. S. Fiske. J. S. Keen. R. A. Lethin. .LI. D. Noalies. P. R. Nuth. "Thc Message Driven Processor: A Multicomputer Procrssin~ Nodes with Efti- cient Mcçhanisrns". IEEE .Ilicrw. April 1992. pp . 13-39.
B. V. Dao. Sudhiiktir I'alamanchili. and Jose Ducito. "i\rchitectural Support for Rrducing Communication Overheüd in kfultiprocess«r lnterconnection 'let- works". PI r~cwciirigs gsf tlrc T h i ~ ~ i I~itrr~~iurioiirri Srniposiirm un Higll Pi.v.futnialrc*~. C'ot~ipirr~~r .-l~rhir~~crwc. 1 997. p p . 343-352.
Drpünmcnt of Enrrgy .-icceie~areri Srrlrtegic C'umpirtirig Itziriciriw (ASCI) Project. http:! ~www.llnl.gov~asçii.
F. Desprez. A. Frmera and B. Tourancheau. "Eîlicient Communication Operations on Passive Optical Star Networks". Ptaceetli1tg.s u f'rlie Filasr I~iro~tinriotml Cor! fét-- mce on .Ilcissii~e!i* Pal-diel Pro~*essi~zg iisiltg Opricd I~iret~co~t~zccfions. 1 994. pp. 52-55.
V. Dimakopoulos and N. J. Dimopoulos. "Total Exchange in Cayley Networks". Eiita-Pm- ' 96. Pwailei Ptvccssirlg, L~~criitr :Votes in Conrpirter Scielict.. 1996. pp. 34 1-346.
V. V. Dimakopoulos and N. J. Dimopoulos. "Communications in Binary Fat Trrrs". Plocredings ~ s f ' the Ititei-~mrionul Cot$mwce on Parnilei and Disnibiired Compirririg. September 1995. pp. 383-338.
J. J. Dongarra and T. Dunigan. "Message-Passing Pertomance of Various Com- puters". Coticiiiwiicy. Volume 9. No. 10. December 1 997. pp. 9 1 5-916.
P. W. Dowd. "Wavelength Division Multiple Accrss Channel Hypercube Processor Intrrconneçtion". IEEE fi.artscicrioris oti Cunipcmr-S. Volume 4 1. October 1902. pp. 1223- I1-i 1 .
P. Dt-uschel and L. L. Peterson. "Fbufs: A High-bandwidth Cross-dumain Transfer Faci lit y". Pt wceditzgs of ' the Fowreerirli .-îC;Cf $mposiiml or1 Oper-nririg Si~renls Priririples, 1993. pp. 1 59-202.
J. Duüto. S. hlamanchi li and L. Ni. I/l/ei-~*oi~iiccrioir :Ventlui.ks: .- lrr Eiigirieeririg .-lppr-onch. I E E E Cornputer Society Press. 1997.
J. Ducito. ':A Ntxessliry and Sufficient Condition for Deadloçk-frrr Adaptive Rout- in? in Wormhule Networks". IEEE Transactions on Parallel and Distributed Sys- tcms. Volume 6. '10. 1 O. 1995. pp. 1055- 1067.
C. Dubnicki. A. Bilas. Y. Chen. S. Damianakis and K. Li. "VMMC-2: Efficient Support for Rcl iable. Connection-Orimted Communication". Praceeditigs of ' rlw //or i t i r ~ ~ t - ~ * c ) r i r r ~ ~ ~ ' r ' Y 7 . 1097.
D. Dunning. G. Rrgnier. G. bfcAlpinc. D. Cümeron. B. Shuben. F. Be?. A. M. blmitt. E. Gronkt. and C. Dodd. "The Virtual Interface Architecture". lEEE .\licr+o. blarch-April. I 998. pp. 66-76.
L. Fan. M. C. Wu. H. C. Lee and P. Grodzinski. "Optical lnterconnection Networks for Massively Parallel Processors using Bram Sterring Vertical Cavi ty Surfaçe- Emittins Lasers.". P~oceedit~,gs of'tlre Second lriterrmriorznl Cor!férvttce or1 .Clm- sirz.!i Pwdlc~l P~mvssitig witig Opricd Itirerro~triecrioiis. October 1995. pp. IS- 34.
M. R. Frldmiin. S. C. Esener. C. C. Guest and S. H. Lee. "Cornparison Between Optical and Elestrical Interconneçts Basrd on Power and Speed Considerations". . - lpplid Oprics. 27(9). May 1988, pp. 1742- 175 1.
M. Fillo. S. W. Kcckler. W. J. Dally. N. P. Carter. A. Chang, Y Gurevich and W. S. L w . "The M-Machine Multiomputer". Proceedit~gs gsf the 18îli .-ltlmul IEEE; -4 C.Lf Ittrerrioriorial Siniposiirm on ~Cfi~~~oci~~cliirec~~i~'~~". 1 995.
P. Fraigniaud and E. Lazard. "Methods and Problems of Communication in Cisual Networks". Discivie .-fppfied .Ifuilwnn~ics, Volume 53. 1994. pp. 79- 133.
M. Galles. "Spider: A Hish-Speed Network Interconnect". IEEE Micio. Volume 1 7. No. 1. January/February 1997.
[j5] P. Geofhay, L. Prylli. and B. Tourancheau. "BIP-SMP: High Performance Mes- sage Passing Over a Cluster ofcommodity SMPs". SCYY: High Pei;foinzniice Ner- iiw-kiig a d Cotnpiiritrg Co,? f>r.etlce. Yovember. 1 999.
[56] C. J. Glass and L. M. Ni. "The Tum Mode1 for Adaptive Routing". Aocrrditigs of die i 7th Iirrei-mriotial Simposiiinr oii Cui7ipzi~cv- .-lrrhirrctiiw. 1 992. pp. 275-287.
[57] W. Gropp and E. Lusk. "User's Guide for MPICH. n Portable lmplementation of bl P 1". .-l rgmile .Voriotirii Lciboi-urot?: .Llorlienrnrics arid Conrpiirer Sciorce Dili- siotl, June, 1999.
[58] J. W. Goodman. F. 1. Leonberger. S-Y. Kung and R. A. Athale. "Optical Intercon- nections for VLSI Systems". Pt-uccaiit~gs uf ' lEEE. Volume 72. No. 7. July 198-1.
[FI] G. Grwenstreter and R. G. .Llrlhem. -'Realizing Cornmon Communication Pat- terns in Pntitioned Opticnl Püssivr Stars (POPS) Networks". IEEE Tiuiisri~*tioris oti C'oniprir~w. Valumc 47. No. 9. 1098. pp. 9%- I O 13.
[60] 41. W. Haney and M. P. Christensen. "Fundamental Geomerric Advantages of Free-Spacc Optical I nterconnect". Pt-o~*~'ccii~igs uf'rlze Thir-ci Itirri-riariorrd Cu11 /kt-- erice 011 .Llrissiivii Pot-cil1c.l Ptmwsit~g tisittg Op ticcil Iitfe~ruii~lccriotis. 1996. pp. 16-23.
[6 I ] S. M. Hrdetneimi et al.. 'A Sun7ey of Gossiping and Broadcristiny in Communica- tion Nztw«rks". . V m t A - s . Volume 1 S. 1985. pp. 3 19-3.19.
[62] J . L. Hennrssy and D. A. Patterson. Conipiir~v. .-lrr*lzirecrtirz.: .-I Qiicitirirnriiv .-lpprwtrdi. Morgan Kiiutinann. 1 096.
[64] H. S. Hintcin. T. J. Cluonan. F. B. illcConnick. Jr.. A. L. Lentine and F. A. P. Tooley. "Frce-Space Digital Optical Systems". Pt-ocecdiugs of'lEEE. Spccial Isstie or1 Opricd Conrpiiiirtg Sistrnu. Volume SI . No. 1 1. Nov. 1994. pp. 1632- 1649.
[65] S. Hioki. "Construction of Siaples in Lattice Gauge Theory on a Parallel Com- puter". Parallel Conrpiitiig. Volume 22. '10. 10. October 1996. pp. I 335- 1344.
[66] R. W. Hockney, 'The Communication Challenge for MPP: Intel Paragon and Meiko CS-1". Paralief Compicting. Volume 20. N o . 3. March 1994. pp. 359-398.
[67] R. W. Horst and D. Garcia. "ServerNet SAN IiO Architecture". Proceediirgs qf'rhe Hot Iiitercor~riec~s C,: 1997.
J. Hsu and P. Banjerer. "Pertbrmance Measuremrnt and Trace Dnven Simulation of Parallel CXD and Numencal Applications on a Hypercube Multicomputer". Ptoceeditigs qfrlie I 7th Irriet.tiariotiai $mpositu~i oti Cunipiirw At-cl,irecnrte. 1990. p p . 260-169.
K. Hwang and 2. Nu. Scalablr Parciilel Conrpirtitig: Potulielism. Scalibilih: Pt-o- gr~~nznzcibilih: McGraw-Hill. 1998.
S. L. Johnsson. "Communication in Network .4rchitecturcs". in I'LSI cilid Par-criid Chnipirrririoii. cd. R. Suaya and G. Birtwistle. Morgan Kaufmünn. 1900.
S. L. Johnsson and C.-T. Ho. "Optimum Broadcasting and Personalizrd Communi- cation in H ypercubes". IEEE fimscic*riotis otz Cot~rpirters. Volume C-3 8. Srptem- ber 1980. pp. 1249- 126%
S. Klirlson and M. Brorsson. '*A Comparative Charücterization of Communication Patterns in Applications Using MPi and Shared Mrmory on an IBM SP2". Pt-o- cvétiitlgs of'the Ilbt%-shop ou Cunrrniariccrriorr. .-lrdiirecrirrr. micf .-lppliciitiutis /Or .Vcni~-k-busecl Pnr*ciIlrl Cbnipiiritrg, Itircr-tzr~riot~trl Siwiposiirtn oii Hiy11 Pc);fbr-- riicimv Coi?ipiircv- .-lrrhirt~r-nrr-c~. February I 9'23.
S. Küxiras and I. R. Goodman. "Improving CC-NUMA Pertbrmancr Using Instruction-Bascd Prediçtion". Iriferwrio~ml Siwposiirt~r uri High Pci-futmatrce C ' V I I I P I I I L ~ ~ . . - ~ ~ . L - / I ~ ~ L ? c I I ~ T . I 909.
F. E. Kiarnilrv. "Pertbrmance Corn parison between Optoelectronic and V LS 1 Mul- tistage intc<rconnestion Nrtworks". Jowmd ot'Li3/irti.arc Gcl~tioloy?*. Volume 9. Na. 12. December 19C)l. pp. 1674-1692.
V. Kumar. 4. Grama. A. Gupta and G. Karypis. Irtrt~dlrcrion ro Pai-allei Cornpur- itig: Desigti arid .-fria(rsis of .-l lgot.ir/inrs. The Benjamiru'Cummings Publis hing Company. Inc.. 1 991.
A.-C. Lai and B. FaIsafi. "Memoty Shwing Predictor: The Key to a Speculative Coherent DSM". Proceeciiiigs oj' the 16d1 .-ltiriirai Itirertintiottol Smposiirm oit Conrprr fer- .4rr/zitecrures. 1999. pp. 1 72483.
LAM/MPI Parallel Computing. University of Notre Darne. hnp:/! www.mpi.nd.rdu/larn/.
[79] M. Launa. S. Pakin and A. A. Chien. "Efficient Layering for High Speed Commu- nication: Fast Messages 2 .Y'. Pr-oceeditigs q f ' rhe 7th H i 9 Perjormailce Distrib- iireti Conipirtirig ( H P D C i ) Cotr/l.,-rrlce. 1998.
[YU] C. E. Leiserson. Z. S. Abuharndeh. D. C. Douglas. C. R. Feynman. M. N. Gan- mukhi. J. V. Hill. W. D. Hillis. B. C. Kuszmaut. M. A. St. Pierre. D. S. Wells, M. C. Wong. S-W.lhng and R. Zak. "The Network Architecture of the Connection Mac hine C 41-5". PI-ocretfirrgs q f ' the 4th .-l i t l $mposiirrrt or1 Parïill~~l .-i lgor-itllnis ~ r i i c f -4)-cliirecnrrw. Juns 1 992. pp. 272-255.
[S 1 ] .A. L. Lentinr. K. W. Goosen. J. A. CValker. L. M. F. Chirovsky. L. A. D' Asaro. S. P. Hui. B. J. Tscng. R. E. Lribenguth. J. E. Cunningham. W. Y. Jan. L M . Kuo. D. W. Dülirinper. D. P. Kossives. O. D. Bacon. Ci. Livesue. R. K. Momson. R. .A. 'iovotny. and D. B. Buçhholz. "High-Sprrd Optoelectronic VLSI Switching Chip with 3 4000 Optical PO Based on Flip-chip Bonding of MQW Modulators and Detrctors to Silicon CMOS". IEEE Jorownl u f 'S~-k~te t l Topic~ in Qirtrrrtiinl Elec- rror~ic.s. Volume 2. April. 1996.
[ 8 2 ] K. Li. Y. Pm and S. Q. Zheng. Editors. Por*czllel Conrputiiig CSitlg Opricd ho-- C U I ~ I ~ C J L Y ~ O ~ IS. Kluwer .=\cademic PubIishers. 1 99S.
[X3j .A. Loun and H. K. Sunp. "An Optical blulti-Mesh Hyprrcube: A Scalüble Optical interwnnection Network for .llüssively Parüllel Computing". Jo~rimil of ' Liglir- i t y r i v P~+li~ro/c)~q: Lbluine 1 2. No. 4. 1 994. pp. 704-7 16.
[84] .A. Louri and H. K. Sung. "Scalable Opticül Hypercube-basrd Interconnection Nrt- work fbr Milassivcly Parallcl Cornputing". .-lpplietf Oprics. Volume 33. No. 33. No\.. 1 W4. pp. 7558-7598.
[Ssj D. B. Loveman. "High Performance Fortran". lEEE Par-allel mtd Disrr-ihiired T'cchr~olog.. Volume 1. Febniary 1993. pp. 25-42.
[S6] Luçent's Wavestar LambdaRouter. IEEE. Cumptrrc.,-. lanuary 1000. pp. 26.
[S7] S. S. Lumetta. A. bl. Mainwaring. and D. E. Culler. "Multi-Protocol Active Mes- sages on a Cl uster O f S bl Ps". SC9 7: Higli Peijo~maricc iVentwrking uiiri Conzp~rr- iiig Ci)~?fc'i~rtce. Novcmber. 1 997.
[SSJ K. Mackenzie. J. Kubiatowiz. M. Frank. W. Lee. V. Lee. A. Apanval and M. F. Kaashork. "Exploiting Two-Case Delivery for Fast Protected Messaging". Pro- ceediigs of' the 4th Ittterwnriotral Siwposilrm or, Higl~-Peq?mm-uice Compzrtei- .-lrr-Jtirecrio-e. February 1998.
[S9] P. J. Marchand. A. V. Knshnamooflhy. G. 1. Yayla. S. C. Esener. and U. Etion. "Optically Augnrnted 3-D Cornputer: System Technoloçy and Architecture".
Joirr-ml q#'Pur-del mid Disrribirted Conipiiririg. Specinl lsslre or1 Optical hiferrori- riecrs, Febt-uary 3. 1997. pp. 20-35.
P. K. McKinley and D. F. Robinson. "Collective Communication in Wormhole- Routed Massively Parallel Cornputers". IEEE Cor>tpiirer: December 1995. pp. 39- 50.
P. K. McKinley. H. Xu. A. -H. Esfahanian and L. M. Ni. "Unicast-basrd Multicast Communication in Wormhole-routed Networks". lEEE fi.arisac*tioris ori Par-alle1 r i r d Disaibrrred Si.sterrts. 5( 1 2 ) : 1 252- 1265. Deçcmber 1994.
V. N. b!orozov. H. Temkin and A. S. Fedor. "Analysis of ii Three-Dimensional Computcr Optical Schcme Bascd on Bidirectional Free-Space Opticül Intercon- nccts". Opri~wl Errgiiic~cv-itrg. Volume 34. No. 2 . 1995. pp. 513-534.
T. Ylowry and A. Gupta. "Tolerating Latency Through Sotiware-Controllrd Prefetching in S hared-blemory blui t i proçessors". Jotrr-mil q/'Purallel ciriri Distrib- irtcti Couipictit~g, 132). 199 1 . pp. Y 7- 106.
S. S. Mukherjre and M. D. Hill. "Using Prediction to Accelernte Coherence Proto- cols". l'roccerlirrg.s of' rire 25th .-ltrriiicil I~ice~.rrutio~iuf Simposilrni o/l Co~?rpirrer .-l~z-liir~~crirr.~~. 1 0 W.
S. S. Mukherjee. B. Fiilsafi. M. D. Hill and D. A. Wood. "Coherent Nrtwork Inter- hces for Fine-Grain Communication". Praceetlirrgs o/'rlte 3rlr .-lriri ird Irrterwa- fiorral $~.rnposiiun or1 Compirrer- -4 ~rlzi~ertic~r. 1 996.
'lationnl Coordination OtFice for Cornpirfirrg, hzformczriorr. ancf Comniirnicririoris (NCOiCIC). http:!!www.cçic.gov:.
L. M. Yi. "Should Scalablz Parallel Cornputers Support Efticient Hardware Multi- c.stI?'- . Iirtcrwrrtio~irrl Cor~fL;r~orce on Pot-allel Pr*ocessbig. Workshop. A pnl 1 995.
R. A. Nordin. A. F. Levi. R. N. Nottenburg. J. O'Gorman. T. Tmbun-Ek. and R. A. Logan. "A System Perspective on Digital Interconnection TechnologyT*. [EEE Jow-1102 oj'Lighi*ni*e Tec/zriolog?: Volume 10. Junr 1992. pp. 50 1-827.
N. Nupairoj and L. M. Ni. "Benchmarking of Multicast Communication Services". Ecilr~ical Repori LCISC'CPS---ICS- 103, Lbfic/~igun Srafe ~L',liwrsih: September
[ 1 OS]
[ 1 O')]
S. Pakin. M. Lauria. and A. Chien. "High Performance Mrssaging on Workstation: Illinois Fast Messages (FM) for Myrinet." Pioceetfirgs qf 'the Sirpet-conipiiring '95. Nuv.. 1995.
K. Panajotov. hi. Nieuborg .A. Goulet. 1. Veretenniçoffand H. Thictnpont. **A Free- space Rrcontigurable Optical Interconneçtion based on Polarization-Switching VCSEL's and Polarization-Sctlectiw DiKractive Optical Channels". Pioceetliiigs o f ' [ / I L ) Optics iii Conrpirriitg, 19%. pp. 15 1 - 154.
T. M. Pinkston. "Design Considerations for Optical Interconnects in Parallel Com- put ers". P~-u~wdiiig.s of ' ille First Iitrcremr rior 101 Ikbi-kslt op oit .\ltissiiv(i. Ptri.cille/ P~~occssiiig L Siiig Opticcil Iiztei.coirricc~rs. Apnl 199-1. pp. 306-322.
S. H. Rodrigues. T. E. Anderson and D. E. Cullrr. "High-Perf~mnûnce Local Areü Communication with Fast Sockets". L'SEh'Lri' / 99 7 .-l r i i r i rd TL'cIliiictd COI~/~ ' I 'C ' I IL '~ . January 1997.
M. F. Siikr. S. P. Levitan. D. M. Chianilli. B. G. Home. and C. L. Giies."Predicting Multiprocctssur Merno- .4çcrss Patterns with Learning htodrls". Pia~*eecli>tgs of' rite fiiri-rcmlrli Iri~~-~itltiuitcd Co~l/~iri ice oii .\luc/liiie L~.mriitg.*' 1997. pp. 305- 312.
S. R. Sridel. "Circuit Switçhed vs. Store-and-Fonvard Solutions to S~mmrîric Communication Problrms". hocw.ciigs qf' rlie 4th Coizfiwm-e oii &perriibe C'ontpiir~w riittf Coiiciirre~it .-lpplict~tioits. 1 98 9. pp. 3 3 - 3 5 .
Ci. Shah. J . Nieplocha. J. Mina and C. Kim. R. Harrison. R. K. Govindaraju. K. Gildea. P. DiNicola. and C. Bender. "Performance and Experiençe with LAPI -- a New High-Performance Communication Library for the 1 B M RS/6000 SP". Firsr .Lfci& *ntposiirm IPPSSPDP 1998 l lrh Iim~watioital Pordel Piacessiilg -ni- pusilun d 9th .$wiposirrm oii Pardiel aiid Disri-ibirted Pioccssi~ig. ! 998.
R. S heifert. "Gigabit Ethemrt". .-lddisoti- GVeslqi: 1 998.
M. Snir and P. Hochschild. "The Communication Sofiware and Parallel Environ-
ment of the IBM S PT"' IBM reins Jotri-wl. 34(1):205-22 1 . 1995.
C. B. Stunkel. D. G. Shea. B. Abali. M. G. Atkins. C. A. Bender. D. G. Gnce. P. H. Hochschild. D. J. Joseph. B. J. Nathanson. R. .A. Swetz. R. F. Stucke. M. Tsao. and P. R. Varker. "The SP? High-Performance Switch". IBM Systerns Journal. 34(2): 185-204, 1905.
H. Sullivan and T. R. Büshkow. " A Larse Ssale. Homogenrous. Fully Distributed Parallcl Machine". hoceedhgs ofdie 4rh .4riiiiial $mposiirni 011 Co»~ppirrei. .-klii- recrirw. Volume 5. March 1977. pp. 105- 124.
V. S. Sunderam. -'PVM: A Framework for Prirallçl Distributed Computing". Coiz- ~~11171(1c~: Pi-cicrice oiid Espericricv. Voluinc 34). Deceinber 1 990. pp. 3 1 5-3 39.
T. Szymanski. "Hppenneshrs: Opticai Interconneçtion Networks h r Parallttl Com- putins". Jotiixal of'Pmnlie1 mti Disn-ibirrd Conrpirriiig. 16. 1995. pp . 1-35.
Y. Tanaka, M. btatsuda. M. Ando, K. Kubota and M. Sato. "COMPaS: A Pentium Pro PC-based S M P Cluster and its Experience". Piocwriiirgs of'rAe PC-iVO?ÇVY4: Iiirci-iidoiznl Ilbrksliop 0 1 1 Pc.rsoizrrl Conipirrw brised .Ventvol;ks Of' IVOI-kstcrrioiis. iii corljirrrcrioii with PPSLSPDP'YIY. 199s.
R. Thakur and A. Choudhary bgAll-to-all Communication on Meshes with Wom- ho le Routin y ". Pi~ocwtfiugs of [lie I Y Y4 l~zrt.>-iicrriomd Pai*a/ld hoc~essing S l nzpo- siirni. 1994. pp. 56 1-565.
W. Tezuka. F. O'Carroll, A, Hori. and Y. Ishikawa. "Pin-tiown Cache: A Virtual blemory Management Technique for Zero-copy Communication". Firsr Mer& Sinlposi~im PPSiSPDP 1998 12th Intei-nafioiial Parallei Processing spposilrm d; 9rh Simposiirm on Paralle1 und Distrib~ited Piocessiig 1998.
K. Thulasiraman and M. N. S. Swamy. Graphs: Tlieory and Algorithms. John Wiley. 1992.
G. Tricoles. Tomputer Generated Holograms: A Historical review". .-lpp(ied
Optics, Specinl Issue oii Conipiitei Geirerored Hologranis. Volume 26. No. 70. 1987. pp. 435 1-4360.
[ 1 3 1 T. Von Eicken. D. E. Culler. S. C. Goldstein. and K. E. Schauser. "Active Mes- sages: X Mechanisrn for Inte,~~ated Communication and Computation". Proceed- i~igs q#'t/le 19th .-!iiriirczl ln~ci~~rntiot~al $rnposiirn~ 011 Conlpirrw .-lirhirectir~r. May 1992. pp. 256-265.
[176] T. Von Eickcn. A. Basu. V. Buch and W. Vogels. "U-Net: A User-Lrvel Nenvork Interthce for Parallel and Distributcd Computing". Procredings of the 15th .ACM Symposium on Operating Systems Principles. Dçcrmber. 1905.
[ 1771 D. S. Wills. W. S. Lÿcy. and J. Cruz-Riverri. "The Otfset Cube: .4n Optoelrçtroniç Interconnection Yetwork". in K. Bolding and L. Syndrr (ED.) Parallel Cornputer Rouitng and Communication. Springer-Verlag. LNCS 353. pp. 56-1 00. 1994.
[ 1 IS 1 P. H. Worley and I. T. Foster. "ParriIlel Spectral Transform Shallow Watrr Modei: .-\ Runtims-tunable parallel benchmark code". P~ûceeditigs of'the Siwlcihl~j High P L ' > - # U I W I ~ C L ~ Co~~zptiINzg C~II#EIPIICL>. 1 pp. 207-2 1-1.
[ 1 T. Ylitüyai. "Optical Cumputing and Interconncct". A-oc~criitg.~ qfIEEE. Volume 84. No. 6. June 1996. pp. YX-852.
[131] Ci. 1. lhylri. P. J. Marchand. and S. C. Esener. "Speed and Enrrgy hnaiysis of Diy- itül Interconnections: Cornparison of On-chip. OR-chip and Free-Space Teclindo- gicis". .-lpplicd Optics. Volume 37. No. 2. Janunry 1998. pp. 205-227.
[ l3 l ] T-Y l'eh and Y. Patt. "*Alternative lmplcmentation of Two-Levei Adaptive Brançh Prediçtion". P~aceedirigs O/' flic /9tli .-lr~i~ual Iitre~-~iotiotml Siviposirrm on Com- puter .-lr.cltirecrir~r. 1992. pp. 124- 134.
(1321 N. Yuan. R. Melhem and R. Gupta. "Compiled Communication for All-Optical TDM Networks". Piwcceditlgs ofrlre S~rpemmprizi~ag '96. 1996.
[l33] Z. Zhang and J. Torrellas. "Speeding Up Irregular Applications in Shared-~Memory Multiprocessors: Mrmory Binding and Group Prefetching". Procecrii,igs of' the 3 r d .-ln~ilinl Sinrposiirni otr Compirrer .4rcl1iternrre. 1 995. pp. 1 88- 1 99.
Therehre. the pure cvmputation timr is equal to t - - r2 - ( ( t , - r 3 ) + ( r , - ti)). To com-
pute the pure inter-send computation times. I need to know the exact times before and
aticr each MPI cail. For these. 1 did not insert the :l.IP/-lFtim~. cd1 in the source codes of
the applications. but instead I wrote my own profling codes to pther the timing traces.
Thus. eaçh MPI cal1 in the applications calls its own profilhg code. as shown in the fol-
lowing example for the MPi-Seiid.
The .\IPI-ll'tinic calls give the times. t,. and t,,, betore and alter the profiling call.
PMPI-Seird. resprctively. while what I really need are the times t,, and g. It is clear that
there are overheads entrring and exiting the protiling code in addition to the averhead of
the instniçti»ns i and ii. 1 cornputcd thrss extra overheads for each type of the MPI calls
used in the app!ications and took them out to End the pure inter-send cornputûtion times.