1 Library of Contents ... .Airerigr prrcentage of the times the recciw crills art: issued before the cor- responding scnd cal 1s

National Library 1 of Canada Bibliothèque nationale du Canada

Acquisitions and Acquisitions et Bibliographic Services services bibliographiques

395 Wellington Street 395, rue Wellington OttawaON K I A ON4 Ottawa ON K1 A O N 4 Canada Canada

The author has granted a non- exclusive licence allowing the National Library of Canada to reproduce, loan, distribute or seil copies of this thesis in microform, paper or electronic formats.

The author retains ownership of the copyright in this thesis. Neither the thesis nor substantial extracts fiom it may be printed or othewise reproduced without the author's permission.

L'auteur a accordé une Licence non exclusive permettant a la Bibliothèque nationale du Canada de reproduire, prêter, distribuer ou vendre des copies de cette thèse sous la forme de microfiche/film, de reproduction sur papier ou sur format électronique.

L'auteur conserve la propriété du droit d'auteur qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son autorisation.

Supervisor: Dr. Nikitas J. Dimopoulos

Abstract With the availability of fast microprocessors and small-scale multiprocessors. inter-

node communication has beçome an increasingly important factor that limits the pertor-

mance of parallel computers. Essentially. message-passing parallrl computers require

txtrernçly short communication latcncy such that messase transmissions have minimal

impact on the overall computation time. This thesis consentrates on issues rsgnrding hard-

ware communication latrncy in single-hop recontigurable netwcirks. and sofi~vare coinmu-

nication latrncy regardlrss of the type of network.

The fint contribution of this thesis is the design md evaluation of two ditierent catego-

ries of prediction techniques for message-passing systems. This thesis utilizes the çornmu-

nicütions locrility propttny of message-passing parrillel applications to devise a number of

heuristics that ciin be uscd to predict the taryt of subscquent communication requtists. and

to predict the nest consumable message at the receiving ends olcomrnunications.

SpeciRcally. 1 propose two sets of predictors: Cirie-bascci predictors. which are purely

dynamic predictors. and h g - b a s 4 predictors. whiçh are statiudynamic predictors. The

performance of the proposed predictors. speçially Better-cycle7 and T~lüg-brttercyclsl. are

very well lm the application benchmarks studied in this thesis. The proposed predictors

could be easily implemcnted on the network interface due to their simple algorithms and

low mrmory requirements.

As the second contribution of this thesis. 1 show that rnajority of reconfiguration

delays in single-hop recontigurable networks can be hiddcn by using one of the proposed

tiigh hit ratio predictors. The proposed predicton can be used in establishing a communi-

cation pathway between a source and a destination in such networks before this pathway is

to be used.

This thesis' third contribution is the analysis of a broadcasting algorithm that utilizes

latençy hiding and reconfiguration in the network to speed the broadcasting operation. The

analysis brings up closed fomulations that yields the temination time of the algorithms.

Table of Contents . .

Abstract ............................................................................................................................... i l

Table of Contents ............................................................................................................... iv List of Figures ................................................................................................................... vii List o f Tables .................................................................................................................. xi . . Trridemrirks ....................................................................................................................... xi1 ... Glossary .......................................................................................................................... xiii

.Acknowled, (rments ............................................................................................................ xvi

Chapter 1 Introduction ............................................................................. 1

1 . 1 Comrnunicritions Locality and Prediçtion Techniques ......................................... 5 1.2 k i n g the Proposed Predictors at the Send Sidr ................................................... 8 1.3 Redundant Message Copying in Software Mrssaging Layers ............................. 9 . ............................................................................... I 4 CoIlt'ctiw Coinmtinicritions 10

I.5 Thesis Contributions ........................................................................................... 1 1

Chapter 3 Application Benchmarks and Experimental Methodo

...................................... ........................... Pmllel Benchmarks .....,.. ................................. 2.1.1 YPEkNAS P;tr;tllt.lBenchmarksSuite

2.I.I.1 CG ........................................................................... ................................................... . . . 2 I 1 2 MG ..................................... 17

2.1.1.3 LU ........................................................................................... 17 2.1.1.4 BTrind SP .............................................................................. 17

2.1.2 PSTSWM ...................... ,. ................................................................. 1s 2 . 1 . 3 QCDMPI ................................................................................................ 13 .-\pplications' Communication Primitives ......................................................... 1') 2.2.1 MPI-Send .............................................................................................. 20 2.2.2 MPI-Isend ............................................................................................. 20 7 7 - ... MP 1-Srndrecv-replace .......................................................................... 20 2.2.4 MPI-Recv .............................................................................................. 20 2.2.5 MPI-Irecv ............................................................................................. 1 2.1.6 MPIWait ............................................................................................ 1 2.2.7 MPIWriitall .......................................................................................... 1 Exptirimcntal Mcthodology ............................................................................ 1

Chapter 3 Design and Evaluation of Latency Hiding/Reduction Messase

1 3 Destination Predictors ..................................................................................... 3-' 3.1 Introduction ......................................................................................................... 3

3.1.1 Message Switching Layers .................................................................... 24 3 . 1.2 Recontigurable Optical Networks ......................................................... 25

3 . 1.2.1 Communication Modelinj ................................................ 29 Comrnunicotion Frequency and Message Destination Distribution ................... 3û Communication Locality and Caching ............................................................... 35 * 7 1 The LRU . FIFO and LFU Heuristics ..................................................... 38 . . Message Destination Predictors .......................................................................... 43 3.41 The Single-cycle Predictor .................................................................... 46 3 . 4 . Thesingle-cycle2 Predictor .................................................................. 48 4 3 The Better-cycle and Better-cycle? Predictors ...................................... 49

- 3 3.4.4 The Tagging Predictor ........................................................................... 33

3 . 5 The Tag-cycle and Tag-cycle2 Prdictors ............................................. 54 3 A 6 The Tag-bettercycle and Tag-bettercyçlr 2 Predictors ........................... 56 Prediçtors' Cornparison ...................................................................................... 57 3 .5 . l Prrdictor's Mrrnory Requireinrnts ........................................................ 59

.................................................................................. Cking Message Predictors 60 ............................................................................................................. Summary 61

........... Chapter 4 Rrconfiguration Time Enhancements Using Predictors 63

4.1 DistnhutionufblcssiiyeSizss ............................................................................ 64 4.2 Intcr-scnd Computation Times ........................................................................... 64

......................................................... 4.3 Total RecontigurationTimeEnhanccmcnt 71 4.4 Prcdictors' Eî'fect on the Reccive Side ................................................................ 79 4.5 Sumrn3r-y ........................................................................................................... 1

Chapter 5 Collective Communications on a Reconfigurable

............................................................................... Interconnection Network 84

........................................................................................................ Introduction 84 ...................... Communication blodeling for BroadcastingMulti-broadcastiv SS

.................... Bruadcasting and Multi-broadcasting -0 .......................................................................................... 5.3.1 Broridcasting 90

5.3.1. l Analysis of the Greedy Algonthm .......................................... 92 1 2 Groupinç schema ................................................................ 101

5.3. Multi-broadcasting ............................................................................... 102 ................... Communication illodeling for other Collective Communications 103

Scattering ...................................................................................................... 103 Multinodr Broadçasting .............................................................................. 105 Total Exchanse ................................................................................................. 1 OS Summary ........................................................................................................... 1 i2

Chapter 6 Efficient Communication Using Message Prediction for Clusters

of Multiprocessors 1 II

6.1 Introduction ...................................................................................................... 115

.......................................................................... Motivation and Related Work 117 . .

L'sing Message Predictions ............................................................................... 171 Expenrnental Methodology ............................................................................ 113

.................................................................... Rrçeiver-sidr Locality Estimation 123 ..................................................................... 6.5. l Communication Loçality 115

6.5.3 The LRU . FIFO and LFU Heuristics ................................................... 127 ........................................................................................... Message Predictors 129

6.6.1 The Tügging Predictor ......................................................................... 129 6.6.2 The Single-cycle Predictor .................................................................. 130 6.6.3 The Tapçyclel Predictor ................................................................... 130 6.6.4 The Tag-bettrrcycle2 Prctdictor ........................................................... 131

.................................................................... bltissage Preditdictors' Cornparison ,132 ' 7 6.7.1 Predictor's blttmory Rcquirements ...................................................... IL

........................................................................................................ Sumrnary 1 34

Chapter 7 Conclusions and Directions for Future Research ................... 136 3 -l .................................. ...........................*................................ 7.1 Futurc Research ,, 1 JS

Bibliognphy ................................................................................................ 141

AppendixA Rrmoving Timing Disturbances ............................................ 153

List of Figures

Fisure 1.1 : A generic parallel computer ....................................................................... 7

Figure 3. I : RON (k. N). a massively parallel çomputrr intrrconnected by a complrte ti-er-space optical interconnection nrtwork ....................... . ............... .... .... 27

Figure 3.2: 'lumber of send ralls per proçess in the applications undrr different system sizes .................................................................................................. 32

Figure 3.3: Number of message destinations per proçess in the applications undrr dif- firent systrrn sizes ..... .. .... .. ............................................ ......................... 34

Figure 3 -4: Distribution ot'mttssüge destinations in the applications when N = 64 - 2 6

Figure 3.5: Distribution of message destinations in the applications for process zero. when N = 64 ............................................................................................ 37

Figure 3.6: Cornparison of the LRU. FIFO. and LFL heuristics when N = 64 ........- 39

Figure 3.7: Effects of the sçalibilty of the LRU. FIFO. and LFU heuristics on the BT. SP and CG ~ipplicatiuns .......................................................................... 40

Figure 3.8: Effrcts of the scülibilty of the LRU. FIFO. and LFU heuristics on the MG and L L: applications .................... ..... ........ ..... .. ... ............... .......... .... ........ 4 1

Figure 3.9: Effects of the scalibilty of the LRU. FIFO. and L F U heuristics on the PSTSWM and QCDklPI applications ............................ .. ................... 42

Figure 3. l O: Operation of the Single-cycle predictor on a sample request sequcnce..l7

Figurc 3.1 1 : Effect of the Single-cycle predictor on the applications ......................... -18

Figure 3.12: Comparison o f the performance of the Single-cycle predictor with the LRL'. LFU. and FIFO heuristics on the applications under single-port inodeling when N = 64 ............................................................................ 48

Figure 3.13: Operation of the Single-cycle? prediçtor on the sample request sequençe

Figurc 3.14: Effeçt of the Single-cycle2 predictor on the applications ....................... 49

Figure 3.1 5: S tatr diagram of the Better-cycle prediçtor ............................................ 50

Figure 3.16: Operation of the Better-cycle predictor on the sampls request sequençe ...

Figure 3-17: Effeçt of the Better-cycle predictor on the applications ......................... 52

Figure 3.1 S : Operat ion of the Better-cycle2 predictor on the sample request sequence.

...... * ......... *..* ...............*..*.*...- * ....................*...................*...... ....*........*..... 52

Figure 3.19: Effert of the Better-cycle2 predictor on the applications ....................... 53

Fisure 3.10:

Figure 3.2 1 :

Figure 3.22:

Figure 3.23:

Figure 3.24:

Figure 3.25:

Figure 4. I :

Figure 4.2:

Figure 4.3 :

Figure 4.4:

Figure 4.5:

Figure 4.6:

Figure 4.7:

Figure 4.8:

.............................. Effects of the Tagging predictor on the applications 5-1 - -

E fkcts of the Tag-cycle predictor on the applications ......................... ..x

Effects of the Tag-cycle7 predictor on the applications ......................... 56

Effects of the Tag-brttercycle predictor on the applications .................. 56

Effrcts of the Tag-bettercyclrl predictor on the applications ................ 57

Cornparison of the performance of the prediçtors proposed in this chapter whèn numbrr of processes is 64.32 (36 for BT and S P). and 1 6 ........... 58

Distribution of message sizrs of the applications when N = 4 ............... 65

Distribution of message sizes of the applications when N = 9 h r BT and ........................... SP. and S for CG. MG. LU. PSTSWM. and QCDMPI 66

Distribution of message sizes of the applications whrn N = 16 ............. 67

Distribution of message sizes of the BT. SP. PSTSWkl. and QCDM PI ap- ........................................................................... plications whcn N = 25 68

Cumulative distribution funçtion of the inter-send computation times for node zero of the application benchmarks when the number of processors is 16 fbr CG. MG. md LU. and 25 h r BT. SP. QCDbIPI. and PSTSWM.

Perçentagc of the inter-send computation timcs fbr different benchmarks that art. more than 5.1 O, and 35 iniçro~econd~ when N = 4. S or% 16, and 25.

Different sceniirios for message transmission in a multicomputer with a recontigurablc optical interconnect (a) when the messagc-transfefielay is less than the inter-send time. and the rivailable timr is liirger than the recon tiguration-drlay ( b) when the rnessage-transfer-delay is less t han the intcrsend time. and the rivailable time is less than the reconfigurûtion-delay (ç ) whrn the message-transfer-delay is larger than the inter-send time.. ................................................................................ -73

Average ratio of the total reçontiguration time afier hiding over the total original recontigration time for different benchmarks with the current genention and a 10 times faster CPU when d = 1.5. 10. and 3 microseconds: -4 class tOr NPB. 4 nodes (shorter bars are better) ........................ 75

Averqr: ratio of the total reconfiguration time afier hiding over the total original recontiswration timr for diffèrent benchmarks with the current generation and a 10 times faster CPU when d = 1.5. 10. and 25 microseconds: A çlass for NPB. 9 nodes for BT and SP. 8 nodes for other applications (shorter bars are better) .................................................................. 76

Figure 4.10:

Figure 1 . 1 1 :

Figure 4.12:

Figure 4.1 3 :

Figure 4.14:

Figure 5.1 :

Figure 5.2:

Figure 5.4:

Figure 5.5:

Figure 5.6:

Figure 5.7:

Figure 5.8:

Fipre 1.9:

Figure 6.1 :

Figure 6.2:

Figure 6.3:

Average ratio of the total recontiguration tirne after hiding over the total original recontiguration time for diffèrent benchmarks with the current generation and a 1 0 times faster CPU when d = 1. 5. 1 0. and 25 microseconds: A class t'or NPB. Ibnodes (shorter bars are better) ....................... 77

Average ratio of the total recontiguration time alier hiding over the total original reconfiguration time for different benchmarks with the current senerotion and a 10 times faster CPU when d = 1.5. 10. and 23 microseconds. A clrtss for WB. 25 nodes (shorter bars are better) ...................... 75

Summary of the average ratio of the total reconfiguration tirne atier hiding ovcr the total original recontiguration time with the current generation and a 1 0 times hster CPU when applying the Tag-bettercyclç2 predictor on the benchmarks witli d = 25 microseconds. A class for NPB. and under . . .............................................................................. di ttercnt system sizcs SO

.Airerigr prrcentage of the times the recciw crills art: issued before the corresponding scnd cal 1s ............................................................................ ..82

Somr collrcti\.e communication operations ........................................... 87

Lütcnsy Ming broadcasting algxithm for RON ( k . Y). N - 4. k = 2. d = 1

First and second genrration trees. The numbers underneüth rach tree denote the number of trces having the same height. These trees are rootttd at

................ nodes that were ;lt the same levef in the iirst generation tree. 94

Sequential tree algorithm ..................................................................... 104

Spanning binomial tree algorithm ................................................... I O5

%~ultinodrbroadçastingonanS-nodeRON(k. N) undersingle-portmodeling

Multinode broadcastiny on an 9-nodc RON (k. N ) under ?-port modeling

Total eschange on an S-node RON (k. N) under single-port modelins 108

....... Total exchange on an 9-nodr RON (k. N) under ?-port modeiing 1 1 O

Data transfers in a traditional messaging layer ..................................... 1 19

Number of reçeive calls in the applications under di fferent system sizes ..

Number of unique message identitien in the applications under di fferent ........................................................................................... system sizes 126

Figure 6.4. Distribution of the unique message identitiers for process zero in the applications ........................................................................................... 117

Figure 6.5. Effrcts of the LRU . FIFO . and LFU heuristics on the applications ..... I IS

Figurc 6.6. Efkçrs of the Tagging predictor on the applications ............................ 130

Figure 6.7. Eftkcts of the Single-cycle predictor on the applications ..................... 131

Fisure 6.8. E f k t s of the Tag-cycle2 predictor on the applications ....................... 131

Figure 6.9. Effects of the Tag-bertercycle2 predictor on the applications .............. 132

Figure 6.10: Coinpanson of the performance of the prediçtors on the applications . 133

List of Tables

Table 3.1 : Memory requirements (in bytes) of the predictors when N = 64 .............. 59

Table 4.1 : Minimum inter-srnd çomputation times (microseconds) in NAS Parallel Benchmarks. PSTSWM, and QCDMPI when N = 4. S. 9. 16. and 25 .......................... 70

Table 4.2: Communication to computation ratio of the applications ......................... 83

T b 5.1 : Broadçasting time. k = 7. d = 1 ................................................................. 99

Tabk 5.2: Broiidcasting tirne. k = 4. d = 3 ............................................................... i CIO

Table 5.3: Broadsasting timc. d = 3 .......................................................................... 1 O 1

Table 5.4: Multi-broadçasting timr. k = 4. d =3. kl = 1 O .......................................... 1 O3

Table 5.5 : Total exchange time. 'l = 102-1. single-port ............................................. 1 II

Table 5.6: Total exchange timr. 'l = I U X k = 3 ..................................................... 1 12

Tüblr 6.1 : blernory requirements ( in 6-tuplt: sets) for the prcdictors whrn N = 64 h r CG. and Y = 49 t'or BT. SP. and PSTSWM .......................................................................... 134

Trademarks

hl any o t' the desisnations used by manufûcturers and sel [ers to distinguish their products are claimed as trademarks. Trademarks and registered trodrmiirks usrd in this work. where the author ~ 3 s aware 01' them. are listed below. Al1 othcr trademarks are the proprrty of thçir respective owners.

IBM SP2 is a registered trademark of International Business Machines Corp.

IBM P2SC CPC is a registerrd trademrirk of lntrmational Business Machines Curp.

[BbI VuIcan Switçh is a rcgisterrd tradeinark of International Business Machines Corp.

blyrinrt is a rcgisterrd tr3derni.uk of Myricom.

SemcrNet is a rrgisterrd tradrmark of Tandem Division of Compaq.

SGI Origin 2oOO i s a rryistered trademark of Silicon Graphiçs. Inç.

SGI Spider Switch is (i rcgisrered trndemark of Silicon Graphics. Iiic.

LVawStcir LnmbdaRoutcr is ri registered tradeinark of Lucent Technologp.

Glossary

A b1

ASCI

BIP

BT

C'A

CDF

CGH

CIC

CG

CLCblP

COW

D bI

DS iU

EP

FlFO

F b1

FT

HPF

1s

L*AlM/ b1 P 1

LAN

LAPI

Active Messases

Açcelerated Strategic Computing Initiatke prograrn

Basic Intrirîàce tor Parallelism

Block Tridiagonal Application Benchmark

Cammunicatim Assist

Cumulritive Distribution Ftinction

Cornputer Gcnenited Holograms

Computing. Intonation and Commun' :ations Projet

Conjugatc Gradient Application Bençhmark

Clustcr of Multiprocessors

Cluster of Workstations

Detormriblc Mirrors

Distributed S hared-Memory blultiprocrssor

Embaw~ssingl y Paral le1 Application Bençhmark

First-in-tirst-out

Fast Messages

3-D Fast-Fourier Transform Application Benchmark

High Performance Fortran

Intesa- Sort Application Benchmark

Local Area Multiçomputed Messase Passing Interface

Local Area Nehvorks

Low-level Application P r o ~ ~ m m e r s Intertàce

LIFO

L RU

LU

.M G

bIIivlD

hl P I

b1PICH

bIPP

Y1

NOW

? P B

ORPC ( k )

ors

P2SC

POPS

P b1

PSTS Wb1

PVM

QCDblPI

RON (k. N)

RMA

SAN

SEED

Last-in-first-out

Least Recently Used

Lower-Upper Diagonal Application Benchmark

Multigrid Application Benchmark

Multiple Instructions Multiple Data

Message Passing Interthce

A Ponable Implementütion of MPI

Xliissiid y Pard le1 Processors sy stems

Nctwork interthce

Networks of Workstritions

NAS Par;illel Bcnchmarks

Opticcilly Rrc»ntigurable Parüllt.1 Cornputer

Optical Passive Stars

PoweC-Super Microproçessor

Partitionrd Optical Passive Stars

A H igh-Prrtbnnançe Communication Library

Power Spcctrum Transform Shallow Watcr klodel

Parailel Virtual Machine

Quantum Chromodynamics wi th Message passing Interface

Reconîigurable Optical Network.

Rernote Mernory Access

S ystem Area Networks

Self Electro Optics Emitting Device

SWRIMP

SP

SPMD

TLB

C-Net

VCC

VCSEL

w.4

XV

Scalablr High-Performance Real1 y Inexpensive Multiprocrssor

Scalar Pentadiagonal Application Benchmark

Single Program Multiple Data

Translation Lookaside Bu f e r

-4 User-Letei Network Interface Architecture

Vinual Circuit Caching

Vertical Cavity Surtice Ernitting Laser

Vinuril Interface .Architecture

Virtuai btemory-.Mapped Communications

Acknowledgments

1 would like to express my drepest appreciation to my supervisor. Dr. Nikitas J.

Dimopoulos for his thoughttùl suggestions that shaped and improved my ideas. 1 am very

uratcttul tu Nikitas for providins me with his valuable guidance. encouragement. support. 2

criticism. patience. and kindnrss from the first day I rame to Victoria.

1 would like to thmk the mcmbrrs of my dissertation sommittrr. I wish to thank Dr.

rioes- Kin F. Li. Dr. Vijny K. Bharpva. and Dr. D. Micharl kliller for their support and su,,

tions. I am very grateful to Dr. JosC Duato b r his kind açceptancr to be the cxtemal exam-

iner of tliis dissertation. and for his brilliant sugestions.

1 ilin greatly indrbted to my wifr. Azitü G r n m i for hrr continuous support and

cnci)uriigc.mcnts. Without her understanding. 1 would not Iiave Rnished rny dissertation. I

tvould likt. to express my gratitude to my parents who always encouraged me to pursue a

Ph.D.

1 want to thank ail my fnends and graduate fellows rsprcially the fellow rcsearchers at

LAPIS including .b\ndré Schuorl. Nicolaos P. Kourounakis. Shahadat Khiin. blohamrd

Watheq El-Khiirashi. Stephen W. Neville. Riifde! Parra Hemandez. Caedmon Somers. lon

Kanie. and Eric 1.ûsdal who have made my stay so much fun.

I would likc to thank the department's systrm and otfice stati for their continuous

cooperation. I cim thankful to Vicky Smith. Lynne Barrett. Maureen Denning. and Moneca

Bracken.

Speçial thanks to Dr. Murray Campbell at the IBM T. J. Watson Research Centrr for

his kind coopention and hrlp in accessing the IBM Deep Blue. and the staff of the com-

puter center at the University of Victoria for the acçess to the University IBM SP2.

My dissertation rrisrarch was supported by gants %om the Natural Science and Engi-

neerins Rrscarch Counçil (NSERC) of Canada. and the University of Victoria.

Chapter 1

Introduction

Researçh in the xea of xivanced çomputer architecture has bren primanly tiocused on

how to improve the performance o t' çomputers in order to solw computational 1 y intensive

problems [ a l . 62. 601. Soine of thesr problems are called grrr~id cliallerlges. A grand chal-

lenge is a fundamental problem in science or engineering tliat has a broad rconomic

mdm- sçientitic impact: couplcd fields. gcophysical. and astrophysisal Huid dynamics

(GAFD) turbulcnçe. modelinp the global dimate system. formation of the large scale uni-

verse. global optimization algorithms for macromolecuiür rnodrling. petroleum explora-

tion. üerodynrimiç simulations. ocean circulation. are just a few to mention.

The perfomiincc of processors is doubling eaçh eighteen monrhs [QI. However. there

is always a demünd for more cornputing power. To solve p n d challenge problems. com-

putcr systems at thc w q i h p ( 10'' Rc~atin~ point i~perations prr second) and pcr~flap

I ; ( 1 O - Himtins point «pcrûtions per second) performance levels are needed.

Proccssors are becornine vçry cornplex and only a î'w çompanies are drsigning new

processors. Thrreîiore. it is not cost-rtfkctive to build high performance computers just by

using custom-design high performance processors. The trend is to design parallel comput-

ers using çommodity prosessors to achieve terailop and petaflop pertbrmançr. For

instance. two major projects to develop high performance supercomputers in the LISA are:

the fedenl program in Compiiri/g Iiifbiniariorr und Co~nnzirriicoiioi~s (CIC) project at the

national coordination office [9S]. and the Department of Energ .-lccekrnted Strategic

Compiiting hitiritiw (ASCI) program including InteVSandia Option Red. IBWLawrençe

Livitrmore National Laboratory Blue Pacific. and SGI/Los Alamos National Laboratory

BIue Mountain [jL)].

This should nor give us the wrong impression that such high pertbtmance computrrs.

otien salled ~lltzssii*e!i Par-del Piocessoi (MPP) systrms. are only used for grand chal-

lenges and parailel sçientific applications. Even for applications requiring lower comput-

ing power. parallel computing is a cost-eerctive solution. These days. rnany high

performance parallel computing systems arc being used in network and commercial appli-

cations such 3s data wareh~using. internet sen'ers. and di pi ta1 librarits.

ParaIlcl proccssing is cit the hcan of such powcrful computcrs. ?ilthough parallclism

apprars at ditferent Ievels for a single processor systrm. such as lookahead. pipelining.

superscalarity. sprçulative exeçution. vectorization. intrtrlcüving. overlapping. multiplicity.

time sharing. multitasking. multiproyramming. and multithreading. but i t is the parallel

processing and parallel computing among ditierent processors whic h bnngs us such levels

of perîimnançe.

Bnsicaily. a panillrl computer is a "collection of proçrssing elements that comrnuni-

c m and cooperatt. to solvc large problems fut'' [ 9 ] . In other words. a parallel computer.

w hether r ~ ~ ~ ~ . ~ s ~ ~ g c ~ - p c i s s i ~ t g or tiisn-ihitttd sltorwf-nlernoi? ( D S M ). is a collection o f com-

plrttt coniputers. including processor and mtirnory. that communicate through a genrral-

purpose. high-performance. sçülablr intrrconnection network using a cboniniwiicatiori

trssist ( C A ) and!or a rrc~t-or%- irircr;/iwc ( N I ) [XI. as shown in Figure 1.1.

P: Processor S: Cache

1 Nrtwork ~ntrrfaci 1 u

r H Inrerconnectian Network m Figure 1.1 : A generic parallel corn puter

:Ilrssnge-pnssing niulricornp~rrer-S. iimong al1 known paraIlel architectures. are the best

to açhiew such somputing performance level. Message-passing multicomputrrs are char-

acterized by the distribution of memory among a numbrr of computing nodes that corn-

municate with rach other by sxchançing messages through their interconnection

networks. Each node has its own processor. local mrmory. and communication assisthet-

work interthce. AI1 local mernories are privatr and are accessible only by the local procrs-

sors. The wide accrptance of messrigr-passing multiprocessor systems has bren proven by

the introduction of . l f w q c P L I S S ~ ~ I / I I L > I . I ~ ~ c ~ (MPI ) standard [9I . 931. Currently. in addi-

tion to wndor implemrntations of MPI on çommrrcial machines. there are mliny tieely

a d a b l e .LIPI implcrnentations including MPICH [37] and LAMMPi [75] .

Rcsently. .Venc~i-lLs 01' Hbîk~rrrrio~rs (NOW) [ 1 I l . C1irsrer.s qf' Flbr~kstarioi~s (COW).

and C'lirsr~~is of'.Ifirl~ipi~uc~c'ssm (CLCMP) [Y7]. have bern proposed to build inexpensive

pürallel cornputm. howewr. oftrn iit a lower performansc levrl compared to MPP sys-

tems. Thc developmcnt of high-perfomcincr switçhes sprcially for building cost-etfective

interconneçts known as S\mm .-11-ecr .V~~iii+or.ks ( S A N ) [73. 67. 1 13. 5-11 has motivüted suit-

ability of the networks of workstation/multiprocessors as an inexpcnsive high-pertbr-

mance computing plathrm. Systrm area networks such as the Myricorn M yrinet [XI. the

IBM Vulcün switch in the IBM S P I machine [ I 131. the Tandem ServerNet [67]. and the

Spider switch in SGI Origin 2000 machine [ S I . are ri n r v generation of networks that

falls bctwccn memory buses and commercial local area networks (LANs).

Parallcl processing. whether MPP. DSM. YOW. COW. or C'LUMP. puts tremendous

pressure on the intrrconneçtion networks and the memory hirrarchy subsystems. As the

communication overhead is one of the most important factors atiecting the performance of

parnllel cornputers r76.69.431. there has been a gowing interest in the design of intercon-

nection networks. In this respect. various types of interconnection networks. such as corn-

plae networks. hypercubes. meshes. rings. toi. irregular switçh-based. stack-gaphs. and

hypermesh have bern proposed and some of thrm have been implemented [46. 134. 1081.

Meanwhile. many routing algrithms [47.56. 121 have bern proposed for such networks.

In parailcl processing systems. the ability to etficiently communicate and share data

between processors is very critical to obtaining high perfomance. In essence. parallel

cornputers require extremely short communication latencies such that network transac-

tions have minimal impact on the overall çomputation time. Communication hardware

latency. communication software latency. and the user environment (multipro_erarnming.

rnultiuser) are the major F~ctors atfecting the perfonnançr of parüllrl cornputer systems.

This thcsis concentrates on issues rrgardinp hardware communicütion latency in elrc-

tronic networks and rcconfigurable optical networks. and software communication latency

(regardless of the type of network).

In this thesis. 1 propose a numbttr of techniques to achieve etticirnt communications in

message-passing systrms. This thesis makes tix wntnbutions:

The first contribution of this thosis (Chapter 3 ) is the design and evaluation of two

di firent categories of prcdiction techniques for message-passing systems. Speci fi-

cally. I use these prdictors to predist the tnrget of communication messages in

parüllel applications.

.As the second contribution of this thesis (Chapter 4). 1 show that the majority of

recontiguration delays in reconîigurriblr networks can be hidden by usiny one of

the high hit ratio proposed predictors in Chaptrr 3.

The third contribution of this thrsis (Chapter 5 ) is the analysis of a latency hidiny

bmüdcasting ü ly i thm on single-hop reconfigurable networks under single-port

and k-port modeliny which brings up closed formulations thiit yield the termina-

tivn time.

As the founh contribution of this thesis (Chapter 5) . 1 propose a new total

rxchange algorithm in single-hop reconfigurciblr networks under single-port and k-

pon modelinp.

Finally. the f fth contribution (Chapter 6) is the use and evaluation of the proposed

predictors in Chapter 3 to predict the next consumable message at the receiving

ends of message-passing systems (regardless of the type of network). I argue that

these message predictors can be eficiently used to drain the network and cache the

incoming messages even if the corresponding receive calls have not been posted

yet.

Chapter 2 introduces the panllel applications used in this thesis. Chapter 7 concludes

this dissertation and gives directions for future resrarch. Appendix A describes how tim-

ing disturbances have bren removed from the timing profiles of the paralld applications

used in this tliesis.

The rest of this çhrtpter is orynized as follows. In Section 1 . 1 . 1 explain the cornmuni-

cation loçality in message-passing parallrl applications and disçuss ditferent latensy hid-

ing techniques for parallel cornputer systems. In Section 1.7. 1 discuss the advantages of

usin2 prediction techniques at the send side of communications in the rrcontigurable opti-

sril interconnection networks. and in the circuit switched and wormhole routing elrçtronic

interconnection networks. ln Section 1.3. 1 describe the issues related to the messriging

laycr and soiiware communication overhcrid in message-passing systrms. and how predic-

tion san help diminate redundnnt message copying operntions. 1 jive an introduction to

thc issues rcgürding colleçtiw communications in Section 1.4. Finrilly. 1 summarize the

contributicms ot'this thcsis in Section 1.5.

1.1 Communications Locality and Prediction Techniques

In this thesis. I am interestrd in the message-passing mode1 of parallclism as rnessaze-

passin2 parallel computers scale much better than the shared-memory parallel computers.

Communicrition propenirs of message-passing parallel applications can be categorized by

the spatial. wmpotnl. and i-ohinie anributes of the communications [30. 75. 651. The tem-

poral attribute of communications in parallel applications characterizes the rate of mes-

sage generation. and the rate of computations in the applications. The volume of

communications is characterized by the number of messages. and the distribution of mes-

sage sizrs in the applications.

The Spatial attribute of communications in parallrl applications is çharacterized by the

distribution of message destinations. Point-to-point communication patterns may be repet-

itive in message-passing applications as most parctllèl algorithms consist of a number of

computation and communication phases. Several researchers have worked to find or use

the çornrniriiiccitiotis locci& proprrties of parallel applications [30. 75. 65. 36. 3 71.

By inesmye cfesriiiririoii ~*onzniuriicatiotl loccilih: 1 mean that if a certain source-desti-

nation pair has bccn u a d it will br. re-uad ivitli Iiigli probability by a portion of code tliat

is *-near" the place that was used earlirr. and that it will btr re-usrd in the near future. By

tnessqcJ twrprioi~ c*oninrzitlicwrioir locditj* 1 mean that if a certain message rcception ça11

hüs bern uscd i t will be rc-uscd with hish probability by a portion of code that is "near"

the place thüt was used earlisr. and that i t will be re-used in the near Iùture.

Traditionally. une approach to dral with communication latency is to rolriarr the

Iiitrncy: thüt is. hide the latrnsy from the processor's critical path by overlapping it with

other hiyh latcncy ewnts. or hide it with computations. The processor is thrn fret. to do

0tht.r usetiil trisks.

Three üpproiiches can be used to tolerate latcncy in shared-mrmory and message-pass-

in2 s y stems [3 21. They art: pi-oc~wiiiiilg pust comt~i~r~iicu~io~i irl die same tht*eud. mirhi-

tlti.eciditlg. and pt-~conitnrrt~ic*~~tiotl. The tirst approaçli. proceeding past communication in

the same thread in message-passing systems. is to rnakr communication messages asyn-

chronous and proceed past them either to other asynchronous communication messages.

or to the computation in the same thread. This approach is usually used by the parallrl

algorithm designers. Some olthe applications studied in this thesis use this type of latency

tolrrancr by usinp nonblocking asynchronous MPI calls.

In muliithreading. a thread issuing a communication operation suspends itself and lets

another thread run. This approach is used for other threads too. It is hoped that when the

îint thread is resçheduied. its communication operations have concluded. Multithreading

can be done in sotiware or hardware. Sofiware multithreading is very expensive. Some

hardware mu1 tithreading research architectures for message-passinç sy stems such as the J-

Mac hine [3 51. and the M-Machine [ 52) have been reported.

In precommunication. communication operations are pulled up from the place that

communications naturally occur in the progam sso that it is partially or entirely completsd

before data is needed. This c m be done in software by insertin3 ap,rconimirr~icntion oper-

crriorr. or in hardware. by piedicti>zg the subsequent communication operations and issue

them car i .

Precommunication is comrnon in receiver-initirited communications (that is. in shared-

IIIF~IIWI-~ systt'm) rç herr communication coinmences wlien a data is nèedcd such as a read

operation. In so/hi~trr*e-~*oriti~o/l~~i pre/ivcliir~g. the programmer or the compiler decides

when and what to prefetch by analyzing the program and thcn inserting pref2rclz instruc-

tions befort. the actuiil data rtrqurst in the progam [%]. In i i r z i ~ d ~ ~ ~ c z r ~ - ~ ~ o ~ ~ ~ ~ ~ o l l e d p ~ ~ e f L ; t c l ~ -

hg. dediciitrtd hiirdwlirt. is used to predict the future accesses of shariny patterns and

coherence cistivitirs by Iooking at their obscnrd behavior [06. 77. 73. 133.34. 1071. Thus.

there is no need to add instructions to the program. Tiiese techniques assume that memory

occesscs and coherensc rictivities in the neür future will follow past patterns. Then. the

hardware prefetchcis the data based on its prediction.

In sendcr-initiated systrms (that is. in mrssagc-passing systrtms). it is usually ditfiçult

to do the communicrition operation enrlier rit the send sides and thus hide the Iiitency. This

is because message communication is naturally initiatrd to trnnsfer the data when the data

is produced. Howrver. messages may arrive enrlier cit the receiver than it is needcd which

leads to a prrcomrnunication for the receiver side of communication.

.As far as the nuthor is aware. no precornmunication technique has bern proposed for

message-passing systenis. Predictions techniques can be used to predict the subsequent

message destinations. and message reception ccills in message-passing systems. This thr-

sis. for the tirst time. proposes and rvaluates two categories of pattern-based predictors.

nnmely. Clde-bused predictors. and Tq-bïised predictors for message-passing systems.

These predictors c m be used dynamically (at the send side or receive side of communica-

tions) at the communication assist or network interface with or without the help of a pro-

rrammrr or the compiler. -

1.2 Using the Proposed Predictors at the Send Side

In the following. 1 explain how message destination prediction c m be helpful in hiding

the reçonti_ruration delay in single-hop and multi-hop reçontigurable optical interconnec-

tion networks. and in hiding path setup time in circuit switched electronic networks. 1 also

describe the benefit of message destination prediçtion techniques to reduce the latency of

cominuniçütir>ns in çurrent commercial wormhole routed nct~vorks.

The interconneçtiun network piays a key rolr in the performance of message-passing

panllel computers. -4 message is sent h m a source to a destination through the intercon-

neçtion nrtwork. High communication bnndwidth and low communication latcnçy are

essential tor etticient cornmunication betwtten a source and rt destination. However, corn-

muniwtion latency is the most important factor atfecrin~ the performance of message-

passing parüllrl computers. In this thesis. I am interested in hidiny and reducing the com-

munication Iütrncy. Two cütr~orirs of interciv~nection networks a i s ~ : clrctronic intercon-

netcion networks. and optical interconntiçtion netwrks. I have devslopcd prediçtion

techniyues that can be cipplied to both eleçtronic and opticül interconneciion networks.

The proposcd predictors cm br used to set up thc paths in advance in electronic net-

works using rithcr circuit switching or i iwc sii~itclii~zg. In circuit-switching. the routing

hcüticr H i t progresses tlirouph the message destination and resewes p hysical links. Wavc

switçhing is a liybrid switcliing tcçhnique for high performance routers in electronic inter-

connection net\vorks. Wwe switching combines wormholc switching and circuit switch-

inp in the sanie router architecture to reduce the f ixd overhead of communication latençy

by rxploitinrl_ communication loçality. Hence. it is possible to hide the hardware communi-

cation latençy using message destination predictions to pre-establish physical circuits in

circuit switching and wave switching networks.

The predictors can wen b r useful to reduce communication latency in current com-

mercial networks. For example. Myrinet networks [23] have a relatively long routing time

compared with link transmission time. Predicton would allow sending the message header

in advance for the predicted message destination. When data becomes available. they can

be directly transmittcd through the network if the prediction was correct. thus reducing

latency significantly. In case of mis-prediction. a message tail is fonvardcd to tear the path

down. Obviously. nuIl messages must be discarded at the destination.

Optics is ideally suited for implementinj interconnection nrtworks because of its

suprrior characteristics over rlectronic interconnrcts such as higher bandwidth. greatrr

number of fan-ins and fan-outs. higher interconnection densitirs. less s i p a l crosstülk.

frèedoin frein planar canstraint as it can sasily csploir the tliird spatial dimension whicli

drarnatically increases the available communication bandwidth. lower signal and dock

skew. lower power dissipation. inherent parallelism. immunity from elrctromagnetiç inter-

ference and gound loops. and suitability for reconfigurnble intrrconnects [ 1 00. 5 1. 74. 19.

50. 129. 83. 191.

Future massively pcirallel cornputers might benetit fiom using reconfigurable optical

intcrconnection networks. Currently. t h m are somr problems with the optical intercon-

nwt trchnology. Signal attrnuation. optical elemrnt aligning. low conversion timç

berwcen rlectrunics t i ~ photoniçs and vice versa, and high recontiguration dclay are somc

disodvantaycs of uptics which are mostly due to its relatively immature technology. How-

rver. this teçhnolosy is maturing fast. As an rxamplc. Lzicent S ICnveStm* Lnrnb<inRor~rc.i

[S6] relies on an a m y of hundreds of electrically configurnblc niicroscopic rnirrors fabri-

çated on 3 single substrate so that an individual wavelength crin be passed to any of 256

input and output îibers.

As statcd above. the recontiguration drlay in recontigurable optical interconnection

networks is çurrrntly very high. The proposed message destination prediçtors çan be etti-

ciently used to hidr the recontiguration delay in the single-hop and multi-hop reconfig-

urable optical interconnection networks concurrently to the computations [127. 841.

1.3 Redundant Message Copying in S o f ~ a r e Messaging Layers

The communication sotiware overhead currently dominates the communication tirne

in cluster of workstations/multiprocessors. Crossing protection boundaries several times

betwern the user spacr and the kemel space. passing several protocol layers. and involving

a number of memory copying are three different sources of software communication cost.

Several researchers are working to minimize the cost of crossing protection bound-

aries. and using simple protocol layers by utilizing iczer-ievel rnessagi~lg techniques such

as .-l crhv :\/essczges (AM) [ 1251. Fust Messages (FM) [ 1021. I b-nrd Menloi?.-blnpped

Conr~~iwicurioics (VM MC-?) [-BI. L-!Ver [ 1261. L.-IPI [ I IO]. Btcsic Inre~.fizrc. fi>r P ~ ~ I L J I -

irnr (B IP) [ 1 051. I ?rtirtzi Inrerfizc*~ .-Ir-cliir~.crirr-r (V1.4) [49]. and P M [ 1 2 1 1.

.A significant portion of the software communication overhead belcings to a number of

mcssasc copyins operations. Idcally. mcssape protocds should sopy the message dircçtly

from the send butfer in its user spaçe to the receive buKer in the destination without any

intemediate bufiering. However. appkitions at the send side do not know the fnal

recciw butfer addresses and. hence. the communication subspstcms at the receiving end

still copy messages ut 3 temporüry bufier.

Severri1 rrsecirch groups hüw tried to iivoid memory copying [79. 14. 106. 1 19. I I Y 1. Thry h u e bren able to remoïe thc extra mcmory copying operations betwren the applica-

tion user buftkr spriçr m i the nrtwork interface at the send side. Horvt.ter. they haïrn't

been able tu rernovc the rneinory copying at the receiver sides. They may açhicve a zero-

copy rnessaying at the receiver sides only when the reçeivt. call is tilready posted. ü ren-

dez-\uus type wmmunicütion is used for large messages. or the destination buRu address

is already known by an extra communiccition (pre-communication). Howevcr. the prcdiç-

tors proposed in this dissertation can be rtticiently used to predict the next message recep-

tion calls and thus movc the corresponding incorninp messages to a place nrar the CPCi

such as a stüging cache.

1.4 Collective Communications

Communication operations rnay be ritherpoi~ir-to-poim. which involve a single source

and a single destination. or soliecriir. in which more than two processes participate. Col-

lective communications are rommon basic patterns of interprocessor communication that

are îiequently used as building blocks in a vanety of parallel algorithms. Proper imple-

mentation of these basic communication operations is a key to the pertormance of the par-

allel cornputers. Therrfore. there has been a great deal of interest in their design and the

study of their performance. Excellent surveys on collective communication algorithms c m

be found in [90. 53.611.

Collective communication operations can bs used for dara rnovement. process sontrol.

or global operations. Data movrmrnt operations include. 61-ocidcnsting, nniitricastir~g. scat-

rzr-il~g. gurlwritlg. imlr b~otlc~ br.ocitkmritig. and [oral e~chirge . Btzrriw SJI I C / I ) U I I ixtiot 1. i s

J typc of proccss control. Global operations includc ~.edui*tioii. and smz. The gowing

interest in collective communications is rvidrnt by thrir inclusion in the Message Passing

Interfi~cc (WI) [C)3.92].

1.5 Thesis Contributions

In Chaptrr 2. I drsçribe the applications used in this thrsis dong with the point-to-

point communiçation primitives that they use. t rxplain the tixperimental methodolog

used to collect the communication traces of the applications.

In Cliaprer 3. 1 introduce a complete interconnection network using free-space rrcon-

tigurablc optical intcrconnrcts t'or message-passing parallei machines. A comput ing node

in this parallel machine contipures its communiçation link(s) to reach to its destination

node(s). Thcn it scnds irs rnessagc(s) over the rstahlishcd link(s).

1 charrictsrize some communication properties of the parallel applications by present-

ing their communication frtquency and message destination distributions. 1 define the

concept of communication loçality in message-passing parallel applications. and çacliinç

in recontipurabk nerworks. I present widence. using classical memory hierarchy heuris-

tics. LRL! LFL', and FIFO. that there exists message destination communiçation locality

in the message-passiny parallel applications.

The first contribution of this thesis (Chapter 3) is the desi~m and evaluation (in ternis of

hit-ratio) of two ditferent categories of hardware/so%ware communication latency hiding

predictors for such recontigurable message-passing environments. 1 have utilized the mes-

sage destination locality property of message-passing parallel applications to devise a

number of heuristics that c m be usrd to predicr the target of subsequent communication

calls. This technique. çan b r applied directly to reconfigurable interconnects to hide the

communications latency by reconfiguring the communications network soncurrently to

the çomputation.

S periticall y, 1 propose two sets of message destination predictors: C+r-based predic-

tors. which are purely dynamic predictors. and Tay-btrsed predictors. whiçh arc statiu

dynamic predistors. 1 n cycle-based predictors. Siriglr-cylc. Sirgle-cjdr2. Berrrr.-~;~de

and BLJIW-L:\.L./~~. prediciions are dune dynümicall y at the nr iwrk intrrhcr w itliout any

help from the prosrammer or compiler. In Tag-based predictors. 7 izggU~g. kg-c-yclc.. Zig-

<i.c~/el. fiig-bertïvr:ide, and ïig-berrnx;i.clc~?. predictions are donc dynümically nt the net-

work interface as well. but they require an interface to pass some information from the

proyrarn to the network interface. This cm be donç witli the help of a progammer or the

compiler through inserting instructions in the program such as p r~-~~or r r i e~r ( r q ) (or

pe-ivceiir (rczg) as in Chüpter 6). The performance of the proposed predictors. Better-

cycle2 and Tq-bettercycle?. is wry high and prove thiit thry have the potential to hide the

hardwiire ci)mmuniciition latency in recontigurablr networks. The memory requirenients

of the predictors is very low. That makes them very ;ittrric!ivc tor the implementation on

the co~nniunicriticm rissist or network interthce.

In order to tttlicirntiy use the proposed predictors in Chapter 3 to hidc rlie hardware

Iritency of the recontigurable interconnects. rnough lead rime should exist such that the

recontiguration of the interconnect be sompleted before the cominunicati»n request

arrives. In Chapter 4. I present the pure rxecution times of the computation phases of the

pürallel applications on the IBM Derp Blue machine at the IBM T. J. Watson Rrsearch

Center usiny its high-performance switch and under the user spacr mode.

As the second contribution of this thesis. Chapter 4 States that by comparing the inter-

send computation timrs of these parallel benchmarks with some specitic recontiguration

times. most of the time. we are able to hlly utilize these computation times for the concur-

rent reçontiqration of the interconnect when we know. in advance. the next target using

one of the proposed high hit ratio target prediction algorithms introduced in Chapter 3. 1

present the performance enhancements of the proposed predictors on the application

benchmarks for the total reconfiguration time. Finally, 1 show that by

tors at the send sides. applications at the recriver sides would also

amve earlier than betore.

As the third contribution of this thrsis (Chapter 5). I present and

applyinp the predic-

benefit as messages

analyze a broadcasr-

ing algorithm that utilizes latency hiding and reçontiguration in the nenvork t» speed the

broadcasting operation undrr single-port and k-port modcling. In this algorithm. the

recsinfiguration phasc of some of thc nodcs is ~vcrlappcd with the message transmission

phase of the other nodes which ultimately reduçes the broadcasting time. The analysis

brings up closcd formulation that yields the termination time of the algorithm.

The fourtli contribution of this thesis (Chaptrr 5 ) is a conibimd tord ml~atrge dgo-

t?t/inz based on ii combination of the rlitwr [ I O % 1701. and srcriiclaid t'xc*/ra~z,ge [7 1 . 2-11

algorithms. This ensures a berter trnnination tims than that which çan be achiewd by

cither of the two alpithms. Also. known dgorithrns [?O. 401 tor scattering and dl-to-al1

broadcasting have been adapted to the nrtwork.

In Chapter 6. 1 present the frequsncy and distributions of reçeive communication calls

in the applications. I present rvidence that there enists message reception communiçcitions

locality in the message-passing parallel applications. .As 1 stated earlier. the communica-

tion subsystems iit the receiving end still copy early amving inessages unnrçessarily at a

trmporriry bufkr. As far 3s the author is aware. no prediction techniques have been pro-

posed to remove this unnecessary message copyin;.

1 use the proposed prcdiçtors introduçed in Chapter 3 to predict the next consumable

message. and to thus establish the existence of message reception communications local-

ity. As the Wth contribution of this thesis. Chapter 6 argues that these message predictors

cm be etficiently used to drain the network and cache the incoming messages even if the

corresponding receive calls have not been posted yet. This way. there is no need to unnec-

essarily copy the early miving messages into a temporary buEer.

The performance of the proposed predictors. Single-cy de. Tag-cycle2 and Tag-

bettercyclel. in t ems of hit ratio. on the parallel applications are quite promising and suç-

gest that prediction has the potential to eiiminate most of the remaining message copies. C

Moreover. the mrmory requiremrnts of thrse predictors is very low making tliem easy to

implement. Finally. 1 discuss ways in which thsse predirtions could be used to drastically

reduce the latency due io message copying.

In Chapter 7.1 conçlude this thrsis and rive some directions for future research.

Chapter 2

Application Benchmarks and Experimental Methodology

In Section 2.1. 1 describe the applications used in this thesis. 1 rxplain the various

point-to-point message-passing primitiws of the applications in Section 1.2. 1 discuss the

experimcntal methodology in Section 2.3.

2. t Parallel Benchmarks

This thesis (exccpt Chapter 5 ) studirs the computation and cvmmunication character-

istiçs of açtual parallel cipplications. For these studies. 1 have used sorne well-known paral-

le1 bmç hmarks Vomi the .WS pni-ciilcl bcnchzcii4-.Y (8 PB ) sui te [ 1 31. the Pniailel Sprcrid

li*oii.sfi~ivi S/td/oit. I j i i r c ~ i . .llocki (PSTSWM) parallel appiication [125]. and the pure

Qirwritnl CItron~o D~mtaics ..\loizr~~ C d o Sii~r~iiririoit Code with .WI (QCDMPI) pnrallel

application [65]. .Althouyh the rcsults prcscnted in this thrsis are for tlir above parallcl

applications. these rippl isiitions have been widely used as bendimarks representing the

cornputcitions in scientitic mi engineering parallei appliciitions.

1 used the MPI [92] impiemrntation of the NAS bençhmarks. version 2.3. the

PSTSWM. w-sion 6.2. and the QCDMPI. version 1.4. and run them on several IBM SP3

machines. 1 chose the IBM SP2 as it is a message-passing parallel machine so that the çho-

sen parallel applications are mapped directly on it. 1 used difirent systrm siztts and prob-

lem sizes of the applications in this study. NPB 2.3 cornes with five problrm sizes for each

benchmark: small class "Sv. workstation class " W . large class '*.A" and lûrger classes "B"

and "C". Due io access limitations in the use of the IBM Deep Blue machine at the IBM T.

J. Watson Rrsearch Crnter. and space limitations in usin% the University of Victoria IBM

SPI. 1 was able to experirnrnt with only the "W" and "A" classes and the results included

in this thesis represent theses classes.

3.1.1 NPB: NAS Paraliel Benchmarks Suite

The NAS Parallel Benchmarks (NPB) [13] have bern drveloped at the NASA Ames

Research Crnter to study the pertbmance of massivrly parallel processor systems and

nctworks of workstations. The NAS Panllel Benchmarks are a set of eight benchmark

problrrns. raçh of which tocuses on some important aspect of highly parallel supercorn-

putins for aerophysics applications. The NPB are a set of implementations of the NAS

Parailel Benchmarks hased on fwtran 77 and the MPl message-psiog interthce stan-

dard. and are not tied to any sperif c systcm.

The NPB consists of Rve bbkrmrls". and three "simulated compurational fluid dynamis

(CFD) applications". The three simulated CFD application benchmarks. loiwr-irppct-

di(rgotra/ ( L U ) . sc~rrkui* potrritiingotrd (SP). and bloc8k rri~i iqorid (BT) arc intended to

ciccuriitely represcnt the principal cornputationül and data mowment requirements of mod-

ern CFD applicritkms. The kernels. ~.otijl!qrrc gwtlieitr (CG ). nrirlrigriti (MG) . enrbat.t+uss-

i/lg!i. p<rtrr//el ( E P ) . 3-D firsr-Fowicr norts/iwm (FT) . and itiregct. sotet (1s) are relatiidy

compact problems. rrich of which cmphosizes ri particular type of numerical çomputation.

1 am intrrested in the point-tu-point patterns u f the LU. BT. and SP applications. and CG

rind MG kernels. EP. FT. and IS kernrls are not suitable for this study. EP and FT use only

collective communication operations whilr rach node in the IS kernel always communi-

cates with a speçific node.

The rorijirga!ore gmiieie,lt kemel. CG. tests the performance of the system for unstruc-

tured grid computations which by their nature require irregular long distance communica-

tions which is a challenge for al1 kinds of parallel çomputers. Essentially. it requires

computing a sparse matris-vector product. The inverse power method is usrd to find an

rstimate of the Ilirgest rigenvalue of a symmetnc positive-definite sparse matrk with a

random pattern of non-zrros. This code requires a power-of-two number of procrssors.

2.1.1.2 BIG

The second kemel benchrnark is a simplified rnirltigritf ke1wel. MG. which solves a

3-D poisson PDE. Four iterations of the V-cycle multi-gid algorithm are used to obtain an

approximate solution i r to the discrete Poisson problem V'U = i- un a 156 x 756 x 256

q-id with periodiç boundary conditions. This code is a good test of both short and Ions dis- 2

tance highly stniçtured communication. This code requires a power-of-two numbcr of pro-

çrssors. The partitioniny of the gnd onto proccssors occurs such that the grid is

suççessively halvrd. starting with the z dimension. thrn the ?. dimension and thrn the A-

dimension. and rrprating until al l power-of-two proccssors cire assignctd.

2.1.1.3 LU

The / O I W - I ~ ~ ~ L ~ I . ~ f i u p ~ d benchmark. LU. employs ü syrnmetnc successive over-

relaxiition (SSOR) numerical schemc t« sdve ti regular-sparsr block 5 x 5 lower and

upper rriiinguliir systcm. A 2-D partitioning u f the grid onto proccssors oçcurs by halving

the grid repcütedly in the Rrst two dimensions. altemately .Y and then j: until al1 pwer -d -

two processors arc assigneci. resuiting in vertical prncil-like grid partitions on the individ-

ual proçessors. The ordering of point based operations constitutins the SSOR procedure

procerds on diagonds which progressively sweep h m one corner on a zivttn r plane to

the opposite corner of the same I plane. thereupon proceeding to the next z plane. Commu-

nication of partition boundary data occurs aRer cornpletion of computation on ail diago-

nais that contact an adjacent partition. LU is very sensitive to the smali-message

communiçcition performance of an MPI iinplementation. It is the only benchmark in the

NPB 2.3 suitc that sends large numbers of very srnall (40 byte) messages.

2.1.1.4 BTandSP

The BT and SP algorithms have a similar structure: rach solves three sets of uncoupled

systrms of rquations. first in the * . then in the y. and finally in the i direction. In the block

widiagot~nl benchmark. BT. multiple independent systems of non-diagonally dominant,

block tndiaçonal equations with a 5 x 5 block size are solved. in the scaiat- pertradiago-

rtul benchmark. SP. multiple independent systems of non-diagonally dominant. scalar pen-

tadiagonal equations with a 5 x 5 block size are solved. Both BT and SP codes require a

square number of processors. These codes have bern written so that if rt given parallel

piatforrn only pemits a power-of-two number of processors to br assigned to a job. then

unnerded processors are deemed inactive and are ignored during computation. but are

counted when determining MRopis mtes.

2.1 .L PSTS W PI

The Pm-nlid Specrid Ti-<irzs/~im Slid/ori IKzrer .llociel (PSTSWM) application [ 1251.

was dttveloped by Wor1t.y at Oak Ridge National Laboratory and Foster at Ar, 'JO nne

Nationiil L:iboratory. PSTSW41 is a message-püssing benchmark cilde and parallrl algo-

rithm tcstbrd that solws the nonlinrar shallow watrr rquations on a rotating sphsrc using

the spectral triinsform method. PSTSWM was devtiloped to evaluate pardlrl üigonthms

for the spectral transform method as i t is used in global atmosphcric circulation modrls.

klultiple parallel cilgorithms are embedded in the code and cm be srlected at run-timr. as

c m the problrm s i x . number of processors. and data decomposition. PSTSWM is written

in Fortran 77 wirh VMS extensions and a small number of C preprocessor directives. 1

used the MPI implemrntation of the PSTSWM with the default input sizes.

2.1.3 QCDMPl

Pure Quantum Chromo Dynamics Monte Carlo Simulation Code with MPI

(QCDklPI) [65] . writtcn by Hioki at Tezukayama University. is a pure Quantum Chromo

Dynamics simulation code with MPI calls. It is a powerful tooi to analyze the non-prrtur-

bative aspects of QCD. This proyram çan bc applied to any dimensional QCD such as the

Mimensional QCD in whiçh the color and/or quark conf nenirnt mechanism are

obtained. QCDMPI runs on any numbcr of processon and also any dimensional partition-

ing of the system can be applied.

2.2 Applications' Communication Primitives

As stated carlier. I m only interested in the patterns of the point-to-point communica-

tions betwren pair-wise nodes in the above applications as discussed in Chapter 3. Chapter

4. and Chapter 6 of this thesis. Etficirnt algorithms for collective communications are pre-

sentcd in Chüpter 5 . These applications use synchronous and asynçhronous MPI send and

receive primitives [ 9 I ] . 1 briefly explain thesr: communication primitives here.

.\n bIPl program consists of autonomous processes. rxecuring their own code. in an

riiiiltiple i~isri~~rrtiorcs nidriplc ciam (MIMD) style. Note that al1 panllel applications stud-

ied in this thesis use an s i r g l ~ ~ prog~.ani rrrrrltiple hm ( S P M D ) style. Processes are identi-

tird açcording to rheir relative rank in a group. that is. consecutive intcgers in the range O

to guzrpsizc - 1 . If the g o u p consists of al1 processes then thc processes are ranked kom O

to .V - I whrre ,V is the total number of processes in the application.

The processes commur.icate via salls to .LIPI communication primitives. The basic

point-to-point commüniçation operations are scrd and receiw. Thrre are two genrral

point-to-point cornmunicütion oprratiuns in M PI: bled-irjg and rro~lbIo~*kN~g. Blocking

scnd or reseivt: sdls will not rctum until the püramcters of the çalls can be safel y modi-

îicd. Thot is. in the case of a send d l . the niessqe '~welup has b r r n created and the mes-

sage has bcen sent out or has been buffered into a systcm butier. For the case of a receive

ciill. it means thüt the messqe has been rescivcd into the receive bufer. Note thüt the mes-

sage envelop consists of a Hxed number of fields (mirce. desr. mg, comr~r) and it is used to

distinguish messages and selectivel y reçeive them. Nonblocking çommunication operÿ-

tions just post or start the oprration. Thus the application programmer must explicitly

somplrte the çommunication cal1 later at some point in the program using one of the van-

ous hnction calls in MPI such as :L/Pi_CCilit or MPi-ICizitall.

There are four communication modes in MPI: standard. bu[fér*ed. -mh.or~oirs. and

i-eac!\: These correspond to four ditferent types of send operations. In the synchronous

mode send call. the call will not finish until a matching receive call has bern issued and

has brgun reception of the message. In the buffered mode send d l . the send call is local

(in contrary to other communication modes where the send calis are nonlocal) and is not

waiting for the reseive cal1 to be posted. Actually. it bufers data when the reçeive call is

not posted. In the reûdy mode send d l . the receive call rnust have been posted radier. In

the standard mode. it is up to the system to buffer the data or send it as in synchronous

mode. Note that the standard mode is the only mode for the receive calls.

1.2.1 MPI-Send

.\fPl-.Sci~d (bu/.' c-oriilr. riiz!nhpe. desr. rng. conrtn) [92] is a standard blocking send cal1

which is a combination of buKered and synchronous mode and is dependent on the imple-

mentation. When the cal1 tinishes, the send bul-ter c m be used. In the buttered mode. data

is writtrn îiom the send bu fer to the systrm bufler and the cal1 retums. ln the synchronous

mode. the cal1 waits for the reçeiver to be posted and then reiums. The LU. MG. CG. and

PSTSWM applications use tliis type of scnd call.

1.22 MPI-lsend

.Ill->/-l.wtd (bill: ~.orr>rt. cbrttr~pc. d w . mg, '*onm. t-ccpresr) (921 is a standard non-

blocking scnd d l . I t rcturns immedititeiy. Therefore. the send butier cannot be reused. [ t

ciin be implemented in the butfered or synchronous mode. It needs another call. IIPlJliiit

or .\/PI-lbiiird/. to complete [lie call. Thcse completion calls are explainrd Ilitrr in Section

1.2.6 and Section 1.7.7. respectively. BT and SP use this typo of send d l .

1.2.3 MPI-Sendrecv-replace

. 1 1 P ~ i 1 r i 1 * e ~ - i ~ ~ c p I ~ r ~ ' c i (bif#,' colrrit. datïthpc. <f~ist. sei~dtag, solri'ce. rccvrag, conm.

srutzrs) [92] combines in one call the sending of a message and receivinç another message

in the same butfer. QCDMPl uses this type of communication call.

1.2.4 MP'Recv

:lIPIRi?ec~ (bi!l: coiuir. d a t a ~ p e soirrce. r<zg ontm. m. srcitiw) [92] is a standard blocking

receive call. When it returns, the data is rivailable at the destination buffer. L U and

PSTSWkl use this type of receivr call.

2.2.5 MPI-Irecv

.\fPl-I~-eci* (hi/' corirzr. durnnpe. sozii.ce. rng. cornm. reqziesr) [93] is a standard non-

blocking recrive çd1. It imrnediiitely posts the cal1 and retums. Hence. data is not available

at the tirne ofreturn. It nerds anothrr completion cal1 such as MPI-IVair or iLIPI-IC.iii~~lll ro

çornplete this c d . .411 applications except QCDMPl use this type of receive call.

.A cal1 to .IfP[-llizir {reqzrcsr. sr<rriis) [97] returns when the operation idttntitied by

rpqiic~[ is cornplrtv. For .\lPiJst~rrd operation. when W'/IlNir rrtums the srnd bumer can

be reused. For .lfP/-Recr operation. the completion of the .CIPI-ICRN cal1 notifies the

availability of the data at the rccrive butier. BT. Lü. MG. CG. PSTSWM applications al1

use this type of completion call.

2.2.7 RIPI-Waitall

.\If IJliri[cill (coiulr. u i~~~a~~q /_ i - c~qr i~~s t . r , ~ z ~ - ~ * a ! ~ ~ ~ r r i r i r s c s ) [92] waits for the complr-

t i m of al1 nonblocking clills associated with the active handlcs in the list. BT and SP use

tliis type of completion d l .

2.3 Experimental Methodology

1 executrd the applications on the 12-nodc IBM SP2 machine at the University of Vic-

toria for gûthrrinp their communication traces. and on the 30-node IBM Derp Blue at the

IBM T. J. Writson Rcsecirch Ccnter for collrcting their timing protiles. 1 wrote my own

proiiling codes usiny the wrripper hcility of the MPI to gather the communication traces.

and the timing protilrs of thrsr applications. 1 did this by inserting monitor operations in

the proHling M PI librriry for the communication relatrd activities. These opentions

inçlude anthmetic operations for the calculation of the desired characteristics. It is wonh

mentioning that jathering communication traces does not affect the communication pat-

terns of thrse applications. However. it atfects the temporal properties of these applica-

tions. In Appendix A. I explain the approach used to remove the timing disturbances fiom

the timing profiles of the applications.

Chapter 3

Design and Evaluation of Latency HidingIReduction Message Destination Predictors

Interconnection networks and their senices such as message delivery and flow control

are (i major source of communication hardware latrnçy in pariillel cornputer systems. In

Section 3.1. I briefly describr message-passing computers and message switching laycrs.

Then. as a specitiç circuit switçhed interconnrction nrtwork. 1 introduce a recot~figlriuble

o p r i ~ d iidrit.o~k ROX ( P . .V). for message-passing parallel computers. The advantages of

such recontigurablc optical interconnects are thrir high bandwidth and their ability to pro-

vide wrsati le lippiicrttion-dependrnr network reconîigurations.

1 chüracttirizr somr communication properties of the parallel application benchmarks

by presentiny thriir communication frequency and message destination distributions in

Section 3.2. 1 dctine the concept of contmiîtiicoriori lucnlih. in message-passing parallel

applications. and c*rrclii~ip in reçonfigurable networks in Section 3.3. I present cvidence

that therc exists message destination communication losality in the message-passing par-

allcl applications in Section 3.3.1. k i n g classical replacement heuristiçs. LRL', LFL', and

FlFO. i show that message destinations display a tom of locality.

I have utilized the message destination locality property of message-passinp parallel

applications to devise a number of heuristiçs that can be used to prvdict the tarpet of sub-

sequrnt communication requests. Thus. in Section 3.4.1 contribute by proposing and eval-

uating (in trrms of hit ratio) two dificrent categories of hardware/software commlrnicwiot~

luroi-. hitling pr*eiiicrors for message-passin2 environments. By utiiizing such predictors.

the hardware communication latency in reconfigunble interconnects ran be etfectively

hidden by reconfiguring the communication nenvork concurrent to the computation. 1

compare the performance and storage requirements of the proposed predictors in Section

3.5. In Section 3.6. 1 &borate on how these prcdictors clin be used and integrated into the

network interfaces. Finall. 1 sumrnarize this chapter in Section 3.7.

3.1 Introduction

Message-pûssing multicomputers are composrd of a number of computinp modules

that communicate with each other by rxclianging messages through thrir interconnection

networks. Each computing module has its own proçessors. local mrmory. and cotnmuni-

cation üssistmtwork interface. All local mernories are private and are accessible only by

the local processors. Communication hardware latençy. communication software latency.

and the user environment ( multiprogramming. mu1 tiuser) are the major tàctors atfrcting

the performance of message-possing parallel computer systems.

Interconneçtion networks. and their services such as message delivery and Row control

are n major source u f communication hardware latency. Essttntially. an interconnection

network is charricterired by irs topolog: sit*idii~ig roxz re~ : j i o i i corirr*ol mocliairisnr. and

roirri)r,g ri/gu~-itltni. The topology is the p hysical structure of the nctwork. The interconnrc-

tion network [46] miyht bc a shrired-medium network (such as Ethemct. Token Ring). a

direct network (such as mesh. torus). an indirect network (multistrige interconnection net-

work sush 3s IBM SP [l 171. or irreguliir such as Myinet [23]). or a hybrid network (such

ris hypcrmesh) [117].

The rout ing algorithm detemines which routes messages should fbllow through the

network to reach their destinations. There are many ditferent routing algorithms with dif-

ferent guaranters and performance such as Duato's adaptive routing [47]. Glass and NI'S

tum-mode1 routing [56], and up*-down* routing [E l .

The How control mechanism determines when the message. or packet. or portion of a

message should move dong its route. Packrts or Hits may be bloçked. butiered. disçarded

or cietoured to an alternate route based on the flow control mechanism.

3.1.1 Message Switching Layers

The switching strategy determines how a message rnoves dong its routes. There are

many switching stntegies. Ci!-cuir sii*irchirig. packet swircliirig. virt~ral c i i~ l i ro ig l i . and

wotw/io/e swirchi/ig are the basic switching stntegies [J6]. In packet switching. messages

are divided into fixed-size packeis. Each parket is routed individually tiom source to des-

tination and has to bt. butiered in each intermediate node. I t is also called stol-e-nd~fbr-

wtrd .swirc*/tiug. In virtual ait-thrcwgh switching. the entire packet does not need to he

bufered in the nodes. The packct heailer can be examined and afkr the routing decision is

made and the output rhannel is tiee the headcr and the following data can be immediately

transmitted. ln wormh«le switching. the paçket is broken up into nits. Wormhole switrh-

ing pipelines the Hits through the nrtwork just l ikr the vinual ut-through switching strat-

e y y but it hüs reducd butkr requirrmrnts.

In circuit switching. ri physical path is reserved from a source to a destination before

the i ~ ~ t ~ i l l mess32e transmission takes place. The routing heüdcr is injecteci into the net-

ivork. It rcisrrws physiçal links as i t is transmitted through intemediate nodes. A corn-

plete path is set up when the routing hradrr renches the destination. Then an

acknowledgment is transmi tted back to the source. Then. the message contents can be sent

alon- the rescned channels. The disadvantaye is that during message transmission uther

messages may be blockrd. The advantage is the minimum message trnnsfer latrncy as the

physiçal path is already esrablished.

In Cliaptrr 3 through Chapter 5 of this thesis. I am interested in the circuit switching

stmtegy. As 1 explain latrr in Section 3.3. message destinations in message-passing panl-

lei applications display a form of loçality. Thus. it is possible to use this communication

locality to pre-establish the physical links and thus hide the path setup time. This applies

both to the electronic circuit switched interconnection networks. and to the recontigurable

optical intrrconneçtion networks. However. as 1 drscribe in Section 3.4. the prediction

techniques that 1 propose in this chapter would also reduce the communication time in

wormhole routed networks. In the next section. 1 consider a circuit switched reconfig-

urable optical intrrconnection network as an specific case.

3.1.2 Reconfigurable Optical Networks

Several topological propenies. such as tkgtee. mVrt*age ciistatice. and dinnrerrr: can be

used to evaluate and compare difkrent interconnection networks. Most of thesr propenirs

c m be deriveci tiom the underlying g a p h of an interconnection nrtwork. where processors

and communication links are mapped onto the vertices (nodes) and r d p s (links) of the

graph. respective! y.

.A Gttrph cunsists of set of venicrs. i: inierconnected 'Dy a set of ctdpes. E. symbol-

ized as G = ( I :E) [ 1 2 3 The number of venicrs and rdges in a gaph is .V = 1I.I . and IEl

rrspectivrly. An rdge E E connects vertices i r and i: written as c = iri: and is said to be

irrcicloir with ir and 1: .A vertes i* has degree d,. if it is incident with exactly d,. ttdges. In a

sequcnce of distinct wtices ipl . i.,. . . . . i l such that for every I 5 i c k . the edge r i ib,+ is

in E. The <lisflrttc*c betwcen ir and i: rlist(tr.i*). is the minimum length of a piith between i r

and i: The r~-rv~rti.ic*ih* of LI is e(ir) = disr(1r.i.). where iv is (i vertex such that

/ . = . . disr( l r . ~ ) . The maximum ccçentricity among al1 vertices is the

~firrtrrcwr of the graph.

1 am interested in havinp a complete interconnection network. where 3ny computing

nodr can coinmunicate with any othrr node in a single-hop. Cornplete interconnection net-

works can be rnodelrd by a complete gaph. K.,.. A çomplete g a p h is a regular graph

where al1 X vertices are linked together and the diameter is one. Each vertex has degee dG

eyual to .V - 1. and the number of edses. 1 El . is iV(N - l ) i ? far too high to be of pnctical

intrrest whrn :V is large. These limitations prevent implementing çornplete networks using

metal-basrd interconnections as there is a fixed physical link between any two nodes.

Optics is ideaily suited for implementing interconnection networks because of its

suptit-ior characteristics over electronics [IOO. 5 1 . 741. such as higher interconnection den-

si- higher bandwidth. suitability for reconfigurablr: interconnects. greater fan-in and fan-

out. lower error rate. tieedom fiom planar constraints (light beams can easily cross each

other). immunity from electromagnetic fieid and ground loops. lower signal crosstalk.

Several research groups in academia and industry are working on diferent aspects of uti-

l i z in~ optiçal interconnects in massively parailel processing systrms including works on

the feasibili ty study and teçhnoloig related problems of optical interconnects. architec-

tures for optically interconnected coinputer systems. and corninunications and algorithmic

issues for such parallrl systerns [Q?. 191.

One of the main features of an optical interconnect is its capability to rrcory5gzi1-e. This

is vcry suitâblc fOr thc construction of 3-D VLSI computcrs [ S I . By i~~rcrcotrriccr i r co~r -

&ru-cirion. I siinply mean the abi lity to change the interconnect dynamically upon

demand. In essence. the adkantages of recontigurablcl optical interconnects are due to their

ability to provide versatile application-drpendent network contigurütions. F r - s p m r opri-

cbr i l i~trc.rroilneccz are a class of opticd intrrconnccts that c m support network recontigurn-

tion.

Free-spüçe optiçal inrerconnects use free-spaci: (vacuum. air or glass) for optical sig-

nal propagation. In tiee-space optical interconnects. optical signüls çan propagüte very

close to rash other and pass eash othcr without interaction. It can easily exploit the third

spatial dimension wliich dr~maticdly inçrrtises the wailablc communication bandwidth.

Free-space recontigur~blc optical interconnects result in much drnser interconnrction net-

works than metal-based and zuided-wave interconnections [?Y. 831. and have the potential

to solvs the problems nssociated with implementing cornplete networks due to their ability

to recontiyure.

1 introduce an nbstrast mode1 [ I l for a cornplete interconnection network using tiee-

space recontigurable optical interconnects for massively parallcl cornputers. and discuss

its charricteristics.

Definition A recoq5gwabk uptica! irenrorlr. RON (k. !V). consists of N computing

nodes with their own local memory. A node is capable of connecting directly to my other

node. A node can establish k simultaneous connections. These connections are established

dynamically by recontiguring the opticai interconnect. The links remain established until

they are explicitly destroyed.

Messages are sent using ciiririt sicircliiilg. That is. a connection must be established

betwern the source and destination pair before the messase is sent. Each node has the abil-

ity to simultanrouslp send and receive k messases on its k links (the k-port model). or

rxoçtly one message on one of its links (the single-port rnodel). Full-duplex communica-

tion where a node can send and receive messages at the same tirnr is supportrd. A simpli-

tied block diagram of the network is shown in Figure 3.1 where each node uses on1 y one

of its links.

- - -b Poten tial links

4 Efréctiw links

Figure 3.1 : R0.V (1 .VI. a massive1 y paral lrl cornputer interconnected by a completr frce-spüce optical interconnection network

Vanous implrmrntation technologies cxist to embody the above iibstract rnodel. Such

techni)logirs inçlude r.ei-ric-<il-~*~ri.ih s~r~.fc~c*~-mirti,ir: lrisci-.s ( VCSELs) for photon «_encra-

tion. sel#4ecno-oprit. c[fC;cr d ~ v i c ~ s ( SEEDs) for modulaiion. frequençy hoping for rod-

ing. wawlengtli tuning t i r transmitters and receiven. conprrer geiiei-uted lrolograrns

(CGH). and ~i~-fiwmrible mir~ois (DM) for switching and optical beam routing. The

switching in the case of CGH san be achieved by recording the desired source-destination

communication patterns. As stated in Chapter 1. deforniable mirrors. such as Lltceiir S

Ilkrdtrrrr Lmbcic~Roitrer [%]. are also reaching matunty. Optical beam routing in a free-

space optical interconnection nrtwork otien ernploys other extemal optical elrments such

as rnirron. prisms. lenses.

Each node has a fi xed number of tunable transmit~ers for sending optical beams toward

its beam router. such as a cornputer generated hologam or a deformable minor. to be redi-

rected to the receivers of the other nodes. Also. each node has ri large number of Axrd

receivrrs at its input ports. Sornct of thesr input ports may be used only for collective com-

munications operaiions while others may be used for pair-wise communications.

Path setup phase san b r done by sendins an ençoded light beam to the beam router to

reproyram the cumputtx gt'nerütrd holugriirn. or tu J e h m the mirror such tliüt the astual

message san br delivered to the destination(s) directly. It can br donc in two di firent

ways. First. the routcr (CGH or DM) upon receiving the message (which includes the pay-

loüd) stores the message in a butfer and thcn confipures its output links so that it can for-

ward the message to the destination node(s). This approach nreds a bufier for the entire

message üt cash beani router which is of high cost. It alsi> involves an extra copy. The brt-

tcr approash is to scnd an optisal bram havins only the destination address to the beam

routcr for the piith setup phase. Then. ÿfter some time. to be çallrd ~.cc*ot~ti,qtii+~ztiori th/+:

the second bcarn containing the actual message can he sent through the çonf yured router

to its destination.

Collision can happen cit the receiving nodes considering the hçt that several beams

may a m w üt a destination nodr at the samr time. Hençe. a destination node may not be

able to completr the path srtup phase. or accept the message. However. 1 assume that dur

to the availribility of a laqe number of Axed recrivers at the destinations. connections are

establishcd iminedioiely afier sorne time (reconfiguration drlay).

1 assume an unbounded number of available wavelengths for the system. However. in

case of a limited number of available wavelengths. one can utilize spread-spectrum tech-

niques where erich transmitter sends its information changing the wavelen~qh in a pseudo-

random fashion. The receiver can reconstruct the transmitted message if it is aware of the

pseudo-random code usrd h r encoding the sequence of wavelengths used during the

transmission.

1 am not interested in the technology itself. and implementation çoncems are outside

the scope of this dissertation. Instead. I am particularly interestêd in the abstract model of

this network. I shall assume that one or more of the technologies outlined above will be

used to iinplement the proposcd interconnect. L'ndrr such an implementation. the various

overheads associated with the reconfiguration of the network (such as beam strenng. set-

ting up the cornputer-generated hulograms. tuning the transmitters. or sending the fre-

quency code in a îi-cqurncy hoping implrmrntation etc.) are lumped togethcr as the

recantigurrttion delay t l . I assume that the reconfiguration dclay. d. rnost of the time is con-

stant but occnsionally may br unboundrd dur to hot spots in applications.

3.1.1.1 Communication Modeling

.An important concem is to model the communication time T required to srnd a mes-

sage from one node to anothcr. 1 use the communication motleling of Hockney [Ml. Hock-

ney's model characterizes the communication time for a point-to-point communication

4, opcration as: T = r , - - . where r , is thc start-up timc which is rqual to the time needed l-7.,

to send (i zero byte message. and includes the time required to preparr the message. suçh

as adding a header. and a tniler. I,,, is the lrngth of message to be transmitted. and i., is

the trsyrprorir btidwitirli in Mbytes per second and is the maximum bandwidth achiev-

able when the message length approaches inf nity. The communication time cün be written

as: T = r , - i,,,r where r is the per unit transmission tirne and is equai to the reciprocal of

r , . For the RON (k . N). 1 amend the model by explicitly including the reconfiguration

dclay d that is necessary for a nodr to configure a link that would connect directly to its

target node(s). The transmission time then becomes T = d + I , . + i , r .

The time on the Hy. 2 , r . for small messages is negligiblc compared to the setup time.

r,. and the recontiguration delüy. d. In the current gcnention of parallel computer systems.

the setup time. r,. is several tens of microseconds E-131. Several researchers are working to

minimize the srtup time by using user-level messaging techniques such as clctive Mes-

sages (AM) [ 1351 and Fast iClessages (FM) [102]. In Chapter 6. I discuss issues regarding

the soliware overhead component of the communication latency. I utilize the prediction

techniques proposed in this chapter to reduce the communication latency by avoiding

unnrressary memory iropying operations at the rereiver side of communications.

In this chapter. I am partiçularly interestrd in the techniques that hide the reconfigura-

tion drlay. ci. For this. and for the tirst time as h r the author is aware. I propose and evalu-

cite ditiirent communislttion lütency hiding predictors at the srnd side of communications

in rncssagc-pûssing systcms using rccontigurablc nctl~*orks so that rhc rcçontiguration

delay c m be hidden. in essence. by utilizing such predictors. the hardware communication

latency in reçontigunblr intcrconnects can be rtiectivel y hidden by recontiguring the

communicrition networks concurrent to the computcitions.

3.2 Communication Frequency and Message Destination Distribution

Several rcsearchrn have inwstigated the communication brhavior of parallel applica-

tions [-30. 75. 65. 72.371. Chodnctkar and his colleagues [30] have developed n trafic char-

acterizrition rnethodolosy tor parallel applications. They have çonsidered the inter-arriva1

time distribution of messases (send cülls). spatial mcssage distribution. and the message

wlume in messqe-passing and shared-memory applications. Kim and Liija [ 7 5 ] exam-

incd the communication patterns of message-passing parallel scientitiç proyrams in t ems

of message s ix . message destination. and gcnrration distributions for the send time.

reçeix time. and computation time. Hsu and Bane j e e [65] anlilyzed the communication

cliaracteristiçs of parallel CAD applications on a hypercube. Karisson and Brorsson [71]

have compared the communication properties of parallei applications in message-passing

applications using MPI. and shared memory applications using TreadMarks [IO]. de

Lahaut and Germain [37] have shown that in scientific applications written in Hiyh Perfor-

mance Fortran (HPF) [SS] a large part of communications can be known tiom the analysis

of the code. This is çalied smic contmtaiicburio~is. communications that can be known at

compile-timct. in çontrrist to &mrnic ~~onirniitlicatio~is where communications c m be

determined only at run-tirne.

Essentially. communication propenirs of parallcl applications can be cateprized by

the sparicil. renrpord. and n h n w anributes of the communications [;O. 75. 651. The tem-

poral attribute o f communications in panilel applications characterizes the rate of mes-

sage genentions. and the rate of computiitions. 1 present the cumulative distribution

tùnction of the inter-send romputaiion timrs of the applications studied in this thesis in

Chiipter 4.

Tlie ~oluinc. of somtnunicütions is characterizcd by thci nuinber of inessages. anci the

distribution of message sizes in the applications. In this chaptrr. 1 am particularly inter-

ested in the number of messages. In Chüpter 4. 1 show the distribution of message sizes in

the paraIlel applications.

One of the communication volumc charactcrïstics of pariilIr1 applications is the îi-e-

quency o t'scnd mcsstiyes. 1 use a number of pardlrl bençhmarks. as introductxi in Chapter

2. and nttrüct their communication traces. The processes in these applications use block-

iny and nonblocking stiindard M PI srnd primitives. nümely .llPIISc'tid MPi-lscild. and

.\~P~-$L~~~~~~-L.L~~I_I~cJ~/~Ic*LJ [KI. Fi y ure 3 -2 illustrates the numbttr of scnd communication

calls per proccss in the applications under ditferent system sizes. 1 cxecuted al1 applica-

tions once for crich dif i rent system s i x and counted the nuniber of send calls tijr crich

process of the applications. Hence. in Figure 3.2. by average. minimum. and maximum. I

mran the üwrage. minimum. and maximum number of send cd ls taken over a11 processes

of each application. It is rvident tliat processes in the BT. SP. CG. and QCDMPI appliçû-

tions have the samr number of send communication çalls for rach direrent system size.

This is also true for LU. MG. and PSTSWM when the number of processes is four. four

and eight. and a power of two. respectively.

The Spatial anribute of communications in parallrl applications is characterized by the

distribution of message destinations. It is cornmonly assumed that the message destina-

tions art. evenly distributed among al1 of the processes iilthough an individual process may

not se r a uni form message destination distribution [75. 301.

1WOO~ Minimum Average

* noOO* - - m O u 5 60001 V) - O

" 4 9 16 5 36 49 6J Number of Processes

Minimum .a~)ml Average

Maximum-

ffl

4 8 16 32 E-l Number of Processes

x 10' LU

Minimum -

W Avenge

J B 16 32 64 Number of Prccesses

- Minimum Average

6 - . Mâximum - - Cu U O c i 2 Q)

V3 - O 5 0 8- O

5 =O.)-

n 4 9 16 25 36 49 a

Number of Processes

tjaw Minimum Average

U .a 8 16 32 64 Number of Processes

PSTSWM 9000

Minimum 5ZEl Avenge

ln 7500- Maximum

- - <O O ,- C Q)

O

v - - - - - 4 8 16 25 32 36 49 63

Number of Processes

QCDMPt 400

Minrmum a Average -

Maximum ,

- 4 8 t6 25 32 36 49 65 Number of Processes

Figure 3.2: Number of send calls per process in the applications under different system sizes

In MPI. the scnd operation (iCfPI-s~~id !CfPl_lse~id and MPI-Se)lclr-ecii-epIace com-

munication calls in the paralle1 applications studied in this thrsis). associates an e~welope

with a message. Messages in addition to the data pan ça- information that csn be used to

distinguish messages and selectively receive them. This infomlation consists of a fixed

numbrr of tields. which is collectively callrtd the message e~iwlope. These fields are the

source procrss of a message. suwcr. the destination proçttss of a message. cim. the mes-

sage tas. r~ic. and the message communiçator. cornni. The message source is implicitly

determined by the identity of the message sender and necid not be rxplicitly camed by

messages. The other tields are specitied by arguments in the sond operation. The destina-

tion process is specitird by the dest argument. The integr-valued message tag is specified

by the rrlg argument. This intcger can bt: used by the program to distin-,uisli ditferent types

of messages. h communiciitor specitirs the communication context for a communication

operation. It iilso spccities the set of processes that share this communication context.

Each communication çontext provides a srparate communication universe. Messages rire

alwavs received within the contrxt they were sent. and messages sent in ditierent contcfxts

do not interkrc. Thc BT. SP. and PSTS Wb1 applications use a number of dityerent com-

municators i ncluding the predrtinrd communiçator. .Cft>I-CO~LfM~IC.'ORL, D. provided by

M P I while othcr pardlel applications. CG. MG. LU. and QCDMPI use only the pre-

detinèd c~m~municritor.

As stated above. ;l message rnvelop consists of sowce. &sr. rng. and comm. The

soiur-e and rczg of a message cinvctlop do not atiect the link establishment phase tor a mes-

sage transmission to a destination process. Thus. 1 assigned a ditferent identifier. called

ii~riqire nressngc. dcsrirrnriori itlelit@e~: for each d e s t . cornni> tuple hund in the cornmuni-

cation traces of the applications. For simpilicity. from now on. I use the tenn "message

destination" instead of unique message destination identifier. Figure 3.3, shows the mini-

mum. average. and maximum number of message destinations per process in the applica-

tions under ditferent system sizes. It is rvident that processes in al1 applications

cornmunicate with only a favorite subset ofail other processes. Note that processes in the

BT. and SP applications. in contrast to the other applications. have the sarne number of

message destinations undrr ditferent system sizes (except when N is four). This is also

sr Mtnimum c fi Average 0 - . Maximum m

,5 . - Minimum c . W Average 0 , . Mairlmum -

Nurnber of Processes Number at Processes

4

V) Minirnwn

c a Average 0 - - Maximum

4 8 16 32 64 Number of Processes

3- - a Minimum c , Average 9 - Maximum (TJ C -

4 8 i6 32 64 Nurnber of Processes

LU PSTSWM

-i,inimum- Average

0 - - Maairnurn m - 3 - in

O al $2 C 3 - O

z 1 - n

5 2

n- J 8 16 32 64

Number of Processes

Minimum '

Avera~e

-4 9 16 25 32 36 -49 64 Number of Processes

Number of Pra 36 49 6 4

cesses

Figure 3.3: Number of message destinations per process in the applications under different system sizes

tme for CG when the number of processes is S and 16. and for MG when it is 4 and S.

Meanwhile. in al1 applications. rxcept BT and SP. the number of messase destinations

increastts when the number of processes increases (note the exception cases in PSTSWM

and QCDMPI when the number of processes increases h m 31 to 36).

Figure 3.4. illustrates the distribution of message destinations in the applications when

the number of processes is 64. The BT. SP. CG. PSTSWM. and QCDMPI applications

wri fy tiic assumptim that the inessage destinations arc uni f~mnlp Jistributed arnong al1 of

the processes. MG shows an aImost uniform message destination. However. LU prrsents

three di fferent pcriks for message destinations.

Figure 3.5. shows the distribution of message destinations for one of the processes.

prosess m o . of the applications when the number of processes is 64. 1 choose process

zero becausi: i t is ii favorite destination of al1 processes and is usuall y responsible for dis-

tributiny data and veriSing the results of the computatioo. I t is slear that this primss tends

to communisntr with only a hvorire subset cit'all other processes in the applications. 1

have found similûr results for ail other processes in rach application as it u n be seen in

Figure 3.4.

3.3 Communication Locality and Caching

I define the trrms m~ssc~ge ki f imrion conimtu~icario~i losnli~: and crrcltim~ in con-

junction with this work as follows. By message destination communication locality 1 mean

that if 3 certain source-dcstitiation pair has bren used i t will be re-used with high probabil-

ity by a portion ofçode that is "near" the place that was used radier. and that i t will be re-

uscd in the nzar future. If communication locality exists in parallrtl applications. tlien it is

possible to cciche the configuration that a previous communication request has made and

reuse it at a latrr stage. Caçhing in the context of this discussion will mean that when a

communication çhmnel is established it will remain established unril it is explicitly

destroyed. .As already mentioned. in the context of fier-space optical interconnect main-

taining an established communication channel does not interfère with communications

that are in progress in other parts of the network.

BT (64 processes) x 104 SP (64 processes)

0 . . . . . . . . . . . . . . . . . . . . . . . . - ...... i) l b 32 4 3 oJ

Message Destinatioos

n O 16 32 48 0 4

Message Destinations

CG (64 processes) 5000

PSTSWM (64 processes) j X I O + LU 164 processesl

..A. . . . . . . ..< .... ..

..... .-....... .... - A - - . . - . - A -

16 32 43 Message Desiinations

. . . . . 16 32 48

Massage Destinations

QCDMPl (61 processesl

Ci....,,:. ...

"O 16 32 48 64 Message Destinations

Figure 3.4: Disrribution of message destinations in the applications when N = 64

BT (64 processesl SP (a processes)

- m C E n.- -

Y 16 EJ 32 JO 48 56 64 2 O 8 TB 14 32 40 48 56 oj Message Destinations Message Desrinarions

MG (64 processes)

5 (, --A---- E, L) .-A---------.

0 d 16 24 32 40 48 36 64 2 0 8 16 24 32 JO 48 56 13 Message Deslinaiions Message Destinations

Q PSTSWM (64 processes)

n 1200-

i E. -

h

Q) - a a

(, 5 0. --- - z O 8 16 24 32 JO 48 56 64 z O 8 16 24 32 JO -18 56 64

Message Desrinations Message Destinations

- O QCDMPi (63 processes) 59 80 w

0 CS

Figure 3.5: Distribution of message destinations in the applications for process zero. when 1V = 63

In the message-passing progamming paradiam. müny parallel algorithms are built

h m loops consisting of computation and communication phases. Therefore. çommunica-

tion patterns may be repetitive. This has motivated researchers to tind the corrrnrlrrzicntioii

f ~ ~ d i i y proprrties of parallel applications [75. 681. Kim and Lilja [75] have recently

shown that there is ri locality in message destination. message sizes. and consecutive mns

of send and recrive primitives in parallel algorithms. They have proposed and expanded

the concept of memory acsess locality based on the L e m Recelil& Lsed ( L R U ) [6S] stack

rnodel to determine these localities.

In the following subsection. 1 expand on the work by Kim and Lilja [75] by utilizing

the FIFO and LFU heuristics on the applications to see the existence of message drstina-

tion comrnunicütion lostility or repetitive message destinations. 1 use the term ltit iwio to

establish and compare the performance of these heuristics. If the ncxt message destination

is alrcady in the set of messriyr destinations maintained by the LRU. LFU. and FIFO heu-

ristics. 1 count a hir. othewise. 1 count a niiss. I t is çlear that the hit ratio is equal to the

number of hits divideci by the total nurnbrr of hits and misses.

3.3.1 The LRU, FIFO and LFU Heuristics

The Lcttsr R~~celrt(r C Sed ( L RU) . Fil~r-lu- Filsr-Oltr ( FI FO). and L L ~ Fwpenrb. #

L'retl (LFU) heuristics. al1 maintain a set of k (Ir is the wiiidoii. s i x ) message drstinations.

I f the next message destination is not in the set. then it replaces one of the destinations in

the set acçordins to which of the LRU. FIFO or LFU strategies is adopted. The window

s ix . X: corresponds to the number of input. output ports used in RON (& X). Figure 3.6

shows the results of the LRU. FIFO. and LFU hruristics on the applications when the

number of prosesses is 64. Figure 3.7. Figure 3.8 and Figure 3.9 illustrate the size scali-

biltiy of the these heuristics on the applications. It is clear that the hit ratios in 311 applica-

tions approach 1 as the window size increases. The performance of the FIFO nlgorithrn is

almost the samr as the LRU for al1 benchmarks. Howcver. the LFU algorithm has a better

performance than the LRLr and FIFO hruristics. the exception is for the LU benchmark.

when k = 2 and A'= 16.32. and 64.

ET (64 processes) SP (64 processes)

- LFU - - - LRU

FlFO I

"0 4 a 12 Window Size

CG (64 processes)

- LFU LRU FlFO

7 3 ~ i n d i w Size

4

LU (64 processes) . - - - - - - - 7

- LFU - - - LRU FlFO

QCDMPI (64 processes)

- LFU - - - LAU

FlFO 3 9 4 8 17

Window Size

MG (64 processes)

02- - LFU - - . LRLI

. _ _ _ _ _ . - _ - - - - - - F IF0

Où - -

3 rj 9 Window Size

PSTSWM (64 processes)

- LFU - - - LRU

1 FlFO

Window Size

Figure 3.6: Compaxison of the LRU. FIFO. and LFU heuristics when iV= 64

LRU (BT. SPI

0 0 3 1 2 3 -: 5 6 7 8 9 1 0 1 1 1 2 Window Size

LFU (ET, SP)

-i s i + a i i o ; ~ ~ Window Size

~ i n d o w Size

FlFO (BT. SP)

' f P TT

01 0 1 2 3 4 5 6 7 8 9101112

Window Size

LRU (CG)

1 2 3 4 Window Size

~ i n d o w Size

Figure 3.7: Efects of the scalibilty of the LRU. FIFO. and LFU heuristics on the BT, SP and CG applications

LRU (MG) FlFO (MG)

Window Size

01.

LFU (MG)

-+ N=32 & P3=64

LRU (LU)

O 0 0 6 Window Size

-& N=64

0 : 2 S 6 f s S Window Size

O: I

1 2 3 4 Window Size

FlFO (LU) LfU (LU)

"O 1 9 3 4 ~ i n d c & Size

Figure 3.8: Etiects of the sçalibilty of the LRU. FIFO. and LFU heuristics on the MG and LU applications

LRU (PSTSWM) FlFO (PSTSWM)

" 0 f 2 3 3 5 6 7 8 9 1 0 Window Size

LFU (PSTSWM)

+ N=32 -&- N=36 -t N d 9

* 0 1 2 3 J S 6 7 8 9 1 0 Window Size

FIFO (QCDMPI)

" 0 ' o 2 3 3 5 6 7 8 9 l O Window Size

0.2.

i3 1 .

LRU (QCDMPI)

4- N=36 - N 4 9 + N=64

- N=49 + N=Ô4

2 3 4 5 6 Window Size

O i ; a

2 3 4 5 6 Window Size

LFU (QCDMPI)

Window Size

Figure 3.9: Effects of the scalibi!ty of the LRU. FIFO. and LFU heuristics on the PSTSWM and QCDMPI applications

Basiçally, the LRU. FIFO luid LFU heuristics do not predict êxactly the nsxt message

destination but show the probability that the next message destination is in the message

destination set of the LRU. FIFO and LFU heuristics. respectively. For instance. the

PSTSWM application shows nearly 70941 hit ratio for a window size of seven under the

LRU heuristic when the number of processes is 64. This mrans that 70°h of the time one

of the seven most recent message destination will br used in the nrxt message. The LRU.

FIFO. and the LFC; heuristics perforrn better uhcn k is sutticicntly large. Howrver. this

adds to the hardware complexity as k links should b r setup and remain active before the

nest message is reiidy to btr sent.

1 am intcrestd in having prediçtors that can prediçt the next message destination with

a hi j h probability. and work under single-port modeling to minimize the cost of hardware

implementation. In the following section. 1 propose ii numbrr of novei message destina-

tion prrdictors.

3.4 Message Destination Predictors

.As notrd earlier. a node srnds ri message to anottier nodr by (irst establishing ii iink to

the taget (hençct the reçontiguration delay <I) and thrn sending the artual message over the

rstablished link. It is obvious thût if the link is already in place. then the configuration

phase does not enter the picture with a sommrnsurate saving in the message transmission

time. I would like to cstablish etlicient algorithms where the link establishment çosts are

minimized. The srated objective can be accomplished. if the target of the communication

operation c m bct p/v<iicrl.ri before the message itsel f is available. In this way. the commu-

nication pathway u n be established and be ready to be used as soon as the message tu be

sent becomes nailable.

Thrre are several ways of accomplishing this. If the communication operation is r e g -

lar and known. then it is possible that one can detemine the destinations and the instances

that these shall be used. 1 have developed such algorithms for broadcastingimultibroad-

casting [ l ] and discuss them in Chapter 5 . However. if the algonthm is not known. as is

usually the case for point-to-point communications. the approach mentioned above cannot

be used.

Prediction techniques have been proposed in the past to prediçt the future accesses of

sharing patterns and coherençe activities in distributed shared memory (DSkl) by looking

at their observed behûvior [96. 77. 73. 133. 34. 1071. These techniques assume that mrm-

ory accrsses and cohrrence açtivitirs in the near future will follow past pattems. Sakr and

his collragues have used tirno series and neural nrtworks for the prediction of the next

memory sharing rqurs ts [ 1071. Dahlgren and his collragues devised hardware regular

stride techniques to prefetch several bloçks ahead of the çurrent data block [XI. More

elaborare hardware-based irregular stride prefetching approüches have been proposed by

Zhang and Torrellas [ 1331. Kaxiras and Goodman have recently proposed an instruction-

hased tipproach which maintains the history uf load and store instructions in relation to

cache misses and predicting their future behavior [ 73 ] . This is in contrast to address-based

techniques thlit keep data-açcess history for the predictions. .Mukhrt jee and Hill proposed

a general pattern-hüsed predictor. cosmos. to leam and predirt the coherençr activity for a

memory block in a DSM [96]. Cosmos rnakes ri prediction in two steps. First. it uses a

cache block addrcss to index into ii message history table to obtûin the <proccssor and

message-type) tuples of the last few coherence messages receivod for thüt cache block.

Thcn i t uses thesc iproccssor. message-type> tuples to index a pattem history table to

obtriin ii <prosessor. messase-type> tuplt: prediction. In a recent paper. Lai and Falsati

proposetl a new clüss of pattern-based prcdictors. niernot? slinrNig pwtlicror7i. to eliminate

the coherence ovrrhead on a remote access latency by just predictiny the memory request

messages. those prima- messages that invoke a sequence of protocol actions [77]. It

improvrs prediction accuracy over cosmos by elirninatin_r the acknow ledgments messages

from the pattern tables. It also reducrs memory overhead and perturbation in the tables

due to message re-ordering. Both works in [96. 771 are adaptations of Yeh and Patt's two-

level P.-lp branch predictor [ l 3 11. R4p is a two-level adaptive branch predictor based on

the past behavior of the same branch.

In so ftware-controlled prefetching. the progammer or compiler decides when and

what to prefetch by analyzinr the code and inserting prefétch instructions. Mowry and

Gupta [95] have used software-controlled pretètching. and multithreading to hide and

reducr the Iatency in shared memory rnultiprocessoo.

As stated above. many prediction techniques have been proposed to reduce or hide the

latençy of a rernotr memory accrss in shared memory systems. However. to the brst of my

knowledge. no prediction technique has been proposed to predict the next message drsti-

nation for message-passing systerns to hide the latency of recontiguration delay in recon-

tigurable networks.

I explore the efiect that a number of heuristics have in prediçting the tri@ of a çom-

munication rcquesr. The set of prcdictors proposcd in this scction [?. 31 predict the mcs-

sage destination of a subsequent communication request based on a past history of

communication patterns on a per source procrss basis. Thesr predictors çan be used

dynamically at the communication assist or network interface with or without the hrlp of

the programmer or a compiler.

.Acturilly. I propose two sets of predictors in this thesis: Cide-based predictors. which

are pure dynamic predictors. and 7iig-btrsd prrdiçtors. w hich arc stlitic/dynamiç prediç-

tors. I n C yc le-based prediçtors. Si~ igh-~yck , Si11g11é-q TICI. Brrter.-qde. and Berter-

L ~ L W . predictions rire dont. dynamically at the network intertàce without any hrlp iiom

the programmer or compiler. In Tq-based predictors. fiiggiltg, 7&-cy/e. Tu~-L;\Y*IcI,

fiig-bcrrci-,+~-1~~. and ïà,g-hcrrci-~~i-c/~2. predictions are done dynamically at the network

interthcc 3s wrll. but they requirr some intonnation to b r passed t'rom the proyrrim to the

network interfûce. This c m be done with the help of the programmer and!or the compiler

through insrrting instructions such as p~rco~itrect (tag) in the prognrn. Thc Tag-basrd

predictors can be pure dynamiç predictors if another Ievel of prediction is done on the tag

themselvcs at the network interface. This way. there is no need for the program to pass

pre-connect (tag) information to the network intertàce. I leave this approach for the tùture

rescarc h .

It is worth mentioninp that these predictors can be used in any circuit-switched net-

worbs including the works proposed in [%. 13-1. Dao and his colleagues 1361 exploit the

'r 1 t!Ul.'c' communication locality to improvr the pefirmance of parallel computers usin,

swidzing. a hybnd switching technique for high performance routers in electronic inter-

connection networks. Wave switching combines wormhole switching and circuit switch-

ing in the same router architecture to reduçe the fixed overhead of communication latency

by exploiting communication locality. Thus. it is possible to reduce latrncy for cornmuni-

cations that display locali ty and use pre-established physical çircui ts. Yuan and others

[ 1321 use the communication locality in circuit-switchrd time-multiplexed optical inter-

ronnection nrtworks. They rely upon existing techniques for identifying communication

patterns suc11 that their compiled communication algorithrns compute the minimal multi-

plesing degree required for establishing 311-optical paths from sources to destinations in

such networks.

The predictors can rvrn be useful in reducing the latcncy in current commercial net-

works. For example. Myrinet networks [23] have a relatively long routing timr compared

with link transmission time. Predictors would allow sending the routiny hrader in iidvance

tOr the prediçtrci message destination. LVhen the message beçomes üvüilable. i t can be

dirrctly rransmitted through the network if the prediction was correct. thus reducing

Intcnçy significantly. In case of a mis-prediction. a message tail is fonvarded to tsar the

path down. Obviously nul1 mrssaycs must be disciirded nt the destination.

.As in the LRU. LFU. and FIFO heuristics. I use the lzir m i o to rstablish and compare

the performance of thése predictors. As a hit ratio. 1 define the percentage of tjmes that the

predicted message destination was correct out of al1 communication rrquests. The hit

ratios presrnted for the performance of the prcdictors are cither the minimum. the average.

ur the maximum of the hit ratios takrn over al 1 nodes of rach application.

3.4.1 The Single-cycle Predictor

The Siiig/+qdp predictor is basrd on the fact that if a group of message destinations

are requested rrpeatedly in a cyclical îàshion. then a single port can accommodate these

requests by rnsuring that the connection to the subsequent message destination in the

cycle can be rstablished as soon as the current request terminates. This predictor imple-

rnents a simple cycle discovery algorithm. Startins with a cycle-lwad message destination

(this is the frst message destination that is requested at start-up. or the one that causes a

miss). 1 log the sequence of requests until the cycle-head is requested again. This stored

sequençe constitutes a cycle. and c m be used to predict the subsequent requests. If the pre-

dicted message destination coincides with the subsequent requested message destination.

then I record a hit. Othenvise. 1 record a miss and the cycle formation stage commences

with the cycle-hrad being the message destination that caused the miss.

Figure 3. IO illustrates an example for the operation of the Single-cycle prediçtor. The

top trace represents the sequence of requested message destinations. while the bottorn

trace represents the predictrd messase destinations açcording to the Single-cycle predic-

tut-. Tliz a~-rc>u.s witli the cross represrnt misses. ~ l i i l r : the ones witli the circlr reprrsttnt

hits. The "dash" in place of a predicted message destination indicates tliat a cycle is being

f~mnrd. and thrrefore no predicted message destination is oîfered (note that this is also

rtddèd to the misses).

Cyclc t;mnation

Cycle Cycle Cyclc Cycle tonnation t'ormrition ibm-iation ibnnation

Figure 3.1 0: Operntion of the Single-cycle predictor on a sample request sequençe

Figure 3.1 1. shows thc behavior of this algorithm. The performance of the Single-

cycle prediçtor is very p o d on the CG. LU. MG (rncept when N = 4. 8). BT and SP

(except when .V = 4). The Single-cycle predictor behaves poorly on the PSTSWM (except

when ,V = 36.49) and QCDMPI applicritions.

The pertormance of the Single-cycle predictor is much better than the L RU. FIFO and

LFC; heuristics under the single-port modeiing for the LU and CG brnchmarks. for the

MG. PSTSWM applications (except when iV = 4.8). and for BT and SP (except when :V =

4). However. the pertbrmance for QCDMPI is almost the same. Note that 1 compare the

pertormance of the predictors with the LRU. LFU. and FIFO heuristics under single-port

modeling for the same optical interconnect implementation çost although the proposed

Single-cycle (63 processes)

j P LU UG CG PSTSWhl OCD

Single-cycle

Figure 3.11 : EKcçt of the Single-cycle predictor on the applications

predictors have higher rnemory requirernents (refer to Section 3.5.1). Figure 3.12 com-

pares the performance of the Single-cycle predictor with the LRC;. LFC. and FIFO under

single-port modeling whcn .V = 6-1.

Single-cycle. LRU/LFU/FIFO Comparison (64 processes) 1

- LRU. LFU. FiFO -. Single-qcio ,

0 - z 3) ; U 0 7- 1.1

O ' BT SP LU MG CGPSTSWMQCD

Figure 3.1 2: Comparison of the performance of the Single-cycle predictor wi th the LRLr. LFU. and FIFO heuristics on the applications under single-port modeling when N = 64

3.42 The Single-cycle2 Predictor

In the communication tnces of some of the applications. there rxist cycles of length

ont. (such as the one composed of the reqursted message destination 7 in Figure 3.10). For

thtse situations. thrre will always be iwo misses until the predictor detemiines that there is

a cycle of lrtngth one. The [email protected] predictor is identical to the single-cycle predictor

with the addition that during cycle formation. the previously requested message destina-

tion is otfered as the predicted message destination. If a miss occurs during cycle forma-

tion. the formation phase continues until a cycle is formed. Then and only then misses

cause a new cycle hrniation phase to begin. 1 appliçd the Single-cycle? predictor to the

request srquence of the previous example as shown in Figure 3.13. As was rxpectrd. the

Sin_rlr-cycle2 predictor reacts better to cycles of length one.

Rey urst sequrnce 1 3 5 3 1 3 5 5 7 7 1 3 5 5 1 7 7 1 3 1 I

1 - 1 2 3 5 3 5 5 1 7 7 1 3 5 5 3 7 7 1 3 2 Prtdictd seq irerice

Cycle Cycle C y clr Cycle Cycle fimnation î'~x-mation fimnation hrmiition formation

Figure 3.13: Operation of the Single-cycle2 predictor on tlic sample request sequense

Figure 3.14 illustrates the performance of the Single-cycle2 prediçtor. This predictor

has 3 better performance thnn the single-cycle alpo ri thm.

Single-cycle2 (64 processesl Single-cycle2 - .

0 1 '

' ET SP LU MG CG PSTSWM OCD Jo tu 10 JO 40 50 na -0 Nurnber of Pracesses

Figure 3.14: Effcct of the Single-cycle2 predictor on the applications

3.43 The Better-cycle and Better-cycle2 Predictors

In the Single-cycle and Single-cycle? algori thms. as soon as a message destination

breaks a cycle 1 discard the cycle and start forming a new cycle with this message destina-

tion as the nrw cycle-head. Then I just rely upon the new cycle to predict the next message

destination. The Single-cycle and Single-cycle7 predictors could achieve a better pertor-

mance if the previous cycle information was not discarded as new cycle is fomed.

In the Beiter-cide predictor. each cycle-head bas its own cycle. For this. 1 keep the last

cycle associated with each cycle-head encountered in the communication pattern of rach

process. This means that when a cycle breaks 1 keep this cycle in memory for the r o m -

sponding cycle-head for later references. Whrn a cycle breaks. if 1 haven't already seen

the new cycle-hriad thrn 1 form a cycle for it. otherwise I prediçt the next message destina-

tion based on the rnember of the cycle assoçiated with this cycle-head that I have from the

p s t in mrmory. If the predicted message destination coincides with the subsequent

requested message destination. then 1 record a hit. If not. then 1 record a miss and revise

the cyclc fbr this cycle-head. The state diayram of this predictor is s h o w in Figure 3.15.

Miss A Cycle (new cycle-head)

phase u

H it A One-cyc le-complete

Figure 3.15: State diagram of the Brtter-cycle predictor

The top lefi state is the "cycle formation phase" initiated with a cycle-head. This is the

same as the cycle formation phase in the Sinsle-cycle predictor. Upon a cycle completion.

1 enter the "cycle prediction phase". In case of a mis-prediction in the "cycle prediction

phase". 1 move back to the "cyclr formation phase" if the new cycle-hmd has not been vis-

ited so h r (that is. there is no cycle associated with this new cycle-hrad in the memory).

Othrnvise. 1 move fonvard to the **cycle prediction phase for the new cycle-head". 1 move

back to "cyclr prediction phase" atirr one cornpletc ç y l e to continue the predictions for

this new cycle-head. In case of a mis-prediction during the tirst cycle of predictions in the

**cycle prediction phase for the new cycle-hrad". 1 move to the "cycle-revision phase" to

rcvise the cyclr for this néw cycle-head. It is ckar that after the revision phase. I move to

the "cyçlr prediction phase" for the nrxt cycles of predistions.

Figurc 3.16 illustrates the operation of the Better-cycle prcdictor on the sample requrst

srquencc. I t is clear that the Hrst cycle assoçiated with cycle-liead I consists of message

destinations 1. 3. 5 . and 6. Hoivever. in the fbunh nppeiirince of this cycle-head a rcvised

cycle fonns which contains message destinations 1. 3. and 2.

Cycle t'ormation

Cycle formation

Cycle hrmation

Figure 3.1 6: Operation of the Better-cycle predictor on the sample request sequrnce

The performance of the Bctter-cycle predictor on the benchmarks is shown in Figure

3.1 7. It is evident that its performance is exceptionally better for ail benchmarks compared

to the Single-cycle predictor except for the QCDMPI benchmark when N = 25.32. 36 and

49.

Better-cycle (64 processes) Bener-cvcle

Figure 3.17: Etiect of the Bettrr-cycle predictclr on the applications

L I ? .

(1 f

The Berro.-qde2 predictor is identical to the Better-cycle prediçtor with the addition

+ -- G- MG

. -8- CG -& PSTS1h(M - CCDMPl

that dunng cycle formation and cycle revision phases the previously rrquestrd message

Id 10 L3 10 20 50 00 XI Number of Processes

destination is otfL.red as the prediçtrd message destination. Fiyrc 3.18 illustrates the

operation of thc Bettrr-cycle2 prediçtor on the sarnr sample requrst seyuence.

Cycle formation

Cycle fotmation

Cycle formation

Figure 3.18: Operation of the Brtter-cycle? predictor on the sample request sequencr

The Better-cycle? predictor has a better performance than the Single-cycle. Sinsle-

çycle2. and the Better-cycle predictor for the QCDMPI benchmark. The performance of

this predictor is show in Figure 3.19. It is ~vonh mentioning that i found that the applica-

tions have a very small number of cycle-heads (at most 9) under the Better-cycle and Bet-

ter-cycle? predictors and diflerent system sizes. Section 3.5.1 discusses the memory

requirernent of a11 predicton proposed in this thesis.

Figure 3.19: EKect of the Better-cycle3 predictor on the applications

3.4.4 The Tagging Predictor

The ï iggitg predictor assumes a static communication environment in the sense thüt ri

paniculür communication request (send) in a section of code. will be to the same message

destination with ti large probability. Therefore. as the exeçution trace nears the section of

code in question. i t clin cause the communiçation subsystem to establish the çonnestion to

the target node More the actual curnmuniçations rcqurst is issued. This c m be implr-

mrntrd with the hrlp of the compiler or by the programmer through ri pr-e-so/irircr (ta@

operation which wili force the communication system to rstablish the communication

çonnection before the actual communication request is issued. As noted earlier. for this

predictor and other Tag-basrd predictors. 1 can avoid the help frorn the compiler or the

proyrammer by predictiny the tag itself at the network intertice. This way. there is no nred

for the propram to pass prr-connect (tag) information to the network interface. However.

the pertomance of these ?-lri*el Tug-buseci prediction techniques has not been evaluated

yet.

1 attaçh a ditierent rng (this is ditrerent thnn the tag in an MPI communication d l : it

may be a unique identifier or the program counter at the address of the communication

call) to each of the communicrtion requests found in the applications. This tag is passed to

the communication subsystem by the pre-connect (tag) operation. To this tag and at the

çommunication assist. 1 assign the requested message destination the first time a link is

established. A hi t is recorded if in subsequent encounters of the tag, the reqursted message

destination is the samr as the one already associoted with the tag. Othenvise. a miss is

recorded and the tag is assiged the newly requested message destination.

The pertbmancr of the Tagging predictor is presented in Figure 3.20. As can be seen.

the Tagging predictor results in excellent performance (hit ratios in the upper 90°f0) for d l

the application benchmarks except the CG. PSTSWM. and QCDMPI. The reason is that

tliese bcncliinarks includc scnd opcratians witli inessaye destitiations cülsulatrd busd on

loop variables. Thus. the samr section of code cycles through a number of di firent mes-

sage destinations. As we have seen eariier. the Better-cycle and Better-cycle2 predictors

are excellent in disçovering such cyclic occurrences for the CG and PSTSWM bench-

marks. bllttanwhile. the Brtter-cycle? predictor has better performance for the QCDMPI

benchmark compored to the Tagging predictor.

Figure 3.2 O : EKects of the Tagging predictur on the applications

3.45 The Tag-cycle and Tag-cycle2 Predictors

The Tagging prediçior does not have a good performance on the CG. PSTSWM. and

the QCDMP l benchmarks whilr the Single-cycle and Single-cycle7 predictors showed

good results for the CG benchmark. 1 combine the Tagging algorithm with the Sinçle-

cycle alsorithm and cail it <lie Tag-c~~ lc . algorithm.

In the Tapcycltt predictor. 1 attach a ditferent tag to each of the communication

requests Found in the application benchmarks and do a Single-cycle discovery algorithm

on each tag. To this tag and at the communication assist. 1 assisn the requested message

destination. to be çalled fng~de- l i ead message destination (this is the tirst message desti-

nation that is requested at this tag. or the one that causes a miss). 1 log the srquencr of the

requests at this tag until the tagcycle-head is requested again. This stored sequencr: consti-

tutes a cycle at each ta-. and can he usrd to predict the suhsequent requests. A hit is

recordrd if in subsequent encounter of the tng. the requested message destination is the

same as the predicted one in the cycle. If not. then 1 record a miss and the cycle formation

stage begins with the tagsycle-head being the message destination that ciiused the miss.

The Tag-cycle prrdistor performs exceptionally well across al1 the benchmarks except for

thc QCDMPI benchmark as s h o w in Figure 3.2 1 .

Number of Processes

Figure 3.21: Effects of the Tag-cycle predictor on the applications

The Ttlg-q*cIcI predictor is identical to the Tag-cycle predictor with the addition that

duriny cycle formation. similar to the Single-cycle2 predictor. the previousl y requested

message destination is otiered as the predicted one. The performance of the Tag-cycle7

predictor. as s h o w in Figure 3-13. is better than the Tagging and Tag-cycle predicton for

r i t 1 benchmarks.

Tag-cycle2 (64 processes) Taa-cvcle2

Number of Processes

Figure 3.12: Etkcts of the Tag-cycle2 predictor on the applications

3.4.6 The Tag-bettercycle and Tag-bettercyclez Predictors

The Bettrr-cycle and Better-cycle? nlyorithms have berter performance on the parallrl

applications than the Single-cycle and Single-cycle? cilgorithms. Therefore. 1 combine the

Better-cycle and Bettrr-cycle2 rilgoritlirns with the Tagging alsorithm to pet better perfor-

mance than the Tag-cycle and Tag-cycle2 algori thms. 1 cal1 thrse 7&g-be~ret~lde and Tq-

h~~rtc1r: idc2 predictors. The performance of these two predictors are s h o w in Fi yure 3 23.

and Figure 3.24.

Tag-bettercycle (64 processes) Tag-beltercycle

'3 10 20 30 4 0 50 tiû 3 Nurnber of Processes

Figure 3.23: Etfects of the Tag-bettercycle predictor on the applications

In Tag-bettercycle predictor. 1 anach a different tag to each of the communication

requests tound in the benchmarks and do a Better-cycle discovery algorithm on each iag.

To this tag and at the communication assist. I assign the requested target node. to be called

rngbrrtei-c~rle-lierrd node. The Tapbettercyclel predictor is identical to the Tag-bettercy-

de prediçtor rvith the addition that d u r i n cycle formation. similar to the Better-cycle?

predictor. the previously requested message destination is otfered as the predicted mes-

sage destination. The performance of Tas-bettercyçle for the QCDMPl benchmark is bet-

ter than the Tag-cycle algorithm. but not beaer than the Tag-cycle2 predictor. Howrvrr. the

hg-bettercyclel predictor is superior to al1 other predictors for al1 parallel benchmarks.

Moreover. 1 found that the applications have w-y small number of tûgbettercycle-heads (at

most 3) under the Tag-bettercycle and Tq-bettercycle2 prediçtors and ditferent system

sizes.

' i -0 20 !O a0 50 ria '0

Number of Processes

Tag-bettercycle2 (64 processes) Tag-beltercycle2 * p P

O 9 .

' 3 - a # '

0 C

m

= 1 al

Figure 3.14: Etiects of the Tay-bettercyçle? predictor on the applications

ul ul 281 & - CO.'' al 2 &

1 .' )L'

J ' .

,,

3.5 Predictors' Cornparison

-3 BT + 5P + it) * MG -+ CG &- QrjfÇLWi t ( X O l - . -

Figure 3.25. presents a cornparison of the perîbmance of the prediçtors prrsented in

this chapter when the number of processors is 64.31 and 36. and 1 6. respectively. It is evi-

dent that the Tag-bettercyclel predictor has the best overrill performance for al1 applica-

tions (except tOr QCDMPI when the number of processes is 16. and 64 whrre Better-

cycle2 hns a better performance) and its hit ratio is consistently very high. It is also cieür

that under single-port modeling. the proposed predicton outprrform the classical LRU.

LFU, FIFO heuristics.

3.5.1 Predictor's Memory Requirements

Table 3.1 compares the maximum memory requirement of the proposed messare des-

tination predictors on the application benclimarks when the number of processors is 64. I

have found that the memory requirement of the predictors decrease gadually when the

number of processes decreases. The numbers in the table are the multiplication factor for

the ümount of storage nerded to maintain the message destination and its communicator.

Hriving 64 processes in this u s e study 2nd at most 4 diftèrent çommunicators in the appli-

cations. one nrrds ro have only one byte of storage per raçh message destination and its

cornmunicator.

Table 3.1: Memory requirements (in bytes) of the predictors whrn N = 61

It is quitr c l r x thÿt the rnemory requirements of the predictors is very low. That makes

thern very attractive fbr implrmentation on the communication assist or network interface.

Comparatively the Better-cycle. and Tag-bettercycle predictors have a iinie hiyher mem-

ury requirements than the other predictors. Although. the ciassical LRU. LFU. and FIFO

heuristics nred less m e m o . as stated earlier. the beauty of the proposed predictors lies in

the fact that they opcrate under single-port modelinp. That is. only one communication

channel is available at a- time. and this is reconfigured on demand. This btings the cost

of optical interconnect implrmentation to the minimum. The stonge requirement of the

prrdistors have been found using the following formulae:

Single-cyçle(2)

Brttrr-çycle(2)

Tayging

hg-cycle( 2)

Trig-bt.ttcrcyclc(2)

BT

49

40

12

24

24

SP

49

49

12

24

24

CG

9

IS

MG

7

2s

LU

4

12

1 0

40

O

1 O

10

30

12

14

36

QCD

S

32

PSTSWM

33

297

- 7

1 O

30

S

48

48

- f , - - . . - !l,fents .lL, A L.i.cldi i x Maximum number of cycle-heads ( 3 . 7 )

.\,fini ,,,,,,,, = blaximum number of tags 3.- .-

(3.3)

- T ' , ~ - L..,.L.,Ll, 2 l - .Clm ',,, -,.- .* x Maximum cycle length of caçh tags ( 3 A)

3.6 Using Message Predictors

ln ihis section. I hrieHy discuss how a message destination prediçtor c m be used and

integrated into the nztwork intertice. Predictors would reside beside the communication

üssist or nrtwork interface and accelerate the reconfiguration phase of the interconnect.

Thry monitor the message destination patterns of thrir host node and make ü prediction

according to their prediction algorithms. Then. the network interface uses the predictions

to rstablish the links to its Rnal message destinations.

As statrd abovc the predictors would execute on the communication assist of rach

node of the parallei machine. and predict the message destinations for communications

originating at the node on which they residr based on the p s t history of communications.

In Cycle-based predictors (Singie-cycle. Sinsle-çycle2, Better-cycle. and Better-cyçle2).

predictors do not nerd any help from the compiler or programmer. However. as stated ear-

lier. in Tag- based predictors (Tagging, Tag-cycle. Tag-cycle?. Tag-bettercyclr. and Tag-

bettercyclel). predictors require an intertàce to pass some information tiom the progam

to the network interface. Wiih a simple help tiom the programmer or compiler. this can be

dons throush inserting prr-cor~tiecr (mg) instructions in the program well above earh spe-

cific srnd communication operation but rvidently after the previous send communication

operation.

Determinhg whrn to perform the path setup action ( recontiguration phase) is quite

simple. Basically. predictors should map the prediction into the path setup action when the

previous communication has teminated. Thus. as soon as the previous message transmis-

sion is compltitc. tlir sommunication assist recontigurcs thc link to the nest message desti-

nation. It is clear that upon ii mis-prediction. the on-seing reconfiguration which is not

correct and may or ma): not be completrd by the time of the mis-prediction due to ri

shoner inter-send çomputation time (to be discussed in Chapter 4) immediately stops and

a new rcconfiguration tnkes place.

3.7 Summary

interconnection nrtwurks arc still ii source of bottleneck for high pertbmançe som-

munications in massively parallel rnvironments. In this çhapter. I introduced a reçonîiy-

urable interconnection network that could alleviate the communication problems in such

environrnents.

In ordrr to beneti t from such interconnects ct fcçtivel y. recontiguration drlay should be

hiddrn. For this. 1 analyzrd the communication properties of some parallrl applications in

tems of communication frequency and message destination distributions. Lking classical

memory hirrarchy heuristics. I found that message destinations display a form of locality.

Having message destination loçality in parallel applications. I proposed a number of

predictors that can be used to acçurately predict the message destination of the subsequent

ci>mmunicütion reyuest. The proposed predictors would execute on the communication

assist of each node of the parallel machine. The performance of the proposed predictors.

esprçiall y Better-cycle2 and Tag-bettercycleZ. are very goood and they could effective1 y

hide the hardware communication latency by recontigunng the communications network

concurrently to the computation.

For these predictors to be used efficiently, I shall aque. in Chapter 4. that at least in the

application benchmarks studied. there is enough computation preceding a communication

request suçh that the prediçton could rfectively hidr the reconfiguration cost [4.3].

Chapter 4

Reconfiguration Time Enhancements Using Predictors

To reçontigure the optiçal intcrconnect concurrently to the cornputation. or to spectula-

tivcly setup the path in electronic interconnrcts. two conditions are necessary: ( 1 ) An

accurate prrdiction of the destination: ( 2 ) Enough lead time so that the reconfiguration of

the interconnect (or the path srtup phase) be compirttid before the communication reyuest

In Chapter 3. 1 utilized the message destination locality property of parallel applica-

tions to devise a numbrr of heuristics that can br: used to "predict" the targrt of subsequrnt

communication requests. This technique. çan be applied directly to recontigurable intrr-

connttsts t« hide the communications Iütençy by reçontiguring the communication net-

work çoncurrcntly to the computation.

1 present the pure evrcution times of the çomputation phases of the parallel bench-

marks on the [Bk! Deep Blue machine at the IBM T. J. Watson Rrsttarch Center using its

high-perhm~ance switch under the user space mode. This çhaptrr contributes by arguins

that by çompüring the inter-communication computation tirntts of these parallel bench-

marks with soinr specitic recontiyuration times. most of the time. we are able to hl ly uti-

lize thesr: computation times for the concurrent reconfiguration of the interconnect whrn

we know. in advance. the next target using one of the proposed high hit-ratio target predic-

tion algorithms introduced in Chapter 3.

In this chapter. 1 fi rst show the distribution of message sizes of the applications in Sec-

tion 4.1. In Section 4.2. the pure inter-send computation times of the parallel applications

on an [BM SP2 machine is presented. 1 present the performance enhancements of the pro-

posed prediçtors on the application benchmarks for the total reconfigration time in Sec-

tion 4.3. In Section 4.1. i discuss how the predictors at the send side affect the receive side

of communications. Finally. 1 conclude this chapter in Section 4.5.

4.1 Distribution of Message Sizes

The volume of communications is characterized by the number of messages. and the

distribution of message sizes in the applications. I presented the number of messages in

Chaptrr 3. In this çhapter. I am particularly interested in the distribution of message sizes

in the applications. In Section 4.3. 1 use the size of messages in the applications to çalcu-

late the message transfer delay tims. Figure 4.1 through Figure 4.4 illustrate the distribu-

tion of message .;izes of al1 applications under diffèrent systems sizrs. The MG.

PSTSWM. SP. and BT applications use more distinct message sizes in thrir communica-

tion calls than the othrr applications. The CG. LU. and QCDMPI use a few number of dis-

tinct message siïrs.

4.2 Inter-send Computation Times

in Section 4.3. 1 shall entirnine the ettèçtiveness of the proposed predictors. 1 shall

quanti- thc ability of the proposrd prediciors in hiding the reconfiguriition delays. For

this. 1 need to know the pure computation timrs betwern ctny two send communication

operations.

did rnpcriments on a fast machine to establish the inter-srnd çomputation times and

the efects of the heuristics on the total reconfiguration Jelay I usrd the IBM SP? Derp

Blue machine at IBM T. J. Watson Research C'enter. a 30 node machine with 160 MHZ

PlSC thin nodes. Z56MB R A M and a second genrration high performance switçh and ran

the suite of applications. one process on each node under the user space mode. when 1 was

the only user of this machine. This avoided any task switching that might have atfected my

measurements. My mrasurements detrnnined a lower bound on the inter-semi cornputa-

tion times (Le. the tirne devoted to computation between any two send communication

d l ) .

I excluded al\ timing overheads in the profiling codes to computr the exeçution times

of the çomputation and communication phases of the parallel application benchmarks. The

inter-srnd computation measurements excluded any overhead associated wi th any other

,1-- -- il .! 7 1 1 5 4 r> J 10 12

.IL

Message Sires (bytes1 . 10' Message Sces (byiesl ro'

QCDMPI (4 nodesl

% 0.5 1 1 5 2 Message Sizes (bytes) x 10'

Figure 4.1 : Distribution of message sizes of the applications when N = 4

-1 8 t2 16 Message Sires {bytesi x 10'

CG 18 nodes)

Oo O S 1 r 5 2 2 s Message Sires (bytes) x 104

Figure 12: Distribution of message sizes of the applications when N = 9 for BT and SP. and S for CG. MG. LU. PSTSWM. and QCDMPI

U 6 8

Message Sires ibves) r 10'

0- - - O 1 1 3 4

Message Stres (bidesi * 10. Message Sires I bvies)

PSTSWM ( 16 nodesi

,J 5'

, ]C A---

J Message I Sues O r bytes) 'I .O '" , i.-!-yh $ 2 3

, 10' Message Sites (bytes) . 10'

2000 4000 6000 8000 tOOOO Message Sites (bytesi

Figure 4.3: Distribution of message sizes of the applications when N = 16

PSTSWM (25 iiodes) 5000

. . - . . - - . . . - - . - - . 7 4 O

Message Sixes (bytes1 r 10'

- -

1 'Y 3 Message Sires i bytesi r 10'

Figure 4.4: Distribution of message sizes of the BT. S P. PSTS WM. and QCDMPl applications when .V = 25

communication primitives (q. reçeivr communication call. collrctive communications).

Thus i t c m be çonsidered as a lower bound on the pure computation time. In Appendix A.

1 rxplain how the pure inter-send computation times have been computed.

The temporal attribute of inter-send çomputations in parallel applications charactenzes

thc rate of computations. The inter-amval times of the computation time can be used to

obtain the ciinrirlntiw tiistribirriorl jirncriort (CDF) of the computation times. The CDF of

the computation times cm dien be used for curve fining to generate the inter-amval times

of computation times for simulation purposes. Figure 4.5 presents the cumulative distribu-

tion hnction of the inter-srnd computation timrs for node zero of the applications ( 16

nodes for CG. MG. and LU; 15 nodes tor BT. SP. PSTSWM. and QCDMPI). Note that I

have found sirnilar cumulative distribution tiinction plots for other systern sizes.

'O !O ho 30 '00 intoi --alna Coinouiarion T'rnn IIM~?I

5 ? O,;

20 JO 60 aa t oo lniar - S ~ M Ciimput~l~on lime Inari

i' 2 ilS ?O JO 90 BC 'O0

Inter -sena Cmputiition T I ~ H Irwlrx

Figure 4.5: Cumulative distribution function of the inter-send computation times for node zero of the application benchmarks when the number of processors is 16 for CG. iMG. and

LU. and 35 for BT. SP. QCDMPI. and PSTSWM.

Table 4.1 shows the minimum pure inter-rend computation times of the applications

under diwerent system sizes. Note that LU. MG. and CG nin only on a power-of-two num-

ber of procrssors. The inter-send computation times for the CG (4 nodes) and QCDklPI

application benchmarks are quite large while al1 other applications have a minimum of

less than 23 microseconds pure cornputation times.

Table 4.1 : Minimum inter-send cornputation times (microseconds) in NAS Parallel Btnchmrirks. PSTSWM. and QCDMPI when 1V = 4, S. 9. 16. and 25

8 nodes J nodes

(9 for BT, SP) 16 nodcs 25 nodes

1

IBM Deep Blue uses a state-of-the-art high performance CPU. PowerZ-Super (P3SC)

microprocessor. in its nodes. The nodes are interconnected via an adapter to a high perfor-

mance. multistage. packet-switchrd network for interprocessor communications. 1 am

interested in having a rough cornparison benveen the pure inter-send computation times of

the applications running on such powerful machines and the current state-of-the-art recon-

figuration dclay associated with optical interconnects. Researchers in optical engineering

are using diwerent approaches to design reconfigurable interconnects [103. S 11. In [103].

the authors report a 25 microseconds reconfiguration delay for their experimental recon-

tigunble interconnects. Based on these reports. 1 compare the pure cornputation times of

the application benchmarks with 25 microsrconds reconfiguration time. and with reconfig-

uration tirnes of 10. 5. and 1 microseconds as a rneasure of future advancements in the

area of recontigurable interconnects. Figure 4.6 presents the distribution of the inter-send

computation tirnes on ditferent applications whcn the computation times are more than 5.

10.25 microseconds and the number of processors is 4. S or 9. 16. and 25.

Examining the distribution OF the inter-send times. rewuld tliai t l i q are qui te widely

distributrd. ,411 applications have nearly 100°-O inter-send computation times that are

greater than 5 microsttconds. For the BT. SP. LU. MG. and CG (except 4 nodes) applica-

tion benshmarks. betwern 6O0* to 80% of the computation times are above 25 microsec-

onds. The PSTSWM and QCDMPI application benchmarks have nearly 1 OOOb inter-send

computation tirncis that are greater than 25 microsrconds. It is evidrnt that the majority of

the recontigurations can procerd in parallel with the computation and be readied hefore

rhe end of the computation. For the cases where the computation timr is not sulficiently

long t« wmpletely hide the resonfiyuriition i r etfectively reduces the recontiguration cost

by the cixrcspondiny length ut' time.

4.3 Total Reconfiguration Time Enhancement

1 assume 3 multicomputrr with nodrs similar to the thin nodrs of an IBM SPI system

but with a rrconfigurable optical interconnect whish has a reconfiguration delay d (ci = 1.

5. 10.25 microseconds). I t is interesting to ser the rtiectivenrss of the proposed predictors

on such a multicornputer system. Specifically. 1 shall quanti@ the ability of the proposed

predictors in hiding the reconfiguration delays. For the calculaiions used to quanti- the

reconfpration hiding capabilities of the predictors. 1 use the lower bound of the inter-

send computation times.

Figure 4.7 illustrates ditierent sçenarios for message transmission in the multicom-

putrr with the recontigurable optical interconnect. Note that as won as a send cal1 is

issued. the message c'an be sent to the destination if the link is already established. Recon-

figuration is staned as soon as the messase is delivered to the destination. Thus. the

nlcssageet~o>rsfei~deIa~+ (the delay associated with the transfer of a message) reduces the

4 Nodes (W ciass far NAS) 4 Nodes (A class for NAS)

8 Nodes (9 nodes tor BT. SP: W class for NAS) . . . . . . . . . . . . - . . . .

, Uois m n '9

r-

16 Nodes (W class tor NASi . . . . .

3 hlori, min

{W dass tor NAS) . . -

8 Nodes (9 nodes tor BT. SP: A class for NAS) . . . . . . . . . . . . . . . . . . . . .

3 . ~ d 1 . mi,,

16 Nodes (A class for NAS) . . . . . - . . . . * . . . . . . ,

.UA; '

25 Nodes (A class lor NAS) . . . . . . . . . . . . . . . . . .

0 MC& &ri i

Figure 4.6: Percentage of the inter-send cornputation times for ditierent benchmarks that are more than 5. 10, and 25 microseconds when N = 4.8 or 9, 16, and 25.

rimount of time available before the next send cal1 is issued. For this, I subtract the

t~ressngt.-tioii~f2i~~-ieic1~* ( For the specific message sizr) from the correspondinp inrei--setd

finie and cal1 the remaining time, the uiaiZable-rime. This allows me to compute the lower

bound of the tirnrs that c m be hidden. For eash n l c . s s a g e _ r t ~ - ~ i ~ ~ s f ~ ~ ~ ~ i e l ( ~ ~ ~ calculation. I use

the corrrsponding message size and a one Gigabyte per second communication channel.

If the trwilcrblc~-rinze is geattir than zero as in Figure4.7(a) (that is the

~~i~~ss~~g~--r~iz~isfI~i~~i~~I~~~. is lcss than thc corrcsponding N ~ I c I - - s ~ ~ z ~ ~ii~zi). and it is morc

thrin the i ~ c ~ o r ~ ~ g i i r a r i o ~ i - ~ i c l n ~ ~ then a correct prediçtion would help completely hide the

rvc~or~~~qiri*(itior~~~fe/~iy. 1 f the tridnble-rime is geater than zero as in Fisure 4.7( b) but i t

is less than the ivcor~~gti~-ciiio~t~~icla~~ then part CIE the ) r s o i ~ i g ~ r i a r i o ~ i ~ ~ f e l ~ i j ~ qua1 to the

ti\'~lihbletinre can be hidden. Howevcr. if the tzmii~iblerimc is less than zero

Figure 4.7(c) (that is tlir n~e~sscigr~~r~cii~fC.~-~~i~l~~~~ is greater than the çorresponding

Figure 4.7: Diffrrent scenuios for message transmission in a multicomputer with n recontigurable optical interconnect (a) when the rnessage-transferJrlay is less than the inte-end time. and the available timr is lager than the reconfiguration-delay (b) when the message-transfer-cielay is less than the inter-send time. and the available time is less

than the recontiguration-delay (c) when the message-transfer-delay is larger than the inter-send time

The algorithm uscd to obtain the time spent in reconfiguring the interconnect with and

without applying the predictors is given by the following pseudocode. The

totn~origii~o~~eco~~fiprration is the sum of the reconfiqration delays encountered in the

applications' run-time. The toral-nmiyeconJigirration is the sum of the recontipration

drlays encountered in the applications' mn-time when predictions are used to hide them

with the inter-send cornpuration times. The recor,fiJlr~atiorr-mio is the ratio of

t u t ~ ~ - ~ i ~ ~ ~ ~ ~ - e ~ o i ~ / i ~ r i ~ ~ ~ t i o ~ ~ over ~~t~I~origi~rul~~~ecorIfigt~i'atio~~. It is ciear that the less this

ratio. the better is the prrdictor's capability to hide the reconfiguration delay.

total- n e w-recon figura tion = 0.0;

total- original- recon figura tion = 0.0;

for each inter-send-computation {

available- time = inter-send-computation - message transfecdelay;

if (available-time c 0) (

total- new- reconfiguration += recon figuration_delay;

total- original- recon figura tion += recon figura ?ion- delay;

1 eise {

if (hi?) then

if (available- time < reconfiguration-delay) then

to ta/- n e w- recon figura tion += recon figura tion- delay - ava ila ble- time:

else;

else total-new-reconfiguration += reconfiguratio'delay:

total- original- recon figura tion += recon figuration- delay;

1 1 recon figura tion-ratio = total- ne w- recon figura tion / btal-original-recon figura tion

Figure 4.Y through Figure 4.1 1 illustrate the r e c o ~ ~ ~ r i n ~ i o i ~ - ~ ~ n t i o . the average ratio of

the total new recontiguntion delay (atier applying predictions) over the total original

reconfiguration delay for ench application benchmark under two ditferent CPU speeds and

four diflerent rrcontiguratinn delays. 1 present the results for two different C P U speeds:

one tOr the current P2SC thin nodes, and one h r a 10 tirnes faster CPU as a measure of

hture CPUs. The results are shown for the best predictors. Better-cycle2 and Tag-

bettercyçld. In these figures. shorter bars are better. For the sake of completeness. 1 have

includrd the results for LRU. LFU. and FIFO heuristics under single-port modeling (recall

LRU. L N . FlFO (4 nodes. A cfass for NAS) 1

u = i A =a=5u?d C 0 = !Qin -

u = 2 5 i n --

Berrer-cycle2 id nodes. A class for NAS)

LRU. L N . FtFO (4 nodes. A class tor NAS. CPU 10 limes faster)

Bener-cycle2 (4 nodes. A class for NAS. CPU 10 limes hsterl

Tag-benercycie2 (3 nodes. A class for NAS. CPU 10 nmes faster)

n al

E o a i *

Figure 4.8: Average ratio of the total reconfiguration time afier hiding over the total oriyinal recontiguration time for ditierent benchrnarks with the current generztion and a 10 times t'aster CPU when d = 1 . 5. 10. and 25 microseconds: A class for NPB, 4 nodes

(shorter bars rire better)

LRU. LFU. FIFO (9 nodes (BT. SP). A ciass for NAS)

~ r l : i w

Bener-cycle2 (9 nwes iBT. SP). A ciass tor NAS)

LRU LFU FIFO (9 n a e s i3T SPI A c ! z s for NA5 CPU 70 rimes hsleji 1 -

I u - l d

Tag-benercwle2 19 nodes IBT. Spi. A class for NAS)

Bener -iyc!@2 19 n a e s ;Br S P i A cuss tar NAS. CPU 70 trmes hslerb

f.ig-ociicrc;h!eZ i9 n a e s iBT SPI A c i . 1 ~ ~ for PIAI CPU 10 limes f.uleri

Figure 4.9: Average ratio of the total reconfiguration tirne after hiding over the total original reconfiguration time for different benchmarks with the current çeneration and

a 10 times faster CPLT when d = 1.5. 10. and 25 microseconds: A class for NPB, 9 nodes for BT and SP. S nodes for other applications (shorter bars are better)

LRU. LFU. FlFO { 16 nodes. A cfass for NAS) LRU. LFU. FlFO (16 nodes: A class for NAS. CPU 10 iimes taster) 1 -

i - l r n

Better-cycle2 ( 16 nocles. A C&SS for NAS. CPU 10 times taster)

Tdq-benercycie2 (16 nodes; A ciass for NAS. CPU 10 iimes tasterl

Figure 4.10: Average ratio of the total reconfiguration time alter hiding over the total original reconfiguration tirne for ditferent benchmarks with the current generation and a 10 times tàster CPU when d = 1.5. 10, and 25 microseconds: A class for NPB. Ionodes

(shorter bars are better)

L W . LFU. FlFO (25 nWes. A cQss for NAS) LRU. LFU. FlFO (25 nodes: A ciass for NAS. CPU 10 tunes fasteu

Belter-cycle3 (75 nodes: A class for N A 3

Tag-&nercycie2 (25 nodes. A class for NAS) (25 nodes; A ciass for NAS. CPU IO bmes !aster1

i l r 7"s - I I - 5 4 - < l . l u rn --

* l 25 ta

Figure 1.1 1 : Average ratio of the total recontiguration time afier hiding over the total original reconfiguntion time for different benchmarks with the current generation and a 1 0 times faster CPU when d = 1.5. 10. and 25 microseconds. A class for NPB. 25 nodes

(shorter bars are better)

thnt the LRU. LFU. and FIFO heuristics under single-port modeling predict the next desti-

nation to be the same as the previous message destination). It is clear that the Better-

cycle?. and Tag-bettercycleî predictors outperform the LRUILFL~~FIFOF heuristics. The

Tag-bettercycle? predictor improves the total reconfiguration delay better than the Better-

cycle2 predictor. especially when the number of processors is 4. or 9. Under the Tag-

bettercycleî prrdiçtor. the majorin, of recontigmtion delays in the CG. MG. and LU

benchmrirks clin be hidden. Meanwhile. the recontiguration-ratio for BT and SP deçreasrs

tiom 0.4 to O. 13 when the number of nodes increases h m 4 to 25. The QCDhlPI has ri

reconfigurntion-ratio betwren 0.3 and 0.5. However. the PSTSWM application shows a

consistent recontiguration-ratio of near 0.6 (rxcept when :V = 4). I t is also evident that the

ratios increase with ü taster CPC for the same recontiguration drlay. However. the reçon-

figuration drlay time may rilso decrelise in the future. In this respect. it is informative to

compare the bar griiphs undrr dittèrent recontiguration delays and processor sperds. From

the plots for BT. SP. QCDMPI. and PSTS WM. i t seems that the recontiguration delay is

not a factor. It meüns that either the inter-send somputation times are so short that thry

cannot hide the recontiguration delays or they are long enough that thry can hide larse

reconîi yuration delays.

In general. the results are consistent with the fact that wi: c m hide most of the recon-

figuration delnys using one of the proposed high hit-ratio prrdiçtors. Figure 1.12 shows a

sumrnary of the average ratio of the total new reçontiguntion delay over the total original

reçonfiguratian delay with the surrent yeneration and a I O times fastrr CPU when apply-

ing the Tag-bettrrcycle2 predictor on the benchmarks for ri = 15 microseçonds. A class for

WB. and under diferent systrm sizes.

1.4 Predictors' Effect on the Receive Side

It is intzresting to discover the eflect of applying the heuristics at the send side of com-

munications on the receivins sides and hence on the total mecution time. Using one of the

high hit-ratio predictors reduces the total reconfiguration del- When this happens at the

sender sides. most of the tirne the messages are delivered sooner at the receiver sides. If

Tag-bettercyc2 (A class for NAS) 1 . * B T 1 + EP

+- LU go,, . -E+ MG + CG +% PSTSWM t

Taq-bettercvc2 QI dass for NAS, CPU 70 tirnes faster)

Number of Processors '0 4 8 12 16 20 24 28

Number of Processors

Figure 4.12: Summary of the average ratio of the total recontiguration time atier hiding ovrr the total original recontiguration time with the surrent generation and a 1 O times

taster CPLr when applying the Tas-bettercycle? predictor on the benchmarks with c i = 25 rnicroseconds. A class h r NPB. and under different system sizes

the rrccice calls have brrn issued aftrr the message has amved. thrre would be no y i n .

However. i f they rire issued rarlier. then there would be performance enhancemrnt on the

receivins side and theretire on the whole execution time. This is shown in the Figure 4.13.

I have used the following stratepy for discovenng the number of times that the receive

calls are issued enrlier than their correspond in^ send calls. 1 synchronized the timing

traces of each node of these applications. 1 have considrred the times just before the send

and receive calls arc: issued. In case of blocking and non-bloçking send calls. the time just

before the calls (MP-rrd and .WI-Isend) have been taken into account. That is the time

that the message is ready to be sent over. For the bloçkinç receive cal1 (MPI-Recv). 1 did

the same. That is the time that the receiver is ready to get the message. However. for the

non-blocking receive call (rWI-Irecv). 1 consider the time when the wait cal1 (AdPr-CC'ait)

is issued for the corresponding receive call (MPI-Irecv). This @es us the worst case sce-

nario for the number of times the receive calls are issued betore their corresponding send

calls.

Process 1 Process Z

Scnd-crll /j3/ With heuristic I <z"' I

I hl I I d 1 No heuristic I hi I

I

I A

Process 1 Process 2

I I

Receive-cal1 I O ( c m consume 1 erirlier )

p g i t h heuristic 1 Send-cal1 6,/ I

l 7 1 No heuristic i *

4l 1 I I I

I I i j Receive-cal1 1 1

( crinnot consume earl ier) 1 1 1 I I

I

Figure 4.13: Heuristics e t k t s on the reçeiving side

1 present the average percentage of the times that the rriceive calls are issued earlirr

than thrir correspondinp send calls for the CG. SP. and PSTSWM benchmarks in Figure

4.14. The results are truc for ci = 1 . 5. I O , and 25 microseconds. LU and MG benchmarks

cire using . \ I f [ - . - f :V}xC'RCE [92] for some of thrir reçeive calls and hencr orir cannot

identie the sourccs of messages to compare with. What 1 have calçulated is a lower bound

of the improvrmrnt. .-\ trace-driven simulator should b r written for the exact çalçulation of

the improvernent.

1.5 Summary

In order to eîficiently use the proposed predicton in Chapter 3 to hide the hardware

latency of the reconfigurable interconnects. rnough lead time should exist such that the

reconfiguration of the interconnect be completed before the communication request

amves. For this. 1 presentrd the distribution of execution times of the computation phases

of the parailel application benchmarks on an IBM SP? machine. The results showed that

most of the time. we are able to fully utilize these computation times for the concurrent

reconfiguntion of the interconnect when we know. in advance. the next target using one of

the proposed high hit-ratio target prediction algorithms.

4 9 16 75 4 9 16 25 4 8 16 SP tW) SP ( A ) CG ( W ana A

Figure 4.11: Average percentazr of the timrs the receive calls are issued before the çorresponding scnd cal 1s

1 ais0 presented the pcrfomancc enhnnçements of the best predictors. Better-cycle'.

iind Ttis-bettercyslrî. on the application benchmarks for the total reconfiguration time.

Finiilly. I considercd the zi?ects that usinj message destination predictors have on the

recciiing sides of çominuniçations. 1 showed that up to 50U;, of the tirne applications

might brnrfit from the situations whcre thry post early receive çalls. However. 4 trace-

driven sirnulator should be writtrn for the calculation of the irnprovement.

1 did not evaluate the application spredup whcn usin2 the predictors on the applica-

tions. Rough estimates point to minimal speedup gains. This is because the parallel appli-

c*onirrrlrrtica~io~i cations studied are very çoarse-graincd and hence the ratio is small. conrptcrariorl

Table 4.2 shows the çommuniçation to computation ratios for the applications under dif-

ferent system sizes. These applications have been written to avoid a lot of communications

hetween pair-wise nodes mostly because of the high communication latency in the current

seneration of parallrl systems [Ij]. and panly because of the algorithms. thrmselves. As

show in Table 4.7. the çommuniçation to computation ratio is increasing when the num-

ber of nodes increases. This means that we might have better speedup for these applica-

tions for larjer system sizes. However. the inter-send cornputation times may decrease and

thus reconfiguration delays cannot be hidden.

Table 4.2: Communication to computation ratio of the applications

4 nodes

In this chapter. and Chapier 3 of this dissertation. 1 am partiçularly interested in the

point-to-point communications in parallrl applications. In Chapter 5. 1 disçuss eficient

collcçtivc communication nlgorithms for such recontigurable interconnects.

8 nodes (9 for BT, SP)

16 nodes 25 nodes

Chapter 5

Collective Communications on a Reconfigurable Interconnection Network

COIIPcriw cSonlnltl)lictlrio)ls are basic patterns of intrrprocessor communication that

are tiequently usrd as building blocks in a variety of parallel algorithms. Proper implti-

mrntation of collective communication algorithms is a key to the overill performance of

para1 le1 cornputers.

Frer-spacr opticcil interconnertion is used to fashion ii reçontigurable network. Since

network recontiguration is expensive cornparrd to message transmission in suçh networks.

/ ~ l r o i ~ ; ~ hidirig i~~c+lirliqiics ciin be used to increase the performance of collective communi-

cations operürions.

1 prrscnt and analyze a broadcasting~multi-broadcasting algorithm [?O] that utiliztts

liitency hidiny and recontiguration in the network. RON (k . Y). to spsed these operarions.

.As the Hrst contribution of this chapter. the analysis of the broadcasting algorithm includes

a slosrd fomulation that yields the termination time. Secondly. I contribute by proposing

a conrbitied tord ~~-vciinrrg~ ~iigoritlinl based on a combination of the cli,-eci [109. 1101. and

s r m i i d ~ ~ ~ c l i t r ~ t g e [7 1 . 2-11 nlpithms. This ensures n brttrr termination time than what

can be achievrd by eithrr ofthe two algorithms. Meanwhiie. known algorithms for scatter-

ing and dl-to-dl broadcasting from the literature [-IO. 2 11 have been adapted to the net-

work.

5.1 Introduction

Communication operations may be rither poi~ii-to-poitit. as discussed so far. or collec-

tive. in which more than two processes participate. The study of classical algorithrns

bnngs up some generic communication patterns. collective communications, that appear

very otien in parallel algorithms [70. 761. Collective communications are common basic

patterns of interprocessor communication that are frequcntly used as building blocks in a

variety of parailel algorithms. Proper implementation of these basic communication oper-

ations on various parallel architectures is a key to the etticient rxrcution of the parailel

algorithms that use them. and hence. on the overall performance of the parallel cornputers.

Wktber communiçation operations are progammed by the user (low-levrl routines).

contained in a librrtry such as MPI [92.93], and Pnrallrl &tirul iCfm-hiirr (PVM) [ l 151. or

generated by a compiler to translate Iiigh-1cvr.l data prirallc.1 languapc such as Iligh Pc.r$w

nzuricc Foranir ( H P F ) [SZ]. their latrncy directly atiects the total çomputation time of the

pürallel application. The gowing interest in collective communiçation operations is cvi-

dent by their inclusion in the WI.

Collcçtive communicntion operations çan be usrd for data movement. process syn-

shronization. or global opentions. as shown in Figure 5.1. Data movement operations

i ncl ude. hi-utrdctrsriilg nitriri-btnu~Ic*c~.stiirg, ~iiir~i~us~irzg. scnrtwiilg, gurliei-it g. t?itiltitmle

hrr)d~~~r.sri~rg. and rotd ~ w l w ~ g i ~ . In broadcasting. a node sends its unique message to al1

ortirr nocics. Broadcasting is ussd in a varirty of linrcir dyebra iilyorithms [76]. such as

inütrix-wctor multiplication. matrix-matrix inultiplication. LU-tactorkation. and House-

holder transfcmnations. It is also usrd in database queries and transitive closure algo-

rithms. In multi-broadçasting. ii node broadçasts ii number of messages to al1 other nodes.

In multicasting. a sprcial case of broadcasting. 3 node sends its unique message to a subset

of (il1 the other nodes. In scattering. a node sends a ditierent message to al1 other nodes. i t

is basically used for distribution of data among the processors. Gathering is the exact

reverse of scattering. That is. a node receives a di tierent message from al1 other nodes. 1

will not discuss it here as a separate operation. In multinode broadcasting al1 nodes send

their unique messages to al1 other nodes. In total exchange. al1 nodcs send their ditierent

messages to al1 other nodes. Perwnaiizeii conimirrricatiorts (scattering, gatherinp, and total

exchange) are used. for instance. in transposing a rnatrix. and the conversion between dif-

ferent data structures. or in neural nrtwork simulations. It is worth mentioning that the ter-

minology is not yet standard. For example. broadcasting is referred as me-ru-dl.

multinode broadcasting is referred as all-[O-al/ or gossipiug. scattering is referred as per-

sonalized om-[O-dl. and to ta1 exc hange is referred as mlr hi-scntteriq or pei-io/ialiied d l -

to-dl.

Bar-riei s~~r~cliroriixzriou, is a type of process synchronization. It defines a logical point

in the control How of an algorithm at which al1 members of the g o u p must amve before

any of the processes in the subset is allowed to proceed further. Therefore. one ofthe pro-

cesses plays the role of ii banier process. This process gathers messages of al1 d i c r pro-

cesses. and then broadçasts a message to them indicating that they can continue.

Global operations include miiicriori. and scu~i. In reduction. an operation suçh as sirirr.

, i rm- . nriu. is applird açross data items received tiom each membrr of the group. In an .VI I

~*dir~*rioii operation. the resultant data resides at the root node. Theretore. it contains a

m h c r i n y operütion. In an M N rwlircrio/t operation. evcry node or procrss involved in the b

operütion obtliins 3 copy of the rrduccd data. Hençr. it is ri combination of gatlirring and

broadsiisting. In sçan operation. given prosesses Pa. pi . . .. . p,,. and data items do. di . . . . . ti,,. an operation O is applied such that the result d,, O d , O ... O d, is üvailablr at the pro-

CCSS pi.

Collective operations have brcn usually proposed and designed for systems that sup-

port on1 y point-to-point. or iinicusr. communication in hardware. In these environmçnts.

collective oprrations are implemented by sending multiple unicast messages. Such imple-

mentations are çalled iii~i~ast-bosed. .An alternative approash is to provide more direct

support for collective communication in the hardware. Two main approaçhes have been

studied. The hrst approach uses a network other than the primary data network to imple-

ment collective communications [SOI. ln the second approach. the data network is

enhanced to bettrr support some collective communications. To improve collective corn-

munication performance and reducr software overhead. two such enhancements to routers

have been proposed: message replicarion and i~itei~meniare reception. Message replication

rrfers to the ability to duplicate incoming messages ont0 more than one outgoing chan-

Broadcast Scatter Gather Multinode broadcast

Total exchange Barrier Reduction Scan

Figure 5.1 : Some collective communication operations

nels. whilr intemrdiiitr reception is the ability to simultaneously dctliver an incoming

message to the local processor. and to an outgoiny channel. N i has proposed how scalnble

parallcl coinputers should support efficient hardware multicast [99].

'lumcrous works 1iai.e bern reporteci on collective communications. Excellent surveys

on cul lectivr communiç~tion algorithms in srorn-~rtiri-/bi~t-ai~d systems can b r found in

[53]. Anorlier sun-ry of broadcasting and multinodc: broadcasting in store-and- fonvard

systems san be found in [Ml. Dimakopoulos and Dimopoulos have shown how total

exchange can be donc in cûyley gaphs [-Il]. Tliey hwe also presented collective commu-

nication alyorithrns on binay h t trees [43]. McKinley and his colleagues have surveyed

soIleçtivr communications on hypercubes. meshes. and tori in ii-otnihole-iairretI networks

[go]. Recently Banikazemi and others. have proposed efficient broadcastins and rnulti-

casting algorithms using communication capabilities of heterogeneous networks of work-

stations [ l j ] . In the context of optical interconnection networks. Berthorne and Ferreira

[?O. 2 11 have presented broadcasting and multicasting algorithrns for networks using opri-

cal passire sraï-s (OPS). Comparative Study of one-to-many wavelength division multi-

plexing (WDM) lightwave interconnection networks. based on hypergagh theory [ l SI.

have been studied by Bourdin and his collea~ues [3]. Grwenstreter and Melhem have

presented some çommunication algorithms in partitio~ied opricd passiw stars (POPS)

networks [59] .

In this çhapter. 1 present and analyzr some collective communication algorithm for

the reconfigurable network. RON(k . YI. drfined in Chapter 3. In Section 5.2. 1 describe the

communication modelinp. 1 prctsent and analy~r. bruadcasting [?O]. ü n J niulti-brcdcast-

ing algonthms thüt utilize the resontiguration capabilities of the network in Section 5.3.

Latrr on in Section 5.5 and Section 5.6. known algorithms h m literature for scattering

and rnultinode broadcasting [?O. -101 are adapted to the nrtwork. Then. 1 propose a new

algorithm for total exchange operation. to br calléd cornbitrd rord e . ~ ~ h o ~ r g e nlgurirlinz. in

Section 5.7. Finrilly. 1 summarizc: this çhapter in Section 5.3.

5.2 Communication Modeling for Broadcasting/MuIti-broadcasting

A s Jiscussed in Chapter 3. I use a moditied Hockney's çommunication model [66] . 1

modi- the Hockney's rnodel into tw» models. In this section. 1 define the tirst model as

used for hidin- thc reconfiipation delays in broadcasting and multi-broadcasting algo-

rithms. In Section 5.4. I define the second model for other collective communication dgo-

rithms. The second rnodel supports combining messases into a single message as used in

scattering. rnultinode broadcasting. and total exçhanyr algorithms. ro be discussed later.

Note that these alorithms are efficient but thry do not hide the reçontiguration drlay in

the network.

The communication rime to send a unit lensth message. I,,I. ii-om one node to another

in the nrtwork is equlil to T = d + r , + I , s . 1 incorporate both I, and I,r,t into a single

message dslay r,, = r , - i , t . Thus. a unit length message transmission takes

T = I I - r,,, . For the remaining of the discussion. and without loss of generality. 1 shail

assume that r , = 1 for a message of tked length used in broadcastingimulti-broadcast-

ing.

Culler and his collragues have proposed the LogP mode1 [33] which uses another ter-

minology for communication modeliog. LogP models sequrnces of point-to-point com-

munications of short messages. L is the network hardware latency for one-word message

transtèr. O is the combined overhead in processing the message at the sender (0,) and

receiver (O, . ) . P is the number of processors. The gap. g, is the minimum timr interval

between two consecutive message transmission îkom a procrssor. Alexandrov and cithen

have proposed the LogGP model [ R I which incorporates long messages intn the LogP

model. The Gap per byte for long messages. G. is defined as the time per byte for a long

message. Bar-Noy and Kipnis have developed the posrd niodel [KI. a sprcial case of

LogP model. whrre g is one. Howew. they don2 consider the parameters o and G.

.A node in LogP. LoyGP. and postal modrls can send tinother message iinmediately y

time üfter the previous messase has been sent without waitins for the previous message to

be deliwred at the destination. Thesc rnodels are more suitable t'or the currcnt state-of-the-

an wormiiole-routed nrtworks whrre messages can be pipelinrd through the netwixk.

Howevcr. a nodr in iny communication modrling can send another message only atier irs

prcvious message has been delivered and its link has been reçonfigured (if nerdrd). This is

because rny mode1 is a rdlep/mtc-likc nrodel basrd on the circuit-switching technique

which is suitable for rrçonfigurablc optisal networks.

The model that 1 have used is slightly ditierent tiom the rnodel that is otfered in [?O.

2 1. 401. The difference lies in the fact that in the network. RON (k N), only the sender is

üIIowed to reconfigure. and hençe the drlriy penalties occur there. The receiver. in contrast

to the rnodels in [2 1. -101. and in [?O] is rntirely passive.

1 use the notations B ,,,. i~ fBt , l . S ,,,, G ,,,. TE ,,,. for broadcasting time. multi-broadcasting

time. scattering time. multinode broadcasting time. and total exchange time. respectively.

i derive tirne complexities of collective communication algorithms in the network.

RO:V (k. N). under the model m. where m E {F 1. F k } . F I stands for full-duplex, single-

port communication. While. Fk stands for tùll-duplex. &pan communication.

5.3 Broadcasting and Multi-broadcasting

In this section. 1 shall concentrate in techniques that could etfectively hide the recon-

figuration delay d in the network. By reconfiguration latency hiding. 1 mean the process in

which while some nodes are in their reconfiguration phase. other nodes are in their mes-

sage transmission phase. Hence. the reconfiguration phase is overlapped with the message

transmission phase which ultimately reduçes the broadcasting and multi-broadcasting

times,

5.3.1 Broadcasting

In broadçüstiny. a node. assuming nodr tr,, without loss of generality sends its unique

message to al1 other nodes. I assume an unbounded number of available wavelengths for

the systrm. As noted carlier in Chaptrr 3. techniques such as sprerid-spectrum cm be used

in case of limited number of availablr wavelrnpths. In the following. 1 Rrst discuss the

broadçastiny algorithm under k-port modeling. and then present the results for the singlr-

pon mode1 ing.

K-port: The naive algorithm is to let the broadcasting node 14, inform k nrw nodes at a

stcp. Cleïrly. it tïkcs i ti A 1 )[y 1 tirne units. In a mure eficicnt algorithm. BIii. node

q, sends the messase to k other nodes and these k nodes. upon receiving the message. send

i t tu k othrr nodrs eüch. which are distinct tiom the nodes that have reçrived the message

ttius far. Continuiny this way. the algorithm will terminate cifier [ Iogk( N ( k - 1 ) + l 11 - 1

steps. while in terms of rlapsed time. the algorithm will take

( d + 1 ) ( r l o g k ( N ( l i - I ) + 1 )1- 1 ) time units.

Obviously. one çan do better than this if one allows the nodes that have already been

informed. to re-send the same message to a different group of nodes. Thus. starting with

node no. it srnds the message to A- nodes. At the end of this step. k + 1 nodes possess the

message which they now send to k other nodes each. Proçeeding this way. this algorithm.

B2Fk. will terminate atier [logp - N1 steps and will require ( d + 1 )[log, _ , N1 tirne

units.

The above algorithms. BIFk and BJFLY are logarithrnic in time. but they sutfer because

of the large reconfiguration delay. d. that cach node incurs. 1 am interested in devising

algorithms that will ovcrcomr the existence of the large reconfiguration delays by essen-

tially hiding it. The aigorithm BIFk can be improved if the configuration of al1 the links

foming the tree procerd in parallel. Hence. in this new algorithm. BjFk. the broadcasting

message would reach the leaves of the tree in time d + [log,( .V( k - I ) + 1 ) 1 - 1 .

The algorithm BJFk c m be improvcd i f the configurations can takr place concurrent to

the message transmissions. 1 adopt a grerdy algorithm. B+k. wliere a node reconfigures

its links to r e x h k children which lead to 3 pir-so~!figwetl tree of an appropriate

O( log,.V) drpth. A s won as the broadcasting node has Hnished sending its message. it

recontigures its links to reach another predefined tree. It is understood that while this node

is reconiigunng (this takes d steps time units). nodes that have already been configured

and art. in possession of the message send it w k nsiyhbors mch. This process repeats at

cash node e w r y tirnr it sends the message. Potentially. the message. starting at nodr 1 1 ,

1 ', f i - ' - 1 w i I l rt.ach 1 ~k - k - - .. . + k = nodes before node I I , br able to recontigure.

k - 1

Figure 5.7 drpicts the BdFk algorithm for a 2-port network with 41 nodes and a reconfigu-

ration drlay of 1 . This al2orithm is optimai since a node after sending~reçeivinp the mes-

sage immrrdiately recontigures to send the message to a new node. This alporithm is

similar to the broadcasting algorithm by Benhome and Frrreira for their loosely-coupied

opticrrlli. r~~.co~~figui*chie pu~.a//d cor~lpirre~: ORPC (k). using optical passive stars (OPS)

[?O].

I t is clecir that rither this broadcasting network is a dedicated network. or there exists a

global control where nodes understand that a broadcasting is _coing to take place and hence

they reconfigure their links correspondingly In the latter case. an early reconfiguration

delay should be added to the broadcasting time.

Figure 5.1: Latençy hiding broadcasting algorithm for ROiV (k. W . :V = 4 1. k = 2. ti = 1

5.3.1.1 Analysis of the Greedy Algorithm

Before presentiny the annlysis of the yreedy algorithm. i t is wonh noting that it can be

shown tliat the total number of nodrs. !V(S). informrd up to step S follows the recurrence

relations:

It crin dso be shown that the numbrr of nodes. i.(S). that receive the message at cash

step. S. follows the recurrence relations:

Thrse recurrence relations are a kind of generalization of the Fibonacci hnctions

defined by Bar-Noy and Kipnis for the postal mode1 [ 161. and are similar to the recurrence

relations of the broadcasting algot-ithms by Berthorne and Ferreira [XI. The above rela-

tions and thosr in [M. 201 cannot be solved for a general d. T h y should be computed step

by step or be piyen in a table in order to tind the temination time of the algorithms. How-

evrr. as will be shown in the following. the andysis of the broadcasting algorithm includes

a closed formulation that yields the temination time.

1 present anothrr approach to tind a closed formula for the total number of nodes. IV(S).

up to the strp S. The problem I shall endenvor to solve is to find the time required for the

greedy aigorithm to complete. 1 shall approach the analysis constructively. that is. 1 shall

find the nuniber of nudes tliat will be inhrnxd as time progresses. and I diail stop whrn

d l nocies .Y have been intormed.

Denote by S the termination time (in units of r,,,). Then starting h m an arbitr~ry node

I I , , . the nodes that wiil be informeci and assuming no recontiguration. belong to a h--ary tree

p-1-1 routed üt node 11 , ) and of depth S. There are .VI = nodes in this trce. and 1 s h d l

k - l

reference them as belonsing to the Hrst yaxration. Each of the nodes in this tree. once i t

has broadclist the message to its own çhildren. will reconîigure and wili become the root of

a nctw tree over which a new wave of broadcasting will commence and proçeed concur-

rently with the broadcasting in the tint generation tree. This can only happrn if S > ti -+ 2

rnsuring that the Rrst node to be reçontigured (node ,tn) will have enough time to reçonfig-

lire and broridcrist to its k children.

1 shall refer to the nodes belonging to the trees rooted at nodes which were included in

the Arst generation tree and reconfigured. as the second generation nodes. Thus. node 110

can send its message aagain at time d + 1 atier its router has been recontigured to çonnect to

a set of k new nodrs. By sending this new message. 110 actuall y embeds a new k-ary tree at

depth t l A 1 . The next k nodes at depth 1 of the tirst generation o f trees embed k ditferent k-

ary trees at depth t / + 2. Lising this concept. the A'-'/- ' nodes at depth S - d - Z of the first

yenrration embed the last kS - ' - ' difkrent trees at depth S - 1 in the second generation.

Figure 5.3 depicts the embeddinç of the first two generritions of the nodes.

Denote by NT the total number of new nodes in the second generation. and by :CIi the

total tiumber o f new nodes in the trees of the second generation rooted at depth i.

Figurc 5.3: First and second generation trees. The nuinbers undernerith each tree denote the number of trees having the same hright. These trees arc

rootrd tit nodes that wrre at the same level in the first generation tree.

This continues until depth S - 1 where:

Therefore. the total number of new nodes in the second generation. fi. will be:

The process of reconfiguring the optical interconnects continues by the nodes as soon

as they have broadcast the messase to their children. Each generation of trees embeds a

new genention rhat commences at depth d + I from its parent generation. It is clear that

the total number of gctnentions is

Let us now count the total number of nodes 3; in the third generation. The tirst tree of

the [hird p e r a t i o n is einbrddrd ot Jrptli I ( d - I ) by q,. 1 bttgin ~c i t l i tliuse trces af tliis

wneration whish are rmbedded by the nodes of the tirst tree in the second genrration. Let b

1 Q denotes the total number of nodes in these trees rooted at drpth i.

This continues until the depth S - 1 where:

l = k S - ? ' / - ' ( k ) =

kS - 2L / - l -k-I - ? Q s - , k - l

Now. consider trees cmbrdded in the third generation by the nodes of the next k trees

7 at depth S - t l - 1 in the second generation. and let Qf denotes the total number of nodes in

thrse trers rooted at drpth i. Therefore.

This continues until the depth S - 1 where:

1 continue with the trees cmbdded in the third grneration by the nodes of the next X'

3 trees of depth S - d - 3 in the second generation. and let C l i denotes the total number of

nodes in these trres rooted at depth i. Therefore.

Hence. the total number of the new nodes in the third generation. N3, will be:

In ri similar manner. I can compute the number of nodes for the fourth and fitih gener-

ations as:

This proçrss implies lcmma 1 .

Lemmr 1 The number of nttw nodes in generation i + 1. i 2 1 çan be found as:

Proof. 1 give a combinatorial argument for its validity. Assume a tree belonging to

yenrration i - I and rootrd iit depth ( i - l)(d A 1 ). This tree will produce a number of trees

k S - / i l / - I I - I

belonging to genention i and rooted at depth i(d + 1). The tem - hJ repre-

k - I

sents the number of new nodes in the first tree of generation i rooted at depth i(d + 1 ). Sub-

sequent trers in this generation. have a decreasing (by one) number of levels. but since

they were produced by nodes that are at lower levels in the parent generation. their num-

bers grow with the power of k. Therefore. the number of nodes within al1 the trees at each

1 have however accounted for the number of trees produced by a single tree in a parent

generation. There is more than one tree of identical depth in the parent generation. and the

multiplicative trrm [ j :i '1 accounts h r this nurnber based on the Pascal's triangle

The total numbrr of nodes in al1 genrrations. :V(S). informed up to step S. is equal to:

Note that Equation 5.30 is a closed formula and easier to compute ( l e s computation

and mrmory requiremrnts) than the recurrence Equation 5.1. and Equation 5.2. To drter-

minc the termination time S one hüs to solve Equation 5.30 for S. This equation can br

solvrd numericülly. h b l e 5.1 and Table 5.2 provide a cornparison of somr numcrical

examples for thc bmadcasting time under difièrent broadcasting algorithms. BIFk. BJFk

BjFk, RJFI . and for the best case logk - , .V whtin there is no recontiguration delay (Le. d

= 0). for panicular number of nodrs. .V. recontiguration delay. c i . and port modeling. k. I t

is quitc ç l a r that the latency hidiny algorithm. BJFk. pcrtoms better than the other algo-

ri thms.

Table 5.1 : Broadcasting time. k = 2. d = I

Table 3.2: Broadcasting time. k = 4. d = 3

Siitgk-port: In this case. a nodr cm only use one of its links. Therefore. instrad of k-

ary trrrs. linrar arrays cire embedded. Hencr. using the same concept as in the k-port mod-

eling. the total number of nodes for generütions l . 2. 3.4 are:

If 1 continue in a similar manner to the k-port modeling. then the total number of nodes

in dl penrrations. .V(SI. would br:

Table 5.3 provides a cornparison of some numerical examples for the broadcasting

time of the latency hiding algorithm. BFI. of the spatuiig bblomiul olgorithnz [ I 141.

( d .- 1 ) r los, - NI. and for the best case log ,:V when there is no reçonfiguration delay (Le.

il = 0). for a particular number of nodes. :V. and resonfiguration delay. d. I t is çlear that the

algorithm. BFI . performs better than the spanning binomial algorithm.

Table 5.3: Broadcasting time, r i = 3

5.3.1.2 Grouping schema

The total numbrr of nodes. .V(S). informtld up to strp S is gven as Equation 5.1. Mecin-

while. the number of nodcs. i-(S). that rereive the message ~ i t rach step S is defined ns

Equation 5.2. Tlir nodes are divided into two proups. The yroup that has already received

the messügc and the «ne that has not. The nodes that know the niessaye ai any give strp

wn br grouped into thosr nodrs that have already rrcrived the message and those that

rcceive iit this timr step. The nodes that receive at e x h step. is proportional (A- times) to

the number of nodrs that haïe received the message at the last step and thosr that have

sent the message t! - I steps ügo.

The same grouping schema as in [20] can be used to tind the set of nodes that transmit

the message. and the set of the nodes that receive the message at any given step. The set

T(S) consists of the nodes transmitting the message at step S. While. the set R(S) consists

of the nodrs that receive the message at step S. These two sets can be found by Equation

5.36. Note that the same grouping schema c m be applied to the multi-broadcasting case to

be discussed in the next section.

5.3.2 Multi-broadcasting

If there ore .LI messages to be broadcast by a node to al1 other nodes. the simplest algo-

rithm is to use the above latency hiding broadçasting algorithms or B F I ) :Cl tirnrs in

sequence. This algorithin. denote it by MBI. gives an upper bound for muiti-broadcosting

and takrs :Il x ( d-+ B-l,, ) . and hl x ( i l - BF l ) timr units undrr k-port and single-port

modeling. respeçtively. .\ lower bound for multi-broadcasting. .Il - 1 - MB,,, (MB,,, is the

broadçüsting timc for an optimal nl~orithm). can bc achiwed by pipelining the messages

through the netwurk. That is. node ql sends its bl messages in sequrnce in an optimal

broüdciistiny algorithm.

Onc may think of another algorithm. illB?Fh: where the tirst message embeds a broad-

casting tree (Rrst ~enrration tree) rooted at nodc 1 1 , ~ : Each of the subsequrnt messages use

this embedded tree to broiidwst thus bypassiny the reconfiguration costs that the first mes-

sage incurred. Hrnce. the îirst message will incur a delay of d A (rlogAAr( k - I ) 7 I ] - 1 )

time units to broadcüst over al1 :V nodes and to embed the broadcast tree, while the second

and subsequent messages would only incur a broadcast delay of r logkN( k - 1 ) + 1 1 - 1

cich. Therefore. the total cost is

Table 5.4 compares the two algorithms :LIB.lBIFKand MB7FK. Note that an optimal alço-

rithm for multi-broadcasting is to be devised such that messages are pipelined through the

rmbedded trees using the latency hiding broadcasting algorithms (BdFA., or BFI) .

Table 5.4: Multi-broadcasting time. k = 4. d 4. ,W = 10

5.4 Communication Modeling for other Collective Communications

ln this section. 1 detine the second communication modcling used for scattering. mult-

inodr broadcasting. and total exchange algori thms. This mode1 supports corn bining mes-

sages into a single lager message as used in these alporithms. Note thût the ctlgorithms for

scüttcring. multinodc broadcasting. and total rxshange art: quite efficient but they do not

hide the rrcontiyuration delliy in the network.

A s stiited in Section 5.2. the communication time to send a unit Iength message from

one node to another in the network is rqual to T = ti - r , - /,,,r . Without loss of yencrül-

ity. 1 nomalize the timr T with respect to l , , , ~ . Thus. a representative length message

transmission takrs T = ri' + r , - 1 . The communication rime to send an .1f representative

\en& message from one node to another would be T = i i + r , - .CI. 'lote that. sending a

cornbined message (that is a larger message) does not affect the start-up time. r , . and the

recontigur~tion delay. ii. For simplicity. I incorponte both t , and d into a single message

5.5 Scattering

The scattering oprration. is used basically to distribute data to the nodes of parallei

cornputer. The easiest algorithm for the scattering operation is based on the seqiienriai rree

[ 1 O I 1. In this case. the source node sends its ditferent messages to each of the other nodes

sequrntially. as shown in Figure 5.4 for single-port rnodeling. As the source of communi-

cation is the same for the whole scnttering operation. this node should reconfigure its links s

atier each step. Thrrefore. the scattering time, SIFI . is (:V - 1 ) (d +- 1 ) time units.

Figure 5.4: Sequrntial tree al son thm

The sp~11111i1ig biltor~rilzl trec nlgoritl~m [9 1 ] used for broadcasting~multicasting open-

tions con dso be usrd for scattering operation. ln this aigorithm. the number of informrd

nodes doubles ar eaçh strp. and rach node stores its own message and fonvards the rest of

the messages it received. if necessap. to its childrrn. .As illustrated in Figure 5.5. the

source node sends its messages for the upper half of the nodes to the node 4. In the second

step. nades O and 4 are responsible for sendinp messages to the nodes in their halves. That

is. to the node 2 (messages t i~r nodes 1. and 3). and nodr 6 (messages for nodes 6. and 7).

respectively. In the third step. al1 nodes send the rcmaining messages to the remaining - - -

nodes. These thret: steps (açtuolly log,N - steps) takes rach ( t l * 4). (11 -7). and ( d T 1)

time units. respttçtively. Grtnerally. this algorithm has a scattering time:

'lote that 1 have nqlected rhe data permutation tirne at each node. It shoulc 1 be noted

thüt he spanning binomial algorithm has a much berter termination time than the sequen-

tial alyorithm for the RON (A-. .V) (except for the trivial case. N = 2. where they have the

same termination tirne).

A-port: The sequential tree algorithm can be rxtended for k-port modeling. That is. at

each step the source node sends its A- difierent messages to k other different nodes. There-

N - 1 fore. S I F k = (<i+ 1 I ( ~ ) .

Figure 5.5: Spûnning binomial tree algorithm

Desprez and his çolleagurs have extended the spanning binomial algorithm for the A--

.V pon modeling [-IO]. In this algorithm. the sçatrrring nodr rio. sends k messages of - k - 1

.v Iength each. to its k children. Thrrefore. therc are ( k - 1 ) nodes havinp - di f i r e n t

k t 1

m e s s a p . Thrsr ni~des. iit step 2. communicate each with their k çhildren and srnd one

( k - I )-th o f their initial message to each one. This process continues and al1 nodes are

informeci iitirr log, . , .V communiciition steps. Thus the scattering tirne is rqual to

5.6 Multinode Broadcasting

In multinode broadcastinp. also called gossiping [53]. al1 nodes send their unique mes-

sages to all othcr nodrs. and this is basically used in parallel algorithms when al1 nodes

nerd to exchange their data. The simplest alsorithm for multinode broadcasting is to use

the latency hidiny broadcasting algorithm N times. one for rach nodr. Another algorithm

is to consider the multinode broadcasting as a degenerate case of total exchange. to be dis-

cussed in the next section. However. better dgot-ithms exist.

Sin&-port: In the direct algorithms [109. 1201. at any step i. a node p sends its mes- -

sage to node ( p - i) niod IV. Clnirly. the cost of this algorithm. GIFI. is (N - 1 )(d - 1 ).

One may use a better algorithm. just like the starrdaid cwharige algorithms for the

total exchange operation [7 1 . 2-11, where during each step. the complete network is recur-

sivctly dividttd into halves. and messages are exchanged across new divisions at each strp.

This algorithm combines messages into larger messages to be transmitted as a single unit.

As tud ly , each node sends its message dong with the other messages it received at the pre-

vious strps. Hençe. the multinods broadcasting has log - .X strps. and a çost of

Figure 5.6 show painvise communications and the lrngth of messages at each step for

inuitinodc brocidcasting on an Y node messase-passing multicornputer. Unfortunately.

lütençy hidiny ciinnot improvc this cost.

r - 1 r - 1 , r - i r - i -1 t - 1 04-m 1

L L - J r - 7 4 r - 1

2-3 L L - J r o i 4 r o i

4 4 +j

Figure 5.6: Multinode broadcasting on an Y-node RON t f N) undrr single-port modeling

k-port: A simple algorithm is baseci on the extension of the direct algorithm for k-port

modrling. That is. at step i. node p sends its messase to the nodes @ + ( i - 1 )k + I ) n i d :V.

( p -+ ( i - I ) k - 2) mod ,V. ... . ( p + ik) n i d Y. This algorithm has a cost of:

Desprez and his colleagues [do] extended the G7FI algorithm for k-port modelinj by

Ietting the nodcs combine the messages to reduce the effect of reconfigration delay. Fig-

o f ( k + 1 ) nodes mch. Nodes are grouprd as (O. 1. ... . k) . ( k + 1. k - 2. ... . Z(k T I ) - 1).

... . ( N - ( k + 1). Y - ( k + 1 ) + 1. ... . .V- 1). At step 1. al1 nodes within a group exchange

thcir mcssagcs. At the cnd of this stcp. each node has (k - 1 ) messages. At step 2 . nodr p

exchanges al1 its messages with nodes @ - (k - 1 )) nrod N. ( p - I ( k i 1 )) nioif X. . . . . ( p +

1

k ( k - 1 )) nrod .V. At the end of this step. rach node has ( k + I ) - messages. Let

S = logl - ,Y . This process continues to step S. where node p nchünges its messages

mud .V. i t is clcar that at each step i of this algorithm. cash node sends ( k + I )' - I rnes-

sages to k othcr nodes. Hense. this algorithm has a multinodr broadcasting timr:

Step 1

st,, 2 y- Q6 3 * 3

y-y, 3

Figurc 5.7: Multinodr broadcasting on an 9-node RON (k. .V} under 2-port modeling

5.7 Total Erchange

In total exchange. al1 nodes send their ditferent messages ro al1 other nodes. A nüive

algorithm for total exchange is to pertbm a scattering operation !V times in sequence.

However. better algorithrns exist.

Single-port: In the direct algorithms [109. 1101. at 3ny step i. a node p sends the mes-

sage to destinrd nodr ( p 7 i) nrotf N. Clearly the çost of this algorithm. TE Z F I . i s q u d 10

One may also use tlir standard exchange dgorithm for total exchange similar to the

ones ~ ~ e d in hypercubes. and meshes [71. 241. where during cach stcp. the çomplrte net-

work is reçursively dividrd into halves. and messages are exçhanged across new divisions

at rach srep. Nodes combine messages into larger messages to be rransmitted as a single

unit. Consider this algorithm for an Y-nodr multiçomputer. as shown in Figure 5 .8 . There

rire :V/ 2 messages to be sent by niçh node at any srep in this algorithm. 1 only desrribe

rhis for node O. Node O sends al1 its messages for the nodes at the upper half (that is. nodes

4. 5. 6. and 7) to node 4 at step 1. .4t the süme time. i t receives the messages for its half

from node 4. At the second step. node O sends its message. dong with the messages tiom

node 4. destincd to nodes 2 and 3. to node 2 . At the same timr. it receivrs the messayes

frorn rhr nodes 2. and 6 for itself and node 1 . At the third step (actually. log,.V steps). - node O sends its message dong with the other messages from nodes 2.4. and 6 to node 1.

I t is clear thüt üt the end of this step al1 nodes have exchanged al1 their messages. Thus.

this alp-ithm. TEIF,. haï a cort of (2 - z h o g , N . 2 1 '-

Figure 5.8: Total exchange on an 8-node RON (k. :V) under single-port modelins

Which algorithm. TE I F , or is faster depends on the nurnber of nodes :V. and

- the tem. d . I propose another algorithm. called conibi~ied [oral ci-cliange a i p i - i r h .

TE+!. which is a combination of these two algorithms.

1 begin this a l p i t hm by doing some of (or even none of) the steps involved in the

standard total rxçhange algorithm. and then continue with the direct algorithm. That is.

divide the nodes in the complrte network in half and do the steps involvcd in the standard

total rxchange algorithm up to a point(s) that there is no gain in continuing to do sa. From

that step(s) on. the direct algorithrn is used for al1 the nodes in each of the created sub-

groups at the same tirnr. Actually. the goal is to tind the number of steps. or 3 bound h r

the number ofsteps. before switching to the direct algrithrn such thüt the time assoçiated

with this alprithm is less than (or at Irast rqual to) the cither two (direct. and standard

exchange) algori thms.

Let inc: explain this algorithm with i = 1 (number of doing the standard exchange alpo-

rithrn) tor the example shuwn in Fizurc 5.8. At the step 1. the nodes in the complete net-

work Lire divided in halves. Eüch node exchmges 4 messages with i ts corresponding node -

at the uther hült This takes CI- 4. and at this point. rüch of the network halves contain

messages drstined to the hülf itsrlf. As a matter of fact. each node now has two messages

for each of the nodes in its half. Thrse messages clin b r distributed to their destinations

using a direct algorithm. There are 4 nodes in each half and 2 messages to b<: exchanged at C - -

a time tor a cost of (4 - 1 ) (d - 1) = 3d i 6. Hçnce. this algorithm has a total çost of 4d

1 O.

Lemma 1 The conhiiird tom/ r.rcltringe dgot-ithm undrr single-port modeling on

ROY (K. N) has a cost of

where i is the number of steps to do the standard rxchange algorithrn before switchinç to

the direct algorithrn.

Proof. In the combined total exchange algorithm. each time a standard exchange algo-

- AJ rithm strp is dune a cust of ii4 is addrd. This brings up the term i(2 + :). The tirrt - d

part of the second tem. . is for the number of nodes in the groups doing the direct -

algonthms sirnultanrously. The second part. ( d + 2') . stands for the dslay assosiated with

the tnnsfer of messages which is doubled at rach steps.

It is cirrir that this algorithm is rxactly the same as the direct algorithm when i = 0. and

the standard exchange rilgorithm whrn i = los - ,M.

k-port: The direct algorithm for the k-port modeling requires node p at stcp i to send

its niessage to the nodes ( p - (i - 1 )k + I ) nrod Y. ( p + ( i - 1 )k - 2 ) n i d iV. . . . . ( p + ik) nrod

.V - 1 .Y This algiirirhrn hns ri cust 01: TE 1 ic (lb 1 ) ( ) .

The same yroupiny and algorithm as G'&-k can br used for total exchange with the

.v exception tliat this timr eash nodr sends - messages at a timr. Therefore. the çost of

k - 1

the above olgorithm whrn IV =9 and k = 2.

Step 2

Figure 5.9: Total exchange on an 9-node ROiV (k. iV) under 2-port modeling

11 1

Which algorithm. TEIFk or is hster depends on the number of nodes iV. num-

ber of inputhutput channels. L and the term. 2. Just likr the singlr-port modeling. a corn-

Lemma 3 The contbiiled

RO& (k. .V) has a cost of

is proposed which is a combination of the above

to rd e.rr/rntige d g o i - i h i under k-port modeling on

where i is the number of steps to do the standard exchange algorithm before switching to

the direct algorithm.

Proof. In the combined total exchnngr algorithm and undrr k-port modeliny. rach tirne

- !V a standard exchan-c algorithm step is done a cost of d - - is cidded. This brings up the

k - 1

.L' I ;v rem i ( 2 - - ) . The tira part of the second term. -( - I . is for the number of k - 1 t ( k 7 1 ) ' 1

nodrs in the groups doing the direct dgorithms simultaneously. The seond part.

( 2 - ( k - I 1 ' ) . stands for the drlay associated with the transfer of messages.

Ir is clerir that this algorîthm is exactly the sarne as the direct algorithm when i = 0, and

the standard exchange algorithm when i = logt - , Y . 1 haven't found any mathematicai

proof that this i t l p i t h r n is bettcr than the known algorithms. However. in all the numeri-

cal examplrs (more than one hundred thousand exnmplrs) that 1 have performed for the

cornparison of these algorithms. I have always found a step. i. for which. the combined

total exchangr algorithm had a shortrr or equal exchanpe time than both the direct algo-

above statement is also true for single-port rnodeling. Therefore. It is conjectured that the

proposed algorithm is better than (or at lrast equal to) both known alponthms. Table 5.5

and Table 5.6 summarize some typical examples with optimal costs for and

Table 5.5: Total exchange time. .V = 1024. single-port

Table 5.6: Total exchange timc. .V = 1024. k = 3

5.8 Summary

In this chiiprrr. I presented and analyzed a broadcasting algorithm [?O] that could

effectively hide the rrconfiguration delay d in the network. RON (k. :V). Essrntially, in this

algorithm. the reconf guration phase of some of tlie nodes is overlapped with the message

transmission phase of the other nodes which ultimately reducrs the broadcasting time. The

analysis of the broadcasting algorithm includes a closed formulation that yields the terni-

nation time.

The solution for the total exchange problem combines two known algorithms. direct

[109. 1201. and s r o t d d excl~atrgr 171. 241, and it includrs an optimization phase that

determines the number of steps alter which the first algorithm teminates and the second

one is rngaged. This rnsures a temination time that is better than what can be accom-

plished by rither of the two algorithms. Mranwhile. known algorithms for scattering and

dl-to-al1 broadcasting h m literature [-!O. 211 have been adapted to the network.

RO!V (k. :V).

The scattering. multinode broadcastiny. and total exchange rilgorithms discussed in

this çhapter iissumed that the number of nodes in the ROIV (k . M) is a power of 2 . or a

p o w r of ( X 7 1 ) undrr single-port and P-port inodeling. resprctiwly. However. wlien the

nuinber of processors is not a power of 2. or a power of ( k 1). duinmy nodes can be

itssumed to rxist until the next power of 2 or ( k 7 1 ) with a little pertormancr loss.

So far. in this thesis. 1 have becn concerned about efficient communications in mes-

sage-passing paralle1 cornputer systems using recontigurable interconnects. I have usrd

knowledgr of the next destination (either by prediçtion or algorithmically) to hide the

reconîiguration Ititensy of the interconnesr. In Chapter 6. regiirdlrss of the type of the

interconnsction nrtwork. 1 utilize prediçtion techniques in general. and more specitically

the proposed predictors in Chaptrr 3. to remove the redundant message copying at the

recciving side o t'cominunicritions in message-passiny systems.

I l-i

Chapter 6

Efficient Communication Using Message Prediction for Clusters of Multiprocessors

A signiticant portion of the software communication overhsad belongs to a number of

message çopying operations. Ideally i t is desirable to have a trur zero-copy protosol

whttre the message moves direçtly from the send buflkr in its user spüce to the receive

buftkr in the destination without any intermediate bultèring. However. due to the hct thot

message-passing applications at the send side do not know the final receive butfer

addresses rarly iimving messages have to be buffered in a temporary area.

[ explüin the motivation behind this work and disçuss related work in Section 6.2. In

Section 6.3. 1 daborate on how prrdiction wouid hrlp eliminate message copying at the

recriving side of communications. 1 explain the experimrntal methodologies to gathcr

communication traces of the parallrl applications in Section 6.4. 1 sharaçterize some com-

munication propenies of the parallel application benchmarks by presentins the frequrncy

and distributions of receive communication calls in Section 6.5. 1 show that there is ü mes-

sage rrçeption communication locality in message-passing parallel applications [j]. H3v-

ing this comrnunication locality at the rcceiver sides. 1 use the proposed predictors

introduced in Chapter 3 to predict the next consumable message. This chapter contributes

by arguing that these message predictors can be rfficiently used to drain the network and

cache the incoming messages rven if the corresponding reccive çalls have not been posted

yet. This way. there is no need to unnecessarily copy the early amving messages into a

temporaty butfer. As shown in Section 6.6. the performance of thrse predictors. in terms

of hit ratio. on some parallcl applications is quite promising [ 5 ] and sugçest thai prediction

has the potential to eliminate most of the remaining message copies. 1 compare the pertbr-

mance and storage requirements of the predictors in Section 6.7. Finaily. I summarize this

chapter in Section 6.5.

6.1 Introduction

With the increasing uniprocessor and SMP somputation power available today. inter-

procrssor communication has become an important factor that limits the pertormance of

workstations clusters. Essentially. communication overhead is one of the most important

factors aReçting the performance of parallel cornputers. iMany factors atfect the perfor-

mance of communication subsystems in parallrl systems. Specitically. communication

hardware and its senrices. communication sotiware. and the user environment (multipm-

gramminp. mu1 tiuser) are the major sources of the communication overhead. - The communication hardware aspect includes the architecture and placement of the

network interthce. and the intercclnnection nrtwork and its srnices. Many arshitcctures

Imvr been proposrd for the network interfaces. Thry are classifird as ( 1 ) direct [ 52 . 7.63.

W. 97. SS] and (2) mcmory-basrd [4S. 1 11. 126. 131. Direct network interfaces allow a

procrssor to directly açcess the network queue. However. thry mostly ignore the issue of

multiprogmmming. That is. a single thread san only use the ncitwork interface at a timr.

Memory-basrd intrrthxs provids protection but have high latency. Interconnection net-

ivorks themselves cire anothcr source of communication hardware latrncy. Communication

services including flow control. and messaye delivrry dso add to this latency.

Corninunication software ovrrhrsd currently dominates the communication time in

çlusters of workstations. In the current generation of parallel cornputer systems. the soft-

ware overheads are tens of microseconds [43]. This is worse in clusters of workstations.

Even with high performance nrtworks 123. 67. 1 I l ] available today. there is still a gap

brtwecn what the network can otier and what the user application can sec. The cornmuni-

cation software overhead cost cornes mainly h m three different soiirces: crossing protec-

tion boundaries several times between the user space and the kemel space. passing several

protocol laycrs. and iwolving a number of memory copying operations.

Several researchrrs are working to minimize the cost of crossing protection bound-

aries. and using simple protocol layers by utilizinz ~iser=levrl messaging techniques such

as .-fcri\-e :Clessa,os ( A M ) [125]. Fasr Messages ( F M ) [102]. bïrt~tal iClemo~?*-hlapped

Cornnizrriications (VMMC-2) [48], LWer [ 1261. LAPI [ 1 1 O ] . Basic lntelfacejor- ParaIlel-

isnl (BIP) [105]. ?irrirnl Inre~fice .-lrchi&ectine (VIA) [19]. and PM [ I I I ] . A sigiticant

portion of the soliware communication overhead belonçs to a nurnber of message copy-

ing. Idrally. message protocols should transfer messages in a single copy (this is usually

called o true zero-copy). In other words. the protocol should çopy the message directly

tiorn the srnd buRer in its user space to the receive bufer in the destination without any

intermediate buflerin-. However. applications at the send side do not know the final

receiw butfer addresses and. hence. the communication subsystems at the receiving end

still copy messages unnrcessarily from the network interface to a system butfer. and then

tiom the system butiir to the user buWcr when the receiving application posts the reçeive

crtll.

Some resrarchers have tned ro avoid memory çopying [4S. 79. 106. 14. 1 1 9 . I lS I .

LVhilr thry have been able to remove the mernory sopying between the application butier

spacr and the network interface at the send sidr by using user-levrl rnessaging techniques.

they haven't been able to remove the memory copying tit the receiver sides completely.

They may üchieve a zero-copy rnessaging üt the receiver sides only if the receive cal1 is

dready posted. 3 rendez-vous type c.ommunication is used for large messages. or the desti-

nation butfrr address is already known by a pre-communication. Note. however. that

WI-l [%] supports a rcmotr mrmory aççess ( R M A ) operation but this is mostly suitablr

tor receivrr-initiateci communications arising from the shared-memory paradign.

1 am intrrestttd in bypassing the memory copying at the destination in the general case.

rager or rendez-vous and for sender-initiated communications as in MPI [92. 931. In this

chapter. 1 argue that it is possible to address the message copying problem at the receiving

side by speçulation. I support my daim by showing that messages display a form of local-

ity at the receiving ends of communications.

1 introduce hrre. for the frst time. the notion of message prediction for the receiving

side of message-passing systems. By predicting the next receive communication call. and

hence the next destination bufièr address. before the receiving cal1 is posted one will be

able to copy the message directly into the CPU cache speculatively before it is needed so

that in etfect 3 zero-copy transtèr can be achieved.

1 am interested in utilizing the proposed predictors in Chapter 3 [3. 21. but this time at

the receiver sidrs to predict the next consumable message and drain the network as soon as

the message amves. Upon a message amval. a user-lewl threüd is invoked. If the receive

cal1 has not bern issued yet. the message will be cached. but etticient cache mapping

mechanisms need to be devised to facilitate bindins üt the moment the receive cal1 is

issued. If the receive cal1 has already been issued. then the message can be written to its

final destination.

This çhaptrr concentrates on mmeage predictions at the destinations in message-pass-

ing systems using MP 1 in isolation. This is analogous to branch prediction. and coherence

üctivity prediction [97] in isolation. Our tools are not ready for measuring the rtYrçtive-

nrss of the predictors on the application mn-timo yet. My preliminary svaluation masures

the accuracy of the predictors in t e m s of hit ratio. The results are quite promising and sug-

yest that prediction has the potential to rliminate most of the remaining message copies.

6.2 Motivation and Related Work

High performance computing is increasingly concemed with efficient communication

cicross the interconnrçt due to the availability of high-speed highly-üdvanced processors.

Modem switched networks. called a+steni .-lrw ,V~.niwks (SAN). such as Myrinet [23]

and ServerNrt [67]. provide high communication biindwidth and low communication

lotency. However. beçause of hi yh proçessing overhead due to communication sotiware

incl uding network interface control. How control. butter management. memory copyinç.

polling and intempt handlinç. usrrs cannot sec much ditference cornpared to traditional

local a r a networks.

Fortunately. several user-level messaging techniques have been developed to remove

the operating system kemel and protocol stack tiorn the critical path of communications

[125. 107. 18. 126. 49. 105. 1 10. I Z I l . This way. applications can send and receive mes-

sages without operat ing system intervention which otien greatly reduces the communica-

tion latency.

Data transfer mechanisms and message copying. çontrol transfer mechanisms. address

translation rnechanisms. protection mechanisms. and reliability issues are the key factors

for the performance of a user-kvel communication systern. In this chapter. 1 am panicu-

lady interested to avoid message copying at the receivcr side of communications.

-4 si~miticant portion of the software communication overhead belongs to a number of

message copying. With the traditional software rnessaging layers. thrre are usually four

niessagr copyirig uperatiuns from the wid b u t k r to tlir rrceiïe butkr. as show in Figure

6.1. These copies arc namely irom the send butfrr to the system bufler ( 1 ). from the sys-

tem butkr tu the network intert';tce (NI) (3 ) . and at the other end of communication from

rhe network interhce to the system bufer (3). and from the system butfer to the reçeive

bufer (4) whrn the recrive cal1 is posted. Note that. 1 haven't çonsidered dota transfer

from the network intrrtiw (NI) at the sending process to the network interface at the

rcceiiiny process ns scparats çopy. Also. the network interface's place can bc cithcr on

the I!O bus or on the memory bus.

At the send side. some user-level rnessaging layrrs use programmed I/O to avoid sys-

tem butycr copying. FM uses progrümmed PO while .Ab!-Il and BIP do so only for small

messages. Some other user-mrssaginy layers use DMA. VMMC-2. U-Net. and PM use

DMA to bypass the systern butier çopy while AM-II and BIP do so only for large mes-

sages. In systems that use Dbf A. applications or a library dynamically pins and unpins

pages in the user spacr that sontain the send and the receive butfers. Address translation

can be donr using a kemel module as in BIP. or by caçhing a limited number of address

translations for the pinned pages as in VMMC-2. U-NetlMM [17]. and PM. Some network

intertices also pcrmit bypassing message copying at the network interface by directly

writing into the nrtwork.

Contrary to the send side. bypassing the system buffrr copying at the receiving side

may not be achievable. Processes at the sending sidrs do not know the destination buffer

addresses. Therefore. when a message amves at the receivinç side it has to be buffered if

the receive d l has not been posted yet. VMMC [El for the SHRIMP rnulticomputer is a

communication mode1 that provides direct data transfer between the sender's and

Send Process 1 Send bu* 1

Receive Process 1 Receive buffer

Network 0 Fig : Data transfers in a traditional messaging Iûyer

recctiver's vinual nddress spacr. However. it crin achieve zero-copy transfer only if the

sender knows the destination butfer address. Thrrefore. the receiver exports its butfer

iiddress by scouting a message to the sender beforr the actual transmission can take place.

This leads to a ?-phase rendez-vous protoçol which adds to the nrtwork trafic. and net-

work latcnçy espeçially for short messages.

VbfiIIC-2 [AS]. uses 3 n-trnsfi.i i*c.ciiivciioil mrchanism insteüd. [ t uses a dcfault. redi-

rtxtliblt. rcceiw buftkr for a sender who does not know the address of the receive butter.

h l e n il message amves at the reseiving network interface. the redirection mechûnism

checks to see if the receiver has aiready posted its bufier address. If the receive buffer has

bern posted rarlier than the message amval. the message will be directly transferred to the

user butter. Thus it achieves a zero-copy transfer. If the buttèr address is not posted. the

message must be butfered in the default bufler. It will then be transferred when the receive

bufier is posted. Thus. it achieves a one-copy transfer. However. if the receiver posts its

butfer address when the message amves. pan of the message is buffered ai the default

buRer and the rest is transferred to the user buKer.

Fast sockrts [ 1061 has bren built using active messages. It uses a mrchanism at the

receiver side called irceirle postiilg to avoid the message copy in the fast socket bufier. If

the message handler knows that the data's final mernory destination is already known upon

message limval the message is directly moved to the application user space. Othenvise. it

has to be copied into the fast socket bufer.

FM 2.x [79] uses a similar approach as fast sockets. namely kg .PI. irirei.I.n~*ing. FM

collaborates with the handlrr to direct the incorning messages into the destination butkr if

the receive cal1 has already been posted.

MPI-LAPI [I4] is an implementation of MPI on top of LAPl [110] for the IBM SP

machines. In the implemrntation of the eager protoçol. the header handler of the LAPI

retums (i biiftér pointer to LAPI which tells LXPI where the packets of the message must

be reûssembld. If a receive cal1 has been posted. the address of the user bufier is retumrd

to LAPI. If the herider handlrr dorsn't tind a m a t c h s receive. i t will return the address of

an L J W ~ I * w r i l o ~ ~ / bu/li>i* and hencr a one-çopy triansfer is accomplished. Meanwhile. mrs-

sage sizes of larger than eayer s i x is trmsferred using ?-phase rendez-vous protoçol.

Some resttürch projects have proposed solutions to multi-protocol message-passing

interfaces on ciirsrei*~ o f r~tiiiripi*ocessois (Clumps) using both shared-memory for intra-

node communications and message-passing for inter-node communications [ I 1 Y. 55.871.

MPICH-PMiCLUMP [llY] is an MPI library implemented on a cluster of SMPs. It

uses a message-passing oniy modrl where erich process runs on a processor of an SMP

node. For inter-node communications. it uses eugei and ie~~dez-rvms protocols. For short

messages. it achieves one-copy using eager protocol as the message is copied into a tem-

porûry butier if the MPI receive primitive has not been issued. For large message. it uses

rcndrz-vous protocol to achieve zero-copy by using a remote wnte operation but it needs

an extra communication. For intra-node communications. it achieves a one-copy using a

kernel primitive that allows to copy messages from the sender to the receiver without

inw lving the communication butfer.

BIP-SMP [SI. for intra-node communications. uses shared memory for small mes-

sages with hvo rnemory copy. and direct copy for large messages with a kemel overhead.

For inter-node communications. it rvorks like MPI-BIP which is a port of MPICH [57].

TOMPI [38] is a threaded implementation of MPI on a single SMP node. I t copies a

message only once by utilizing multiple threads on an SMP node. Unfortunately. it is not

scalabie to a cluster of SMP machines.

Other techniques to bypass extra copying ore the re-ninppbrg. and copj-oir-wire tech-

niques [3 1. 451. Both techniques require switching to the supervisor mode. acqui ring nec-

essary locks to virtual mcmot-y data structure. and chanying vinual memory mapping at

s e ~ w d Iwels for rach page. and thrn pertbrming li-cursicirioti Lookaside B1rlfi.r (TLB)/

cache consistency actions. and finally returning to the user mode. This Iimits the perfor-

mance of the paye re-mapping. and çopy-on-write techniques. A zero-copy TCP stack is

iniplemcnted in S01at-i~ by using copy-on-wnté pages and re-mapping to improve çomrnu-

nication performance [3 11. [ t açhieves a relatively high throuyhput for large messages.

Howruer. it does not have a 200d performance for small messages. This work is dso

solely dedicated ro the SUN Solaris virtual memory system.

/hiils (451 is also using the re-mapping technique to avoid the penalty of copying large

messages across di Kerent lûyers of protoçol stack. Howrver. tbu fs allows re-mapping uni y

for a lirnited range of user virtual mrmory.

I t is quite clcar that rven user-lrvel messaging techniques may not achieve a zero-çopy

communication al1 the time at the receiver side of communications. Meanwhile. the major

problem with al1 pase re-mapping techniques is their pour performance for shon messages

whiçh is evtremcly important For parallel computing.

As stiited in Chapter 3. many prediction techniques have been proposed in the past to

predict the tùture Iiccesses of sharing patterns and coherençe activities in disttibuted

shared rnemory (DSM) by looking at their observed behavior [96. 77. 73. 133, 31. 1071.

Recently. Afsahi and Dimopoulos proposed some heunstics to predict the destination tar-

set of subsequent communication requests at the send side of communications in mes-

sage-passin2 systems [3. 41. However. to the best of my knowledge. no prediction

technique has been proposed for the reçrive side of communications in message-passing

systems to reducc the latency of a message transfer.

This chapter of the thesis. reports on an innovative approaçh for removing message

ropying at the receiving ends of communications for message-passing systems. I argue

that i t is possible to address the message copying problem at the receiving sides by specu-

iation. 1 introduce message prediction techniques such that messages can be directly trans-

ferreci to the cache even if the receivr calls have not been posted yet.

6.3 Using Message Predictions

in this section. 1 anülyze the problem with the early arriva1 of messages at the destina-

tions in message-passing systems. In such systems. a number of messages amve in arbi-

trary order 3t the destinations. The consuming process or thread will consume one

message at a timr. If l know which message is going to be consumed next. then i can move

the message upon its arriva1 to nrar the place thût it is to br consumed (cg. a staging

cache). or 1 could schedule which thread to exrcute next preferably at the same processor

as the çonsuming thread to enhancc the chances that the data will br in the processor cache

whrn it is acçessed by the consumer.

For this. one hüs to consider thrce ditierent issues. First. deciding which message is

going to be consumcd next. This can be donc by devising receive call predictors. history-

bassd predictors that predict subsequrnt receive calls by a given process in a message-

passiny progrnm. Second. deciding where and how this messaçe is to be movrd in the

cache. Third. etficient cache re-mapping and late binding rnechanisms need to be devised

for when the receive call is posted.

In this chapter. 1 am addressing the first problem. That is. utilizing message predictors

and evaluating thtir performance. I am workiny on several methods to address the remain-

ing issues.

6.4 Experimental Methodology

In exploring the sttèct that ditferent heuristics have in predicting the next receive call.

1 used a number of parallel benchmarks. and extracted their communication traces on

which 1 applied the predictors. Specifically. 1 used BT. Si? and CG benchmarks from NPB

suite [ l j ] . and PSTSWM application [ E S ] . introducrd in Chapter 2. 1 didn't use the MG

and LLJ benchmarks form the NPB suite because these benchmarks use

1fPI_-f .VY-SOC 'RTE in wrne of thsir receive cnlls (MP I-Recv and blPl_lrecv). This

mrans that the applications may receive a panicular message from ditferent sources

depcnding on the order of arrivai. 1 also didn't use the QCDMPI application as this appli-

cation uses the sy nshronous communication primi tivci. . W ~ ~ r t L ~ c * r ~ ~ ~ c p I ~ ~ c ~ e . where the

sender waits for the receive call to br posted. Then it transmits the message. In this case.

prediçtion wouldn't help as the receive call is already posted.

1 çxpçrimented with the workstation class "W". and thé larger class *'A" of the NPB

suite. and the default problem size for thr PSTSWM application. Note that because of

spacc and acccss limi tations 1 did not cxperirncnt wi th the larger classcs "B". and "C". The

N P B results arc nlmost the same for "W" and "A" classes. Hençc. 1 report only for the ':4"

CIÜSS here. Note that 1 also rcmoved the initialization part liom the communication traces

ut' the PSTSWM application.

6.5 Receiver-side Locality Estimation

The applications use blocking and nonblocking standard klPl receive primitives.

namely .lfP!Jcc~. and .\IPl-ireci* [92]. ILIPI-Recv (biif,' coio~r. h i o h p e m u r e . mg,

romm. starus) is a standard blocking receive call. Whrn it retums. data is available at the

destination bufier. The PSTSWM application uses this type of receive call. :ÇIPl-h-ecv

(616,' colrnt. tiurahpe. m u r e . mg, comnr. irqiresr) is a standard nonblockinç receive call. It

immediately posts the cal1 and retums. Hence. data is not available at the time of return. It

needs another call to complrte the call. Al1 applications in this study use this type of

receive calI.

As noted rarlirr in Chapter 3. one of the communication characteristics of any pcirallel

application is the tiequency of communications. Figure 6.2 illustrates the minimum. aver-

age. and maximum number of receive communication calls in the applications under dif-

tèrent system sizes. 1 sxttcuted the applications once for rach diffrrent system size and

çounted the number of receive calls for rach process of the applications. Hence. in Figure

6.2. by average. minimum. and maximum. 1 meün the average. minimum. and maximum

number of receive calls taken over al1 processes of sach application. 1 t is clear that al1 pro-

cesses in the BT. SP. and CG applications have the same number of rrceivr communica-

tion calls h r rach diffrrent system size. While processes in the PSTSWM application have

diferent number of rcceivtt communic;rtion crtlls.

-- -- Minimum Minimum

8000 . Average r6000- S Z l Average -- Maximum - VI - - - - -- - - U)

Maximum m - - - -

m O

2 12000 -

4 3 16 25 36 49 4 9 16 25 36 49 Number of Processes Number of Processes

CG PSTSWM

Minimum 4000. Average -

in klaxirnum

8 16 32 64 Nurnber ot Processes

80001 @ Average Maximum ul - - -- - - - -

a O

2 6000 r z 2 0 4 m t

8 n E 32000t

ni " J 8 16 25 32 36 4 9

Number of Processes

Figure 6.2: Number of receive calls in the applications under different system sizes

MPI-Recv and MPI-Irecv salls have a 7-tuple set consisting of sorii-ce. tag cotm.

dmiype. biiJ contm. and siriria or rvqirest. In ordrr to choose precisely one of the received

messages at the network interface and trnnstèr it to the cache. the predictors need to con-

sider al1 the details of a message enveiop. That is. solore. mgog. coiritt. daraype. bld,' and

conlm ( 1 don? consider sronrs and i.eqlrest as they are just a handle when the calls return).

1 did not rrly only on the butfrr ddress. b i~ f : of a receive cal1 as many processes may srnd

thrir messages to the same butfer address of a particular destination process. Nor 1 çould

drpçnd only on the sendrr. sorrirr. of a message. or on the length. coimr. of a message.

Thrreforct. 1 assignai a difirent identitirr for rach unique 6-tuple found in the communi-

cation traces of the applications. Figure 6.3 shows the number of ~rt~iqiie message ideurifi-

ers in the applications under difirent system sizes. By ahmerage. minimum. and maximum.

1 mriin the average. minimum. and müximum number of unique identitiers taken over al1

processes of e x h application. I t is svident that dl processes in the BT. and CG iipplicn-

tions Iiave the same number of unique message identitiers while processrs in the SP. and

PSTSWM applications have ditferent number of uniqur message idrntifiers (rxcept whrn

the numbcr of processrs is four for the SP benchmark).

Figure 6.4 shows the distribution of crich unique message identifier for process zero of

the applications when the number of processes is 64 for CG and 49 for the other applica-

tions. I chose procrss zero becausr this process almost always had the largest number of

unique message identitiers arnong al1 professes in the applications and is also responsible

for distributing data and verifying the results of the cornputation. As it is shown in Figure

6.4. the inessage identifiers are evenly distnbuted in BT. However. the distribution of the

message identitiers in CG and PSTSWM are almost bimodal with two separated peaks.

The SP benchmark shows four difierent peaks for the message identifiers. Similar distribu-

tions have been found for other system sizes [ 6 ] .

6.5.1 Communication Locality

As noted in Chapter 3. some researchers have tried to find or use the cotttmunicntions

locdin properties of parallel applications [3. 4. 75. 30. 361. 1 define the term message

r-eceptiotz locam* in conjunction with this work. By message reception locality I mean that

if a certain message reception cal1 has been used it will be re-used with high probability by

a portion of code that is "near" the place that wlis used earlier. and that it will b e re-used in

the near future.

ET SP !!l 130- - o! Minimum o! ~ t n ~ m u m = - p!J Average

s 2 5 - . Maxtrnurn 0 a 0 m20- in UY

al E l 5 a 3 O - y 0 - O

àj 5 a E

4 9 16 75 36 49 Z 0 1 9 t6 25 36 49 Number of Processes Nurnber of Processes

CG PSTSWM C 8 700,

, Minrrnun o! Minimum = Average 3 Average

- 5 --- ~aximum $600c Maximum 0 z

saor 2 g 400, E al 3 3001 z C 2 2oof O

5 look

s & g-

o 4 9 16 25 32 36 49 Number of Processes Number of Processes

Figure 6.3: Yumber of unique message identitiers in the applications undcr di Kerent sy stem sizes

In the following subsection. 1 present the performance of the classical LRU. LFU. and

FIFO heuristics on the applications to see the existence of locality or repetitive receive

calls. 1 use the hir turio to rstablish and compare the performance of these heuristics. As a

bit ratio. I define the percrntage of the timrs that the predictcd receive cal1 was correct out

of al1 receive communication requests.

BT (49 processes) SP (49 processes)

20 30 Message Identifier

CG (64 processes)

4 8 12 Messaqe Identifier

Message Identifier

PSTSWM (49 processes) JO

"O 200 400 600 800 Messaqe Identifier

Figure 6.4: Distribution of the unique message identitiers for proress zero in the applications

6.5.2 The LRU, FIFO and LFU Heuristics

The L e m Recem!i. L &.ri ( LRU). Firsr-hi- Fhr-Oiit ( FI FO ). and Lmsr Fieqzwrir~r

LSed ( L F ü ) heuristics. 311 maintain a set of k (k is the window size) unique message iden-

titiers. If the next message identifier is already in the set. then a hit is rrcorded. Othenvise.

a miss is resorded and the new message identifier replaces one of the identitien in the set

accordin2 CI to which of the LRU. FlFO or LFU strategirs is adoptrd.

Figure 6.5 shows the results of the LRU. FIFO. and LFU heunstics on the application

benchmarks when the number of processes is 64 for CG luid -19 for al1 other applications.

It is clear that the hit-ratios in al1 benchmarks approach I as the window size increases.

The pertomance of the FlFO algorithm is the same as the LRU for BT. and PSTSWM

benchmarks. and almost the same for the SP and CG benchmarks. The LFU algorithm

128

consistently has a bettrr performance than the LRU and FIFO heuristics on the BT. CG.

and PSTS WM applications. It also has a better performance than the LRU and FIFO heu-

ristics on the SP benchmark for window sizrs of geater than five. [t is interesting to sre

that a real application like PSTSWM needs window sizes of geater than 150 to achieve a

good performance (hi t ratios above S O 0 o ) under the LFU policy. Similar perfonniince

results for the LRU. FIFO. and LFU heuristics on other sysrrm sizes can be found in [ 6 ] .

I

r O XI IO 40 50 Window size

CG (64 processes)

Window size

02' , + - LFU

/ I - - LRU - - FIFO --

PSTSWM (49 processes)

0.2.

O6 - 5 1 O 15 20

-. - LFU - - - LRU - - FIFO

Window size

- LFU - - - LRU - - FIFO

Figure 6.5: Etfeçts of the LRU. FIFO. and LFU heuristics on the applications

% I

5 1 O 15 20 Window stre

Essentialiy. the LRU. FIFO and LFU heuristics do not predict exactly the nrxt receive

cd1 but shows the probability that the next receive cal1 might be in the set. For instance.

the SP benchmark shows nearly a 60?/0 hit ratio for a window size of five under the LRU

heuristic. This rneans that 60°/0 of the time one of the tive most recently issued calls will be

issued next. These heunstics perform better when the window size k is suficiently large.

However. this large window adds to the hardware and software irnplernentation cornplex-

ity as one nreds to move al1 messages in the set to the cache in the likrlihood that one of

them is joing to be used next. This is prohibitive for large window sizes.

1 am intrrested in having predictors that can predict the next receive call with a high

probability In Section 6.6. 1 utilize the novel message predicton proposed in Chüpter 3

cmploying different hriiristics and evaluate thrir performance on the applications.

6.6 Message Predictors

The set of predictors usrd in this section predict the subsequent receive calls bascd on

the past history of communication patterns on a per process basis. These predictors were

proposed in Chapter 3 to predict the destination taget of subsequent communication

requests ÿt the send side of communications. It is wonh mrntioning that the message re-

ordering rffrct 1771 (messages tiom different processes may amw out-of-order even if

messages t'rom the samc proçcsszs may amve in-order in most networks) has no r fec t on

the prrdiçtions as the predictors predict the next reçcive calls based on the patterns of the

receive calls in the program that runs on the samr process and not on the amving mrs-

sages unlrss the order of receive calls depends on the order of message arrival. Note that in

the following fi gurtrs. by avernse. minimum. and maximum. 1 mean the average. mini-

mum. and maximum hit ratio taken o x r al1 processes of eaçh application.

6.6.1 The Tagging Predictor

A s desçribed earlirr in Chapter 3. the TkggNzg predictor assumes a static communica-

tion environment in the sense that a particular communication receive call in a section of

code. will be the snme one with a large probabiliiy. 1 attach a differrnt mg to each of the

recei1.e calls found in the applications. This can be implemented with the help of a com-

piler or by the prokgammer through a pi-e-rereive (iag) operation which will be passed to

the communication subsystem tu predict the next receive cal1 before the actual receive call

is issued.

Tagging predictor Mnimum

O 8- y Average - M-lm_urn

Q) gil4

âi 2 i3.2

0- BT SP CG PSTSWM

Tagging predictor

Number of processes

S - h-i ibr CG. and 49 t'or others

Figure 6.6: Eticçts of the Tagging predictor on the applications

The perfonance of the Tagging predictor is shawn in Figure 6.6. I t is evidrnt that this

predictor doesn't have a good performance for the applications studied. It çannot prrdict

the communication patterns of PSTSWM at all. and has a degndinp performance for all

other applications whrn the number of processes inçreases.

6.6.2 The Single-cycle Predictor

The S i i l g / ~ ~ - ~ : i d c ~ prediçtor. proposed in Chapter 3. is based on the tact that if a group

of receive calls arc issued repratrdly in a cyclical fashion. then 1 can predict the next

request one step ahead. The perf~mnançe of the Single-cycle predictor is shown in Figure

6.7. It is evident that its performance is consistently very high (hit ratios of more than 0.9).

Note that for the PSTSWM application. the Single-cycle predictor bas a zero hit-ratio for

one of the processes. However. it doesn't atfeçt the average hit-ratio over ail the processes.

I t is wonh mentioning that al1 Cycle-based predicton proposed in Chapter 3. (Single-

cycle. Single-cyclel. Bener-cycle. and Briter-cycle?) have the same pertbrmance for the

applications studird. Thus. I just reported the results for the Single-cycle predictor here.

6.6.3 The Tag-cycle2 Predictor

The Tag predictor didn't have a good prrtbnnance on the applications while the Sin-

gle-cycle predictor had a very good performance. The Tag-qde2 predictor. proposed in

Chapter 3, is a combination of the Tap predictor and the Single-cycle? predictor. In the

Tag-cycle? predictor. I attach a diffèrent tag to each of the communication requests found

Single-cycle predictor Single-cycle predictor d

Q

-8- CG 6 PSTSWM

O O 10 20 30 JO 50 60 70

Number of processes

S = h4 for CG. and 49 for iithrrs

Figure 6.7: Etfects of the Single-cycle predictor on the applications

in the benchmarks and do a Single-cycle? discovery algorîthm on a c h tag. The perfor-

mance of the Tag-cycle2 prcdictor is s h o w in Figure 6.8. The Tay-cycle? predictor per-

fbms well on al1 henchmarks. Its performance is the samr as the Single-cyclc predictor on

BT and PSTSWM. i-liwcwr. it has a bctter performance on CG and a lower performance

cm SP.

Tag-cycle2 predictor Tag-cycle2 predictor

Minimum A 0

ESl Average

~ u m b e r of processes

?(: = 64 tor CG. and 30 for others

Figure 6.8: Effects of the Tag-cycle2 predictor on the applications

6.6.4 The Tag-bettercycle2 Predictor

In the Sin&-cycle and Tag-cycle2 predictors. as soon as a receive cal1 breaks a cycle I

remove the cycle and fom a new cycle. In the Tig-berreqde.? predictor. proposed in

Chapter 3. 1 keep the last cycle associated with each tagbettercycle-head encountered in

the communication patterns of rach process. This mrans that when a cycle breaks 1 main-

tain the rlrments o f this cycle in memory for later references. The performance of the Tag-

bettercycle? predictor is shown in Figure 6.9. The Tag-bettercyclr? predictor perforrns

well on al1 benchmarks. Its performance is the sarne as the Single-cycle and Tag-cycle2

predictors on the BT and PSTSWM. However. it has a better performance on the CG and a

lower performance on the SP relative to the Single-cycle predictor. The Tag-bettercyclr?

predictor has a better performance on the SP application compared to the Tag-cycle.! pre-

also found that the applications have vrry small number of tagbettrrcyclr-heads

2) undrr rhe Tagbenercycle2 predictor and difkrent system sizes.

Tag-bentercycle2 predictor Minimum Average

. Maximum -

BT SP CG PSTSWM

Tag-bettercycle2 predictor a

2 0.4

. . 3.2

++ CG 4- PSTSWM

'0 10 20 30 40 50 60 70 Number of processes

S = h-4 tilr CG. ,inri 49 for ù t h m

Figure 6.9: Etfeçts of the hg-bettercyclcl predictor on the applications

6.7 Message Predictors' Cornparison

Figure 6.1 O presents o cornparison of the performance of the predictors on the applica-

tions under some typical system sizes. As we have seen so far. Single-cycle. Tag-cycle2

~ ind Tÿg-bettercycle? al1 perform exceptionall y well on the benchrnarks. However. the per-

formance of the Single-cycle is better on the SP benchmark while Tag-cycle3 and Tag-

bcittercycle2 have better perfomançe on the CG benchmnrk.

6.7.1 Predictor's Memory Requirements

Table 6.1 compares the maximum memory requirement of the message predictors on

the application benchmarks when the number of processes is 64 for CG. and 49 for BT. SP.

and PSTSWM. I have found that the memory requirement of the predictors decrease gad-

I PSTSWM

N = 64 t'or CG. and 49 h r othrrs

N = j2 h r CG and PSTSW M. and 36 for BT and SP

Figure 6.10: Cornparison of the pertormance of the predictors on the applications

ually whrn the number of processes decreases. The numbers in the table are the multipli-

cation factor for the amount of storage needed to maintain the message 6-tuple sets. It is

quitr clear that the memory requirements of the predictors is low. That inakes them very

attractive for the implementation at the network intertàce. Cornparatively. predictors (Sin-

de-cycle. Tq-cycle. and Tag-betterçycle) nerd higher mrmory requirement for the *

PSTSWM application. Xlthough. the classical LRU. LFC. and FIFO heuristics need less

mrmory rrquirements. but as stated earlirr. the beauty of the predictors lies on the fûct that

they predict with high accuracy and transfer only one message to the cache which should

dramaticcilly reducr the cache pollution eKect. if any. This shouid also bring down the

software ccist of the irnplementation.

Table 6.1 : Memory requirements (in 6-tuple sets) for the predictors when :V = 64 h r CG. and !V = 49 for BT, SP. and PSTSWbl

6.8 Summary

L

Cominunication latency adversely nRects the performance of networks of worksta-

tions. A sipnifiçant portion of the software communication overhead belongs to a number

of message copying operations. Ideally. it is very desirable to have a true zero-copy proto-

col where the message is moved directly tiom the send bufier in its user space to the

receivr butfer in the destination without any intermediate bufiering. However. this is not

always possible as a message may amve at the destination where the sorrespondinç

reçeiw cal1 has not been issued yet. Hence. the message has to be butfered in a temporary

bu fier.

BT

In this chapter of the dissertation. I have shown that there is a message reception com-

munication locality in message-passing applications. I have utilized the different predic-

tors proposed in Chapter 3 to predict the next receive cal1 at the receiver side of

SP CC PSTSWM

communications. By predicting receive calls early. a process ran perform the necessary

data placement upon message reception and move the message directly into the cache. 1

presented the performance of these predictors on some parallel applications. The perfor-

mance rcsults are quite promising and justi@ more work in this area.

1 cnïision these predictors to be used to drain the network and place the incoming mes-

sages in the cache in suçh a way so as to increase the probability thrit the messages \vil1

still be in cache whrn the cansuming thrrnd ne& to access them.

Chapter 7

Conclusions and Directions for Future Research

Parallrl procrssing is the key to the design of high performance cornputers. However.

with the aviiilability of fast microprocessors and small-scalt: multiprocessors. intemode

communication has become an increasingly important factor that limits the performance

of parallel computers. In essence. parallel computers require rxtremely short coinmunica-

tion Iritrncy such that network transactions have minimal impact on the overall cornputa-

tion timr. This thesis uses a number of techniques to achirve etlicient communications in

message-piissing systems. This thesis makes tive contributions.

The Hrst contribution of this thesis is the design and evaluation of two ditierent utego-

ries of prediction techniques for message-passing systems. I present rvidence that mes-

sage destinations display a form of locality. This thesis utilizes the message destination

loçali ty property of mctssa_re-passiny parallel applications to devise a number of heuristics

that can be usccl to predict the taget of subsciquent communication rcquests.

Speci f saIl y. 1 propose two sets of message destination predictors: C ~ c k - b m C d predic-

tors. which are purely dynamic predictors. and fig-bascd predictors. which are stritia

dynamic predictors. In cycle-based predictors. Sirigle-cirie. Si&-q~lel. Berter-c~.cle.

and Betre1-cycle2. predictions are done dynamically at the network interface without any

hrlp from the programmer or compiler. In Tay-based predictors. f iggiq, Tog-c~de. irug-

L ~ c W , k g - b c r t e ~ ~ ~ I e . and Tug-bertel-cj~le?. prediçtions are done d ynamical 1 y at the net-

work intedice as wrll. but they require an interface to pass some information from the

program to the nrnvork intertàce. This can bci done with the help of programmer or com-

piler through inserting instructions in the progain such as pre-conriecr (ta@. The perfor-

mance of the proposed predictors. specially Better-cycle2 and Tag-bettercycle2. are very

wel1 on al1 application benchmarks. Meanwhile. the memory requiremenrs of the predic-

tors is very low. The proposed predictors should be easily implemented on the network

interthce dur to their simple algorithrns and low memory requirements.

The heuristics proposed are only possible because of the existence of communications

lociility that can be used in establishing a communication pathway between a source and a

destination in recontigurable interconnects before this pathway is to be used. This is a very

iIcsirablc property sinec. it allows us ti, ctfc.stivcly hidc the iost of establishing such com-

munications links. providing thus the application with the raw power of the underlying

hardware (cg. a recontigurable optiçal intrrconnect).

As the second contribution O C this thesis. 1 show thnt the majority of reçontiguration

delays in single-hop recontigurable networks can be hidden by using one of the proposed

high hit ratio predictors. In othrr words. by çompanng the inter-send computation times u f

some parallel benchmarks with somr speçific reconfiguration times. most of the time. we

are able to fuily utilize these computation timtts for the concurrent recontigurütion of the

intcrconnect whrn we know. in advançr. the nrxt taqet using one of the propossd high hit

ratio target prcdiction algorithms. This thrisis also states that by utilizing the predictors at

the srnd sidr of communications. applications at the recriver sides would also benetit as

messages arrive rnrlier than befbre.

.As the third contribution of this thesis. I analyze a broadcasting algorithm that utilizes

latency hiding and reconîîguration in the network to speed the broadcasting operation

under single-port and k-port modeling. In this algorithm. the reconfiguration phase of

some of the nodes is overlapped with the message transmission phase of the other nodes

which ultimately reduces the broadcastinz time. The analysis brings up ciosed formuia-

tions that yield the termination timr of the algorithrns.

The fourth contribution of this thesis is a new total exchange algorithm in single-hop

recontigumble nstworks undrr single-port and k-port modelinç. 1 conjecture that this algo-

rithm ensures a bettrr temination time than what can be açhieved by either of the direct,

and standard exchange rilgorithms.

Ideally. message protocols should copy the message directly fiom the send bufier in its

user space to the receive buffer in the destination without any intermediate bufferin_r.

However. Applications at the send side do not know the final receive bufier addresses and.

hence. the communication subsystems at the rereiving end still copy messages unneces-

sarily at a ternporary butfrr.

This thesis presents rvidence that there rxists message reçeption communications

locaiity in message-püssing parüilri applications. Having messagr rrception communica-

tions losality, the Atih contribution of this thesis is the use and evüluation of the proposed

predictors to predict the next consumablr message at the recriving ends of communiça-

tions. This thesis contributes by claimin2 that thrse message predictors c m bc etficiently

~ised to drain the network and cache the insoming messages even if the corresponding

receive calls have not becn posted yet. This w a . there is no need tto unnecessari1 y copy the

early amvine rnessap into a temporary butfer. The performance of the proposed predic-

tors. Single-cycle. Tq-cycle2 and Tag-bettercyclt.2. on the parriIlel applications arc quite

promising and suggest thüt prediction has the potcntial to eliminate most of the remaining

message copics.

7.1 Future Research

The proposed predictors in Chapter 3 of this thesis such as Tag-bettercycle3 and Bet-

ter-cycle2 perfonn exceptionally well on al1 applications excrpt QCDMPI. under ditferent

system sizrs. It seems that this application rrpeatedl y changes its message destinations in

ditirrent cycles that even the bssi proposed predictors cannot aiways capture them. Thus.

i t rnight be hrlpfui to devise other predicton. called .JI/--de and Zig-ullcycle. that could

maintain al1 cycles associated with each cycle-head and tagbettercycle-head found in the

communication traces of the applications. In case that these two predictors. All-cycle and

Tas-allcycle. have high memory requirernents. it might be better to devise predictors that

faIl somewhere between the extreme cases. That is. predictors that can maintain more than

one cycle but less than al1 of the cycles associated with each cycle-head and tagbettercy-

clr-head. Not to mention that searchinç in diffrrent cycles may add to the performance

penalty.

The Tag-based predictors proposed in Chapter 3 can be pure dynamiç predictors if

another Ievrl of prediction is done on the tag themselvrs at the network interface. This

ivay. there is no need for the program to pass pre-courlecf (fag) (or pre-irceiw (ta@ as in

Chapter 6) information to the network interface. It is interesring to see what would b r the

performance of such 2-lewl Tog-bas4 predictors.

In Chapter 4. i roughly showed that up to 5O0o of the times applications at the receiv-

in3 end iiiight benrfit whtn the prsdictars arc applicd at the send side of communications.

Howrver. a trace-drivsn simulator should be writtrn to precisely rvnluate the etfect that

cipplying the predictors at the send sidr has on the reçeive side. and on the total application

run-time.

This thesis in Chapter 5 cinalyzes rtticittnt broadcasting~multi-broadcasting nlgorithms

that utilizes latrncy hiding to speed these operations. An optimal alporithm for multi-

bri~adcasting is t« be devisrd suçh that mrssayrs are pipelincd in the embedded trecls usiny

the latrncy hiding broadcasting algorithms (BdFn-. or B F I ) . In this thesis. dthough algo-

ri thms for scattering. all-to-all broadcasting. and total exchange are very znisirnt but they

do not use latency hiding technique. Although very challenying. efficient algorithms for

multicastiny. scattering. all-to-all brolidçasting. and total exchange should be devisrd such

that they use Iütrnçy hidiny technique to hide the reconfiguration delay in the network.

As stated in Chapter 6. by predicting recrive calls eariy. a node can pertbrm the neces-

sa- data placement upon message reception and move the message direct1 y into the cache

in suçh a way so as to incrrase the probability that the messages will still be in cache whrn

the consuming thread needs to açcess them. Further issues that should be investigated are

deciding whrre and how this message is to be moved in the cache. Would this cache be a

first-lwel cache, ri second-ltxel cache. a third-Ievel cache or even a network-cache? What

mechanism should be used to transfer the message into the cache? User-level messaging

and/or multithreaded MPI environment. Meanwhile. efficient cache re-mapping and late

binding mechmisms need to be devised for when the receive cal1 is posted. Also. cache

pollution and inaccurate timing are the other issues that should be addressed.

The performance of the predictors proposed in this thesis were evaluated under single-

port modeling. That is the predictors predict one step ahead. However. Cycle-based pre-

diçtors. Single-cycle. Single-çyclel. Better-cycle. and Better-cycle?. and Tagcycle-based

predictors. Tag-cycle. Tag-cyciel. Tag-bettercycle. and Tag-bettercycle? maintain the

message destinations of n cycle. Therefore. it is possible to predict more than one step

ahrad. It is interesting to Rnd the performance of the predictors under such modeling in

ternis of hit ratio. and for the total recontiguration delays. and the application run t h e .

Finally. al1 the applications studied in this dissertation are scirntitiç and engineering

ones. It is interesting to discover the impact of the predictors on the performance of com-

mercial applications.

14 1

Bibliography

A. Afsahi and N. J. Dimopoulos. "Collective Communications on a Rsconfig- urable Optictil I nterconnect". Proceeciirigs qf'rhe OPODIS 'Y 7. Itrrei.riarioirn1 C m - , /2i .~we or1 PI-i~iciplrs o/Disri.ibrrred $stenis. December. 1 997. pp. 1 67- 1 S 1 .

A. Afstihi and N. J. Dimopoulos. "Hiding Communication Latency in Rrconfig- urable >lessage-Passing Environments". P)-oc*t.t.rlirigs q/' rlir IPP Si SPDP 1 Y9 Y. 1 MI lnré~~iinrioiiol Prr~nllel P~oc*essitg SSi*r~iposiiini uird / 0th S~nzposiiu~i on Parcil- Ici r l ~ i < f Distt-ihirred Piwc*rssbz,g. April 1999. pp. 55-60.

A. At'sahi and '1. J. Dimopoulos. "Commiiniccition Latency Hiding in Recontig- iirable Message-Passine Environments: Quantitative Stiidies". Prncwrlirigs qf'rlw IIPC5'YY. lirli .-l~rrriicd Ir~t~wiuriuricil Simposiitni ori High Pct~urwiczrice C1umpitrirzg Siisr~~~~i.s w d . Ipp/i~*oriuris. Wiriiw .-Icud~~r.niics Pirbiislre~~. June 1099. pp. 1 1 I - 126.

A. Afsahi and N. J. Dimopoulos. "Efficient Communication C h g Message Prc- diction h r Clustcrs of Multiprocessof'. 7;.clt)iicd Rcpurr EC'E-99-5, Depczrrnieiit u f 'E /~ t - t r i~ t~I 'mi Conipi<tc~r E/igirtwritikg, L iriiwsin of' l ictor-icz. December 1990.

A. Aganvül. R. Bianchini. D. Chaiken. K. L. Johnson. D. Kranz. J. Kubiatowiz, B- H. Lim. K. Mackenzie and D. Yeung. "The MIT Alewife Machine: Architecture and Performance". Procveiiitigs r uf'rlie Xrli .-lniiiral lrztei~tiario~~al $rnrposiirm ou Conipirres .-11-chirectlrm. 1998.

-4. Alexandrov. M. Ionescu. K. E. Schauscr and C. Schriman. "LogGP: Incorporat- ing Long Messages Into the LojP Mode1 - One Step Closrr Towards a Realistic Mode1 for Parallel Computation". ?rh .4tintrnl Si*ntposiitm otr Parailei .-figorirlinu ~ i m f .-Ir-cliitccrirrr (SE-1.4 '95). Jul y 1 995.

G. S. Almasi and A. Gottlieb. Higlii~ Paraile( Coniputirlg. BenjaminiCumrnings. 1989.

C. Amza. A. L. Cox. S. Dwarkadas. P. Keieher. H. Lu. R. Rajamony. W. Yu and W. Zwrienepoel. "TreadMarks: Shared Memory Computing on Networks of Worksta- tion. IEEE Cornputer. Volume 19. no. 2. February 1996. pp. 18-25.

T. E. Anderson. D. E. Culler. D. A. Patterson, and the NOW team. "*.A case for Net- works of Workstations: NOW. IEEE Mic*r.o. February 1 995.

T. E. Anderson. S. S. Owicki. P. Saxe. and C. P. Thacker. "High Speed Switch Scheduling for Local Area Networks". Irireiwiirioiial Coizft?i-eiice oit .-lrcltirecrii,~I Sirpporr fhi* Pi~~qin»rniiir,g Lan,~trc<~es aiid 0priuriif.q Sisrerw. 1 992. pp. 9s- 1 10.

D. H. Bailcy. T. Harsis. W. Saphir. R. V. der Wijnganrt. A. Woo and M. Yarrow. "The NAS Parailel Benshinarks 1.0: Report NAS-95-O?Oq*. Kasa Ames Research C'enter. Detxmber 1995.

M. Banikazemi. R. K. Govindaraju. R. Blackmore and D. K. Panda. "Implrmrnt- ing Eilicient MPI on I A P I iOr IBkl RS/6000 SP Systems: Experiences and Pertbr- mancc E v d urition. "Pi-uc~~~dii~qs O/' rhc O / ' IPPS, SPBP 1 549, Ijrh 1iiic.r-ittrriorrid PLI~zIIIL'I Pi -~cess iq Srrriposilm L I I ~ l thli Srmposi~inl U I I Ptodlel uiicl Disri-ibiircd Pror~c~ssirig. Apnl 1999. p p . 183-1 90.

M. Bünikazemi. J. Sümpüthkumar. S. Prabhu. D. K. Panda. and P. Sadayappün. "Communication klodcling of Heterogeneous Networks of Workstiitions for Per- formance Charactcrization of Collective Operritions". Piocccdi/i,qs of ' rlte 1rtrcv.iru- rioitcil h k d t o p ou Hcreiugmeo~is Compiiriizg, iir co~rjirrrcrioii irirli IPPS, SPBP 'W. Apnl 1999. pp. 173- 13 1 .

A. Bar-Noy and S. Kipnis. "Designing Broadcasting Algorithms in the Postal >Iode1 for Message-Passing Systems". 4rlr :lnnzral .-!CM Siwiposirinr oii ParaIlel .-l Igoi-ir1tm.s < m l .-l i-chirecrrirvs. Junc 1992. pp. 1 1-22.

A. Basu. M. Welsh. T. V. Eicken. "lncorporating Memory Management into User- Letel Network Intertaces". ffor Iirrcr-curtitccrs C: August 1 997.

C. Berge. &peiyuplts . North-Hoiland. 1989.

P. Berthorne and A. Ferreira. Editors. Opricd Iirrerrotrriectioiis aird Pcrrullel Pra- cessiftg: 7kirds U r rhr htrri$rciicr. Kluwer Academic Publishers. 1998.

P. Berthorne and A. Ferrein. "Communication lssues in ParaIlel Systems with Optical Interconnections". Iiirei-~iariuird Jo~ii.ilal of'Foliirdariom c$Conipiirei* Sci- mce. Volume S . Numbrr 2. June 1997. pp. 143- 162.

P. Berthorne and A. Ferreira. "On Broadcûsting Schemes in Restricted Optical Pas- sive Star Systems". Interconnection Nehvorks and Mapping and Scheduiing Paral-

D. E. Culler. J. P. Singh and A. Gupta. Pal-del Conipirrer ..1~diitecrii1-t.: -4 Hwd- it-cu- SI .-lpp~vacli. Morgan Kauhann. 1999.

D. E. Culler. R. M. Karp. D. A. Patterson. .A. Sahay. K. E. Schauser. E. Santos. R. Subnmonian and T. von Eiçken. "LogP: Towards a Realistic Model of Parallel Computation". 4th .-I CM SICPL.-iiV Swrposiirni olr Ptirrciples nnd Pmcfice ~ ~ ' P U I - <rllc./ Piqyrrimr i11g. 1 993.

F. Dahlgren. M. Dubois and P. Stenstrom. "Srquential Hardware Prefetching in S hared-Mrmory kt ultiproçessors". f EEE TI-criucicriolw or1 Pczr-~dld c i d Disn-iblrteci Sisrcnis. 6( 7). 1 995.

W. J . Dally. .i. A. S. Fiske. J. S. Keen. R. A. Lethin. .LI. D. Noalies. P. R. Nuth. "Thc Message Driven Processor: A Multicomputer Procrssin~ Nodes with Efti- cient Mcçhanisrns". IEEE .Ilicrw. April 1992. pp . 13-39.

B. V. Dao. Sudhiiktir I'alamanchili. and Jose Ducito. "i\rchitectural Support for Rrducing Communication Overheüd in kfultiprocess«r lnterconnection 'let- works". PI r~cwciirigs gsf tlrc T h i ~ ~ i I~itrr~~iurioiirri Srniposiirm un Higll Pi.v.futnialrc*~. C'ot~ipirr~~r .-l~rhir~~crwc. 1 997. p p . 343-352.

Drpünmcnt of Enrrgy .-icceie~areri Srrlrtegic C'umpirtirig Itziriciriw (ASCI) Project. http:! ~www.llnl.gov~asçii.

F. Desprez. A. Frmera and B. Tourancheau. "Eîlicient Communication Operations on Passive Optical Star Networks". Ptaceetli1tg.s u f'rlie Filasr I~iro~tinriotml Cor! fét-- mce on .Ilcissii~e!i* Pal-diel Pro~*essi~zg iisiltg Opricd I~iret~co~t~zccfions. 1 994. pp. 52-55.

V. Dimakopoulos and N. J. Dimopoulos. "Total Exchange in Cayley Networks". Eiita-Pm- ' 96. Pwailei Ptvccssirlg, L~~criitr :Votes in Conrpirter Scielict.. 1996. pp. 34 1-346.

V. V. Dimakopoulos and N. J. Dimopoulos. "Communications in Binary Fat Trrrs". Plocredings ~ s f ' the Ititei-~mrionul Cot$mwce on Parnilei and Disnibiired Compirririg. September 1995. pp. 383-338.

J. J. Dongarra and T. Dunigan. "Message-Passing Pertomance of Various Com- puters". Coticiiiwiicy. Volume 9. No. 10. December 1 997. pp. 9 1 5-916.

P. W. Dowd. "Wavelength Division Multiple Accrss Channel Hypercube Processor Intrrconneçtion". IEEE fi.artscicrioris oti Cunipcmr-S. Volume 4 1. October 1902. pp. 1223- I1-i 1 .

P. Dt-uschel and L. L. Peterson. "Fbufs: A High-bandwidth Cross-dumain Transfer Faci lit y". Pt wceditzgs of ' the Fowreerirli .-îC;Cf $mposiiml or1 Oper-nririg Si~renls Priririples, 1993. pp. 1 59-202.

J. Duüto. S. hlamanchi li and L. Ni. I/l/ei-~*oi~iiccrioir :Ventlui.ks: .- lrr Eiigirieeririg .-lppr-onch. I E E E Cornputer Society Press. 1997.

J. Ducito. ':A Ntxessliry and Sufficient Condition for Deadloçk-frrr Adaptive Rout- in? in Wormhule Networks". IEEE Transactions on Parallel and Distributed Sys- tcms. Volume 6. '10. 1 O. 1995. pp. 1055- 1067.

C. Dubnicki. A. Bilas. Y. Chen. S. Damianakis and K. Li. "VMMC-2: Efficient Support for Rcl iable. Connection-Orimted Communication". Praceeditigs of ' rlw //or i t i r ~ ~ t - ~ * c ) r i r r ~ ~ ~ ' r ' Y 7 . 1097.

D. Dunning. G. Rrgnier. G. bfcAlpinc. D. Cümeron. B. Shuben. F. Be?. A. M. blmitt. E. Gronkt. and C. Dodd. "The Virtual Interface Architecture". lEEE .\licr+o. blarch-April. I 998. pp. 66-76.

L. Fan. M. C. Wu. H. C. Lee and P. Grodzinski. "Optical lnterconnection Networks for Massively Parallel Processors using Bram Sterring Vertical Cavi ty Surfaçe- Emittins Lasers.". P~oceedit~,gs of'tlre Second lriterrmriorznl Cor!férvttce or1 .Clm- sirz.!i Pwdlc~l P~mvssitig witig Opricd Itirerro~triecrioiis. October 1995. pp. IS- 34.

M. R. Frldmiin. S. C. Esener. C. C. Guest and S. H. Lee. "Cornparison Between Optical and Elestrical Interconneçts Basrd on Power and Speed Considerations". . - lpplid Oprics. 27(9). May 1988, pp. 1742- 175 1.

M. Fillo. S. W. Kcckler. W. J. Dally. N. P. Carter. A. Chang, Y Gurevich and W. S. L w . "The M-Machine Multiomputer". Proceedit~gs gsf the 18îli .-ltlmul IEEE; -4 C.Lf Ittrerrioriorial Siniposiirm on ~Cfi~~~oci~~cliirec~~i~'~~". 1 995.

P. Fraigniaud and E. Lazard. "Methods and Problems of Communication in Cisual Networks". Discivie .-fppfied .Ifuilwnn~ics, Volume 53. 1994. pp. 79- 133.

M. Galles. "Spider: A Hish-Speed Network Interconnect". IEEE Micio. Volume 1 7. No. 1. January/February 1997.

[j5] P. Geofhay, L. Prylli. and B. Tourancheau. "BIP-SMP: High Performance Mes- sage Passing Over a Cluster ofcommodity SMPs". SCYY: High Pei;foinzniice Ner- iiw-kiig a d Cotnpiiritrg Co,? f>r.etlce. Yovember. 1 999.

[56] C. J. Glass and L. M. Ni. "The Tum Mode1 for Adaptive Routing". Aocrrditigs of die i 7th Iirrei-mriotial Simposiiinr oii Cui7ipzi~cv- .-lrrhirrctiiw. 1 992. pp. 275-287.

[57] W. Gropp and E. Lusk. "User's Guide for MPICH. n Portable lmplementation of bl P 1". .-l rgmile .Voriotirii Lciboi-urot?: .Llorlienrnrics arid Conrpiirer Sciorce Dili- siotl, June, 1999.

[58] J. W. Goodman. F. 1. Leonberger. S-Y. Kung and R. A. Athale. "Optical Intercon- nections for VLSI Systems". Pt-uccaiit~gs uf ' lEEE. Volume 72. No. 7. July 198-1.

[FI] G. Grwenstreter and R. G. .Llrlhem. -'Realizing Cornmon Communication Pat- terns in Pntitioned Opticnl Püssivr Stars (POPS) Networks". IEEE Tiuiisri~*tioris oti C'oniprir~w. Valumc 47. No. 9. 1098. pp. 9%- I O 13.

[60] 41. W. Haney and M. P. Christensen. "Fundamental Geomerric Advantages of Free-Spacc Optical I nterconnect". Pt-o~*~'ccii~igs uf'rlze Thir-ci Itirri-riariorrd Cu11 /kt-- erice 011 .Llrissiivii Pot-cil1c.l Ptmwsit~g tisittg Op ticcil Iitfe~ruii~lccriotis. 1996. pp. 16-23.

[6 I ] S. M. Hrdetneimi et al.. 'A Sun7ey of Gossiping and Broadcristiny in Communica- tion Nztw«rks". . V m t A - s . Volume 1 S. 1985. pp. 3 19-3.19.

[62] J . L. Hennrssy and D. A. Patterson. Conipiir~v. .-lrr*lzirecrtirz.: .-I Qiicitirirnriiv .-lpprwtrdi. Morgan Kiiutinann. 1 096.

[64] H. S. Hintcin. T. J. Cluonan. F. B. illcConnick. Jr.. A. L. Lentine and F. A. P. Tooley. "Frce-Space Digital Optical Systems". Pt-ocecdiugs of'lEEE. Spccial Isstie or1 Opricd Conrpiiiirtg Sistrnu. Volume SI . No. 1 1. Nov. 1994. pp. 1632- 1649.

[65] S. Hioki. "Construction of Siaples in Lattice Gauge Theory on a Parallel Com- puter". Parallel Conrpiitiig. Volume 22. '10. 10. October 1996. pp. I 335- 1344.

[66] R. W. Hockney, 'The Communication Challenge for MPP: Intel Paragon and Meiko CS-1". Paralief Compicting. Volume 20. N o . 3. March 1994. pp. 359-398.

[67] R. W. Horst and D. Garcia. "ServerNet SAN IiO Architecture". Proceediirgs qf'rhe Hot Iiitercor~riec~s C,: 1997.

J. Hsu and P. Banjerer. "Pertbrmance Measuremrnt and Trace Dnven Simulation of Parallel CXD and Numencal Applications on a Hypercube Multicomputer". Ptoceeditigs qfrlie I 7th Irriet.tiariotiai $mpositu~i oti Cunipiirw At-cl,irecnrte. 1990. p p . 260-169.

K. Hwang and 2. Nu. Scalablr Parciilel Conrpirtitig: Potulielism. Scalibilih: Pt-o- gr~~nznzcibilih: McGraw-Hill. 1998.

S. L. Johnsson. "Communication in Network .4rchitecturcs". in I'LSI cilid Par-criid Chnipirrririoii. cd. R. Suaya and G. Birtwistle. Morgan Kaufmünn. 1900.

S. L. Johnsson and C.-T. Ho. "Optimum Broadcasting and Personalizrd Communi- cation in H ypercubes". IEEE fimscic*riotis otz Cot~rpirters. Volume C-3 8. Srptem- ber 1980. pp. 1249- 126%

S. Klirlson and M. Brorsson. '*A Comparative Charücterization of Communication Patterns in Applications Using MPi and Shared Mrmory on an IBM SP2". Pt-o- cvétiitlgs of'the Ilbt%-shop ou Cunrrniariccrriorr. .-lrdiirecrirrr. micf .-lppliciitiutis /Or .Vcni~-k-busecl Pnr*ciIlrl Cbnipiiritrg, Itircr-tzr~riot~trl Siwiposiirtn oii Hiy11 Pc);fbr-- riicimv Coi?ipiircv- .-lrrhirt~r-nrr-c~. February I 9'23.

S. Küxiras and I. R. Goodman. "Improving CC-NUMA Pertbrmancr Using Instruction-Bascd Prediçtion". Iriferwrio~ml Siwposiirt~r uri High Pci-futmatrce C ' V I I I P I I I L ~ ~ . . - ~ ~ . L - / I ~ ~ L ? c I I ~ T . I 909.

F. E. Kiarnilrv. "Pertbrmance Corn parison between Optoelectronic and V LS 1 Mul- tistage intc<rconnestion Nrtworks". Jowmd ot'Li3/irti.arc Gcl~tioloy?*. Volume 9. Na. 12. December 19C)l. pp. 1674-1692.

V. Kumar. 4. Grama. A. Gupta and G. Karypis. Irtrt~dlrcrion ro Pai-allei Cornpur- itig: Desigti arid .-fria(rsis of .-l lgot.ir/inrs. The Benjamiru'Cummings Publis hing Company. Inc.. 1 991.

A.-C. Lai and B. FaIsafi. "Memoty Shwing Predictor: The Key to a Speculative Coherent DSM". Proceeciiiigs oj' the 16d1 .-ltiriirai Itirertintiottol Smposiirm oit Conrprr fer- .4rr/zitecrures. 1999. pp. 1 72483.

LAM/MPI Parallel Computing. University of Notre Darne. hnp:/! www.mpi.nd.rdu/larn/.

[79] M. Launa. S. Pakin and A. A. Chien. "Efficient Layering for High Speed Commu- nication: Fast Messages 2 .Y'. Pr-oceeditigs q f ' rhe 7th H i 9 Perjormailce Distrib- iireti Conipirtirig ( H P D C i ) Cotr/l.,-rrlce. 1998.

[YU] C. E. Leiserson. Z. S. Abuharndeh. D. C. Douglas. C. R. Feynman. M. N. Gan- mukhi. J. V. Hill. W. D. Hillis. B. C. Kuszmaut. M. A. St. Pierre. D. S. Wells, M. C. Wong. S-W.lhng and R. Zak. "The Network Architecture of the Connection Mac hine C 41-5". PI-ocretfirrgs q f ' the 4th .-l i t l $mposiirrrt or1 Parïill~~l .-i lgor-itllnis ~ r i i c f -4)-cliirecnrrw. Juns 1 992. pp. 272-255.

[S 1 ] .A. L. Lentinr. K. W. Goosen. J. A. CValker. L. M. F. Chirovsky. L. A. D' Asaro. S. P. Hui. B. J. Tscng. R. E. Lribenguth. J. E. Cunningham. W. Y. Jan. L M . Kuo. D. W. Dülirinper. D. P. Kossives. O. D. Bacon. Ci. Livesue. R. K. Momson. R. .A. 'iovotny. and D. B. Buçhholz. "High-Sprrd Optoelectronic VLSI Switching Chip with 3 4000 Optical PO Based on Flip-chip Bonding of MQW Modulators and Detrctors to Silicon CMOS". IEEE Jorownl u f 'S~-k~te t l Topic~ in Qirtrrrtiinl Elec- rror~ic.s. Volume 2. April. 1996.

[ 8 2 ] K. Li. Y. Pm and S. Q. Zheng. Editors. Por*czllel Conrputiiig CSitlg Opricd ho-- C U I ~ I ~ C J L Y ~ O ~ IS. Kluwer .=\cademic PubIishers. 1 99S.

[X3j .A. Loun and H. K. Sunp. "An Optical blulti-Mesh Hyprrcube: A Scalüble Optical interwnnection Network for .llüssively Parüllel Computing". Jo~rimil of ' Liglir- i t y r i v P~+li~ro/c)~q: Lbluine 1 2. No. 4. 1 994. pp. 704-7 16.

[84] .A. Louri and H. K. Sung. "Scalable Opticül Hypercube-basrd Interconnection Nrt- work fbr Milassivcly Parallcl Cornputing". .-lpplietf Oprics. Volume 33. No. 33. No\.. 1 W4. pp. 7558-7598.

[Ssj D. B. Loveman. "High Performance Fortran". lEEE Par-allel mtd Disrr-ihiired T'cchr~olog.. Volume 1. Febniary 1993. pp. 25-42.

[S6] Luçent's Wavestar LambdaRouter. IEEE. Cumptrrc.,-. lanuary 1000. pp. 26.

[S7] S. S. Lumetta. A. bl. Mainwaring. and D. E. Culler. "Multi-Protocol Active Mes- sages on a Cl uster O f S bl Ps". SC9 7: Higli Peijo~maricc iVentwrking uiiri Conzp~rr- iiig Ci)~?fc'i~rtce. Novcmber. 1 997.

[SSJ K. Mackenzie. J. Kubiatowiz. M. Frank. W. Lee. V. Lee. A. Apanval and M. F. Kaashork. "Exploiting Two-Case Delivery for Fast Protected Messaging". Pro- ceediigs of' the 4th Ittterwnriotral Siwposilrm or, Higl~-Peq?mm-uice Compzrtei- .-lrr-Jtirecrio-e. February 1998.

[S9] P. J. Marchand. A. V. Knshnamooflhy. G. 1. Yayla. S. C. Esener. and U. Etion. "Optically Augnrnted 3-D Cornputer: System Technoloçy and Architecture".

Joirr-ml q#'Pur-del mid Disrribirted Conipiiririg. Specinl lsslre or1 Optical hiferrori- riecrs, Febt-uary 3. 1997. pp. 20-35.

P. K. McKinley and D. F. Robinson. "Collective Communication in Wormhole- Routed Massively Parallel Cornputers". IEEE Cor>tpiirer: December 1995. pp. 39- 50.

P. K. McKinley. H. Xu. A. -H. Esfahanian and L. M. Ni. "Unicast-basrd Multicast Communication in Wormhole-routed Networks". lEEE fi.arisac*tioris ori Par-alle1 r i r d Disaibrrred Si.sterrts. 5( 1 2 ) : 1 252- 1265. Deçcmber 1994.

V. N. b!orozov. H. Temkin and A. S. Fedor. "Analysis of ii Three-Dimensional Computcr Optical Schcme Bascd on Bidirectional Free-Space Opticül Intercon- nccts". Opri~wl Errgiiic~cv-itrg. Volume 34. No. 2 . 1995. pp. 513-534.

T. Ylowry and A. Gupta. "Tolerating Latency Through Sotiware-Controllrd Prefetching in S hared-blemory blui t i proçessors". Jotrr-mil q/'Purallel ciriri Distrib- irtcti Couipictit~g, 132). 199 1 . pp. Y 7- 106.

S. S. Mukherjre and M. D. Hill. "Using Prediction to Accelernte Coherence Proto- cols". l'roccerlirrg.s of' rire 25th .-ltrriiicil I~ice~.rrutio~iuf Simposilrni o/l Co~?rpirrer .-l~z-liir~~crirr.~~. 1 0 W.

S. S. Mukherjee. B. Fiilsafi. M. D. Hill and D. A. Wood. "Coherent Nrtwork Inter- hces for Fine-Grain Communication". Praceetlirrgs o/'rlte 3rlr .-lriri ird Irrterwa- fiorral $~.rnposiiun or1 Compirrer- -4 ~rlzi~ertic~r. 1 996.

'lationnl Coordination OtFice for Cornpirfirrg, hzformczriorr. ancf Comniirnicririoris (NCOiCIC). http:!!www.cçic.gov:.

L. M. Yi. "Should Scalablz Parallel Cornputers Support Efticient Hardware Multi- c.stI?'- . Iirtcrwrrtio~irrl Cor~fL;r~orce on Pot-allel Pr*ocessbig. Workshop. A pnl 1 995.

R. A. Nordin. A. F. Levi. R. N. Nottenburg. J. O'Gorman. T. Tmbun-Ek. and R. A. Logan. "A System Perspective on Digital Interconnection TechnologyT*. [EEE Jow-1102 oj'Lighi*ni*e Tec/zriolog?: Volume 10. Junr 1992. pp. 50 1-827.

N. Nupairoj and L. M. Ni. "Benchmarking of Multicast Communication Services". Ecilr~ical Repori LCISC'CPS---ICS- 103, Lbfic/~igun Srafe ~L',liwrsih: September

[ 1 OS]

[ 1 O')]

S. Pakin. M. Lauria. and A. Chien. "High Performance Mrssaging on Workstation: Illinois Fast Messages (FM) for Myrinet." Pioceetfirgs qf 'the Sirpet-conipiiring '95. Nuv.. 1995.

K. Panajotov. hi. Nieuborg .A. Goulet. 1. Veretenniçoffand H. Thictnpont. **A Free- space Rrcontigurable Optical Interconneçtion based on Polarization-Switching VCSEL's and Polarization-Sctlectiw DiKractive Optical Channels". Pioceetliiigs o f ' [ / I L ) Optics iii Conrpirriitg, 19%. pp. 15 1 - 154.

T. M. Pinkston. "Design Considerations for Optical Interconnects in Parallel Com- put ers". P~-u~wdiiig.s of ' ille First Iitrcremr rior 101 Ikbi-kslt op oit .\ltissiiv(i. Ptri.cille/ P~~occssiiig L Siiig Opticcil Iiztei.coirricc~rs. Apnl 199-1. pp. 306-322.

S. H. Rodrigues. T. E. Anderson and D. E. Cullrr. "High-Perf~mnûnce Local Areü Communication with Fast Sockets". L'SEh'Lri' / 99 7 .-l r i i r i rd TL'cIliiictd COI~/~ ' I 'C ' I IL '~ . January 1997.

M. F. Siikr. S. P. Levitan. D. M. Chianilli. B. G. Home. and C. L. Giies."Predicting Multiprocctssur Merno- .4çcrss Patterns with Learning htodrls". Pia~*eecli>tgs of' rite fiiri-rcmlrli Iri~~-~itltiuitcd Co~l/~iri ice oii .\luc/liiie L~.mriitg.*' 1997. pp. 305- 312.

S. R. Sridel. "Circuit Switçhed vs. Store-and-Fonvard Solutions to S~mmrîric Communication Problrms". hocw.ciigs qf' rlie 4th Coizfiwm-e oii &perriibe C'ontpiir~w riittf Coiiciirre~it .-lpplict~tioits. 1 98 9. pp. 3 3 - 3 5 .

Ci. Shah. J . Nieplocha. J. Mina and C. Kim. R. Harrison. R. K. Govindaraju. K. Gildea. P. DiNicola. and C. Bender. "Performance and Experiençe with LAPI -- a New High-Performance Communication Library for the 1 B M RS/6000 SP". Firsr .Lfci& *ntposiirm IPPSSPDP 1998 l lrh Iim~watioital Pordel Piacessiilg -ni- pusilun d 9th .$wiposirrm oii Pardiel aiid Disri-ibirted Pioccssi~ig. ! 998.

R. S heifert. "Gigabit Ethemrt". .-lddisoti- GVeslqi: 1 998.

M. Snir and P. Hochschild. "The Communication Sofiware and Parallel Environ-

ment of the IBM S PT"' IBM reins Jotri-wl. 34(1):205-22 1 . 1995.

C. B. Stunkel. D. G. Shea. B. Abali. M. G. Atkins. C. A. Bender. D. G. Gnce. P. H. Hochschild. D. J. Joseph. B. J. Nathanson. R. .A. Swetz. R. F. Stucke. M. Tsao. and P. R. Varker. "The SP? High-Performance Switch". IBM Systerns Journal. 34(2): 185-204, 1905.

H. Sullivan and T. R. Büshkow. " A Larse Ssale. Homogenrous. Fully Distributed Parallcl Machine". hoceedhgs ofdie 4rh .4riiiiial $mposiirni 011 Co»~ppirrei. .-klii- recrirw. Volume 5. March 1977. pp. 105- 124.

V. S. Sunderam. -'PVM: A Framework for Prirallçl Distributed Computing". Coiz- ~~11171(1c~: Pi-cicrice oiid Espericricv. Voluinc 34). Deceinber 1 990. pp. 3 1 5-3 39.

T. Szymanski. "Hppenneshrs: Opticai Interconneçtion Networks h r Parallttl Com- putins". Jotiixal of'Pmnlie1 mti Disn-ibirrd Conrpirriiig. 16. 1995. pp . 1-35.

Y. Tanaka, M. btatsuda. M. Ando, K. Kubota and M. Sato. "COMPaS: A Pentium Pro PC-based S M P Cluster and its Experience". Piocwriiirgs of'rAe PC-iVO?ÇVY4: Iiirci-iidoiznl Ilbrksliop 0 1 1 Pc.rsoizrrl Conipirrw brised .Ventvol;ks Of' IVOI-kstcrrioiis. iii corljirrrcrioii with PPSLSPDP'YIY. 199s.

R. Thakur and A. Choudhary bgAll-to-all Communication on Meshes with Wom- ho le Routin y ". Pi~ocwtfiugs of [lie I Y Y4 l~zrt.>-iicrriomd Pai*a/ld hoc~essing S l nzpo- siirni. 1994. pp. 56 1-565.

W. Tezuka. F. O'Carroll, A, Hori. and Y. Ishikawa. "Pin-tiown Cache: A Virtual blemory Management Technique for Zero-copy Communication". Firsr Mer& Sinlposi~im PPSiSPDP 1998 12th Intei-nafioiial Parallei Processing spposilrm d; 9rh Simposiirm on Paralle1 und Distrib~ited Piocessiig 1998.

K. Thulasiraman and M. N. S. Swamy. Graphs: Tlieory and Algorithms. John Wiley. 1992.

G. Tricoles. Tomputer Generated Holograms: A Historical review". .-lpp(ied

Optics, Specinl Issue oii Conipiitei Geirerored Hologranis. Volume 26. No. 70. 1987. pp. 435 1-4360.

[ 1 3 1 T. Von Eicken. D. E. Culler. S. C. Goldstein. and K. E. Schauser. "Active Mes- sages: X Mechanisrn for Inte,~~ated Communication and Computation". Proceed- i~igs q#'t/le 19th .-!iiriirczl ln~ci~~rntiot~al $rnposiirn~ 011 Conlpirrw .-lirhirectir~r. May 1992. pp. 256-265.

[176] T. Von Eickcn. A. Basu. V. Buch and W. Vogels. "U-Net: A User-Lrvel Nenvork Interthce for Parallel and Distributcd Computing". Procredings of the 15th .ACM Symposium on Operating Systems Principles. Dçcrmber. 1905.

[ 1771 D. S. Wills. W. S. Lÿcy. and J. Cruz-Riverri. "The Otfset Cube: .4n Optoelrçtroniç Interconnection Yetwork". in K. Bolding and L. Syndrr (ED.) Parallel Cornputer Rouitng and Communication. Springer-Verlag. LNCS 353. pp. 56-1 00. 1994.

[ 1 IS 1 P. H. Worley and I. T. Foster. "ParriIlel Spectral Transform Shallow Watrr Modei: .-\ Runtims-tunable parallel benchmark code". P~ûceeditigs of'the Siwlcihl~j High P L ' > - # U I W I ~ C L ~ Co~~zptiINzg C~II#EIPIICL>. 1 pp. 207-2 1-1.

[ 1 T. Ylitüyai. "Optical Cumputing and Interconncct". A-oc~criitg.~ qfIEEE. Volume 84. No. 6. June 1996. pp. YX-852.

[131] Ci. 1. lhylri. P. J. Marchand. and S. C. Esener. "Speed and Enrrgy hnaiysis of Diy- itül Interconnections: Cornparison of On-chip. OR-chip and Free-Space Teclindo- gicis". .-lpplicd Optics. Volume 37. No. 2. Janunry 1998. pp. 205-227.

[ l3 l ] T-Y l'eh and Y. Patt. "*Alternative lmplcmentation of Two-Levei Adaptive Brançh Prediçtion". P~aceedirigs O/' flic /9tli .-lr~i~ual Iitre~-~iotiotml Siviposirrm on Com- puter .-lr.cltirecrir~r. 1992. pp. 124- 134.

(1321 N. Yuan. R. Melhem and R. Gupta. "Compiled Communication for All-Optical TDM Networks". Piwcceditlgs ofrlre S~rpemmprizi~ag '96. 1996.

[l33] Z. Zhang and J. Torrellas. "Speeding Up Irregular Applications in Shared-~Memory Multiprocessors: Mrmory Binding and Group Prefetching". Procecrii,igs of' the 3 r d .-ln~ilinl Sinrposiirni otr Compirrer .4rcl1iternrre. 1 995. pp. 1 88- 1 99.

Therehre. the pure cvmputation timr is equal to t - - r2 - ( ( t , - r 3 ) + ( r , - ti)). To com-

pute the pure inter-send computation times. I need to know the exact times before and

aticr each MPI cail. For these. 1 did not insert the :l.IP/-lFtim~. cd1 in the source codes of

the applications. but instead I wrote my own profling codes to pther the timing traces.

Thus. eaçh MPI cal1 in the applications calls its own profilhg code. as shown in the fol-

lowing example for the MPi-Seiid.

The .\IPI-ll'tinic calls give the times. t,. and t,,, betore and alter the profiling call.

PMPI-Seird. resprctively. while what I really need are the times t,, and g. It is clear that

there are overheads entrring and exiting the protiling code in addition to the averhead of

the instniçti»ns i and ii. 1 cornputcd thrss extra overheads for each type of the MPI calls

used in the app!ications and took them out to End the pure inter-send cornputûtion times.

Documents

1 Library of Contents ... .Airerigr prrcentage of the times the recciw crills art: issued before the cor- responding scnd cal 1s