digital.library.unt.edu/67531/metadc283200/m2/1/high_re… · AN L-95/49 I I I M1" Mathematics and Computer Science Division Mathematics and Computer Science Division Mathematics

AN L-95/49

III

M1"

Mathematics and ComputerScience Division



- . . I.~ * -

Parallel Solution of the Time-dependent Ginzburg-Landau

Equations and Other ExperiencesUsing BlockComm-Chameleon

and PCN on the IBM SPinter iPSC/860, and Clusters

of Workstations

by E. Coskun and M. K. Kwong

.. ;**j~.

Argonne Natioral Laboratory, Argonne. Ilinois 60439operated by The University of Chicago.or the United States Uepalrnment of Energy under Contract W-31-109-Eg-38

7

a P. At..

I

I

I

n4L

Argonne National Labortory, with kacilitics in the states of titinois and Idaho, isowned by the United States government, and operated by The I Uni; crsity otChicagowilder the provkions of a contract with the Depatmeit of "1ncrgy.

DISCLAIMER.This report was prepared as an account of work sponsorecd by an agency' atthe United States Government. Neither the I nited States (iovcrnment norany agency thereof, nor any of their employees, makes any warranty, cxpre<,sor implied, or assumes an 1egali ability or responihility for the accuracy,completeness. or usefulness of any information. apparatus, product, or process disclosed. or represents that its use would not inftigc pnvately ow nednghts. Reference herein to any specittc commercial product. process. orservice by trade name. tradkiark, marniuacturer, or other ise, does notnecessaily constitute or itnpl% its endorsement, recommendation, orfavoring by taw United States Government or aint agency thereot. the tbs «n

and opinions of authors expressed here do not neceswanly state or reflectthosc ot the United States Covernment or any agency thereof.

Reprnluced frmti Ott hvi aviluble c oijy.

Available to DOE and D)(? c&luntr;ltvtwrs fi'*i the(ffitr e , Scientitt i anT ethical lif; t nalttion

P.M Box 62(.)ak Ridge. TN 37?it

Putctes availa!Ae tic'm 1421 s'(7 ""401

Available to the pub ic fimrt theNational Tchmical Infomiiion Start ie

i.S. Iepatr nme f Comi rc5295 POr R id Road

Spriligfield, VA 22 6;

ANL-95/49

Parallel Solution of the Time-dependent Ginzburg-Landau

Equations and Other Experiences Using

BlockComm-Chameleon and PCN on the IBM SP,

Intel iPSC/860, and Clusters of Workstations

by

Erhan CoskuLni and Man Kam Nwon9

Mathematics and (Computer Science Division

S.pt1 eumbe, 1995

MASTERDepartment of Mathernatical ciences, Northern lhliuois University, L)eKalb, IL 60115. Present ad.

dreas,: Karadcniz Technical University, Department of Mathematics, Trabron, 610Jlt Turkey 1.il:*rhanlostOl.bia. ktu.edu. tr

2This author wa" supported by the Mathematical, hinforuatiku, and Computational .itiencc. 1 )isiuo. .s.bprogram of the Office of ('omputational and 'echnology Rcsearch, U.S. Department of Energy, under (unt raet

W-3b-1O9-Fu,-3.

DSTRIBIm ON OF mut s tM Nt' ' I M3

Distribution Category:Mathematics aid

Computer Science (UC.405)

A RGONNE NATIONAL LABORATOItY9700 South Cass Avenue

Argonne, TL 60439

I

I

DISCLAIMER

Portions of this document may be illegiblein electronic image products. Images areproduced from the best available originaldocument.

DISCLAIMER

Portions of this document may be illegiblein electronic image products. Images areproduced from the best available originaldocument.

Contents

Abstract 1

1 Introduction 1

2 Preliminaries 2

3 Test Problems 4

4 I'arailtel Programs with ilockComii/Chameleon 64.1 ProgSum BC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64." Prog1'iH ............... .. ..... ... .... .......... 94.3 P rogPd RC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . )44 ProgTdgl C . . .. . . . . . . .. . ... . . . . . . . . .. . . . . . . . . . . . . 11

5 lusterss of Workstations as a Parallel Computing Environment 13

6 Parallel Programs with PCN 146.1 V rogP iP('N . . . . . . . . . . . . . . . . . . . . . . . . . . . ..........146.2 lProgPd PCN . . . . . . . .. . .. . . . . . . . . . . . . . . . . . . . .. . . . . 16

7 Conclusion 19

Acknowledgments 20

Appenlix: Program Listings 21

Iteferetices 37

iii

Parallel Solution of the Time-dependent Ginzburg-Landau Equations andOther Experiences Using BlockComm-Chnaieleoii and PCN

on the IBM SP, Intel iPSC/860, and Clusters of Workstations

by

Erhan Coskun and Aan Kam Kwong

Abstract

Tiite-dep'rident Ginxibrg-Landau (1'I)GL) equatitnmi are cvjnidertd for ,11tode rig athin-ti iim limit rit ,tilrcoTnId.or placed under magnetic field. The prtbiemii then leadtyto the um: of sO-called itatuial boundary condi.Itis. (AomplmutAtionAl domiaiiin i- partititlued

mt ubdoin s ad bond variabbls art' used in obhtaiing tlie correponding discrete }y tem

of equaLtiu.1n Au dliclint tiruc-dilkreciciug it'ethod laued on the Forward Filer mlt& lIIl isdevelctped. Finally, a variahlc treugth magintic lkld re-utiing i a %ortex olititdn 11 T1 %p11 ligh T. suip(rconductin; tilms is intOrduced.

We tackled our problem using two dilferean state-of-the-art parallel cormputing ttol!31ck(omm/h(.harnelcon and CN. We had access to two high performance distributedmemory supercomputers: the Intel iPSC/860t and lIt SI'l. We alo teated thc : .. tk,using, as a parallel computing environment, a cluster of Sun Sparc workstations.

1 Introduction

it, our ';t miy of thet. m teati'lmCal mnodo-ling -.f sulp rt t wtI~i ivity, t1 htv. (lMv-lopied altf14 1#i1t

algorithm to solve numerically the time dependent Ginzburg Landau (TDGL) equations into diml1(en1ionls (see [3 ). Tht currespondling problem in three diiliisienoi, is, howetvi, viry

comnpt Lati onaliy extensive. The stud y is iimpractical on a conventional uniprocesSor comnpi er,even if the most efficient algorithm is used. The inumerical simulation 'if such Grand Chrllernge

pr blems (t e th reei tmier iional TI'I )GI, in its rti ire generaiti v) Il.-pei d- 1,11 high jperferm.ni;,computing techniques and resources.

We tackled the 31) problem using two different state-of-the art parallel! computing tool,:l;ceck(?omm/Chamieleon and PCN. the development of both involves Argoune scientists. Site

the coimpletion of this work. a new tool. the Message Passing Interface ( M I '!9}. has emerged.It 11, ;1 ei t iexi-111 Tr pE1ti1 to llot-c me 1.t1" StAlifr(ld mtessge-a.shing tool. 'uture extt iion

of our work will dcfrnitely include MPI. We had access to two high.performance distributd-niotmorv .;upivircom)Iputers: the Intel iWSC/6ti0 and lIM S P. We iLso te--td the rvde :;ing,as a paraie-1 coriptaing onvironwnis.t. , custvr of Snn Spare wtrkstatljimls 1i, the M\atlhemiiaics

and Cotmputer Science Division of Argonne National Laboratory.Although our main objective was to develop a parallel code for the for ward Euler method

(so 'It 11 , lv the TDG I. i-quat i trs, wE . arr.-d with iii; *'t siiplt 1 r warm up prt'blom". Oir

experience wit. these three problems is also described here: the' are used as examples to

illustrate ;omi, of tw concepts of parallel programming tools. More in-depth li~t'isciti of allOie prolmi'ills ; ri-l- r fl il thli rE'pErt, .t liEhr with all compl'tr pIarallei rod ,e; anm rn lilt

protieires, is Pive? in (s,.I)a v. Levin- of Ar gonre National ,Lalboratory has alst dAvelp t le parall-l c mlo Stir :.h itg

thw TWG1. trrirtg lhock(omrim. hut with a differ-it method of discretizirig the equations; c Irl-stul

i:. 1101, and the forthcoming paper 1161. Earlier, two other colleagues, Paul Plassmanii andSteve Wright, developed a parallel code for solving the static Ginzburg-Landau equations using

optilmizativin techliiquies; their work is reported in [61.

2 Preliminaries

We begii by introduciug some terminology that will he used thrnighout this report. We alsobriefly describe the parallel programning tools and etvirotiments we used.

When a particular instance of a code or a part of a code is executed on a machine, all ofthe work needed to execute that portion of the program is referred as a single In.k, tor proces.

Pai-ailce processing is information processing or niumnrical computation that enphasizs th'concurrent iiiainipulatilOnl of data elemnents belonging to one or more processes in solving a.single problemuu.

Early supercomputers achieved concurrency with the method of pipcitnirng. namely, by

dividing a coItputation into a number of step that are processed ilk an as eibly-line fashion.

More modern architectures use ttultiple CPUs, eacih capable (f executing instructic-ijs vittirkyindependently of others.

HIow a processor arcesses the computer meitory u(shujrfd mtemnory or dishributed tijr mIewy)i

affects how a parallel program will be desic tied and coded. It is generally ac cepted [1x} trashared-memory parallel programming can usually be (lone through minor extenions to exi.;tingprogramming languages, operating systems, and rode librariF's. On t he ot her hand, (ist ribntid-

memory programming is a bit more involved. but it has the advantages of massive paralk'lism.Onr Pxpi-riments were done excsi lively "n distrilbuted-mN memory envi iriinment

A pwirnll syscnt 117) is the coifihitiat ion of att algorithm and th' Iarallel a r~litei ir, on

which it is itplenented. As mentioned in i17. the perfornian-e of a parallel algorithm cannothe evalnated in isolation from a parallel at chitecture. Threfore, it is more appropritN' ti talkabout performance of a parallel system than performance of a parallel algorithm.

Vrius ietrics are used to melia,;ure the ;wrformatce of t parallel tyi:temi. We mIntti. 4u

Wily a few tif tlt'm below.

. h'll- pogiu'el runt 6mte at( his' ed titl mtu from 1 the moment a parallel compt 1.a(t i rn s

tot thtie imotmetnt the last processor finishiest execution.

* The sp-edup is defined as

, Mrial run tIme Gor ti#he st Sefiwntial alg"rithm

para.ilel run tim e using p processors

The .specedlup .S , rcPIrc i1i tIc beu1- L of ru lvinp, a problem in j)Erallel using p idt; iCl

processors. A more practical definitiont (since it is often difficul t deterin tl, hetsequeintial algorithml is obht ined by replacing the expre,;ion in the unuim rator ahove hvAexeciUtiou til, P of the Arite iNde isiip, a -inigle pr( iPssu r'." We *i r Ilii ti th1 t niddefiuition to evaluate our numerical results.

* The f fit 'y is defivo-d as

i 1 th. ideaL 4 ""iC0 iui f .e I. .%pt (Af, ., - , tt E, = 1.

* the cost of solving a problem on a paralel systemin is delini-d as the prodhict of parall'l runtime and the number of processors used. It reflects the sum of time that each processorspends solving the problem.

The generic goal in the development of parallel algorithin is to achieve as high a ep ypg o)sible. Tht perfect speadup ., - 1, or optimal eflirienrcy E, - 1, is obtainable (,iily for

essentially trivial probleins. All causes of imperfect speedup of a parallel system are collectivelvreferred to as the (rcrhfad resulting front paralel processing. Some factors that cause overihi.e:dare as follows (see [131, [171, and [183):

* lack of a perfect degree of parallelism in the islgoi ithim),

e lIa k of )erfe(t I lit14d balancing,

* communication or contention time, and

* el:tra collmputaticon.

in the ideal situation when each coiupuiational step of an algorithm can be done independentlyof the other steps, we say that the algorithm lhs a perfect degree of paralle.lisim. In reality,this ratrvly happens. A processor often utst wait ill the middle of a rim utirlil ;I hits rew-ived

all the data or information from other processors it needs to execute the next computational

Load balfnzing is the assigmltnnt of tasks to the processors of the system so as o keepeach processor doing useful work for as much of the time as possible. The determination ofthis optimal assignment is also called the mapping problem. Load balancing may I( achi've'deither statically or dVnamicallV. In static load balancing, tasks are assigned to processor' atthe beginning of a computation. In dynamic load balancing. tasks are assigned to processorsi, ti' computati n procc:'eds.

In distributed memory system, each processor can address only its own local memory.Commciunication between processor., takes place by me.ssayr pxloimj, .t proce~ that take. iel

ativ.4y more time thais direct access to local memory. In a shlared-;ttemfory system, all ith.

processors have access to a common memory. Each processor can also have its own local, butIilolited, mn'enory for program code and intermcaediate results. Conn micatic. beetwc'en indi-vidual processors is through the common memory. A major advantage of a shared umetmorySystem is the rapid crm iimtnicatioti (it data between pr ce'SS.Cr5. A ,f'rlows (1i.dd.1(antage is that

.1;(Tere'n prcest)rs may wish I it11 me tw t fl cOmiln mremr i orv ati ab ogp t the sai timet 1i 1' especiallylly

'wh1'len ew values are to he deposited), in which ca&w there will he a delay until the memory isfree, or until the proper order of access is established. This delay is called contf iiOm 11i(.

An efhicietL serial algorith in may not lend itself to .'ihlcient parlIelizatios 1)ecae' ofthe dependency of computational steps on results from previous steps. As a consequence, arrdesignT of the algorithm necessitating eztha compiuation may I) reqxuircd. In 111 extremesit 'cation, a better serial algorithm may have to be sacrificed in favor of ati inferior title.

We close Ithis sECtion tby in roduiciag the paidlel ptugrahsnsing tools Chamcleon. Block-Comm, anmd PCN, used in our study.

Chameleon is a library of low level. comprehensive, and ve"ry efficient ies-sag'-pa singrout inies dev'lopedl by W. Gropp and B. Smith (11'.

Hlfc)rk('O"m is a library of high-lev'l message- pausing rout ines icgnic'd hy C rocpp ile

manage the effcient communication of blocks of data between processorr. It provides shortruts for many cotsnoin message-passing tasks often fund il the cC1m5 putatlOnil tLchiceiie el

domain decomposition. Both packages are ,till tinder active development. One can consult.(7 for the most cuirrient documentation about BlockCommn. Although the use Block('ommgreatly simplifies the coding of domain decoiposition algorithmiis. it ds in it provide I hl- datt

reduction and broadcast routines that are needed in our case. Hence, we have used a combi-1aliol of C(hamleleon and hxlock( ouiuii rqimttilles in the ,mlw program. Alt ough Noth p~axkage:,

have both Fortran and C versions, we have chosen Fortran as our prograiniing language (seeStitiona 1.

PCN (Proram (7oiposition NvotatiOni) is a ptarallel progratinilg lauguNgo devrel ,wd

jointly by Argonne (1. Foster), Caltech, and the Aerospace Corporation. It provides a paradigm

for composing parallel prograins out of modules of parallel or sequueuitial sbrtuitiines t hat maybe written either in PCN itself or in more conventional prograniuning languages. The progranmier needs to specify only which modules are to be run concurrently and what datae onununications are needed between modules. T1he actual assigumuent of ta isks to spedi ic pr-cessors and message passing are transparent to the programmer. See E"; for more itforniationand its use for various parallel environments.

The two programming tools we used are highly portable over a wide variety of computerarchitectures. We have used three different parallel environments in our study: the InteliP'SC,/R0, the IBM S', and clusters o Sui Sparc wuJrPsitations. Al of them are ishitriluted-

memtory imultiple instruction multiple data (rrtmlni) sysu emits. For each problE'mi, the tamt!'

program i recompiled with the appropriate makefiles i ere used in the three systems. The IntelilS( '/86(3 at A rgoinne has eight itcde,. All pr cesCr nodes are identical aud are ctmlnectod by

bidirectional links in a hyperdulbe topology. See [1? for its hardware and software specifiratiotnS.We used this machine mainly for program development because it is freely accessible and there

is ro limiitati m om the amivunit of time onie can work ol ihe machine. The A rgt inne 1I11M Sl"has its nodues. Fach node is an 1S/6000 miodel 370 and has 12 NHytes of mlemiory per nde,l (;B te local disk per node, full Unixi on crach node, and a high-perforniance Omega switch.The pep:k performance of each lnode is 123 NFlops. 'There tire several t.raiusp'urt layers on

the SUP including EIji. Ulli, and p4. EUllI is the low-overhead implementation of ihe U1interface. LI!I is 1KMs . mmessage-pasing interface to th" higlh-performanice switcih. See lM1 foimore current information about the SP and how to ruse these transport layers.

3 Test Problems

In this section. we describe tih' fojur t est problemS in our experiments. Our ultimatum goal i to

derekl 1 a parallel code izmiplemtieiting the Forward Euler algorithmii for the TDGL et atitls. Aswarm-up trials, we experiuented with three simpler but computtatlomdy intensive problems.

The first two problems tre ex;,mnl jes of the partitioiig teclurhjiie known I fti.Hiwtiii!

dE r-( >. vitirr tuhe others use the dontain decomposition technique.Problem 1: We consider the slowly divergent harmony :eric:es

t= 1

.a heriaticians are interested in investigating its rata' of divergence. 'rie extremely slow I4teof divergelicc -if the -erier means t hat a large iimno lnr of tertils will ble iwe-d'lo ii n uciNer:

'Th- nrk d'-.cnbcd in tih. report .da e doutrml; the pcrd v! M. .99-Mu I'i91 . Sizi ti s ti !i',+ ',i,- 1 11 .t rulfyc'uium ll. hun..' ', vr+e i, Anld n re neflwn'it t 'n-mmunication switches haxe' b.r-n nst aid

eIperime tws and tis requiremiwut makes the problem an interesting example for parallelprogramming. A parallel code using lllockComin to compute the partial sumis wilt bw pre -iit P-d

i nvether wit Ii some performance results. The tode will h referred to as ProgSumBC.Problem 2: Our second problem is a well-known simple numerical integration problem.

It has been the arch-example used in the introductirn of many prallel program ming toolmeatsauik. Th ob'jectiv is to approximate the integral

f( x)dx,/i

where.1

I 1+ r-

by tlsin-g the rectangllr rzie:

If(/) = hf(:r1),a-I

where I = TI umbcr and x, = (i - ) jh. One can easily modify ProgSumBC to obtain a parallelBlkCXo bf1il code ror 1his pre.t A parallel P('N ctdi- for thi. problem , itm1d Prog?iPCN,will also be presented.

Problem 3: We study the following two-dimensional PDF':

--- 1 . a (CY fi) = 0 (y.y)

im (0. 1) x( U, 1) with the houndary conditions

t(.) = 0, u(., 1) = ., a(0, y) = 0, u(l.y = ya,

where c is a ronti .at. Tho exact solution, a-s one cait s'asilv verify, is u = ry'. Fiy approzimating

the seco'.d derivatives ini the PDL' bV the usual C011trlT (Iilfl'T&'Ieci formua1111s. v',' tso tail 1.h ii1.atr

systcln of eg(plations

S-20 + V ' 21 +- F*- I)- - r~y - - 0,.

for i - 1,..,.V - 1, j - I....,. - 1, where >z - 1/N. Ay - imethotd, x,iAz. yl : jAy. We use the notation U1 to,-. denote the value of U at the point iabsve

tlir. cerruut ISFIP, and so ut.

Hv expanding the function uhx,y) as a Taylor series at the point (x,,ki), we see that

hla truncation 0rror invc{ves only the fourth.or(der derivatives of u(x, y). Since u(.r, y) =ryluth nj and U, are identically zero. 'herefore, the trnncation error is ideiuicdly Y"'ro as

wvll. 'haej the pliaalueter c i, greater thian appr xiaasktely 2 r2, thie coetflci('t mtatrix L1 the

linear sv-tem is positive dt'firite (see 1211). The SOR (s ccestiiv-'( gverrIuai g,., inothf i,therefore. guaranteed to converge if the relaxation parameter is chosen from the interval (Q2).Th piar.allal rodeat for thtis problems with KlwkosrCom m; and PCN. which we :aMuAed ProgPd-RC,a1d ProgPdePCN, respctivdly, ;are given it t, e app-eeldix.

Problem 4: Matchematical d Taiols of the saGL are given elsewhere (see 3. 1 I(, 15!,,Iia(d the r'faiernces cited therein n. It sufikes to say nnktt we ure soling a ?ytem po (pmit ldi fT''rey Lti a~luamtio IOaE'qIat i',ns governing two unknown funct in aft sane andi(j~ spac.' p' .,itinan: ii

rl

chlomplex-vallued scalar . (called the order pa:rameter); and a three dincusional vector A (called

the vector potential). We %sed all unconventional method >N; (14') Io dscr-t ize fhe eqatilis

with respect to the space variables. The resulting system is then solved using a forward Euler

met iod. A parallel TilockComm code ProgTdglBC, for impleilwiiting this atlgoritliim is Kivenlin the Appendix. Since the code itself is rather complicated and specialized, we will present

in this report only the performance results, and refer the readers to 131 for a detal diicussion

of the code. We Iio t1( 1 hat. We have also developed a parallel P(CN code for 4his prcilbleimi, bit

performance results were less complete. As a cons(-quenice, we have decided not. to preset ills,cod4 in this report

4 Parallel Programs with BlockConm/Chameleon

4.1 ProgSunBC

ProgSumBC i the parallel program for Problem I written with llock ( ommn awd C(ha neleonr.

We give the program listing blow aid explain it coilent. TH1 litl r brstc in IIh li-1(i1ghave been added for ea.y reference and are not part of the code. 'I he subrout ine calk that

begin with the letters BC are lok Conm routines, while thoso that begin with PI tri-Chameleon routines. The fir7t five lines of the program declare the appropriate funct ion rmiirir

ndlii variables.

1 integer function corker()2 integer nbytes, Plmytid, myid, sx, ex. N3 integer intaize, msg.int, Paallprocs4 paramater(intsizc4,msg.int=1,Psallprocs'O,nbyte;8)5 double prec.iion ti, t2, SYGetElapsiodTino

Strictly speaking, the tauie ProgSumBC refers to the file PRofSumBC.f th4t (en1;ii .1

Vortra subroutimiv. called worker(), as declared in line I above. The worker() .ubroutinen IlA very rmiu like le . rr.spming meq i col. r.r Ile sagime poI ble.t ico st ig 4f

instructions for the numerical computations. In the actual execution of a parallel program. theCOSIpU t'r nceds SIimC C ..- .r rh'1t( ad instr;Ct iOIIS, such as initial setup dirii live. ( to :zrod U p1 he pr 11 r1s, Ii e1 i.bllish t('IfII igt ijq ii lIlhk. Vo1.III)! I lem1) 1Il .i'.1-11 dirotee im (1ti

after all the computations are finally completed). Many parallel programming took require

the programmier to explicitly include these instructions. in their programs. Chaiil-ou adst hi

toe iin't ruci ous, 9uch as PIC'all used to czdl worker( ) in a parallel execution mtiode, h6t it

proVides a convnient alternative that frees a user froimi his extra effort. Overhead! inistrtio.S

#hat. M-, commnnic to m 4ost program mi have been colct ei in a uiain sutroutinv arnl p r copiulAi 'difllthe' bjrit files fiua in .0 (for I'i rtran codes) and cuain.o (for C Codes). the a pproplriat* 0.I1P

of which is to be linked to the coipatational mubrmutie whoa compiling the provra m. Themeo derate price to pmy is that one no longer theirk4 i1. ter mo of writing a meaig F-rtrmn ecide (,.,

a rainO ro(utill" ill { J. 11111 just a fIn.'t ti),. with the titandatory name vorkero, a' we hanv.

done in Limp 1.

C

6 myid = Plmvtid()T if(myid ,eq. 0) thenS print*,'Oumber of points'9 read(5,*) N

j 10 endif11 cali PlbcastSrc(S.intsize,0,Psallprocs,msg~int)

When the code is executed on thte coniptiter, every processor is given the same set of in-structions contained in ProgSumBC, bitt not every processor will execute all the steps containedhi Ohw program. The prograin uses l I 1 number of thl calling processor (c 1 ta; l in litio6 using the Chameleon routine PImytidO and assigned to the variable myid) to determinewhich segments of cotles are appropriate for the pocessor. Lines 7 to 10 are an example ofsuch a segment. One of the processors, that with ID # 0, is given the rcspncsilbilit y t obtain,interactively) the user's input of the number of terms in the harmonic series to be suiii1uicd.

Line 11 calls the ('hameleon routine PIbcastSrc to broadcast the value N to all processors.Even though orly processor # 0 is the sender, and all other processors are receivers, this routinemust be called by all the processors. Roughly speaking, PIbcastSrc is shorthand for processor

# 0 14 enid a message to all other processors, and for all other processors to wait for thismessage to arrive. The arguments of PIbcastSrc are, respectively, the variable (bdtiffwr) thatcontains the message, the size of the buffer, the lD of the processor that broadcast the message,

i w sol 111 pro ci ss .rs ;. hat rOCei v.. ii Ih A g e ( by c(dveintins .rc s, X11 pro M '41.0 iyv,,iked w loll

this argument is O). and the data type of the Imessage. For more prCis4 syntax u'firit ions ofChan.eleon rouitinc call, consult the ('hameleon manual 1 1.

1 12 call getindex(N,sx,er)13 call PIgsync(0)14 tI=SYGetElapsedTim( )1S call compute(sx,ex,myid)16 t2=SYGetElapsedTime() - tl

N'ilw Ihat each 1 rle**,~tr kn-ws the val cia of N, lit- iiNext step is 1i- fii d tmt oh- raniiy- ',

.hose tcrlis in 1t1c harmonic series that it is responsible to work on. This is done in line LA by(ai"ing- the subrolutine getindex to (m.pu11())t. the indice of the start.iiig terit sx and the lt.tterm ex in the range. The subroutine getindex is given below.

IN hn 13, a global .VyInchr,)imzatiion call li use to make all the processors hegin t ining attihc same tim-. Liires 14 and 1I return it clip apsed tine cis id by the 1hrouiiIie 4 1ipdIIUte iii

line 15, which does the acttial stimiming.

subroutine getindex(mxsx,ex)

include '/home/gropp/tools.n/blkcm/mosh.h'

integer mx, sx, ex, nd

integer sz(0;&,0:0)

integer myid, nproc. PInuntid3, Plmytid

nd-1

sz(szmdim.0) - mxsz(szisparallel.0) = 1sz(szndim.0) = -1myid = Plmytid()

nproc = PInuatids()

call BCGlobalToLocalArray( nd, sz, nproc, myid

I

i

fpI

11

sx sz(szstart,0) + 1

ex = sz(szend,0) + 1returnend

The 131ckCmmttt subrumitinp BCGlobalToLocalArray deteritines t1h afpprupriat. data

donm;Qn Ihat a processor i: responsible for. given the decomposition style nd. the number4 processors nproc, and the processor ID # myid. The BlockCoatuu call slore, its ipsults inthe array sz. The precise definitions of each comuponents of sz are given in the nia ntmal.

subroutine compute(sx,ex,myid)

integer sx, ex, i, nxyid

double pxecision sum, workcum=0.0do izsxex

sumasum+IdOintegralenddocall Plgdsum(sua,>1,work,0)

i; (myid eq. 0)thenprint*, smalll' , sun

endif

returnend

The first part of compute finds the partial runt of tin series from the trim With index

x ti Ili, torn w ;iid x tax, inwiw vo-ly. The call PTgdIum findk O th g 4~baIuf (-{d 'ieir

precision) sun. by adding up all the results stored in the local variable sum attached to sat iprocca:or. The other arguient3 f4 PIgdsum are, re p.ctiv.ily, the length of the array sum (inOw 1'111 t rF'1 Sum 1, a scal;4r and so the value of this argulnitt! i lllpby 1), ni I;ri

work of the aine .iize as sum to be use as a work area to compute the global sum. and the,(t of pro es.scars involved (L. me ti4ued -- l i*er, a I I I If 0, 1by ! ItNv ti1 1 , I i 11,Le'.t i kit 4111

''' r-o 'r, are .it b. included ). The result of tlhc' comn put: tion, the gfohai suni , overwrites te:ocal auw originally stur(d 'n the varidbe sum.

leilie seIf-e llpa nu tlry per fet ai m r. i r~'iil .,w iIuai t w-, ili l'igirt I

I1

jI

f

I

I

J

I

1

E

C

0.

0 5

1

0.8

0.6

0.4

0 20 5 10

8 -

4 i

2 !tt

n t 0'J0 10 20

Number of Processors

0.

0.

0.

0 10 20

80; -,- -

60

40-

20

0

0

a

tt

1

.8

4 10 20 10 ?0

Number of Processors

Figure 1: Parallel run time and efficiency versus number of processors for ProgSum.C-SPI-EUH,ir m with N 10.,000,000 ( t.eft), N_ 100,000.000 (.liddle ), N =200,000,000 ( Right

4.2 ProgPiRC

One z.eeds only to modify the computation routine compute in ProgSumBC to get a parallelcode for Problem 2 in lockComm. As a matter of fact. the only difference between Problemi aiid I'roblom 2 ii the form of t 1, termn, in the eries to I- simed. In {,thi word, rhe oTh y

ir'al res') ' n01(ed ar' in modify ing the line "sum=9um+1d0integral."

We include this example to make the point that once a prototype parallel program hasb.poi, writ ton, impo si. (?f It car he reused to writ,- aioth(.r program. IIenre, thlte iitir otiiv., i ti

is wurthwhii".

4.3 ProgPdeBC

OC r iwi, h' id w i tl ion for Pr ll 4 emi :, i5 to dlc.,mipose the domain in whirl, the piiri4T d lf,r-

"(tI 'I7Ipticn 1i 1 eme +nt as"c11 111 a o 11 .t i d e.1l~ laill' a- the lttmlwr 1?f ptroirv m , itsr1 r.ACh

150 ----

100

50

00 10 20

1.2

0

t

{

1 2 . -

11

6i

44

processor is assigned the data of one of the subdoinains, called a block, and a share of tiv cowmutations that involves mainly data in the associated block. At each time step. each processoralso r'qi i res s(e rie ox tra inform inii 6 froni p roc o I SI r 0 itedl with I git igH c e ii g 1Ilocks ill

order to complete the assigned computation. in most domain decomposition algorithns forwc'1vin1g partial difTerential 4eieatioici, this extra informiiationr is typi-cally dala carried by .4'

of lattice points, the so-called ghost points, that borders the sulbdomiain. Tlhe "xcha:g" of information aiuon- processors is perfurted by mnssagc-passing ibri,.ry calls. A two diiie isionMCo0Ijp11tatioinal domllaile willi a typical ,iihdlomeain atod it,. ghost point, for ai five-p in" ,te-i4t

is illustrated below.

j1

,J

F%,ure 2. A nine processror (lecoml position t(f a 21) domiain with ghubt points (0)

If only a general purpose. low level nessage-passing tool, :such t ihame'lon, i.; U-e towT itr a parai.li '!mAil ii decom1i iti.n a lgciritlmn, oni h as to ini.lue o xpE1icit code s'gmiilts t I

*. dfine each su bdomuain (i.e., (lt'termic tie ranges of indieices for the Iattice poin ibltbeltilg6 to the subdoinain ).

.: each subdoman W a p1rcer'Mr.

:. determine the ghost points dnd tihe flow of mila s and

-i. send and receive each message explicitly.

Block 'omn pr-vides ,uhioutine ails to autotuat" these steps for a wi>d class of comiulalfo4

1ilaiii ''eliimpiisitiH}11 algcrithtms fur roctai riiolar dn miaiiis 1c' x mit rplE' I he ( ;l

BCGlobalToLocalArray, usved earlier in the subroutine compute in Section 1.1. take5 careof Steps 1 3. Another szbroutinr BCexec() can he used to automate Step 4.

T)w c-i le ProgPd"C is givon in the Appeudix. Suisme performia ce reult are pre-tU'ted ir. 'table I. For .his partic ular ex perinivit, c - 20. w { retaxa ti piarantie'r 1 = 1 . and

we tae ued 5)00 grid pi intjs adid 1000 iter;tion t-,ps.

10

Table 1. Performance r.silts for the ProgPdeBC-SP1-EU1H system

1 536.82

12 245.3 I 5.2773 U.13!8

20 -224.0 .. 7707 0.28

4.4 ProgTdgIBC

The code for ProgTdglBC is rather long and is give in the Appendix. It ha,, been ru coif t Iind I iPSC,';60, the IBM SP, and a chister of Sun workstations without further mnodificatioit.

Typical performni4e results for the ProgTdg1BC-iPSC/860 And ProgTdglBC-SP1-P4 sys-toms are' plotted in Figm, ;. The latter uses IhI- ver'wi ';f W1 iuk C1 . 11n1 tiaf i s .,- ton the

p4 inacro package. develop-d by E. L. Lusk at Ar.gonne. and uses the Ethernet transport layer.The graph suggests that the speedup for the first parallel system is far better tlan thoat .. f

Ihe ,-mi(d. This ie Io Iith, facr ihat tour I tEt prtbl),wml has a rather iiw granularily for 60-SP. As a result. SP nodes have to spend more tune in communication than in computation.This explanation is confirmed by the fact. that when we switched to the imores efficient trdhlc.Ip- tjayer . ' I1 for the SP, the speeslup cIrve shows a much heiter purformiiance.

I.,

2.1123

-. 0.5 9

Nul. of PIoc. Paradlll Run Time SE

1294.95 I

2 '391.55 j 12

e(.d up

. 725

1ffici"uy

1

0.6031

0.50742

Zu SP 1

4 -t

21- I

I S !l 5 6 7 8Sumbes of Th a.sa."

.; :-3. Speo.dup for thc ProgTdglBC-SP1-P4 and ProgTdglBC-iPSC sy.te-in

1J

U

$100O

E

C 500-

0 -

4 0 5 10 15 20 25 30 35 40a

d 5r- 1

X10

N-5-

0I

.5

i'O 15 20

0W

0 L.. I . L _ f

' 5 10 15 20 25 30 35 40Number of Processors

Figures I. Swim.. p.erfortiancro risitits for thi ProgTdglBC-SPI-EU1H system

5 Clusters of Workstations as a Parallel Computing Environment

)u. to the kw acceS priority given to parallel j bs rinitg i' the backgroud. performtan-( oha ) t-if wor, rlations is not cosiStent. varying, according to thi deuiaud of ol hwI 11r, 1sthe workstatiorm . I hir enivirowzI1 i" ther . 111W11or1. tmily d usm tor te.st i ins Ati ftli db'.j ilig.

.\n j, ow Ob er t ri pr ,."s vreat;ifl oll r,0,,pq4l, w41rtKsIlI Miof, tA ., l u rai ..-rbtln)i .tj,1Ut

I :t

of time. 1 vpical performance results obtained by running ProgPiBC with n - 10, 000,000 ona coilhctioit of workitations are shown below. Here, the real and svstem ties arc obtailnwd bythe Unix's time ct'mmanid and elapsed time is computed by lhe program.

Table 2. Performi ance rults for ProgPiBC on a cluster of workstations

( tuite !In sec)

Time Number of Workstatins

_ 71 --2 3 5 6 7

teal 19.0 t1. 57.2 F9.8 5S.0 60.3 60.6 71.0

System (12 0.6 0.0 1.1 1.4 1. 1.6 1.8

pd .8 . 4 3.S4 J 2 .03 25.4 2 IT .

0 Parallel Programs with PCN

6.1 ProgPiPCN

i ma=n(arge:, argv, rc)2 { ? argv ?= [L.ninervals,intervalsize) .>3{4 ;::ys:stringto.integsr(n.intervals,ni),5 :ys:stringtointeger(interval.size,1i).

h-ni*li,

, 7 ;i.th=1.numberx,8 :ainbody(ni,li,with) in vts:array(ni),9 rc=0

10 },11 defauli ->

12 (;ntdio-printf("usage %'/, <Iint:ervals> cintsizw>\n",argv[01), )13 o = 1

14 r

15 }

Tlv! S)yilt;aY elf PCN is ,imiilar to that of C. The Oxmilua., however, i its.'t ;%, lhe cuimaidterminator, while the semicolon is used to dcci: re a sequential procedure. ProgPiPCN covnistsof Ue 1'( N pr.)rddsi ar d a Firtran j)TetdIXtre. The arguriinebIt argc aiid argv of nain()1ha 111 11sual Lm.''ltings as in C, and re i used for a reoturi curde. But tliilho it 1', tie

ar:inplits to main() must be specified in the definition, whether we are planning to paS* ,n1crimaniii dJ line arguii we.iIk tti t plt jrogri I nt. lille 2 tevl i dud.tl pIrpf)t : ti ilumirter

of c'nmmarnd line ar:.um1lNrits is cieked. and if i hat is eqai tiwo. th.e valiie, f argv r1] andtargv[2] are signed no r intervals and intervaL.size. In ihues 4 .. Pt'N's sys mioditlc

11:,ed to di-fine ni anti i1 i to be t1 intrfro tege valUef. represented by tilh striys n intervwis:nd interval.. siz , r-spoctiivety. hi lines 1J 7, lihe tetal n't ibei r of p m IIII ' nil4d IO.- wi *I I or

11

the intervals are computed. Line S is a call to the procedure main..body: the infix operatorin i, used to specify the map function vta:array (i), which < reatPs a virtual array thpolugyof size ni. This topology guarantees the' portability of the program Across difereti. computerplatforms. See (i) for more on virtual topologies and map functions. Line 9 sets the returncode v, riable to zero. ines II 1 l print an eiror garage aii case tIhe 11um1be of a rgo'!;Iilse'1

supplied iS wrong.

16 mainbody(ni,li,width) 17 p l n ;

17 port globals[nodes()J;

18 {i i rectangle(ni,li.width.globals),

12 display(0,0,globals,ni)

?.0 }

The built-in fiancti-in nodes() determines the Ill mbler of zisude's preweat. In linte 17, a

port array globals with nodes() elements is created. This port array is used for the globaloperations to be performed later. Lines Ma I.) are two procedure calls to be executed in parallel

Moti..The firbt procedure call impipenmenits tlhe rett angular rule to approximate thne v-Ah e' of:r a;;d the second displays the results. The' rmle of the arguments passed to these proredriresis clear from the context of the program.

21 rectangie(ni,li,width,globals)

22 ?ort global [;23 {11 i over 0 .. ni- :: ni intervals "/

24 start.interval(iiwidth,globals[i])4node(i)25 }

start.interval(i,li,idth,globals)

double sum;

compsum_(li~width, sun),Rtd2o-printf("li=%d vidrh="f sue ='\1r",{i with, nn,nj.),globals={sue},stdio:print("globals=/f\n",{ltiobals},_),

Thp itrat-ive < 4k.truct in hine 2a creates ni instanues of start interval( ). euof whIch

ralIs tho Fortrav1 1)roredire (compsum tl 10C11tP1te the fo Al ce 1 r ibuion te 1Ihe vahci of r. T1i

value is snapshot by the definitional variahle globals for uise in the procedure display.

display (count,globsam,glhbals,ni)

port global U ;1 {? count<ni

{; display(count+1,globsum+globals~count],globalsnz) },default ->

;

15

We ran this program on the Intel iPSC/s60 and on the WIJM SP. The performance resndts

for ProgPiPCN a)e illsl ated helow using gauge, an e'cition p-oiler for VCN progr1ims. l Thii

utility provides many options to ariayvze the performance of a parallel PCN pri:graiu. migthe.se are the profile data. for the titie spent in each procedure on each node, the number of

iium .wauh pri itedoore is ralkd, idle ties, iiterili'de ie'si age cutl i .u 'ohutiwS, a11 varittlis

statistical results based on these profile data. Th first graph pertains to :rogPiPCN ri n onthe ln ei PS C/tC( with eight nodes.

M A S--- - - -

C N~vts cabtr~eboa . .?arAfl '

i,,etla on lm

I e-V I *_U- 0 45 5-- w C. ' M 3.01 3.1 1 .CK71D)* 1A Id ___Tom ___________ __ _____

.jg swta. i It... , - -' a f

.dau -- ,-i -- -M

sy wsTar.IJ _l (itui( ir miaa..i sc );01

vinyl: ~ *inuirulM S1e~ry7 rw r l I "j / " - rI

&,. 74

gimf R T

Ta't I ul cma 70Tnoa' *'.Av wnster:3 2TMU

Execution Time i'y Procedures

Figure 5. Execution tli ie metric of ProgPiPCU

Tlw gra ph lh w thl Pxc ut i m time nti rit if ProgP iPCN by prov ll ;rN. 'h e' p11(s Itrt

names with the prefix compi hejong to omr code, and thi other procedures are in the built-inPCN inoduke sys and std io. Notice that the time ;pext by Ihe I'ortran proccdurc cornpsum i:

tmulchw greater 1'r 1,11 1at 1 of sib h r prIedir t irpw1 d b ,l w iiedit h gra1hi i., 1 t i w\,'q j Ot

times. the number of reductions. and the nutriber of suspensions. A redeicition is orEi rou pletedexecution uf a process, ad a .;jspension ocrci ra when a process requires value of .su undefineddeufiini;tri.al vaiiibled . A prow"" suspend.i'i 1 until the defliiti nal a ib" E givel I viihm"

6.2 ProgPdcPCN

for he code ProgPdePCN we discuss only the procedure named square. A which imapt, ach bi-ckto a nOdF ii. a virt ual a aty t1ij op logy. The ther CA ri 1 elrre 1 1 re i ii 1.1r ii ' lf ProgP iPCN

'If 11mpIe tude I_ Aiven in the ApeCndix.

16

squaro(max.iter,globals)

port l [nodos()],E[nodes()],4lob&ls[]:{Hf i over 0 .. isize:

{HI j over 0 .. joize-1 ::(mo=id(ij),

startblock(max.iter,i,j,Nfine] ,H[id(i, j--1)].,E[me) ,Elid(i-1 ,j)l.globals [m-]) node (me)

} }} }

start-block(max~iteri,jN,S,E,,global.s)

The domain is decomposed into isize horizontal and size vertical blocks. Each block isa:imwd an I) nibnler by the fUncdI ion id and mnapped to the tilmeber node(me) if tHie array

nodo. The port arrays N[nodes()] and E[nodes()] are used to communicate data on the

ghost points (which form the edgo). Notice that the north ghost points of block(id(ij-:))are 'he outh gh.'st pcilt, of block(id(i j)). And the inrth input of a h1.ck is the snuthIioutput of its north neighbor. The procedulres send..edge and receive..edge in the A ppendixsend and receive data on the edge.

Figures 6-- give the performance results of ProgPdePCN run on tli IBM Sl1 with iho,

nodes. The first graph shows the execution time by procedures. Notice hiat the time tiled bythe computational procedure compute is a bout one hundreds timps i hi by tie clluillN'ilatienI

procedures get-edge and receive.edge.T he second graph shows the time breakdown by nodes. The gray b.iis reprent41 i4le tile

c h ie the hark ones represent t he execut ion time. Not ic I h: t oat h node spend. A re' at *e ellIhi

auoun! of time waiting for data f: un other node. To improve performance, one must findways tc, reduce this idle time.

i

Usajge Calr fsitarttc Su st , r"]eta 'lean Colar S:, a-ctr

BuckeucThrraD.. "inor scale U tzert

Current Gnarsho Slaci _ --- - f- GwAll

axcuuon Tioe

1.--13 Iw-GI C.OCl S.C I'

pe m.mhwv, _d 4 . M 5.t . zwe*'. jd/. j

P-mw11..s.... WSpian .llsuc.iV s&.

01deo rsar_utltn 2i

pden "l:"k t2

1I

11

ii

it III

fl t11 111 I

I'

lii 1

1pI'

1J

Total liseruiion Ttrme ;mmns:+rswasuos 1 19 .7om.j RedU.tisaa 532Toa! guapr.s:ana. 195117

Execution Time by Procedures

F i. r G I'.rorrt,?Ix.r,4' of ProgPdePCN on hf SP

UsaR ; Cl) Sattos bL D clt, j C-:-!s--]'.m -- aEwbt m e

Ur.zn,.-a- r. r.,t 1rPTV~.r.edur4

Currwmt S p.,ot s._achoI- --

II ncmuAJJ

Time a aldaw

u 0 is A as 30 3e 40 d 0 * on

Total Execation , s (mIns:s:rsoacs). 1 19 b2Tf~a eAr!' AftSo: Wre?

Terai Ssmrsrnwwns: i95'.tr

Time Beakdown by Nodca

{i,w .. Tinw Irxkoirw by IiN

Co

as o 1 A

- - -

--

- l ~ !

[

the third graph shows the exeiIh t time by procdtire aItd nodes. The time is r'ptesetnted

by the color (unfortunately, the celor cannot be reproduced in this report) of the square thaicorre'spunds to the procedire a.rl node.

Via s:;t'al [Staljs [uhtieb ?e Iclor Cl1 (Help .r~f cQuit

Z aam 4.4, . ucket

f ivr l Un"r ayPraceduroa

- -r t G-ap a S Uct-

r h . A u

t("arn Rin a

_.._... S

r@"a.w1o. ds_ 0.Q.J. -r _"1a",..Sd

-g m-. j 3 d s i

Total r nUlbenn Time teninaseamace); 1.1.0WTOW ot dwuod, 53=Total Smaponrin. 195137

Ex'clltl. r Timek b-y Prcedtires and N(Kl s

Figure 1 . FExe,rutioii timt- by p'rocedkres and nlodo.

'1 Conclusion

[lie observations -.iv(n beimw are ba~ised coi our limited experience with ther tooals. And mlay

('Venl be bIit .1d,

. PC X is a p~rugrammiivw li aguage, whereas BlockCommu is a library of routines. Froma uispr't point ofview, ibil. means that to use ] CN, one has to master the la~guagE~

synna;t, whereats I t t R11,('kC( mm/11Ch1amtlet' , (fe hw; t(1 learn how andt( wh}ei, 1(I rlS(

the BlockComim/Chameleon subroutines to modify a sequential code. The new .\P1 tool

is mo(ri like OhP latter.

" For mtor,, complicated aPPlications, Block("omm must be supple mente d by Chamlleo n

resutini ., for para.lBei 1/0, data reduction, broxadcasiting, ec)

(t Alilmigh BI,,ck(:fona1 hta! versions for bo th Fowtr,-n and C, writinn a tl4m,,itt de4 cmtpoo-

sition code in C is not as convenient, because C arrays cannot bn declared with arbit rary

Index rmiges. itre4'(d, tnur originall ~fequentia.l T (G L ((KIP wavs writ ten in C, ,1111 we hilv1

r,( rcozvfert it it, Frrran t( L,1_k advan.ta;ge or th(, i(rk~Ctonal1,e i,;t!; ".

" '1 fwe (tirfr lit. M od.( 'mim doit mflentiltlon i:* v Titt,,r Ior F"''Orir.Ll ll. :, «intled:, t 1,4t Of

'lt:, ll, l" I is ,I (' u, f . iote w " ,.-d it, Ili, t'hametr(, n Imuti in w(, otrl , N I

19

program. we have to sometimes guess the Fortran sytLtax for Somie Chamemlf n routine

calls. It. would bo of great help to the users if both rOrtrmi and ( docuIimctintation.s forthe two packages were available.

" Ti use PCN to rewri'e a sequential code in general involves rr'lalvAV movie mifooi t hu.

to ~e a tessage-passing tool.

* Siuc', the coinpilation technology for C('N is still in its infa.ncy (and so i6 iot A., g d

as that of Fcrt ran or C), a program written entirely in CN usually do not prod uce

the most ctficicnt code. The approach of multilingual programming permits us to takeadvantage of the unique features of PCN, such as mapping, communication, and sdteidul-ing, to conpiement the proven efficiency of Fortran and C programmIing fur sequentiialcomputation [4". This approach calls for dividing up a sequential program into some con

vnruient parts anud converting these pieces to procedures to be called by lCN. A Fo.rtrasequential subroutine can be called from PCN directly, except that the suflix <- has tohe appended to the subroutine name to form the correspond VCN procedure name. Inthe cisC of C suhroutinet, ;'rgun~'its (except arrays) paaed t. a C pruc(-dure frzm PUN

must be declared as pointers in the C procedure.

Acknowledgments

NWe thank mur colleagues who have made their work on the various paraiel prograu:ulnr too4.available to us and helped us with 1 ativ of our questiorns. 'his list includes Ian Futcr, W0fliam1Gripp, Ewing L Lusk, and Stev, Tuecke. We also thank Dave Levine for sharing with uus hiversionl Of parallel TDGI. code and Paul Plassiua.nn for his parallel GL code: botn provided

ahbi la ' :Mi5 I an c to get u., s1mted 11i lear iii Illtc n ('uni iii.

20

APPENDIX: Program Listings

ProgPdePCN: A PCN Program for Problem 3

include "grid.h"ikdefine id(i,j) (((i+isizc)ixisira)+((j+jsir. 1'jaixe)*illire)main(argcargvrc)

(7 argv "'-.,maxnumof~iterations] ->

sys:string_to_integer(waxnum-of~iterations pax_itcr),mainibody(maxitar) in vts~array(isizeisize),rc-0

default ->

stdio:printf("usage:Xs <maXiter>\n'",argvl)l},.),rc=1

}}main_bodiy(max_itor)port globals[nodesi) :

({{ aquaro(eaiter,globals),i tsFlAy(0,4,globair)

}

squ~ar"(w wx_ itar, global s)port N;nodes)Ernodes()jglobalsL1;

{H i over 0 .. isize-1 ::(11 j ovor 0 .. jsize-1 ::

start~blocktaax~iter,i,j,NU[e).iid(i,j-1)I.t[aej .Eid(i-1.j)j.globalarmel)Onoda(me)

}

}

startblock(saziter. i.;,tj.SEWY.lobals)double 1sua[b-jz.b1z1,.dgerbsz):if! ={PI1,3o},E-{E.EO},

{ ? S?-{So,Si), W?-{WoIi}.- > {;

iriit iliz _(i,j square),start~clock(),

blwek(nax~iter~i,j~aquar*,edg*,fNi,Si,Fi,Wi),(NoSnF.Wn}g}oba1 0}

}biockiw. x_ tei,i, j,eqgarw,«.1gNe,i,gloleil.x, wunt)double squarej].edge, .error;

s!nd-edgo(square,edge,02s,(s1)Ieceivk~edge(ui,si~ei,ai,1a,Ia1),cnwpte1V.o_ i,j,RqnirA.nisii.0wi, rror),

Sr.uil <m x.. tet ->

.l Jcukmaitr ,i.. qual.e,vdgsr.isiOal.global5.scount+

21

default ->

{; topclockO,global.s=arror,

e td io:pr int f("done\n'',.{},_)}

}}aend-edte (square , dge.O .MA,0 IIdMoble square 0, edge ;]

{: getedge.(N(1RTHsquareedge),5[tedge}l t12,

getedge.(SUTHl.squareedge),S- {edge}ISni.

geLedge_(EST.square, edge),E,-L{edge})EI1,

getedgA_(WES T,egquarw,eigo),W-[{edge)tW1J.

}

}

receive_aedga(ni.ai..:i,ei .Is,Tst)

i i?-{NS,E.W} ->

{(I{? N'?=(.{nn}Ikt_twpj ->{;ni-nn,H1-ML_etp}},{? S?-f{ss}!S1_tApI ->(;si-ft,1-Stap}),{7 ETglfae}iFEt_twp] ->{;mot=P.,EL-EI~tmp}},

{? w''=CtY }l11tmp] ->t~vtyv ,vl-w1.'.mp}},Tll-(Ni1,S1,ESW1}

}

di splay(count,globiax , lobd.s )port. globalst]{? count<isizuej Size ->

;tewpmax-g1obalscount],

getaax(globaax.tcxpmax.ne._anx),d isplay (C oulnt, now-MAX,gl obal! .,)

de.fault ->t;stdzo:przntf(""axerror-%f\n",{globnax}..),

Atdiopr imtf("don't\n", _}).

}

gc Aax(x~y.--){' x>y ->-,t.

Iinc.1lude ''ogvid.h'

subroutine jintialize(i blockc)

nteg.er i, j

double precizina blockt.S1ZE,r^IZEJteger ii. jj

do i--1, BSZE

''.5

do 3;=1. BSIZEblock(ii. jj) - 0.0

enddo,nddo

return

end

subroueino compute(i.j,v.ned,sed,ecd.ederrmax)

integer i,j,ii,jjdouble precision v(BSIZE.BSIZE),u(0:BSIZE+1.0:BSlZE+1)double precision ned(BSIZE).wsdjSIZE)double precision eed(BSIZE).ved(BSIZE)double prw isiwn dx ,dy ,crreax,err.,d ,x(BSIZE) .y(AS17FE)errmax-0.0

dx-1.d0/1 isIzi'CBSIZF.-1 .d0)dy-dxu1.dOa=20.d0do ii a ].BSIZE

dn jj = 1.BSIZEu(ii.j;)-v(ii.j j)

.nddo

anddodo ii-1,S17F

u(0,ii)-vad(ii)u(BSIZE+I, ii) (aiA

u(ii.USIZE+1)-ned(:i)u(ii,0)-eed(ii)

enddodo ii -1,BSIZE

x tiii(RST7Fe*i+(i-it)dxy cii)-(BSlZ1Ej+(i.i-1))"dyit (i cq. 0) u(0,ii)-0.0it (i .eq. isize-1)u(BSIZE+1,ii)-y(ii)'*3if (J .eq. 0) u(ii,01-0.0

if (j .eq. )size-1)u(ii,SIZE+1)-x(ii)enddo

do kk-1.20

orrmax=0.0do jji-1.BSZEdo ii=1,BSIZEu(ii,. ; iwlu ii. j )- r"((-u(i i+1. ;;)+2"u(1i. ; )-ut ii-:1i 1)/dj""-

/ +(-it(ii.jj+1).+.n(ii..j-nrii,jj-1)/dye.?

+ auu(ii.jil-x(ii)y(i ))caey(jj)e2-6))/41dz"e"+a)err=ahstuiii.jju-x(ii)eyijJ.e3)*rruax-emax (er ru.1, orr )

enddosenddo

do ii i.BSIZEdo jj-1,USI?.Ev' ii,jj)-ffii.jj')enddoteuddu

retirn

dUtr OUtih r gtt.dgh(idb .k .tdg)douoff- p rer7-ton hiorik bJ!'Lt.hNl/F), rrg1R 1 /r.)

integer i,id

C fior th facef4 aid . fqo. WORTH> thoit

M2/

do i-,BSIZEedge(i) = block(i.ISIZE)

onddoondii

C' South tacoif (id .eq. SOUTH) then

de 1 1,IlSIEedge(i' - block(i,l)

*nddosndif

East facoif iid .eq. EAST) then

do i-1.BSIZEedge(i) - block(HSIZEi)

enddoenrdif

C went fare

if (id .&q. WEST) thenAn t-I ,iSIZE

edge(i) -block(1,i)enddo

endifI eLtrn

end

21

ProgPdeBC:A BlckCommiii Program for Problem 3

XXXXXXXxXXXXXYXXXXXXXXXxxXXXXX 'XXXXXX'XXxXYX'XXXYXYXXXXx XYXXX

integer function worker()double precision errmax.vovk,dx.dy.vinteger nxnyparameter (nx=501, nyuSl0.a=20)double precision ut(nx+2)w(ny+2)).x(nx+2).y(ny+2)double precision v((nx+2).(ny+2))dnhi. precision ti, '2. SYGetElapseTimeinteger piaytid. pgn,ayid. Uatepinteger qx,txgpwx,PxgJY"yE'g^y^TSg

call indexcomp(nxny.sx,ex.sxgpexgp.+Sy.ey.eyKp.eygg.pm%.rrax-0.0W=I d0call InitDomaint u,nx,ny,sx,sxgp,ex,exgp,.y,sygp,ey,eygp)call initDomain( v.nx.nty,+sI.xgla.exgpsy.aygly.ygp)

call PIgaync (0)1 1- SYGetEl.apsdTim'e)

dx-1 .dO(nx-1'

dy-1.dC/(ny-1)

call bottd(n,x,y~nx,ny.dx,IT,ax~nxgp.ex,exg9ss:."Tygg+.y-0Ygr)cell hound(v.x.ytx.ny.dxdy.ex.ac, : .xwxg.sy,aygp-ey.eygplbegin iteration!.nteF 2000ein 20 ie -~ -o -,call BCxec(pgau.u)

cx}} comput (u,r,x~y~n:,ny,dx~dy,e,ertra*,r.

+r.x,sagp.oxzexpIg,cysygj.oy..ygp)

call PCoec(pga.v.v)

call compute(v.ux.y,nx.ny,dx,dyv.ezrrtax.a,x.xgp.a.z.xgp.sy.sygp'eyoygp)all PIg aatezrrzax.l,work.)

if (uyid .rce 0)print 30.i3.1FiZ(i.rr/7.0).atrwa20 %ontinu.'.0 futaat(Sa~fR.2,i:0,i1F.12)

t2 - SYC tElapsedTiae() -t

print. t , 'Tot i twn - , ori 't, ptiyt i.l )

call BCfrea(pgn)ujtkek-G

:rc'urnend

!;RflUTI.1 indexcotp(nx,nysxax.sxgpixgp.

+ sy.ey.eyvgp~ygp~pg"'

nt,-g~tE- 1 p yt td, fitutitdr , ipwi f2)include '/hoae/:opp/tuls.rnblctn/seshf.k'

A iteer my 11A. ' otnx.ny.ndMDYTES

in,,g r pg, al .9,r:11

tebhe iAs,dXFS Ij x .l,;j.6y~og~ y JSyg

25

TBTES=8Hz18LediiM,O) z nx

sztazisparatlel,0) = I

Sz(arndi.05 -1

sz(azadim,1) = nyA^.(sisparallel.l) 1sz(szndia,l) -1

caVl BCFindGhos1FroaStw.il( ud. az. 0. 0,1)myid - pimytid)nproc - pinumtidh()iftmyid eq. 0) printe,'nproc-',nproccall BCClobalToLocalArray( nd, sz. nproc, myad )Iper(1 -0iper(2)-0

call HietGhtidthr(nd,sz, iper)

pga - BCBuildArrayPGN( nd, sz, nput., my ii. NBYTES )

call RCArrayCuapile( pga. 0 )

SI - 3z(sZstaxtO) + 1nx - Rr(Srend,01 + 1sxgp s:(szsg,0)Oxgp- Bz(Kze.g0)sy - az(srstart,1) + 1

ey - sz(azend,1) + 1sygp - sz szzg.1)evKp = Sz(Szeg,i)

returnand

subroutine InitDomain( u,nx.ny.sxegc,cxFgp.ey,;ygp,ey,cygpinteger sx,sxgp,ex,ergpgy,Sygp,ey,eygpdouble precision u(ax-axgp:ex+exgp,ey-sygp:ey'eygp:integer i,joixenvdo 1 - sy-nygp ey eygp

do i - ex exgpex-'xgpu(i=j) - 0.OdO

,.nddoenddir c t ur-ENend

itnbruutisae Lbund(u.x,yitx,ny.dx.dy,+ ex,gp,x.exgp,sy,sygp~cy,eygp)integer s: .sxgp.ex.exgp.sy.sygp, y.eygp.l.jdouble prerision u(e:-cxgp-eergp,y-ygp:cy+eygp)double precision x(sx:ex'dnible precision y(sy-e1 )dzu'.le precicion dx,dy1i1t1.ger nx,nydo a-sx,ex

enddo

dlu j&sy.4yyIjI.4-1 ely

enddubottom (Fy . 1i:2 (cy .og. 1 then

doI i-rx~exu(t,sy) = 0.0

enddoer . I r

Tup (ey - Eby'

if try eq. ny) thendo i-sx.ex

uti,ey) x(i)

enddopndif

if (sx .q. 1) thendo j-sy,ey

u(mx,O) - 0.Cenddo

endifRight (ox = nx)

if tex .eq. nx) then

do j-Ay.ayutex.j) yt )"yt;) y(J

.ltiddoendiIreturnend

,tbrout Inc coi!t ei.,, Yx,y,nx,ny.dx,dyv.Qrrmax,A,+ sx.sxgp.ex.exgpsy,.-:p.ey..ygp)

integer Ex,szgp,ex,".xgpy.ygg.Py..ygpdoutbhe precision u(sx-axgp:ez+exgp,$y-sygp:t; :y;))

double precision rilsx-xxgp:-el~xgpry-%ygp:ey+eygpldouble precision x(sx:ex)double precision y(+y:eyido'tble precision dx,dyerrxax,err.winteger :saxaay,eex,eeyi.i.nxI&y

sa3-sxRex -ex

eey-ey

if(ax .eq. i)aax=2uf(my .cq. 1)ssy-2if ('x .eq. xIaeexanxz-i

if my . q. ny)oey-ny-t

e6iaax=0.0

do 15 j-rsy,ecydo 15 i-FMx,etx

V Il)=u(1,})-Q+(t-u(t+l,j)+2+u(t,j)-u(i-1,)))/3x++(-u(i,j"-i:2u(i,jl-u(i,j-i))/.dyae2

/ + a.uti,j)-x()#y(j)e taey(3)**2-6))/(4/dz+"2+a)

err- +hrtv(2, )-x(1)"y(;!"": !errax-a (erralax, atr)

5 continue

returnVId

-4

ProgTdglBC: A BlockCoum Program for Problem 4

xxxxxrxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx :xxxxxxxrxxxy. xxxxxr.x7.xx

integer function worker()

E1TEGER sx,exsy.ey~sxgp.exgp.gygp.oygpINTEGER nproc.myid,pinytid,pgm, sz(0:9,0:1\INTEGER nx,ny,nd,np,nrnsnax(;),usy(2),nzmnys.countpaameter(nx - S2, ny - 52, nu-2)double precision atnxeny).b(nx'ny)double precision da(nxany).db(nxeny)Aonnle precision pI(fix ny),p2tnxeny)double precision dp1(nxeny),4p2(nx*ny)double precision hhlnxeny),.ewd( 2double precision time.SYGetElapsodTimadouble p-ecirsion dx~dt0,(dxy~dt~t.dy,rky,tn,h~rk

double precision dx2.rkx,rk2,dy2

xyid-pi.ytid()

CALL got inex(nx.ny,nd~az~sx,exdy,ey,nxgytygUwxgPeygp.pg)

CALL. cekin Sex(ar.sx.ex.aexgp.exgp.sy.ey" ygp.eygp.+ n .ny.Py idnproc)

if (myid.eq.01 pri.,te, 'Reading ppra*Tare '

CALL Bain.input(rk,h,tp,nx,nynp,nr.ns.

+ dx2,dy?.rk2.rkz,rky,dxy,nxmnya,+ dt0tdx,dy,nax,n'y.eedi.myid)

is (msyid.eq.01 prints, 'ititializing.'

CALL initialize(pI."2,a.Lh.dx.+sx,ex,rxgp,ewgp,,y''y "yy,eygp,+n.nsy,eed. a.ayid.nx.ny)

t-b

count-0

dt-0c eeeesr . Hair, loop 06666+#*4*

it smyi1 .eq. 0) print*,'Start time-',SYGetElaprodTime()10 .r (t.it~tp)THE

CALL bound(p1,p2,a.b,h.rk.Ax.nydAdy,auxnya,+ sxrxg.xgp.wx4.ey.yaygx.gyg.rk.rky)

CA .I. cepf(p1.p2.a,.b,da.db,dp1,dp ,dxy,+nx.nydx.dy.na.nydx2,dy2,kxzky.rk2.rkhdt~count,+exe.exgp,exgpny,ey,syg.peyg.p,pni

if (f"SBD(count,np) .6...0)) thsei

CALL compaua(p1.p2,".;b,hh.5yidcuwit,pgs,+ di,txiiysdy,rkx.rky.rk2,nxny.nxa~nry,h,rkt,+ sx~ez.sx r.exgp~ty~oy,syg'p~ey>;y)Andii

pit-min tp-t ,dtD)t -t+4t

rnunt-rount+1GO TO 10KDi F

r FnI of mnin loopi (myid .eq. 0)thent ine-ST~etElapsedTime()

print*,' Elapied time : ',time

prints.' Average Tie : ',time/countendif

poi kor-0RE.TUtN

EID

SUBROUTIWE getindex(nxnry.nd.az.ax,ex,ay,ey,

*sxgpsyg.-sxgp.Pygp,pgm'

integsr pinytid. pinumtidsiper(?)

include '/hoaa/grupp/toolsn.t/blkcm/a.shti.h',nt.gpr myid. nproc,nx,ny,nd,NHYTES

integer pgm. az(09.0:1)integer xx,xgpeex ,ggsy,sygK,.y, yggKBYTES-8sz(+zudiM,0) - fx

23zszisparalle1,0) _ 1sz(szndim.0) - -1e:(tazdirn,1) ' nysz(azisparallel.) n 1

s%(.'ndi ,1) - -1call DCFindChostFonStenci1( nu. sa. 0. 0.1)myid = pimytid()

nproc - piniumtids()i f(myid .j. 0) print+,'nhtrn ',u1"cal1 BCCLobalToLocalArray( nid. s, nproc, myid $ipex (11-3iper(2)-0call PCSatChuatLidtht(nd,azipar)

pgw = BCBuildArrayPC( nd. :z. sproc. Myid, EBYTFS )

calt BCArrayCompile( Pa,. 0 )

ax u

t.1 -

sxgp =

Sy =My -Lygp 'eygp -return

enrd

5:.(3zstart,0) + 1z(t/.end,0) + 1

sz(szSR,O)azfazeg,0)

szszstart,1) + 1az(eecid,l) + 1sZ(rtrg,1)az(azeh,I)

1.I:Clude ' 'touls .h' '

i rnc I ude ' ' cuss/curs.).'Xinclttdo <stdSO.t>#include ''bukcu/bc.h''

sinclildc ''blxcw/Rooh. h,'include. 'coma/o/pio.h''

Si iet rR6000definee chcckir.dex_ chcckindex

Send ifvoid chockind.(six,exuxxgp.exgp,sYcytyg.oy)p.

rx,nray.myid, tproc)HCArrayPart sgzeliJ..nt sax. *ex, 06xgp, *exgp;

.nt *ny. *ey, *rygp. *eygp:jot *nx.Sny.

int *ayid. *nproc:l

FILE "pv;static. char filenamell 'blkrep'';int i, lx. ly:int gix, gly; /tdiaasion of bloCkc with ghosts*/if ( wuyid -= U ) 4prints(''Utiting report\n'');

if t6pv - fopen(filename,'' ' .) NULL) {print f(''cannot open %sWn',Iilenaae);oxit(D);

}

fprxnt!(p, " Decomposition Report\n ):fprintf(pw, ' ss+sC *ss*sc"ss"+s"s**.*ss*.a*e*s4*s**a"\n\n' ):

fprintf(pv. ''Total processor Xd \n'', *npror;fprintf(pv. 'Global size (x~y) : %d %d\n''.

*nx, sny):fprintf(pv, ''Block Decompositior. ''1:fprintf(p. 'Processor Distribution i, y1: xi %d\\aN'

sizz(0LJ.ndim, sizettll.adi);fpxinLf(pv. 'node\tblock size\Lblock endjpoiints\t''fprintfpv. ''block w/ghnsts points\n'):for (i-1;i<=7O;i++) fprint4(p,''-');irrintf(re, ''V'>

}lx - *ex-"sx+1;

ly - eyey1glx - .ex+*exgp-*sx+*sxgp+1;

gly = *ey+4ey, p-*ziy+*Aygp+1;for 1i-0: i<-enproc; 1++) {

if iGTOKEZ(O.i)) {pv Igenfienae,- a''.:

fpxintf(pv,'' %,I\t( %d %3) ''.*ayid~lxty);

tprintftlpt,''\t(Xd:xd, Yd:yd)'',"eX,.sec.*fy.ey :fprintf(p. '\t(Zd:Xd. Zd:Xd)\n'',

+rx-esxgp.*ex+"w"Yx,y-1"yg, "ey+"tygp));/I fprintftp. ''done\n'');/

1c lose (pr) ;

c The input file is re"d by proce.sor 0 Nnd then the Pint~ i-S%..tterod to the othei pruceaaors

SUBROUTINE aaininput(k,h~tp..rxr.np..na.+ dx7,dy?,rk2.rkxrkydynxs,nym.

+ dtO.dxdy.nsznsy.seed.ay1d)integer isz,nag irtrsg.dt4l.all ds

parameter(Iz-4.usgInt-Iali-0)

par.maetr(daz4.-8msg.dbl-4)rl H dv..tO.dxy,,iyrf I.yiength.11ongthreal+8 rk2,dy2.rky.rkxh.rk,ds2.tpinteger np.nr.nsnrx(2).nay(2)integer i. nx,ny,syid,xa,1,ymloub'le pre-f ivion tteed420CHAMACTE1t7J dLacrpif (myid..q.0 thenDPEMunit=59.2le 'dets:ultx '

?.EwINT 5P..'V 1,9 ,2) 3isicrp

READREADREADREADREADREADREADREA)READREADREADREAlDRF.A DREADREADREADDREAD

(9,.) rk(9,2S) discrp(9,+) h(9.2S) discrp(9,*) tp(9,2) discrp(9,) xlength

('x.25) discrp(9.+) ylengthk9.25) discrp(0.+) np

(9,25) discrp(9,.) nr(9,25) discrp(9.) cdl(9,25) discip(9,.) ns

do t-1,nsREAD (9,25) discrpREAD (9.v) nsx(i).nsy(),secd(i)end doCLDSF. (9

25 FORMAT(A72)dx xiength/(nx-2)dy + ylongth/(ny-2)dxy-dxdydx2 - dxedxdy2 - dyedyrk2-rkerkikx=rksdxrky-rkedydt0=rkocil/a(1 ./dx2/rk +1 . /dy2/t k2" (t x larnth) "',

+ 1./dxz'+1./dy2+.)nxa=n -1

r.yn-ny- 1endif

s wratlter the datacall PlbcastSrc(np.sz0.allnsg-intcall PlbcastSrc(n-.isz0,All.3sg-int)call PIbcastSrc(ns,1sz,0,all,sg.int)call P'LcaatSrc(nxa.isz,0.al1.magint)call PlbcastSrc(nys.isz,0,all,pgsnt)Lall ItcastSzc (h,dtz,O0,&al1a.e_.1bI)cell PlberstSrc! it0.ds,O,ali~eag~dbl)call PIbcastSrc(tp,dz.O,all,masg.dbl)raii PThcaRtSre(dx,dsz,0.aiimsg_dbl)

call PlbcastSrc(dydsz.0.all.asgjdbl)crli PTbcast$ac(dx2,daz,0,all ,sg itst)call PlbcastSrc(dy2,dsz,0,ali,msg_ Lnt)<.wli PlbcastSrc(zh.dz.O,,ll.asgiut)call PlbcastSrc(rkx,dez,Oallmag.dbl)cell PbcastSrc(rky,daz0.,A11 ,agdb)rx)il Piihrrs e - ric2,ds e,0. , F, . nl Icall PbcastSrc(dxydsz.O.all.azasgdbl)

du ;-l,n.call 1l~bcadt:L9(nItli ,ib ,0i ,a1l,asg.int)

_ali !'itcaat~rc (nsy(i),itz,^,ali.nsg~utt)

All PibrttSrc(eal(i),1+i.0,ai.mag..dtl)

RETURN"nud

31

SUDROUTINE initialize(pi,p2,ab,h,dx,+Sx, x,xgp,egpsy ,y zygp.@yrp,+n1,x,nsy ,seedns,myid,nx,ny)

I11TEGER sx,ex,sy,ey,sxgpexgpsygpeygp.nsI1TFGFR nax(ns),nv(ns)doo'blo precision seer(ns).dx.hdoutle precision pl(ax-sxgp:ex+exgp,sy-aygp:cy+oygp)doubi precision p2(sx-oxgp:ex+exgp.sy-sygp:*y*eygpIdouble precision a(sx-sxgp:ox+exgp,sy-nygp:ey+oygp)double precision b(sx-vxgp:ex+*xgpsy-sygp: y+ygp)INTEGER ixiy.ayid.rx,ny

DO iy - Ry-sygp, cy+eygpDO ix - sx-axgp, ex+exgp

pl(ix.iy)-O

p2(ix, iy)-0b(ix.iy) 0

FWD DnEND DUCAI.J. reint (1p,sx,ex,sy,ey,axg,wxlpygp,vygp,

+nasx.nsy,seed, nsayid)RF.TURNFNDSUBROUTINE reinit(p,ax.ax,iyey.sxgp,exgp.

+ sygp,-ygp,nsx.nsy.seed.ns.yid)1T.txGER ns,m1 adINTEGER sx.exueyeysxgeygp..exg,,ygdouble precision pl(s:-axgp:ex+axp;ya-ayFp:ay~eygp),seedtaa)INTEGER nsx(ns),ny(ns), ,ix,iy

D0 i - 1,nsDO ix - sx,ex

IF ((uwx(i) .ge. ex) .and. mnsx(i) .ie. ex)) thenDO iy - ay.ey

IF ((nsy(x) .ge. sy) .and. (nsy(i) .e. ey)) tnertpl(ngx(i),nsy(i))-seed(i)

ESDDDENDIF

EIDDOEUDDORETURNEIND

SUDRCUT3INE boiud(pi,p2,sb.hk.rnx.nydx,dy,nxa,rya,+ axex exgpexg.eysyey.iygp,eygp.kxky)

INTEGER nx, ny,1,nyuwj.nxh

ISTEGER ax, ex, sy. QyINTEGER xax,may,eex,eeyINTEGER sxgp. ezgp. sygp, eygpdouble precision pl(esx sagpex+argy,.y-Mygy.ry+ryg})dlnbie precision p2(sx--sxgp:ex+exgp,sy-sygp:ey*.y0p)double precisioa a(sx-sxgp:ex+e>p.sy-eygp:ey+eygp)double precision biex-axgp-ex+exgp.ry-'ygp:ey+eygptdouble precision dx, dy, kx, ky.h.ksxxsxtay-ey

eex~exesy-eyit (ex ":q. nx) 4ex-nxm,9 toy .01 .ry tovy=nys

32

c Bottom (sy-1)

if (sy eq. 1) thendo i-sxx.eex

pl(i, 1)-pi( i,2)*rc-s(bhi,1)*ky)

/ +1.2(i.2)"sin(b(il)*ky)p2(i,l) p2(i,21"ros(b(i,l)*ky)

/ -p1(i.2)*sin(b(i,1)*ky)ail,l) a(i,2)+(h-(b(i+1,1)-b(i,l))/dx) cdy

enddoend if

c Top (sy-ny)

if(sy .eq. ny) thendo i-ssx,oex

pl(i.ny)-p1(i,nya)+cos(b(i.nym)*ky)

! -p2?(i,nyra)*ginto(1,nyna).kylp2(i,ny)-p2(i,nya)cos(b(inya)*ky)

/ +p1(i.nya)*sin(b(inym)*ky)a(2,ny)-a(i~nya)-(h-(b(i+I,xiya)-b(i.nym))/dx)*dy

enddoendif

c left (x=1)

if ksx .eq. 1) thendo j-say,.ay

pl( ,j)-pt(2.j)*en(t(1,j)*it x

/ +p2(2,j)*sia(a(1,j)*kx)p2(1,j)-p2(2,j)ecos(afl ,j ekx)

/ -pI(2,j)"sin(a(t)kx)

bilj)-b(2,j)-(h+(a(1,j+i)-a(l,j))/dy)wdxenrdo

*ndif

c right (ex=nx)

if (ex .eq. Dx) thcn

do j-say,ceyp1(nx.j)-p1(nxa.j)*coas(a(nxa,J)0kx)

/ -p2(nxm.j)*sin(a(nxn.j)*kx)p2(nxj)-p2(nxa,j))cas(a(nxa,j).kx)

/ +pI(nxa,j)esin(a(nxa.j)ekx)b(nx.,j) -b(nx , f)+th+(a(nxa,.l+1)

/ -a(nxs.j))/dy)*dxcnddo

endifRETURNEND

SUBROUTIME coma4(phI. p2.a1,a2.fg2.f2,hg1 .hg2,dxy.+nx,ny,dxdy,nxz.nym,dx2.dy2,rkx.rky,ak2,rkh,dt," uun,+sx. x,szgp..zgpsyey .sygpeygp"pg3)

INTEGER ax, xrsy.eyssy.ssxeyesxINTEGER &xgp.exgp~sygpoygpd'gaTWTF(;R nx. ny,cn'tnt.i,3j.nxmanyudouble precision p 1ax-xgp:ex+ecxp.cy-sy:ey+cy; )double precipinn ph2 rt-x- +:Px+pxgsy-"ygpray"Qygp)double prcisiori (axx-agp:oxzixgp.ay-tygp;Qy+.ygp)dluuble pre-<.cior e2(. I-sgg:xexgyrel-ygY:-y+kylp)1otuble precxrion r g1t1 x-sxgp:ex+ xgpay-sYgp' Y~eygp,double parcisin fgC(Ax-exgp@xEaxgp,Ay-tybwyueyaygp)

double precision hgl(sx-sxgp:ex+exgp,sy-sygp:ey+eygp)double precis1-n hg2(6x-sxgp:ex+exgpsy-sygp:ey+eygp)double' pr visionn vm(60,60)double precision ik,h,dtc21,s21double precision dxdy.dx2.dy2,cl0.s1U,c20,s2,c11sl11

double precision rk2,vkx,rkvdxy5sy-Uy

wex-ex

eey-eycall BCexec(pgn,ph1,ph1)call BCexec(pg.ph2.ph2)call BCexec(pgu.al,al)call BCexec(pgp,a2.a2)if (ey .e., 1) say-2if (x .eq. 1) sax-2if (ey .eq. ny) eey-rnyif kex .eq. nx) ex'nxa

do j-ssy,eeydo i-ssxeex

c10 ' cos(a1(i-1.j)*rkx)S10 sin(al(i-1,j)erkx)c20 - cos(a2(i.j-1)erky)a20 - sin(a2(ij-1)srky)c11 - coi(al(i.j)*rka)all - sin(al(i, j)*rkn)c21 - cos(a2(i.j)+rky)d21 - ain(a2(i,j)erky)vw(ij)-phl(i, )«+ ph2(ij)1+2Agl(i~i)-phi(i~l)+(i.-'a(i.,j))/ +((c]Ophl(i-. j) -a30+1 }2(i-1,j) - 2phl(i,i)/ +c11.phl(1i,j) +s11*ph2(i+l.j))/dx2/ +(c2O*ph1(i,j-1) -s20.ph2(i,j-1) - 2-phl(ij)/ +c21ephl(i.j+1) +e2tpt:(1,*+1))/dy:)/rk2

hg2(i~j1 ph2(i,j)+(1.-xm(i,])i/ +((c10"ph2(i-1,j) +s0phi(i 1.j) 2eph2(ij)/ +c1erph2(+1.j) -sllaphl(I+1,j))/dx2/ +(c20oph2(i.j-1) +a20epht(i,i-1) - 2aph2(ij)/ +c21ph2(i.j-1) -s21+ph1(i.y+1))/dy2)/rk2

if (j.eq.7) there

tglli.j)- (al(1.,+1)-al(1.j))/dy2 + h/dy/ +(a2(i.j)-a2(i+1.j))/(dxy)/ +((phl(i.;)+ph2(2+1.j)-ph2( ,))"phti }.t))oc11/ -(p (i~j)*ph1,(i+1,j)+A2(i.j)ep.2(i+1.j))es11/rkx

else if (j.cq.nya) thenfbl(z,j)- (-ai(ij)+i1(i,j-1))/dy2 - h/dy

/ *(-a2(i.-1)+a2(i+.j-1))/(dxy)/ +((ph(i j)eje2(i+1,j)-ph2(1,jephl1i-i,ju'cIl

/ -(phi(3.;)+phl( +1.j)+yh2(i,i)"ph2(i+1,;)J)eril)/rkx

gl(i,3)) (al(1,.t+ll-1."al(i.})+ai(i,]-1))/dy'/ +(a2(ij)-a2(i+1.j)-.2(i. -1)+a2(i+1.j-1))/(day)/ -((ph i(i.j)+ph,'2c 1 ,)-ph2(i.3$"phi(1+1.j))-cil

/ - (ph(i,j)ephteIi+1,j)+ph2(i.j).ph2(i+1 j))al)/ikx

end if

ef i.*tj.2) thenfg (i,j)= ("2(i+S, ))- a^( , ))/ix2 - hl/dSa

/ +(ai(i.i.-lp(r ..j"e)}/idxy:/ +(p t i ieh ( .+ )ph ( .lp ii + )e2

31

-(phl(i,j?=phlii,j+1)+pn2i ,j!«ph?(>iij+l))*s231)/rky

elre if i.eq.nx) thenfS2(i,j)= (-21j+2i1j)d2+ h/dx

+((phi (i, j1*ph2(, j+1)-ph2(i,j J*phl(i. j+i))*. 21

/ -(gil1(i.j)plht(ij+1)+ph2(1ij)*ph2(ij+1)1s2I)/rkyel sefg2(i,j)- (e2(i+1,j)-2.oa2(i,j)+a2(i-l,j))/dx2+t(al(,j)-a2(i.j+1)-al(i--1,j)+al(i-1.j+1))/(dxy)

S+((phl(i,j)+ph2(i.j+1)-ph2(i,j)ophi(i,j+l))oc2i/ -(ph1(i,j)phi(i,1+1)+ph2( ,j)+ph'2(i.j+1))*s21)/rky

Pad if

enddoenddo

do i-SLy,eoydo i-ssxeex

phl(i, j) rphl (i, j)+dt*hgl(i, j)ph2(i,j)-ph2(ij)+dtehg2(i,j"if (i.1t.nxm) a1(i,j) a1(i,j)+dtetgt(i,j)*dxif (j.1t.nym) a2(ij)-a2(i,j)+dtsfg2(i,j).dy

enddoenddoRETURN

EI D

SBRnOUTIVE compsum(p1.p2,a.bhh,yid,c:utk.pgu,+ eri.dx2,dydy2.kxkyk2,nx,n..nxnnyh},k.t,+ sxex4rxgp,axgp,&y,oy&ygp~aygp)

INTEGER sx. ox, sy. oy,i ,.pgaINTEGER axgp, exgp, sygP, eygpnys,nxaLNTEGER myid,count ,nxnysex,ssy,eexeeydouble precision p1 (sx-sxgp:ex+exgp.sy-sygp:e)+eygp)

double precision p2 (sx-axgpiex+exgp,sy-sygp:ey+eygp)double precision a (sx -axgp.ex+axgpsy- vgp:ey+eygp)double precision b (sx-sxgp:ex+exg/,sy-sygp:ey+eygp)double precision hh(aa-axgp:ex+ex;p,sv-sygp:ey+y F3)double precision h,ktdouble precision zua,p2sa.vorkdoable precision di, dy, dx2, dy,,k2, kx. kyp2m

double precision cl,sLc2,s2,s

esay~y,Aer-axeexwcx

r. 1 At'exectp~gx,pl,pl)call BCex*c(pgm,p2.p2)call bCexec(pg3,.,alcall DCtxac(pgub,b)

it (by .eq. 1) ssy2if (Cx eq. 1) Rsx-2it (ey .eq. ny? eeyanymif (ex .eq. nx) deo-nxa

P~2

41-0do j-sy,'enydo i-ax,.ex

t,1r-p1 , )*e 1. 2(1, j)"s2plmax-saxlp2aax., ri)

3

/ -(aij+1)-aij)tycl = cos(a(i.)'kx)si sin(a(ii)=kx)c2 cos(b(ij)eky)

s2 sin(b(i,j)"ky)

sy(((p:(i+1.j)-(c1"pl(i.j)-slep2(i.j)))*"./ +((2(i+t,jl-(rti ,2(i,j}+*:1r(i,jilu*2-)/dx2/ +((p2(i.j+1)-(c2*pl( i,)-s2*p2(i,j)))".2

/ +(p2(i~j+t)-(c2p2ii,j)+a2*p1(i,jil)1s2)/dy21/k2

/ - p2a + 0. Sp2m..2sus+- sun+s+(hh(i.j)-h)es2and doend do54ue- medxRedyp2uax-sqrt(p2max)ralI1 Pigdaiuam inve1,rk,0)call PIgdmax(p2max,I;work,0if (myid .eq. 0) then

vrote(G,991) t,p2max.sum991 furuat('t - ',f10.6, ', : x(phi) ',112.;,

/ , energy ',116.10)

andifRETURNEND

References

,I, R. G. Rah) II, PvmJinut,1'ng Parallel Proits aor"., Addisqin- !Wsley Pul'ihing '"MparlV,New York, .19, .

K2J I. M. thai;dy and Stephen L'aylr. Au ltrtdhctio Paralhl PI'jru'mning, .l 'nei n1Rariei t Ptblditbrs, Biston, 1992.

i3i Erhan Coskun, Numerical Analysis of Ginzburg Landau Models for Superconductivity.Ph.D. det'tatimi, 1994, N4rthern Iulirois Uiiivor.ity. iN ab, Ill.

11 1. Foster and S. Tuecke, 'arllkl Prvxgr-amming with PCN, Technical Report ANL 91/32,Revision 1, rgonne National Laboratory, 19)1.

5 N. Gaibreath , W. (rop), D. Gmiter, G. Leaf, and I) I. irle., f'P)umh I d'/l idime ,a

t,: Tih- Dernueswnui. Twmc-Dep ndeut &in:burq-Laridau Equaliln, Proceedings of th-SIAM Parallel Pracessmng for Scientitic Conpiitiug Confrence, 193. 160- 161.

.1. Garner, M. Spanbetuer, R. Bende :, K. Straridburg, S. ltrig ht, :,1i P. F'Plas 1,-

!ritical Fulds of J.lq.phsotr-couphed Supccotriductinq .\fultivcr<, Pfhysical Rrvicw B. 51992 . 7973 79 :1, .

W . ). (r'pp, l':,,J pIll wd.,I rtrin r- lt i t, A I gf)1rn i NiAi' thmal Lah or411 ory, Arg'iti , 111.1993).

. rcpp and E. lu. k, I ." (s lIr for Mth .-1 N, iX . Nit2 , M.1 at biemat ic and 'nput r

Sfl ice Divi-i11, Argnme Nationial Labhorattory. T}e<:hiical 11,pori. A.NL;'.M(S- M- 194.

Set- also tttp://mww.:cs.anl.gov/Projccts/sp/zndex.html.

i William Gropp. winrg I.u.4k. aid A11t hony Sk jul iiiin, Using MPI' , MIT ! , re.1,).

See ai,. :ttp ://vv . mcs. an..gov/Proj ects/hnpi/ index ht-ml.

Wi I). ropp. It. Naper. G. Leaf. D. Levine. M. Palumbo. and V. Vinokur. Nuterica1t1,)atiwi(- of Vorttx D namics ;ii T% ), p ti u e nuc '11i tll r'tit Pri"Jriit IMCS 1'4-, i114,

MNathenatics anld Computer Science Divitsion. Arugoutie Natioual Labjora ory, Arrgonnw,

Ul.. 1'Y)4.

W . D ;rc p) a nd B. uri lh. ( ' l von l 1'111111. i f'rrp 1lrr inry Toi,.J U.,cr .1,',irr . T',b. l

ilicai Honor. A N L-9 3/23, A'1r;onne National Laboratory', A rgonnie, I., 1993.

See alsce ftp://info.mcs.ara..gov/pub/toch-roprts/reports/AXL9323.ps.Z.

121 R. WV. N1 KF.y. Parydlel ('trmputers, Adam fliln r L.id., Iiril,, IQ0 .

3' . 'imiar, A. Gramla. A- c Gpta. and G;. karypis, /utorrtir !r, Paraiicd !'omptItil,

It. (. Publ1,ishing Cmwp airy. Nfw, York. 1994.

:"m K . ht w wn , " 1'err pt11f. lyan'tiln f; r fere tilr1 1 1..,"t1 (,'- _"In rJeleo u 1,1 1ro f g n

Ap 1 ii*ed Matf:. and Comnput ations, 53 i 1993). 129 150.

X 15' M a N . /1 w()i, Naneal (11: rim( Irl. f)1 the( (-! 1~-) Equatin. Proceedings roedgsof ti, F 1'ii

W(rild Ce eig4o'-i, .if N erintar Awly . Timpst, Flria. Atg. 1992 0c ~1 1pp.ar. A1

Ma.hemz atic S and 'co~npmptLer .Scince Dil 'ision i'repm-i t Mf(C :. P371 U7!)3, A r Ouii#' Nat:,iai.iab(raltOry. A rgonav. Iii., .lify 199)1."',

37

,c D. Levine, U.publishedi :mforlatiurl, Argonne National Laboratorv, 1995.

'71 T. G. LUwk and II. L. Rewini, Introdtiction to I'amlle1t ('oapuing, I'retitir' Ilni , lry.li'-wood CliffS, New Jersey, 1992.

1iS P. Metsina and A. Murli, 'nronlel ('omputing: 'roblem-, !.Jtlh4.l iandt 4;plirrwn.,, 1

sevier, New York, 199.

i; J. M. Ortuga., Inindirlion' to Pumrid I gl UVertr q 1 tion f lime .r .iS./IVm", 'lS(um nfPress, New York. i988,

20; M..1. Quinn, Iesigning Eficient .4qgrithm for PImillel ( omputr.%, '.cGriw- HiN I""L(ompatny, Nr'w York. 19S7.

211 (. N 1, Th t ei,11 S'114101) oti , ef Ordi ~nry (fail 1' iai li r i47 re ! Iqii. qnti)ns, A, -

aemic Press. CA. I9b%

Distribution t:r ANL-95/49

Internal:

J. H. Beumer (5)F. Y. FradinM. K. Kwong (10)G. W. PieperR. L. StevensC. L. Wilkinson

TIS File

External:

DOE-OSTI, for distribution per UC-405 (52)

ANL-E LibraryANL-W LibraryManager, Chicago Oparations Office, DOEMathematics and Computer Science Division Revieu Committee:

F. Berman, University of California at LaJolla

G. Cybenko, Dartmouth College

T. DuPont, The University of Chicago

J. G. Glimm, State University of New York at Stony BrookM. T. Heath, University of Illinois. UrbanaE. F. Infante. University of MinnesotaK. Eunen, University of Wisconsin at MadisonR. E. O'Malley, University of WashingtonL. R. Petzold, University of Minnesota

E. Coskun, Karadeniz Technical University (10)

D. Nelson, DOE - Office of Computational and Technology Research

F. Howes, DOE - Office of Computational and Technology Research

Documents

digital.library.unt.edu/67531/metadc283200/m2/1/high_re… · AN L-95/49 I I I M1" Mathematics and Computer Science Division Mathematics and Computer Science Division Mathematics