328
   "If you build it, they will come." And so we built them. Multiprocessor workstations, massively parallel supercomputers, a cluster in every department ... and they haven't come. Programmers haven't come to program these wonderful machines. Oh, a few programmers in love with the challenge have shown that most types of problems can be force-fit onto parallel computers, but general programmers, especially professional programmers who "have lives", ignore parallel computers. And they do so at their own peril. Parallel computers are going mainstream. Multithreaded microprocessors, multicore CPUs, multiprocessor PCs, clusters, parallel game consoles ... parallel computers are taking over the world of computing. The computer industry is ready to flood the market with hardware that will only run at full speed with parallel programs. But who will write these programs? This is an old problem. Even in the early 1980s, when the "killer micros" started their assault on traditional vector supercomputers, we worried endlessly about how to attract normal programmers. We tried everything we could think of: high-level hardware abstractions, implicitly parallel programming languages, parallel language extensions, and portable message-passing libraries. But after many years of hard work, the fact of the matter is that "they" didn't come. The overwhelming majority of programmers will not invest the effort to write parallel software. A common view is that you can't teach old programmers new tricks, so the problem will not be solved until the old programmers fade away and a new generation takes over. But we don't buy into that defeatist attitude. Programmers have shown a remarkable ability to adopt new software technologies over the years. Look at how many old Fortran programmers are now writing elegant Java programs with sophisticated object-oriented designs. The problem isn't with old programmers. The problem is with old parallel computing experts and the way they've tried to create a pool of capable parallel programmers. And that's where this book comes in. We want to capture the essence of how expert parallel programmers think about parallel algorithms and communicate that essential understanding in a way professional programmers can readily master. The technology we've adopted to accomplish this task is a pattern language. We made this choice not because we started the project as devotees of design patterns looking for a new field to conquer, but because patterns have been shown to work in ways that would be applicable in parallel programming. For example, patterns have been very effective in the field of object-oriented design. They have provided a common language experts can use to talk about the elements of design and have been extremely effective at helping programmers master object- oriented design. 

Pattern Language for Parallel Programming, 2004

Embed Size (px)

Citation preview

"Ifyoubuildit,theywillcome." Andsowebuiltthem.Multiprocessorworkstations,massivelyparallelsupercomputers,aclusterin everydepartment...andtheyhaven'tcome.Programmershaven'tcometoprogramthesewonderful machines.Oh,afewprogrammersinlovewiththechallengehaveshownthatmosttypesofproblems canbeforcefitontoparallelcomputers,butgeneralprogrammers,especiallyprofessional programmerswho"havelives",ignoreparallelcomputers. Andtheydosoattheirownperil.Parallelcomputersaregoingmainstream.Multithreaded microprocessors,multicoreCPUs,multiprocessorPCs,clusters,parallelgameconsoles...parallel computersaretakingovertheworldofcomputing.Thecomputerindustryisreadytofloodthemarket withhardwarethatwillonlyrunatfullspeedwithparallelprograms.Butwhowillwritethese programs? Thisisanoldproblem.Evenintheearly1980s,whenthe"killermicros"startedtheirassaulton traditionalvectorsupercomputers,weworriedendlesslyabouthowtoattractnormalprogrammers. Wetriedeverythingwecouldthinkof:highlevelhardwareabstractions,implicitlyparallel programminglanguages,parallellanguageextensions,andportablemessagepassinglibraries.But aftermanyyearsofhardwork,thefactofthematteristhat"they"didn'tcome.Theoverwhelming majorityofprogrammerswillnotinvesttheefforttowriteparallelsoftware. Acommonviewisthatyoucan'tteacholdprogrammersnewtricks,sotheproblemwillnotbesolved untiltheoldprogrammersfadeawayandanewgenerationtakesover. Butwedon'tbuyintothatdefeatistattitude.Programmershaveshownaremarkableabilitytoadopt newsoftwaretechnologiesovertheyears.LookathowmanyoldFortranprogrammersarenow writingelegantJavaprogramswithsophisticatedobjectorienteddesigns.Theproblemisn'twithold programmers.Theproblemiswitholdparallelcomputingexpertsandthewaythey'vetriedtocreatea poolofcapableparallelprogrammers. Andthat'swherethisbookcomesin.Wewanttocapturetheessenceofhowexpertparallel programmersthinkaboutparallelalgorithmsandcommunicatethatessentialunderstandinginaway professionalprogrammerscanreadilymaster.Thetechnologywe'veadoptedtoaccomplishthistaskis apatternlanguage.Wemadethischoicenotbecausewestartedtheprojectasdevoteesofdesign patternslookingforanewfieldtoconquer,butbecausepatternshavebeenshowntoworkinwaysthat wouldbeapplicableinparallelprogramming.Forexample,patternshavebeenveryeffectiveinthe fieldofobjectorienteddesign.Theyhaveprovidedacommonlanguageexpertscanusetotalkabout theelementsofdesignandhavebeenextremelyeffectiveathelpingprogrammersmasterobject orienteddesign.

Thisbookcontainsourpatternlanguageforparallelprogramming.Thebookopenswithacoupleof chapterstointroducethekeyconceptsinparallelcomputing.Thesechaptersfocusontheparallel computingconceptsandjargonusedinthepatternlanguageasopposedtobeinganexhaustive introductiontothefield. Thepatternlanguageitselfispresentedinfourpartscorrespondingtothefourphasesofcreatinga parallelprogram: * FindingConcurrency.Theprogrammerworksintheproblemdomaintoidentifytheavailable concurrencyandexposeitforuseinthealgorithmdesign. * AlgorithmStructure.Theprogrammerworkswithhighlevelstructuresfororganizingaparallel algorithm. * SupportingStructures.Weshiftfromalgorithmstosourcecodeandconsiderhowtheparallel programwillbeorganizedandthetechniquesusedtomanageshareddata. * ImplementationMechanisms.Thefinalstepistolookatspecificsoftwareconstructsfor implementingaparallelprogram. Thepatternsmakingupthesefourdesignspacesaretightlylinked.Youstartatthetop(Finding Concurrency),workthroughthepatterns,andbythetimeyougettothebottom(Implementation Mechanisms),youwillhaveadetaileddesignforyourparallelprogram. Ifthegoalisaparallelprogram,however,youneedmorethanjustaparallelalgorithm.Youalsoneed aprogrammingenvironmentandanotationforexpressingtheconcurrencywithintheprogram's sourcecode.Programmersusedtobeconfrontedbyalargeandconfusingarrayofparallel programmingenvironments.Fortunately,overtheyearstheparallelprogrammingcommunityhas convergedaroundthreeprogrammingenvironments.

* OpenMP.AsimplelanguageextensiontoC,C++,orFortrantowriteparallelprogramsfor sharedmemorycomputers. * MPI.Amessagepassinglibraryusedonclustersandotherdistributedmemorycomputers. * Java.Anobjectorientedprogramminglanguagewithlanguagefeaturessupportingparallel programmingonsharedmemorycomputersandstandardclasslibrariessupportingdistributed computing. Manyreaderswillalreadybefamiliarwithoneormoreoftheseprogrammingnotations,butfor readerscompletelynewtoparallelcomputing,we'veincludedadiscussionoftheseprogramming environmentsintheappendixes. Inclosing,wehavebeenworkingformanyyearsonthispatternlanguage.Presentingitasabookso peoplecanstartusingitisanexcitingdevelopmentforus.Butwedon'tseethisastheendofthis effort.Weexpectthatotherswillhavetheirownideasaboutnewandbetterpatternsforparallel programming.We'veassuredlymissedsomeimportantfeaturesthatreallybelonginthispattern language.Weembracechangeandlookforwardtoengagingwiththelargerparallelcomputing communitytoiterateonthislanguage.Overtime,we'llupdateandimprovethepatternlanguageuntil ittrulyrepresentstheconsensusviewoftheparallelprogrammingcommunity.Thenourrealwork willbeginusingthepatternlanguagetoguidethecreationofbetterparallelprogramming environmentsandhelpingpeopletousethesetechnologiestowriteparallelsoftware.Wewon'trest untilthedaysequentialsoftwareisrare. ACKNOWLEDGMENTS Westartedworkingtogetheronthispatternlanguagein1998.It'sbeenalongandtwistedroad, startingwithavagueideaaboutanewwaytothinkaboutparallelalgorithmsandfinishingwiththis book.Wecouldn'thavedonethiswithoutagreatdealofhelp. ManiChandy,whothoughtwewouldmakeagoodteam,introducedTimtoBeverlyandBerna.The NationalScienceFoundation,IntelCorp.,andTrinityUniversityhavesupportedthisresearchat varioustimesovertheyears.HelpwiththepatternsthemselvescamefromthepeopleatthePattern LanguagesofPrograms(PLoP)workshopsheldinIllinoiseachsummer.Theformatofthese

workshopsandtheresultingreviewprocesswaschallengingandsometimesdifficult,butwithout themwewouldhaveneverfinishedthispatternlanguage.Wewouldalsoliketothankthereviewers whocarefullyreadearlymanuscriptsandpointedoutcountlesserrorsandwaystoimprovethebook. Finally,wethankourfamilies.Writingabookishardontheauthors,butthatistobeexpected.What wedidn'tfullyappreciatewashowharditwouldbeonourfamilies.WearegratefultoBeverly's family(DanielandSteve),Tim'sfamily(Noah,August,andMartha),andBerna'sfamily(Billie)for thesacrificesthey'vemadetosupportthisproject.

TimMattson,Olympia,Washington,April2004 BeverlySanders,Gainesville,Florida,April2004 BernaMassingill,SanAntonio,Texas,April2004

Chapter 1.APatternLanguageforParallelProgramming Section1.1. INTRODUCTION Section1.2. PARALLELPROGRAMMING Section1.3. DESIGNPATTERNSANDPATTERNLANGUAGES Section1.4. APATTERNLANGUAGEFORPARALLELPROGRAMMING Chapter 2.BackgroundandJargonofParallelComputing Section2.1. CONCURRENCYINPARALLELPROGRAMSVERSUSOPERATINGSYSTEMS Section2.2. PARALLELARCHITECTURES:ABRIEFINTRODUCTION Section2.3. PARALLELPROGRAMMINGENVIRONMENTS Section2.4. THEJARGONOFPARALLELCOMPUTING Section2.5. AQUANTITATIVELOOKATPARALLELCOMPUTATION Section2.6. COMMUNICATION Section2.7. SUMMARY Chapter 3.TheFindingConcurrencyDesignSpace Section3.1. ABOUTTHEDESIGNSPACE Section3.2. THETASKDECOMPOSITIONPATTERN Section3.3. THEDATADECOMPOSITIONPATTERN Section3.4. THEGROUPTASKSPATTERN Section3.5. THEORDERTASKSPATTERN Section3.6. THEDATASHARINGPATTERN Section3.7. THEDESIGNEVALUATIONPATTERN Section3.8. SUMMARY Chapter 4.TheAlgorithmStructureDesignSpace Section4.1. INTRODUCTION Section4.2. CHOOSINGANALGORITHMSTRUCTUREPATTERN Section4.3. EXAMPLES

Section4.4. THETASKPARALLELISMPATTERN Section4.5. THEDIVIDEANDCONQUERPATTERN Section4.6. THEGEOMETRICDECOMPOSITIONPATTERN Section4.7. THERECURSIVEDATAPATTERN Section4.8. THEPIPELINEPATTERN Section4.9. THEEVENTBASEDCOORDINATIONPATTERN Chapter 5.TheSupportingStructuresDesignSpace Section5.1. INTRODUCTION Section5.2. FORCES Section5.3. CHOOSINGTHEPATTERNS Section5.4. THESPMDPATTERN Section5.5. THEMASTER/WORKERPATTERN Section5.6. THELOOPPARALLELISMPATTERN Section5.7. THEFORK/JOINPATTERN Section5.8. THESHAREDDATAPATTERN Section5.9. THESHAREDQUEUEPATTERN Section5.10. THEDISTRIBUTEDARRAYPATTERN Section5.11. OTHERSUPPORTINGSTRUCTURES Chapter 6.TheImplementationMechanismsDesignSpace Section6.1. OVERVIEW Section6.2. UEMANAGEMENT Section6.3. SYNCHRONIZATION Section6.4. COMMUNICATION Endnotes Appendix A:ABriefIntroductiontoOpenMP SectionA.1. CORECONCEPTS SectionA.2. STRUCTUREDBLOCKSAND IRECTIVEFORMATS D SectionA.3. WORKSHARING SectionA.4. DATAENVIRONMENTCLAUSES SectionA.5. THEOpenMPRUNTIMELIBRARY SectionA.6. SYNCHRONIZATION SectionA.7. THESCHEDULECLAUSE SectionA.8. THERESTOFTHELANGUAGE Appendix B:ABriefIntroductiontoMPI SectionB.1. CONCEPTS SectionB.2. GETTINGSTARTED SectionB.3. BASICPOINTTOPOINTMESSAGEPASSING SectionB.4. COLLECTIVEOPERA TIONS SectionB.5. ADVANCEDPOINTTOPOINTMESSAGEPASSING SectionB.6. MPIANDFORTRAN SectionB.7. CONCLUSION Appendix C:ABriefIntroductiontoConcurrentProgramminginJava SectionC.1. CREATINGTHREADS SectionC.2. ATOMICITY,MEMORYSYNCHRONIZATION,ANDTHEvolatileKEYWORD

SectionC.3. SYNCHRONIZEDBLOCKS SectionC.4. WAITANDNOTIFY SectionC.5. LOCKS SectionC.6. OTHERSYNCHRONIZATIONMECHANISMSANDSHAREDDATA STRUCTURES SectionC.7. INTERRUPTS Glossary Bibliography AbouttheAuthors Index

APatternLanguageforParallelProgramming>INTRODUCTION

Chapter 1. A Pattern Language for Parallel Programming1.1INTRODUCTION 1.2PARALLELPROGRAMMING 1.3DESIGNPATTERNSANDPATTERNLANGUAGES 1.4APATTERNLANGUAGEFORPARALLELPROGRAMMING

1.1. INTRODUCTIONComputersareusedtomodelphysicalsystemsinmanyfieldsofscience,medicine,andengineering. Modelers,whethertryingtopredicttheweatherorrenderasceneinthenextblockbustermovie,can usuallyusewhatevercomputingpowerisavailabletomakeevermoredetailedsimulations.Vast amountsofdata,whethercustomershoppingpatterns,telemetrydatafromspace,orDNAsequences, requireanalysis.Todelivertherequiredpower,computerdesignerscombinemultipleprocessing elementsintoasinglelargersystem.Thesesocalledparallelcomputersrunmultipletasks simultaneouslyandsolvebiggerproblemsinlesstime. Traditionally,parallelcomputerswererareandavailableforonlythemostcriticalproblems.Sincethe mid1990s,however,theavailabilityofparallelcomputershaschangeddramatically.With multithreadingsupportbuiltintothelatestmicroprocessorsandtheemergenceofmultipleprocessor coresonasinglesilicondie,parallelcomputersarebecomingubiquitous.Now,almostevery universitycomputersciencedepartmenthasatleastoneparallelcomputer.Virtuallyalloilcompanies, automobilemanufacturers,drugdevelopmentcompanies,andspecialeffectsstudiosuseparallel computing. Forexample,incomputeranimation,renderingisthestepwhereinformationfromtheanimationfiles, suchaslighting,textures,andshading,isappliedto3Dmodelstogeneratethe2Dimagethatmakes upaframeofthefilm.Parallelcomputingisessentialtogeneratetheneedednumberofframes(24 persecond)forafeaturelengthfilm.ToyStory,thefirstcompletelycomputergeneratedfeature lengthfilm,releasedbyPixarin1995,wasprocessedona"renderfarm"consistingof100dual

processormachines[PS00].By1999,forToyStory2,Pixarwasusinga1,400processorsystemwith theimprovementinprocessingpowerfullyreflectedintheimproveddetailsintextures,clothing,and atmosphericeffects.Monsters,Inc.(2001)usedasystemof250enterpriseserverseachcontaining14 processorsforatotalof3,500processors.Itisinterestingthattheamountoftimerequiredtogenerate aframehasremainedrelativelyconstantascomputingpower(boththenumberofprocessorsand thespeedofeachprocessor)hasincreased,ithasbeenexploitedtoimprovethequalityofthe animation. ThebiologicalscienceshavetakendramaticleapsforwardwiththeavailabilityofDNAsequence informationfromavarietyoforganisms,includinghumans.Oneapproachtosequencing,championed andusedwithsuccessbyCeleraCorp.,iscalledthewholegenomeshotgunalgorithm.Theideaisto breakthegenomeintosmallsegments,experimentallydeterminetheDNAsequencesofthesegments, andthenuseacomputertoconstructtheentiresequencefromthesegmentsbyfindingoverlapping areas.ThecomputingfacilitiesusedbyCeleratosequencethehumangenomeincluded150fourway serversplusaserverwith16processorsand64GBofmemory.Thecalculationinvolved500million trillionbasetobasecomparisons[Ein00]. TheSETI@homeproject[SET,ACK02 + ]providesafascinatingexampleofthepowerofparallel computing.Theprojectseeksevidenceofextraterrestrialintelligencebyscanningtheskywiththe world'slargestradiotelescope,theAreciboTelescopeinPuertoRico.Thecollecteddataisthen analyzedforcandidatesignalsthatmightindicateanintelligentsource.Thecomputationaltaskis beyondeventhelargestsupercomputer,andcertainlybeyondthecapabilitiesofthefacilitiesavailable totheSETI@homeproject.Theproblemissolvedwithpublicresourcecomputing,whichturnsPCs aroundtheworldintoahugeparallelcomputerconnectedbytheInternet.Dataisbrokenupintowork unitsanddistributedovertheInternettoclientcomputerswhoseownersdonatesparecom putingtime tosupporttheproject.EachclientperiodicallyconnectswiththeSETI@homeserver,downloadsthe datatoanalyze,andthensendstheresultsbacktotheserver.Theclientprogramistypically implementedasascreensaversothatitwilldevoteCPUcyclestotheSETIproblemonlywhenthe computerisotherwiseidle.Aworkunitcurrentlyrequiresanaverageofbetweensevenandeight hoursofCPUtimeonaclient.Morethan205,000,000workunitshavebeenprocessedsincethestart oftheproject.Morerecently,similartechnologytothatdemonstratedbySETI@homehasbeenused foravarietyofpublicresourcecomputingprojectsaswellasinternalprojectswithinlargecompanies utilizingtheiridlePCstosolveproblemsrangingfromdrugscreeningtochipdesignvalidation. Althoughcomputinginlesstimeisbeneficial,andmayenableproblemstobesolvedthatcouldn'tbe otherwise,itcomesatacost.Writingsoftwaretorunonparallelcomputerscanbedifficult.Onlya smallminorityofprogrammershaveexperiencewithparallelprogramming.Ifallthesecomputers designedtoexploitparallelismaregoingtoachievetheirpotential,moreprogrammersneedtolearn howtowriteparallelprograms. Thisbookaddressesthisneedbyshowingcompetentprogrammersofsequentialmachineshowto designprogramsthatcanrunonparallelcomputers.Althoughmanyexcellentbooksshowhowtouse particularparallelprogrammingenvironments,thisbookisuniqueinthatitfocusesonhowtothink aboutanddesignparallelalgor ithms.Toaccomplishthisgoal,wewillbeusingtheconceptofa patternlanguage.Thishighlystructuredrepresentationofexpertdesignexperiencehasbeenheavily usedintheobjectorienteddesigncommunity.

Thebookopenswithtwointroductorychapters.Thefirstgivesanoverviewoftheparallelcomputing landscapeandbackgroundneededtounderstandandusethepatternlanguage.Thisisfollowedbya moredetailedchapterinwhichwelayoutthebasicconceptsandjargonusedbyparallel programmers.Thebookthenmovesintothepatternlanguageitself.

1.2. PARALLEL PROGRAMMINGThekeytoparallelcomputingisexploitableconcurrency.Concurrencyexistsinacomputational problemwhentheproblemcanbedecomposedintosubproblemsthatcansafelyexecuteatthesame time.Tobeofanyuse,however,itmustbepossibletostructurethecodetoexposeandlaterexploit theconcurrencyandpermitthesubproblemstoactuallyrunconcurrently;thatis,theconcurrency mustbeexploitable. Mostlargecomputationalproblemscontainexploitableconcurrency.Aprogrammerworkswith exploitableconcurrencybycreatingaparallelalgorithmandimplementingthealgorithmusinga parallelprogrammingenvironment.Whentheresultingparallelprogramisrunonasystemwith multipleprocessors,theamountoftimewehavetowaitfortheresultsofthecomputationisreduced. Inaddition,multipleprocessorsmayallowlargerproblemstobesolvedthancouldbedoneona singleprocessorsystem. Asasimpleexample,supposepartofacomputationinvolvescomputingthesummationofalargeset ofvalues.Ifmultipleprocessorsareavailable,insteadofaddingthevaluestogethersequentially,the setcanbepartitionedandthesummationsofthesubsetscomputedsimultaneously,eachonadifferent processor.Thepartialsumsarethencombinedtogetthefinalanswer.Thus,usingmultipleprocessors tocomputeinparallelmayallowustoobtainasolutionsooner.Also,ifeachprocessorhasitsown memory,partitioningthedatabetweentheprocessorsmayallowlargerproblemstobehandledthan couldbehandledonasingleprocessor. Thissimpleexampleshowstheessenceofparallelcomputing.Thegoalistousemultipleprocessors tosolveproblemsinlesstimeand/ortosolvebiggerproblemsthanwouldbepossibleonasingle processor.Theprogrammer'staskistoidentifytheconcurrencyintheproblem,structurethe algorithmsothatthisconcurrencycanbeexploited,andthenimplementthesolutionusingasuitable programmingenvironment.Thefinalstepistosolvetheproblembyexecutingthecodeonaparallel system. Parallelprogrammingpresentsuniquechallenges.Often,theconcurrenttasksmakinguptheproblem includedependenciesthatmustbeidentifiedandcorrectlymanaged.Theorderinwhichthetasks executemaychangetheanswersofthecomputationsinnondeterministicways.Forexample,inthe parallelsummationdescribedearlier,apartialsumcannotbecombinedwithothersuntilitsown computationhascompleted.Thealgorithmimposesapartialorderonthetasks(thatis,theymust completebeforethesumscanbecombined).Moresubtly,thenumericalvalueofthesummationsmay changeslightlydependingontheorderoftheoperationswithinthesumsbecausefloatingpoint

arithmeticisnonassociative.Agoodparallelprogrammermusttakecaretoensurethat nondeterministicissuessuchasthesedonotaffectthequalityofthefinalanswer.Creatingsafe parallelprogramscantakeconsiderableeffortfromtheprogrammer. Evenwhenaparallelprogramis"correct",itmayfailtodelivertheanticipatedperformance improvementfromexploitingconcurrency.Caremustbetakentoensurethattheoverheadincurredby managingtheconcurrencydoesnotoverwhelmtheprogramruntime.Also,partitioningthework amongtheprocessorsinabalancedwayisoftennotaseasyasthesummationexamplesuggests.The effectivenessofaparallelalgorithmdependsonhowwellitmapsontotheunderlyingparallel computer,soaparallelalgorithmcouldbeveryeffectiveononeparallelarchitectureandadisasteron another. Wewillrevisittheseissuesandprovideamorequantitativeviewofparallelcomputationinthenext chapter.

1.3. DESIGN PATTERNS AND PATTERN LANGUAGESAdesignpatterndescribesagoodsolutiontoarecurringprobleminaparticularcontext.Thepattern followsaprescribedformatthatincludesthepatternname,adescriptionofthecontext,theforces (goalsandconstraints),andthesolution.Theideaistorecordtheexperienceofexpertsinawaythat canbeusedbyothersfacingasimilarproblem.Inadditiontothesolutionitself,thenameofthe patternisimportantandcanformthebasisforadomainspecificvocabularythatcansignificantly enhancecommunicationbetweendesignersinthesamearea. DesignpatternswerefirstproposedbyChristopherAlexander.Thedomainwascityplanningand architecture[AIS77].Designpatternswereoriginallyintroducedtothesoftwareengineering communitybyBeckandCunningham[BC87]andbecameprominentintheareaofobjectoriented programmingwiththepublicationofthebookbyGamma,Helm,Johnson,andVlissides[GHJV95], affectionatelyknownastheGoF(GangofFour)book.Thisbookgivesalargecollectionofdesign patternsforobjectorientedprogramming.Togiveoneexample,theVisitorpatterndescribesawayto structureclassessothatthecodeimplementingaheterogeneousdatastructurecanbekeptseparate fromthecodetotraverseit.Thus,whathappensinatraversaldependsonboththetypeofeachnode andtheclassthatimplementsthetraversal.Thisallowsmultiplefunctionalityfordatastructure traversals,andsignificantflexibilityasnewfunctionalitycanbeaddedwithouthavingtochangethe datastructureclass.ThepatternsintheGoFbookhaveenteredthelexiconofobjectoriented programmingreferencestoitspatternsarefoundintheacademicliterature,tradepublications,and systemdocumentation.Thesepatternshavebynowbecomepartoftheexpectedknowledgeofany competentsoftwareengineer. AneducationalnonprofitorganizationcalledtheHillsideGroup[Hil]wasformedin1993topromote theuseofpatternsandpatternlanguagesand,moregenerally,toimprovehumancommunication aboutcomputers"byencouragingpeopletocodifycommonprogramminganddesignpractice".To developnewpatternsandhelppatternwritershonetheirskills,theHillsideGroupsponsorsanannual PatternLanguagesofPrograms(PLoP)workshopandseveralspinoffsinotherpartsoftheworld, suchasChiliPLoP(inthewesternUnitedStates),KoalaPLoP(Australia),EuroPLoP(Europe),and

MensorePLoP(Japan).Theproceedingsoftheseworkshops[Pat]providearichsourceofpatterns coveringavastrangeofapplicationdomainsinsoftwaredevelopmentandhavebeenusedasabasis forseveralbooks[CS95,VCK96,MRB97,HFR99]. Inhisoriginalworkonpatterns,Alexanderprovidednotonlyacatalogofpatterns,butalsoapattern languagethatintroducedanewapproachtodesign.Inapatternlanguage,thepatternsareorganized intoastructurethatleadstheuserthroughthecollectionofpatternsinsuchawaythatcomplex systemscanbedesignedusingthepatterns.Ateachdecisionpoint,thedesignerselectsanappropriate pattern.Eachpatternleadstootherpatterns,resultinginafinaldesignintermsofawebofpatterns. Thus,apatternlanguageembodiesadesignmethodologyandprovidesdomainspecificadvicetothe applicationdesigner.(Inspiteoftheoverlappingterminology,apatternlanguageisnota programminglanguage.)

1.4. A PATTERN LANGUAGE FOR PARALLEL PROGRAMMINGThisbookdescribesapatternlanguageforparallelprogrammingthatprovidesseveralbenefits.The immediatebenefitsareawaytodisseminatetheexperienceofexpertsbyprovidingacatalogofgood solutionstoimportantproblems,anexpandedvocabulary,andamethodologyforthedesignof parallelprograms.Wehopetolowerthebarriertoparallelprogrammingbyprovidingguidance throughtheentireprocessofdevelopingaparallelprogram.Theprogrammerbringstotheprocessa goodunderstandingoftheactualproblemtobesolvedandthenworksthroughthepatternlanguage, eventuallyobtainingadetailedparalleldesignorpossiblyworkingcode.Inthelongerterm,wehope thatthispatternlanguagecanprovideabasisforbothadisciplinedapproachtothequalitative evaluationofdifferentprogrammingmodelsandthedevelopmentofparallelprogrammingtools. ThepatternlanguageisorganizedintofourdesignspacesFindingConcurrency,Algorithm Structure,SupportingStructures,andImplementationMechanismswhichformalinearhierarchy, withFindingConcurrencyatthetopandImplementationMechanismsatthebottom,asshowninFig. 1.1.Figure 1.1. Overview of the pattern language

TheFindingConcurrencydesignspaceisconcernedwithstructuringtheproblemtoexpose exploitableconcurrency.Thedesignerworkingatthislevelfocusesonhighlevelalgorithmicissues

andreasonsabouttheproblemtoexposepotentialconcurrency.TheAlgorithmStructuredesignspace isconcernedwithstructuringthealgorithmtotakeadvantageofpotentialconcurrency.Thatis,the designerworkingatthislevelreasonsabouthowtousetheconcurrencyexposedinworkingwiththe FindingConcurrencypatterns.TheAlgorithmStructurepatternsdescribeoverallstrategiesfor exploitingconcurrency.TheSupportingStructuresdesignspacerepresentsanintermediatestage betweentheAlgorithmStructureandImplementationMechanismsdesignspaces.Twoimportant groupsofpatternsinthisspacearethosethatrepresentprogramstructuringapproachesandthosethat representcommonlyusedshareddatastructures.TheImplementationMechanismsdesignspaceis concernedwithhowthepatternsofthehigherlevelspacesaremappedintoparticularprogramming environments.Weuseittoprovidedescriptionsofcommonmechanismsforprocess/thread management(forexample,creatingordestroyingprocesses/threads)andprocess/threadinteraction (forexample,semaphores,barriers,ormessagepassing).Theitemsinthisdesignspacearenot presentedaspatternsbecauseinmanycasestheymapdirectlyontoelementswithinparticularparallel programmingenvironments.Theyareincludedinthepatternlanguageanyway,however,toprovidea completepathfromproblemdescriptiontocode.

Chapter 2. Background and Jargon of Parallel Computing2.1CONCURRENCYINPARALLELPROGRAMSVERSUSOPERATINGSYSTEMS 2.2PARALLELARCHITECTURES:ABRIEFINTRODUCTION 2.3PARALLELPROGRAMMINGENVIRONMENTS 2.4THEJARGONOFPARALLELCOMPUTING 2.5AQUANTITATIVELOOKATPARALLELCOMPUTATION 2.6COMMUNICATION 2.7SUMMARY Inthischapter,wegiveanoverviewoftheparallelprogramminglandscape,anddefineany specializedparallelcomputingterminologythatwewilluseinthepatterns.Becausemanytermsin computingareoverloaded,takingdifferentmeaningsindifferentcontexts,wesuggestthateven readersfamiliarwithparallelprogrammingatleastskimthischapter.

2.1. CONCURRENCY IN PARALLEL PROGRAMS VERSUS OPERATING SYSTEMSConcurrencywasfirstexploitedincomputingtobetterutilizeorshareresourceswithinacomputer. Modernoperatingsystemssupportcontextswitchingtoallowmultipletaskstoappeartoexecute concurrently,therebyallowingusefulworktooccurwhiletheprocessorisstalledononetask.This applicationofconcurrency,forexample,allowstheprocessortostaybusybyswappinginanewtask toexecutewhileanothertaskiswaitingforI/O.Byquicklyswappingtasksinandout,givingeach

taska"slice"oftheprocessortime,theoperatingsystemcanallowmultipleuserstousethesystemas ifeachwereusingitalone(butwithdegradedperformance). Mostmodernoperatingsystemscanusemultipleprocessorstoincreasethethroughputofthesystem. TheUNIXshellusesconcurrencyalongwithacommunicationabstractionknownaspipestoprovide apowerfulformofmodularity:Commandsarewrittentoacceptastreamofbytesasinput(the consumer)andproduceastreamofbytesasoutput(theproducer).Multiplecommandscanbechained togetherwithapipeconnectingtheoutputofonecommandtotheinputofthenext,allowingcomplex commandstobebuiltfromsimplebuildingblocks.Eachcommandisexecutedinitsownprocess, withallprocessesexecutingconcurrently.Becausetheproducerblocksifbufferspaceinthepipeis notavailable,andtheconsumerblocksifdataisnotavailable,thejobofmanagingthestreamof resultsmovingbetweencommandsisgreatlysimplified.Morerecently,withoperatingsystemswith windowsthatinviteuserstodomorethanonethingatatime,andtheInternet,whichoftenintroduces I/Odelaysperceptibletotheuser,almosteveryprogramthatcontainsaGUIincorporates concurrency. Althoughthefundamentalconceptsforsafelyhandlingconcurrencyarethesameinparallelprograms andoperatingsystems,therearesomeimportantdifferences.Foranoperatingsystem,theproblemis notfindingconcurrencytheconcurrencyisinherentinthewaytheoperatingsystemfunctionsin managingacollectionofconcurrentlyexecutingprocesses(representingusers,applications,and backgroundactivitiessuchasprintspooling)andprovidingsynchronizationmechanismssoresources canbesafelyshared.However,anoperatingsystemmustsupportconcurrencyinarobustandsecure way:Processesshouldnotbeabletointerferewitheachother(intentionallyornot),andtheentire systemshouldnotcrashifsomethinggoeswrongwithoneprocess.Inaparallelprogram,findingand exploitingconcurrencycanbeachallenge,whileisolatingprocessesfromeachotherisnotthecritical concernitiswithanoperatingsystem.Performancegoalsaredifferentaswell.Inanoperating system,performancegoalsarenormallyrelatedtothroughputorresponsetime,anditmaybe acceptabletosacrificesomeefficiencytomaintainrobustnessandfairnessinresourceallocation.Ina parallelprogram,thegoalistominimizetherunningtimeofasingleprogram.

2.2. PARALLEL ARCHITECTURES: A BRIEF INTRODUCTIONTherearedozensofdifferentparallelarchitectures,amongthemnetworksofworkstations,clustersof offtheshelfPCs,massivelyparallelsupercomputers,tightlycoupledsymmetricmultiprocessors,and multiprocessorworkstations.Inthissection,wegiveanoverviewofthesesystems,focusingonthe characteristicsrelevanttotheprogrammer. 2.2.1. Flynn's Taxonomy ByfarthemostcommonwaytocharacterizethesearchitecturesisFlynn'staxonomy[Fly72].He categorizesallcomputersaccordingtothenumberofinstructionstreamsanddatastreamstheyhave, whereastreamisasequenceofinstructionsordataonwhichacomputeroperates.InFlynn's taxonomy,therearefourpossibilities:SISD,SIMD,MISD,andMIMD.

SingleInstruction,SingleData(SISD).InaSISDsystem,onestreamofinstructionsprocessesa singlestreamofdata,asshowninFig.2.1.ThisisthecommonvonNeumannmodelusedinvirtually allsingleprocessorcomputers.

Figure 2.1. The Single Instruction, Single Data (SISD) architecture

SingleInstruction,MultipleData(SIMD).InaSIMDsystem,asingleinstructionstreamis concurrentlybroadcasttomultipleprocessors,eachwithitsowndatastream(asshowninFig.2.2). TheoriginalsystemsfromThinkingMachinesandMasParcanbeclassifiedasSIMD.TheCPPDAP GammaIIandQuadricsApemillearemorerecentexamples;thesearetypicallydeployedin specializedapplications,suchasdigitalsignalprocessing,thataresuitedtofinegrainedparallelism andrequirelittleinterprocesscommunication.Vectorprocessors,whichoperateonvectordataina pipelinedfashion,canalsobecategorizedasSIMD.Exploitingthisparallelismisusuallydonebythe compiler.Figure 2.2. The Single Instruction, Multiple Data (SIMD) architecture

MultipleInstruction,SingleData(MISD).Nowellknownsystemsfitthisdesignation.Itismentioned forthesakeofcompleteness. MultipleInstruction,MultipleData(MIMD).InaMIMDsystem,eachprocessingelementhasitsown streamofinstructionsoperatingonitsowndata.Thisarchitecture,showninFig.2.3,isthemost generalofthearchitecturesinthateachoftheothercasescanbemappedontotheMIMDarchitecture. Thevastmajorityofmodernparallelsystemsfitintothiscategory.Figure 2.3. The Multiple Instruction, Multiple Data (MIMD) architecture

2.2.2. A Further Breakdown of M IMD TheMIMDcategoryofFlynn'staxonomyistoobroadtobeusefulonitsown;thiscategoryis typicallydecomposedaccordingtomemoryorganization. Sharedmemory.Inasharedmemorysystem,allprocessesshareasingleaddressspaceand communicatewitheachotherbywritingandreadingsharedvariables. OneclassofsharedmemorysystemsiscalledSMPs(symmetricmultiprocessors).AsshowninFig. 2.4,allprocessorsshareaconnectiontoacommonmemoryandaccessallmemorylocationsatequal speeds.SMPsystemsarearguablytheeasiestparallelsystemstoprogrambecauseprogrammersdo notneedtodistributedatastructuresamongprocessors.Becauseincreasingthenumberofprocessors increasescontentionforthememory,theprocessor/memorybandwidthistypicallyalimitingfactor. Thus,SMPsystemsdonotscalewellandarelimitedtosmallnumbersofprocessors.Figure 2.4. The Symmetric Multiprocessor (SMP) architecture

TheothermainclassofsharedmemorysystemsiscalledNUMA(nonuniformmemoryaccess).As showninFig.2.5,thememoryisshared;thatis,itisuniformlyaddressablefromallprocessors,but someblocksofmemorymaybephysicallymorecloselyassociatedwithsomeprocessorsthanothers. Thisreducesthememorybandwidthbottleneckandallowssystemswithmoreprocessors;however,as aresult,theaccesstimefromaprocessortoamemorylocationcanbesignificantlydifferent dependingonhow"close"thememorylocationistotheprocessor.Tomitigatetheeffectsof nonuniformaccess,eachprocessorhasacache,alongwithaprotocoltokeepcacheentriescoherent. Hence,anothernameforthesearchitecturesiscachecoherentnonuniformmemoryaccesssystems (ccNUMA).Logically,programmingaccNUMAsystemisthesameasprogramminganSMP,butto obtainthebestperformance,theprogrammerwillneedtobemorecarefulaboutlocalityissuesand cacheeffects.Figure 2.5. An example of the nonuniform memory access (NUMA) architecture

Distributedmemory.Inadistributedmemorysystem,eachprocesshasitsownaddressspaceand communicateswithotherprocessesbymessagepassing(sendingandreceivingmessages).A schematicrepresentationofadistributedmemorycomputerisshowninFig.2.6.Figure 2.6. The distributed-memory architecture

Dependingonthetopologyandtechnologyusedfortheprocessorinterconnection,communication

speedcanrangefromalmostasfastassharedmemory(intightlyintegratedsupercomputers)toorders ofmagnitudeslower(forexample,inaclusterofPCsinterconnectedwithanEthernetnetwork).The programmermustexplicitlyprogramallthecommunicationbetweenprocessorsandbeconcerned withthedistributionofdata. Distributedmemorycomputersaretraditionallydividedintotwoclasses:MPP(massivelyparallel processors)andclusters.InanMPP,theprocessorsandthenetworkinfrastructurearetightlycoupled andspecializedforuseinaparallelcomputer.Thesesystemsareextremelyscalable,insomecases supportingtheuseofmanythousandsofprocessorsinasinglesystem[MSW96,IBM02]. Clustersaredistributedmemorysystemscomposedofofftheshelfcomputersconnectedbyanoff theshelfnetwork.WhenthecomputersarePCsrunningtheLinuxoperatingsystem,theseclustersare calledBeowulfclusters.Asofftheshelfnetworkingtechnologyimproves,systemsofthistypeare becomingmorecommonandmuchmorepowerful.Clustersprovideaninexpensivewayforan organizationtoobtainparallelcomputingcapabilities[Beo].Preconfiguredclustersarenowavailable frommanyvendors.Onefrugalgroupevenreportedconstructingausefulparallelsystembyusinga clustertoharnessthecombinedpowerofobsoletePCsthatotherwisewouldhavebeendiscarded [HHS01]. Hybridsystems.Thesesystemsareclustersofnodeswithseparateaddressspacesinwhicheachnode containsseveralprocessorsthatsharememory. AccordingtovanderSteenandDongarra's"OverviewofRecentSupercomputers"[vdSD03],which containsabriefdescriptionofthesupercomputerscurrentlyorsoontobecommerciallyavailable, hybridsystemsformedfromclustersofSMPsconnectedbyafastnetworkarecurrentlythedominant trendinhighperformancecomputing.Forexample,inlate2003,fourofthefivefastestcomputersin theworldwerehybridsystems[Top]. Grids.Gridsaresystemsthatusedistributed,heterogeneousresourcesconnectedbyLANsand/or WANs[FK03].OftentheinterconnectionnetworkistheInternet.Gridswereoriginallyenvisionedas awaytolinkmultiplesupercomputerstoenablelargerproblemstobesolved,andthuscouldbe viewedasaspecialtypeofdistributedmemoryorhybridMIMDmachine.Morerecently,theideaof gridcomputinghasevolvedintoageneralwaytoshareheterogeneousresources,suchascomputation servers,storage,applicationservers,informationservices,orevenscientificinstruments.Gridsdiffer fromclustersinthatthevariousresourcesinthegridneednothaveacommonpointofadministration. Inmostcases,theresourcesonagridareownedbydifferentorganizationsthatmaintaincontrolover thepoliciesgoverninguseoftheresources.Thisaffectsthewaythesesystemsareused,the middlewarecreatedtomanagethem,andmostimportantlyforthisdiscussion,theoverheadincurred whencommunicatingbetweenresourceswithinthegrid.

2.2.3. Summary Wehaveclassifiedthesesystemsaccordingtothecharacteristicsofthehardware.These characteristicstypicallyinfluencethenativeprogrammingmodelusedtoexpressconcurrencyona system;however,thisisnotalwaysthecase.Itispossibleforaprogrammingenvironmentfora sharedmemorymachinetoprovidetheprogrammerwiththeabstractionofdistributedmemoryand messagepassing.Virtualdistributedsharedmemorysystemscontainmiddlewaretoprovidethe opposite:theabstractionofsharedmemoryonadistributedmemorymachine.

2.3. PARALLEL PROGRAMMING ENVIRONMENTSParallelprogrammingenvironmentsprovidethebasictools,languagefeatures,andapplication programminginterfaces(APIs)neededtoconstructaparallelprogram.Aprogrammingenvironment impliesaparticularabstractionofthecomputersystemcalledaprogrammingmodel.Traditional sequentialcomputersusethewellknownvonNeumannmodel.Becauseallsequentialcomputersuse thismodel,softwaredesignerscandesignsoftwaretoasingleabstractionandreasonablyexpectitto mapontomost,ifnotall,sequentialcomputers. Unfortunately,therearemanypossiblemodelsforparallelcomputing,reflectingthedifferentways processorscanbeinterconnectedtoconstructaparallelsystem.Themostcommonmodelsarebased ononeofthewidelydeployedparallelarchitectures:sharedmemory,distributedmemorywith messagepassing,orahybridcombinationofthetwo. Programmingmodelstoocloselyalignedtoaparticularparallelsystemleadtoprogramsthatarenot portablebetweenparallelcomputers.Becausetheeffectivelifespanofsoftwareislongerthanthatof hardware,manyorganizationshavemorethanonetypeofparallelcomputer,andmostprogrammers insistonprogrammingenvironmentsthatallowthemtowriteportableparallelprograms.Also, explicitlymanaginglargenumbersofresourcesinaparallelcomputerisdifficult,suggestingthat higherlevelabstractionsoftheparallelcomputermightbeuseful.Theresultisthatasofthemid 1990s,therewasaveritableglutofparallelprogrammingenvironments.Apartiallistoftheseis showninTable2.1.Thiscreatedagreatdealofconfusionforapplicationdevelopersandhinderedthe adoptionofparallelcomputingformainstreamapplications.Table 2.1. Some Parallel Programming Environments from the Mid-1990s

"C*inC ABCPL ACE ACT++ ADDAP Adl Adsmith

CUMULVS DAGGER DAPPLE DataParallelC DC++ DCE++ DDD

JavaRMI javaPG JAVAR JavaSpaces JIDL Joyce Karma

PRIO P3L P4Linda Pablo PADE PADRE Panda

Quake Quark QuickThreads Sage++ SAM SCANDAL SCHEDULE

AFAPI ALWAN AM AMDC Amoeba AppLeS ARTS

DICE DIPC Distributed Smalltalk DOLIB DOME DOSMOS DRL

Khoros KOAN/FortranS LAM Legion Lilac Linda LiPS Locust Lparx Lucid Maisie Manifold Mentat MetaChaos Midway Millipede Mirage Modula2* ModulaP MOSIX MpC MPC++ MPI Multipol Munin NanoThreads NESL NetClasses++

Papers Para++ Paradigm Parafrase2 Paralation Parallaxis Parallel Haskell ParallelC++ ParC ParLib++ ParLin Parlog Parmacs Parti pC pC++ PCN PCP: PCU PEACE PENNY PET PETSc PH Phosphorus POET Polaris POOLT

SciTL SDDA SHMEM SIMPLE Sina SISAL SMI SONiC SplitC SR Sthreads Strand SUIF SuperPascal Synergy TCGMSG Telegraphos TheFORCE Threads.h++ TRAPPER TreadMarks UC uC++ UNITY V Vic* VisifoldVNUS VPE

AthapascanOb DSMThreads Aurora Automap bb_threads Blaze BlockComm BSP C* C** C4 CarlOS Cashmere CC++ Charlotte Charm Charm++ Chu Cid Cilk CMFortran Code Ease ECO Eilean Emerald EPL Excalibur Express Falcon Filaments FLASH FM Fork FortranM FX GA GAMMA Glenda GLU GUARD HAsL

ConcurrentML HORUS Converse COOL CORRELATE CparPar CPS CRL CSP Cthreads HPC HPC++ HPF IMPACT ISETLLinda ISIS JADA JADE

Nexus Nimrod NOW ObjectiveLinda Occam Omega OOF90 Orca P++

POOMA POSYBL PRESTO Prospero Proteus PSDM PSI PVM QPC++

Win32threads WinPar WWWinda XENOOPS XPC Zounds ZPL

Fortunately,bythelate1990s,theparallelprogrammingcommunityconvergedpredominantlyontwo environmentsforparallelprogramming:OpenMP[OMP]forsharedmemoryandMPI[Mesb]for messagepassing. OpenMPisasetoflanguageextensionsimplementedascompilerdirectives.Implementationsare currentlyavailableforFortran,C,andC++.OpenMPisfrequentlyusedtoincrementallyadd parallelismtosequentialcode.Byaddingacompilerdirectivearoundaloop,forexample,the compilercanbeinstructedtogeneratecodetoexecutetheiterationsoftheloopinparallel.The compilertakescareofmostofthedetailsofthreadcreationandmanagement.OpenMPprogramstend toworkverywellonSMPs,butbecauseitsunderlyingprogrammingmodeldoesnotincludeanotion ofnonuniformmemoryaccesstimes,itislessidealforccNUMAanddistributedmemorymachines. MPIisasetoflibraryroutinesthatprovideforprocessmanagement,messagepassing,andsome collectivecommunicationoperations(theseareoperationsthatinvolvealltheprocessesinvolvedina program,suchasbarrier,broadcast,andreduction).MPIprogramscanbedifficulttowritebecause theprogrammerisresponsiblefordatadistributionandexplicitinterprocesscommunicationusing messages.Becausetheprogrammingmodelassumesdistributedmemory,MPIisagoodchoicefor MPPsandotherdistributedmemorymachines. NeitherOpenMPnorMPIisanidealfitforhybridarchitecturesthatcombinemultiprocessornodes, eachwithmultipleprocessesandasharedmemory,intoalargersystemwithseparateaddressspaces foreachnode:TheOpenMPmodeldoesnotrecognizenonuniformmemoryaccesstimes,soitsdata allocationcanleadtopoorperformanceonmachinesthatarenotSMPs,whileMPIdoesnotinclude constructstomanagedatastructuresresidinginasharedmemory.Onesolutionisahybridmodelin whichOpenMPisusedoneachsharedmemorynodeandMPIisusedbetweenthenodes.Thisworks well,butitrequirestheprogrammertoworkwithtwodifferentprogrammingmodelswithinasingle program.AnotheroptionistouseMPIonboththesharedmemoryanddistributedmemoryportions ofthealgorithmandgiveuptheadvantagesofasharedmemoryprogrammingmodel,evenwhenthe hardwaredirectlysupportsit. Newhighlevelprogrammingenvironmentsthatsimplifyportableparallelprogrammingandmore accuratelyreflecttheunderlyingparallelarchitecturesaretopicsofcurrentresearch[Cen].Another

approachmorepopularinthecommercialsectoristoextendMPIandOpenMP.Inthemid1990s,the MPIForumdefinedanextendedMPIcalledMPI2.0,althoughimplementationsarenotwidely availableatthetimethiswaswritten.ItisalargecomplexextensiontoMPIthatincludesdynamic processcreation,parallelI/O,andmanyotherfeatures.Ofparticularinteresttoprogrammersof modernhybridarchitecturesistheinclusionofonesidedcommunication.Onesidedcommunication mimicssomeofthefeaturesofasharedmemorysystembylettingoneprocesswriteintoorreadfrom thememoryregionsofotherprocesses.Theterm"onesided"referstothefactthatthereadorwriteis launchedbytheinitiatingprocesswithouttheexplicitinvolvementoftheotherparticipatingprocess. AmoresophisticatedabstractionofonesidedcommunicationisavailableaspartoftheGlobalArrays [NHL96,NHK02 + ,Gloa]package.GlobalArraysworkstogetherwithMPItohelpaprogrammer managedistributedarraydata.Aftertheprogrammerdefinesthearrayandhowitislaidoutin memory,theprogramexecutes"puts"or"gets"intothearraywithoutneedingtoexplicitlymanage whichMPIprocess"owns"theparticularsectionofthearray.Inessence,theglobalarrayprovidesan abstractionofagloballysharedarray.Thisonlyworksforarrays,butthesearesuchcommondata structuresinparallelcomputingthatthispackage,althoughlimited,canbeveryuseful. JustasMPIhasbeenextendedtomimicsomeofthebenefitsofasharedmemoryenvironment, OpenMPhasbeenextendedtorunindistributedmemoryenvironments.TheannualWOMPAT (WorkshoponOpenMPApplicationsandTools)workshopscontainmanypapersdiscussingvarious approachesandexperienceswithOpenMPinclustersandccNUMAenvironments. MPIisimplementedasalibraryofroutinestobecalledfromprogramswritteninasequential programminglanguage,whereasOpenMPisasetofextensionstosequentialprogramminglanguages. Theyrepresenttwoofthepossiblecategoriesofparallelprogrammingenvironments(librariesand languageextensions),andthesetwoparticularenvironmentsaccountfortheoverwhelmingmajority ofparallelcomputingbeingdonetoday.Thereis,however,onemorecategoryofparallel programmingenvironments,namelylanguageswithbuiltinfeaturestosupportparallelprogramming. Javaissuchalanguage.Ratherthanbeingdesignedtosupporthighperformancecomputing,Javais anobjectoriented,generalpurposeprogrammingenvironmentwithfeaturesforexplicitlyspecifying concurrentprocessingwithsharedmemory.Inaddition,thestandardI/Oandnetworkpackages provideclassesthatmakeiteasyforJavatoperforminterprocesscommunicationbetweenmachines, thusmakingitpossibletowriteprogramsbasedonboththesharedmemoryandthedistributed memorymodels.Thenewerjava.niopackagessupportI/Oinawaythatislessconvenientforthe programmer,butgivessignificantlybetterperformance,andJava21.5includesnewsupportfor concurrentprogramming,mostsignificantlyinthejava.util.concurrent.*packages.Additional packagesthatsupportdifferentapproachestoparallelcomputingarewidelyavailable. Althoughtherehavebeenothergeneralpurposelanguages,bothpriortoJavaandmorerecent(for example,C#),thatcontainedconstructsforspecifyingconcurrency,Javaisthefirsttobecomewidely used.Asaresult,itmaybethefirstexposureformanyprogrammerstoconcurrentandparallel programming.AlthoughJavaprovidessoftwareengineeringbenefits,currentlytheperformanceof parallelJavaprogramscannotcompetewithOpenMPorMPIprogramsfortypicalscientific computingapplications.TheJavadesignhasalsobeencriticizedforseveraldeficienciesthatmatterin thisdomain(forexample,afloatingpointmodelthatemphasizesportabilityandmorereproducible resultsoverexploitingtheavailablefloatingpointhardwaretothefullest,inefficienthandlingof

arrays,andlackofalightweightmechanismtohandlecomplexnumbers).Theperformancedifference betweenJavaandotheralternativescanbeexpectedtodecrease,especiallyforsymbolicorother nonnumericproblems,ascompilertechnologyforJavaimprovesandasnewpackagesandlanguage extensionsbecomeavailable.TheTitaniumproject[Tita]isanexampleofaJavadialectdesignedfor highperformancecomputinginaccNUMAenvironment. Forthepurposesofthisbook,wehavechosenOpenMP,MPI,andJavaasthethreeenvironmentswe willuseinourexamplesOpenMPandMPIfortheirpopularityandJavabecauseitislikelytobe manyprogrammers'firstexposuretoconcurrentprogramming.Abriefoverviewofeachcanbefound intheappendixes.

2.4. THE JARGON OF PARALLEL COMPUTINGInthissection,wedefinesometermsthatarefrequentlyusedthroughoutthepatternlanguage. Additionaldefinitionscanbefoundintheglossary. Task.Thefirststepindesigningaparallelprogramistobreaktheproblemupintotasks.Ataskisa sequenceofinstructionsthatoperatetogetherasagroup.Thisgroupcorrespondstosomelogicalpart ofanalgorithmorprogram.Forexample,considerthemultiplicationoftwoorderNmatrices. Dependingonhowweconstructthealgorithm,thetaskscouldbe(1)themultiplicationofsubblocks ofthematrices,(2)innerproductsbetweenrowsandcolumnsofthematrices,or(3)individual iterationsoftheloopsinvolvedinthematrixmultiplication.Thesearealllegitimatewaystodefine tasksformatrixmultiplication;thatis,thetaskdefinitionfollowsfromthewaythealgorithmdesigner thinksabouttheproblem. Unitofexecution(UE).Tobeexecuted,ataskneedstobemappedtoaUEsuchasaprocessor thread.Aprocessisacollectionofresourcesthatenablestheexecutionofprograminstructions. Theseresourcescanincludevirtualmemory,I/Odescriptors,aruntimestack,signalhandlers,user andgroupIDs,andaccesscontroltokens.Amorehighlevelviewisthataprocessisa"heavyweight" unitofexecutionwithitsownaddressspace.AthreadisthefundamentalUEinmodernoperating systems.Athreadisassociatedwithaprocessandsharestheprocess'senvironment.Thismakes threadslightweight(thatis,acontextswitchbetweenthreadstakesonlyasmallamountoftime).A morehighlevelviewisthatathreadisa"lightweight"UEthatsharesanaddressspacewithother threads. WewilluseunitofexecutionorUEasagenerictermforoneofacollectionofpossiblyconcurrently executingentities,usuallyeitherprocessesorthreads.Thisisconvenientintheearlystagesof programdesignwhenthedistinctionsbetweenprocessesandthreadsarelessimportant. Processingelement(PE).Weusethetermprocessingelement(PE)asagenerictermforahardware elementthatexecutesastreamofinstructions.TheunitofhardwareconsideredtobeaPEdependson thecontext.Forexample,someprogrammingenvironmentsvieweachworkstationinaclusterofSMP workstationsasexecutingasingleinstructionstream;inthissituation,thePEwouldbethe workstation.Adifferentprogrammingenvironmentrunningonthesamehardware,however,might vieweachprocessorofeachworkstationasexecutinganindividualinstructionstream;inthiscase,the PEistheindividualprocessor,andeachworkstationcontainsseveralPEs.

Loadbalanceandloadbalancing.Toexecuteaparallelprogram,thetasksmustbemappedtoUEs, andtheUEstoPEs.Howthemappingsaredonecanhaveasignificantimpactontheoverall performanceofaparallelalgorithm.ItiscrucialtoavoidthesituationinwhichasubsetofthePEsis doingmostoftheworkwhileothersareidle.Loadbalancereferstohowwelltheworkisdistributed amongPEs.LoadbalancingistheprocessofallocatingworktoPEs,eitherstaticallyordynamically, sothattheworkisdistributedasevenlyaspossible. Synchronization.Inaparallelprogram,duetothenondeterminismoftaskschedulingandother factors,eventsinthecomputationmightnotalwaysoccurinthesameorder.Forexample,inonerun, ataskmightreadvariablexbeforeanothertaskreadsvariabley;inthenextrunwiththesameinput, theeventsmightoccurintheoppositeorder.Inmanycases,theorderinwhichtwoeventsoccurdoes notmatter.Inothersituations,theorderdoesmatter,andtoensurethattheprogramiscorrect,the programmermustintroducesynchronizationtoenforcethenecessaryorderingconstraints.The primitivesprovidedforthispurposeinourselectedenvironmentsarediscussedintheImplementation Mechanismsdesignspace(Section6.3). Synchronousversusasynchronous.Weusethesetwotermstoqualitativelyrefertohowtightly coupledintimetwoeventsare.Iftwoeventsmusthappenatthesametime,theyaresynchronous; otherwisetheyareasynchronous.Forexample,messagepassing(thatis,communicationbetweenUEs bysendingandreceivingmessages)issynchronousifamessagesentmustbereceivedbeforethe sendercancontinue.Messagepassingisasynchronousifthesendercancontinueitscomputation regardlessofwhathappensatthereceiver,orifthereceivercancontinuecomputationswhilewaiting forareceivetocomplete. Raceconditions.Araceconditionisakindoferrorpeculiartoparallelprograms.Itoccurswhenthe outcomeofaprogramchangesastherelativeschedulingofUEsvaries.Becausetheoperatingsystem andnottheprogrammercontrolstheschedulingoftheUEs,raceconditionsresultinprogramsthat potentiallygivedifferentanswersevenwhenrunonthesamesystemwiththesamedata.Race conditionsareparticularlydifficulterrorstodebugbecausebytheirnaturetheycannotbereliably reproduced.Testinghelps,butisnotaseffectiveaswithsequentialprograms:Aprogrammayrun correctlythefirstthousandtimesandthenfailcatastrophicallyonthethousandandfirstexecution andthenrunagaincorrectlywhentheprogrammerattemptstoreproducetheerrorasthefirststepin debugging. Raceconditionsresultfromerrorsinsynchronization.IfmultipleUEsreadandwriteshared variables,theprogrammermustprotectaccesstothesesharedvariablessothereadsandwritesoccur inavalidorderregardlessofhowthetasksareinterleaved.Whenmanyvariablesaresharedorwhen theyareaccessedthroughmultiplelevelsofindirection,verifyingbyinspectionthatnorace conditionsexistcanbeverydifficult.Toolsareavailablethathelpdetectandfixraceconditions,such asThreadCheckerfromIntelCorporation,andtheproblemremainsanareaofactiveandimportant research[NM92]. Deadlocks.Deadlocksareanothertypeoferrorpeculiartoparallelprograms.Adeadlockoccurs whenthereisacycleoftasksinwhicheachtaskisblockedwaitingforanothertoproceed.Because allarewaitingforanothertasktodosomething,theywillallbeblockedforever.Asasimpleexample, considertwotasksinamessagepassingenvironment.TaskAattemptstoreceiveamessagefromtask B,afterwhichAwillreplybysendingamessageofitsowntotaskB.Meanwhile,taskBattemptsto

receiveamessagefromtaskA,afterwhichBwillsendamessagetoA.Becauseeachtaskiswaiting fortheothertosenditamessagefirst,bothtaskswillbeblockedforever.Fortunately,deadlocksare notdifficulttodiscover,asthetaskswillstopatthepointofthedeadlock.

2.5. A QUANTITATIVE LOOK AT PARALLEL COMPUTATIONThetwomainreasonsforimplementingaparallelprogramaretoobtainbetterperformanceandto solvelargerproblems.Performancecanbebothmodeledandmeasured,sointhissectionwewilltake aanotherlookatparallelcomputationsbygivingsomesimpleanalyticalmodelsthatillustratesome ofthefactorsthatinfluencetheperformanceofaparallelprogram. Consideracomputationconsistingofthreeparts:asetupsection,acomputationsection,anda finalizationsection.ThetotalrunningtimeofthisprogramononePEisthengivenasthesumofthe timesforthethreeparts. Equation2.1

WhathappenswhenwerunthiscomputationonaparallelcomputerwithmultiplePEs?Supposethat thesetupandfinalizationsectionscannotbecarriedoutconcurrentlywithanyotheractivities,but thatthecomputationsectioncouldbedividedintotasksthatwouldrunindependentlyonasmanyPEs asareavailable,withthesametotalnumberofcomputationstepsasintheoriginalcomputation.The timeforthefullcomputationonPPEscanthereforebegivenbyOfcourse,Eq.2.2describesavery idealizedsituation.However,theideathatcomputationshaveaserialpart(forwhichadditionalPEs areuseless)andaparallelizablepart(forwhichmorePEsdecreasetherunningtime)isrealistic.Thus, thissimplemodelcapturesanimportantrelationship. Equation2.2

AnimportantmeasureofhowmuchadditionalPEshelpistherelativespeedupS,whichdescribes howmuchfasteraproblemrunsinawaythatnormalizesawaytheactualrunningtime. Equation2.3

ArelatedmeasureistheefficiencyE,whichisthespeedupnormalizedbythenumberofPEs. Equation2.4

Equation2.5

Ideally,wewouldwantthespeeduptobeequaltoP,thenumberofPEs.Thisissometimescalled perfectlinearspeedup.Unfortunately,thisisanidealthatcanrarelybeachievedbecausetimesfor setupandfinalizationarenotimprovedbyaddingmorePEs,limitingthespeedup.Thetermsthat cannotberunconcurrentlyarecalledtheserialterms.Theirrunningtimesrepresentsomefractionof thetotal,calledtheserialfraction,denoted. Equation2.6

Thefractionoftimespentintheparallelizablepartoftheprogramisthen(1).Wecanthus rewritetheexpressionfortotalcomputationtimewithPPEsas Equation2.7

Now,rewritingSintermsofthenewexpressionforTtotal(P),weobtainthefamousAmdahl'slaw: Equation2.8

Equation2.9

Thus,inanidealparallelalgorithmwithnooverheadintheparallelpart,thespeedupshouldfollow Eq.2.9.Whathappenstothespeedupifwetakeouridealparallelalgorithmanduseaverylarge numberofprocessors?TakingthelimitasPgoestoinfinityinourexpressionforSyields

Equation2.10

Eq.2.10thusgivesanupperboundonthespeedupobtainableinanalgorithmwhoseserialpart representsofthetotalcomputation. Theseconceptsarevitaltotheparallelalgorithmdesigner.Indesigningaparallelalgorithm,itis importanttounderstandthevalueoftheserialfractionsothatrealisticexpectationscanbesetfor performance.Itmaynotmakesensetoimplementacomplex,arbitrarilyscalableparallelalgorithmif 10%ormoreofthealgorithmisserialand10%isfairlycommon. Ofcourse,Amdahl'slawisbasedonassumptionsthatmayormaynotbetrueinpractice.Inreallife, anumberoffactorsmaymaketheactualrunningtimelongerthanthisformulaimplies.Forexample, creatingadditionalparalleltasksmayincreaseoverheadandthechancesofcontentionforshared resources.Ontheotherhand,iftheoriginalserialcomputationislimitedbyresourcesotherthanthe availabilityofCPUcycles,theactualperformancecouldbemuchbetterthanAmdahl'slawwould predict.Forexample,alargeparallelmachinemayallowbiggerproblemstobeheldinmemory,thus reducingvirtualmemorypaging,ormultipleprocessorseachwithitsowncachemayallowmuch moreoftheproblemtoremaininthecache.Amdahl'slawalsorestsontheassumptionthatforany giveninput,theparallelandserialimplementationsperformexactlythesamenumberof computationalsteps.Iftheserialalgorithmbeingusedintheformulaisnotthebestpossible algorithmfortheproblem,thenacleverparallelalgorithmthatstructuresthecomputationdifferently canreducethetotalnumberofcomputationalsteps. Ithasalsobeenobserved[Gus88]thattheexerciseunderlyingAmdahl'slaw,namelyrunningexactly thesameproblemwithvaryingnumbersofprocessors,isartificialinsomecircumstances.If,say,the parallelapplicationwereaweathersimulation,thenwhennewprocessorswereadded,onewould mostlikelyincreasetheproblemsizebyaddingmoredetailstothemodelwhilekeepingthetotal executiontimeconstant.Ifthisisthecase,thenAmdahl'slaw,orfixedsizespeedup,givesa pessimisticviewofthebenefitsofadditionalprocessors. Toseethis,wecanreformulatetheequationtogivethespeedupintermsofperformanceonaP processorsystem.EarlierinEq.2.2,weobtainedtheexecutiontimeforTprocessors,Ttotal(P),from theexecutiontimeoftheserialtermsandtheexecutiontimeoftheparallelizablepartwhenexecuted ononeprocessor.Here,wedotheoppositeandobtainTtotal(1)fromtheserialandparallelterms whenexecutedonPprocessors. Equation2.11

Now,wedefinethesocalledscaledserialfraction,denotedscaled,as

Equation2.12

andthen

Equation2.13

Rewritingtheequationforspeedup(Eq.2.3)andsimplifying,weobtainthescaled(orfixedtime) speedup.[1][1]

Thisequation,sometimesknownasGustafson'slaw,wasattributedin[Gus88]toE. Barsis. Equation2.14

ThisgivesexactlythesamespeedupasAmdahl'slaw,butallowsadifferentquestiontobeaskedwhen thenumberofprocessorsisincreased.SincescaleddependsonP,theresultoftakingthelimitisn't immediatelyobvious,butwouldgivethesameresultasthelimitinAmdahl'slaw.However,suppose wetakethelimitinPwhileholdingTcomputeandthusscaledconstant.Theinterpretationisthatwe areincreasingthesizeoftheproblemsothatthetotalrunningtimeremainsconstantwhenmore processorsareadded.(Thiscontainstheimplicitassumptionthattheexecutiontimeoftheserial termsdoesnotchangeastheproblemsizegrows.)Inthiscase,thespeedupislinearinP.Thus,while addingmoreprocessorstosolveafixedproblemmayhitthespeeduplimitsofAmdahl'slawwitha relativelysmallnumberofprocessors,iftheproblemgrowsasmoreprocessorsareadded,Amdahl's lawwillbepessimistic.Thesetwomodelsofspeedup,alongwithafixedmemoryversionofspeedup, arediscussedin[SN90].

2.6. COMMUNICATION2.6.1. Latency and Bandwidth Asimplebutusefulmodelcharacterizesthetotaltimeformessagetransferasthesumofafixedcost

plusavariablecostthatdependsonthelengthofthemessage. Equation2.15

Thefixedcostiscalledlatencyandisessentiallythetimeittakestosendanemptymessageover thecommunicationmedium,fromthetimethesendroutineiscalledtothetimethedataisreceived bytherecipient.Latency(giveninsomeappropriatetimeunit)includesoverheadduetosoftwareand networkhardwareplusthetimeittakesforthemessagetotraversethecommunicationmedium.The bandwidth(giveninsomemeasureofbytespertimeunit)isameasureofthecapacityofthe communicationmedium.Nisthelengthofthemessage. Thelatencyandbandwidthcanvarysignificantlybetweensystemsdependingonboththehardware usedandthequalityofthesoftwareimplementingthecommunicationprotocols.Becausethesevalues canbemeasuredwithfairlysimplebenchmarks[DD97],itissometimesworthwhiletomeasure valuesforand,asthesecanhelpguideoptimizationstoimprovecommunicationperformance. Forexample,inasysteminwhichisrelativelylarge,itmightbeworthwhiletotrytorestructurea programthatsendsmanysmallmessagestoaggregatethecommunicationintoafewlargemessages instead.Dataforseveralrecentsystemshasbeenpresentedin[BBC03 + ].

2.6.2. Overlapping Communication and Computation and Latency Hiding Ifwelookmorecloselyatthecomputationtimewithinasingletaskonasingleprocessor,itcan roughlybedecomposedintocomputationtime,communicationtime,andidletime.The communicationtimeisthetimespentsendingandreceivingmessages(andthusonlyappliesto distributedmemorymachines),whereastheidletimeistimethatnoworkisbeingdonebecausethe taskiswaitingforanevent,suchasthereleaseofaresourceheldbyanothertask. Acommonsituationinwhichataskmaybeidleiswhenitiswaitingforamessagetobetransmitted throughthesystem.Thiscanoccurwhensendingamessage(astheUEwaitsforareplybefore proceeding)orwhenreceivingamessage.Sometimesitispossibletoeliminatethiswaitby restructuringthetasktosendthemessageand/orpostthereceive(thatis,indicatethatitwantsto receiveamessage)andthencontinuethecomputation.Thisallowstheprogrammertooverlap communicationandcomputation.WeshowanexampleofthistechniqueinFig.2.7.Thisstyleof messagepassingismorecomplicatedfortheprogrammer,becausetheprogrammermusttakecareto waitforthereceivetocompleteafteranyworkthatcanbeoverlappedwithcommunicationis completed.

Figure 2.7. Communication without (left) and with (right) support for overlapping communication and computation. Although UE 0 in the computation on the right still has some idle time waiting for the reply from UE 1, the idle time is reduced and the computation requires less total time because of UE 1 's earlier start.

AnothertechniqueusedonmanyparallelcomputersistoassignmultipleUEstoeachPE,sothat whenoneUEiswaitingforcommunication,itwillbepossibletocontextswitchtoanotherUEand keeptheprocessorbusy.Thisisanexampleoflatencyhiding.Itisincreasinglybeingusedonmodern highperformancecomputingsystems,themostfamousexamplebeingtheMTAsystemfromCray Research[ACC90 + ].

2.7. SUMMARYThischapterhasgivenabriefoverviewofsomeoftheconceptsandvocabularyusedinparallel computing.Additionaltermsaredefinedintheglossary.Wealsodiscussedthemajorprogramming environmentsinuseforparallelcomputing:OpenMP,MPI,andJava.Throughoutthebook,wewill usethesethreeprogrammingenvironmentsforourexamples.MoredetailsaboutOpenMP,MPI,and Javaandhowtousethemtowriteparallelprogramsareprovidedintheappendixes.

Chapter 3. The Finding Concurrency Design Space3.1ABOUTTHEDESIGNSPACE 3.2THETASKDECOMPOSITIONPATTERN

3.3THEDATADECOMPOSITIONPATTERN 3.4THEGROUPTASKSPATTERN 3.5THEORDERTASKSPATTERN 3.6THEDATASHARINGPATTERN 3.7THEDESIGNEVALUATIONPATTERN 3.8SUMMARY

3.1. ABOUT THE DESIGN SPACEThesoftwaredesignerworksinanumberofdomains.Thedesignprocessstartsintheproblem domainwithdesignelementsdirectlyrelevanttotheproblembeingsolved(forexample,fluidflows, decisiontrees,atoms,etc.).Theultimateaimofthedesignissoftware,soatsomepoint,thedesign elementschangeintoonesrelevanttoaprogram(forexample,datastructuresandsoftwaremodules). Wecallthistheprogramdomain.Althoughitisoftentemptingtomoveintotheprogramdomainas soonaspossible,adesignerwhomovesoutoftheproblemdomaintoosoonmaymissvaluabledesign options. Thisisparticularlyrelevantinparallelprogramming.Parallelprogramsattempttosolvebigger problemsinlesstimebysimultaneouslysolvingdifferentpartsoftheproblemondifferentprocessing elements.Thiscanonlywork,however,iftheproblemcontainsexploitableconcurrency,thatis, multipleactivitiesortasksthatcanexecuteatthesametime.Afteraproblemhasbeenmappedonto theprogramdomain,however,itcanbedifficulttoseeopportunitiestoexploitconcurrency. Hence,programmersshouldstarttheirdesignofaparallelsolutionbyanalyzingtheproblemwithin theproblemdomaintoexposeexploitableconcurrency.Wecallthedesignspaceinwhichthis analysisiscarriedouttheFindingConcurrencydesignspace.Thepatternsinthisdesignspacewill helpidentifyandanalyzetheexploitableconcurrencyinaproblem.Afterthisisdone,oneormore patternsfromtheAlgorithmStructurespacecanbechosentohelpdesigntheappropriatealgorithm structuretoexploittheidentifiedconcurrency. AnoverviewofthisdesignspaceanditsplaceinthepatternlanguageisshowninFig.3.1.Figure 3.1. Overview of the Finding Concurrency design space and its place in the pattern language

Experienceddesignersworkinginafamiliardomainmayseetheexploitableconcurrency immediatelyandcouldmovedirectlytothepatternsintheAlgorithmStructuredesignspace. 3.1.1. Overview Beforestartingtoworkwiththepatternsinthisdesignspace,thealgorithmdesignermustfirst considertheproblemtobesolvedandmakesuretheefforttocreateaparallelprogramwillbe justified:Istheproblemlargeenoughandtheresultssignificantenoughtojustifyexpendingeffortto solveitfaster?Ifso,thenextstepistomakesurethekeyfeaturesanddataelementswithinthe problemarewellunderstood.Finally,thedesignerneedstounderstandwhichpartsoftheproblemare mostcomputationallyintensive,becausetheefforttoparallelizetheproblemshouldbefocusedon thoseparts. Afterthisanalysisiscomplete,thepatternsintheFindingConcurrencydesignspacecanbeusedto startdesigningaparallelalgorithm.Thepatternsinthisdesignspacecanbeorganizedintothree groups.

DecompositionPatterns.Thetwodecompositionpatterns,TaskDecompositionandData Decomposition,areusedtodecomposetheproblemintopiecesthatcanexecuteconcurrently. DependencyAnalysisPatterns.Thisgroupcontainsthreepatternsthathelpgroupthetasks andanalyzethedependenciesamongthem:GroupTasks,OrderTasks,andDataSharing. Nominally,thepatternsareappliedinthisorder.Inpractice,however,itisoftennecessaryto workbackandforthbetweenthem,orpossiblyevenrevisitthedecompositionpatterns. DesignEvaluationPattern.Thefinalpatterninthisspaceguidesthealgorithmdesigner throughananalysisofwhathasbeendonesofarbeforemovingontothepatternsinthe AlgorithmStructuredesignspace.Thispatternisimportantbecauseitoftenhappensthatthe bestdesignisnotfoundonthefirstattempt,andtheearlierdesignflawsareidentified,the

easiertheyaretocorrect.Ingeneral,workingthroughthepatternsinthisspaceisaniterative process. 3.1.2. Using the Decomposition Patterns Thefirststepindesigningaparallelalgorithmistodecomposetheproblemintoelementsthatcan executeconcurrently.Wecanthinkofthisdecompositionasoccurringintwodimensions.

Thetaskdecompositiondimensionviewstheproblemasastreamofinstructionsthatcanbe brokenintosequencescalledtasksthatcanexecutesimultaneously.Forthecomputationtobe efficient,theoperationsthatmakeupthetaskshouldbelargelyindependentoftheoperations takingplaceinsideothertasks. Thedatadecompositiondimensionfocusesonthedatarequiredbythetasksandhowitcanbe decomposedintodistinctchunks.Thecomputationassociatedwiththedatachunkswillonly beefficientifthedatachunkscanbeoperateduponrelativelyindependently.

Viewingtheproblemdecompositionintermsoftwodistinctdimensionsissomewhatartificial.Atask decompositionimpliesadatadecompositionandviceversa;hence,thetwodecompositionsarereally differentfacetsofthesamefundamentaldecomposition.Wedividethemintoseparatedimensions, however,becauseaproblemdecompositionusuallyproceedsmostnaturallybyemphasizingone dimensionofthedecompositionovertheother.Bymakingthemdistinct,wemakethisdesign emphasisexplicitandeasierforthedesignertounderstand. 3.1.3. Background for Examples Inthissection,wegivebackgroundinformationonsomeoftheexamplesthatareusedinseveral patterns.Itcanbeskippedforthetimebeingandrevisitedlaterwhenreadingapatternthatrefersto oneoftheexamples.Medical imaging

PET(PositronEmissionTomography)scansprovideanimportantdiagnostictoolbyallowing physicianstoobservehowaradioactivesubstancepropagatesthroughapatient'sbody.Unfortunately, theimagesformedfromthedistributionofemittedradiationareoflowresolution,dueinparttothe scatteringoftheradiationasitpassesthroughthebody.Itisalsodifficulttoreasonfromtheabsolute radiationintensities,becausedifferentpathwaysthroughthebodyattenuatetheradiationdifferently. Tosolvethisproblem,modelsofhowradiationpropagatesthroughthebodyareusedtocorrectthe images.AcommonapproachistobuildaMonteCarlomodel,asdescribedbyLjungbergandKing [LK98].Randomlyselectedpointswithinthebodyareassumedtoemitradiation(usuallyagamma ray),andthetrajectoryofeachrayisfollowed.Asaparticle(ray)passesthroughthebody,itis attenuatedbythedifferentorgansittraverses,continuinguntiltheparticleleavesthebodyandhitsa cameramodel,therebydefiningafulltrajectory.Tocreateastatisticallysignificantsimulation, thousands,ifnotmillions,oftrajectoriesarefollowed. Thisproblemcanbeparallelizedintwoways.Becauseeachtrajectoryisindependent,itispossibleto parallelizetheapplicationbyassociatingeachtrajectorywithatask.Thisapproachisdiscussedinthe ExamplessectionoftheTaskDecompositionpattern.Anotherapproachwouldbetopartitionthe

bodyintosectionsandassigndifferentsectionstodifferentprocessingelements.Thisapproachis discussedintheExamplessectionoftheDataDecompositionpattern.Linear algebra

Linearalgebraisanimportanttoolinappliedmathematics:Itprovidesthemachineryrequiredto analyzesolutionsoflargesystemsoflinearequations.Theclassiclinearalgebraproblemasks,for matrixAandvectorb,whatvaluesforxwillsolvetheequation Equation3.1

ThematrixAinEq.3.1takesonacentralroleinlinearalgebra.Manyproblemsareexpressedin termsoftransformationsofthismatrix.Thesetransformationsareappliedbymeansofamatrix multiplication Equation3.2

IfT,A,andCaresquarematricesoforderN,matrixmultiplicationisdefinedsuchthateachelement oftheresultingmatrixCis Equation3.3

wherethesubscriptsdenoteparticularelementsofthematrices.Inotherwords,theelementofthe productmatrixCinrowiandcolumnjisthedotproductoftheithrowofTandthejthcolumnof A.Hence,computingeachoftheN2elementsofCrequiresNmultiplicationsandN1additions, makingtheoverallcomplexityofmatrixmultiplicationO(N3).

Therearemanywaystoparallelizeamatrixmultiplicationoperation.Itcanbeparallelizedusing eitherataskbaseddecomposition(asdiscussedintheExamplessectionoftheTaskDecomposition pattern)oradatabaseddecomposition(asdiscussedintheExamplessectionoftheData Decompositionpattern).Molecular dynamics

Moleculardynamicsisusedtosimulatethemotionsofalargemolecularsystem.Forexample, moleculardynamicssimulationsshowhowalargeproteinmovesaroundandhowdifferentlyshaped drugsmightinteractwiththeprotein.Notsurprisingly,moleculardynamicsisextremelyimportantin

thepharmaceuticalindustry.Itisalsoausefultestproblemforcomputerscientistsworkingonparallel computing:Itisstraightforwardtounderstand,relevanttoscienceatlarge,anddifficulttoparallelize effectively.Asaresult,ithasbeenthesubjectofmuchresearch[Mat94,PH95,Pli95]. Thebasicideaistotreatamoleculeasalargecollectionofballsconnectedbysprings.Theballs representtheatomsinthemolecule,whilethespringsrepresentthechemicalbondsbetweenthe atoms.Themoleculardynamicssimulationitselfisanexplicittimesteppingprocess.Ateachtime step,theforceoneachatomiscomputedandthenstandardclassicalmechanicstechniquesareusedto computehowtheforcemovestheatoms.Thisprocessiscarriedoutrepeatedlytostepthroughtime andcomputeatrajectoryforthemolecularsystem. Theforcesduetothechemicalbonds(the"springs")arerelativelysimpletocompute.These correspondtothevibrationsandrotationsofthechemicalbondsthemselves.Theseareshortrange forcesthatcanbecomputedwithknowledgeofthehandfulofatomsthatsharechemicalbonds.The majordifficultyarisesbecausetheatomshavepartialelectricalcharges.Hence,whileatomsonly interactwithasmallneighborhoodofatomsthroughtheirchemicalbonds,theelectricalcharges causeeveryatomtoapplyaforceoneveryotheratom. ThisisthefamousNbodyproblem.OntheorderofN2termsmustbecomputedtofindthese nonbondedforces.BecauseNislarge(tensorhundredsofthousands)andthenumberoftimestepsin asimulationishuge(tensofthousands),thetimerequiredtocomputethesenonbondedforces dominatesthecomputation.Severalwayshavebeenproposedtoreducetheeffortrequiredtosolvethe Nbodyproblem.Weareonlygoingtodiscussthesimplestone:thecutoffmethod. Theideaissimple.Eventhougheachatomexertsaforceoneveryotheratom,thisforcedecreases withthesquareofthedistancebetweentheatoms.Hence,itshouldbepossibletopickadistance beyondwhichtheforcecontributionissosmallthatitcanbeignored.Byignoringtheatomsthat exceedthiscutoff,theproblemisreducedtoonethatscalesasO(Nxn),wherenisthenumberof atomswithinthecutoffvolume,usuallyhundreds.Thecomputationisstillhuge,anditdominatesthe overallruntimeforthesimulation,butatleasttheproblemistractable. Thereareahostofdetails,butthebasicsimulationcanbesummarizedasinFig.3.2. Theprimarydatastructuresholdtheatomicpositions(atoms),thevelocitiesofeachatom (velocity),theforcesexertedoneachatom(forces),andlistsofatomswithinthecutoff distanceofeachatoms(neighbors).Theprogramitselfisatimesteppingloop,inwhicheach iterationcomputestheshortrangeforceterms,updatestheneighborlists,andthenfindsthe nonbondedforces.Aftertheforceoneachatomhasbeencomputed,asimpleordinarydifferential equationissolvedtoupdatethepositionsandvelocities.Physicalpropertiesbasedonatomicmotions arethenupdated,andwegotothenexttimestep. Therearemanywaystoparallelizethemoleculardynamicsproblem.Weconsiderthemostcommon approach,startingwiththetaskdecomposition(discussedintheTaskDecompositionpattern)and followingwiththeassociateddatadecomposition(discussedintheDataDecompositionpattern).This exampleshowshowthetwodecompositionsfittogethertoguidethedesignoftheparallelalgorithm.

Figure 3.2. Pseudocode for the molecular dynamics example Int const N // number of atoms Array Array Array Array of of of of Real Real Real List :: :: :: :: atoms (3,N) //3D coordinates velocities (3,N) //velocity vector forces (3,N) //force in each dimension neighbors(N) //atoms in cutoff volume

loop over time steps vibrational_forces (N, atoms, forces) rotational_forces (N, atoms, forces) neighbor_list (N, atoms, neighbors) non_bonded_forces (N, atoms, neighbors, forces) update_atom_positions_and_velocities( N, atoms, velocities, forces) physical_properties ( ... Lots of stuff ... ) end loop

3.2. THE TASK DECOMPOSITION PATTERNProblem Howcanaproblembedecomposedintotasksthatcanexecuteconcurrently? Context Everyparallelalgorithmdesignstartsfromthesamepoint,namelyagoodunderstandingofthe problembeingsolved.Theprogrammermustunderstandwhicharethecomputationallyintensive partsoftheproblem,thekeydatastructures,andhowthedataisusedastheproblem'ssolution unfolds. Thenextstepistodefinethetasksthatmakeuptheproblemandthedatadecompositionimpliedby thetasks.Fundamentally,everyparallelalgorithminvolvesacollectionoftasksthatcanexecute concurrently.Thechallengeistofindthesetasksandcraftanalgorithmthatletsthemrun concurrently. Insomecases,theproblemwillnaturallybreakdownintoacollectionofindependent(ornearly independent)tasks,anditiseasiesttostartwithataskbaseddecomposition.Inothercases,thetasks aredifficulttoisolateandthedecompositionofthedata(asdiscussedintheDataDecomposition pattern)isabetterstartingpoint.Itisnotalwaysclearwhichapproachisbest,andoftenthealgorithm designerneedstoconsiderboth. Regardlessofwhetherthestartingpointisataskbasedoradatabaseddecomposition,however,a parallelalgorithmultimatelyneedstasksthatwillexecuteconcurrently,sothesetasksmustbe identified.

Forces Themainforcesinfluencingthedesignatthispointareflexibility,efficiency,andsimplicity.

Flexibility.Flexibilityinthedesignwillallowittobeadaptedtodifferentimplementation requirements.Forexample,itisusuallynotagoodideatonarrowtheoptionstoasingle computersystemorstyleofprogrammingatthisstageofthedesign. Efficiency.Aparallelprogramisonlyusefulifitscalesefficientlywiththesizeoftheparallel computer(intermsofreducedruntimeand/ormemoryutilization).Forataskdecomposition, thismeansweneedenoughtaskstokeepallthePEsbusy,withenoughworkpertaskto compensateforoverheadincurredtomanagedependencies.However,thedriveforefficiency canleadtocomplexdecompositionsthatlackflexibility. Simplicity.Thetaskdecompositionneedstobecomplexenoughtogetthejobdone,but simpleenoughtolettheprogrambedebuggedandmaintainedwithreasonableeffort.

Solution Thekeytoaneffectivetaskdecompositionistoensurethatthetasksaresufficientlyindependentso thatmanagingdependenciestakesonlyasmallfractionoftheprogram'soverallexecutiontime.Itis alsoimportanttoensurethattheexecutionofthetaskscanbeevenlydistributedamongtheensemble ofPEs(theloadbalancingproblem). Inanidealworld,thecompilerwouldfindthetasksfortheprogrammer.Unfortunately,thisalmost neverhappens.Instead,itmustusuallybedonebyhandbasedonknowledgeoftheproblemandthe coderequiredtosolveit.Insomecases,itmightbenecessarytocompletelyrecasttheproblemintoa formthatexposesrelativelyindependenttasks. Inataskbaseddecomposition,welookattheproblemasacollectionofdistincttasks,paying particularattentionto

Theactionsthatarecarriedouttosolvetheproblem.(Arethereenoughofthemtokeepthe processingelementsonthetargetmachinesbusy?) Whethertheseactionsaredistinctandrelativelyindependent.

Asafirstpass,wetrytoidentifyasmanytasksaspossible;itismucheasiertostartwithtoomany tasksandmergethemlateronthantostartwithtoofewtasksandlatertrytosplitthem. Taskscanbefoundinmanydifferentplaces.

Insomecases,eachtaskcorrespondstoadistinctcalltoafunction.Definingataskforeach functioncallleadstowhatissometimescalledafunctionaldecomposition. Anotherplacetofindtasksisindistinctiterationsoftheloopswithinanalgorithm.Ifthe iterationsareindependen tandthereareenoughofthem,thenitmightworkwelltobaseatask decompositiononmappingeachiterationontoatask.Thisstyleoftaskbaseddecomposition leadstowhataresometimescalledloopsplittingalgorithms. Tasksalsoplayakeyroleindatadrivendecompositions.Inthiscase,alargedatastructureis decomposedandmultipleunitsofexecutionconcurrentlyupdatedifferentchunksofthedata structure.Inthiscase,thetasksarethoseupdatesonindividualchunks.

AlsokeepinmindtheforcesgivenintheForcessection:

Flexibility.Thedesignneedstobeflexibleinthenumberoftasksgenerated.Usuallythisis donebyparameterizingthenumberandsizeoftasksonsomeappropriatedimension.This willletthedesignbeadaptedtoawiderangeofparallelcomputerswithdifferentnumbersof processors. Efficiency.Therearetwomajorefficiencyissuestoconsiderinthetaskdecomposition.First, eachtaskmustincludeenoughworktocompensatefortheoverheadincurredbycreatingthe tasksandmanagingtheirdependencies.Second,thenumberoftasksshouldbelargeenough sothatalltheunitsofexecutionarebusywithusefulworkthroughoutthecomputation. Simplicity.Tasksshouldbedefinedinawaythatmakesdebuggingandmaintenancesimple. Whenpossible,tasksshouldbedefinedsotheyreusecodefromexistingsequentialprograms thatsolverelatedproblems.

Afterthetaskshavebeenidentified,thenextstepistolookatthedatadecompositionimpliedbythe tasks.TheDataDecompositionpatternmayhelpwiththisanalysis. ExamplesMedical imaging

ConsiderthemedicalimagingproblemdescribedinSec.3.1.3.Inthisapplication,apointinsidea modelofthebodyisselectedrandomly,aradioactivedecayisallowedtooccuratthispoint,andthe trajectoryoftheemittedparticleisfollowed.Tocreateastatisticallysignificantsimulation,thousands, ifnotmillions,oftrajectoriesarefollowed. Itisnaturaltoassociateataskwitheachtrajectory.Thesetasksareparticularlysimpletomanage concurrentlybecausetheyarecompletelyindependent.Furthermore,therearelargenumbersof trajectories,sotherewillbemanytasks,makingthisdecompositionsuitableforalargerangeof computersystems,fromasharedmemorysystemwithasmallnumberofprocessingelementstoa largeclusterwithhundredsofprocessingelements. Withthebasictasksdefined,wenowconsiderthecorrespondingdatadecompositionthatis,we definethedataassociatedwitheachtask.Eachtaskneedstoholdtheinformationdefiningthe trajectory.Butthatisnotall:Thetasksneedaccesstothemodelofthebodyaswell.Althoughit mightnotbeapparentfromourdescriptionoftheproblem,thebodymodelcanbeextremelylarge. Becauseitisareadonlymodel,thisisnoproblemifthereisaneffectivesharedmemorysystem; eachtaskcanreaddataasneeded.Ifthetargetplatformisbasedonadistributedmemory architecture,however,thebodymodelwillneedtobereplicatedoneachPE.Thiscanbeverytime consumingandcanwasteagreatdealofmemory.ForsystemswithsmallmemoriesperPEand/or withslownetworksbetweenPEs,adecompositionoftheproblembasedonthebodymodelmightbe moreeffective. Thisisacommonsituationinparallelprogramming:Manyproblemscanbedecomposedprimarilyin termsofdataorprimarilyintermsoftasks.Ifataskbaseddecompositionavoidstheneedtobreakup anddistributecomplexdatastructures,itwillbeamuchsimplerprogramtowriteanddebug.Onthe otherhand,ifmemoryand/ornetworkbandwidthisalimitingfactor,adecompositionthatfocuseson

thedatamightbemoreeffective.Itisnotsomuchamatterofoneapproachbeing"better"than anotherasamatterofbalancingtheneedsofthemachinewiththeneedsoftheprogrammer.We discussthisinmoredetailintheDataDecompositionpattern.Matrix multiplication

Considerthemultiplicationoftwomatrices(C=AB),asdescribedinSec.3.1.3.Wecanproducea taskbaseddecompositionofthisproblembyconsideringthecalculationofeachelementofthe productmatrixasaseparatetask.EachtaskneedsaccesstoonerowofAandonecolumnofB.This decompositionhastheadvantagethatallthetasksareindependent,andbecauseallthedatathatis sharedamongtasks(AandB)isreadonly,itwillbestraightforwardtoimplementinashared memoryenvironment. Theperformanceofthisalgorithm,however,wouldbepoor.Considerthecasewherethethree matricesaresquareandoforderN.ForeachelementofC,NelementsfromAandNelementsfromB wouldberequired,resultingin2NmemoryreferencesforNmultiply/addoperations.Memoryaccess timeisslowcomparedtofloatingpointarithmetic,sothebandwidthofthememorysubsystemwould limittheperformance. Abetterapproachwouldbetodesignanalgorithmthatmaximizesreuseofdataloadedintoa processor'scaches.Wecanarriveatthisalgorithmintwodifferentways.First,wecouldgroup togethertheelementwisetaskswedefinedearliersothetasksthatusesimilarelementsoftheAandB matricesrunonthesameUE(seetheGroupTaskspattern).Alternatively,wecouldstartwiththedata decompositionanddesignthealgorithmfromthebeginningaroundthewaythematricesfitintothe caches.WediscussthisexamplefurtherintheExamplessectionoftheDataDecompositionpattern.Molecular dynamics

ConsiderthemoleculardynamicsproblemdescribedinSec.3.1.3.Pseudocodeforthisexampleis shownagaininFig.3.3. Beforeperformingthetaskdecomposition,weneedtobetterunderstandsomedetailsoftheproblem. First,theneighbor_list ()computationistimeconsuming.Thegistofthecomputationisa loopovereachatom,insideofwhicheveryotheratomischeckedtodeterminewhetheritfallswithin theindicatedcutoffvolume.Fortunately,thetimestepsareverysmall,andtheatomsdon'tmovevery muchinanygiventimestep.Hence,thistimeconsumingcomputationisonlycarriedoutevery10to 100steps.Figure 3.3. Pseudocode for the molecular dynamics example Int const N // number of atoms Array Array Array Array of of of of Real Real Real List :: :: :: :: atoms (3,N) //3D coordinates velocities (3,N) //velocity vector forces (3,N) //force in each dimension neighbors(N) //atoms in cutoff volume

loop over time steps vibrational_forces (N, atoms, forces) rotational_forces (N, atoms, forces) neighbor_list (N, atoms, neighbors)

non_bonded_forces (N, atoms, neighbors, forces) update_atom_positions_and_velocities( N, atoms, velocities, forces) physical_properties ( ... Lots of stuff ... ) end loop

Second,thephysical_properties()functioncomputesenergies,correlationcoefficients,and ahostofinterestingphysicalproperties.Thesecomputations,however,aresimpleanddonot significantlyaffecttheprogram'soverallruntime,sowewillignoretheminthisdiscussion. Becausethebulkofthecomputationtimewillbeinnon_bonded_forces(),wemustpicka problemdecompositionthatmakesthatcomputationrunefficientlyinparallel.Theproblemismade easierbythefactthateachofthefunctionsinsidethetimeloophasasimilarstructure:Inthe sequentialversion,eachfunctionincludesaloopoveratomstocomputecontributionstotheforce vector.Thus,anaturaltaskdefinitionistheupdaterequiredbyeachatom,whichcorrespondstoa loopiterationinthesequentialversion.Afterperformingthetaskdecomposition,therefore,weobtain thefollowingtasks.

Tasksthatfindthevibrationalforcesonanatom Tasksthatfindtherotationalforcesonanatom Tasksthatfindthenonbondedforcesonanatom Tasksthatupdatethepositionandvelocityofanatom Atasktoupdatetheneighborlistforalltheatoms(whichwewillleavesequential)

Withourcollectionoftasksinhand,wecanconsidertheaccompanyingdatadecomposition.Thekey datastructuresaretheneighborlist,theatomiccoordinates,theatomicvelocities,andtheforcevector. Everyiterationthatupdatestheforcevectorneedsthecoordinatesofaneighborhoodofatoms.The computationofnonbondedforces,however,potentiallyneedsthecoordinatesofalltheatoms,because themoleculebeingsimulatedmightfoldbackonitselfinunpredictableways.Wewillusethis informationtocarryoutthedatadecomposition(intheDataDecompositionpattern)andthedata sharinganalysis(intheDataSharingpattern).Known uses

Taskbaseddecompositionsareextremelycommoninparallelcomputing.Forexample,thedistance geometrycodeDGEOM[Mat96]usesataskbaseddecomposition,asdoestheparallelWESDYN moleculardynamicsprogram[MR95].

3.3. THE DATA DECOMPOSITION PATTERN

Problem Howcanaproblem'sdatabedecomposedintounitsthatcanbeoperatedonrelativelyindependently? Context Theparallelalgorithmdesignermusthaveadetailedunderstandingoftheproblembeingsolved.In addition,thedesignershouldidentifythemostcomputationallyintensivepartsoftheproblem,thekey datastructuresrequiredtosolvetheproblem,andhowdataisusedastheproblem'ssolutionunfolds. Afterthebasicproblemisunderstood,theparallelalgorithmdesignershouldconsiderthetasksthat makeuptheproblemandthedatadecompositionimpliedbythetasks.Boththetaskanddata decompositionsneedtobeaddressedtocreateaparallelalgorithm.Thequestionisnotwhich decompositiontodo.Thequestioniswhichonetostartwith.Adatabaseddecompositionisagood startingpointifthefollowingistrue.

Themostcomputationallyintensivepartoftheproblemisorganizedaroundthemanipulation ofalargedatastructure. Similaroperationsarebeingappliedt differentpartsofthedatastructure,insuchawaythat o thedifferentpartscanbeoperatedonrelativelyindependently.

Forexample,manylinearalgebraproblemsupdatelargematrices,applyingasimilarsetofoperations toeachelementofthematrix.Inthesecases,itisstraightforwardtodrivetheparallelalgorithm designbylookingathowthematrixcanbebrokenupintoblocksthatareupdatedconcurrently.The taskdefinitionsthenfollowfromhowtheblocksaredefinedandmappedontotheprocessing elementsoftheparallelcomputer. Forces Themainforcesinfluencingthedesignatthispointareflexibility,efficiency,andsimplicity.

Flexibility.Flexibilitywillallowthedesigntobeadaptedtodifferentimplementation requirements.Forexample,itisusuallynotagoodideatonarrowtheoptionstoasingle computersystemorstyleofprogrammingatthisstageofthedesign. Efficiency.Aparallelprogramisonlyusefulifitscalesefficientlywiththesizeoftheparallel computer(intermsofreducedruntimeand/ormemoryutilization). Simplicity.Thedecompositionneedstobecomplexenoughtogetthejobdone,butsimple enoughtolettheprogrambedebuggedandmaintainedwithreasonableeffort.

Solution InsharedmemoryprogrammingenvironmentssuchasOpenMP,thedatadecompositionwill frequentlybeimpliedbythetaskdecomposition.Inmostcases,however,thedecompositionwillneed tobedonebyhand,becausethememoryisphysicallydistributed,becausedatadependenciesaretoo complexwithoutexplicitlydecomposingthedata,ortoachieveacceptableefficiencyonaNUMA computer. Ifataskbaseddecompositionhasalreadybeendone,thedatadecompositionisdrivenbytheneedsof eachtask.Ifwelldefinedanddistinctdatacanbeassociatedwitheachtask,thedecompositionshould

besimple. Whenstartingwithadatadecomposition,however,weneedtolooknotatthetasks,butatthecentral datastructuresdefiningtheproblemandconsiderwhethertheycantheybebrokendownintochunks thatcanbeoperatedonconcurrently.Afewcommonexamplesincludethefollowing.

Arraybasedcomputations.Concurrencycanbedefinedintermsofupdatesofdifferent segmentsofthearray.Ifthearrayismultidimensional,itcanbedecomposedinavarietyof ways(rows,columns,orblocksofvaryingshapes). Recursivedatastructures.Wecanthinkof,forexample,decomposingtheparallelupdateofa largetreedatastructurebydecomposingthedatastructureintosubtreesthatcanbeupdated concurrently.

Regardlessofthenatureoftheunderlyingdatastructure,ifthedatadecompositionistheprimary factordrivingthesolutiontotheproblem,itservesastheorganizingprincipleoftheparallel algorithm. Whenconsideringhowtodecomposetheproblem'sdatastructures,keepinmindthecompeting forces.

Flexibility.Thesizeandnumberofdatachunksshouldbeflexibletosupportthewidestrange ofparallelsystems.Oneapproachistodefinechunkswhosesizeandnumberarecontrolledby asmallnumberofparameters.Theseparametersdefinegranularityknobsthatcanbevariedto modifythesizeofthedatachunkstomatchtheneedsoftheunderlyinghardware.(Note, however,thatmanydesignsarenotinfinitelyadaptablewithrespecttogranularity.) Theeasiestplacetoseetheimpactofgranularityonthedatadecompositionisintheoverhead requiredtomanagedependenciesbetweenchunks.Thetimerequiredtomanagedependencies mustbesmallcomparedtotheoverallruntime.Inagooddatadecomposition,the dependenciesscaleatalowerdimensionthanthecomputationaleffortassociatedwitheach chunk.Forexample,inmanyfinitedifferenceprograms,thecellsattheboundariesbetween chunks,thatis,thesurfacesofthechunks,mustbeshared.Thesizeofthesetofdependent cellsscalesasthesurfacearea,whiletheeffortrequiredinthecomputationscalesasthe volumeofthechunk.Thismeansthatthecomputationaleffortcanbescaled(basedonthe chunk'svolume)tooffsetoverheadsassociatedwithdatadependencies(basedonthesurface areaofthechunk).

Efficiency.Itisimportantthatthedatachunksbelargeenoughthattheamountofworkto updatethechunkoffsetstheoverheadofmanagingdependencies.Amoresubtleissueto considerishowthechunksmapontoUEs.Aneffectiveparallelalgorithmmustbalancethe loadbetweenUEs.Ifthisisn'tdonewell,somePEsmighthaveadisproportionateamountof work,andtheoverallscalabilitywillsuffer.Thismayrequirecleverwaystobreakupthe problem.Forexample,iftheproblemclearsthecolumnsinamatrixfromlefttoright,a columnmappingofthematrixwillcauseproblemsastheUEswiththeleftmostcolumnswill finishtheirworkbeforetheothers.Arowbasedblockdecompositionorevenablockcyclic decomposition(inwhichrowsareassignedcyclicallytoPEs)woulddoamuchbetterjobof keepingalltheprocessorsfullyoccupied.Theseissuesarediscussedinmoredetailinthe DistributedArraypattern.

Simplicity.Overlycomplexdatadecompositionscanbeverydifficulttodebug.Adata decompositionwillusuallyrequireamappingofaglobalindexspaceontoatasklocalindex space.Makingthismappingabstractallowsittobeeasilyisolatedandtested.

Afterthedatahasbeendecomposed,ifithasnotalreadybeendone,thenextstepistolookatthetask decompositionimpliedbythetasks.TheTaskDecompositionpatternmayhelpwiththisanalysis. ExamplesMedical imaging

ConsiderthemedicalimagingproblemdescribedinSec.3.1.3.Inthisapplication,apointinsidea modelofthebodyisselectedrandomly,aradioactivedecayisallowedtooccuratthispoint,andthe trajectoryoftheemittedparticleisfollowed.Tocreateastatisticallysignificantsimulation,thousands ifnotmillionsoftrajectoriesarefollowed. Inadatabaseddecompositionofthisproblem,thebodymodelisthelargecentraldatastructure aroundwhichthecomputationcanbeorganized.Themodelisbrokenintosegments,andoneormore segmentsareassociatedwitheachprocessingelement.Thebodysegmentsareonlyread,notwritten, duringthetrajectorycomputations,sotherearenodatadependenciescreatedbythedecompositionof thebodymodel. Afterthedatahasbeendecomposed,weneedtolookatthetasksassociatedwitheachdatasegment. Inthiscase,eachtrajectorypassingthroughthedatasegmentdefinesatask.Thetrajectoriesare initiatedandpropagatedwithinasegment.Whenasegmentboundaryisencountered,thetrajectory mustbepassedbetweensegments.Itisthistransferthatdefinesthedependenciesbetweendata chunks. Ontheotherhand,inataskbasedapproachtothisproblem(asdiscussedintheTaskDecomposition pattern),thetrajectoriesforeachparticledrivethealgorithmdesign.EachPEpotentiallyneedsto accessthefullbodymodeltoserviceitssetoftrajectories.Inasharedmemoryenvironment,thisis easybecausethebodymodelisareadonlydataset.Inadistributedmemoryenvironment,however, thiswouldrequiresubstantialstartupoverheadasthebodymodelisbroadcastacrossthesystem. Thisisacommonsituationinparallelprogramming:Differentpointsofviewleadtodifferent algorithmswithpotentiallyverydifferentperformancecharacteristics.Thetaskbasedalgorithmis simple,butitonlyworksifeachprocessingelementhasaccesstoalargememoryandiftheoverhead incurredloadingthedataintomemoryisinsignificantcomparedtotheprogram'sruntime.An algorithmdrivenbyadatadecomposition,ontheotherhand,makesefficientuseofmemoryand(in distributedmemoryenvironments)lessuseofnetworkbandwidth,butitincursmorecommunication overheadduringtheconcurrentpartofcomputationandissignificantlymorecomplex.Choosing whichistheappropriateapproachcanbedifficultandisdiscussedfurtherintheDesignEvaluation pattern.Matrix multiplication

Considerthestandardmultiplicationoftwomatrices(C=AB),asdescribedinSec.3.1.3.Several databaseddecompositionsarepossibleforthisproblem.Astraightforwardonewouldbeto

decomposetheproductmatrixCintoasetofrowblocks(setofadjacentrows).Fromthedefinitionof matrixmultiplication,computingtheelementsofarowblockofCrequiresthefullAmatrix,butonly thecorrespondingrowblockofB.Withsuchadatadecomposition,thebasictaskinthealgorithm becomesthecomputationoftheelementsinarowblockofC. AnevenmoreeffectiveapproachthatdoesnotrequirethereplicationofthefullAmatrixisto decomposeallthreematricesintosubmatricesorblocks.Thebasictaskthenbecomestheupdateofa Cblock,withtheAandBblocksbeingcycledamongthetasksasthecomputationproceeds.This decomposition,however,ismuchmorecomplextoprogram;communicationandcomputationmustbe carefullycoordinatedduringthemosttimecriticalportionsoftheproblem.Wediscussthisexample furtherintheGeometricDecompositionandDistributedArraypatterns. Oneofthefeaturesofthematrixmultiplicationproble