Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Lecture7:PrinciplesofParallelAlgorithmDesign
ConcurrentandMul;coreProgrammingCSE436/536
DepartmentofComputerScienceandEngineeringYonghongYan
[email protected]/~yan
1
Topics(Part1)
• IntroducAon• Programmingonsharedmemorysystem(Chapter7)
– OpenMP• Principlesofparallelalgorithmdesign(Chapter3)• Programmingonsharedmemorysystem(Chapter7)
– Cilk/CilkplusandOpenMPTasking– PThread,mutualexclusion,locks,synchroniza;ons
• AnalysisofparallelprogramexecuAons(Chapter5)– PerformanceMetricsforParallelSystems
• Execu;onTime,Overhead,Speedup,Efficiency,Cost– ScalabilityofParallelSystems– Useofperformancetools
2
Topics(Part2)
• Parallelarchitecturesandhardware– Parallelcomputerarchitectures– Memoryhierarchyandcachecoherency
• ManycoreGPUarchitecturesandprogramming– GPUsarchitectures– CUDAprogramming– IntroducAontooffloadingmodelinOpenMP
• Programmingonlargescalesystems(Chapter6)– MPI(pointtopointandcollec;ves)– IntroducAontoPGASlanguages,UPCandChapel
• Parallelalgorithms(Chapter8,9&10)– Sor;ngandStencil
3
“parallelandfor”OpenMPConstructs
4
for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }
#pragma omp parallel shared (a, b)
{
int id, i, Nthrds, istart, iend; id = omp_get_thread_num(); Nthrds = omp_get_num_threads(); istart = id * N / Nthrds; iend = (id+1) * N / Nthrds; for(i=istart;i<iend;i++) { a[i] = a[i] + b[i]; }
}
#pragma omp parallel shared (a, b) private (i) #pragma omp for schedule(static)
for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }
Sequential code
OpenMP parallel region
OpenMP parallel region and a worksharing for construct
OpenMPBestPrac;ces
#pragmaompparallelprivate(i){#pragmaompfornowaitfor(i=0;i<n;i++)a[i]+=b[i];#pragmaompfornowaitfor(i=0;i<n;i++)c[i]+=d[i];#pragmaompbarrier#pragmaompfornowaitreducAon(+:sum)for(i=0;i<n;i++)sum+=a[i]+c[i];}
5
• Falsesharing– Whenatleastonethreadwritetoa
cachelinewhileothersaccessit• Thread0:=A[1](read)• Thread1:A[0]=…(write)
• SoluAon:usearraypadding
int a[max_threads]; #pragma omp parallel for schedule(static,1) for(int i=0; i<max_threads; i++) a[i] +=i;
int a[max_threads][cache_line_size]; #pragma omp parallel for schedule(static,1) for(int i=0; i<max_threads; i++) a[i][0] +=i;
False-sharinginOpenMPandSolu;on
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
False Sharing
CPUs Caches Memory
A store into a shared cache line invalidates the other copies of that line:
The system is not able to distinguish between changes
within one individual line
6
A
T0
T1
NUMAFirst-touch
7
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
About “First Touch” placement/2
for (i=0; i<100; i++) a[i] = 0;
a[0] :a[49]
#pragma omp parallel for num_threads(2)
First TouchBoth memories each have “their half” of
the array
a[50] :a[99]
WorkwithFirst-touchinOpenMP
• First-touchinpracAce– IniAalizedataconsistentlywiththecomputaAons
8
#pragmaompparallelforfor(i=0;i<N;i++){a[i]=0.0;b[i]=0.0;c[i]=0.0;}readfile(a,b,c);#pragmaompparallelforfor(i=0;i<N;i++){a[i]=b[i]+c[i];}
SPMDProgramModelsinOpenMP
9
• SPMD(SingleProgram,Mul;pleData)forparallelregions– Allthreadsoftheparallelregionexecutethesamecode– EachthreadhasuniqueID
• UsethethreadIDtodivergetheexecuAonofthethreads– Differentthreadcanfollowdifferentpathsthroughthesame
code
• SPMDisbyfarthemostcommonlyusedpamernforstructuringparallelprograms– MPI,OpenMP,CUDA,etc
if(my_id == x) { } else { }
Overview:AlgorithmsandConcurrency
• IntroducAontoParallelAlgorithms– TasksandDecomposiAon– ProcessesandMapping
• Decomposi;onTechniques– RecursiveDecomposiAon– DataDecomposiAon– ExploratoryDecomposiAon– HybridDecomposiAon
• CharacterisAcsofTasksandInteracAons– TaskGeneraAon,Granularity,andContext– CharacterisAcsofTaskInteracAons.
10
Overview:ConcurrencyandMapping
• MappingTechniquesforLoadBalancing– StaAcandDynamicMapping
• MethodsforMinimizingInteracAonOverheads– MaximizingDataLocality– MinimizingContenAonandHot-Spots– OverlappingCommunicaAonandComputaAons– ReplicaAonvs.CommunicaAon– GroupCommunicaAonsvs.Point-to-PointCommunicaAon
• ParallelAlgorithmDesignModels– Data-Parallel,Work-Pool,TaskGraph,Master-Slave,Pipeline,
andHybridModels
11
Overview:AlgorithmsandConcurrency
• IntroducAontoParallelAlgorithms– TasksandDecomposiAon– ProcessesandMapping
• DecomposiAonTechniques– RecursiveDecomposiAon– DataDecomposiAon– ExploratoryDecomposiAon– HybridDecomposiAon
• CharacterisAcsofTasksandInteracAons– TaskGeneraAon,Granularity,andContext– CharacterisAcsofTaskInteracAons.
12
Decomposi;on,Tasks,andDependencyGraphs
• Decomposeworkintotasksthatcanbeexecutedconcurrently• DecomposiAoncouldbeinmanydifferentways.• Tasksmaybeofsame,different,orevenindeterminatesizes.• Taskdependencygraph:
– node=task– edge=controldependence,output-inputdependency– Nodependency==parallelism
13
Example:DenseMatrixVectorMul;plica;on
14
X=
• ComputaAonofeachelementofoutputvectoryisindependent
• Decomposedintontasks,oneperelementinyàEasy• ObservaAons
– EachtaskonlyreadsonerowofA,andwritesoneelementofy– Alltaskssharevectorb(theshareddata)– Nocontroldependenciesbetweentasks– AlltasksareofthesamesizeintermsofnumberofoperaAons.
REALA[n][n],b[n],y[n];inti,j;for(i=0;i<n;i++){sum=0.0;for(j=0;j<n;j++)sum+=A[i][j]*b[j];c[i]=sum;}
Example:DatabaseQueryProcessingConsidertheexecuAonofthequery:MODEL=``CIVIC''ANDYEAR=2001AND(COLOR=``GREEN''ORCOLOR=``WHITE”)
onthefollowingtable:
ID# Model Year Color Dealer Price 4523 Civic 2002 Blue MN $18,000 3476 Corolla 1999 White IL $15,000 7623 Camry 2001 Green NY $21,000 9834 Prius 2001 Green CA $18,000 6734 Civic 2001 White OR $17,000 5342 Altima 2001 Green FL $19,000 3845 Maxima 2001 Blue NY $22,000 8354 Accord 2000 Green VT $18,000 4395 Civic 2001 Red CA $17,000 7352 Civic 2002 Red WA $18,000
15
Example:DatabaseQueryProcessing
• Tasks:EachtasksearchthewholetableforentriesthatsaAsfyonepredicate– Results:Alistofentries
• Edges:outputofonetaskservesasinputtothenext MODEL=``CIVIC''ANDYEAR=2001AND(COLOR=``GREEN''ORCOLOR=``WHITE”)
16
Example:DatabaseQueryProcessing• AnalternatedecomposiAonMODEL=``CIVIC''ANDYEAR=2001AND
(COLOR=``GREEN''ORCOLOR=``WHITE)
• DifferentdecomposiAonsmayyielddifferentparallelismandperformance
17
GranularityofTaskDecomposi;ons• Granularityistasksize(amountofcomputa;on)
– Dependingonthenumberoftasksforthesameproblemsize• Fine-graineddecomposiAon=largenumberoftasks• CoarsegraineddecomposiAon=smallnumberoftasks• Granularityfordensematrix-vectorproduct
– fine-grain:eachtaskrepresentsanindividualelementiny– coarser-grain:eachtaskcomputes3elementsiny
X=
18
DegreeofConcurrency
• DefiniAon:thenumberoftasksthatcanbeexecutedinparallel• MaychangeoverprogramexecuAon
• Metrics– Maximumdegreeofconcurrency
• MaximumnumberofconcurrenttasksatanypointduringexecuAon.
– Averagedegreeofconcurrency• TheaveragenumberoftasksthatcanbeprocessedinparallelovertheexecuAonoftheprogram
• Speedup:serial_execu;on_;me/parallel_execu;on_;me• InverserelaAonshipofdegreeofconcurrencyandtask
granularity– Taskgranularityé(lesstasks),degreeofconcurrencyê – Taskgranularityê(moretasks),degreeofconcurrencyé
19
Examples:DegreeofConcurrency
• Maximumdegreeofconcurrency• Averagedegreeofconcurrency
20
X=
• Databasequery– Max:4– Average:7/3(eachtasktakesthesameAme)
• Matrix-vectormulAplicaAon– Max:n– Average:n
Cri;calPathLength
• Adirectedpath:asequenceoftasksthatmustbeserialized– Executedoneateranother
• CriAcalpath:– Thelongestweightedpaththroughoutthegraph
• CriAcalpathlength:shortestAmeinwhichtheprogramcanbeexecutedinparallel– LowerboundonparallelexecuAonAme
21
Abuildingproject
Cri;calPathLengthandDegreeofConcurrencyDatabasequerytaskdependencygraph
Ques;ons:WhatarethetasksonthecriAcalpathforeachdependencygraph?WhatistheshortestparallelexecuAonAme?HowmanyprocessorsareneededtoachievetheminimumAme?Whatisthemaximumdegreeofconcurrency?Whatistheaverageparallelism(averagedegreeofconcurrency)?
Totalamountofwork/(criAcalpathlength)2.33(63/27)and1.88(64/34)
22
LimitsonParallelPerformance
• WhatboundsparallelexecuAonAme?– minimumtaskgranularity
• e.g.densematrix-vectormulAplicaAon≤n2concurrenttasks– fracAonofapplicaAonworkthatcan’tbeparallelized
• moreaboutAmdahl’slawinalaterlecture…– dependenciesbetweentasks– paralleliza;onoverheads
• e.g.,costofcommunicaAonbetweentasks
• Measuresofparallelperformance– speedup=T1/Tp– parallelefficiency=T1/(pTp)
23
TaskInterac;onGraphs
• Tasksgenerallyexchangedatawithothers– example:densematrix-vectormulAply
• Ifbisnotreplicatedinalltasks:eachtaskhassome,butnotall
• Taskswillhavetocommunicateelementsofb
• TaskinteracAongraph– node=task– edge=interacAonordataexchange
• TaskinteracAongraphsvs.taskdependencygraphs– Taskdependencygraphsrepresentcontroldependences– TaskinteracAongraphsrepresentdatadependences
24
TaskInterac;onGraphs:AnExample
Sparsematrixvectormul;plica;on:Axb• ComputaAonofeachresultelement=anindependenttask.• Onlynon-zeroelementsofsparsematrixAparAcipatein
computaAon.• Ifpar;;onbacrosstasks,i.e.taskTihasb[i]only• ThetaskinteracAongraphofthecomputaAon=graphofthematrixA• Aistheadjacentmatrixofthegraph
25
• FinertaskgranularityèmoreoverheadoftaskinteracAons– OverheadasaraAoofusefulworkofatask
• Example:sparsematrix-vectorproductinteracAongraph
• AssumpAons:– eachdot(A[i][j]*b[j])takesunitAmetoprocess– eachcommunicaAon(edge)causesanoverheadofaunitAme
• Ifnode0isatask:communicaAon=3;computaAon=4• Ifnodes0,4,and5areatask:communicaAon=5;computaAon=15
– coarser-graindecomposi;on→smallercommunica;on/computa;onra;o(3/4vs5/15)
TaskInterac;onGraphs,Granularity,andCommunica;on
26
ProcessesandMapping
• Generally– #oftasks>=#processingelements(PEs)available– parallelalgorithmmustmaptaskstoprocesses
• Mapping:aggregatetasksintoprocesses– Process/thread=processingorcompuAngagentthatperformswork– assigncollecAonoftasksandassociateddatatoaprocess
• AnPE,e.g.acore/thread,hasitssystemabstracAon– NoteasytobindtaskstophysicalPEs,thus,morelayersatleast
conceptuallyfromPEtotaskmapping– ProcessinMPI,threadinOpenMP/pthread,etc
• Theoverloadedtermsofprocessesandthreads– TaskàprocessesàOSprocessesàCPUàcores– Forthesakeofsimplicity,processes=PEs/cores
27
ProcessesandMapping
• MappingoftaskstoprocessesiscriAcaltotheparallelperformanceofanalgorithm.
• Onwhatbasisshouldonechoosemappings?– Taskdependencygraph– TaskinteracAongraph
• Taskdependencygraphs– Toensureequalspreadofworkacrossallprocessesatanypoint
• minimumidling• opAmalloadbalance
• TaskinteracAongraphs– TominimizeinteracAons
• MinimizecommunicaAonandsynchronizaAon 28
ProcessesandMapping
Agoodmappingmustminimizeparallelexecu;on;meby:• Mappingindependenttaskstodifferentprocesses
– Maximizeconcurrency• TasksoncriAcalpathhavehighpriorityofbeingassignedtoprocesses
• MinimizinginteracAonbetweenprocesses– mappingtaskswithdenseinteracAonstothesameprocess.
• Difficulty:thesecriteriaotenconflictwitheachother– E.g.NodecomposiAon,i.e.onetask,minimizesinteracAonbut
nospeedupatall!– OthersuchconflicAngcases?
29
ProcessesandMapping:Example
30
Example:mappingdatabasequeriestoprocesses• Considerthedependencygraphsinlevels
– nonodesinaleveldependupononeanother• Assignalltaskswithinaleveltodifferentprocesses
– computelevelsusingtopologicalsort
Overview:AlgorithmsandConcurrency
• IntroducAontoParallelAlgorithms– TasksandDecomposi;on– ProcessesandMapping
• Decomposi;onTechniques– RecursiveDecomposiAon– DataDecomposiAon– ExploratoryDecomposiAon– HybridDecomposiAon
• CharacterisAcsofTasksandInteracAons– TaskGeneraAon,Granularity,andContext– CharacterisAcsofTaskInteracAons.
31
Decomposi;onTechniques
32
Sohowdoesonedecomposeataskintovarioussubtasks?• Nosinglerecipethatworksforallproblems• PracAcallyusedtechniques
– Recursivedecomposi;on– Datadecomposi;on– ExploratorydecomposiAon– SpeculaAvedecomposiAon
RecursiveDecomposi;on
Generallysuitedtoproblemssolvableusingthedivide-and-conquerstrategySteps:1. decomposeaproblemintoasetofsub-problems2. recursivelydecomposeeachsub-problem3. stopdecomposiAonwhenminimumdesiredgranularity
reached
33
RecursiveDecomposi;on:Quicksort
34
Ateachlevelandforeachvector1. Selectapivot2. ParAAonsetaroundpivot3. Recursivelysorteachsubvector
ê
quicksort(A,lo,hi)iflo<hip=pivot_parAAon(A,lo,hi)quicksort(A,lo,p-1)quicksort(A,p+1,hi)
Eachvectorcanbesortedconcurrently(i.e.,eachsor;ngrepresentsanindependentsubtask).
RecursiveDecomposi;on:Min
35
procedure SERIAL_MIN (A, n) min = A[0]; for i := 1 to n − 1 do if (A[i] < min) min := A[i]; return min;
procedureRECURSIVE_MIN(A,n)if(n=1)thenmin:=A[0];elselmin:=RECURSIVE_MIN(A,n/2);rmin:=RECURSIVE_MIN(&(A[n/2]),n-n/2);if(lmin<rmin)thenmin:=lmin;elsemin:=rmin;
returnmin;
Findingtheminimuminavectorusingdivide-and-conquer
Applicabletootherassocia;veopera;ons,e.g.sum,AND…Knownasreduc;onopera;on
RecursiveDecomposi;on:Min
findingtheminimuminset{4,9,1,7,8,11,2,12}.
36
Taskdependencygraph:• RECURSIVE_MIN forms the
binarytree• minfinishesandcloses• Parallelinthesamelevel
FibwithOpenMPTasking
• Taskcomple;onoccurswhenthetaskreachestheendofthetaskregioncode
• MulApletasksjoinedtocompletethroughtheuseoftasksynchroniza;onconstructs– taskwait– barrierconstruct
• taskwaitconstructs:– #pragmaomptaskwait– !$omptaskwait
37
intfib(intn){intx,y;if(n<2)returnn;else{#pragmaomptaskshared(x)x=fib(n-1);#pragmaomptaskshared(y)y=fib(n-2);#pragmaomptaskwaitreturnx+y;}}
DataDecomposi;on--Themostcommonlyusedapproach
• Steps:1. IdenAfythedataonwhichcomputaAonsareperformed.2. ParAAonthisdataacrossvarioustasks.
• ParAAoninginducesadecomposiAonoftheproblem,i.e.computaAonisparAAoned
• DatacanbeparAAonedinvariousways– CriAcalforparallelperformance
• DecomposiAonbasedon– outputdata– inputdata– input+outputdata– intermediatedata
38
OutputDataDecomposi;on
• Eachelementoftheoutputcanbecomputedindependentlyofothers– simplyasafuncAonoftheinput.
• AnaturalproblemdecomposiAon
39
OutputDataDecomposi;on:MatrixMul;plica;on
mul;plyingtwonxnmatricesAandBtoyieldmatrixCTheoutputmatrixCcanbeparAAonedintofourtasks:
Task1:
Task2:
Task3:
Task4: 40
OutputDataDecomposi;on:ExampleAparAAoningofoutputdatadoesnotresultinauniquedecomposiAonintotasks.Forexample,forthesameproblemasinpreviusfoil,withidenAcaloutputdatadistribuAon,wecanderivethefollowingtwo(other)decomposiAons:
Decomposition I Decomposition II
Task 1: C1,1 = A1,1 B1,1
Task 2: C1,1 = C1,1 + A1,2 B2,1
Task 3: C1,2 = A1,1 B1,2
Task 4: C1,2 = C1,2 + A1,2 B2,2
Task 5: C2,1 = A2,1 B1,1
Task 6: C2,1 = C2,1 + A2,2 B2,1
Task 7: C2,2 = A2,1 B1,2
Task 8: C2,2 = C2,2 + A2,2 B2,2
Task 1: C1,1 = A1,1 B1,1
Task 2: C1,1 = C1,1 + A1,2 B2,1
Task 3: C1,2 = A1,2 B2,2
Task 4: C1,2 = C1,2 + A1,1 B1,2
Task 5: C2,1 = A2,2 B2,1
Task 6: C2,1 = C2,1 + A2,1 B1,1
Task 7: C2,2 = A2,1 B1,2
Task 8: C2,2 = C2,2 + A2,2 B2,2 41
OutputDataDecomposi;on:ExampleCountthefrequencyofitemsetsindatabasetransac;ons
42
• Decomposetheitemsetstocount– eachtaskcomputestotalcountforeach
ofitsitemsets– appendtotalcountsforitemsetsto
producetotalcountresult
OutputDataDecomposi;on:Example
Fromthepreviousexample,thefollowingobservaAonscanbemade:
• IfthedatabaseoftransacAonsisreplicatedacrosstheprocesses,eachtaskcanbeindependentlyaccomplishedwithnocommunicaAon.
• IfthedatabaseisparAAonedacrossprocessesaswell(forreasonsofmemoryuAlizaAon),eachtaskfirstcomputesparAalcounts.Thesecountsarethenaggregatedattheappropriatetask.
43
InputDataPar;;oning
• GenerallyapplicableifeachoutputcanbenaturallycomputedasafuncAonoftheinput.
• Inmanycases,thisistheonlynaturaldecomposiAoniftheoutputisnotclearlyknowna-priori– e.g.,theproblemoffindingtheminimum,sorAng.
• AtaskisassociatedwitheachinputdataparAAon– ThetaskperformscomputaAonwithitspartofthedata.– SubsequentprocessingcombinestheseparAalresults.
• MapReduce
44
InputDataPar;;oning:Example
45
Par;;oningInputandOutputData
46
• ParAAononbothinputandoutputformoreconcurrency• Example:itemsetcounAng
IntermediateDataPar;;oning
47
• IfcomputaAonisasequenceoftransforms– frominputdatatooutputdata,e.g.imageprocessing
workflow• Candecomposebasedondataforintermediatestages
Task1:
Task2:
Task3:
Task4:
IntermediateDataPar;;oning:Example
48
• densematrixmulAplicaAon– visualize this computaAon in terms of intermediate
matricesD.
IntermediateDataPar;;oning:Example
StageII
Task 01: D1,1,1= A1,1 B1,1 Task 02: D2,1,1= A1,2 B2,1
Task 03: D1,1,2= A1,1 B1,2 Task 04: D2,1,2= A1,2 B2,2
Task 05: D1,2,1= A2,1 B1,1 Task 06: D2,2,1= A2,2 B2,1
Task 07: D1,2,2= A2,1 B1,2 Task 08: D2,2,2= A2,2 B2,2
Task 09: C1,1 = D1,1,1 + D2,1,1 Task 10: C1,2 = D1,1,2 + D2,1,2
Task 11: C2,1 = D1,2,1 + D2,2,1 Task 12: C2,,2 = D1,2,2 + D2,2,2 49
• AdecomposiAonofintermediatedata:8+4tasks:StageI
IntermediateDataPar;;oning:Example The task dependency graph for the decomposiAon(showninpreviousfoil)into12tasksisasfollows:
50
TheOwnerComputesRule
51
• Eachdatumisassignedtoaprocess• Eachprocesscomputesvaluesassociatedwithitsdata• ImplicaAons
– inputdatadecomposiAon• allcomputaAonsusinganinputdatumareperformedbyitsprocess
– outputdatadecomposiAon• anoutputiscomputedbytheprocessassignedtotheoutputdata
References
• Adaptedfromslides“PrinciplesofParallelAlgorithmDesign”byAnanthGrama
• BasedonChapter3of“IntroducAontoParallelCompuAng”byAnanthGrama,AnshulGupta,GeorgeKarypis,andVipinKumar.AddisonWesley,2003
52