Distributed-memory Graph Algorithms: Case studies with ... · Sayan Ghosh, Mahantesh Halappanavar+, Ananth Kalyanaraman, Assefaw Gebremedhin, Antonino Tumeo+ Washington State

• Distributed-memory graph applications exhibit irregularcommunication patterns, challenging to parallelize

• We study distributed-memory implementations ofCommunity Detection (using Louvain method) andMaximumWeight Matching (half approximate method)• Partition a graph into clusters (or communities) such that

each cluster consists of vertices that are denselyconnected within the cluster and sparsely connected tothe rest of the graph

• A matching in a graph is a subset of edges such that notwo “matched” edges are incident on the same vertex

Distributed-memoryGraphAlgorithms:CasestudieswithCommunityDetectionandWeightedMatchingSayanGhosh*,MahanteshHalappanavar+,AnanthKalyanaraman*,AssefawGebremedhin*,AntoninoTumeo+

*WashingtonStateUniversity,Pullman,WA+PacificNorthwestNationalLaboratory,Richland,WA

About

Acknowledgements

Objective: Todeviseheuristicsthatimproveexecutiontimeperformanceand/orquality.

• Goodness of partitioning measured using a global metriccalled modularity (Q), that depends on sum of intra andinter community edge weights

• In 2008, Blondel, et al. introduced a multi-phase,iterative heuristic for modularity optimization, called theLouvain method

Louvainmethodforgraphclustering

Contig GenerationExperimentsconductedonNERSCCoriandEdisonsupercomputers

Performance:CommunityDetection

TheresearchisinpartsupportedbytheU.S.DOEExaGraphproject,acollaborativeeffortofU.S.DOESCandNNSAatDOEPNNL.

• S.Ghosh,M.Halappanavar,A.Tumeo,A.Kalyanaraman,H.Lu,D.Chavarrià-Miranda,A.Khan,A.Gebremedhin"DistributedLouvainAlgorithmforGraphCommunityDetection“, 2018IEEEInternationalParallelandDistributedProcessingSymposium(IPDPS)

• S.Ghosh,M.Halappanavar,A.Kalyanaraman,A.Tumeo,A.Gebremedhin,“miniVite:AGraphAnalyticsBenchmarkingToolforMassivelyParallelSystems”,2019PerformanceModeling,BenchmarkingandSimulationofHighPerformanceComputerSystems(PMBS)

• S.Ghosh,M.Halappanavar,A.Kalyanaraman,A.Khan,A.Gebremedhin,“ExploringMPICommunicationModelsforGraphApplicationsUsingGraphMatchingasaCaseStudy”[underreview]

References

HeuristicsforCommunityDetectionObjective:Implementedhalf-approxmatchingusingMPISend-Recv(NSR),Neighborhoodcollectives(NCL)andRMA.

Observed2-3.5xspeedupon4-16KprocessesforbothNCLNCL/RMAisnotefficientforthisinputandRMAversionsrelativetoNSR

RMAperformsatleast25-35%betterthanNSRandNCLLargeneighborhoodresultsinpoorperformance

Performance:Halfapproximatematching

Energy/MemoryformatchingonCori

Within each iteration• ΔQ when a vertex migrates• Move vertex from current community to one

that yields max ΔQ

At the end of a phase, the graph is rebuilt

Phase continues until ΔQ between successive iterations is below a threshold

Initially each vertex assigned to a separate community

• In the first phase, the initial set of locally dominant edgesare identified and added to matching set M

• Next phase is iterative, for each vertex in M, itsunmatched neighboring vertices are matched

N’v represents unmatched vertices in v’s neighborhood

Vertex with the heaviest unmatchededge incident on v is referred as v’smate

mate of a vertex can change as it may try to match with multiplevertices in its neighborhood

i j

kC(k)

C(i) C(j)

Penalizeavertexineveryiterationifitstaysinthesamecommunity,eventuallyitbecomesimmobileifthecumulativepenaltyfallbelowacutoff(anothervariantrequiresglobalcommunication).

Whenα iscloseto1,thisschemeismoreaggressiveinterminatingvertices,whereascloseto0isthebaselinecase.

Withoutcoloring,inparallelanegativegainscenarioispossible(sinceprocessesworkwithoutdatedinformation).WecolorafractionofverticeswithpreselectednumberofcolorclassesusingtheJones-Plassmann algorithm.

ColoringisexpensiveindistributedmemoryasentireLouvainiterationneedstobeinvokedpercolorclass,increasingthecommunicationcalls.

Initially,whenthegraphislarge,increasingτleadstoquickerexitperphase.

EarlyTermination ThresholdCycling Incompletecoloring

CommunicationcharacteristicsonNERSCCori(1Kprocesses)Matching Communitydetection Graph500BFS

Friendster(1.8Bedges) R-MATgraph(2.1Bedges)

Executiontime ModularityPerformanceofGraphChallengeInputgraphson16nodesofEdison(except200Kcase,whichwasrunon1node).Coloringperformanceis~8-10xworse,modularityimprovesbyanequalfactor!

Observed2-46xspeeduprelativetoaparallelbaselineversiononreal-worldgraphs!

ImplementedcommunicationintensivepartsusingMPIcollectives(COLL),blockingSend-Recv(SR),nonblockingSend-Recv(NBSR)andRMA.Observed4-18%divergenceinperformanceacrossversions.

VersionsyieldingthebestperformanceovertheSend-Recvbaselineversion(runon512-16Kprocesses)forinputgraphs.

• AveragememoryconsumptionforNCListheleast,~1.03−2.3xlessthanNSR,~9−27%lessthanRMA• OverallnodeenergyconsumptionofNSRisabout4xtothatofNCLandRMAforFriendster

With color W/Ocolor

Maximumweightmatching

Documents

Distributed-memory Graph Algorithms: Case studies with ... · Sayan Ghosh*, Mahantesh Halappanavar+, Ananth Kalyanaraman*, Assefaw Gebremedhin*, Antonino Tumeo+ *Washington State

Distributed-memory Graph Algorithms: Case studies with ... · Sayan Ghosh, Mahantesh Halappanavar+, Ananth Kalyanaraman, Assefaw Gebremedhin, Antonino Tumeo+ Washington State