View
25
Download
0
Embed Size (px)
DESCRIPTION
Chee Wai Lee Kumaresh P. James Phillips Gengbin Zheng Laxmikant Kale Klaus Schulten. NAMD Team: Abhinav Bhatele Eric Bohm Sameer Kumar David Kunzman Chao Mei. Scaling Challenges in NAMD: Past and Future. Outline. NAMD: An Introduction Past Scaling Challenges - PowerPoint PPT Presentation
Scaling Challenges in NAMD: Past and FutureNAMD Team:Abhinav BhateleEric BohmSameer KumarDavid KunzmanChao MeiChee Wai LeeKumaresh P.James PhillipsGengbin ZhengLaxmikant KaleKlaus Schulten
OutlineNAMD: An IntroductionPast Scaling ChallengesConflicting Adaptive Runtime TechniquesPME ComputationMemory RequirementsPerformance ResultsComparison with other MD codesFuture Challenges:Load BalancingParallel I/OFine-grained Parallelization
What is NAMD ?A parallel molecular dynamics applicationSimulate the life of a bio-moleculeHow is the simulation performed ?Simulation window broken down into a large number of time steps (typically 1 fs each)Forces on every atom calculated every time stepVelocities and positions updated and atoms migrated to their new positions
How is NAMD parallelized ?HYBRIDDECOMPOSITION
Patches
Computes
BondedComputes
Non-bondedComputes
Patch Integration
Patch Integration
Reductions
Multicast
Point to Point
Point to Point
PME
What makes NAMD efficient ?Charm++ runtime supportAsynchronous message-driven modelAdaptive overlap of communication and computation
Non-bonded WorkBonded WorkIntegrationPMECommunication
BondedComputes
Non-bondedComputes
Patch Integration
Patch Integration
Reductions
Multicast
Point to Point
Point to Point
PME
What makes NAMD efficient ?Charm++ runtime supportAsynchronous message-driven modelAdaptive overlap of communication and computationLoad balancing supportDifficult problem: balancing heterogeneous computationMeasurement-based load balancing
What makes NAMD highly scalable ?
Hybrid decomposition schemeVariants of this hybrid scheme used by Blue Matter and Desmond
Patch
Compute
Scaling ChallengesScaling a few thousand atom simulations to tens of thousands of processorsInteraction of adaptive runtime techniquesOptimizing the PME implementationRunning multi-million atom simulations on machines with limited memoryMemory Optimizations
Conflicting Adaptive Runtime TechniquesPatches multicast data to computesAt load balancing step, computes re-assigned to processorsTree re-built after computes have migrated
SolutionPersistent spanning treesCentralized spanning tree creationUnifying the two techniques
PME CalculationParticle Mesh Ewald (PME) method used for long range interactions1D decomposition of the FFT gridPME is a small portion of the total computationBetter than the 2D decomposition for small number of processorsOn larger partitionsUse a 2D decompositionMore parallelism and better overlap
Automatic Runtime DecisionsUse of 1D or 2D algorithm for PME Use of spanning trees for multicastSplitting of patches for fine-grained parallelismDepend on:Characteristics of the machineNo. of processorsNo. of atoms in the simulation
Reducing the memory footprintExploit the fact that building blocks for a bio-molecule have common structuresStore information about a particular kind of atom only once
OHHOHH14333
14332
14334
14496
14495
14497OHHOHH-1
0
+1
-1
0
+1
Reducing the memory footprintExploit the fact that building blocks for a bio-molecule have common structuresStore information about a particular kind of atom only onceStatic atom information increases only with the addition of unique proteins in the simulationAllows simulation of 2.8 M Ribosome on Blue Gene/L
< 0.5 MB
NAMD on Blue Gene/L1 million atom simulation on 64K processors(LLNL BG/L)
NAMD on Cray XT3/XT45570 atom simulation on 512 processors at 1.2 ms/step
Comparison with Blue MatterBlue Matter developed specifically for Blue Gene/LTime for ApoA1 (ms/step)NAMD running on 4K cores of XT3 is comparable to BM running on 32K cores of BG/L
Comparison with DesmondDesmond is a proprietary MD programTime (ms/step) for Desmond on 2.4 GHz Opterons and NAMD on 2.6 GHz XeonsUses single precision and exploits SSE instructionsLow-level infiniband primitives tuned for MD
NAMD on Blue Gene/P
Future WorkOptimizing PME computationUse of one-sided puts between FFTsReducing communication and other overheads with increasing fine-grained parallelismRunning NAMD on Blue WatersImproved distributed load balancersParallel Input/Output
SummaryNAMD is a highly scalable and portable MD programRuns on a variety of architecturesAvailable free of cost on machines at most supercomputing centersSupports a range of sizes of molecular systemsUses adaptive runtime techniques for high scalabilityAutomatic selection of algorithms at runtime best suited for the scenarioWith new optimizations, NAMD is ready for the next generation of parallel machines
Questions ?
Here on behalf of NAMD developersSTMV 1 million atomSolutions might be applicable in other scenarios as wellMotivate why is it challenging to parallelize: Generally a few thousand atoms, one time-step is a few secondsThis needs to be parallelized onto a large number of processors.The no. of time-steps is large (million to billion) since interesting phenomena happens on the scale of ns or ms1 fs = 10-15 secsDim = Rc + marginComputes are Charm++ objects which can be load balanced and have adaptive overlap of computation and communicationBonded forcesElectrostatic and van der waals forces (decide margin Rc)Short range: computesLong range: pmeOne kind of computes can also have different amounts of work depending on the no. of atomsOne kind of computes can also have different amounts of work depending on the no. of atomsFurther splitting of patches, 2-AwayXLoad balancer: balances computes, minimizes communicationPatches multicast data to computesIntermediate nodes have heavy loadMove just a few computes around2D PME has more communication but it can be overlapped, 1D PME has less latency
1D FFT single transpose and computation phase, 2D 3 transposes and 2 phases of computationPME small portion of the computation, Has been done beforeNeed to decide at runtime which decomposition to useX-axis is the no. of processorsY-axis the the no. of timesteps simulated per secScale well on Abe, Ranger, Blue Gene/PDesmond uses low level infiniband primitives tuned to MD