Scaling Challenges in NAMD: Past and Future

  • View
    25

  • Download
    0

Embed Size (px)

DESCRIPTION

Chee Wai Lee Kumaresh P. James Phillips Gengbin Zheng Laxmikant Kale Klaus Schulten. NAMD Team: Abhinav Bhatele Eric Bohm Sameer Kumar David Kunzman Chao Mei. Scaling Challenges in NAMD: Past and Future. Outline. NAMD: An Introduction Past Scaling Challenges - PowerPoint PPT Presentation

Text of Scaling Challenges in NAMD: Past and Future

  • Scaling Challenges in NAMD: Past and FutureNAMD Team:Abhinav BhateleEric BohmSameer KumarDavid KunzmanChao MeiChee Wai LeeKumaresh P.James PhillipsGengbin ZhengLaxmikant KaleKlaus Schulten

  • OutlineNAMD: An IntroductionPast Scaling ChallengesConflicting Adaptive Runtime TechniquesPME ComputationMemory RequirementsPerformance ResultsComparison with other MD codesFuture Challenges:Load BalancingParallel I/OFine-grained Parallelization

  • What is NAMD ?A parallel molecular dynamics applicationSimulate the life of a bio-moleculeHow is the simulation performed ?Simulation window broken down into a large number of time steps (typically 1 fs each)Forces on every atom calculated every time stepVelocities and positions updated and atoms migrated to their new positions

  • How is NAMD parallelized ?HYBRIDDECOMPOSITION

    Patches

    Computes

  • BondedComputes

    Non-bondedComputes

    Patch Integration

    Patch Integration

    Reductions

    Multicast

    Point to Point

    Point to Point

    PME

  • What makes NAMD efficient ?Charm++ runtime supportAsynchronous message-driven modelAdaptive overlap of communication and computation

  • Non-bonded WorkBonded WorkIntegrationPMECommunication

    BondedComputes

    Non-bondedComputes

    Patch Integration

    Patch Integration

    Reductions

    Multicast

    Point to Point

    Point to Point

    PME

  • What makes NAMD efficient ?Charm++ runtime supportAsynchronous message-driven modelAdaptive overlap of communication and computationLoad balancing supportDifficult problem: balancing heterogeneous computationMeasurement-based load balancing

  • What makes NAMD highly scalable ?

    Hybrid decomposition schemeVariants of this hybrid scheme used by Blue Matter and Desmond

    Patch

    Compute

  • Scaling ChallengesScaling a few thousand atom simulations to tens of thousands of processorsInteraction of adaptive runtime techniquesOptimizing the PME implementationRunning multi-million atom simulations on machines with limited memoryMemory Optimizations

  • Conflicting Adaptive Runtime TechniquesPatches multicast data to computesAt load balancing step, computes re-assigned to processorsTree re-built after computes have migrated

  • SolutionPersistent spanning treesCentralized spanning tree creationUnifying the two techniques

  • PME CalculationParticle Mesh Ewald (PME) method used for long range interactions1D decomposition of the FFT gridPME is a small portion of the total computationBetter than the 2D decomposition for small number of processorsOn larger partitionsUse a 2D decompositionMore parallelism and better overlap

  • Automatic Runtime DecisionsUse of 1D or 2D algorithm for PME Use of spanning trees for multicastSplitting of patches for fine-grained parallelismDepend on:Characteristics of the machineNo. of processorsNo. of atoms in the simulation

  • Reducing the memory footprintExploit the fact that building blocks for a bio-molecule have common structuresStore information about a particular kind of atom only once

  • OHHOHH14333

    14332

    14334

    14496

    14495

    14497OHHOHH-1

    0

    +1

    -1

    0

    +1

  • Reducing the memory footprintExploit the fact that building blocks for a bio-molecule have common structuresStore information about a particular kind of atom only onceStatic atom information increases only with the addition of unique proteins in the simulationAllows simulation of 2.8 M Ribosome on Blue Gene/L

  • < 0.5 MB

  • NAMD on Blue Gene/L1 million atom simulation on 64K processors(LLNL BG/L)

  • NAMD on Cray XT3/XT45570 atom simulation on 512 processors at 1.2 ms/step

  • Comparison with Blue MatterBlue Matter developed specifically for Blue Gene/LTime for ApoA1 (ms/step)NAMD running on 4K cores of XT3 is comparable to BM running on 32K cores of BG/L

  • Comparison with DesmondDesmond is a proprietary MD programTime (ms/step) for Desmond on 2.4 GHz Opterons and NAMD on 2.6 GHz XeonsUses single precision and exploits SSE instructionsLow-level infiniband primitives tuned for MD

  • NAMD on Blue Gene/P

  • Future WorkOptimizing PME computationUse of one-sided puts between FFTsReducing communication and other overheads with increasing fine-grained parallelismRunning NAMD on Blue WatersImproved distributed load balancersParallel Input/Output

  • SummaryNAMD is a highly scalable and portable MD programRuns on a variety of architecturesAvailable free of cost on machines at most supercomputing centersSupports a range of sizes of molecular systemsUses adaptive runtime techniques for high scalabilityAutomatic selection of algorithms at runtime best suited for the scenarioWith new optimizations, NAMD is ready for the next generation of parallel machines

  • Questions ?

    Here on behalf of NAMD developersSTMV 1 million atomSolutions might be applicable in other scenarios as wellMotivate why is it challenging to parallelize: Generally a few thousand atoms, one time-step is a few secondsThis needs to be parallelized onto a large number of processors.The no. of time-steps is large (million to billion) since interesting phenomena happens on the scale of ns or ms1 fs = 10-15 secsDim = Rc + marginComputes are Charm++ objects which can be load balanced and have adaptive overlap of computation and communicationBonded forcesElectrostatic and van der waals forces (decide margin Rc)Short range: computesLong range: pmeOne kind of computes can also have different amounts of work depending on the no. of atomsOne kind of computes can also have different amounts of work depending on the no. of atomsFurther splitting of patches, 2-AwayXLoad balancer: balances computes, minimizes communicationPatches multicast data to computesIntermediate nodes have heavy loadMove just a few computes around2D PME has more communication but it can be overlapped, 1D PME has less latency

    1D FFT single transpose and computation phase, 2D 3 transposes and 2 phases of computationPME small portion of the computation, Has been done beforeNeed to decide at runtime which decomposition to useX-axis is the no. of processorsY-axis the the no. of timesteps simulated per secScale well on Abe, Ranger, Blue Gene/PDesmond uses low level infiniband primitives tuned to MD