NAMD: Biomolecular Simulation on Thousands of Processors

  • View
    27

  • Download
    0

Embed Size (px)

DESCRIPTION

NAMD: Biomolecular Simulation on Thousands of Processors. James C. Phillips Gengbin Zheng Sameer Kumar Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Dept. of Computer Science And Theoretical Biophysics Group Beckman Institute - PowerPoint PPT Presentation

Text of NAMD: Biomolecular Simulation on Thousands of Processors

  • NAMD: Biomolecular Simulation on Thousands of ProcessorsJames C. PhillipsGengbin Zheng Sameer KumarLaxmikant Kalehttp://charm.cs.uiuc.eduParallel Programming LaboratoryDept. of Computer ScienceAnd Theoretical Biophysics GroupBeckman InstituteUniversity of Illinois at Urbana Champaign

  • AcknowledgementsFunding AgenciesNIHNSFDOE (ASCI center)Students and StaffParallel Programming LaboratoryOrion LawlorMilind BhandarkarRamkumar VadaliRobert BrunnerTheoretical BiophysicsKlaus Schulten, Bob SkeelCoworkers

    PSCRalph Roskies Rich Raymond Sergiu SanieliviciChad VizinoKen HackworthNCSADavid ONeal

  • OutlineChallenge of MD:Charm++:Virtualization, load balancing,Principle of persistence, Measurementt based load balanceNAMD parallelizationVirtual processorsOptimizations and ideas:Better load balancing: explicitly model communication costRefinement (cycle description)Consistency of speedup over timestepsProblem: commn/OS jitterAsync reductionsDynamic substep balancing to handle jitterPME parallelizationPME description:3D FFT FFTW and modificationsVP pictureMulti-timesteppingOverlapTranspose optimizationPerformance dataSpeedupTableComponents:Angle, non-bonded, pme, integrationCommn overhead:

  • NAMD: A Production MD programNAMDFully featured programNIH-funded developmentDistributed free of charge (~5000 downloads so far)Binaries and source codeInstalled at NSF centersUser training and supportLarge published simulations (e.g., aquaporin simulation featured in keynote)

  • NAMD, CHARMM27, PMENpT ensemble at 310 or 298 K 1ns equilibration, 4ns production

    Protein:~ 15,000 atomsLipids (POPE):~ 40,000 atomsWater:~ 51,000 atomsTotal:~ 106,000 atoms

    3.5 days / ns - 128 O2000 CPUs11 days / ns - 32 Linux CPUs.35 days/ns512 LeMieux CPUsAcquaporin SimulationF. Zhu, E.T., K. Schulten, FEBS Lett. 504, 212 (2001)M. Jensen, E.T., K. Schulten, Structure 9, 1083 (2001)

  • Molecular Dynamics in NAMDCollection of [charged] atoms, with bondsNewtonian mechanicsThousands of atoms (10,000 - 500,000)At each time-stepCalculate forces on each atom Bonds:Non-bonded: electrostatic and van der WaalsShort-distance: every timestepLong-distance: using PME (3D FFT)Multiple Time Stepping : PME every 4 timesteps Calculate velocities and advance positionsChallenge: femtosecond time-step, millions needed!Collaboration with K. Schulten, R. Skeel, and coworkers

  • Sizes of Simulations Over TimeBPTI3K atomsEstrogen Receptor36K atoms (1996)ATP Synthase327K atoms(2001)

  • Parallel MD: Easy or Hard?EasyTiny working dataSpatial localityUniform atom densityPersistent repetitionMultiple timesteppingHardSequential timestepsShort iteration timeFull electrostaticsFixed problem sizeDynamic variationsMultiple timestepping!

  • Other MD Programs for BiomoleculesCHARMMAmberGROMACSNWChemLAMMPS

  • Traditional Approaches: non isoefficientReplicated Data:All atom coordinates stored on each processorCommunication/Computation ratio: P log PPartition the Atoms array across processorsNearby atoms may not be on the same processorC/C ratio: O(P)Distribute force matrix to processorsMatrix is sparse, non uniform, C/C Ratio: sqrt(P)

  • Spatial DecompositionAtoms distributed to cubes based on their location Size of each cube :Just a bit larger than cut-off radiusCommunicate only with neighborsWork: for each pair of nbr objectsC/C ratio: O(1)However: Load ImbalanceLimited Parallelism

    Cells, Cubes orPatchesCharm++ is useful to handle this

  • Virtualization: Object-based ParallelizationUser ViewSystem implementationUser is only concerned with interaction between objects

  • Data driven executionSchedulerSchedulerMessage QMessage Q

  • Charm++ and Adaptive MPIRealizations of Virtualization ApproachCharm++ Parallel C++Asynchronous methodsIn development for over a decadeBasis of several parallel applicationsRuns on all popular parallel machines and clustersAMPIA migration path for MPI codes Allows them dynamic load balancing capabilities of Charm++Minimal modifications to convert existing MPI programs Bindings for C, C++, and Fortran90Both available from http://charm.cs.uiuc.edu

  • Benefits of VirtualizationSoftware EngineeringNumber of virtual processors can be independently controlledSeparate VPs for modulesMessage Driven ExecutionAdaptive overlapModularityPredictability: Automatic Out-of-coreDynamic mappingHeterogeneous clusters:Vacate, adjust to speed, shareAutomatic checkpointingChange the set of processorsPrinciple of Persistence:Enables Runtime OptimizationsAutomatic Dynamic Load BalancingCommunication OptimizationsOther Runtime Optimizations

    More info:http://charm.cs.uiuc.edu

  • Measurement Based Load BalancingPrinciple of persistenceObject communication patterns and computational loads tend to persist over timeIn spite of dynamic behaviorAbrupt but infrequent changesSlow and small changesRuntime instrumentationMeasures communication volume and computation timeMeasurement based load balancersUse the instrumented data-base periodically to make new decisions

  • Spatial Decomposition Via CharmAtoms distributed to cubes based on their location Size of each cube :Just a bit larger than cut-off radiusCommunicate only with neighborsWork: for each pair of nbr objectsC/C ratio: O(1)However: Load ImbalanceLimited Parallelism

    Cells, Cubes orPatchesCharm++ is useful to handle this

  • Object Based Parallelization for MD:Force Decomposition + Spatial DecompositionNow, we have many objects to load balance:Each diamond can be assigned to any proc. Number of diamonds (3D): 14Number of Patches

  • Bond ForcesMultiple types of forces:Bonds(2), Angles(3), Dihedrals (4), ..Luckily, each involves atoms in neighboring patches onlyStraightforward implementation:Send message to all neighbors,receive forces from them26*2 messages per patch!Instead, we do:Send to (7) upstream nbrsEach force calculated at one patch

  • Performance Data: SC2000

    Chart1

    2

    3.9259259259

    7.689119171

    15.394190871

    30.254841998

    59.574468085

    117.77777778

    227.25880551

    421.59090909

    603.25203252

    772.91666667

    1052.4822695

    1252.3206751

    Processors

    Speedup

    Speedup on Asci Red

    Sheet1

    APO data

    FLOPS2739157088.1Atoms92224

    ProcessorsTime/stepSpeedupEfficiencyCommentsGFLOPSAtoms/proc

    157.11118-Jul0.047971227592224

    414.73.88435374150.9710884354Paper0.186337216923056

    87.317.81121751030.9764021888Paper0.37471369211528

    321.930.0526315790.939144736827-Jul1.44166162532882

    640.96459.2323651450.925505705427-Jul2.84144926151441

    1280.493115.821501010.904855476727-Jul5.5560995702720.5

    2560.259220.463320460.861184845627-Jul10.575896093360.25

    5120.152375.657894740.7337068257July 26 (28 done)18.020770317180.125

    7680.102559.803921570.7289113562July 26 (28 done)26.854481256120.08333333

    10240.0822694.647201950.6783664082July 26 (28 done)33.32307893190.0625

    15360.0645885.271317830.576348514227-Jul42.46755175460.041666667

    20480.0573996.50959860.486576952427-Jul47.80378862345.03125

    40960.05021137.45019920.277697802520-Oct54.56488223422.515625

    T3E

    410.714127-Jul0.2559959895

    85.282.02651515158.10606060611.013257575827-Jul0.5187797515

    162.644.05303030316.2121212121.013257575827-Jul1.0375595031

    321.357.925925925931.7037037040.990740740727-Jul2.0290052505

    640.68815.55232558162.2093023260.972020348827-Jul3.9813329769

    1280.35630.056179775120.22471910.93925561827-Jul7.6942614835

    2560.18557.837837838231.351351350.903716216227-Jul14.80625453

    BC1 data

    flops6921606751.4

    274.2121Old data0.0932831099

    437.81.9629629633.92592592590.9814814815Old data0.1831112897

    819.33.84455958557.6891191710.9611398964Old data0.3586324742

    169.647.697095435715.3941908710.9621369295Old data0.7180089991

    324.90515.12742099930.2548419980.9454638124July 28 data1.4111328749

    642.49129.78723404359.5744680850.9308510638July 28 data2.7786458255

    1281.2658.888888889117.777777780.9201388889July 28 data5.4933386916

    2560.653113.62940276227.258805510.887729709July 28 data10.59970406

    5120.352210.79545455421.590909090.8234197443July 26 data19.663655544

    7680.246301.62601626603.252032520.7854844173July 28 data28.136612811

    10240.192386.45833333772.916666670.7548014323July 28 data36.050035164

    15360.141526.241134751052.48226950.6852098109July 28 data49.089409585

    20480.1185626.160337551252.32067510.6114847046July 28 data58.410183556

    Without Sync

    274.2121

    437.81.9629629633.92592592590.9814814815

    819.33.84455958557.6891191710.9611398964

    169.647.697095435715.3941908710.9621369295

    1281.3256.212121212112.424242420.8783143939

    2560.684108.47953216216.959064330.847496345

    5120.366202.73224044405.464480870.7919228142

    7680.283262.19081272524.381625440.6827885748First run

    7680.295251.52542373503.050847460.6550141243Second run

    7680.272272.79411765545.588235290.710401348Summary run

    7680.291254.98281787509.965635740.6640177549Second summary run

    9860.21353.33333333706.666666670.7167004733

    10240.226328.31858407656.637168140.6412472345

    818.93.92592592597.85185185190.9814814815

    169.547.777777777815.5555555560.9722222222

    324.8615.26748971230.5349794240.954218107

    642.4730.0404858360.080971660.9387651822

    1281.2658.888888889117.777777780.9201388889

    2560.652113.80368098227.607361960.8890912577

    5120.373198.92761394397.855227880.777060992

    7680.233318.45493562636.909871240.8293097282

    10240.219338.81278539677.625570780.6617437215

    July 26 runs

    274.2121Old data

    437.81.9629629633.92592592590.981481