Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

James PhillipsBeckman Institute, University of Illinoishttp://www.ks.uiuc.edu/Research/namd/

Chao MeiParallel Programming Lab, University of Illinoishttp://charm.cs.illinois.edu/



UIUC Beckman Institute is a “home away from home” for interdisciplinary researchers

Theoretical and ComputationalBiophysics Group



Biomolecular simulations are our computational microscope

Ribosome: synthesizes proteins from genetic information, target for antibiotics

Silicon nanopore: bionanodevice for sequencing DNA efficiently



Our goal for NAMD is practical supercomputing for NIH researchers

• 44,000 users can’t all be computer experts.– 11,700 have downloaded more than one version.

– 2300 citations of NAMD reference papers.

• One program for all platforms.– Desktops and laptops – setup and testing

– Linux clusters – affordable local workhorses

– Supercomputers – free allocations on TeraGrid

– Blue Waters – sustained petaflop/s performance

• User knowledge is preserved.– No change in input or output files.

– Run any simulation on any number of cores.

• Available free of charge to all.

Phillips et al., J. Comp. Chem. 26:1781-1802, 2005.



• Spatially decompose data and communication.• Separate but related work decomposition.• “Compute objects” facilitate iterative, measurement-based load balancing system.

NAMD uses a hybrid force-spatial parallel decomposition

Kale et al., J. Comp. Phys. 151:283-312, 1999.



Charm++ overlaps NAMD algorithms

Objects are assigned to processors, queued as data arrives, and executed in priority order.

Phillips et al., SC2002.



NAMD adjusts grainsize to match parallelism to processor count

• Tradeoff between parallelism and overhead• Maximum patch size is based on cutoff• Ideally one or more patches per processor

– To double, split in x, y, z dimensions– Number of computes grows much faster!

• Hard to automate completely– Also need to select number of PME pencils

• Computes partitioned in outer atom loop– Old: Heuristic based on on distance, atom count– New: Measurement-based compute partitioning



Measurement-based grainsize tuning enables scalable implicit solvent simulation

After - Measurement-based (512 cores)

Before - Heuristic (256 cores)



The age of petascale biomolecular simulation is near



Larger machines enable larger

simulations



2002Gordon

BellAward

PSC Lemieux: 3000 cores

ATP synthase: 300K atoms

Blue Waters: 300,000 cores, 1.2M threads

Chromatophore: 100M atoms

Target is still 100 atoms per thread



Scale brings other challenges

• Limited memory per core

• Limited memory per node

• Finicky parallel filesystems

• Limited inter-node bandwidth

• Long load balancer runtimes

Which is why we collaborate with PPL!



Challenges in 100M-atom Biomolecule Simulation

• How to overcome sequential bottleneck?– Initialization– Output trajectory & restart data

• How to achieve good strong-scaling results?– Charm++ Runtime



Loading Data into System (1)

• Traditionally done on a single core– Molecule size is small

• Result of 100M-atom system– Memory: 40.5 GB !– Time: 3301.9 sec !




• Compression scheme– Atom “Signature” representing common

attributes of a atom– Support more science simulation parameters– However, not enough

• Memory: 12.8 GB!

• Time: 125.5 sec!




• Parallelizing initialization– #input procs: a parameter chosen either by user

or auto-computed at runtime– First, each loads 1/N of all atoms– Second, atoms shuffled with neighbor procs for

later spatial decomposition– Good enough e.g. 600 input procs

• Memory: 0.19 GB• Time: 12.4 sec



Output Trajectory & Restart Data (1)

• At least 4.8GB output to file system per output step– tens ms/step target makes it more critical

• Parallelizing output– Each output proc is responsible for a portion of

atoms

• Output to single file for compatibility



Output Issue (1)



Output Issue (2)

• Multiple and independent file

• Post-processing into a single file



Initial Strong Scaling on Jaguar6,720 cores

53,760 cores

107,520 cores224,076 cores



Multi-threading MPI-based Charm++ Runtime

• Exploit multicore

• Portable as based on MPI

• On each node:– “processor” represented as a thread– N “worker” threads share 1 “communication”

thread• Worker thread: only handle computation

• Communication: only handle network message



Benefits of SMP Mode (1)

• Intra-node communication is faster– Msg transferred as a pointer

• Program launch time reduced– 224K cores: ~6 min ~1 min

• Transparent to application developers– Correct charm++ program runs both in non-

SMP and SMP mode



Benefits of SMP Mode(2)

• Reduce memory footprint further– Read-only data structures shared– Memory footprint for MPI library is reduced – On avg. 7X reduction!

• Better cache performance

Enables the 100M-atom run on Intrepid (BlueGene/P 2GB/node)



Potential Bottleneck on Communication Thread

• Computation & Communication Overlap alleviates the problem to some extent



Node-aware Communication

• In runtime: multicast, broadcast etc.– E.g.: a series of bcast in startup: 2.78X

reduction

• In application: multicast tree– Incorporate knowledge of computation to guide

the construction of the tree• Least loaded node as intermediate node



Handle Burst of Messages (1)

• A global barrier after each timestep due to constant pressure algorithm

• More amplified due to only 1 comm thd per node



Handle Burst of Messages (2)

• Work flow of comm thread– Alternate in send/release/receive modes

• Dynamic flow control– Exit one mode to another – E.g. 12.3% for 4480-node (53,760 cores)



Hierarchical Load Balancer

• Large memory consumption in centralized one

• Processors are divided into groups

• Load balancing is done in each group



Improvement due to Load Balancing



Performance Improvement ofSMP over non-SMP on Jaguar



Strong Scale on Jaguar (2)

6,720 cores

53,760 cores

107,520 cores

224,076 cores



Weak Scale on Intrepid (~1466 atoms/core)

2M 6M 12M 24M 48M100M

1. 100M-atom ONLY runs in SMP mode

2. Dedicating one core to communication per node in SMP mode (25% loss) caused performance gap



Conclusion and Future Work

• IO bottleneck solved by parallelization

• An approach that optimizes both application and its underlying runtime– SMP mode in runtime

• Continue to improve performance– PME calculation

• Integrate and optimize new science codes



Acknowledgement

• Gengbin Zheng, Yanhua Sun, Eric Bohm, Chris Harrison, Osman Sarood for the 100M-atom simulation

• David Tanner for the implicit solvent work

• Machines: Jaguar@NCCS, Intrepid@ANL supported by DOE

• Funds: NIH, NSF



Thanks

Documents

Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++