NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Scaling NAMD to 100 Million Atoms on Petascale

NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++ James Phillips Beckman Institute, University of IllinoisChao Mei Parallel Programming Lab, University of Illinois NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC UIUC Beckman Institute is a home away from home for interdisciplinary researchers Theoretical and Computational Biophysics Group NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Biomolecular simulations are our computational microscope Ribosome: synthesizes proteins from genetic information, target for antibiotics Silicon nanopore: bionanodevice for sequencing DNA efficiently NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Our goal for NAMD is practical supercomputing for NIH researchers 44,000 users cant all be computer experts. 11,700 have downloaded more than one version. 2300 citations of NAMD reference papers. One program for all platforms. Desktops and laptops setup and testing Linux clusters affordable local workhorses Supercomputers free allocations on TeraGrid Blue Waters sustained petaflop/s performance User knowledge is preserved. No change in input or output files. Run any simulation on any number of cores. Available free of charge to all. Phillips et al., J. Comp. Chem. 26: , 2005. NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Spatially decompose data and communication. Separate but related work decomposition. Compute objects facilitate iterative, measurement-based load balancing system. NAMD uses a hybrid force-spatial parallel decomposition Kale et al., J. Comp. Phys. 151: , 1999. NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Charm++ overlaps NAMD algorithms Objects are assigned to processors, queued as data arrives, and executed in priority order. Phillips et al., SC2002. NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC NAMD adjusts grainsize to match parallelism to processor count Tradeoff between parallelism and overhead Maximum patch size is based on cutoff Ideally one or more patches per processor To double, split in x, y, z dimensions Number of computes grows much faster! Hard to automate completely Also need to select number of PME pencils Computes partitioned in outer atom loop Old: Heuristic based on on distance, atom count New: Measurement-based compute partitioning NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Measurement-based grainsize tuning enables scalable implicit solvent simulation After - Measurement-based (512 cores) Before - Heuristic (256 cores) NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC The age of petascale biomolecular simulation is near NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Larger machines enable larger simulations NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC 2002 Gordon Bell Award PSC Lemieux: 3000 cores ATP synthase: 300K atoms Blue Waters: 300,000 cores, 1.2M threads Chromatophore: 100M atoms Target is still 100 atoms per thread NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Scale brings other challenges Limited memory per core Limited memory per node Finicky parallel filesystems Limited inter-node bandwidth Long load balancer runtimes Which is why we collaborate with PPL! NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Challenges in 100M-atom Biomolecule Simulation How to overcome sequential bottleneck? Initialization Output trajectory & restart data How to achieve good strong-scaling results? Charm++ Runtime NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Loading Data into System (1) Traditionally done on a single core Molecule size is small Result of 100M-atom system Memory: 40.5 GB ! Time: sec ! NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Loading Data into System (2) Compression scheme Atom Signature representing common attributes of a atom Support more science simulation parameters However, not enough Memory: 12.8 GB! Time: sec! NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Loading Data into System (3) Parallelizing initialization #input procs: a parameter chosen either by user or auto-computed at runtime First, each loads 1/N of all atoms Second, atoms shuffled with neighbor procs for later spatial decomposition Good enough e.g. 600 input procs Memory: 0.19 GB Time: 12.4 sec NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Output Trajectory & Restart Data (1) At least 4.8GB output to file system per output step tens ms/step target makes it more critical Parallelizing output Each output proc is responsible for a portion of atoms Output to single file for compatibility NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Output Issue (1) NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Output Issue (2) Multiple and independent file Post-processing into a single file NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Initial Strong Scaling on Jaguar 6,720 cores 53,760 cores 107,520 cores 224,076 cores NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Multi-threading MPI-based Charm++ Runtime Exploit multicore Portable as based on MPI On each node: processor represented as a thread N worker threads share 1 communication thread Worker thread: only handle computation Communication: only handle network message NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Benefits of SMP Mode (1) Intra-node communication is faster Msg transferred as a pointer Program launch time reduced 224K cores: ~6 min ~1 min Transparent to application developers Correct charm++ program runs both in non- SMP and SMP mode NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Benefits of SMP Mode(2) Reduce memory footprint further Read-only data structures shared Memory footprint for MPI library is reduced On avg. 7X reduction! Better cache performance Enables the 100M-atom run on Intrepid (BlueGene/P 2GB/node) NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Potential Bottleneck on Communication Thread Computation & Communication Overlap alleviates the problem to some extent NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Node-aware Communication In runtime: multicast, broadcast etc. E.g.: a series of bcast in startup: 2.78X reduction In application: multicast tree Incorporate knowledge of computation to guide the construction of the tree Least loaded node as intermediate node NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Handle Burst of Messages (1) A global barrier after each timestep due to constant pressure algorithm More amplified due to only 1 comm thd per node NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Handle Burst of Messages (2) Work flow of comm thread Alternate in send/release/receive modes Dynamic flow control Exit one mode to another E.g. 12.3% for 4480-node (53,760 cores) NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Hierarchical Load Balancer Large memory consumption in centralized one Processors are divided into groups Load balancing is done in each group NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Improvement due to Load Balancing NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Performance Improvement of SMP over non-SMP on Jaguar NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Strong Scale on Jaguar (2) 6,720 cores 53,760 cores 107,520 cores 224,076 cores NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Weak Scale on Intrepid (~1466 atoms/core) 2M6M 12M24M 48M 100M 1.100M-atom ONLY runs in SMP mode 2.Dedicating one core to communication per node in SMP mode (25% loss) caused performance gap NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Conclusion and Future Work IO bottleneck solved by parallelization An approach that optimizes both application and its underlying runtime SMP mode in runtime Continue to improve performance PME calculation Integrate and optimize new science codes NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Acknowledgement Gengbin Zheng, Yanhua Sun, Eric Bohm, Chris Harrison, Osman Sarood for the 100M-atom simulation David Tanner for the implicit solvent work Machines:supported by DOE Funds: NIH, NSF NIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute, UIUC Thanks

Documents

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Scaling NAMD to 100 Million Atoms on Petascale