35
NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIU Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++ James Phillips Beckman Institute, University of Illinois http://www.ks.uiuc.edu/ Research/namd/ Chao Mei Parallel Programming Lab, University of Illinois http://charm.cs.illinois.edu/

Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

Embed Size (px)

DESCRIPTION

Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++. James Phillips Beckman Institute, University of Illinois http://www.ks.uiuc.edu/Research/namd/. Chao Mei Parallel Programming Lab, University of Illinois http://charm.cs.illinois.edu/. - PowerPoint PPT Presentation

Citation preview

Page 1: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

James PhillipsBeckman Institute, University of Illinoishttp://www.ks.uiuc.edu/Research/namd/

Chao MeiParallel Programming Lab, University of Illinoishttp://charm.cs.illinois.edu/

Page 2: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

UIUC Beckman Institute is a “home away from home” for interdisciplinary researchers

Theoretical and ComputationalBiophysics Group

Page 3: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Biomolecular simulations are our computational microscope

Ribosome: synthesizes proteins from genetic information, target for antibiotics

Silicon nanopore: bionanodevice for sequencing DNA efficiently

Page 4: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Our goal for NAMD is practical supercomputing for NIH researchers

• 44,000 users can’t all be computer experts.– 11,700 have downloaded more than one version.

– 2300 citations of NAMD reference papers.

• One program for all platforms.– Desktops and laptops – setup and testing

– Linux clusters – affordable local workhorses

– Supercomputers – free allocations on TeraGrid

– Blue Waters – sustained petaflop/s performance

• User knowledge is preserved.– No change in input or output files.

– Run any simulation on any number of cores.

• Available free of charge to all.

Phillips et al., J. Comp. Chem. 26:1781-1802, 2005.

Page 5: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

• Spatially decompose data and communication.• Separate but related work decomposition.• “Compute objects” facilitate iterative, measurement-based load balancing system.

NAMD uses a hybrid force-spatial parallel decomposition

Kale et al., J. Comp. Phys. 151:283-312, 1999.

Page 6: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Charm++ overlaps NAMD algorithms

Objects are assigned to processors, queued as data arrives, and executed in priority order.

Phillips et al., SC2002.

Page 7: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

NAMD adjusts grainsize to match parallelism to processor count

• Tradeoff between parallelism and overhead• Maximum patch size is based on cutoff• Ideally one or more patches per processor

– To double, split in x, y, z dimensions– Number of computes grows much faster!

• Hard to automate completely– Also need to select number of PME pencils

• Computes partitioned in outer atom loop– Old: Heuristic based on on distance, atom count– New: Measurement-based compute partitioning

Page 8: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Measurement-based grainsize tuning enables scalable implicit solvent simulation

After - Measurement-based (512 cores)

Before - Heuristic (256 cores)

Page 9: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

The age of petascale biomolecular simulation is near

Page 10: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Larger machines enable larger

simulations

Page 11: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

2002Gordon

BellAward

PSC Lemieux: 3000 cores

ATP synthase: 300K atoms

Blue Waters: 300,000 cores, 1.2M threads

Chromatophore: 100M atoms

Target is still 100 atoms per thread

Page 12: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Scale brings other challenges

• Limited memory per core

• Limited memory per node

• Finicky parallel filesystems

• Limited inter-node bandwidth

• Long load balancer runtimes

Which is why we collaborate with PPL!

Page 13: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Challenges in 100M-atom Biomolecule Simulation

• How to overcome sequential bottleneck?– Initialization– Output trajectory & restart data

• How to achieve good strong-scaling results?– Charm++ Runtime

Page 14: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Loading Data into System (1)

• Traditionally done on a single core– Molecule size is small

• Result of 100M-atom system– Memory: 40.5 GB !– Time: 3301.9 sec !

Page 15: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Loading Data into System (2)

• Compression scheme– Atom “Signature” representing common

attributes of a atom– Support more science simulation parameters– However, not enough

• Memory: 12.8 GB!

• Time: 125.5 sec!

Page 16: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Loading Data into System (3)

• Parallelizing initialization– #input procs: a parameter chosen either by user

or auto-computed at runtime– First, each loads 1/N of all atoms– Second, atoms shuffled with neighbor procs for

later spatial decomposition– Good enough e.g. 600 input procs

• Memory: 0.19 GB• Time: 12.4 sec

Page 17: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Output Trajectory & Restart Data (1)

• At least 4.8GB output to file system per output step– tens ms/step target makes it more critical

• Parallelizing output– Each output proc is responsible for a portion of

atoms

• Output to single file for compatibility

Page 18: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Output Issue (1)

Page 19: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Output Issue (2)

• Multiple and independent file

• Post-processing into a single file

Page 20: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Initial Strong Scaling on Jaguar6,720 cores

53,760 cores

107,520 cores224,076 cores

Page 21: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Multi-threading MPI-based Charm++ Runtime

• Exploit multicore

• Portable as based on MPI

• On each node:– “processor” represented as a thread– N “worker” threads share 1 “communication”

thread• Worker thread: only handle computation

• Communication: only handle network message

Page 22: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Benefits of SMP Mode (1)

• Intra-node communication is faster– Msg transferred as a pointer

• Program launch time reduced– 224K cores: ~6 min ~1 min

• Transparent to application developers– Correct charm++ program runs both in non-

SMP and SMP mode

Page 23: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Benefits of SMP Mode(2)

• Reduce memory footprint further– Read-only data structures shared– Memory footprint for MPI library is reduced – On avg. 7X reduction!

• Better cache performance

Enables the 100M-atom run on Intrepid (BlueGene/P 2GB/node)

Page 24: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Potential Bottleneck on Communication Thread

• Computation & Communication Overlap alleviates the problem to some extent

Page 25: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Node-aware Communication

• In runtime: multicast, broadcast etc.– E.g.: a series of bcast in startup: 2.78X

reduction

• In application: multicast tree– Incorporate knowledge of computation to guide

the construction of the tree• Least loaded node as intermediate node

Page 26: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Handle Burst of Messages (1)

• A global barrier after each timestep due to constant pressure algorithm

• More amplified due to only 1 comm thd per node

Page 27: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Handle Burst of Messages (2)

• Work flow of comm thread– Alternate in send/release/receive modes

• Dynamic flow control– Exit one mode to another – E.g. 12.3% for 4480-node (53,760 cores)

Page 28: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Hierarchical Load Balancer

• Large memory consumption in centralized one

• Processors are divided into groups

• Load balancing is done in each group

Page 29: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Improvement due to Load Balancing

Page 30: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Performance Improvement ofSMP over non-SMP on Jaguar

Page 31: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Strong Scale on Jaguar (2)

6,720 cores

53,760 cores

107,520 cores

224,076 cores

Page 32: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Weak Scale on Intrepid (~1466 atoms/core)

2M 6M 12M 24M 48M100M

1. 100M-atom ONLY runs in SMP mode

2. Dedicating one core to communication per node in SMP mode (25% loss) caused performance gap

Page 33: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Conclusion and Future Work

• IO bottleneck solved by parallelization

• An approach that optimizes both application and its underlying runtime– SMP mode in runtime

• Continue to improve performance– PME calculation

• Integrate and optimize new science codes

Page 34: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Acknowledgement

• Gengbin Zheng, Yanhua Sun, Eric Bohm, Chris Harrison, Osman Sarood for the 100M-atom simulation

• David Tanner for the implicit solvent work

• Machines: Jaguar@NCCS, Intrepid@ANL supported by DOE

• Funds: NIH, NSF

Page 35: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/

Beckman Institute, UIUC

Thanks