PVM-AMBER: A parallel implementation of the AMBER molecular mechanics package for workstation clusters

PVM-AMBER: A Parallel Implementation of the AMBER Molecular Mechanics Package for Workstation Clusters

ERIC SWANSON and TERRY P. LYBRAND" University of Washington, Center for Bioengineering, Molecular Bioengineering Program, Box 351 750, Seattle, Washington 98195-1 750

Received 16 August 1994; accepted 16 December 1994

ABSTRACT A parallel version of the popular molecular mechanics package AMBER suitable for execution on workstation clusters has been developed. Computer-intensive portions of molecular dynamics or free-energy perturbation computations, such as nonbonded pair list generation or calculation of nonbonded energies and forces, are distributed across a collection of Unix workstations linked by Ethernet or FDDI connections. This parallel implementation utilizes the message-passing software PVM (Parallel Virtual Machine) from Oak Ridge National Laboratory to coordinate data exchange and processor synchronization. Test simulations performed for solvated peptide, protein, and lipid bilayer systems indicate that reasonable parallel efficiency (70-90%) and computational speedup (2-5 X serial computer runtimes) can be achieved with small workstation clusters (typically six to eight machines) for typical biomolecular simulation problems. PVM-AMBER is also easily and rapidly portable to different hardware platforms due to the availability of PVM for numerous computers. The current version of PVM-AMBER has been tested successfully on Silicon Graphics, IBM RS6000, DEC ALPHA, and HP 735 workstation clusters and heterogeneous clusters of these machines, as well as on CRAY T3D and Kendall Square KSR2 parallel supercomputers. Thus, PVM-AMBER provides a simple and cost-effective mechanism for parallel molecular dynamics simulations on readily available hardware platforms. Factors that affect the efficiency of this approach are discussed. 0 1995 by John Wiley & Sons, Inc.

systems. Techniques such as Monte Carlo or molecular dynamics statistical mechanical calculations are used routinely to study condensed-phase systems as well as polymers and large biomolecules. These simulation tools can provide information and insight that is difficult to obtain from direct experimental measurements. However,

Introduction

OmPuter simulation has become an impor- C tant tool for the study of complex chemical *Author to whom all correspondence should be addressed.

Journal of Computational Chemistry, Vol. 16, No. 9, 1131-1140 (1995) 0 1995 by John Wiley & Sons, Inc. CCC 0192-8651 I951091 131-10

SWANSON AND LYBRAND

the merit and reliability of computer simulation results are often limited by the quality of the underlying mathematical models used in calculations and/or by the extent of configurational sampling that can be accomplished with available computer resources. Enhancements in computational resources would enable longer simulations (i.e., more extensive sampling) for larger systems with more sophisticated and accurate mathematical models. Much effort at present focuses on efficient utilization of parallel computing platforms to perform longer or larger simulations. However, it is often a nontrivial task to develop a highly efficient parallel algorithm for molecular simulations, and such algorithms are usually not portable from one hardware platform to another. There are also many users who have only limited access to state- of-the-art massively parallel computers but may have access to a number of high-performance workstations. We have therefore set out to develop a parallel implementation of the molecular simulation package AMBER' that is both computationally efficient and highly portable for various parallel computing platforms (including both networked workstation clusters and parallel supercomputers).

In a typical molecular dynamics simulation of macromolecules, calculation of nonbonded energies and forces accounts for 90% or more of the total computation time, especially for simulations including solvent. For this reason, efforts to achieve greater speed have focused on this section of the calculation, regarding both algorithms and the use of vector or parallel machine architecture^?-^ For vector machines, the main difficulty is to design the programs such that critical loops can be vector- ized by the compiler; this detailed work has largely been accomplished for widely used programs such as AMBER and CHARMm.6 In the case of parallel architectures, the methodology is still evolving both with regard to hardware and The principle is easily stated: The calculation should be arranged such that many processors can work simultaneously toward the result. In practice, efficiency is achieved if the division of labor exhibits "coarse granularity"-that is, if intervals of parallel calculation are long relative to the time required to communicate between processors and perform serial calculations.

For most macromolecular and condensed-phase applications, the nonbonded calculation in a molecular dynamics simulation requires calculation of interactions between many pairs of atoms. For large systems, an exhaustive approach, in which all pairwise interactions are calculated,

would be very expensive. The usual practice is to restrict the calculation to interaction of atoms within some cutoff distance of one another. There are several techniques to accomplish this',''; the choice depends on both the size of the molecular system to be simulated and the software and hardware resources available. A grid cell approach, in which space is subdivided into cubical cells usually such that only interactions between particles in neighboring cells need be considered, has ad- vantages when simulating large numbers of atoms using hardware with many processors." By map- ping processors to cells, parallel efficiency is achieved because communication between processors reflects only neighboring cell interactions.

Another approach involves generating a list of pairs of atoms which lie within the cutoff distance and restricting the nonbonded calculation to part- ners contained in this list. While it takes significant time to generate the pair list, this is not too bur- densome since the list need not be updated for each dynamics step. Because atomic movements for a typical dynamics timestep (e.g., 1 femtosec- ond) are small relative to the cutoff distance, it suffices to update the pair list occasionally, perhaps every 25 steps. Furthermore, in a parallel implementation, the full pair list need not be ac- cessed by each processor, a saving of both memory and time. Still, our choice to use the pair list approach was largely pragmatic, since the programs already incorporated pair list routines.

In this article we describe the use of a virtual parallel machine, which consists of several workstations connected by Ethernet. These separate workstations effectively become a parallel processing system through the use of message-passing software (for an overview, see ref. 12). We chose to use the Parallel Virtual Machine (PVM) package developed at Oak Ridge National Lab~ratoryl~, '~ because of its support for many hardware platforms and wide dissemination in computational science disciplines. The motivation for our approach was simply the availability of the hardware; the workstations are used for interactive molecular modeling during the day, while during off hours they can be used as a computational resource. This situation influenced our choice of methodology, which would likely be quite different if the target hardware was, for example, a massively parallel distributed memory machine. Since multi-workstation sites are commonplace to- day, we believe that interest is strong in tapping the substantial compute power these workstations repre~ent.'~ While parallel molecular dynamics

1132 VOL. 16, NO. 9

PVM-AMBER

programs are now relatively common, few previ- ous attempts have been made to develop hardware-independent programs suitable for use on workstation clusters.*

Computing Methods

The programs MINMD, SANDER, and GIBBS from the AMBER suite were modified to incorpo- rate parallel computation of nonbonded energies and forces. MINMD performs both energy mini- mization and molecular dynamics calculations. SANDER is similar to MINMD but can include constraints which are typically obtained from nu- clear magnetic resonance (NMR) data to perform simulated annealing structure refinement calculations. GIBBS is a molecular dynamics program which calculates free-energy differences between two states. Its code is significantly different from MINMD, but the method for computing nonbonded energies and forces is the same.

For each of these programs, two programs were derived: a master version and a slave version. One workstation executes the master version and the remaining workstations execute the slave version. The master is responsible for assigning work to the slaves and providing them with required data and will also perform any computations not assigned to slaves. Each slave performs only those computations requested by the master, returning results to the master. In the case of molecular dynamics calculations, the master must provide each slave with coordinates of all the atoms at the beginning of every dynamics steps, and each slave returns to the master the nonbonded energies and forces it has calculated. The modifications were function- ally the same for MINMD, SANDER, and GIBBS.

For the master versions, (1) nonbonded computations were removed from the molecular dynamics routines, (2) calls were added early in the programs for the purpose of starting PVM and launching the slaves, (3) a routine was added which calculates how much work each slave should do (i.e., defines blocks of atoms for which each slave will calculate nonbonded energies and forces), (4) a routine was added in the molecular dynamics loop which sends updated coordinates to the slaves, (5) a routine was added to receive energies and forces from the slaves, and (6) a call to terminate the slaves was added.

The molecular dynamics loop on the master is

now

DO FOR N STEPS Send coordinates to slaves IF pair list update is needed

Allocate work among slaves Direct slaves to generate new pair lists

ENDIF Calculate energies and forces, except for nonbonded contributions Receive and sum nonbonded energies and forces from slaves Calculate new coordinates and velocities.

END DO

On each slave, the molecular dynamics loop has been reduced to

DO UNTIL terminated by master Receive new coordinates from master IF pair list update was ordered by master

ENDIF Calculate nonbonded energies and forces Send energies and forces to master

Calculate new pair list for allocated atoms

END DO

The slaves actually reproduce all calculations preceding the first dynamics step, executing essentially the same code as on the master. This is done because certain variables and arrays needed in the nonbonded calculations must have correct initial values (which remain invariant during dynamics), and its is simpler to compute these locally than to have the master send all these data to each slave. Occasionally, the master must send ancillary data to the slaves, such as the size of the periodic box in a constant-pressure calculation, but there is little such information. Because the slaves are so spe- cialized, they can remain ignorant of net forces and velocities.

PVM version 3.2.6 was used for message passing via the FORTRAN interface routines included with PVM. Two configuration options were chosen for increased efficiency relative to the defaults. An option to pass raw data was used which bypasses XDR data encoding. This was appropriate because of the homogenous collection of machines; with heterogeneous machines, data encoding is required unless the machines conform to the same data representation standard. Also, an option to route messages directly between user processes (not via the PVM daemons) was chosen, which improved performance with no loss of functional-

JOURNAL OF COMPUTATIONAL CHEMISTRY 1133

SWANSON AND LYBFWND

ity. This option is selected by calling pvmfadvise (PVMROUTEDIRECT,ierr).

An important feature of PVM is the freedom to execute UNIX shell scripts on slave machines, thereby providing a convenient means for manag- ing files and doing ancillary tasks as well as the computational work. With suitable scripts in place, the user can do a series of computational runs with little attention to the slave machines.

The timing of events in one dynamics step is shown in Figure 1, taken from a calculation on lysozyme (described later) using four slaves. The master does useful work for part of the time that the slaves are busy with the nonbonded calculations (e.g., calculation of bond, bond angle, and torsion angle energies and forces). However, the master typically finishes in time to receive forces from the slaves without delay. In any case, PVM ensures that no data are lost regardless of timing considerations. It would also be possible to have the master participate in nonbonded calculations to the extent that time permits. This would improve efficiency, especially if only two or three slaves are used. In our environment, this is not desirable because the master is used to perform other tasks simultaneously while it controls a PVM dynamics job.

s4

I I I I I I I I I I I I I 0 5 10

TIME (sec)

FIGURE 1. Time course of a molecular dynamics step, taken from the middle of a parallel calculation using four slaves. Arrows represent data movement - coordinates are sent from the master M to slaves S1 -S4 (arrows pointing up); nonbonded energies and forces are calculated on the slaves and sent to the master (arrows pointing down). Arrows at the right show the beginning of the next dynamics step. The time shown for data movement includes buffer management as well as actual movement of data on the physical network. Heavy horizontal lines represent cpu activity. Note that data movement may overlap, as in the return of energies and forces from slaves 3 and 4 to the master.

LOAD BALANCING

Ideally, each slave will require an equal time to complete its portion of the nonbonded calculation. In practice, it is difficult for the master to know a priori what the workload allocation should be to achieve perfect performance parity for all the slaves. Initially, the master inspects the pair list (which it has calculated just once for this purpose) and allocates the work, assuming equal speed of each slave and that the time required will be proportional to the number of nonbonded pairs. Thereafter, the master keeps track of the time actually required by each slave. When a pair list update is due, the master reallocates the nonbonded calculation to correct for differences in execution times among the slaves.

For example, in a molecular system with 10,000 atoms and 2,000,000 nonbonded pairs, the master processor will analyze the pair list distribution and allocate work accordingly. If there are four slave processors to be used for the calculation, the master will split the workload into four (estimated) equivalent fragments. In this case, the pair list analysis might reveal the following pattern: Atoms 1-1000 have a total of 500,000 nonbonded interaction pairs with the entire system, atoms 1001-2500 have a total of 500,000 nonbonded pairs, atoms 2501-5000 have a total of 500,000 nonbonded pairs, and atoms 5001-9999 account for the remaining 500,000 nonbonded interactions. The master would then assign atoms 1-1000 to slave processor 1, atoms 1001-2500 to slave processor 2, atoms 2501-5000 to slave processor 3, and the remaining atoms to slave processor 4. Once these work as- signments are made, each slave is responsible for all aspects of the nonbonded energy and force calculations, including the calculation of the nonbonded pair list for their assigned atoms (i.e., at the first iteration, the pair list is actually calculated twice, once by the master for initial workload analysis and once by the slaves to perform the respective tasks). This parallel evaluation of the nonbonded pair list is possible because each slave receives a copy of coordinates and atom types for the entire system. After this initial workload distribution, the master never again calculates the nonbonded pair list. Instead, the master monitors the elapsed time performance for each slave in the cluster. If slave processor 3 is consistently two times slower than the other slave processors, the master will reduce the number of atoms assigned to slave 3 by something less than 50% (to avoid an excessive correction), redistributing the additional

1134 VOL. 16, NO. 9

PVM-AMBER

atoms as evenly as possible among the other slaves. In the aforementioned example, slave processor 1 would perhaps now be responsible for atoms 1-1300, slave processor 2 would have atoms 1301-3100, slave processor 3 would have only atoms 3101-4700, and slave processor 4 would get atoms 4701-9999. In other words, slave processor 3 is now responsible for 900 fewer atoms, and the other three slave processors each get 300 additional atoms. Further fine-tuning of the workload partitions would continue until all slave processors are able to complete their assigned tasks in equivalent elapsed time periods. The workload adjustment is implemented, if needed, each time the nonbonded pair list update is scheduled. Pro- vided there is not excessive fluctuation in the central processing unit (CPU) loads for each slave processor due to other tasks or dramatic variation in the number of nonbonded pairs for clusters of atoms assigned to each slave, this simple load-balancing scheme can quickly find the optimal workload distribution without ever explicitly calculating the full nonbonded pair list serially after the initial iteration. This scheme has the advantage of compensating for differences in speed among the slaves, whether due to different intrinsic CPU speeds or variable CPU demands caused by, for example, an interactive graphics session or other background jobs running on some of the slaves.

In the case of the GIBBS program, an additional modification was made to enhance parallel performance. Preliminary tests showed that, depending on the options selected for the run, significant time may be spent calculating contributions to the free energy which arise from constraints (the constraints are applied to selected distances, angles, and torsions and can help define the pathway from the initial to final state). Within this section of the calculation, a considerable number of nonbonded interactions must again be computed; furthermore, the program is designed to report free energies for both forward and reverse directions (the reason for this is explained fully in the GIBBS documenta- tion). Fortunately, these calculations are performed by a distinct set of routines which are amenable to execution on separate slave processors. We chose to allocate two slaves for this purpose, for forward and reverse calculations, respectively. In typical solvated protein systems with - 10,000 atoms, use of four slaves for nonbonded calculations and two slaves for constraint calculations usually yields a reasonable load balance.

VERIFICATION OF COMPUTATIONS

We compared the forces and energies obtained from parallel computations with those obtained from the standard serial AMBER programs, for a variety of molecular systems and runtime options, using one to eight slave processors. For the initial dynamics step, forces and energies agreed to within expected precision for double-precision floating point calculations. For subsequent dynamics time steps, the results from parallel molecular dynamics (MD) calculations do not agree precisely with results obtained from serial calculations. The dis- agreement is insignificant in the first 200 steps or so but increases gradually, and eventually (after 1000 to 2000 steps) the trajectories are substantially different. We believe that this effect is caused by imprecision in the double-precision floating point summations of forces for each step. In the parallel versions of the programs, nonbonded forces calculated on each slave are sent to the master (as double-precision reals) and summed with the other contributions to forces. Therefore, the order of the summation is different from the serial version, and this leads to differences in the least significant bits of the forces. Although these differences are tiny, the semichaotic nature of molecular dynamics tends to amplify the differences over time.

To test whether these small differences in forces could account for the observed divergence of trajectories, the following experiment was done. A serial version of MINMD was made in which, after the force calculation but before the calculation of new coordinates, the least significant bit of 50% of the elements of the force vector was toggled (governed by a random number generator). Trajec- tories from MD runs on pancreatic trypsin in- hibitor (data set 5PTI from the Protein Data Bank16), using the standard MINMD program, modified MINMD program, and parallel MINMD program (three slaves), were compared (Table I). The similarity of the deviations as a function of the number of steps is indicative of a common cause and not the result of some unknown error in the program. In parallel runs, the order of summing the forces is nondeterministic, especially with load balancing in effect, and therefore perfectly repro- ducible trajectories are not obtained. We note further that the divergence of MD trajectories for a parallel versus a serial run is comparable to the divergence seen when a double-precision serial


SWANSON AND LYBRAND

TABLE 1. Comparison of Molecular Dynamics Trajectories, for Modified and Parallel MINMD Runs on BPTI, with Reference to a Standard (Serial) MINMD Calculation.

rms Deviation in Coordinates (A) No. Steps Modified MINMD Parallel MlNMD

200 500 1000 1500 2000

< 0.001 < 0.001 0.001 0.001 0.036 0.009 0.230 0.21 9 0.535 0.51 1

Trajectory files store coordinates to ,001 A precision.

MD run is performed on two different hardware platforms.

MOLECULAR DYNAMICS BENCHMARKS

Most benchmarks were run using a cluster of Silicon Graphics Indigo workstations with MIPS R3000 CPUs and 24-32 MB of memory. A single processor in a Silicon Graphics 4D/340S computer was used as the master CPU for most runs. Exami- nation of system reports showed that available memory was not a limitation in these tests. The systems are directly connected by Ethernet and are physically close to one another, with no interven-

4 f a cn

ing network hardware such as bridges. Tests were run at times when there was no competing use of the systems, other than normal housekeeping func- tions, and when network traffic was expected to be light.

The smallest test case for molecular dynamics simulations was the tripeptide THR-ASN-VAL (all-atom model with acetyl and N-methyl amide groups capping the amino and carboxy termini, respectively), ino a rectangular periodic water box extending 12.0 A beyond the peptide. The system had 56 solute atoms and 996 yater molecules, for a total of 3044 atoms. A 9.0-A nonbonded cutoff distance was used, which resulted in approximately 600,000 nonbonded pairs in the calculation. Two thousand steps of dynamics were run with a 0.001-ps timestep, and the pair list was updated every 25 steps. The reference calculation using the standard MINMD program required 15,567 s elapsed time and 15,551 s CPU time, of which 15,038 s were spent in nonbonded and pair list calculations, as reported in the summary at the end of the output file. Results are summarized in Table 11. The relative speedup as a function of number of slaves for this and the following test cases is shown in Figure 2.

A larger test case was lysozyme (data set lLZl from the Protein Data Bank16) with added hydrc- gens, in a rectangular periodic box extending 10.5

lysozyme - tripeptide -

bilayer +-- 5 -

4 -

3 -

1 2 3 4 5 6 7 Number of slaves

FIGURE 2. Speedup of molecular dynamics calculations as a function of number of slaves employed, for three test cases: tripeptide ( -A -), lysozyme (-+ -1, and lipid bilayer (- -1. Efficiency depends on the number of atoms and number of nonbonded pairs (see text).

1136 VOL. 16, NO. 9

PVM-AMBER

TABLE II. Molecular Dynamics Benchmarks for Tripeptide Test Case (3044 Atoms).

No. Elapsed Master Slave Slaves Time Speedup CPU CPU

- 0 15567 1 .ooo 15551 1 15675 0.993 715 14488 2 8900 1.749 937 7390 3 6868 2.267 1170 4990 4 5860 2.656 1419 385 1 5 5383 2.892 1641 31 05 6 4979 3.127 1907 2666 7 4824 3.227 21 49 2357 8 491 6 3.167 2375 21 85

Times are in seconds; slave CPU times are averages. All nonbonded computations are done exclusively on slaves. The 0 slave case is a serial MINMD calculation.

A beyond the protein. This system had 2038 protein atoms and, 5130 waters, for a total of 17,428 atoms. A 9.0-A nonbonded cutoff distance was used, yielding approximately 3,900,000 nonbonded pairs in the dynamics calculation. One thousand steps (1 ps) of dynamics were run using the protocol outlined earlier for the tripeptide test. The serial calculation requires 56,218 s elapsed time and 56,160 s CPU time, of which 53,615 s were spent in nonbonded and pair list calculations. Statistics for these calculations are shown in Table 111.

A third test case consisted of a lipid bilayer of 3000 atoms wi!h 1610 waters, for a total of 7830 atoms. A 9.0-A nonbonded cutoff distance was used, which resulted in approximately 3,700,000 nonbonded pairs in the calculation. Five hundred

TABLE 111. Molecular Dynamics Benchmarks for Lysozyme Test Case (17,428 atoms).


5621 8 56325 31 298 22766 18719 16638 15544 14882 14791

1 .ooo 0.998 1.796 2.469 3.003 3.379 3.61 7 3.776 3.801

561 61 291 2 3571 41 73 531 4 5620 6277 7070 7856

-

53924 27632 18651 141 22 11563 9920 8704 7933

Data are as described for the tripeptide case in Table I I .

steps (0.5 ps) of dynamics were run, using the same protocol outlined earlier. Timings are shown in Table IV. Results for the SANDER program are identical to those for MINMD, as would be expected since the pair list and nonbonded energy routines are essentially identical for the two programs. Results for the parallel GIBBS program are generally comparable to the results reported here for MINMD and SANDER. However, a larger per- centage of CPU work for the GIBBS program involves subroutines not modified for parallel computation, so the scaling factor for GIBBS parallel performance is smaller (10-15% less efficient than MINMD or SANDER).

As expected, the efficiency is highest with two slaves and drops as more slaves are added, since overhead becomes more significant. However, performance continues to increase through six slaves. The overhead referred to here includes CPU time on the master related to message passing, CPU time on each slave related to message passing, and actual communication. For example, in the lysozyme test case on the SGIs, the CPU time on the master increases roughly 700 s per slave, while on the slaves the decrease in CPU time per slave is about 700 s less than would be the case with 100% efficiency. These times reflect buffer management and account for the major part of the overhead. Time spent in data movement on the network is less significant. This agrees with the observation of Douglas et al.17 that the cost of buffer management is significant. (We also used the profiling utility "prof" to survey the time spent in each subrou- tine. The results were consistent with our interpre- tation.)

As a further test of the hypothesis that buffer management is more significant than physical communication time in these benchmarks, two si- multaneous parallel MINMD runs were performed

TABLE IV. Molecular Dynamics Benchmarks for Lipid Bilayer (7830 atoms).


0 3651 6 1 .ooo 36516 - 1 35770 1.021 845 35048 2 18272 1.998 987 17320 4 9902 3.688 1282 8695 6 741 5 4.925 1616 591 7 8 6370 5.732 1 960 4525

Data are as described for the tripeptide case in Table It.


SWANSON AND LYBRAND

on the SGI workstation cluster. One run was the lysozyme test case using three slaves; the other run was the tripeptide test case with three slaves, each for 5000 steps. The runs used separate processors but shared a common Ethernet network. The resulting speedup for the lysozyme case was 2.40; for the tripeptide case, the speedup was 2.78. These speedups are 96-97% of the speedups seen in the standalone benchmarks.

Further insight into the relative importance of CPU overhead versus network bandwidth can be obtained by repeating benchmarks with a cluster of faster workstations in the same network configuration. To perform some of these tests, we have used an SGI Indigo 2 R4400 processor (on loan from Silicon Graphics) as a master with two, four, or six slower slave processors (Indigo R3000 systems) to rerun the lysozyme benchmarks. For these tests, more detailed measurements were made of CPU time spent in PVM routines on the master associated with sending and receiving data to and from the slaves. We also measured the elapsed time for data transmission to the slaves. These data were compared to comparable numbers gen- erated when an R3000 system was used as the master processor and are shown in Table V.

When sending data from an R4400 master to the slaves, the master CPU is not a limiting factor; a combination of the slaves’ ability to receive the data and network bandwidth presumably limit transfer speed in this arrangement. With an R3000 master, the CPU speed appears to be reasonably matched to network throughput. The CPU burden

TABLE V. Comparison of R3000 versus R4400 Master Processor Times in PVM Routines for the Lysozyme Benchmark.

Master

R3000 R3000 R3000 R4400 R4400 R4400

No. Send Slaves CPU

2 475 4 976 6 1439 2 1 25 4 222 6 352

Send Elapsed

1146 2340 3502 1005 2254 3451

Receive CPU Runtime

679 29771 1416 18108 2332 15107 152 28659 325 17117 537 14163

~~~~~~~

Send CPU and Receive CPU refer to the actual CPU times on the master associated with sending and receiving data to and from the slaves, respectively. Elapsed time reports the total time for data transmission from the master to all slaves (elapsed time for data receipt from all slaves is impractical to measure with analysis tools available at present). Runtime refers to the total runtime for each job.

when the master receives data from the slaves is somewhat greater than for data transmission to the slaves, perhaps because separate buffers receive data from each slave. With an R3000 master, the CPU time associated with data reception from six slaves is a significant fraction (15%) of the total runtime. With an R4400 master, CPU time for data reception from six slaves is much smaller, as shown in Table V. These results show clearly that a faster master processor improves overall performance for the cluster due to improved performance of the PVM data transmission management routines (i.e., improved buffer management performance).

Of course, it must be the case that at some point faster processors will overwhelm network bandwidth capacity, thus causing network bandwidth to become the limiting factor in overall performance. From the aforementioned tests, we can estimate that a cluster of six or more R4400 processors might overwhelm our thinwire Ethernet network for some problems, and a cluster of even faster processors will likely overmatch the thinwire Ethernet network for many jobs. We note that clusters of fast processors on our thinwire network will still yield good parallel performance for many jobs, but network bandwidth overhead would become the most significant performance limitation. Clusters of faster processors will require higher bandwidth networks to maintain the performance scaleups we have seen in our tests (on the other hand, our thinwire Ethernet network can hardly be considered a state-of-the-art high-bandwidth com- munications network these days). Since data transmission from slaves back to the master processor imposes greater demands on the network, additional performance enhancement can probably be obtained via better management of the data return process. One simple scheme to improve data return management would require each slave to send a request to the master for ”permission” to return results. The master processor would then grant permission for each slave to send data, in order, thus eliminating the contention of slaves with each other for network bandwidth. Future versions of PVM-AMBER will implement this additional data transmission management scheme.

The efficiency of the parallel MINMD calculation depends on the size and character of the problem. For large systems, we expect that the time required for a dynamics step in a serial calculation will be approximately proportional to the number of nonbonded pairs. The overhead in the parallel calculation is roughly proportional to the number of atoms times the number of slaves.

1138 VOL. 16, NO. 9

PVM-AM BER

Therefore, the speedup as a function of number of slaves should go through a maximum. We observe this pattern, predicted by Janak and Pattnaik,18 up to the maximum of eight slaves available to us. The small tripeptide test case shows little decrease in execution time when more than four slaves are used, because increased overhead nullifies the gain from distributing the nonbonded calculation. For the lysozyme test case, the speedup curve levels off between seven and eight slaves. The lipid bilayer test case generates a larger number of nonbonded pairs relative to the number of atoms and is efficiently computed by the parallel approach when as many as eight slaves are employed.

Table VI presents comparative results for the lipid bilayer test case run on a number of different single-processor and workstation cluster platforms. As can be seen from these results, relatively small workstation clusters (four to eight SGI Indi-

TABLE VI. Performance Statistics for PVM AMBER MD Runs on Workstation Clusters versus Some Single-Processor Simulations.

~

CPU Time Performance

Processor (S) Ratio Efficiency

1 Indigo* - 350,000 43.2 100% 2 Indigos - 185,500 22.9 95%

6 Indigo - 76,000 9.4 77%

1 DEC ALPHA* - 81,100 10.0 100%

Cray C-90* - 8,100 1 .o

4 Indigos - 100,000 12.3 88%

8 Indigos - 64,600 8.0 68%

2 DEC ALPHAS - 48,700 6.0 84% Cray Y-MP* - 14,600 1.8

An asterisk (*) denotes a single-processor calculation. All Indigos are MIPS R3000 processors, and DEC ALPHAs are 3000 / 400 AXP models. The master processor for all cluster runs was an SGI 4D /340S (MIPS R3000 processor), and all machines are connected via thinwire Ethernet. CPU time indicates elapsed CPU time required for a 5-picosecond MD simulation on a dedicated workstation cluster; performance ratio is the simple CPU time ratio relative to Cray C-90 (single-processor) performance (i.e., the ratio of elapsed time on dedicated workstation clusters versus total execution time on a C-90); and efficiency indicates the efficiency of CPU utilization of slave CPUs relative to a single-processor run on that architecture, as determined by analysis of CPU times and system activity statistics. For example, 100% efficiency for a cluster with four slaves processors indicates that a job runs exactly four times faster than it would on a single-slave CPU; 75% efficiency for an eight-slave cluster indicates that a job runs six times faster than it would on a single slave CPU; etc.

gos or a few DEC ALPHAS) connected by thinwire Ethernet can yield respectable speedups for mod- est to large MD simulations. These numbers strongly suggest that parallel MD simulation on workstation clusters can be extremely cost- effective.

Conclusion

We have demonstrated the feasibility of per- forming molecular dynamics computations on workstation clusters configured as a virtual parallel machine. In comparison to the performance achieved on certain problems using parallel super- compute r~ , '~ -~~ the speedups we obtain are mod- est. Still, our typical speedups of two- to fivefold for small workstation clusters should be useful for many applications.

Our results clearly indicate that we have achieved our goal of highly portable parallel code. We have successfully ported PVM-AMBER to run on clusters of Silicon Graphics, IBM RS6000, DEC ALPHA, and HP workstations (and heterogeneous workstation clusters) as well as Silicon Graphics multiprocessor workstations and CRAY T3D and Kendall Square KSR2 parallel supercomputers. Ports to these diverse hardware platforms have generally been quite easy, requiring only 1 to 2 h of work to make necessary code adjustments (e.g., changes to CPU timing and file handling routines), recompile and successfully execute programs, and verify results for standard test cases. As outlined in this article, the computational efficiency depends on specific system size and characteristics (i.e., systems with a high ratio of nonbonded interaction pairs to total atoms perform well on workstation clusters, while smaller systems run less efficiently). We do not yet have enough definitive results to compare performance for PVM-AMBER versus native-mode parallel code on parallel supercomputers. It is clear that our custom codes for specific parallel architectures do perform better than the PVM version. In some cases, the performance difference has been substantial (as much as 30%) in favor of our native-mode parallel codes. However, we expect that as we adopt newer versions of PVM that support shared memory architectures better, the PVM-AMBER code will compare more favorably with native parallel code on some parallel machines. A number of hardware vendors now promise custom versions of PVM, tuned for optimal performance on their machines. Thus, it is likely that PVM-based parallel codes


SWANSON AND LYBRAND

will also become a reasonable alternative from an efficiency standpoint for at least some parallel machines in the future.

There will always be situations in which a parallel computation is not the most cost-effective solution. For example, potential of mean force calculations are generally performed by defining a reaction coordinate for the process of interest and then subdividing the reaction coordinate into seg- ments that will be sampled in individual Monte Carlo or MD simulations. The optimal "parallel" implementation in such cases is to use each available processor to perform a separate simulation, thus achieving 100% efficiency. However, many problems require long simulations to obtain ade- quate configurational sampling or involve extremely large molecular systems, such that total runtime becomes a major concern. These classes of problems will benefit from the enhanced throughput offered by parallel simulation programs, and parallel computation on workstation clusters should be cost-effective for many of these problems.

Acknowledgments

This work was supported in part by grants from the National Science Foundation (DMB-9196006 and MCB-9405405), the Whitaker Foundation, Sili- con Graphics, and Digital Equipment Corporation. We also wish to thank the Pittsburgh Supercom- puter Center for technical support and CPU resources to help port and test PVM-AMBER on the PSC DEC ALPHA supercluster, and the technical staff at Kendall Square Research for assistance with porting and testing for the University of Washington KSR2.

References

1. D. A. Pearlman, D. A. Case, J. C. Caldwell, G. L. Seibel, U. C. Singh, P. Weiner, and P. Kollman, Amber 4.0, Univer- sity of California at San Francisco, 1991.

2. M. A. Shifman, A. Windemuth, K. Schulten, and P. L.

3. T. W. Clark and J. A. McCammon, Comp. and Chem., 14,219

4. F. Mueller-Plathe, Comp. Physics Comm., 61, 285 (1990). 5. J. E. Mertz, D. J. Tobias, C. L. Brooks, and U. C. Singh, J.

Comp. Chem., 12, 1270 (1991). 6. B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States,

S. Swaninathan, and M. Karplus, J. Comp. Chem., 4, 187 (1983).

7. T. W. Clark, K. Kennedy, and L. R. Scott, Proceedings Scal- able High Performance Computing SHPCC-92, 98 (1992).

8. S. L. Lin, J. Mellor-Crummey, B. M. Pettitt, and G. N. Phillips, Jr., J. Comp. Chem., 13, 1022 (1992).

9. W. F. van Gunsteren, H. J. C. Berendsen, F. Colonna, D. Perahia, J. P. Hollenberg, and D. Lellouch, J. Comp. Chem., 5, 272 (1984).

Miller, Comp. Biomed. Res., 25, 168 (1992).

(1990).

10. R. D. Skeel, J. Comp. Chem., 12, 175 (1991). 11. K. Esselink, B. Smit, and P. A. J. Hilbers, J . Comp. Physics,

106, 101 (1993). 12. 0. A. McBryan, Parallel Computing, 20, 417 (1994). The

remainder of this issue describes specific message-passing implementations.

13. G. A. Geist and V. S. Sunderam, Concurrency: Prac. Exper., 4, 293 (1992).

14. J. Donagarra, G. A. Geist, R. Manchek, and V. S . Sunderam, Comp. in Phys., 7, 166 (1993). PVM may be obtained by anonymous ftp from cs.utk.edu in directory pub/xnetlib, or via e-mail server by sending the message "send index from pym3" to [email protected].

15. D. W. Duke, Comp. Phys., 7, 176 (1993). 16. F. C. Bemstein, T. F. Koetzle, G. H. B. Williams, E. F.

Meyer, Jr., M. D. Brice, J. R. Rodgers, 0. Kennard, T. Shimanouchi, and M. Tasumi, J . Mol. Biol., 112, 535 (1977).

17. C. C. Douglas, T. G. Mattson, and M. H. Schultz, Parallel Programming Systems for Workstation Clusters, Technical Report YALEU/DCS/TR-975, Department of Computer Science, Yale University, 1993. (available from casper.na.cs.yale.edu in /pub/tr975.ps)

18. J. F. Janak and P. C. Pattnaik, J . Comp. Chem., 13, 533 (1992). 19. H. Sato, Y. Tanaka, H. Iwama, S. Kawakika, M. Saito,

K. Morikami, T. Yao, and S. Tsutsumi, Proceedings Scalable High Performance Computing SHPCC-92, 113 (1992).

20. S. E. DeBolt and P. A. Kollman, J. Comp. Chem., 14, 312 (1993).

21. W. S . Young and C. L. Brooks III, 1. Comp. Chem., 15, 44 (1994).

VOL. 16, NO. 9

Documents

PVM-AMBER: A parallel implementation of the AMBER molecular mechanics package for workstation clusters