Upload
erika-barker
View
224
Download
0
Tags:
Embed Size (px)
Citation preview
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms
Reporter: Jilin Zhang
Authors:Changjun Hu, Yali Liu, and Jianjiang Li
Information Engineering School, University of Science and Technology Beijing, Beijing, P.R.China
Outline 1 Motivation 2 Related Works 3 Spatial Decomposition Coloring (SDC) Approach 4 Short-Range Forces Calculations of EAM using
SDC method 5 Experiments and Discussion 6 Conclusion and Future Directions
1 Motivation The process of molecular dynamics simulations
Fig. 1 the process of molecular dynamics simulations.
12
3
4
5
6
0
9
7 8
calculate forces
1 2
3
4
56
0
97
8
calculate new positions of atoms
set init_state
12
3
4
5
6
0
9
7 8
1 Motivation the intensive computation
appears in short-range force calculations procedure of MD simulations Neighbor-list method
decreases the intensive computation largely. It make each atom only interacts with atoms in its neighbor region.
Newton’s third law can have the force computations. And it brings the reduction operations on irregular arrays
for ( i = 0; i < N; i++){
neighstart = neighindex[i];neighend = neighstart + neighlen[i];for ( k = neighstart ; k < neighend; k++){
j = neighlist[k];xd = Coord[j][X]- Coord[i][X];yd = Coord[j][Y]- Coord[i][Y];zd = Coord[j][Z]- Coord[i][Z];…forc = …force[i][X] += forc*xd ;force[i][Y] += forc*yd ;force[i][Z] += forc*zd ;force[j][X] -= forc*xd ;force[j][Y] -= forc*yd ;force[j][Z] -= forc*zd ;
}}
Fig. 2 codes of force caluclations.
2 Related Works --- parallel reduction operations on irregular arrays
Some types of solutionsenclosing reduction operation in a critical
sectionprivating the reduction arrayusing redundant computations strategy
2 Related Works --- parallel reduction operations on irregular arrays
enclosing reduction operation in a critical sectioncreate a critical section in inner loop
straight and easy to implement parallelization.
high synchronization cost arose by critical region, atomic or lock involved in inner loop
2 Related Works --- parallel reduction operations on irregular arrays
private the reduction arrayeach thread have to update share array in critical
region according the value of its private array it reduce times of entering into critical region and
reduce synchronization cost. high memory overhead of private array limit number of particles allowed in simulations compete for cache space and decrease program
speed
2 Related Works --- parallel reduction operations on irregular arrays
redundant computations strategydoes not use Newton’s third law. So each pair
interaction has to be calculated twice. the high parallelizability since data
dependence has been removed between the loop iterations
there are double computations and that neighbor list requires more memory space.
3 Spatial Decomposition Coloring (SDC) Approach Spatial Decomposition (SD) method
distributed memory multi-processors involving several hundreds of processors
change all array declarations and all loop bounds, and explicitly codes the periodic transfer of the boundary data between processors.
It is simple to implement SD in OpenMP.
3 Spatial Decomposition Coloring (SDC) Approach
SD method places a restriction on parallelism in OpenMP.
synchronization will be required to ensure that multiple threads do not attempt to update the same atom simultaneously.
1 2
8
4 5
10
3 6
14 1613 17 18
127 9
15
11rc rc
Fig. 3 SD method.
3 Spatial Decomposition Coloring (SDC) Approach SDC method
SDC method consists of the following steps
Step 1): Split domain
Step 2): Coloring subdomains
Step 3): Parallel Computing
3 Spatial Decomposition Coloring (SDC) Approach SDC method
SDC method consists of the following steps Step 1): Split domain
Split the spatial domain into subdomains.
Length of a subdomain must be longer than diameter.
Number of subdomains in dimension decomposed should be even.
3 Spatial Decomposition Coloring (SDC) Approach SDC method
SDC method consists of the following steps Step 2): Coloring subdomains
The number of subdomains with each color must be equal
each subdomain is surrounded only by those subdomains with different colors.
3 Spatial Decomposition Coloring (SDC) Approach
SDC methodSDC method consists of the following steps
Step 3): Parallel ComputingCalculations of forces on subdomains
with one color can be run in parallel.a barrier should be given for waiting all
threads to complete computation on this color.
Calculations on subdomains with different colors must run in a serial fashion.
3 Spatial Decomposition Coloring (SDC) Approach SDC method
advantage neighbor list usually doesn’t be updated in every time-
step Cost of SDC method is very lowest. higher-dimensional decomposition method creates
more subdomains. scalable and suitable on multi-core and many-core architectures.
disadvantage Spatial Decomposition method Overload imbalance
under condition of simulation system has uniformity of density
4 Short-Range Forces Calculations of EAM using SDC method
EAM methodshort-range
forces the intensive
computation three computational
phases the most time
consuming parts are 1 and 3
N
ijiji )r(φρ
)(' iF ρ
N
ijjijijiiji FF ijr)')('')(')r(V'(F
ρρρρ
Fig. 4 short-range forces in EAM method.
4 Short-Range Forces Calculations of EAM using SDC method The parallel procedure of short-range
forces calculations using SDC method1) Run electron density computations using
SDC method2) Calculate embedding function value and
their derivative in parallel3) Run force calculations using SDC method
force calculations based on SDC method
L1: computations on subdomains with different color
L2 : computations on subdomains with same color
L3 deals with all atoms that constitute a subdomain
L4 deals with neighbors of a atom
#pragma omp parallel private(cpart) for (cpart = 0; cpart < colors; cpart++){ ...#pragma omp for private(spart,i,j,k,…) for (spart = cpart; spart < subdomains; spart += colors) for ( ipart = pstart[spart]; ipart < pstart[spart+1]; ipart++) {
i = partindex[ipart];neighstart = neighindex[i];neighend = neighstart + neighlen[i];for ( k = neighstart ; k < neighend; k++){
j = neighlist[k];…forc = …force[i][X] += forc*xd ;force[i][Y] += forc*yd ;force[i][Z] += forc*zd ;force[j][X] -= forc*xd ;force[j][Y] -= forc*yd ;force[j][Z] -= forc*zd ;
} }}
L1:
L2:
L3:
L4:
4 Short-Range Forces Calculations of EAM using SDC method
Fig. 5 forces calculations using SDC.
5 Experiments and Discussion
Experimental environment Four Intel Xeon(R) Quad-core E7320 (L2 Cache 4MB)
processors, 16 GB memory OS is Fedora release 9 with kernel 2.6.25. The compiler is gcc
4.3.0. Experimental cases
observe micro-deformation behaviors of pure Fe metals material
---came from XMD program under periodic boundary conditions initial state -- body-centered cubic (bcc) lattice arrangement test cases
Small-scale case (1): 54,000 atoms Medium-scale case (2): 265,302 atoms Large-scale case (3): 1,062,882 atoms Large-scale case (4): 3,456,000 atoms
Speedup
Small case (1) on 2~16 cores Medium case (2) on 2~16 cores
2 3 4 8 12 16 2 3 4 8 12 16
SDC (one-dim) 1.71 2.46 3.07 4.17 1.84 2.64 3.37 6.24 6.33
SDC (two-dim) 1.70 2.46 3.07 4.74 5.90 6.43 1.84 2.65 3.39 6.20 8.89 10.90
SDC (three-dim) 1.66 2.40 2.99 4.61 5.74 6.30 1.82 2.65 3.36 6.16 8.76 10.78
Large case (3) on 2~16 cores Large case (4) on 2~16 cores
2 3 4 8 12 16 2 3 4 8 12 16
SDC (one-dim) 1.86 2.76 3.67 6.82 9.76 9.59 1.88 2.79 3.66 6.30 9.97 9.82
SDC (two-dim) 1.87 2.78 3.64 6.74 9.73 12.31 1.87 2.80 3.65 6.77 9.84 12.42
SDC (three-dim) 1.86 2.75 3.64 6.64 9.65 12.29 1.87 2.80 3.67 6.74 9.82 12.34
Table 1. The Speedups of Spatial Decomposition Coloring (SDC) Methods
5 Experiments and Discussion
the scalability of our SDC method. performance of multi-dimensional SDC method has been improved with the increase in the number of cores and the increase in the number of atoms.
performance of SDC methods. We can see that two-dimensional SDC method achieves highest efficiency. two-dimensional decomposition algorithm strives to make
subdomains with small surface area and large volume, which results in better cache locality compared to the one-dimensional decomposition strategy.
three-dimensional SDC method slightly degrades the performance due to the more overhead of fork-join threads and scheduling.
0
2
4
6
8
10
12
14
2 3 4 8 12 16
Number of cores
Spee
dup
SDC on small case(1) SDC on medium case(2) SDC on large case(3) SDC on large case(4)CS on small case(1) CS on medium case(2) CS on large case(3) CS on large case(4)SAP on small case(1) SAP on medium case(2) SAP on large case(3) SAP on large case(4)RC on small case(1) RC on medium case(2) RC on large case(3) RC on large case(4)
Fig. 6 The speedup of two-dimensional Spatial Decomposition Coloring (SDC) method, Critical Section (CS) method, Share Array Privatization (SAP) method and Redundant Computations (RC) method.
5 Experiments and Discussion
SDC method achieves a nearly linear speedup and highest speedup than other methods The reason of nearly linear speedup is that the low
synchronization cost of implicit barriers in our method can be amortized over a large amount of computation.
CSmethod achieves lowest efficiency. CS method encloses reduction
operations on irregular array in critical section. SAPmethod
performance degrade with the increase of the number of executing cores. memory overhead+synchronization overhead
RC VS SDC there is nearly two-fold computation work for the short-range
force calculations in RC method than in SDC method, the efficiency of RC method is low than that of SDC method.
Conclusion and Future Directions
A scalable spatial decomposition coloring (SDC) method To solve a class of short-range force calculations
problems on shared memory multi-core platforms It is scalable not only to large simulation system but
also to many-core architectures Future directions
To study SDC method on NUMA memory architecture To implement SDC method using MPI+OpenMP in
multi-core cluster
Thank You !