25
One-sided Communication Implementation in FMO Method J. Maki, Y. Inadomi, T. Takami, R. Susukit a , H. Honda, J. Ooba, T. Kobayashi, R. Nogita, K. Inoue and M. Aoyagi Computing and Communications Center, Kyushu University Fukuoka Industry, Science & Technology Foundation

One-sided Communication Implementation in FMO Method J. Maki, Y. Inadomi, T. Takami, R. Susukita †, H. Honda, J. Ooba, T. Kobayashi, R. Nogita, K. Inoue

Embed Size (px)

Citation preview

One-sided Communication Implementation in FMO Method

J. Maki, Y. Inadomi, T. Takami, R. Susukita†, H. Honda,J. Ooba, T. Kobayashi, R. Nogita, K. Inoue and M. Aoyagi

Computing and Communications Center, Kyushu University†Fukuoka Industry, Science & Technology Foundation

Overview of FMO method (1)• FMO(Fragment Molecular Orbital method) has been devel

oped by Kitaura (AIST) and co-workers to calculate electronic states of a macromolecule such as a protein and nucleotide

• The target molecule is divided into fragments (monomers)

• ab initio molecular orbital (MO) calculation is carried out on each fragment and fragment pair (dimer).

• Usually executed with high parallel efficiency (>90%)

• The method can execute all-electron calculation on a protein molecule with 10,000 atoms.

Example of fragmentation

Hamiltonian

monomer calculation

Overview of FMO method (2)

1

| |[ ]I J

i i

i ji I J I i j I

H h V

rr r

environmentalelectrostatic potential

(potential from other monomers)

monomer calculation depends on other monomers electron density

electron density of monomer J

,

KKK A

A K i j KA

ZV d

r

r rr R r r

JJ J

iterated until all electron densities unchanged(SCC procedure)

I I I IH E

Overview of FMO method (3)

Schrødinger equation

basis functions

2 I I II I I IIj k ji ki j jk

jk i jk

C C D

MO coefficients

density matrix elements

RHF (Restricted Hartree-Fock) SCF method

Molecular Orbital (MO)

electron density

number of basis functions

Hamiltonian

dimer calculation

( 2)IJ I

I J I

E E N E

Total energy

IJ IJ IJ IJH E

Overview of FMO method (4)

,

1

| |[ ]IJ K

i i

i ji IJ K I J i j IJ

H h V

rr r

environmental electrostatic potential

Schrødinger equation

,

KKK A

A K i j KA

ZV d

r

r rr R r r

not all dimers are calculated by SCF

dimer-es approximation• For distant dimers, SCF calculation are not carried out • Energy is calculated by electrostatic approximation

monomer

SCFdimer

ES (non-SCF)dimer

1 21 2

1 2

I JIJ I JE E E d d

r r

r rr r

SCC procedure

convergence check of SCC procedure

already converged

not yet converged

electronic structure calculation for monomer

initial density calculation 0ID

electronic structure calculation for SCF-dimer

ES dimer calculation

total energy calculationFor all monomers

For all SCF-dimers

For all separated dimers

Flow chart of FMO calculation

( 2)IJ I

I J I

E E N E

I I I IH E

IJ IJ IJ IJH E

1 21 2

1 2

I JIJ I JE E E d d

r r

r rr r

SCC procedure

convergence check of SCC procedure

already converged

not yet converged

electronic structure calculation for monomer

initial density calculation 0ID

electronic structure calculation for SCF-dimer

ES-dimer calculation

total energy calculation

update monomer density matrices

refer monomer matrices needed in electronic structure calculation

refer two monomer matrices needed in dimer-ES calculation

refer monomer matrices needed in electronic structure calculation

update monomer density matrices

For all monomers

For all SCF-dimers

For all separated dimers

Update all monomer density matrices

Refer some monomer density matrices

( 2)IJ I

I J I

E E N E

I I I IH E

IJ IJ IJ IJH E

1 21 2

1 2

I JIJ I JE E E d d

r r

r rr r

Density requirements and update in FMO calculation

4-center two electron integral calculationsfor all environment monomers are prohibitive

,

K I K Iij i j

KI I I IAi j i j

A K i j KA

V V d

Zd d d

r r r r

rr r r r r r rr R r r

,

KI Ii j

I I K Ki j k l K

klk l K

d d

d d D

rr r r r

r r

r r r rr r

r r

KijVmatrix element of KV r (needed in SCF calculation)

involves 4 center tow electron integrals

order of calculation costs number of basis functions

Approximation of calculation (1)

esp-aoc approximation

Approximation of

KS

K K

iiD S

K Ki iK K KA

iiA K i KA

ZV d D S

r r

r rr R r r

calculation (2)

overlapmatrix of fragment K

Mulliken AO population of

K K K K Ki i ii

i K

D S

r r r

K K Kij i jS d r r r( )

Ki r

Approximation of

,

K K K K KA ij jiii

i A i A j K

Q D S D S

K K

K A A

A K A KA A

Z QV

rr R r R

K KA A

A K

Q

r r R

calculation (3)

esp-ptc approximation

Mulliken atomic charge of the nucleus AKAQ

point charge approx. of

number of environment monomersfor which 4-center two electron integralcalculations are carried out

Hypothetical petascale computing environment

Number of fragments : 10,000-100,000

~1,000) (the current largest

Number of CPUs : 1PFlops = 100,000 CPU

(current one CPU peak performance ~10GFlops)

Memory requirement in FMO

RIJ

1,000 350MB 4MB

5,000 1.7GB 95MB

10,000 3.4GB 380MB

50,000 17GB 9.3GB

100,000 34GB 37GB

Number of fragments

It is difficult for each process to store all necessary data

Memory per one CPU core > 10 GB×

The size of one density matrix ≒ 350KBRIJInterfragment distance data ≒ / 2

OSC implementation in OpenFMO (1)

node00

0 1

node01

2 3

node02

4 5

node03

6 7

node04

8 9

Group00 Group01 Group02

• In its execution, only the process of rank 0 is used as a server process for dynamic load balancing. All the other processes are worker processes and are divided into groups and the two-level parallelization is used as the other implementations.

OSC implementation in OpenFMO (2)

node00

0 1

node01

2 3

node02

4 5

node03

6 7

node04

8 9

Group00 Group01 Group02

memory window created by MPI_Win

density matrix data

MPI_Get

Estimation of communication cost (1)Assumptions

1. The sizes of all density matrices are the same.All groups have the same number of worker processes.

2. The time to put or get one density matrix is equal to the average point-to-point communication time .

3. MPI_Bcast implements a binomial tree algorithm so that the time to broadcast one density matrix over processes is obtained by .

4. And for OSC scheme, dynamic load balancing works well and the delay due to competing put or get requests is ignorable. Therefore all groups execute the same number of monomer and dimer jobs and this means that all groups have the same

N

SCF dimers

SCC iterationsneighboring monomers (average)

neighboring monomers (average)

17

20

10,000-100,000

96,000

32

7

12

Estimation of communication cost (2)

monomers with which a monomerforms a SCF dimer

point-to-point communication time to transfer one density matrix

mon monget SCC 4C 1( )fN N N N

number of SCF dimers

total number of dimers

mon dim esget get get get

monSCC 4C

dim dimSCF 4C

1

/ 2 1

( )

( )f

f f f

N N N N

N N N

N N N N N

total number of get requests

Estimation of communication cost (3)

dim dim dimget SCF 4C 2 / 2( )fN N N N

number of monomerget requests

number of SCFdimerget requests

OSC scheme

dimSCF1 / 2 / 2( )f f fN N N N number of ES dimers

es dimget SCF1 / 2 / 2 2( )f f fN N N N N

number of ES dimerget requests

p2p 2 proc group1 log /( )t N N

SCC p2p group/fN N t N

Estimation of communication cost (4)

Bcast scheme

total cost per one group

SCC p2p groupOSC

p2p proc group get group2

/

1 log / /( )fT N N t N

t N N N N

cost of MPI_Bcast

cost for one get requests

total cost of put requests

total cost per one group

SCC p2p procBcast 2logfT N N t N

OSC scheme

Estimation of communication cost (5)

Simulation using skelton code (1)• FMO calculation was simulated using skelton code

– quantum calculation parts are removed• Skelton code accumulates estimated time of :

– molecular integral calculation– Fock matrix builging, SCF procedure

• Time of each communication are estimated and accumulated– MPI_Send, MPI_recv, MPI_get, MPI_put

estimated from measurements– MPI_Bcast, MPI_Allreduce

estimated from point to point communication timeby assuming binomial tree, butterfly algorithm

Simulation using skelton code (2)• integral calculation time estimation formula

integrals required for environmental potential

1 electron

integrals required for SCF

• Fock builging time estimation

A=1.12072E-05 B=6.03297E-08 env 2

1e b b bf N AN BN

env 22 3e C b b bf N AN BN

env 22 4e C b b bf N AN BN

2 32e b b bf N AN BN

2 3Fock b b bf N AN BN

2 electron (3center)A=5.28555E-05 B=2.60718E-07

A=7.57123E-05 B=9.33117E-07

A=5.84313E-05 B=1.12603E-06

2 electron (4center)A=0.00050 B=3.26399E-06

simulation using skelton codes (3)

OSC cost < Bcast cost

> 512 with

Aquaporin protein

6-31G* basis set

PDBID 2F2B,

=492, 14,492 atoms

> 32) (

=4, 8, 16, 32, 64

=16

Linux cluster in RIKEN Super Combined Cluster (RSCC) was used

Summary

• OSC of MPI-2 standard has been implemented in a new FMO code to reduce the memory requirement per process.

• These results show OSC scheme is prospective for the petascale computing environment.

• Evaluation of communication costs shows that OSC scheme has an advantage over the Bcast scheme for

=10,000-100,000, =96,000.

by broadcast unnecessary.• OSC implementation also makes the exchange of

• Simulations using skelton code also show that OSC scheme has an advantage in communication cost with large .