Implementation of Scientific Computing Applications on ......Storage Subsystem (PPSS) L1 L1 L2 cache Memory F ow Controller (MFC) LocalStore(LS) DMA Controller memory bandwidth •

Implementation of Scientific Computing Applications on theComputing Applications on the Cell Broadband Engine processor

Guochun Shi,Volodymyr Kindratenko

National Center for Supercomputing ApplicationsUniversity of Illinois at Urbana-Champaign

Three scientific applications

• Nanoscale Molecule Dynamics (NAMD)• James Philips from Theoretical and ComputationJames Philips from Theoretical and Computation

Biophysics group, UIUC

• MIMD Lattice Computation (MILC)• MIMD Lattice Computation (MILC)• Steven Gottlieb from physics department, Indiana

University

• Electron Repulsion Integral (ERI) in quantum chemistry• Ivan S Ufimtsev Todd J Martinez Chemistry• Ivan S. Ufimtsev, Todd J.Martinez, Chemistry

department, UIUC

2

Presentation outline

• Introduction• Cell Broadband EngineCell Broadband Engine• Target Applications

1. NAnoscale Molecule Dynamics (NAMD)2 MIMD Lattice Computation (MILC) collaboration2. MIMD Lattice Computation (MILC) collaboration3. Electron Repulsion Integral (ERI) in quantum chemistry

• In-house task library and task dispatch system

• Implementation and performance• Application porting and optimization

• Summary• Conclusions

3

Cell Broadband Engine

• One Power Processor Element (PPE) and eight Synergistic

Memory Interface Controller (MIC)

I/O Interface (IOIF0)

I/O Interface (IOIF1)(PPE) and eight Synergistic

Processing Elements (SPE), each SPE has 256 KB local storage

Element Interconnect Bus (EIB)

l

• 3.2 GHz processor • 25.6 GB/s processor-to-

memory bandwidth

PowerPC ProcessorStorage Subsystem (PPSS)

L1 L1

L2 cache

Memory Flow Controller (MFC)

Local Store (LS)

DMA Controller

memory bandwidth • > 200 GB/s EIB sustained

aggregate bandwidth• Theoretical peak performance

PowerPC Processor Element (PPE)

PowerPC ProcessorUnit (PPU)

instruction cache

datacache

Synergistic Processor Element (SPE) 0

Synergistic Processor Unit (SPU)

Local Store (LS)

• Theoretical peak performance of 204.8 GFLOPS (SP) and 14.63 GFLOPS (DP)

SPE 7

4

In-house task library and dispatch system

typedef struct task_s {int cmd; // operandi t i // th i f t k t t

int ppu_task_init(int argc, char **argv,spe_program_handle_t ); // initializationint ppu_task_run(volatile task_t * task); // start a task in all SPEsi t t k ( l til t k t* t k i t ) // t t t k i SPE

API for PPE and SPECompute task struct

int size; // the size of task structure} task_t;

typedef struct compute_task_s {task_t common;

int ppu_task_spu_run(volatile task_t* task, int spe); // start a task in one SPEint ppu_task_spu_wait(void); // wait for any SPE to finish, blocking callvoid ppu_task_spu_waitall(void); // wait for all SPEs to finish, blocking all

int spu_task_init(unsigned long long);int spu task register(dotask t,int); // register a task

<user_type1> <user_var_name1><user_type2> <user_var_name2>…

} compute_task_t;

int spu_task_register(dotask_t,int); // register a taskint spu_task_run(void); // start the infinite loop, wait for tasks

Task dispatch system in NAMD

remove patch pair

pool of patch pairs to be processed

pool of idle SPEs

task pool is

yesno

start find idle SPEfind a

dependency‐free patch pair

remove patch pair from the pool

run pair‐compute task on the idle SPE

yes yes

no no

pool is empty

wait until allSPU tasks exit

5

ppu_task_spe_wait() stop

NAMD

• Compute forces and update positions repeatedlypositions repeatedly

• The simulation space is divided into rectangular regions called patchesregions called patches• Patch dimensions > cutoff

radius for non-bonded interaction

• Each patch only needs to be checked against nearby patches• Self-compute and pair-

compute

6

NAMD kernel

• 1: for each atom i in patch pk

NAMD SPEC 2006 CPU benchmark kernel

• 2: for each atom j in patch pl

• 3: if atoms i and j are bonded, compute bonded forces• 4: otherwise, if atoms i and j are within the cutoff distance, add atom j to the i’s atom pair list• 5: end• 6: for each atom k in the i’s atom pair list• 6: for each atom k in the i s atom pair list• 7: compute non-bonded forces (L-J potential and PME direct sum, both vial lookup tables)• 8: end• 9: end

• 1: for each atom i in patch pk

We implemented a simplified version of the kernel that excludes pairlists and bonded forces

• 2: for each atom j in patch pl

• 3: if atoms i and j are within the cutoff distance• 4: compute non-bonded forces (L-J potential and PME direct sum, both via lookup tables)• 5: end• 6: end

7

• 6: end•

NAMD Implementation: SPU Data movement for distance

• SIMD: each component is kept in a separate vector

Data movement for distancecomputation (SP case)

x0 y0 z0 c0 id0 x1 y1 z1 c0 i0 x2 y2 z2 c0 i0 x3 y3 z3 c0 i0Data in memorykept in a separate vector

• Data movement dominates the time

• Buffer size carefully chosen

Shuffle operation

s

• Buffer size carefully chosen to fit into the local store

x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3 c0 c1 c2 c3

vectors

Code(KB)

L-J table (KB)

Table_four(KB)

Atom_buffer(KB)

Force buffer (KB)

Stack others

dx0 dx1 dx2 dx3

Comput-ation

dy0 dy1 dy2 dy3 dz0 dz1 dz2 dz3

Local store usage

(KB) (KB) (KB) (KB)

SP 25 55 45 30 18 83

DP 25 55 91 48 18 19

8

r0 r1 r2 r3

distance vector

Implementation: optimizations

• Different vectorization schemes are applied in

task start

DMA in atoms and forces for the patch

schemes are applied in order to get best performance

Self-compute?

compute distances for 4 pairs of atoms

compute distances for 4 pairs of atoms

Yes No

• Self-compute: do redundant computations and fill zeros to unused

any of them within cutoff?

compute 4 non-bonded forces

yes

no

Number of save pairs

>= 4?

Save pairs within the cutof f distance

no

slots• Pair-compute: save enough

pairs of atoms, then do

nullify forces for those outside of cutof f distance

all pairs

4?

compute 4 non-bonded forces

all pairsno noyesyes

yes

calculations DMA forces back to

main memory

all pairs processed?

task exit

all pairs processed?

9

task exit

NAMD Performance: static analysis

3.5E+10

4E+10scalar force computationscalar distance computationSIMD force computation

3.5E+10

4E+10

data manupulation cycles

compute cycles

1.5E+10

2E+10

2.5E+10

3E+10

ber o

f clock cycles

SIMD force computationmerge SIMD distance computation

1 5E+10

2E+10

2.5E+10

3E+10

r of clock cycles

compute cycles

0

5E+09

1E+10

single‐precision single‐precision double‐precision double‐precision

numb

0

5E+09

1E+10

1.5E+10

numbe

r• Distance computation code takes most of the time

single precision self‐compute

single precision pair‐compute

double precision self‐compute

double precision pair‐compute

single‐precision self‐compute

single‐precision pair‐compute

double‐precision self‐compute

double‐precision pair‐compute

Distance computation code takes most of the time• Data manipulation time is significant

10

NAMD PerformanceScaling and speedup of the force-

field kernel as compared to a3.0 GHz Intel Xeon processor

25.8

20

25

30

14.0

16.0

18.0

20.0

cond

s)

speedupexecution time

4038.037.6 38.9

pNAMD kernel performance on different architectures

1 0 1.73.4

6.8

13.4

5

10

15

2 0

4.0

6.0

8.0

10.0

12.0

spee

dup

execution time (sec

20

25

30

35

40

28.6

16.9

12 2

22.319.0

single‐precision kernel

double‐precision kernel

time (secon

ds)

1.0

00.0

2.0

Intel Xeon

(3.0GHz)

1 SPE 2 SPEs 4 SPEs 8 SPEs 16 SPEs

0

5

10

15

Cell/B.E. Intel IBM Intel Xeon ( )

AMD Cell/B.E. 8

12.28.6

0.9

11.1

1.6execution

22.5

15

20

25

30

10.0

12.0

14.0

16.0

18.0

20.0

dup

ime (secon

ds)

speedupexecution time

PPE (3.2GHz)

Itanium 2 (900MGz)

Power4 (1.2GHz)

(3.0GHz) Opteron (2.2GHz)

SPEs (3.2GHz)

• 13.4x speedup for SP and 11.6x

1.0 1.52.9

5.8

11.6

0

5

10

0.0

2.0

4.0

6.0

8.0

Intel 1 SPE 2 SPEs 4 SPEs 8 SPEs 16 SPEs

spee

d

execution t

speedup for DP compared to a 3.0 GHz Intel Xeon processor

• SP performance is < 2x better than DP

11

Intel Xeon

(3.0GHz)

1 SPE 2 SPEs 4 SPEs 8 SPEs 16 SPEs

MILC Introduction

• MILC studies large scale numerical simulations to study quantum chromodynamics (QCD) the theory ofchromodynamics (QCD), the theory of the strong interactions of subatomic physics.

• Large cycle consumer of• Large cycle consumer of supercomputers

• Our target applicationcompute loop 1

CPUMPI scatter/gather for loop 2

MPI scatter/gather for loop 3• dynamical clover fermions (clover_dynamical) using the hybrid-molecular dynamics Ralgorithm

MPI scatter/gather for loop 3

compute loop 2

MPI scatter/gather for loop n+1

• Our view of the MILC applications• A sequence of communication

and computation blocks

compute loop n

MPI scatter/gather

MPI scatter/gather for loop n+1

12

pOriginal CPU‐based implementation

MILC Performance in PPE

• Step 1: try to run it in PPE Runtime in CPU and PPE

400 0

450.0

• In PPE it runs approximately ~2-3x 100.0

150.0

200.0

250.0

300.0

350.0

400.0

run

time

(sec

onds

)

CPU

PPE

approximately 2 3x slower than modern CPU 0.0

50.0

8x8x16x16 16x16x16x16

lattice size

• MILC is bandwidth-bound

It ith h t

Stream benchmark on CPU and PPE

2500300035004000

(GB

/s)

CPU• It agrees with what we see with stream benchmark

0500

100015002000

Copy Scale Add Triad

Ban

dwid

th CPU

PPE

13

MILC: Execution profile and kernels to be ported

20

25, % 70

8090100

, %

kernel run‐time on 8x8x16x16 lattice

kernel run time on 16x16x16x16 lattice

5

10

15

kernel time

102030405060

cumulative

,kernel run‐time on 16x16x16x16 lattice

cumulative for 8x8x16x16 lattice

cumulative for 16x16x16x16 lattice

0

udad

u_m

u_nu

h_w_

s ite_

s pec

ial

s u3m

at_c

opy

mul

t_su

3_nn

mult

_su3

_na

mul

t_th

is_ld

u_s it

e

dadu

_mat

_mu_

nu

s ingl

e ac

tion

s u3_

adjo

int

e ral

strid

ed g

athe

rf_

mu_

nu

mult

_add

_wve

c

isu

3_m

atrix

_add

i

d_co

ngra

d2_c

l

s et_

neig

hbor

mul

t_su

3_an

upda

te_u

com

pute

_clov

upda

te_h

_cl

dd_w

vec_

mag

sq

realt

race

_su3

_nn

add_

su3_

mat

rixwp

_shr

ink

ags q

_wve

c_ta

sk

gaug

e_ac

tion

u3_m

atrix

to ze

rore

unita

rize

010

dslas

h_ m uda

gene

rs c

alar

_

s cal

ar_m

ulti_

s cal

ar_m

ult_a

d re mag

s et s

u3

10 of these subroutines are responsible for >90% of overall runtime• 10 of these subroutines are responsible for >90% of overall runtime• su3mat_copy, i.e. memcpy responsible for nearly 20% of all runtime• All kernels responsible for 98.8%

14

MILC: Kernel memory access pattern

• Neighbor data access taken care of by MPI framework

#define FORSOMEPARITY(i,s,choice) \for( i=((choice)==ODD ? even_sites_on_node : 0 ), \s= &(lattice[i]); \i< ( (choice)==EVEN ? even_sites_on_node : sites_on_node); \i++,s++)care of by MPI framework

• In each iteration, only small elements are accessed

FORSOMEPARITY(i,s,parity) {mult_adj_mat_wilson_vec( &(s->link[nu]), ((wilson_vector *)F_PT(s,rsrc)), &rtemp );mult_adj_mat_wilson_vec( (su3_matrix *)(gen_pt[1][i]), &rtemp, &(s->tmp) );mult_mat_wilson_vec( (su3_matrix *)(gen_pt[0][i]), &(s->tmp), &rtemp );mult_mat_wilson_vec( &(s->link[mu]), &rtemp, &(s->tmp) ); mult_sigma_mu_nu( &(s->tmp), &rtemp, mu, nu );su3_projector_w( &rtemp, ((wilson_vector *)F_PT(s,lsrc)),

((su3 matrix*)F PT(s mat)) );elements are accessed• Lattice size: 1832 bytes• su3_matrix: 72 bytes• wilson vector: 96 bytes

((su3_matrix )F_PT(s,mat)) );}

lattice site 0

One sample kernel from udadu_mu_nu() routine

• wilson_vector: 96 bytes

• Challenge: how to get data into SPUs as fast as possible?SPUs as fast as possible?• Data is nonaligned• Daa is not multiple of 128

bytesData accesses

Data from neighbor

15

bytes

Approach I: packing and unpacking

PPE and main memorystruct site

SPEs

…site

Packing

DMA operations …

Unpacking

DMA operations

DMA ti

struct site

Unpacking DMA operations…

• Good performance in DMA operations• Packing and unpacking are expensive in PPE

16

Approach II: Indirect memory access

SPEsPPE and main memory

Original lattice

Modified lattice

……

Continuous mem …

DMA operations

• Replace elements in struct site with pointers• Replace elements in struct site with pointers• Pointers point to continuous memory regions• PPU overhead due to indirect memory access

17

y

Approach III: Padding and small memory DMAs

Original lattice

SPEs

…

PPE and memory

Lattice after padding

……

P ddi l t t i t i

…

DMA operations

• Padding elements to appropriate size• Padding struct site to appropriate size• Gained good bandwidth performance with padding overhead• Su3 matrix from 3x3 complex to 4x4 complex matrixSu3_matrix from 3x3 complex to 4x4 complex matrix

• 72 bytes 128 bytes• Bandwidth efficiency lost: 44%

• Wilson_vector from 4x3 complex to 4x4 complex98 b t 128 b t

18

• 98 bytes 128 bytes• Bandwidth efficiency lost: 23%

MILC struct site padding

• 128 byte stride access has different performanceThi i d t 16 b k i i• This is due to 16 banks in main memory

• Odd numbers always reach peak• We choose to pad the struct site to 2688 (21*128) bytes• We choose to pad the struct site to 2688 (21 128) bytes

25.38

12.69

25.36

8.6

25.49

12.89

25.48

4.26

25.5

12.93

25.34

8.58

25.51

12.89

25.47

2 13

25.34

10

20

30

idth (GB/s)

4. 6 2.13

0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

stride (x 128 bytes)

ban

dwi

19

MILC Kernel performance

• GFLOPS are low for all kernels• Bandwidth is around 80% of peak for

t f k l7

8

9

10

Peak GFLOPS (%)8x8x16x16 lattice 16x16x16x16 lattice

most of kernels• Kernel speedup compared to CPU for

most of kernels are between 10x to 20x• set_memory_to_zero kernel has ~40x

speedup 0

1

2

3

4

5

6

…speedup• Memcpy speedup >15x uda

du_mu

_nu

dslash

_w_si

te_spe

cial

su3ma

t_copy

mult_

su3_nn

mult_

su3_na

mult_

this_l

du_site

udadu_

mat_m

u_nu

single

actio

n

su3_ad

joint

genera

l strid

ed gat

her

f_mu_n

u

scalar

_mult

_add_w

vec

scalar

_mult

i_su3_

matrix

_addi

d_cong

rad2_c

l

set_ne

ighbor

mult_

su3_an

update

_u

comput

e_clov

update

_h_cl

scalar

_mult

_add_w

vec_m

ag…

realtra

ce_su3

_nn

add_su

3_matr

ix

wp_sh

rink

magsq

_wvec

_task

gauge_

action

set su

3_matr

ix to z

ero

reunit

arize

Peakbandwidth (%)S peedup

50

60

70

80

90

100

Peak bandwidth (%)8x8x16x16 lattice 16x16x16x16 lattice

S peedup

253035

404550

8x8x16x16 lattice 16x16x16x16 lattice

0

10

20

30

40

udadu_

mu_nu

w_site

_speci

al

su3ma

t_copy

mult_

su3_nn

mult_

su3_na

_this_

ldu_si

te

u_mat_

mu_nu

single

actio

n

su3_ad

joint

stride

d gath

er

f_mu_n

u

ult_ad

d_wvec

3_matr

ix_add

i

d_cong

rad2_c

l

set_ne

ighbor

mult_

su3_an

update

_u

comput

e_clov

update

_h_cl

_wvec

_mags

q

ltrace_

su3_nn

d_su3_

matrix

wp_sh

rink

sq_wv

ec_tas

k

gauge_

action

matrix

to ze

ro

reunit

arize

05

10

152025

20

u

dslash

_w mult_

udadu

genera

l s

scalar

_mu

scalar

_mult

i_su3 d c

scalar

_mult

_add_ real ad ma

g s

set su

3_m

MILC Application performance

• Single Cell Application performance speedup 5.00

6.00

ec)

SPE contribution38.69

8x8x16x16 lattice

p p• ~8–10x, compared to Xeon single core

0 821.48 1.71

2.2438.24

4.64

2.971.91

1.72

1.00

2.00

3.00

4.00

execution time (se

• Profile in Xeon• 98.8% parallel code, 1.2% serial code

speedup slowdown

0.45 0.820.00

1 core Xeon 8 cores Xeon 1 PPE, 8 SPEs

1 PPE, 16 SPEs

(NUMA)

2 PPEs, 8 SPEs per PPE

(MPI)execution mode

16x16x16x16 lattice

• 67-38% kernel SPU time, 33-62% PPU time of overall runtime in Cell

15.00

20.00

25.00

time (sec)

SPE contribution

55.8

168.57

16x16x16x16 lattice

PPE is standing in the way for further improvement

1.886.15 5.94 6.85

10.60166.69

49.65 11.52 6.82

6.38

0.00

5.00

10.00

1 core Xeon 8 cores Xeon 1 PPE 8 1 PPE 16 2 PPEs 8

execution t

21

1 core Xeon 8 cores Xeon 1 PPE, 8 SPEs

1 PPE, 16 SPEs

(NUMA)

2 PPEs, 8 SPEs per PPE

(MPI)execution mode

Introduction –Quantum Chemistry

• Two basic questions in Chemistry* :1. Where are the electrons?2 Where are the nuclei?2. Where are the nuclei?

Quantum Chemistry focuses the first question by solving the time-independent Schrödinger equation to get the electronic wave functions.g

And the absolute squire is interpreted as the probability distribution for the positions of the electrons.

• The probability distribution function is usually expanded to Gaussian type basis functions. Gaussian type basis functions

• To find the coefficients in the above expansion, we need do lots of two electron repulsion integrals

two Electron Repulsion Integral (ERI)

22

p g ( )

* http://mtzweb.scs.uiuc.edu/research/gpu/

Introduction – Electron Repulsion Integral (ERI)

for (s1 = startShell; s1 < stopShell; s1++)for (s2 = s1; s2 < totNumShells; s2++)for(s3 = s1; s3 < totNumShells; s3++)

Reference CPU implementation

for(s4=s3; s4 < totNumshells; s4++)

for (p1=0;p1< numPrimitives[s1]; p1++)for (p2=0;p2< numPrimitives[s2]; p2++)for (p3=0;p3< numPrimitives[s3]; p3++)for (p4 0 p4< n mPrimiti es[s4] p4++)

The general form for [ss|ss] integral

for (p4=0;p4< numPrimitives[s4]; p4++){

…..H_ReductionSum[s1,s2,s3,s4 += sqrt(F_PI*rho)*

I1*I2*I3*Weight*Coeff1*Coeff2*Coeff3*Coeff4;}

where

• Four outer loops sequence through all unique combinations of electron shells

• Four Inner loops sequence through all shell primitives

• The primitives [ss|ss] are computed and summed up in the core code.

23

p

Porting to Cell B.E. – load balance SPEs

• Function offload programmingWork distribution computation

SPEsPPE

• Each SPE is assigned a range of (s1,s2,s3,s4) to work on.

…

• Load balance Computation in PPE• Each primitive integral roughly has

the same amount of computation

Collect results from SPEs

continue

• Each contracted integral may have different amount of computation

• Each iteration in outer most loop



for (p1=0;p1< numPrimitives[s1]; p1++)for (p2=0;p2< numPrimitives[s2]; p2++)

has different amount of computation

• PPE must compute the amount of computation before SPEs run

(p ;p [ ]; p )for (p3=0;p3< numPrimitives[s3]; p3++)for (p4=0;p4< numPrimitives[s4]; p4++)

{

…..

H_ReductionSum[s1,s2,s3,s4 += sqrt(F_PI*rho)*

I1*I2*I3*Weight*Coeff1*Coeff2*Coeff3*Coeff4;

24

p oeff2 Coeff3 Coeff4;}

SPE kernels

• Is local store enough:Inp t data an arra of coordinates + an arra of shells + an arra• Input data: an array of coordinates + an array of shells + an array of primitives < 32 KB

• The Gauss error function -- erf() – is evaluated by interpolating table The table size is < 85KBtable. The table size is < 85KB

• We have enough local store

• Precomputing is necessary to reduce redundant computationcomputation• Precomputed intermediate results are much larger than local

store

25

SPE kernel -- precompute

• Instead of computing every primitive integrals from equations,


for(s4=s3; s4 < totNumshells; s4++)p g qwe precompute pairwise quantities and store them in main memory


DMA in precomputed pair quantities.

for (p1=0;p1< numPrimitives[s1]; p1++)for (p2=0;p2< numPrimitives[s2]; p2++)

• DMA all pairwise quantities before computing a contracted integral

P t d titi

(p ;p [ ]; p )for (p3=0;p3< numPrimitives[s3]; p3++)for (p4=0;p4< numPrimitives[s4]; p4++){

…..H_ReductionSum[s1,s2,s3,s4 += sqrt(F_PI*rho)*

I1*I2*I3*W i ht*C ff1*C ff2*C ff3*C ff4• Precomputed quantities: I1*I2*I3*Weight*Coeff1*Coeff2*Coeff3*Coeff4;}

2122121 )/(2/3

2121 )( Redd

rαααα

ααπ +−

+

21 αα +

• Trade bandwidth with computation

)/()( 212211 αααα ++ RRrr

26

SPE kernel – inner loops optimizations

• Naïve way of SIMDing kernel• Use a counter to keep track of the

number of primitive integrals

for (p1=0;p1< numPrimitives[s1]; p1++)for (p2=0;p2< numPrimitives[s2]; p2++)for (p3=0;p3< numPrimitives[s3]; p3++)for (p4=0;p4< numPrimitives[s4]; p4++){ static int count =0;number of primitive integrals

• If counter >=4, do a 4-way computation

• Loop switch• If one of the loops’ length is

{ static int count 0;count++;if (count == 4){compute 4 primitive integrals using SIMD intrinsics

}}• If one of the loops length is

multiple of 4, we switch the loop to the inner most

• Advance in increment of 4, and get rid of counter

…..DMA in precomputed pair quantities.

Naïve implementation of SIMD kernel

• Loop unrolling• If the numPrimitives are the same

for all primitive integral and the happened to be some nice

for (p4=0;p4< numPrimitives[s4]; p4++)for (p2=0;p2< numPrimitives[s2]; p2++)for (p3=0;p3< numPrimitives[s3]; p3++)for (p1=0;p1< numPrimitives[s1]; p1+=4)

{happened to be some nice number

• Completely unroll the inner loops generate the most efficient code

{Compute 4 primitive integrals in SIMD intrinsics

}

27

Loop switch: loop1 length is multiples of 4, switch loop1 and loop4

Performance resultsModel1 Model2

# of Atoms 30 64

Basis set 6-311G STO-6G

# of integrals 528,569,315 2,861,464,320

# of reduction elements 3,146,010 2,207,920

Xeon (2.33Ghz) 21.04 112.55

Cell-blade (16 SPEs) 1.1 0.92

Speedup 19x 122x.

• Model1 is 10 water molecules (30 atoms in total), model2 is 64 hydrogen atom arranged in latticelattice

• Model1 has non-uniform contracted integral intensity, ranging from 4096 to just 1 primitive integrals.

• Loop switching is not always possible naïve SIMD implementation more control overhead• Loop unrolling is not possible since the # of iterations in each inner loop changes• Precomputing proves to slow down due to DMA overhead

• Model2 uniform computation intensity• Precomputing and loop unrolling proves to be efficient• Module outperforms GPU implementation*

28

* Quantum Chemistry on Graphical Processing Units. 1. Strategies for Two-Electron Integral Evaluation. Ivan S. Ufimtsev and Todd J. Martinez, Journal of Chemical Theory and Computation, February 2008

Summary

Ported code Precision Performance Limit factor

Cell Blade to single core

d 1

Cell blade to Xeon blade

d 2speedup1 speedup2

NAMD Modified SP,DP Computation 22-26x NULLkernel

MILC application SP Bandwidth 10-12x 1.5-4.1x

ERI Kernel SP Computation 19-122x NULL

1. CPU comparing core for NAMD 3.0Ghz Xeon, for MILC and ERI it is 2.33 Ghz Xeon

29

2. One Xeon blade contains 8 cores with 2MB L2 cache per core.

Lessons learned

• Keep code on PPE to a minimum• 1.2% runtime code 33-62% in PPE in MILC

• Find out the limiting factor in the kernel and work on it• MILC is bandwidth limited and we focus on improving the usable bandwidth• NAMD and ERI is compute-bound and we focus on optimizing the SIMD kernel• NAMD and ERI is compute-bound and we focus on optimizing the SIMD kernel.

• Sometimes we can trade bandwidth with computation or vice versa • In ERI, depends on input data, we can either precompute some quantities in PPE

d DMA i d ll t ti i SPEand DMA in or do all computation in SPEs

• Application data layout in memory in critical• Padding would not be necessary if MILC is field centered improvement of g y p

performance and productivity• Proper data layout makes SIMDizing kernel code easier and make it faster

30

Thank You

• Questions?

31

Documents

Implementation of Scientific Computing Applications on ......Storage Subsystem (PPSS) L1 L1 L2 cache Memory F ow Controller (MFC) LocalStore(LS) DMA Controller memory bandwidth •