View
220
Download
1
Tags:
Embed Size (px)
Citation preview
Stanford Streaming Supercomputer
Eric Darve
Mechanical Engineering Department
Stanford University
Stream Register File
Clust.0
Clust.15
Micro-ctrl
ScalarProc.
ScalarCache
Memory SystemNetwork
DRAM DRAM
Inter-cluster Crossbar
StreamCtrl
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 2/33
Overview of Streaming Project
• Main PIs:– Pat Hanrahan, [email protected]– Bill Dally, [email protected]
• Objectives: – Cost/Performance: 100:1 compared to clusters.– Programmable: applicable to large class of scientific
applications.– Porting and developing new code made easier:
stream language, support of legacy codes.
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 3/33
Performance/Cost
Item Cost Per Node
Processor chip 200 200
Router chip 200 50
Memory chip 20 320
Board/Backplane 3000 188
Cabinet 50000 49
Power 1 50
Per-Node Cost 976
Cost estimate – about $1K/nodePreliminary numbers, parts cost only, no I/O included.Expect 2x to 4x to account for margin and I/O
News Center
News Releases | Publications | Resources | Multimedia Gallery
News Release Archive | Awards
FOR IMMEDIATE RELEASEOctober 21, 2002
Sandia National Laboratories and Cray Inc. finalize $90 million contract for new supercomputer
Collaboration on Red Storm System under Department of Energy’s Advanced Simulation and Computing Program (ASCI)
ALBUQUERQUE, N.M. and SEATTLE, Wash. — The Department of Energy’s Sandia National Laboratories and Cray Inc. (Nasdaq NM: CRAY) today announced that they have finalized a multiyear contract, valued at approximately $90 million, under which Cray will collaborate with Sandia to develop and deliver a new massively parallel processing (MPP) supercomputer called Red Storm. InJune 2002, Sandia reported that Cray had been selected for the award, subject to successful contract negotiations.
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 5/33
Performance/Cost Comparisons
• Earth Simulator (today)– Peak 40TFLOPS, ~$450M– 0.09MFLOPS/$ – Sustained 0.03MFLOPS/$
• Red Storm (2004)– Peak 40TFLOPS, ~$90M– 0.44MFLOPS/$
• SSS (proposed 2006)– Peak 40TFLOPS, < $1M– 128MFLOPS/$– Sustained 30MFLOPS/$ (single node)
• Numbers are sketchy today, but even if we are off by 2x, improvement over status quo is large
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 6/33
ES
RedStorm
Desktop SSS
SSSASCI machines
GFLOPS
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 7/33
How did we achieve that?
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 8/33
VLSI Makes Computation Plentiful
VLSI: very large-scale integration. This is the current level of computer microchip miniaturization (refers to microchips containing in the hundreds of thousands of transistors.)
• Abundant, inexpensive arithmetic
– Can put 100s of 64-bit ALUs on a chip
– 20pJ per FP operation
• (Relatively) high off-chip bandwidth
– 1Tb/s demonstrated, 2nJ per word off chip
• Memory is inexpensive $100/Gbyte
nVidia GeForce4~120 Gflops/sec~1.2 Tops/sec
Velio VC30031Tb/s I/O BW
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 9/33
But VLSI imposes some constraintsCurrent Architecture: few ALUs / chip = expensive and limited performance.
Objective for SSS architecture: • Keep hundreds of ALUs/chip
busy.
Difficulty:• Locality of data: we need to
match 20Tb/s ALU bandwidth to ~100Gb/s chip bandwidth.
• Latency tolerance: to cover 500 cycle remote memory access time.
Arithmetic is cheap, global bandwidth is expensiveLocal << global on-chip << off-chip << global system
Chip64-bit ALU(to scale)
Architecture of Pentium 4
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 10/33
The Stream Model exposes parallelism and locality in applications
• Streams of records passing through kernels • Parallelism
– Across stream elements– Across kernels
• Locality– Within kernels– Producer-consumer locality between kernels
Grid ofCells
5 6K1
12W I/O50 Ops
K214W I/O100 Ops
Table
K315W I/O70 Ops
8K4
9W I/O80 Ops
IndexStream
1
3
4 41
Grid ofCells
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 11/33
Streams match scientific computation to constraints of VLSI
Stream program matches application to Bandwidth Hierarchy 32:4:1
Memory
Grid ofCells
Cells5 K150 Ops
Stream Cache Stream Reg File Local Registers
K2100 Ops
Indices
Results 1
7
K370 Ops
K480 Ops
Table Table
0.5
Results 2
Results 3
Results 4
5
6
8
Results 2
8
33
8
8
Grid ofCells
5
300 Ops900W
58Words
4
12Words9.5Words
Indices
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 12/33
Scientific programs stream well
17 16 11
7
47
71 83 93
856 1289
1372
1447
1
10
100
1000
10000
Constant Linear Quadratic Cubic
Polynomial Order (Euler Equations)
Mem BW (GB/s)
SRF BW (GB/s)
LRF BW (GB/s)
StreamFEM results show L:S:M ratios of 206:13:1 to 50:3:1
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 13/33
BW Hierarchy of SSS
DRAM
DRAM
StreamCacheBank 0
StreamCacheBank 7
StreamRegister
File
Func.Units
Func.Units
Cluster 0
Cluster 15
16 GB/s 64 GB/s 512 GB/s 3,840 GB/s
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 14/33
Stream processor = Vector processor + Local registers
• Like a vector processor, stream processors
– Amortize instruction overhead over records of a stream
– Hide latency by loading (storing) streams of records
– Can exploit producer consumer locality at the SRF (VRF) level
• Stream processors add local registers and microcoded kernels
– >90% of all references from local registers
• Increases effective bandwidth and capacity of SRF (VRF) by 10x
• Enables 10x number of ALUs• Enables SRF to capture working set
MemoryVector
RegisterFile
1x10x
MemoryStreamRegister
File
LocalRegisters
1x 10x100x
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 15/33
Brook: streaming language
• C with streaming– Make data parallelism explicit– Declare communication pattern
• Streams– View of records in memory– Operated on in parallel– Accessing stream values not is permitted outside of kernels
Kernel
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 16/33
Brook Kernels
• Kernels– Functions which operate only on streams
• Stream arguments are read-only or write-only• Reduction variables (associative operations only)
– Restricted communication between records• No state or “static” variables• No global memory access
Kernel
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 17/33
Brook Example: Molecular Dynamics
struct Vector { float x, y, z ;} ;
typedef stream struct Vector Vectors ;
kernel void UpdatePosition ( Vectors sPos, Vectors sVel, const float timestep, out Vectors sNewPos ){ sNewPos.x = sPos.x + timestep * sVel.x; sNewPos.y = sPos.y + timestep * sVel.y; sNewPos.z = sPos.z + timestep * sVel.z;}
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 18/33
struct Vector { float x, y, z ;} ;
typedef stream struct Vector Vectors ;
void main () { struct Vector Pos[MAX] = {…} ; struct Vector Vel[MAX] = {…} ;
Vectors sPos, sVel, sNewPos ;
streamLoad (sPos, Pos, MAX) ; streamLoad (sVel, Vel, MAX) ; UpdatePosition (sPos, sVel, 0.2f, sNewPos) ;
streamStore (sNewPos, Pos) ;}
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 19/33
StreamMD: motivation• Application: study the folding of human
proteins.
• Molecular Dynamics: computer simulation of the dynamics of macro molecules.
• Why this application?– Expect high arithmetic intensity.
– Requires variable length neighborlists.
– Molecular Dynamics can be used in engine simulation to model spray, e.g. droplet formation and breakup, drag, deformation of droplet.
• Test case chosen for initial evaluation: box of water molecules.
DNA molecule
Human immunodeficiency virus (HIV)
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 20/33
Numerical Algorithm• Interaction between atoms is modeled by the potential energy
associated to each configuration. Includes:– Chemical bond potentials.
– Electrostatic interactions.
– Van der Waals interactions.
• Newton’s second law of motion used to compute the trajectory of all atoms:
• Velocity Verlet time integrator (leap-frog):
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 21/33
High-Level Implementation in Brook
• Cutoff is used to compute non-bonded forces: two particles do not interact if they are at a distance larger than cutoff radius.
• Gridding technique is used to accelerate search of all atoms within cutoff radius.
• Stream of variable length is associated to each cell of the grid: contains all the water molecules inside the cell.
• High level Brook functionality are used:– streamGatherOP: used to
construct the list of all water molecules inside a cell.
– streamScatterOP: used to reduce the partial forces computed for each molecule.
Memory
SRF
GatherOP
n++
n
ScatterOP
f+g (g)
f
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 22/33
Imagine SSS
StreamMD Results
FDIV
FADD
FDIV
FSUB
FSUB
FSUBFSUBFSUB
FSQRT SPREADFMUL FMULFSUB SPREAD
FSUB
FSUB
FADD FMUL SPREAD
FSQRTFMULFMULFSUB
FADD
FADD FMUL
FDIV
FADD
FDIV
FSUBFSUB
FSQRTFMUL FMULFSUB
FSUB
FSUB
FADD FMUL
FSQRTFMULFMULFSUB
FADD
FADD FMUL
FDIV
FADD
FDIV
SPREADSPREAD
FSUB FSUB
SPREADFSQRTFSUB
FMULFMULFSUB
FSUB
FADD FMULFSQRT
FMUL FMULFSUB
FADD
FMULFADD
FDIV
PASSFSUBFADD
FSUB
FDIVFSUB FMULFMUL
FMULFADD
FADD
FSUBFSQRT
FDIVFMUL
FMUL
FADD
FMUL
FMUL
FMUL FSQRTFMULFADD FMUL
FADD FSQRTFMULFADDFMULFMUL
FMUL
FADD FMULFADD FMUL
FMUL
FADD FMUL
FMUL FMULFADDPASSPASSFMULPASS
FMULFLE
FADD FMULFMULFMULFMUL NSELECT
FMUL FMULFMUL
FADD FMUL FMULFMUL FDIVFMUL FMUL
FMUL FMUL PASSFADD FMUL FMUL
FMULFMUL
FMUL FMULFMULFMUL
FADD FMULFMUL PASS
FMULFMULFMUL FMULFMUL FMULFADDFADD
FADD FMUL FMUL
FMUL FMULFMUL FMULFMUL FMUL
FADD FMULFMULFMUL FMUL PASSFMUL FMUL SELECT
FMUL FMULFMUL FMUL
FMULFMULFMUL FMUL
FMUL FMULFSUB FMUL PASSFMUL PASS
FMULFMUL
FMULFMULFADD FADDFLE PASSFMULFMULFADD FADDFADD
FMULFMULFADD PASSPASSFMULFMULFADD PASS
FMUL FMULFADDFMUL FMULFADDFADD
FADD FADDFADD
FADD FMULFMULFADDFADDFADD FMULFADDFADD
FADD FMULFADDFADD PASS
FADDFADDFADD SPREADFADD FMULFADDFADD PASSSELECT SPREAD
FADDFSUB FMULFADD SELECT
FADDFSUB FADD PASS SPREADFADDFADD FADD SELECTSPREAD
FADD FSUB FADD SELECT SELECTSPREADFADD FSUBFADD SELECT SELECTPASS SPREAD
FADD FMUL SPWRITEFSUB SELECTSELECT SPREADSELECTFADDFADD FSUB SPWRITESELECT SPREAD
SELECTFSUB FADD SPWRITESELECTSPREAD
FSUB SELECTFADDFADD SPWRITESELECTFSUB SELECTSELECTFADD FADD SPWRITEPASS
FADD SELECT SELECTFADDFADD SPWRITE
SPWRITEFADD FADD DEC_CHK_UCRLOOPSPWRITE DEC_UCR
IADD 32 SPWRITE
PASSPASS
NSELECT DATA_OUT
NSELECT DATA_OUT340
100
280
220
160
FSUB
FSUB
FABSFLTFSUB
FABS FLTFLT
FLT SELECT
SELECT NSELECTFABS FLT
NSELECTFSUB FLT SPREAD
FSUBFSUB SELECT SPREAD
FSUB NSELECTSPREAD
FSUB SPREAD
FSUB SPREAD
FSUB FSUBFSUB PASSSPREAD
FSUBFSUB FSUB SPREAD
FSUBFSUB SPREAD
FSUB
FMUL FMUL FSUBFSUB
FMUL FMULFSUB FSUB SPREAD
PASS PASSFSUBFSUB
FSUBFSUB FMUL
FADDFMULFMULFMUL
FADD FMULFSUBFMUL
FSUB FMUL
FMULFSUB FINVSQRT_LOOKUP
FADDFADD FMUL FSUB PASS
FADDFMULFSUB
PASSFADD
FMULFMULFSUBFSUB
FINVSQRT_LOOKUPFADDFMUL FMUL
FINVSQRT_LOOKUP PASSFADD PASS
FMULFMUL FADD PASS
FMUL FMULFMULFSUB
FINVSQRT_LOOKUPFADD FMULFSUBFSUB
FMULFADD FSUB PASS
FMULFMUL
FSUB FADDFMUL FSUB PASS
FADD FINVSQRT_LOOKUPFMUL FMULFSUB
FMULFMUL FINVSQRT_LOOKUPFSUB FSUB
FMULFSUB FSUBFMUL
FMULFADD FMUL FSUB
FINVSQRT_LOOKUPFADDFMUL
FSUBFMUL FMUL PASSFSUB
FMUL FMULFMULFMUL PASS
FMUL FINVSQRT_LOOKUPFADDFMUL FMUL
FMUL FADD
FMUL FSUBFMUL PASS
FSUBFMULFMULFSUB PASS
FMULFADD FADD FMUL
FMUL FINVSQRT_LOOKUP
FMULFSUB FMUL FMUL
FMULFMULFMUL FINVSQRT_LOOKUP
FMUL FINVSQRT_LOOKUPFADD
FSUBFMULFMUL PASS
FMULFMUL FMULFMUL
FMULFMUL FMUL
FSUB FINVSQRT_LOOKUP
FMUL FMUL FMULFMUL
FMULFMUL PASSFSUB
FMUL FSUBFSUBFMUL
FMUL
FMUL FSUB FSUBFMUL
PASSFSUB FMUL FMUL
FMULFMULFMULFMUL
FMUL PASS
FMUL FMULFMULFSUB
FMUL FMULFMUL
FSUBFMUL FMULFSUB
FMUL
FMULFMUL FMUL FMUL
FMULFSUB FMUL
FMULFMUL FMULFMUL
FMUL FMULFMUL
FMULFSUB FMULFMUL
FADDFMULFMUL FMUL
PASS
FMUL FMULFMUL
FMUL FSUBFMUL
FADDFMUL FMULFMUL PASS
FSUB FMULFMULFMUL
FMUL FSUB
FADD FMULFMUL FMUL
PASSPASS
FSUBFMUL FADDFADD
FMULFMUL
FADDFMULFMUL
FLE
FMUL FMULFMULFMUL
FMUL FMUL
FADDFMUL FMUL
PASS
PASSFMULFADDFADD FMUL
FMULFMUL FMULFMUL
FADD FMUL FMUL
FMULFMUL FMUL
FMUL FMULFMUL FLE
FADD FMULFMUL FMUL
FMUL FMUL FADD
FMUL FMULFMULFMUL
FADDFADD FADDFMUL
FMULFMUL PASSPASS
PASSFMUL NSELECT
FMUL FMULFMULFMUL PASS
FADDFMUL FMUL FMUL PASSNSELECT
PASS PASSFMUL FMULFMUL
FMULFMUL FMUL FMUL PASS
FMUL FMULFMULFMUL PASS
FMUL FMULFADD FMUL PASS
FMUL FMULFMULFMUL PASS
FMUL FMUL FMULFMUL
FMULFMUL FMULFMUL PASS
FSUB FMUL FADDFMUL PASSPASS
FMULFMULFMULFMUL
FADD FMUL FMULFMUL
FADDFMUL FADDFMUL PASS
FMULFMUL FMUL FMUL
FMUL FMUL FADDFMUL PASS
FMULFMUL FMULFADD PASS
FMULFADD FMULFADD PASS
FADDFMUL FADDFMUL
FADD FADDFMUL FADD
FADDFADDFMUL PASSFMUL
FADD FADD FADDFMUL
FADDFADD FADDFADD PASS
FADD FADDFMULFADD NSELECT SPREAD
FADDFADD FMULFADD SPREAD
FADD FSUBFADDFADD SELECTSPREAD
FADDFADD FADDFSUB SELECTNSELECT SPREAD
FADDFSUB FADD PASSNSELECT SPREAD
FADDFSUB FADD SELECTNSELECT SPREAD
FADDFMUL FSUB SPWRITE SELECTNSELECT SPREAD
SELECTFADD FADDFSUB SPWRITENSELECT SPREAD
NSELECTFSUB SPWRITE SELECTSPREAD
FSUB NSELECT SPWRITE SELECT
FSUB SELECTFADDFADD SPWRITESELECTFADD SPWRITESPWRITEDEC_CHK_UCRLOOPSPWRITEDEC_UCR32 SPWRITEPASST DATA_OUTNSELECTDATA_OUT
120
260
170
220
• Preliminary schedule obtained using the Imagine architecture:– High arithmetic
intensity: all ALUs are kept busy. Gflops expected to be very high.
– SRF bandwidth is sufficient. About 1 word for 30 instructions.
• Results helped guide architectural decisions for SSS.
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 23/33
Observations• Arithmetic intensity is sufficient. Bandwidth is not going to be the
limiting factor in these applications. Computation can be naturally organized in a streaming fashion.
• The interaction between the application developers and the language development group has helped insured that Brook can be used to code real scientific applications.
• Architecture has been refined in the process of evaluating these applications.
• Implementation is much easier than MPI. Brook hides all the parallelization complexity from the user. The code is very clean and easy to understand. The streaming versions of these applications are in the range of 1000-5000 lines of code.
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 24/33
A GPU is a stream processor
• The GPU on a Graphics Card is streaming processor.• n VIDIA recently announced that their latest graphics
card, the NV30, will be programmable and capable of delivering 51 Gflops peak performance (1.6 Gflops for Pentium 4).
Can we use this computing power for scientific application?
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 25/33
Cg: Assembly or High-level?
Assembly…
DP3 R0, c[11].xyzx, c[11].xyzx;
RSQ R0, R0.x;
MUL R0, R0.x, c[11].xyzx;
MOV R1, c[3];
MUL R1, R1.x, c[0].xyzx;
DP3 R2, R1.xyzx, R1.xyzx;
RSQ R2, R2.x;
MUL R1, R2.x, R1.xyzx;
ADD R2, R0.xyzx, R1.xyzx;
DP3 R3, R2.xyzx, R2.xyzx;
RSQ R3, R3.x;
MUL R2, R3.x, R2.xyzx;
DP3 R2, R1.xyzx, R2.xyzx;
MAX R2, c[3].z, R2.x;
MOV R2.z, c[3].y;
MOV R2.w, c[3].y;
LIT R2, R2;
...
Assembly…
DP3 R0, c[11].xyzx, c[11].xyzx;
RSQ R0, R0.x;
MUL R0, R0.x, c[11].xyzx;
MOV R1, c[3];
MUL R1, R1.x, c[0].xyzx;
DP3 R2, R1.xyzx, R1.xyzx;
RSQ R2, R2.x;
MUL R1, R2.x, R1.xyzx;
ADD R2, R0.xyzx, R1.xyzx;
DP3 R3, R2.xyzx, R2.xyzx;
RSQ R3, R3.x;
MUL R2, R3.x, R2.xyzx;
DP3 R2, R1.xyzx, R2.xyzx;
MAX R2, c[3].z, R2.x;
MOV R2.z, c[3].y;
MOV R2.w, c[3].y;
LIT R2, R2;
...
Cg
COLOR cPlastic = Ca + Cd * dot(Nf, L) + Cs * pow(max(0, dot(Nf, H)), phongExp);
Cg
COLOR cPlastic = Ca + Cd * dot(Nf, L) + Cs * pow(max(0, dot(Nf, H)), phongExp);
or PhongShader
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 26/33
Cg uses separate vertex and fragment programs
ApplicationVertexProcessor
FragmentProcessor
Assem
bly &R
asterization
Fram
ebufferO
perations
Fram
ebuffer
Program Program
Textures
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 27/33
Characteristics of NV30 & Cg • Characteristics of GPU:
– optimized for 4-vector arithmetic– Cg has vector data types and operations
e.g. float2, float3, float4– Cg also has matrix data types
e.g. float3x3, float3x4, float4x4• Some Math:
– Sin/cos/etc.– Normalize
• Dot product: dot(v1,v2);• Matrix multiply:
– matrix-vector: mul(M, v); // returns a vector– vector-matrix: mul(v, M); // returns a vector– matrix-matrix: mul(M, N); // returns a matrix
Innermost loop in C: computation of LJ and Coulomb interactions.
for (k=nj0;k<nj1;k++) { //loop over indices in neighborlistjnr = jjnr[k]; //get index of next j atom (array LOAD) j3 = 3*jnr; //calc j atom index in coord & force arraysjx = pos[j3]; //load x,y,z coordinates for j atomjy = pos[j3+1];jz = pos[j3+2];qq = iq*charge[jnr]; //load j charge and calc. productdx = ix – jx; //calc vector distance i-jdy = iy – jy;dz = iz – jz;rsq = dx*dx+dy*dy+dz*dz; //calc square distance i-jrinv = 1.0/sqrt(rsq); //1/rrinvsq = rinv*rinv; //1/(r*r)vcoul = qq*rinv; //potential from this interactionfscal = vcoul*rinvsq; //scalarforce/|dr|vctot += vcoul; //add to temporary potential variablefix += dx*fscal; //add to i atom temporary force variablefiy += dy*fscal; //F=dr*scalarforce/|dr|fiz += dz*fscal;force[j3] -= dx*fscal; //subtract from j atom forcesforce[j3+1]-= dy*fscal;force[j3+2]-= dz*fscal;
}
Example: MD
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 29/33
Inner loop in Cg/* Find the index and coordinates of j atom */jnr = f4tex1D (jjnr, k);
/* Get the atom position */j1 = f3tex1D(pos, jnr.x);j2 = f3tex1D(pos, jnr.y);j3 = f3tex1D(pos, jnr.z);j4 = f3tex1D(pos, jnr.w);
We compute four interactions at a time so that we can take advantage of high performance of vector arithmetic.
We are fetching coordinates of atom: data is stored as texture
/* Get the vectorial distance, and r^2 */d1 = i - j1;d2 = i - j2;d3 = i - j3;d4 = i - j4;
rsq.x = dot(d1, d1);rsq.y = dot(d2, d2);rsq.z = dot(d3, d3);rsq.w = dot(d4, d4);
/* Calculate 1/r */rinv.x = rsqrt(rsq.x);rinv.y = rsqrt(rsq.y);rinv.z = rsqrt(rsq.z);rinv.w = rsqrt(rsq.w);
Computing the square of distanceWe use built-in dot product for float3 arithmetic
Built-in function: rsqrt
/* Calculate Interactions */
rinvsq = rinv * rinv;
rinvsix = rinvsq * rinvsq * rinvsq;
Highly efficient float4 arithmetic
vnb6 = rinvsix * temp_nbfp;vnb12 = rinvsix * rinvsix * temp_nbfp;vnbtot = vnb12 - vnb6;
qq = iqA * temp_charge;vcoul = qq*rinv;fs = (12f * vnb12 - 6f * vnb6 + vcoul) * rinvsq;vctot = vcoul;
/* Calculate vectorial force and update local i atom force */fi1 = d1 * fs.x;fi2 = d2 * fs.y;fi3 = d3 * fs.z;fi4 = d4 * fs.w;
This is the force computation
ret_prev.fi_with_vtot.xyz += fi1 + fi2 + fi3 + fi4;
ret_prev.fi_with_vtot.w += dot(vnbtot, float4(1, 1, 1, 1))
+ dot(vctot, float4(1, 1, 1, 1));
Return type is:
struct inner_ret { float4 fi_with_vtot; };
Contains x, y and z coordinates of force and total energy.
Computing total force due to 4 interactions
Computing total potential energy for this particle
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 33/33
Conclusion• 3 representative applications show high bandwidth ratios:
streamMD, streamFlo, StreamFEM.
• Feasibility of streaming established for scientific applications: high arithmetic intensity, bandwidth hierarchy is sufficient.
• Available today: NVidia NV30 graphics card.
• Future work:– StreamMD to GROMACS (Folding @ Home)– StreamFEM and StreamFLO to 3D– Multinode versions of all applications– Sparse solvers for implicit time-stepping– Adaptive meshing– Numerics