1
Comparison of MD Simulator and Supercomputers Per Board 1 Pentium + 1GB DRAM 4 FPGAs + 24 MB SRAM Performance 1X 40X Cost ~$1500 ~$1500 Power 106W 40W Metric Improvement of MD System Performance/Power 100X Improvement Performance/Cost 40X Improvement Performance/Space 40X Improvement FPGA-Based Supercomputing: An Implementation for Molecular Dynamics Ian Kuon, Navid Azizi, Ahmad Darabiha,Aaron Egier, and Paul Chow Deparment of ECE, University of Toronto, Toronto, Ontario Canada {ikuon,nazizi,ahmadd,aegier,pc}@eecg.utoronto.ca System Overview Lennard-Jones Force Generator Uses lower bits of r 2 to multiply with slope to interpolate Adds function value to interpolated amount to obtain pseudo-acceleration Not true acceleration because it includes division by r and multiplication by dt , which were included in the look up table to simplify hardware Final multiplication by distance components to obtain acceleration components Receives distance from PG Uses middle-order bits of r 2 to lookup value and slope of LJ potential function Since LJ potential does not change much at large distances, the higher-order bits are only used to limit the maximum distance Allows for finer granularity of lookup where function is rapidly changing Molecular Dynamics Simulates atom and molecule interactions with Newtonian mechanics Applications - defect analysis, protein folding Any interesting volume has far too many particles to simulate Use periodic boundary conditions to reduce number of particles that must be tracked a cube of particles is simulated cube is replicated in every dimension a particle on the edge of the cube interacts with its neighbour on the other side of the cube Simulation Process Model intermolecular forces using Lennard- Jones potential Forces must be calculated for each pairwise interaction between particles Acceleration due to net force is calculated using Newton's second law F=ma Must perform time integration to get position and velocity updates. Use Velocity Verlet Update rule to perform this Integration problem becomes O(n 2 ) for simplicity insignificant interactions can be discarded Previous Hardware no known previous implementations used programmable logic MODEL chip - floating point throughout MD-GRAPE - fixed and floating point design 76 MODEL chips are 50 times faster than a 200 MHz Sun Ultra 2 Communication with host computer slows simulation speed Pair Generator Validation Results Timestep (s) Relative Speedup to Software (10.7s) TM3 37 sec 0.29X Better Memory Organization Modern FPGA 2.1 sec 5.1X 0.51 sec 21X Multiple FPGAs 0.51/n sec 21*nX Verlet Update Uses acceleration to perform time integration and update velocity and position Conventionally two separate steps are performed to obtain a half timestep velocity For hardware implementation this is inefficient Instead a single combined operation is performed Multiplication is simplified by pre-multiplying acceleration in lookup table by the timestep Performed once per particle however, velocity stored in memory is a half timestep ahead of current time For every timestep in the simulation 1. PG examines a pair of particles and determines the distance between the pair. 2. LJFC computes the force between the pair 3. AU computes the total acceleration on a particle from the series of incremental forces. 4. VU uses the total acceleration to update the position and velocity PairGen LJFC(0) LJFC( n-1) LJFC(1) LJFC(k) AU(1) AA AU(0) AA AU(k) AU(n-1) AA AA VU(1) VA VU(0) VA VU(k) VU(n-1) VA VA P A i P A i P A i P A i PAj PAj PAj PAj Slope Mem Value Mem Value Mem Slope Mem Value Mem Slope Mem Value Mem Slope Mem subtract subtract a_in b_in Minimum force_dir overflow N-1 N-1 1 N N N N 1 N-1 0 1 P(t) V(t+½dt) P(t+dt) V(t-½dt) a(t) dt dt dt dt 2 Distance Squared FF...FF lookup_ offset lookup_ size high _bits Residual (used for interpolation) Address to lookup tables Scaling Factors and Precision of Various Fields Field Used in Block Scaling Factor Precision r PG 2 -64 38 r 2 PG, LJFC 2 -128 76 Slope LJGC 2 77 36 Function LJFC 2 16 47 Residual x slope LJFC 2 15 48 Pseudo-acceleration LJFC, AU 2 -14 50 Velocity VU 2 -15 51 Velocity after step 2 of Verlet VU 2 -15 51 Acceleration VU 2 -64 37 Velocity x dt VU 2 -64 37 Representation of t VU 2 -70 22 Comparison of Energy of TM3 vs Software Model Time Step Kinetic Energy (KJ/mol) Potential Energy (KJ/mol) Total Energy (KJ/mol) Initial 1794 293.4 2087 Software – End 2078 14.4 2093 Hardware -End 2068 13.52 2081 Comparison of MD Simulator and State-of-the-Art Processor Per Board 2.4GHz P4 + 1GB DRAM 4 FPGAs + 24 MB SRAM Performance 1X Min 40X Cost ~$1500 ~$1500 Power 106W 40W Metric Improvement of MD System Performance/Power 106X Improvement Performance/Cost 40X Improvement Performance/Space 40X Improvement + min(x1-x2, x2-x1) + min(y1-y2, y2-y1) 2 2 min(z1-z2, z2-z1) 2 Retrieves 3-D co-ordinates for all pairs of particles from memory. Computes the distance squared between all pairs of particles and the force direction. Distance squared between two particles calculated using the formula: By using unsigned numbers in the Minimum operation the Periodic Boundary Conditions are accounted for Software Compared against academic C-based simulator: MD3DLJ Potential and Kinetic Energy track Difference in energy for software and hardware was on the order of 1% RMS. Top Level VHDL Generator Uses the generated constants to create a wrapper Constants Package Generates shared constants for hardware and software Includes precision, scaling factors, number of particles A’ Replicated Box A’ Replicated Box A’ Replicated Box A’ Replicated Box A’ Replicated Box A’ Replicated Box A’ Replicated Box A’ Replicated Box Box being simulated B’ B’ B’ B’ B’ B’ B’ B’ A B PairGen (PG) Verlet Update (VU) Force Computer (LJFC) Acceleration Update Sun Workstation System Control Memory Function Value Memory Memory Controller Slope Memory Memory Controller Memory Controller ABS X X1 X2 X Y Z X*X Y*Y Z*Z DISTANCE SQUARED ABS Y Y1 Y2 ABS Z Z1 Z2 X*X + Y*Y + Z*Z X1-X2 X2-X1 Y1-Y2 Y2-Y1 Z1-Z2 Z2-Z1 Stage 1 Stage 2 Stage 3 Input Output 0 1 0 1 0 1 PARTICLE A PARTICLE B X DISTANCE Y DISTANCE Z DISTANCE FORCE DIRECTION DISTANCE SQUARED FF...FF PARTICLE A PARTICLE B FORCE DIRECTION ACCEL X SLOPE VALUE DISTANCE SQUARED LOOKUP Input 1 Stage 1 Stage 2 Stage 3 Output Input 2 ACCEL Y ACCEL Z DISTANCE SQUARED LOOKUP Calculate Time Step One Time Step Complete Select Particle 1 Velocity Verlet Update Retrieve Particle 2 Store New Information Compute Force with Lennard- Jones potential Retrieve particle parameters No more new particle 1's? F No more new particle 2's? F T Separation<Cut-off Threshold F T T Energy (kJ/mol) Timestep -500 0 500 1000 1500 2000 2500 0 100 200 300 400 500 600 700 TM3 Implementation -Total Software Implementation - Total TM3 Implementation - Kinetic Software Implementation - Kinetic TM3 Implementation - Potential Software Implementation - Potential F= - (r= r LJ (r= = 4 - r 6 r 12 LJ 2 t + v = v(t) + a(t) t 2 t t 2 t - v v(t) = + a(t) 2 t 2 2 t r(t+ t)=r(t)+ tv(t)+ a(t) Supercomputers = many P4-like processors Why not FPGA processors instead? The TM3 project has been made possible by support from Micronet and Xilinx.

FPGA-Based Supercomputing: An Implementation for Molecular ...pc/research/publications/fpgamd.fpg… · FPGA-Based Supercomputing: An Implementation for Molecular Dynamics Ian Kuon,

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: FPGA-Based Supercomputing: An Implementation for Molecular ...pc/research/publications/fpgamd.fpg… · FPGA-Based Supercomputing: An Implementation for Molecular Dynamics Ian Kuon,

Comparison of MD Simulator and Supercomputers

Per Board 1 Pentium +

1GB DRAM 4 FPGAs +

24 MB SRAM Performance 1X 40X Cost ~$1500 ~$1500 Power 106W 40W Metric Improvement of MD System Performance/Power 100X Improvement Performance/Cost 40X Improvement Performance/Space 40X Improvement

FPGA-Based Supercomputing: AnImplementation for Molecular Dynamics

Ian Kuon, Navid Azizi, Ahmad Darabiha,Aaron Egier, and Paul ChowDeparment of ECE, University of Toronto, Toronto, Ontario Canada

{ikuon,nazizi,ahmadd,aegier,pc}@eecg.utoronto.ca

System Overview

Lennard-Jones Force Generator Uses lower bits of r2 to multiply with slope to

interpolate Adds function value to interpolated amount to

obtain pseudo-acceleration

Not true acceleration because it includes divisionby r and multiplication by dt, which were includedin the look up table to simplify hardware

Final multiplication by distance components toobtain acceleration components

Receives distance from PG Uses middle-order bits of r2 to lookup value and

slope of LJ potential function Since LJ potential does not change much at large

distances, the higher-order bits are only used tolimit the maximum distance Allows for finer granularity of lookup where function is

rapidly changing

Molecular Dynamics Simulates atom and molecule interactions with

Newtonian mechanics Applications - defect analysis, protein folding Any interesting volume has far too many particles

to simulate Use periodic boundary conditions to reduce

number of particles that must be tracked

a cube of particles is simulated cube is replicated in every dimension a particle on the edge of the cube interacts with

its neighbour on the other side of the cube

Simulation Process

Model intermolecular forces using Lennard-Jones potential

Forces must be calculated for each pairwiseinteraction between particles

Acceleration due to net force is calculatedusing Newton's second law F=ma

Must perform time integration to get positionand velocity updates.

Use Velocity Verlet Update rule to perform thisIntegration

problem becomes O(n2)for simplicity insignificant interactions can be discarded

Previous Hardware no known previous implementations used

programmable logic MODEL chip - floating point throughout

MD-GRAPE - fixed and floating point design

76 MODEL chips are 50 times faster than a200 MHz Sun Ultra 2

Communication with host computer slowssimulation speed

Pair GeneratorValidation

Results

Timestep (s)Relative Speedup to Software (10.7s)

TM3

37 sec0.29X

BetterMemory Organization

Modern FPGA

2.1 sec5.1X

0.51 sec21X

Multiple FPGAs

0.51/n sec21*nX

Verlet Update

Uses acceleration to perform time integration andupdate velocity and position

Conventionally two separate steps are performed toobtain a half timestep velocity

For hardware implementation this is inefficient Instead a single combined operation is performed

Multiplication is simplified by pre-multiplyingacceleration in lookup table by the timestep

Performed once per particle

however, velocity stored in memory is a half timestepahead of current time

For every timestep in the simulation 1. PG examines a pair of particles and determines the distance between the pair. 2. LJFC computes the force between the pair 3. AU computes the total acceleration on a particle from the series of incremental forces. 4. VU uses the total acceleration to update the position and velocity

PairGen

LJFC(0)

LJFC(n-1)

LJFC(1)

LJFC(k)

AU(1)

AA

AU(0)

AA

AU(k)

AU(n-1)

AA

AA

VU(1)

VA

VU(0)

VA

VU(k)

VU(n-1)

VA

VA

PA

iP

Ai

PA

iP

Ai

PA

jP

Aj

PA

jP

Aj

Slope Mem

Value Mem

Value Mem

Slope Mem

Value Mem

Slope Mem

Value Mem

Slope Mem

subtract

subtract

a_in

b_in

Minimum

force_dir

overflow

N-1

N-1

1N

N

N

N

1

N-1

0

1

P(t)

V(t+½dt)

P(t+dt)

V(t-½dt)

a(t)

dt

dt

dt

dt2

DistanceSquared

FF...FF

lookup_offset

lookup_size

high_bits

Residual(used for

interpolation)

Address tolookup tables

Scaling Factors and Precision of Various Fields Field Used in Block Scaling Factor Precision r PG 2-64 38 r2 PG, LJFC 2-128 76 Slope LJGC 277 36 Function LJFC 216 47 Residual x slope LJFC 215 48 Pseudo-acceleration LJFC, AU 2-14 50 Velocity VU 2-15 51 Velocity after step 2 of Verlet VU 2-15 51 Acceleration VU 2-64 37 Velocity x dt VU 2-64 37 Representation of t VU 2-70 22

Comparison of Energy of TM3 vs Software Model Time Step

Kinetic Energy (KJ/mol)

Potential Energy (KJ/mol)

Total Energy (KJ/mol)

Initial 1794 293.4 2087 Software – End 2078 14.4 2093 Hardware -End 2068 13.52 2081

Comparison of MD Simulator and State-of-the-Art Processor

Per Board 2.4GHz P4 + 1GB DRAM 4 FPGAs + 24 MB

SRAM Performance 1X Min 40X Cost ~$1500 ~$1500 Power 106W 40W

Metric Improvement of MD System Performance/Power 106X Improvement Performance/Cost 40X Improvement Performance/Space 40X Improvement

+min(x1-x2, x2-x1) + min(y1-y2, y2-y1)2 2 min(z1-z2, z2-z1)2

Retrieves 3-D co-ordinates for all pairsof particles from memory.

Computes the distance squared between all pairsof particles and the force direction.

Distance squared between two particlescalculated using the formula:

By using unsigned numbers in the Minimumoperation the Periodic Boundary Conditions areaccounted for

Software

Compared against academic C-based simulator:MD3DLJ

Potential and Kinetic Energy track

Difference in energy for software and hardwarewas on the order of 1% RMS.

Top Level VHDL GeneratorUses the generated constants to create a wrapper

Constants PackageGenerates shared constants for hardware and software

Includes precision, scaling factors, number of particles

A’

Replicated Box

A’

Replicated Box

A’

Replicated Box

A’

Replicated Box

A’

Replicated Box

A’

Replicated Box

A’

Replicated Box

A’

Replicated Box

Box being simulated

B’

B’

B’

B’

B’

B’

B’

B’

AB

PairGen(PG)

VerletUpdate

(VU)

ForceComputer

(LJFC)

AccelerationUpdate

Sun Workstation

SystemControl

Memory

FunctionValue

Memory

MemoryController

SlopeMemory

MemoryController

MemoryController

ABS X X1

X2

X

Y

Z

X*X

Y*Y

Z*Z

DISTANCESQUARED

ABS Y Y1

Y2

ABS Z Z1

Z2

X*X + Y*Y+ Z*Z

X1-X2

X2-X1

Y1-Y2

Y2-Y1

Z1-Z2

Z2-Z1

Stage 1 Stage 2 Stage 3Input Output

0

1

0

1

0

1

PARTICLE A

PARTICLE B

X DISTANCE

Y DISTANCE

Z DISTANCE

FORCEDIRECTION

DISTANCESQUARED

FF...FF

PARTICLE A

PARTICLE B

FORCEDIRECTION

ACCEL X

SLOPE

VALUE

DISTANCESQUAREDLOOKUP

Input 1 Stage 1 Stage 2 Stage 3 OutputInput 2

ACCEL Y

ACCEL Z

DISTANCESQUAREDLOOKUP

Calculate TimeStep

One Time StepComplete

Select Particle 1

Velocity VerletUpdate

Retrieve Particle2

Store NewInformation

Compute Forcewith Lennard-

Jones potential

Retrieve particleparametersNo more new

particle 1's?F

No more newparticle 2's?

FT

Separation<Cut-offThreshold

F

T

T

En

erg

y (k

J/m

ol)

Timestep-500

0

500

1000

1500

2000

2500

0 100 200 300 400 500 600 700

TM3 Implementation -Total Software Implementation - TotalTM3 Implementation - Kinetic Software Implementation - KineticTM3 Implementation - Potential Software Implementation - Potential

F= - (r)r LJ

(r) = 4 - r

6

r

12

LJ

2t +v = v(t) + a(t)t2t t

2t -vv(t) = + a(t)2t

2

2tr(t+ t)=r(t)+ tv(t)+ a(t)

Supercomputers = many P4-like processorsWhy not FPGA processors instead?

The TM3 project has been made possible by support from Micronet and Xilinx.