35
Copyright © 2008 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 1 Real Science Real Numbers Real Software Accelerating numerically intensive life science codes for molecular and quantum mechanics Simon McIntosh-Smith, VP Customer Applications [email protected] MRSC08 conference April 2 nd 2008

Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m1

Re

al

Scie

nce

Re

al

Nu

mb

ers

Re

al

So

ftw

are

Ac

cele

rati

ng

nu

meri

call

y i

nte

nsiv

e

life

scie

nce c

od

es f

or

mo

lecu

lar

an

d

qu

an

tum

mech

an

ics

Sim

on

Mc

Into

sh

-Sm

ith

, V

P C

usto

mer

Ap

pli

cati

on

s

sim

on

@cle

ars

peed

.co

m

MR

SC

08 c

on

fere

nce

Ap

ril

2n

d2008

Page 2: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m2

Ag

en

da

•C

learS

pee

d in

tro

du

cti

on

•A

ccele

rato

r o

verv

iew

•S

oft

ware

to

ols

descri

pti

on

•A

ccele

rate

d a

pp

licati

on

su

cces

s s

tori

es

•F

utu

re d

evelo

pm

en

ts

•S

um

mary

Page 3: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m3

Cle

arS

pee

d In

tro

du

cti

on

•F

ou

nd

ed

in

20

02

we a

re a

UK

co

mp

an

y, b

as

ed

in

Bri

sto

l,

wit

h o

ffic

es in

Sa

n J

os

e, C

A

•W

e a

re d

rive

n b

y t

he

pri

nc

iple

th

at

to d

elive

r H

IGH

PE

RF

OR

MA

NC

E y

ou

ha

ve

to

de

live

r H

IGH

PO

WE

R

EF

FIC

IEN

CY

as

sys

tem

s b

ec

om

e m

ore

sp

ace

, p

ow

er

su

pp

ly a

nd

co

olin

g c

on

str

ain

ed

•T

he

refo

re w

e d

elive

r th

e W

orl

d’s

mo

st

po

wer

eff

icie

nt,

hig

h-p

erf

orm

an

ce

pro

ce

ss

ors

, w

ith

su

pp

ort

ing

su

bs

ys

tem

s, s

oft

ware

deve

lop

me

nt

too

ls, lib

rari

es

an

d

ap

plic

ati

on

s

•W

e p

rovid

e s

olu

tio

ns f

or

bo

th t

he

Hig

h P

erf

orm

an

ce

Co

mp

uti

ng

(H

PC

) a

nd

em

be

dd

ed

sys

tem

s m

ark

ets

•P

art

neri

ng

wit

h H

P, IB

M, S

GI, S

un

an

d o

the

r O

EM

s t

o

de

live

r in

sys

tem

s

Page 4: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m4

Re

al

Scie

nce

Re

al

Nu

mb

ers

Re

al

So

ftw

are

Cle

arS

pe

ed

ac

ce

lera

tor

ov

erv

iew

Page 5: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m5

•Architecture designed for Coarse-

Grained Data Parallel Processing:

–Achieves high performance, low power

–Multi-threading enables asynchronous,

overlapped I/O with compute

–Scalable array of many Processor

Elements (PEs)

–Includes enterprise-class reliability

features necessary for HPC, such as

ECC on memories, spare PEs etc.

•Programmed in an extended version

of ANSI C called Cn:

–Rich expressive semantics

–Single “poly”data type modifier

Cle

arS

pee

d’s

accele

rato

r arc

hit

ectu

re

MT

AP

Pro

gra

mm

ab

le I

/O t

o D

RA

M

PE

0

Peripheral Network

PE

1

PE

n-1

Da

ta

Cac

he

Mo

no

Co

ntr

oll

er

Ins

tru

c-

tio

n

Cac

he

Co

ntr

ol

an

d

De

bu

gHigh Speed Bus

Po

ly C

on

tro

lle

r

High Speed Bus

Page 6: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m6

Pro

ce

ssin

g E

lem

en

t (P

E)

arc

hit

ectu

re

•Multiple execution units per PE

•Floating point adder

•Floating point multiplier

•Fixed-point MAC 16x16 →32+64

•Integer ALU with shifter

•Load/store

•High-bandwidth, 5-port register file per PE

•Fast inter-PE communication path (swazzle)

•Closely coupled SRAM for data

•Keeping data close is key to low power

•Per PE address generators & DMA (PIO)

•Complete pointer model, including parallel pointer

chasing and vectors of addresses

•Key for gather/scatter and sparse operations

32 & 64-bit

IEEE 754

PE

n

Me

mo

ry a

dd

res

s

ge

ne

rato

r (P

IO)

Re

gis

ter

Fil

e

12

8 B

yte

s

PE

SR

AM

e.g

. 6

KB

yte

s

FP Mul

FP Add

MAC

ALU

PIO

Co

lle

cti

on

& D

istr

ibu

tio

n

64

64

64

32

64

64

PE

n+

1

PE

n–1

12

8

32

}

Page 7: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m7

•A

rra

y o

f 96 P

rocesso

r E

lem

en

ts

•B

rute

fo

rce a

pp

roach

:

–Simple to use

–Power efficient

•64-b

it a

nd

32-b

it f

loati

ng

po

int

–Also integer processing

•210 MHz: key to low power and

therefore to absolute

performance

•In

teg

rate

d D

DR

2 m

em

ory

co

ntr

oll

er

•~

1 T

B/s

ec i

nte

rnal

ban

dw

idth

–At the register file

•128 m

illi

on

tra

nsis

tors

•L

ow

Po

wer,

Ap

pro

x 1

0 W

att

s

Th

e C

SX

600 P

roce

sso

r

Page 8: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m8

Th

e C

learS

peed

Ad

van

ce

TM

e620 a

ccele

rato

r b

oard

•A

n Enterprise-class

HP

C a

ccele

rato

r–

Designed to fit into existing servers, even 1U. Single slot PCI Express x8 card

•Low power consumption, small, light

–Designed for high stability and reliability (MTBF)

•Board DRAM (1 GByte) is error protected and no moving parts (e.g. fans) are required

•80.6

4 D

ou

ble

Pre

cis

ion

(D

.P.)

IE

EE

754 G

FL

OP

S p

eak

–R∞≈66 GFLOPS for 64-bit matrix multiply (DGEMM) calls

–Hardware also supports 32-bit floating point and integer calculations

•O

ver

1 G

Byte

/sb

etw

een

accele

rato

r an

d h

os

t

•32/6

4-b

it d

rivers

fo

r L

inu

x (

Red

Hat

an

d S

use

) a

nd

Win

do

ws

•250 g

ram

s,

15c

m l

on

g,

35

watt

s f

or

en

tire

card

(a

t so

cket)

•N

o e

xtr

a p

ow

er

co

nn

ecto

rs,

co

oli

ng

or

sp

ace

(slo

ts)

req

uir

ed

Page 9: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m9

Ult

ra d

en

se s

yste

ms

: 1 T

FL

OP

64-b

it in

1U

Lo

w p

ow

er

mean

s y

ou

can

bu

ild

very

d

en

se s

ys

tem

s t

o a

ch

ieve

th

e h

igh

es

t p

erf

orm

an

ce

:

•1U

sta

nd

ard

serv

er

–Intel 5365 3.0GHz

–2-socket, quad core

–0.096 Double Precision (D.P.) TFLOPS peak

–~600 watts

–36 servers �~3.5 TFLOPS peak in a 25 kW

rack

•C

learS

peed

Accele

rate

dT

era

sc

ale

Sys

tem

(C

AT

S)

–24 CSX600 processors

–~1 D.P. TFLOPS peak

–~600 watts

–36 CATS �~

35 T

FL

OP

Speak in a 25 kW

rack

•10X

sp

eed

up

fo

r th

e s

am

e p

ow

er,

sp

ace

an

d c

oo

lin

g–Or a 90% reduction in energy used and CO2

emitted for the same compute

2 PCIe x8

cables

Page 10: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m10

CA

TS

1T

FL

OP

in

1U

arc

hit

ectu

re

Twelve Advance e620 boards

(two layers of six)

Power supply

PCI switching

Cooling fans

PCIe connectors

Page 11: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m11

CA

TS

lau

nch

ed

at

SC

07 i

n R

en

o

•~

12 T

FL

OP

S 6

4-b

it in

12U

(21”)

•144 G

B

of

mem

ory

•921 G

B/s

mem

ory

ban

dw

idth

•A

pp

licati

on

s s

ho

wn

:

–Molpro (Quantum Mechanics)

–Amber (Molecular Dynamics)

–BUDE (Drug Docking)

–Sire (QM/MM)

–Monte Carlo (Financial Application)

•S

usta

ined

over

2 T

FL

OP

S o

n

Mo

lpro

•R

an

off

batt

ery

po

wer

for

5

min

ute

s d

uri

ng

a p

ow

er

cu

t!

Page 12: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m12

Re

al

Scie

nce

Re

al

Nu

mb

ers

Re

al

So

ftw

are

Cle

arS

pee

d D

evelo

pm

en

t

En

vir

on

men

t

Page 13: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m13

Cle

arS

pee

d d

evelo

pm

en

t en

vir

on

men

t

•V

ers

ion

3.0

ju

st

rele

as

ed

•Cn

op

tim

isin

g c

om

pile

r

–C with p

olyextension for SIMD datatypes

–Uses ACE CoSy compiler development system

•A

ss

em

ble

r, lin

ke

r, s

imu

lato

rs

•D

eb

ug

ge

r –

a p

ort

of

gd

b

–Runs on ClearSpeed’shardware at full speed

•P

rofi

lin

g –

cs

pro

f

–Heterogeneous, system-wide visualisation of an accelerated

application’s performance while running on both a multi-core host and

ClearSpeed accelerators. ClearSpeed’s hardware can be profiled in

real-time

•L

ibra

rie

s (

BL

AS

, R

NG

, F

FT

, m

ore

…)

& H

igh

le

ve

l A

PIs

•P

revie

w o

f an

EC

LIP

SE

ID

E

•D

oc

um

en

tati

on

, tr

ain

ing

ma

teri

als

•A

va

ila

ble

fo

r W

ind

ow

s a

nd

Lin

ux

(R

ed

Ha

t a

nd

SL

ES

)

Page 14: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m14

Op

t L

ev

el 4

0

0.51

1.52

2.5

amber

american

_option_

pde_

mono

american

_option_

pde_

poly

asian_op

tion_mon

o

asian_op

tion_po

ly

asian_op

tion_vector

blackscholes_m

ono

blackscholes_p

oly

blackscholes_vector

broadieg

lasserman_m

ono

broadieg

lasserman_p

oly

broadieg

lasserman_vector

docking_compute eu

ro_o

ption

euro_o

ption_

bino

mial_mono

euro_o

ption_

bino

mial_po

ly

euro_o

ption_

bino

mial_vector

logp

mande

lbrot_po

ly

mersenn

e_twister

sinp

square

2.51

3.00 latest

Co

mp

iler

imp

rovem

en

ts in

th

e 3

.0 r

ele

ase

Represents an average of 15-20% performance improvement over

the previous release

Page 15: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m15

Copyright ©2006 ClearSpeed Technology plc. All rights reserved.

15

Ecli

pse ID

E n

ow

availab

le in

pre

vie

w f

orm

Inte

gra

ted

de

ve

lop

men

t

en

vir

on

men

t s

up

po

rtin

g

co

mm

an

d l

ine C

SX

de

ve

lop

men

t to

ols

.

Based

on

th

e i

nd

ustr

y

sta

nd

ard

Ecli

ps

e p

latf

orm

.

Su

pp

ort

s d

evelo

pm

en

t o

f

ap

pli

ca

tio

n c

od

e u

sin

g t

he

Cle

arS

peed

Cn

lan

gu

ag

e.

Can

be u

se

d a

s a

plu

g-i

n

wit

h o

the

r exis

tin

g e

cli

pse

based

to

ols

to

pro

vid

e a

n

IDE

fo

r h

ete

rog

en

eo

us

so

ftw

are

de

velo

pm

en

t.

Page 16: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m16

Cle

arS

pee

d c

sg

db

/dd

dd

eb

ug

ger

Real time plot of

contents of PE

Memory

Cn Source level

break point, watch

points single step

Disassembly,

break point, watch

points single step

Register contents

On Chip vector

contents displayed

Page 17: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m17

Copyright ©2006 ClearSpeed Technology plc. All rights reserved.

17

Ecli

pse ID

E –

CS

X D

eb

ug

Pers

pecti

ve

Sta

nd

ard

Ecli

ps

e g

rap

hic

al

deb

ug

in

terf

ace

fo

r C

SX

pro

cesso

r d

eb

ug

gin

g.

CS

X p

rocesso

r p

rovid

es

fu

ll

hard

ware

deb

ug

gin

g o

f

ap

pli

ca

tio

n c

od

e.

Pro

vid

es s

ea

mle

ss v

iew

of

all

96

pro

cess

or

co

res a

nd

the a

ss

ocia

ted

sta

te.

All

ow

s f

ull

sym

bo

lic

deb

ug

of

the Cn

lan

gu

ag

e.

En

han

ced

vie

ws f

or

CS

X

sp

ecif

ic i

nfo

rmati

on

.

Page 18: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m18

Advance™Accelerator Board

CSX 600

Pipeline

CSX 600

Pipeline

Host

CPU(s)

Host

CPU(s)

Host

CPU(s)

Cle

arS

pe

ed

pro

file

r fo

r h

ete

rog

en

eo

us

mu

lti-

pro

ce

sso

r s

ys

tem

s

Advance™Accelerator Board

Host

CPU(s)

CSX

Pipeline

HO

ST

/BO

AR

D I

NT

ER

AC

TIO

N

View host/board interactions.

Provides performance

information for data transfer

operations. Trace cluster

node/board interaction. See

overlap of host compute and

board compute.

CS

X P

IPE

LIN

E

View detailed instruction

issue information. Visualize

overlap of executing

instructions. Optimize code at

the instruction level. View

instruction level performance

bottlenecks. Get accurate

instruction timing.

CS

X S

YS

TE

M

View system level trace.

Visually inspect the

overlap of compute and

I/O. Visualize cache

utilization. View branch

trace of code executing.

Find and analyse

performance bottlenecks.

Get accurate event timing

CSX

Pipeline

HO

ST

CO

DE

PR

OF

ILIN

G

Visually inspect host code

executing.

Supports multiple threads

and processes. Time

specific code sections.

See overlap of host

threads executing.

Platform and processor

agnostic trace collection.

Page 19: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m19

De

velo

per

su

pp

ort

•L

ots

of

on

lin

e s

up

po

rt a

nd

self

-tra

inin

g

mate

rials

:

–http://developer.clearspeed.com/resources/training/

•Self-paced training for programmers, includes optimisation

tips

–http://developer.clearspeed.com

•Online manuals, training materials, forums, support

–http://support.clearspeed.com

•All the latest software downloads, including example codes

Page 20: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m20

Re

al

Scie

nce

Re

al

Nu

mb

ers

Re

al

So

ftw

are

Po

rtin

g lif

escie

nc

e a

pp

licati

on

s

Page 21: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m21

Lif

escie

nce c

od

es a

re g

oo

d t

arg

ets

fo

r accele

rato

rs

•W

hy i

s t

his

?–They usually contain massive parallelism that can be exploited

–Often data parallel

•S

cie

nti

sts

in

th

e f

ield

s o

f co

mp

uta

tio

nal ch

em

istr

y,

bio

log

y a

nd

mate

rials

scie

nce w

an

t:–Simulations of larger systems (more atoms etc.)

–Longer simulations (e.g. protein folding)

–More accurate simulations, usually requiring more

computationally expensive methods

•C

learS

pee

d h

as f

ocu

sed

on

lif

e s

cie

nce c

od

es

req

uir

ing

flo

ati

ng

po

int

op

era

tio

ns:

–Molecular Dynamics codes such as Amber

–Quantum Chemistry codes such as Molpro

Page 22: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m22

Ac

cele

rate

d a

pp

licati

on

s

•M

olp

ro e

lectr

on

ic m

ole

cu

lar

str

uctu

re c

od

e

–Collaboration with Dr Fred Manby’s group at Bristol

University’s Centre for Computational Chemistry

•S

ire Q

M/M

M f

ree e

nerg

y c

od

e (

uses M

olp

ro)

–Collaboration with Dr Christopher Woods

•B

UD

E m

ole

cu

lar

dyn

am

ics-b

as

ed

dru

g d

ockin

g

–Collaboration with Dr Richard Sessions at Bristol

University’s Department of Biochemistry

•A

mb

er

9 im

plicit

(p

rod

ucti

on

co

de)

–Molecular dynamics simulation with implicit solvent

•A

mb

er

9 e

xp

licit

(b

eta

)

–Molecular dynamics simulation with explicit solvent

Page 23: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m23

Ex

pe

rie

nc

es p

ort

ing

th

e M

olp

ro/S

ire Q

ua

ntu

m M

ec

ha

nic

s c

od

e

•F

oc

us

ed

on

th

e D

en

sit

y F

un

cti

on

al T

he

ory

(D

FT

) p

art

of

Mo

lpro

: h

ttp

://d

eve

lop

er.

cle

ars

pe

ed

.co

m/r

es

ou

rces/t

rain

ing

/

•M

os

t o

f th

e r

un

-tim

e is

sp

en

t in

th

e e

xc

ha

ng

e a

nd

c

orr

ela

tio

n k

ern

els

. T

he

se

us

e q

ua

dra

ture

on

a g

rid

to

e

va

lua

te in

teg

rals

. T

he

co

mp

uta

tio

nal c

ost

of

bo

th s

ca

le

cu

bic

ally w

ith

th

e s

ize

of

the

pro

ble

m:

–B

uil

din

g t

he d

en

sit

y:

–B

uil

din

g t

he c

on

trib

uti

on

to

th

e F

ock

matr

ix:

–B

oth

can

be e

xp

ress

ed

as l

arg

e D

GE

MM

+ m

atr

ix-v

ecto

r o

ps

–U

se B

locked

DG

EM

M f

or

maxim

um

perf

orm

an

ce

•Was also possible to overlap this computation on the accelerator

with other Coulomb computation on the host at the same time –

heterogeneous computation

Page 24: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m24

Ac

cele

rate

d M

olp

ro/S

ire R

esu

lts

•M

od

elin

g n

ew

Ne

ura

min

ida

se

in

hib

ito

rs t

o v

ac

cin

ate

ag

ain

st

the

In

flu

en

za

vir

us

•Sustaining

>2

20

do

ub

le p

rec

isio

n G

FL

OP

S o

n t

he

Mo

lpro

ap

plic

ati

on

pe

r C

AT

S n

od

e

•W

ith

10 C

AT

S n

od

es

� ���2.2 TFLOPS sustained

in a

sin

gle

rac

k a

t S

C0

7 (

DG

EM

M a

t ~

7 T

FL

OP

S!)

•E

xe

cu

ted

~60

x1

015 6

4-b

it f

loati

ng

po

int

op

era

tio

ns

on

Mo

lpro

in

~17

ho

urs

sp

rea

d o

ve

r 3

da

ys

du

rin

g S

C0

7

•7.3X speedup per CATS node across whole app

–Accelerated portion is ~17X faster

•Enables 1 ligand per day with

QM/MM levels of accuracy:

55 QM atoms, ~1600 MM atoms, DFT BLYP VDZ

Ligand bound to

Neuraminidase active site

Page 25: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m25

Exp

eri

en

ces p

ort

ing

th

e B

UD

E d

rug

do

ckin

g c

od

e

•U

se

s M

ole

cu

lar

Dyn

am

ics

to

pe

rfo

rm M

on

te C

arl

o-b

as

ed

d

rug

do

ck

ing

•R

es

earc

h in

to p

ep

tid

e-b

as

ed

pro

tease

in

hib

ito

rs

•M

ult

iple

de

gre

es

of

ma

ssiv

e p

ara

llelism

to

ex

plo

it

–Multiple potential drug candidates requiring millions of docking

operations

–Different orientations and configurations of flexible ligands

means that the energy of billions of poses must be calculated

•T

he

fit

nes

s o

f a

po

se

is e

va

lua

ted

by a

co

mp

uta

tio

na

lly

ex

pe

ns

ive

, h

igh

ly a

cc

ura

te a

tom

-ato

m e

mp

iric

al fr

ee

e

ne

rgy f

orc

e f

ield

ca

lcu

lati

on

•T

he

ma

in k

ern

el w

as r

esp

on

sib

le f

or

>99

% o

f th

e

co

mp

uta

tio

na

l re

qu

ire

men

ts b

ut

co

nsis

ted

of

just

500

lin

es

of

FO

RT

RA

N s

ou

rce

co

de

•P

ort

ed

in

a f

ew

ma

n d

ays

an

d f

ull

y o

pti

mis

ed

in

ju

st

1-2

m

an

wee

ks

Page 26: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m26

Ac

cele

rate

d B

UD

E r

esu

lts

•S

ca

les

lin

ea

rly a

cro

ss

mu

ltip

le C

AT

S n

od

es

–Ran on 10 CATS nodes simultaneously at SC07

–�120 ClearSpeed e620 accelerator boards

–�240 CSX600 processors

–�23,040 64-bit Processor Elements!

•13.5x speedup p

er

CA

TS

no

de

ac

ross

th

e w

ho

le

ap

plic

ati

on

co

mp

are

d t

o t

he

late

st

3.0

GH

z q

ua

d c

ore

C

PU

s

•A

wh

ole

pe

pti

de

lib

rary

ca

lcu

lati

on

to

ok

18

ho

urs

on

te

n

CA

TS

no

de

s, c

om

pa

red

to

th

e 5

da

ys

it

wo

uld

ha

ve

tak

en

o

n a

qu

ad

-co

re b

ase

d x

86

sys

tem

of

the

sa

me s

ize

an

d

ap

pro

xim

ate

po

wer

co

nsu

mp

tio

n

•T

he

fir

st

se

t o

f re

al p

ep

tid

es

ba

se

d o

n s

imu

lati

on

s r

un

on

C

AT

S h

ave

no

w b

ee

n s

yn

the

siz

ed

an

d a

re u

nd

erg

oin

g

tests

in

th

e la

bo

rato

ry

Page 27: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m27

Am

ber

9 s

an

der

ap

plicati

on

accele

rati

on

•A

ccele

rate

d A

mb

er

9 s

an

der

imp

licit

sh

ipp

ing

n

ow

(d

ow

nlo

ad

fro

m C

learS

peed

web

sit

e)

•A

mb

er

9 s

an

der

exp

licit

avail

ab

le in

beta

so

on

•O

n C

AT

S, b

oth

exp

licit

an

d im

plicit

ach

ieve 1

0

to 2

0X

sp

eed

up

s–Saving 90-95% of the simulation time -> more simulations

–Alternatively, enabling larger, more accurate simulations

•W

hile r

ed

ucin

g p

ow

er

co

nsu

mp

tio

n b

y 6

6%

•A

nd

in

cre

asin

g s

erv

er

roo

m c

ap

acit

y b

y 3

00

%

•F

ull-s

cale

MD

ap

plicati

on

s h

ave b

een

hard

to

p

ort

–co

mp

licate

d c

od

es w

ith

lo

ts o

f cas

es

•C

learS

pee

d s

ou

rce c

od

e w

ill b

e in

clu

ded

in

th

e

Am

ber

10 r

ele

as

e b

y d

efa

ult

Page 28: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m28

Oth

er

accele

rata

ble

ap

pli

cati

on

s

•C

om

pu

tati

on

al F

luid

Dyn

am

ics,

Sm

oo

th P

art

icle

Hyd

rod

yn

am

ics m

eth

od

s

–Collaborations with Jamil Appa at BAE systems and Dr

Graham Pullan and Tobias Brandvik at the Whittle Laboratory,

Cambridge University

•L

arg

e m

atr

ix-m

atr

ix a

rith

meti

c o

pera

tio

ns

–E.g. Electromagnetics, radar cross section etc.

•Im

ag

e p

roces

sin

g

•R

AD

AR

/SO

NA

R a

pp

licati

on

s (

e.g

. S

AR

)

•S

tar-

P f

rom

In

tera

cti

ve S

up

erc

om

pu

tin

g

•M

AT

LA

B a

nd

Math

em

ati

ca w

hen

perf

orm

ing

larg

e

matr

ix o

pera

tio

ns, su

ch

as s

olv

ing

syste

ms o

f

lin

ear

eq

uati

on

s

Page 29: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m29

Re

al

Scie

nce

Re

al

Nu

mb

ers

Re

al

So

ftw

are

Fu

ture

Page 30: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m30

Fu

ture

develo

pm

en

ts –

So

ftw

are

•E

as

e o

f u

se i

mp

rovem

en

ts,

e.g

.:

–ClearStack (near)

•Makes talking between the host and card much simpler

•Includes C++ object migration base class (beta, with partner)

•H

ete

rog

en

eo

us p

rog

ram

min

g e

nvir

on

men

t (n

ext)

–“Exploiting Loop-Level Parallelism for SIMD Arrays using

OpenMP”, IWOMP 2007 (Beijing)

•In

gen

era

l, p

rog

ram

min

g m

od

el

beco

min

g m

ore

Op

en

MP

-lik

e

–Cross-platform

–Heterogeneous

Page 31: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m31

“E

xp

loit

ing

Lo

op

-Le

ve

l P

ara

lle

lis

m f

or

SIM

D A

rrays

us

ing

Op

en

MP

Page 32: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m32

Fu

ture

develo

pm

en

ts –

Hard

ware

•N

ext

gen

era

tio

n p

roces

so

r “C

all

an

ish

”b

ein

g r

ele

ased

this

ye

ar

•N

ew

bo

ard

s a

nd

CA

TS

based

on

Call

an

ish

•B

ig in

cre

ases i

n p

erf

orm

an

ce a

nd

perf

orm

an

ce p

er

watt

•B

inary

co

mp

ati

ble

wit

h c

urr

en

t p

rod

ucts

sin

ce

rele

as

e 3

.0 o

f th

e S

DK

•W

atc

h t

his

sp

ace f

or

mo

re a

nn

ou

ncem

en

ts!

Page 33: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m33

Su

mm

ary

•C

learS

pee

d a

ccele

rato

rs a

re d

esig

ned

fo

r H

PC

•H

ave f

ocu

sed

on

En

terp

rise-c

lass f

eatu

res t

o e

nab

le

reli

ab

le larg

e-s

cale

sys

tem

s

•R

eal

ap

plicati

on

s s

tart

ing

to

pro

ve a

cc

ele

rato

rs a

re

cre

dib

le

–E

.g. re

al c

om

po

un

ds

syn

the

siz

ed

fro

m B

UD

E d

oc

kin

g s

ims

!

•C

learS

pee

d e

nab

les P

eta

scale

syste

ms a

nd

be

yo

nd

, in

sm

all

er

form

facto

rs t

han

pre

vio

usly

po

ssib

le a

nd

wit

hin

exis

tin

g in

frastr

uctu

re

co

nstr

ain

ts

–E.g. 100 TFLOPS double precision in 3-4 racks, small enough to

fit in a department

Page 34: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m34

Up

co

min

g e

ven

ts

•T

he n

ext

Cle

arS

peed

Us

er

Gro

up

meeti

ng

will

be h

eld

at

the In

tern

ati

on

al S

up

erc

om

pu

tin

g

Co

nfe

ren

ce (

ISC

08)

in D

resd

en

, G

erm

an

y o

n

Mo

nd

ay J

un

e 1

6th

–Inviting potential speakers to submit proposals now!

Page 35: Accelerating numerically intensive life science codes for … · 2008-04-08 · low power • Per PE address generators & DMA (PIO) •Complete pointer model, including parallel pointe

Copyright ©2008 ClearSpeed Technology Inc. All rights reserved.

ww

w.c

lears

peed

.co

m35