57
Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Embed Size (px)

Citation preview

Page 1: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Nvidia CUDATM and AMD StreamTM (for Cosmology):

Experiences So Far

Steven Gratton,

Institute of Astronomy,

University of Cambridge.

29 October 2008

Page 2: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Brief Outline (1)

• Why I got interested in GPGPU…

• Thinking about CUDA

• An almost-ideal test case to understand the early universe: random walks, random numbers and reductions…

• A trickier problem, for data analysis: Cholesky factorization

• Thoughts on optimization

Page 3: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Brief Outline (2)

• Trying AMD Stream

• Programming for a GPU cluster?

• Is it worth it?

Page 4: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

The Motivation

• I remember reading about CUDA and Nvidia’s new GPUs back in early 2007– having waited hours for 16-core jobs to start

on the local supercomputer, the idea of having 128 processors, all sitting in ones own machine, seemed very enticing…

– CUDA seemed okay and fun enough for me to have a go…

Page 5: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Some support

• Received a mini-grant from the Foundation Questions Institute to look at GPGPU with CUDA for cosmology and provide a resources page; the latter at:

http://www.ast.cam.ac.uk/~stg20/gpgpu/index.html

htttp://www.fqxi.org

Page 6: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

I bought…

• 2x Nvidia 8800GTSs

• A Sun Ultra 40M2 Opteron workstation to host them (good PSU and multiple PCIe slots…)

Page 7: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

The problem

• “Inflation” in the early universe…

• A crash course in cosmology:– The universe is expanding (i.e. galaxies are

moving away from other galaxies) and the radiation in it is cooling down

– Extrapolating backwards would lead to a “big bang”

– Modify this by having a new form of matter, the “inflaton”, early on

Page 8: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

– The inflaton can fluctuate; indeed small fluctuations in it can be the seeds that under gravitational collapse lead to the galaxies etc. that we see today

– But we might also have large fluctuations of the inflaton across the early universe…

• Basically the whole universe could have undergone a random walk!

• So what happened on the average? – What should we expect to see in our past?– And with what spread?

Page 9: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Subtleties

• Starting from an initial inflaton value, some of the histories don’t stop inflating– We may only want to look at the subset that

do; a constrained random walk

• Different histories inflate by different factors– We have to weight each history according to a

function of the entire history

Page 10: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

The Idea

• To simulate a whole load of histories numerically

• Find the average history (and spread)

Page 11: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Nvidia Concepts (1)

• The gpu sequentially executes one or more grids of threads. Each thread in a grid runs the same program or kernel.

• Threads in a grid are organised into blocks. – Threads in a block can synchronize and

communicate via fast shared memory.– All threads can read and write anywhere in

the GPU’s main global memory.

Page 12: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Nvidia Concepts (2)

• A thread block runs on a single multiprocessor. It is split into warps of 32 threads.

• A high-end gpu contains say 12-30 multiprocessors, each consisting of 8 processors. Each processor runs the same instruction 4 times in a row to process a warp.

• Ideally, a multiprocessor should run multiple warps

Nvidia CUDA Programming Guide, Nvidia

Page 13: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Nvidia Concepts (3)

• Write kernels in CUDA, basically C

• This gets compiled into a virtual assembly language, ptx (you can inspect this)

• ptx gets compiled into real card-specific code (officially, you can’t see this)

(but see decuda,

http://www.cs.rug.nl/wladimir/decuda)

Page 14: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

The Idea (2)

• Get each thread to simulate a history

• Form a partial average in each block and store to global memory

• Launch a new kernel to form the final average

• Transfer average history back to cpu

Page 15: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Parallel RNG

• It is tricky to get good streams of parallel numbers

• I wanted something fast and reasonably good (will check soon, honest)

• Each thread uses a Marsaglia “Multiply With Carry” generator, each with its own multiplier.

• Then used a log/trig Box-Muller transform to get a gaussian

Page 16: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Basic Optimization Guide

• Access global memory as little as possible– When you do, access it in a coalesced

manner, having all threads in a warp access nearby addresses

• Keep all threads in a warp doing the same thing

• Access shared memory in an efficient manner

Page 17: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Performance

• 1000x my cpu code!

• Wow!

Page 18: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Performance(2)

• After more than a couple of evenings of thought and effort,

1000x my – single-threaded, unoptimized cpu code – that took about 30 minutes to write – running on an old 1.6 GHz cpu …

• Hmm…

Page 19: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Data Analysis

• Bayes’ Theorem:

P(model|data) = P(data|model) P(model) /

P(data)

• Often, P(data|model) is a gaussian because – the mismatch between data and signal is

down to noise which is often best taken to be gaussian

– The signal is a gaussian

Page 20: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

e.g. the CMB

• Radiation “from the big bang” is a correlated Gaussian (?) random field– Correlations in the field depend on the model

WMAP science team

Page 21: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

• So, p(data|model) is given by:

• C is a function of the cosmological parameters!

2/)( 1

2

1 XNCX TeNC

Page 22: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Cholesky Factorization

000

00

0

0

00

000

ALLT

Page 23: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Why does this help?

• Exponent…– Can solve in O(N^2)

• Prefactor…– Det(A) is just the square of the product of

the diagonal elements of L, takes O(N)

• So after one O(N^3) operation, you can get all you need.– The number in front of N^3 is small for

Cholesky

Lyx

Page 24: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

A suitable parallel implementation

CD

BC

CBAB

DCBA

ba

dcba

d

bc

ab

a

000

00

0

0

00

000

,/,/

2

aCcaBb

Aa

Page 25: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

• What about the bottom right corner?...

• Just another Cholesky factorization, in one lower dimension!

dcb

d

c

b

C

AB

CBA

ba

cba

bc

ab

a

Page 26: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

So…

• Need a loop over reducing matrix sizes of 3 kernels (we need to sync between each step)– 1 to do the square root– 1 to update the strip– 1 to subtract the outer product of the strip

from the remaining bottom right corner

• The last kernel takes all the time!

Page 27: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Performance and Optimizations

• In the outer product, we need to do n(n+1)/2 multiply subtracts per triangle, so N^3/6 in total.

• We could find ourselves memory bandwidth bound…

• To avoid this, we in fact treat “blocks” of the matrix at a time, storing them in fast shared memory, reducing global memory bandwidth by a factor of the block size.

Page 28: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Memory issues…

• Bizarre factor of two slowdowns for certain matrix sizes; different cards suffered at different sizes

• Could lead to the “better” 8800GTX card being slower than the “slower” 8800GTS card…

• My interpretation:

Page 29: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

• Perhaps the memory is interleaved in units of 256 bytes between partitions!

• Then, for certain matrix sizes, all of the strip might be stored in the same partition.

• Then all thread blocks working down a column hit the same partition all the time…

• Problem basically goes away for a version of the code where each block works on a row instead

Hitting memory

partitions…GeForce 8800

Architecture Overview,

Nvidia

Page 30: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

• Perhaps the memory is interleaved in units of 256 bytes between partitions!

• Then, for certain matrix sizes, all of the strip might be stored in the same partition.

• Then all thread blocks working down a column hit the same partition all the time…

• Problem basically goes away for a version of the code where each block works on a row instead

Hitting memory

partitions…GeForce 8800

Architecture Overview,

Nvidia

Page 31: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Also, paging?

• Made a minor change to the layout of the arrays to store each 16x16 block contiguously, i.e. from A[N][N] to A[N/16][N/16][16][16]– Found a 50% speedup!!

• But why should one have to be finding these things out by trial, error and a lot of thought and effort?

Page 32: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Performance bottom line• 12228^2 matrix, single precision

– 8s on an 8800GTX– 17s using Intel MKL on a node of the

University supercomputer (4 cores at 3GHz)

• So, after hours and hours and hours of thought and effort, a factor of a few times faster than a one-line call to a (professional) lapack library routine on a 4-core cpu node…

Page 33: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

So, general performance issues

• No clear description about global memory (other than coalescing), let alone textures!

• Possible issue with shared memory operands (see forum discussion) – no advice from Nvidia

• No official view of what the gpu is actually doing; ptx can be obfuscating to optimization…

Page 34: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Current NV hardware and DP

• Consumer: GTX 260/280

896MB/1GB

nice sticker

• Tesla: C1060

4GB

no nice sticker

Page 35: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

DP vital for Cholesky

• Basically just changed “float” to “double” in my code and now get 40 GFLOP/s, 2/3 of peak performance on a GTX260!! Very happy…

• Note DP doubles memory bandwidth requirements, but cards are more than twice as slow at DP than SP at present…

Page 36: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

AMD Stream

• Alluring theoretical ALU performance, both in SP and DP!– (Earlier DP precision support than Nvidia)

• Beguilingly offers views of the actual gpu code! (Can even program in it…)

• SDK still in beta unfortunately…

Page 37: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

AMD concepts

• Brook+

• CAL/IL

• gpuisa

Page 38: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Brook+ example: adding 2 matriceskernel void sum(float a<>,float b<>,

out float c<>){c=a+b;

}

int main(){…float a<1024,1024>;float b<1024,1024>;float c<1024,1024>;…StreamRead(a,a_cpu);StreamRead(b,b_cpu);…sum(a,b,c);…StreamWrite(c,c_cpu);…

}

• Ideal for “pure” streaming applications, i.e. doing the same thing to many many elements

• Allows for “reductions” and also for more complicated memory access patterns

• Handles all the gpu complexity behind the scenes

Page 39: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

CAL/IL

• CAL=Compute Abstraction Layer– Provides c functions to set up and copy

memory to and from the gpu, and to compile, set up and run kernels on the gpu

• IL=Intermediate Language– A pseudo-assembly language for AMD gpus– 128-bit registers!

Page 40: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Hardware Summary

• Last generation: 3800 Series, Firestream 9170– DP

• New generation: 4800 Series, Firestream 9250– Over 2x faster, over 1TFLOP/s SP!– Can support compute shaders with interthread

communication

Page 41: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Current cards…

• Consumer: HD4870512MB/1GB GDDR5

Nice(?) sticker

• Professional: 92501GB GDDR3

Single slot, <150W

No nice sticker…

Page 42: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Current cards

TeraScale Graphics Engine presentation, AMD

Page 43: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

• 10 “SIMDs”

• x16 “thread processors”

• x5 “stream cores”

TeraScale Graphics Engine presentation, AMD

Page 44: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

• All 16 thread processors in a SIMD run the same instruction, 4 times over in 4 clock cycles, on a “wavefront” of 64 threads

• Each instruction is a VLIW one!– Basically a separate command to each stream

core in the thread processor– Some instructions, e.g. can only run on the “t”

stream core– Some instructions, e.g. DP multiply, take up

multiple stream cores…

Page 45: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

gpuisa

• Perhaps now we see the reason for IL existing, being a scalar language

• You can see the actual gpuisa though!– very helpful to see what your program is

actually doing– aids in optimization

Page 46: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Cholesky on Stream

• Similar structure to Nvidia

• Currently based on 4x4 blocking of the matrix, due to float4 nature of registers

• Currently using “pixel” shaders (compute shaders are new and only supported on latest hardware); more graphics oriented

Page 47: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

IL Shaders

il_ps_2_0dcl_input_position_interp(linear_noperspective) vWinCoord0.xydcl_output_generic o0dcl_output_generic o1dcl_output_generic o2dcl_output_generic o3dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)dcl_resource_id(2)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)dcl_resource_id(3)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)

• Set up…

Page 48: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

• Load in data (that was partitioned between multiple buffers)…

sample_resource(0)_sampler(0) r0, vWinCoord0.xyxxsample_resource(1)_sampler(0) r1, vWinCoord0.xyxxsample_resource(2)_sampler(0) r2, vWinCoord0.xyxxsample_resource(3)_sampler(0) r3, vWinCoord0.xyxx

;PS; -------- Disassembly --------------------00 TEX: ADDR(80) CNT(4) VALID_PIX 0 SAMPLE R1, R0.xyxx, t0, s0 UNNORM(XYZW) 1 SAMPLE R3, R0.xyxx, t1, s0 UNNORM(XYZW) 2 SAMPLE R2, R0.xyxx, t2, s0 UNNORM(XYZW) 3 SAMPLE R0, R0.xyxx, t3, s0 UNNORM(XYZW)

Page 49: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

• Calculate the top-left block; note the float4 nature of the registers!

sqrt r0.x,r0.xxxxdiv r0._yzw,r0,r0.xxxx mad r1._yzw,r0.yyyy,r0_neg(xyzw),r1 mad r2.__zw,r0.zzzz,r0_neg(xyzw),r2 mad r3.___w,r0.wwww,r0_neg(xyzw),r3sqrt r1.y,r1.yyyy div r1.__zw,r1,r1.yyyy mad r2.__zw,r1.zzzz,r1_neg(xyzw),r2 mad r3.___w,r1.wwww,r1_neg(xyzw),r3sqrt r2.z,r2.zzzzdiv r2.___w,r2,r2.zzzz mad r3.___w,r2.wwww,r2_neg(xyzw),r3sqrt r3.w,r3.wwww

01 ALU: ADDR(32) CNT(39) 4 t: SQRT_e R4.x, R1.x 5 t: RCP_sat ____, PS4 6 y: MUL R4.y, R1.y, PS5 z: MUL R4.z, R1.z, PS5 w: MUL R4.w, R1.w, PS5 7 x: MULADD T0.x, PV6.y, -PV6.z, R3.z y: MULADD T0.y, PV6.z, -PV6.z, R2.z VEC_021 z: MULADD R123.z, PV6.y, -PV6.y, R3.y w: MULADD T0.w, PV6.y, -PV6.w, R3.w t: MULADD T0.z, PV6.z, -PV6.w, R2.w 8 x: MULADD T1.x, R4.w, -R4.w, R0.w t: SQRT_e R3.y, PV7.z 9 t: RCP_sat ____, PS8 10 z: MUL R3.z, T0.x, PS9 w: MUL R3.w, T0.w, PS9 11 x: MULADD T1.x, PV10.z, -PV10.w, T0.z y: MULADD R123.y, PV10.z, -PV10.z, T0.y z: MULADD T0.z, PV10.w, -PV10.w, T1.x 12 t: SQRT_e R2.z, PV11.y 13 t: RCP_sat ____, PS12 14 w: MUL R2.w, T1.x, PS13 15 x: MULADD R123.x, PV14.w, -PV14.w, T0.z 16 t: SQRT_e R0.w, PV15.x

Page 50: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

• Write out as a stream

mov o0,r0mov o1,r1mov o2,r2mov o3,r3ret_dynend

17 x: MOV R8.x, R0.x y: MOV R8.y, R0.y z: MOV R8.z, R0.z w: MOV R8.w, PS16 18 x: MOV R7.x, R2.x y: MOV R7.y, R2.y z: MOV R7.z, R2.z w: MOV R7.w, R2.w 19 x: MOV R6.x, R3.x y: MOV R6.y, R3.y z: MOV R6.z, R3.z w: MOV R6.w, R3.w 20 x: MOV R5.x, R4.x y: MOV R5.y, R4.y z: MOV R5.z, R4.z w: MOV R5.w, R4.w 02 EXP_DONE: PIX0, R5 BRSTCNT(3) END_OF_PROGRAM

Page 51: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

AMD issues

• Almost NO information or advice about the memory system, other than to have each thread write 4 float4’s at a time…

• Documentation is improving, but still some way to go (e.g. gpuisa document is 2 generations out of date; very limited discussion of compute shaders at present…)

• But help from the AMD Stream forum

Page 52: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Teething issues, e.g. …

• Fussy about supported gpu/os/driver combinations

• Can’t seem to use all of the card’s memory for a compute shader/global buffer; miss out on 256MB

• Can’t easily access resources larger than 255MB in size from the cpu side

• In some cases all cards have to have had a monitor plugged into them to be accessible; imagine that in a cluster!

Page 53: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Multi-gpu?

• Assuming you can just combine results at the end (on the cpu say), no problem

• If you want to share data (e.g. the Cholesky problem), must bear communication costs in mind– As a lower limit, a kernel call takes about 5-10

μs– PCIe and cpu memory, say 5GB/s

Page 54: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Big gpu clusters?

• E.g. for Cholesky, you’d split the matrix row-wise and pass the processed rows via MPI

• Latencies (c.f. kernel launch overhead)…– No expert but apparently Ethernet 50 μs, Infiniband 2

μs

• …and bandwidth (100-1000 MB/s)• Each kernel should run for longer than this!

• But what about memory errors? (Current cards non-ECC…)

Page 55: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Big gpu clusters? (2)

• I think gpu manufacturers should make available all the optimization info they possibly can– You need all the factors of a few that there

are in order to justify coprocessors– Perhaps this is a bit different from the

consumer market where compatibility and performance over a wide range of products is vital

Page 56: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Is it worth it?

• CUDA especially is a great way to start programming in parallel– A gpu is in many ways analogous to a supercomputer

cluster

• Need to work hard to get close to peak performance– More info from both AMD and Nvidia would help here;

it is strange that one can find out as much or more about the cards and how they work from forums and computer hardware websites as opposed to from the companies themselves…

Page 57: Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Is it worth it?(2)

• Expect up to a factor of a few over decent (quad-core) cpu code…

• Future standardization efforts might help too (if they allow access to the advanced features of the cards in a close to optimal way).

• Libraries (cuBLAS, cuFFT, ACML-gpu…) might be a good way to start– Mpi versions for multi-gpu systems would be great!