Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Nvidia CUDATM and AMD StreamTM (for Cosmology):

Experiences So Far

Steven Gratton,

Institute of Astronomy,

University of Cambridge.

29 October 2008

Brief Outline (1)

• Why I got interested in GPGPU…

• Thinking about CUDA

• An almost-ideal test case to understand the early universe: random walks, random numbers and reductions…

• A trickier problem, for data analysis: Cholesky factorization

• Thoughts on optimization

Brief Outline (2)

• Trying AMD Stream

• Programming for a GPU cluster?

• Is it worth it?

The Motivation

• I remember reading about CUDA and Nvidia’s new GPUs back in early 2007– having waited hours for 16-core jobs to start

on the local supercomputer, the idea of having 128 processors, all sitting in ones own machine, seemed very enticing…

– CUDA seemed okay and fun enough for me to have a go…

Some support

• Received a mini-grant from the Foundation Questions Institute to look at GPGPU with CUDA for cosmology and provide a resources page; the latter at:

http://www.ast.cam.ac.uk/~stg20/gpgpu/index.html

htttp://www.fqxi.org

I bought…

• 2x Nvidia 8800GTSs

• A Sun Ultra 40M2 Opteron workstation to host them (good PSU and multiple PCIe slots…)

The problem

• “Inflation” in the early universe…

• A crash course in cosmology:– The universe is expanding (i.e. galaxies are

moving away from other galaxies) and the radiation in it is cooling down

– Extrapolating backwards would lead to a “big bang”

– Modify this by having a new form of matter, the “inflaton”, early on

– The inflaton can fluctuate; indeed small fluctuations in it can be the seeds that under gravitational collapse lead to the galaxies etc. that we see today

– But we might also have large fluctuations of the inflaton across the early universe…

• Basically the whole universe could have undergone a random walk!

• So what happened on the average? – What should we expect to see in our past?– And with what spread?

Subtleties

• Starting from an initial inflaton value, some of the histories don’t stop inflating– We may only want to look at the subset that

do; a constrained random walk

• Different histories inflate by different factors– We have to weight each history according to a

function of the entire history

The Idea

• To simulate a whole load of histories numerically

• Find the average history (and spread)

Nvidia Concepts (1)

• The gpu sequentially executes one or more grids of threads. Each thread in a grid runs the same program or kernel.

• Threads in a grid are organised into blocks. – Threads in a block can synchronize and

communicate via fast shared memory.– All threads can read and write anywhere in

the GPU’s main global memory.

Nvidia Concepts (2)

• A thread block runs on a single multiprocessor. It is split into warps of 32 threads.

• A high-end gpu contains say 12-30 multiprocessors, each consisting of 8 processors. Each processor runs the same instruction 4 times in a row to process a warp.

• Ideally, a multiprocessor should run multiple warps

Nvidia CUDA Programming Guide, Nvidia

Nvidia Concepts (3)

• Write kernels in CUDA, basically C

• This gets compiled into a virtual assembly language, ptx (you can inspect this)

• ptx gets compiled into real card-specific code (officially, you can’t see this)

(but see decuda,

http://www.cs.rug.nl/wladimir/decuda)

The Idea (2)

• Get each thread to simulate a history

• Form a partial average in each block and store to global memory

• Launch a new kernel to form the final average

• Transfer average history back to cpu

Parallel RNG

• It is tricky to get good streams of parallel numbers

• I wanted something fast and reasonably good (will check soon, honest)

• Each thread uses a Marsaglia “Multiply With Carry” generator, each with its own multiplier.

• Then used a log/trig Box-Muller transform to get a gaussian

Basic Optimization Guide

• Access global memory as little as possible– When you do, access it in a coalesced

manner, having all threads in a warp access nearby addresses

• Keep all threads in a warp doing the same thing

• Access shared memory in an efficient manner

Performance

• 1000x my cpu code!

• Wow!

Performance(2)

• After more than a couple of evenings of thought and effort,

1000x my – single-threaded, unoptimized cpu code – that took about 30 minutes to write – running on an old 1.6 GHz cpu …

• Hmm…

Data Analysis

• Bayes’ Theorem:

P(model|data) = P(data|model) P(model) /

P(data)

• Often, P(data|model) is a gaussian because – the mismatch between data and signal is

down to noise which is often best taken to be gaussian

– The signal is a gaussian

e.g. the CMB

• Radiation “from the big bang” is a correlated Gaussian (?) random field– Correlations in the field depend on the model

WMAP science team

• So, p(data|model) is given by:

• C is a function of the cosmological parameters!

2/)( 1

2

1 XNCX TeNC

Cholesky Factorization

000

00

0

0

00

000

ALLT

Why does this help?

• Exponent…– Can solve in O(N^2)

• Prefactor…– Det(A) is just the square of the product of

the diagonal elements of L, takes O(N)

• So after one O(N^3) operation, you can get all you need.– The number in front of N^3 is small for

Cholesky

Lyx

A suitable parallel implementation

CD

BC

CBAB

DCBA

ba

dcba

d

bc

ab

a

000

00

0

0

00

000

,/,/

2

aCcaBb

Aa

• What about the bottom right corner?...

• Just another Cholesky factorization, in one lower dimension!

dcb

d

c

b

C

AB

CBA

ba

cba

bc

ab

a

So…

• Need a loop over reducing matrix sizes of 3 kernels (we need to sync between each step)– 1 to do the square root– 1 to update the strip– 1 to subtract the outer product of the strip

from the remaining bottom right corner

• The last kernel takes all the time!

Performance and Optimizations

• In the outer product, we need to do n(n+1)/2 multiply subtracts per triangle, so N^3/6 in total.

• We could find ourselves memory bandwidth bound…

• To avoid this, we in fact treat “blocks” of the matrix at a time, storing them in fast shared memory, reducing global memory bandwidth by a factor of the block size.

Memory issues…

• Bizarre factor of two slowdowns for certain matrix sizes; different cards suffered at different sizes

• Could lead to the “better” 8800GTX card being slower than the “slower” 8800GTS card…

• My interpretation:

• Perhaps the memory is interleaved in units of 256 bytes between partitions!

• Then, for certain matrix sizes, all of the strip might be stored in the same partition.

• Then all thread blocks working down a column hit the same partition all the time…

• Problem basically goes away for a version of the code where each block works on a row instead

Hitting memory

partitions…GeForce 8800

Architecture Overview,

Nvidia

• Perhaps the memory is interleaved in units of 256 bytes between partitions!

• Then, for certain matrix sizes, all of the strip might be stored in the same partition.

• Then all thread blocks working down a column hit the same partition all the time…

• Problem basically goes away for a version of the code where each block works on a row instead

Hitting memory

partitions…GeForce 8800

Architecture Overview,

Nvidia

Also, paging?

• Made a minor change to the layout of the arrays to store each 16x16 block contiguously, i.e. from A[N][N] to A[N/16][N/16][16][16]– Found a 50% speedup!!

• But why should one have to be finding these things out by trial, error and a lot of thought and effort?

Performance bottom line• 12228^2 matrix, single precision

– 8s on an 8800GTX– 17s using Intel MKL on a node of the

University supercomputer (4 cores at 3GHz)

• So, after hours and hours and hours of thought and effort, a factor of a few times faster than a one-line call to a (professional) lapack library routine on a 4-core cpu node…

So, general performance issues

• No clear description about global memory (other than coalescing), let alone textures!

• Possible issue with shared memory operands (see forum discussion) – no advice from Nvidia

• No official view of what the gpu is actually doing; ptx can be obfuscating to optimization…

Current NV hardware and DP

• Consumer: GTX 260/280

896MB/1GB

nice sticker

• Tesla: C1060

4GB

no nice sticker

DP vital for Cholesky

• Basically just changed “float” to “double” in my code and now get 40 GFLOP/s, 2/3 of peak performance on a GTX260!! Very happy…

• Note DP doubles memory bandwidth requirements, but cards are more than twice as slow at DP than SP at present…

AMD Stream

• Alluring theoretical ALU performance, both in SP and DP!– (Earlier DP precision support than Nvidia)

• Beguilingly offers views of the actual gpu code! (Can even program in it…)

• SDK still in beta unfortunately…

AMD concepts

• Brook+

• CAL/IL

• gpuisa

Brook+ example: adding 2 matriceskernel void sum(float a<>,float b<>,

out float c<>){c=a+b;

}

int main(){…float a<1024,1024>;float b<1024,1024>;float c<1024,1024>;…StreamRead(a,a_cpu);StreamRead(b,b_cpu);…sum(a,b,c);…StreamWrite(c,c_cpu);…

}

• Ideal for “pure” streaming applications, i.e. doing the same thing to many many elements

• Allows for “reductions” and also for more complicated memory access patterns

• Handles all the gpu complexity behind the scenes

CAL/IL

• CAL=Compute Abstraction Layer– Provides c functions to set up and copy

memory to and from the gpu, and to compile, set up and run kernels on the gpu

• IL=Intermediate Language– A pseudo-assembly language for AMD gpus– 128-bit registers!

Hardware Summary

• Last generation: 3800 Series, Firestream 9170– DP

• New generation: 4800 Series, Firestream 9250– Over 2x faster, over 1TFLOP/s SP!– Can support compute shaders with interthread

communication

Current cards…

• Consumer: HD4870512MB/1GB GDDR5

Nice(?) sticker

• Professional: 92501GB GDDR3

Single slot, <150W

No nice sticker…

Current cards

TeraScale Graphics Engine presentation, AMD

• 10 “SIMDs”

• x16 “thread processors”

• x5 “stream cores”

TeraScale Graphics Engine presentation, AMD

• All 16 thread processors in a SIMD run the same instruction, 4 times over in 4 clock cycles, on a “wavefront” of 64 threads

• Each instruction is a VLIW one!– Basically a separate command to each stream

core in the thread processor– Some instructions, e.g. can only run on the “t”

stream core– Some instructions, e.g. DP multiply, take up

multiple stream cores…

gpuisa

• Perhaps now we see the reason for IL existing, being a scalar language

• You can see the actual gpuisa though!– very helpful to see what your program is

actually doing– aids in optimization

Cholesky on Stream

• Similar structure to Nvidia

• Currently based on 4x4 blocking of the matrix, due to float4 nature of registers

• Currently using “pixel” shaders (compute shaders are new and only supported on latest hardware); more graphics oriented

IL Shaders

il_ps_2_0dcl_input_position_interp(linear_noperspective) vWinCoord0.xydcl_output_generic o0dcl_output_generic o1dcl_output_generic o2dcl_output_generic o3dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)dcl_resource_id(2)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)dcl_resource_id(3)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)

• Set up…

• Load in data (that was partitioned between multiple buffers)…

sample_resource(0)_sampler(0) r0, vWinCoord0.xyxxsample_resource(1)_sampler(0) r1, vWinCoord0.xyxxsample_resource(2)_sampler(0) r2, vWinCoord0.xyxxsample_resource(3)_sampler(0) r3, vWinCoord0.xyxx

;PS; -------- Disassembly --------------------00 TEX: ADDR(80) CNT(4) VALID_PIX 0 SAMPLE R1, R0.xyxx, t0, s0 UNNORM(XYZW) 1 SAMPLE R3, R0.xyxx, t1, s0 UNNORM(XYZW) 2 SAMPLE R2, R0.xyxx, t2, s0 UNNORM(XYZW) 3 SAMPLE R0, R0.xyxx, t3, s0 UNNORM(XYZW)

• Calculate the top-left block; note the float4 nature of the registers!

sqrt r0.x,r0.xxxxdiv r0._yzw,r0,r0.xxxx mad r1._yzw,r0.yyyy,r0_neg(xyzw),r1 mad r2.__zw,r0.zzzz,r0_neg(xyzw),r2 mad r3.___w,r0.wwww,r0_neg(xyzw),r3sqrt r1.y,r1.yyyy div r1.__zw,r1,r1.yyyy mad r2.__zw,r1.zzzz,r1_neg(xyzw),r2 mad r3.___w,r1.wwww,r1_neg(xyzw),r3sqrt r2.z,r2.zzzzdiv r2.___w,r2,r2.zzzz mad r3.___w,r2.wwww,r2_neg(xyzw),r3sqrt r3.w,r3.wwww

01 ALU: ADDR(32) CNT(39) 4 t: SQRT_e R4.x, R1.x 5 t: RCP_sat ____, PS4 6 y: MUL R4.y, R1.y, PS5 z: MUL R4.z, R1.z, PS5 w: MUL R4.w, R1.w, PS5 7 x: MULADD T0.x, PV6.y, -PV6.z, R3.z y: MULADD T0.y, PV6.z, -PV6.z, R2.z VEC_021 z: MULADD R123.z, PV6.y, -PV6.y, R3.y w: MULADD T0.w, PV6.y, -PV6.w, R3.w t: MULADD T0.z, PV6.z, -PV6.w, R2.w 8 x: MULADD T1.x, R4.w, -R4.w, R0.w t: SQRT_e R3.y, PV7.z 9 t: RCP_sat ____, PS8 10 z: MUL R3.z, T0.x, PS9 w: MUL R3.w, T0.w, PS9 11 x: MULADD T1.x, PV10.z, -PV10.w, T0.z y: MULADD R123.y, PV10.z, -PV10.z, T0.y z: MULADD T0.z, PV10.w, -PV10.w, T1.x 12 t: SQRT_e R2.z, PV11.y 13 t: RCP_sat ____, PS12 14 w: MUL R2.w, T1.x, PS13 15 x: MULADD R123.x, PV14.w, -PV14.w, T0.z 16 t: SQRT_e R0.w, PV15.x

• Write out as a stream

mov o0,r0mov o1,r1mov o2,r2mov o3,r3ret_dynend

17 x: MOV R8.x, R0.x y: MOV R8.y, R0.y z: MOV R8.z, R0.z w: MOV R8.w, PS16 18 x: MOV R7.x, R2.x y: MOV R7.y, R2.y z: MOV R7.z, R2.z w: MOV R7.w, R2.w 19 x: MOV R6.x, R3.x y: MOV R6.y, R3.y z: MOV R6.z, R3.z w: MOV R6.w, R3.w 20 x: MOV R5.x, R4.x y: MOV R5.y, R4.y z: MOV R5.z, R4.z w: MOV R5.w, R4.w 02 EXP_DONE: PIX0, R5 BRSTCNT(3) END_OF_PROGRAM

AMD issues

• Almost NO information or advice about the memory system, other than to have each thread write 4 float4’s at a time…

• Documentation is improving, but still some way to go (e.g. gpuisa document is 2 generations out of date; very limited discussion of compute shaders at present…)

• But help from the AMD Stream forum

Teething issues, e.g. …

• Fussy about supported gpu/os/driver combinations

• Can’t seem to use all of the card’s memory for a compute shader/global buffer; miss out on 256MB

• Can’t easily access resources larger than 255MB in size from the cpu side

• In some cases all cards have to have had a monitor plugged into them to be accessible; imagine that in a cluster!

Multi-gpu?

• Assuming you can just combine results at the end (on the cpu say), no problem

• If you want to share data (e.g. the Cholesky problem), must bear communication costs in mind– As a lower limit, a kernel call takes about 5-10

μs– PCIe and cpu memory, say 5GB/s

Big gpu clusters?

• E.g. for Cholesky, you’d split the matrix row-wise and pass the processed rows via MPI

• Latencies (c.f. kernel launch overhead)…– No expert but apparently Ethernet 50 μs, Infiniband 2

μs

• …and bandwidth (100-1000 MB/s)• Each kernel should run for longer than this!

• But what about memory errors? (Current cards non-ECC…)

Big gpu clusters? (2)

• I think gpu manufacturers should make available all the optimization info they possibly can– You need all the factors of a few that there

are in order to justify coprocessors– Perhaps this is a bit different from the

consumer market where compatibility and performance over a wide range of products is vital

Is it worth it?

• CUDA especially is a great way to start programming in parallel– A gpu is in many ways analogous to a supercomputer

cluster

• Need to work hard to get close to peak performance– More info from both AMD and Nvidia would help here;

it is strange that one can find out as much or more about the cards and how they work from forums and computer hardware websites as opposed to from the companies themselves…

Is it worth it?(2)

• Expect up to a factor of a few over decent (quad-core) cpu code…

• Future standardization efforts might help too (if they allow access to the advanced features of the cards in a close to optimal way).

• Libraries (cuBLAS, cuFFT, ACML-gpu…) might be a good way to start– Mpi versions for multi-gpu systems would be great!

Documents

Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008