The Missouri S&T CS GPU Cluster

The Missouri S&T CS GPU Cluster

Cyriac Kandoth

Pretext

• NVIDIA ( ) is a manufacturer of graphics processor technologies that has begun to promote their GPUs as general purpose devices (GPGPUs)

• They donated 8 Tesla GPUs to the Missouri S&T Computer Science Department

• This presentation will introduce you to the concept of creating applications that take advantage of the massive parallelism inherent in GPUs

Tech Specs

• 4 desktop-style cases, each housing:• Intel Core i7 920 @2.66GHz(Quad core with HyperThreading i.e. 8 logical processors)

• 8GB DDR3 1333 @1066MHz(The Core i7 under-clocks the RAM to a speed that it supports)

• 500 GB 3.0Gbps 7200rpm SATA Hard disk drive• Two Tesla C1060 cards on PCI-Express 2.0 x16(A Tesla C1060 is a compute-only GPU that has no video output ports)

• ATI Radeon HD2400 on standard PCI (for display)

The Tesla C1060Form Factor 10.5" x 4.376", Dual Slot

# of Streaming Processor Cores 240

Frequency of processor cores 1.3GHz

Single Precision floating point performance (peak) 933 GFLOP/s

Double Precision floating point performance (peak) 78 GFLOP/s

Floating Point Precision IEEE 754 single & double

Total Dedicated Memory 4GB GDDR3

Memory Speed 800MHz

Memory Interface 512-bit

Memory Bandwidth 102GB/sec

Max Power Consumption 200 W peak, 160 W typical

System Interface PCIe x16Source: http://www.nvidia.com/object/product_tesla_c1060_us.html

http://www.nvidia.com/object/product_tesla_c1060_us.html

http://www.nvidia.com/cuda

eth0

Cluster networking

gpu0

gpu1

gpu2

gpu3

The Missouri

S&T Network

Gigabit Switch

• The 4 nodes are named gpu0 thru gpu3• gpu0 is the frontend that acts as the gateway into

the cluster from the mst.edu domain• At the moment only gpu1 thru gpu3 are available

eth1

eth1

eth1

eth1

The CUDA programming model• CUDA (Compute Unified Device Architecture) is a C/C++

programming model and API (Application Programming Interface) introduced by NVIDIA to enable software developers to code general purpose apps that run on the massively parallel hardware on GPUs.

• GPUs are optimal for data parallel apps aka SIMD (Single Instruction Multiple Data). CUDA allows us to also code MIMD apps, but at a reduced efficiency.

• Threads running in parallel use extremely fast shared memory for communication. There is no MPI_Send(), but the equivalent of MPI_Barrier() is __syncthreads().

The CUDA programming model

• In your code, you can create a kernel (a function) that will run many instances of itself on parallel threads on the GPU. Threads running in parallel are collectively known as a grid.

• Kernels are run on the device (GPU) while the rest of the code runs on the host (CPU).

The CUDA programming model• A grid is organized into

blocks, and each block is organized into threads.

• Only threads within the same block can communicate via shared memory; and sync.

• This type of organization helps the GPU parallelize thread execution using it’s inbuilt hardware protocols.

Built-in Variables accessible in a Kernel

dim3 gridDim• Contains the dimensions of blocks in the grid as specified during

kernel invocation. gridDim.x, gridDim.y (.z is unused)

uint3 blockIdx• Contains the block index within the grid. blockIdx.x,

blockIdx.y (.z is unused)

dim3 blockDim• Contains the dimensions of threads in a block (blockDim.x,

blockDim.y, and blockDim.z)

uint3 threadIdx• Contains the thread index within the block (threadIdx.x,

threadIdx.y, and threadIdx.z)

E.g. Host invokes kernel on a device

// Kernel definition, runs a copy on every thread__global__ void vectorAdd( float* A, float* B, float* C ){ ...}

int main(int argc, char** argv){ dim3 blockSize(16, 16); // 256 threads per block (up to 3D) dim3 gridSize(4, 2); // 8 blocks in the grid (up to 2D)

// Invoke the kernel on the device (GPU) vectorAdd<<<gridSize, blockSize>>>(A, B, C); ... // Continue running on host (CPU) when device is done}

CUDA Type Qualifiers

Function type qualifiers__device__• Executed on the device• Callable from the device only

__global__• Executed on the device• Callable from the host only

__host__• Executed on the host• Callable from the host only• Default type if unspecified

Variable type qualifiers__device__• global memory space• Is accessible from all the

threads within the grid

__constant__• constant memory space• Is accessible from all the

threads within the grid

__shared__• space of a thread block• Is only accessible from all the

threads within the block

Template of a typical main()int main(int argc, char** argv){ // Allocate memory on the host for input data - malloc() // Initialize input data from file, user input, etc.

// Allocate memory on the device - cudaMalloc() // Send input data to the device - cudaMemcpy()

// Set up grid and block dimensions - dim3 variables // Invoke the kernel on the device (GPU) - kernelName<<<gridSize, blockSize>>>(input_params);

// Copy results from device to host - cudaMemcpy() // Free up device memory - cudaFree()

// Print results at the host, because device can’t. // printf() from kernel only works in emulation mode}

CUDA apps in emulation mode

• Compile the program with the emu parameter enabled: make emu=1

• The program emulates a GPU on the host CPU. Usually much slower.

• Helps with debugging, because you are allowed to use printf() statements in device code (from CUDA kernels)

Types of shared memory• Registers:

Fastest form of memory on the GPU. Is only accessible by individual threads and has the lifetime of a thread. We don’t need to deal with it directly (but we can).

• Shared Memory: Can be as fast as a register when there are no bank conflicts (when threads read from the same address). Accessible by any thread of the block from which it was created. Has the lifetime of the block.

• Global memory: Potentially 150x slower than register or shared memory because of un-coalesced reads and writes. Accessible from either the host or device. Has the lifetime of the application. Read-only global memory is called constant memory.

• Local memory: Resides in global memory and can be 150x slower than register/shared memory. Is only accessible by the thread. Has the lifetime of the thread.

A few CUDA API functions

• cudaSetDevice(int dev) - Sets the device to run the kernel.

• __syncthreads() - Blocks execution of all threads within a block until they synchronize.

• cudaMalloc(void** devPtr, size_t count) - Allocates count bytes in GPU memory and returns a pointer to it in the parameter *devPtr.

• cudaMemcpy(void* dst, const void* src,size_t count, enum cudaMemcpyKind kind) - copies count bytes from src to dst where kind is

A complete listing of the CUDA API functions can be found in the Reference Manual.

http://developer.download.nvidia.com/compute/cuda/2_1/toolkit/docs/CudaReferenceManual_2.1.pdf

Tips for speedy code• Have the kernel use the whole card - Have a multiple of 32 threads

per block and at least as many blocks as multiprocessors (240 on the Tesla C1060s).

• Access global memory properly. Coalescing - Memory read by consecutive threads are combined by the hardware into several, wide memory reads.

• Avoid shared memory bank conflicts.

• Have as few branching conditional loops as possible.

• Have small loops unrolled.

• Have no unnecessary __syncthreads() calls.

• See the CUDA Programming Guide for further discussion on all of the above.

http://developer.download.nvidia.com/compute/cuda/2_1/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.1.pdf

Questions?

Documents

The Missouri S&T CS GPU Cluster