Upload
duongtuong
View
237
Download
4
Embed Size (px)
Citation preview
Vivek Sarkar
Department of Computer ScienceRice University
[email protected] 24, 2007
COMP 635: Seminar on Heterogeneous Processors
Lecture 4: Introduction to General-Purposecomputation on GPUs (GPGPUs)
www.cs.rice.edu/~vsarkar/comp635
2COMP 635, Fall 2007 (V.Sarkar)
Announcements• Acknowledgments
— Wen-mei Hwu & David Kirk, UIUC ECE 498 AL1 course, “Programming Massively Parallel Processors”– http://courses.ece.uiuc.edu/ece498/al1/Syllabus.html
— Dana Schaa, “Using CUDA for High Performance Scientific Computing”– http://www.ece.neu.edu/~dschaa/files/dschaa_cuda.ppt
• Class TA: Raghavan Raman
• Reading list for next lecture (10/1) --- volunteers needed to lead discussion!1. “Scan Primitives for GPU Computing”, S.Sengupta et al, Proceedings of the 22nd ACM
SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware• http://graphics.idav.ucdavis.edu/publications/func/return_pdf?pub_id=915
2. “EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system”,P.Wang et al, PLDI 2007• http://doi.acm.org/10.1145/1250734.1250753
• Additional references• NVIDIA, NVidia CUDA Programming Guide, NVidia, 2007
• http://developer.download.nvidia.com/compute/cuda/1_0/NVIDIA_CUDA_Programming_Guide_1.0.pdf
• Hubert Nguyen, GPU Gems 3, Addison Wesley, 2007
• First Workshop on General Purpose Processing on Graphics Processing Units, October 4, 2007, Boston• For details, see http://www.ece.neu.edu/GPGPU/• Contact me if you’re interesting in attending so as to work on a class project or to give a summary report back to the
class
3COMP 635, Fall 2007 (V.Sarkar)
• Two major trends1. Increasing performance gap relative to mainstream CPUs
– Calculation: 367 GFLOPS vs. 32 GFLOPS– Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s
2. Availability of more general (non-graphics) programming interfaces
— GPU in every PC and workstation – massive volume and potential impact
Why GPUs?
4COMP 635, Fall 2007 (V.Sarkar)
What is GPGPU ?
• General Purpose computation using GPUin applications other than 3D graphics—GPU accelerates critical path of application
• Data parallel algorithms leverage GPU attributes—Large data arrays, streaming throughput—Fine-grain SIMD parallelism—Low-latency floating point (FP) computation
• Applications – see GPGPU.org—Game effects (FX) physics, image processing—Physical modeling, computational engineering, matrix algebra,
convolution, correlation, sorting
5COMP 635, Fall 2007 (V.Sarkar)
Nvidia GeForce 8800 GTX (a.k.a. G80)
• The device is a set of 16multiprocessors
• Each multiprocessor is a set of32-bit processors with a SingleInstruction Multiple Dataarchitecture – sharedinstruction unit
• Each multiprocessor has:— 32 32-bit registers per processor— 16KB on-chip shared memory per
multiprocessor— A read-only constant cache— A read-only texture cache
Device
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
Device memory
Shared Memory
InstructionUnit
Processor 1
Registers
…Processor 2
Registers
Processor M
Registers
ConstantCache
TextureCache
6COMP 635, Fall 2007 (V.Sarkar)
CUDA Taxonomy• CUDA = Compute Unified Device Architecture
• Device = GPU, Host = CPU, Kernel = GPU program
• Thread = instance of a kernel program
• Warp = group of threads that execute in SIMD mode• Maximum warp size for the Nvidia G80 is 32 threads
• Thread block = group of warps (all warps in same block must be of equal size)- One thread block at a time is assigned to a multiprocessor- Each warp contains threads of consecutive, increasing thread indices with the
first warp containing thread 0- Maximum block size for the Nvidia G80 is 16 warps
• Grid = array of thread blocks• Blocks within a grid can not be synchronized- Nvidia G80 has 16 multiprocessors
- need minimum of 16 blocks (2^4 * 2^4 * 2^5 = 8K threads) to fully utilizedevice?
- A multiprocessor can hold multiple blocks if resources (registers, threadspace, shared memory) permit- Maximum of 64K threads permitted per grid
7COMP 635, Fall 2007 (V.Sarkar)
CUDA Taxonomy (contd.)
GRIDGRIDBLOCKBLOCK
WWWW WW
BLOCKBLOCK
WW WW WW
BLOCKBLOCK
WW WW WW
BLOCKBLOCK
WW WW WW
BLOCKBLOCK
WW WW WW
BLOCKBLOCK
WW WW WW
BLOCKBLOCK
WW WW WW
BLOCKBLOCK
WW WW WW
BLOCKBLOCK
WW WW WW
BLOCKBLOCK
WW WW WW
BLOCKBLOCK
WW WW WW
BLOCKBLOCK
WW WW WW
BLOCKBLOCK
WW WW WW
BLOCKBLOCK
WW WW WW
BLOCKBLOCK
WW WW WW
BLOCKBLOCK
WW WW WW
8COMP 635, Fall 2007 (V.Sarkar)
Thread Batching: Grids and Blocks
• A kernel is executed as a gridof thread blocks— All threads share data
memory space
• A thread block is a batch ofthreads that can cooperate witheach other by:— Synchronizing their execution
– For hazard-free sharedmemory accesses
— Efficiently sharing datathrough a low latency sharedmemory
• Two threads from two differentblocks cannot cooperate
Host
Kernel1
Kernel2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
Courtesy: NDVIA
9COMP 635, Fall 2007 (V.Sarkar)
Block and Thread IDs
• Threads and blocks have IDs— So each thread can decide
what data to work on— Block ID: 1D or 2D— Thread ID: 1D, 2D, or 3D
• Simplifies memoryaddressing when processingmultidimensional data— Image processing— Solving PDEs on volumes— …
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
Courtesy: NDVIA
10COMP 635, Fall 2007 (V.Sarkar)
Device Memory Space Overview
• Each thread can:— R/W per-thread registers— R/W per-thread local memory— R/W per-block shared memory— R/W per-grid global memory— Read only per-grid constant memory— Read only per-grid texture memory
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
• The host can R/W global,constant, and texture memories
• These memory spaces arepersistent across kernels calledby the same application.
11COMP 635, Fall 2007 (V.Sarkar)
CUDA Host-Device Data Transfer
• cudaError_t cudaMemcpy(void* dst, const void* src,size_t count, enum cudaMemcpyKind kind)
• copies count bytes from the memory area pointed toby src to the memory area pointed to by dst, wherekind is one of
— cudaMemcpyHostToHost
— cudaMemcpyHostToDevice
— cudaMemcpyDeviceToHost
— cudaMemcpyDeviceToDevice
• The memory areas may not overlap
• Calling cudaMemcpy() with dst and srcpointers that do not match the direction of thecopy results in an undefined behavior.
• Synchronous in CUDA 1.0 --- will becomeasynchronous in future CUDA version?
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemor
y
Thread (0,0)
Registers
LocalMemor
y
Thread (1,0)
Registers
Block (1, 0)
Shared Memory
LocalMemor
y
Thread (0,0)
Registers
LocalMemor
y
Thread (1,0)
Registers
Host
12COMP 635, Fall 2007 (V.Sarkar)
Access Times
• Register – dedicated HW - single cycle
• Shared Memory – dedicated HW - single cycle
• Local Memory – DRAM, no cache - *slow*
• Global Memory – DRAM, no cache - *slow*
• Constant Memory – DRAM, cached, 1…10s…100s of cycles,depending on cache locality
• Texture Memory – DRAM, cached, 1…10s…100s of cycles,depending on cache locality
• Instruction Memory (invisible) – DRAM, cached
13COMP 635, Fall 2007 (V.Sarkar)
CUDA Variable Declarations
• __device__ is optional when used with __local__, __shared__, or__constant__
• Automatic variables without any qualifier reside in a register— Except arrays that reside in local memory
• Pointers can only point to memory allocated or declared in globalmemory:— Allocated in the host and passed to the kernel:
__global__ void KernelFunc(float* ptr)— Obtained as the address of a global variable: float* ptr = &GlobalVar;
blockblockshared__device__ __shared__ int SharedVar;
applicationgridglobal__device__ int GlobalVar;
threadthreadlocal__device__ __local__ int LocalVar;
grid
Scope
applicationconstant__device__ __constant__ int ConstantVar;
LifetimeMemory
14COMP 635, Fall 2007 (V.Sarkar)
CUDA Function Declarations
hosthost__host__ float HostFunc()
hostdevice__global__ void KernelFunc()
devicedevice__device__ float DeviceFunc()
Only callablefrom the:
Executedon the:
• __global__ defines a kernel function— Must return void
• __device__ and __host__ can be used together
• __device__ functions cannot have their address taken
• For functions executed on the device:— No recursion— No static variable declarations inside the function
15COMP 635, Fall 2007 (V.Sarkar)
Invoking a Kernel Function – Thread Creation
• A kernel function must be called with an executionconfiguration:
__global__ void KernelFunc(...);
dim3 DimGrid(100, 50); // 5000 thread blocks
dim3 DimBlock(4, 8, 8); // 256 threads per block
size_t SharedMemBytes = 64; // 64 bytes of shared memory
KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...);
• Any call to a kernel function is asynchronous from CUDA1.0 on, explicit synch needed for blocking
• cudaThreadSynchronize() explicitly forces the runtimeto wait until all preceding device tasks have finished
16COMP 635, Fall 2007 (V.Sarkar)
Language Extensions: Built-in Types & Variables
• [u]char[1..4], [u]short[1..4], [u]int[1..4], [u]long[1..4], float[1..4]— Structures accessed with x, y, z, w fields:
uint4 param;int y = param.y;
• dim3
— Based on uint3— Used to specify dimensions
• dim3 gridDim;
— Dimensions of the grid in blocks (gridDim.z unused)
• dim3 blockDim;
— Dimensions of the block in threads• dim3 blockIdx;
— Block index within the grid• dim3 threadIdx;
— Thread index within the block
17COMP 635, Fall 2007 (V.Sarkar)
Host Runtime Component
• Provides functions to deal with:— Device management (including multi-device systems)— Memory management— Error handling
• Initializes the first time a runtime function is called
• A host thread can invoke device code on only one device— Multiple host threads required to run on multiple devices
18COMP 635, Fall 2007 (V.Sarkar)
Device Mathematical Functions
• Some mathematical functions (e.g. sin(x)) have a lessaccurate, but faster device-only version (e.g. __sin(x))— __pow— __log, __log2, __log10— __exp— __sin, __cos, __tan
19COMP 635, Fall 2007 (V.Sarkar)
Device Synchronization Function
• void __syncthreads();
• Synchronizes all threads in a block
• Once all threads have reached this point, execution resumesnormally
• Used to avoid RAW/WAR/WAW hazards when accessingshared or global memory
• Allowed in conditional constructs only if the conditional isuniform across the entire thread block
20COMP 635, Fall 2007 (V.Sarkar)
Extended C (Summary)
• Declspecs— global, device, shared,
local, constant
• Keywords— threadIdx, blockIdx
• Intrinsics— __syncthreads
• Runtime API— Memory, symbol,
execution management
• Function launch
__device__ float filter[N];
__global__ void convolve (float *image) {
__shared__ float region[M]; ...
region[threadIdx] = image[i];
__syncthreads() ...
image[j] = result;}
// Allocate GPU memoryvoid *myimage = cudaMalloc(bytes)
// 100 blocks, 10 threads per blockconvolve<<<100, 10>>> (myimage);
21COMP 635, Fall 2007 (V.Sarkar)
gcc / cl
G80 SASSfoo.sass
OCG
Extended C
cudaccEDG C/C++ frontend
Open64 Global Optimizer
GPU Assemblyfoo.s
CPU Host Code foo.cpp
Integrated source(foo.cu)
22COMP 635, Fall 2007 (V.Sarkar)
Matrix Multiply Example in CUDA(Host Code)
Source: NVIDIA Compute Unified DeviceArchitecture Programming Guide Version 1.0
23COMP 635, Fall 2007 (V.Sarkar)
Matrix Multiply Example in CUDA(Device Code)
24COMP 635, Fall 2007 (V.Sarkar)
Ideal CUDA Programs
• High intrinsic parallelism- per-pixel or per-element operations
• Minimal communication (if any) between threads- limited synchronization
• High ratio of arithmetic to memory operations• Few control flow statements
- Divergent execution paths among threads from the same warpmust be serialized (costly)
- The compiler may replace conditional instructions by predicatedinstructions to reduce divergence
25COMP 635, Fall 2007 (V.Sarkar)
Sample GPU Applications
16%931,365Finite-Difference Time Domain analysis of2D electromagnetic wave propagation
FDTD
>99%33490Computing a matrix Q, a scanner’sconfiguration in MRI reconstruction
MRI-Q
96%98536Two Point Angular Correlation FunctionTRACF
>99%31952Single-precision implementation of saxpy,used in Linpack’s Gaussian elim. routine
SAXPY
>99%160322Petri Net simulation of a distributed systemPNS
99%2811,104Rye Polynomial Equation Solver, quantumchem, 2-electron repulsion
RPES
99%1461,874Finite element modeling, simulation of 3Dgraded materials
FEM
>99%2181,979Distributed.net RC5-72 challenge client codeRC5-72
>99%2851,481SPEC ‘06 version, change to single precisionand print fewer reports
LBM
35%19434,811SPEC ‘06 version, change in guess vectorH.264
% timeKernelSourceDescriptionApplication
26COMP 635, Fall 2007 (V.Sarkar)
Performance of Sample Kernels and Applications
• GeForce 8800 GTX vs. 2.2GHz Opteron 248• 10× speedup in a kernel is typical, as long as the kernel can occupy enough
parallel threads• 25× to 400× speedup if the function’s data requirements and control flow suit
the GPU and the application is optimized• Keep in mind that the speedup also reflects how suitable the CPU is for
executing the kernelSource: Slide 21, Lecture 1, UIUC ECE 498, David Kirk & Wen-mei Hwu, http://courses.ece.uiuc.edu/ece498/al1/lectures/lecture1%20intro%20fall%202007.ppt