CUDA Lecture 5 CUDA at the University of Akron

Prepared 6/23/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

CUDA Lecture 5CUDA at the University of

Akron

Your own PCs running G80 emulatorsBetter debugging environmentSufficient for the first couple of weeks

Your own PCs with a CUDA-enabled GPUNVIDIA boards in department

GeForce family of processors for high-performance gaming

Tesla C2070 for high-performance computing – no graphics output (?) and more memory

CUDA at the University of Akron – Slide 2

Overview: CUDA Equipment


Summary: NVIDIA TechnologyDescription Card Models Where AvailableLow Power Ion Netbooks in CAS 241.Consumer Graphics Processors

GeForce 8500GTGeForce 9500GTGeForce 9600GT

Add-in cards in Dell Optiplex 745s in

department.

2nd Generation GPUs

GeForce GTX275

In Dell Precision T3500s in department.

Fermi GPUs GeForce GTX480

Tesla C2070

In select Dell Precision T3500s in department.

In Dell Precision T7500 Linux server

(tesla.cs.uakron.edu)

Basic building block is a “streaming multiprocessor”

different chips have different numbers of these SMs:


Hardware View, Consumer Procs.

Product SMs

Compute

Capability

GeForce 8500GT

2 v. 1.1

GeForce 9500GT

4 v. 1.1

GeForce 9600GT

8 v. 1.1

Basic building block is a “streaming multiprocessor” with8 cores, each with 2048 registersup to 128 threads per core16KB of shared memory8KB cache for constants held in device memory

different chips have different numbers of these SMs:


Hardware View, 2nd Generation

Product SMs Bandwidth

Memory Compute

Capability

GTX275 30 127 GB/s 1 -2 GB v. 1.3

each streaming multiprocessor has32 cores, each with 1024 registersup to 48 threads per core64KB of shared memory / L1 cache8KB cache for constants held in device memory

there’s also a unified 384KB L2 cachedifferent chips again have different numbers

of SMs:


Hardware View, Fermi

Product SMs Bandwidth

Memory

Compute

Capability

GTX480 15 180 GB/s 1.5 GB v. 2.0Tesla

C207014 140 GB/s 6 GB

ECCv. 2.1


Different Compute Capabilities

Feature v. 1.1

v. 1.3, 2.x

Integer atomic functions operating on64-bit words in global memory

no yes

Integer atomic functions operating on32-bit words in shared memory

no yes

Warp vote functions no yesDouble-precision floating-point operations

no yes


Different Compute Capabilities

Feature v. 1.1, 1.3

v. 2.x

3D grid of thread block no yesFloating-point atomic addition operating on32-bit words in global and shared memory

no yes

_ballot() no yes_threadfence_system() no yes_syncthread_count(),_syncthread_and(),_syncthread_or()

no yes

Surface functions no yes


Common Technical Specifications

SpecMaximum x- or y- dimensions of a grid of thread blocks 65536Maximum dimensionality of thread block 3Maximum z- dimension of a block 64Warp size 32Maximum number of resident blocks per multiprocessor 8Constant memory size 64 KCache working set per multiprocessor for constant memory 8 KMaximum width for 1D texture reference bound to linear memory

2 27

Maximum width, height and depth for a 3D texture reference bound to linear memory or a CUDA array

2048 x 2048 x 2048

Maximum number of textures that can be bound to a kernel 128Maximum number of instructions per kernel 2 million


Different Technical Specifications

Spec v. 1.1

v. 1.3

v. 2.x

Maximum number of resident warps per multiprocessor

24 32 48

Maximum number of resident threads per multiprocessor

768 1024

1536

Number of 32-bit registers per multiprocessor 8 K 16 K 32 K


Different Technical Specifications

Spec v. 1.1, 1.3

v. 2.x

Maximum dimensionality of grid of thread block 2 3Maximum x- or y- dimension of a block 512 1024Maximum number of threads per block 512 1024Maximum amount of shared memory per multiprocessor

16 K 48 K

Number of shared memory banks 16 32

Amount of local memory per thread16 K 512

KMaximum width for 1D texturereference bound to a CUDA array

8192 32768


Different Technical SpecificationsSpec v. 1.1, 1.3 v. 2.x

Maximum width and number of layersfor a 1D layered texture reference

8192 x 512 16384 x 2048

Maximum width and height for 2Dtexture reference bound tolinear memory or a CUDA array

65536 x 32768 65536 x 65536

Maximum width, height, and numberof layers for a 2D layered texture reference

8192 x 8192 x 512

16384 x 16384 x 2048

Maximum width for a 1D surfacereference bound to a CUDA array

Not supported

8192

Maximum width and height for a 2Dsurface reference bound to a CUDA array

8192 x 8192

Maximum number of surfaces thatcan be bound to a kernel

8

CUDA (Compute Unified Device Architecture) is NVIDIA’s program development environment:based on C with some extensionsC++ support increasing steadilyFORTRAN support provided by PGI compilerlots of example code and good documentation –

2-4 week learning curve for those with experience of OpenMP and MPI programming

large user community on NVIDIA forums


Overview: CUDA Components

When installing CUDA on a system, there are 3 components:driver

low-level software that controls the graphics cardusually installed by sys-admin

toolkitnvcc CUDA compilersome profiling and debugging toolsvarious librariesusually installed by sys-admin in /usr/local/cuda



SDKlots of demonstration examplesa convenient Makefile for building applicationssome error-checking utilitiesnot supported by NVIDIAalmost no documentationoften installed by user in own directory



Remotely access the front end:ssh tesla.cs.uakron.edu

ssh sends your commands over an encrypted stream so your passwords, etc., can’t be sniffed over the network


Accessing the Tesla Card

The first time you do this:After login, run/root/gpucomputingsdk_3.2.16_linux.runand just take the default answers to get your own personal copy of the SDK.

Then:

cd ~/NVIDIA_GPU_Computing_SDK/Cmake -j12 -k

will build all that can be built.



The first time you do this:Binaries end up in:

~/NVIDIA_GPU_Computing_SDK/C/bin/linux/releaseIn particular header file <cutil_inline.h> is in ~/NVIDIA_GPU_Computing_SDK/C/common/inc

Can then get a summary of technical specs and compute capabilities by executing

~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/deviceQuery



Two choices:use nvcc within a standard Makefileuse the special Makefile template provided in

the SDKThe SDK Makefile provides some useful

options:make emu=1

uses an emulation library for debugging on a CPUmake dbg=1

activates run-time error checkingIn general just use a standard Makefile


CUDA Makefile


Sample Tesla MakefileGENCODE_ARCH := -gencode=arch=compute_10,code=\"sm_10,compute_10\“ -gencode=arch=compute_13,code=\"sm_13,compute_13\“ -gencode=arch=compute_20,code=\"sm_20,compute_20\“

INCLOCS := -I$(HOME)/NVIDIA_GPU_Computing_SDK/shared/inc -I$(HOME)/NVIDIA_GPU_Computing_SDK/C/common/inc

LIBLOCS := -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib -L$(HOME)/NVIDIA_GPU_Computing_SDK/C/lib

LIBS = -lcutil_x86_64

<progName>: <progName>.cu <progName>.cu <progName>.cuh nvcc $(GENCODE_ARCH) $(INCLOCS) <progName>.cu $(LIBLOCS) $(LIBS) -o <progName>

CUDA Tools and Threads – Slide 2

Parallel Thread Execution (PTX)Virtual machine and

ISAProgramming modelExecution resources

and state

Compiling a CUDA Program

Any source file containing CUDA extensions must be compiled with NVCC

NVCC is a compiler driverWorks by invoking all the necessary tools and

compilers like cudacc, g++, cl, …NVCC outputs

C code (host CPU code)Must then be compiled with the rest of the

application using another toolPTX

Object code directly, or PTX source interpreted at runtime


Compilation

Any executable with CUDA code requires two dynamic librariesThe CUDA runtime library (cudart)The CUDA core library (cuda)


Linking

An executable compiled in device emulation mode (nvcc –deviceemu) runs completely on the host using the CUDA runtimeNo need of any device and CUDA driverEach device thread is emulated with a host

thread


Debugging Using the Device Emulation Mode

Running in device emulation mode, one canUse host native debug support (breakpoints,

inspection, etc.)Access any device-specific data from host code

and vice-versaCall any host function from device code (e.g. printf) and vice-versa

Detect deadlock situations caused by improper usage of __syncthreads


Debugging Using the Device Emulation Mode

Emulated device threads execute sequentially, so simultaneous access of the same memory location by multiple threads could produce different results

Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode


Device Emulation Mode Pitfalls

Results of floating-point computations will slightly differ because ofDifferent compiler outputs, instructions setsUse of extended precision for intermediate

resultsThere are various options to force strict single

precision on the host


Floating Point

New Visual Studio Based GPU Integrated Development

http://developer.nvidia.com/object/nexus.html

Available in Beta (as of October 2009)


Nexus

http://developer.nvidia.com/object/nexus.html

Based on original material fromhttp://en.wikipedia.com/wiki/CUDA, accessed

6/22/2011.The University of Akron: Charles Van TilburgThe University of Illinois at Urbana-Champaign

David Kirk, Wen-mei W. HwuOxford University: Mike GilesStanford University

Jared Hoberock, David TarjanRevision history: last updated 6/23/2011.


End Credits

Documents

CUDA Lecture 5 CUDA at the University of Akron