29
Prepared 6/23/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. CUDA Lecture 5 CUDA at the University of Akron

CUDA Lecture 5 CUDA at the University of Akron

  • Upload
    jemma

  • View
    45

  • Download
    0

Embed Size (px)

DESCRIPTION

CUDA Lecture 5 CUDA at the University of Akron. Prepared 6/23/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. Overview: CUDA Equipment. Your own PCs running G80 emulators Better debugging environment Sufficient for the first couple of weeks - PowerPoint PPT Presentation

Citation preview

Page 1: CUDA Lecture  5 CUDA at the University of Akron

Prepared 6/23/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

CUDA Lecture 5CUDA at the University of

Akron

Page 2: CUDA Lecture  5 CUDA at the University of Akron

Your own PCs running G80 emulatorsBetter debugging environmentSufficient for the first couple of weeks

Your own PCs with a CUDA-enabled GPUNVIDIA boards in department

GeForce family of processors for high-performance gaming

Tesla C2070 for high-performance computing – no graphics output (?) and more memory

CUDA at the University of Akron – Slide 2

Overview: CUDA Equipment

Page 3: CUDA Lecture  5 CUDA at the University of Akron

CUDA at the University of Akron – Slide 3

Summary: NVIDIA TechnologyDescription Card Models Where AvailableLow Power Ion Netbooks in CAS 241.Consumer Graphics Processors

GeForce 8500GTGeForce 9500GTGeForce 9600GT

Add-in cards in Dell Optiplex 745s in

department.

2nd Generation GPUs

GeForce GTX275

In Dell Precision T3500s in department.

Fermi GPUs GeForce GTX480

Tesla C2070

In select Dell Precision T3500s in department.

In Dell Precision T7500 Linux server

(tesla.cs.uakron.edu)

Page 4: CUDA Lecture  5 CUDA at the University of Akron

Basic building block is a “streaming multiprocessor”

different chips have different numbers of these SMs:

CUDA at the University of Akron – Slide 4

Hardware View, Consumer Procs.

Product SMs

Compute

Capability

GeForce 8500GT

2 v. 1.1

GeForce 9500GT

4 v. 1.1

GeForce 9600GT

8 v. 1.1

Page 5: CUDA Lecture  5 CUDA at the University of Akron

Basic building block is a “streaming multiprocessor” with8 cores, each with 2048 registersup to 128 threads per core16KB of shared memory8KB cache for constants held in device memory

different chips have different numbers of these SMs:

CUDA at the University of Akron – Slide 5

Hardware View, 2nd Generation

Product SMs Bandwidth

Memory Compute

Capability

GTX275 30 127 GB/s 1 -2 GB v. 1.3

Page 6: CUDA Lecture  5 CUDA at the University of Akron

each streaming multiprocessor has32 cores, each with 1024 registersup to 48 threads per core64KB of shared memory / L1 cache8KB cache for constants held in device memory

there’s also a unified 384KB L2 cachedifferent chips again have different numbers

of SMs:

CUDA at the University of Akron – Slide 6

Hardware View, Fermi

Product SMs Bandwidth

Memory

Compute

Capability

GTX480 15 180 GB/s 1.5 GB v. 2.0Tesla

C207014 140 GB/s 6 GB

ECCv. 2.1

Page 7: CUDA Lecture  5 CUDA at the University of Akron

CUDA at the University of Akron – Slide 7

Different Compute Capabilities

Feature v. 1.1

v. 1.3, 2.x

Integer atomic functions operating on64-bit words in global memory

no yes

Integer atomic functions operating on32-bit words in shared memory

no yes

Warp vote functions no yesDouble-precision floating-point operations

no yes

Page 8: CUDA Lecture  5 CUDA at the University of Akron

CUDA at the University of Akron – Slide 8

Different Compute Capabilities

Feature v. 1.1, 1.3

v. 2.x

3D grid of thread block no yesFloating-point atomic addition operating on32-bit words in global and shared memory

no yes

_ballot() no yes_threadfence_system() no yes_syncthread_count(),_syncthread_and(),_syncthread_or()

no yes

Surface functions no yes

Page 9: CUDA Lecture  5 CUDA at the University of Akron

CUDA at the University of Akron – Slide 9

Common Technical Specifications

SpecMaximum x- or y- dimensions of a grid of thread blocks 65536Maximum dimensionality of thread block 3Maximum z- dimension of a block 64Warp size 32Maximum number of resident blocks per multiprocessor 8Constant memory size 64 KCache working set per multiprocessor for constant memory 8 KMaximum width for 1D texture reference bound to linear memory

2 27

Maximum width, height and depth for a 3D texture reference bound to linear memory or a CUDA array

2048 x 2048 x 2048

Maximum number of textures that can be bound to a kernel 128Maximum number of instructions per kernel 2 million

Page 10: CUDA Lecture  5 CUDA at the University of Akron

CUDA at the University of Akron – Slide 10

Different Technical Specifications

Spec v. 1.1

v. 1.3

v. 2.x

Maximum number of resident warps per multiprocessor

24 32 48

Maximum number of resident threads per multiprocessor

768 1024

1536

Number of 32-bit registers per multiprocessor 8 K 16 K 32 K

Page 11: CUDA Lecture  5 CUDA at the University of Akron

CUDA at the University of Akron – Slide 11

Different Technical Specifications

Spec v. 1.1, 1.3

v. 2.x

Maximum dimensionality of grid of thread block 2 3Maximum x- or y- dimension of a block 512 1024Maximum number of threads per block 512 1024Maximum amount of shared memory per multiprocessor

16 K 48 K

Number of shared memory banks 16 32

Amount of local memory per thread16 K 512

KMaximum width for 1D texturereference bound to a CUDA array

8192 32768

Page 12: CUDA Lecture  5 CUDA at the University of Akron

CUDA at the University of Akron – Slide 12

Different Technical SpecificationsSpec v. 1.1, 1.3 v. 2.x

Maximum width and number of layersfor a 1D layered texture reference

8192 x 512 16384 x 2048

Maximum width and height for 2Dtexture reference bound tolinear memory or a CUDA array

65536 x 32768 65536 x 65536

Maximum width, height, and numberof layers for a 2D layered texture reference

8192 x 8192 x 512

16384 x 16384 x 2048

Maximum width for a 1D surfacereference bound to a CUDA array

Not supported

8192

Maximum width and height for a 2Dsurface reference bound to a CUDA array

8192 x 8192

Maximum number of surfaces thatcan be bound to a kernel

8

Page 13: CUDA Lecture  5 CUDA at the University of Akron

CUDA (Compute Unified Device Architecture) is NVIDIA’s program development environment:based on C with some extensionsC++ support increasing steadilyFORTRAN support provided by PGI compilerlots of example code and good documentation –

2-4 week learning curve for those with experience of OpenMP and MPI programming

large user community on NVIDIA forums

CUDA at the University of Akron – Slide 13

Overview: CUDA Components

Page 14: CUDA Lecture  5 CUDA at the University of Akron

When installing CUDA on a system, there are 3 components:driver

low-level software that controls the graphics cardusually installed by sys-admin

toolkitnvcc CUDA compilersome profiling and debugging toolsvarious librariesusually installed by sys-admin in /usr/local/cuda

CUDA at the University of Akron – Slide 14

Overview: CUDA Components

Page 15: CUDA Lecture  5 CUDA at the University of Akron

SDKlots of demonstration examplesa convenient Makefile for building applicationssome error-checking utilitiesnot supported by NVIDIAalmost no documentationoften installed by user in own directory

CUDA at the University of Akron – Slide 15

Overview: CUDA Components

Page 16: CUDA Lecture  5 CUDA at the University of Akron

Remotely access the front end:ssh tesla.cs.uakron.edu

ssh sends your commands over an encrypted stream so your passwords, etc., can’t be sniffed over the network

CUDA at the University of Akron – Slide 16

Accessing the Tesla Card

Page 17: CUDA Lecture  5 CUDA at the University of Akron

The first time you do this:After login, run/root/gpucomputingsdk_3.2.16_linux.runand just take the default answers to get your own personal copy of the SDK.

Then:

cd ~/NVIDIA_GPU_Computing_SDK/Cmake -j12 -k

will build all that can be built.

CUDA at the University of Akron – Slide 17

Accessing the Tesla Card

Page 18: CUDA Lecture  5 CUDA at the University of Akron

The first time you do this:Binaries end up in:

~/NVIDIA_GPU_Computing_SDK/C/bin/linux/releaseIn particular header file <cutil_inline.h> is in ~/NVIDIA_GPU_Computing_SDK/C/common/inc

Can then get a summary of technical specs and compute capabilities by executing

~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/deviceQuery

CUDA at the University of Akron – Slide 18

Accessing the Tesla Card

Page 19: CUDA Lecture  5 CUDA at the University of Akron

Two choices:use nvcc within a standard Makefileuse the special Makefile template provided in

the SDKThe SDK Makefile provides some useful

options:make emu=1

uses an emulation library for debugging on a CPUmake dbg=1

activates run-time error checkingIn general just use a standard Makefile

CUDA at the University of Akron – Slide 19

CUDA Makefile

Page 20: CUDA Lecture  5 CUDA at the University of Akron

CUDA at the University of Akron – Slide 20

Sample Tesla MakefileGENCODE_ARCH   := -gencode=arch=compute_10,code=\"sm_10,compute_10\“     -gencode=arch=compute_13,code=\"sm_13,compute_13\“     -gencode=arch=compute_20,code=\"sm_20,compute_20\“

INCLOCS := -I$(HOME)/NVIDIA_GPU_Computing_SDK/shared/inc     -I$(HOME)/NVIDIA_GPU_Computing_SDK/C/common/inc

LIBLOCS := -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib     -L$(HOME)/NVIDIA_GPU_Computing_SDK/C/lib

LIBS =  -lcutil_x86_64

<progName>: <progName>.cu <progName>.cu <progName>.cuh            nvcc $(GENCODE_ARCH) $(INCLOCS) <progName>.cu $(LIBLOCS) $(LIBS) -o <progName>

Page 21: CUDA Lecture  5 CUDA at the University of Akron

CUDA Tools and Threads – Slide 2

Parallel Thread Execution (PTX)Virtual machine and

ISAProgramming modelExecution resources

and state

Compiling a CUDA Program

Page 22: CUDA Lecture  5 CUDA at the University of Akron

Any source file containing CUDA extensions must be compiled with NVCC

NVCC is a compiler driverWorks by invoking all the necessary tools and

compilers like cudacc, g++, cl, …NVCC outputs

C code (host CPU code)Must then be compiled with the rest of the

application using another toolPTX

Object code directly, or PTX source interpreted at runtime

CUDA Tools and Threads – Slide 22

Compilation

Page 23: CUDA Lecture  5 CUDA at the University of Akron

Any executable with CUDA code requires two dynamic librariesThe CUDA runtime library (cudart)The CUDA core library (cuda)

CUDA Tools and Threads – Slide 23

Linking

Page 24: CUDA Lecture  5 CUDA at the University of Akron

An executable compiled in device emulation mode (nvcc –deviceemu) runs completely on the host using the CUDA runtimeNo need of any device and CUDA driverEach device thread is emulated with a host

thread

CUDA Tools and Threads – Slide 24

Debugging Using the Device Emulation Mode

Page 25: CUDA Lecture  5 CUDA at the University of Akron

Running in device emulation mode, one canUse host native debug support (breakpoints,

inspection, etc.)Access any device-specific data from host code

and vice-versaCall any host function from device code (e.g. printf) and vice-versa

Detect deadlock situations caused by improper usage of __syncthreads

CUDA Tools and Threads – Slide 25

Debugging Using the Device Emulation Mode

Page 26: CUDA Lecture  5 CUDA at the University of Akron

Emulated device threads execute sequentially, so simultaneous access of the same memory location by multiple threads could produce different results

Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode

CUDA Tools and Threads – Slide 26

Device Emulation Mode Pitfalls

Page 27: CUDA Lecture  5 CUDA at the University of Akron

Results of floating-point computations will slightly differ because ofDifferent compiler outputs, instructions setsUse of extended precision for intermediate

resultsThere are various options to force strict single

precision on the host

CUDA Tools and Threads – Slide 27

Floating Point

Page 28: CUDA Lecture  5 CUDA at the University of Akron

New Visual Studio Based GPU Integrated Development

http://developer.nvidia.com/object/nexus.html

Available in Beta (as of October 2009)

CUDA Tools and Threads – Slide 28

Nexus

Page 29: CUDA Lecture  5 CUDA at the University of Akron

Based on original material fromhttp://en.wikipedia.com/wiki/CUDA, accessed

6/22/2011.The University of Akron: Charles Van TilburgThe University of Illinois at Urbana-Champaign

David Kirk, Wen-mei W. HwuOxford University: Mike GilesStanford University

Jared Hoberock, David TarjanRevision history: last updated 6/23/2011.

CUDA at the University of Akron – Slide 29

End Credits