Upload
jemma
View
45
Download
0
Tags:
Embed Size (px)
DESCRIPTION
CUDA Lecture 5 CUDA at the University of Akron. Prepared 6/23/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. Overview: CUDA Equipment. Your own PCs running G80 emulators Better debugging environment Sufficient for the first couple of weeks - PowerPoint PPT Presentation
Citation preview
Prepared 6/23/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
CUDA Lecture 5CUDA at the University of
Akron
Your own PCs running G80 emulatorsBetter debugging environmentSufficient for the first couple of weeks
Your own PCs with a CUDA-enabled GPUNVIDIA boards in department
GeForce family of processors for high-performance gaming
Tesla C2070 for high-performance computing – no graphics output (?) and more memory
CUDA at the University of Akron – Slide 2
Overview: CUDA Equipment
CUDA at the University of Akron – Slide 3
Summary: NVIDIA TechnologyDescription Card Models Where AvailableLow Power Ion Netbooks in CAS 241.Consumer Graphics Processors
GeForce 8500GTGeForce 9500GTGeForce 9600GT
Add-in cards in Dell Optiplex 745s in
department.
2nd Generation GPUs
GeForce GTX275
In Dell Precision T3500s in department.
Fermi GPUs GeForce GTX480
Tesla C2070
In select Dell Precision T3500s in department.
In Dell Precision T7500 Linux server
(tesla.cs.uakron.edu)
Basic building block is a “streaming multiprocessor”
different chips have different numbers of these SMs:
CUDA at the University of Akron – Slide 4
Hardware View, Consumer Procs.
Product SMs
Compute
Capability
GeForce 8500GT
2 v. 1.1
GeForce 9500GT
4 v. 1.1
GeForce 9600GT
8 v. 1.1
Basic building block is a “streaming multiprocessor” with8 cores, each with 2048 registersup to 128 threads per core16KB of shared memory8KB cache for constants held in device memory
different chips have different numbers of these SMs:
CUDA at the University of Akron – Slide 5
Hardware View, 2nd Generation
Product SMs Bandwidth
Memory Compute
Capability
GTX275 30 127 GB/s 1 -2 GB v. 1.3
each streaming multiprocessor has32 cores, each with 1024 registersup to 48 threads per core64KB of shared memory / L1 cache8KB cache for constants held in device memory
there’s also a unified 384KB L2 cachedifferent chips again have different numbers
of SMs:
CUDA at the University of Akron – Slide 6
Hardware View, Fermi
Product SMs Bandwidth
Memory
Compute
Capability
GTX480 15 180 GB/s 1.5 GB v. 2.0Tesla
C207014 140 GB/s 6 GB
ECCv. 2.1
CUDA at the University of Akron – Slide 7
Different Compute Capabilities
Feature v. 1.1
v. 1.3, 2.x
Integer atomic functions operating on64-bit words in global memory
no yes
Integer atomic functions operating on32-bit words in shared memory
no yes
Warp vote functions no yesDouble-precision floating-point operations
no yes
CUDA at the University of Akron – Slide 8
Different Compute Capabilities
Feature v. 1.1, 1.3
v. 2.x
3D grid of thread block no yesFloating-point atomic addition operating on32-bit words in global and shared memory
no yes
_ballot() no yes_threadfence_system() no yes_syncthread_count(),_syncthread_and(),_syncthread_or()
no yes
Surface functions no yes
CUDA at the University of Akron – Slide 9
Common Technical Specifications
SpecMaximum x- or y- dimensions of a grid of thread blocks 65536Maximum dimensionality of thread block 3Maximum z- dimension of a block 64Warp size 32Maximum number of resident blocks per multiprocessor 8Constant memory size 64 KCache working set per multiprocessor for constant memory 8 KMaximum width for 1D texture reference bound to linear memory
2 27
Maximum width, height and depth for a 3D texture reference bound to linear memory or a CUDA array
2048 x 2048 x 2048
Maximum number of textures that can be bound to a kernel 128Maximum number of instructions per kernel 2 million
CUDA at the University of Akron – Slide 10
Different Technical Specifications
Spec v. 1.1
v. 1.3
v. 2.x
Maximum number of resident warps per multiprocessor
24 32 48
Maximum number of resident threads per multiprocessor
768 1024
1536
Number of 32-bit registers per multiprocessor 8 K 16 K 32 K
CUDA at the University of Akron – Slide 11
Different Technical Specifications
Spec v. 1.1, 1.3
v. 2.x
Maximum dimensionality of grid of thread block 2 3Maximum x- or y- dimension of a block 512 1024Maximum number of threads per block 512 1024Maximum amount of shared memory per multiprocessor
16 K 48 K
Number of shared memory banks 16 32
Amount of local memory per thread16 K 512
KMaximum width for 1D texturereference bound to a CUDA array
8192 32768
CUDA at the University of Akron – Slide 12
Different Technical SpecificationsSpec v. 1.1, 1.3 v. 2.x
Maximum width and number of layersfor a 1D layered texture reference
8192 x 512 16384 x 2048
Maximum width and height for 2Dtexture reference bound tolinear memory or a CUDA array
65536 x 32768 65536 x 65536
Maximum width, height, and numberof layers for a 2D layered texture reference
8192 x 8192 x 512
16384 x 16384 x 2048
Maximum width for a 1D surfacereference bound to a CUDA array
Not supported
8192
Maximum width and height for a 2Dsurface reference bound to a CUDA array
8192 x 8192
Maximum number of surfaces thatcan be bound to a kernel
8
CUDA (Compute Unified Device Architecture) is NVIDIA’s program development environment:based on C with some extensionsC++ support increasing steadilyFORTRAN support provided by PGI compilerlots of example code and good documentation –
2-4 week learning curve for those with experience of OpenMP and MPI programming
large user community on NVIDIA forums
CUDA at the University of Akron – Slide 13
Overview: CUDA Components
When installing CUDA on a system, there are 3 components:driver
low-level software that controls the graphics cardusually installed by sys-admin
toolkitnvcc CUDA compilersome profiling and debugging toolsvarious librariesusually installed by sys-admin in /usr/local/cuda
CUDA at the University of Akron – Slide 14
Overview: CUDA Components
SDKlots of demonstration examplesa convenient Makefile for building applicationssome error-checking utilitiesnot supported by NVIDIAalmost no documentationoften installed by user in own directory
CUDA at the University of Akron – Slide 15
Overview: CUDA Components
Remotely access the front end:ssh tesla.cs.uakron.edu
ssh sends your commands over an encrypted stream so your passwords, etc., can’t be sniffed over the network
CUDA at the University of Akron – Slide 16
Accessing the Tesla Card
The first time you do this:After login, run/root/gpucomputingsdk_3.2.16_linux.runand just take the default answers to get your own personal copy of the SDK.
Then:
cd ~/NVIDIA_GPU_Computing_SDK/Cmake -j12 -k
will build all that can be built.
CUDA at the University of Akron – Slide 17
Accessing the Tesla Card
The first time you do this:Binaries end up in:
~/NVIDIA_GPU_Computing_SDK/C/bin/linux/releaseIn particular header file <cutil_inline.h> is in ~/NVIDIA_GPU_Computing_SDK/C/common/inc
Can then get a summary of technical specs and compute capabilities by executing
~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/deviceQuery
CUDA at the University of Akron – Slide 18
Accessing the Tesla Card
Two choices:use nvcc within a standard Makefileuse the special Makefile template provided in
the SDKThe SDK Makefile provides some useful
options:make emu=1
uses an emulation library for debugging on a CPUmake dbg=1
activates run-time error checkingIn general just use a standard Makefile
CUDA at the University of Akron – Slide 19
CUDA Makefile
CUDA at the University of Akron – Slide 20
Sample Tesla MakefileGENCODE_ARCH := -gencode=arch=compute_10,code=\"sm_10,compute_10\“ -gencode=arch=compute_13,code=\"sm_13,compute_13\“ -gencode=arch=compute_20,code=\"sm_20,compute_20\“
INCLOCS := -I$(HOME)/NVIDIA_GPU_Computing_SDK/shared/inc -I$(HOME)/NVIDIA_GPU_Computing_SDK/C/common/inc
LIBLOCS := -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib -L$(HOME)/NVIDIA_GPU_Computing_SDK/C/lib
LIBS = -lcutil_x86_64
<progName>: <progName>.cu <progName>.cu <progName>.cuh nvcc $(GENCODE_ARCH) $(INCLOCS) <progName>.cu $(LIBLOCS) $(LIBS) -o <progName>
CUDA Tools and Threads – Slide 2
Parallel Thread Execution (PTX)Virtual machine and
ISAProgramming modelExecution resources
and state
Compiling a CUDA Program
Any source file containing CUDA extensions must be compiled with NVCC
NVCC is a compiler driverWorks by invoking all the necessary tools and
compilers like cudacc, g++, cl, …NVCC outputs
C code (host CPU code)Must then be compiled with the rest of the
application using another toolPTX
Object code directly, or PTX source interpreted at runtime
CUDA Tools and Threads – Slide 22
Compilation
Any executable with CUDA code requires two dynamic librariesThe CUDA runtime library (cudart)The CUDA core library (cuda)
CUDA Tools and Threads – Slide 23
Linking
An executable compiled in device emulation mode (nvcc –deviceemu) runs completely on the host using the CUDA runtimeNo need of any device and CUDA driverEach device thread is emulated with a host
thread
CUDA Tools and Threads – Slide 24
Debugging Using the Device Emulation Mode
Running in device emulation mode, one canUse host native debug support (breakpoints,
inspection, etc.)Access any device-specific data from host code
and vice-versaCall any host function from device code (e.g. printf) and vice-versa
Detect deadlock situations caused by improper usage of __syncthreads
CUDA Tools and Threads – Slide 25
Debugging Using the Device Emulation Mode
Emulated device threads execute sequentially, so simultaneous access of the same memory location by multiple threads could produce different results
Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode
CUDA Tools and Threads – Slide 26
Device Emulation Mode Pitfalls
Results of floating-point computations will slightly differ because ofDifferent compiler outputs, instructions setsUse of extended precision for intermediate
resultsThere are various options to force strict single
precision on the host
CUDA Tools and Threads – Slide 27
Floating Point
New Visual Studio Based GPU Integrated Development
http://developer.nvidia.com/object/nexus.html
Available in Beta (as of October 2009)
CUDA Tools and Threads – Slide 28
Nexus
Based on original material fromhttp://en.wikipedia.com/wiki/CUDA, accessed
6/22/2011.The University of Akron: Charles Van TilburgThe University of Illinois at Urbana-Champaign
David Kirk, Wen-mei W. HwuOxford University: Mike GilesStanford University
Jared Hoberock, David TarjanRevision history: last updated 6/23/2011.
CUDA at the University of Akron – Slide 29
End Credits