138
Lecture #5: Advanced CUDA | February 22th, 2011 Nicolas Pinto (MIT, Harvard) [email protected] Massively Parallel Computing CS 264 / CSCI E-292

[Harvard CS264] 05 - Advanced-level CUDA Programming

  • Upload
    npinto

  • View
    1.458

  • Download
    15

Embed Size (px)

DESCRIPTION

http://cs264.org

Citation preview

Page 1: [Harvard CS264] 05 - Advanced-level CUDA Programming

Lecture #5: Advanced CUDA | February 22th, 2011

Nicolas Pinto (MIT, Harvard) [email protected]

Massively Parallel ComputingCS 264 / CSCI E-292

Page 2: [Harvard CS264] 05 - Advanced-level CUDA Programming

Administrivia

• HW2: out, due Mon 3/14/11 (not Fri 3/11/11)

• Projects: think about it, consult the staff (*), proposals due ~ Fri 3/25/11

• Guest lectures:

• schedule coming soon

• on Fridays 7.35-9.35pm (March, April) ?

Page 3: [Harvard CS264] 05 - Advanced-level CUDA Programming

During this course,

we’ll try to

and use existing material ;-)

“ ”

adapted for CS264

Page 4: [Harvard CS264] 05 - Advanced-level CUDA Programming

Todayyey!!

Page 5: [Harvard CS264] 05 - Advanced-level CUDA Programming

Outline

1.Hardware Review

2.Memory/Communication Optimizations

3.Threading/Execution Optimizations

Page 6: [Harvard CS264] 05 - Advanced-level CUDA Programming

1. Hardware Review

Page 7: [Harvard CS264] 05 - Advanced-level CUDA Programming

8© NVIDIA Corporation 2008

10-Series Architecture

240 thread processors execute kernel threads

30 multiprocessors, each contains

8 thread processors

One double-precision unit

Shared memory enables thread cooperation

Thread

Processors

Multiprocessor

Shared

Memory

Double

Page 8: [Harvard CS264] 05 - Advanced-level CUDA Programming

© 2008 NVIDIA Corporation.

Execution Model

Software Hardware

Threads are executed by thread processors

Thread

Thread Processor

Thread Block Multiprocessor

Thread blocks are executed on multiprocessors

Thread blocks do not migrate

Several concurrent thread blocks can reside on one multiprocessor - limited by multiprocessor resources (shared memory and register file)

...

Grid Device

A kernel is launched as a grid of thread blocks

Only one kernel can execute on a device at one time

Threading Hierarchy

Page 9: [Harvard CS264] 05 - Advanced-level CUDA Programming

10© NVIDIA Corporation 2008

Warps and Half Warps

Thread

Block Multiprocessor

32 Threads

32 Threads

32 Threads

...

Warps

16

Half Warps

16

DRAM

Global

Local

A thread block consists of 32-

thread warps

A warp is executed physically in

parallel (SIMD) on a

multiprocessor

Device

Memory

=

A half-warp of 16 threads can

coordinate global memory

accesses into a single transaction

Page 10: [Harvard CS264] 05 - Advanced-level CUDA Programming

11© NVIDIA Corporation 2008

Memory Architecture

Host

CPU

Chipset

DRAM

Device

DRAM

Global

Constant

Texture

Local

GPU

Multiprocessor

Registers

Shared Memory

Multiprocessor

Registers

Shared Memory

Multiprocessor

Registers

Shared Memory

Constant and Texture

Caches

Page 11: [Harvard CS264] 05 - Advanced-level CUDA Programming

© 2008 NVIDIA Corporation.

Kernel Memory Access

Per-thread

Per-block

Per-device

ThreadRegisters

Local Memory

SharedMemory

Block

...Kernel 0

...Kernel 1

GlobalMemory

Time

On-chip

Off-chip, uncached

• On-chip, small

• Fast

• Off-chip, large

• Uncached

• Persistent across kernel launches

• Kernel I/O

Kernel Memory Access

Page 12: [Harvard CS264] 05 - Advanced-level CUDA Programming

© 2008 NVIDIA Corporation.

Kernel Memory Access

Per-thread

Per-block

Per-device

ThreadRegisters

Local Memory

SharedMemory

Block

...Kernel 0

...Kernel 1

GlobalMemory

Time

On-chip

Off-chip, uncached

• On-chip, small

• Fast

• Off-chip, large

• Uncached

• Persistent across kernel launches

• Kernel I/O

Global Memory

Per-device

...Kernel 0

...Kernel 1

GlobalMemory

Time

• Off-chip, large

• Uncached

• Persistent across kernel launches

• Kernel I/O

• Different types of “global memory”

• Linear Memory

• Texture Memory

• Constant Memory

Page 13: [Harvard CS264] 05 - Advanced-level CUDA Programming

12© NVIDIA Corporation 2008

Memory Architecture

Memory Location Cached Access Scope Lifetime

Register On-chip N/A R/W One thread Thread

Local Off-chip No R/W One thread Thread

Shared On-chip N/A R/W All threads in a block Block

Global Off-chip No R/W All threads + host Application

Constant Off-chip Yes R All threads + host Application

Texture Off-chip Yes R All threads + host Application

Page 14: [Harvard CS264] 05 - Advanced-level CUDA Programming

2. Memory/CommunicationOptimizations

Page 15: [Harvard CS264] 05 - Advanced-level CUDA Programming

2.1 Host/Device TransferOptimizations

Review

Page 16: [Harvard CS264] 05 - Advanced-level CUDA Programming

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

F

!"#$%&'()*+,-

! !"#$%&'(")*+,-".'/"$'0&"12"#+"345"#$%&'(6

! !**"#$%&'(6"78"'"#$%&'(")*+,-"'%&"%18"+8"#$&"

6'.&".1*#792%+,&66+%

! :$16",'8",+..187,'#&"07'"6$'%&(".&.+%/

! !8("6/8,$%+87;&"

! :$%&'(6"+<"'")*+,-"'%&".1*#72*&=&("+8#+"'"

.1*#792%+,&66+%"'6"!"#$%

>?@

A+%#$)%7(B&

CD!E

F+1#$)%7(B&

F!:! G#$&%8&#

H%'2$7,6">'%("I"

>@C!

J%+8#"F7(&"K16

E&.+%/"K16 ?>L"K16

?>L9G=2

%&66"K16

!

! ./012 +%"./0$! D&2*',&("!H?

! ?5?M"J1**"C12*&="F&%7'*M"F/..&#%7,"K16! 53NEKI6")'8(O7(#$"78"&',$"(7%&,#7+8

! "#$$#%&'()#$*+%,-(+%.#($/&.+0&,(1&2%,3(,+8<7B1%'#7+86P""GPBQ! ?>L9G"4R="S"4R"*'8&6

! 4R"#7.&6"#$&")'8(O7(#$"TUHKI6V

! :$&">@C!"62&,7<7,'#7+8"$'6")&&8"12('#&(! W&%67+8"4PN"4 L87#7'*"%&*&'6&M"N4INX

! W&%67+8"4P4"4@2('#&"O7#$"8&O&%"$'%(O'%&M"NUINX

! K',-O'%(6",+.2'#7)*&

! G=2&,#&("12('#&6"78"8&'%"<1#1%&Q! W&%67+8"4P5"I"5PN

! RY9)7#"<*+'#78B"2+78#"6122+%#"T7P&P"(+1)*&V

! W&%67+8"4P4"'((&("6+.&"7.2+%#'8#"16&<1*"

<&'#1%&6Q

3*456%#$

! !6/8,$%+8+16".&.+%/",+27&6

! !6/8,$%+8+16"H?@"2%+B%'."*'18,$

7%#&6%#$

! !#+.7,".&.+%/"786#%1,#7+86

3+ Gb/s

8 GB/s

25+ GB/s

160+ GB/sto

VRAM

PC Architecture

modified from Matthew Bolitho

Review

Page 17: [Harvard CS264] 05 - Advanced-level CUDA Programming

The PCI-“not-so”-e Bus

• PCIe bus is slow

• Try to minimize/group transfers

• Use pinned memory on host whenever possible

• Try to perform copies asynchronously (e.g. Streams)

• Use “Zero-Copy” when appropriate

• Examples in the SDK (e.g. bandwidthTest)

Review

Page 18: [Harvard CS264] 05 - Advanced-level CUDA Programming

2.2 Device MemoryOptimizations

Page 19: [Harvard CS264] 05 - Advanced-level CUDA Programming

Definitions

• gmem: global memory

• smem: shared memory

• tmem: texture memory

• cmem: constant memory

• bmem: binary code (cubin) memory ?!?(covered next week)

Page 20: [Harvard CS264] 05 - Advanced-level CUDA Programming

Performance Analysise.g. Matrix Transpose

Page 21: [Harvard CS264] 05 - Advanced-level CUDA Programming

22© NVIDIA Corporation 2008

Matrix Transpose

Transpose 2048x2048 matrix of floats

Performed out-of-place

Separate input and output matrices

Use tile of 32x32 elements, block of 32x8 threads

Each thread processes 4 matrix elements

In general tile and block size are fair game foroptimization

Process

Get the right answer

Measure effective bandwidth (relative to theoretical orreference case)

Address global memory coalescing, shared memory bankconflicts, and partition camping while repeating abovesteps

Page 22: [Harvard CS264] 05 - Advanced-level CUDA Programming

23© NVIDIA Corporation 2008

Theoretical Bandwidth

Device Bandwidth of GTX 280

1107 * 10^6 * (512 / 8) * 2 / 1024^3 = 131.9 GB/s

Specs report 141 GB/s

Use 10^9 B/GB conversion rather than 1024^3

Whichever you use, be consistent

Memory

clock (Hz)

Memory

interface

(bytes)

DDR

Page 23: [Harvard CS264] 05 - Advanced-level CUDA Programming

24© NVIDIA Corporation 2008

Effective Bandwidth

Transpose Effective Bandwidth

2048^2 * 4 B/element / 1024^3 * 2 / (time in secs) = GB/s

Reference Case - Matrix Copy

Transpose operates on tiles - need better comparisonthan raw device bandwidth

Look at effective bandwidth of copy that uses tiles

Matrix size

(bytes)

Read and

write

Page 24: [Harvard CS264] 05 - Advanced-level CUDA Programming

25© NVIDIA Corporation 2008

Matrix Copy Kernel

__global__ void copy(float *odata, float *idata, int width, int height){ int xIndex = blockIdx.x * TILE_DIM + threadIdx.x; int yIndex = blockIdx.y * TILE_DIM + threadIdx.y; int index = xIndex + width*yIndex;

for (int i = 0; i < TILE_DIM; i += BLOCK_ROWS) { odata[index+i*width] = idata[index+i*width]; }}

TILE_DIM = 32BLOCK_ROWS = 8

32x32 tile32x8 thread block

idata and odata

in global memory

idata odata

Elements copied by a half-warp of threads

Page 25: [Harvard CS264] 05 - Advanced-level CUDA Programming

26© NVIDIA Corporation 2008

Matrix Copy Kernel Timing

Measure elapsed time over loop

Looping/timing done in two ways:

Over kernel launches (nreps = 1)

Includes launch/indexing overhead

Within the kernel over loads/stores (nreps > 1)

Amortizes launch/indexing overhead

__global__ void copy(float *odata, float* idata, int width, int height, int nreps){ int xIndex = blockIdx.x * TILE_DIM + threadIdx.x; int yIndex = blockIdx.y * TILE_DIM + threadIdx.y; int index = xIndex + width*yIndex;

for (int r = 0; r < nreps; r++) { for (int i = 0; i < TILE_DIM; i += BLOCK_ROWS) { odata[index+i*width] = idata[index+i*width]; } }}

Page 26: [Harvard CS264] 05 - Advanced-level CUDA Programming

27© NVIDIA Corporation 2008

Naïve Transpose

Similar to copy

Input and output matrices have different indices

__global__ void transposeNaive(float *odata, float* idata, int width, int height, int nreps){ int xIndex = blockIdx.x * TILE_DIM + threadIdx.x; int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;

int index_in = xIndex + width * yIndex; int index_out = yIndex + height * xIndex;

for (int r=0; r < nreps; r++) { for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { odata[index_out+i] = idata[index_in+i*width]; } }}

idata odata

Page 27: [Harvard CS264] 05 - Advanced-level CUDA Programming

28© NVIDIA Corporation 2008

Effective Bandwidth

Effective Bandwidth (GB/s)

2048x2048, GTX 280

Loop over

kernel

Loop in kernel

Simple Copy 96.9 81.6

Naïve

Transpose

2.2 2.2

Page 28: [Harvard CS264] 05 - Advanced-level CUDA Programming
Page 29: [Harvard CS264] 05 - Advanced-level CUDA Programming

gmem coalescing

Page 30: [Harvard CS264] 05 - Advanced-level CUDA Programming

Memory Coalescing

GPU memory controller granularity is 64 or 128 bytes

Must also be 64 or 128 byte aligned

Suppose thread loads a float (4 bytes)

Controller loads 64 bytes, throws 60 bytes away

Page 31: [Harvard CS264] 05 - Advanced-level CUDA Programming

Memory Coalescing

Memory controller actually more intelligent

Consider half-warp (16 threads)

Suppose each thread reads consecutive float

Memory controller will perform one 64 byte load

This is known as coalescing

Make threads read consecutive locations

Page 32: [Harvard CS264] 05 - Advanced-level CUDA Programming

30© NVIDIA Corporation 2008

Coalescing

Global Memory

Half-warp of threads

} 64B aligned segment (16 floats)

Global memory access of 32, 64, or 128-bit words by a half-warp of threads can result in as few as one (or two)transaction(s) if certain access requirements are met

Depends on compute capability

1.0 and 1.1 have stricter access requirements

Examples – float (32-bit) data

}128B aligned segment (32 floats)

Page 33: [Harvard CS264] 05 - Advanced-level CUDA Programming

31© NVIDIA Corporation 2008

CoalescingCompute capability 1.0 and 1.1

K-th thread must access k-th word in the segment (or k-th word in 2contiguous 128B segments for 128-bit words), not all threads need toparticipate

Coalesces – 1 transaction

Out of sequence – 16 transactions Misaligned – 16 transactions

Page 34: [Harvard CS264] 05 - Advanced-level CUDA Programming

Memory Coalescing

GT200 has hardware coalescer

Inspects memory requests from each half-warp

Determines minimum set of transactions which are

64 or 128 bytes long

64 or 128 byte aligned

Page 35: [Harvard CS264] 05 - Advanced-level CUDA Programming

32© NVIDIA Corporation 2008

CoalescingCompute capability 1.2 and higher

1 transaction - 64B segment

2 transactions - 64B and 32B segments 1 transaction - 128B segment

Coalescing is achieved for any pattern of addresses that fits into asegment of size: 32B for 8-bit words, 64B for 16-bit words, 128B for32- and 64-bit words

Smaller transactions may be issued to avoid wasted bandwidth dueto unused words

(e.g. GT200 like the C1060)

Page 36: [Harvard CS264] 05 - Advanced-level CUDA Programming

© NVIDIA Corporation 2010

CoalescingCompute capability 2.0 (Fermi, Tesla C2050)

32 transactions - 32 x 32B segments, instead of 32 x 128B segments.

2 transactions - 2 x 128B segment - but next warp probably only 1 extra transaction, due to L1 cache.

Memory transactions handled per warp (32 threads)L1 cache ON: Issues always 128B segment transactionscaches them in 16kB or 48kB L1 cache per multiprocessor

L1 cache OFF: Issues always 32B segment transactionsE.g. advantage for widely scattered thread accesses

Page 37: [Harvard CS264] 05 - Advanced-level CUDA Programming

Coalescing Summary

Coalescing dramatically speeds global memory access

Strive for perfect coalescing:

Align starting address (may require padding)

A warp should access within a contiguous region

Page 38: [Harvard CS264] 05 - Advanced-level CUDA Programming

33© NVIDIA Corporation 2008

Coalescing in Transpose

Naïve transpose coalesces reads, but not writes

idata odata

Elements transposed by a half-warp of threads

Q: How to coalesce writes ?

Page 39: [Harvard CS264] 05 - Advanced-level CUDA Programming

smem as a cache

Page 40: [Harvard CS264] 05 - Advanced-level CUDA Programming

Shared Memory

SMs can access gmem at 80+ GiB/sec

but have hundreds of cycles of latency

Each SM has 16 kiB ‘shared’ memory

Essentially user-managed cache

Speed comparable to registers

Accessible to all threads in a block

Reduces load/stores to device memory

Page 41: [Harvard CS264] 05 - Advanced-level CUDA Programming

34© NVIDIA Corporation 2008

Shared Memory

~Hundred times faster than global memory

Cache data to reduce global memory accesses

Threads can cooperate via shared memory

Use it to avoid non-coalesced accessStage loads and stores in shared memory to re-order non-coalesceable addressing

Page 42: [Harvard CS264] 05 - Advanced-level CUDA Programming

33© NVIDIA Corporation 2008

Coalescing in Transpose

Naïve transpose coalesces reads, but not writes

idata odata

Elements transposed by a half-warp of threads

Q: How to coalesce writes ?

Page 43: [Harvard CS264] 05 - Advanced-level CUDA Programming

34© NVIDIA Corporation 2008

Shared Memory

~Hundred times faster than global memory

Cache data to reduce global memory accesses

Threads can cooperate via shared memory

Use it to avoid non-coalesced accessStage loads and stores in shared memory to re-order non-coalesceable addressing

Page 44: [Harvard CS264] 05 - Advanced-level CUDA Programming

© 2008 NVIDIA Corporation

A Common Programming Strategy

!   Partition data into subsets that fit into shared memory

Page 45: [Harvard CS264] 05 - Advanced-level CUDA Programming

© 2008 NVIDIA Corporation

A Common Programming Strategy

!   Handle each data subset with one thread block

Page 46: [Harvard CS264] 05 - Advanced-level CUDA Programming

© 2008 NVIDIA Corporation

A Common Programming Strategy

!   Load the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism

Page 47: [Harvard CS264] 05 - Advanced-level CUDA Programming

© 2008 NVIDIA Corporation

A Common Programming Strategy

!   Perform the computation on the subset from shared memory

Page 48: [Harvard CS264] 05 - Advanced-level CUDA Programming

© 2008 NVIDIA Corporation

A Common Programming Strategy

!   Copy the result from shared memory back to global memory

Page 49: [Harvard CS264] 05 - Advanced-level CUDA Programming

35© NVIDIA Corporation 2008

Coalescing through shared memory

Access columns of a tile in shared memory to writecontiguous data to global memory

Requires __syncthreads() since threads write dataread by other threads

Elements transposed by a half-warp of threads

idata odata

tile

Page 50: [Harvard CS264] 05 - Advanced-level CUDA Programming

36

__global__ void transposeCoalesced(float *odata, float *idata, int width, int height, int nreps){ __shared__ float tile[TILE_DIM][TILE_DIM];

int xIndex = blockIdx.x * TILE_DIM + threadIdx.x; int yIndex = blockIdx.y * TILE_DIM + threadIdx.y; int index_in = xIndex + (yIndex)*width;

xIndex = blockIdx.y * TILE_DIM + threadIdx.x; yIndex = blockIdx.x * TILE_DIM + threadIdx.y; int index_out = xIndex + (yIndex)*height;

for (int r=0; r < nreps; r++) { for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { tile[threadIdx.y+i][threadIdx.x] = idata[index_in+i*width]; }

__syncthreads();

for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { odata[index_out+i*height] = tile[threadIdx.x][threadIdx.y+i]; } }}

© NVIDIA Corporation 2008

Coalescing through shared memory

Page 51: [Harvard CS264] 05 - Advanced-level CUDA Programming

37© NVIDIA Corporation 2008

Effective Bandwidth

Effective Bandwidth (GB/s)

2048x2048, GTX 280

Loop over kernel Loop in kernel

Simple Copy 96.9 81.6

Shared Memory Copy 80.9 81.1

Naïve Transpose 2.2 2.2

Coalesced Transpose 16.5 17.1

Uses shared

memory tile

and

__syncthreads()

Page 52: [Harvard CS264] 05 - Advanced-level CUDA Programming
Page 53: [Harvard CS264] 05 - Advanced-level CUDA Programming

smem bank conflicts

Page 54: [Harvard CS264] 05 - Advanced-level CUDA Programming

39© NVIDIA Corporation 2008

Shared Memory Architecture

Many threads accessing memoryTherefore, memory is divided into banks

Successive 32-bit words assigned to successive banks

Each bank can service one address per cycleA memory can service as many simultaneousaccesses as it has banks

Multiple simultaneous accesses to a bankresult in a bank conflict

Conflicting accesses are serialized

Bank 15

Bank 7

Bank 6Bank 5

Bank 4

Bank 3Bank 2

Bank 1Bank 0

Page 55: [Harvard CS264] 05 - Advanced-level CUDA Programming

Shared Memory Banks

Shared memory divided into 16 ‘banks’

Shared memory is (almost) as fast as registers (...)

Exception is in case of bank conflicts

0 16

1 17

2 18

3 19

4 20

5 21

6 22

7 23

8 24

9 25

10 26

11 27

12 28

13 29

14 30

15 31

4 bytes

Bank 0

Bank 1

Bank 2

Bank 3

Bank 4

Bank 5

Bank 6

Bank 7

Bank 8

Bank 9

Bank 10

Bank 11

Bank 12

Bank 13

Bank 14

Bank 15

Page 56: [Harvard CS264] 05 - Advanced-level CUDA Programming

40© NVIDIA Corporation 2008

Bank Addressing Examples

No Bank Conflicts

Linear addressingstride == 1

No Bank Conflicts

Random 1:1 Permutation

Bank 15

Bank 7

Bank 6Bank 5

Bank 4

Bank 3Bank 2

Bank 1Bank 0

Thread 15

Thread 7

Thread 6Thread 5

Thread 4

Thread 3Thread 2

Thread 1Thread 0

Bank 15

Bank 7

Bank 6Bank 5

Bank 4

Bank 3Bank 2

Bank 1Bank 0

Thread 15

Thread 7

Thread 6Thread 5

Thread 4

Thread 3Thread 2

Thread 1Thread 0

Page 57: [Harvard CS264] 05 - Advanced-level CUDA Programming

41© NVIDIA Corporation 2008

Bank Addressing Examples

2-way Bank Conflicts

Linear addressingstride == 2

8-way Bank Conflicts

Linear addressingstride == 8

Thread 11

Thread 10

Thread 9Thread 8

Thread 4

Thread 3Thread 2

Thread 1Thread 0

Bank 15

Bank 7

Bank 6Bank 5

Bank 4

Bank 3Bank 2

Bank 1Bank 0

Thread 15

Thread 7

Thread 6Thread 5

Thread 4

Thread 3Thread 2

Thread 1Thread 0

Bank 9Bank 8

Bank 15

Bank 7

Bank 2

Bank 1Bank 0

x8

x8

Page 58: [Harvard CS264] 05 - Advanced-level CUDA Programming

42© NVIDIA Corporation 2008

Shared memory bank conflicts

Shared memory is ~ as fast as registers if there are no bankconflicts

warp_serialize profiler signal reflects conflicts

The fast case:

If all threads of a half-warp access different banks, there is nobank conflict

If all threads of a half-warp read the identical address, there is nobank conflict (broadcast)

The slow case:

Bank Conflict: multiple threads in the same half-warp access thesame bank

Must serialize the accesses

Cost = max # of simultaneous accesses to a single bank

Page 59: [Harvard CS264] 05 - Advanced-level CUDA Programming

43© NVIDIA Corporation 2008

Bank Conflicts in Transpose

32x32 shared memory tile of floats

Data in columns k and k+16 are in same bank

16-way bank conflict reading half columns in tile

Solution - pad shared memory array__shared__ float tile[TILE_DIM][TILE_DIM+1];

Data in anti-diagonals are in same bank

idata odata

tile

Q: How to avoid bank conflicts ?

Page 60: [Harvard CS264] 05 - Advanced-level CUDA Programming

43© NVIDIA Corporation 2008

Bank Conflicts in Transpose

32x32 shared memory tile of floats

Data in columns k and k+16 are in same bank

16-way bank conflict reading half columns in tile

Solution - pad shared memory array__shared__ float tile[TILE_DIM][TILE_DIM+1];

Data in anti-diagonals are in same bank

idata odata

tile

Page 61: [Harvard CS264] 05 - Advanced-level CUDA Programming

Illustration!"#$%&'(%)*$+,'-.*/&/01'2#03'4*056/789• :;<:; !(=('#$$#+• >#$?'#77%99%9'#'7*6@)0,

– !"#$%&'(%)*'+,)-./+01'20345%61'/)'%'$%47'%++511'035'1%85'(%)*9

warps:

© NVIDIA 2010

31

210

31210

31210

warps:0 1 2 31

Bank 0Bank 1…

Bank 3120 1

31

Page 62: [Harvard CS264] 05 - Advanced-level CUDA Programming

!"#$%&'(%)*$+,'-.*/&/01'2#03'4*056/789• -&&'#'7*6:)0'5*$';#&&/01,

– !"#!! $%&%'())(*• <#$;'#77%99%9'#'7*6:)0,

– !" +,--.)./0'1(/234'/5'1(/2'65/-7,603warps:

© NVIDIA 2010

31210

31210

31210

warps:0 1 2 31 padding

Bank 0Bank 1…

Bank 313120 1

Illustration

Page 63: [Harvard CS264] 05 - Advanced-level CUDA Programming

44© NVIDIA Corporation 2008

Effective Bandwidth

Effective Bandwidth (GB/s)

2048x2048, GTX 280

Loop over

kernel

Loop in kernel

Simple Copy 96.9 81.6

Shared Memory Copy 80.9 81.1

Naïve Transpose 2.2 2.2

Coalesced Transpose 16.5 17.1

Bank Conflict Free Transpose 16.6 17.2

Page 64: [Harvard CS264] 05 - Advanced-level CUDA Programming

Need a pause?

Page 65: [Harvard CS264] 05 - Advanced-level CUDA Programming

Unrelated: Tchatcher Illusion

Page 66: [Harvard CS264] 05 - Advanced-level CUDA Programming

Unrelated: Tchatcher Illusion

Page 67: [Harvard CS264] 05 - Advanced-level CUDA Programming

gmem partition camping

Page 68: [Harvard CS264] 05 - Advanced-level CUDA Programming

46© NVIDIA Corporation 2008

Partition Camping

Global memory accesses go through partitions

6 partitions on 8-series GPUs, 8 partitions on 10-seriesGPUs

Successive 256-byte regions of global memory areassigned to successive partitions

For best performance:

Simultaneous global memory accesses GPU-wide shouldbe distributed evenly amongst partitions

Partition Camping occurs when global memoryaccesses at an instant use a subset of partitions

Directly analogous to shared memory bank conflicts, buton a larger scale

Page 69: [Harvard CS264] 05 - Advanced-level CUDA Programming

47© NVIDIA Corporation 2008

0 1 2 3 4 5

64 65 66 67 68 69

128 129 130 ...

0 64 128

1 65 129

2 66 130

3 67 ...

4 68

5 69

odataidata

Partition Camping in Transpose

tiles in matrices

colors = partitions

blockId = gridDim.x * blockIdx.y + blockIdx.x

Partition width = 256 bytes = 64 floats

Twice width of tile

On GTX280 (8 partitions), data 2KB apart map tosame partition

2048 floats divides evenly by 2KB => columns of matricesmap to same partition

Page 70: [Harvard CS264] 05 - Advanced-level CUDA Programming

48© NVIDIA Corporation 2008

Partition Camping Solutions

blockId = gridDim.x * blockIdx.y + blockIdx.x

Pad matrices (by two tiles)

In general might be expensive/prohibitive memory-wise

Diagonally reorder blocks

Interpret blockIdx.y as different diagonal slices andblockIdx.x as distance along a diagonal

odataidata

0 64 128

1 65 129

2 66 130

3 67 ...

4 68

5

0

64 1

128 65 2

129 66 3

130 67 4

... 68 5

Page 71: [Harvard CS264] 05 - Advanced-level CUDA Programming

49

__global__ void transposeDiagonal(float *odata, float *idata, int width, int height, int nreps){ __shared__ float tile[TILE_DIM][TILE_DIM+1];

int blockIdx_y = blockIdx.x; int blockIdx_x = (blockIdx.x+blockIdx.y)%gridDim.x;

int xIndex = blockIdx_x * TILE_DIM + threadIdx.x; int yIndex = blockIdx_y * TILE_DIM + threadIdx.y; int index_in = xIndex + (yIndex)*width;

xIndex = blockIdx_y * TILE_DIM + threadIdx.x; yIndex = blockIdx_x * TILE_DIM + threadIdx.y; int index_out = xIndex + (yIndex)*height;

for (int r=0; r < nreps; r++) { for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { tile[threadIdx.y+i][threadIdx.x] = idata[index_in+i*width]; } __syncthreads(); for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { odata[index_out+i*height] = tile[threadIdx.x][threadIdx.y+i]; } }}© NVIDIA Corporation 2008

Diagonal Transpose

Add lines to map diagonal

to Cartesian coordinates

Replace

blockIdx.x

with

blockIdx_x,

blockIdx.y

with

blockIdx_y

Page 72: [Harvard CS264] 05 - Advanced-level CUDA Programming

50

if (width == height) { blockIdx_y = blockIdx.x; blockIdx_x = (blockIdx.x+blockIdx.y)%gridDim.x;} else { int bid = blockIdx.x + gridDim.x*blockIdx.y; blockIdx_y = bid%gridDim.y; blockIdx_x = ((bid/gridDim.y)+blockIdx_y)%gridDim.x;}

© NVIDIA Corporation 2008

Diagonal Transpose

Previous slide for square matrices (width == height)

More generally:

Page 73: [Harvard CS264] 05 - Advanced-level CUDA Programming

51© NVIDIA Corporation 2008

Effective Bandwidth

Effective Bandwidth (GB/s)

2048x2048, GTX 280

Loop over kernel Loop in kernel

Simple Copy 96.9 81.6

Shared Memory Copy 80.9 81.1

Naïve Transpose 2.2 2.2

Coalesced Transpose 16.5 17.1

Bank Conflict Free Transpose 16.6 17.2

Diagonal 69.5 78.3

Page 74: [Harvard CS264] 05 - Advanced-level CUDA Programming

52© NVIDIA Corporation 2008

Order of Optimizations

Larger optimization issues can mask smaller ones

Proper order of some optimization techniques innot known a priori

Eg. partition camping is problem-size dependent

Don’t dismiss an optimization technique asineffective until you know it was applied at the righttime

Naïve

2.2 GB/s

Coalescing16.5 GB/s

Bank

ConflictsPartition

Camping16.6 GB/s

48.8 GB/s

69.5 GB/s

Partition

Camping

Bank

Conflicts

Page 75: [Harvard CS264] 05 - Advanced-level CUDA Programming

53© NVIDIA Corporation 2008

Transpose Summary

Coalescing and shared memory bank conflicts aresmall-scale phenomena

Deal with memory access within half-warp

Problem-size independent

Partition camping is a large-scale phenomenon

Deals with simultaneous memory accesses by warps ondifferent multiprocessors

Problem size dependent

Wouldn’t see in (2048+32)^2 matrix

Coalescing is generally the most critical

SDK Transpose Example:http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html

Page 76: [Harvard CS264] 05 - Advanced-level CUDA Programming

tmem

Page 77: [Harvard CS264] 05 - Advanced-level CUDA Programming

55© NVIDIA Corporation 2008

Textures in CUDA

Texture is an object for reading data

Benefits:Data is cached (optimized for 2D locality)

Helpful when coalescing is a problem

FilteringLinear / bilinear / trilinear

Dedicated hardware

Wrap modes (for “out-of-bounds” addresses)Clamp to edge / repeat

Addressable in 1D, 2D, or 3DUsing integer or normalized coordinates

Usage:CPU code binds data to a texture object

Kernel reads data by calling a fetch function

Page 78: [Harvard CS264] 05 - Advanced-level CUDA Programming

Other goodiesOptional “format conversion”

• {char, short, int, half} (16bit) to float (32bit)

• “for free”

• useful for *mem compression (see later)

Page 79: [Harvard CS264] 05 - Advanced-level CUDA Programming

56© NVIDIA Corporation 2008

Texture Addressing

WrapOut-of-bounds coordinate iswrapped (modulo arithmetic)

ClampOut-of-bounds coordinate isreplaced with the closestboundary

0 1 2 3 4

1

2

3

0(5.5, 1.5)

0 1 2 3 4

1

2

3

0(2.5, 0.5)(1.0, 1.0)

0 1 2 3 4

1

2

3

0(5.5, 1.5)

Page 80: [Harvard CS264] 05 - Advanced-level CUDA Programming

57© NVIDIA Corporation 2008

Two CUDA Texture Types

Bound to linear memoryGlobal memory address is bound to a textureOnly 1DInteger addressingNo filtering, no addressing modes

Bound to CUDA arraysCUDA array is bound to a texture1D, 2D, or 3DFloat addressing (size-based or normalized)FilteringAddressing modes (clamping, repeat)

Both:Return either element type or normalized float

Page 81: [Harvard CS264] 05 - Advanced-level CUDA Programming

58© NVIDIA Corporation 2008

CUDA Texturing Steps

Host (CPU) code:Allocate/obtain memory (global linear, or CUDA array)

Create a texture reference object

Currently must be at file-scope

Bind the texture reference to memory/array

When done:

Unbind the texture reference, free resources

Device (kernel) code:Fetch using texture reference

Linear memory textures:

tex1Dfetch()

Array textures:

tex1D() or tex2D() or tex3D()

Page 82: [Harvard CS264] 05 - Advanced-level CUDA Programming

cmem

Page 83: [Harvard CS264] 05 - Advanced-level CUDA Programming

!"#$%&#%'()*"+,• -.)&/'0"+'1")00212)#%$'&#.'"%3)+'.&%&'%3&%'2$'+)&.'4#20"+*/,'5,'6&+7$• 8&%&'2$'$%"+).'2#'9/"5&/'*)*"+,:'+)&.'%3+"493'&'1"#$%&#%;1&13)

– !!"#$%&'$&!!()*'+,-,./(,$(0."+'/'&,#$%

– 1'$(#$+2(3.(/.'0(32(456(7./$.+%

– 8,9,&.0(&#(:;<=

• <)+*2'&..$'4#20"+*'&11)$$)$=

© NVIDIA 2010

• <)+*2'&..$'4#20"+*'&11)$$)$=– <./$.+(>#,$&./('/?*9.$&()*'+,-,.0(@,&A(!"#$%

– 1#9>,+./(9*%&(0.&./9,$.(&A'&('++(&A/.'0%(,$('(&A/.'03+#"7 @,++(0./.-./.$".(&A.(%'9.('00/.%%

– B#(+,9,&(#$('//'2(%,C.D("'$(*%.('$2(?+#3'+(9.9#/2(>#,$&./

• !"#$%&#%'1&13)'%3+"49374%='– EF(3,&%(>./(@'/>(>./(F("+#"7%(>./(9*+&,>/#".%%#/

– G#(3.(*%.0(@A.$('++(&A/.'0%(,$('(@'/>(/.'0(&A.(%'9.('00/.%%

• H./,'+,C.%(#&A./@,%.

Page 84: [Harvard CS264] 05 - Advanced-level CUDA Programming

!"#$%&#%'()*"+,• -.)&/'0"+'1")00212)#%$'&#.'"%3)+'.&%&'%3&%'2$'+)&.'4#20"+*/,'5,'6&+7$• 8&%&'2$'$%"+).'2#'9/"5&/'*)*"+,:'+)&.'%3+"493'&'1"#$%&#%;1&13)

– !!"#$%&'$&!!()*'+,-,./(,$(0."+'/'&,#$%

– 1'$(#$+2(3.(/.'0(32(456(7./$.+%

– 8,9,&.0(&#(:;<=

• <)+*2'&..$'4#20"+*'&11)$$)$=

!!>+#3'+!!(?#,0(7./$.+@("#$%&(-+#'&(A>!' BC

DDD

© NVIDIA 2010

• <)+*2'&..$'4#20"+*'&11)$$)$=– <./$.+(E#,$&./('/>*9.$&()*'+,-,.0(F,&G(!"#$%

– 1#9E,+./(9*%&(0.&./9,$.(&G'&('++(&G/.'0%(,$('(&G/.'03+#"7 F,++(0./.-./.$".(&G.(%'9.('00/.%%

– H#(+,9,&(#$('//'2(%,I.J("'$(*%.('$2(>+#3'+(9.9#/2(E#,$&./

• !"#$%&#%'1&13)'%3+"49374%='– KL(3,&%(E./(F'/E(E./(L("+#"7%(E./(9*+&,E/#".%%#/

– M#(3.(*%.0(FG.$('++(&G/.'0%(,$('(F'/E(/.'0(&G.(%'9.('00/.%%• N./,'+,I.%(#&G./F,%.

DDD-+#'&(O(P(>!'QRSTU(((((((((((((((((((VV(*$,-#/9-+#'&(2(P(>!'Q3+#"7W0ODOXSTU((((VV(*$,-#/9-+#'&(I(P(>!'Q&G/.'0W0ODOTU((((((VV($#$Y*$,-#/9DDD

Z

Page 85: [Harvard CS264] 05 - Advanced-level CUDA Programming

!"#$%&#%'()*"+,• -)+#).')/)01%)$'23-'%4+)&5$'6783'9&+:$;:)+'<('51+=#>'=%$'.=?)%=*)• @..'%4+)&5$'&00)$$'%4)'$&*)'AB'9"+5• C$=#>'D(E(F

– !"#$%&"'(%)*+#$*,%-./%01%234/%5)%67,%+'"))8#– 9"#$8:;%<5"=,%(5+*:+8"<<>%&5',*%? 2.@/%<8:*A%B*'>%<8C*<>%+5%6*%*B8#+*=%D7<+8(<*%+8D*,

© NVIDIA 2010

– 9"#$8:;%<5"=,%(5+*:+8"<<>%&5',*%? 2.@/%<8:*A%B*'>%<8C*<>%+5%6*%*B8#+*=%D7<+8(<*%+8D*,

...addresses from a warp

96 192128 160 224 28825632 64 352320 384 4484160

Page 86: [Harvard CS264] 05 - Advanced-level CUDA Programming

!"#$%&#%'()*"+,• -)+#).')/)01%)$'23-'%4+)&5$'6783'9&+:$;:)+'<('51+=#>'=%$'.=?)%=*)• @..'%4+)&5$'&00)$$'%4)'$&*)'AB'9"+5• C$=#>'0"#$%&#%D1#=?"+*'&00)$$E

– !"#$%&'(#)&*+%,-+$&./&01%+$– 233&4%-+#$&-"%&"5&,45$%(5%&,(,-+&67&./&01%+$&4*&08$&%#(**",

© NVIDIA 2010

– 233&4%-+#$&-"%&"5&,45$%(5%&,(,-+&67&./&01%+$&4*&08$&%#(**",• 953":+31&%4&0+&+;",%+<&4;+#&:+#5+3&3"*+%"=+&> 4%-+#&34(<$&<4&54%&?4&%-#48?-&%-"$&,(,-+

...addresses from a warp

96 192128 160 224 28825632 64 352320 384 4484160

Page 87: [Harvard CS264] 05 - Advanced-level CUDA Programming

*mem compression

Page 88: [Harvard CS264] 05 - Advanced-level CUDA Programming

!"#$%$&$'()*$#+),-%"./00$-'• 1+/')233)/30/)+20)4//')-"#$%$&/5)2'5)6/.'/3)$0)3$%$#/5)47)#+/)'8%4/.)-9)47#/0)'//5/5:);-'0$5/.);-%"./00$-'

• <"".-2;+/0=– !"#$%&'"()*+,'"%-)#.))"%/01%2301%450-,# ,"#)6)*+%,+ 2%,"+#*7&#,'"%8390-,# *):7,*)+%;%

&'7<=)>– ?@$%&'"()*+,'"%-)#.))"%A<231%A<451%A<39 ,+%'")%,"+#*7&#,'"

© NVIDIA 2010

– ?@$%&'"()*+,'"%-)#.))"%A<231%A<451%A<39 ,+%'")%,"+#*7&#,'"• A<23 82+B)2CD>%,+%+#'*;6)%'"=E1%"'%D;#F%,"+#*7&#,'"+

– G;"6)0-;+)H$• I'.)*%;"H%7<<)*%=,D,#+%;*)%J)*")=%;*67D)#+

• K;#;%,+%;"%,"H)L%A'*%,"#)*<'=;#,'"

• <""3$;2#$-')$')".2;#$;/=– M=;*J%!"#$%& NO'=(,"6%I;##,&)%PMK%+E+#)D+%'A%):7;#,'"+%7+,"6%D,L)H%<*)&,+,'"%

+'=()*+%'"%Q@R+S– F##<$TT;*L,(U'*6T;-+TCV22U42V2

34

Page 89: [Harvard CS264] 05 - Advanced-level CUDA Programming

Accelerating GPU computation through

mixed-precision methods

Michael Clark

Harvard-Smithsonian Center for Astrophysics

Harvard University

SC’10

Page 90: [Harvard CS264] 05 - Advanced-level CUDA Programming

caching

... too much ?

bank conflicts

coalescing

partition campingclam

ping

mix

ed p

reci

sion

broadcasting

streamszero-copy

Page 91: [Harvard CS264] 05 - Advanced-level CUDA Programming

Parallel ProgrammParking

is Hard(but you’ll pick it up)

Page 92: [Harvard CS264] 05 - Advanced-level CUDA Programming

(you are not alone)

Page 93: [Harvard CS264] 05 - Advanced-level CUDA Programming

3. Threading/ExecutionOptimizations

Page 94: [Harvard CS264] 05 - Advanced-level CUDA Programming

3.1 Exec. ConfigurationOptimizations

Page 95: [Harvard CS264] 05 - Advanced-level CUDA Programming

60© NVIDIA Corporation 2008

Occupancy

Thread instructions are executed sequentially, soexecuting other warps is the only way to hidelatencies and keep the hardware busy

Occupancy = Number of warps running concurrentlyon a multiprocessor divided by maximum number ofwarps that can run concurrently

Limited by resource usage:

Registers

Shared memory

Page 96: [Harvard CS264] 05 - Advanced-level CUDA Programming

61© NVIDIA Corporation 2008

Grid/Block Size Heuristics

# of blocks > # of multiprocessorsSo all multiprocessors have at least one block to execute

# of blocks / # of multiprocessors > 2Multiple blocks can run concurrently in a multiprocessor

Blocks that aren’t waiting at a __syncthreads() keep thehardware busy

Subject to resource availability – registers, shared memory

# of blocks > 100 to scale to future devicesBlocks executed in pipeline fashion

1000 blocks per grid will scale across multiple generations

Page 97: [Harvard CS264] 05 - Advanced-level CUDA Programming

62© NVIDIA Corporation 2008

Register Dependency

Read-after-write register dependencyInstruction’s result can be read ~24 cycles later

Scenarios: CUDA: PTX:

To completely hide the latency:Run at least 192 threads (6 warps) per multiprocessor

At least 25% occupancy (1.0/1.1), 18.75% (1.2/1.3)

Threads do not have to belong to the same thread block

add.f32 $f3, $f1, $f2

add.f32 $f5, $f3, $f4

x = y + 5;

z = x + 3;

ld.shared.f32 $f3, [$r31+0]

add.f32 $f3, $f3, $f4

s_data[0] += 3;

Page 98: [Harvard CS264] 05 - Advanced-level CUDA Programming

63© NVIDIA Corporation 2008

Register Pressure

Hide latency by using more threads per SM

Limiting Factors:Number of registers per kernel

8K/16K per SM, partitioned among concurrent threads

Amount of shared memory

16KB per SM, partitioned among concurrent threadblocks

Compile with –ptxas-options=-v flag

Use –maxrregcount=N flag to NVCCN = desired maximum registers / kernel

At some point “spilling” into local memory may occur

Reduces performance – local memory is slow

Page 99: [Harvard CS264] 05 - Advanced-level CUDA Programming

64© NVIDIA Corporation 2008

Occupancy Calculator

Page 100: [Harvard CS264] 05 - Advanced-level CUDA Programming

65© NVIDIA Corporation 2008

Optimizing threads per block

Choose threads per block as a multiple of warp sizeAvoid wasting computation on under-populated warps

Want to run as many warps as possible permultiprocessor (hide latency)

Multiprocessor can run up to 8 blocks at a time

HeuristicsMinimum: 64 threads per block

Only if multiple concurrent blocks

192 or 256 threads a better choice

Usually still enough regs to compile and invoke successfully

This all depends on your computation, so experiment!

Page 101: [Harvard CS264] 05 - Advanced-level CUDA Programming

66© NVIDIA Corporation 2008

Occupancy != Performance

Increasing occupancy does not necessarily increaseperformance

BUT …

Low-occupancy multiprocessors cannot adequatelyhide latency on memory-bound kernels

(It all comes down to arithmetic intensity and availableparallelism)

Page 102: [Harvard CS264] 05 - Advanced-level CUDA Programming

!"##"$%&"$'($)*+,"%*#%-

(."$%/,,01*+,2

%

!"#$%&'!(%)(*'

+,'-./).%.&'

'

0.12.34./'556'5787'

8'

GTC’10

Page 103: [Harvard CS264] 05 - Advanced-level CUDA Programming

66© NVIDIA Corporation 2008

Occupancy != Performance

Increasing occupancy does not necessarily increaseperformance

BUT …

Low-occupancy multiprocessors cannot adequatelyhide latency on memory-bound kernels

(It all comes down to arithmetic intensity and availableparallelism)

Page 104: [Harvard CS264] 05 - Advanced-level CUDA Programming

3.2 InstructionOptimizations

Page 105: [Harvard CS264] 05 - Advanced-level CUDA Programming

69© NVIDIA Corporation 2008

CUDA Instruction Performance

Instruction cycles (per warp) = sum ofOperand read cycles

Instruction execution cycles

Result update cycles

Therefore instruction throughput depends onNominal instruction throughput

Memory latency

Memory bandwidth

“Cycle” refers to the multiprocessor clock rate1.3 GHz on the Tesla C1060, for example

Page 106: [Harvard CS264] 05 - Advanced-level CUDA Programming

70© NVIDIA Corporation 2008

Maximizing Instruction Throughput

Maximize use of high-bandwidth memoryMaximize use of shared memory

Minimize accesses to global memory

Maximize coalescing of global memory accesses

Optimize performance by overlapping memoryaccesses with HW computation

High arithmetic intensity programs

i.e. high ratio of math to memory transactions

Many concurrent threads

Page 107: [Harvard CS264] 05 - Advanced-level CUDA Programming

71© NVIDIA Corporation 2008

Arithmetic Instruction Throughput

int and float add, shift, min, max and float mul, mad:4 cycles per warp

int multiply (*) is by default 32-bit

requires multiple cycles / warp

Use __mul24() / __umul24() intrinsics for 4-cycle 24-bit intmultiply

Integer divide and modulo are more expensiveCompiler will convert literal power-of-2 divides to shifts

But we have seen it miss some cases

Be explicit in cases where compiler can’t tell that divisor isa power of 2!

Useful trick: foo % n == foo & (n-1) if n is a power of 2

Page 108: [Harvard CS264] 05 - Advanced-level CUDA Programming

72© NVIDIA Corporation 2008

Runtime Math Library

There are two types of runtime math operations insingle-precision

__funcf(): direct mapping to hardware ISA

Fast but lower accuracy (see prog. guide for details)

Examples: __sinf(x), __expf(x), __powf(x,y)

funcf() : compile to multiple instructions

Slower but higher accuracy (5 ulp or less)

Examples: sinf(x), expf(x), powf(x,y)

The -use_fast_math compiler option forces everyfuncf() to compile to __funcf()

Page 109: [Harvard CS264] 05 - Advanced-level CUDA Programming

73© NVIDIA Corporation 2008

GPU results may not match CPU

Many variables: hardware, compiler, optimizationsettings

CPU operations aren’t strictly limited to 0.5 ulpSequences of operations can be more accurate due to 80-bit extended precision ALUs

Floating-point arithmetic is not associative!

Page 110: [Harvard CS264] 05 - Advanced-level CUDA Programming

74© NVIDIA Corporation 2008

FP Math is Not Associative!

In symbolic math, (x+y)+z == x+(y+z)

This is not necessarily true for floating-point additionTry x = 1030, y = -1030 and z = 1 in the above equation

When you parallelize computations, you potentiallychange the order of operations

Parallel results may not exactly match sequentialresults

This is not specific to GPU or CUDA – inherent part ofparallel execution

Page 111: [Harvard CS264] 05 - Advanced-level CUDA Programming

75© NVIDIA Corporation 2008

Control Flow Instructions

Main performance concern with branching isdivergence

Threads within a single warp take different paths

Different execution paths must be serialized

Avoid divergence when branch condition is afunction of thread ID

Example with divergence:

if (threadIdx.x > 2) { }

Branch granularity < warp size

Example without divergence:

if (threadIdx.x / WARP_SIZE > 2) { }

Branch granularity is a whole multiple of warp size

Page 112: [Harvard CS264] 05 - Advanced-level CUDA Programming

Scared ?

Page 113: [Harvard CS264] 05 - Advanced-level CUDA Programming

Howwwwww?!(do I start)

Scared ?

Page 114: [Harvard CS264] 05 - Advanced-level CUDA Programming

Profiler

Page 115: [Harvard CS264] 05 - Advanced-level CUDA Programming

!"#$%&'&()'*+(,-./'$0-• ,-./'$0-(1.2"*0-&3

– !"#$%&'$!("#)!##&*+!"!"#$%&'$!("#)*,*'&$*+• #$%&"'()*+,+(%+-"./"0 1+*"23*1

• 4'556+-7"'()86-+5"*+183/5!"4+9+)6%+-7"-$+5"($%

– -.+)%*/&*#$!"-#$)%*/&*#$• :()*+,+(%+-"./ 0"1+*"23*1";$*"+3)&"8$3-<5%$*+"'(5%*6)%'$(

© NVIDIA 2010

• :()*+,+(%+-"./ 0"1+*"23*1";$*"+3)&"8$3-<5%$*+"'(5%*6)%'$(

• :(5%*6)%'$(",3/".+")$6(%+-"';"'%"'5"41*+-')3%+-"$6%7

– .0)-.(12.).(2+)3!##!".0)-.(12.).(2+)4!$!"-.(12.)#$(%*)$%2"#2'$!("• :()*+,+(%+-"./"0 1+*"45($'"0 =8'(+"'5"0>?#@

– &"'2'4*+)-.(12.).(2+)$%2"#2'$!("• :()*+,+(%+-"./ 0"1+*"A*$16 $;"0!">!"B!"$*"C"%*3(53)%'$(5

• 6.78#-03– B>""D"!"#$%&'$!("#)!##&*+ <D"B>"E"23*1"5'F+"D<– 0>?#"D"=-.(12.)#$(%*)$%2"#2'$!(" 56.0)-.(12.).(2+)3!##@

7

Page 116: [Harvard CS264] 05 - Advanced-level CUDA Programming

© NVIDIA Corporation 2010

CUDA Visual Profiler data for memory transfers

Memory transfer type and direction(D=Device, H=Host, A=cuArray)

e.g. H to D: Host to Device

Synchronous / Asynchronous

Memory transfer size, in bytes

Stream ID

Page 117: [Harvard CS264] 05 - Advanced-level CUDA Programming

© NVIDIA Corporation 2010

CUDA Visual Profiler data for kernels

Page 118: [Harvard CS264] 05 - Advanced-level CUDA Programming

© NVIDIA Corporation 2010

CUDA Visual Profiler computed data for kernels

Instruction throughput: Ratio of achieved instruction rate to peak single issue instruction rate

Global memory read throughput (Gigabytes/second)

Global memory write throughput (Gigabytes/second)

Overall global memory access throughput (Gigabytes/second)

Global memory load efficiency

Global memory store efficiency

Page 119: [Harvard CS264] 05 - Advanced-level CUDA Programming

© NVIDIA Corporation 2010

CUDA Visual Profiler data analysis viewsViews:

Summary table Kernel tableMemcopy table Summary plotGPU Time Height plotGPU Time Width plotProfiler counter plotProfiler table column plotMulti-device plotMulti-stream plot

Analyze profiler counters

Analyze kernel occupancy

Page 120: [Harvard CS264] 05 - Advanced-level CUDA Programming

© NVIDIA Corporation 2010

CUDA Visual Profiler Misc.Multiple sessions

Compare views for different sessions

Comparison Summary plot

Profiler projects save & load

Import/Export profiler data (.CSV format)

Page 121: [Harvard CS264] 05 - Advanced-level CUDA Programming

meh!!!! I don’t like to profile

Scared ?

Page 122: [Harvard CS264] 05 - Advanced-level CUDA Programming

Modified source code

Page 123: [Harvard CS264] 05 - Advanced-level CUDA Programming

!"#$%&'&()'*+(,-.'/'0.(1-2340(5-.0

• 6'70(707-3%8-"$%(#".(7#*+8-"$%(903&'-"&(-/(*+0(:03"0$– !"#$%&'()&'*)+%#',-",'+)./,'-"0%'+","1+%2%.+%.,'*).,&)31(3)4')&'

"++&%##$.5– 6$0%#'7)8'5))+'%#,$9",%#'()&:

• ;$9%'#2%.,'"**%##$.5'9%9)&7

© NVIDIA 2010

• ;$9%'#2%.,'"**%##$.5'9%9)&7• ;$9%'#2%.,'$.'%<%*8,$.5'$.#,&8*,$).#

• 5-7;#3'"<(*+0(*'70&(/-3(7-.'/'0.(:03"0$&– =%32#'+%*$+%'4-%,-%&',-%'>%&.%3'$#'9%9 )&'9",-'?)8.+– @-)4#'-)4'4%33'9%9)&7')2%&",$).#'"&%')0%&3"22%+'4$,-'"&$,-9%,$*

• A)92"&%',-%'#89')('9%91).37'".+'9",-1).37',$9%#',)'(8331>%&.%3',$9%

9

Page 124: [Harvard CS264] 05 - Advanced-level CUDA Programming

I want to believe...

Scared ?

Page 125: [Harvard CS264] 05 - Advanced-level CUDA Programming

!"#$%&'(#)*$%!+$,(-."/

time

© NVIDIA 2010

mem math full mem math full mem math full mem math full

Memory and latency bound

Poor mem-math overlap: latency is a problem

Math-bound

Good mem-math overlap: latency not a problem

(assuming instruction throughput is not low compared to HW theory)

Memory-bound

Good mem-math overlap: latency not a problem

(assuming memory throughput is not low compared to HW theory)

Balanced

Good mem-math overlap: latency not a problem

(assuming memory/instrthroughput is not low compared to HW theory)

13

Memory bound ?

Math bound ?

Latency bound ?

Page 126: [Harvard CS264] 05 - Advanced-level CUDA Programming

!"#$%&'(#)*$%!+$,(-."/

time

© NVIDIA 2010

mem math full mem math full mem math full mem math full

Memory and latency bound

Poor mem-math overlap: latency is a problem

Math-bound

Good mem-math overlap: latency not a problem

(assuming instruction throughput is not low compared to HW theory)

Memory-bound

Good mem-math overlap: latency not a problem

(assuming memory throughput is not low compared to HW theory)

Balanced

Good mem-math overlap: latency not a problem

(assuming memory/instrthroughput is not low compared to HW theory)

13

Page 127: [Harvard CS264] 05 - Advanced-level CUDA Programming

!"#$%&'(#)*$%!+$,(-."/

time

© NVIDIA 2010

mem math full mem math full mem math full mem math full

Memory and latency bound

Poor mem-math overlap: latency is a problem

Math-bound

Good mem-math overlap: latency not a problem

(assuming instruction throughput is not low compared to HW theory)

Memory-bound

Good mem-math overlap: latency not a problem

(assuming memory throughput is not low compared to HW theory)

Balanced

Good mem-math overlap: latency not a problem

(assuming memory/instrthroughput is not low compared to HW theory)

13

Page 128: [Harvard CS264] 05 - Advanced-level CUDA Programming

Argn&%#$... too many optimizations !!!

Page 129: [Harvard CS264] 05 - Advanced-level CUDA Programming

67© NVIDIA Corporation 2008

Parameterize Your Application

Parameterization helps adaptation to different GPUs

GPUs vary in many ways# of multiprocessors

Memory bandwidth

Shared memory size

Register file size

Max. threads per block

You can even make apps self-tuning (like FFTW andATLAS)

“Experiment” mode discovers and saves optimalconfiguration

Page 130: [Harvard CS264] 05 - Advanced-level CUDA Programming

More ?• Next week:

GPU “Scripting”, Meta-programming, Auto-tuning

• Thu 3/31/11:PyOpenCL (A.Knockler, NYU), ahh (C.Omar, CMU)

• Tue 3/29/11:Algorithm Strategies (W. Hwu, UIUC)

• Tue 4/5/11:Analysis-driven Optimization (C.Wooley, NVIDIA)

• Thu 4/7/11:Irregular Parallelism & Efficient Data Structures (J.Owens, UCDavis)

• Thu 4/14/11:Optimization for Ninjas (D.Merill, UVirg)

• ...

Page 131: [Harvard CS264] 05 - Advanced-level CUDA Programming

iPhD one more thingor two...

Page 132: [Harvard CS264] 05 - Advanced-level CUDA Programming

Life/Code Hacking #2.xSpeed {listen,read,writ}ing

accelerated e-learning (c) / massively parallel {learn,programm}ing (c)

Page 133: [Harvard CS264] 05 - Advanced-level CUDA Programming

Life/Code Hacking #2.2Speed writing

accelerated e-learning (c) / massively parallel {learn,programm}ing (c)

Page 134: [Harvard CS264] 05 - Advanced-level CUDA Programming

Life/Code Hacking #2.2Speed writing

accelerated e-learning (c) / massively parallel {learn,programm}ing (c)

http://steve-yegge.blogspot.com/2008/09/programmings-dirtiest-little-secret.html

Page 135: [Harvard CS264] 05 - Advanced-level CUDA Programming

Life/Code Hacking #2.2Speed writing

accelerated e-learning (c) / massively parallel {learn,programm}ing (c)

Typing tutors: gtypist, ktouch, typingweb.com, etc.

Page 136: [Harvard CS264] 05 - Advanced-level CUDA Programming

Life/Code Hacking #2.2Speed writing

accelerated e-learning (c) / massively parallel {learn,programm}ing (c)

Kinesis Advantage (QWERTY/DVORAK)

Page 137: [Harvard CS264] 05 - Advanced-level CUDA Programming

Demo

Page 138: [Harvard CS264] 05 - Advanced-level CUDA Programming

COME