72
DUANE MERRILL, PH.D. NVIDIA RESEARCH CUB: A pattern of “collective” software design, abstraction, and reuse for kernel-level programming

CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

DUANE MERRILL, PH.D.

NVIDIA RESEARCH

CUB: A pattern of “collective” software design, abstraction, and reuse

for kernel-level programming

Page 2: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

2

What is CUB?

1. A design model for collective kernel-level primitives

How to make reusable software components for SIMT groups (warps, blocks, etc.)

2. A library of collective primitives

Block-reduce, block-sort, block-histogram, warp-scan, warp-reduce, etc.

3. A library of global primitives (built from collectives)

Device-reduce, device-sort, device-scan, etc.

Demonstrate collective composition, performance, performance-portability

Page 3: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

3

Outline

1. Software reuse

2. SIMT collectives: the “missing” CUDA abstraction layer

3. The soul of collective component design

4. Using CUB’s collective primitives

5. Making your own collective primitives

6. Other Very Useful Things in CUB

7. Final thoughts

Page 4: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

4

Software reuse Abstraction & composability are fundamental design principles

Reduce redundant programmer effort

Save time, energy, money

Reduce buggy software

Encapsulate complexity

Empower productivity-oriented programmers

Insulation from underlying hardware

– five NVIDIA GPU architectures between 2008-2014

Software reuse empowers a durable programming model

Page 5: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

5

Software reuse Abstraction & composability are fundamental design principles

Reduce redundant programmer effort

Save time, energy, money

Reduce buggy software

Encapsulate complexity

Empower productivity-oriented programmers

Insulation from changing capabilities of the underlying hardware

– NVIDIA has produced nine different CUDA GPU architectures since 2008!

Software reuse empowers a durable programming model

Page 6: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

6

Outline

1. Software reuse

2. SIMT collectives: the “missing” CUDA abstraction layer

3. The soul of collective component design

4. Using CUB’s collective primitives

5. Making your own collective primitives

6. Other Very Useful Things in CUB

7. Final thoughts

Page 7: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

7

Parallel programming is hard…

Page 8: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

8

Parallel decomposition and grain sizing

Synchronization

Deadlock, livelock, and data races

Plurality of state

Plurality of flow control (divergence, etc.)

Bookkeeping control structures

Memory access conflicts, coalescing, etc.

Occupancy constraints from SMEM, RF, etc

Algorithm selection and instruction scheduling

Special hardware functionality, instructions, etc.

No, cooperative parallel programming is hard…

Page 9: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

9

Parallel decomposition and grain sizing

Synchronization

Deadlock, livelock, and data races

Plurality of state

Plurality of flow control (divergence, etc.)

Bookkeeping control structures

Memory access conflicts, coalescing, etc.

Occupancy constraints from SMEM, RF, etc

Algorithm selection and instruction scheduling

Special hardware functionality, instructions, etc.

No, cooperative parallel programming is hard…

Page 10: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

10

CUDA today

Threadblock Threadblock Threadblock

CUDA function stub

Application

Page 11: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

11

Software abstraction in CUDA

PROBLEM: virtually every CUDA kernel written today is cobbled from scratch A tunability, portability, and maintenance concern

CUDA function stub

Kernel threadblock

Application

Page 12: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

12

Kernel function stub

Software abstraction in CUDA

collective interface

Application

scalar interface

Collective software components reduce development cost, hide complexity, bugs, etc.

BlockStore

BlockSort

BlockLoad

BlockStore

BlockSort

BlockLoad

collective

function

Page 13: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

13

What do these applications have in common?

0

1

1

2

1

1

2

2

2

2

1

3

2

3

2

1 2

2

Parallel sparse graph traversal Parallel radix sort

Parallel BWT compression Parallel SpMV

Page 14: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

14

What do these applications have in common? Block-wide prefix-scan

Scan for

enqueueing

Scan for

segmented

reduction

Scan for solving

recurrences

(move-to-front)

Scan for

partitioning

0

1

1

2

1

1

2

2

2

2

1

3

2

3

2

1 2

2

Parallel sparse graph traversal Parallel radix sort

Parallel BWT compression Parallel SpMV

Page 15: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

15

Examples of parallel scan data flow 16 threads contributing 4 items each

t3

t3

t0 t3 t2 t1

t2

t2

t3

t3

t3

t3

id

id id

t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11

t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11

t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11

t7

t7

t4 t7 t6 t5

t6

t6

t5

t5

t4

t4

id

id id t1

1 t1

1

t8 t1

1

t1

0 t9

t1

0 t1

0

t9

t9

t8

t8

id

id id t1

5 t1

5

t1

2

t1

5

t1

4

t1

3 t1

4 t1

4

t1

3 t1

3

t1

2 t1

2

id

id id

t4 t7 t6 t5

t1

1

t8 t1

1

t1

0 t9

t1

0 t9 t8

t1

5 t1

5

t1

2

t1

5

t1

4

t1

3 t1

4 t1

4

t1

3 t1

3

t1

2 t1

2

t1 t0 t2 t3

t1 t0 t2 t3

t1 t0 t2 t3

t5 t4 t6 t7

t5 t4 t6 t7

t5 t4 t6 t7

t9 t8 t10 t11

t9 t8 t10 t11

t9 t8 t10 t11

t13 t12 t14 t15

t13 t12 t14 t15

t13 t12 t14 t15

t3

t3

t3

t2

t2

t2

t1

t1

t1

t0

t0

t0

t3

t3

t0 t3 t2 t1

t2

t2

t1

t1

t0

t0

id

id id

t3

t3

t3

t2

t2

t2

t1

t1

t1

t0

t0

t0

t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11

t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11

t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11

t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11

t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11

t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11

Brent-Kung hybrid

(Work-efficient ~130 binary ops, depth 15)

Kogge-Stone hybrid

(Depth-efficient ~170 binary ops, depth 12)

Page 16: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

16

CUDA today Kernel programming is complicating

threadblock threadblock threadblock

CUDA function stub

Application

Page 17: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

17

Kernel function stub

Software abstraction in CUDA

collective interface

Application

scalar interface

Collective software components reduce development cost, hide complexity, bugs, etc.

BlockStore

BlockSort

BlockLoad

BlockStore

BlockSort

BlockLoad

collective

function

Page 18: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

18

Outline

1. Software reuse

2. SIMT collectives: the “missing” CUDA abstraction layer

3. The soul of collective component design

4. Using CUB’s collective primitives

5. Making your own collective primitives

6. Other Very Useful Things in CUB

7. Final thoughts

Page 19: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

19

threadblock

BlockSort

Collective composition CUB primitives are easily nested & sequenced

threadblock threadblock threadblock

CUDA stub

application

BlockSort BlockSort BlockSort

Page 20: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

20

threadblock

BlockSort

Collective composition CUB primitives are easily nested & sequenced

BlockRadixRank

BlockExchange

threadblock threadblock threadblock

CUDA stub

application

BlockSort BlockSort BlockSort

Page 21: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

21

threadblock

BlockSort

Collective composition CUB primitives are easily nested & sequenced

BlockRadixRank

BlockScan

BlockExchange

threadblock threadblock threadblock

CUDA stub

application

BlockSort BlockSort BlockSort

Page 22: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

22

threadblock

BlockSort

Collective composition CUB primitives are easily nested & sequenced

BlockRadixRank

BlockScan

WarpScan

BlockExchange

threadblock threadblock threadblock

CUDA stub

application

BlockSort BlockSort BlockSort

Page 23: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

23

Parllel width

Tunable composition Flexible grain-size (“shape” remains the same)

threadblock

BlockSort

BlockRadixRank

BlockScan

WarpScan

BlockExchange

CUDA stub

application

threadblock

Block

Sort

threadblock

Block

Sort

threadblock

Block

Sort …

Page 24: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

24

threadblock

Parllel width

Tunable composition Flexible grain-size (“shape” remains the same)

BlockSort

BlockRadixRank

BlockScan

WarpScan

BlockExchange

CUDA stub

application

threadblock

Block

Sort

threadblock

Block

Sort

threadblock

Block

Sort …

Page 25: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

25

Parllel width

Tunable composition Algorithmic-variant selection

threadblock

BlockSort

BlockRadixRank

BlockScan

WarpScan

BlockExchange

CUDA stub

application

threadblock

Block

Sort

threadblock

Block

Sort

threadblock

Block

Sort …

Page 26: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

26

Parllel width

Tunable composition Algorithmic-variant selection

threadblock

BlockSort

BlockRadixRank

BlockScan

WarpScan

BlockExchange

CUDA stub

application

threadblock

Block

Sort

threadblock

Block

Sort

threadblock

Block

Sort …

Page 27: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

27

Parllel width

Tunable composition Algorithmic-variant selection

threadblock

BlockSort

BlockRadixRank

BlockScan

WarpScan

BlockExchange

CUDA stub

application

threadblock

Block

Sort

threadblock

Block

Sort

threadblock

Block

Sort …

Page 28: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

28

Parllel width

Tunable composition Algorithmic-variant selection

threadblock

BlockSort

BlockRadixRank

BlockScan

WarpScan

BlockExchange

CUDA stub

application

threadblock

Block

Sort

threadblock

Block

Sort

threadblock

Block

Sort …

Page 29: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

29

CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA arch families (Telsa, Fermi, Kepler, Maxwell)

0.50

1.05

1.40

0.51

0.71 0.66

TeslaC1060

TeslaC2050

TeslaK20C

bill

ion

s o

f 3

2b

key

s /

sec

Global radix sort

CUB Thrust v1.7.1

8

14

21

4 6 6

TeslaC1060

TeslaC2050

TeslaK20C

bill

ion

s o

f 3

2b

item

s /

sec

Global prefix scan

CUB Thrust v1.7.1

2.7

16.2

19.3

0 2 2

TeslaC1060

TeslaC2050

TeslaK20C

bill

ion

s o

f 8

b it

ems

/ se

c

Global Histogram

CUB NPP

4.2

8.6

16.4

1.7 2.2 2.4

TeslaC1060

TeslaC2050

TeslaK20C

bill

ion

s o

f 3

2b

inp

uts

/ s

ec

Global partition-if

CUB Thrust v1.7.1

Page 30: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

30

Outline

1. Software reuse

2. SIMT collectives: the “missing” CUDA abstraction layer

3. The soul of collective component design

4. Using CUB’s collective primitives

5. Making your own collective primitives

6. Other Very Useful Things in CUB

7. Final thoughts

Page 31: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

31

CUB collective usage

__global__ void ExampleKernel(...)

{

}

Page 32: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

32

CUB collective usage

1. Collective specialization

__global__ void ExampleKernel(...)

{

// Specialize cub::BlockScan for 128 threads

typedef cub::BlockScan<int, 128> BlockScanT;

}

1

Page 33: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

33

CUB collective usage 3 parameter fields (specialization, construction, function call) + resource reflection

1. Collective specialization

2. Reflected shared resource type

__global__ void ExampleKernel(...)

{

// Specialize cub::BlockScan for 128 threads

typedef cub::BlockScan<int, 128> BlockScanT;

// Allocate temporary storage in shared memory

__shared__ typename BlockScanT::TempStorage scan_storage;

}

1

2

Page 34: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

34

CUB collective usage

1. Collective specialization

2. Reflected shared resource type

__global__ void ExampleKernel(...)

{

// Specialize cub::BlockScan for 128 threads

typedef cub::BlockScan<int, 128> BlockScanT;

// Allocate temporary storage in shared memory

__shared__ typename BlockScanT::TempStorage scan_storage;

// Obtain a tile of 512 items blocked across 128 threads

int items[4];

...

}

1

2

Page 35: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

35

CUB collective usage

1. Collective specialization

2. Reflected shared resource type

3. Collective construction

4. Collective function call

__global__ void ExampleKernel(...)

{

// Specialize cub::BlockScan for 128 threads

typedef cub::BlockScan<int, 128> BlockScanT;

// Allocate temporary storage in shared memory

__shared__ typename BlockScanT::TempStorage scan_storage;

// Obtain a tile of 512 items blocked across 128 threads

int items[4];

...

// Compute block-wide prefix sum

BlockScanT(scan_storage).ExclusiveSum(items, items);

...

}

1

3 4

2

Page 36: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

36

CUB collective usage 3 parameter fields (specialization, construction, function call) + resource reflection

1. Collective specialization

2. Reflected shared resource type

3. Collective construction

4. Collective function call

__global__ void ExampleKernel(...)

{

// Specialize cub::BlockScan for 128 threads

typedef cub::BlockScan<int, 128> BlockScanT;

// Allocate temporary storage in shared memory

__shared__ typename BlockScanT::TempStorage scan_storage;

// Obtain a tile of 512 items blocked across 128 threads

int items[4];

...

// Compute block-wide prefix sum

BlockScanT(scan_storage).ExclusiveSum(items, items);

...

}

1

3 4

2

Page 37: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

37

Sequencing CUB primitives

// A kernel for computing tiled prefix sums

__global__ void ExampleKernel(int* d_in, int* d_out)

{

}

Page 38: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

38

Sequencing CUB primitives

// A kernel for computing tiled prefix sums

__global__ void ExampleKernel(int* d_in, int* d_out)

{

// Specialize for 128 threads owning 4 integers each

typedef cub::BlockLoad<int*, 128, 4> BlockLoadT;

typedef cub::BlockScan<int, 128> BlockScanT;

typedef cub::BlockStore<int*, 128, 4> BlockStoreT;

}

1. Specialize the collective

primitive types

Page 39: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

39

Sequencing CUB primitives

// A kernel for computing tiled prefix sums

__global__ void ExampleKernel(int* d_in, int* d_out)

{

// Specialize for 128 threads owning 4 integers each

typedef cub::BlockLoad<int*, 128, 4> BlockLoadT;

typedef cub::BlockScan<int, 128> BlockScanT;

typedef cub::BlockStore<int*, 128, 4> BlockStoreT;

// Allocate temporary storage in shared memory

__shared__ union {

typename BlockLoadT::TempStorage load;

typename BlockScanT::TempStorage scan;

typename BlockStoreT::TempStorage store;

} temp_storage;

}

1. Specialize the collective

primitive types

2. Allocate shared memory with a

union of TempStorage structured-

layout types

Page 40: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

40

Sequencing CUB primitives

// A kernel for computing tiled prefix sums

__global__ void ExampleKernel(int* d_in, int* d_out)

{

// Specialize for 128 threads owning 4 integers each

typedef cub::BlockLoad<int*, 128, 4> BlockLoadT;

typedef cub::BlockScan<int, 128> BlockScanT;

typedef cub::BlockStore<int*, 128, 4> BlockStoreT;

// Allocate temporary storage in shared memory

__shared__ union {

typename BlockLoadT::TempStorage load;

typename BlockScanT::TempStorage scan;

typename BlockStoreT::TempStorage store;

} temp_storage;

// Cooperatively load a tile of 512 items across 128 threads

int items[4];

BlockLoadT(temp_storage.load).Load(d_in, items);

}

3. Block-wide load

1. Specialize the collective

primitive types

2. Allocate shared memory with a

union of TempStorage structured-

layout types

Page 41: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

41

Sequencing CUB primitives

// A kernel for computing tiled prefix sums

__global__ void ExampleKernel(int* d_in, int* d_out)

{

// Specialize for 128 threads owning 4 integers each

typedef cub::BlockLoad<int*, 128, 4> BlockLoadT;

typedef cub::BlockScan<int, 128> BlockScanT;

typedef cub::BlockStore<int*, 128, 4> BlockStoreT;

// Allocate temporary storage in shared memory

__shared__ union {

typename BlockLoadT::TempStorage load;

typename BlockScanT::TempStorage scan;

typename BlockStoreT::TempStorage store;

} temp_storage;

// Cooperatively load a tile of 512 items across 128 threads

int items[4];

BlockLoadT(temp_storage.load).Load(d_in, items);

__syncthreads(); // Barrier for smem reuse

}

3. Block-wide load,

4. barrier

1. Specialize the collective

primitive types

2. Allocate shared memory with a

union of TempStorage structured-

layout types

Page 42: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

42

Sequencing CUB primitives

// A kernel for computing tiled prefix sums

__global__ void ExampleKernel(int* d_in, int* d_out)

{

// Specialize for 128 threads owning 4 integers each

typedef cub::BlockLoad<int*, 128, 4> BlockLoadT;

typedef cub::BlockScan<int, 128> BlockScanT;

typedef cub::BlockStore<int*, 128, 4> BlockStoreT;

// Allocate temporary storage in shared memory

__shared__ union {

typename BlockLoadT::TempStorage load;

typename BlockScanT::TempStorage scan;

typename BlockStoreT::TempStorage store;

} temp_storage;

// Cooperatively load a tile of 512 items across 128 threads

int items[4];

BlockLoadT(temp_storage.load).Load(d_in, items);

__syncthreads(); // Barrier for smem reuse

// Compute and block-wide exclusive prefix sum

BlockScanT(temp_storage.scan).ExclusiveSum(items, items);

}

3. Block-wide load,

4. barrier,

5. block-wide scan

1. Specialize the collective

primitive types

2. Allocate shared memory with a

union of TempStorage structured-

layout types

Page 43: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

43

Sequencing CUB primitives

// A kernel for computing tiled prefix sums

__global__ void ExampleKernel(int* d_in, int* d_out)

{

// Specialize for 128 threads owning 4 integers each

typedef cub::BlockLoad<int*, 128, 4> BlockLoadT;

typedef cub::BlockScan<int, 128> BlockScanT;

typedef cub::BlockStore<int*, 128, 4> BlockStoreT;

// Allocate temporary storage in shared memory

__shared__ union {

typename BlockLoadT::TempStorage load;

typename BlockScanT::TempStorage scan;

typename BlockStoreT::TempStorage store;

} temp_storage;

// Cooperatively load a tile of 512 items across 128 threads

int items[4];

BlockLoadT(temp_storage.load).Load(d_in, items);

__syncthreads(); // Barrier for smem reuse

// Compute and block-wide exclusive prefix sum

BlockScanT(temp_storage.scan).ExclusiveSum(items, items);

__syncthreads(); // Barrier for smem reuse

}

3. Block-wide load,

4. barrier,

5. block-wide scan,

6. barrier

1. Specialize the collective

primitive types

2. Allocate shared memory with a

union of TempStorage structured-

layout types

Page 44: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

44

Sequencing CUB primitives

// A kernel for computing tiled prefix sums

__global__ void ExampleKernel(int* d_in, int* d_out)

{

// Specialize for 128 threads owning 4 integers each

typedef cub::BlockLoad<int*, 128, 4> BlockLoadT;

typedef cub::BlockScan<int, 128> BlockScanT;

typedef cub::BlockStore<int*, 128, 4> BlockStoreT;

// Allocate temporary storage in shared memory

__shared__ union {

typename BlockLoadT::TempStorage load;

typename BlockScanT::TempStorage scan;

typename BlockStoreT::TempStorage store;

} temp_storage;

// Cooperatively load a tile of 512 items across 128 threads

int items[4];

BlockLoadT(temp_storage.load).Load(d_in, items);

__syncthreads(); // Barrier for smem reuse

// Compute and block-wide exclusive prefix sum

BlockScanT(temp_storage.scan).ExclusiveSum(items, items);

__syncthreads(); // Barrier for smem reuse

// Cooperatively store a tile of 512 items across 128 threads

BlockStoreT(temp_storage.load).Store(d_in, items);

}

3. Block-wide load,

4. barrier,

5. block-wide scan

6. barrier,

7. block-wide store

1. Specialize the collective

primitive types

2. Allocate shared memory with a

union of TempStorage structured-

layout types

Page 45: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

45

Tuning with CUB primitives

int* d_in; // = ...

int* d_out; // = ...

// Invoke kernel (GF110 Fermi)

ExampleKernel <<<1, 128>>>(

d_in,

d_out);

// A kernel for computing tiled prefix sums

__global__ void ExampleKernel(int* d_in, int* d_out)

{

// Specialize for 128 threads owning 4 integers each

typedef cub::BlockLoad<int*, 128, 4> BlockLoadT;

typedef cub::BlockScan<int, 128> BlockScanT;

typedef cub::BlockStore<int*, 128, 4> BlockStoreT;

// Allocate temporary storage in shared memory

__shared__ union {

typename BlockLoadT::TempStorage load;

typename BlockScanT::TempStorage scan;

typename BlockStoreT::TempStorage store;

} temp_storage;

// Cooperatively load a tile of 512 items across 128 threads

int items[4];

BlockLoadT(temp_storage.load).Load(d_in, items);

__syncthreads(); // Barrier for smem reuse

// Compute and block-wide exclusive prefix sum

BlockScanT(temp_storage.scan).ExclusiveSum(items, items);

__syncthreads(); // Barrier for smem reuse

// Cooperatively store a tile of 512 items across 128 threads

BlockStoreT(temp_storage.load).Store(d_in, items);

}

Page 46: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

46

Tuning with CUB primitives

int* d_in; // = ...

int* d_out; // = ...

// Invoke kernel (GF110 Fermi)

ExampleKernel <<<1, 128>>>(

d_in,

d_out);

template <typename T>

__global__ void ExampleKernel(T* d_in, T* d_out)

{

// Specialize for 128 threads owning 4 Ts each

typedef cub::BlockLoad<T*, 128, 4> BlockLoadT;

typedef cub::BlockScan<T, 128> BlockScanT;

typedef cub::BlockStore<T*, 128, 4> BlockStoreT;

// Allocate temporary storage in shared memory

__shared__ union {

typename BlockLoadT::TempStorage load;

typename BlockScanT::TempStorage scan;

typename BlockStoreT::TempStorage store;

} temp_storage;

// Cooperatively load a tile of 512 items across 128 threads

T items[4];

BlockLoadT(temp_storage.load).Load(d_in, items);

__syncthreads(); // Barrier for smem reuse

// Compute and block-wide exclusive prefix sum

BlockScanT(temp_storage.scan).ExclusiveSum(items, items);

__syncthreads(); // Barrier for smem reuse

// Cooperatively store a tile of 512 items across 128 threads

BlockStoreT(temp_storage.load).Store(d_in, items);

}

Page 47: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

47

Tuning with CUB primitives

int* d_in; // = ...

int* d_out; // = ...

// Invoke kernel (GF110 Fermi)

ExampleKernel <128> <<<1, 128>>>(

d_in,

d_out);

template <int BLOCK_THREADS, typename T>

__global__ void ExampleKernel(T* d_in, T* d_out)

{

// Specialize for BLOCK_THREADS threads owning 4 integers each

typedef cub::BlockLoad<T*, BLOCK_THREADS, 4> BlockLoadT;

typedef cub::BlockScan<T, BLOCK_THREADS> BlockScanT;

typedef cub::BlockStore<T*, BLOCK_THREADS, 4> BlockStoreT;

// Allocate temporary storage in shared memory

__shared__ union {

typename BlockLoadT::TempStorage load;

typename BlockScanT::TempStorage scan;

typename BlockStoreT::TempStorage store;

} temp_storage;

// Cooperatively load a tile of items

T items[4];

BlockLoadT(temp_storage.load).Load(d_in, items);

__syncthreads(); // Barrier for smem reuse

// Compute and block-wide exclusive prefix sum

BlockScanT(temp_storage.scan).ExclusiveSum(items, items);

__syncthreads(); // Barrier for smem reuse

// Cooperatively store a tile of items

BlockStoreT(temp_storage.load).Store(d_in, items);

}

Page 48: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

48

Tuning with CUB primitives

int* d_in; // = ...

int* d_out; // = ...

// Invoke kernel (GF110 Fermi)

ExampleKernel <128, 4> <<<1, 128>>>(

d_in,

d_out);

template <int BLOCK_THREADS, int ITEMS_PER_THREAD, typename T>

__global__ void ExampleKernel(T* d_in, T* d_out)

{

// Specialize for BLOCK_THREADS threads owning ITEMS_PER_THREAD integers each

typedef cub::BlockLoad<T*, BLOCK_THREADS, ITEMS_PER_THREAD> BlockLoadT;

typedef cub::BlockScan<T, BLOCK_THREADS> BlockScanT;

typedef cub::BlockStore<T*, BLOCK_THREADS, ITEMS_PER_THREAD> BlockStoreT;

// Allocate temporary storage in shared memory

__shared__ union {

typename BlockLoadT::TempStorage load;

typename BlockScanT::TempStorage scan;

typename BlockStoreT::TempStorage store;

} temp_storage;

// Cooperatively load a tile of items

T items[ITEMS_PER_THREAD];

BlockLoadT(temp_storage.load).Load(d_in, items);

__syncthreads(); // Barrier for smem reuse

// Compute and block-wide exclusive prefix sum

BlockScanT(temp_storage.scan).ExclusiveSum(items, items);

__syncthreads(); // Barrier for smem reuse

// Cooperatively store a tile of items

BlockStoreT(temp_storage.load).Store(d_in, items);

}

Page 49: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

49

Tuning with CUB primitives

int* d_in; // = ...

int* d_out; // = ...

// Invoke kernel (GF110 Fermi)

ExampleKernel <128, 4, BLOCK_LOAD_WARP_TRANSPOSE>

<<<1, 128>>>(

d_in,

d_out);

template <int BLOCK_THREADS, int ITEMS_PER_THREAD, BlockLoadAlgorithm LOAD_ALGO>

__global__ void ExampleKernel(T* d_in, T* d_out)

{

// Specialize for BLOCK_THREADS threads owning ITEMS_PER_THREAD integers each

typedef cub::BlockLoad<T*, BLOCK_THREADS, ITEMS_PER_THREAD, LOAD_ALGO> BlockLoadT;

typedef cub::BlockScan<T, BLOCK_THREADS> BlockScanT;

typedef cub::BlockStore<T*, BLOCK_THREADS, ITEMS_PER_THREAD> BlockStoreT;

// Allocate temporary storage in shared memory

__shared__ union {

typename BlockLoadT::TempStorage load;

typename BlockScanT::TempStorage scan;

typename BlockStoreT::TempStorage store;

} temp_storage;

// Cooperatively load a tile of items

T items[ITEMS_PER_THREAD];

BlockLoadT(temp_storage.load).Load(d_in, items);

__syncthreads(); // Barrier for smem reuse

// Compute and block-wide exclusive prefix sum

BlockScanT(temp_storage.scan).ExclusiveSum(items, items);

__syncthreads(); // Barrier for smem reuse

// Cooperatively store a tile of items

BlockStoreT(temp_storage.load).Store(d_in, items);

}

Page 50: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

50

Tuning with CUB primitives

int* d_in; // = ...

int* d_out; // = ...

// Invoke kernel (GF110 Fermi)

ExampleKernel <128, 4, BLOCK_LOAD_WARP_TRANSPOSE,

BLOCK_SCAN_RAKING> <<<1, 128>>>(

d_in,

d_out);

template <int BLOCK_THREADS, int ITEMS_PER_THREAD, BlockLoadAlgorithm LOAD_ALGO,

BlockScanAlgorithm SCAN_ALGO, typename T>

__global__ void ExampleKernel(T* d_in, T* d_out)

{

// Specialize for BLOCK_THREADS threads owning ITEMS_PER_THREAD integers each

typedef cub::BlockLoad<T*, BLOCK_THREADS, ITEMS_PER_THREAD, LOAD_ALGO> BlockLoadT;

typedef cub::BlockScan<T, BLOCK_THREADS, SCAN_ALGO> BlockScanT;

typedef cub::BlockStore<T*, BLOCK_THREADS, ITEMS_PER_THREAD> BlockStoreT;

// Allocate temporary storage in shared memory

__shared__ union {

typename BlockLoadT::TempStorage load;

typename BlockScanT::TempStorage scan;

typename BlockStoreT::TempStorage store;

} temp_storage;

// Cooperatively load a tile of items

T items[ITEMS_PER_THREAD];

BlockLoadT(temp_storage.load).Load(d_in, items);

__syncthreads(); // Barrier for smem reuse

// Compute and block-wide exclusive prefix sum

BlockScanT(temp_storage.scan).ExclusiveSum(items, items);

__syncthreads(); // Barrier for smem reuse

// Cooperatively store a tile of items

BlockStoreT(temp_storage.load).Store(d_in, items);

}

Page 51: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

51

Tuning with CUB primitives

int* d_in; // = ...

int* d_out; // = ...

// Invoke kernel (GF110 Fermi)

ExampleKernel <128, 4, BLOCK_LOAD_WARP_TRANSPOSE,

LOAD_DEFAULT, BLOCK_SCAN_RAKING> <<<1, 128>>>(

d_in,

d_out);

template <int BLOCK_THREADS, int ITEMS_PER_THREAD, BlockLoadAlgorithm LOAD_ALGO,

CacheLoadModifier LOAD_MODIFIER, BlockScanAlgorithm SCAN_ALGO, typename T>

__global__ void ExampleKernel(T* d_in, T* d_out)

{

// Specialize for BLOCK_THREADS threads owning ITEMS_PER_THREAD integers each

typedef cub::BlockLoad<T*, BLOCK_THREADS, ITEMS_PER_THREAD, LOAD_ALGO> BlockLoadT;

typedef cub::BlockScan<T, BLOCK_THREADS> BlockScanT;

typedef cub::BlockStore<T*, BLOCK_THREADS, ITEMS_PER_THREAD> BlockStoreT;

// Allocate temporary storage in shared memory

__shared__ union {

typename BlockLoadT::TempStorage load;

typename BlockScanT::TempStorage scan;

typename BlockStoreT::TempStorage store;

} temp_storage;

// Cooperatively load a tile of items

T items[ITEMS_PER_THREAD];

typedef cub::CacheModifiedInputIterator<LOAD_MODIFIER, T> InputItr;

BlockLoadT(temp_storage.load).Load(InputItr(d_in), items);

__syncthreads(); // Barrier for smem reuse

// Compute and block-wide exclusive prefix sum

BlockScanT(temp_storage.scan).ExclusiveSum(items, items);

__syncthreads(); // Barrier for smem reuse

// Cooperatively store a tile of items

BlockStoreT(temp_storage.load).Store(d_in, items);

}

Page 52: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

52

Tuning with CUB primitives

int* d_in; // = ...

int* d_out; // = ...

// Invoke kernel (GK110 Kepler)

ExampleKernel <128, 21, BLOCK_LOAD_DIRECT,

LOAD_LDG, BLOCK_SCAN_WARP_SCANS> <<<1, 128>>>(

d_in,

d_out);

template <int BLOCK_THREADS, int ITEMS_PER_THREAD, BlockLoadAlgorithm LOAD_ALGO,

CacheLoadModifier LOAD_MODIFIER, BlockScanAlgorithm SCAN_ALGO, typename T>

__global__ void ExampleKernel(T* d_in, T* d_out)

{

// Specialize for BLOCK_THREADS threads owning ITEMS_PER_THREAD integers each

typedef cub::BlockLoad<T*, BLOCK_THREADS, ITEMS_PER_THREAD, LOAD_ALGO> BlockLoadT;

typedef cub::BlockScan<T, BLOCK_THREADS> BlockScanT;

typedef cub::BlockStore<T*, BLOCK_THREADS, ITEMS_PER_THREAD> BlockStoreT;

// Allocate temporary storage in shared memory

__shared__ union {

typename BlockLoadT::TempStorage load;

typename BlockScanT::TempStorage scan;

typename BlockStoreT::TempStorage store;

} temp_storage;

// Cooperatively load a tile of items

T items[ITEMS_PER_THREAD];

typedef cub::CacheModifiedInputIterator<LOAD_MODIFIER, T> InputItr;

BlockLoadT(temp_storage.load).Load(InputItr(d_in), items);

__syncthreads(); // Barrier for smem reuse

// Compute and block-wide exclusive prefix sum

BlockScanT(temp_storage.scan).ExclusiveSum(items, items);

__syncthreads(); // Barrier for smem reuse

// Cooperatively store a tile of items

BlockStoreT(temp_storage.load).Store(d_in, items);

}

Page 53: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

53

Outline

1. Software reuse

2. SIMT collectives: the “missing” CUDA abstraction layer

3. The soul of collective component design

4. Using CUB’s collective primitives

5. Making your own collective primitives

6. Other Very Useful Things in CUB

7. Final thoughts

Page 54: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

54

// Simple collective primitive for block-wide prefix sum

template <typename T, int BLOCK_THREADS>

class BlockScan

{

};

Block-wide prefix sum (simplified)

x0 x1 x2 x3 x4 x5 x6 x7

x0:x0 x0:x1

x0:x2 x0:x3

x0:x4 x0:x5

x0:x6 x0:x7

Page 55: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

55

// Simple collective primitive for block-wide prefix sum

template <typename T, int BLOCK_THREADS>

class BlockScan

{

// Type of shared memory needed by BlockScan

typedef T TempStorage[BLOCK_THREADS];

};

Block-wide prefix sum (simplified)

x0 x1 x2 x3 x4 x5 x6 x7

x0:x0 x0:x1

x0:x2 x0:x3

x0:x4 x0:x5

x0:x6 x0:x7

Page 56: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

56

// Simple collective primitive for block-wide prefix sum

template <typename T, int BLOCK_THREADS>

class BlockScan

{

// Type of shared memory needed by BlockScan

typedef T TempStorage[BLOCK_THREADS];

// Per-thread data (reference to shared storage)

TempStorage &temp_storage;

// Constructor

BlockScan (TempStorage &storage) : temp_storage(storage) {}

};

Block-wide prefix sum (simplified)

x0 x1 x2 x3 x4 x5 x6 x7

x0:x0 x0:x1

x0:x2 x0:x3

x0:x4 x0:x5

x0:x6 x0:x7

Page 57: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

57

// Simple collective primitive for block-wide prefix sum

template <typename T, int BLOCK_THREADS>

class BlockScan

{

// Type of shared memory needed by BlockScan

typedef T TempStorage[BLOCK_THREADS];

// Per-thread data (reference to shared storage)

TempStorage &temp_storage;

// Constructor

BlockScan (TempStorage &storage) : temp_storage(storage) {}

// Inclusive prefix sum operation (each thread contributes its own data item)

T InclusiveSum (T thread_data)

{

#pragma unroll

for (int i = 1; i < BLOCK_THREADS; i *= 2)

{

temp_storage[tid] = thread_data;

__syncthreads();

if (tid – i >= 0) thread_data += temp_storage[tid];

__syncthreads();

}

return thread_data;

}

};

Block-wide prefix sum (simplified)

x0 x1 x2 x3 x4 x5 x6 x7

x0:x0 x0:x1

x0:x2 x0:x3

x0:x4 x0:x5

x0:x6 x0:x7

Page 58: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

58

Block-wide reduce-by-key (simplified)

a a b b c c c c

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

0 0 1 0 1 0 0 0

values

keys

head-flags

a a b b c c c prev-keys -

0 1.0 2.0 1.0 2.0 1.0 2.0 3.0

0 0 0 1 1 2 2 2

scanned-values

scanned-flags

// Reduce-by-segment scan data type

struct ValueOffsetPair

{

ValueT value;

int offset;

// Sum operation

ValueOffsetPair operator+(ValueOffsetPair &other)

{

ValueOffsetPair retval;

retval.offset = offset + other.offset;

retval.value = (other.offset) ?

other.value :

value + other.value;

return retval;

}

};

2

4.0

c

1

Page 59: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

59

// Block-wide reduce-by-key

template <typename KeyT, typename ValueT, int BLOCK_THREADS, int ITEMS_PER_THREAD>

struct BlockReduceByKey

{

// Parameterized BlockDiscontinuity type for keys

typedef BlockDiscontinuity<KeyT, BLOCK_THREADS> BlockDiscontinuityT;

// Parameterized BlockScan type

typedef BlockScan<ValueOffsetPair, BLOCK_THREADS> BlockScanT;

// Temporary storage type

union TempStorage

{

typename BlockDiscontinuityT::TempStorage discontinuity;

typename BlockDiscontinuityT::TempStorage scan;

};

// Reduce segments using addition operator.

// Returns the "carry-out" of the last segment

ValueT Sum(

TempStorage& temp_storage, // shared storage reference

KeyT keys[ITEMS_PER_THREAD], // [in|out] keys

ValueT values[ITEMS_PER_THREAD], // [in|out] values

int segment_indices[ITEMS_PER_THREAD]) // [out] segment indices (-1 if invalid)

{

...

}

};

Block-wide reduce-by-key (simplified)

Page 60: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

60

// Reduce segments using addition operator.

// Returns the "carry-out" of the last segment

ValueT Sum(

TempStorage& temp_storage,

KeyT keys[ITEMS_PER_THREAD],

ValueT values[ITEMS_PER_THREAD],

int segment_indices[ITEMS_PER_THREAD])

{

KeyT prev_keys[ITEMS_PER_THREAD];

ValueOffsetPair scan_items[ITEMS_PER_THREAD];

// Set head segment_flags.

BlockDiscontinuityKeysT(temp_storage.discontinuity).FlagHeads(

segment_indices, keys, prev_keys);

__syncthreads();

// Unset the flag for the first item

if (threadIdx.x == 0)

segment_indices[0] = 0;

// Zip values and segment_flags

for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)

{

scan_items[ITEM].offset = segment_indices[ITEM];

scan_items[ITEM].value = values[ITEM];

}

...

...

// Exclusive scan of values and segment_flags

ValueOffsetPair tile_aggregate;

BlockScanT(temp_storage.scan).ExclusiveSum(

scan_items, scan_items, tile_aggregate);

// Unzip values and segment indices

for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)

{

segment_indices[ITEM] = segment_indices[ITEM] ?

scan_items[ITEM].offset :

-1;

keys[ITEM] = prev_keys[ITEM];

values[ITEM] = scan_items[ITEM].value;

}

// Return “carry-out”

return tile_aggregate.value;

}

Block-wide reduce-by-key (simplified)

Page 61: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

61

Outline

1. Software reuse

2. SIMT collectives: the “missing” CUDA abstraction layer

3. The soul of collective component design

4. Using CUB’s collective primitives

5. Making your own collective primitives

6. Other Very Useful Things in CUB

7. Final thoughts

Page 62: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

62

Cache-modified input iterators

#include <cub/cub.cuh>

// Standard layout type

struct Foo

{

double x;

char y;

};

__global__ void Kernel(Foo* d_in, Foo* d_out)

{

// In host or device code: create an LDG wrapper

cub::CacheModifiedInputIterator<cub::LOAD_LDG, Foo> ldg_itr(d_in);

cub::CacheModifiedOutputIterator<cub::STORE_WT, Foo> volatile_itr(d_out);

volatile_itr[threadIdx.x] = ldg_itr[threadIdx.x];

}

code for sm_35

Function : _Z6KernelPdS_

MOV R1, c[0x0][0x44];

S2R R0, SR_TID.X;

ISCADD R2, R0, c[0x0][0x140], 0x3;

LDG.64 R4, [R2];

LDG.64 R2, [R6];

ISCADD R0, R0, c[0x0][0x144], 0x4;

TEXDEPBAR 0x1;

ST.WT.64 [R0], R4;

TEXDEPBAR 0x0;

ST.WT.64 [R0+0x8], R2;

EXIT;

LOAD_DEFAULT, ///< Default (no modifier)

LOAD_CA, ///< Cache at all levels

LOAD_CG, ///< Cache at global level

LOAD_CS, ///< Cache streaming (likely to be accessed once)

LOAD_CV, ///< Cache as volatile (including cached system lines)

LOAD_LDG, ///< Cache as texture

LOAD_VOLATILE, ///< Volatile (any memory space)

Page 63: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

63

Texture obj (and ref) input iterators

#include <cub/cub.cuh>

// Standard layout type

struct Foo

{

int y;

double x;

};

template <typename InputIteratorT, typename OutputIteratorT>

__global__ void Kernel(InputIteratorT d_in, OutputIteratorT d_out)

{

d_out[threadIdx.x] = d_in[threadIdx.x];

}

// Create a texture object input iterator

Foo* d_foo;

cub::TexObjInputIterator<Foo> d_foo_tex;

d_foo_tex.BindTexture(d_foo);

Kernel<<<1, 32>>>(d_foo_tex, d_foo);

d_foo_tex.UnbindTexture();

code for sm_35

Function :

_Z6KernelIN3cub19TexObjInputIteratorI3FooiEEPS

2_EvT_T0_

MOV R1, c[0x0][0x44];

S2R R0, SR_TID.X;

IADD R2, R0, c[0x0][0x144];

SHF.L R2, RZ, 0x1, R2;

IADD R3, R2, 0x1;

TLD.LZ.T R2, R2, 0x52, 1D, 0x1;

TLD.LZ.P R4, R3, 0x52, 1D, 0x3;

ISCADD R0, R0, c[0x0][0x150], 0x4;

TEXDEPBAR 0x1;

ST [R0], R2;

TEXDEPBAR 0x0;

ST.64 [R0+0x8], R4;

EXIT;

Page 64: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

64

Collective primitives

WarpReduce

reduction & segmented reduction

WarpScan

BlockDiscontinuity

BlockExchange

BlockHistogram

BlockLoad & BlockStore

BlockRadixSort

BlockReduce

BlockScan

Page 65: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

65

Device-wide (global) primitives (Usable with CDP, streams, and your own memory allocator)

DeviceHistogram

histogram-even

histogram-range

DevicePartition

partition-if

partition-flagged

DeviceRadixSort

ascending / descending

DeviceReduce

reduction

arg-min, arg-max

reduce-by-key

DeviceRunLengthEncode

RLE

Non-trivial segments

DeviceScan

inclusive / exclusive

DeviceSelect

select-flagged

select-if

keep-unique

DeviceSpmv

Page 66: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

66

NEW: performance-resilient histogram Simple intra-thread RLE provides a uniform performance response regardless of input sample distribution

0

50

100

150

200

250

300

350

Avg

ela

pse

d t

ime

(us)

GeForce GTX980: 1-channel (1920x1080 uchar1 pixels)

RLE CUB SMEM Atomic GMEM Atomic // RLE pixel counts within the thread's pixels

int accumulator = 1;

for (int PIXEL = 0;

PIXEL < PIXELS_PER_THREAD - 1;

++PIXEL)

{

if (bins[PIXEL] == bins[PIXEL + 1])

{

accumulator++;

}

else

{

atomicAdd(

privatized_histogram + bins[PIXEL],

accumulator);

accumulator = 1;

}

}

Page 67: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

67

NEW: CSR SpMV Merge-based parallel decomposition for load balance

0

1

2

3

4

5

6

7

0 2 2 4 8

CSR row-offsets

CSR

no

n-z

ero

s

Ind

ices

of

no

n-z

ero

s (ℕ

) 2.0 0.0

2.0

4.0

0

1

2 0 0

1

2 0

1

2

3

4 0

1.0

1.0

1.0

1.0

1.0

1.0

1.0

1.0

0

5

10

15

20

25

30

35

40

gig

aflo

ps

fp32 (Tesla K40)

CUB

cuSPARSE

Page 68: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

68

Outline

1. Software reuse

2. SIMT collectives: the “missing” CUDA abstraction layer

3. The soul of collective component design

4. Using CUB’s collective primitives

5. Making your own collective primitives

6. Other Very Useful Things in CUB

7. Final thoughts

Page 69: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

69

Benefits of using CUB primitives

Simplicity of composition

Kernels are simply sequences of primitives

High performance

CUB uses the best known algorithms, abstractions, and strategies, and techniques

Performance portability

CUB is specialized for the target hardware (e.g., memory conflict rules, special instructions, etc.)

Simplicity of tuning

CUB adapts to various grain sizes (threads per block, items per thread, etc.)

CUB provides alterative algorithms

Robustness and durability

CUB supports arbitrary data types and block sizes

Page 70: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

70

Questions?

Please visit the CUB project on GitHub

http://nvlabs.github.com/cub

Duane Merrill ([email protected])

Page 71: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

71

THANK YOU

Page 72: CUB - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5617... · 2015-04-27 · 29 CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA

72

p0

p1 p2 p3

p1 p2 p3

p0

p1 p2 p3 p0

barrier

id

id id

barrier