27
1 Compaction for Efficient SIMT Control Flow Wilson W. L. Fung Tor M. Aamodt University of British Columbia HPCA-17 Feb 14, 2011 Thread Block

Compaction for Efficient SIMT Control Flow

  • Upload
    azure

  • View
    85

  • Download
    0

Embed Size (px)

DESCRIPTION

Thread Block. Compaction for Efficient SIMT Control Flow. Wilson W. L. Fung Tor M. Aamodt University of British Columbia HPCA-17 Feb 14, 2011. Grid. Warp. Blocks. Blocks. Thread Blocks. Scalar Thread. Scalar Thread. 1. 1. 2. 2. 3. 3. 4. 4. 5. 5. 6. 6. 7. 7. 8. 8. 9. - PowerPoint PPT Presentation

Citation preview

Page 1: Compaction for Efficient SIMT Control Flow

1

Compactionfor Efficient SIMT Control Flow

Wilson W. L. FungTor M. Aamodt

University of British ColumbiaHPCA-17 Feb 14, 2011

Thread Block

Page 2: Compaction for Efficient SIMT Control Flow

Wilson Fung, Tor Aamodt Thread Block Compaction 2

Commodity ManyCore Accelerator SIMD HW: Compute BW + Efficiency

Non-Graphics API: CUDA, OpenCL, DirectCompute Programming Model: Hierarchy of scalar threads SIMD-ness not exposed at ISA Scalar threads grouped into warps, run in lockstep

GridBlocksBlocksThread Blocks

2

Graphics Processor Unit (GPU)

Scalar Thread 1 2 3 4 5 6 7 8 9 10 11 12Scalar ThreadWarp

1 2 3 4 5 6 7 8 9 10 11 12

Single-Instruction-MultiThread

Page 3: Compaction for Efficient SIMT Control Flow

Wilson Fung, Tor Aamodt Thread Block Compaction 3

SIMT Execution Model

A: K = A[tid.x];B: if (K > 10) C: K = 10; elseD: K = 0;E: B = C[tid.x] + K;

Time

Branch Divergence

D E 0011C E 1100

B - 1111PC RPC Active MaskE - 1111A 1 2 3 4

B 1 2 3 4

C 1 2 -- --

D -- -- 3 4

E 1 2 3 4

50% SIMD Efficiency!

Reconvergence Stack

E

A[] = {4,8,12,16};

In some cases: SIMD Efficiency 20%

Page 4: Compaction for Efficient SIMT Control Flow

Wilson Fung, Tor Aamodt Thread Block Compaction 4

Dynamic Warp Formation (W. Fung et al., MICRO 2007)

Time

1 2 3 4A

1 2 -- --C

-- -- 3 4D

1 2 3 4E

1 2 3 4B

Warp 0

5 6 7 8 A

5 -- 7 8C

-- 6 -- --D

5 6 7 8E

5 6 7 8B

Warp 1

9 10 11 12A

-- -- 11 12C

9 10 -- --D

9 10 11 12E

9 10 11 12B

Warp 2

Reissue/MemoryLatency

SIMD Efficiency 88%1 2 7 8C

5 -- 11 12CPack

Page 5: Compaction for Efficient SIMT Control Flow

Wilson Fung, Tor Aamodt Thread Block Compaction 5

This Paper

Identified DWF Pathologies Greedy Warp Scheduling Starvation

Lower SIMD Efficiency Breaks up coalesced memory access

Lower memory efficiency Some CUDA apps require lockstep execution of

static warps DWF breaks them!

Additional HW to fix these issues? Simpler solution: Thread Block Compaction

Extreme case:5X slowdown

Page 6: Compaction for Efficient SIMT Control Flow

Wilson Fung, Tor Aamodt Thread Block Compaction 6

Thread Block Compaction Block-wide Reconvergence Stack

Regroup threads within a block Better Reconv. Stack: Likely Convergence

Converge before Immediate Post-Dominator Robust

Avg. 22% speedup on divergent CUDA apps No penalty on others

PC RPC AMaskWarp 0

E -- 1111D E 0011C E 1100

PC RPC AMaskWarp 1

E -- 1111D E 0100C E 1011

PC RPC AMaskWarp 2

E -- 1111D E 1100C E 0011

PC RPC Active MaskThread Block 0

E -- 1111 1111 1111D E 0011 0100 1100C E 1100 1011 0011 C Warp X

C Warp Y

D Warp UD Warp T

E Warp 0E Warp 1E Warp 2

Page 7: Compaction for Efficient SIMT Control Flow

Wilson Fung, Tor Aamodt Thread Block Compaction 7

Outline

Introduction GPU Microarchitecture DWF Pathologies Thread Block Compaction Likely-Convergence Experimental Results Conclusion

Page 8: Compaction for Efficient SIMT Control Flow

Wilson Fung, Tor Aamodt Thread Block Compaction 8

GPU Microarchitecture

InterconnectionNetwork

Memory PartitionLast-Level

Cache Bank

Off-ChipDRAM Channel

Memory PartitionLast-Level

Cache Bank

Off-ChipDRAM Channel

Memory PartitionLast-Level

Cache Bank

Off-ChipDRAM Channel

SIMT CoreSIMT CoreSIMT CoreSIMT CoreSIMT Core

SIMTFront End SIMD Datapath

FetchDecode

ScheduleBranch

Done (Warp ID)

Memory Subsystem Icnt.NetworkSMem L1 D$ Tex $ Const$ More Details

in Paper

Page 9: Compaction for Efficient SIMT Control Flow

Wilson Fung, Tor Aamodt Thread Block Compaction 9

9 6 3 4D-- 10 -- --D1 2 3 4E5 6 7 8E

9 10 11 12E

DWF Pathologies: Starvation Majority Scheduling

Best Performing in Prev. Work Prioritize largest group of threads

with same PC Starvation

LOWER SIMD Efficiency!

Other Warp Scheduler? Tricky: Variable Memory Latency

Time

1 2 7 8C 5 -- 11 12C

9 6 3 4D-- 10 -- --D

1 2 7 8E 5 -- 11 12E

9 6 3 4E-- 10 -- --E

B: if (K > 10) C: K = 10; elseD: K = 0;E: B = C[tid.x] + K;

1000s cycles

Page 10: Compaction for Efficient SIMT Control Flow

Wilson Fung, Tor Aamodt Thread Block Compaction 10

DWF Pathologies: Extra Uncoalesced Accesses Coalesced Memory Access = Memory SIMD

1st Order CUDA Programmer Optimization Not preserved by DWF

E: B = C[tid.x] + K;

1 2 3 4E5 6 7 8E

9 10 11 12E

Memory0x1000x1400x180

1 2 7 12E9 6 3 8E5 10 11 4E

Memory0x1000x1400x180

#Acc = 3

#Acc = 9

No DWF

With DWF

L1 Cache AbsorbsRedundant

Memory Traffic

L1$ Port Conflict

Page 11: Compaction for Efficient SIMT Control Flow

Wilson Fung, Tor Aamodt Thread Block Compaction 11

DWF Pathologies:Implicit Warp Sync. Some CUDA applications depend on the

lockstep execution of “static warps”

E.g. Task Queue in Ray Tracing

Thread 0 ... 31Thread 32 ... 63Thread 64 ... 95

Warp 0Warp 1Warp 2

int wid = tid.x / 32; if (tid.x % 32 == 0) { sharedTaskID[wid] = atomicAdd(g_TaskID, 32);}my_TaskID = sharedTaskID[wid] + tid.x % 32; ProcessTask(my_TaskID);

ImplicitWarpSync.

Page 12: Compaction for Efficient SIMT Control Flow

Wilson Fung, Tor Aamodt Thread Block Compaction 12

StaticWarp

DynamicWarp

StaticWarp

Observation

Compute kernels usually contain divergent and non-divergent (coherent) code segments

Coalesced memory access usually in coherent code segments DWF no benefit there

Coherent

Divergent

Coherent

Reset Warps

Divergence

RecvgPt.

Coales. LD/ST

Page 13: Compaction for Efficient SIMT Control Flow

Wilson Fung, Tor Aamodt Thread Block Compaction 13

Thread Block Compaction Barrier @ Branch/reconverge pt.

All avail. threads arrive at branch Insensitive to warp scheduling

Run a thread block like a warp Whole block move between coherent/divergent code Block-wide stack to track exec. paths reconvg.

Warp compaction Regrouping with all avail. threads If no divergence, gives static warp arrangement

Starvation

Implicit Warp Sync.

Extra Uncoalesced Memory Access

Page 14: Compaction for Efficient SIMT Control Flow

Wilson Fung, Tor Aamodt Thread Block Compaction 14

Thread Block CompactionPC RPC Active ThreadsA - 1 2 3 4 5 6 7 8 9 10 11 12D E -- -- 3 4 -- 6 -- -- 9 10 -- --C E 1 2 -- -- 5 -- 7 8 -- -- 11 12

E - 1 2 3 4 5 6 7 8 9 10 11 12

Time

1 2 7 8C 5 -- 11 12C9 6 3 4D-- 10 -- --D

5 6 7 8A 9 10 11 12A

1 2 3 4A

5 6 7 8E 9 10 11 12E

1 2 3 4E

A: K = A[tid.x];B: if (K > 10) C: K = 10; elseD: K = 0;E: B = C[tid.x] + K;

5 6 7 8A 9 10 11 12A

1 2 3 4A

5 -- 7 8C -- -- 11 12C

1 2 -- --C

-- 6 -- --D9 10 -- --D

-- -- 3 4D

5 6 7 8E 9 10 11 12E

1 2 7 8E

-- -- -- ---- -- -- --

-- -- -- ---- -- -- --

-- -- -- ---- -- -- --

Page 15: Compaction for Efficient SIMT Control Flow

Wilson Fung, Tor Aamodt Thread Block Compaction 15

Thread Block Compaction

Barrier every basic block?! (Idle pipeline) Switch to warps from other thread blocks

Multiple thread blocks run on a core Already done in most CUDA applications

Block 0

Block 1

Block 2

Branch Warp Compaction

Execution

Execution

Execution

Execution

Time

Page 16: Compaction for Efficient SIMT Control Flow

Wilson Fung, Tor Aamodt Thread Block Compaction 16

Microarchitecture Modification Per-Warp Stack Block-Wide Stack I-Buffer + TIDs Warp Buffer

Store the dynamic warps New Unit: Thread Compactor

Translate activemask to compact dynamic warps More Detail in Paper

ALUALUALU

I-Cache Decode

Warp Buffer

Score-Board

Issue RegFileMEM

ALUFetch

Block-Wide Stack

Done (WID)

Valid[1:N]

Branch Target PC

ActiveMask

Pred.Thread

Compactor

Page 17: Compaction for Efficient SIMT Control Flow

Wilson Fung, Tor Aamodt Thread Block Compaction 17

RarelyTaken

Likely-Convergence Immediate Post-Dominator: Conservative

All paths from divergent branch must merge there Convergence can happen earlier

When any two of the paths merge

Extended Recvg. Stack to exploit this TBC: 30% speedup for Ray Tracing

while (i < K) { X = data[i];A: if ( X = 0 )B: result[i] = Y;C: else if ( X = 1 )D: break;E: i++; }F: return result[i];

A

B C

DE

FiPDom of A

Details in Paper

Page 18: Compaction for Efficient SIMT Control Flow

Wilson Fung, Tor Aamodt Thread Block Compaction 18

Outline

Introduction GPU Microarchitecture DWF Pathologies Thread Block Compaction Likely-Convergence Experimental Results Conclusion

Page 19: Compaction for Efficient SIMT Control Flow

Wilson Fung, Tor Aamodt Thread Block Compaction 19

Evaluation Simulation: GPGPU-Sim (2.2.1b)

~Quadro FX5800 + L1 & L2 Caches 21 Benchmarks

All of GPGPU-Sim original benchmarks Rodinia benchmarks Other important applications:

Face Detection from Visbench (UIUC) DNA Sequencing (MUMMER-GPU++) Molecular Dynamics Simulation (NAMD) Ray Tracing from NVIDIA Research

Page 20: Compaction for Efficient SIMT Control Flow

Wilson Fung, Tor Aamodt Thread Block Compaction 20

0.6 0.7 0.8 0.9 1 1.1 1.2 1.3

TBC

DWF

IPC Relative to Baseline

COHEDIVG

Experimental Results 2 Benchmark Groups:

COHE = Non-Divergent CUDA applications DIVG = Divergent CUDA applications

Serious Slowdown from pathologiesNo Penalty for COHE

22% Speedup on DIVG

Per-Warp Stack

Page 21: Compaction for Efficient SIMT Control Flow

Wilson Fung, Tor Aamodt Thread Block Compaction 21

Conclusion

Thread Block Compaction Addressed some key challenges of DWF One significant step closer to reality

Benefit from advancements on reconvergence stack Likely-Convergence Point Extensible: Integrate other stack-based proposals

Page 22: Compaction for Efficient SIMT Control Flow

Wilson Fung, Tor Aamodt Thread Block Compaction 22

Thank You!

Page 23: Compaction for Efficient SIMT Control Flow

Thread Block Compaction

Backups

Wilson Fung, Tor Aamodt 23

Page 24: Compaction for Efficient SIMT Control Flow

Wilson Fung, Tor Aamodt Thread Block Compaction 24

Thread Compactor Convert activemask from block-wide stack to

thread IDs in warp buffer Array of Priority-Encoder

P-Enc P-Enc P-Enc P-Enc

1 2 7 85 -- 11 12

1 2 -- -- 5 -- 7 8 -- -- 11 12C E

1 2 -- --5 -- 7 8-- -- 11 12

1 2 7 8C 5 -- 11 12C

Warp Buffer

Page 25: Compaction for Efficient SIMT Control Flow

Wilson Fung, Tor Aamodt Thread Block Compaction 25

Effect on Memory Traffic TBC does still generate some extra uncoalesced

memory access Memory0x1000x1400x180

#Acc = 41 2 7 8C

5 -- 11 12C2nd Acc will hit the L1 cache

No Change to Overall

Memory TrafficIn/out of a core

Nor

mal

ized

Mem

ory

Sta

lls

0%

50%

100%

150%

200%

250%

300%

TBC DWF Baseline

0.6

0.8

1

1.2 B

FS2

FC

DT

HO

TSP

LPS

MU

M M

UM

pp N

AM

D N

VRT

AES

BA

CK

P C

P D

G H

RTW

L L

IB L

KYT

MG

ST N

NC

RAY

STM

CL

STO W

P

DIVG COHE

Mem

ory

Traf

ficN

orm

aliz

ed to

Bas

elin

e TBC-AGE TBC-RRB 2.67x

Page 26: Compaction for Efficient SIMT Control Flow

Wilson Fung, Tor Aamodt Thread Block Compaction 26

Likely-Convergence (2)

NVIDIA uses break instruction for loop exits That handles last example

Our solution: Likely-Convergence Points

This paper: only used to capture loop-breaks

PC RPC LPC LPosActiveThdsF -- -- -- 1 2 3 4E F -- -- --B F E 1 1C F E 1 2 3 4E F E 1 2D F E 1 3 4

E F -- -- 2E F E 1 1E F -- -- 1 2

Convergence!

Page 27: Compaction for Efficient SIMT Control Flow

Wilson Fung, Tor Aamodt Thread Block Compaction 27

Likely-Convergence (3)

Applies to both per-warp stack (PDOM) and thread block compaction (TBC) Enable more threads grouping for TBC Side effect: Reduce stack usage in some case