Code GPU with CUDA - SIMT

CODE GPU WITH CUDASIMT

NVIDIA GPU ARCHITECTURE

Created by Marina Kolpakova ( ) for cuda.geek Itseez

BACK TO CONTENTS

http://github.com/cuda-geek

http://itseez.com/

http://cuda-geek.github.io/cumib/index.html

OUTLINEHardware revisionsSIMT architectureWarp schedulingDivergence & convergencePredicated executionConditional execution

OUT OF SCOPEComputer graphics capabilities

HARDWARE REVISIONSSM (shading model) – particular hardware implementation.

Generation SM GPU modelsTesla sm_10 G80 G92(b) G94(b)

sm_11 G86 G84 G98 G96(b) G94(b) G92(b)sm_12 GT218 GT216 GT215sm_13 GT200 GT200b

Fermi sm_20 GF100 GF110sm_21 GF104 GF114 GF116 GF108 GF106

Kepler sm_30 GK104 GK106 GK107sm_32 GK20Asm_35 GK110 GK208sm_37 GK210

Maxwell sm_50 GM107 GM108sm_52 GM204sm_53 GM20B

LATENCY VS THROUGHPUT ARCHITECTURESModern CPUs and GPUs are both multi-core systems.

CPUs are latency oriented:Pipelining, out-of-order, superscalarCaching, on-die memory controllersSpeculative execution, branch predictionCompute cores occupy only a small part of a die

GPUs are throughput oriented:100s simple compute coresZero cost scheduling of 1000s or threadsCompute cores occupy most part of a die

SIMD – SIMT – SMTSingle Instruction Multiple Thread

SIMD: elements of short vectors are processed in parallel. Represents problem as shortvectors and processes it vector by vector. Hardware support for wide arithmetic.SMT: instructions from several threads are run in parallel. Represents problem as scopeof independent tasks and assigns them to different threads. Hardware support for multi-threading.SIMT vector processing + light-weight threading:

Warp is a unit of execution. It performs the same instruction each cycle. Warp is 32-lane widethread scheduling and fast context switching between different warps to minimizestalls

SIMTDEPTH OF MULTI-THREADING × WIDTH OF SIMD

1. SIMT is abstraction over vector hardware:Threads are grouped into warps (32 for NVIDIA)A thread in a warp usually called laneVector register file. Registers accessed line by line.A lane loads laneId’s element from registerSingle program counter (PC) for whole warpOnly a couple of special registers, like PC, can be scalar

2. SIMT HW is responsible for warp scheduling:Static for all latest hardware revisionsZero overhead on context switchingLong latency operation score-boarding

SASS ISASIMT is like RISC

Memory instructions are separated from arithmeticArithmetic performed only on registers and immediates

SIMT PIPELINEWarp scheduler manages warps, selects ready to executeFetch/decode unit is associated with warp schedulerExecution units are SC, SFU, LD/ST

Area-/power-efficiency thanks to regularity.

VECTOR REGISTER FILE~Zero warp switching requires a big vector register file (RF)

While warp is resident on SM it occupies a portion of RFGPU's RF is 32-bit. 64-bit values are stored in register pairFast switching costs register wastage on duplicated itemsNarrow data types are as costly as wide data types.

Size of RF depends on architecture. Fermi: 128 KB per SM, Kepler: 256 KB per SM,Maxwell: 64 KB per scheduler.

DYNAMIC VS STATIC SCHEDULINGStatic scheduling

instructions are fetched, executed & completed in compiler-generated orderIn-order executionin case one instruction stalls, all following stall too

Dynamic schedulinginstructions are fetched in compiler-generated orderinstructions are executed out-of-orderSpecial unit to track dependencies and reorder instructionsindependent instructions behind a stalled instruction can pass it

WARP SCHEDULINGGigaThread subdivide work between SMsWork for SM is sent to Warp SchedulerOne assigned warp can not migrate between schedulersWarp has own lines in register file, PC, activity maskWarp can be in one of the following states:

Executed - perform operationReady - wait to be executedWait - wait for resourcesResident - wait completion of other warps within the same block

WARP SCHEDULINGDepending on generation scheduling is dynamic (Fermi) or static (Kepler, Maxwell)

WARP SCHEDULING (CONT)Modern warp schedulers support dualissue (sm_21+) to decode instruction pairfor active warp per clock

SM has 2 or 4 warp schedulers dependingon the architecture

Warps belong to blocks. Hardware tracksthis relations as well

DIVERGENCE & (RE)CONVERGENCEDivergence: not all lanes in a warp take the same code path

Convergence handled via convergence stackConvergence stack entry includes

convergence PCnext-path PClane mask (mark active lanes on that path)

SSY instruction pushes convergence stack. It occurs before potentially divergentinstructions<INSTR>.S indicates convergence point – instruction after which all lanes in a warp takethe same code path

DIVERGENT CODE EXAMPLE ( v o i d ) a t o m i c A d d ( & s m e m [ 0 ] , s r c [ t h r e a d I d x . x ] ) ;

/ * 0 0 5 0 * / S S Y 0 x 8 0 ; / * 0 0 5 8 * / L D S L K P 0 , R 3 , [ R Z ] ; / * 0 0 6 0 * / @ P 0 I A D D R 3 , R 3 , R 0 ; / * 0 0 6 8 * / @ P 0 S T S U L [ R Z ] , R 3 ; / * 0 0 7 0 * / @ ! P 0 B R A 0 x 5 8 ; / * 0 0 7 8 * / N O P . S ;

Assume warp size == 4

PREDICATED & CONDITIONAL EXECUTIONPredicated execution

Frequently used for if-then statements, rarely for if-then-else. Decision is made bycompiler heuristic.Optimizes divergence overhead.

Conditional executionCompare instruction sets condition code (CC) registers.CC is 4-bit state vector (sign, carry, zero, overflow)

No WB stage for CC-marked registersUsed in Maxwell to skip unneeded computations for arithmetic operationsimplemented in hardware with multiple instructions

I M A D R 8 . C C , R 0 , 0 x 4 , R 3 ;

FINAL WORDSSIMT is RISC-based throughput oriented architectureSIMT combines vector processing and light-weight threadingSIMT instructions are executed per warpWarp has its own PC and activity maskBranching is done by divergence, predicated or conditional execution

THE ENDNEXT

BY / 2013–2015CUDA.GEEK

http://cuda-geek.github.io/cumib/code_gpu_with_cuda_2.html

https://github.com/cuda-geek

Education

Code GPU with CUDA - SIMT