Code GPU with CUDA - SIMT

  • View

  • Download

Embed Size (px)

Text of Code GPU with CUDA - SIMT



    Created by Marina Kolpakova ( ) for cuda.geek Itseez


  • OUTLINEHardware revisionsSIMT architectureWarp schedulingDivergence & convergencePredicated executionConditional execution

  • OUT OF SCOPEComputer graphics capabilities

  • HARDWARE REVISIONSSM (shading model) particular hardware implementation.

    Generation SM GPU modelsTesla sm_10 G80 G92(b) G94(b)

    sm_11 G86 G84 G98 G96(b) G94(b) G92(b)sm_12 GT218 GT216 GT215sm_13 GT200 GT200b

    Fermi sm_20 GF100 GF110sm_21 GF104 GF114 GF116 GF108 GF106

    Kepler sm_30 GK104 GK106 GK107sm_32 GK20Asm_35 GK110 GK208sm_37 GK210

    Maxwell sm_50 GM107 GM108sm_52 GM204sm_53 GM20B

  • LATENCY VS THROUGHPUT ARCHITECTURESModern CPUs and GPUs are both multi-core systems.

    CPUs are latency oriented:Pipelining, out-of-order, superscalarCaching, on-die memory controllersSpeculative execution, branch predictionCompute cores occupy only a small part of a die

    GPUs are throughput oriented:100s simple compute coresZero cost scheduling of 1000s or threadsCompute cores occupy most part of a die

  • SIMD SIMT SMTSingle Instruction Multiple Thread

    SIMD: elements of short vectors are processed in parallel. Represents problem as shortvectors and processes it vector by vector. Hardware support for wide arithmetic.SMT: instructions from several threads are run in parallel. Represents problem as scopeof independent tasks and assigns them to different threads. Hardware support for multi-threading.SIMT vector processing + light-weight threading:

    Warp is a unit of execution. It performs the same instruction each cycle. Warp is 32-lane widethread scheduling and fast context switching between different warps to minimizestalls


    1. SIMT is abstraction over vector hardware:Threads are grouped into warps (32 for NVIDIA)A thread in a warp usually called laneVector register file. Registers accessed line by line.A lane loads laneIds element from registerSingle program counter (PC) for whole warpOnly a couple of special registers, like PC, can be scalar

    2. SIMT HW is responsible for warp scheduling:Static for all latest hardware revisionsZero overhead on context switchingLong latency operation score-boarding


    Memory instructions are separated from arithmeticArithmetic performed only on registers and immediates

  • SIMT PIPELINEWarp scheduler manages warps, selects ready to executeFetch/decode unit is associated with warp schedulerExecution units are SC, SFU, LD/ST

    Area-/power-efficiency thanks to regularity.

  • VECTOR REGISTER FILE~Zero warp switching requires a big vector register file (RF)

    While warp is resident on SM it occupies a portion of RFGPU's RF is 32-bit. 64-bit values are stored in register pairFast switching costs register wastage on duplicated itemsNarrow data types are as costly as wide data types.

    Size of RF depends on architecture. Fermi: 128KB per SM, Kepler: 256KB per SM,Maxwell: 64KB per scheduler.


    instructions are fetched, executed & completed in compiler-generated orderIn-order executionin case one instruction stalls, all following stall too

    Dynamic schedulinginstructions are fetched in compiler-generated orderinstructions are executed out-of-orderSpecial unit to track dependencies and reorder instructionsindependent instructions behind a stalled instruction can pass it

  • WARP SCHEDULINGGigaThread subdivide work between SMsWork for SM is sent to Warp SchedulerOne assigned warp can not migrate between schedulersWarp has own lines in register file, PC, activity maskWarp can be in one of the following states:

    Executed - perform operationReady - wait to be executedWait - wait for resourcesResident - wait completion of other warps within the same block

  • WARP SCHEDULINGDepending on generation scheduling is dynamic (Fermi) or static (Kepler, Maxwell)

  • WARP SCHEDULING (CONT)Modern warp schedulers support dualissue (sm_21+) to decode instruction pairfor active warp per clock

    SM has 2 or 4 warp schedulers dependingon the architecture

    Warps belong to blocks. Hardware tracksthis relations as well

  • DIVERGENCE & (RE)CONVERGENCEDivergence: not all lanes in a warp take the same code path

    Convergence handled via convergence stackConvergence stack entry includes

    convergence PCnext-path PClane mask (mark active lanes on that path)

    SSY instruction pushes convergence stack. It occurs before potentially divergentinstructions.S indicates convergence point instruction after which all lanes in a warp takethe same code path

  • DIVERGENT CODE EXAMPLE (void) atomicAdd( &smem[0], src[threadIdx.x] );

    /*0050*/ SSY 0x80; /*0058*/ LDSLK P0, R3, [RZ]; /*0060*/ @P0 IADD R3, R3, R0; /*0068*/ @P0 STSUL [RZ], R3; /*0070*/ @!P0 BRA 0x58; /*0078*/ NOP.S;

    Assume warp size == 4


    Frequently used for if-then statements, rarely for if-then-else. Decision is made bycompiler heuristic.Optimizes divergence overhead.

    Conditional executionCompare instruction sets condition code (CC) registers.CC is 4-bit state vector (sign, carry, zero, overflow)

    No WB stage for CC-marked registersUsed in Maxwell to skip unneeded computations for arithmetic operationsimplemented in hardware with multiple instructions

    IMAD R8.CC, R0, 0x4, R3;

  • FINAL WORDSSIMT is RISC-based throughput oriented architectureSIMT combines vector processing and light-weight threadingSIMT instructions are executed per warpWarp has its own PC and activity maskBranching is done by divergence, predicated or conditional execution


    BY / 20132015CUDA.GEEK