MULTITHREADING.ppt

Embed Size (px)

Citation preview

  • 8/10/2019 MULTITHREADING.ppt

    1/30

    CS 162 Computer Architecture

    Lecture 10: Multithreading

    Instructor: L.N. Bhuyan

    www.cs.ucr.edu/~bhuyan/cs162Adopted from Internet

    http://www.cs.ucr.edu/~bhuyan/cs162http://www.cs.ucr.edu/~bhuyan/cs162
  • 8/10/2019 MULTITHREADING.ppt

    2/30

  • 8/10/2019 MULTITHREADING.ppt

    3/30

    Mult i threading

    How can we guarantee no dependencies between

    instructions in a pipeline?

    One way is to interleave execution of instructions from

    different program threads on same pipeline

    Interleave 4 th reads , T1-T4, on non-bypassed 5-stage pipe

    T1: LW r1, 0(r2)

    T2: ADD r7, r1, r4

    T3: XORI r5, r4, #12

    T4: SW 0(r7), r5

    T1: LW r5, 12(r1)

  • 8/10/2019 MULTITHREADING.ppt

    4/30

    CDC 6600 Peripheral

    Processors (Cray, 1965)

    First multithreaded hardware

    10 virtual I/O processors

    fixed interleave on simple pipeline pipeline has 100ns cycle time

    each processor executes one

    instruction every 1000ns accumulator-based instruction set to

    reduce processor state

  • 8/10/2019 MULTITHREADING.ppt

    5/30

    Simp le Mult i threaded

    Pipel ine

    Have to carry thread select down pipeline to

    ensure correct state bits read/written at each pipe

    stage

  • 8/10/2019 MULTITHREADING.ppt

    6/30

    Mult i thread ing Cos ts

    Appears to software (including OS) asmultiple

    slower CPUs Each thread requires its own user state

    GPRs

    PC

    Also, needs own OS control state virtual memory page table base register

    exception handling registers

    Other cos ts?

  • 8/10/2019 MULTITHREADING.ppt

    7/30

    Thread Schedu l ing Po l ic ies

    Fixed interleave (CDC 6600 PPUs , 1965) each of N threads executes one instruction every N cycles

    if thread not ready to go in its slot, insert pipeline bubble

    Software-controlled interleave (TI ASC PPUs , 1971) OS allocates S pipeline slots amongst N threads

    hardware performs fixed interleave over S slots, executingwhichever thread is in that slot

    Hardware-controlled thread scheduling (HEP, 1982) hardware keeps track of which threads are ready to go

    picks next thread to execute based on hardware priority

    scheme

  • 8/10/2019 MULTITHREADING.ppt

    8/30

    What Grain

    Mult i threading?

    So far assumed fine-grained

    multithreading

    CPU switches every cycle to a different thread When does this m ake sense?

    Coarse-grained multithreading

    CPU switches every few cycles to a different

    thread

    When does this m ake sense?

  • 8/10/2019 MULTITHREADING.ppt

    9/30

    Mult ithread ing Des ign

    Choices

    Context switch to another thread every cycle,or on hazard or L1 miss or L2 miss or networkrequest

    Per-thread state and context-switch overhead

    Interactions between threads in memoryhierarch

  • 8/10/2019 MULTITHREADING.ppt

    10/30

    Denelcor HEP

    (Burton Smith , 1982)

    First commercial machine to use

    hardware threading in main CPU

    120 threads per processor

    10 MHz clock rate

    Up to 8 processors

    precursor to Tera MTA (MultithreadedArchitecture)

  • 8/10/2019 MULTITHREADING.ppt

    11/30

    Tera MTA Overv iew

    Up to 256 processors

    Up to 128 active threads per processor

    Processors and memory modulespopulate a sparse 3D torusinterconnection fabric

    Flat, shared main memory

    No data cache Sustains one main memory access per cycle

    per processor

    50W/processor @ 260MHz

  • 8/10/2019 MULTITHREADING.ppt

    12/30

    MTA Instruct ion Format

    Three operations packed into 64-bitinstruction word (short VLIW)

    One memory operation, one arithmeticoperation, plus one arithmetic or branchoperation

    Memory operations incur ~150 cycles oflatency

    Explicit 3-bit lookahead field in instructiongives number of subsequent instructions (0-7)that are independent of this one c.f. Instruction grouping in VLIW

    allows fewer threads to fill machine pipeline used for variable- sized branch dela slots

  • 8/10/2019 MULTITHREADING.ppt

    13/30

    MTA Multi thread ing

    Each processor supports 128 activehardware threads 128 SSWs, 1024 target registers, 4096 general-

    purpose registers

    Every cycle, one instruction from oneactive thread is launched into pipeline

    Instruction pipeline is 21 cycles long At best, a single thread can issue one

    instruction every 21 cycles Clock rate is 260MHz, effective single thread

    issue rate is 260/21 = 12.4MHz

  • 8/10/2019 MULTITHREADING.ppt

    14/30

    MTA Pipel ine

  • 8/10/2019 MULTITHREADING.ppt

    15/30

    Coarse-Grain Mu lt i thread ing

    Tera MTA designed for supercomputingapplications with large data sets and lowlocality No data cache

    Many parallel threads needed to hide largememory latency

    Other applications are more cache friendly Few pipeline bubbles when cache getting hits

    Just add a few threads to hide occasionalcache miss latencies

    Swap threads on cache misses

  • 8/10/2019 MULTITHREADING.ppt

    16/30

    MIT A lew ife

    Modified SPARC chips

    register windows hold different thread

    contexts

    Up to four threads per node

    Thread switch on local cache miss

  • 8/10/2019 MULTITHREADING.ppt

    17/30

    IBM PowerPC RS64-III

    (Pulsar)

    Commercial coarse-grain multithreadingCPU

    Based on PowerPC with quad-issue in-order fivestage pipeline

    Each physical CPU supports two virtualCPUs

    On L2 cache miss, pipeline is flushed andexecution switches to second thread short pipeline minimizes flush penalty (4

    cycles), small compared to memory accesslatency

    flush pipeline to simplify exception handling

  • 8/10/2019 MULTITHREADING.ppt

    18/30

    Specu lat ive, Out-o f-Order

    Superscalar Processo r

  • 8/10/2019 MULTITHREADING.ppt

    19/30

    Superscalar Mach ine

    Eff ic iency

    Why horizontal waste?

    Why vert ical waste?

  • 8/10/2019 MULTITHREADING.ppt

    20/30

    Vert ical Mu lt i thread ing

    Cycle-by-cycle interleaving of secondthread removes vertical waste

  • 8/10/2019 MULTITHREADING.ppt

    21/30

    Ideal Mu lt i th read ing fo r

    Superscalar

    Interleave multiple threads to

    multiple issue slots with norestrictions

  • 8/10/2019 MULTITHREADING.ppt

    22/30

    Simul taneous

    Mult i threading

    Add multiple contexts and fetch engines

    to wide out-of-order superscalar

    processor [Tullsen, Eggers, Levy, UW, 1995]

    OOO instruction window already has most

    of the circuitry required to schedule from

    multiple threads

    Any single thread can utilize whole

    machine

  • 8/10/2019 MULTITHREADING.ppt

    23/30

    Comparison o f Issue Capab i li t iesCourtesy of Susan Eggers; Used with Permission

  • 8/10/2019 MULTITHREADING.ppt

    24/30

    From Superscalar to SMT

    SMT is an out-of-order superscalar extended

    with

    hardware to support multiple executing

    threads

  • 8/10/2019 MULTITHREADING.ppt

    25/30

    From Superscalar to SMT

    Extra pipeline stages for accessingthread-shared register files

  • 8/10/2019 MULTITHREADING.ppt

    26/30

    From Superscalar to SMT

    Fetch from the two highest throughput

    threads.

    Why?

  • 8/10/2019 MULTITHREADING.ppt

    27/30

    From Superscalar to SMT

    Small items

    per-thread program counters

    per-thread return stacks

    per-thread bookkeeping for instruction

    retirement, trap & instruction dispatch

    queue flush thread identifiers, e.g., with BTB & TLB

    entries

  • 8/10/2019 MULTITHREADING.ppt

    28/30

    Simu ltaneous Mu lt ithreaded

    Processor

  • 8/10/2019 MULTITHREADING.ppt

    29/30

    SMT Des ign Issues

    Which thread to fetch from next?

    Dont want to clog instruction window

    with thread with many stalls try tofetch from thread that has fewest instsin window

    Locks

    Virtual CPU spinning on lock executesmany instructions but gets nowhereadd ISA support to lower priority ofthread spinning on lock

    I t l P t i 4 X

  • 8/10/2019 MULTITHREADING.ppt

    30/30

    In tel Pen tium -4 Xeon

    Processor

    Hyperthreading == SMT

    Dual physical processors, each 2-waySMT

    Logical processors share nearly all

    resources of the physical processor Caches, execution units, branch predictors

    Die area overhead of hyperthreading ~5 %

    When one logical processor is stalled, the

    other can make progress No logical processor can use all entries in

    queues when two threads are active

    A processor running only one active

    software thread to run at the same speed