MULTITHREADING.ppt

8/10/2019 MULTITHREADING.ppt

1/30

CS 162 Computer Architecture

Lecture 10: Multithreading

Instructor: L.N. Bhuyan

www.cs.ucr.edu/~bhuyan/cs162Adopted from Internet
http://www.cs.ucr.edu/~bhuyan/cs162http://www.cs.ucr.edu/~bhuyan/cs162


2/30


3/30

Mult i threading

How can we guarantee no dependencies between

instructions in a pipeline?

One way is to interleave execution of instructions from

different program threads on same pipeline

Interleave 4 th reads , T1-T4, on non-bypassed 5-stage pipe

T1: LW r1, 0(r2)

T2: ADD r7, r1, r4

T3: XORI r5, r4, #12

T4: SW 0(r7), r5

T1: LW r5, 12(r1)


4/30

CDC 6600 Peripheral

Processors (Cray, 1965)

First multithreaded hardware

10 virtual I/O processors

fixed interleave on simple pipeline pipeline has 100ns cycle time

each processor executes one

instruction every 1000ns accumulator-based instruction set to

reduce processor state


5/30

Simp le Mult i threaded

Pipel ine

Have to carry thread select down pipeline to

ensure correct state bits read/written at each pipe

stage


6/30

Mult i thread ing Cos ts

Appears to software (including OS) asmultiple

slower CPUs Each thread requires its own user state

GPRs

PC

Also, needs own OS control state virtual memory page table base register

exception handling registers

Other cos ts?


7/30

Thread Schedu l ing Po l ic ies

Fixed interleave (CDC 6600 PPUs , 1965) each of N threads executes one instruction every N cycles

if thread not ready to go in its slot, insert pipeline bubble

Software-controlled interleave (TI ASC PPUs , 1971) OS allocates S pipeline slots amongst N threads

hardware performs fixed interleave over S slots, executingwhichever thread is in that slot

Hardware-controlled thread scheduling (HEP, 1982) hardware keeps track of which threads are ready to go

picks next thread to execute based on hardware priority

scheme


8/30

What Grain

Mult i threading?

So far assumed fine-grained

multithreading

CPU switches every cycle to a different thread When does this m ake sense?

Coarse-grained multithreading

CPU switches every few cycles to a different

thread

When does this m ake sense?


9/30

Mult ithread ing Des ign

Choices

Context switch to another thread every cycle,or on hazard or L1 miss or L2 miss or networkrequest

Per-thread state and context-switch overhead

Interactions between threads in memoryhierarch


10/30

Denelcor HEP

(Burton Smith , 1982)

First commercial machine to use

hardware threading in main CPU

120 threads per processor

10 MHz clock rate

Up to 8 processors

precursor to Tera MTA (MultithreadedArchitecture)


11/30

Tera MTA Overv iew

Up to 256 processors

Up to 128 active threads per processor

Processors and memory modulespopulate a sparse 3D torusinterconnection fabric

Flat, shared main memory

No data cache Sustains one main memory access per cycle

per processor

50W/processor @ 260MHz


12/30

MTA Instruct ion Format

Three operations packed into 64-bitinstruction word (short VLIW)

One memory operation, one arithmeticoperation, plus one arithmetic or branchoperation

Memory operations incur ~150 cycles oflatency

Explicit 3-bit lookahead field in instructiongives number of subsequent instructions (0-7)that are independent of this one c.f. Instruction grouping in VLIW

allows fewer threads to fill machine pipeline used for variable- sized branch dela slots


13/30

MTA Multi thread ing

Each processor supports 128 activehardware threads 128 SSWs, 1024 target registers, 4096 general-

purpose registers

Every cycle, one instruction from oneactive thread is launched into pipeline

Instruction pipeline is 21 cycles long At best, a single thread can issue one

instruction every 21 cycles Clock rate is 260MHz, effective single thread

issue rate is 260/21 = 12.4MHz


14/30

MTA Pipel ine


15/30

Coarse-Grain Mu lt i thread ing

Tera MTA designed for supercomputingapplications with large data sets and lowlocality No data cache

Many parallel threads needed to hide largememory latency

Other applications are more cache friendly Few pipeline bubbles when cache getting hits

Just add a few threads to hide occasionalcache miss latencies

Swap threads on cache misses


16/30

MIT A lew ife

Modified SPARC chips

register windows hold different thread

contexts

Up to four threads per node

Thread switch on local cache miss


17/30

IBM PowerPC RS64-III

(Pulsar)

Commercial coarse-grain multithreadingCPU

Based on PowerPC with quad-issue in-order fivestage pipeline

Each physical CPU supports two virtualCPUs

On L2 cache miss, pipeline is flushed andexecution switches to second thread short pipeline minimizes flush penalty (4

cycles), small compared to memory accesslatency

flush pipeline to simplify exception handling


18/30

Specu lat ive, Out-o f-Order

Superscalar Processo r


19/30

Superscalar Mach ine

Eff ic iency

Why horizontal waste?

Why vert ical waste?


20/30

Vert ical Mu lt i thread ing

Cycle-by-cycle interleaving of secondthread removes vertical waste


21/30

Ideal Mu lt i th read ing fo r

Superscalar

Interleave multiple threads to

multiple issue slots with norestrictions


22/30

Simul taneous

Mult i threading

Add multiple contexts and fetch engines

to wide out-of-order superscalar

processor [Tullsen, Eggers, Levy, UW, 1995]

OOO instruction window already has most

of the circuitry required to schedule from

multiple threads

Any single thread can utilize whole

machine


23/30

Comparison o f Issue Capab i li t iesCourtesy of Susan Eggers; Used with Permission


24/30

From Superscalar to SMT

SMT is an out-of-order superscalar extended

with

hardware to support multiple executing

threads


25/30


Extra pipeline stages for accessingthread-shared register files


26/30


Fetch from the two highest throughput

threads.

Why?


27/30


Small items

per-thread program counters

per-thread return stacks

per-thread bookkeeping for instruction

retirement, trap & instruction dispatch

queue flush thread identifiers, e.g., with BTB & TLB

entries


28/30

Simu ltaneous Mu lt ithreaded

Processor


29/30

SMT Des ign Issues

Which thread to fetch from next?

Dont want to clog instruction window

with thread with many stalls try tofetch from thread that has fewest instsin window

Locks

Virtual CPU spinning on lock executesmany instructions but gets nowhereadd ISA support to lower priority ofthread spinning on lock

I t l P t i 4 X


30/30

In tel Pen tium -4 Xeon

Processor

Hyperthreading == SMT

Dual physical processors, each 2-waySMT

Logical processors share nearly all

resources of the physical processor Caches, execution units, branch predictors

Die area overhead of hyperthreading ~5 %

When one logical processor is stalled, the

other can make progress No logical processor can use all entries in

queues when two threads are active

A processor running only one active

software thread to run at the same speed

Documents

MULTITHREADING.ppt