Upload
i2loveu3235
View
220
Download
0
Embed Size (px)
Citation preview
8/10/2019 MULTITHREADING.ppt
1/30
CS 162 Computer Architecture
Lecture 10: Multithreading
Instructor: L.N. Bhuyan
www.cs.ucr.edu/~bhuyan/cs162Adopted from Internet
http://www.cs.ucr.edu/~bhuyan/cs162http://www.cs.ucr.edu/~bhuyan/cs1628/10/2019 MULTITHREADING.ppt
2/30
8/10/2019 MULTITHREADING.ppt
3/30
Mult i threading
How can we guarantee no dependencies between
instructions in a pipeline?
One way is to interleave execution of instructions from
different program threads on same pipeline
Interleave 4 th reads , T1-T4, on non-bypassed 5-stage pipe
T1: LW r1, 0(r2)
T2: ADD r7, r1, r4
T3: XORI r5, r4, #12
T4: SW 0(r7), r5
T1: LW r5, 12(r1)
8/10/2019 MULTITHREADING.ppt
4/30
CDC 6600 Peripheral
Processors (Cray, 1965)
First multithreaded hardware
10 virtual I/O processors
fixed interleave on simple pipeline pipeline has 100ns cycle time
each processor executes one
instruction every 1000ns accumulator-based instruction set to
reduce processor state
8/10/2019 MULTITHREADING.ppt
5/30
Simp le Mult i threaded
Pipel ine
Have to carry thread select down pipeline to
ensure correct state bits read/written at each pipe
stage
8/10/2019 MULTITHREADING.ppt
6/30
Mult i thread ing Cos ts
Appears to software (including OS) asmultiple
slower CPUs Each thread requires its own user state
GPRs
PC
Also, needs own OS control state virtual memory page table base register
exception handling registers
Other cos ts?
8/10/2019 MULTITHREADING.ppt
7/30
Thread Schedu l ing Po l ic ies
Fixed interleave (CDC 6600 PPUs , 1965) each of N threads executes one instruction every N cycles
if thread not ready to go in its slot, insert pipeline bubble
Software-controlled interleave (TI ASC PPUs , 1971) OS allocates S pipeline slots amongst N threads
hardware performs fixed interleave over S slots, executingwhichever thread is in that slot
Hardware-controlled thread scheduling (HEP, 1982) hardware keeps track of which threads are ready to go
picks next thread to execute based on hardware priority
scheme
8/10/2019 MULTITHREADING.ppt
8/30
What Grain
Mult i threading?
So far assumed fine-grained
multithreading
CPU switches every cycle to a different thread When does this m ake sense?
Coarse-grained multithreading
CPU switches every few cycles to a different
thread
When does this m ake sense?
8/10/2019 MULTITHREADING.ppt
9/30
Mult ithread ing Des ign
Choices
Context switch to another thread every cycle,or on hazard or L1 miss or L2 miss or networkrequest
Per-thread state and context-switch overhead
Interactions between threads in memoryhierarch
8/10/2019 MULTITHREADING.ppt
10/30
Denelcor HEP
(Burton Smith , 1982)
First commercial machine to use
hardware threading in main CPU
120 threads per processor
10 MHz clock rate
Up to 8 processors
precursor to Tera MTA (MultithreadedArchitecture)
8/10/2019 MULTITHREADING.ppt
11/30
Tera MTA Overv iew
Up to 256 processors
Up to 128 active threads per processor
Processors and memory modulespopulate a sparse 3D torusinterconnection fabric
Flat, shared main memory
No data cache Sustains one main memory access per cycle
per processor
50W/processor @ 260MHz
8/10/2019 MULTITHREADING.ppt
12/30
MTA Instruct ion Format
Three operations packed into 64-bitinstruction word (short VLIW)
One memory operation, one arithmeticoperation, plus one arithmetic or branchoperation
Memory operations incur ~150 cycles oflatency
Explicit 3-bit lookahead field in instructiongives number of subsequent instructions (0-7)that are independent of this one c.f. Instruction grouping in VLIW
allows fewer threads to fill machine pipeline used for variable- sized branch dela slots
8/10/2019 MULTITHREADING.ppt
13/30
MTA Multi thread ing
Each processor supports 128 activehardware threads 128 SSWs, 1024 target registers, 4096 general-
purpose registers
Every cycle, one instruction from oneactive thread is launched into pipeline
Instruction pipeline is 21 cycles long At best, a single thread can issue one
instruction every 21 cycles Clock rate is 260MHz, effective single thread
issue rate is 260/21 = 12.4MHz
8/10/2019 MULTITHREADING.ppt
14/30
MTA Pipel ine
8/10/2019 MULTITHREADING.ppt
15/30
Coarse-Grain Mu lt i thread ing
Tera MTA designed for supercomputingapplications with large data sets and lowlocality No data cache
Many parallel threads needed to hide largememory latency
Other applications are more cache friendly Few pipeline bubbles when cache getting hits
Just add a few threads to hide occasionalcache miss latencies
Swap threads on cache misses
8/10/2019 MULTITHREADING.ppt
16/30
MIT A lew ife
Modified SPARC chips
register windows hold different thread
contexts
Up to four threads per node
Thread switch on local cache miss
8/10/2019 MULTITHREADING.ppt
17/30
IBM PowerPC RS64-III
(Pulsar)
Commercial coarse-grain multithreadingCPU
Based on PowerPC with quad-issue in-order fivestage pipeline
Each physical CPU supports two virtualCPUs
On L2 cache miss, pipeline is flushed andexecution switches to second thread short pipeline minimizes flush penalty (4
cycles), small compared to memory accesslatency
flush pipeline to simplify exception handling
8/10/2019 MULTITHREADING.ppt
18/30
Specu lat ive, Out-o f-Order
Superscalar Processo r
8/10/2019 MULTITHREADING.ppt
19/30
Superscalar Mach ine
Eff ic iency
Why horizontal waste?
Why vert ical waste?
8/10/2019 MULTITHREADING.ppt
20/30
Vert ical Mu lt i thread ing
Cycle-by-cycle interleaving of secondthread removes vertical waste
8/10/2019 MULTITHREADING.ppt
21/30
Ideal Mu lt i th read ing fo r
Superscalar
Interleave multiple threads to
multiple issue slots with norestrictions
8/10/2019 MULTITHREADING.ppt
22/30
Simul taneous
Mult i threading
Add multiple contexts and fetch engines
to wide out-of-order superscalar
processor [Tullsen, Eggers, Levy, UW, 1995]
OOO instruction window already has most
of the circuitry required to schedule from
multiple threads
Any single thread can utilize whole
machine
8/10/2019 MULTITHREADING.ppt
23/30
Comparison o f Issue Capab i li t iesCourtesy of Susan Eggers; Used with Permission
8/10/2019 MULTITHREADING.ppt
24/30
From Superscalar to SMT
SMT is an out-of-order superscalar extended
with
hardware to support multiple executing
threads
8/10/2019 MULTITHREADING.ppt
25/30
From Superscalar to SMT
Extra pipeline stages for accessingthread-shared register files
8/10/2019 MULTITHREADING.ppt
26/30
From Superscalar to SMT
Fetch from the two highest throughput
threads.
Why?
8/10/2019 MULTITHREADING.ppt
27/30
From Superscalar to SMT
Small items
per-thread program counters
per-thread return stacks
per-thread bookkeeping for instruction
retirement, trap & instruction dispatch
queue flush thread identifiers, e.g., with BTB & TLB
entries
8/10/2019 MULTITHREADING.ppt
28/30
Simu ltaneous Mu lt ithreaded
Processor
8/10/2019 MULTITHREADING.ppt
29/30
SMT Des ign Issues
Which thread to fetch from next?
Dont want to clog instruction window
with thread with many stalls try tofetch from thread that has fewest instsin window
Locks
Virtual CPU spinning on lock executesmany instructions but gets nowhereadd ISA support to lower priority ofthread spinning on lock
I t l P t i 4 X
8/10/2019 MULTITHREADING.ppt
30/30
In tel Pen tium -4 Xeon
Processor
Hyperthreading == SMT
Dual physical processors, each 2-waySMT
Logical processors share nearly all
resources of the physical processor Caches, execution units, branch predictors
Die area overhead of hyperthreading ~5 %
When one logical processor is stalled, the
other can make progress No logical processor can use all entries in
queues when two threads are active
A processor running only one active
software thread to run at the same speed