36
1 Hardware Multithreading COMP25212

Hardware Multithreading

Embed Size (px)

DESCRIPTION

Hardware Multithreading. COMP25212. Increasing CPU Performance. By increasing clock frequency – pipelining By increasing Instructions per Clock – superscalar Minimizing memory access impact – caches Maximizing pipeline utilization – branch prediction - PowerPoint PPT Presentation

Citation preview

Page 1: Hardware Multithreading

1

Hardware Multithreading

COMP25212

Page 2: Hardware Multithreading

2

Increasing CPU Performance

• By increasing clock frequency – pipelining

• By increasing Instructions per Clock – superscalar

• Minimizing memory access impact – caches

• Maximizing pipeline utilization – branch prediction

• Maximizing pipeline utilization – forwarding

• Maximizing instruction issue – dynamic scheduling

Page 3: Hardware Multithreading

3

Increasing Parallelism

• Amount of parallelism that we can exploit is limited by the programs– Some areas exhibit great parallelism– Some others are essentially sequential

• In the later case, where can we find additional independent instructions?– In a different program!

Page 4: Hardware Multithreading

4

Software Multithreading - Revision

• Modern Operating Systems support several processes/threads to be run concurrently

• Transparent to the user – all of them appear to be running at the same time

• BUT, actually, they are scheduled (and interleaved) by the OS

Page 5: Hardware Multithreading

5

OS Thread Switching - Revision

Thread T0Thread T1Operating System

Save state into PCB0

Load state fromPCB1

Save state into PCB0

Load state fromPCB1

COMP25111 – Lect. 5

Exec

Exec

ExecWait

Wait

Wait

Page 6: Hardware Multithreading

6

Process Control Block (PCB) - Revision

• Process ID• Process State• PC• Stack Pointer• General Registers• Memory Management

Info

• Open File List, with positions

• Network Connections• CPU time used• Parent Process ID

PCBs store information about the state of ‘alive’ processes handled by the OS

Page 7: Hardware Multithreading

7

OS Process States - Revision

Terminated

Running on a CPU

Blocked waiting for

event

Ready waiting for

a CPUNew

Dispatched

Wait (e.g. I/O)

Eventoccurs

Pre-empted

COMP25111 – Lect. 5

Page 8: Hardware Multithreading

8

Hardware Multithreading

• Allow multiple threads to share a single processor

• Requires replicating the independent state of each thread

• Virtual memory can be used to share memory among threads

Page 9: Hardware Multithreading

9

CPU Support for Multithreading

Data Cache

Fetch Logic

Fetch Logic

Decode Logic

Fetch Logic

Exec Logic

Fetch Logic

Mem

Logic

Write Logic

Inst Cache

PCA

PCB

VA MappingA

VA MappingB

AddressTranslation

RegA

RegB

Page 10: Hardware Multithreading

10

Hardware Multithreading Issues

• How to HW MT is presented to the OS– Normally present each hardware thread as a

virtual processor (Linux, UNIX, Windows)– Requires multiprocessor support from the OS

• Needs to share or replicate resources– Registers – normally replicated– Caches – normally shared

• Each thread will use a fraction of the cache• Cache trashing issues – harm performance

Page 11: Hardware Multithreading

11

Example of Trashing - Revision

Memory Accesses Line 13D

Thread A Thread B Action taken Tag

: : Invalid

0x075A13D0 MISS : Load 0x075A 0x075A

0X018313D4 MISS Load 0X0183 0X0183

0x075A13D4 MISS Load 0x075A 0x075A

0X018313D8 MISS Load 0X0183 0X0183

0x075A13D8 MISS Load 0x075A 0x075A

: 0X018313DC MISS Load 0X0183 0X0183

Same index

Direct Mapped cache

Page 12: Hardware Multithreading

12

Hardware Multithreading

• Different ways to exploit this new source of parallelism

• When & how to switch threads?– Coarse-grain Multithreading – Fine-grain Multithreading – Simultaneous Multithreading

Page 13: Hardware Multithreading

13

Coarse-Grain Multithreading

Page 14: Hardware Multithreading

14

Coarse-Grain Multithreading

• Issue instructions from a single thread • Operate like a simple pipeline

• Switch Thread on “expensive” operation:– E.g. I-cache miss– E.g. D-cache miss

Page 15: Hardware Multithreading

15

Switch Threads on Icache miss1 2 3 4 5 6 7

Inst a IF ID EX MEM WB

Inst b IF ID EX MEM WB

Inst c IF MISS ID EX MEM WB

Inst X IF ID EX MEM

Inst Y IF ID EX

Inst Z IF ID

- - - -

• Remove Inst c and switch to ‘grey’ thread• ‘Grey’ thread will continue its execution until

there is another I-cache or D-cache miss

Page 16: Hardware Multithreading

16

Switch Threads on Dcache miss1 2 3 4 5 6 7

Inst a IF ID EX M-Miss WB

Inst b IF ID EX MEM WB

Inst c IF ID EX MEM WB

Inst d IF ID EX MEM

Inst X IF ID EX

Inst Y IF ID

MISS MISS MISS

- - -

- - -

- - -Abort theseAbort these

• Remove Inst a and switch to ‘grey’ thread– Remove issued instructions from ‘white’ thread– Roll back ‘white’ PC to point to Inst a

Page 17: Hardware Multithreading

17

Coarse Grain Multithreading

• Good to compensate for infrequent, but expensive pipeline disruption

• Minimal pipeline changes– Need to abort all the instructions in “shadow” of

Dcache miss overhead– Swap instruction streams

• Data control hazards are not solved

Page 18: Hardware Multithreading

18

Fine-Grain Multithreading

Page 19: Hardware Multithreading

19

Fine-Grain Multithreading

• Interleave the execution of several threads

• Usually using Round Robin among all the ready hardware threads

• Requires instantaneous thread switching– Complex hardware

Page 20: Hardware Multithreading

20

Fine-Grain Multithreading

• Multithreading helps alleviate fine-grain dependencies (e.g. forwarding?)

1 2 3 4 5 6 7

Inst a IF ID EX MEM WB

Inst M IF ID EX MEM WB

Inst b IF ID EX MEM WB

Inst N IF ID EX MEM

Inst c IF ID EX

Inst P IF ID

Page 21: Hardware Multithreading

21

I-cache misses in Fine Grain Multithreading

• An I-cache miss is overcome transparently

1 2 3 4 5 6 7

Inst a IF ID EX MEM WB

Inst M IF ID EX MEM WB

Inst b IF-MISS - - - -

Inst N IF ID EX MEM

Inst P IF ID EX

Inst Q IF ID

Inst b is removed and ‘white’ is marked as not ‘ready’

‘White’ thread is not ready so ‘grey’ is executed

Page 22: Hardware Multithreading

22

D-cache misses in Fine Grain Multithreading

• Mark the thread as not ‘ready’ and issue only from the other thread(s)

1 2 3 4 5 6 7

Inst a IF ID EX M-MISS Miss Miss WB

Inst M IF ID EX MEM WB

Inst b IF ID - - -

Inst N IF ID EX MEM

Inst P IF ID EX

Inst Q IF ID

‘White’ marked as not ‘ready’. Remove Inst b. Update PC.

‘White’ thread is not ready so ‘Grey’ is executed

Page 23: Hardware Multithreading

23

1 2 3 4 5 6 7

Inst a IF RO EX MEM WB

Inst M IF RO EX MEM WB

Inst b IF ID EX MEM WB

Inst N IF ID EX MEM

Inst c IF ID EX

Inst P IF ID

Fine Grain Multithreadingin Out-of-order processors

• In an out of order processor we may continue issuing instructions from both threads– Unless O-o-O algorithm stalls one of the threads

4 5 6 7

M MISS Miss Miss WB

EX MEM WB

RO (RO) (RO) EX

IF RO EX MEM

IF (RO) (RO)

IF RO

Page 24: Hardware Multithreading

24

Fine Grain Multithreading

• Utilization of pipeline resources increased, i.e. better overall performance

• Impact of short stalls is alleviated by executing instructions from other threads

• Single-thread execution is slowed• Requires an instantaneous thread switching

mechanism – Expensive in terms of hardware

Page 25: Hardware Multithreading

25

Simultaneous Multi-Threading

Page 26: Hardware Multithreading

26

Simultaneous Multi-Threading

• The main idea is to exploit instructions level parallelism and thread level parallelism at the same time

• In a superscalar processor issue instructions from different threads in the same cycle

• Instructions from different threads can be using the same stage of the pipeline

Page 27: Hardware Multithreading

27

Simultaneous Multi-Threading

1 2 3 4 5 6 7 8 9 10

Inst a IF ID EX MEM WB

Inst b IF ID EX MEM WB

Inst M IF ID EX MEM WB

Inst N IF ID EX MEM WB

Inst c IF ID EX MEM WB

Inst P IF ID EX MEM WB

Inst Q IF ID EX MEM WB

Inst d IF ID EX MEM WB

Inst e IF ID EX MEM WB

Inst R IF ID EX MEM WB

Same thread

Differentthread

Page 28: Hardware Multithreading

28

SMT issues

• Asymmetric pipeline stall (from superscalar)– One part of pipeline stalls – we want the other

pipeline to continue

• Overtaking – want non-stalled threads to make progress

• Existing implementations on O-o-O, register renamed architectures (similar to Tomasulo)– e.g. Intel Hyperthreading

Page 29: Hardware Multithreading

29

SMT: Glimpse into the Future

• Scout threads– A thread to prefetch memory – reduce cache miss

overhead

• Speculative threads– Allow a thread to execute speculatively way past

branch/jump/call/miss/etc– Needs revised O-o-O logic– Needs and extra memory support

Page 30: Hardware Multithreading

30

Simultaneous Multi Threading

• Extracts the most parallelism from instructions and threads

• Implemented only in out-of-order processors because they are the only able to exploit that much parallelism

• Has a significant hardware overhead

Page 31: Hardware Multithreading

31

ExampleConsider we want to execute 2 programs with 100 instructions each. The first program suffers an i-cache miss at instruction #30, and the second program another at instruction #70. Assume that:

+ There is parallelism enough to execute all instructions independently (no hazards)

+ Switching threads can be done instantaneously + A cache miss requires 20 cycles to get the instruction to the cache. + The two programs would not interfere with each other’s caches lines

Calculate the execution time observed by each of the programs (cycles elapsed between the execution of the first and the last instruction of that application) and the total time to execute the workload

a) Sequentially (no multithreading),

b) With coarse-grain multithreading,

c) With fine-grain multithreading,

d) With 2-way simultaneous multithreading,

Page 32: Hardware Multithreading

32

Summary of Hardware Multithreading

Page 33: Hardware Multithreading

33

Benefits of Hardware Multithreading

• Multithreading techniques improve the utilisation of processor resources and, hence, the overall performance

• If the different threads are accessing the same input data they may be using the same regions of memory – Cache efficiency improves in these cases

Page 34: Hardware Multithreading

34

Disadvantages of Hardware Multithreading

• The single-thread performance may be degraded when comparing with a single-thread CPU– Multiple threads interfering with each other

• Shared caches mean that, effectively, threads would use a fraction of the whole cache– Trashing may exacerbate this issue

• Thread scheduling at hardware level adds high complexity to processor design– Thread state, managing priorities, OS-level information, …

Page 35: Hardware Multithreading

35

Multithreading Summary

• A cost-effective way of finding additional parallelism for the CPU pipeline

• Available in x86, Itanium, Power and SPARC• Present additional hardware thread as an

additional virtual CPU to Operating System• Operating Systems Beware!!! (why?)

Page 36: Hardware Multithreading

36

Comparison of Multithreading Techniques – 4-way superscalar

1 2 1 2 3 1 2 3 13 4 54 5 6 6 2 3 47 8 7 4 5 5 69 10 11 12 8 6 7 8

7 9 10 11 128 9 10

9 10 11 1213 13 1414 15 16 15 16

1 2 1 2 1 2 1 23 1 2 3 3 1 2 34 5 6 1 2 3 1 3 4 57 8 1 4 5 6 69 10 11 12 3 2 3 4 7

4 5 8 7 4 51 2 3 4 5 5 6 9 104 5 2 3 4 11 12 8 66 4 5 6 7 8 77 6 9 10 11 12

SMT

Thread A Thread B Thread C Thread DTim

e ——

——

>

Tim

e ——

——

>

Coarse-grain Fine-grain