41
Multiscalar Processors Presented by Matthew Misler Gurindar S. Sohi, Scott E. Breach, T. N. Vijayjumar University of Wisconsin-Madison ISCA ‘95

Multiscalar Processors

  • Upload
    preston

  • View
    45

  • Download
    2

Embed Size (px)

DESCRIPTION

Multiscalar Processors. Presented by Matthew Misler Gurindar S. Sohi , Scott E. Breach, T. N. Vijayjumar University of Wisconsin-Madison ISCA ‘95. Scalar Processors. Instruction Queue. Execution Unit. addu $20, $20, 16. ld $23, SYMVAL -16($20). move $17, $21. beq $17, $0, SKIPINNER. - PowerPoint PPT Presentation

Citation preview

Page 1: Multiscalar Processors

Multiscalar Processors

Presented by Matthew MislerGurindar S. Sohi, Scott E. Breach, T. N. Vijayjumar

University of Wisconsin-Madison

ISCA ‘95

Page 2: Multiscalar Processors

Scalar Processors

2

Instruction Queue

addu $20, $20, 16

ld $23, SYMVAL -16($20)

move $17, $21

beq $17, $0, SKIPINNER

ld $8, LELE($17)

Execution Unit

Page 3: Multiscalar Processors

SuperScalar Processors

3

Instruction Queue

addu $20, $20, 16

ld $23, SYMVAL -16($20)

move $17, $21

beq $17, $0, SKIPINNER

ld $8, LELE($17)

Execution Unit

Page 4: Multiscalar Processors

Fetch-Execute

– Paradigm has been around for about 60 years

– Superscalar processors to execute instructions out of order– Sometimes re-ordering done in hardware– Sometimes software– Sometimes both

– Partial ordering

4

Page 5: Multiscalar Processors

Control Flow Graphs

– Segments are split on control dependencies (conditional branches)

5

Page 6: Multiscalar Processors

Sequential “Walk”

– Walk through the CFG with enough parallelism– Use speculative execution and branch

prediction to raise the level of parallelism

– Sequential semantics must be preserved– Can still execute out of order, but in-order

commit

6

Page 7: Multiscalar Processors

Multiscalars and Tasks

– CFG broken down into tasks– Multiscalars step through at the task

level– No inspection of instructions within a task

– Each Task is assigned to one ‘processing unit’

– Multiple tasks can execute in parallel

7

Page 8: Multiscalar Processors

Multiscalar Microarchitecture

– Sequencer– Queue of processing units

– Unidirectional ring– Each has an instruction cache, processing

element, register file

– Interconnect– Data Bank

– Each has: address resolution buffer, data cache

8

Page 9: Multiscalar Processors

Multiscalar Microarchitecture

9

Page 10: Multiscalar Processors

Outline

Multiscalar Microarchitecture Tasks Multiscalars in-depth Distribution of cycles Comparison to other paradigms Performance Conclusion

10

Page 11: Multiscalar Processors

Tasks

– Sequencer distributes a task to a Processing unit– Unit fetches and executes the task until

completion

– Instructions in the window are bounded – By the first instruction in the earliest executing

task– By the last instruction in the latest executing

task

11

Page 12: Multiscalar Processors

Tasks

– Sequencer distributes a task to a Processing unit– Unit fetches and executes the task until

completion

– The Instruction Window is bounded by– The first instruction in the earliest executing task– The last instruction in the latest executing task

– So? Instruction windows can be huge

12

Page 13: Multiscalar Processors

Tasks Example

13

A B C D E

A B C B B C D

Page 14: Multiscalar Processors

Tasks Example

14

A B C D E

A B C B B C D

A B B C D

A B C B C D E

Page 15: Multiscalar Processors

Tasks

– Hold true to sequential semantics inside each block

– Enforce sequential order overall on tasks– The circular queue takes care of this part

– In the previous example:– Head of queue does ABCBBCD– Middle unit does ABBCD– Tail of the queue ABCBCDE

15

Page 16: Multiscalar Processors

Tasks

– Registers– Create mask

– May produce values for a future task– Forward values down the ring

– Accum mask– Union of the create masks of active tasks

– Memory– If it’s a known producer-consumer, then

synchronize on loads and stores

16

Page 17: Multiscalar Processors

Tasks

– Memory (cont’d)– Unknown P-C relationship

– Conservative approach: wait– Aggressive approach: speculate

– Conservative approach means sequential operation

– Aggressive approach requires dynamic checking, squashing and recovery

17

Page 18: Multiscalar Processors

Outline

Multiscalar basics Tasks Multiscalars in-depth Distribution of cycles Comparison to other paradigms Performance Conclusion

18

Page 19: Multiscalar Processors

Multiscalar Programs

– Code for the tasks– Small changes to existing ISA

– add specification of tasks – no major overhaul

– Structure of the CFG and tasks– Communications between tasks

19

Page 20: Multiscalar Processors

Control Flow Graph Structure

– Successors– Task descriptor

– Producing and consuming values– Forward register information on last update

– Compiler can mark instructions: operate and forward

– Stopping conditions– Special condition, evaluate conditions, complete

– All of these can be viewed as tag bits

20

Page 21: Multiscalar Processors

Multiscalar Hardware

– Walks through the CFG– Assign tasks to processing units– Execute tasks in a ‘sequential’ order

– Sequencer fetches the task descriptors– Using the address of the first instruction– Specifying the create masks– Constructing the accum mask

– Using the task descriptor, predict successor

21

Page 22: Multiscalar Processors

Multiscalar Hardware

22

– Databanks– Updates to cache not speculative

– Use of Address Resolution Buffer– Detects violation of dependencies– Initiates corrective actions– If it runs out of space, squash tasks

– Not the head of the queue; it doesn’t use the ARB– Can stall rather than squash

Page 23: Multiscalar Processors

Multiscalar Hardware

23

– Remember the earlier architectural picture?

Page 24: Multiscalar Processors

Multiscalar Hardware

24

– It’s not the only possible architecture– Possible design with shared functional units– Possible design with ARB and data cache on

the same side as the processing units

– Scaling the interconnect is non-trivial– Glossed over

Page 25: Multiscalar Processors

Outline

Multiscalar Basics Tasks Multiscalars In-Depth Distribution of Cycles Comparison to Other Paradigms Performance Conclusion

25

Page 26: Multiscalar Processors

Distribution of Cycles

26

– Wasted cycles:– Non-useful computation

– Squashed

– No computation– Waiting

– Remains idle– No assigned task

Page 27: Multiscalar Processors

Distribution of Cycles

– Non-useful computation cycles– Determine useless computation early

– Validate prediction early– Check if the next task is predicted correctly– Eg. Test for loop exit at the start of the loop

– Tasks violating sequentiality are squashed– To avoid, try to synchronize memory

communication with register communication– Could delay the load for a number of cycles– Can use signal-wait synchronization

27

Page 28: Multiscalar Processors

Distribution of Cycles

– Contrast with no assigned task– No computation cycles

– Dependencies within the same task– Dependencies between tasks (earlier/later)– Load Balancing

28

Page 29: Multiscalar Processors

Outline

Multiscalar Basics Tasks Multiscalars In-Depth Distribution of Cycles Comparison to Other Paradigms Performance Conclusion

29

Page 30: Multiscalar Processors

Comparison to Other Paradigms

30

– Branch prediction– Sequencer only needs to predict branches

across tasks

– Wide instruction window– Check to see which is ready for issue, in

Multiscalar relatively few ready for inspection

Page 31: Multiscalar Processors

Comparison to Other Paradigms

31

– Issue logic– Superscalar processors have n2 logic– Multiscalar logic is distributed,

– Each processing unit issues instructions independently

– Loads and stores– Normally sequence numbers for managing the

buffers– In multiscalar, the loads and stores are

independent

Page 32: Multiscalar Processors

Comparison to Other Paradigms

32

– Superscalar processors need to discover CFG as it decodes branches

– Only requires the compiler to split code into tasks– Multiprocessors require all dependence to be

known or conservatively provided for– If a compiler could compile independently, it

can be executed in parallel

Page 33: Multiscalar Processors

Outline

Multiscalar Basics Tasks Multiscalars In-Depth Distribution of Cycles Comparison to Other Paradigms Performance Conclusion

33

Page 34: Multiscalar Processors

Performance

34

– Simulated– 5 stage pipeline

Functional unit latency

Page 35: Multiscalar Processors

Performance

35

– Memory– Non-blocking loads and stores– 10 cycle latency for first 4 words– 1 cycle for each additional 4 words

– Instruction Cache: 1 cycle for 4 words– 10+3 cycles for miss

– Data Cache: 1 word per cycle multiscalar– 10+3 cycles + bus contention, for a miss

– 1024 entry cache of task descriptors

Page 36: Multiscalar Processors

Performance

36

+12.2% on average

Page 37: Multiscalar Processors

Performance – In-Order

37

Page 38: Multiscalar Processors

Performance – Out-of-Order

38

Page 39: Multiscalar Processors

Performance – Summary

39

– Most of the benchmarks achieve speedup– Eg. An average of 1.924 in 1-way in-order 4-

unit multiscalar

– Worst case 0.86 speedup (slowdown)– Many squashes in prediction and memory

order in Gcc and Xlisp– Leads to almost sequential execution

– Keeping in mind, 12.2% increase in IC

Page 40: Multiscalar Processors

Outline

Multiscalar Basics Tasks Multiscalars In-Depth Distribution of Cycles Comparison to Other Paradigms Performance Conclusion

40

Page 41: Multiscalar Processors

Conclusion

41

– Divide the CFG into tasks– Assign tasks to processing units

– Walk the CFG in task-size steps– Shows performance gains