48
Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

  • View
    223

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 1

Instruction-Level Parallelism

• Review of Pipelining (the laundry analogy)

Page 2: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 2

Instruction-Level Parallelism

• Review of Pipelining (Appendix A)

Page 3: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 3

Instruction-Level Parallelism

• Review of Pipelining (Appendix A) – MIPS pipelineMIPS pipeline five stages:

» IF – instruction fetch

» ID – instruction decoding and operands fetch

» EX – execution using ALU, including effective address and target address computing

» MEM – accessing memory for L & S instructions

» WB – write result back to (destination) register

Page 4: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 4

• The “naïve” MIPS pipeline

Page 5: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 5

• The “naïve” MIPS pipeline -- implementation

Instruction-Level Parallelism

Page 6: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 6

Instruction-Level Parallelism• A series of datapaths shifted in time

Page 7: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 7

Instruction-Level Parallelism• A pipeline showing the pipeline registers between stages

Page 8: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 8

The major hurdles of pipelining: pipeline hazards• Structural HazardsStructural Hazards: resource conflicts, such as bus,

register file ports, memory ports, etc.

Page 9: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 9

The major hurdles of pipelining: pipeline hazards• Data HazardsData Hazards: data dependency (producer-consumer

relationship, or read after write). Some can be resolved by forwardingforwarding

Page 10: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 10

The major hurdles of pipelining: pipeline hazards• Data HazardsData Hazards: data hazards detection in MIPS pipeline

Page 11: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 11

The major hurdles of pipelining: pipeline hazards• Data HazardsData Hazards: the logic for forwarding of data in MIPS

pipeline

Page 12: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 12

The major hurdles of pipelining: pipeline hazards• Data HazardsData Hazards: the forwarding of data in MIPS pipeline

Page 13: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 13

The major hurdles of pipelining: pipeline hazards• Data HazardsData Hazards: Some cannot be resolved by forwarding, forwarding,

thus requiring stalls

Page 14: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 14

The major hurdles of pipelining: pipeline hazards• Data HazardsData Hazards: Avoid non-forwardable data hazards

through compiler scheduling:

Page 15: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 15

The major hurdles of pipelining: pipeline hazards• Branch (Control) HazardsBranch (Control) Hazards: can cause greater

performance loss (e.g., a 3-cycle loss in the “naïve” MIPS pipeline)

Page 16: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 16

The major hurdles of pipelining: pipeline hazards• Branch (Control) HazardsBranch (Control) Hazards: improved MIPS pipelined with

one-cycle loss

Page 17: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 17

The major hurdles of pipelining: pipeline hazards• Reducing branch penaltiesReducing branch penalties:

1. Freeze or Flussh

2. Predict-not-taken or Predict-taken

3. Delayed Branch1) Branch instruction

2) Sequential successor

3) Branch target if taken

“Canceling/nullifying”

Branch if prediction

incorrect

Page 18: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 18

The major hurdles of pipelining: pipeline hazards• Scheduling the branch delay slotScheduling the branch delay slot:

Page 19: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 19

Performance of Pipelining• Example 1Example 1:

– Consider an unpipelined machine A and a pipelined machine B where CCTA = 10ns, CPI(A)ALU = CPI(A)Br = 4, CPI(A)l/s = 5, CCTB = 11ns. Assuming an instruction mix of 40% for ALU, 20% for branches, and 40% for l/s, what is the speedup of B over A under ideal conditions?

Page 20: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 20

Performance of Pipelining• Impacts of pipeline hazardsImpacts of pipeline hazards:

Page 21: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 21

Performance of Pipelining• Performance of branch schemesPerformance of branch schemes:

Overall costs of a variety of branch schemes with the MIPS pipeline

Page 22: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 22

Performance of Pipelining• Example 2Example 2: For a deeper pipeline such as that in a MIPS R4000, it takes three pipeline stages

before the target-address is known and an additional stage before the condition is evaluated. This leads to the branch penalties for the three simplest branch schemes listed below:

Find the effective addition to the CPI arising from branches for this pipeline, assuming that unconditional, untaken conditional, and taken conditional branches account for 4%, 6%, and 10%, respectively.Answer:Answer:

Page 23: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 23

What Makes Pipelining Hard to Implement?1. Exceptional conditions (e.g., interrupts, etc) often change the order of

instruction execution;

Page 24: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 24

What Makes Pipelining Hard to Implement?• Actions needed for different types of exceptional conditions:

Page 25: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 25

What Makes Pipelining Hard to Implement?

• Stopping and Restarting Execution: Two Challenges

Page 26: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 26

What Makes Pipelining Hard to Implement?

• Stopping and Restarting Execution: Two Challenges (cont’d)

Page 27: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 27

What Makes Pipelining Hard to Implement?• Precise Exception Handling in MIPS

Pipeline Stage

Problem exceptions occurring

IF Page fault on instruction fetch; misaligned memory access; memory protection violation

ID Undefined or illegal opcode

EX Arithmetic exception

MEM Page fault on data fetch; misaligned memory access; memory-protection violation

WB None

Page 28: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 28

What Makes Pipelining Hard to Implement?• Precise Exception Handling in MIPS

Page 29: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 29

Extending MIPS Pipeline to Handle Multicycle Operations• Handle floating point operations: single cycle (CPI=1) very long CCT

or highly complex logic circuit

• Multiple cycle long latency: with EX cycle repeated many times and/or with multiple PF function units

The MIPS pipeline with three additional unpipelined, floating point units

Page 30: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 30

Extending MIPS Pipeline to Handle Multicycle Operations• Pipelining FP functional units:

• LatencyLatency: number of intervening cycles between the producer and the consumer of an operand -- 0 for ALU and 1 for LW

• Initiation intervalInitiation interval: number of minimum cycles between two issues of instructions using the same functional unit.

F. Unit Int. ALU Data Mem FP Add Multiply Divide

Latency 0 1 3 6 24Init. Interval 1 1 1 1 25

Page 31: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 31

Extending MIPS Pipeline to Handle Multicycle Operations• Pipeline timing of a set of independent FP instructions:

• A typical FP code sequence showing the stalls arising from RAW hazards:

• Three instructions want to perform a write back to the FP register simultaneously

MUL.D IF ID M1 M2 M3 M4 M5 M6 M7M7 MEM WB

ADD.D IF ID A1 A2 A3 A4A4 MEM WB

L.D IF ID EX MEMMEM WB

S.D IF ID EX MEM WB

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

L.D. F4,0(R2) IF ID EX MM WB

MUL.D F0,F4,F6 IF ID Stall M1 M2 M3 M4 M5 M6 M7 MM WB

ADD.D F2,F0,F8 IF St’l ID St’l St’l St’l St’l St’l St’l A1 A2 A3 A4 MM WB

S.D. F2,0(R2) IF St’l St’l St’l St’l St’l St’l ID EX St’l St’l St’l MM

1 2 3 4 5 6 7 8 99 10 11

MUL.D F0, F4, F6 IF ID M1 M2 M3 M4 M5 M6 M7 MEM WBWB

… IF ID EX MEM WB

… IF ID EX MEM WB

ADD.D F2, F4, F6 IF ID A1 A2 A3 A4 MEM WBWB

… IF ID EX MEM WB

… IF ID EX MEM WB

L.D. F2, 0(R2) IF ID EX MEM WBWB

Page 32: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 32

Extending MIPS Pipeline to Handle Multicycle Operations

• Difficulties in exploiting ILP:Difficulties in exploiting ILP: various hazards that impose dependency among instructions, as a result:– RAW(read after write): j tries to read a source before i writes to it

– WAW(write after write): j tries to write an operand before it is written by i

– WAR(write after read): j tries to write a destination before it is read by

• Implementing pipeline in FP: hazards and forwarding in Implementing pipeline in FP: hazards and forwarding in longer latency pipelineslonger latency pipelines– Divide not fully pipelined (structural hazard)Divide not fully pipelined (structural hazard)

– Multiple Multiple writes in a cycle and arrive at WB variably,WAW and structural hazards. Would there be WAR?

– Out-of-order completion of instructions more problems for exception handling

– Higher RAW frequency and longer stalls due to longer latency

Page 33: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 33

Extending MIPS Pipeline to Handle Multicycle Operations• Introduce Introduce interlockinterlock::

– tracking the use of write port at ID and stalling issue if detected

– use shift register for tracking issued instructions' use of write port

– stall when entering MEM:

1. can stall any of the contending instructions,

2. no need to detect conflict early when is it harder to see,

3. give priority to the unit with the longest latency,

4. can cause bottleneck stalling

– WAW occurs if LD is issued one cycle earlier and has F2 as destination (WAW with ADDD); Solution:

1. delay issuing LD until ADDD enters MEM, or,

2. stamp out result of ADD

» Hazard detection with FP pipeline:Hazard detection with FP pipeline:1. check for structural hazards: a. functional units, b. write ports

2. check for RAW hazard: source reg. in ID = dest. reg. (issued)

3. check for WAW hazard: dest reg. in ID = dest. reg. (issued)

Page 34: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 34

Extending MIPS Pipeline to Handle Multicycle Operations• Maintain precise exception:Maintain precise exception:

– Example of out-of-order completion:» DIVF F0, F2, F3 ; exception of SUBF at end of ADDF» ADDF F10, F10, F8 ; cause imprecise exception which» SUBF F12, F12, F14 ; cannot be solved by HW/SW

– Solutions:1. Fast imprecise (tolerable in 60's & 70s, but much less so now due to pipelined FP, virtual memory,

and IEEE standard) or slow precise2. Buffering of result until all predecessors finish:

– the bigger the difference among instruction execution lengths, the more expensive to implement (e.g., large number of comparators and MUXs and large amount of buffer space)

– history file: keeps track of register values– future file: keeps newer values of registers until all predecessors are completed

3. Quasi-precise exception: keep enough information for trap-handling routine to create a precise sequence for exception:– operations in the pipeline and their PCs– software finishes all instructions issued prior to the latest completed instruction

4. Guarded issuing: issue only if it is certain that all prior instructions will complete without causing an exception – stalling to maintain precise exception

Page 35: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 35

The MIPS R4000 Pipeline

– R4000 pipeline leads to a 2-cycle load delay

Page 36: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 36

The MIPS R4000 Pipeline

– R4000 pipeline leads to a 3-cycle basic branch delay since the condition evaluation is performed during the EX stage

Page 37: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 37

Dynamic Scheduling with Scoreboard

Dynamic Scheduling:Dynamic Scheduling: hardware re-arranges the instruction execution order to reduce stalls:1. handles situations where dependences are unknown or

difficult to detect at compile time, thus simplifying the compiler design;

2. increases portability of the compiled code;

3. solves problems associated with the so-called “head-of-the-queue” (HOTQ) blocking caused by “in-order issue” of earlier pipelines. Example:

4. MIPS, which is “in-order issue”, can be made to “out-of-order” execute (implying “out-of-order” completion) by splitting ID into two phases: (1) In-order Issue: check for structural hazards. (2) Read operands: wait until no data hazards, then read operands (and then execute, possibly out-of-order!). The HOTQ problem above can be solved in this new MIPS!

Page 38: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 38

Dynamic Scheduling with Scoreboard

Page 39: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 39

Dynamic Scheduling with Scoreboard

Page 40: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 40

Dynamic Scheduling with Scoreboard

Page 41: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 41

Dynamic Scheduling with Scoreboard

Page 42: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 42

Dynamic Scheduling with Scoreboard

Page 43: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 43

Dynamic Scheduling with Scoreboard

Page 44: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 44

Dynamic Scheduling with Scoreboard

Page 45: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 45

Unpipelined Processor (MIPS)

Page 46: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 46

Pipelined Processor (MIPS)

Page 47: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 47

The Eight-stage Pipeline of the R4000

Page 48: Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

Slide 48

A 2-cycle Load Delay of The R4000 Integer Pipeline