Enhancing performance - Pipelining Chapter 6 Part 1 – Concepts N. Guydosh 3/24/04

Enhancing performance - PipeliningChapter 6

Part 1 – Concepts

N. Guydosh

3/24/04

Introduction• Parallelism built into the processor hardware

– The logical sequence of events in the execution of an instruction is generally wasteful of time .

– Example: while an instruction is doing arithmetic using registers, the memory is idle ... why not fetch the next instruction during this time?

– The key idea is to overlap the processing of multiple instructions.

Introduction – An Analogy• Analogous to an assembly line in a factory

– The time to build an individual car does not decrease but: the number of cars built per unit time is greatly increased .

– Multiple cars are simultaneously built– While the engine is installed in one car, the seats are installed in another ... at the same time – Success of assembly line depends on how well balanced it is ... we dont want one task

(phase) taking 10 minutes, while another takes 1 hour. ...

The car having the short task done at one station will have to wait idle for the next task station to free up.

– The series of work stations in an assembly line are analogous to the functional units in a processor data path

– A series of cars to be built passes though the assembly line simultaneously - each work station is busy.

– On startup and stopping of the line it take a time equal to the sum of the work stations time to fill and empty line (pipeline).

Introduction – Computer Instructions

• In the assembly line example, the car becomes an instruction .

• The tasks done on the car become the instruction phases (or stages).

• The work stations become the functional units in the data path.

• The assembly line becomes the data path

• The data path simultaneously executing multiple instructions is called a pipeline

Introduction – Computer Instructions (cont)

• Pipelining improves instruction throughput rather than individual instruction execution time

• The time required to move an instruction one step down the pipeline is one clock cycle

• The length of a clock cycle is determined by the time required for the slowest pipeline stage because all stages must proceed at the same rate

• The goal of the designer is to balance the length of each stage - otherwise there will be idle time during a stage.

Steps to Take

• Decompose the processing of instructions into phases

• Simplest decomposition is two phases or stages: fetch and execute – 1st stage fetches and buffers the instruction

– 2nd stage (execution) receives the buffered instruction from the 1st stage when it is free

– While 2nd stage is executing, the 1st stage takes advantage of any unused memory cycles to fetch and buffer the next instruction: this is called instruction pre-fetch or fetch overlap

Steps to Take (Cont.)

• Problems with this approach – Execution time is generally longer than fetch time.

– Fetch stage may have to wait before it can empty its buffer – Ideally we would like to have the various stages of instruction

processing take the same amount of time. – A conditional branch instruction makes the address of the next

instruction uncertain ... thus fetch stage waits until the execute stage (branch) determines the next instruction address

• Both above situations results in performance loss - the latter (conditional branch) can be reduced by guessing at, the outcome of the branch.

Steps to Take (Cont.)• An improvement would be to decompose the instruction processing into

smaller steps (finer granularity) – There would be less variation in processing time among the stages – These are the familiar phases of our instruction execution

IF Instruction fetchID Instruction decode and register fetchEX Execution and effective address calculationMEM Memory access (fetch memory operands)WB Write back (into register file )

– The various phases (5 of them) will be more nearly equal in duration – Register read and register write takes only 1 ns and all the other phases take 2

ns. Thus all the phases will take 2 ns - register operations will idle for 1 ns during the register phases

Steps to Take (Cont.)

• Fundamental concept:In order to make each phase as independent as possible of other phases, we will use the single clock cycle data path (fig, 5.19, p. 360) and a multiple clock cycle timing scheme.

– A hybrid of the two schemes in chapter 5.

– The single clock cycle data path has redundant hardware which enhances parallelism and phase independence.

• ALU and two adders

• But it is functionally the same as the multiple clock data path.

Single Clock Cycle Datapath for Multi Clock Cycle Timing

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Instruction

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

ReaddataAddress

Datamemory

1

ALUresult

Mux

ALUZero

IF: Instruction fetch ID: Instruction decode/register file read

EX: Execute/address calculation

MEM: Memory access WB: Write back

Fig 6.10

Performance Example • Execution of three consecutive lw instructions –see p. 439

– 2 ns per phase except for reg phase which is 1 ns

Instructionfetch

Reg ALUData

accessReg

8 nsInstruction

fetchReg ALU

Dataaccess

Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch

Reg ALUData

accessReg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 ns 2 ns 2 ns 2 ns 2 ns


Performance Example (cont.)

• Ideally with no delays for register operation, it would take 8 ns to execute an lw instruction and 24 ns to do three of them sequentially.

– In a 5 stage pipeline the three could be done in 14 ns.– Ideally we would expect to complete an instruction every 8/5 = 1.6 ns for the 5 stage pipeline.– Instead we see 2 ns between instructions– This is due to an imbalance if the time for each phase – all phases are 2ns and the register phase is

1ns

• Since an instruction is fired off every 2 ns in the 5 way pipeline as opposed to every 8 ns in a non pipelined scheme, it would seem that the performance advantage should be 8/2 = 4.

– But what we see is 24/14 = 1.7– The reason we are not getting the 4:1 ration is that this example never filled the pipe: about 2/3 of

time was spent filling and emptying the pipe – Maximum parallelism is achieved only when the pipe is filled – Suppose we increase the number of instructions executed by 1000 – Non pipelined: 24 + 1000(8ns/inst) = 8024 ns

Pipelined: 14 + 1000(2ns/inst) = 2014 ns ratio = 8024/2014 = 3.98 8/2 = 4

Principles

• Principle: Keep the pipe full and make the phase times as equal as possible.

– Sometimes “disruptions” cause it to empty and have to be refilled .... as can happen with successful branches

Principles (cont.)• Principle: In order for a pipelined scheme to work well, the data path

stages (functional units) must be designed in such a way that instructions executing at a particular stage will do so independently of instructions simultaneously executing on other stages. – As in an assembly line the instruction should “flow” through the data path from

stage to stage and not require the services of multiple stages independently – It turns out the single clock cycle data path implementation we came up within

chapter 5, has this property to a large degree.– fig. 6.10, p. 450 (single-cycle data path) is an idealistic abstraction and must

be “modified” to make it work well in a pipelined environment. Multi-cycle clocking will be added to this single cycle datapath.

– Instructions roughly flow from left to right as they get executed. ... Instruction 1 could be in the “ID” stage while instruction 2 is in the: “EX” stage.

– Two exceptions to the left to right flow:(a) Write back (WB) stage flows from end of the pipe to register file in the middle of the pipe (b) The mem stage feeds back to the fetch stage with a possible non incremented branch address

Making the Pipeline Work

• Pipeline phase buffering – Pipeline buffer registers between phases - saving the data for the next phase,

thus make a phase immediately reusable by another instruction see fig 6.12, p. 452

– There is no pipeline register between WB stage and the ID phase (a right to left path).

This is ok since this is a “natural” interdependence between instructions being executed ... an lw places data in the register file and a later instruction uses it. All instructions generally change the state of the of the machine. ... These kinds of instruction interdependence could get hairy --- see later

Pipelined Datapath Showing Pipeline Registers Between Phases

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Datamemory

Address

Fig. 6.12

Preserving Information in the Pipeline

• Data to be stored by sw instructionThe data from rt register to be stored in memory is buffered in the ID/EX pipeline register but needed in the mem stage.

... so it is automatically transferred to the EX/MEM pipeline register during the EX phase. See fig 6.16, p 457 or fig 6.18

• Destination register number (rt) needed by lw instruction In the lw instruction the register number to write the data into is needed at the output of the mem phase (MEM/WB register) ... but is first buffered in the IF/ID register

... So it is automatically transferred though three pipeline registers to the MEM/WB register where it is needed. See fig. 6.18, p. 460:This move is ID/EX EX/MEM MEM/WB register file write register specification– Initially the given datapath did not have this path in it (a deliberate bug).– If we did not make this correction the destination register number stored in

ID/EX would get overwritten by the next instruction coming down the pipe and thus would result in an error.

Data Path Showing Information Preservation

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0

Address

Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX

Fig. 6.18

Pass rt data for sw

Preserve destination register number for lw

Notes For “Scenarios” Or Walk Thru’s Given in the Text – See overhead slides

• Single instruction lw – see fig. 6.13 though 6.15. Assumes correction of fig 6.18– The target register number (rt) is determined in the decode/register fetch

stage, but is not used until the final stage (write-back). – Thus the rt number stored in the ID/EX register must be “moved along” with

the instruction to the MEM/WB register where it is needed:– This move is:

ID/EX EX/MEM MEM/WB register file write register specification– If we didn’t do this, the destination register number would get wiped out in

ID/EX by the next instruction coming down the pipe. ... And it would be all over but the laughing.

– This is essentially the same reason we put the IRWrite control line on the instruction register for the multi-clock non-pipelined case in chapter 5 … a copy of certain data from the fetched instruction must be maintained throughout the execution of the instruction.

Notes On “Scenarios” Or Walk Thru’s Given in the Text (cont.)

• Single instruction sw – see fig. 6.16 though 6.17 Assumes correction of fig 6.18– First two stages (fetch and decode) identical to lw

– Shows need to keep information used in later stages of execution of the instruction:

– The source data from rt (to be written to memory) in the register file is fetched during the decode/register fetch stage, but is needed in the mem stage for storage on the memory (stored in ID/EX register).

– Thus this field is transferred along with the instruction from the ID/EX register to the EX/MEM register where is now available for writing to mem

– This is similar to the situation in lw, but not identical.

Notes On “Scenarios” Or Walk Thru’s Given in the Text (cont.)

• Two instructions lw and sub – see fig. 6.22 though 6.24

– Illustrates that each instruction must “visit” each phase even if it does not need any services in the phases.

– sub does not need the mem phase, so the ALU output is merely passed to the next pipeline register (MEM/WB) to await being written to the register file

Graphical Representation Of Pipelines • Single clock cycle diagram

– What was used in the “scenarios”– Shows state of the entire datapath during a single clock cycle – All instructions in the pipeline identified by labels above respective stages– Requires a sequence of such diagrams to show the execution of

instruction(s) – Example: figs. 6.22 - 6.24 ... “walk thru” of a two instructions:

lw and sub

• Multiple-clock-cycle pipeline diagram– Gives a high level overview:– Shows the pipeline activity for all clock pulses in a single diagram - see

fig 6.20, p. 462

Multiple-clock-cycle pipeline diagram

Fig 6.20

IM Reg DM Reg

IM Reg DM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

lw $10, 20($1)


sub $11, $2, $3

ALU

ALU


Time ( in clock cycles)

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Instructionfetch

Instructiondecode

Instructionfetch

Instructiondecode Execution Write back

Execution

Dataaccess

Dataaccess Write backlw $10, $20($1)

sub $11, $2, $3

Fig 6.21

What Can Go Wrong – Pipeline HazardsA Preview

• Hazard: a situation when the next instruction cannot execute in the following clock cycle

• Structural hazard– What if there were a only single memory

– Example: a lw followed by another instruction – lw could be accessing data in the memory and the while the next instruction is attempting th be fetched into the same memory.

What Can Go Wrong – Pipeline HazardsA Preview (cont.)

• Control hazard– The need to make a decision based on the results of one instruction while

others are already executing. The decision may have an effect on instructions already executing

– Example: conditional branch (beq) could invalidate instruction already in execution if the branch is successful.

– Possible solutions:stall the instruction after beq until the decision is determined (success or unsuccessful branch)predict or guess the outcome of the design. If you are correct then you run full speed, if you are wrong, then the following instruction must be flushed and the pipe refills from the new branched instruction stream.

What Can Go Wrong – Pipeline HazardsA Preview (cont.)

• Data hazard– An instruction depends on the result of a previous instruction still in the

pipeline.

– Example:add $s0, $t0, $t1 # $s0 available in 5th stagesub $t2, $s0, $t3 # $s0 needed in 2nd stage

– Naïve approach would be to stall sub until data is ready – performance penalty

– Better: make the data available earlier by forwarding or bypassing stagesgive the sub instruction the result before writing to the register file

– Sometime even with forwarding a stall ma be necessary, example:

lw $s0, 20($t1) # $s0 available in 5th stage – must access memory sub $t2, $s0, $t3 # $s0 needed in 2nd stage

Control and data hazard resolution is easier said than done – complicates controls – implementation details later

Data Hazard Forwarding

add $s0, $t0, $t1

sub $t2, $s0, $t3


IF ID WBEX

IF ID MEMEX

Time2 4 6 8 10

MEM

WBMEM

Time2 4 6 8 10 12 14

lw $s0, 20($t1)

sub $t2, $s0, $t3


IF ID WBMEMEX

IF ID WBMEMEX

bubble bubble bubble bubble bubble

Pipeline Control

• Start with the controls used for the none-pipelined case (single clock cycle with controls): fig 5.19, p. 360:

PC

Instructionmemory

Readaddress

Instruction[31– 0]

Instruction [20 16]

Instruction [25 21]

Add

Instruction [5 0]

MemtoReg

ALUOp

MemWrite

RegWrite

MemRead

BranchRegDst

ALUSrc

Instruction [31 26]

4

16 32Instruction [15 0]

0

0Mux

0

1

Control

Add ALUresult

Mux

0

1

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

PCSrc

Datamemory

Writedata

Readdata

Mux

1

Instruction [15 11]

ALUcontrol

Shiftleft 2

ALUAddress

Fig 5.19

Pipeline Control (cont.)

• No controls needed for pipeline registers - they are written each clock cycle.

• Each control line is associated with a component active in only a single pipeline stage - thus divide the control lines into potentially 5 groups (per stage) ... see fig. 6.29, p. 469

– Since we are using the single clock data path, the controls are only for last 3 stages.

– Thus, of the 5 potential groups, we will need only three groups

Pipeline Control (cont.) • All controls are created during the decode phase and stored in

the ID/EX pipeline register – extending the register.

• As the clock pulses and the instruction advances thru the pipeline: – Control signal needed by the current execution phase are utilized and ... – The remainder of them are passed to the next pipeline register to be used by

later phases. See fig. 6.28 - 6.29, p. 469

• This method of asserting control lines is reminiscent of horizontal microcode where the control lines (bits in the microword) are asserted as the microword is “executed”– ... The control bits in the phase registers play this role - controls for a

particular phase becoming asserted when the phase occurs (become active)

• When a stage is inactive, the control lines for that stage are deasserted (killing that phase)

Pipeline ControlComparison With Mult-clock-nonpipeline Control

• In chapter 5 (multi-clock), the sequencing of control required a special hardware implementation of an FSM, (see fig, 5.42, 5.43), in this case the sequencing is embedded in the pipeline structure itself (pipeline registers).– all control is computed during instruction decode phase and then passed

along via pipeline registers

– The generation of the control values in the decode phase is combinational logic – done in one clock pulse as in the single-clock design of chapter 5.

– Sequencing is achieved by an instruction moving from one phase to another – the control signals associated with a phase being “presented” as the stage is entered via the pipeline register for that phase.

• In chapter 5 (multi-clock) instruction execution took a variable number of clock cycles (see fig 5.42), in this case all instructions take same number of cycles.

Pipeline Control (cont.)

Execution/Address Calculation stage control lines

Memory access stage control lines

Write-back stage control

lines

InstructionReg Dst

ALU Op1

ALU Op0

ALU Src Branch

Mem Read

Mem Write

Reg write

Mem to Reg

R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X

Control

EX

M

WB

M

WB

WB

IF/ID ID/EX EX/MEM MEM/WB

Instruction

Pass control signals along just like the data

Fig. 6.28

Fig. 6.29

Datapath with Pipeline Control

PC

Instructionmemory

Inst

ruct

ion

Add


Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15– 0]

0

0

Mux

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

Writedata

Readdata

Mux

1

ALUcontrol

Shiftleft 2

RegW

rite

MemRead

Control

ALU


6

EX

M

WB

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

Mux

0

1

Mem

Write

AddressData

memory

Address

Fig. 6.30

Datapath with Pipeline ControlChanges from fig. 5.19 (single clock)

• Changes from fig. 5.19 (single clock) to fig 6.30 (pipelined)

– Destination register (rt or rd) propagated because of multi-clock timing.

– PC set twice via PCSource for the MUX:• increment in fetch phase

• branch address if instruction is beq in MEM phase – overwrites incremented value if successful

– jump instruction not implemented in fig 6.30

A Scenario Showing Pipeline ControlsSee pdf figures 6.31 – 6.35

• Scenario for the following sequence (see overheads slides)lw $10, 20($1) #note that these instructions are independent of each other!sub $11, $2, $3and $12, $4, $5or $13, $6, $7add $14, $8, $9

• A fully loaded pipeline (5 instructions), with controls ... Takes 9 clock cycles to complete. See fig 6.31 - 6.35

• Although one instruction begins (and completes) each clock cycle, an individual instruction takes five cycles to complete

• Note the propagation of the destination register thru the pipe starting in fig 6.33

• It takes 4 cycles before the 5 stage pipeline is operating at full efficiency (see fig 6.33) – “filling the pipe”.

Documents

Enhancing performance - Pipelining Chapter 6 Part 1 – Concepts N. Guydosh 3/24/04