Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee

PipeliningPipelining

Hwanmo SungHwanmo Sung

CS147 PresentationCS147 Presentation

Professor Sin-Min LeeProfessor Sin-Min Lee

PipeliningPipelining

What is pipelining?What is pipelining?

An implementation technique that overlaps An implementation technique that overlaps the execution of multiple instructions.the execution of multiple instructions.

Any architecture in which digital Any architecture in which digital information flows through a series of information flows through a series of stations that each inspect, interpret or stations that each inspect, interpret or modify the information.modify the information.

Example(1) : LaundryExample(1) : Laundry

Ann, Brian, Cathy, Dave Ann, Brian, Cathy, Dave each have one load of clothes each have one load of clothes to wash, dry, and foldto wash, dry, and fold

Washer takes 30 minutesWasher takes 30 minutes

Dryer takes 40 minutesDryer takes 40 minutes

““Folder” takes 20 minutesFolder” takes 20 minutes

A B C D

Sequential LaundrySequential Laundry

Sequential laundry takes 6 hours for 4 loadsSequential laundry takes 6 hours for 4 loads

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

Pipelined LaundryPipelined LaundryStart work ASAPStart work ASAP

Pipelined laundry takes 3.5 hours for 4 loads Pipelined laundry takes 3.5 hours for 4 loads Speedup = 6/3.5 = 1.7Speedup = 6/3.5 = 1.7

A

B

C

D

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

30 40 40 40 40 20

Pipelining LessonsPipelining Lessons

Pipelining doesn’t help Pipelining doesn’t help latencylatency of single task, it of single task, it helps helps throughputthroughput of entire workload of entire workload

Pipeline rate limited by Pipeline rate limited by slowestslowest pipeline stage pipeline stage MultipleMultiple tasks operating simultaneously tasks operating simultaneously Potential speedup = Potential speedup = Number pipe stagesNumber pipe stages Unbalanced lengths of pipe stages reduces Unbalanced lengths of pipe stages reduces

speedupspeedup Time to “Time to “fillfill” pipeline and time to “” pipeline and time to “draindrain” it ” it

reduces speedupreduces speedup

Computer PipelinesComputer Pipelines

Execute billions of instructions, so Execute billions of instructions, so throughput is what mattersthroughput is what matters

RISC desirable features: all instructions RISC desirable features: all instructions same length, registers located in same same length, registers located in same place in instruction format, memory place in instruction format, memory operands only in loads or storesoperands only in loads or stores

Unpipelined DesignUnpipelined Design

Single-cycle implementationSingle-cycle implementation The cycle time depends on the slowest instructionThe cycle time depends on the slowest instruction Every instruction takes the same amount of timeEvery instruction takes the same amount of time

Multi-cycle implementationMulti-cycle implementation Divide the execution of an instruction into multiple Divide the execution of an instruction into multiple

stepssteps Each instruction may take variable number of Each instruction may take variable number of

steps (clock cycles)steps (clock cycles)

Unpipelined System

Comb.Logic

REG

30ns 3ns

Clock

Time

Op1 Op2 Op3??

One operation must complete before next can beginOne operation must complete before next can begin Operations spaced 33ns apartOperations spaced 33ns apart

Pipelined DesignPipelined Design Divide the execution of an instruction into multiple Divide the execution of an instruction into multiple

steps (stages)steps (stages) Overlap the execution of different instructions in Overlap the execution of different instructions in

different stagesdifferent stages Each cycle different instruction is executed in different Each cycle different instruction is executed in different

stagesstages For example, 5-stage pipeline (For example, 5-stage pipeline (FFetch-etch-DDecode-ecode-RRead-ead-

EExecute-xecute-WWrite), rite), 5 instructions are executed concurrently in 5 5 instructions are executed concurrently in 5

different pipeline stagesdifferent pipeline stages Complete the execution of one instruction every Complete the execution of one instruction every

cycle (instead of every 5 cycle)cycle (instead of every 5 cycle) Can increase the throughput of the machine 5 timesCan increase the throughput of the machine 5 times

Example(1) : 3 Stage Example(1) : 3 Stage PipelinePipeline

Space operations Space operations

13ns apart13ns apart

3 3 operations occur operations occur simultaneouslysimultaneously

REG

Clock

Comb.Logic

REG

Comb.Logic

REG

Comb.Logic

10ns 3ns 10ns 3ns 10ns 3ns

Time

Op1

Op2

Op3

??

Op4

Delay = 39nsThroughput = 77MHz

Example(2)Example(2)

F D R E W

F D R E W

F D R E W

F D R E W

F D R E W

F D R E W

F D R E W

F D R E W

F D R E W

F

Non-pipelined processor: 25 cycles = number of instrs (5) * number of stages (5)

Pipelined processor: 9 cycles = start-up latency (4) + number of instrs (5)

Filling thepipeline

Draining thepipeline

5 stage pipeline:Fetch – Decode – Read – Execute - Write

Basic Performance Issues in Basic Performance Issues in PipeliningPipelining

Pipelining increases the CPU instruction Pipelining increases the CPU instruction throughput - the number of instructions throughput - the number of instructions complete per unit of time - but it is not complete per unit of time - but it is not reduce the execution time of an individual reduce the execution time of an individual instruction.instruction.

Pipeline Speedup ExamplePipeline Speedup Example Assume the multiple cycle has a 10-ns clock cycle, loaAssume the multiple cycle has a 10-ns clock cycle, loa

ds take 5 clock cycles and account for 40% of the instrds take 5 clock cycles and account for 40% of the instructions, and all other instructions take 4 clock cycles. uctions, and all other instructions take 4 clock cycles.

If pipelining the machine add 1-ns to the clock cycle, hIf pipelining the machine add 1-ns to the clock cycle, how much speedup in instruction execution rate do we ow much speedup in instruction execution rate do we get from pipelining.get from pipelining.MC Ave Instr. Time = Clock cycle x Average CPIMC Ave Instr. Time = Clock cycle x Average CPI

= 10 ns x (0.6 x 4 + 0.4 x 5)= 10 ns x (0.6 x 4 + 0.4 x 5) = 44 ns= 44 ns

PL Ave Instr. Time = 10 + 1 = 11 nsPL Ave Instr. Time = 10 + 1 = 11 nsSpeedup = 44 / 11 = 4Speedup = 44 / 11 = 4

This ignores time needed to fill & empty the pipeline aThis ignores time needed to fill & empty the pipeline and delays due to hazards.nd delays due to hazards.

What makes it easy?What makes it easy? all instructions are the same lengthall instructions are the same length just a few instruction formatsjust a few instruction formats memory operands appear only in loads and memory operands appear only in loads and

storesstores

What makes it easy?What makes it easy?

It’s not Easy for ComputersIt’s not Easy for Computers Limits to pipelining:Limits to pipelining: Hazards Hazards prevent next instruction prevent next instruction

from executing during its designated clock cyclefrom executing during its designated clock cycle Structural hazardsStructural hazards: Hardware cannot support this : Hardware cannot support this

combination of instructions - two instructions need the combination of instructions - two instructions need the same resource.same resource.

Data hazardsData hazards: Instruction depends on result of prior : Instruction depends on result of prior instruction still in the pipelineinstruction still in the pipeline

Control hazardsControl hazards: Pipelining of branches & other : Pipelining of branches & other instructions that change the PCinstructions that change the PC

Common solution is to Common solution is to stallstall the pipeline until the the pipeline until the hazard is resolved, inserting one or more “hazard is resolved, inserting one or more “bubblesbubbles” ” in the pipelinein the pipeline

Limitation: Nonuniform Limitation: Nonuniform PipeliningPipelining

Clock

REG

Com.Log.

REG

Comb.Logic

REG

Comb.Logic

5ns 3ns 15ns 3ns 10ns 3ns

Throughput limited by slowest stageThroughput limited by slowest stageDelay determined by clock period * number of stagesDelay determined by clock period * number of stages

Must attempt to balance stagesMust attempt to balance stages

Delay = 18 * 3 = 54 nsThroughput = 55MHz

Limitation: Deep PipelinesLimitation: Deep Pipelines

Diminishing returns as add more pipeline stagesDiminishing returns as add more pipeline stages Register delays become limiting factorRegister delays become limiting factor

Increased latencyIncreased latencySmall throughput gainsSmall throughput gains

Clock

REG

Com.Log.

5ns 3ns

REG

Com.Log.

5ns 3ns

REG

Com.Log.

5ns 3ns

REG

Com.Log.

5ns 3ns

REG

Com.Log.

5ns 3ns

REG

Com.Log.

5ns 3ns

Delay = 48ns, Throughput = 128MHz

Limitation: Sequential Limitation: Sequential DependenciesDependencies

Op4 gets result Op4 gets result

from Op1from Op1 Pipeline HazardPipeline Hazard

REG

Clock

Comb.Logic

REG

Comb.Logic

REG

Comb.Logic

Time

Op1

Op2

Op3

??

Op4

Structure HazardStructure Hazard

Sometimes called Resource Conflict.Sometimes called Resource Conflict. Example. Example.

Some pipelined machines have shared a Some pipelined machines have shared a single memory pipeline for a data and single memory pipeline for a data and

instruction. As a result, when an instruction. As a result, when an instruction contains a data memory instruction contains a data memory

reference, it will conflict with the instruction reference, it will conflict with the instruction reference for a latter instructionreference for a latter instruction..

Solutions to Structural Solutions to Structural HazardHazard

Resource DuplicationResource Duplication exampleexample

Separate I and D caches for memory access Separate I and D caches for memory access conflictconflict

Time-multiplexed or multi-port register file for Time-multiplexed or multi-port register file for register file access conflictregister file access conflict

Data HazardData Hazard

Data hazard occur when pipeline changes thData hazard occur when pipeline changes the order of read/write accesses to operands se order of read/write accesses to operands so that the order differs from the order seen bo that the order differs from the order seen by sequentially execution instructions on an uy sequentially execution instructions on an unpipelined machinenpipelined machine

Solutions to Data HazardSolutions to Data Hazard

Freezing the pipelineFreezing the pipeline

(Internal) Forwarding(Internal) Forwarding

Compiler schedulingCompiler scheduling

Control (Branch) HazardsControl (Branch) Hazards

Caused by branchesCaused by branches Instruction fetch of a next instruction has to Instruction fetch of a next instruction has to

wait until the target (including the branch wait until the target (including the branch condition) of the current branch instruction condition) of the current branch instruction need to be resolvedneed to be resolved

Solutions to Control HazardSolutions to Control Hazard Optimized branch processingOptimized branch processing

1. Find out branch 1. Find out branch taken or nottaken or not early early → → simplified branch conditionsimplified branch condition 2. Compute branch 2. Compute branch target addresstarget address early early → → extra hardwareextra hardware Branch predictionBranch prediction

- Predict the next target address (branch prediction) and - Predict the next target address (branch prediction) and if wrong, flush all the speculatively fetched instructions if wrong, flush all the speculatively fetched instructions from the pipelinefrom the pipeline

Delayed branchDelayed branch

- Pipeline stall to delay the fetch of the next instruction- Pipeline stall to delay the fetch of the next instruction

SummarySummary Pipelining overlaps the execution of Pipelining overlaps the execution of

multiple instructions. multiple instructions. With an idea pipeline, the CPI(Cycle Per Instruction)With an idea pipeline, the CPI(Cycle Per Instruction)

is one, and the speedup is equal to the number of s is one, and the speedup is equal to the number of stages in the pipeline.tages in the pipeline.

However, several factors prevent us from achieving However, several factors prevent us from achieving the ideal speedup, includingthe ideal speedup, including

Not being able to divide the pipeline evenlyNot being able to divide the pipeline evenly The time needed to empty and flush the pipelineThe time needed to empty and flush the pipeline

Overhead needed for pipeling Overhead needed for pipeling Structural, data, and control harzardsStructural, data, and control harzards

SummarySummary Just overlap tasks, and easy if tasks are independeJust overlap tasks, and easy if tasks are independe

ntnt Speed Up Speed Up VS.VS. Pipeline Depth; if ideal CPI is 1, then: Pipeline Depth; if ideal CPI is 1, then:

Hazards limit performance on computers:Hazards limit performance on computers: Structural: need more HW resourcesStructural: need more HW resources Data: need forwarding, compiler schedulingData: need forwarding, compiler scheduling Control: discuss next timeControl: discuss next time

Speedup =Pipeline Depth

1 + Pipeline stall CPIX

Clock Cycle Unpipelined

Clock Cycle Pipelined

Documents

Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee