Upload
carlton-clem
View
220
Download
1
Tags:
Embed Size (px)
Citation preview
PipeliningPipelining
Hwanmo SungHwanmo Sung
CS147 PresentationCS147 Presentation
Professor Sin-Min LeeProfessor Sin-Min Lee
PipeliningPipelining
What is pipelining?What is pipelining?
An implementation technique that overlaps An implementation technique that overlaps the execution of multiple instructions.the execution of multiple instructions.
Any architecture in which digital Any architecture in which digital information flows through a series of information flows through a series of stations that each inspect, interpret or stations that each inspect, interpret or modify the information.modify the information.
Example(1) : LaundryExample(1) : Laundry
Ann, Brian, Cathy, Dave Ann, Brian, Cathy, Dave each have one load of clothes each have one load of clothes to wash, dry, and foldto wash, dry, and fold
Washer takes 30 minutesWasher takes 30 minutes
Dryer takes 40 minutesDryer takes 40 minutes
““Folder” takes 20 minutesFolder” takes 20 minutes
A B C D
Sequential LaundrySequential Laundry
Sequential laundry takes 6 hours for 4 loadsSequential laundry takes 6 hours for 4 loads
A
B
C
D
30 40 20 30 40 20 30 40 20 30 40 20
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
Pipelined LaundryPipelined LaundryStart work ASAPStart work ASAP
Pipelined laundry takes 3.5 hours for 4 loads Pipelined laundry takes 3.5 hours for 4 loads Speedup = 6/3.5 = 1.7Speedup = 6/3.5 = 1.7
A
B
C
D
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
30 40 40 40 40 20
Pipelining LessonsPipelining Lessons
Pipelining doesn’t help Pipelining doesn’t help latencylatency of single task, it of single task, it helps helps throughputthroughput of entire workload of entire workload
Pipeline rate limited by Pipeline rate limited by slowestslowest pipeline stage pipeline stage MultipleMultiple tasks operating simultaneously tasks operating simultaneously Potential speedup = Potential speedup = Number pipe stagesNumber pipe stages Unbalanced lengths of pipe stages reduces Unbalanced lengths of pipe stages reduces
speedupspeedup Time to “Time to “fillfill” pipeline and time to “” pipeline and time to “draindrain” it ” it
reduces speedupreduces speedup
Computer PipelinesComputer Pipelines
Execute billions of instructions, so Execute billions of instructions, so throughput is what mattersthroughput is what matters
RISC desirable features: all instructions RISC desirable features: all instructions same length, registers located in same same length, registers located in same place in instruction format, memory place in instruction format, memory operands only in loads or storesoperands only in loads or stores
Unpipelined DesignUnpipelined Design
Single-cycle implementationSingle-cycle implementation The cycle time depends on the slowest instructionThe cycle time depends on the slowest instruction Every instruction takes the same amount of timeEvery instruction takes the same amount of time
Multi-cycle implementationMulti-cycle implementation Divide the execution of an instruction into multiple Divide the execution of an instruction into multiple
stepssteps Each instruction may take variable number of Each instruction may take variable number of
steps (clock cycles)steps (clock cycles)
Unpipelined System
Comb.Logic
REG
30ns 3ns
Clock
Time
Op1 Op2 Op3??
One operation must complete before next can beginOne operation must complete before next can begin Operations spaced 33ns apartOperations spaced 33ns apart
Pipelined DesignPipelined Design Divide the execution of an instruction into multiple Divide the execution of an instruction into multiple
steps (stages)steps (stages) Overlap the execution of different instructions in Overlap the execution of different instructions in
different stagesdifferent stages Each cycle different instruction is executed in different Each cycle different instruction is executed in different
stagesstages For example, 5-stage pipeline (For example, 5-stage pipeline (FFetch-etch-DDecode-ecode-RRead-ead-
EExecute-xecute-WWrite), rite), 5 instructions are executed concurrently in 5 5 instructions are executed concurrently in 5
different pipeline stagesdifferent pipeline stages Complete the execution of one instruction every Complete the execution of one instruction every
cycle (instead of every 5 cycle)cycle (instead of every 5 cycle) Can increase the throughput of the machine 5 timesCan increase the throughput of the machine 5 times
Example(1) : 3 Stage Example(1) : 3 Stage PipelinePipeline
Space operations Space operations
13ns apart13ns apart
3 3 operations occur operations occur simultaneouslysimultaneously
REG
Clock
Comb.Logic
REG
Comb.Logic
REG
Comb.Logic
10ns 3ns 10ns 3ns 10ns 3ns
Time
Op1
Op2
Op3
??
Op4
Delay = 39nsThroughput = 77MHz
Example(2)Example(2)
F D R E W
F D R E W
F D R E W
F D R E W
F D R E W
F D R E W
F D R E W
F D R E W
F D R E W
F
Non-pipelined processor: 25 cycles = number of instrs (5) * number of stages (5)
Pipelined processor: 9 cycles = start-up latency (4) + number of instrs (5)
Filling thepipeline
Draining thepipeline
5 stage pipeline:Fetch – Decode – Read – Execute - Write
Basic Performance Issues in Basic Performance Issues in PipeliningPipelining
Pipelining increases the CPU instruction Pipelining increases the CPU instruction throughput - the number of instructions throughput - the number of instructions complete per unit of time - but it is not complete per unit of time - but it is not reduce the execution time of an individual reduce the execution time of an individual instruction.instruction.
Pipeline Speedup ExamplePipeline Speedup Example Assume the multiple cycle has a 10-ns clock cycle, loaAssume the multiple cycle has a 10-ns clock cycle, loa
ds take 5 clock cycles and account for 40% of the instrds take 5 clock cycles and account for 40% of the instructions, and all other instructions take 4 clock cycles. uctions, and all other instructions take 4 clock cycles.
If pipelining the machine add 1-ns to the clock cycle, hIf pipelining the machine add 1-ns to the clock cycle, how much speedup in instruction execution rate do we ow much speedup in instruction execution rate do we get from pipelining.get from pipelining.MC Ave Instr. Time = Clock cycle x Average CPIMC Ave Instr. Time = Clock cycle x Average CPI
= 10 ns x (0.6 x 4 + 0.4 x 5)= 10 ns x (0.6 x 4 + 0.4 x 5) = 44 ns= 44 ns
PL Ave Instr. Time = 10 + 1 = 11 nsPL Ave Instr. Time = 10 + 1 = 11 nsSpeedup = 44 / 11 = 4Speedup = 44 / 11 = 4
This ignores time needed to fill & empty the pipeline aThis ignores time needed to fill & empty the pipeline and delays due to hazards.nd delays due to hazards.
What makes it easy?What makes it easy? all instructions are the same lengthall instructions are the same length just a few instruction formatsjust a few instruction formats memory operands appear only in loads and memory operands appear only in loads and
storesstores
What makes it easy?What makes it easy?
It’s not Easy for ComputersIt’s not Easy for Computers Limits to pipelining:Limits to pipelining: Hazards Hazards prevent next instruction prevent next instruction
from executing during its designated clock cyclefrom executing during its designated clock cycle Structural hazardsStructural hazards: Hardware cannot support this : Hardware cannot support this
combination of instructions - two instructions need the combination of instructions - two instructions need the same resource.same resource.
Data hazardsData hazards: Instruction depends on result of prior : Instruction depends on result of prior instruction still in the pipelineinstruction still in the pipeline
Control hazardsControl hazards: Pipelining of branches & other : Pipelining of branches & other instructions that change the PCinstructions that change the PC
Common solution is to Common solution is to stallstall the pipeline until the the pipeline until the hazard is resolved, inserting one or more “hazard is resolved, inserting one or more “bubblesbubbles” ” in the pipelinein the pipeline
Limitation: Nonuniform Limitation: Nonuniform PipeliningPipelining
Clock
REG
Com.Log.
REG
Comb.Logic
REG
Comb.Logic
5ns 3ns 15ns 3ns 10ns 3ns
Throughput limited by slowest stageThroughput limited by slowest stageDelay determined by clock period * number of stagesDelay determined by clock period * number of stages
Must attempt to balance stagesMust attempt to balance stages
Delay = 18 * 3 = 54 nsThroughput = 55MHz
Limitation: Deep PipelinesLimitation: Deep Pipelines
Diminishing returns as add more pipeline stagesDiminishing returns as add more pipeline stages Register delays become limiting factorRegister delays become limiting factor
Increased latencyIncreased latencySmall throughput gainsSmall throughput gains
Clock
REG
Com.Log.
5ns 3ns
REG
Com.Log.
5ns 3ns
REG
Com.Log.
5ns 3ns
REG
Com.Log.
5ns 3ns
REG
Com.Log.
5ns 3ns
REG
Com.Log.
5ns 3ns
Delay = 48ns, Throughput = 128MHz
Limitation: Sequential Limitation: Sequential DependenciesDependencies
Op4 gets result Op4 gets result
from Op1from Op1 Pipeline HazardPipeline Hazard
REG
Clock
Comb.Logic
REG
Comb.Logic
REG
Comb.Logic
Time
Op1
Op2
Op3
??
Op4
Structure HazardStructure Hazard
Sometimes called Resource Conflict.Sometimes called Resource Conflict. Example. Example.
Some pipelined machines have shared a Some pipelined machines have shared a single memory pipeline for a data and single memory pipeline for a data and
instruction. As a result, when an instruction. As a result, when an instruction contains a data memory instruction contains a data memory
reference, it will conflict with the instruction reference, it will conflict with the instruction reference for a latter instructionreference for a latter instruction..
Solutions to Structural Solutions to Structural HazardHazard
Resource DuplicationResource Duplication exampleexample
Separate I and D caches for memory access Separate I and D caches for memory access conflictconflict
Time-multiplexed or multi-port register file for Time-multiplexed or multi-port register file for register file access conflictregister file access conflict
Data HazardData Hazard
Data hazard occur when pipeline changes thData hazard occur when pipeline changes the order of read/write accesses to operands se order of read/write accesses to operands so that the order differs from the order seen bo that the order differs from the order seen by sequentially execution instructions on an uy sequentially execution instructions on an unpipelined machinenpipelined machine
Solutions to Data HazardSolutions to Data Hazard
Freezing the pipelineFreezing the pipeline
(Internal) Forwarding(Internal) Forwarding
Compiler schedulingCompiler scheduling
Control (Branch) HazardsControl (Branch) Hazards
Caused by branchesCaused by branches Instruction fetch of a next instruction has to Instruction fetch of a next instruction has to
wait until the target (including the branch wait until the target (including the branch condition) of the current branch instruction condition) of the current branch instruction need to be resolvedneed to be resolved
Solutions to Control HazardSolutions to Control Hazard Optimized branch processingOptimized branch processing
1. Find out branch 1. Find out branch taken or nottaken or not early early → → simplified branch conditionsimplified branch condition 2. Compute branch 2. Compute branch target addresstarget address early early → → extra hardwareextra hardware Branch predictionBranch prediction
- Predict the next target address (branch prediction) and - Predict the next target address (branch prediction) and if wrong, flush all the speculatively fetched instructions if wrong, flush all the speculatively fetched instructions from the pipelinefrom the pipeline
Delayed branchDelayed branch
- Pipeline stall to delay the fetch of the next instruction- Pipeline stall to delay the fetch of the next instruction
SummarySummary Pipelining overlaps the execution of Pipelining overlaps the execution of
multiple instructions. multiple instructions. With an idea pipeline, the CPI(Cycle Per Instruction)With an idea pipeline, the CPI(Cycle Per Instruction)
is one, and the speedup is equal to the number of s is one, and the speedup is equal to the number of stages in the pipeline.tages in the pipeline.
However, several factors prevent us from achieving However, several factors prevent us from achieving the ideal speedup, includingthe ideal speedup, including
Not being able to divide the pipeline evenlyNot being able to divide the pipeline evenly The time needed to empty and flush the pipelineThe time needed to empty and flush the pipeline
Overhead needed for pipeling Overhead needed for pipeling Structural, data, and control harzardsStructural, data, and control harzards
SummarySummary Just overlap tasks, and easy if tasks are independeJust overlap tasks, and easy if tasks are independe
ntnt Speed Up Speed Up VS.VS. Pipeline Depth; if ideal CPI is 1, then: Pipeline Depth; if ideal CPI is 1, then:
Hazards limit performance on computers:Hazards limit performance on computers: Structural: need more HW resourcesStructural: need more HW resources Data: need forwarding, compiler schedulingData: need forwarding, compiler scheduling Control: discuss next timeControl: discuss next time
Speedup =Pipeline Depth
1 + Pipeline stall CPIX
Clock Cycle Unpipelined
Clock Cycle Pipelined