What is pipelining? How is the pipelining Implemented ... implementation Data Path for...

Preview:

Citation preview

A. Pipelining: Basic ConceptsWhat is pipelining?

How is the pipelining Implemented?

What makes pipelininghard to implement?

Feb.2008_jxh_Introduction 1.2ComputerArchitecture_pipeline 2/90

What is Pipelining ?Pipelining:

“A technique designed into some computers to increase speed by starting the execution of one instruction before completing the previous one.”

----Modern English-Chinese Dictionary

implementation technique whereby different instructions are overlapped in execution at the same time.implementation technique to make fast CPUs

Feb.2008_jxh_Introduction 1.3ComputerArchitecture_pipeline 3/90

It likes Auto Assembly lineAn arrangement of workers, machines, and equipment in which the product being assembled passes consecutively from operation to operation until completed.

Ford installs first moving assembly line in 1913. The right picture shows the moving assembly line at Ford Motor Company's michigan plant.

( 84 distinct steps)

Feb.2008_jxh_Introduction 1.4ComputerArchitecture_pipeline 4/90

Trucking gas from depot to gas station

The steps:Get the barrelsLoad them into the truckDrive to the gas stationUnload the gasReturn for more oil

Let’s do the mathEach truck can carry 5 barrelsCan load a truck with 5 barrels in 1 hourIt takes each truck 1 day to drive to and from gas stationHow many barrels per week are delivered?

Feb.2008_jxh_Introduction 1.5ComputerArchitecture_pipeline 5/90

Looks a Lot Like a Multi-cycle Processor

What are the steps ?Fetch an instruction (Get the barrels)Decode the instruction (Load them into the truck)ALU OP (Drive to the gas station)Memory Access (Unload the gas)Write-back (Return for more oil)

Feb.2008_jxh_Introduction 1.6ComputerArchitecture_pipeline 6/90

A better way, but dangerous

Roll the barrels down the roadBig fire hazard

Feb.2008_jxh_Introduction 1.7ComputerArchitecture_pipeline 7/90

Big idea: Build a pipeline

Now let’s do the mathPipeline can accept 1 barrel every hourHow many barrels get delivered to the gas station per day?

Feb.2008_jxh_Introduction 1.8ComputerArchitecture_pipeline 8/90

Trucking vs. Pipelines

TrucksTruck with 5 barrels takes 1 day to drive to and from gas station, while need 2 hours for loading and unloadingLOTS of TIME when loading area,gas station, and pieces of the road are unused

PipelinesPipeline can accept 1 barrel every hourResources (loading area, gas station,pipelines) are always in use

Feb.2008_jxh_Introduction 1.9ComputerArchitecture_pipeline 9/90

Why Pipelining: Its NaturalLaundry

Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold

Washer takes 30 minutesDryer takes 40 minutes“Folder” takes 20 minutes

A B C D

Feb.2008_jxh_Introduction 1.10ComputerArchitecture_pipeline 10/90

Sequential Laundry

Sequential laundry takes 6 hours for 4 loadsIf they learned pipelining, how long would laundry take?

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 MidnightTime

Task

Order

Feb.2008_jxh_Introduction 1.11ComputerArchitecture_pipeline 11/90

Pipelined Laundry--Start work ASAP

Pipelined laundry takes 3.5 hours for 4 loads

A

B

C

D

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

30 40 40 40 40 20

Feb.2008_jxh_Introduction 1.12ComputerArchitecture_pipeline 12/90

Why pipelining : overlapped

Latches, called pipeline registers’break up computation into 5 stagesDeal 5 tasks at the same time.

Only deal one task each time.This task takes “ such a long time”

Feb.2008_jxh_Introduction 1.13ComputerArchitecture_pipeline 13/90

Why pipelining: more faster

Can “launch” a new computation every 100ns in this structureCan finish 107

computations per second

Can launch a new computation every 20ns in pipelined structureCan finish 5×107

computations per second

Feb.2008_jxh_Introduction 1.14ComputerArchitecture_pipeline 14/90

Why pipelining : conclusion

The key implementation technique used to Make fast CPU: decrease CPUtime.

Improving of Throughput ( rather than individual execution time)

Improving of efficiency for resources (functional unit)

Feb.2008_jxh_Introduction 1.15ComputerArchitecture_pipeline 15/90

What is a pipeline ?A pipeline is like an auto assemble lineA pipeline has many stagesEach stage carries out a different part of instruction or operationThe stages, which cooperates at a synchronized clock, are connected to form a pipeAn instruction or operation enters through one end and progresses through the stages and exit through the other endPipelining is an implementation technique that exploits parallelism among the instructions in a sequential instruction stream

Feb.2008_jxh_Introduction 1.16ComputerArchitecture_pipeline 16/90

Pipeline Characteristics--latency vs. Throughput

LatencyEach instruction takes a certain time to complete. This is the latency for that operation. It's the amount of time between when the instruction is issued and when it completes.

ThroughputNumber of items (cars, instructions) exiting the pipeline per unit time.The throughput of the assembly line is the number of products completed per hour. The throughput of a CPU pipeline is the number of instructions completed per second.

Feb.2008_jxh_Introduction 1.17ComputerArchitecture_pipeline 17/90

Pipeline Characteristics-clock cycle vs. Machine cycle

Clock cycleEverything in a CPU moves in lockstep, synchronized by the clock ( The Heartbeat of CPU )

Machine cycle (Processor cycle, Stage time)time required to complete a single pipeline stage.The pipeline designer’s goal is to balance the length of each pipeline stage. In many instances, machine cycle = max ( times for all stages).A machine cycle is usually one, sometimes two, clock cycles long, but rarely more.

Feb.2008_jxh_Introduction 1.18ComputerArchitecture_pipeline 18/90

Performance for Pipelining

Feb.2008_jxh_Introduction 1.19ComputerArchitecture_pipeline 19/90

Ideal Performance for Pipelining

If the stages are perfectly balanced, The time per instruction on the pipelined processor equal to:

Time per instruction on unpipelined machineNumber of pipe stages

So, Ideal speedup equal toNumber of pipe stages.

Feb.2008_jxh_Introduction 1.20ComputerArchitecture_pipeline 20/90

Why not just make a 50-stage pipeline?

Some computations just won’t divide into any finer (shorter in time) logical implementation.

5 stages OK

50 stages NO. Sorry!

Feb.2008_jxh_Introduction 1.21ComputerArchitecture_pipeline 21/90

Why not just make a 50-stage pipeline ?

Those latches are NOT free, they take up area, and there is a real delay to go THRU the latch itself.

Machine cycle > latch latency + clock skewIn modern, deep pipeline (10-20 stages), this is a real effectTypically see logic “depths” in one pipe stage of 10-20 “gates”.

At these speeds, and with this few levels of logic, latch delay is important

Feb.2008_jxh_Introduction 1.22ComputerArchitecture_pipeline 22/90

How Many Pipeline Stages?E.g., Intel

Pentium III, Pentium 4: 20+ stagesMore than 20 instructions in flightHigh clock frequency (>1GHz)High IPC

Too many stages:Lots of complicationsShould take care of possible dependencies among in-flight instructionsControl logic is huge

Feb.2008_jxh_Introduction 1.23ComputerArchitecture_pipeline 23/90

Simple implementation of a RISC (MIPS)

Start with Implementation without pipelining

single-cycle implementationmulti-cycle implementation

Pipelining the RISC Instruction SetPipelining performance issuesHow can we do it efficiently ?Examples

Feb.2008_jxh_Introduction 1.24ComputerArchitecture_pipeline 24/90

MIPS ISA

Feb.2008_jxh_Introduction 1.25ComputerArchitecture_pipeline 25/90

MIPS instruction format

Feb.2008_jxh_Introduction 1.26ComputerArchitecture_pipeline 26/90

9 Instructions

Feb.2008_jxh_Introduction 1.27ComputerArchitecture_pipeline 27/90

R-instr. Add / Sub

Feb.2008_jxh_Introduction 1.28ComputerArchitecture_pipeline 28/90

R-Instruction: And/Or

Feb.2008_jxh_Introduction 1.29ComputerArchitecture_pipeline 29/90

I-Instruction: LW/SW

Feb.2008_jxh_Introduction 1.30ComputerArchitecture_pipeline 30/90

Branch / Jump

Feb.2008_jxh_Introduction 1.31ComputerArchitecture_pipeline 31/90

SLT: set when less than

Feb.2008_jxh_Introduction 1.32ComputerArchitecture_pipeline 32/90

Single-cycle implementation

Data Path for R-instruction

Feb.2008_jxh_Introduction 1.33ComputerArchitecture_pipeline 33/90

R-instruction Data Path

Feb.2008_jxh_Introduction 1.34ComputerArchitecture_pipeline 34/90

What’s the data path for I-instruction (R-I, LW/SW/BEQ)?

Feb.2008_jxh_Introduction 1.35ComputerArchitecture_pipeline 35/90

Data Path for LW

Feb.2008_jxh_Introduction 1.36ComputerArchitecture_pipeline 36/90

Data Path for SW

Feb.2008_jxh_Introduction 1.37ComputerArchitecture_pipeline 37/90

Data Path for BEQ

Feb.2008_jxh_Introduction 1.38ComputerArchitecture_pipeline 38/90

What’s the data path for J-instruction (jump)?

Feb.2008_jxh_Introduction 1.39ComputerArchitecture_pipeline 39/90

Data Path for Jump

Feb.2008_jxh_Introduction 1.40ComputerArchitecture_pipeline 40/90

Feb.2008_jxh_Introduction 1.41ComputerArchitecture_pipeline 41/90

Feb.2008_jxh_Introduction 1.42ComputerArchitecture_pipeline 42/90

Feb.2008_jxh_Introduction 1.43ComputerArchitecture_pipeline 43/90

Feb.2008_jxh_Introduction 1.44ComputerArchitecture_pipeline 44/90

Feb.2008_jxh_Introduction 1.45ComputerArchitecture_pipeline 45/90

Feb.2008_jxh_Introduction 1.46ComputerArchitecture_pipeline 46/90

Single-cycle implementation

seldom used !

Feb.2008_jxh_Introduction 1.47ComputerArchitecture_pipeline 47/90

Let’s have a break

Feb.2008_jxh_Introduction 1.48ComputerArchitecture_pipeline 48/90

Single cycle Multiple cycle

Feb.2008_jxh_Introduction 1.49ComputerArchitecture_pipeline 49/90

Finite State Diagram

Feb.2008_jxh_Introduction 1.50ComputerArchitecture_pipeline 50/90

Multiple Cycle MIPS CPU

Feb.2008_jxh_Introduction 1.51ComputerArchitecture_pipeline 51/90

Multiple Cycle MIPS CPU

Feb.2008_jxh_Introduction 1.52ComputerArchitecture_pipeline 52/90

Feb.2008_jxh_Introduction 1.53ComputerArchitecture_pipeline 53/90

5-cycle MIPS CPU

Feb.2008_jxh_Introduction 1.54ComputerArchitecture_pipeline 54/90

Feb.2008_jxh_Introduction 1.55ComputerArchitecture_pipeline 55/90

Feb.2008_jxh_Introduction 1.56ComputerArchitecture_pipeline 56/90

Feb.2008_jxh_Introduction 1.57ComputerArchitecture_pipeline 57/90

Feb.2008_jxh_Introduction 1.58ComputerArchitecture_pipeline 58/90

Feb.2008_jxh_Introduction 1.59ComputerArchitecture_pipeline 59/90

Feb.2008_jxh_Introduction 1.60ComputerArchitecture_pipeline 60/90

Feb.2008_jxh_Introduction 1.61ComputerArchitecture_pipeline 61/90

Feb.2008_jxh_Introduction 1.62ComputerArchitecture_pipeline 62/90

Feb.2008_jxh_Introduction 1.63ComputerArchitecture_pipeline 63/90

Feb.2008_jxh_Introduction 1.64ComputerArchitecture_pipeline 64/90

Feb.2008_jxh_Introduction 1.65ComputerArchitecture_pipeline 65/90

Feb.2008_jxh_Introduction 1.66ComputerArchitecture_pipeline 66/90

Feb.2008_jxh_Introduction 1.67ComputerArchitecture_pipeline 67/90

Multi-cycle implementation

Feb.2008_jxh_Introduction 1.68ComputerArchitecture_pipeline 68/90

About Multi-cycle implementation

The temporary storage locations were added to the datapath of the unpipelinedmachine to make it easy to pipeline.Note that branch and store instructions take 4 clock cycles.

Assuming branch frequency of 12% and a store frequency of 10%, CPI is 4.78.

This implementation is not optimal.

Feb.2008_jxh_Introduction 1.69ComputerArchitecture_pipeline 69/90

How to improve the performance ?

For a possible branch, do the equality test and compute the possible branch target by adding the sign-extended offset to the incremented PC earlier in ID. Completing ALU instructions during the MEM cycle So, branch instructions take only 2 cycles, store and ALU instructions take 4 cycles, and load instruction takes the longest time 5 cycles. CPI drops to 4.07 assuming 47% ALU operation frequency.

2×12% +4×(10%+47%)+ 31×5=4.07

Feb.2008_jxh_Introduction 1.70ComputerArchitecture_pipeline 70/90

Optimized Multi-cycle implementation

PC

Instr.MEM

NPC

IR

MU

X

Reg.File

Data.MEM

Zero?

A

B

IM

MU

X

4

ALU

output

LMD

MU

X

Sign Ex

IF ID EX MEM W B

Temporary storage locations

Feb.2008_jxh_Introduction 1.71ComputerArchitecture_pipeline 71/90

Improvement on hardware redundancy

ALU can be shared.

Data and instruction memory can be combined since access occurs on different clock cycles.

Feb.2008_jxh_Introduction 1.72ComputerArchitecture_pipeline 72/90

Pipelining MIPS instruction set Since there are five separate stages, we can have a pipeline in which one instruction is in each stage. CPI is decreased to 1, since one instruction will be issued (or finished) each cycle. During any cycle, one instruction is present in each stage.

Ideally, performance is increased five fold !

Feb.2008_jxh_Introduction 1.73ComputerArchitecture_pipeline 73/90

store

load

5-stage Version of MIPS Datapath

pipeline registers or latches

Why Latches ?

Feb.2008_jxh_Introduction 1.74ComputerArchitecture_pipeline 74/90

Single-cycle implementation vs. pipelining

Single Cycle Implementation: CPI=1, long clock cycle

Clk

Load Store Waste

Cycle 1 Cycle 2

Pipeline Implementation: CPI=1, clock cycle ≈ long clock cycle/5

Load IF ID EX MEM WB

IF ID EX MEM WBStore

IF ID EX MEM WBR-type

Clk

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10

Feb.2008_jxh_Introduction 1.75ComputerArchitecture_pipeline 75/90

Multi-cycle implementation vs. pipelining

Load Store R-type

Multip-Cycle Implementation: CPI=5,

Pipeline Implementation: CPI=1,

Clk

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10

Load IF ID EX MEM WB

IF ID EX MEM WBStore

IF ID EX MEM WBR-type

Clk

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10

IF ID EX MEM WB IF ID EX MEM WB

Feb.2008_jxh_Introduction 1.76ComputerArchitecture_pipeline 76/90

How pipelining decrease execution time?

If your starting point is a single clock cycle per instruction machine then

pipelining decreases cycle time.

a multiple clock cycle per instruction machine then

pipelining decreases CPI.

Feb.2008_jxh_Introduction 1.77ComputerArchitecture_pipeline 77/90

performance issues in pipelining

Latency: The execution time of each instruction in pipelining does not decrease, instead, always longer than that of unpipelined machine.Imbalance among stages reduces performanceOverhead rise from register delay and clock skew also contribute to the lower limit of machine cycle.Pipeline hazards are the major hurdle of pipeline, which prevent the machine from reaching the ideal performance.Time to “fill” pipeline and time to “drain” it reduces speedup

Feb.2008_jxh_Introduction 1.78ComputerArchitecture_pipeline 78/90

store

load

How simple as this ! Really ?

pipeline registers or latches

Why need to add this line?

Feb.2008_jxh_Introduction 1.79ComputerArchitecture_pipeline 79/90

Problems that pipelining introducesFocus: no different operations with the same

data path resource on the same clock cycle.(structure hazard)There is conflict about the memory !

Mem

Instr.

Order

Time (clock cycles)

Ld/St

Instr 1

Instr 2

Instr 3A

LUMem Reg Mem Reg

ALUMem Reg Mem Reg

ALUMem Reg Mem Reg

ALUReg Mem Reg

Feb.2008_jxh_Introduction 1.80ComputerArchitecture_pipeline 80/90

The conflict about the registers !

Feb.2008_jxh_Introduction 1.81ComputerArchitecture_pipeline 81/90

Conflict occurs when PC update

Must increment and store the PC every clock.What happens when meet a branch ?

Branches change the value of the PC -- but the condition is not evaluated until ID ! If the branch is taken, the instructions fetched behind the branch are invalid !

This is clearly a serious problem ( Control hazard ) that needs to be addressed. We will deal it later.

Feb.2008_jxh_Introduction 1.82ComputerArchitecture_pipeline 82/90

Must latches be engaged ? Yeah !

Ensure the instructions in different stages do not interfere with one another . Through the latches, can the stages be combined one by one to form a pipeline.The latches are the pipeline registers , which are much more than those in multi-cycle version

IR: IF/ID.IR; ID/EX.IR; EX/DM.IR; DM/WB.IRB: ID/EX.B; EX/DM.BALUoutput: EX/DM.ALUoutput, DM/WB.ALUoutput

Any value needed on a later stage must be placed in a register and copied from one register to the next, until it is no longer needed.

Feb.2008_jxh_Introduction 1.83ComputerArchitecture_pipeline 83/90

Separate instruction and data memoriesuse split instruction and data cache

the memory system must deliver 5 times the bandwidth over the unpipelined version.

IM

Instr.

Order

Time (clock cycles)

Ld/St

Instr 1

Instr 2

Instr 3

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUReg DM Reg

Feb.2008_jxh_Introduction 1.84ComputerArchitecture_pipeline 84/90

Pipeline hazard: the major hurdle

A hazard is a condition that prevents an instruction in the pipe from executing its next scheduled pipe stage

Taxonomy of hazardStructural hazards

These are conflicts over hardware resources.Data hazards

Instruction depends on result of prior computation which is not ready (computed or stored) yet

Control hazardsbranch condition and the branch PC are not available in time to fetch an instruction on the next clock

Feb.2008_jxh_Introduction 1.85ComputerArchitecture_pipeline 85/90

Hazards can always be resolved by Stall

The simplest way to "fix" hazards is to stall the pipeline. Stall means suspending the pipeline for some instructions by one or more clock cycles. The stall delays all instructions issued after the instruction that was stalled, while other instructions in the pipeline go on proceeding.A pipeline stall is also called a pipeline bubble or simply bubble.No new instructions are fetched during a stall .

Feb.2008_jxh_Introduction 1.86ComputerArchitecture_pipeline 86/90

Performance of pipeline with stalls

Pipeline stalls decrease performance from the ideal Recall the speedup formula:

Feb.2008_jxh_Introduction 1.87ComputerArchitecture_pipeline 87/90

Assumptions for calculation The ideal CPI on a pipelined processor is almost always 1. (may less than or greater that )So

Ignore the overhead of pipelining clock cycle.Pipe stages are ideal balanced.Two different implementation

single-cycle implementationmulti-cycle implementation

Feb.2008_jxh_Introduction 1.88ComputerArchitecture_pipeline 88/90

Case of single-cycle implementation

CPI unpipelined = 1

• Clock cycle pipelined = Clock cycle unpipelinedpipeline depth

Feb.2008_jxh_Introduction 1.89ComputerArchitecture_pipeline 89/90

Case of multi-cycle implementation

Clock cycle unpipelined = Clock cycle pipeliningCPl unpipelined = pipeline depth

Feb.2008_jxh_Introduction 1.90ComputerArchitecture_pipeline 90/90

That’s all for today !

Recommended