Recap (Pipelining)

Preview:

DESCRIPTION

Recap (Pipelining). What is Pipelining?. A way of speeding up execution of tasks Key idea : overlap execution of multiple taks. Automobile Manufacturing. 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. - PowerPoint PPT Presentation

Citation preview

1

RecapRecap(Pipelining)(Pipelining)

2

What is Pipelining?• A way of speeding up execution of tasks

• Key idea:

overlap execution of multiple taks

3

Automobile Manufacturing1. Build frame. 60 min.

2. Add engine. 50 min.

3. Build body. 80 min.

4. Paint. 40 min.

5. Finish. 45 min.

275 min.

Latency: Time from start to finish for one car.

Throughput: Number of finished cars per time unit.

1 car/275 min = 0.218 cars/hour

275 minutes per car.

Issues: How can we make the process better by adding?

(smaller is better)

(larger is better)

4

An Assembly line

1

1

1

1

1

2

2

2

2

2

3

3

3

3

3

4

4

4

4

4

60 50 80 40 45

First two stagescan’t produce faster thanone car/80 min or a backlog will occurat third stage.

80 80

Last two stages only receive onecar/80 min to work on.

80 80

Latency: 400 min/carThroughput: 4 cars/640 min (1 car/160 min)

time

Will approach 1 car/80 min as time goes on

5

Pipelining a Digital System

• Key idea: break big computation up into pieces

Separate each piece with a pipeline register

1ns

200ps 200ps 200ps 200ps 200ps

PipelineRegister

6

Pipelining a Digital System

• Why do this? Because it's faster for repeated computations

1ns

Non-pipelined:1 operation finishesevery 1ns

200ps 200ps 200ps 200ps 200ps

Pipelined:1 operation finishesevery 200ps

7

Comments about pipelining

• Pipelining increases throughput, but not latency

– Answer available every 200ps, BUT

– A single computation still takes 1ns

• Limitations:

– Computations must be divisible into stages of equal sizes

– Pipeline registers add overhead

8

Another Example

Comb.Logic

REG

30ns 3ns

Clock

Delay = 33nsThroughput = 30MHz

Time

UnpipelinedSystem

Op1 Op2 Op3??

– One operation must complete before next can begin– Operations spaced 33ns apart

9

3 Stage Pipelining

– Space operations 13ns apart

– 3 operations occur simultaneously

REG

Clock

Comb.Logic

REG

Comb.Logic

REG

Comb.Logic

10ns 3ns 10ns 3ns 10ns 3ns

Delay = 39nsThroughput = 77MHz

Time

Op1

Op2

Op3

Op4

10

Limitation: Nonuniform Pipelining

Clock

REG

Com.Log.

REG

Comb.Logic

REG

Comb.Logic

5ns 3ns 15ns 3ns 10ns 3ns

Delay = 18 * 3 = 54 nsThroughput = 55MHz

• Throughput limited by slowest stage• Delay determined by clock period * number of stages

• Must attempt to balance stages

11

Limitation: Deep Pipelines

• Diminishing returns as add more pipeline stages• Register delays become limiting factor

• Increased latency• Small throughput gains• More hazards

Delay = 48ns, Throughput = 128MHzClock

REG

Com.Log.

5ns 3ns

REG

Com.Log.

5ns 3ns

REG

Com.Log.

5ns 3ns

REG

Com.Log.

5ns 3ns

REG

Com.Log.

5ns 3ns

REG

Com.Log.

5ns 3ns

12

MIPSPipeliningPipelining

13

MIPS 5-stage pipelineMIPS 5-stage pipeline• The MIPS processor needs 5 stages to execute instructions

• Pipelining stages:– IF - Instruction Fetch

– ID - Instruction Decode

– EX - Execute / Address Calculation

– MEM - Memory Access (read / write)

– WB - Write Back (results into register file)

• Not all instructions need all the stages (e.g., add instruction does not need the MEM stage)

14

Basic MIPS Pipelined Processor

IF/ID

Pipeline Registers

5 516

RD1

RD2

RN1 RN2 WN

WD

Register File ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

5

Instruction I32

MUX

<<2RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

32

ID/EX EX/MEM MEM/WB

15

Pipelined Example - Executing Multiple Instructions

• Consider the following instruction sequence:

lw $r0, 10($r1)

sw $sr3, 20($r4)

add $r5, $r6, $r7

sub $r8, $r9, $r10

16

Executing Multiple InstructionsClock Cycle 1

LW

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

17

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

Executing Multiple InstructionsClock Cycle 2

LWSW

18

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

Executing Multiple InstructionsClock Cycle 3

LWSWADD

19

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

Executing Multiple InstructionsClock Cycle 4

LWSWADDSUB

20

Executing Multiple InstructionsClock Cycle 5

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

LWSWADDSUB

21

Executing Multiple InstructionsClock Cycle 6

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

SWADDSUB

22

Executing Multiple InstructionsClock Cycle 7

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

ADDSUB

23

Executing Multiple InstructionsClock Cycle 8

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

SUB

24

Alternative View - Multicycle Diagram

IM REG ALU DM REGlw $r0, 10($r1)

sw $r3, 20($r4)

add $r5, $r6, $r7

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7

IM REG ALU DM REG

IM REG ALU DM REG

sub $r8, $r9, $r10 IM REG ALU DM REG

CC 8

25

Processor Pipelining

• There are two ways that pipelining can help:

1. Reduce the clock cycle time, and keep the same CPI

2. Reduce the CPI, and keep the same clock cycle time

CPU time = Instruction count * CPU time = Instruction count * CPICPI * * Clock cycle timeClock cycle time

26

Reduce the clock cycle time, and keep Reduce the clock cycle time, and keep the same CPIthe same CPI

5 516

RD1

RD2

RN1 RN2 WN

WD

Register File ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

5

Instruction I32

MUX

<<2RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

32

CPI = 1CPI = 1

Clock = X HzClock = X Hz

27

Reduce the clock cycle time, and keep Reduce the clock cycle time, and keep the same CPIthe same CPI

Pipeline Registers

5 516

RD1

RD2

RN1 RN2 WN

WD

Register File ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

5

Instruction I32

MUX

<<2RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

32

CPI = 1CPI = 1

Clock = Clock = X*5 HzX*5 Hz

28

Reduce the CPI, and keep the same Reduce the CPI, and keep the same cycle timecycle time

5 516

RD1

RD2

RN1 RN2 WN

WD

Register File ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

5

Instruction I32

MUX

<<2RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

32

CPI = 5CPI = 5

Clock = X*5 HzClock = X*5 Hz

29

Reduce the CPI, and keep the same Reduce the CPI, and keep the same cycle timecycle time

Pipeline Registers

5 516

RD1

RD2

RN1 RN2 WN

WD

Register File ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

5

Instruction I32

MUX

<<2RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

32

CPI = 1CPI = 1

Clock = Clock = X*5 HzX*5 Hz

30

Pipeline performancePipeline performance

• Ideally we get a speedup (by reducing clock cycle or reducing the CPI) equal to the number of stages.

• In practice, we do not achieve that – but we get close:

– Pipelining has additional overhead (e.g., pipeline registers)

– Pipeline hazards

31

Pipeline HazardsPipeline Hazards• Hazards are situations in pipelining which

prevent the next instruction in the instruction stream from executing during the designated clock cycle.

• Hazards reduce the ideal speedup gained from pipelining (e.g., CPI =1) and are classified into three classes:

– Structural hazards

– Data hazards

– Control hazards

Recommended