80
UC Regents Spr 2010 © UCB EECS 150 - L11: Processor Pipelining 2010-2-23 John Wawrzynek EECS 150 -- Digital Design Lecture 11-- Processor Pipelining www-inst.eecs.berkeley.edu/~cs150 Today’s lecture by John Lazzaro 1

EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

Embed Size (px)

Citation preview

Page 1: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

2010-2-23John Wawrzynek

EECS 150 -- Digital DesignLecture 11-- Processor Pipelining

www-inst.eecs.berkeley.edu/~cs150

Today’s lecture by John Lazzaro

1

Page 2: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Today: PipeliningHow to apply the performance equation to our single-cycle CPU.

Why pipelining is hard: data hazards,control hazards, structural hazards.

Pipelining: an idea from assemblyline production applied to CPU design

Visualizing pipelines to evaluatehazard detection and resolution.

A tool kit for hazard resolution.2

Page 3: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

New successful instruction sets are rare

instruction set

software

hardware

Implementors suffer with original sins of ISAs, to support the installed base of software.

3

Page 4: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

define: The Architect’s ContractTo the program, it appears that instructions execute in the correct order defined by the ISA.

What the machine actually does is up to the hardware designers, as long as the contract is kept.

As each instruction completes, themachine state (regs, mem) appears to the program to obey the ISA.

Goal: Keep contract and run programs faster.

4

Page 5: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Performance Measurement(as seen by a CPU designer)

Q. Why do we care about a program’s performance?

A. We want the CPU we are designing to run it well !

5

Page 6: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Step 1: Analyze the right measurement!

CPU Time:Time the CPU spends running program under measurement.

Response Time:

Total time: CPU Time + time spent waiting (for disk, I/O, ...).

Guides CPU design

Guides system design

Measuring CPU time (Unix):% time <program name>25.77u 0.72s 0:29.17 90.8%

6

Page 7: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

CPU time: Proportional to Instruction Count

CPU timeProgram

Machine InstructionsProgram

Rationale: Every additional

instruction you execute takes time.

Q. How does an architect influence the number of machine instructions needed to run an algorithm?A. Create new instructions:instruction set architect.

Q. Static count?(lines of program printout)Or dynamic count? (trace of execution)

A. Dynamic.

Q. Once ISA is set, who can influence instructioncount?A. Compiler writer,application developer.

7

Page 8: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

CPU time: Proportional to Clock Period

TimeProgram

TimeOne Clock Period

Q. What ultimately limitsan architect’s ability to reduce clock period ?

A. Clock-to-Q, setup times.

Q. How can architects (not technologists) reduce clock period?A. Shorten the machine’s critical path.

Rationale: We measure each instruction’s

execution time in “number of cycles”. By shortening the period for

each cycle, we shorten execution time.

8

Page 9: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Completing the performance equation

SecondsProgram

InstructionsProgram

= SecondsCycle

We need all three terms, and only these terms, to

compute CPU Time!

When is it OK to compare clock rates?

What factors make different programs have different CPIs?

Instruction mix varies.

Cache behavior varies.

Branch prediction varies.

“CPI” -- The Average Number of Clock

Cycles Per Instruction For the Program

InstructionCycles

9

Page 10: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Consider Lecture 10 single-cycle CPU ...

All instructions take 1 cycle to execute every time they run.

CPI of any program running on machine? 1.0

“average CPI for the program” is a more-useful concept for more complicated machines ...

10

Page 11: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Consider machine with a data cache ...

InstructionsProgram

= SecondsCycle

A program’s load instructions “stride”

through every memory address.

The cache never “hits”, so every load goes to

DRAM (100x slower than loads that go to cache).

Thus, the average number of cycles for load instructions is higher for this program.

InstructionCycles

Thus, the average number of cycles for all instructions is higher for this program.

SecondsProgram

Thus, program takes longer to run!

11

Page 12: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Final thoughts: Performance Equation

SecondsProgram

InstructionsProgram

= SecondsCycle Instruction

Cycles

Goal is to optimize execution time, notindividualequation

terms.

The CPI of the

program.Reflects

the program’s instruction

mix.

Machinesare

optimizedwith

respect toprogram

workloads.

Clockperiod.

Optimizejointlywith

machineCPI.

12

Page 13: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Pipelining

13

Page 14: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Clocking methodology ...

Processor uses synchronous logicdesign (a “clock”).

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'8

!"#$%&'

( )#*#&&'&+,-+.'*/#&+0-12'*,'*3+

#

4 5+! ,/$'60&7"89+:+,/$'6$;"9+:+,/$'6.',;%9

5+! #0&7"8 :+#$;" :+#.',;%

0&7

f T1 MHz 1 μs

10 MHz 100 ns100 MHz 10 ns

1 GHz 1 ns

All state elements act like positive edge-triggered flip flops.

D Q

clk14

Page 15: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Memory and register file semantics ...

32rd1

RegFile

32rd2

WE32wd

5rs1

5rs2

5ws

Reads are combinational: Put a stable address on input, a short time later data appears on output.

32Dout

Data Memory

WE32Din

32Addr

Writes are clocked: If WE is high, memory Addr (or register file ws) captures memory Din (or register file wd) on positive edge of clock.

Note: May not be best choice for project.15

Page 16: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Recall: A single-cycle processor

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Dout

Data Memory

WE

Din

Addr

MemToReg

Addr Data

Instr

Mem32A

L

U

32

32

op

Ext

SecondsProgram

InstructionsProgram

= SecondsCycle Instruction

Cycles

CPI == 1This is good.

Slow.This is bad.

Challenge: Speed up clock while keeping CPI == 1

16

Page 17: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

A MIPS R-format CPU design

32rd1

RegFile

32rd2

WE32wd

5rs1

5rs2

5ws

32ALU

32

32

op

opcode rs rt rd functshamt

Decode fields to get : ADD $8 $9 $10

Logic

17

Page 18: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

How data flows after posedge

32rd1

RegFile

32rd2

WE32wd

5rs1

5rs2

5ws

32ALU

32

32

op

Logic

Addr Data

InstrMem

D

PC

Q+

0x4

18

Page 19: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Next posedge: Update state and repeat

32rd1

RegFile

32rd2

WE32wd

5rs1

5rs2

5ws

D

PC

Q

19

Page 20: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Observation: Logic idle most of cycle

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Dout

Data Memory

WE

Din

Addr

MemToReg

Addr Data

Instr

Mem32A

L

U

32

32

op

Ext

For most of cycle, ALU is either “waiting” for its inputs, or “holding” its output

Ideal: a CPU architecture where each part is always “working”.

20

Page 21: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Inspiration: Automobile assembly lineAssembly line moves on a steady clock.

Each station does the same task on each car.Car

body shell

Car chassis

Mergestation

Boltingstation

The clock

21

Page 22: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Lessons from car assembly lines

Faster line movement yields more cars per hour off the line.

Faster line movement requires more stages, each doing simpler tasks.

To maximize efficiency, all stages should take same amount of time(if not, workers in fast stages are idle)

“Filling”, “flushing”, and “stalling” assembly line are all bad news.

22

Page 23: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Key analogy: The instruction is the car

D

PC

Q

+

0x4

Addr Data

Instr

Mem

IR IR IR

Instruction Fetch

IR

Pipeline Stage #1 Stage #2

Controlshardware

in stage 2

Stage #3

Controlshardware

in stage 3

Stage #4

Controlshardware

in stage 4

Stage #5

Controlshardware

in stage 5

“Data-stationary control”

23

Page 24: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Example: Decode & Register Fetch stage

D

PC

Q

+

0x4

Addr Data

Instr

Mem

IR

Instr Fetch

Pipeline Stage #1

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

Ext

IR

B

A

M

Stage #2

Decode & Reg Fetch

IR

Stage #3

ADD R4,R3,R2OR R7,R6,R5SUB R10, R9,R8

ADD R4,R3,R2OR R7,R6,R5SUB R10,R9,R8

A sample program

R’s chosen so that instructions are

independent - like cars on the line.

24

Page 25: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Decode & Reg Fetch

Performance Equation and Pipelining

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR IR

B

A

M

Instr Fetch Stage #3

SecondsProgram

InstructionsProgram= Seconds

Cycle InstructionCycles

To get shortest clock period,

balance the work to do in each

pipeline stage.

CPI == 1Once pipe is fill,one instructioncompletes per

cycle

Clock period is shorter

Less work to do in each cycle

25

Page 26: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Hazards: An instruction is not a car ...

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR IR

B

A

M

Instr Fetch

Stage #1 Stage #2 Stage #3

Decode & Reg Fetch

ADD R4,R3,R2OR R5,R4,R2

An example of a “hazard” -- we must

(1) detect and (2) resolve all hazards

to make a CPU that matches ISA

R4 not written yet ...... wrong value of R4 fetched from RegFile, contract with programmer broken! Oops! ADD R4,R3,R2

OR R5,R4,R2

New sample program

26

Page 27: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Decode & Reg Fetch

Performance Equation and Hazards

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR IR

B

A

M

Instr Fetch Stage #3

SecondsProgram

InstructionsProgram= Seconds

Cycle InstructionCycles

“Software slows the machine

down”Seymour Cray

Some ways to cope with hazards

makes CPI > 1“stalling pipeline”

Added logic to detect and resolve hazards increases

clock period

27

Page 28: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

A (simplified) 5-stage pipelined CPU

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR

B

A

M

Instr Fetch

“IF” Stage “ID/RF” Stage

Decode & Reg Fetch

1 2

“EX” StageExecution

32A

L

U

32

32

op

IR

Y

M

3

IR

Dout

Data Memory

WE

Din

Addr

MemToReg

R

“MEM” StageMemory

WE, MemToReg

4WB5

WriteBack

Mux,Logic

28

Page 29: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Sometimes, “contract” is a challenge

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR

B

A

M

Instr Fetch

“IF” Stage “ID/RF” Stage

Decode & Reg Fetch

1 2

“EX” StageExecution

32A

L

U

32

32

op

IR

Y

M

3

IR

Dout

Data Memory

WE

Din

Addr

MemToReg

R

“MEM” StageMemory

WE, MemToReg

4WB5

WriteBack

Mux,Logic

LW R4,0(R0)OR R5,R4,R2

Sample ProgramLW R4,0(R0)

OR R5,R4,R2

... but we haven’t even started the load yet!

One approach: change the contract!

29

Page 30: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

MIPS LW contract: Delayed Loads

InstructionFetch

InstructionDecode

OperandFetch

Execute

ResultStore

NextInstruction

Fetch the load inst from memory

“Retrieve” register value: $2

Compute memory address: 32 + $2

Load memory address contents into: $1

Prepare to fetch instr that follows the LW in the program. Depending on load semantics, new $1 is visible to that instr, or not until the following instr (”delayed loads”).

Decode fields to get : LW $1, 32($2)

opcode rs rt offset “I-Format”

30

Page 31: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

After we change the contract ...

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR

B

A

M

Instr Fetch

“IF” Stage “ID/RF” Stage

Decode & Reg Fetch

1 2

“EX” StageExecution

32A

L

U

32

32

op

IR

Y

M

3

IR

Dout

Data Memory

WE

Din

Addr

MemToReg

R

“MEM” StageMemory

WE, MemToReg

4WB5

WriteBack

Mux,Logic

LW R4,0(R0)OR R5,R4,R2

Sample ProgramLW R4,0(R0)

OR R5,R4,R2

... “delayed load” contract does not guarantee new R4 is seen.

Only partially solves problem ... soon, we

finish the story.

31

Page 32: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Visualizing Pipelines

32

Page 33: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Pipeline Representation #1: Timeline

D

PC

Q

+

0x4

Addr Data

Instr

Mem

IR IR

IF (Fetch) ID (Decode) EX (ALU)

IR IR

MEM WB

ADD R4,R3,R2

OR R7,R6,R5

SUB R1,R9,R8XOR R3,R2,R1

AND R6,R5,R4I1:I2:I3:I4:I5:

Sample Program

IF ID

IF

EX

ID

IF

MEM

EX

ID

IF

WB

MEM

EX

IFID

WB

MEM

IDEX

IF

WB

EXMEM

IDMEMWB

EX

I1:I2:I3:I4:I5:

t1 t2 t3 t4 t5 t6 t7 t8Time:Inst

I6:

Good for visualizing pipeline fills.

Pipeline is “full”

33

Page 34: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Pipeline is “full”

Good for visualizing pipeline stalls.

Representation #2: Resource Usage

D

PC

Q

+

0x4

Addr Data

Instr

Mem

IR IR IR IR

ADD R4,R3,R2

OR R7,R6,R5

SUB R1,R9,R8XOR R3,R2,R1

AND R6,R5,R4I1:I2:I3:I4:I5:

Sample Program

I1 I2

I1

I3

I2

I1

I4

I3

I2

I1

I5

I4

I3

I1I2

IF:ID:EX:MEM:WB:

t1 t2 t3 t4 t5 t6 t7 t8Time:Stage

IF (Fetch) ID (Decode) EX (ALU) MEM WB

I5

I4

I2I3

I6

I5

I3I4

I6

I7

I4I5

I6

I7

I8

34

Page 35: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Hazard Taxonomy

35

Page 36: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Structural Hazards

Several pipeline stages need to use the same hardware resource at the same time.

Solution #1: Add extra copies ofthe resource (only works sometime).

Solution #2: Change resource sothat it can handle concurrent use.

Solution #3: Stages “take turns”by stalling parts of the pipeline.

36

Page 37: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

Ext

IR IR

B

A

M

“IF” Stage “ID/RF” Stage “EX” Stage

32A

L

U

32

32

op

IR

Y

M

IR

R

“MEM” Stage

WE, MemToReg

WB

Mux,Logic

32Dout

Data Memory

WE32

Din

Addr

MemToReg

PC

To branch logic

Structural Hazard Example: One Memory

Used by IF stage

and MEM stage

37

Page 38: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

A solution: “Extra copies” of memory

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR

B

A

M

Instr Fetch

“IF” Stage “ID/RF” Stage

Decode & Reg Fetch

1 2

“EX” StageExecution

32A

L

U

32

32

op

IR

Y

M

3

IR

Dout

Data Memory

WE

Din

Addr

MemToReg

R

“MEM” StageMemory

WE, MemToReg

4WB5

WriteBack

Mux,Logic

I and D caches are a

hybrid solution

38

Page 39: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Alternatively: Concurrent use ...

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR

B

A

M

Instr Fetch

“IF” Stage “ID/RF” Stage

Decode & Reg Fetch

1 2

“EX” StageExecution

32A

L

U

32

32

op

IR

Y

M

3

IR

Dout

Data Memory

WE

Din

Addr

MemToReg

R

“MEM” StageMemory

WE, MemToReg

4WB5

WriteBack

Mux,Logic

ID and WB stages use register file in

same clock cycle

39

Page 40: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Data Hazards: 3 Types (RAW, WAR, WAW)

Several pipeline stages read or write thesame data location in an incompatible way.

Read After Write (RAW) hazards.Instruction I2 expects to read a datavalue written by an earlier instruction,but I2 executes “too early” and readsthe wrong copy of the data.

Note “data value”, not “register”. Data hazards are possible for any architected state (such as main memory). In practice, main memory hazard avoidance is the job of the memory system.

40

Page 41: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Recall: RAW example

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR IR

B

A

M

Instr Fetch

Stage #1 Stage #2 Stage #3

Decode & Reg Fetch

ADD R4,R3,R2OR R5,R4,R2

R4 not written yet ...... wrong value of R4 fetched from RegFile, contract with programmer broken! Oops!

ADD R4,R3,R2OR R5,R4,R2

Sample program

This is what we mean

when we say Read After

Write (RAW) Hazard

41

Page 42: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Data Hazards: 3 Types (RAW, WAR, WAW)

Write After Read (WAR) hazards. Instruction I2 expects to write over a data value after an earlier instruction I1 reads it. But instead, I2 writes too early, and I1 sees the new value.

Write After Write (WAW) hazards. Instruction I2 writes over data an earlier instruction I1 also writes. But instead, I1 writes after I2, and the final data value is incorrect.

WAR and WAW not possible in our 5-stage pipeline. But are possible in other pipeline designs.

42

Page 43: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

I1:I2:I3:I4:I5:

t1 t2 t3 t4 t5 t6 t7 t8Time:Inst

I6:

Control Hazards: A taken branch/jump

D

PC

Q

+

0x4

Addr Data

Instr

Mem

IR IR

IF (Fetch) ID (Decode) EX (ALU)

IR IR

MEM WB

BEQ R4,R3,25

SUB R1,R9,R8AND R6,R5,R4

I1:I2:I3:

Sample Program(ISA w/o branch delay slot) IF ID

IF

EX

ID

IF

MEM WBEX stage computes if branch is taken

Note: with branch delay slot, I2 MUST complete, I3 MUST NOT complete.

If branch is taken, these instructions

MUST NOT complete!43

Page 44: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Hazards Recap

Structural Hazards

Data Hazards (RAW, WAR, WAW)

Control Hazards (taken branches and jumps)

On each clock cycle, we must detect the presenceof all of these hazards, and resolve them before they break the “contract with the programmer”.

44

Page 45: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Hazard Resolution Tools

45

Page 46: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

The Hazard Resolution Toolkit

Stall earlier instructions in pipeline.

Kill earlier instructions in pipeline.

Forward results computed in later pipeline stages to earlier stages.Add new hardware or rearrange hardware design to eliminate hazard.

Make hardware handle concurrent requests to eliminate hazard.

Change ISA to eliminate hazard.

46

Page 47: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Resolving a RAW hazard by stalling

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR IR

B

A

M

Instr Fetch

Stage #1 Stage #2 Stage #3

Decode & Reg Fetch

ADD R4,R3,R2OR R5,R4,R2

Let ADD proceed to WB stage, so that R4 is written to regfile.

ADD R4,R3,R2OR R5,R4,R2

Sample programKeep executingOR instructionuntil R4 is ready.Until then, sendNOPS to IR 2/3.

Freeze PC and IR until stall is over.

New datapath hardware

(1) Mux into IR 2/3to feed in NOP.

(2) Write enable on PC and IR 1/2

47

Page 48: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

The Hazard Resolution Toolkit

Stall earlier instructions in pipeline.

Kill earlier instructions in pipeline.

Forward results computed in later pipeline stages to earlier stages.Add new hardware or rearrange hardware design to eliminate hazard.

Make hardware handle concurrent requests to eliminate hazard.

Change ISA to eliminate hazard.

48

Page 49: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR

B

A

M

Instr Fetch

“IF” Stage “ID/RF” Stage

Decode & Reg Fetch

1 2

“EX” StageExecution

32A

L

U

32

32

op

IR

Y

M

3

Resolving a RAW hazard by forwarding

ADD R4,R3,R2OR R5,R4,R2ADD R4,R3,R2OR R5,R4,R2

Sample program

ALU computes R4 in the EX stage, so ...Just forward it

back!

Unlike stalling, does not change CPI. May hurt cycle time.

49

Page 50: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

The Hazard Resolution Toolkit

Stall earlier instructions in pipeline.

Kill earlier instructions in pipeline.

Forward results computed in later pipeline stages to earlier stages.Add new hardware or rearrange hardware design to eliminate hazard.

Make hardware handle concurrent requests to eliminate hazard.

Change ISA to eliminate hazard.

50

Page 51: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

I1:I2:I3:I4:I5:

t1 t2 t3 t4 t5 t6 t7 t8Time:Inst

I6:

Control Hazards: Fix with more hardware

D

PC

Q

+

0x4

Addr Data

Instr

Mem

IR IR

IF (Fetch) ID (Decode) EX (ALU)

IR IR

MEM WB

BEQ R4,R3,25

SUB R1,R9,R8AND R6,R5,R4

I1:I2:I3:

Sample Program(ISA w/o branch delay slot) IF ID

IF

EX

ID

IF

MEM WBEX stage computes if branch is taken

If branch is taken, these instructions

MUST NOT complete!

If we add hardware, can we move it here?

51

Page 52: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Resolving control hazard with hardware

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR IR

B

A

M

Instr Fetch

Stage #1 Stage #2 Stage #3

Decode & Reg Fetch

==

To branch control logic

52

Page 53: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

I1:I2:I3:I4:I5:

t1 t2 t3 t4 t5 t6 t7 t8Time:Inst

I6:

Control Hazards: After more hardware

D

PC

Q

+

0x4

Addr Data

Instr

Mem

IR IR

IF (Fetch) ID (Decode) EX (ALU)

IR IR

MEM WB

BEQ R4,R3,25

SUB R1,R9,R8AND R6,R5,R4

I1:I2:I3:

Sample Program(ISA w/o branch delay slot) IF ID

IF

EX MEM WB

If branch is taken, this instruction MUST NOT

complete!

ID stage computes if branch is taken

If we change ISA, can we always let I2 complete (”branch delay slot”) and

eliminate the control hazard.

53

Page 54: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

MIPS Delayed Branch: BEQ $1,$2,25

InstructionFetch

InstructionDecode

OperandFetch

Execute

ResultStore

NextInstruction

Fetch branch inst from memory

“Retrieve” register values: $1, $2

Compute if we take branch: $1 == $2 ?

Decode fields to get: BEQ $1, $2, 25

opcode rs rt offset “I-Format”

ALWAYS prepare to fetch instr that follows the BEQ in the program (”delayed branch”). IF we take branch, the instr we fetch AFTER that instruction is PC + 4 + 100.

PC == “Program Counter”54

Page 55: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

The Hazard Resolution Toolkit

Stall earlier instructions in pipeline.

Kill earlier instructions in pipeline.

Forward results computed in later pipeline stages to earlier stages.Add new hardware or rearrange hardware design to eliminate hazard.

Make hardware handle concurrent requests to eliminate hazard.

Change ISA to eliminate hazard.

55

Page 56: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Resolve control hazard by killing instr

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR IR

B

A

M

Instr Fetch

Stage #1 Stage #2 Stage #3

Decode & Reg Fetch

J 200

J 200OR R5,R4,R2

Sample program(no delay slot) Detect J

instruction, muxa NOP into IR 1/2

Compute new PC using hardware not shown ...

This hurts CPI.

Can we do better?

56

Page 57: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

The Hazard Resolution Toolkit

Stall earlier instructions in pipeline.

Kill earlier instructions in pipeline.

Forward results computed in later pipeline stages to earlier stages.Add new hardware or rearrange hardware design to eliminate hazard.

Make hardware handle concurrent requests to eliminate hazard.

Change ISA to eliminate hazard.

57

Page 58: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Structural hazard solution: concurrent use

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR

B

A

M

Instr Fetch

“IF” Stage “ID/RF” Stage

Decode & Reg Fetch

1 2

“EX” StageExecution

32A

L

U

32

32

op

IR

Y

M

3

IR

Dout

Data Memory

WE

Din

Addr

MemToReg

R

“MEM” StageMemory

WE, MemToReg

4WB5

WriteBack

Mux,Logic

ID and WB stages use register file in

same clock cycle

Does not come for free ...

58

Page 59: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Hazard Diagnosis

59

Page 60: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Data Hazards: Read After Write

Read After Write (RAW) hazards.Instruction I2 expects to read a datavalue written by an earlier instruction,but I2 executes “too early” and readsthe wrong copy of the data.

Classic solution: use forwarding heavily, fall back on stalling when forwarding won’twork or slows down the critical path too much.

60

Page 61: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Mux,Logic

Full bypass network ...

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

Ext

IR IR

B

A

M

32A

L

U

32

32

op

IR

Y

M

IR

Dout

Data Memory

WE

Din

Addr

MemToReg

R

WE, MemToReg

ID (Decode) EX MEM WB

From WB

61

Page 62: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Mux,Logic

Common bug: Multiple forwards ...

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

Ext

IR IR

B

A

M

32A

L

U

32

32

op

IR

Y

M

IR

Dout

Data Memory

WE

Din

Addr

MemToReg

R

WE, MemToReg

ID (Decode) EX MEM WB

From WB

ADD R4,R3,R2 OR R2,R3,R1 AND R2,R2,R1

Which do we forward from?

62

Page 63: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Mux,Logic

Common bug: Multiple forwards II ...

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

Ext

IR IR

B

A

M

32A

L

U

32

32

op

IR

Y

M

IR

Dout

Data Memory

WE

Din

Addr

MemToReg

R

WE, MemToReg

ID (Decode) EX MEM WB

From WB

ADD R4,R0,R2 OR R0,R3,R1 AND R0,R2,R1

Which do we forward from?

63

Page 64: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

LW and Hazards

No load “delay slot”

64

Page 65: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Mux,Logic

Questions about LW and forwarding

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

Ext

IR IR

B

A

M

32A

L

U

32

32

op

IR

Y

M

IR

Dout

Data Memory

WE

Din

Addr

MemToReg

R

WE, MemToReg

ID (Decode) EX MEM WB

From WB

ADDIU R1 R1 24 LW R1 128(R29)

Do we need to stall ?OR R3,R3,R2

65

Page 66: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Mux,Logic

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

Ext

IR IR

B

A

M

32A

L

U

32

32

op

IR

Y

M

IR

Dout

Data Memory

WE

Din

Addr

MemToReg

R

WE, MemToReg

ID (Decode) EX MEM WB

From WB

ADDIU R1 R1 24 LW R1 128(R29)

Do we need to stall ?OR R1,R3,R1

Questions about LW and forwarding

66

Page 67: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Resolving a RAW hazard by stalling

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR IR

B

A

M

Instr Fetch

Stage #1 Stage #2 Stage #3

Decode & Reg Fetch

ADD R4,R3,R2OR R5,R4,R2

Let ADD proceed to WB stage, so that R4 is written to regfile.

ADD R4,R3,R2OR R5,R4,R2

Sample programKeep executingOR instructionuntil R4 is ready.Until then, sendNOPS to IR 2/3.

Freeze PC and IR until stall is over.

New datapath hardware

(1) Mux into IR 2/3to feed in NOP.

(2) Write enable on PC and IR 1/2

67

Page 68: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Branches and Hazards

Single “delay slot”

68

Page 69: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Recall: Control hazard and hardware

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR IR

B

A

M

Instr Fetch

Stage #1 Stage #2 Stage #3

Decode & Reg Fetch

==

To branch control logic

69

Page 70: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

I1:I2:I3:I4:I5:

t1 t2 t3 t4 t5 t6 t7 t8Time:Inst

I6:

Recall: After more hardware, change ISA

D

PC

Q

+

0x4

Addr Data

Instr

Mem

IR IR

IF (Fetch) ID (Decode) EX (ALU)

IR IR

MEM WB

BEQ R4,R3,25

SUB R1,R9,R8AND R6,R5,R4

I1:I2:I3:

Sample Program(ISA w/o branch delay slot) IF ID

IF

EX MEM WB

If branch is taken, this instruction MUST NOT

complete!

ID stage computes if branch is taken

If we change ISA, can we always let I2 complete (”branch delay slot”) and

eliminate the control hazard.

70

Page 71: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Mux,Logic

Question about branch and forwards:

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

Ext

IR IR

B

A

M

32A

L

U

32

32

op

IR

Y

M

IR

Dout

Data Memory

WE

Din

Addr

MemToReg

R

WE, MemToReg

ID (Decode) EX MEM WB

BEQ R1 R3 label

Will this work as shown?OR R3,R3,R1

==

To branch control logic

71

Page 72: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Lessons learned

Pipelining is hard

Write test code in advance

Study every instruction

Think about interactions ...

72

Page 73: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Lessons learned

Pipelining is hard

Write test code in advance

Study every instruction

Think about interactions ... between forwarding, branch and jump delay slots, R0 issues LW issues ... a long list!

73

Page 74: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Control Implementation

74

Page 75: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Recall: What is single cycle control?

32rd1

RegFile

32rd2

WE32wd

5rs1

5rs2

5ws

ExtRegDest

ALUsrcExtOp

ALUctr

32A

L

U

32

32

op

MemToReg

32Dout

Data Memory

WE32

Din

Addr

MemWr

Equal

RegWr

32Addr Data

InstrMem

Equal

RegDestRegWr

ExtOpALUsrc MemWr

MemToReg

PCSrc

Combinational Logic(Only Gates, No Flip Flops)Just specify logic functions!

75

Page 76: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

In pipelines, all IR registers are used

IR IR IR IR

ID (Decode) EX MEM WB

Equal

RegDestRegWr

ExtOp MemToReg

PCSrc

Combinational Logic(Only Gates, No Flip Flops)

(add extra state outside!)

A “conceptual” design -- for shortest critical path, IR registers may hold decoded info,

not the complete 32-bit instruction

76

Page 77: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Exceptions and InterruptsException: An unusual event happens to an instruction during its execution. Examples: divide by zero, undefined opcode.

Interrupt: Hardware signal to switch the processor to a new instruction stream. Example: a sound card interrupts when it needs more audio output samples (an audio “click” happens if it is left waiting).

77

Page 78: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Challenge: Precise Interrupt / Exception

!"#$%&

!"#$%

&'"()*+,-*.//0

123.4-*5+.66+,

788%($9:%;%##<

=%;'>9;?*';@*AB$6C86C"@%"*%D%(B$9C;*E'#*89"#$

9>FG%>%;$%@*9;*+H1H*9;*IJ&*41/KH+*LB$*@9@*;C$*#)CE*

BF*9;*$)%*#BL#%MB%;$*>C@%G#*B;$9G*>9@6N9;%$9%#2

!"#$%

&'()*+)

+2*7D(%F$9C;#*;C$*F"%(9#%O

.2*788%($9:%*C;*'*:%"P*#>'GG*(G'##*C8*F"C?"'>#

A;%*>C"%*F"CLG%>*;%%@%@*$C*L%*#CG:%@

!"#$%"&'$%(#)*+%)

!"#$%

&'"()*+,-*.//0

123.4-*5+.66+1

Q"%(9#%*I;$%""BF$#

,$'-.)$'(//+(%'()'0*'(#'0#$+%%./$'0)'$(1+#'2+$3++#'$3"'0#)$%.4$0"#) !"#$%&' #()%&'*+,

- ./0%01102.%31%#44%'(".562.'3("%67%.3%#()%'(246)'(8%&' '".3.#44$%239740.0

- (3%01102.%31%#($%'(".562.'3(%#1.05%&' /#"%.#:0(%74#20

;/0%'(.05567.%/#()405%0'./05%#<35."%./0%75385#9%35%50".#5."%'.%#.%&'*+%=

Definition:

Follows from the “contract” between the architect and the programmer ...

(or exception)

78

Page 79: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Precise Exceptions in Static Pipelines

!"#$%&

!"#$%

&'"()*+,-*.//0

123.4-*5+.66+7899%($*:;*<;$%""=>$#!"#$%&$%'()'*+%,-.)#/%0

!" !"#! $%&' $%& $(!# )! $*& (+,-./!$ 01)2! $3& $*& $(!% !"#! $4& $%& $*!& 516! $73& $3& $%!' 8!!! $%& $4& $*

()*+(,+(-./-01(23 7'''*'''* .'''7 ('''. +'''+ ( %'''%

-/4*(-/0,#0 -/4*(-/0,"5

9:;<=>?-'=;@?--AB@<

6-/174/078*/--)3*409-/0.7,,71):*0*(0723:/2/8*09*0;7<;043//.+ =98*0*(04*9-*0/>/1)*7(80(,0:9*/-0784*-)1*7(840?/,(-//>1/3*7(801;/1@40,7874;/.0(80/9-:7/-0784*-)1*7(84

!"#$%

&'"()*+,-*.//0

123.4-*5+.66+38?(%>$@:;*A';BC@;D120$!'()'*3/4)$5#67)*8/-)./0)9

C D:E>'?FG?B@=:;'$EHI<'=;'B=B?E=;?'A;@=E'G:JJ=@'B:=;@',0'<@HI?/C KFG?B@=:;<'=;'?H-E=?-'B=B?'<@HI?<':L?--=>?'EH@?-'?FG?B@=:;<C ";M?G@'?F@?-;HE'=;@?--AB@<'H@'G:JJ=@'B:=;@',:L?--=>?':@N?-</C "$'?FG?B@=:;'H@'G:JJ=@O'AB>H@?'9HA<?'H;>'KP9'-?I=<@?-<&'Q=EEHEE'<@HI?<&'=;M?G@'NH;>E?-'P9'=;@:'$?@GN'<@HI?

A4B81;-(8()40!8*/--)3*4

KFG!

P9!

P9";<@R'0?J ! !?G:>? K 0

!H@H'0?J ST

KFGK

P9K

KFG0

P90

C9)4/

D6C

E7::0F0G*9</

E7::0H0G*9</

E7::0D0G*9</

!::/<9:0I31(./

IJ/-,:(=F9*90A..-0D>1/3*

6C0A..-/440D>1/3*7(84

E7::0K-7*/?91@

G/:/1*0L98.:/-06C

!"##$%&

'"$(%

Key observation: architected state only change in memory and register write stages.

79

Page 80: EECS 150 -- Digital Designcs150/sp10/Lecture/lec11-pipelining.pdf · EECS 150 - L11: Processor Pipelining UC Regents Spr 2010 © UCB 2010-2-23 John Wawrzynek EECS 150 -- Digital Design

UC Regents Spr 2010 © UCBEECS 150 - L11: Processor Pipelining

Thursday:

80