Pipelined MIPS Microarchitecture · Pipelined MIPS Microarchitecture. Carnegie Mellon 2 What Will We Learn ... Single-Cycle and Pipelined Datapath SignImmE CLK A RD Instruction Memory

Carnegie Mellon

1

Design of Digital Circuits 2014Srdjan CapkunFrank K. Gürkaynak

Adapted from Digital Design and Computer Architecture, David Money Harris & Sarah L. Harris ©2007 Elsevier

http://www.syssec.ethz.ch/education/Digitaltechnik_14

Pipelined MIPS Microarchitecture

Carnegie Mellon

2

What Will We Learn

How to do more per unit time

Parallelism

Pipelining

Single Cycle vs Pipelined

Pipelined MIPS architecture

Hazards and how to solve them

Data hazards

Control hazards

Performance of the Pipelined architecture

Carnegie Mellon

3

Parallelism

Two types of parallelism:

Spatial parallelism

duplicate hardware performs multiple tasks at once

Temporal parallelism

task is broken into multiple stages

also called pipelining

for example, an assembly line

Carnegie Mellon

4

Parallelism Definitions

Some definitions:

Token: A group of inputs processed to produce a group of outputs

Latency: Time for one token to pass from start to end

Throughput: The number of tokens that can be produced per unit time

Parallelism increases throughput.

Carnegie Mellon

5

Parallelism Example

Example:

Ben Bitdiddle is baking cookies to celebrate the installation of his traffic light controller. It takes 5 minutes to roll the cookies and 15 minutes to bake them. After finishing one batch he immediately starts the next batch. What is the latency and throughput if Ben doesn’t use parallelism?

Latency =

Throughput =

Carnegie Mellon

6

Parallelism Example

Example:

Ben Bitdiddle is baking cookies to celebrate the installation of his traffic light controller. It takes 5 minutes to roll the cookies and 15 minutes to bake them. After finishing one batch he immediately starts the next batch. What is the latency and throughput if Ben doesn’t use parallelism?

Latency = 5 + 15 = 20 minutes = 1/3 hour

Throughput = 1 tray/ 1/3 hour = 3 trays/hour

Carnegie Mellon

7

Parallelism Example

What is the latency and throughput if Ben uses parallelism?

Spatial parallelism: Ben asks Allysa P. Hacker to help, using her own oven

Temporal parallelism: Ben breaks the task into two stages: roll and baking. He uses two trays. While the first batch is baking he rolls the second batch, and so on.

Carnegie Mellon

8

Spatial ParallelismS

pa

tial

Para

lleli

sm Roll

Bake

Ben 1 Ben 1

Alyssa 1 Alyssa 1

Ben 2 Ben 2

Alyssa 2 Alyssa 2

Time

0 5 10 15 20 25 30 35 40 45 50

Tray 1

Tray 2

Tray 3

Tray 4

Latency:

time to

first tray

Legend

Latency =

Throughput=

Carnegie Mellon

9

Spatial ParallelismS

pa

tial

Para

lleli

sm Roll

Bake

Ben 1 Ben 1

Alyssa 1 Alyssa 1

Ben 2 Ben 2

Alyssa 2 Alyssa 2

Time

0 5 10 15 20 25 30 35 40 45 50

Tray 1

Tray 2

Tray 3

Tray 4

Latency:

time to

first tray

Legend


Throughput= 2 trays/ 1/3 hour = 6 trays/hour

Carnegie Mellon

10

Temporal ParallelismT

em

po

ral

Para

lleli

sm Ben 1 Ben 1

Ben 2 Ben 2

Ben 3 Ben 3

Time

0 5 10 15 20 25 30 35 40 45 50

Latency:

time to

first tray

Tray 1

Tray 2

Tray 3

Latency =

Throughput=

Carnegie Mellon

11

Temporal ParallelismT

em

po

ral

Para

lleli

sm Ben 1 Ben 1

Ben 2 Ben 2

Ben 3 Ben 3

Time

0 5 10 15 20 25 30 35 40 45 50

Latency:

time to

first tray

Tray 1

Tray 2

Tray 3


Throughput= 1 trays/ 1/4 hour = 4 trays/hour

Using both techniques, the throughput would be 8 trays/hour

Carnegie Mellon

12

Pipelined MIPS Processor

Temporal parallelism

Divide single-cycle processor into 5 stages:

Fetch

Decode

Execute

Memory

Writeback

Add pipeline registers between stages

Carnegie Mellon

13

Single-Cycle vs. Pipelined Performance

Time (ps)Instr

Fetch

InstructionDecodeRead Reg

Execute

ALU

Memory

Read / Write

Write

Reg1

2

0 100 200 300 400 500 600 700 800 900 1100 1200 1300 1400 1500 1600 1700 1800 19001000

Instr

1

2

3

Fetch


Execute

ALU

Memory

Read / Write

Write

Reg

Fetch


Execute

ALU

Memory

Read/Write

Write

Reg

Fetch


Execute

ALU

Memory

Read/Write

Write

Reg

Fetch


Execute

ALU

Memory

Read/Write

Write

Reg

Single-Cycle

Pipelined

Carnegie Mellon

14

Pipelining Abstraction

Time (cycles)

lw $s2, 40($0) RF 40

$0

RF$s2

+ DM

RF $t2

$t1

RF$s3

+ DM

RF $s5

$s1

RF$s4

- DM

RF $t6

$t5

RF$s5

& DM

RF 20

$s1

RF$s6

+ DM

RF $t4

$t3

RF$s7

| DM

add $s3, $t1, $t2

sub $s4, $s1, $s5

and $s5, $t5, $t6

sw $s6, 20($s1)

or $s7, $t3, $t4

1 2 3 4 5 6 7 8 9 10

add

IM

IM

IM

IM

IM

IMlw

sub

and

sw

or

Carnegie Mellon

15

Single-Cycle and Pipelined Datapath

SignImmE

CLK

A RD

Instruction

Memory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

Register

File

0

1

0

1

A RD

Data

Memory

WD

WE0

1

PCF0

1

PC' InstrD25:21

20:16

15:0

SrcBE

20:16

15:11

RtE

RdE

<<2+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchM

ResultW

PCPlus4EPCPlus4F

ZeroM

CLK CLK

ALU

WriteRegE4:0

CLK

CLK

CLK

SignImm

CLK

A RD

Instruction

Memory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

Register

File

0

1

0

1

A RD

Data

Memory

WD

WE0

1

PC0

1

PC' Instr25:21

20:16

15:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

Zero

CLK

ALU

Fetch Decode Execute Memory Writeback

Carnegie Mellon

16

Corrected Pipelined Datapath

WriteReg must arrive at the same time as Result

SignImmE

CLK

A RD

Instruction

Memory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

Register

File

0

1

0

1

A RD

Data

Memory

WD

WE0

1

PCF0

1

PC' InstrD25:21

20:16

15:0

SrcBE

20:16

15:11

RtE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4EPCPlus4F

ZeroM

CLK CLK

WriteRegW4:0

ALU

WriteRegE4:0

CLK

CLK

CLK

Fetch Decode Execute Memory Writeback

Carnegie Mellon

17

Pipelined Control

SignImmE

CLK

A RD

Instruction

Memory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

Register

File

0

1

0

1

A RD

Data

Memory

WD

WE0

1

PCF0

1

PC' InstrD25:21

20:16

15:0

5:0

SrcBE

20:16

15:11

RtE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4EPCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD

ALUSrcD

RegWriteD

Op

Funct

Control

Unit

ZeroM

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU

RegWriteE RegWriteM RegWriteW

MemtoRegE MemtoRegM MemtoRegW

MemWriteE MemWriteM

BranchE BranchM

RegDstE

ALUSrcE

WriteRegE4:0

Same control unit as single-cycle processorControl delayed to proper pipeline stage

Carnegie Mellon

18

Pipeline Hazard

Occurs when an instruction depends on results from previous instruction that hasn’t completed.

Types of hazards:

Data hazard: register value not written back to register file yet

Control hazard: next instruction not decided yet (caused by branches)

Carnegie Mellon

19

Data Hazard

The register file can be read and written in the same cycle:

write takes place during the 1st half of the cycle

read takes place during the 2nd half of the cycle => no hazard !!!

However operations that involve register file have only half a clock cycle to complete the operation!!

Time (cycles)

add $s0, $s2, $s3 RF $s3

$s2

RF$s0

+ DM

RF $s1

$s0

RF$t0

& DM

RF $s0

$s4

RF$t1

| DM

RF $s5

$s0

RF$t2

- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMadd

or

sub

Carnegie Mellon

20

Data Hazard

One instruction writes a register ($s0) and next instructions read this register => read after write (RAW) hazard.

add writes into $s0 in the first half of cycle 5

and reads $s0 on cycle 3, obtaining the wrong value

or reads $s0 on cycle 4, again obtaining the wrong value.

sub reads $s0 in the second half of cycle 5, obtaining the correct value

subsequent instructions read the correct value of $s0

Time (cycles)

add $s0, $s2, $s3 RF $s3

$s2

RF$s0

+ DM

RF $s1

$s0

RF$t0

& DM

RF $s0

$s4

RF$t1

| DM

RF $s5

$s0

RF$t2

- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMadd

or

sub

Carnegie Mellon

21

How Can You Handle Data Hazards?

Insert “NOP”s (No OPeration) in code at compile time

Rearrange code at compile time

Forward data at run time

Stall the processor at run time

Carnegie Mellon

22

Compile-Time Hazard Elimination

Time (cycles)

add $s0, $s2, $s3 RF $s3

$s2

RF$s0

+ DM

RF $s1

$s0

RF$t0

& DM

RF $s0

$s4

RF$t1

|DM

RF $s5

$s0

RF$t2

-DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMadd

or

sub

nop

nop

RF RFDMnopIM

RF RFDMnopIM

9 10

Insert enough NOPs for result to be ready

Or (if you can) move independent useful instructions forward

Carnegie Mellon

23

Data Forwarding

Time (cycles)

add $s0, $s2, $s3 RF $s3

$s2

RF$s0

+ DM

RF $s1

$s0

RF$t0

& DM

RF $s0

$s4

RF$t1

| DM

RF $s5

$s0

RF$t2

- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMadd

or

sub

Carnegie Mellon

24

Data Forwarding

SignImmE

CLK

A RD

Instruction

Memory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign

Extend

Register

File

0

1

0

1

A RD

Data

Memory

WD

WE

1

0

PCF0

1

PC' InstrD25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

Control

Unit

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU



MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

SignImmD

Forw

ard

AE

Forw

ard

BE

20:16RtE

RsD

RdD

RtD

RegW

rite

M

RegW

rite

W

Hazard Unit

PCPlus4E

BranchE BranchM

ZeroM

Carnegie Mellon

27

Stalling

Time (cycles)

lw $s0, 40($0) RF 40

$0

RF$s0

+ DM

RF $s1

$s0

RF$t0

& DM

RF $s0

$s4

RF$t1

| DM

RF $s5

$s0

RF$t2

- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMlw

or

sub

Trouble!

Forwarding is sufficient to solve RAW data hazards

but …

Carnegie Mellon

28

Stalling

Time (cycles)

lw $s0, 40($0) RF 40

$0

RF$s0

+ DM

RF $s1

$s0

RF$t0

& DM

RF $s0

$s4

RF$t1

| DM

RF $s5

$s0

RF$t2

- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMlw

or

sub

Trouble!

The lw instruction does not finish reading data until the end of the Memory stage, so its result cannot be forwarded to the Execute stage of the next instruction.

Carnegie Mellon

29

Stalling

Time (cycles)

lw $s0, 40($0) RF 40

$0

RF$s0

+ DM

RF $s1

$s0

RF$t0

& DM

RF $s0

$s4

RF$t1

| DM

RF $s5

$s0

RF$t2

- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMlw

or

sub

Trouble!

The lw instruction has a two-cycle latency, therefore a dependent instruction cannot use its result until two cycles later.

The lw instruction receives data from memory at the end of cycle 4. But the and instruction needs that data as a source operand at the beginning of cycle 4. There is no way to solve this hazard with forwarding.

Carnegie Mellon

30

Stalling

Time (cycles)

lw $s0, 40($0) RF 40

$0

RF$s0

+ DM

RF $s1

$s0

RF$t0

& DM

RF $s0

$s4

RF$t1

| DM

RF $s5

$s0

RF$t2

- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMlw

or

sub

9

RF $s1

$s0

IMor

Stall

Carnegie Mellon

31

Stalling Hardware

Stalls are supported by:

adding enable inputs (EN) to the Fetch and Decode pipeline registers

and a synchronous reset/clear (CLR) input to the Execute pipeline register.

When a lw stall occurs

StallD and StallF are asserted to force the Decode and Fetch stage pipeline registers to hold their old values.

FlushE is also asserted to clear the contents of the Execute stage pipeline register, introducing a bubble

Carnegie Mellon

32

Stalling Hardware

SignImmE

CLK

A RD

Instruction

Memory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign

Extend

Register

File

0

1

0

1

A RD

Data

Memory

WD

WE

1

0

PCF0

1

PC' InstrD25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

Control

Unit

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU



MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

SignImmD

Sta

llF

Sta

llD

Forw

ard

AE

Forw

ard

BE

20:16RtE

RsD

RdD

RtD

RegW

rite

M

RegW

rite

W

Mem

toR

egE

Hazard Unit

Flu

shE

PCPlus4E

BranchE BranchM

ZeroM

EN

EN

CLR

Carnegie Mellon

33

Control Hazards

beq:

branch is not determined until the fourth stage of the pipeline

Instructions after the branch are fetched before branch occurs

These instructions must be flushed if the branch happens

Branch misprediction penalty

number of instruction flushed when branch is taken

May be reduced by determining branch earlier

Carnegie Mellon

34

Control Hazards: Original Pipeline

SignImmE

CLK

A RD

Instruction

Memory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign

Extend

Register

File

0

1

0

1

A RD

Data

Memory

WD

WE

1

0

PCF0

1

PC' InstrD25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

Control

Unit

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU



MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

SignImmD

Sta

llF

Sta

llD

Forw

ard

AE

Forw

ard

BE

20:16RtE

RsD

RdD

RtD

RegW

rite

M

RegW

rite

W

Mem

toR

egE

Hazard Unit

Flu

shE

PCPlus4E

BranchE BranchM

ZeroM

EN

EN

CLR

Carnegie Mellon

35

Control Hazards

Time (cycles)

beq $t1, $t2, 40 RF $t2

$t1

RF- DM

RF $s1

$s0

RF& DM

RF $s0

$s4

RF| DM

RF $s5

$s0

RF- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMlw

or

sub

20

24

28

2C

30

...

...

9

Flush

these

instructions

64 slt $t3, $s2, $s3 RF $s3

$s2

RF$t3s

lt

DMIMslt

Carnegie Mellon

36

Control Hazards: Early Branch Resolution

EqualD

SignImmE

CLK

A RD

Instruction

Memory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign

Extend

Register

File

0

1

0

1

A RD

Data

Memory

WD

WE

1

0

PCF0

1

PC' InstrD25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchD

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

Control

Unit

PCSrcD

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU



MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

=

SignImmD

Sta

llF

Sta

llD

Forw

ard

AE

Forw

ard

BE

20:16RtE

RsD

RdE

RtD

RegW

rite

M

RegW

rite

W

Mem

toR

egE

Hazard Unit

Flu

shE

EN

EN

CLR

CLR

Introduced another data hazard in Decode stage.. this is another story

Carnegie Mellon

37

Early Branch Resolution

Time (cycles)

beq $t1, $t2, 40 RF $t2

$t1

RF- DM

RF $s1

$s0

RF& DMand $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

andIM

IMlw

20

24

28

2C

30

...

...

9

Flush

this

instruction

64 slt $t3, $s2, $s3 RF $s3

$s2

RF$t3s

lt

DMIMslt

Carnegie Mellon

38

Handling Data and Control Hazards

EqualD

SignImmE

CLK

A RD

Instruction

Memory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign

Extend

Register

File

0

1

0

1

A RD

Data

Memory

WD

WE

1

0

PCF0

1

PC' InstrD25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchD

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

Control

Unit

PCSrcD

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU



MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

0

1

0

1

=

SignImmD

Sta

llF

Sta

llD

Forw

ard

AE

Forw

ard

BE

Forw

ard

AD

Forw

ard

BD

20:16RtE

RsD

RdD

RtD

RegW

rite

E

RegW

rite

M

RegW

rite

W

Mem

toR

egE

Bra

nchD

Hazard Unit

Flu

shE

EN

EN

CLR

CLR

Possible solution to data hazard in Decode stage.

Carnegie Mellon

40

Branch Prediction

Guess whether branch will be taken

Backward branches are usually taken (loops)

Perhaps consider history of whether branch was previously taken to improve the guess

Good prediction reduces the fraction of branches requiring a flush

Carnegie Mellon

41

Pipelined Performance Example

SPECINT2000 benchmark:

25% loads

10% stores

11% branches

2% jumps

52% R-type

Suppose:

40% of loads used by next instruction

25% of branches mispredicted

All jumps flush next instruction

What is the average CPI?

Carnegie Mellon

42

Pipelined Performance Example Solution

Load/Branch CPI = 1 when no stalling, 2 when stalling.Thus:

CPIlw = 1(0.6) + 2(0.4) = 1.4 Average CPI for load

CPIbeq = 1(0.75) + 2(0.25) = 1.25 Average CPI for branch

And

Average CPI =

Carnegie Mellon

43

Pipelined Performance Example Solution

Load/Branch CPI = 1 when no stalling, 2 when stalling.Thus:

CPIlw = 1(0.6) + 2(0.4) = 1.4 Average CPI for load

CPIbeq = 1(0.75) + 2(0.25) = 1.25 Average CPI for branch

And

Average CPI = (0.25)(1.4) + load(0.1)(1) + store(0.11)(1.25) + beq(0.02)(2) + jump(0.52)(1) r-type

= 1.15

Carnegie Mellon

44

Pipelined Performance

There are 5 stages, and 5 different timing paths:

Tc = max {tpcq + tmem + tsetup fetch

2(tRFread + tmux + teq + tAND + tmux + tsetup ) decode

tpcq + tmux + tmux + tALU + tsetup execute

tpcq + tmemwrite + tsetup memory

2(tpcq + tmux + tRFwrite) writeback

}

The operation speed depends on the slowest operation

Decode and Writeback use register file and have only half aclock cycle to complete, that is why there is a 2 in front of them

Carnegie Mellon

45


Element Parameter Delay (ps)

Register clock-to-Q tpcq_PC 30

Register setup tsetup 20

Multiplexer tmux 25

ALU tALU 200

Memory read tmem 250

Register file read tRFread 150

Register file setup tRFsetup 20

Equality comparator teq 40

AND gate tAND 15

Memory write Tmemwrite 220

Register file write tRFwrite 100

Tc = 2(tRFread + tmux + teq + tAND + tmux + tsetup )= 2[150 + 25 + 40 + 15 + 25 + 20] ps= 550 ps

Carnegie Mellon

46


For a program with 100 billion instructions executing on a pipelined MIPS processor:

CPI = 1.15

Tc = 550 ps

Execution Time = (# instructions) × CPI × Tc

= (100 × 109)(1.15)(550 × 10-12)= 63 seconds

Carnegie Mellon

47

Performance Summary for MIPS arch.

Processor

Execution Time

(seconds)

Speedup

(single-cycle is baseline)

Single-cycle 95 1

Multicycle 133 0.71

Pipelined 63 1.51

Fastest of the three MIPS architectures is Pipelined.

However, even though we have 5 fold pipelining, it is not 5 times faster than single cycle.

Carnegie Mellon

48

What Did We Learn?

How to design a pipelined architecture

Break down long combinational path by registers

Shortens the clock period

You can start processing next instruction once the first part is complete

Problems of the pipelined architecture

Data you need for the next instruction may not be ready (Data Hazard)

You may not yet know the next instruction (Control Hazard)

Solutions to Hazards

Performance of pipelined MIPS architecture

Documents

Pipelined MIPS Microarchitecture · Pipelined MIPS Microarchitecture. Carnegie Mellon 2 What Will We Learn ... Single-Cycle and Pipelined Datapath SignImmE CLK A RD Instruction Memory