19
L18 – Pipeline Issues 1 Comp 411 CPU Pipelining Issues Finishing up 4.5-4.8 This pipe stuff makes my head hurt! What have you been beating your head against?

CPU Pipelining Issues

  • Upload
    patia

  • View
    47

  • Download
    0

Embed Size (px)

DESCRIPTION

CPU Pipelining Issues. What have you been beating your head against?. This pipe stuff makes my head hurt!. Finishing up 4.5-4.8. Pipelining. Improve performance by increasing instruction throughput Ideal speedup is number of stages in the pipeline. Do we achieve this?. Pipelining. - PowerPoint PPT Presentation

Citation preview

Page 1: CPU Pipelining Issues

L18 – Pipeline Issues 1Comp 411

CPU Pipelining Issues

Finishing up 4.5-4.8

This pipe stuff makesmy head hurt!

What have you been beating

your head against?

Page 2: CPU Pipelining Issues

L18 – Pipeline Issues 2Comp 411

Pipelining

Improve performance by increasing instruction throughput

Ideal speedup is number of stages in the pipeline. Do we achieve this?

Instructionfetch

Reg ALUData

accessReg

8 nsInstruction

fetchReg ALU

Dataaccess

Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch

Reg ALUData

accessReg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 ns 2 ns 2 ns 2 ns 2 ns

Programexecutionorder(in instructions)

Page 3: CPU Pipelining Issues

L18 – Pipeline Issues 3Comp 411

Pipelining

What makes it easyall instructions are the same lengthjust a few instruction formatsmemory operands appear only in loads and stores

What makes it hard?structural hazards: suppose we had only one memorycontrol hazards: need to worry about branch instructionsdata hazards: an instruction depends on a previous

instruction

Individual Instructions still take the same number of cycles

But we’ve improved the through-put by increasing the number of simultaneously executing instructions

Page 4: CPU Pipelining Issues

L18 – Pipeline Issues 4Comp 411

Structural Hazards

InstFetch

RegRead

ALU DataAccess

Reg Write

InstFetch

RegRead

ALU DataAccess

Reg Write

InstFetch

RegRead

ALU DataAccess

Reg Write

InstFetch

RegRead

ALU DataAccess

Reg Write

Page 5: CPU Pipelining Issues

L18 – Pipeline Issues 5Comp 411

Problem with starting next instruction before first is finisheddependencies that “go backward in time” are

data hazards

Data Hazards

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Programexecutionorder(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2:

DM Reg

Reg

Reg

Reg

DM

Page 6: CPU Pipelining Issues

L18 – Pipeline Issues 6Comp 411

Have compiler guarantee no hazardsWhere do we insert the “nops” ?

sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)

Problem: this really slows us down!

Software Solution

Page 7: CPU Pipelining Issues

L18 – Pipeline Issues 7Comp 411

Use temporary results, don’t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding

Forwarding

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Programexecution order(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2 :

DM Reg

Reg

Reg

Reg

X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :

DM

Page 8: CPU Pipelining Issues

L18 – Pipeline Issues 8Comp 411

Load word can still cause a hazard:an instruction tries to read a register following a load

instruction that writes to the same register.

Thus, we need a hazard detection unit to “stall” the instruction

Can't always forward

Reg

IM

Reg

Reg

IM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

lw $2, 20($1)

Programexecutionorder(in instructions)

and $4, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

DM Reg

Reg

Reg

DM

Page 9: CPU Pipelining Issues

L18 – Pipeline Issues 9Comp 411

Stalling

We can stall the pipeline by keeping an instruction in the same stage

lw $2, 20($1)

Programexecutionorder(in instructions)

and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Reg

IM

Reg

Reg

IM DM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)

IM Reg DM RegIM

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9 CC 10

DM Reg

RegReg

Reg

bubble

Page 10: CPU Pipelining Issues

L18 – Pipeline Issues 10Comp 411

When we decide to branch, other instructions are in the pipeline!

We are predicting “branch not taken”need to add hardware for flushing instructions if

we are wrong

Branch Hazards

Reg

Reg

CC 1

Time (in clock cycles)

40 beq $1, $3, 7

Programexecutionorder(in instructions)

IM Reg

IM DM

IM DM

IM DM

DM

DM Reg

Reg Reg

Reg

Reg

RegIM

44 and $12, $2, $5

48 or $13, $6, $2

52 add $14, $2, $2

72 lw $4, 50($7)

CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

Reg

Page 11: CPU Pipelining Issues

L18 – Pipeline Issues 11Comp 411

Improving Performance

Try to avoid stalls! E.g., reorder these instructions:

lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)

Add a “branch delay slot”the next instruction after a branch is always

executedrely on compiler to “fill” the slot with something

useful

Superscalar: start more than one instruction in the same cycle

Page 12: CPU Pipelining Issues

L18 – Pipeline Issues 12Comp 411

Dynamic Scheduling

The hardware performs the “scheduling” hardware tries to find instructions to execute

out of order execution is possible

speculative execution and dynamic branch prediction

All modern processors are very complicatedPentium 4: 20 stage pipeline, 6 simultaneous

instructions

PowerPC and Pentium: branch history table

Compiler technology important

Page 13: CPU Pipelining Issues

L18 – Pipeline Issues 13Comp 411

Pipeline Summary (I)

• Started with unpipelined implementation– direct execute, 3-5 cycles/instruction– it had a long cycle time: mem + regs + alu + mem

+ wb

• We ended up with a 5-stage pipelined implementation– increase throughput (3x???)– delayed branch decision (1 cycle)

Choose to execute instruction after branch– delayed register writeback (3 cycles)

Add bypass paths (6 x 2 = 12) to forward correct value

– memory data available only in WB stageIntroduce NOPs at IRALU, to stall IF and RF

stages until LD result was ready

Page 14: CPU Pipelining Issues

L18 – Pipeline Issues 14Comp 411

Pipeline Summary (II)

Fallacy #1: Pipelining is easySmart people get it wrong all of the time!

Fallacy #2: Pipelining is independent of ISAMany ISA decisions impact how easy/costly it is to implement pipelining (i.e. branch semantics, addressing modes).

Fallacy #3: Increasing Pipeline stages improves performanceDiminishing returns. Increasing complexity.

Page 15: CPU Pipelining Issues

L18 – Pipeline Issues 15Comp 411

RISC = Simplicity???

Generalization of registers and

operand coding

Complex instructions, addressing modes

Addressingfeatures, eg

index registers

RISCs

Primitive Machines with

direct implementati

ons

VLIWs,Super-Scalars ?

Pipelines, Bypasses,

Annulment, …, ...

“The P.T. Barnum World’s Tallest Dwarf Competition”

World’s Most Complex RISC?

Page 16: CPU Pipelining Issues

L18 – Pipeline Issues 16Comp 411

Chapter 5 Preview

Memory Hierarchy

Page 17: CPU Pipelining Issues

L18 – Pipeline Issues 17Comp 411

Memory Hierarchy

Memory devices come in several different flavorsSRAM – Static Ram

fast (0.5 to 10ns)expensive (>10 times DRAM)small capacity (few megabytes typically)

DRAM – Dynamic RAMMuch slower than SRAM (50ns – 100ns)Access time varies with addressAffordable ($22 / gigabyte) (2GB DDR3,

Mar 2010)4 Gig typical

DISKSlow! (10ms access time)Cheap! (< $0.10 / GB) (down from

$193,000/GB in 1980)Big! (1Tbyte is no problem)

Page 18: CPU Pipelining Issues

L18 – Pipeline Issues 18Comp 411

Users want large and fast memories!

Memory Hierarchy

Page 19: CPU Pipelining Issues

L18 – Pipeline Issues 19Comp 411

Locality

A principle that makes having a memory hierarchy a good idea

If an item is referenced,

temporal locality: it will tend to be referenced again soonspatial locality: nearby items will tend to be referenced soon.

Why does code have locality?