20
ENGS 116 Lecture 6 1 Pipelining Difficulties and MIPS R4000 Vincent H. Berk October 6, 2008 Reading for today: A.3 – A.4, article: Yeager Reading for Wednesday: A.5 – A.6, article: Smith&Pleszkun FRIDAY: NO CLASS

Pipelining Difficulties and MIPS R4000

  • Upload
    pink

  • View
    41

  • Download
    0

Embed Size (px)

DESCRIPTION

Pipelining Difficulties and MIPS R4000. Vincent H. Berk October 6, 2008 Reading for today: A.3 – A.4, article: Yeager Reading for Wednesday: A.5 – A.6, article: Smith&Pleszkun FRIDAY: NO CLASS. Exception Characterization. Synchronous vs. Asynchronous - PowerPoint PPT Presentation

Citation preview

Page 1: Pipelining Difficulties and MIPS R4000

ENGS 116 Lecture 6 1

Pipelining Difficulties and MIPS R4000

Vincent H. Berk

October 6, 2008

Reading for today: A.3 – A.4, article: Yeager

Reading for Wednesday: A.5 – A.6, article: Smith&Pleszkun

FRIDAY: NO CLASS

Page 2: Pipelining Difficulties and MIPS R4000

ENGS 116 Lecture 6 2

Exception CharacterizationSynchronous vs. Asynchronous

– Synchronous: event occurs same place every time– Asynchronous: caused by devices external to CPU & memory,

also hw malfunctionsUser requested vs. user coerced

– Requested: user task asks for it – Coerced: hw event not under control of user program

User maskable vs. user nonmaskable– Maskable: event that can be disabled by user task

Within vs. between instructions– Within: during execution of task, hard to handle, usually

synchronous since instruction is triggerResume vs. terminate

– Terminating: execution always stops after the interrupt

Page 3: Pipelining Difficulties and MIPS R4000

ENGS 116 Lecture 6 3

Exception HandlingTable of Interrupt vector addresses

• Base register of this table stored in CPU by OS

• Addresses of Interrupt handling routines are stored in table

• On interrupt, CPU jumps to: base + 4 * int_num

• Usually 16 or 32 interrupts

• Physical pins on CPU, as well as software calls

Page 4: Pipelining Difficulties and MIPS R4000

ENGS 116 Lecture 6 4

Exception Examples(see also: figure A.27)

• I/O request: device requests attention from CPU

• System call or Supervisor call from software

• Breakpoint or instruction tracing: software debugging, single-step

• Arithmetic: Integer or FP, overflow, underflow, division by zero

• Page fault: requested virtual address was not present in main memory

• Misaligned address: bus error

• Memory protection: read/write/execute forbidden on requested address

• Invalid opcode: CPU was given an wrongly formatted instruction

• Hardware malfunction: CRC errors, component failure

• Power failure

Page 5: Pipelining Difficulties and MIPS R4000

ENGS 116 Lecture 6 5

Pipelining Complications

• Exceptions: 5 instructions executing in 5-stage pipeline– How to stop the pipeline?– How to restart the pipeline?– Who caused the exception?

Stage Problem exceptions occurring

IF Page fault on instruction fetch; misaligned memoryaccess; memory-protection violation

ID Undefined or illegal opcode

EX Arithmetic interrupt

MEM Page fault on data fetch; misaligned memory access;memory-protection violation

Page 6: Pipelining Difficulties and MIPS R4000

ENGS 116 Lecture 6 6

Pipelining Complications• Simultaneous exceptions in more than one pipeline stage, e.g.,

– Load with data page fault in MEM stage

– Add with instruction page fault in IF stage

– Add fault will happen BEFORE load fault

• Solution #1– Interrupt status vector per instruction

– Defer check till last stage, kill state update if exception

• Solution #2– Interrupt ASAP

– Restart everything that is incomplete

Another advantage for state update late in pipeline!

Page 7: Pipelining Difficulties and MIPS R4000

ENGS 116 Lecture 6 7

Pipelining Complications

• Complex addressing modes and instructions

• Address modes: Autoincrement causes register change during instruction execution

– Interrupts? Need to restore register state

– Adds WAR and WAW hazards since writes no longer in last stage

• Memory-memory move instructions

– Must be able to handle multiple page faults

– Long-lived instructions: partial state save on interrupt

• Floating point: long execution time; out of order completion

Page 8: Pipelining Difficulties and MIPS R4000

ENGS 116 Lecture 6 8

Stopping and Starting Execution

Most difficult exception occurrences have 2 properties

– They occur within instructions

– They must be restartable

The pipeline must be shut down safely and the state must be saved for correct restarting

Restarting is usually done by saving PC of instruction at which to start

Branches and delayed branches require special treatment

Precise exceptions allow instructions just before the exception to be completed, while restarting instructions after the exception

Page 9: Pipelining Difficulties and MIPS R4000

ENGS 116 Lecture 6 9

Figure A.29 The MIPS pipeline with three additional unpipelined, floating-point, functional units.

IDIF WBMEM

Integer unitEX

FP/Integerdivider

EX

FP adder

EX

EXFP/Integermultiply

Page 10: Pipelining Difficulties and MIPS R4000

ENGS 116 Lecture 6 10

Figure A.31 A pipeline that supports multiple outstanding FP operations

IF ID MEM WB

Integer unit

EX

FP/integer multiply

FP adder

FP/integer divider

DIV

M1

M2

M3

M4

M5

M6

M7

A1 A4A-3A2

Page 11: Pipelining Difficulties and MIPS R4000

ENGS 116 Lecture 6 11

Figure A.33 A typical FP code sequence showing the stalls arising from RAW hazards.

Clock Cycle Number

Instruction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

LD, F4, 0(R2)

IF ID EX MEM WB

MULTD F0,F4, F6

IF ID stall M1 M2 M3 M4 M5 M6 M7 MEM WB

ADDD F2,F0, F8

IF stall ID stall stall stall stall stall stall A1 A2 A3 A4 MEM

SD 0 (R2),F2

IF stall stall stall stall stall stall ID EX stall stall stall MEM

Page 12: Pipelining Difficulties and MIPS R4000

ENGS 116 Lecture 6 12

Case Study: MIPS R4000(100 MHz to 200 MHz)

• 8 Stage Pipeline:– IF – first half of fetching of instruction; PC selection happens here as

well as initiation of instruction cache access.

– IS – second half of access to instruction cache.

– RF – instruction decode and register fetch, hazard checking and also instruction cache hit detection.

– EX – execution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation.

– DF – data fetch, first half of access to data cache.

– DS – second half of access to data cache.

– TC – tag check, determine whether the data cache access hit.

– WB – write back for loads and register-register operations.

• 8 Stages: What is impact on Load delay? Branch delay? Why?

Page 13: Pipelining Difficulties and MIPS R4000

ENGS 116 Lecture 6 13

Instruction memory Reg Data memory Reg

IF IS RF EX DF DS TC WB

Figure A.37 The eight-stage pipeline structure of the R4000 uses pipelined instruction and data cache accesses.

AL

U

Page 14: Pipelining Difficulties and MIPS R4000

ENGS 116 Lecture 6 14

WBTCDSDFEXRFISIF

IF ISIF

RFISIF

EXRFISIF

DFEXRFISIF

DSDFEXRFISIF

TCDSDFEXRFISIF

WBTCDSDFEXRFISIF

TWO CycleLoad Latency

IF ISIF

RFISIF

EXRFISIF

DFEXRFISIF

DSDFEXRFISIF

TCDSDFEXRFISIF

THREE CycleBranch Latency(conditions evaluated during EX phase)

Delay slot plus two stallsBranch likely cancels delay slot if not taken

Case Study: MIPS R4000

Page 15: Pipelining Difficulties and MIPS R4000

ENGS 116 Lecture 6 15

MIPS R4000 Floating Point

• FP Adder, FP Multiplier, FP Divider

• Last step of FP Multiplier/Divider uses FP Adder HW

• 8 kinds of stages in FP units:

Stage Functional unit Description

A FP adder Mantissa ADD stage

D FP divider Divide pipeline stage

E FP multiplier Exception test stage

M FP multiplier First stage of multiplier

N FP multiplier Second stage of multiplier

R FP adder Rounding stage

S FP adder Operand shift stage

U Unpack FP numbers

Page 16: Pipelining Difficulties and MIPS R4000

ENGS 116 Lecture 6 16

R4000 Performance• Not ideal CPI of 1:

– Load stalls (1 or 2 clock cycles)

– Branch stalls (2 cycles + unfilled slots)

– FP result stalls: RAW data hazard (latency)

– FP structural stalls: Not enough FP hardware (parallelism)

Page 17: Pipelining Difficulties and MIPS R4000

ENGS 116 Lecture 6 17

Instruction Level Parallelism

Want to exploit parallelism among instruction sequences

Branches interfere with parallelism - gcc has branch every 5 or 6 instructions (on average)

Need to find sequences of unrelated instructions that can be overlapped

Often see loop-level parallelism

for (i = 0; i < 100; i = i +1)

x[i] = x[i] + y[i]

Want to convert loop-level parallelism to instruction-level parallelism

Page 18: Pipelining Difficulties and MIPS R4000

ENGS 116 Lecture 6 18

FP Loop: Where are the Hazards?

Loop: LD F0, 0(R1) ; F0=vector element

ADDD F4, F0, F2 ; add scalar in F2

SD 0 (R1), F4 ; store result

SUBI R1, R1, #8 ; decrement pointer 8 bytes (DW)

BNEZ R1, Loop ; branch R1!=zero

NOP ; delayed branch slot

Instructionproducing result

Instructionusing result

Latency in clock cycles

FP ALU op Another FP ALU op 3

FP ALU op Store double 2

Load double FP ALU op 1

Load double Store double 0Integer op Integer op 0

Page 19: Pipelining Difficulties and MIPS R4000

ENGS 116 Lecture 6 19

FP Loop Hazards

Instructionproducing result

Instructionusing result

Latency inclock cycles

FP ALU op Another FP ALU op 3FP ALU op Store double 2Load double FP ALU op 1

Load double Store double 0Integer op Integer op 0

• Where are the stalls?

Loop: LD F0, 0(R1) ; F0=vector element

ADDD F4, F0, F2 ; add scalar in F2

SD 0 (R1), F4 ; store result

SUBI R1, R1, #8 ; decrement pointer 8 bytes (DW)

BNEZ R1, Loop ; branch R1! = zero

NOP ; delayed branch slot

Page 20: Pipelining Difficulties and MIPS R4000

ENGS 116 Lecture 6 20

FP Loop Showing Stalls1 Loop: LD F0, 0 (R1) ; F0=vector element2 stall3 ADDD F4, F0, F2 ; add scalar in F24 stall5 stall6 SD 0 (R1), F4 ; store result7 SUBI R1, R1, #8 ; decrement pointer 8 bytes (DW)8 stall ; wait for result R19 BNEZ R1, Loop ; branch R1!=zero10 stall ; delayed branch slot

Instructionproducing result

Instructionusing result

Latency inclock cycles

FP ALU op Another FP ALU op 3FP ALU op Store double 2Load double FP ALU op 1

• Rewrite code to minimize stalls?