Click here to load reader

Instruction Level Parallelism - bt.nitk.ac.inbt.nitk.ac.in/c/16b/cs701/notes/1.  · PDF fileAppendix C and Chapter 3, HP5e. Outline Pipelining, Hazards Branch prediction Static and

  • View
    3

  • Download
    0

Embed Size (px)

Text of Instruction Level Parallelism - bt.nitk.ac.inbt.nitk.ac.in/c/16b/cs701/notes/1.  · PDF...

  • Instruction Level Parallelism

    Appendix C and Chapter 3, HP5e

  • Outline ● Pipelining, Hazards ● Branch prediction ● Static and Dynamic Scheduling ● Speculation ● Compiler techniques, VLIW ● Limits of ILP.

  • Pipelining Basics

  • Implementation of RISC ISA - Stages ● Instruction Fetch (IF) ● Instruction Decode/Register Fetch (ID)

    – Fixed field decoding ● Execution/Effective address (EX) ● Memory Access (MEM) ● Write back (WB)

  • MIPS Datapath

    A D

    D

    P C

    4

    IM

    NPC

    RegsIR

    Sign Extend

    A

    B

    Imm 16 32

    rs

    rt

    rd

    A LU ALUOutput

    M U X

    M U X

    Zero? Cond

    DM LMD MU X

    M U X

    Instruction Fetch Instruction Decode/ Register Fetch

    Execute/ Address

    Calculation

    Memory Access

    Write Back

    IF ID EX MEM WB

  • Multiple Issue Integer Pipeline

    IM RF

    Read

    A B

    DM

    RF

    Write

    IR0

    IR1

    Zero?

    IF ID EX MEM WB

  • Pipeline Performance An unpipelined processor has 1ns clock cycle. ALU Operation and branches take 4 cycles and Memory ops take 5 cycles. Relative frequencies of the operations are 40%, 20%, and 40%. Suppose Clock skew and setup, pipelining adds 0.2ns of overhead to the clock. What is the speedup?

    Average Instruction Execution time = Clock cycle * Average CPI

    CPI=∑ i=1

    n IC i InstructionCount

    ×CPI i

  • Dependences

    Pipeline Hazards – Structural & Data

  • Outline ● Data dependences ● Name dependences ● Structural hazards ● Data hazards

    – Stalling, Forwarding

  • Basic Block ● A straight line code sequence with no branches in

    except to the entry and no branches out except at the exit

    Loop: L.D F0, 0(R1)

    ADD.D F4, F0, F2

    S.D F4, 0(R1)

    DADDUI R1, R1, #-8

    BNE R1, R2, Loop

  • Dependence

    ● Name dependences – Register renaming

    ● Hazard – Overlap during execution could change the order of

    access to the operand involved in the dependence.

    for (i=0; i

  • Hazards ● Program Order

    – ILP preserves program order only where it affects the outcome of the program

    ● Structural Hazards – Resource conflicts

    ● Data Hazards – RAW, WAW, WAR

    ● Control Hazard – Whether or not an instruction should be executed

    depends on a control decision made by an earlier instruction

  • Structural Hazard

    MEM ID EX MEM WB

    MEM ID EX MEM WB

    MEM ID EX MEM WB

    MEM ID EX MEM WB

    i1

    i2

    i3

    i4

    ...

    1 2 3 4 5 6 7 8 9

    MEM ID EX MEM WBi5

    HAZARD!!!

    ● Unified Memory example ● Register File – WB, ID example.

  • Cost of a Load Structural Hazard ● Data references constitute 40% of the instruction

    mix. Ideal CPI = 1 (with no structural hazards). Assume that the processor with the structural hazard has a clock rate that is 1.1 times higher than the clock rate of the processor without the hazard. Which processor is faster, and by how much?

    Avg. InstructionTime =CPI×Clock cycle time

    Avg. InstructionTime ideal=CPI×Clock cycle timeideal

  • Cost of a Load Structural Hazard

    Avg. InstructionTime =CPI×Clock cycle time

    Avg. InstructionTime =(1+0.4×1)× Clock cycle timeideal

    1.1

    Avg. InstructionTime =1.27×Clock cycle timeideal

  • Data Hazards A

    D D

    P C

    4

    IM

    NPC

    RegsIR

    Sign Extend

    A

    B

    Imm 16 32

    rs

    rt

    rd

    A LU ALUOutput

    M U X

    M U X

    Zero? Cond

    DM LMD MU X

    M U X

    IR IR IR

    R1 ← R2 + R3

    R4 ← R1 + R5

    R1 is updated in the WB stage.

  • Stalled Stages and Pipeline Bubbles Time (clock cycles)

    R1 ← R2 + R3

    R4 ← R1 + R5

    IF ID

    IF

    EX MA WB

    ID

    IF

    EX MA WB

    IF IF

    ID ID

    EX MA WBID

    I1 I2 I3

    I4

    Stalled Stages EX MA WBIDIF

    I5

    EX MA WBIDIF

    EX

    MA

    WB

    ID

    IF

    I1

    I1

    I1

    I1

    I2 I2 I2 I2

    I2

    I2

    I2

    I3 I3 I3

    I3

    I3

    I3

    I3

    I4

    I4

    I4

    I4

    I5

    I5

    I5

    I5

    nop nop nop

    nop nop nop

    nop nop nop

    How to overcome this hazard?

  • Resolving Data Hazards ● Stalling one of the instructions ● Data Forwarding (Bypassing) ● Scheduling hazardous instructions away from

    each other

  • Stalling (Interlocking) A

    D D

    P C

    4

    IM

    NPC

    RegsIR

    Sign Extend

    A

    B

    Imm 16 32

    rs

    rt

    rd

    A LU ALUOutput

    M U X

    M U X

    Zero? Cond

    DM LMD MU X

    M U X

    IR IR IR

    R1 ← R2 + R3

    R4 ← R1 + R5

    NOP

    Stall Condition

  • Pipeline Performance

    Speedup pipelining= Pipeline depth

    1+Stall cycles per instruction

    Speedup pipelining= CPI unpipelined CPI pipelined

  • Forwarding DADD DSUB AND OR XOR

    R4,R1,R5 R6,R1,R7

    R1,R2,R3

    R8,R1,R9 R10,R1,R1 1

    IM REG DMDADD

    DSUB

    AND

    Time (clock cycles)

    ALU REG

    IM REG DMALU REG

    IM REG DMALU REG

  • Forwarding

    Time (clock cycles)

    R1 ← R2 + R3

    R4 ← R1 + R5

    IF ID

    IF

    EX MA WB

    ID

    IF

    EX MA WB

    IF IF

    ID ID

    EX MA WBIDIF

    ID

    Stalled Stages

    Before Bypassing

    Time (clock cycles)

    R1 ← R2 + R3

    R4 ← R1 + R5

    IF ID

    IF

    EX MA WB

    EX MA WBID

    After Bypassing

    CPI > 1

    CPI = 1 IF EX MA WBID

  • Cost of Forwarding

    ● In longer pipelines? ● In multiple issue pipelines?

    ● All the dependences have been solved?

  • Forwarding

    ● Forwarding cannot solve all data dependence problems

    LD R2, 4(R1) ADD R4, R2, R3

    IM REG DMLD

    ADD

    Time (clock cycles)

    ALU REG

    IM REG DMALU REG

  • Forwarding - Stall Condition

    ● Forwarding cannot solve all data dependence problems

    LD R2, 4(R1) ADD R4, R2, R3

    IM REG DMLD

    ADD

    Time (clock cycles)

    ALU REG

    IM REG DMALU REGREG

    STALL

  • Instruction Level Parallelism

    Static Scheduling

  • Outline ● ILP ● Multicycle instructions ● Loop unrolling, scheduling ● Superscalar pipelines

  • ILP ● Instruction-level parallelism: overlap among

    instructions: pipelining or multiple instruction execution

    ● What determines the degree of ILP? – dependences: property of the program – hazards: property of the pipeline

  • Pipeline Scheduling ● Reorder instructions so that dependent instructions are

    far enough apart ● Done by the compiler, before the program runs:

    ● Static Instruction Scheduling ● Done by the hardware, when the program is running:

    ● Dynamic Instruction Scheduling

  • Static vs. Dynamic Scheduling ● Dynamic scheduling:

    – requires complex structures to identify independent instructions (scoreboards, issue queue)

    –  high power consumption –  low clock speed –  high design and verification effort

    ● Static: Compiler can compute instruction latencies and dependences

  • Pipeline Scheduling

    LW R3, 0(R1)

    LW R13, 0(R11)

    ADDI R5, R3, 1

    ADD R2, R2, R3

    ADD R12, R13, R3

    LW R3, 0(R1)

    ADDI R5, R3, 1

    ADD R2, R2, R3

    LW R13, 0(R11)

    ADD R12, R13, R3

    stall

    stall

    Original Program

    Pipeline Scheduling

    Scheduled Code

    Total Execution Cycles: 7 Total Execution Cycles: 5

  • References ● HP5e. Chapter 3 – Instruction-Level Parallelism

    and Its Exploitation. ● HP5e. Appendix A – Instruction Set Principles. ● HP5e. Appendix C – Pipelining: Basic and

    Intermediate Concepts. ● HP5e. Appendix H – Hardware and Software for

    VLIW and EPIC.

    Slide 1 Slide 2 Slide 3 Slide 4 Slide 5 Slide 6 Slide 7 Slide 8 Slide 9 Slide 10 Slide 11 Slide 12 Slide 13 Slide 14 Slide 15 Slide 16 Slide 17 Slide 18 Slide 19 Slide 20 Slide 21 Slide 22 Slide 23 Slide 24 Slide 25 Slide 26 Slide 27 Slide 28 Slide 29 Slide 30 Slide 31 Slide 32

Search related