36
EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University EE382A – Autumn 2009 John P Shen Lecture 7- 1 Stanford University http://eeclass.stanford.edu/ee382a

EE382A Lecture 7: Dynamic Scheduling · 2010. 2. 25. · Announcements • Project proposal due on Wed 10/14 – 2-3 pages submitted through email3 pages submitted through email –

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

  • EE382A Lecture 7:

    Dynamic Scheduling

    Department of Electrical EngineeringStanford University

    EE382A – Autumn 2009 John P ShenLecture 7- 1

    Stanford University

    http://eeclass.stanford.edu/ee382a

  • Announcements

    • Project proposal due on Wed 10/142 3 pages submitted through email– 2-3 pages submitted through email

    – List the group members– Describe the topic including why it is important and your thesis

    Describe the methodology you will use (experiments tools machines)– Describe the methodology you will use (experiments, tools, machines)– Statement of expected results– Few key references to related work

    EE382A – Autumn 2009 John P ShenLecture 7- 2

  • What Limits ILP

    INSTRUCTION PROCESSING CONSTRAINTS

    R C t ti C d D dResource Contention Code Dependences

    Control Dependences Data Dependences

    (Structural Dependences)

    Control Dependences Data Dependences

    T D d St C fli t(RAW) True Dependences

    Anti-Dependences Output Dependences

    Storage Conflicts(RAW)

    (WAR) (WAW)

    EE382A – Autumn 2009 John P ShenLecture 7- 3

    Anti-Dependences Output Dependences(WAR) (WAW)

  • The Reason for WAW and WAR:Register RecyclingRegister Recycling

    COMPILER REGISTER ALLOCATIONSingle Assignment Symbolic RegCODE GENERATION Single Assignment, Symbolic Reg.

    Map Symbolic Reg. to Physical Reg. Maximize Reuse of Reg.

    CODE GENERATION

    REG. ALLOCATION

    9 $34: mul $14 $7, 4010 addu $15, $4, $1411 l $24 $9 4

    INSTRUCTION LOOPS

    a e euse o eg

    For (k=1;k

  • Resolving False Dependences

    Must Prevent (2) from completing•• before (1) is dispatched

    (2) R3 ← R5 + 1

    •(1) R4 ← R3 + 1

    before (1) is dispatched

    Must Prevent (2) from completing before (1) completes

    (1) R3 ← R3 + R5

    ← R3•••

    (2) R3 ← R5 + 1

    •••

    Stalling: delay dispatching (or write back) of the later instructionCopy Operands: Copy not-yet-used operand to prevent being overwritten

    (WAR)

    EE382A – Autumn 2009 John P ShenLecture 7- 5

    ( )Register Renaming: use a different register (WAW & WAR)

  • Register Renaming: The Idea

    • Anti and output dependences are false dependencesr3 ← r1 op r2r3 ← r1 op r2r5 ← r3 op r4r3 ← r6 op r7

    • The dependence is on name/location rather than data

    • Given unlimited number of registers, anti and output dependences can always be eliminated

    Renamedr1 ← r2 / r3

    Originalr1 ← r2 / r3 r1 ← r2 / r3

    r4 ← r1 * r5r8 ← r3 + r6

    r1 ← r2 / r3r4 ← r1 * r5r1 ← r3 + r6

    EE382A – Autumn 2009 John P ShenLecture 7- 6

    r9 ← r8 - r4r3 ← r1 - r4

  • Register Renaming Technique

    Register Renaming Resolves:

    Design of Redundant Registers: Anti-Dependences Output Dependences

    Design of Redundant Registers:Number:

    OneMultipleA i i Multiple

    Allocation:Fixed for Each RegisterPooled for all Regsiters

    Architected PhysicalRegisters Registers

    R1 P1 Pooled for all Regsiters Location:

    Attached to Register File(Centralized)

    R1R2•••

    P1P2••• (Centralized)

    Attached to functional units (Distributed)

    •Rn

    Pn

    ••

    EE382A – Autumn 2009 John P ShenLecture 7- 7

    •Pn + k

  • Integrating Map Tables with the ARF

    EE382A – Autumn 2009 John P ShenLecture 7- 8

  • Register Renaming Operations

    • At Decode/Dispatch: for each instruction handled in parallel1 Source Read: Check availability of source operands1. Source Read: Check availability of source operands2. Destination Allocate: Map destination register to new physical register

    • Stall if no register available

    N t t h h t t t bl– Note: must have enough ports to any map tables

    • At finish: 3. Register Update: update physical registerg p p p y g

    • At Complete/Commit: for each instruction handled in parallel3. Register Update: update architectural register

    C f RRF/ROB t ARF & d ll t RRF t OR– Copy from RRF/ROB to ARF & deallocate RRF entry; OR– Upgrade physical location and deallocate register with old value

    • It is now safe to do that

    EE382A – Autumn 2009 John P ShenLecture 7- 9

    • Question: can we allocate later or deallocate earlier?

  • Renaming Operation

    EE382A – Autumn 2009 John P ShenLecture 7- 10

  • Renaming Buffer Options

    1. Unified/merged register file – MIPS R10K, Alpha 21264– Registers change role architecture to renamed

    2. Rename register file (RRF) – PA 8500, PPC 620g ( ) ,– Holds new values until they are committed to ARF– Extra data transfer…

    3. Renaming in the ROB – Pentium III

    EE382A – Autumn 2009 John P ShenLecture 7- 11

    Note: can have a single scheme or separate for integer/FP

  • Unified Register File:Physical Register FSMPhysical Register FSM

    EE382A – Autumn 2009 John P ShenLecture 7- 12

  • Register Renaming in the IBM RS6000 FPU

    FPU Register Renaming…

  • Renaming Difficulties: Wide Instruction IssueWide Instruction Issue

    • Need many ports in RFs and mapping tables

    • Instruction dependences during dispatching/issuing/committing– Must handle dependencies across instructions– E.g. add R1←R2+R3; sub R6←R1+R5

    – Implementation: use comparators, multiplexors, countersImplementation: use comparators, multiplexors, counters• Comparators: discover RAW dependencies• Multiplexors: generate right physical address (old or new allocation)• Counters: determine number of physical registers allocatedp y g

    EE382A – Autumn 2009 John P ShenLecture 7- 14

  • Renaming Difficulties: Mispredictions & ExceptionsMispredictions & Exceptions

    • If exception/misprediction occurs, register mapping must be precise

    • Separate RRF: consider all RRF entries free

    • ROB renaming: consider all ROB entries freeg

    • Unified RF: restore precise mapping – Single map: traverse ROB to undo mapping (history file approach)

    ROB t b ld i• ROB must remember old mapping…

    – Two maps: architectural and future register map • On exception, copy architectural map into future map…

    Ch k i ti k l h k i t f t h d d– Checkpointing: keep regular check points of map, restore when needed• When do we make a checkpoint? On every instruction? On every branch?• What are the trade-offs?

    W ’ll i it thi h l t

    EE382A – Autumn 2009 John P ShenLecture 7- 15

    • We’ll revisit this approach later on…

  • “Dataflow Engine” for Dynamic Execution

    - Read register orA i i t t

    Dispatch Buffer Reg. Write Back

    - Assign register tag- Advance instructions

    to reservation stationsDispatch Reg. File Ren. Reg.

    AllocateReservationStations

    Branch

    ReorderBufferentries

    Integer Integer Float.- Load/ Forwarding- Monitor reg. tagR i d tg g

    Point StoreForwardingresults toRes. Sta. &renameregisters

    - Receive databeing forwarded

    - Issue when alloperands ready

    Compl. Buffer Managed as a queue;Maintains sequential orderof all Instructions in flight

    (Reorder Buff.)

    EE382A – Autumn 2009 John P ShenLecture 7- 16

    Completeof all Instructions in flight(“takeoff” = dispatching;“landing” = completion)

  • Historical Background

    • Dynamic or Data-flow Scheduling:Scheduling hardware allows instructions to be executed as soon as its– Scheduling hardware allows instructions to be executed as soon as its source operands are ready and a FU is available

    – Assuming renaming, only limited by RAW and structural hazards

    • First proposal: Tomasulo’s algorithm in IBM 360/91 FPU (1967)– 1 instruction per cycle, distributed implementation, imprecise exceptions…1 instruction per cycle, distributed implementation, imprecise exceptions…

    • We will talk directly about modern implementations– Read the original in the textbook– Differences: renaming, precise exceptions, multiple instructions per cycle,

    EE382A – Autumn 2009 John P ShenLecture 7- 17

  • Steps in Dynamic Execution (1)

    • Fetch instruction (in-order, speculative)I cache access predictions insert in a fetch buffer– I-cache access, predictions, insert in a fetch buffer

    • DISPATCH (in-order, speculative)( , p )– Read operands from Register File (ARF) and/or Rename Register File

    (RRF)• RRF may return a ready value or a Tag for a physical locationy y g p y

    – Allocate new RRF entry (rename destination register) for destination– Allocate Reorder Buffer (ROB) entry– Advance instruction to appropriate entry in the scheduling hardware– Advance instruction to appropriate entry in the scheduling hardware

    • Typical name for centralized: issue queue or instruction window• Typical name for distributed: reservation stations

    EE382A – Autumn 2009 John P ShenLecture 7- 18

  • Steps in Dynamic Execution (2)

    • ISSUE & EXECUTE (out-of-order, speculative)Scheduler entry monitors result bus for rename register Tag(s)– Scheduler entry monitors result bus for rename register Tag(s)

    • Find out if source operand becomes ready

    – When all operands ready, issue instruction into Functional Unit (FU) and deallocate scheduler entry (wake up & select)deallocate scheduler entry (wake-up & select)

    • Subject to structural hazards & priorities

    – When execution finishes, broadcast result to waiting scheduler entries and RRF entryRRF entry

    • COMMIT/RETIRE/GRADUATE (in-order, non-speculative)– When ready to commit result into “in-order” state (head of the ROB):

    • Update architectural register from RRF entry, deallocate RRF entry, and if it is a store instruction, advance it to Store Buffer

    • Deallocate ROB entry and instruction is considered architecturally completed

    EE382A – Autumn 2009 John P ShenLecture 7- 19

    • Update predictors based on instruction result

  • Centralized Instruction Windowor Issue Queue Implementationor Issue Queue Implementation

    + info for executing instruction (e.g. opcode, ROB entry RRF entry)ROB entry, RRF entry)

    EE382A – Autumn 2009 John P ShenLecture 7- 20

  • Instruction WindowSource Operand OptionsSource Operand Options

    • Option (a): read at dispatch and keep in the window

    • Option (b): read at issue

    EE382A – Autumn 2009 John P ShenLecture 7- 21

    Option (b): read at issue

  • ROB Implementation

    EE382A – Autumn 2009 John P ShenLecture 7- 22

  • Example: MIPS R10000 circa 1996

    EE382A – Autumn 2009 John P ShenLecture 7- 23

  • R10000 Design Choices

    • Register RenamingMap table lookup + dependency check on simultaneous dispatches– Map table lookup + dependency check on simultaneous dispatches

    – Unified physical register file– 4-deep branch stack to backup the map table on branch predictions

    Sequential (4 at a time) back tracking to recover from exceptions– Sequential (4-at-a-time) back-tracking to recover from exceptions

    • Instruction QueuesS t 16 t fl ti i t d i t i t ti– Separate 16-entry floating point and integer instruction queues

    – Prioritized, dataflow-ordered scheduling

    • Reorder Buffer– One per outstanding instruction, FIFO ordered– Stores PC, logical destination number, old physical destination number

    EE382A – Autumn 2009 John P ShenLecture 7- 24

    Why not current physical destination number?

  • R10000 Block Diagram

    EE382A – Autumn 2009 John P ShenLecture 7- 25

  • R10000 Instruction Fetch and Branch

    EE382A – Autumn 2009 John P ShenLecture 7- 26

  • R10000 Register Renaming

    EE382A – Autumn 2009 John P ShenLecture 7- 27

  • R10000 Pipelines

    EE382A – Autumn 2009 John P ShenLecture 7- 28

  • R10000 Integer Queue

    EE382A – Autumn 2009 John P ShenLecture 7- 29

  • Priority/Select Logic

    • Tree of arbiters that works in 2 phases • First phaseFirst phase

    – Request signals are propagated up the tree. Only ready instructions send requests

    – This in turn raises the ready signal ofThis in turn raises the ready signal of its parent arbiter cell. At the root cell one or more of the input request signals will be high if there are one or more instructions that are ready.

    – The root cell grants the functional unit to one of its children by raising one of its grant outputs.

    • Second phase p– Grant signal is propagated down the

    tree to the instruction that is selected – The enable signal to the root cell is

    high whenever the functional unit is

    EE382A – Autumn 2009 John P ShenLecture 7- 30

    gready to execute an instruction.

  • Priority/Select Logic Issues

    • Selection is easier if the priority depends on instruction locationOlder instructions are at the bottom of window and receive priority– Older instructions are at the bottom of window and receive priority

    • This creates an issue of compacting/collapsing:p g p g– As instructions depart, compress remaining towards the bottom– Younger instructions will be inserted towards the top (lower priority)

    • Compacting the window is not easy!– Its complexity can affect performance (clock frequency)– Often implemented in some restricted form

    • E.g. split window into two parts, allow compaction from 2nd half towards 1st

    • Trade-off between window utilization and compaction simplicity

    EE382A – Autumn 2009 John P ShenLecture 7- 31

    p p y

  • Wake-up and Select Latency

    • Assume a result becomes available in cycle iWhen you can start executing an instruction that waits for it?– When you can start executing an instruction that waits for it?

    • Ideal solution: in cycle i+1 – Back to back executing, just like with 5-stage pipeline– Requirement: the following have to work in one cycle

    • Distribute result tag to the window & detect that instruction becomes read• Select instruction for execution & forward its info/operands to FU

    – May stress clock cycle in wide processor

    • Alternative: split wake-up and select in separate cyclesSimpler hardware faster clock cycle– Simpler hardware, faster clock cycle

    – Lower IPC (dependencies cost one extra cycle)

    EE382A – Autumn 2009 John P ShenLecture 7- 32

  • Result Forwarding(Common Data Bus – CDB)(Common Data Bus CDB)

    • Common data bus: used to broadcast results of FUs

    B d t d ti ti• Broadcast destinations– RF or RRF or ROB, depending on the renaming scheme– Instruction window

    • May need result or tag for the result

    • Number of CDBs– Best case 1 per functional unitBest case, 1 per functional unit– Can have less, but now we may have structural hazard

    • Notes:– CDBs can be slow as they go across large chip area– Broadcast tag early

    EE382A – Autumn 2009 John P ShenLecture 7- 33

  • Dynamic Scheduling Implementation Cost

    • To support N-way dispatch into IW per cycleNx2 simultaneous lookups into the rename map (or associative search)– Nx2 simultaneous lookups into the rename map (or associative search)

    – N simultaneous write ports into the IW and the ROB• To support N-way issue per cycle (assuming read at issue)

    – 1 prioritized associative lookup of N entries– N read ports into the IW– Nx2 read ports into the RF

    • To support N-way finish per cycle– N write ports into the RF and the ROB– Nx2 associative lookup and write in IWNx2 associative lookup and write in IW

    • To support N-way retire per cycle– N read ports in the ROB

    N t i t th RF ( t ti ll )

    EE382A – Autumn 2009 John P ShenLecture 7- 34

    – N ports into the RF (potentially)

  • Instruction Window Alternatives

    • Single vs. multiple buffers (trade-offs?)Single centralized window– Single centralized window

    – Single centralized window with static alignment for different FUs – Separate integer – FP – LSU windows– Separate buffers for each FU

    • Aka, reservation stations (see Tomasulo algorithm)

    • Management policies to keep in mind– Random access or FIFO

    • In-order vs out-of-order within each queue

    – Age-prioritized or criticality-basedAge prioritized or criticality based– Value vs. tag only– When to deallocate

    • Reservation stations for Ld/St units are more complicated

    EE382A – Autumn 2009 John P ShenLecture 7- 35

    • Reservation stations for Ld/St units are more complicated

  • MIPS R10000

    EE382A – Autumn 2009 John P ShenLecture 7- 36