Tendinte Actuale in Proiectarea SMPA

  • Upload
    sfofoby

  • View
    237

  • Download
    0

Embed Size (px)

Citation preview

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    1/98

    1

    Current and Future Trends inProcessor Architecture

    Theo UngererBorut Robic

    Jurij Silc

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    2/98

    2

    Tutorial Background Material

    Jurij Silc, Borut Robic, Theo Ungerer: Processor Architecture From Dataflow to

    Superscalar and Beyond(Springer-Verlag, Berlin, Heidelberg, New York 1999).

    Book homepage: http://goethe.ira.uka.de/people/ungerer/proc-arch/

    Slide col lection of tutorial slides: http://goethe.ira.uka.de/people/ungerer/

    Slide collection of book contents (in 15 lectures):http://goethe.ira.uka.de/people/ungerer/prozarch/prslides99-00.html

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    3/98

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    4/98

    4

    Part I:State-of-the-art multiple-issue processors

    Superscalar

    Overview

    Superscalar in more detail

    Instruction Fetch and Branch Prediction Decode

    Rename

    Issue

    Dispatch

    Execution Units

    Completion

    Retirement

    VLIW/EPIC

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    5/98

    5

    Multiple-issue Processors

    Today's microprocessors ut ilize instruction-level parallelism by a mult i-stageinstruction pipeline and by the superscalar or the VLIW/EPIC technique.

    Most of today's general-purpose microprocessors are four- or six-issuesuperscalars.

    VLIW (very long instruction word) is the choice for most signal processors.

    VLIW is enhanced to EPIC (expl icit ly parallel instruction computing) by HP/Intelfor its IA-64 ISA.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    6/98

    6

    Instruction Pipelining

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    7/98

    7

    IF

    ID

    and

    RenameIssue

    EX

    EX

    EX

    EX

    Retire

    and

    WriteBack

    Instru

    ction

    Windo

    w

    Superscalar Pipeline

    Instructions in the instruction window are free from control dependenciesdue tobranch prediction, and free from name dependencesdue to register renaming.

    So, only (true) data dependences and structural conf licts remainto be solved.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    8/98

    8

    Superscalar vs. VLIW

    Superscalar and VLIW: More than a single instruction can be issued to the

    execution units per cycle.

    Superscalar machinesare able to dynamically issue mult iple instructions eachclock cycle from a conventional linear instruction stream.

    VLIW processorsuse a long instruction word that contains a usually fixed number

    of instructions that are fetched, decoded, issued, and executed synchronously.

    Superscalar: dynamic issue, VLIW: static issue

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    9/98

    9

    Sections of a Superscalar Pipeline

    The ability to issue and execute instructions out-of-order partitions a superscalar

    pipeline in three distinct sections:

    in-order sectionwith the instruction fetch, decode and rename stages - the issue isalso part of the in-order section in case of an in-order issue,

    out-of-order sectionstarting with the issue in case of an out-of-order issue processor,

    the execution stage, and usually the completion stage, and again an

    in-order sectionthat comprises the retirement and wri te-back stages.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    10/98

    10

    I-cache

    D-cache

    Bus

    Inter-

    face

    Unit

    Branch

    UnitInstruction Fetch Unit

    Reorder Buffer

    Instruction

    Issue Unit

    Retire

    Unit

    Load/Store

    Unit

    Integer

    Unit(s)

    Floating-Point

    Unit(s)

    Rename

    Registers

    General

    Purpose

    Registers

    Floating-

    Point

    Registers

    BTACBHT

    MMU

    MMU32 (64)

    Data

    Bus

    32 (64)

    Address

    Bus

    Control

    Bus

    Instruction Buffer

    Instruction Decode andRegister Rename Unit

    Components of a SuperscalarProcessor

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    11/98

    11

    Branch-Target Buffer orBranch-Target Address Cache

    The Branch Target Buffer (BTB)or Branch-Target Address Cache (BTAC)stores

    branch and jump addresses, their target addresses, and optionally predict ion

    information.

    The BTB is accessed during the IF stage.

    ... ... ...

    Branch address Target addressPrediction

    bits

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    12/98

    12

    Branch Prediction

    Branch predictionforetells the outcome of conditional branch instruct ions.

    Excellent branch handl ing techniques are essential for today's and for future

    microprocessors.

    Requirements of high performance branch handling: an early determination of the branch outcome (the so-called branch resolution),

    buffering of the branch target address in a BTAC,

    an excellent branch predictor(i.e. branch prediction technique) and speculative

    execution mechanism, often another branch is predicted while a previous branch is still unresolved, so the

    processor must be able to pursue two or more speculation levels,

    and an efficient rerolling mechanismwhen a branch is mispredicted (minimizing the

    branch misprediction penalty).

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    13/98

    13

    Misprediction Penalty

    The performance of branch prediction depends on the prediction accuracyand the

    cost of misprediction.

    Misprediction penaltydepends on many organizational features:

    the pipeline length (favoring shorter pipelines over longer pipelines), the overall organization of the pipeline,

    the fact i f misspeculated instructions can be removed from internal buffers, or have to

    be executed and can only be removed in the retire stage,

    the number of speculative instructions in the instruction window or the reorder buffer.Typically only a limited number of instructions can be removed each cycle.

    Mispredicted is expensive:

    4 to 9 cycles in the Alpha 21264,

    11 or more cycles in the Pentium II.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    14/98

    14

    Static Branch Prediction

    Static Branch Predictionpredicts always the same direction for the same branch

    during the whole program execution.

    It comprises hardware-fixed prediction and compiler-directed prediction.

    Simple hardware-fixed direction mechanisms can be:

    Predict always not taken

    Predict always taken

    Backward branch predict taken, forward branch predict not taken

    Sometimes a bit in the branch opcode allows the compi ler to decide the

    prediction direction.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    15/98

    15

    Dynamic Branch Prediction

    Dynamic Branch Prediction: the hardware influences the predict ion while

    execution proceeds.

    Prediction is decided on the computation historyof the program.

    During the start-up phase of the program execution, where a static branch

    prediction might be effective, the history information is gathered and dynamic

    branch prediction gets effective.

    In general, dynamic branch prediction gives better results than static branch

    predict ion, but at the cost of increased hardware complexity.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    16/98

    16

    One-bit Predictor

    NT

    NT

    T

    T

    Predict TakenPredict Not

    Taken

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    17/98

    17

    One-bit vs. Two-bit Predictors

    A one-bit predictor correctly predicts a branch at the end of a loop i teration, as

    long as the loop does not exit.

    In nested loops, a one-bit prediction scheme will cause two mispredict ions for the

    inner loop: One at the end of the loop, when the iteration exits the loop instead of looping again,

    and

    one when executing the first loop iteration, when it predicts exit instead of looping.

    Such a double misprediction in nested loops is avoided by a two-bit predictorscheme.

    Two-bit Prediction: A prediction must miss twice before it is changed when a two-

    bit prediction scheme is applied.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    18/98

    18

    Two-bit Predictors(Saturation Counter Scheme)

    NT

    NTT

    T

    (11)Predict Strongly

    Taken

    NT

    T

    NT

    T

    (00)Predict Strongly

    Not Taken

    (01)Predict Weakly

    Not Taken

    (10)Predict Weakly

    Taken

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    19/98

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    20/98

    20

    Two-bit Predictors

    Two-bit predictors can be implemented in the Branch Target Buffer (BTB)

    assigning two state bits to each entry in the BTB.

    Another solution is to use a BTB for target addresses

    and a separate Branch History Table (BHT) as prediction buffer. A mispredict in the BHT occurs due to two reasons:

    either a wrong guess for that branch,

    or the branch history of a wrong branch is used because the table is indexed.

    In an indexed table lookup part of the instruction address is used as indextoidentify a table entry.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    21/98

    21

    Two-bit Predictors and Correlation-based Prediction

    Two-bit predictors work well for programs which contain many frequentlyexecuted loop-control branches (floating-point intensive programs).

    Shortcomings arise from dependent (correlated) branches, which are frequent in

    integer-dominated programs.

    Example:

    if (d==0) /* branch b1*/d=1;

    if (d==1) /*branch b2 */

    ...

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    22/98

    22

    Predictor Behavior in Example

    A one-bit predictor initialized to predict taken for branches b1 and b2

    every branch is mispredicted.

    A two-bit predictor of of saturation counter scheme starting from the state

    predict weakly taken

    every branch is mispredicted.

    The two-bit predictor of hysteresis scheme mispredicts every second branch

    execution of b1 and b2.

    A (1,1) correlating predictor takes advantage of the correlation of the two

    branches; i t mispredicts only in the first iteration when d = 2.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    23/98

    23

    Correlation-based Predictor

    The two-bit predictor scheme uses only the recent behavior of a single branch to

    predict the future of that branch.

    Correlations between different branch instructions are not taken into account.

    Correlation-based predictorsor correlating predictorsaddit ionally use the

    behavior of other branches to make a predict ion.

    While two-bit predictors use self-historyonly, the correlating predictor uses

    neighbor history additionally.

    Notation: (m,n)-correlation-based predictoror (m,n)-predictoruses the behavior of

    the last m branches to choose from 2mbranch predictors, each of which is a n-bit

    predictor for a single branch.

    Branch history register (BHR):The global history of the most recent m branches

    can be recorded in a m-bit shif t register where each bit records whether the

    branch was taken or not taken.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    24/98

    24

    ...

    ...

    ...

    ...

    ...

    ...

    Pattern History Tables PHTs(2-bit predictors)

    ...

    ...

    1 1

    Branch address

    10

    0Branch History Register BHR

    (2-bit shift register) 1

    select

    Correlation-based Prediction(2,2)-predictor

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    25/98

    25

    Two-level Adaptive Predictors

    Developed by Yeh and Patt at the same time (1992) as the correlation-based

    predict ion scheme.

    The basic two-level predictoruses a single global branch history register (BHR) of

    k bits to index in a pattern history table (PHT) of 2-bit counters.

    Global history schemescorrespond to correlation-based predictor schemes.

    Example for the notation: GAg:

    a single global BHR (denoted G) and

    a single global PHT (denoted g), Astands for adaptive.

    All PHT implementations of Yeh and Patt use 2-bit predictors.

    GAg-predictor with a 4-bit BHR length is denoted as GAg(4).

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    26/98

    26

    1 1 00

    1100

    ...

    ...

    1 1Index

    predict:taken

    BranchHistory

    Register(BHR)

    Branch PatternHistory Table

    (PHT)

    shift direction

    Implementation of a GAg(4)-predictor

    In the GAg predictor schemes the PHT lookup depends entirely on the bit pattern

    in the BHR and is completely independent of the branch address.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    27/98

    27

    Variations of Two-level AdaptivePredictors

    the full branch address to distinguish multiple PHTs

    (called per-address PHTs), a subset of branches (e.g. n bits of the branch address) to distinguish multiple

    PHTs (called per-set PHTs),

    the full branch address to distinguish mult iple BHRs

    (called per-address BHRs), a subset of branches to distinguish multiple BHRs

    (called per-set BHRs),

    or a combination scheme.

    Mispredictions can be restrained by additionally using:

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    28/98

    28

    Two-level Adaptive Predictors

    single global PHT per-set PHTs per-address PHTs

    single global BHR GAg GAs GAp

    per-address BHT PAg PAs PAp

    per-set BHT SAg SAs SAp

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    29/98

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    30/98

    30

    Hybrid Predictors

    The second strategy of McFarling is to combine mult iple separate branch

    predictors, each tuned to a different class of branches.

    Two or more predictors and a predictor selection mechanism are necessary in acombining or hybrid predictor.

    McFarling: combination of two-bit predictor and gshare two-level adaptive,

    Young and Smith: a compiler-based static branch prediction with a two-level adaptive

    type, and many more combinations!

    Hybrid predictors often better than single-type predictors.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    31/98

    31

    Simulations of Grunwald 1998

    SAg, gshare and MCFarlings combining predictor for some SPECmarks

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    32/98

    32

    Results

    Simulation of Keeton et al. 1998 using an OLTP (online transaction workload) on aPentiumPro multiprocessor reported a mispredict ion rate of 14%with an branchinstruction frequency of about 21%.

    Two di fferent conclusions may be drawn from these simulation results:

    Branch predictors should be further improved

    and/or branch prediction is only effective if the branch is predictable.

    If a branch outcome is dependent on irregular data inputs, the branch oftenshows an irregular behavior.

    Question: Confidence of a branch prediction?

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    33/98

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    34/98

    34

    Predicated Instructions

    Method to remove branches

    Predicatedor conditional instructionsand one or more predicate registersuse a

    predicate register as additional input operand.

    The Boolean result of a condition testing is recorded in a (one-bit) predicateregister.

    Predicated instructions are fetched, decoded and placed in the instruction window

    like non predicated instruct ions.

    It is dependent on the processor archi tecture, how far a predicated instructionproceeds speculatively in the pipeline before its predication is resolved:

    A predicated instruction executes only if its predicate is true, otherwise the instruction

    is discarded.

    Alternatively the predicated instruction may be executed, but commits only if thepredicate is true, otherwise the result is discarded.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    35/98

    35

    Predication Example

    if (x= = 0) { /*branch b1 */

    a= b+ c;

    d= e- f;

    }g= h* i; /* instruction independent of branch b1 */

    (Pred= (x= = 0) ) /* branch b1: Predis set to true in xequals 0 */if Predthena= b+ c; /* The operations are only performed */

    ifPredthene= e- f; /* if Predis set to true */

    g= h* i;

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    36/98

    36

    Predication

    Able to eliminate a branch and therefore the associated branch prediction

    increasing the distance between mispredictions.

    The the run length of a code block is increased better compiler scheduling.

    - Predication affects the instruct ion set, adds a port to the register file, and

    complicates instruction execution.

    - Predicated instructions that are discarded still consume processor resources;

    especially the fetch bandwidth.

    Predication is most effective when control dependences can be completely

    eliminated, such as in an if-then with a small then body.

    The use of predicated instructions is limited when the control flow involves

    more than a simple alternative sequence.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    37/98

    37

    Eager (Multipath) Execution

    Execution proceeds down both paths of a branch, and no prediction is made.

    When a branch resolves, all operations on the non-taken path are discarded.

    With limited resources, the eager execution strategy must be employed carefully.

    Mechanism is required that decides when to employ predict ion and when eagerexecution: e.g. a confidence estimator

    Rarely implemented (IBM mainframes) but some research projects:

    Dansoft processor, Polypath architecture, selective dual path execution, simultaneousspeculation scheduling, disjoint eager execution

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    38/98

    38

    Branch handling techniques andimplementations

    Technique Implementation examples

    No branch prediction Intel 8086

    Static prediction

    always not taken Intel i486

    always taken Sun SuperSPARC

    backward taken, forward not taken HP PA-7x00

    semistatic with profi ling early PowerPCs

    Dynamic prediction:

    1-bit DEC Alpha 21064, AMD K5

    2-bit PowerPC 604, MIPS R10000, Cyrix 6x86 & M2, NexGen 586two-level adaptive Intel PentiumPro, Pentium II, AMD K6

    Hybrid predic tion DEC Alpha 21264

    Predication Intel/HP Itanium, ARM processors, TI TMS320C6201

    Eager execut ion (limited) IBM mainframes: IBM 360/91, IBM 3090

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    39/98

    39

    High-Bandwidth Branch Prediction

    Future microprocessor wi ll require more than one prediction per cycle starting

    speculation over multiple branches in a single cycle

    When multiple branches are predicted per cycle, then instructions must be

    fetched from multiple target addresses per cycle, complicating I-cache access. Solution: Trace cache in combination with next trace predict ion.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    40/98

    40

    IF

    ID

    and

    Rename

    Issue

    EX

    EX

    EXEX

    Retire

    and

    WriteBackInstruction

    Wind

    ow

    Back to the Superscalar Pipeline

    In-order delivery of instructions to the out-of-order execution kernel!

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    41/98

    41

    Decode Stage

    Delivery task: Keep instruction window full

    the deeper instruction look-ahead allows to find more instructions to issue to

    the execution units.

    Fetch and decode instruct ions at a higher bandwidth than execute them. The processor fetches and decodes today about 1.4 to twice as many instructions

    than it commits (because of mispredicted branch paths).

    Typically the decode bandwidth is the same as the instruction fetch bandwidth.

    Multiple instruction fetch and decode is supported by a fixed instruction length.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    42/98

    42

    Decoding variable-length instructions

    Variable instruction length:

    often the case for legacy CISC instruction sets as the Intel IA32 ISA.

    a multistage decode is necessary.

    The first stage determines the instruction l imits within the instruction stream.

    The second stage decodes the instructions generating one or several micro-ops from

    each instruction.

    Complex CISC instructions are spli t into micro-ops which resemble ordinary

    RISC instructions.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    43/98

    43

    Two principal techniques to implementrenaming

    Separate sets of architectural registersand rename (physical) registers areprovided.

    The physical registerscontain values (of completed but not yet retired instructions),

    the architectural registersstore the commit ted values.

    After commitment of an instruction, copyingits result from the rename register to thearchitectural register is required.

    Only a single set of registersis provided and architectural registers aredynamically mapped to physical registers.

    The physical registers contain committed values and temporary results.

    After commitment of an instruction, the physical register is made permanent and nocopying is necessary.

    Alternative to the dynamic renaming is static renaming in combination with a largeregister fi le as defined for the Intel Itanium.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    44/98

    44

    Issue and Dispatch

    The notion of the instruction windowcomprises all the waiting stations between

    decode (rename) and execute stages.

    The instruction window isolates the decode/rename from the execution stages

    of the pipeline.

    Instruct ion issueis the process of ini tiating instruction execution in the

    processor's functional units.

    issue to a FU or a reservation station

    dispatch, if a second issue stage exists to denote when an instruction is started to

    execute in the funct ional unit.

    The instruction-issue policyis the protocol used to issue instructions.

    The processor's lookahead capabilityis the ability to examine instructions

    beyond the current point of execution in hope of finding independent

    instructions to execute.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    45/98

    45

    Issue

    The issue logicexamines the waiting instructions in the instruction window and

    simultaneously assigns (issues) a number of instructions to the FUs up to a

    maximum issue bandwidth.

    The program order of the issued instructions is stored in the reorder buffer.

    Instruction issue from the instruct ion window can be:

    in-order(only in program order) or out-of-order

    it can be subject to simultaneous data dependences and resource constraints,

    or it can be divided in two (or more) stages

    checking structural conflic t in the first and data dependences in the next stage (or

    vice versa).

    In the case of structural conf licts first , the instructions are issued to reservation

    stations (buffers) in front of the FUs where the issued instructions await missing

    operands.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    46/98

    46

    Reservation Station(s)

    Two definitions in l iterature:

    A reservation stationis a buffer for a singleinstruction with its operands (original

    Tomasulo paper, Flynn's book, Hennessy/Patterson book).

    A reservation stationis a buffer (in front of one or more FUs) with one or more entries

    and each entry can buffer an instruction with its operands(PowerPC literature).

    Depending on the specific processor, reservation stations can be central to a

    number of FUs or each FU has one or more own reservation stations.

    Instructions await their operands in the reservation stations, as in the Tomasuloalgorithm.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    47/98

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    48/98

    48

    Functional

    Units

    Issue and

    Dispatch

    Decodeand

    Rename

    The following issue schemes arecommonly used

    Single-level, central issue: single-level issue out of a central window as in

    Pentium II processor

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    49/98

    49

    Decode

    andRename

    Functional

    Units

    Issue andDispatch

    Functional

    Units

    Single-level, two-window issue

    Single-level, two-window issue: single-level issue with a instruction window

    decoupling using two separate windows

    most common: separate floating point and integer windows as in HP 8000 processor

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    50/98

    50

    Decode

    and

    Rename

    DispatchIssue

    Functional Unit

    Functional Unit

    Functional Unit

    Functional Unit

    Reservation Stations

    Two-level issue with multiple windows

    Two-level issue with multiple windowswith a centralized window in the first stage

    and separate windows in the second stage (PowerPC 604 and 620 processors).

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    51/98

    51

    Execution Stages

    Various types of FUs classified as: single-cycle (latency of one) or

    mult iple-cycle (latency more than one) units.

    Single-cycle unitsproduce a result one cycle after an instruction started execution.

    Usually they are also able to accept a new instruction each cycle (throughput of

    one).

    Multi-cycle unitsperform more complex operations that cannot be implemented

    within a single cycle.

    Multi-cycle units

    can be pipelined to accept a new operation each cycle or each other cycle

    or they are non-pipelined.

    Another class of units exists that perform the operations with variable cycle times.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    52/98

    52

    Types of FUs

    single-cycle (single latency) units:

    (simple) integer and (integer-based) mult imedia units,

    multicycle units that are pipelined (throughput of one):

    complex integer, floating-point, and (floating-point -based) mult imedia unit (also

    called multimedia vector units), mult icycle units that are pipelined but do not accept a new operation each cycle

    (throughput of 1/2 or less):

    often the 64-bit floating-point operations in a floating-point unit,

    multicycle units that are often not pipelined: division unit, square root units, complex multimedia units

    variable cycle time units:

    load/store unit (depending on cache misses) and special implementations of e.g.floating-point units.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    53/98

    53

    Utilization of subword parallelism (data parallel instructions, SIMD)

    Saturation arithmetic

    Additional ar ithmetic instructions, e.g. pavgusb (average instruction), maskingand selection instructions, reordering and conversion

    MM streams and/or 3D graphics supported

    Multimedia Units

    x1 x2 x3 x4 y1 y2 y3 y4

    x1*y1x2*y2x3*y3x4*y4

    R1: R2:

    R3:

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    54/98

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    55/98

    55

    Finalizing Pipelined Execution- Retirement and Write-Back

    Retiring means removal from the scheduler with or without the commitment of

    operation results, whichever is appropr iate.

    Retir ing an operation does not imply the results of the operation are either permanent

    or non permanent.

    A result is made permanent:

    either by making the mapping of architectural to physical register permanent (if no

    separate physical registers exist) or

    by copying the result value from the rename register to the architectural register ( incase of separate physical and architectural registers)

    in an own wri te-back stage after the commitment!

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    56/98

    56

    Reorder Buffers

    The reorder bufferkeeps the original program order of the instructions after

    instruction issue and allows result serialization during the retire stage.

    State bits store if an instruction is on a speculative path, and when the branch

    is resolved, if the instruction is on a correct path or must be discarded.

    When an instruction completes, the state is marked in i ts entry.

    Exceptions are marked in the reorder buffer entry of the triggering instruction.

    The reorder buffer is implemented as a circular FIFO buffer.

    Reorder buffer entries are allocate in the (first) issue stage and deallocatedserially when the instruction retires.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    57/98

    57

    Precise Interrupt (Precise Exception)

    An interrupt or exception is called preciseif the saved processor state

    corresponds with the sequential model of program execution where one

    instruction execution ends before the next begins.

    Precise exceptionmeans that all instructions before the faulting instruction are

    committed and those after it can be restarted from scratch.

    If an interrupt occurred, all instructions that are in program order before the

    interrupt signaling instruction are committed, and all later instructions are

    removed.

    Depending on the architecture and the type of exception, the faulting instruction

    should be committed or removed without any lasting effect.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    58/98

    58

    VLIW and EPIC

    VLIW (very long instruction word) and

    EPIC (explici t parallel instruction computing):

    Compiler packs a fixed number of instructions into a single VLIW/EPIC

    instruction.

    The instructions within a VLIW instruction are issued and executed in parallel,

    EPIC is more flexible.

    Examples:

    VLIW: High-end signal processors (TMS320C6201)

    EPIC: Intel Merced/Itanium

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    59/98

    59

    Intel's IA-64 EPIC Format

    IA-64 instructions are packed by compiler into bundles.

    A bundle is a 128-bit long instruction word (LIW) containing three IA-64instructions along with a so-called template that contains instruction groupinginformation.

    IA-64 does not insert no-op instructions to fill slots in the bundles.

    The template explici tly indicates parallelism, that is, whether the instructions in the bundle can be executed in parallel

    or if one or more must be executed serially

    and whether the bundle can be executed in parallel with the neighbor bundles.

    Instruction 2

    41 bits

    Instruction 1

    41 bits

    Instruction 0

    41 bits

    Template

    5 bits

    IA-64 instruction word

    128 bits

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    60/98

    60

    Part II: Microarchitectural solutionsfor future microprocessors

    Technology prognosis

    Speed-up of a single-threaded application

    Advanced superscalar

    Trace Cache Superspeculative

    Multiscalar processors

    Speed-up of mult i-threaded applications

    Chip mult iprocessors (CMPs) Simultaneous multithreading

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    61/98

    61

    Technological Forecasts

    Moore's Law:number of transistors per chip double every two years

    SIA (semiconductor industries association) prognosis 1998:

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    62/98

    62

    Design Challenges

    Increasing clock speed,

    the amount of work that can be performed per cycle,

    and the number of instruct ions needed to perform a task.

    Today's general trend toward more complex designs is opposed by the wiring

    delay within the processor chip as main technological problem.

    higher clock rates with subquarter-micron designs

    on-chip interconnecting wires cause a significant portion of the delay time incircuits.

    Functional partitioning becomes more important!

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    63/98

    63

    Architectural Challenges andImplications

    Preserve object code compatibi lity (may be avoided by a virtual machine that

    targets run-time ISAs)

    Find ways of expressing and exposing more parallelism to the processor. It is

    doubtful if enough ILP is available. Harness thread-level paralelism (TLP)

    additionally.

    Memory bottleneck

    Power consumption for mobile computers and appliances.

    Soft errors by cosmic rays of gamma radiation may be faced with fault-tolerantdesign through the chip.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    64/98

    64

    Future Processor ArchitecturePrinciples

    Speed-up of a single-threaded application

    Advanced superscalar

    Trace Cache

    Superspeculative

    Multiscalar processors

    Speed-up of mult i-threaded applications

    Chip multiprocessors (CMPs)

    Simultaneous multithreading

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    65/98

    65

    Processor Techniques to Speed-upSingle-threaded Application

    Advanced superscalar processorsscale current designs up to issue 16 or 32

    instructions per cycle.

    Trace cachefacilitates instruction fetch and branch prediction

    Superspeculative processorsenhance wide-issue superscalar performance by

    speculating aggressively at every point.

    Multiscalar processorsdivide a program in a collection of tasks that are distributed

    to a number of parallel processing units under control of a single hardware

    sequencer.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    66/98

    66

    Advanced Superscalar Processors forBillion Transistor Chips

    Aggressive speculation, such as a very aggressive dynamic branch predictor,

    a large trace cache,

    very-wide-issue superscalar processing (an issue width of 16 or 32 instructions

    per cycle),

    a large number of reservation stations to accommodate 2,000 instructions,

    24 to 48 highly optimized, pipelined functional units,

    sufficient on-chip data cache, and

    sufficient resolution and forwarding logic.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    67/98

    67

    The Trace Cache

    Trace cacheis a special I-cache that captures dynamic instruction sequences in

    contrast to the I-cache that contains static instruction sequences.

    Like the I-cache, the trace cache is accessed using the starting address of the next

    block of instructions.

    Unlike the I-cache, it stores logically contiguous instructions in physically

    contiguous storage.

    A trace cache linestores a segment of the dynamic instruction trace across

    mult iple, potentially taken branches.

    Each l ine stores a snapshot, ortrace, of the dynamic instruction stream.

    The trace construct ion is of the critical path.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    68/98

    68

    I-cache and Trace Cache

    I-cache Trace cache

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    69/98

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    70/98

    70

    Strong- vs. Weak-dependence Model

    Strong-dependence modelfor program execution:a total instruction ordering of a sequential program.

    Two instructions are identif ied as either dependent or independent, and when in doubt,

    dependences are pessimistically assumed to exist .

    Dependences are never allowed to be violated and are enforced during instructionprocessing.

    Weak-dependence model:

    specifying that dependences can be temporarily violated dur ing instruction execution as

    long as recovery can be performed prior to affecting the permanent machine state. Advantage: the machine can speculate aggressively and temporari ly violate the

    dependences.

    The machine can exceed the performance limit imposed by the strong-dependence

    model.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    71/98

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    72/98

    72

    Superflow processor

    The Superflow processor speculates on

    instruction flow: two-phase branch predictor combined with trace cache

    register data flow: dependence predict ion: predict the register value dependence

    between instruct ions

    source operand value prediction

    constant value prediction

    value str ide prediction: speculate on constant, incremental increases in operand

    values

    dependence prediction predicts inter-instruction dependences memory data flow: predict ion of load values, of load addresses and alias prediction

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    73/98

    73

    Superflow Processor Proposal

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    74/98

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    75/98

    75

    Multiscalar mode of execution

    A

    B C

    D

    E

    Task A

    PE 0

    Task B

    PE 1

    Task D

    PE 2

    Task E

    PE 3

    Datavalues

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    76/98

    76

    Multiscalar processor

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    77/98

    77

    Multiscalar, Trace and SpeculativeMultithreaded Processors

    Multiscalar: A program is statically partitioned into tasks which are marked byannotations of the CFG.

    Trace Processor: Tasks are generated from traces of the trace cache.

    Speculative multithreading: Tasks are otherwise dynamically constructed.

    Common target: Increase of single-thread program performance by dynamically

    util izing thread-level speculation additionally to instruction-level parallelism.

    A thread means a HW thread

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    78/98

    78

    Additional utilization of more coarse-grained parallelism

    Chip multiprocessors (CMPs) or multiprocessor chips

    integrate two or more complete processors on a single chip,

    every functional unit of a processor is dupl icated.

    Simultaneous mult ithreaded processors (SMPs)

    store multiple contexts in different register sets on the chip,

    the functional units are multiplexed between the threads,

    instructions of d ifferent contexts are simultaneously executed.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    79/98

    79

    Shared memory candidates for CMPs

    Pro-

    cessor

    Pro-

    cessor

    Pro-

    cessor

    Pro-

    cessor

    Secondary Cache

    Global Memory

    Primary Cache

    Shared primary cache

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    80/98

    80

    Shared memory candidates for CMPs

    Pro-

    cessor

    Pro-

    cessor

    Pro-

    cessor

    Pro-

    cessor

    Primary

    Cache

    Secondary

    Cache

    Secondary

    Cache

    Secondary

    Cache

    Secondary

    Cache

    Global Memory

    Primary

    Cache

    Primary

    Cache

    Primary

    Cache

    Pro-

    cessor

    Pro-

    cessor

    Pro-

    cessor

    Pro-

    cessor

    Primary

    Cache

    Secondary Cache

    Global Memory

    Primary

    Cache

    Primary

    Cache

    Primary

    Cache

    Shared caches and memory Shared secondary cache

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    81/98

    81

    Hydra: A Single-Chip Multiprocessor

    CPU 0

    Centralized Bus Arbitration Mechanisms

    Cache SRAM Array DRAM Main Memory I/O Device

    A

    SingleChip

    PrimaryI-cache

    PrimaryD-cache

    CPU 0 Memory Controller

    Rambus MemoryInterface

    Off-chip L3Interface

    I/O BusInterface

    DMA

    CPU 1

    PrimaryI-cache

    PrimaryD-cache

    CPU 1 Memory Controller

    CPU 2

    PrimaryI-cache

    PrimaryD-cache

    CPU2 Memory Controller

    CPU 3

    PrimaryI-cache

    PrimaryD-cache

    CPU 3 Memory Controller

    On-chip SecondaryCache

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    82/98

    82

    Shared memory candidates for CMPs

    Pro-

    cessor

    Pro-

    cessor

    Pro-

    cessor

    Pro-

    cessor

    Secndary Cache

    Global Memory

    Primary Cache

    Shared global memory, no caches

    Global Memory

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    83/98

    83

    Motivation for Processor-in-Memory

    Technological trends have produced a large and growing gap between processorspeed and DRAM access latency.

    Today, it takes dozens of cycles for data to travel between the CPU and main

    memory.

    CPU-centric design philosophyhas led to very complex superscalar processorswith deep pipelines.

    Much of this complexity is devoted to hiding memory access latency.

    Memory wall: the phenomenon that access times are increasingly limiting system

    performance.

    Memory-centric designis envisioned for the future!

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    84/98

    84

    PIM or Intelligent RAM (IRAM)

    PIM (processor-in-memory) orIRAM (intell igent RAM)approaches coupleprocessor execution with large, high-bandwidth, on-chip DRAM banks.

    PIMor IRAMmerge processor and memory into a single chip.

    Advantages:

    The processor-DRAM gap in access speed increases in future. PIM provides higher

    bandwidth and lower latency for (on-chip-)memory accesses.

    DRAM can accommodate 30 to 50 times more data than the same chip area devoted

    to caches.

    On-chip memory may be treated as main memory - in contrast to a cache which isjust a redundant memory copy.

    PIM decreases energy consumption in the memory system due to the reduction of

    off-chip accesses.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    85/98

    85

    PIM Challenges

    Scaling a system beyond a single PIM.

    The DRAM technology today does not allow on-chip coupling of high

    performance processors with DRAM memory since the clock rate of DRAM

    memory is too low.

    Logic and DRAM manufacturing processes are fundamentally dif ferent.

    The PIM approach can be combined with most processor organizations.

    The processor(s) itself may be a simple or moderately superscalar standard

    processor,

    it may also include a vector unit as in the vector IRAM type,

    or be designed around a smart memory system.

    In future: potentially memory-centric architectures.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    86/98

    86

    Conclusions on CMP

    Usually, a CMP wil l feature: separate L1 I-cache and D-cache per on-chip CPU

    and an optional unif ied L2 cache.

    If the CPUs always execute threads of the same process, the L2 cacheorganization will be simplified, because different processes do not have to be

    distinguished.

    Recently announced commercial processors with CMP hardware: IBM Power4 processor with 2 processor on a single die

    Sun MAJC5200 two processor on a die (each processor a 4-threaded block-

    interleaving VLIW)

    M ti ti f M ltith d d

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    87/98

    87

    Motivation for MultithreadedProcessors

    Aim: Latency tolerance

    What is the problem? Load access latencies measured on an Alpha Server 4100

    SMP with four 300 MHz Alpha 21164 processors are:

    7 cycles for a primary cache miss which hi ts in the on-chip L2 cache of the 21164

    processor, 21 cycles for a L2 cache miss which hits in the L3 (board-level) cache,

    80 cycles for a miss that is served by the memory, and

    125 cycles for a dirty miss, i.e., a miss that has to be served from another processor 's

    cache memory.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    88/98

    88

    Multithreading

    Multithreading The ability to pursue two or more threads of control in parallel within a processor

    pipeline.

    Advantage: The latencies that arise in the computation of a single instruction streamare fil led by computations of another thread.

    Multithreaded processorsare able to bridge latencies by switching to anotherthread of control - in contrast to chip multiprocessors.

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    89/98

    89

    Register set 1

    Register set 2

    Register set 3

    Register set 4

    PC PSR 1

    PC PSR 2

    PC PSR 3

    PC PSR 4

    FP

    Thread 1:

    Thread 2:

    Thread 3:

    Thread 4:

    ... ... ...

    Multithreaded Processors

    Multithreading: Provide several program counters registers (and usually several register sets) on chip

    Fast context switching by switching to another thread of control

    A h f M ltith d d

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    90/98

    90

    Approaches of MultithreadedProcessors

    Cycle-by-cycle interleaving An instruction of another thread is fetched and fed into the execution pipel ine at each

    processor cycle.

    Block-interleaving

    The instructions of a thread are executed successively unti l an event occurs that maycause latency. This event induces a context switch.

    Simultaneous multithreading

    Instructions are simultaneously issued from multiple threads to the FUs of a

    superscalar processor. combines a wide issuesuperscalar instruction issue with multithreading.

    C i i f M ltith di ith

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    91/98

    91

    Comparision of Multithreading withNon-Multithreading Approaches

    (a) single-threaded scalar

    (b) cycle-by-cycle interleaving

    multithreaded scalar

    (c) block interleaving

    multithreaded scalar

    (a)

    Time(processcycles)

    (c)

    Contextswitch

    (b)

    Con

    textswitch

    Sim ltaneo s M ltithreading (SMT)

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    92/98

    92

    Simultaneous Multithreading (SMT)and Chip Multiprocessors (CMP)

    (a) SMT

    (b) CMP

    (a)

    Tim

    e(processorcycles)

    Issue slots

    (b)

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    93/98

    93

    Simultaneous Multithreading

    State of research SMT is simulated and evaluated with Spec92, Spec95, and with database transaction and

    decision support workloads

    Mostly unrelated programs are loaded in the thread slots!

    Typical result: 8-threaded SMT reaches a two- to threefold IPC increase over single-threaded superscalar.

    State of industrial development

    DEC/Compaq announced Alpha EV8 ( 21464 ) as

    4-threaded 8-wide superscalar SMT processor

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    94/98

    94

    Combining SMT and Multimedia

    Start with a wide-issue superscalargeneral-purpose processor

    Enhance by simultaneous multithreading

    Enhance by mult imedia unit (s)

    Enhance by on-chip RAM memoryfor constants and local variables

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    95/98

    95

    The SMT Multimedia Processr Model

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    96/98

    96

    IPC of Maximum Processor Models

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    97/98

    97

    CMP or SMT?

    The performance race between SMT and CMP is not yet decided. CMP is easier to implement, but only SMT has the ability to hide latencies.

    A functional partitioning is not easily reached within a SMT processor due to the

    centralized instruction issue.

    A separation of the thread queues is a possible solution, although i t does not removethe central instruction issue.

    A combinat ion of simultaneous multithreading with the CMP may be superior .

    Research: combine SMT or CMP organization with the ability to create threads

    with compiler support or fully dynamically out of a single thread thread-level speculation

    close to mult iscalar

  • 8/11/2019 Tendinte Actuale in Proiectarea SMPA

    98/98