Click here to load reader

Microarchitecture of S uperscalars (4) Decoding

  • View
    71

  • Download
    0

Embed Size (px)

DESCRIPTION

Microarchitecture of S uperscalars (4) Decoding. Dezső Sima Fall 2007. (Ver. 2.0).  Dezső Sima, 2007. Overview. 1. Overview. 2. Straightforward parallel decoding. 3. Predecoding. 4. Decoding with CISC/RISC conversion. 4.1 Overview. 4.2 Decoding into µops. - PowerPoint PPT Presentation

Text of Microarchitecture of S uperscalars (4) Decoding

  • Microarchitecture of Superscalars (4)DecodingDezs Sima

    Fall 2007(Ver. 2.0) Dezs Sima, 2007

  • Overview1. Overview 2. Straightforward parallel decoding 3. Predecoding 4. Decoding with CISC/RISC conversion 4.1 Overview 4.2 Decoding into ops 4.3 Decoding into macroops 5. Using a trace cache 6. Decoding with instruction grouping 6.1 Overview 6.2 Grouping of RISC instructions 6.3 Grouping of CISC instructions

  • 1. Overview1. gen. RISC superscalarsIntelPredecoding

    Straightforwardparallel decodingUsing a tracecacheDecoding withinstruction groupingDecoding techniques used in superscalarsDecoding withCISC/RISC conversionBeginning with 2. gen. superscalars Beginning with 2. gen. superscalar CISCsP4-familyDecoding into opsDecoding intomacroopsAMD(up to two ops)Grouping of RISCinstructionsPOWER4POWER5Grouping of CISCinstructionsPentium MCore Beginning with the Pentium ProBeginning withthe K7K7 (Athlon)K8 (Hammer)

  • 2 Straightforward parallel decodingFigure 2.1: The PowerPC 601s front endSource: Stokes, J.H., PowerPC on Apple: An architecture history, Aug. 2004.http://arstechnica.com/articles

  • 3 Predecoding (1)Figure 3.1: Contrasting the decoding and instruction issues in a scalar and a 4-way superscalar processor IcacheSuperscalar issueScalar issueIcache

  • 3 Predecoding (1)Figure 3.2: The principle of predecodingSecond-level cache(or memory)PredecodeunitTypically 128 bits/cycleWhen instructions are written into the I-cache, the predecode unit of a RISC processor appends 4-7 bits to each instruction.

    AMDs CISC processors append n-bits to each byte (K5, K6: 5 bits/byte ; K7, K8: 3 bits/byte).E.g. 148 bits/cycleSource: Sima, D. et al., ACA, Addison-Wesley 1997

  • 3 Predecoding (2)Figure 3.3: The introduction of predecodingSource: Sima, D. et al., ACA, Addison-Wesley 1997

  • 3. Predecoding (3)Figure 3.4: Variable length instruction decoding in the AthlonSource: de Vries, H., Understanding the detailed Architecture of AMDs 64 bit Core, Sept.2003, http://www.chip-architect.com

  • 3 Predecoding (4)Figure 3.5: Opterons instruction cache and decodingSource: de Vries, H., Understanding the detailed Architecture of AMDs 64 bit Core, Sept.2003, http://www.chip-architect.com

  • 4 Decoding with CISC/RISC conversionDecoding with CISC/RISCconversionRISC coreRetiring with RISC/CISCconversionCISC instructionsDecoding with CISC/RISC conversionExamples:PProK6opsmacroopsModification of the program stateafter RISC/CISC re-conversionFigure 4.1: Principle of decoding with CISC/RISC conversionSource: Sima, D. et al., ACA, Addison-Wesley 19974.1 Overview

  • 4.2 Decoding into ops (1)Figure 4.2: The Microarchitecture of the Pentium ProSource: Shanley, T. ,Pentium Pro Processor System Architecture, Addison-Wesley Press, 1997

  • 4.2 Decoding into ops (2)Figure 4.3: Basic misprediction pipeline of the Pentium IIISource: Hinton, G. et al., The Microarchitecture of the Pentium 4 Processor, Intel Technology Journal Q1, 2001

  • Figure 4.4: Decoding in AMDs K6Source: Shriver, B., Smith,.B.,The Anatomyof a High-Performance MicroprocessorIEEE Computer Society Press, 19984.2 Decoding into ops (3)

  • Figure 4.5: The Microarchitecture of the Pentium M (Yonah)4.2 Decoding into ops (4)Source: Kanter, D., Intels next Generation Microarchitecture Unveiled, Real World Tech., 2006 March 9.

  • 4.2 Decoding into ops (5)Figure 4.6: The Microarchitecture of the Core processor familySource: Kanter, D., Intels next Generation Microarchitecture Unveiled, Real World Tech., 2006 March 9.

  • 4.3 Decoding into macroops (1)Figure 4.7: AMD AthlonTM the Microarchitecture of the AthlonSource: Meyer, D., The AMD-K7 Processor, MPF. Oct. 1998

  • 4.3 Decoding into macroops (2)Figure 4.8: Decoding in the Athlon (1)Source: Meyer, D., The AMD-K7 Processor, MPF. Oct. 1998

  • 4.3 Decoding into macroops (3)Figure 4.9: Decoding in the Athlon (2)Source: Meyer, D., The AMD-K7 Processor, MPF. Oct. 1998

  • Each MacroOp: 1 or 2 operations (OPs)eg:ADD EAX, EBX1 ADD OPAND EAX, [EBX+16]1 LOAD OP1 AND OP

    Up to 3 MacroOps per cycle with up to 3 FX + 2 L/S OPs (dual ported D$!) per cycle4.3 Decoding into macroops (4)

  • 4.3 Decoding into macroops (5)Figure 4.10: The Microarchitecture of the HammerSource: Weber, F., AMDs Next Generation Microprocessor Architecture, MPF. Oct. 2001

  • 5 Using a trace cache (1)Figure 5.1: The Microarchitecture of the Pentium 4 (Willamette)

  • 5 Using a trace cache (2)Figure 5.2: Basic misprediction pipeline of the Pentium 4 (Willamette)Source: Hinton, G. et al., The Microarchitecture of the Pentium 4 Processor, Intel Technology Journal Q1, 2001

  • 5 Using a trace cache (3)Figure 5.3: The Microarchitecture of the Pentium 4 (Prescott)Source: Kanter, D., Intels next Generation Microarchitecture Unveiled, Real World Tech., 2006 March 9.

  • Decoding withinstruction groupingGrouping of RISCinstructionsPOWER4POWER5Grouping of CISCinstructionsPentium MCore arch.6. Decoding with instruction groupingK7 (Athlon)K8 (Hammer)6.1 Overview

  • Figure 5.3: Instruction grouping in the K7 and K8Source: de Vries, H., Understanding the detailed Architecture of AMDs 64 bit Core, Sept.2003, http://www.chip-architect.comUp to 3 MacroOps are decoded per cycle, these MacroOps are allocated a line in the ROB

    The ROB has 24 lines of 3 entries each. The ROB retires a line if it is the oldest one and all MacroOps in that line are completed.6.2 Grouping of RISC instructions (1)

    Operation of the Reorder Buffer (ROB)

    index123456789101112lane 0lane 1lane 2

    = Out Of Order finished Instructions, results still speculative.= Instructions being retired now.= Retired Instructions, not speculative anymore.

  • Figure 6.1: Out of order execution of MacroOps from the FX schedulers in the K8L (to be introduced in Q2 2007)

    (The K8L scheduler has 8*3 entires vs 6*3 in the K8)Source: Malich, Y.AMD's Next Generation Microarchitecture Preview: from K8 to K8L, Aug. 2006.6.2 Grouping of RISC instructions (2)SchedulersDecodersEUs

  • Figure 6.1: The principle of instruction grouping in IBMs POWER4 and POWER5 processors6.2 Grouping of RISC instructions (3)InstructiongroupsIssuequeuesExecutionunitsROBDispatch instruction groups in-order, forward individual instructions to the issue queuesExecute individual instructions oooRetire isntruction groups in-order, modify program stateRetire

  • 6.2 Grouping of RISC instructions (4)Figure 6.2: Implementation of instruction grouping in IBMs POWER 5 processorSource: Sinharoy, B. et al. POWER5 system microarchitecture, IBM J.,Res.& Dev., July/Sept. 2005.

  • 6.3 Grouping of CISC instructions (1)(Intel: macro-op fusion)x86 instructions: macro-opsinternal instructions: opsMacro-op fusion: combines two macro ops into a single op.Specifically: x86 compare or test instructions are fused with x86 jumps to produce a single op.Any decoder can perform macro-op fusion but only one macro-op fusion can be performed in each cycle.In the Core architecture the max. decode bandwidth is 4+1 x86 instructions/cycleMacro-op fusion can reduce the number of ops by about 10%.Introduced in the Core architecture

  • 6.3 Grouping of CISC instructions (2)Benefits:Fewer ops Increased performanceooo execution becomes more effective as the instruction window includes now more (~10%) x86 instructions