Upload
gregory-clarence-lyons
View
222
Download
0
Tags:
Embed Size (px)
Citation preview
Microarchitecture of Superscalars (4)Decoding
Dezső Sima
Fall 2007
(Ver. 2.0) Dezső Sima, 2007
Overview
1. Overview•
2. Straightforward parallel decoding•
3. Predecoding•
4. Decoding with CISC/RISC conversion•
4.1 Overview•
4.2 Decoding into µops•
4.3 Decoding into macroops•
5. Using a trace cache•
6. Decoding with instruction grouping•
6.1 Overview•
6.2 Grouping of RISC instructions•
6.3 Grouping of CISC instructions•
1. Overview
1. gen. RISC superscalars
Intel
PredecodingStraightforwardparallel decoding
Using a tracecache
Decoding withinstruction grouping
Decoding techniques used in superscalars
Decoding withCISC/RISC conversion
Beginning with 2. gen. superscalars
Beginning with 2. gen.
superscalar CISCs
P4-family
Decoding into µops
Decoding intomacroops
AMD(up to two µops)
Grouping of RISC
instructions
POWER4
POWER5
Grouping of CISC
instructions
Pentium MCore
Beginning with the Pentium Pro
Beginning withthe K7
K7 (Athlon)K8 (Hammer)
2 Straightforward parallel decoding
Figure 2.1: The PowerPC 601’s front end
Source: Stokes, J.H., „PowerPC on Apple: An architecture history”, Aug. 2004.http://arstechnica.com/articles
3 Predecoding (1)
Figure 3.1: Contrasting the decoding and instruction issues in a scalar and a 4-way superscalar processor
Icache
Superscalar issue
DF . . .I
Decode / Issue / Check
Instructionbuffer
Decode / Issue / Check
Scalar issue
Typical FX-pipeline layout D/IF . . .
Icache
Instructionbuffer
3 Predecoding (1)
Figure 3.2: The principle of predecoding
Second-level cache(or memory)
Predecodeunit
I-cache
Typically 128 bits/cycleWhen instructions are written into the I-cache, the predecode unit of a RISC processor appends 4-7 bits to each instruction.
AMD’s CISC processors append n-bits to each byte (K5, K6: 5 bits/byte ; K7, K8: 3 bits/byte).E.g. 148 bits/cycle
Source: Sima, D. et al., „ACA”, Addison-Wesley 1997
3 Predecoding (2)
Figure 3.3: The introduction of predecoding
Source: Sima, D. et al., „ACA”, Addison-Wesley 1997
3. Predecoding (3)
Figure 3.4: Variable length instruction decoding in the AthlonSource: de Vries, H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, Sept.2003,
http://www.chip-architect.com
3 Predecoding (4)
Figure 3.5: Opteron’s instruction cache and decoding
Source: de Vries, H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, Sept.2003, http://www.chip-architect.com
4 Decoding with CISC/RISC conversion
Decoding with CISC/RISCconversion
RISC core
Retiring with RISC/CISCconversion
CISC instructions
Decoding with CISC/RISC conversion
Examples:PPro K6
µops macroops
Modification of the program stateafter RISC/CISC re-conversion
Figure 4.1: Principle of decoding with CISC/RISC conversion
Source: Sima, D. et al., „ACA”, Addison-Wesley 1997
4.1 Overview
4.2 Decoding into µops (1)
Figure 4.2: The Microarchitecture of the Pentium Pro
Source: Shanley, T. ,”Pentium Pro Processor System Architecture”, Addison-Wesley Press, 1997
4.2 Decoding into µops (2)
Figure 4.3: Basic misprediction pipeline of the Pentium III
Source: Hinton, G. et al., „The Microarchitecture of the Pentium 4 Processor”, Intel Technology Journal Q1, 2001
Figure 4.4: Decoding in AMD’s K6
Source: Shriver, B., Smith,.B.,”The Anatomyof a High-Performance Microprocessor”
IEEE Computer Society Press, 1998
4.2 Decoding into µops (3)
Figure 4.5: The Microarchitecture of the Pentium M (Yonah)
4.2 Decoding into µops (4)
Source: Kanter, D., „Intel’s next Generation Microarchitecture Unveiled”, Real World Tech., 2006 March 9.
4.2 Decoding into µops (5)
Figure 4.6: The Microarchitecture of the Core processor familySource: Kanter, D., „Intel’s next Generation Microarchitecture Unveiled”, Real World Tech., 2006 March 9.
4.3 Decoding into macroops (1)
Figure 4.7: AMD AthlonTM the Microarchitecture of the Athlon
Source: Meyer, D., „The AMD-K7 Processor”, MPF. Oct. 1998
4.3 Decoding into macroops (2)
Figure 4.8: Decoding in the Athlon (1)
Source: Meyer, D., „The AMD-K7 Processor”, MPF. Oct. 1998
4.3 Decoding into macroops (3)
Figure 4.9: Decoding in the Athlon (2)
Source: Meyer, D., „The AMD-K7 Processor”, MPF. Oct. 1998
Each MacroOp: 1 or 2 operations (OPs)
eg: ADD EAX, EBX 1 ADD OPAND EAX, [EBX+16] 1 LOAD OP
1 AND OP
Up to 3 MacroOps per cycle with up to 3 FX + 2 L/S OPs (dual ported D$!) per cycle
4.3 Decoding into macroops (4)
4.3 Decoding into macroops (5)
Figure 4.10: The Microarchitecture of the Hammer
Source: Weber, F., „AMD’s Next Generation Microprocessor Architecture”, MPF. Oct. 2001
5 Using a trace cache (1)
Figure 5.1: The Microarchitecture of the Pentium 4 (Willamette)
5 Using a trace cache (2)
Figure 5.2: Basic misprediction pipeline of the Pentium 4 (Willamette)
Source: Hinton, G. et al., „The Microarchitecture of the Pentium 4 Processor”, Intel Technology Journal Q1, 2001
5 Using a trace cache (3)
Figure 5.3: The Microarchitecture of the Pentium 4 (Prescott)
Source: Kanter, D., „Intel’s next Generation Microarchitecture Unveiled”, Real World Tech., 2006 March 9.
Decoding withinstruction grouping
Grouping of RISC
instructions
POWER4POWER5
Grouping of CISC
instructions
Pentium MCore arch.
6. Decoding with instruction grouping
K7 (Athlon)K8 (Hammer)
6.1 Overview
Operation of the Reorder Buffer (ROB)
index 1 2 3 4 5 6 7 8 9 10 11 12lane 0 lane 1 lane 2
= Out Of Order finished Instructions, results still speculative. = Instructions being retired now. = Retired Instructions, not speculative anymore.
Figure 5.3: Instruction grouping in the K7 and K8
Source: de Vries, H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, Sept.2003, http://www.chip-architect.com
Up to 3 MacroOps are decoded per cycle, these MacroOps are allocated a line in the ROB
The ROB has 24 lines of 3 entries each. The ROB retires a line if it is the oldest one and all MacroOps in that line are completed.
6.2 Grouping of RISC instructions (1)
Figure 6.1: Out of order execution of MacroOps from the FX schedulers in the K8L (to be introduced in Q2 2007)
(The K8L scheduler has 8*3 entires vs 6*3 in the K8)
Source: Malich, Y.„AMD's Next Generation Microarchitecture Preview: from K8 to K8L”, Aug. 2006.
6.2 Grouping of RISC instructions (2)
SchedulersDecoders EUs
Figure 6.1: The principle of instruction grouping in IBM’s POWER4 and POWER5 processors
6.2 Grouping of RISC instructions (3)
Instructiongroups
EU EU
Issuequeues
Executionunits
ROB
Dispatch instruction groups in-order, forward individual
instructions to the issue queues
Execute individual instructions ooo
Retire isntruction groups in-order, modify program state
Retire
6.2 Grouping of RISC instructions (4)
Figure 6.2: Implementation of instruction grouping in IBM’s POWER 5 processor
Source: Sinharoy, B. et al. „POWER5 system microarchitecture”, IBM J.,Res.& Dev., July/Sept. 2005.
6.3 Grouping of CISC instructions (1)(Intel: macro-op fusion)
x86 instructions: macro-opsinternal instructions: μops
Macro-op fusion:combines two macro ops into a single μop.
Specifically:x86 compare or test instructions are fused with x86 jumps to produce a single μop.
Any decoder can perform macro-op fusion but only one macro-op fusion can be performed in each cycle.
In the Core architecture the max. decode bandwidth is 4+1 x86 instructions/cycle
Macro-op fusion can reduce the number of μops by about 10%.
Introduced in the Core architecture
6.3 Grouping of CISC instructions (2)
Benefits:
• Fewer μopsIncreased performance
• ooo execution becomes more effective as the instruction window includes now more (~10%) x86 instructions