30
Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007

Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007

Embed Size (px)

Citation preview

Page 1: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

Microarchitecture of Superscalars (4)Decoding

Dezső Sima

Fall 2007

(Ver. 2.0) Dezső Sima, 2007

Page 2: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

Overview

1. Overview•

2. Straightforward parallel decoding•

3. Predecoding•

4. Decoding with CISC/RISC conversion•

4.1 Overview•

4.2 Decoding into µops•

4.3 Decoding into macroops•

5. Using a trace cache•

6. Decoding with instruction grouping•

6.1 Overview•

6.2 Grouping of RISC instructions•

6.3 Grouping of CISC instructions•

Page 3: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

1. Overview

1. gen. RISC superscalars

Intel

PredecodingStraightforwardparallel decoding

Using a tracecache

Decoding withinstruction grouping

Decoding techniques used in superscalars

Decoding withCISC/RISC conversion

Beginning with 2. gen. superscalars

Beginning with 2. gen.

superscalar CISCs

P4-family

Decoding into µops

Decoding intomacroops

AMD(up to two µops)

Grouping of RISC

instructions

POWER4

POWER5

Grouping of CISC

instructions

Pentium MCore

Beginning with the Pentium Pro

Beginning withthe K7

K7 (Athlon)K8 (Hammer)

Page 4: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

2 Straightforward parallel decoding

Figure 2.1: The PowerPC 601’s front end

Source: Stokes, J.H., „PowerPC on Apple: An architecture history”, Aug. 2004.http://arstechnica.com/articles

Page 5: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

3 Predecoding (1)

Figure 3.1: Contrasting the decoding and instruction issues in a scalar and a 4-way superscalar processor

Icache

Superscalar issue

DF . . .I

Decode / Issue / Check

Instructionbuffer

Decode / Issue / Check

Scalar issue

Typical FX-pipeline layout D/IF . . .

Icache

Instructionbuffer

Page 6: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

3 Predecoding (1)

Figure 3.2: The principle of predecoding

Second-level cache(or memory)

Predecodeunit

I-cache

Typically 128 bits/cycleWhen instructions are written into the I-cache, the predecode unit of a RISC processor appends 4-7 bits to each instruction.

AMD’s CISC processors append n-bits to each byte (K5, K6: 5 bits/byte ; K7, K8: 3 bits/byte).E.g. 148 bits/cycle

Source: Sima, D. et al., „ACA”, Addison-Wesley 1997

Page 7: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

3 Predecoding (2)

Figure 3.3: The introduction of predecoding

Source: Sima, D. et al., „ACA”, Addison-Wesley 1997

Page 8: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

3. Predecoding (3)

Figure 3.4: Variable length instruction decoding in the AthlonSource: de Vries, H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, Sept.2003,

http://www.chip-architect.com

Page 9: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

3 Predecoding (4)

Figure 3.5: Opteron’s instruction cache and decoding

Source: de Vries, H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, Sept.2003, http://www.chip-architect.com

Page 10: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

4 Decoding with CISC/RISC conversion

Decoding with CISC/RISCconversion

RISC core

Retiring with RISC/CISCconversion

CISC instructions

Decoding with CISC/RISC conversion

Examples:PPro K6

µops macroops

Modification of the program stateafter RISC/CISC re-conversion

Figure 4.1: Principle of decoding with CISC/RISC conversion

Source: Sima, D. et al., „ACA”, Addison-Wesley 1997

4.1 Overview

Page 11: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

4.2 Decoding into µops (1)

Figure 4.2: The Microarchitecture of the Pentium Pro

Source: Shanley, T. ,”Pentium Pro Processor System Architecture”, Addison-Wesley Press, 1997

Page 12: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

4.2 Decoding into µops (2)

Figure 4.3: Basic misprediction pipeline of the Pentium III

Source: Hinton, G. et al., „The Microarchitecture of the Pentium 4 Processor”, Intel Technology Journal Q1, 2001

Page 13: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

Figure 4.4: Decoding in AMD’s K6

Source: Shriver, B., Smith,.B.,”The Anatomyof a High-Performance Microprocessor”

IEEE Computer Society Press, 1998

4.2 Decoding into µops (3)

Page 14: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

Figure 4.5: The Microarchitecture of the Pentium M (Yonah)

4.2 Decoding into µops (4)

Source: Kanter, D., „Intel’s next Generation Microarchitecture Unveiled”, Real World Tech., 2006 March 9.

Page 15: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

4.2 Decoding into µops (5)

Figure 4.6: The Microarchitecture of the Core processor familySource: Kanter, D., „Intel’s next Generation Microarchitecture Unveiled”, Real World Tech., 2006 March 9.

Page 16: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

4.3 Decoding into macroops (1)

Figure 4.7: AMD AthlonTM the Microarchitecture of the Athlon

Source: Meyer, D., „The AMD-K7 Processor”, MPF. Oct. 1998

Page 17: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

4.3 Decoding into macroops (2)

Figure 4.8: Decoding in the Athlon (1)

Source: Meyer, D., „The AMD-K7 Processor”, MPF. Oct. 1998

Page 18: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

4.3 Decoding into macroops (3)

Figure 4.9: Decoding in the Athlon (2)

Source: Meyer, D., „The AMD-K7 Processor”, MPF. Oct. 1998

Page 19: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

Each MacroOp: 1 or 2 operations (OPs)

eg: ADD EAX, EBX 1 ADD OPAND EAX, [EBX+16] 1 LOAD OP

1 AND OP

Up to 3 MacroOps per cycle with up to 3 FX + 2 L/S OPs (dual ported D$!) per cycle

4.3 Decoding into macroops (4)

Page 20: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

4.3 Decoding into macroops (5)

Figure 4.10: The Microarchitecture of the Hammer

Source: Weber, F., „AMD’s Next Generation Microprocessor Architecture”, MPF. Oct. 2001

Page 21: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

5 Using a trace cache (1)

Figure 5.1: The Microarchitecture of the Pentium 4 (Willamette)

Page 22: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

5 Using a trace cache (2)

Figure 5.2: Basic misprediction pipeline of the Pentium 4 (Willamette)

Source: Hinton, G. et al., „The Microarchitecture of the Pentium 4 Processor”, Intel Technology Journal Q1, 2001

Page 23: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

5 Using a trace cache (3)

Figure 5.3: The Microarchitecture of the Pentium 4 (Prescott)

Source: Kanter, D., „Intel’s next Generation Microarchitecture Unveiled”, Real World Tech., 2006 March 9.

Page 24: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

Decoding withinstruction grouping

Grouping of RISC

instructions

POWER4POWER5

Grouping of CISC

instructions

Pentium MCore arch.

6. Decoding with instruction grouping

K7 (Athlon)K8 (Hammer)

6.1 Overview

Page 25: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

Operation of the Reorder Buffer (ROB)

 

index 1 2 3 4 5 6 7 8 9 10 11 12lane 0                        lane 1                        lane 2                        

 

  = Out Of Order finished Instructions, results still speculative.  = Instructions being retired now.  = Retired Instructions, not speculative anymore.

 

Figure 5.3: Instruction grouping in the K7 and K8

Source: de Vries, H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, Sept.2003, http://www.chip-architect.com

Up to 3 MacroOps are decoded per cycle, these MacroOps are allocated a line in the ROB

The ROB has 24 lines of 3 entries each. The ROB retires a line if it is the oldest one and all MacroOps in that line are completed.

6.2 Grouping of RISC instructions (1)

Page 26: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

Figure 6.1: Out of order execution of MacroOps from the FX schedulers in the K8L (to be introduced in Q2 2007)

(The K8L scheduler has 8*3 entires vs 6*3 in the K8)

Source: Malich, Y.„AMD's Next Generation Microarchitecture Preview: from K8 to K8L”, Aug. 2006.

6.2 Grouping of RISC instructions (2)

SchedulersDecoders EUs

Page 27: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

Figure 6.1: The principle of instruction grouping in IBM’s POWER4 and POWER5 processors

6.2 Grouping of RISC instructions (3)

Instructiongroups

EU EU

Issuequeues

Executionunits

ROB

Dispatch instruction groups in-order, forward individual

instructions to the issue queues

Execute individual instructions ooo

Retire isntruction groups in-order, modify program state

Retire

Page 28: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

6.2 Grouping of RISC instructions (4)

Figure 6.2: Implementation of instruction grouping in IBM’s POWER 5 processor

Source: Sinharoy, B. et al. „POWER5 system microarchitecture”, IBM J.,Res.& Dev., July/Sept. 2005.

Page 29: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

6.3 Grouping of CISC instructions (1)(Intel: macro-op fusion)

x86 instructions: macro-opsinternal instructions: μops

Macro-op fusion:combines two macro ops into a single μop.

Specifically:x86 compare or test instructions are fused with x86 jumps to produce a single μop.

Any decoder can perform macro-op fusion but only one macro-op fusion can be performed in each cycle.

In the Core architecture the max. decode bandwidth is 4+1 x86 instructions/cycle

Macro-op fusion can reduce the number of μops by about 10%.

Introduced in the Core architecture

Page 30: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

6.3 Grouping of CISC instructions (2)

Benefits:

• Fewer μopsIncreased performance

• ooo execution becomes more effective as the instruction window includes now more (~10%) x86 instructions