71
Prof. Mateo Valero Procesadores Superescalares Las Palmas de Gran Canaria 26 de Noviembre de 1999

Procesadores Superescalares

  • Upload
    ivo

  • View
    33

  • Download
    0

Embed Size (px)

DESCRIPTION

Procesadores Superescalares. Prof. Mateo Valero. Las Palmas de Gran Canaria 26 de Noviembre de 1999. Initial developments. Mechanical machines 1854: Boolean algebra by G. Boole 1904: Diode vacuum tube by J.A. Fleming 1946: ENIAC by J.P. Eckert and J. Mauchly - PowerPoint PPT Presentation

Citation preview

Page 1: Procesadores Superescalares

Prof. Mateo Valero

Procesadores Superescalares

Las Palmas de Gran Canaria

26 de Noviembre de 1999

Page 2: Procesadores Superescalares

M. Valero 2

Initial developments

• Mechanical machines

• 1854: Boolean algebra by G. Boole

• 1904: Diode vacuum tube by J.A. Fleming

• 1946: ENIAC by J.P. Eckert and J. Mauchly

• 1945: Stored program by J.V. Neuman

• 1949: EDSAC by M. Wilkes

• 1952: UNIVAC I and IBM 701

Page 3: Procesadores Superescalares

M. Valero 3

Eniac 1946

Page 4: Procesadores Superescalares

M. Valero 4

EDSAC 1949

Page 5: Procesadores Superescalares

M. Valero 5

Pipeline

Page 6: Procesadores Superescalares

M. Valero 6

Superscalar ProcessorF

etch

Dec

ode

Ren

ame

Inst

ruct

ion

Win

dow

Wak

eup+

sele

ct

Reg

iste

rfi

le

Byp

ass

Dat

a C

ache

Fetch of multiple instructions every cycle.Rename of registers to eliminate added dependencies.Instructions wait for source operands and for functional units.Out- of -order execution, but in order graduation.

Scalable Pipes

Page 7: Procesadores Superescalares

M. Valero 7

Technology Trends and Impact

0

500

1000

1500

2000

2500

3000

3500

0.80 micras0.35 micras0.18 micras

Delay in Psec.

Issue Width= 4 Issue Width= 8

S. Palacharla et al ¨Complexity Effective…¨. ISCA 1997. Denver.

ROB Size = 32 ROB Size = 64

Page 8: Procesadores Superescalares

M. Valero 8

Physical Scalability

0102030405060708090

100

0,25 0,18 0,13 0,1 0,08 0,06

Processor generation (microns)

Di

e

re

ac

ha

bl

e

(%

)

1 clock2 clocks4 clocks8 clocks16 clocks

0,25 0,18 0,13 0,1 0,08 0,06

Doug Matzke. ¨ Will Physical Scalability… ¨. IEEE Computer. Sept. 1997. pp 37-39.

Die

rea

chab

le (

per

cen

t)

Processor generation (microns)

Page 9: Procesadores Superescalares

M. Valero 9

Register influence on ILP

• Spec95

0,4

0,9

1,4

1,9

2,4

2,9

3,4

3,9

48 64 96 128 160 192 224 256

Register file size

IPC Integer

Floating Point

8-way fetch/issuewindow of 256 entriesup to 1 taken branchg-share 64k entriesOne cycle latency

Page 10: Procesadores Superescalares

M. Valero 10

Register File Latency

– 66% and 20% performance improvement when moving from 2 to 1-cycle latency

1

1,5

2

2,5

3

3,5

4

4,5

IPC

appl

u

apsi

fppp

p

hydr

o2d

mgr

id

su2c

or

swim

tom

catv

turb

3d

wav

e5

Hm

ean

1 cycle 2 cycle

0,5

0,7

0,9

1,1

1,3

1,5

1,7

1,9

2,1

2,3

IPC

com

pres

s

gcc go

ijpeg li

m88

ksim perl

vort

ex

Hm

ean

1 cycle 2 cycle

Page 11: Procesadores Superescalares

M. Valero 11

Outline

• Virtual-physical register• A register file cache• VLIW architectures

Page 12: Procesadores Superescalares

M. Valero 12

Virtual-Physical Registers

• Motivation

– Conventional renaming scheme

– Virtual-Physical Registers

Icache Decode&Rename Commit

Register unusedRegister

used

Register used

Page 13: Procesadores Superescalares

M. Valero 13

load f2, 0(r4)fdiv f2, f2, f10fmul f2, f2, f12fadd f2, f2, 1

load p1, 0(r4)fdiv p2, p1, p10fmul p3, p2, p12fadd p4, p2, 1

renameCache miss: 20Fdiv: 20Fmul: 10Fadd: 5

Example

– Register pressure: average registers per cycle

0 5 10 15 20 25 30 35 40 45 50 55

p4

p3

p2

p1

p4

p3

p2

p1

Conventional: 3.6

Virtual-Physical: 0.7

Page 14: Procesadores Superescalares

M. Valero 14

Percentage of Used/Wasted Registers

0

20

40

60

80

100

120

UsedWasted

0

20

40

60

80

100

120

140

Page 15: Procesadores Superescalares

M. Valero 15

Virtual-Physical register• Physical register play two different roles

– Keep track of dependences (decode)– Provide a storage location for results (write-

back)• Proposal: Three types of registers

– Logical: Architected registers– Virtual-Physical (VP): Keep track of

dependences– Physical: Store values

• Approach– Decode: rename from logical to VP– Write-back (or issue): rename from VP to

physical

Page 16: Procesadores Superescalares

M. Valero 16

Virtual-Physical Registers

• Hardware support

VPreg

DS

rc1

R1

Src

2R

2

Inst

. que

ue

Lre

gC

VP

reg

RO

B

VP Preg VLreg

General Map Table

Preg

Phy. Map Table

Fet

ch

Decode IssueE

xecu

teWrite-back Commit

Page 17: Procesadores Superescalares

M. Valero 17

Virtual-Physical Registers

• No free physical register– Re-execute but… if it is the oldest instruction…

– Avoiding deadlock• A number (NRR) of registers are reserved for the oldest

instructions

• 21% speedup for Spec95 on a 8-way issue [HPCA-4]

– Conclusions– Optimal NRR is different for each program

– For a given program, best NRR may be different for different sections of code

Page 18: Procesadores Superescalares

M. Valero 18

Virtual-Physical Registers– Performance evaluation

• SimpleScalar OoO with modified renaming

• 8-way issue• RUU: 128 entries• FU (latency)

» 8 Simple int. (1)» 4 Int Mult (7)» 6 Simple FP (4)» 4 FP Mult (4)» 4 FP Div (16)» 4 mem ports

• L1 Dcache» 32 KB, 2-way, 32

B/line, 1 cycle

• L1 Icache» 32 KB, 2-way, 64

B/line, 1 cycle

• L2 cache» 1 MB, 2-way, 64 B/line,

12 cycles

• Main memory» 50 cycles

• Branch prediction» 18-bit Gshare» 2 taken branches

• Benchmarks: SPEC95» Compac/Dec compilers -

O5

Page 19: Procesadores Superescalares

M. Valero 19

Virtual-Physical Registers

– Performance evaluation

5

1 0 13

6

29

22

42

20

0

5

10

15

20

25

30

35

40

45%

Sp

eed

up

Speedup for 64 registers

Page 20: Procesadores Superescalares

M. Valero 20

IPC and NRR

1

1,5

2

2,5

3

3,5

1 4 8 16 24 36

liapplu

Page 21: Procesadores Superescalares

M. Valero 21

Virtual-Physical Registers• What is the optimal allocation policy ?

– Approximation• Registers should be allocated to the instructions that can use

them earlier (avoid unused registers)

• If some instruction should be stall because of the lack of registers, choose the latest instructions (delaying the earliest would also delay the commit of the latest)

– Implementation• Each instruction allocates a physical register in the write-

back. If none available, it steals the register from the latest instruction after the current

Page 22: Procesadores Superescalares

M. Valero 22

DSY Performance

1,9

2,1

2,3

2,5

2,7

2,9

3,1

3,3

com

pres

s gcc go li

perl

Hm

ean

conventionalvp-originalvp-dsy

1,51,71,92,12,32,52,72,93,13,3

mgr

id

tom

catv

appl

u

swim

hydr

o2d

Hm

ean

SpecInt95 SpecFp99

Page 23: Procesadores Superescalares

M. Valero 23

Performance and Number of Registers

2,2

2,3

2,4

2,5

2,6

2,7

2,8

48 64 80 96 128 160

conventionalvp-originalvp-dsy

1,2

1,4

1,6

1,8

2

2,2

2,4

2,6

2,8

3

48 64 80 96 128 160

SpecFp95SpecIn95

Page 24: Procesadores Superescalares

M. Valero 24

Outline

• Virtual-physical register• A register file cache• VLIW architecture

Page 25: Procesadores Superescalares

M. Valero 25

Register Requirements

SpecInt95

0

20

40

60

80

100

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Value & InstructionValue & ready Instruction

SpecFP95

0

20

40

60

80

100

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Value & InstructionValue & Ready Instruction

Page 26: Procesadores Superescalares

M. Valero 26

Register File Latency

– 66% and 20% performance improvement when moving from 2 to 1-cycle latency

1

1,5

2

2,5

3

3,5

4

4,5

IPC

appluapsi

fpppphydro2d

mgridsu2cor

swimtomcatv

turb3d

wave5Hmean

1 cycle 2 cycle

0,5

0,7

0,9

1,1

1,3

1,5

1,7

1,9

2,1

2,3

IPC

compress

gccgo ijpeg

li m88ksim

perlvortex

Hmean

1 cycle 2 cycle

Page 27: Procesadores Superescalares

M. Valero 27

Register File Bypass

0,50,70,91,11,31,51,71,92,12,3

1-cycle, 1-bypasslevel2 cycle, 2-bypasslevels2-cycle, 1-bypasslevel

SpecInt95

Page 28: Procesadores Superescalares

M. Valero 28

Register File Bypass

1

1,5

2

2,5

3

3,5

4

4,5

applu

apsi

fppp

hydro2dm

grid

su2corsw

in

tomcatv

turb3d

wave5

Hm

ean

1-cycle, 1-bypasslevel2 cycle, 2-bypasslevels2-cycle, 1-bypasslevel

SpecFP95

Page 29: Procesadores Superescalares

M. Valero 29

Register File Cache

• Organization– Bank 1 (Register File)

• All registers (128)

• 2-cycle latency

– Bank 2 (Reg. File Cache)

• A subset of registers (16)

• 1-cycle latency

RF

RFC

Page 30: Procesadores Superescalares

M. Valero 30

Experimental Framework

– OoO simulator• 8-way issue/commit

• Functional Units (lat.)– 2 Simple integer (1)– 3 Complex integer

» Mult. (2)» Div. (14)

– 4 Simple FP (2)– 2 FP div.: 2 (14)– 3 Branch (1)– 4 Load/Store

• 128-entry ROB

• 16-bit Gshare

• Icache and Dcache– 64 KB– 2-way set-associative– 1/8-cycle hit/miss– Dcache: Lock-up free-16

outstanding misses

– Benchmarks• Spec95• DEC compiler -O4 (int.) -O5

(FP)• 100 million after inizialitations

– Access time and area models• Extension to Wilton&Jouppi

models

Page 31: Procesadores Superescalares

M. Valero 31

Caching Policy (1 of 3)

• First policy• Many values (85%-Int and

84%-FP) are used at most once

• Thus, only non-bypassed values are cached

• FIFO replacement

RF

RFC

Page 32: Procesadores Superescalares

M. Valero 32

Performance

– 20% and 4% improvement over 2-cycle

– 29% and 13% degradation over 1-cycle

0,5

0,7

0,9

1,1

1,3

1,5

1,7

1,9

2,1

2,3

IPC

com

pres

s

gcc go

ijpe

g li

m88

ksim perl

vort

ex

Hm

ean

1 cycle RFC.1 2 cycle

1

1,5

2

2,5

3

3,5

4

4,5

IPC

appl

u

apsi

fppp

p

hydr

o2d

mgr

id

su2c

or

swim

tom

catv

turb

3d

wav

e5

Hm

ean

1 cycle RFC.1 2 cycle

Page 33: Procesadores Superescalares

M. Valero 33

Caching Policy (1 of 2)

• Second policy• Values that are sources of

any non-issued instruction with all its operands ready

– Not issued because of lack of functional units

– or, the other operand in in the main register file

RF

RFC

Page 34: Procesadores Superescalares

M. Valero 34

Performance

– 24% and 5% improvement over 2-cycle

– 25% and 12% degradation over 1-cycle

0,5

0,7

0,9

1,1

1,3

1,5

1,7

1,9

2,1

2,3

IPC

com

pres

s

gcc go

ijpe

g li

m88

ksim perl

vort

ex

Am

ean

Hm

ean

1 cycle RFC.2 2 cycle

1

1,5

2

2,5

3

3,5

4

4,5

IPC

appl

u

apsi

fppp

p

hydr

o2d

mgr

id

su2c

or

swim

tom

catv

turb

3d

wav

e5

Hm

ean

1 cycle RFC.2 2 cycle

Page 35: Procesadores Superescalares

M. Valero 35

Caching Policy (1 of 3)

• Third policy• Values that are sources of any non-issued

instruction with all its operands ready

• Prefetching– Table that for each physical register indicates which is

the other operand of the first instruction that uses it

• Replacement: give priority to those values already read at least once

Page 36: Procesadores Superescalares

M. Valero 36

Performance

– 27% and 7% improvement over 2-cycle

– 24% and 11% degradation over 1-cycle

0,5

0,7

0,9

1,1

1,3

1,5

1,7

1,9

2,1

2,3

IPC

com

pres

s

gcc go

ijpeg li

m88

ksim perl

vort

ex

Hm

ean

1 cycle RFC.3 2 cycle

1

1,5

2

2,5

3

3,5

4

4,5

IPC

appl

u

apsi

fppp

p

hydr

o2d

mgr

id

su2c

or

swim

tom

catv

turb

3d

wav

e5

Hm

ean

1 cycle RFC.3 2 cycle

Page 37: Procesadores Superescalares

M. Valero 37

Speed for Different RFC Architectures

0,7

0,9

1,1

1,3

1,5

1,7

1,9

2,1

C1 C2 C3 C4

1-cycle

2-cycle, one bypass

Non-bypass caching+ prefetch-first-pair

SpecInt95Taken into account access time

Page 38: Procesadores Superescalares

M. Valero 38

Speed for Different RFC Architectures

0,7

1,2

1,7

2,2

2,7

3,2

C1 C2 C3 C4

1-cycle

2-cycle, one bypass

Non-bypass caching+ prefetch-first-pair

SpecFp95

Page 39: Procesadores Superescalares

M. Valero 39

Conclusions

– Register file access time is critical

– Virtual-physical registers significantly

reduce the register pressure

• 24% improvement for SpecFP95

– A register file cache can reduce the average

access time

• 27% and 7% improvement for a two-level,

locality-based partitioning architecture

Page 40: Procesadores Superescalares

High performance instruction fetch through a

software/hardware cooperationAlex Ramirez

Josep Ll. Larriba-Pey

Mateo Valero

UPC-Barcelona

Page 41: Procesadores Superescalares

M. Valero 41

Superscalar ProcessorF

etch

Dec

ode

Ren

ame

Inst

ruct

ion

Win

dow

Wak

eup+

sele

ct

Reg

iste

rfi

le

Byp

ass

Dat

a C

ache

Fetch of multiple instructions every cycle.

Rename of registers to eliminate added dependencies.

Instructions wait for source operands and for functional units.

Out- of -order execution, but in order graduation.

J.E. Smith and S.Vajapeyam.¨Trace Processors…¨ IEEE Computer.Sept. 1997. pp68-74.

Page 42: Procesadores Superescalares

M. Valero 42

Motivation

• Instruction Fetch rate important not only in steady state– Program start-up– Miss-speculation points– Program segments with little ILP

InstructionFetch &Decode

InstructionFetch &Decode

InstructionExecutionInstructionExecution

Instruction Queue(s)

Branch /Jump outcome

Page 43: Procesadores Superescalares

M. Valero 43

Motivation

• Instruction fetch effectively limits the performance of superscalar processors– Even more relevant at program startup points

• More aggressive processors need higher fetch bandwidth– Multiple basic block fetching becomes necessary

• Current solutions need extensive additional hardware– Branch address cache– Collapsing buffer: multi-ported cache – Trace cache: special purpose cache

Page 44: Procesadores Superescalares

M. Valero 44

PostgreSQL

1

1.2

1.4

1.6

1.8

2

2.2

2.4

32KB 64KB F4 F8 F16 PBr Pic Bw4 Bw8 Bw16 PF- PF4

Postgres

64KB I1, 64KB D1, 256KB L2

LB

BL=0

Page 45: Procesadores Superescalares

M. Valero 45

Programs Behaviour

1

1,5

2

2,5

3

3,5

32KB 64KB F4 F8 F16 PBr Pic Bw4 Bw8 Bw16 PF- PF4

Postgres Gcc Vortex

64KB I1, 64KB D1, 256KB L2

Page 46: Procesadores Superescalares

M. Valero 46

The Fetch Unit (1 of 3)Fetch

Address

Instruction Cache

(i-cache)

Instruction Cache

(i-cache)

Shift & MaskShift & Mask

BranchPrediction

Mechanism

BranchPrediction

Mechanism

Next Address Logic

Next Address Logic

Scalar Fetch Unit To Decode

Next Fetch Address

• Scalar Fetch Unit

– Few instructions per cycle

– 1 branch

• Limitations

– Prediction accuracy

– I-cache miss rate

• Prev. work, code reordering– Fisher (IEEE Tr. on Comp. 81)

– Hwu and Chang (ISCA’89)

– Petis and Hansen (Sigplan’90)

– Torrellas et al. (HPCA’95)

– Kalamatianos et al. (HPCA’98)Software,reduce cachemisses

Page 47: Procesadores Superescalares

M. Valero 47

The Fetch Unit (2 of 3)Fetch

Address

Instruction Cache

(i-cache)

Instruction Cache

(i-cache)

Shift & MaskShift & Mask

BranchTargetBuffer

BranchTargetBuffer

ReturnStack

ReturnStack

MultipleBranch

Predictor

MultipleBranch

Predictor

Next Address Logic

Next Address Logic

Aggressive Core Fetch Unit

• Aggressive Fetch Unit

– Lot of instructions per cycle

– Several branches

• Limitations

– Prediction accuracy

– Sequentiality

– I-cache miss rate

• Prev. work, trace building– Yeh et al. (ICS’93)

– Conte et al. (ISCA’95)

– Rottenberg et al. (MICRO’96)

– Friendly et al. (MICRO’97)Hardware,form tracesat run time

To Decode Next Fetch Address

Page 48: Procesadores Superescalares

M. Valero 48

Trace Cache

b1

b2b3

b4b7 b6

b5

b0

b8

Trace is a sequence of logically contiguos instructions.

Trace cache line stores a segment of the dynamic instruction traces across multiple, potentially, taken branches:(b1-b2-b4, b1-b3-b7….)

It is indexed by fetch address and branches outcome

History-based fetch mecanism.

Page 49: Procesadores Superescalares

M. Valero 49

The Fetch Unit (3 of 3)Fetch

Address

Instruction Cache

(i-cache)

Instruction Cache

(i-cache)

Shift & MaskShift & Mask

BranchTargetBuffer

BranchTargetBuffer

ReturnStack

ReturnStack

MultipleBranch

Predictor

MultipleBranch

Predictor

Next Address Logic

Next Address Logic

Aggressive Core Fetch Unit

To Decode Next Fetch Address

Trace Cache(t-cache)

Trace Cache(t-cache)

FillBuffer

FillBuffer

From Fetch or CommitTrace Cache aims atforming traces run time

Page 50: Procesadores Superescalares

M. Valero 50

Our Contribution• Mixed software-hardware approach

– Optimize performance at compile-time• Use profiling information• Make optimum use of the available hardware

– Avoid redundant work at run-time• Do not repeat what was done at compile-time• Adapt hardware to the new software

• Software Trace Cache– Profile-directed code reordering & mapping

• Selective Trace Storage– Fill Unit modification

Page 51: Procesadores Superescalares

M. Valero 51

Our Work

• Workload analysis– Temporal locality– Sequentiality

• Software Trace Cache– Seed selection– Trace building– Trace mapping– Results

• Selective Trace Storage– Counting blue traces– Implementation– Results

32KB instruction cache64KB trace cache

6,5

7,5

8,5

9,5

10,5

gcc li postgres

FIP

A

Base TC STC STS

Page 52: Procesadores Superescalares

M. Valero 52

Dynamic referencesBenchmark75% 90% 99%

Codesize

swim 148 232 763 110350hydro2d 1223 1977 5371 125946applu 2407 5060 10509 132803m88ksim 458 1006 2863 51341li 325 563 1365 38126gcc 9595 22098 57878 349382compress 243 338 525 21991postgres 2716 5221 11748 374399

Workload Analysis (Reference Locality)

• Considerable amount of reference locality

Page 53: Procesadores Superescalares

M. Valero 53

Workload Analysis (Sequentiality)Benchmark Unpredictable Predictableswim 45.3 54.7mgrid 19.9 81.1apsi 22.1 77.9m88ksim 37.3 62.7li 49.2 50.8gcc 60.1 39.9ijpeg 70.2 29.8postgres 23.8 76.2

Loop branches Indirect jumps Subroutine returns Unpredictable conditional

branches

Fall-through Unconditional branches Conditional branches with Fixed

Behaviour Subroutine calls

Predictable Un-predictable

Page 54: Procesadores Superescalares

M. Valero 54

Software Trace Cache• Profile directed code reordering

– Obtain a weighted control flow graph– Select seeds or starting basic blocks– Build basic block traces

• Map dynamically consecutive basic blocks to physically contiguous storage

• Move unused basic blocks out of the execution path

– Carefully map these traces in memory• Avoid conflict misses in the most popular traces• Minimize conflicts among the rest

• Increased role of the instruction cache– Able to provide longer instruction traces

Page 55: Procesadores Superescalares

M. Valero 55

STC : Seed Selection

• All procedure entry points– Ordered by popularity– Starts building traces on the most popular procedures

• Knowledge based selection– Based on source code knowledge– Leads to longer sequences

• Inlining of the main path of found procedures

– Loses temporal locality• Less popular basic blocks surround the most popular ones

Page 56: Procesadores Superescalares

M. Valero 56

STC : Trace Building

• Greedy algorithm– Follow the most likely

path out of a basic block– Add secondary seeds for

all other targets• Two threshold values

– Execution threshold• Do not include

unpopular basic blocks– Transition threshold

• Do not follow unlikely transitions

• Iterate process with less restrictive thresholds

2.4

A1

A2

A3

A4 A5

A6 A7

A8

B1

C1

C2

C3

C5 C4

10

10

10

6 4

7.6

10

30

20

11

150

20

20

1

0.4

1

1

0.6

0.1

0.9

0.45

0.55

1

0.01

1

0.4

0.6

0.10.9

0.99

Branch Threshold

Branch Threshold

Valid,visit later

Valid,visit later

Exec Threshold

Page 57: Procesadores Superescalares

M. Valero 57

STC : Trace Mapping

CFA

I-cache sizeNo code here

I-cache

Most popular traces Least popular traces

Page 58: Procesadores Superescalares

M. Valero 58

I-cache Miss Rate

Instruction Cache(i-cache)

Instruction Cache(i-cache)

Xchange, Shift & MaskXchange, Shift & Mask

BTBBTB RASRAS BPBP

Next Address LogicNext Address Logic

Code Layout CacheI-cache/CFA Base P&H Torr Auto Ops 2-way Victim8KB I-cache 6.5 3.0 * * * 6.1 5.6

2KB CFA 2.3 2.2 2.14KB CFA 2.9 4.2 2.96KB CFA

* *3.1 2.3 5.2

* *

32KB I-cache 2.7 0.3 * * * 1.2 1.64KB CFA 0.2 0.3 0.28KB CFA 0.2 0.4 0.2

24KB CFA* *

0.2 0.3 0.2* *

64KB I-cache 1.4 0.09 * * * 0.3 0.48KB CFA 0.05 0.07 0.04

16KB CFA 0.14 0.08 0.0524KB CFA

* *0.02 0.03 0.03

* *

Page 59: Procesadores Superescalares

M. Valero 59

Fetch Bandwidth

Instruction Cache(i-cache)

Instruction Cache(i-cache)

Xchange, Shift & MaskXchange, Shift & Mask

BTBBTB RASRAS BPBP

Next Address LogicNext Address Logic

Code Layout Trace CacheI-cache/CFA

Base P&H Torr Auto Ops 16KB 16KB+ops

IDEAL 7.6 9.6 8.5-9.9 9.9 10.7 10.3 12.28KB I-cache 3.1 5.2 * * * 5.1 *

2KB CFA 5.6 6.0 6.2 8.44KB CFA 5.0 5.3 6.6 8.76KB CFA

* *4.9 5.8 5.6

*8.1

32KB I-cache 4.7 8.8 * * * 7.2 *4KB CFA 8.9 9.2 10.0 11.58KB CFA 8.4 8.8 10.1 11.5

24KB CFA* *

8.2 9.2 10.1*

11.664KB I-cache 1.4 9.3 * * * 8.6 *

8KB CFA 8.8 9.8 10.6 12.016KB CFA 8.4 9.7 10.5 12.124KB CFA

* *8.5 9.8 10.6

*12.1

Page 60: Procesadores Superescalares

M. Valero 60

STC : Results

32KB Instruction cache, 64KB Trace cache

2,2

4,41

3,13

2,65

4,61

5,05

2,55

4,95

4,54

2,97

5,11

5,64

2

3

4

5

6

gcc li postgres

FIP

C

BaseSTCTCS/HTC

Page 61: Procesadores Superescalares

M. Valero 61

STC: Conclusions

• STC increases the role of the core fetch unit– Build traces at compile-time

• Increases code sequentiality– Map them carefully in memory

• Reduces instruction cache miss rate

• Increased core fetch unit performance– Trace cache-like performance with no additional

hardware cost• Compile-time solution

or ...– Optimum results with a small supporting trace cache

• Better fail-safe mechanism on a trace cache miss

Page 62: Procesadores Superescalares

M. Valero 62

Selective Trace Storage

• The STC constructed traces at compile time– Blue traces

• Built at compile-time• Traces containing only consecutive instructions• May be provided by the instruction cache in a single cycle

– Red traces• Built at run-time• Traces containing taken branches• Can be provided by the trace cache in a single cycle

• Blue traces need not be stored in the trace cache– Better usage of the storage space

• Better performance with same cost• Equivalent performance at lower cost

Page 63: Procesadores Superescalares

M. Valero 63

STS: Counting Blue Traces

0%

20%

40%

60%

80%

100%

3+ breaks210

Reordering reduces the number of

breaksHigh degree of redundancy,

even in the original code

Page 64: Procesadores Superescalares

M. Valero 64

STS: Implementation

FillUnitFillUnit

BranchTargetBuffer

BranchTargetBuffer

MultipleBranch

Predictor

MultipleBranch

Predictor

ReturnAddress

Stack

ReturnAddress

Stack

Fetch Address

Filter outblue traces

in the fill unit

Xchange, Shift & MaskXchange, Shift & Mask

Next Address LogicNext Address Logic

To DecodeNext Fetch Address

Blue (redundant)

trace

Red tracecomponents

Hit

Page 65: Procesadores Superescalares

M. Valero 65

STS: FIPA - Realistic Branch Predictor

7

7.5

8

8.5

9

9.5

10

10.5

11

11.5

Gcc Li Postgres

Page 66: Procesadores Superescalares

M. Valero 66

STS: FIPC - Realistic BP - 64KB i-cache

2

2.5

3

3.5

4

4.5

5

5.5

6

Gcc Li Postgres

Page 67: Procesadores Superescalares

M. Valero 67

STS: FIPA - Perfect Branch Predictor

8

8.5

9

9.5

10

10.5

11

11.5

12

Gcc Li Postgres

Page 68: Procesadores Superescalares

M. Valero 68

STS: Conclusions

• Minor hardware modification– Filter out blue traces in the fill unit

• Avoid redundant run-time work

• Better usage of the storage space– Higher performance with the same cost

– Equivalent performance at much lower cost

• Benefits of STS increase when used with STC– The more work done at compile-time, the less work left

to do at run-time

Page 69: Procesadores Superescalares

M. Valero 69

Conclusions

• Instruction fetch is better approached using both software and hardware techniques– Compile-time code reorganization

• Increase code sequentiality• Minimize instruction cache misses

– Avoid run-time redundant work• Do not store the same traces twice

• High fetch unit performance with little additional hardware– Small 2KB complementary trace cache & smart fill unit

Page 70: Procesadores Superescalares

M. Valero 70

Future Work• Further increasing fetch performance

– Increase i-cache performance• Reduce miss ratio• Reduce miss penalty

– Increase quality of provided instructions• Better branch prediction accuracy

– Faster recovery after mispredictions

• Take the path of least resistance– Simplicity of design– Software approach whenever possible

Page 71: Procesadores Superescalares

M. Valero 71

The End