43
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Embed Size (px)

Citation preview

Page 1: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Complexity-Effective Superscalar Processors

S. Palacharla, N. P. Jouppi, and J. E. Smith

Presented by: Jason Zebchuk

Page 2: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Brainiac vs. Speed Demon

• Brainiac:

• Maximize # of instructions per cycle

• More complex, slower clock

• Speed Demon:

• Maximize clock frequency

• Faster clock, fewer instructions per cycle

Page 3: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Complexity-Effective

• Issue Width vs. Clock Cycle

• Goal is to balance both

• Best design is Complexity-Effective

• Allow complex issue scheme AND fast clock cycle

Page 4: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

What is Complexity?

• Delay of Critical Path through a piece of logic.

Page 5: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Creating a Complexity-Effective Architecture

• Analyze complexity of each pipeline stage

• Select component with most complexity

• Propose less-complex alternative that achieves similar performance

Page 6: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Agenda

• OoO Superscalar Pipeline Overview

• Register Renaming

• Wakeup & Select

• Bypass

• Complexity-Effective Optimizations

• Dependence-based Scheduler• Clustering

Page 7: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Pipeline OverviewFe

tch

Deco

de

Renam

e

Regis

ter

File

Data

C

ach

e

Wake

up

Sele

ct

Issu

eW

indow

Bypass

• Fetch:

• Read Instructions from I-Cache

• Predict Branches

Page 8: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Pipeline OverviewFe

tch

Deco

de

Renam

e

Regis

ter

File

Data

C

ach

e

Wake

up

Sele

ct

Issu

eW

indow

Bypass

• Decode:• Parse instruction

• Shuffle opcode parts to appropriate ports for rename

Page 9: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Pipeline OverviewFe

tch

Deco

de

Renam

e

Regis

ter

File

Data

C

ach

e

Wake

up

Sele

ct

Issu

eW

indow

Bypass

• Rename:• Map architectural registers to physical• Eliminate false dependences• Dispatch renamed instructions to

scheduler

Page 10: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Pipeline OverviewFe

tch

Deco

de

Renam

e

Regis

ter

File

Data

C

ach

e

Bypass

• Wakeup:• Instructions check whether they become ready• Compare register names from Writeback stage

• Select:• Choose from amongst ready instructions

Wake

up

Sele

ct

Issu

eW

indow

Page 11: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Pipeline OverviewFe

tch

Deco

de

Renam

e

Regis

ter

File

Data

C

ach

e

Bypass

• Register File Read:

• Read source operands

Wake

up

Sele

ct

Issu

eW

indow

Page 12: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Pipeline OverviewFe

tch

Deco

de

Renam

e

Regis

ter

File

Data

C

ach

e

Bypass

• Bypass and Execute:

• Execute instructions in functional units

• Bypass results from outputs to inputs

Wake

up

Sele

ct

Issu

eW

indow

Page 13: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Pipeline OverviewFe

tch

Deco

de

Renam

e

Regis

ter

File

Data

C

ach

e

Bypass

• Data Cache Access:

• Load & Store to data cache

Wake

up

Sele

ct

Issu

eW

indow

Page 14: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Pipeline OverviewFe

tch

Deco

de

Renam

e

Regis

ter

File

Data

C

ach

e

Bypass

• Write result to register file

• Broadcast tag to wakeup waiting instrs.

• happens two cycles before results produced

Wake

up

Sele

ct

Issu

eW

indow

Page 15: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Alternate PipelineFe

tch

Deco

de

Renam

e

Regis

ter

File

Data

C

ach

e

Bypass

• Used by Pentium Pro, PowerPC

• Re-order buffer (ROB) holds values

• Tomasulo-like renaming (point to ROB entries)

Reord

er

Buff

erWakeupSelect

Reservation

Stations

Page 16: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Sources of ComplexityFe

tch

Deco

de

Renam

e

Regis

ter

File

Data

C

ach

e

Bypass

Complexity scales with issue width and window size.

Wake

up

Sele

ct

Issu

eW

indow

Page 17: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

What about.....

• Fetch, Decode, Register File, Cache?

• Previously studied

• Register File, Caches

• Easy to scale ?

• Fetch, Decode, Functional Units

Page 18: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Register Renaming• Eliminate WAR and WAW hazards

Ld r1, [r2]Add r3, r1, r2Sub r1, r2, r4Ld r2, [r1]

Ld r1, [r2]Add r3, r1, r2Sub r1, r2, r4Ld r2, [r1]

Ld p5, [p7]Add p3, p5, p7Sub p6, p7, p4Ld p9, [p6]

Architectural (Logical)Registers

PhysicalRegisters

Page 19: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Register Rename LogicLogical

Sou

rce &

Dest

inati

on

Reg

iste

rs

RegisterAliasTable

Dependence

CheckLogicL

ogical

Sou

rce &

Dest

inati

on

Reg

iste

rs

MU

X

Physical

Sou

rce &

Dest

inati

on

Reg

iste

rs

SRAM array Scales with Issue Width

Page 20: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Register Alias Table

LogicalRegisterName

Physical Register Name

Decode Bitlines

Wordlines

Senseamps

Trename = Tdecode + Twordline + Tbitline + Tsenseamp

Page 21: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Rename Delay

Scales Linearly with Issue Width

Bette

r

Page 22: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Instruction Wakeup

• Broadcast newly available operands

• Update which operands are ready for each instruction

• Mark instruction ready when both left and right operands are ready

Page 23: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Wakeup Logic

• Delay = Ttagdrive + Ttagmatch + TmatchOR

Tag

Driv

e

Tag Match

Tag OR

Page 24: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Wakeup Delay

Delay increases quadratically with window size

Bette

r

Page 25: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Wakeup Delay Breakdown

Wire delay dominates at smaller feature sizes!B

ette

r

Page 26: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Select• Consider all instructions that are

ready

• Select one ready instruction for each functional unit

• Uses some selection policy

• e.x., Oldest First

• little affect on performance

Page 27: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Selection Logic

• Delay = c0 + c1×log4(WINSIZE)

RequestSignal Grant

Signal

Page 28: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Selection Delay

Only Logic delay, wire delay ignored

Bette

r

Page 29: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Data Bypass Network

• Forward results from completing instructions to dependent instructions

• # of paths depends on pipeline depth and issue width

• # of paths = 2×IW2×S

• S pipe stages after producing results

Page 30: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Data Bypass Logic• FUs broadcasts

results (potentially multiple subsequent results)

• Regfile reads current operand values

• MUX selects correct source

Page 31: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Bypass Delay• Tbypass = 0.5 × Rmetal × Cmetal × L2

• L = length of result wires

• Delay independent of feature size

• Dependent on specific Layout

Issue Width

Wire Length (λ)

Delay (ps)

4 20500 184.9

8 49000 1056.4

Page 32: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Putting it All TogetherIssue Width

Window Size

Rename Delay (ps)

Wakeup+Select delay (ps)

Bypass delay (ps)

0.8μm

4 32 1577.9 2903.7 184.9

8 64 1710.5 3369.4 1056.4

0.35μm

4 32 627.2 1248.4 184.9

8 64 726.6 1484.8 1056.4

0.18μm

4 32 351.0 578.0 184.9

8 64 427.9 724.0 1056.4

Page 33: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Key Sources of Complexity

• Wakeup+Select Logic:

• Limiting stage for 5 out of 6 designs

• Considers all instructions simultaneously

• Bypass Paths

• Wire dominated delay

• Lots of long wires

Page 34: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Dependence-based Microarchitecture

Fetc

h

Deco

de

Renam

eSte

er

Regis

ter

File

Data

C

ach

e

Wake

up

Sele

ct

Bypass

FIFOs

• Replace Issue Window with a few small FIFO queues

• Only schedule from head of FIFO

• Adds new Steering stage

Page 35: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Instruction Steering Heuristic

• 3 possible cases for instruction I:

1. All operands ready & in register file

• I steered to empty FIFO

2. I requires operand produced by Isource in FIFO Fa

1. If no instruction behind Isource, I steered to Fa

2. Otherwise, steered to empty FIFO

• Two operands produced by Ileft and Iright.

• Apply rule 2 to Ileft

• If steered to empty FIFO, apply rule 2 to Iright

Page 36: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Steering Example

0: addu r18, r0, r2 1: addiu r2, r0, -1 2: beq r18, r2, L2 3: lw r4, -32768(r28) 4: sllv r2, r18, r20 5: xor r16, r2, r19 6: lw r3, -32676(r28) 7: sll r2, r16, 0x2 8: addu r2, r2, r23 9: lw r2, 0(r2)10: sllv r4, r18, r411: addu r17, r4, r1912: addiu r3, r3, 113: sw r3, -32676(r28)14: bew r2, r17, L3

0 1 32

0 1

2

3

4

5 6

7

8

9

10

11

12

13

14

45

6

789

1011

1213

14

Page 37: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Performance Comparison

Cycle count within 8% for all benchmarks

Bett

er

Page 38: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

But . . .

• TExecution =

• Dependence-based microarchitecture is less complex

• Allows for faster clock

• Overall, dependence-based has better performance

# Instructions × Clock PeriodInstructions per Cycle

Page 39: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Improving Bypass Paths

• One-cycle bypass within Cluster

• Two-cycle bypass between Clusters

Page 40: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Performance of Clustering

• IPC up to 12% lower

• Clock is 25% faster

• Overall, 10% - 22% faster

• 16% faster on average

Bett

er

Page 41: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

1. 1 Window + 1 Cluster

2. 1 Window + 2 Clusters

3. 2 Windows + 2 Clustersi. Issue Window +

Steering heuristic

ii. Issue Window + Random steering

iii.FIFOs + Steering heuristic

Other Alternatives

Page 42: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Performance of Clustering

Steering has largest impact on performance

Bett

er

Page 43: Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Conclusions• Wakeup+Select and Bypass paths

likely to be limiting pipeline stages

• Atomicity makes these stages critical

• Consider more complexity-effective designs for these stages

• Sacrifice some IPC decrease for higher clock frequency