June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar

June 20th 2004 University of Utah 1

Microarchitectural Techniques to Reduce Interconnect Power in

Clustered Processors

Karthik RamaniNaveen Muralimanohar

Rajeev Balasubramonian

Power and Temperature-Aware Microarchitecture

June 20th 2004

2

Motivation

Wire delays do not scale as well as their transistor

counterparts

Communication bound future processors

Increased use of interconnects and hence, an increase in

power dissipation 50% of dynamic power is in interconnect switching

(Magen et al. SLIP 04)

MIT Raw processor’s on-chip network consumes 36% of

total chip power (Wang et al. 2003)

June 20th 2004

3

Interconnect Power

Reduction in power Increase in latency

Dynamic Power = aCV2f

Different Methods Frequency scaling Voltage scaling Reducing the size of repeaters

Reducing the no. of repeaters

June 20th 2004

4

Power-Delay Tradeoff

Conventional Interconnect Design – Performance

Oriented Low latency

High Power Dissipation

Power Reduction by tolerating some delay penalty

Reducing Repeater Size L D SC

Decreasing No. of Repeaters L D SC

TotalPower LeakagePower DynamicPower Short CircuitPower

Latency increases

June 20th 2004

5

Power Reduction

Ref: Banerjee et al. IEEE Transactions on Electron Devices 2002

June 20th 2004

6

Impact of Power-centric Design

Delay Optimized Case – Wires optimized for delay Power Optimized case – Wires optimized for power Performance difference 20%

0

0.5

1

1.5

2

2.5

3

Apsi

Eon

Art

Galg

el

Mesa

Sw

im

Applu

Equake

Gap

Lucas

Mgrid

Wupw

ise

Fm

a3d

Gzip

Pars

er

Vpr

Bzip

2

Tw

olf

Cra

fty

Gcc

Vort

ex

IPC

Delay optimized case

Power optimized case

June 20th 2004

7

Heterogeneous Interconnects

Proposed Design – Implementing wires with varied characteristics

Delay optimized interconnect Power optimized interconnect

Latencies twice the delay optimal wires 80% reduction in power (by focusing on

repeaters alone)

June 20th 2004

8

Outline

Motivation & Proposed solution

Base Architecture

Interconnect Transfers

Results

Conclusion & Future work

June 20th 2004

9

Architecture for evaluation A dynamically scheduled

clustered model with 16 clusters

Hierarchical interconnects Crossbar Ring

Centralized front-end I-Cache & D-Cache LSQ Branch Predictor

Four FU/cluster

I-Cache

D-cache

LSQ Cluster

Cross bar(1 cycle latency)

Ring interconnect(4 cycle latency)

June 20th 2004

10

Simulator Parameters

Simplescalar with contention modeled in detail

15 entry o-o-o issue queue in each cluster (int &

fp each)

30 Physical registers (int & fp each)

In-flight window - 480 instructions

Inter-cluster latencies Delay optimized 2-10 cycles

Power optimized 4-20 cycles

June 20th 2004

11

Interconnect transfers - Types% Different Transfers in the Interconnect

24%

31%

16%

4%

25%

Ready Reg Bypassed Reg Load data Store data Address Transfer

Bypassed register value

Ready register valueAddress transfer

Store value

Load value

June 20th 2004

12

Bypassed Register Values

Operands produced in a

cluster that are immediately

required by another cluster

Criticality based on two

factors Operand arrival time at

the cluster

Actual issue time of the

sourcing instruction

Criticality changes at runtime

Needs a dynamic predictor

Rename&

Dispatch

IQ

Regfile

FU

IQ

Regfile

FU

IQ

Regfile

FU

Producing Instruction completing execution at cycle 120

Consumer Instruction dispatched atCycle 100

June 20th 2004

13

The Data Criticality Predictor A table indexed by the lower order bits of the

instruction address, updated dynamically to indicate the criticality of data.

Difference in arrival time and usage calculated for each operand of an instruction

Difference < Threshold Critical Difference > Threshold Non-Critical

June 20th 2004

14

Ready Register Values

Source operands that are available at the time of dispatch

Premise - significant latency between dispatch and issue

Latency tolerant

Power optimized wires

Rename&

Dispatch

IQ

Regfile

FU

IQ

Regfile

FU

IQ

Regfile

FU

IQ

Regfile

FU

Operand is ready at cycle 90

Consumer instruction Dispatched at cycle 100

June 20th 2004

15

Load & Store data Store data – Often non-critical

Impact of delayed stores (rare cases) Dependent loads have to wait Stall in the commit process if store is at the head of the

reorder buffer Latency insensitive – Power optimized network

Load data – Critical! Often on the critical path Latency sensitive – Fast network

June 20th 2004

16

Address prediction

High confidence prediction for 51% of

effective address transfers

L1 Cache

LSQ

APFU

Reg

L1 Cache

LSQFU

Reg

June 20th 2004

17

Summary of transfers

Critical Non-Critical

Load Values Store value

Effective address unpredicted Effective address predicted

Bypassed register value Bypassed register value

Ready register value

June 20th 2004

18

Outline

Motivation & Proposed solution

Base Architecture


Simulation results

Conclusion

June 20th 2004

19

Methodology

Three cases for simulation High Performance case – A clustered model

with only delay optimized wires

Low Power case – A clustered model with all

power-optimized wires

Criticality based case – A clustered model

using heterogeneous wires

June 20th 2004

20

Results

0

0.5

1

1.5

2

2.5

3

IPC

Ap

plu

Ap

si

Art

Eq

uake

Fm

a3d

Galg

el

Lu

cas

Mesa

Mg

rid

Sw

im

Wu

pw

ise

Bzip

2

Crafty

Eo

n

Gap

Gcc

Gzip

Parser

Tw

olf

Vo

rtex

Vp

r

IPC Analysis

Low power IPC

Criticality based IPC - Low power IPC

High Performance IPC - Critical Case IPC

Performance loss in criticality based case compared to high performance case 2.5% Performance loss in low power case compared to high performance case is 20%

June 20th 2004

21

Results

-10

0

10

20

30

40

50

Ap

si

Eo

n

Art

Ga

lge

l

Me

sa

Sw

im

Ap

plu

Eq

ua

ke

Ga

p

Lu

cas

Mg

rid

Wu

pw

ise

Fm

a3

d

Gzi

p

Pa

rse

r

Vp

r

Bzi

p2

Tw

olf

Cra

fty

Gcc

Vo

rte

x

% Non-Critical Transfer % Performance loss

% N

on

-cri

tical tr

an

sfe

rs

% IP

C lo

ss

June 20th 2004

22

Summary of non-critical interconnect transfers


13%

24%

4%

8%

23%

16%

12% Effective address predictedUnpredicted address

Bypassed non-criticalBypassed critical

Ready register

Store value

Load value

June 20th 2004

23

Result summary

Two kinds of non-critical transfers Data that are not immediately used – 38% Verification of address predictions – 13%

Criticality based case 49% of all data transfers through the Power-optimized

wires Performance penalty - only 2.5% Potential energy savings of around 50% in the

interconnects

June 20th 2004

24

Related Work

Proposal of several heuristics for data criticality – Tune et al. [HPCA -7] , Srinivasan et al. [ISCA-28]

Redirection of instructions to units based on criticality – Seng et al. [MICRO 2001]

Balasubramonian et al. evaluated heterogeneous cache banks [MICRO 2003]

Banerjee and Mehrotra came up with an analytical model for designing interconnect for a given delay penalty [IEEE Trans. 2002]

June 20th 2004

25

Future Work

Other metrics for data criticality prediction (low confidence branch)

Application of heterogeneous interconnect in other places of the microprocessor (cache etc.)

Other configurations of heterogeneous interconnect

June 20th 2004

26

Conclusion

Single interconnect model optimized for delay or power alone is not enough

Heterogeneous interconnect model alleviates this problem

Criticality predictor efficiently identifies non-critical data

49% goes in non-critical network – performance loss 2.5%

June 20th 2004

27

Questions ?

Thank You

Documents

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar