View
217
Download
2
Tags:
Embed Size (px)
Citation preview
June 20th 2004 University of Utah 1
Microarchitectural Techniques to Reduce Interconnect Power in
Clustered Processors
Karthik RamaniNaveen Muralimanohar
Rajeev Balasubramonian
Power and Temperature-Aware Microarchitecture
June 20th 2004
2
Motivation
Wire delays do not scale as well as their transistor
counterparts
Communication bound future processors
Increased use of interconnects and hence, an increase in
power dissipation 50% of dynamic power is in interconnect switching
(Magen et al. SLIP 04)
MIT Raw processor’s on-chip network consumes 36% of
total chip power (Wang et al. 2003)
June 20th 2004
3
Interconnect Power
Reduction in power Increase in latency
Dynamic Power = aCV2f
Different Methods Frequency scaling Voltage scaling Reducing the size of repeaters
Reducing the no. of repeaters
June 20th 2004
4
Power-Delay Tradeoff
Conventional Interconnect Design – Performance
Oriented Low latency
High Power Dissipation
Power Reduction by tolerating some delay penalty
Reducing Repeater Size L D SC
Decreasing No. of Repeaters L D SC
TotalPower LeakagePower DynamicPower Short CircuitPower
Latency increases
June 20th 2004
5
Power Reduction
Ref: Banerjee et al. IEEE Transactions on Electron Devices 2002
June 20th 2004
6
Impact of Power-centric Design
Delay Optimized Case – Wires optimized for delay Power Optimized case – Wires optimized for power Performance difference 20%
0
0.5
1
1.5
2
2.5
3
Apsi
Eon
Art
Galg
el
Mesa
Sw
im
Applu
Equake
Gap
Lucas
Mgrid
Wupw
ise
Fm
a3d
Gzip
Pars
er
Vpr
Bzip
2
Tw
olf
Cra
fty
Gcc
Vort
ex
IPC
Delay optimized case
Power optimized case
June 20th 2004
7
Heterogeneous Interconnects
Proposed Design – Implementing wires with varied characteristics
Delay optimized interconnect Power optimized interconnect
Latencies twice the delay optimal wires 80% reduction in power (by focusing on
repeaters alone)
June 20th 2004
8
Outline
Motivation & Proposed solution
Base Architecture
Interconnect Transfers
Results
Conclusion & Future work
June 20th 2004
9
Architecture for evaluation A dynamically scheduled
clustered model with 16 clusters
Hierarchical interconnects Crossbar Ring
Centralized front-end I-Cache & D-Cache LSQ Branch Predictor
Four FU/cluster
I-Cache
D-cache
LSQ Cluster
Cross bar(1 cycle latency)
Ring interconnect(4 cycle latency)
June 20th 2004
10
Simulator Parameters
Simplescalar with contention modeled in detail
15 entry o-o-o issue queue in each cluster (int &
fp each)
30 Physical registers (int & fp each)
In-flight window - 480 instructions
Inter-cluster latencies Delay optimized 2-10 cycles
Power optimized 4-20 cycles
June 20th 2004
11
Interconnect transfers - Types% Different Transfers in the Interconnect
24%
31%
16%
4%
25%
Ready Reg Bypassed Reg Load data Store data Address Transfer
Bypassed register value
Ready register valueAddress transfer
Store value
Load value
June 20th 2004
12
Bypassed Register Values
Operands produced in a
cluster that are immediately
required by another cluster
Criticality based on two
factors Operand arrival time at
the cluster
Actual issue time of the
sourcing instruction
Criticality changes at runtime
Needs a dynamic predictor
Rename&
Dispatch
IQ
Regfile
FU
IQ
Regfile
FU
IQ
Regfile
FU
Producing Instruction completing execution at cycle 120
Consumer Instruction dispatched atCycle 100
June 20th 2004
13
The Data Criticality Predictor A table indexed by the lower order bits of the
instruction address, updated dynamically to indicate the criticality of data.
Difference in arrival time and usage calculated for each operand of an instruction
Difference < Threshold Critical Difference > Threshold Non-Critical
June 20th 2004
14
Ready Register Values
Source operands that are available at the time of dispatch
Premise - significant latency between dispatch and issue
Latency tolerant
Power optimized wires
Rename&
Dispatch
IQ
Regfile
FU
IQ
Regfile
FU
IQ
Regfile
FU
IQ
Regfile
FU
Operand is ready at cycle 90
Consumer instruction Dispatched at cycle 100
June 20th 2004
15
Load & Store data Store data – Often non-critical
Impact of delayed stores (rare cases) Dependent loads have to wait Stall in the commit process if store is at the head of the
reorder buffer Latency insensitive – Power optimized network
Load data – Critical! Often on the critical path Latency sensitive – Fast network
June 20th 2004
16
Address prediction
High confidence prediction for 51% of
effective address transfers
L1 Cache
LSQ
APFU
Reg
L1 Cache
LSQFU
Reg
June 20th 2004
17
Summary of transfers
Critical Non-Critical
Load Values Store value
Effective address unpredicted Effective address predicted
Bypassed register value Bypassed register value
Ready register value
June 20th 2004
18
Outline
Motivation & Proposed solution
Base Architecture
Interconnect Transfers
Simulation results
Conclusion
June 20th 2004
19
Methodology
Three cases for simulation High Performance case – A clustered model
with only delay optimized wires
Low Power case – A clustered model with all
power-optimized wires
Criticality based case – A clustered model
using heterogeneous wires
June 20th 2004
20
Results
0
0.5
1
1.5
2
2.5
3
IPC
Ap
plu
Ap
si
Art
Eq
uake
Fm
a3d
Galg
el
Lu
cas
Mesa
Mg
rid
Sw
im
Wu
pw
ise
Bzip
2
Crafty
Eo
n
Gap
Gcc
Gzip
Parser
Tw
olf
Vo
rtex
Vp
r
IPC Analysis
Low power IPC
Criticality based IPC - Low power IPC
High Performance IPC - Critical Case IPC
Performance loss in criticality based case compared to high performance case 2.5% Performance loss in low power case compared to high performance case is 20%
June 20th 2004
21
Results
-10
0
10
20
30
40
50
Ap
si
Eo
n
Art
Ga
lge
l
Me
sa
Sw
im
Ap
plu
Eq
ua
ke
Ga
p
Lu
cas
Mg
rid
Wu
pw
ise
Fm
a3
d
Gzi
p
Pa
rse
r
Vp
r
Bzi
p2
Tw
olf
Cra
fty
Gcc
Vo
rte
x
% Non-Critical Transfer % Performance loss
% N
on
-cri
tical tr
an
sfe
rs
% IP
C lo
ss
June 20th 2004
22
Summary of non-critical interconnect transfers
Interconnect Transfers
13%
24%
4%
8%
23%
16%
12% Effective address predictedUnpredicted address
Bypassed non-criticalBypassed critical
Ready register
Store value
Load value
June 20th 2004
23
Result summary
Two kinds of non-critical transfers Data that are not immediately used – 38% Verification of address predictions – 13%
Criticality based case 49% of all data transfers through the Power-optimized
wires Performance penalty - only 2.5% Potential energy savings of around 50% in the
interconnects
June 20th 2004
24
Related Work
Proposal of several heuristics for data criticality – Tune et al. [HPCA -7] , Srinivasan et al. [ISCA-28]
Redirection of instructions to units based on criticality – Seng et al. [MICRO 2001]
Balasubramonian et al. evaluated heterogeneous cache banks [MICRO 2003]
Banerjee and Mehrotra came up with an analytical model for designing interconnect for a given delay penalty [IEEE Trans. 2002]
June 20th 2004
25
Future Work
Other metrics for data criticality prediction (low confidence branch)
Application of heterogeneous interconnect in other places of the microprocessor (cache etc.)
Other configurations of heterogeneous interconnect
June 20th 2004
26
Conclusion
Single interconnect model optimized for delay or power alone is not enough
Heterogeneous interconnect model alleviates this problem
Criticality predictor efficiently identifies non-critical data
49% goes in non-critical network – performance loss 2.5%
June 20th 2004
27
Questions ?
Thank You