Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Variable-Width Datapath for On-Chip Network Static Power Reduction
George Michelogiannakis, John Shalf
Postdoctoral Research Fellow
Computer Architecture Laboratory
Lawrence Berkeley National Laboratory
In a Nutshell
Leakage power is an increasing problem in future or near threshold
voltage (NTV) technologies
Leakage power can be important even at high network loads
This work proposes variable-width datapaths
Parts of channels, buffers, and crossbars can be activated on
demand
We demonstrate an average of 33% total power reduction with
PARSEC benchmarks
Today’s Menu
Leakage power / motivation
Related work
Variable-width datapaths
Results
Conclusions
Leakage Power Contribution
32nm (above). 45nm (below). Orion 2.0.
[Top left] “FlexiBuffer: Reducing Leakage Power in On-
Chip Network Routers”. DAC 2011
[Rest] “NoRD: Node-Router Decoupling for Effective
Power-gating of On-Chip Routers”. MICRO 2012
Single router. 45nm. 1.0V Orion 2.0.
PARSEC benchmarks
Subthreshold Leakage at NTV
0%
10%
20%
30%
40%
50%
60%
45nm 32nm 22nm 14nm 10nm 7nm 5nm
SD
Leakag
e P
ow
er
100% Vdd
75% Vdd
50% Vdd
40% Vdd
Increasing
Variations
NTV operation reduces total power, improves energy efficiency
Subthreshold leakage power is substantial portion of the total
Near-threshold voltage (NTV) design — Opportunities and challenges. DAC 2012
Assumes 20% leakage power at 100% Vdd
Applications Load Network Unevenly
“Fine-grained bandwidth adaptivity in networks-on-chip using bidirectional channels”. NOCS 2012
Today’s Menu
Leakage power / motivation
Related work
Variable-width datapaths
Results
Conclusions
Power Gating
High threshold voltage “sleep switch” transistor
Savings when sleep time enough to overcome energy overheads
“MP3: Minimizing Performance Penalty for Power-gating of Clos Network-on-Chip”. HPCA 2014
Drowsy SRAMs
Put SRAM lines into low-power (low voltage) “drowsy” mode
Preserves data
Faster activation than power-gated SRAMs (1-2 cycles)
Higher leakage current while drowsy. Higher activation penalty
“Drowsy Caches: Simple Techniques for Reducing Leakage Power”. ISCA 2002
CatNap: Multiple Networks
Router
Router
Router
Router
Router
Router
Router
Router
Router
Router
Router
Router
Router
Router
Router
Router
Router
Router
“Catnap: Energy Proportional Multiple Network-on-Chip”. ISCA 2013
High traffic area
Source
Low traffic
injection
Multinets cannot share resources (e.g.,
channels) in low traffic regions
Optimal decisions at injection
are a challenge
Today’s Menu
Leakage power / motivation
Related work
Variable-width datapaths
Results
Conclusions
Variable-Width Channels
Router Router
Router RouterPacket 1
Packet 1
Packet 2
Same bisection bandwidth
Input Buffer Gating
VC1
VC2
VC3
VC4
“Adding slow-silent virtual channels for low-power on-chip networks”. NOCS 2008
Router
Packet 1
Packet 2
VC1
VC2
If flits from any channel lane can choose any VC, that
necessasitates multiplexers
We map channel lanes to use only a subset of VCs (1-1 with equal
VCs and lanes)
VC 0
VC 1
VC 2
VC 3
Crossbar Gating
“Segment gating for static energy reduction in networks-on-chip”. NoCArc 2009
VC1
VC2
VC3
VC4
VC1
VC2
VC1 VC2 VC3 VC4VC1 VC4
Output VCs
Input VCs
Activation Mechanism
Flits winning switch allocation (SA) activate in the next router:
Output channels and switch lanes (3 cycles)
Input buffers (1 cycle with drowsy SRAMs)
No false activations
With the below 4-stage router pipeline, no activation stalls
InBuf VA SA ST
InBuf VA SA STSTInBuf
Router Impact
ABN switch allcocators: ( Inputs x VCs ) x ( Outputs x ChanLanes )
As long as ChanLanes no greater than VCs, switch allocator no more
complex than VC allocator
If VC and switch allocators in different pipeline stages, router cycle
time does not extend
VC allocators consume 2-10mW and occupy 5000um
Both very small percentages of the router
Therefore increase of switch allocator’s cost insigificant
“Allocator implementations for network- on-chip routers”. SC 2009
Today’s Menu
Leakage power / motivation
Related work
Variable-width datapaths
Results
Conclusions
Methodology
Booksim network simulator
8x8 Mesh. DOR
We compare:
Single-lane: Single-lane power-gated network
ABN: Flits choose a lane based on their output VC
Multinets: Multiple power-gated subnetworks
Router pipeline previously presented
2 VCs as baseline. Normalize for buffer size by adjusting VCs
Activation and deactivation delays (65nm at 1GHz):
Channel and crossbar activation delay: 3 cycles
Channel and crossbar activation wait: 15 cycles
Channel and crossbar deactivation wait: 6 cycles
Buffer (VC) deactivation wait: 3 cycles
Two Subnetworks/Lanes. Static Power
0 20 40 600
1
2
3
4
5
Injection rate (request packets/cycle * 1000)
Sta
tic p
ow
er
(W)
8x8 mesh. DOR. UR traffic
Single lane
ABN
MultiNets
UR worst case for ABNs
ABNs better powers down resources at high loads
Two Subnetworks/Lanes. Dynamic Power
0 10 20 30 40 500
2
4
6
8
Injection rate (request packets/cycle * 1000)
Dyn
am
ic p
ow
er
(W)
8x8 mesh. DOR. UR traffic
Single lane
ABN
MultiNets
UR worst case for ABNs
Multinets takes advantage of lower-radix switches
Two Subnetworks. Latency
0 20 40 600
50
100
150
200
Injection rate (request packets/cycle * 1000)
Ave
rag
e la
ten
cy (
clo
ck c
ycle
s)
8x8 mesh. DOR. UR traffic
Single lane
ABN
MultiNets
Multinets cannot make perfect injection decisions or use resources in
another subnetwork after injection to combat transient imbalance
PARSEC Results
0
5
10
15
20
25
30
35
40
Perc
en
tag
e t
ota
l p
ow
er
red
ucti
on
(%
)
Low Medium High
Two ABN lanes and two multinet subnetworks
Scaling Up
0
5
10
15
20
25
30
Perc
en
tag
e t
ota
l p
ow
er
red
ucti
on
(%
)
Low Medium High
Four ABN lanes and four multinet subnetworks
Scaling Up
0 10 20 30 40 500
50
100
150
200
250
Injection rate (request packets/cycle * 1000)
Ave
rag
e la
ten
cy (
clo
ck c
ycle
s)
8x8 mesh. DOR. UR traffic
ABN two lanes
ABN four lanes
Two multinets
Four multinets
Effects of transient imbalance in multinets are intensified
Today’s Menu
Leakage power / motivation
Related work
Variable-width datapaths
Results
Conclusions
Conclusions
Leakage power is a growing concern in future technologies
Dividing datapaths in lanes provides more flexibility than multi-
network approaches
But there are tradeoffs
Using drowsy SRAMs allows hiding the activation delay without
false activations
Can change with shallow router pipelines
We demonstrate an average of 33% total power reduction with
PARSEC benchmarks
Questions?
Acknowledgment: D.O.E. office of science