Upload
scott-higgins
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
Power Management in High Performance Processors through
Dynamic Resource Adaptation and Multiple Sleep Mode
Assignments
Houman HomayounNational Science Foundation Computing Innovation
Fellow
Department of Computer Science University of California San Diego
Copyright © 2010 Houman Homayoun University of California San Diego 2
Outline – Multiple Sleep Mode
Brief overview of state-of-art superscalar processor Introducing the idea of multiple sleep modes design Architectural control of multiple sleep modes Results Conclusions
Copyright © 2010 Houman Homayoun University of California San Diego 3
Superscalar Architecture
Fetch
Decode
Rename
Instruction Queue
Execute
LogicalRegister
File
PhysicalRegister
File
ROB
F.U. F.U. F.U. F.U.
Reservation Station
Write-Back
Dispatch
Issue Load Store Queue
Fetch
Decode
Rename
Instruction Queue
Execute
LogicalRegister
File
PhysicalRegister
File
ROB
F.U. F.U. F.U. F.U.
Reservation Station
Write-Back
Dispatch
Issue Load Store Queue
Copyright © 2010 Houman Homayoun University of California San Diego 4
On-chip SRAMs+CAMs and Power
On-chip SRAMs+CAMs in high-performance processors are large
Branch Predictor Reorder Buffer Instruction Queue Instruction/Data TLB Load and Store Queue L1 Data Cache L1 Instruction Cache L2 Cache
more than 60% of chip budget
Dissipate significant portion of power via leakage
Pentium M processor die photoCourtesy of intel.com
Copyright © 2010 Houman Homayoun University of California San Diego 5
Techniques Address Leakage in SRAM+CAM
Gated-Vdd, Gated-Vss Voltage Scaling (DVFS) ABB-MTCMOS Forward Body Biasing (FBB), RBB Sleepy Stack Sleepy Keeper
Way Prediction, Way Caching, Phased Access Predict or cache recently access ways, read tag first
Drowsy Cache Keeps cache lines in low-power state, w/ data retention
Cache Decay Evict lines not used for a while, then power them down
Applying DVS, Gated Vdd, Gated Vss to memory cell Many architectural support to do that.
Circuit
Architecture
Copyright © 2010 Houman Homayoun University of California San Diego 6
Sleep Transistor Stacking Effect
Subthreshold current: inverse exponential function of threshold voltage
Stacking transistor N with slpN: The source to body voltage (VM ) of
transistor N increases, reduces its
subthreshold leakage current, when
both transistors are off
)2)2((0 FSBFTT VVV
slpN
vss
N
MV
vdd
gnV
gslpnV vss
LC
CV
Drawback : rise time, fall time, wakeup delay, area, dynamic power, instability
Copyright © 2010 Houman Homayoun University of California San Diego 7
Wakeup Latency
To benefit the most from the leakage savings of stacking sleep transistors
keep the bias voltage of NMOS sleep transistor as low as possible (and for PMOS as high as possible)
Drawback: impact on the wakeup latency (sleep transistor wakeup delay + sleep signal propagation delay) of the circuit
Control the gate voltage of the sleep transistors Increasing the gate voltage of footer sleep transistor reduces the
virtual ground voltage (VM)
reduction in the circuit wakeup
delay overhead
reduction in leakage power
savings
Copyright © 2010 Houman Homayoun University of California San Diego 8
Increasing the bias voltage increases the leakage power while decreases the wakeup delay overhead
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
(Footer,Header) Gate Bias Voltage Pair
Nor
mal
ized
Lea
kage
Pow
er
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Nor
mal
ized
Wak
e-U
p D
elay
Normalized leakage Normalized wake-up delay
Wakeup Delay vs. Leakage Power Reduction
trade-off between the wakeup overhead
and leakage power saving
Copyright © 2010 Houman Homayoun University of California San Diego 9
Multiple Sleep Modes Specifications
Wakeup Delay varies from 1~more than 10 processor cycles (2.2GHz). Large wakeup power overhead for large SRAMs.
Need to find Period of Infrequent Access
On-chip SRAM multiple sleep mode normalized leakage power savings
BPRED FRF IRF IL1 DL1 L2 DTLB ITLB
basic-lp 0.29 0.21 0.21 -- -- -- 0.25 0.25
lp 0.43 0.31 0.31 0.37 0.37 -- 0.34 0.34
aggr-lp 0.55 0.58 0.58 0.48 0.48 0.44 0.49 0.49
ultra-lp 0.67 0.65 0.65 0.69 0.64 0.63 0.57 0.57
Copyright © 2010 Houman Homayoun University of California San Diego 10
Reducing Leakage in SRAM Peripherals
Maximize the leakage reduction put SRAM into ultra low power mode adds few cycles to the SRAM access latency
significantly reduces performance
Minimize Performance Degradation put SRAM into the basic low power mode requires near zero wakeup overhead
Not noticeable leakage power reduction
Copyright © 2010 Houman Homayoun University of California San Diego 11
Motivation for Dynamically Controlling Sleep Mode
large leakage reduction benefit Ultra and aggressive low power modes
low performance impact benefit Basic-lp mode
Periods of frequent access Basic-lp mode
Periods of infrequent access Ultra and aggressive low power modes
dynamically adjust sleep power mode
Copyright © 2010 Houman Homayoun University of California San Diego 12
Architectural Motivations
Architectural Motivation A load miss in L1/L2 caches takes a long time to service
prevents dependent instructions from being issued When dependent instructions cannot issue
performance is lost At the same time, energy is lost as well!
This is an opportunity to save energy
Copyright © 2010 Houman Homayoun University of California San Diego 13
Multiple Sleep Mode Control Mechanism
L2 cache miss or multiple DL1 misses triggers power mode transitioning.
The general algorithm may not deliver optimal results for all units.
modified the algorithm for individual on-chip SRAM-based units to maximize the leakage reduction at NO performance cost.
basic-lp lp ultra-lp
aggr-lp
Processor stall
3 pending DL1 miss
Pending DL1 misses
Pending L2 miss/es
Processor continue
all pending DL1 misses
serviced
L2 miss
General state machine to control power mode transitions
Copyright © 2010 Houman Homayoun University of California San Diego 14
Branch Predictor IPB IPB IPB IPB
ammp 4.5 equake 4.21 mcf 3.9 twolf 7.6
applu 324.1 facerec 20.0 mesa 11.0 vortex 5.7
apsi 28.9 galgel 14.3 mgrid 310.4 vpr 9.0
art 8.1 gap 14.2 parser 6.0 wupwise 8.7
bzip2 6.7 gcc 6.3 perlbmk 7.2 average 37.8
crafty 8.5 gzip 9.5 sixtrack 11.9
eon 8.2 lucas 25.6 swim 77.1
1 out of every 9 fetched instructions in integer benchmarks
and out of 63 fetched instructions in floating point benchmarks accesses the branch predictor
always put branch predictor in deep low power modes (lp, ultra-lp or aggr-lp) and waking up on access.
noticeable performance degradation for some benchmarks.
Copyright © 2010 Houman Homayoun University of California San Diego 15
Observation: Branch Predictor Access Pattern
equake
0
5
10
15
20
25
30
1 M cycles
IPB
eve
ry 5
12 c
ycle
s
swim
0
50
100
150
200
250
300
350
1M cycles
IPB
eve
ry 5
12 c
ycle
s
Within a benchmark there is significant variation in Instructions Per Branch (IPB).
once the IPB drops (increases) significantly it may remain low (high) for a long period of time.
Distribution of the number of branches per 512-instruction interval (over 1M cycles)
Copyright © 2010 Houman Homayoun University of California San Diego 16
Branch Predictor Peripherals Leakage Control
Can identify the high IPB period, once the first low IPB period is detected.
The number of fetched branches is counted every 512 cycles, once the number of branches is found to be less than a certain threshold (24 in this work) a high IPB period identified. The IPB is then predicted to remain high for the next twenty 512 cycles intervals (10K cycles).
Branch predictor peripherals transition from basic-lp mode to lp mode when a high IPB period is identified.
During pre-stall and stall periods the branch predictor peripherals transition to aggr-lp and ultra-lp mode, respectively.
Copyright © 2010 Houman Homayoun University of California San Diego 17
Leakage Power Reduction
0%
5%
10%
15%
20%
25%
30%
35%
40%
amm
p
applu
apsi ar
t
bzip2
craf
tyeo
n
equak
e
facer
ec
galgel
gapgcc
gziplu
cas mcf
mesa
mgrid
parser
perlbm
k
sixt
rack
swim
twolf
vorte
xvp
r
wupw
ise
aver
age
basic-lp lp aggr-lp ultra-lp
Noticeable Contribution of Ultra and Basic low power mode
Copyright © 2010 Houman Homayoun University of California San Diego 18
Outline – Resource Adaptation
why an IQ, ROB, RF major power dissipators? Study processor resources utilization during L2/multiple L1
misses service time Architectural approach on dynamically adjusting the size of
resources during cache miss period for power conservation Results Conclusions
Copyright © 2010 Houman Homayoun University of California San Diego 19
Instruction Queue
The Instruction Queue is a CAM-like structure which holds
instructions until they can be issued. Set entries for new dispatched instructions Read entries to issue instructions to functional units Wakeup instructions waiting in the IQ once a result is ready Select instructions for issue when the number of instructions
available exceed the processor issue limit (Issue Width).
Main Complexity: Wakeup Logic
Copyright © 2010 Houman Homayoun University of California San Diego 20
Logical View of Instruction Queue
OR
Ready Bit
Pre-charge
matchline2
Vdd
tag0
0
tag0
1
tag0
2
tag0
3 tagI
W0
tagI
W1
tagI
W2
tagI
W3ta
g00
tag0
1
tag0
2
tag0
3 tagI
W0
tagI
W1
tagI
W2
tagI
W3
matchline1
matchline3
matchline4
No Need to always have such aggressive wakeup/issue width!No Need to always have such aggressive wakeup/issue width!
At each cycle, the match lines are pre-charged high To allow the individual bits associated with an instruction tag to be compared with the
results broadcasted on the taglines. Upon a mismatch, the corresponding matchline is discharged. Otherwise, the match line
stays at Vdd, which indicates a tag match. At each cycle, up to 4 instructions broadcasted on the taglines,
four sets of one-bit comparators for each one-bit cell are needed. All four matchlines must be ORed together to detect a match on any of the broadcasted tags.
The result of the OR sets the ready bit of instruction source operand
Copyright © 2010 Houman Homayoun University of California San Diego 21
ROB and Register File
The ROB and the register file are multi-ported SRAM structures with several functionalities:
Setting entries for up to IW instructions in each cycle, Releasing up to IW entries during commit stage in a cycle, and Flushing entries during the branch recovery.
decode8%
sense_amp
4%
data output driver29%
bitline and memory
cell58%
wordline1%
sense_amp
3%
bitline and memory
cell63%
decode11%
wordline8%
data output driver15%
Dynamic Power Leakage Power
Copyright © 2010 Houman Homayoun University of California San Diego 22
Architectural Motivations
Architectural Motivation: A load miss in L1/L2 caches takes a long time to service
prevents dependent instructions from being issued When dependent instructions cannot issue
After a number of cycles the instruction window is full ROB, Instruction Queue, Store Queue, Register Files
The processor issue stalls and performance is lost At the same time, energy is lost as well!
This is an opportunity to save energy
Scenario I: L2 cache miss period Scenario II: three or more pending DL1 cache misses
Copyright © 2010 Houman Homayoun University of California San Diego 23
How Architecture can help reducing power in ROB, Register File and Instruction Queue
Issue rate decrease
-10%
0%10%
20%30%
40%50%
60%70%
80%90%
100%
Scenario I
Scenario II
Significant issue width decrease!
Scenario I: The issue rate drops by more than 80%Scenario II: The issue rate drops is 22% for integer benchmarks and 32.6% for floating-point benchmarks.
Copyright © 2010 Houman Homayoun University of California San Diego 24
How Architecture can help reducing power in ROB, Register File and Instruction Queue
Benchmark Scenario I Scenario II Benchmark Scenario I Scenario II
bzip2 165.0 88.6 applu 13.8 -4.9
crafty 179.6 63.6 apsi 46.6 18.2
gap 6.6 61.7 Art 31.7 56.9
gcc 97.7 43.9 equake 49.8 38.1
gzip 152.9 41.0 facerec 87.9 14.1
mcf 42.2 40.6 galgel 30.9 34.4
parser 31.3 102.3 lucas -0.7 54.0
twolf 81.8 58.8 mgrid 8.8 5.6
vortex 118.7 57.8 swim -4.3 11.4
vpr 96.6 55.7 wupwise 40.2 24.4
INT average 98.2 61.4 FP average 30.5 25.2
ROB occupancy grows significantly during scenario I and II for integer benchmarks: 98% and 61% on average
The increase in ROB occupancy for floating point benchmarks is less, 30% and 25% on average for scenario I and II.
Copyright © 2010 Houman Homayoun University of California San Diego 25
How Architecture can help reducing power in ROB, Register File and Instruction Queue
Register File occupancy
Scenario I IRF
non-Scenario I
IRF
Scenario I FRF
non-Scenario I
FRF
Scenario II IRF
non-Scenario II
IRF
Scenario II FRF
non-Scenario II FRF
bzip2 74.4 28.8 0.0 0.0 56.6 30.7 0.0 0.0
crafty 83.4 31.9 0.1 0.0 51.4 32.2 0.0 0.0
gap 46.2 41.1 0.1 0.7 65.8 42.9 0.6 0.5
gcc 46.3 21.2 0.2 0.1 28.7 24.0 0.0 0.1
gzip 45.1 27.2 0.0 0.0 39.8 27.2 0.0 0.0
mcf 40.8 29.3 1.0 1.1 46.8 36.4 3.2 0.1
parser 37.4 29.8 0.0 0.0 57.0 29.8 0.1 0.0
twolf 58.7 32.3 2.6 2.1 46.0 29.8 2.5 2.0
vortex 70.9 31.1 0.3 0.2 52.4 35.0 0.2 0.2
vpr 63.9 29.0 7.8 8.6 66.4 41.0 8.7 8.3
INT average 55.3 29.2 1.1 1.2 50.3 32.0 1.4 1.0
applu 6.0 5.6 76.6 64.8 1.7 6.2 77.3 73.7
apsi 16.1 18.3 65.7 37.6 15.8 17.9 58.8 43.6
art 35.4 25.0 36.2 30.7 23.0 29.0 42.9 6.3
equake 34.2 27.4 16.1 7.1 32.7 29.4 21.0 9.6
facerec 52.6 22.5 50.0 28.9 30.3 38.4 48.1 35.0
galgel 50.4 27.4 41.8 48.7 32.1 26.0 61.0 44.2
lucas 21.7 23.8 47.7 44.0 41.7 22.1 29.7 47.0
mgrid 5.9 6.2 90.0 80.7 1.9 6.4 96.7 87.2
swim 23.3 27.8 77.1 78.1 29.7 23.1 87.1 76.2
wupwise 26.3 28.8 53.5 28.7 40.5 26.9 38.0 42.2
FP average 26.6 20.9 56.5 44.7 24.0 22.1 56.2 46.0
IRF occupancy always grows for both scenarios when IRF occupancy always grows for both scenarios when experimenting with integer benchmarks. a similar case is for experimenting with integer benchmarks. a similar case is for FRF when running floating-point benchmarks and only FRF when running floating-point benchmarks and only during scenario II during scenario II
Copyright © 2010 Houman Homayoun University of California San Diego 26
Proposed Architectural Approach
Adaptive resource resizing during cache miss period
Reduce the issue and the wakeup width of the processor during L2 miss service time.
Increase the size of ROB and RF during L2 miss service time or when at least three DL1 misses are pending
simple resizing scheme: reduce to half size. not necessarily optimized for individual units, but a simple scheme to implement at circuit!
Copyright © 2010 Houman Homayoun University of California San Diego 27
Results
Small Performance loss~1% 15~30% dynamic and leakage
power reductionPower (Dynamic/Leakage) Reduction
0%
5%
10%
15%
20%
25%
30%
35%
40%
INT RF Leakage INT RF Dynamic FP RF Lekage FP RF Dynamic
IPC Degradation
0%
1%
2%
3%
4%
5%
6%
Power (Dynamic/Leakage) Reduction
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
ROB Leakage ROB Dynamic Issue Queue
Copyright © 2010 Houman Homayoun University of California San Diego 28
Conclusions
Introducing the idea of multiple sleep mode design Apply multiple sleep mode to on-chip SRAMs
Find period of low activity for state transition Introduce the idea of resource adaptation Apply resource adaptation to on-chip SRAMs+CAMs
Find period of low activity for state transition Applying similar adaptive techniques to other energy hungry
resources in the processor Multiple sleep mode functional units