Upload
doane
View
22
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Power, Temperature, Reliability and Performance - Aware Optimizations in On-Chip SRAMs. Houman Homayoun PhD Candidate Dept. of Computer Science, UC Irvine. Outline. Past Research Low Power Design - PowerPoint PPT Presentation
Citation preview
Power, Temperature, Reliability Power, Temperature, Reliability and Performance - Aware and Performance - Aware
Optimizations in On-Chip SRAMsOptimizations in On-Chip SRAMs
Houman Homayoun
PhD Candidate
Dept. of Computer Science, UC Irvine
2
Outline
Past Research Low Power Design
Power Management in Cache Peripheral Circuits (CASES-2008, ICCD-2008,ICCD-2007, TVLSI, CF-2010)
Clock Tree Leakage Power Management (ISQED-2010)
Thermal-Aware Design Thermal Management in Register File (HiPEAC-2010)
Reliability-Aware Design Process Variation Aware Cache Architecture for Aggressive Voltage-
Frequency Scaling (DATE-2009, CASES-2009)
Performance Evaluation and Improvement Adaptive Resource Resizing for Improving Performance in Embedded
Processor (DAC-2008, LCTES-2008)
3
Current Research Inter-core Selective Resource Pooling in 3D Chip
Multiprocessor
Extend Previous Work (for Journal Publication!!)
Outline
Leakage Power Management in Cache Peripheral Circuits
5
Outline: Leakage Power in Cache Peripherals
L2 cache power dissipation Why cache peripheral? Circuit techniques to reduce leakage in Peripheral
(ICCD-08, TVLSI) Study static approach to reduce leakage in L2 cache
(ICCD-07) Study adaptive techniques to reduce leakage in L2
cache (ICCD-08) Reducing Leakage in L1 cache (CASES-2008)
6
On-chip Caches and Power On-chip caches in high-performance
processors are large more than 60% of chip budget
Dissipate significant portion of power via leakage
Much of it was in the SRAM cells Many architectural techniques proposed to
remedy this
Today, there is also significant leakage in the peripheral circuits of an SRAM (cache)
In part because cell design has been optimized
Pentium M processor die photoCourtesy of intel.com
7
Peripherals ?
Data Input/Output Driver Address Input/Output Driver Row Pre-decoder Wordline Driver Row Decoder
Others : sense-amp, bitline pre-charger, memory cells, decoder logic
addr0
addr1
addr2
addr3
Predecoder and Global Wordline Drivers
Decoder
addr
Global WordlineLocal Wordline
Bitline BitlineAddr Input Global Drivers
Sense amp
Global Output Drivers
8
Why Peripherals ?
Using minimal sized transistor for area considerations in cells and larger, faster and accordingly more leaky transistors to satisfy timing requirements in peripherals.
Using high vt transistors in cells compared with typical threshold voltage transistors in peripherals
1
10
100
1000
10000
100000
mem
ory c
ell
INVX
INV2X
INV3X
INV4X
INV5X
INV6X
INV8X
INV12
X
INV16
X
INV20
X
INV24
X
INV32
X
( pw )
200X
6300X
9
Leakage Power Components of L2 Cache
SRAM peripheral circuits dissipate more than 90% of the total leakage power
global address input drivers
11%
global data input drivers
14%
global row predecoder
1%
local row decoders
33%
others8%
local data output drivers
8%
global data output drivers
25%
10
Leakage power as a Fraction of L2 Power Dissipation
L2 cache leakage power dominates its dynamic power above 87% of the total
0%10%20%30%40%50%60%70%80%90%
100%
amm
pap
plu
apsi art
bzi
p2
craf
tyeo
neq
uak
efa
cere
cg
alg
elg
ap gcc
gzi
plu
cas
mcf
mes
am
gri
dp
arse
rp
erlb
mk
sixt
rack
swim
two
lfvo
rtex vp
rw
up
wis
eav
erag
e
Leakage Dynamic
11
Circuit Techniques Address Leakage in SRAM Cell
Gated-Vdd, Gated-Vss Voltage Scaling (DVFS) ABB-MTCMOS Forward Body Biasing (FBB), RBB Sleepy Stack Sleepy Keeper
Target SRAM memory cell
12
Architectural Techniques Way Prediction, Way Caching, Phased Access
Predict or cache recently access ways, read tag first Drowsy Cache
Keeps cache lines in low-power state, w/ data retention Cache Decay
Evict lines not used for a while, then power them down Applying DVS, Gated Vdd, Gated Vss to memory cell
Many architectural support to do that.
All target cache SRAM memory cell
Multiple Sleep Mode Zig-Zag Horizontal and Vertical Sleep
Transistor Sharing
14
Sleep Transistor Stacking Effect
Subthreshold current: inverse exponential function of threshold voltage
Stacking transistor N with slpN: The source to body voltage (VM ) of
transistor N increases, reduces its
subthreshold leakage current, when
both transistors are off
)2)2((0 FSBFTT VVV
slpN
vss
N
MV
vdd
gnV
gslpnV vss
LC
CV
Drawback : rise time, fall time, wakeup delay, area, dynamic power, instability
15
A Redundant Circuit Approach
P1 P2
N1 N2
vddvdd
vssvss
010
slpN1 slpN2
slpP1 slpP2
P3
N3
vdd
vss
1
slpN3
slpP3
P4
N4
vdd
vss
slpN4
slpP4
6L
W6
L
W6
L
W6
L
W
12L
W12
L
W12
L
W12
L
W
MV
slpN5
Sleep signal
Sleep signal
Sleep signal
5.1L
W
Drawback impact on wordline driver output rise time, fall time and propagation delay
16
Impact on Rise Time and Fall Time
The rise time and fall time of the output of an inverter is proportional to the
Rpeq * CL and Rneq * CL
Inserting the sleep transistors increases both Rneq and Rpeq
P1 P2
N1 N2
vddvdd
vssvss
010
slpN1 slpN2
slpP1 slpP2
I leakage
I leakage
Increasing in rise time
Increasing in fall time
Impact on performance
Impact on memory functionality
17
A Zig-Zag Circuit
Rpeq for the first and third inverters and Rneq for the second and fourth inverters doesn’t change.
Fall time of the circuit does not change
P1 P2
N1 N2
vdd
vdd
vss
vss
010
slpN1
slpP2
Sleep signal
Sleep signal
P3
N3
vdd
vss
1
P4
N4
vdd
vss
slpP4
slpN36
L
W6
L
W
12L
W12
L
W
slpN5
Sleep signal
5.1L
W
0
vss
18
A Zig-Zag Share Circuit To improve leakage reduction and area-efficiency of
the zig-zag scheme, using one set of sleep transistors shared between multiple stages of inverters
Zig-Zag Horizontal Sharing Zig-Zag Horizontal and Vertical Sharing
19
Zig-Zag Horizontal Sharing
Comparing zz-hs with zigzag scheme, with the same area overhead
Zz-hs less impact on rise time Both reduce leakage almost the
same
vddvdd vdd
vss vss vss
2 x slpN
slpPSleep signal
Sleep signal
P1 P2 P3 P4
N1 N2 N3 N4
shareINeqR
shareI
vss
1NR
hszznslpR 2
zznslpR
0 01
MV
20
Zig-Zag Horizontal and Vertical Sharing
vdd vdd vddvdd vdd
vss vss vss vss vss
slpN
slpPSleep signal
Sleep signal
P 11 P12 P 13 P14 P 21 P22 P 23 P24
N11 N12 N13 N14N21 N22 N23 N24
Word-line Driver line K
Word-line Driver line K +1
MV
21
Leakage Reduction of ZZ Horizontal and Vertical Sharing
Increase in virtual ground voltage increase leakage reduction
N11
vss
slpN
N21Vg0 Vg0
Vg0
21NI11NI
slpNI
vdd vdd
V
N11
vss
slpN
Vg0
Vg0
11NI
slpNI
vdd
M1V
N21
vss
slpN
Vg0
Vg0
21NI
slpNI
vdd
V
(a) (b)
M1M2
2
log. 0101
11
gddW
W
M
VVnV
slpN
N
2
log. 0
.2
102
11
gddW
W
M
VVnV
slpN
N
22
ZZ-HVS Evaluation : Power Result
Increasing the number of wordline rows share sleep transistors increases the leakage reduction and reduces the area overhead
Leakage power reduction varies form a 10X to a 100X when 1 to 10 wordline shares the same sleep transistors
2~10X more leakage reduction, compare to the zig-zag scheme
1
10
100
1000
1 2 3 4 5 6 7 8 9 10
log
(n
W)
baseline redundant zigzag zz-hs zz-hvs
number of wordline row
x10
x100
x2
x12
(a)
23
Wakeup Latency
To benefit the most from the leakage savings of stacking sleep transistors
keep the bias voltage of NMOS sleep transistor as low as possible (and for PMOS as high as possible)
Drawback: impact on the wakeup latency of wordline drivers
Control the gate voltage of the sleep transistors Increasing the gate voltage of footer sleep transistor reduces the
virtual ground voltage (VM)
reduction in the circuit wakeup
delay overhead
reduction in leakage power
savings
24
Wakeup Delay vs. Leakage Power Reduction
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
(Footer,Header) Gate Bias Voltage Pair
No
rma
lize
d L
ea
ka
ge
Po
we
r
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
No
rmal
ized
Wak
e-U
p D
elay
Normalized leakage Normalized wake-up delay
trade-off between the wakeup overhead
and leakage power saving
Increasing the bias voltage increases the leakage power while decreases the wakeup delay overhead
25
Multiple Sleep Modes
power mode wakeup delay (cycle)
leakage reduction (%)
basic-lp 1 42%
lp 2 75%
aggr-lp 3 81%
ultra-lp 4 90%
Power overhead of waking up peripheral circuits Almost equivalent to the switching power of sleep
transistors Sharing a set of sleep transistors horizontally and
vertically for multiple stages of a (wordline) driver makes the power overhead even smaller
Reducing Leakage in L2 Cache Peripheral Circuits
Using Zig-Zag Share Circuit Technique
27
Static Architectural Techniques: SM
SM Technique (ICCD’07) Asserts the sleep signal by default.
Wakes up L2 peripherals on an access to the cache
Keeps the cache in the normal state for J cycles (turn-on period) before returning it to the stand-by mode (SM_J)
No wakeup penalty during this period Larger J leads to lower performance degradation but lower
energy savings
28
Static Architectural Techniques: IM IM technique (ICCD’07)
Monitor issue logic and functional units of the processor after L2 cache miss. Asserts the sleep if the issue logic has not issued any instructions and functional units have not executed any instructions for K consecutive cycles (K=10)
De-asserted the sleep signal M cycles before the miss is serviced
No performance loss
29
More Insight on SM and IM
Some benchmarks SM and IM techniques are both effective facerec, gap, perlbmk and vpr
IM works well in almost half of the benchmarks but is ineffective in the other half
SM work well in about one half of the benchmarks but not the same benchmarks as the IM
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
amm
p
app
lu
apsi art
bzi
p2
craf
ty
eon
equ
ake
face
rec
gal
gel
gap gcc
gzi
p
luca
s
mcf
mes
a
mg
rid
par
ser
per
lbm
k
sixt
rack
swim
two
lf
vort
ex vpr
wu
pw
ise
aver
age
IM SM-750
adaptive technique combining IM and SM has the potential to deliver an even greater power
reduction
30
Which Technique Is the Best and When ?
DL1 miss rate
L2 miss rate
L1xL2 miss rates x 10K
DL1 miss rate
L2 miss rate
L1xL2 miss rates x 10K
ammp 0.05 0.19 96.11 lucas 0.10 0.67 645.73 applu 0.06 0.66 368.03 mcf 0.24 0.43 1023.88 apsi 0.03 0.28 75.01 mesa 0.00 0.27 8.02 art 0.41 0.00 0.41 mgrid 0.04 0.46 165.13
bzip2 0.02 0.04 7.09 parser 0.02 0.07 13.76 crafty 0.00 0.01 0.17 perlbmk 0.01 0.46 22.88 eon 0.00 1.00 0.00 sixtrack 0.01 0.00 0.14
equake 0.02 0.67 124.36 swim 0.09 0.63 561.41 facerec 0.03 0.31 86.11 twolf 0.05 0.00 0.16 galgel 0.04 0.01 2.11 vortex 0.00 0.23 6.94 gap 0.01 0.55 38.54 vpr 0.02 0.15 33.95 gcc 0.05 0.04 16.88 wupwise 0.02 0.68 122.40 gzip 0.01 0.05 3.28 Average 0.05 0.31 136.50
L2 to be idle
There are few L1 misses Many L2 misses waiting for memory
miss rate product (MRP) may be a good indicator
of the cache behavior
31
The Adaptive Techniques
Adaptive Static Mode (ASM) MRP measured only once during an initial learning
period (the first 100M committed instructions) MRP > A IM (A=90) MRP ≤ A SM_J Initial technique SM_J
Adaptive Dynamic Mode (ADM) MRP measured continuously over a K cycle period (K is 10M)
choose IM or the SM, for the next 10M cycles MRP > A IM (A=100) A ≥ MRP > B SM_N (B=200) otherwise SM_P
32
More Insight on ASM and ADM
ASM attempts to find the more effective static technique per benchmark by profiling a small subset of a program
ADM is more complex and attempts to find the more effective static technique at a finer granularity of every 10M cycles intervals based on profiling the previous timing interval
33
Compare ASM with IM and SM
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
am
mp
ap
plu
ap
si art
bzi
p2
cra
fty
eo
n
eq
ua
ke
face
rec
ga
lge
l
ga
p
gc
c
gzi
p
luca
s
mc
f
me
sa
mg
rid
pa
rser
pe
rlb
mk
six
trac
k
sw
im
two
lf
vo
rte
x vp
r
wu
pw
ise
ave
rag
e
ASM-IM ASM-SM
fraction of IM and SM contribution for ASM_750
Most benchmarks ASM correctly selects the more effective static technique
Exception: equake
a small subset of program can be used to identify L2 cache behavior,
whether it is accessed very infrequently or it is idle since
processor is idle
34
ADM Results
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
amm
p
app
lu
apsi ar
t
bzi
p2
craf
ty
eon
equ
ake
face
rec
gal
gel gap gcc
gzi
p
luca
s
mcf
mes
a
mg
rid
par
ser
per
lbm
k
sixt
rack
swim
two
lf
vort
ex vpr
wu
pw
ise
aver
age
ADM_IM ADM_SM
Many benchmarks both IM and SM make a noticeable contribution ADM is effective in combining the IM and SM
Some benchmarks either IM or SM contribution is negligible ADM selects the best static technique
35
Power Results
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
amm
p
app
lu
apsi art
bzi
p2
craf
ty
eon
equ
ake
face
rec
gal
gel
gap gcc
gzi
p
luca
s
mcf
mes
a
mg
rid
par
ser
per
lbm
k
sixt
rack
swim
two
lf
vort
ex vpr
wu
pw
ise
aver
age
ASM ADM
(a)
-20%
-10%
0%
10%
20%
30%
40%
50%
60%
70%
80%
amm
p
app
lu
apsi art
bzi
p2
craf
ty
eon
equ
ake
face
rec
gal
gel
gap gcc
gzi
p
luca
s
mcf
mes
a
mg
rid
par
ser
per
lbm
k
sixt
rack
swim
two
lf
vort
ex vpr
wu
pw
ise
aver
age
ASM ADM
(b)
leakage power savings total energy delay reduction
leakage reduction using ASM and ADM is 34% and 52% respectively
The overall energy delay reduction is 29.4 and 45.5% respectively, using the ASM and ADM.
2~3 X more leakage power reduction and less performance loss compare to
static approaches
RELOCATE: Register File Local Access Pattern Redistribution Mechanism for Power and Thermal Management in Out-of-Order Embedded Processor
37
Outline
Motivation Background study Study of Register file Underutilization Study of Register file default access patterns Access concentration and activity redistribution to
relocate register file access patterns Results
38
Why Register File?
RF is one of the hottest units in a processor A small, heavily multi-ported SRAM Accessed very frequently
Example: IBM PowerPC 750FX
39
Prior Work: Activity Migration Reduces temperature by migrating the activity to a
replicated unit. requires a replicated unit
large area overhead leads to a large performance degradation
Tem
pera
ture
T final
T ambient
Active Period
Idle Period
T init
T crisis
time
AM AM+PG
40
Conventional Register Renaming
Free List
Active List
Tail pointer
Head pointer Instruction # Original code Renamed code
1 RA <- ... PR1 <- ...
2 …. <- RA .... <- PR1
3 branch to _L branch to _L
4 RA <- ... PR4 <- ...
5 ... ...
... ...
6 _ L:
_ L:
7 …. <- RA .... <- PR1
Register Renamer Register allocation-release
• Physical registers are allocated/released in a somewhat random order
41
Analysis of Register File Operation
Register File Occupancy
(a)
0%10%20%30%40%50%60%70%80%90%
100%
RF_ocuupancy < 16 16 < RF_ocuupancy < 32
32 < RF_ocuupancy < 48 48 < RF_ocuupancy < 64
(b)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
RF_ocuupancy < 16 16 < RF_ocuupancy < 3232 < RF_ocuupancy < 48 48 < RF_ocuupancy < 64
MiBench SPECint2K
42
Performance Degradation with a Smaller RF
(a)
0%
5%
10%
15%
20%
25%
30%
35%
% p
erfo
rman
ce d
egra
dat
ion
48-entry 32-entry 16-entry
(b)
0%
10%
20%
30%
40%
50%
60%
% p
erfo
rman
ce d
egra
dat
ion
48-entry 32-entry 16-entry
MiBench SPECint2K
43
Analysis of Register File Operation
Register File Access Distribution Coefficient of variation (CV) shows a “deviation” from
average # of accesses for individual physical registers.
nai is the number of accesses to a physical register i during a specific period (10K cycles). na is the average
N, the total number of physical registers
na
nanaN
CV
n
ii
access
2
1
)(1
44
Coefficient of Variation
(a)
0%
2%
4%
6%
8%
10%
12%
% c
oef
fici
ent
of
vari
atio
n
(b)
0%
2%
4%
6%
8%
10%
12%
14%
% c
oef
fici
ent
of
vari
atio
n
MiBench SPEC2K
45
Register File Operation
Underutilization which is distributed uniformly
while only a small number of registers are occupied at any given time, the total accesses are uniformly distributed over the entire physical register file during the course of execution
46
RELOCATE: Access Redistribution within a Register File
The goal is to “concentrate” accesses within a partition of a RF (region)
Some regions will be idle (for 10K cycles) Can power-gate them and allow to cool down
register activity (a) baseline, (b) in-order (c) distant patterns
47
An Architectural Mechanism for Access Redistribution
Active partition: a register renamer partition currently used in register renaming
Idle partition: a register renamer partition which does not participate in renaming
Active region: a region of the register file corresponding to a register renamer partition (whether active or idle) which has live registers
Idle region: a region of the register file corresponding to a register renamer partition (whether active or idle) which has no live registers
48
Activity Migration without Replication
An access concentration mechanism allocates registers from only one partition
This default active partition (DAP) may run out of free registers before the 10K cycle “convergence period” is over
another partition (according to some algorithm) is then activated (referred to as additional active partitions or AAP )
To facilitate physical register concentration in DAP, if two or more partitions are active and have free registers, allocation is performed in the same order in which partitions were activated.
49
The Access Concentration Mechanism
Partition activation order is 1-3-2-4
Free List
Active List
Free List
Active List
Free List
Active List
Free List
Active List
Partition P1
Free-list 1 full Free-list 3 full Free-list 2 full
Active List 4 emptyActive List 2 emptyActive List 3 empty
Partition P2
Partition P4
Partition P3
Free-list 4 full
Active List 1 empty
50
The Redistribution Mechanism The default active partition is changed once every N cycles to redistribute
the activity within the register file (according to some algorithm) Once a new default partition (NDP) is selected, all active partitions
(DAP+AAP) become idle.
The idle partitions do not participate in register renaming, but their corresponding RF regions may have to be kept active (powered up)
A physical register in an idle partition may be live
An idle RF region is power gated when its active list becomes empty.
51
Performance Impact? There is a two-cycle delay to wakeup a power gated physical register
region The register renaming occurs in the front end of the microprocessor
pipeline whereas the register access occurs in the back end. There is a delay of at least two pipeline stages between
renaming and accessing a physical register file Can wake up the requested region in timeCan wake up a required register file region
without incurring a performance penalty at the time of access
52
Results: Mibench RF power reduction
(a)
0%5%
10%15%20%25%30%35%40%45%50%55%
Po
we
r R
ed
uc
tio
n %
num_partition=2 num_partition=4 num_partition=8
53
Results: SPEC2K RF power reduction
(b)
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
Po
we
r R
ed
uc
tio
n %
num_partition=2 num_partition=4 num_partition=8
54
Analysis of Power Reduction
Increasing the number of RF partitions provides more opportunity to capture and cluster unmapped registers to a partition
Indicates that wakeup overhead is amortized for a larger number of partitions.
Some exceptions the overall power overhead associated with waking up an idle
region becomes larger as the number of partition increases. frequent but ineffective power gating and its overhead as the
number of partition increases
55
Peak Temperature Reduction
Table 1. Peak temperature reduction for MiBench benchmarks
temperature reduction for different number of partition (C )
base
temperature
(C ) 2P 4P 8P
basicMath 94.3 3.6 4.8 5.0
bc 95.4 3.8 4.4 5.2
crc 92.8 5.3 6.0 6.0
dijkstra 98.4 6.3 6.8 6.4
djpeg 96.3 2.8 3.5 2.4
fft 94.5 6.8 7.4 7.6
gs 89.8 6.5 7.4 9.7
gsm 92.3 5.8 6.7 6.9
lame 90.6 6.2 8.5 11.3
mad 93.3 3.8 4.3 2.2
patricia 79.2 11.0 12.4 13.2
qsort 88.3 10.1 11.6 11.9
search 93.8 8.7 9.3 9.1
sha 90.1 5.1 5.4 4.5
susan_corners 92.7 4.7 5.3 5.1
susan_edges 91.9 3.7 5.8 6.3
tiff2bw 98.5 4.5 5.9 4.1
average 92.5 5.6 6.8 6.9
Table 2. Peak temperature reduction for SPEC2K integer benchmarks
temperature reduction for different number of partition (C )
base
temperature
(C ) 2P 4P 8P
bzip2 92.7 4.8 3.9 3.1
crafty 83.6 9.5 11 10.4
eon 77.3 10.6 12.4 12.5
galgel 89.4 6.9 7.2 5.8
gap 86.7 4.8 5.9 7.1
gcc 79.8 7.9 9.4 10.1
gzip 95.4 3.2 3.8 3.9
mcf 85.8 6.9 8.7 9.4
parser 97.8 4.3 5.8 4.8
perlbmk 85.8 10.6 12.3 12.6
twolf 86.2 8.8 10.2 10.5
vortex 81.7 11.3 12.5 12.9
vpr 94.6 4.9 5.2 4.4
average 87.4 7.2 8.3 8.2
56
Analysis of Temperature Reduction
Increasing the number of partitions results in larger power density in each partition because RF access activity is concentrated in a smaller partition
While capturing more idle partitions and power gating them may potentially result in higher power reduction, larger power density due to smaller partition size results in overall higher temperature
Adaptive Resource Resizing for Improving Performance in
Embedded Processor
58
Introduction Technology scaling into the ultra deep submicron
allowed hundreds of millions of gates integrated onto a single chip.
Designers have ample silicon budget to add more processor resources to exploit application parallelism and improve performance.
Restrictions with the power budget and practically achievable operating clock frequencies are limiting factors.
Increasing register file (RF) size increases its access time, which reduces processor frequency.
Dynamically Resizing RF in tandem with dynamic frequency scaling (DFS) significantly improves the performance.
59
Motivation for Increasing RF Size After a long latency L2 cache miss the processor executes
some independent instructions but eventually ends up becoming stalled.
After L2 cache miss one of ROB, IQ, RF or LQ/SQ fills up and processor stalls until the miss serviced.
With larger resources it is less likely that these resources will fill up completely during the L2 cache miss service time and potentially improve performance.
The sizes of resources have to be scaled up together; otherwise the non-scaled ones would become a performance bottleneck.
0%
5%
10%
15%
20%
25%
30%
35%
40%
Frequency of stalls due to L2 cache misses, in PowerPC 750FX architecture
60
Impact of Increasing RF Size Increasing the size of RF, (as well as ROB, LQ and IQ)
can potentially increase processor performance by reducing the occurrences of idle periods,
has critical impact on the achievable processor operating frequency
RF decide the max achievable operating frequency
significant increase in bitline delay when the size of the RF increases.
Breakdown of RF component delay with increasing size
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
RF-24 RF-32 RF-48
de
lay
(n
s)
input driver decoder wordlinebitline sense_amp output driver
61
Analysis of RF Component Access Delay The equivalent capacitance on the bitline is Ceq = N *
diffusion capacitance of pass transistors + wire capacitance (usually 10% of total diffusion capacitance) where N is the total number of rows.
As the number of rows increases the equivalent bitline capacitance also increases and therefore the propagation delay increases.
Reduction in clock freq with increasing resource size
62
Impact on Execution Time The execution time increases with larger resource
sizes
Normalized execution time for different configs with reduced operating frequency compared to baseline architecture
Trade-off between larger resources (and hence reducing the occurrences of
idle period) and lowering the clock frequency,
The latter becomes more important and plays a major role in deciding the performance in terms of execution time.
0.900.920.940.960.981.001.021.041.061.081.101.12
No
rmal
ized
Exe
cuti
on
Tim
e
Baseline Conf-1 Conf-2
63
Dynamic Register File Resizing
Dynamic RF scaling based on L2 cache misses allows the processor use smaller RF (having a lower access time)
during the period when there is no pending L2 cache miss (normal period) and a larger RF (at the cost of having a higher access time) during the L2 cache miss period.
To satisfy accessing the RF in one cycle, reduce the operating clock frequency when we scale up its size
DFS needs to be done fast, otherwise it impacts the performance benefit
need to use a PLL architecture capable of applying DFS with the least transition delay.
The studied processor (IBM PowerPC 750) uses a dual PLL architecture which allows fast DFS with effectively zero latency.
64
Circuit Modification
The challenge is to design the RF in such a way that its access time is dynamically being controlled.
Proposed circuit modification for RF
Among all RF components, the bitline delay increase is responsible for the majority of RF access time increase.
Wordline
Wordline
Wordline
Wordline
Wordline
Segment Select
Segment Select
Sense Amp and Bitline Pre- Charge Circuit
single bit
Register entry free/taken
Upper segment full/empty
Dynamically adjust bitline load.
65
L2 Miss Driven RF Scaling (L2MRFS)
Proposed circuit modification for RF
Wordline
Wordline
Wordline
Wordline
Wordline
Segment Select
Segment Select
Sense Amp and Bitline Pre- Charge Circuit
single bit
Register entry free/taken
Upper segment full/empty
Normal period: the upper segment is power gated and the transmission gate is turned off to isolate the lower bitline segment from the upper bitline segment.
Only the lower segment bitline is pre-charged during this period.
L2 cache miss period: the transmission gate is turned on and both segments bitlines are pre-charged.
downsize at the end of cache miss period when the upper segment is empty.
Augment the upper segment with one extra bit per entry. Set the entry when a register is taken and reset it when a register is released. ORing these bits can detect when the segment is empty.
66
Performance and Energy-delay
(a)
0%
2%
4%
6%8%
10%
12%
14%
16%
DYN_Conf_1 DYN_Conf_2
(b)
0.840.860.880.900.920.940.960.981.00
DYN_Conf_1 DYN_Conf_2
Experimental results: (a) normalized performance improvement for L2MRFS (b) normalized energy-delay product compare to conf_1 and conf_2
Performance improvement 6% and 11%
Energy-delay reduction 3.5% and 7%
Inter-core Selective Resource Pooling in 3D Chip
Multiprocessor
68
An Example!
0
32
64
96
128
cyclenu
mb
er o
f o
ccu
pie
d R
F
entr
ies
mcf gcc
gready: mcfhelper: gcc
gready: gcchelper: mcf
I1 I2 I3
An example of register file utilization for different cores in a dual core CMP
69
Preliminary Results for Register File Pooling
MUX
MUX
die-to-die via
layer 0
layer 1
1.01.52.02.53.03.54.04.55.0
Sp
ee
du
p (
No
rma
lize
d IP
C)
single-core 2-core 3-core 4-core
Register files participating in resource pooling
The normalized IPC of resource pooling
70
Challenges
The level of resource sharing “loose pooling”: HELPER gets a higher priority in accessing to
the pooled resource “tight pooling”: the priority is given to the GREADY core
The granularity of resource sharing number of entries number of ports
The level of confidence in predicting the resource utilizations avoid starving the HELPER core avoiding over provisioning for the GREADY core
A new floorplanning put identical resources as close to each other can incurs additional thermal and power burden on a currently
power hungry and thermal critical resources
71
Conclusion
Power-Thermal-Reliability aware High Performance Design Through
Inter-Disciplinary Approach
Reducing Leakage in L2 Cache Peripheral Circuits
Using Multiple Sleep Mode Technique
73
Multiple Sleep Modes
power mode wakeup delay (cycle)
leakage reduction (%)
basic-lp 1 42%
lp 2 75%
aggr-lp 3 81%
ultra-lp 4 90%
Power overhead of waking up peripheral circuits Almost equivalent to the switching power of sleep
transistors Sharing a set of sleep transistors horizontally and
vertically for multiple stages of a (wordline) driver makes the power overhead even smaller
74
Reducing Leakage in L1 Data Cache Maximize the leakage reduction in DL1 cache
put DL1 peripheral into ultra low power mode adds 4 cycles to the DL1 latency
significantly reduces performance
Minimize Performance Degradation put DL1 peripherals into the basic low power mode requires only one cycle to wakeup and hide this latency during address computation stage
thus not degrading performance Not noticeable leakage power reduction
75
Motivation for Dynamically Controlling Sleep Mode
large leakage reduction benefit Ultra and aggressive low power modes
low performance impact benefit Basic-lp mode
Periods of frequent access Basic-lp mode
Periods of infrequent access Ultra and aggressive low power modes
dynamically adjust peripheral circuit sleep power mode
76
Reducing DL1 Wakeup Delay
Can determine whether an instruction is load or a store at least one cycle prior cache access
Accessing DL1 while its peripherals are in basic-lp mode doesn’t require an extra cycle
wake up DL1 peripherals one cycle prior to access
One cycle of wakeup delay can be hidden for all other low-power modes
Reducing the wakeup delay by one cycle
Put DL1 in basic-lp mode by default
77
Architectural Motivations
Architectural Motivation A load miss in L1/L2 caches takes a long time to service
prevents dependent instructions from being issued When dependent instructions cannot issue
performance is lost At the same time, energy is lost as well!
This is an opportunity to save energy
78
Low-end Architecture
Given the miss service time of 30 cycles likely that processor stalls during the miss service period Occurrence of additional cache misses while one DL1
cache miss is already pending further increases the chance of pipeline stall
basic-lp lp u ltra -lp
aggr-lp
D L1 m iss
P rocessor sta ll
D L1 m iss++
P ending D L1 m iss
P ending D L1 m iss /es
D L1 m iss serviced
P rocessor continue
79
Low Power Modes in a 2KB DL1 Cache
Fraction of total execution time DL1 cache spends in each of the power mode
85% of the time DL1 peripherals put into low power modes Most of the time spent in the basic-lp mode (58% of total
execution time)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
basic
mat
h bccr
c
dijkst
ra
djpeg fft gs
gsmla
me
mad
patric
iapgp
qsort
rijndae
l
sear
ch sha
susa
n_corn
ers
susa
n_edges
tiff2
bw
aver
age
hp trivial-lp lp aggr-lp ultra-lp
80
Low Power Modes in Low-End Architecture
(a)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
2KB 4KB 8KB 16KB
hp basic-lp lp aggr-lp ultra-lp
Increasing the cache size reduces DL1 cache miss rate Reduces opportunities to put the cache into more aggressive
low power modes Reduces performance degradation for larger DL1 cache
0%
1%
2%
3%
4%
5%
6%
7%
8%
9%
10%
basic
mat
h bccr
c
dijkst
ra
djpeg fft gs
gsmla
me
mad
patric
iapgp
qsort
rijndae
l
sear
ch sha
susa
n_corn
ers
susa
n_edges
tiff2
bw
aver
age
2KB 4KB 8KB 16KB
Performance degradation Frequency of different low power mode
81
High-end Architecture
basic-lp lp ultra-lpDL1 miss
L2 miss
Pending DL1 miss/esDL1 miss
serviced
L2 miss serviced
L2 miss
DL1 transitions to ultra-lp mode right after an L2 miss occurs
Given a long L2 cache miss service time (80 cycles) the processor will stall waiting for memory
DL1 returns to the basic-lp mode once the L2 miss is serviced
82
Low Power Modes in 4KB Cache
Many benchmarks the ultra-lp mode has a considerable contribution
These benchmarks have high L2 miss rate which triggers transition to ultra low power mode
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
amm
p
applu
apsi ar
t
bzip2
craf
tyeo
n
equak
e
face
rec
galgel
gap gccgzip
luca
sm
cf
mes
a
mgrid
parse
r
perlb
mk
swim
twolf
vorte
xvp
r
wupwise
aver
age
hp trivial-lp lp ultra-lp
83
Low Power Modes in High-End Architecture
(b)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
4KB-64KB 8KB-128KB 16KB-256KB 32KB-512KB
hp basic-lp lp ultra-lp
(b)
0.0%
0.5%
1.0%
1.5%
2.0%
2.5%
3.0%
3.5%
4.0%
4.5%
5.0%
4KB-64KB 8KB-128KB 16KB-256KB 32KB-512KB
Performance degradation Frequency of different low power mode
Increasing the cache size reduces DL1 cache miss rate Reduces opportunities to put the cache into more aggressive
low power modes Reduces performance degradation
84
Leakage Power Reduction: Low-End Architecture
DL1 leakage is reduced by 50% While ultra-lp mode occurs much less frequently compared to
basic-lp mode, its leakage reduction is comparable to the basic-lp mode.
in ultra-lp mode the peripheral leakage is reduced by 90%, almost twice that of basic-lp mode.
0%
10%
20%
30%
40%
50%
60%
70%
80%
basic
mat
hbc
crc
dijkst
ra
djpeg fft gs
gsmla
me
mad
patric
iapgp
qsort
rijndae
l
sear
ch sha
susa
n_corn
ers
susa
n_edges
tiff2
bw
aver
age
trivial-lp lp aggr-lp ultra-lp
85
Leakage Power Reduction: High-End Architecture
The average leakage reduction is almost 50%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
amm
p
applu
apsi ar
t
bzip2
craf
tyeo
n
equak
e
face
rec
galgel
gap gccgzip
luca
sm
cf
mes
a
mgrid
parse
r
perlb
mk
swim
twolf
vorte
xvp
r
wupwise
aver
age
trivial-lp lp ultra-lp