SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
PIs: Fadi J. Kurdahi and Nikil D. DuttCenter for Embedded Computer Systems (CECS)
University of California, Irvine
{kurdahi, dutt}@uci.edu
Temperature-Aware SoC Optimization Framework
SRC Task # 1617.001
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI Annual Review: March 2009 #2
Outline
Background and Motivation
Task Details, Accomplishments
Technical Overview
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI Annual Review: March 2009 #3
Background and MotivationSOC Design Methodologies
Traditionally focused on performance, cost, and switching power Temperature and its effects were second tier metrics
Temperature is increasingly becoming a primary design constraint Particularly for sub-100 nm process technologies
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
Temperature & SRAM
Effects of high temperature: Increased leakage power Reduced lifetime (e.g. electromigration, stress) Increased interconnect signal propagation delay Increased switching delay of transistors
Increased cell delay due to temperature SRAM’s access time (read/write) will increase A failure occurs when access time > rated time period
Thus, an increase in temperature can cause an SRAM cell to fail.
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
Process Variation & SRAMRandom Dopant Fluctuation (RDF):
Dominant impact on a transistor’s strength mismatch Intra-Die Variation (different characteristics of cells within an SRAM block) RDF typically modeled as a Gaussian distribution of threshold voltage
Because of process variation, not all the cells in an SRAM block will fail at the same temperature Different cells will fail at different temperature Read failure, Write failure
Because of variation in threshold voltage, value stored in the cell may flip (Destructive read failure)
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI Annual Review: March 2009 #6
Outline
Background and Motivation
Task Details, Accomplishments
Technical Overview
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
RELOCATE
Register File Local Access Pattern Redistribution Mechanism for Power and Thermal Management in Out-of-Order Embedded Processor
Houman Homayoun, Aseem Gupta, Avesta Sasan, Alex Veidenbaum,
Fadi Kurdahi, Nikil Dutt
University of California Irvine
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
Outline
MotivationBackground studyStudy of Register file UnderutilizationStudy of Register file default access patternsAccess concentration and activity redistribution to relocate
register file access patternsResults
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
Why Register File?
RF is one of the hottest units in a processor A small, heavily multi-ported SRAM Accessed very frequently
Example: IBM PowerPC 750FX, AMD Athlon 64
AMD Athlon 64 core floorplan blocksThermal Image of AMD Athlon 64 core floorplan blocks using infrared cameras, Courtesy of Renau et al. ISCA 2007
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
Why Temperature?
Higher power densities (Watt per mm2) lead to higher operating temperatures, which(i) Increase the probability of timing violations
(ii) Reduce IC lifetime
(iii) Lower operating frequency
(iv) Increase leakage power
(v) Require expensive cooling mechanisms
(vi) Overall increase in design effort and cost
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
Prior Work: Activity Migration
Reduces temperature by migrating the activity to a replicated unit. requires a replicated unit
large area overhead leads to a large performance degradation
Tem
pera
ture
T final
T ambient
Active Period
Idle Period
T init
T crisis
time
AM AM+PG
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
Conventional Register Renaming
Free List
Active List
Tail pointer
Head pointer Instruction # Original code Renamed code
1 RA <- ... PR1 <- ...
2 …. <- RA .... <- PR1
3 branch to _L branch to _L
4 RA <- ... PR4 <- ...
5 ... ...
... ...
6 _ L:
_ L:
7 …. <- RA .... <- PR1
Register Renamer Register allocation-release
• Physical registers are allocated/released in a somewhat random order
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
Analysis of Register File Operation
1. Register File Occupancy
(a)
0%10%20%30%40%50%60%70%80%90%
100%
RF_ocuupancy < 16 16 < RF_ocuupancy < 32
32 < RF_ocuupancy < 48 48 < RF_ocuupancy < 64
(b)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
RF_ocuupancy < 16 16 < RF_ocuupancy < 3232 < RF_ocuupancy < 48 48 < RF_ocuupancy < 64
MiBench SPECint2K
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
Performance Degradation with a Smaller Register File
(a)
0%
5%
10%
15%
20%
25%
30%
35%
% p
erfo
rman
ce d
egra
dat
ion
48-entry 32-entry 16-entry
(b)
0%
10%
20%
30%
40%
50%
60%
% p
erfo
rman
ce d
egra
dat
ion
48-entry 32-entry 16-entry
MiBench SPECint2K
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
Analysis of Register File Operation
2. Register File Access Distribution Coefficient of variation (CV) shows a “deviation” from average # of
accesses for individual physical registers.
nai is the number of accesses to a physical register i during a specific period (10K cycles). na is the average
N, the total number of physical registers
na
nanaN
CV
n
ii
access
2
1
)(1
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
Coefficient of Variation
(a)
0%
2%
4%
6%
8%
10%
12%
% c
oef
fici
ent
of
vari
atio
n
(b)
0%
2%
4%
6%
8%
10%
12%
14%
% c
oef
fici
ent
of
vari
atio
n
MiBench SPEC2K
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
Register File Operation
Underutilization which is distributed uniformly
while only a small number of registers are occupied at any given time, the total accesses are uniformly distributed over the entire physical register file during the course of execution
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
RELOCATE: Access Redistribution within a Register File
The goal is to “concentrate” accesses within a partition of a RF (region) Some regions will be idle (for 10K cycles)
Can power-gate them and allow to cool down
register activity (a) baseline, (b) in-order (c) distant patterns
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
An Architectural Mechanism to Support Access Redistribution
Active partition: a register renamer partition currently used in register renaming
Idle partition: a register renamer partition which does not participate in renaming
Active region: a region of the register file corresponding to a register renamer partition
(whether active or idle) which has live registers
Idle region: a region of the register file corresponding to a register renamer partition
(whether active or idle) which has no live registers
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
Activity Migration without Replication
An access concentration mechanism allocates registers from only one partition
This default active partition (DAP) may run out of free registers before the 10K cycle “convergence period” is over
another partition (according to some algorithm) is then activated (referred to as additional active partitions or AAP )
To facilitate physical register concentration in DAP, if two or more partitions are active and have free registers, allocation is performed in the same order in which partitions were activated.
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
The Access Concentration Mechanism
Partition activation order is 1-3-2-4
Free List
Active List
Free List
Active List
Free List
Active List
Free List
Active List
Partition P1
Free-list 1 full Free-list 3 full Free-list 2 full
Active List 4 emptyActive List 2 emptyActive List 3 empty
Partition P2
Partition P4
Partition P3
Free-list 4 full
Active List 1 empty
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
The Redistribution Mechanism
The default active partition is changed once every N cycles to redistribute the activity within the register file (according to some algorithm) Once a new default partition (NDP) is selected, all active partitions
(DAP+AAP) become idle.
The idle partitions do not participate in register renaming, but their corresponding RF regions may have to be kept active (powered up) A physical register in an idle partition may be live
An idle RF region is power gated when its active list becomes empty.
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
The Redistribution Mechanism
Free List
Active List
Free List
Active List
Free List
Active List
Free List
Active List
Partition P1
Free-list 1 full Free-list 3 full Free-list 2 full
Active List 4 emptyActive List 2 emptyActive List 3 empty
Partition P2
Partition P4
Partition P3
Free-list 4 full
Active List 1 empty
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
Performance Impact?
There is a two-cycle delay to wakeup a power gated physical register region
The register renaming occurs in the front end of the
microprocessor pipeline whereas the register access occurs in the back end. There is a delay of at least two pipeline stages between renaming
and accessing a physical register file Can wake up the requested region in time
Can wake up a required register file region without incurring a performance penalty
at the time of access
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
Experimental setup
MASE (SimpleScalar 4.0) Model MIPS-74K processor, 800 MHz
MiBench and SPECint2K benchmarks compiled with Compaq compiler, -O4 flag
Industrial memory compiler used 64-entry, 64bit single-ended SRAM memory in TSMC 45nm
technology
HotSpot to estimate thermal profiles
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
Experimental setupTable 1. Processor Architecture
L1 I-cache 8KB, ,4 way, 2 cycles
L1 D-cache 8KB, 4 way, 2 cycles
L2-cache 128KB, 15 cycles
Fetch, dispatch 2 wide
Register file 64 entry
Memory 50 cycles
Instruction fetch queue
2
Load/store queue 16 entry
Arithmetic units 2 integer
Complex unit 2 INT
Pipeline 12 stages
Processor speed 800 MHz
Issue Out-of-order
Table 2. RF Design specification
Process 45nm-CMOS
9 metal layers
Register
file layout area
0.009mm2
Operating Modes Active:R/W
Sleep: no data retention
Operating Voltage 0.6V~1.1V
Read Access Cycle
200MHz
to 1.1GHz
Access time typical corner (0.9V, 45 )
0.32ns
Active Power (Total) in typical corner (0.9V, 45 )
66mW
@ 800MHz
Active Leakage Power typical corner (0.9V, 45 )
15mW
Sleep Leakage Power in typical corner (0.9V, 45 )
2mW Wakeup Delay 0.42ns
Wakeup Energy per register file row (64bits)
0.42nJ
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
Results
(a)
0%5%
10%15%20%25%30%35%40%45%50%55%
Po
we
r R
ed
uc
tio
n %
num_partition=2 num_partition=4 num_partition=8
Mibench RF power reduction
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
Results
(b)
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
Po
we
r R
ed
uc
tio
n %
num_partition=2 num_partition=4 num_partition=8
SPEC2K RF power reduction
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
Analysis of Power Reduction
Increasing the number of RF partitions provides more opportunity to capture and cluster unmapped registers to a partition Indicates that wakeup overhead is amortized for a larger number of
partitions.
Some exceptions the overall power overhead associated with waking up an idle region
becomes larger as the number of partition increases. frequent but ineffective power gating and its overhead as the number of
partition increases
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
Peak Temperature Reduction
Table 1. Peak temperature reduction for MiBench benchmarks
temperature reduction for different number of partition (C )
base
temperature
(C ) 2P 4P 8P
basicMath 94.3 3.6 4.8 5.0
bc 95.4 3.8 4.4 5.2
crc 92.8 5.3 6.0 6.0
dijkstra 98.4 6.3 6.8 6.4
djpeg 96.3 2.8 3.5 2.4
fft 94.5 6.8 7.4 7.6
gs 89.8 6.5 7.4 9.7
gsm 92.3 5.8 6.7 6.9
lame 90.6 6.2 8.5 11.3
mad 93.3 3.8 4.3 2.2
patricia 79.2 11.0 12.4 13.2
qsort 88.3 10.1 11.6 11.9
search 93.8 8.7 9.3 9.1
sha 90.1 5.1 5.4 4.5
susan_corners 92.7 4.7 5.3 5.1
susan_edges 91.9 3.7 5.8 6.3
tiff2bw 98.5 4.5 5.9 4.1
average 92.5 5.6 6.8 6.9
Table 2. Peak temperature reduction for SPEC2K integer benchmarks
temperature reduction for different number of partition (C )
base
temperature
(C ) 2P 4P 8P
bzip2 92.7 4.8 3.9 3.1
crafty 83.6 9.5 11 10.4
eon 77.3 10.6 12.4 12.5
galgel 89.4 6.9 7.2 5.8
gap 86.7 4.8 5.9 7.1
gcc 79.8 7.9 9.4 10.1
gzip 95.4 3.2 3.8 3.9
mcf 85.8 6.9 8.7 9.4
parser 97.8 4.3 5.8 4.8
perlbmk 85.8 10.6 12.3 12.6
twolf 86.2 8.8 10.2 10.5
vortex 81.7 11.3 12.5 12.9
vpr 94.6 4.9 5.2 4.4
average 87.4 7.2 8.3 8.2
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
Analysis of Temperature Reduction
Increasing the number of partitions results in larger power density in each partition because RF access activity is concentrated in a smaller partition
While capturing more idle partitions and power gating them may potentially result in higher power reduction, larger power density due to smaller partition size results in overall higher temperature
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
Conclusions
Showed Register File Underutilization
Studied Register file default access patterns
Propose access concentration and activity redistribution to relocate register file accesses
Results show a noticeable power and temperature reduction in the RF
RELOCATE technique can be applied when units are underutilized as opposed to activity migration, which requires replication
SRC Project 1617.001 Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI
Current and Future Work Extension
Formulate the Best partition selection out of available partitions for activity redistribution.
Apply activity concentration and redistribution mechanism to other hot units; example: L1 cache.
Apply Proactive NBTI Recovery to the idle partitions to improve lifetime reliability.
Trade-off NBTI recovery and power gating to simultaneously reduce power and improve lifetime reliability.
Tackle the temperature barrier in 3D stack processor design using similar activity concentration and redistribution.