Power, Temperature, Reliability and Performance - Aware Optimizations in On-Chip SRAMs

Power, Temperature, Reliability Power, Temperature, Reliability and Performance - Aware and Performance - Aware

Optimizations in On-Chip SRAMsOptimizations in On-Chip SRAMs

Houman Homayoun

PhD Candidate

Dept. of Computer Science, UC Irvine

2

Outline

Past Research Low Power Design

Power Management in Cache Peripheral Circuits (CASES-2008, ICCD-2008,ICCD-2007, TVLSI, CF-2010)

Clock Tree Leakage Power Management (ISQED-2010)

Thermal-Aware Design Thermal Management in Register File (HiPEAC-2010)

Reliability-Aware Design Process Variation Aware Cache Architecture for Aggressive Voltage-

Frequency Scaling (DATE-2009, CASES-2009)

Performance Evaluation and Improvement Adaptive Resource Resizing for Improving Performance in Embedded

Processor (DAC-2008, LCTES-2008)

3

Current Research Inter-core Selective Resource Pooling in 3D Chip

Multiprocessor

Extend Previous Work (for Journal Publication!!)

Outline

Leakage Power Management in Cache Peripheral Circuits

5

Outline: Leakage Power in Cache Peripherals

L2 cache power dissipation Why cache peripheral? Circuit techniques to reduce leakage in Peripheral

(ICCD-08, TVLSI) Study static approach to reduce leakage in L2 cache

(ICCD-07) Study adaptive techniques to reduce leakage in L2

cache (ICCD-08) Reducing Leakage in L1 cache (CASES-2008)

6

On-chip Caches and Power On-chip caches in high-performance

processors are large more than 60% of chip budget

Dissipate significant portion of power via leakage

Much of it was in the SRAM cells Many architectural techniques proposed to

remedy this

Today, there is also significant leakage in the peripheral circuits of an SRAM (cache)

In part because cell design has been optimized

Pentium M processor die photoCourtesy of intel.com

7

Peripherals ?

Data Input/Output Driver Address Input/Output Driver Row Pre-decoder Wordline Driver Row Decoder

Others : sense-amp, bitline pre-charger, memory cells, decoder logic

addr0

addr1

addr2

addr3

Predecoder and Global Wordline Drivers

Decoder

addr

Global WordlineLocal Wordline

Bitline BitlineAddr Input Global Drivers

Sense amp

Global Output Drivers

8

Why Peripherals ?

Using minimal sized transistor for area considerations in cells and larger, faster and accordingly more leaky transistors to satisfy timing requirements in peripherals.

Using high vt transistors in cells compared with typical threshold voltage transistors in peripherals

1

10

100

1000

10000

100000

mem

ory c

ell

INVX

INV2X

INV3X

INV4X

INV5X

INV6X

INV8X

INV12

X

INV16

X

INV20

X

INV24

X

INV32

X

( pw )

200X

6300X

9

Leakage Power Components of L2 Cache

SRAM peripheral circuits dissipate more than 90% of the total leakage power

global address input drivers

11%

global data input drivers

14%

global row predecoder

1%

local row decoders

33%

others8%

local data output drivers

8%

global data output drivers

25%

10

Leakage power as a Fraction of L2 Power Dissipation

L2 cache leakage power dominates its dynamic power above 87% of the total

0%10%20%30%40%50%60%70%80%90%

100%

amm

pap

plu

apsi art

bzi

p2

craf

tyeo

neq

uak

efa

cere

cg

alg

elg

ap gcc

gzi

plu

cas

mcf

mes

am

gri

dp

arse

rp

erlb

mk

sixt

rack

swim

two

lfvo

rtex vp

rw

up

wis

eav

erag

e

Leakage Dynamic

11

Circuit Techniques Address Leakage in SRAM Cell

Gated-Vdd, Gated-Vss Voltage Scaling (DVFS) ABB-MTCMOS Forward Body Biasing (FBB), RBB Sleepy Stack Sleepy Keeper

Target SRAM memory cell

12

Architectural Techniques Way Prediction, Way Caching, Phased Access

Predict or cache recently access ways, read tag first Drowsy Cache

Keeps cache lines in low-power state, w/ data retention Cache Decay

Evict lines not used for a while, then power them down Applying DVS, Gated Vdd, Gated Vss to memory cell

Many architectural support to do that.

All target cache SRAM memory cell

Multiple Sleep Mode Zig-Zag Horizontal and Vertical Sleep

Transistor Sharing

14

Sleep Transistor Stacking Effect

Subthreshold current: inverse exponential function of threshold voltage

Stacking transistor N with slpN: The source to body voltage (VM ) of

transistor N increases, reduces its

subthreshold leakage current, when

both transistors are off

)2)2((0 FSBFTT VVV

slpN

vss

N

MV

vdd

gnV

gslpnV vss

LC

CV

Drawback : rise time, fall time, wakeup delay, area, dynamic power, instability

15

A Redundant Circuit Approach

P1 P2

N1 N2

vddvdd

vssvss

010

slpN1 slpN2

slpP1 slpP2

P3

N3

vdd

vss

1

slpN3

slpP3

P4

N4

vdd

vss

slpN4

slpP4

6L

W6

L

W6

L

W6

L

W

12L

W12

L

W12

L

W12

L

W

MV

slpN5

Sleep signal

Sleep signal

Sleep signal

5.1L

W

Drawback impact on wordline driver output rise time, fall time and propagation delay

16

Impact on Rise Time and Fall Time

The rise time and fall time of the output of an inverter is proportional to the

Rpeq * CL and Rneq * CL

Inserting the sleep transistors increases both Rneq and Rpeq

P1 P2

N1 N2

vddvdd

vssvss

010

slpN1 slpN2

slpP1 slpP2

I leakage

I leakage

Increasing in rise time

Increasing in fall time

Impact on performance

Impact on memory functionality

17

A Zig-Zag Circuit

Rpeq for the first and third inverters and Rneq for the second and fourth inverters doesn’t change.

Fall time of the circuit does not change

P1 P2

N1 N2

vdd

vdd

vss

vss

010

slpN1

slpP2

Sleep signal

Sleep signal

P3

N3

vdd

vss

1

P4

N4

vdd

vss

slpP4

slpN36

L

W6

L

W

12L

W12

L

W

slpN5

Sleep signal

5.1L

W

0

vss

18

A Zig-Zag Share Circuit To improve leakage reduction and area-efficiency of

the zig-zag scheme, using one set of sleep transistors shared between multiple stages of inverters

Zig-Zag Horizontal Sharing Zig-Zag Horizontal and Vertical Sharing

19

Zig-Zag Horizontal Sharing

Comparing zz-hs with zigzag scheme, with the same area overhead

Zz-hs less impact on rise time Both reduce leakage almost the

same

vddvdd vdd

vss vss vss

2 x slpN

slpPSleep signal

Sleep signal

P1 P2 P3 P4

N1 N2 N3 N4

shareINeqR

shareI

vss

1NR

hszznslpR 2

zznslpR

0 01

MV

20

Zig-Zag Horizontal and Vertical Sharing

vdd vdd vddvdd vdd

vss vss vss vss vss

slpN

slpPSleep signal

Sleep signal

P 11 P12 P 13 P14 P 21 P22 P 23 P24

N11 N12 N13 N14N21 N22 N23 N24

Word-line Driver line K

Word-line Driver line K +1

MV

21

Leakage Reduction of ZZ Horizontal and Vertical Sharing

Increase in virtual ground voltage increase leakage reduction

N11

vss

slpN

N21Vg0 Vg0

Vg0

21NI11NI

slpNI

vdd vdd

V

N11

vss

slpN

Vg0

Vg0

11NI

slpNI

vdd

M1V

N21

vss

slpN

Vg0

Vg0

21NI

slpNI

vdd

V

(a) (b)

M1M2

2

log. 0101

11

gddW

W

M

VVnV

slpN

N

2

log. 0

.2

102

11

gddW

W

M

VVnV

slpN

N

22

ZZ-HVS Evaluation : Power Result

Increasing the number of wordline rows share sleep transistors increases the leakage reduction and reduces the area overhead

Leakage power reduction varies form a 10X to a 100X when 1 to 10 wordline shares the same sleep transistors

2~10X more leakage reduction, compare to the zig-zag scheme

1

10

100

1000

1 2 3 4 5 6 7 8 9 10

log

(n

W)

baseline redundant zigzag zz-hs zz-hvs

number of wordline row

x10

x100

x2

x12

(a)

23

Wakeup Latency

To benefit the most from the leakage savings of stacking sleep transistors

keep the bias voltage of NMOS sleep transistor as low as possible (and for PMOS as high as possible)

Drawback: impact on the wakeup latency of wordline drivers

Control the gate voltage of the sleep transistors Increasing the gate voltage of footer sleep transistor reduces the

virtual ground voltage (VM)

reduction in the circuit wakeup

delay overhead

reduction in leakage power

savings

24

Wakeup Delay vs. Leakage Power Reduction

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

(Footer,Header) Gate Bias Voltage Pair

No

rma

lize

d L

ea

ka

ge

Po

we

r

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

No

rmal

ized

Wak

e-U

p D

elay

Normalized leakage Normalized wake-up delay

trade-off between the wakeup overhead

and leakage power saving

Increasing the bias voltage increases the leakage power while decreases the wakeup delay overhead

25

Multiple Sleep Modes

power mode wakeup delay (cycle)

leakage reduction (%)

basic-lp 1 42%

lp 2 75%

aggr-lp 3 81%

ultra-lp 4 90%

Power overhead of waking up peripheral circuits Almost equivalent to the switching power of sleep

transistors Sharing a set of sleep transistors horizontally and

vertically for multiple stages of a (wordline) driver makes the power overhead even smaller

Reducing Leakage in L2 Cache Peripheral Circuits

Using Zig-Zag Share Circuit Technique

27

Static Architectural Techniques: SM

SM Technique (ICCD’07) Asserts the sleep signal by default.

Wakes up L2 peripherals on an access to the cache

Keeps the cache in the normal state for J cycles (turn-on period) before returning it to the stand-by mode (SM_J)

No wakeup penalty during this period Larger J leads to lower performance degradation but lower

energy savings

28

Static Architectural Techniques: IM IM technique (ICCD’07)

Monitor issue logic and functional units of the processor after L2 cache miss. Asserts the sleep if the issue logic has not issued any instructions and functional units have not executed any instructions for K consecutive cycles (K=10)

De-asserted the sleep signal M cycles before the miss is serviced

No performance loss

29

More Insight on SM and IM

Some benchmarks SM and IM techniques are both effective facerec, gap, perlbmk and vpr

IM works well in almost half of the benchmarks but is ineffective in the other half

SM work well in about one half of the benchmarks but not the same benchmarks as the IM

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

amm

p

app

lu

apsi art

bzi

p2

craf

ty

eon

equ

ake

face

rec

gal

gel

gap gcc

gzi

p

luca

s

mcf

mes

a

mg

rid

par

ser

per

lbm

k

sixt

rack

swim

two

lf

vort

ex vpr

wu

pw

ise

aver

age

IM SM-750

adaptive technique combining IM and SM has the potential to deliver an even greater power

reduction

30

Which Technique Is the Best and When ?

DL1 miss rate

L2 miss rate

L1xL2 miss rates x 10K

DL1 miss rate

L2 miss rate

L1xL2 miss rates x 10K

ammp 0.05 0.19 96.11 lucas 0.10 0.67 645.73 applu 0.06 0.66 368.03 mcf 0.24 0.43 1023.88 apsi 0.03 0.28 75.01 mesa 0.00 0.27 8.02 art 0.41 0.00 0.41 mgrid 0.04 0.46 165.13

bzip2 0.02 0.04 7.09 parser 0.02 0.07 13.76 crafty 0.00 0.01 0.17 perlbmk 0.01 0.46 22.88 eon 0.00 1.00 0.00 sixtrack 0.01 0.00 0.14

equake 0.02 0.67 124.36 swim 0.09 0.63 561.41 facerec 0.03 0.31 86.11 twolf 0.05 0.00 0.16 galgel 0.04 0.01 2.11 vortex 0.00 0.23 6.94 gap 0.01 0.55 38.54 vpr 0.02 0.15 33.95 gcc 0.05 0.04 16.88 wupwise 0.02 0.68 122.40 gzip 0.01 0.05 3.28 Average 0.05 0.31 136.50

L2 to be idle

There are few L1 misses Many L2 misses waiting for memory

miss rate product (MRP) may be a good indicator

of the cache behavior

31

The Adaptive Techniques

Adaptive Static Mode (ASM) MRP measured only once during an initial learning

period (the first 100M committed instructions) MRP > A IM (A=90) MRP ≤ A SM_J Initial technique SM_J

Adaptive Dynamic Mode (ADM) MRP measured continuously over a K cycle period (K is 10M)

choose IM or the SM, for the next 10M cycles MRP > A IM (A=100) A ≥ MRP > B SM_N (B=200) otherwise SM_P

32

More Insight on ASM and ADM

ASM attempts to find the more effective static technique per benchmark by profiling a small subset of a program

ADM is more complex and attempts to find the more effective static technique at a finer granularity of every 10M cycles intervals based on profiling the previous timing interval

33

Compare ASM with IM and SM

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

am

mp

ap

plu

ap

si art

bzi

p2

cra

fty

eo

n

eq

ua

ke

face

rec

ga

lge

l

ga

p

gc

c

gzi

p

luca

s

mc

f

me

sa

mg

rid

pa

rser

pe

rlb

mk

six

trac

k

sw

im

two

lf

vo

rte

x vp

r

wu

pw

ise

ave

rag

e

ASM-IM ASM-SM

fraction of IM and SM contribution for ASM_750

Most benchmarks ASM correctly selects the more effective static technique

Exception: equake

a small subset of program can be used to identify L2 cache behavior,

whether it is accessed very infrequently or it is idle since

processor is idle

34

ADM Results

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

amm

p

app

lu

apsi ar

t

bzi

p2

craf

ty

eon

equ

ake

face

rec

gal

gel gap gcc

gzi

p

luca

s

mcf

mes

a

mg

rid

par

ser

per

lbm

k

sixt

rack

swim

two

lf

vort

ex vpr

wu

pw

ise

aver

age

ADM_IM ADM_SM

Many benchmarks both IM and SM make a noticeable contribution ADM is effective in combining the IM and SM

Some benchmarks either IM or SM contribution is negligible ADM selects the best static technique

35

Power Results

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

amm

p

app

lu

apsi art

bzi

p2

craf

ty

eon

equ

ake

face

rec

gal

gel

gap gcc

gzi

p

luca

s

mcf

mes

a

mg

rid

par

ser

per

lbm

k

sixt

rack

swim

two

lf

vort

ex vpr

wu

pw

ise

aver

age

ASM ADM

(a)

-20%

-10%

0%

10%

20%

30%

40%

50%

60%

70%

80%

amm

p

app

lu

apsi art

bzi

p2

craf

ty

eon

equ

ake

face

rec

gal

gel

gap gcc

gzi

p

luca

s

mcf

mes

a

mg

rid

par

ser

per

lbm

k

sixt

rack

swim

two

lf

vort

ex vpr

wu

pw

ise

aver

age

ASM ADM

(b)

leakage power savings total energy delay reduction

leakage reduction using ASM and ADM is 34% and 52% respectively

The overall energy delay reduction is 29.4 and 45.5% respectively, using the ASM and ADM.

2~3 X more leakage power reduction and less performance loss compare to

static approaches

RELOCATE: Register File Local Access Pattern Redistribution Mechanism for Power and Thermal Management in Out-of-Order Embedded Processor

37

Outline

Motivation Background study Study of Register file Underutilization Study of Register file default access patterns Access concentration and activity redistribution to

relocate register file access patterns Results

38

Why Register File?

RF is one of the hottest units in a processor A small, heavily multi-ported SRAM Accessed very frequently

Example: IBM PowerPC 750FX

39

Prior Work: Activity Migration Reduces temperature by migrating the activity to a

replicated unit. requires a replicated unit

large area overhead leads to a large performance degradation

Tem

pera

ture

T final

T ambient

Active Period

Idle Period

T init

T crisis

time

AM AM+PG

40

Conventional Register Renaming

Free List

Active List

Tail pointer

Head pointer Instruction # Original code Renamed code

1 RA <- ... PR1 <- ...

2 …. <- RA .... <- PR1

3 branch to _L branch to _L

4 RA <- ... PR4 <- ...

5 ... ...

... ...

6 _ L:

_ L:

7 …. <- RA .... <- PR1

Register Renamer Register allocation-release

• Physical registers are allocated/released in a somewhat random order

41

Analysis of Register File Operation

Register File Occupancy

(a)

0%10%20%30%40%50%60%70%80%90%

100%

RF_ocuupancy < 16 16 < RF_ocuupancy < 32

32 < RF_ocuupancy < 48 48 < RF_ocuupancy < 64

(b)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

RF_ocuupancy < 16 16 < RF_ocuupancy < 3232 < RF_ocuupancy < 48 48 < RF_ocuupancy < 64

MiBench SPECint2K

42

Performance Degradation with a Smaller RF

(a)

0%

5%

10%

15%

20%

25%

30%

35%

% p

erfo

rman

ce d

egra

dat

ion

48-entry 32-entry 16-entry

(b)

0%

10%

20%

30%

40%

50%

60%

% p

erfo

rman

ce d

egra

dat

ion

48-entry 32-entry 16-entry

MiBench SPECint2K

43

Analysis of Register File Operation

Register File Access Distribution Coefficient of variation (CV) shows a “deviation” from

average # of accesses for individual physical registers.

nai is the number of accesses to a physical register i during a specific period (10K cycles). na is the average

N, the total number of physical registers

na

nanaN

CV

n

ii

access

2

1

)(1

44

Coefficient of Variation

(a)

0%

2%

4%

6%

8%

10%

12%

% c

oef

fici

ent

of

vari

atio

n

(b)

0%

2%

4%

6%

8%

10%

12%

14%

% c

oef

fici

ent

of

vari

atio

n

MiBench SPEC2K

45

Register File Operation

Underutilization which is distributed uniformly

while only a small number of registers are occupied at any given time, the total accesses are uniformly distributed over the entire physical register file during the course of execution

46

RELOCATE: Access Redistribution within a Register File

The goal is to “concentrate” accesses within a partition of a RF (region)

Some regions will be idle (for 10K cycles) Can power-gate them and allow to cool down

register activity (a) baseline, (b) in-order (c) distant patterns

47

An Architectural Mechanism for Access Redistribution

Active partition: a register renamer partition currently used in register renaming

Idle partition: a register renamer partition which does not participate in renaming

Active region: a region of the register file corresponding to a register renamer partition (whether active or idle) which has live registers

Idle region: a region of the register file corresponding to a register renamer partition (whether active or idle) which has no live registers

48

Activity Migration without Replication

An access concentration mechanism allocates registers from only one partition

This default active partition (DAP) may run out of free registers before the 10K cycle “convergence period” is over

another partition (according to some algorithm) is then activated (referred to as additional active partitions or AAP )

To facilitate physical register concentration in DAP, if two or more partitions are active and have free registers, allocation is performed in the same order in which partitions were activated.

49

The Access Concentration Mechanism

Partition activation order is 1-3-2-4

Free List

Active List

Free List

Active List

Free List

Active List

Free List

Active List

Partition P1

Free-list 1 full Free-list 3 full Free-list 2 full

Active List 4 emptyActive List 2 emptyActive List 3 empty

Partition P2

Partition P4

Partition P3

Free-list 4 full

Active List 1 empty

50

The Redistribution Mechanism The default active partition is changed once every N cycles to redistribute

the activity within the register file (according to some algorithm) Once a new default partition (NDP) is selected, all active partitions

(DAP+AAP) become idle.

The idle partitions do not participate in register renaming, but their corresponding RF regions may have to be kept active (powered up)

A physical register in an idle partition may be live

An idle RF region is power gated when its active list becomes empty.

51

Performance Impact? There is a two-cycle delay to wakeup a power gated physical register

region The register renaming occurs in the front end of the microprocessor

pipeline whereas the register access occurs in the back end. There is a delay of at least two pipeline stages between

renaming and accessing a physical register file Can wake up the requested region in timeCan wake up a required register file region

without incurring a performance penalty at the time of access

52

Results: Mibench RF power reduction

(a)

0%5%

10%15%20%25%30%35%40%45%50%55%

Po

we

r R

ed

uc

tio

n %

num_partition=2 num_partition=4 num_partition=8

53

Results: SPEC2K RF power reduction

(b)

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

Po

we

r R

ed

uc

tio

n %

num_partition=2 num_partition=4 num_partition=8

54

Analysis of Power Reduction

Increasing the number of RF partitions provides more opportunity to capture and cluster unmapped registers to a partition

Indicates that wakeup overhead is amortized for a larger number of partitions.

Some exceptions the overall power overhead associated with waking up an idle

region becomes larger as the number of partition increases. frequent but ineffective power gating and its overhead as the

number of partition increases

55

Peak Temperature Reduction

Table 1. Peak temperature reduction for MiBench benchmarks

temperature reduction for different number of partition (C )

base

temperature

(C ) 2P 4P 8P

basicMath 94.3 3.6 4.8 5.0

bc 95.4 3.8 4.4 5.2

crc 92.8 5.3 6.0 6.0

dijkstra 98.4 6.3 6.8 6.4

djpeg 96.3 2.8 3.5 2.4

fft 94.5 6.8 7.4 7.6

gs 89.8 6.5 7.4 9.7

gsm 92.3 5.8 6.7 6.9

lame 90.6 6.2 8.5 11.3

mad 93.3 3.8 4.3 2.2

patricia 79.2 11.0 12.4 13.2

qsort 88.3 10.1 11.6 11.9

search 93.8 8.7 9.3 9.1

sha 90.1 5.1 5.4 4.5

susan_corners 92.7 4.7 5.3 5.1

susan_edges 91.9 3.7 5.8 6.3

tiff2bw 98.5 4.5 5.9 4.1

average 92.5 5.6 6.8 6.9

Table 2. Peak temperature reduction for SPEC2K integer benchmarks

temperature reduction for different number of partition (C )

base

temperature

(C ) 2P 4P 8P

bzip2 92.7 4.8 3.9 3.1

crafty 83.6 9.5 11 10.4

eon 77.3 10.6 12.4 12.5

galgel 89.4 6.9 7.2 5.8

gap 86.7 4.8 5.9 7.1

gcc 79.8 7.9 9.4 10.1

gzip 95.4 3.2 3.8 3.9

mcf 85.8 6.9 8.7 9.4

parser 97.8 4.3 5.8 4.8

perlbmk 85.8 10.6 12.3 12.6

twolf 86.2 8.8 10.2 10.5

vortex 81.7 11.3 12.5 12.9

vpr 94.6 4.9 5.2 4.4

average 87.4 7.2 8.3 8.2

56

Analysis of Temperature Reduction

Increasing the number of partitions results in larger power density in each partition because RF access activity is concentrated in a smaller partition

While capturing more idle partitions and power gating them may potentially result in higher power reduction, larger power density due to smaller partition size results in overall higher temperature

Adaptive Resource Resizing for Improving Performance in

Embedded Processor

58

Introduction Technology scaling into the ultra deep submicron

allowed hundreds of millions of gates integrated onto a single chip.

Designers have ample silicon budget to add more processor resources to exploit application parallelism and improve performance.

Restrictions with the power budget and practically achievable operating clock frequencies are limiting factors.

Increasing register file (RF) size increases its access time, which reduces processor frequency.

Dynamically Resizing RF in tandem with dynamic frequency scaling (DFS) significantly improves the performance.

59

Motivation for Increasing RF Size After a long latency L2 cache miss the processor executes

some independent instructions but eventually ends up becoming stalled.

After L2 cache miss one of ROB, IQ, RF or LQ/SQ fills up and processor stalls until the miss serviced.

With larger resources it is less likely that these resources will fill up completely during the L2 cache miss service time and potentially improve performance.

The sizes of resources have to be scaled up together; otherwise the non-scaled ones would become a performance bottleneck.

0%

5%

10%

15%

20%

25%

30%

35%

40%

Frequency of stalls due to L2 cache misses, in PowerPC 750FX architecture

60

Impact of Increasing RF Size Increasing the size of RF, (as well as ROB, LQ and IQ)

can potentially increase processor performance by reducing the occurrences of idle periods,

has critical impact on the achievable processor operating frequency

RF decide the max achievable operating frequency

significant increase in bitline delay when the size of the RF increases.

Breakdown of RF component delay with increasing size

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

RF-24 RF-32 RF-48

de

lay

(n

s)

input driver decoder wordlinebitline sense_amp output driver

61

Analysis of RF Component Access Delay The equivalent capacitance on the bitline is Ceq = N *

diffusion capacitance of pass transistors + wire capacitance (usually 10% of total diffusion capacitance) where N is the total number of rows.

As the number of rows increases the equivalent bitline capacitance also increases and therefore the propagation delay increases.

Reduction in clock freq with increasing resource size

62

Impact on Execution Time The execution time increases with larger resource

sizes

Normalized execution time for different configs with reduced operating frequency compared to baseline architecture

Trade-off between larger resources (and hence reducing the occurrences of

idle period) and lowering the clock frequency,

The latter becomes more important and plays a major role in deciding the performance in terms of execution time.

0.900.920.940.960.981.001.021.041.061.081.101.12

No

rmal

ized

Exe

cuti

on

Tim

e

Baseline Conf-1 Conf-2

63

Dynamic Register File Resizing

Dynamic RF scaling based on L2 cache misses allows the processor use smaller RF (having a lower access time)

during the period when there is no pending L2 cache miss (normal period) and a larger RF (at the cost of having a higher access time) during the L2 cache miss period.

To satisfy accessing the RF in one cycle, reduce the operating clock frequency when we scale up its size

DFS needs to be done fast, otherwise it impacts the performance benefit

need to use a PLL architecture capable of applying DFS with the least transition delay.

The studied processor (IBM PowerPC 750) uses a dual PLL architecture which allows fast DFS with effectively zero latency.

64

Circuit Modification

The challenge is to design the RF in such a way that its access time is dynamically being controlled.

Proposed circuit modification for RF

Among all RF components, the bitline delay increase is responsible for the majority of RF access time increase.

Wordline

Wordline

Wordline

Wordline

Wordline

Segment Select

Segment Select

Sense Amp and Bitline Pre- Charge Circuit

single bit

Register entry free/taken

Upper segment full/empty

Dynamically adjust bitline load.

65

L2 Miss Driven RF Scaling (L2MRFS)

Proposed circuit modification for RF

Wordline

Wordline

Wordline

Wordline

Wordline

Segment Select

Segment Select

Sense Amp and Bitline Pre- Charge Circuit

single bit

Register entry free/taken

Upper segment full/empty

Normal period: the upper segment is power gated and the transmission gate is turned off to isolate the lower bitline segment from the upper bitline segment.

Only the lower segment bitline is pre-charged during this period.

L2 cache miss period: the transmission gate is turned on and both segments bitlines are pre-charged.

downsize at the end of cache miss period when the upper segment is empty.

Augment the upper segment with one extra bit per entry. Set the entry when a register is taken and reset it when a register is released. ORing these bits can detect when the segment is empty.

66

Performance and Energy-delay

(a)

0%

2%

4%

6%8%

10%

12%

14%

16%

DYN_Conf_1 DYN_Conf_2

(b)

0.840.860.880.900.920.940.960.981.00

DYN_Conf_1 DYN_Conf_2

Experimental results: (a) normalized performance improvement for L2MRFS (b) normalized energy-delay product compare to conf_1 and conf_2

Performance improvement 6% and 11%

Energy-delay reduction 3.5% and 7%

Inter-core Selective Resource Pooling in 3D Chip

Multiprocessor

68

An Example!

0

32

64

96

128

cyclenu

mb

er o

f o

ccu

pie

d R

F

entr

ies

mcf gcc

gready: mcfhelper: gcc

gready: gcchelper: mcf

I1 I2 I3

An example of register file utilization for different cores in a dual core CMP

69

Preliminary Results for Register File Pooling

MUX

MUX

die-to-die via

layer 0

layer 1

1.01.52.02.53.03.54.04.55.0

Sp

ee

du

p (

No

rma

lize

d IP

C)

single-core 2-core 3-core 4-core

Register files participating in resource pooling

The normalized IPC of resource pooling

70

Challenges

The level of resource sharing “loose pooling”: HELPER gets a higher priority in accessing to

the pooled resource “tight pooling”: the priority is given to the GREADY core

The granularity of resource sharing number of entries number of ports

The level of confidence in predicting the resource utilizations avoid starving the HELPER core avoiding over provisioning for the GREADY core

A new floorplanning put identical resources as close to each other can incurs additional thermal and power burden on a currently

power hungry and thermal critical resources

71

Conclusion

Power-Thermal-Reliability aware High Performance Design Through

Inter-Disciplinary Approach

Reducing Leakage in L2 Cache Peripheral Circuits

Using Multiple Sleep Mode Technique

73

Multiple Sleep Modes

power mode wakeup delay (cycle)

leakage reduction (%)

basic-lp 1 42%

lp 2 75%

aggr-lp 3 81%

ultra-lp 4 90%

Power overhead of waking up peripheral circuits Almost equivalent to the switching power of sleep

transistors Sharing a set of sleep transistors horizontally and

vertically for multiple stages of a (wordline) driver makes the power overhead even smaller

74

Reducing Leakage in L1 Data Cache Maximize the leakage reduction in DL1 cache

put DL1 peripheral into ultra low power mode adds 4 cycles to the DL1 latency

significantly reduces performance

Minimize Performance Degradation put DL1 peripherals into the basic low power mode requires only one cycle to wakeup and hide this latency during address computation stage

thus not degrading performance Not noticeable leakage power reduction

75

Motivation for Dynamically Controlling Sleep Mode

large leakage reduction benefit Ultra and aggressive low power modes

low performance impact benefit Basic-lp mode

Periods of frequent access Basic-lp mode

Periods of infrequent access Ultra and aggressive low power modes

dynamically adjust peripheral circuit sleep power mode

76

Reducing DL1 Wakeup Delay

Can determine whether an instruction is load or a store at least one cycle prior cache access

Accessing DL1 while its peripherals are in basic-lp mode doesn’t require an extra cycle

wake up DL1 peripherals one cycle prior to access

One cycle of wakeup delay can be hidden for all other low-power modes

Reducing the wakeup delay by one cycle

Put DL1 in basic-lp mode by default

77

Architectural Motivations

Architectural Motivation A load miss in L1/L2 caches takes a long time to service

prevents dependent instructions from being issued When dependent instructions cannot issue

performance is lost At the same time, energy is lost as well!

This is an opportunity to save energy

78

Low-end Architecture

Given the miss service time of 30 cycles likely that processor stalls during the miss service period Occurrence of additional cache misses while one DL1

cache miss is already pending further increases the chance of pipeline stall

basic-lp lp u ltra -lp

aggr-lp

D L1 m iss

P rocessor sta ll

D L1 m iss++

P ending D L1 m iss

P ending D L1 m iss /es

D L1 m iss serviced

P rocessor continue

79

Low Power Modes in a 2KB DL1 Cache

Fraction of total execution time DL1 cache spends in each of the power mode

85% of the time DL1 peripherals put into low power modes Most of the time spent in the basic-lp mode (58% of total

execution time)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

basic

mat

h bccr

c

dijkst

ra

djpeg fft gs

gsmla

me

mad

patric

iapgp

qsort

rijndae

l

sear

ch sha

susa

n_corn

ers

susa

n_edges

tiff2

bw

aver

age

hp trivial-lp lp aggr-lp ultra-lp

80

Low Power Modes in Low-End Architecture

(a)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

2KB 4KB 8KB 16KB

hp basic-lp lp aggr-lp ultra-lp

Increasing the cache size reduces DL1 cache miss rate Reduces opportunities to put the cache into more aggressive

low power modes Reduces performance degradation for larger DL1 cache

0%

1%

2%

3%

4%

5%

6%

7%

8%

9%

10%

basic

mat

h bccr

c

dijkst

ra

djpeg fft gs

gsmla

me

mad

patric

iapgp

qsort

rijndae

l

sear

ch sha

susa

n_corn

ers

susa

n_edges

tiff2

bw

aver

age

2KB 4KB 8KB 16KB

Performance degradation Frequency of different low power mode

81

High-end Architecture

basic-lp lp ultra-lpDL1 miss

L2 miss

Pending DL1 miss/esDL1 miss

serviced

L2 miss serviced

L2 miss

DL1 transitions to ultra-lp mode right after an L2 miss occurs

Given a long L2 cache miss service time (80 cycles) the processor will stall waiting for memory

DL1 returns to the basic-lp mode once the L2 miss is serviced

82

Low Power Modes in 4KB Cache

Many benchmarks the ultra-lp mode has a considerable contribution

These benchmarks have high L2 miss rate which triggers transition to ultra low power mode

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

amm

p

applu

apsi ar

t

bzip2

craf

tyeo

n

equak

e

face

rec

galgel

gap gccgzip

luca

sm

cf

mes

a

mgrid

parse

r

perlb

mk

swim

twolf

vorte

xvp

r

wupwise

aver

age

hp trivial-lp lp ultra-lp

83

Low Power Modes in High-End Architecture

(b)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

4KB-64KB 8KB-128KB 16KB-256KB 32KB-512KB

hp basic-lp lp ultra-lp

(b)

0.0%

0.5%

1.0%

1.5%

2.0%

2.5%

3.0%

3.5%

4.0%

4.5%

5.0%

4KB-64KB 8KB-128KB 16KB-256KB 32KB-512KB

Performance degradation Frequency of different low power mode

Increasing the cache size reduces DL1 cache miss rate Reduces opportunities to put the cache into more aggressive

low power modes Reduces performance degradation

84

Leakage Power Reduction: Low-End Architecture

DL1 leakage is reduced by 50% While ultra-lp mode occurs much less frequently compared to

basic-lp mode, its leakage reduction is comparable to the basic-lp mode.

in ultra-lp mode the peripheral leakage is reduced by 90%, almost twice that of basic-lp mode.

0%

10%

20%

30%

40%

50%

60%

70%

80%

basic

mat

hbc

crc

dijkst

ra

djpeg fft gs

gsmla

me

mad

patric

iapgp

qsort

rijndae

l

sear

ch sha

susa

n_corn

ers

susa

n_edges

tiff2

bw

aver

age

trivial-lp lp aggr-lp ultra-lp

85

Leakage Power Reduction: High-End Architecture

The average leakage reduction is almost 50%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

amm

p

applu

apsi ar

t

bzip2

craf

tyeo

n

equak

e

face

rec

galgel

gap gccgzip

luca

sm

cf

mes

a

mgrid

parse

r

perlb

mk

swim

twolf

vorte

xvp

r

wupwise

aver

age

trivial-lp lp ultra-lp

Documents

Power, Temperature, Reliability and Performance - Aware Optimizations in On-Chip SRAMs