Power Management in High Performance Processors through Dynamic Resource Adaptation and Multiple Sleep Mode Assignments Houman Homayoun National Science

Power Management in High Performance Processors through

Dynamic Resource Adaptation and Multiple Sleep Mode

Assignments

Houman HomayounNational Science Foundation Computing Innovation

Fellow

Department of Computer Science University of California San Diego

Copyright © 2010 Houman Homayoun University of California San Diego 2

Outline – Multiple Sleep Mode

Brief overview of state-of-art superscalar processor Introducing the idea of multiple sleep modes design Architectural control of multiple sleep modes Results Conclusions


Superscalar Architecture

Fetch

Decode

Rename

Instruction Queue

Execute

LogicalRegister

File

PhysicalRegister

File

ROB

F.U. F.U. F.U. F.U.

Reservation Station

Write-Back

Dispatch

Issue Load Store Queue

Fetch

Decode

Rename

Instruction Queue

Execute

LogicalRegister

File

PhysicalRegister

File

ROB

F.U. F.U. F.U. F.U.

Reservation Station

Write-Back

Dispatch

Issue Load Store Queue


On-chip SRAMs+CAMs and Power

On-chip SRAMs+CAMs in high-performance processors are large

Branch Predictor Reorder Buffer Instruction Queue Instruction/Data TLB Load and Store Queue L1 Data Cache L1 Instruction Cache L2 Cache

more than 60% of chip budget

Dissipate significant portion of power via leakage

Pentium M processor die photoCourtesy of intel.com


Techniques Address Leakage in SRAM+CAM

Gated-Vdd, Gated-Vss Voltage Scaling (DVFS) ABB-MTCMOS Forward Body Biasing (FBB), RBB Sleepy Stack Sleepy Keeper

Way Prediction, Way Caching, Phased Access Predict or cache recently access ways, read tag first

Drowsy Cache Keeps cache lines in low-power state, w/ data retention

Cache Decay Evict lines not used for a while, then power them down

Applying DVS, Gated Vdd, Gated Vss to memory cell Many architectural support to do that.

Circuit

Architecture


Sleep Transistor Stacking Effect

Subthreshold current: inverse exponential function of threshold voltage

Stacking transistor N with slpN: The source to body voltage (VM ) of

transistor N increases, reduces its

subthreshold leakage current, when

both transistors are off

)2)2((0 FSBFTT VVV

slpN

vss

N

MV

vdd

gnV

gslpnV vss

LC

CV

Drawback : rise time, fall time, wakeup delay, area, dynamic power, instability


Wakeup Latency

To benefit the most from the leakage savings of stacking sleep transistors

keep the bias voltage of NMOS sleep transistor as low as possible (and for PMOS as high as possible)

Drawback: impact on the wakeup latency (sleep transistor wakeup delay + sleep signal propagation delay) of the circuit

Control the gate voltage of the sleep transistors Increasing the gate voltage of footer sleep transistor reduces the

virtual ground voltage (VM)

reduction in the circuit wakeup

delay overhead

reduction in leakage power

savings


Increasing the bias voltage increases the leakage power while decreases the wakeup delay overhead

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

(Footer,Header) Gate Bias Voltage Pair

Nor

mal

ized

Lea

kage

Pow

er

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Nor

mal

ized

Wak

e-U

p D

elay

Normalized leakage Normalized wake-up delay

Wakeup Delay vs. Leakage Power Reduction

trade-off between the wakeup overhead

and leakage power saving


Multiple Sleep Modes Specifications

Wakeup Delay varies from 1~more than 10 processor cycles (2.2GHz). Large wakeup power overhead for large SRAMs.

Need to find Period of Infrequent Access

On-chip SRAM multiple sleep mode normalized leakage power savings

BPRED FRF IRF IL1 DL1 L2 DTLB ITLB

basic-lp 0.29 0.21 0.21 -- -- -- 0.25 0.25

lp 0.43 0.31 0.31 0.37 0.37 -- 0.34 0.34

aggr-lp 0.55 0.58 0.58 0.48 0.48 0.44 0.49 0.49

ultra-lp 0.67 0.65 0.65 0.69 0.64 0.63 0.57 0.57


Reducing Leakage in SRAM Peripherals

Maximize the leakage reduction put SRAM into ultra low power mode adds few cycles to the SRAM access latency

significantly reduces performance

Minimize Performance Degradation put SRAM into the basic low power mode requires near zero wakeup overhead

Not noticeable leakage power reduction


Motivation for Dynamically Controlling Sleep Mode

large leakage reduction benefit Ultra and aggressive low power modes

low performance impact benefit Basic-lp mode

Periods of frequent access Basic-lp mode

Periods of infrequent access Ultra and aggressive low power modes

dynamically adjust sleep power mode


Architectural Motivations

Architectural Motivation A load miss in L1/L2 caches takes a long time to service

prevents dependent instructions from being issued When dependent instructions cannot issue

performance is lost At the same time, energy is lost as well!

This is an opportunity to save energy


Multiple Sleep Mode Control Mechanism

L2 cache miss or multiple DL1 misses triggers power mode transitioning.

The general algorithm may not deliver optimal results for all units.

modified the algorithm for individual on-chip SRAM-based units to maximize the leakage reduction at NO performance cost.

basic-lp lp ultra-lp

aggr-lp

Processor stall

3 pending DL1 miss

Pending DL1 misses

Pending L2 miss/es

Processor continue

all pending DL1 misses

serviced

L2 miss

General state machine to control power mode transitions


Branch Predictor IPB IPB IPB IPB

ammp 4.5 equake 4.21 mcf 3.9 twolf 7.6

applu 324.1 facerec 20.0 mesa 11.0 vortex 5.7

apsi 28.9 galgel 14.3 mgrid 310.4 vpr 9.0

art 8.1 gap 14.2 parser 6.0 wupwise 8.7

bzip2 6.7 gcc 6.3 perlbmk 7.2 average 37.8

crafty 8.5 gzip 9.5 sixtrack 11.9

eon 8.2 lucas 25.6 swim 77.1

1 out of every 9 fetched instructions in integer benchmarks

and out of 63 fetched instructions in floating point benchmarks accesses the branch predictor

always put branch predictor in deep low power modes (lp, ultra-lp or aggr-lp) and waking up on access.

noticeable performance degradation for some benchmarks.


Observation: Branch Predictor Access Pattern

equake

0

5

10

15

20

25

30

1 M cycles

IPB

eve

ry 5

12 c

ycle

s

swim

0

50

100

150

200

250

300

350

1M cycles

IPB

eve

ry 5

12 c

ycle

s

Within a benchmark there is significant variation in Instructions Per Branch (IPB).

once the IPB drops (increases) significantly it may remain low (high) for a long period of time.

Distribution of the number of branches per 512-instruction interval (over 1M cycles)


Branch Predictor Peripherals Leakage Control

Can identify the high IPB period, once the first low IPB period is detected.

The number of fetched branches is counted every 512 cycles, once the number of branches is found to be less than a certain threshold (24 in this work) a high IPB period identified. The IPB is then predicted to remain high for the next twenty 512 cycles intervals (10K cycles).

Branch predictor peripherals transition from basic-lp mode to lp mode when a high IPB period is identified.

During pre-stall and stall periods the branch predictor peripherals transition to aggr-lp and ultra-lp mode, respectively.


Leakage Power Reduction

0%

5%

10%

15%

20%

25%

30%

35%

40%

amm

p

applu

apsi ar

t

bzip2

craf

tyeo

n

equak

e

facer

ec

galgel

gapgcc

gziplu

cas mcf

mesa

mgrid

parser

perlbm

k

sixt

rack

swim

twolf

vorte

xvp

r

wupw

ise

aver

age

basic-lp lp aggr-lp ultra-lp

Noticeable Contribution of Ultra and Basic low power mode


Outline – Resource Adaptation

why an IQ, ROB, RF major power dissipators? Study processor resources utilization during L2/multiple L1

misses service time Architectural approach on dynamically adjusting the size of

resources during cache miss period for power conservation Results Conclusions


Instruction Queue

The Instruction Queue is a CAM-like structure which holds

instructions until they can be issued. Set entries for new dispatched instructions Read entries to issue instructions to functional units Wakeup instructions waiting in the IQ once a result is ready Select instructions for issue when the number of instructions

available exceed the processor issue limit (Issue Width).

Main Complexity: Wakeup Logic


Logical View of Instruction Queue

OR

Ready Bit

Pre-charge

matchline2

Vdd

tag0

0

tag0

1

tag0

2

tag0

3 tagI

W0

tagI

W1

tagI

W2

tagI

W3ta

g00

tag0

1

tag0

2

tag0

3 tagI

W0

tagI

W1

tagI

W2

tagI

W3

matchline1

matchline3

matchline4

No Need to always have such aggressive wakeup/issue width!No Need to always have such aggressive wakeup/issue width!

At each cycle, the match lines are pre-charged high To allow the individual bits associated with an instruction tag to be compared with the

results broadcasted on the taglines. Upon a mismatch, the corresponding matchline is discharged. Otherwise, the match line

stays at Vdd, which indicates a tag match. At each cycle, up to 4 instructions broadcasted on the taglines,

four sets of one-bit comparators for each one-bit cell are needed. All four matchlines must be ORed together to detect a match on any of the broadcasted tags.

The result of the OR sets the ready bit of instruction source operand


ROB and Register File

The ROB and the register file are multi-ported SRAM structures with several functionalities:

Setting entries for up to IW instructions in each cycle, Releasing up to IW entries during commit stage in a cycle, and Flushing entries during the branch recovery.

decode8%

sense_amp

4%

data output driver29%

bitline and memory

cell58%

wordline1%

sense_amp

3%

bitline and memory

cell63%

decode11%

wordline8%

data output driver15%

Dynamic Power Leakage Power


Architectural Motivations

Architectural Motivation: A load miss in L1/L2 caches takes a long time to service

prevents dependent instructions from being issued When dependent instructions cannot issue

After a number of cycles the instruction window is full ROB, Instruction Queue, Store Queue, Register Files

The processor issue stalls and performance is lost At the same time, energy is lost as well!

This is an opportunity to save energy

Scenario I: L2 cache miss period Scenario II: three or more pending DL1 cache misses


How Architecture can help reducing power in ROB, Register File and Instruction Queue

Issue rate decrease

-10%

0%10%

20%30%

40%50%

60%70%

80%90%

100%

Scenario I

Scenario II

Significant issue width decrease!

Scenario I: The issue rate drops by more than 80%Scenario II: The issue rate drops is 22% for integer benchmarks and 32.6% for floating-point benchmarks.



Benchmark Scenario I Scenario II Benchmark Scenario I Scenario II

bzip2 165.0 88.6 applu 13.8 -4.9

crafty 179.6 63.6 apsi 46.6 18.2

gap 6.6 61.7 Art 31.7 56.9

gcc 97.7 43.9 equake 49.8 38.1

gzip 152.9 41.0 facerec 87.9 14.1

mcf 42.2 40.6 galgel 30.9 34.4

parser 31.3 102.3 lucas -0.7 54.0

twolf 81.8 58.8 mgrid 8.8 5.6

vortex 118.7 57.8 swim -4.3 11.4

vpr 96.6 55.7 wupwise 40.2 24.4

INT average 98.2 61.4 FP average 30.5 25.2

ROB occupancy grows significantly during scenario I and II for integer benchmarks: 98% and 61% on average

The increase in ROB occupancy for floating point benchmarks is less, 30% and 25% on average for scenario I and II.



Register File occupancy

Scenario I IRF

non-Scenario I

IRF

Scenario I FRF

non-Scenario I

FRF

Scenario II IRF

non-Scenario II

IRF

Scenario II FRF

non-Scenario II FRF

bzip2 74.4 28.8 0.0 0.0 56.6 30.7 0.0 0.0

crafty 83.4 31.9 0.1 0.0 51.4 32.2 0.0 0.0

gap 46.2 41.1 0.1 0.7 65.8 42.9 0.6 0.5

gcc 46.3 21.2 0.2 0.1 28.7 24.0 0.0 0.1

gzip 45.1 27.2 0.0 0.0 39.8 27.2 0.0 0.0

mcf 40.8 29.3 1.0 1.1 46.8 36.4 3.2 0.1

parser 37.4 29.8 0.0 0.0 57.0 29.8 0.1 0.0

twolf 58.7 32.3 2.6 2.1 46.0 29.8 2.5 2.0

vortex 70.9 31.1 0.3 0.2 52.4 35.0 0.2 0.2

vpr 63.9 29.0 7.8 8.6 66.4 41.0 8.7 8.3

INT average 55.3 29.2 1.1 1.2 50.3 32.0 1.4 1.0

applu 6.0 5.6 76.6 64.8 1.7 6.2 77.3 73.7

apsi 16.1 18.3 65.7 37.6 15.8 17.9 58.8 43.6

art 35.4 25.0 36.2 30.7 23.0 29.0 42.9 6.3

equake 34.2 27.4 16.1 7.1 32.7 29.4 21.0 9.6

facerec 52.6 22.5 50.0 28.9 30.3 38.4 48.1 35.0

galgel 50.4 27.4 41.8 48.7 32.1 26.0 61.0 44.2

lucas 21.7 23.8 47.7 44.0 41.7 22.1 29.7 47.0

mgrid 5.9 6.2 90.0 80.7 1.9 6.4 96.7 87.2

swim 23.3 27.8 77.1 78.1 29.7 23.1 87.1 76.2

wupwise 26.3 28.8 53.5 28.7 40.5 26.9 38.0 42.2

FP average 26.6 20.9 56.5 44.7 24.0 22.1 56.2 46.0

IRF occupancy always grows for both scenarios when IRF occupancy always grows for both scenarios when experimenting with integer benchmarks. a similar case is for experimenting with integer benchmarks. a similar case is for FRF when running floating-point benchmarks and only FRF when running floating-point benchmarks and only during scenario II during scenario II


Proposed Architectural Approach

Adaptive resource resizing during cache miss period

Reduce the issue and the wakeup width of the processor during L2 miss service time.

Increase the size of ROB and RF during L2 miss service time or when at least three DL1 misses are pending

simple resizing scheme: reduce to half size. not necessarily optimized for individual units, but a simple scheme to implement at circuit!


Results

Small Performance loss~1% 15~30% dynamic and leakage

power reductionPower (Dynamic/Leakage) Reduction

0%

5%

10%

15%

20%

25%

30%

35%

40%

INT RF Leakage INT RF Dynamic FP RF Lekage FP RF Dynamic

IPC Degradation

0%

1%

2%

3%

4%

5%

6%

Power (Dynamic/Leakage) Reduction

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

ROB Leakage ROB Dynamic Issue Queue


Conclusions

Introducing the idea of multiple sleep mode design Apply multiple sleep mode to on-chip SRAMs

Find period of low activity for state transition Introduce the idea of resource adaptation Apply resource adaptation to on-chip SRAMs+CAMs

Find period of low activity for state transition Applying similar adaptive techniques to other energy hungry

resources in the processor Multiple sleep mode functional units

Documents

Power Management in High Performance Processors through Dynamic Resource Adaptation and Multiple Sleep Mode Assignments Houman Homayoun National Science