37
Advanced Microarchitecture Lecture 15: Power

Advanced Microarchitecture

  • Upload
    chet

  • View
    49

  • Download
    0

Embed Size (px)

DESCRIPTION

Advanced Microarchitecture. Lecture 15: Power. Basic Power Review. Power = Voltage × Current Voltage is usually a constant (we’ll talk about voltage scaling later) Current varies Depends on the block (cache vs. ALU vs. decoder …) Depends on the application ( int vs. FP vs. multimedia) - PowerPoint PPT Presentation

Citation preview

Page 1: Advanced  Microarchitecture

Advanced MicroarchitectureLecture 15: Power

Page 2: Advanced  Microarchitecture

2

Basic Power Review• Power = Voltage × Current

– Voltage is usually a constant• (we’ll talk about voltage scaling later)

– Current varies• Depends on the block (cache vs. ALU vs. decoder …)• Depends on the application (int vs. FP vs. multimedia)• Depends on the program phase

• Another form:– i = Cdv/dt vi dt = Cv dv P = ½CV2

– Power = Energy of each capacitor × avg times (dis)charged/time to

(dis)charge

– = bAll Blocks½CbV2ab/tc = ½V2f b Cbab = ½aCV2fLecture 15: Power

C = Total Capacitancea = average activity

factor

Page 3: Advanced  Microarchitecture

3

Static Power• We talked about this in Lecture 1

– Two types of static power• Leakage through the channel (sub-threshold

conductance)• Leakage through the gate/oxide (tunneling)

• Pstatic = Psub + Poxide

• Ptotal = Pdynamic + Pstatic

= ½aCV2f + K1We-VT/nVq(1-e-V/Vq) + K2W(V/Tox)2e-aTox/V

Lecture 15: Power

Page 4: Advanced  Microarchitecture

4

Trading Power for Performance• P = ½aCV2f, f V P V3

• To a first order, Perf f Perf V

Lecture 15: Power

Powe

r

Voltage

P V3

For a linear decrease in voltage (and performance)

… we get a cubic decrease in (dynamic)

power consumption

Rule of thumb: for small DV/Df,

1% performance for every 3%

powerhttp://download.intel.com/technology/itj/2003/volume07issue02/art03_pentiumm/vol7iss2_art03.pdf

Page 5: Advanced  Microarchitecture

5

Limits of Trading

• Vdd – VT > VNoise Margin

• Vdd cannot be scaled below VT + VNoise Margin

Lecture 15: Power

Gnd Noise can cause transistor to

accidentally switch!

Powe

r

Voltage/Frequency

P V3

Voltage scaling can take the supply voltage down only

so far

Below this, we can only use frequency scaling (decrease f, but keep V constant), which provides only linear power

reduction (½CV2f)

VT

Vdd

noise

Page 6: Advanced  Microarchitecture

6

DVFS• Dynamic Voltage/Frequency Scaling• Someone tracks performance demands,

idleness, etc.– “Someone” is typically the OS with hardware

support– … but you could have a hardware only-

approach• Under thermal emergencies, the HW takes over

regardless of what voltage/frequency the OS asks for

• Goal: consume minimum power necessary while still meeting performance demands

• Can also do just DVS or DFSLecture 15: Power

Page 7: Advanced  Microarchitecture

7

Clock Gating• CMOS logic is also called “static” logic:

– If the inputs don’t change, neither do the outputs(or any other intermediate nodes)

• Therefore, to reduce dynamic power in CMOS circuits, don’t let the inputs change if you don’t need to!

Lecture 15: Power

CMOSBlock

1234 59038644 9087

Power dissipated CMOS

Block

Clock gate this block?

12348644 5903

1976

1976

Latch doesn’t grab new value, so its output

(block’s input) doesn’t change

Page 8: Advanced  Microarchitecture

8

Example: ALU

Lecture 15: Power

opcode

+

logicshift

comp

×

opcode

one result

All units consume

power, but only one output is

useful

+

logicshift

comp

×

opcodeClock-gating

Logic

one result

Based on opcode, the logic

clock-gates all but the one

required unit

Note, this logic consumes its own

power

Page 9: Advanced  Microarchitecture

9

Logic Timing• To properly clock-gate, you must know

you’re going to gate the cycle before(otherwise it’ll be too late as the clock edge will have

already arrived)

Lecture 15: Power

Payload RAM

+

logiccom

pClock-gatingLogic

OpcodeValueEValueL

Page 10: Advanced  Microarchitecture

10

Logic Timing• Not all blocks can be easily gated

– may be difficult to know whether gating should be applied ahead of time• likely true for critical path circuits: e.g., gating select

logic probably difficult since bidders not known until last moment

– computation of gating condition may be complex• value-based (is input zero?)• multi-value based (are all inputs zero?)• multi-condition based (are all RS entries not bidding?)

Lecture 15: Power

Page 11: Advanced  Microarchitecture

11

Clock Gating Dynamic Logic• CMOS logic toggles only when input

changes• Dynamic logic may consume power

regardless

Lecture 15: Power

CMOS NOR gateN-Domino NOR gate

pictures from http://6004.csail.mit.edu/6.371/handouts/L11.pdf

If A (or B) equals 1 and does not change, then sequence is:

precharge X to 1, evaluate

discharges X to 0, precharge X

to 1, evaluate …

X

X

Gating inputs is not enough; need to

ensure CLK is disabled.

Page 12: Advanced  Microarchitecture

12

Clock Gating is for Dynamic Power• Even if gates not toggling, they continue to

leak

Lecture 15: Power

Vdd

Gnd

1

On

Offsubthreshold leakage

gate leakage

gate leakage

Vdd

Gnd

0

Off

On

gate leakage

gate leakage

subthreshold leakage

Page 13: Advanced  Microarchitecture

13

Reducing Leakage: Stacking

Lecture 15: Power

intermediate node has V > 0

V

0

V/2R

R

0

1

0

1

channel leakage

channel leakagehigher

resistance

vs.

Higher VSB increases VT

VB=0

VSV/2

Higher threshold voltage decreases leakage

currentHigher resistance increases

gate latency

Page 14: Advanced  Microarchitecture

14

Body Bias Effect

Lecture 15: Power

Channel Leakage

Less Channel Leakage

VB VB

VS

VS

Larger VSB

WARNING: This is a GROSSLY simplified explanation!!!

If you’re interested in low-power circuits and microarchitecture, you should go read up on some real semiconductor/electronics

literature.

Page 15: Advanced  Microarchitecture

15

Dual VT Devices• Manufacture two types of transistors:

– Low VT gates: fast, high leakage– High VT gates: slow, low leakage (typically

10x less)– Designer chooses what kind to use

• Pro:– less area than stacking (one high-VT gate = one

low-VT gate in area, stacking requires multiple gates)

• Con:– Manufacturing process needs to provide two

device typesLecture 15: Power

Page 16: Advanced  Microarchitecture

16

Use Only Where Appropriate• Stacking and higher VT both slow down the

gates• Analyze circuits and…

– apply one or both techniques to gates not on the critical path

– apply to longest path if timing permits (i.e., this circuit is not a frequency limiter)

Lecture 15: Power

Critical path gates

Stack or use high-VT gates here

Page 17: Advanced  Microarchitecture

17

Standby Input Vectors• The amount of leakage depends on the

clock-gated inputs to the gate

Lecture 15: Power

0

0 Off

On

Off

On 1

0 On

On

Off

Off 0

1 Of

Off

On

On 1

1 On

Off

On

Off

2 off transistorsin parallel

1 off transistorin leakage path

1 off transistorin leakage path

2 off transistorsin leakage path

Page 18: Advanced  Microarchitecture

18

Standy Input Vectors• When clock-gating a block

– disable latch clock (as usual)– load leakage-minimizing input vector (stored

elsewhere)

Lecture 15: Power

Clock gate

1

1

11

1

• How to determine best input vector for n-input gate?

Can cause spurious

transitions that consume more dynamic power

Page 19: Advanced  Microarchitecture

19

Variant: Embedded Dual-VT

• Instead of at the gate-level, choose high-VT vs. low-VT at the transistor-level

Lecture 15: Power

High-VT devices

Low-VT devices

• Can be used if some transitions are more important than others– “more important” can be

speed or power• Combine with setting

input sleep vectors– make the off transistors

high-VT if possible to further reduce leakge

Page 20: Advanced  Microarchitecture

20

Power Gating• If you turn off the power, then the gates

can’t leak

Lecture 15: Power

Vdd

Gnd

0

Off

On

X Gnd

Off

Gnd

Virtual Vdd

Vdd

01 X off

This gating transistor is a beast… it needs to be

big enough to supply the necessary current when not-gated, also

needs to be low leakage (high VT gate)

Gating transistor also called “sleep” transistor

Page 21: Advanced  Microarchitecture

21

Power Gating

Lecture 15: Power

Virtual Vdd

Vdd

After gating, residual charge in

system will continue to leak

Off

Gnd

Virtual Vdd

Vdd

Virtual VGnd

Both paths cut off now

Page 22: Advanced  Microarchitecture

22

Turn-On/Turn-Off Latency• Sleep transistors are slow high VT devices• Depending on size of block covered by

sleep transistor, virtual Vdd/Gnd may have a lot of capacitance to charge/discharge

Lecture 15: Power

Vdd

Virt. Vdd

R

C

Moderate R, Large C Large RC (slow)

timeADD inst ready

to execute

ALU asleepdelay to

wakeup ALU

ADD exec

Wakeup delay can causesignificant performance

penalties when units unavailable

Page 23: Advanced  Microarchitecture

23

Turn-On/Turn-Off Latency• In some situations, can know early enough

ahead

Lecture 15: Power

(crude pipeline)

fetch decode

FP inst decoded

! FPUImmediately send wakeup to FPU

fadd

Hopefully by the time the fadd makes it to the OOO core, gets scheduled, and makes it to the FPU, the turn-on has completed

exec

Page 24: Advanced  Microarchitecture

24

Turn-On/Turn-Off Latency• In some cases it’s much harder

Lecture 15: Power

pipeline full/stalled(maybe due to D$ miss to main memory)

power-off front-end units(fetch, decode, etc.)

miss serviced, back-end starts movingagain; front-end starts wake up

back-end gets starved because front-endwakeup is too slow and can’t refill the pipeline

But it’s hard to start the power-on early because we don’t know when the memoryrequest will be fulfilled (and whether that will cause the back-end to drain)

Page 25: Advanced  Microarchitecture

25

Turn-On/Turn-Off Power• (Dis)Charging Virtual Vdd/Gnd consumes

quite a bit of energy/power

Lecture 15: Power

P = ½aCV2f

• Worst-case: charge up as soon as you’re done discharging

timeGo to sleep!

Virt. Vdd

Done discharging,now wakeup! We just wasted 2×½×CVirt

Vdd×Vdd2 Watts to discharge

and then recharge the virtual Vdd

And we spent zero cycles fully asleep, so we didn’t save any/much leakage

power

Page 26: Advanced  Microarchitecture

26

Turn-On/Turn-Off Power• Must stay asleep for some time, just to

break even!

Lecture 15: Power

Energy consumed fromleakage (no sleeping)

timeEner

gy c

onsu

med

Energy to dischargeVirtual Vdd/Gnd

Zero energy consumedwhile sleeping

Energy to rechargeVirtual Vdd/Gnd

Minimum sleep-time for energy break-even

Too little sleep… ends up costingmore energy than doing nothing

Extra energyspent

Sleep interval > break-even length

Energyreduction

Page 27: Advanced  Microarchitecture

27

Turn-On/Turn-Off Noise• Instantly turning on the sleep transistor to

recharge virtual Vdd causes very large current spike (di/dt)

Lecture 15: Power

Water Tank Ishower

Flush!Ijohn

Ishower - Ijohn

PressureDrop

Current for recharging virtual

Vdd

Solution: progressive turn-on;recharge virtual Vdd slowly, which

limits Ijohn (i.e., Irecharge) to keep pressure drop (supply noise)

under control

Slowing down recharge increases

performance penalty when

recharge is late

Page 28: Advanced  Microarchitecture

28

Example: Intel Core (not Core 2)• OS power management

(OSPM)– algorithm monitors CPU load

over some window of time– computes target

performance point, requests from CPU

– CPU is expected to modify operating voltage/frequency to match OSPM’s request

Lecture 15: Power

Relative Power Consumption

Voltage andfrequency

scaling

Frequencyscaling only

• OS can choose different power saving states (C0 – Cn)– C0: active state (no power

saving)– Ci: higher i more power

savings, but longer recovery time

http://download.intel.com/technology/itj/2006/volume10issue02/vol10_art03.pdf

Page 29: Advanced  Microarchitecture

29

Example: Core Idle States• C0: Active• C1 (processor-centric measures)

– instruction execution halted, clocks are gated• C2: CPU does not access bus w/o chipset’s

consent– allows bus to be put in low-power mode

• C3: CPU disables PLLs (clock generators)• C4: CPU lowers voltage to minimum level

while still being able to retain state (e.g., cache contents)

• DC4: “Deep” C4 (next slide)Lecture 15: Power

Page 30: Advanced  Microarchitecture

30

Example: Core Sleep State• Upon entering C4, flush L2 cache to main

memory– Don’t do it all at once!

• If C4 period is short, then you waste more power due to flushing

• Can have performance impact on wakeup since cache will be cold

• Flush only part of the L2 (1/8 to 1/2) by ways– once a complete way has been flushed, power gate it

with sleep transistors (discussed later)• Do this upon each entry into C4 state

• When L2 shrunk to 0 bytes, enter DC4– Greatly reduce voltage since there’s no state to

retain• No need to wakeup cache for snoops• Chipset directs snoop traffic directly to memory• Typically expand cache to minimum of two ways on

exit from DC4

Lecture 15: Power

Page 31: Advanced  Microarchitecture

31

Example: Core Duo• Many shared resources

– PLL, power supply, L2 cache• Can’t (easily) run cores at

different clock speeds with a single PLL

• Can’t run cores at different voltages with a single power supply

• Can’t turn off L2 cache just because one core is idle

• External interface complications– OS sees two separate CPUs

• one C-state per core– Platform views the whole

processor as a single entity for power-management (for C2 state and higher)Lecture 15: Power

OS can request C-stateson a per-core basis

Platform seesonly a single

C-state(the lower of

the two)

Page 32: Advanced  Microarchitecture

32

Turbo-Mode• If one core is in deep-sleep, it’s not consuming much

power• Idea: use DVFS in reverse to increase voltage/freqency

Lecture 15: Power

core 0core 1

powe

r

power limit

rela

tive

perfo

rman

ce

Both coresin C0

core 0Core 0 in C0

Core 1 in DC4

core 0

Core 0 in C0Core 1 in DC4

Deliver more performance when

running a single program and not

worried about battery life

(plugged in to wall)

“Intel Dynamic Acceleration Technology”

Page 33: Advanced  Microarchitecture

33

Variable VT Devices

• Pros:– significant standby leakage reduction– memory elements retain state– no transistor sizing/partitioning required– dynamically tunable VT at runtime

• Cons:– requires expensive triple-well fabrication process– body-biasing effect decreases with technology scaling

Lecture 15: Power

Higher VSB increases VT

VB=0

VSV/2

Earlier body-bias effect from stackedtransistors due to higher source voltage

Provide a way to explicitly bias VB

Set VBBN < 0 makes VSB > 0 for this NFET

Since VBBN < 0, also called“reverse biasing”

Kao et al., Embedded Tutorial: Subthreshold Leakage Modeling and Reduction Techniques, ICCAD 2002

Page 34: Advanced  Microarchitecture

34

Body-Biased Cache• Super-high VT for caches (very slow)• Use selective forward-body biasing during

access to read/write at a reasonable speed

Lecture 15: Power

0000000

Very-high VT devices

(very low leakage,

slow access speed)

0VBBN

Access

Vfwd-biasVfwd-bias

VSB < 0 VT decreases transistors are faster

(but consume more power)

Access Completed

0

A few cache lines go into high leakage mode, but onlyvery briefly (during access). The rest of the time, it

consumes very little leakage power.

Page 35: Advanced  Microarchitecture

35

GALS• Different blocks have different performance

needs– and this varies in time

• Idea: clock different blocks at different speeds– Apply voltage/frequency scaling to

blocks/groups-of-blocks• e.g., FP units can be slowed down (or maybe even

completely turned off) for integer applications– Block consumes less power when it doesn’t

have to operate in max-performance mode• GALS = Globally Asynchronous, Locally

SynchronousLecture 15: Power

Page 36: Advanced  Microarchitecture

36

GALS Example

Lecture 15: Power http://www.ece.cmu.edu/~dianam/conferences/isca02.pdf

Baseline Processor GALS Processor

Page 37: Advanced  Microarchitecture

37

GALS Issues• How to communicate between clock

domains?

Lecture 15: Power

Asynchronous FIFO Design[Chelcea and Nowick]

Producer can clear empty, but it gets cleared on

clk2Consumer clears

the full signal, but it occurs on

clk1

Timing Issues:

Voltage Issues:

0V

0.75V

“0”

“1”

0V

1.5V“1” (0.75V) 0.75V =0/1?

Vdd1 Vdd2

FIFO between domains must “speak” both

voltages